Perceptron

Introduction

A perceptron is a type of linear model and one of the earliest forms of artificial neural network. It was introduced by Frank Rosenblatt in 1958 at the Cornell Aeronautical Laboratory. The perceptron models simple decision-making for binary classification tasks, where the goal is to separate data into two classes. Although modern deep learning systems have far surpassed the perceptron in capability, the perceptron remains one of the most historically significant algorithms in machine learning. Its story, including the controversy surrounding its limitations, shaped decades of research funding and public perception of artificial intelligence.

At its core, the perceptron takes a set of numerical inputs, multiplies each by a learned weight, adds a bias term, and passes the result through a step activation function to produce a binary output. Despite this simplicity, the perceptron laid the groundwork for every neural network architecture that followed, from multi-layer perceptrons to the deep neural networks powering modern AI systems.

Historical background

Frank Rosenblatt, a psychologist by training, built the first perceptron implementation (the Mark I Perceptron) as a hardware device at Cornell in 1957-1958. The machine used 400 photocells connected randomly to "neurons" and could learn to recognize simple shapes. When Rosenblatt unveiled the system in July 1958, the press coverage was enthusiastic. The New York Times reported it as "the embryo of an electronic computer that [the Navy] expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence." Rosenblatt himself made ambitious claims, suggesting the perceptron was "the first machine which is capable of having an original idea" ^[1].

Rosenblatt published the formal description of the perceptron algorithm in his 1958 paper "The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain" in the journal Psychological Review ^[2]. He later expanded on the theory in his 1962 book Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms.

The perceptron was not the first computational model of a neuron. Warren McCulloch and Walter Pitts had proposed an artificial neuron model in 1943, but their model had fixed, non-learnable weights. The perceptron's innovation was that it could learn its weights from data through an iterative training procedure, making it the first trainable neural network model.

Structure

Single-layer perceptron

The perceptron consists of a single layer of artificial neurons, also called nodes or units. Each neuron receives one or more input values, multiplies each input by an associated weight, and then sums the products. A bias term is added to this weighted sum. The result is then passed through an activation function, typically a step function (also called a Heaviside function), which outputs the final classification decision. The weights and bias are adjustable parameters learned during training.

The mathematical formulation of a single perceptron is:

Component	Formula
Weighted sum	z = w_1 x_1 + w_2 x_2 + ... + w_n x_n + b
Vector notation	z = w . x + b
Output (step function)	y = 1 if z >= 0, else y = 0

Here, x is the input vector, w is the weight vector, b is the bias, and y is the binary output.

Geometric interpretation

Geometrically, the perceptron defines a hyperplane in the input space. Points on one side of the hyperplane are classified as class 1, while points on the other side are classified as class 0. The weight vector w is normal (perpendicular) to this hyperplane, and the bias b determines the offset of the hyperplane from the origin. This means the perceptron can only solve problems where the two classes are linearly separable, meaning a single straight line (in 2D), plane (in 3D), or hyperplane (in higher dimensions) can perfectly divide the classes.

Learning algorithm

The learning process of a perceptron involves adjusting the weights so that classification errors are minimized on a training dataset. This is achieved through supervised learning, where the perceptron receives labeled input-output pairs and updates its weights iteratively.

The perceptron learning rule

The perceptron learning rule works as follows:

Initialize all weights to zero (or small random values).
For each training example (x, y_target):
- Compute the predicted output: y_pred = step(w . x + b)
- If the prediction is correct (y_pred = y_target), do nothing.
- If the prediction is wrong, update the weights:
  - w = w + eta * (y_target - y_pred) * x
  - b = b + eta * (y_target - y_pred)
Repeat until all examples are classified correctly or a maximum number of iterations is reached.

The parameter eta is the learning rate, a positive constant that controls the size of weight updates. For the basic perceptron, any positive learning rate works because the algorithm converges as long as the data is linearly separable.

A key feature of the perceptron learning rule is that it only updates weights when a misclassification occurs. If the current prediction is correct, the weights remain unchanged. This makes the perceptron an example of an error-driven or online learning algorithm.

Perceptron convergence theorem

The perceptron convergence theorem, formally proved by Albert Novikoff in 1962 ^[3], guarantees that if the training data is linearly separable, the perceptron learning rule will converge to a correct solution in a finite number of steps. The theorem provides an upper bound on the number of weight updates: at most R^2 / gamma^2, where R is the maximum norm (length) of any training example and gamma is the margin of the best separating hyperplane.

The proof works by bounding the weight vector from above and below. Each misclassification causes a weight update that makes progress toward a valid solution (the lower bound grows linearly with the number of updates), while the overall magnitude of the weight vector grows more slowly (the upper bound grows with the square root of the number of updates). Because the lower bound eventually exceeds the upper bound, the algorithm must terminate ^[4].

If the data is not linearly separable, the standard perceptron algorithm will cycle indefinitely without converging. Variants such as the pocket algorithm (Gallant, 1990) address this by keeping track of the best weight vector found so far.

Property	Detail
Convergence guarantee	Finite steps if data is linearly separable
Maximum updates	R^2 / gamma^2
Proved by	Albert Novikoff (1962)
Limitation	Does not converge on non-separable data

Limitations and the XOR problem

The most well-known limitation of the single-layer perceptron is its inability to solve the XOR problem. XOR (exclusive or) is a logical function that outputs 1 when its two inputs differ and 0 when they are the same:

Input A	Input B	XOR Output
0	0	0
0	1	1
1	0	1
1	1	0

No single line can divide the XOR outputs into the correct classes. Points (0,0) and (1,1) produce output 0, while (0,1) and (1,0) produce output 1. These form a pattern that is not linearly separable. Therefore, a single-layer perceptron cannot learn the XOR function.

This limitation extends to any function that is not linearly separable. Problems like determining whether pixels form a connected pattern, computing parity, or recognizing complex shapes all fall outside the capabilities of a single-layer perceptron.

Visualizing XOR

In two dimensions, the four XOR data points are:

Input A	Input B	XOR Output	Position in 2D space
0	0	0	Bottom-left corner
0	1	1	Top-left corner
1	0	1	Bottom-right corner
1	1	0	Top-right corner

The class-0 points (0,0) and (1,1) lie on one diagonal, while the class-1 points (0,1) and (1,0) lie on the other. No single straight line can separate the two classes. Any line that correctly classifies three of the four points will misclassify the fourth.

Why XOR matters

The inability of single-layer perceptrons to solve XOR is mathematically trivial. Any function that is not linearly separable is beyond the reach of a linear classifier. What made XOR devastating for neural network research was not the technical result itself, but its rhetorical power. Minsky and Papert used XOR as the simplest possible illustration of a fundamental limitation, and the clarity of the example made it memorable and convincing.

XOR with multi-layer networks

A two-layer network can solve XOR easily. One common solution uses two hidden neurons:

Hidden neuron 1 computes OR(A, B): outputs 1 if either input is 1.
Hidden neuron 2 computes AND(A, B): outputs 1 only if both inputs are 1.
The output neuron computes: hidden_1 AND (NOT hidden_2), which is exactly XOR.

This requires only 2 hidden neurons and 6 weights (plus biases). The simplicity of this solution highlights that the limitation was not in neural networks as a concept, but specifically in single-layer architectures. The missing piece was an efficient algorithm for training the weights in multi-layer networks, which backpropagation eventually provided.

Minsky and Papert's "Perceptrons" (1969)

In 1969, MIT researchers Marvin Minsky and Seymour Papert published Perceptrons: An Introduction to Computational Geometry ^[5], which provided rigorous mathematical analysis of what perceptrons could and could not compute. The book went beyond the XOR example to systematically characterize the limitations of single-layer perceptrons.

Minsky and Papert proved that single-layer perceptrons could not compute certain predicates (such as connectedness and parity) and showed that the computational limitations were tied to the "order" of the perceptron, essentially how many input variables each unit could examine simultaneously. They demonstrated that for some problems, the required order grew with the size of the input, making the perceptron impractical.

The book was technically rigorous, but its impact went beyond what the mathematical results strictly implied. Minsky and Papert were aware that multi-layer networks could, in principle, overcome these limitations. However, no efficient learning algorithm for multi-layer networks was widely known at the time, and the book was interpreted by many in the research community as a definitive argument against neural network approaches in general ^[6].

Allen Newell called Perceptrons "a great book," and Michael Arbib noted it was "widely hailed as an exciting new chapter in the theory of pattern recognition." At the same time, H.D. Block argued the authors "study a severely limited class of machines from a viewpoint quite alien to Rosenblatt's," making the title "seriously misleading" ^[7]. Minsky himself humorously compared the book to H.P. Lovecraft's Necronomicon, a work "known to many, but read only by a few," highlighting the gap between its actual content and its outsized influence on the field.

Impact on the first AI winter

The publication of Perceptrons coincided with growing skepticism about AI research that had been heavily funded through the 1960s. The book provided academic justification for redirecting funding away from neural network research and toward symbolic AI approaches. The period from the early 1970s through the early 1980s saw a sharp decline in neural network research, a period now called the first "AI winter" ^[8].

Several factors contributed to this shift:

Perceptrons gave critics a respected academic reference for dismissing neural networks.
Government funding agencies, including DARPA, reduced grants for neural network research.
Researchers working on neural networks found it difficult to publish or secure positions.
Symbolic AI (expert systems, logic-based approaches) became the dominant paradigm.

Rosenblatt did not live to see the eventual revival of neural networks. He died in a boating accident on Chesapeake Bay on his 43rd birthday, July 11, 1971 ^[9].

How the winter unfolded

Year	Event	Impact
1969	Perceptrons published	Provided formal basis for skepticism about neural networks
1970-72	Funding agencies reassess priorities	DARPA and NSF shift funding toward symbolic AI
1971	Frank Rosenblatt dies	Neural networks lose their most prominent advocate
1973	Lighthill Report (UK)	Broader AI funding cuts, compounding neural network winter
1974	Paul Werbos describes backpropagation in his PhD thesis	Goes largely unnoticed for over a decade
Early 1980s	Hopfield networks and Boltzmann machines	Renewed interest begins, but slowly
1986	Rumelhart, Hinton, Williams publish on backpropagation	Neural network winter officially ends

Minsky later acknowledged that the book may have had an outsized negative effect. In the 1988 expanded edition of Perceptrons, he and Papert added a chapter discussing multi-layer networks and backpropagation, recognizing the progress that had been made. A 2017 reprint included a foreword by Leon Bottou discussing the book from a modern deep learning perspective.

Multi-layer perceptron (MLP)

A multi-layer perceptron (MLP) is an extension of the basic perceptron, consisting of multiple layers of interconnected neurons. An MLP typically includes an input layer, one or more hidden layers, and an output layer. The addition of hidden layers allows MLPs to model non-linear relationships between inputs and outputs.

Unlike single-layer perceptrons, MLPs use differentiable activation functions (such as sigmoid or ReLU) rather than step functions. This is necessary for training with gradient descent-based methods.

Revival through backpropagation

The key breakthrough that made multi-layer networks practical was the popularization of backpropagation by David Rumelhart, Geoffrey Hinton, and Ronald Williams in their 1986 paper "Learning representations by back-propagating errors" ^[10]. Backpropagation provided an efficient method for computing gradients through multiple layers, solving the credit assignment problem: determining how much each weight in each layer contributed to the output error.

With backpropagation, MLPs could learn to solve problems like XOR and many others that were impossible for single-layer perceptrons. The universal approximation theorem (Cybenko, 1989; Hornik, 1991) later showed that a feedforward network with a single hidden layer containing enough neurons can approximate any continuous function to arbitrary accuracy ^[11].

Architecture	Layers	Can solve XOR	Training method
Single-layer perceptron	1	No	Perceptron learning rule
Multi-layer perceptron	2+	Yes	Backpropagation / gradient descent

MLP architecture details

An MLP with one hidden layer computes its output in two stages:

Hidden layer: h = sigma(W_1 . x + b_1)
Output layer: y = sigma(W_2 . h + b_2)

Here, W_1 and W_2 are weight matrices, b_1 and b_2 are bias vectors, and sigma is a non-linear activation function applied element-wise. The hidden layer transforms the input into a new representation where the classes may become linearly separable, and the output layer then performs the final classification.

Variants of the perceptron

Several variants of the original perceptron have been developed over the years, each addressing specific limitations or extending the model's capabilities.

Voted perceptron

The voted perceptron, introduced by Yoav Freund and Robert Schapire in 1999, keeps track of all intermediate weight vectors generated during training along with a count of how many consecutive examples each vector classified correctly (its "survival time"). At prediction time, each weight vector casts a vote weighted by its survival time, and the final prediction is determined by majority vote. The voted perceptron achieves generalization bounds comparable to support vector machines without requiring quadratic programming ^[12].

Averaged perceptron

The averaged perceptron is a simplification of the voted perceptron. Instead of storing all intermediate weight vectors, it computes their weighted average (weighted by survival time). The resulting single weight vector is used for predictions. In practice, the averaged perceptron performs nearly as well as the voted perceptron and is much more efficient, requiring only a single weight vector at prediction time. It has been widely adopted in natural language processing for tasks such as part-of-speech tagging and syntactic parsing ^[13].

Kernel perceptron

The kernel perceptron, first proposed by Aizerman, Braverman, and Rozoner in 1964 through their "potential function method," applies the kernel trick to the perceptron algorithm ^[14]. Instead of computing predictions using the explicit weight vector, the kernel perceptron expresses predictions in terms of kernel evaluations between training examples:

Prediction: y_hat = sgn( sum_i alpha_i * y_i * K(x_i, x) )

Here, K is a kernel function (such as polynomial, Gaussian, or radial basis function) that implicitly maps the data into a higher-dimensional feature space. This allows the kernel perceptron to learn non-linear decision boundaries without explicitly computing the feature transformation. The kernel perceptron was historically the first kernel-based classification algorithm, predating support vector machines by several decades.

Pocket algorithm

The pocket algorithm, introduced by Stephen Gallant in 1990, modifies the perceptron to handle data that is not linearly separable. It runs the standard perceptron learning rule but keeps a "pocket" copy of the best weight vector found so far (the one that correctly classifies the most training examples). When training ends, the pocket weight vector is used rather than the final weight vector.

Multiclass perceptron

The multiclass perceptron extends binary classification to multiple classes. One common approach maintains a separate weight vector for each class and predicts the class whose weight vector produces the highest score. Updates are applied to the weight vectors of both the correct class and the incorrectly predicted class.

Summary of variants

Variant	Key idea	Introduced by
Voted perceptron	Weighted vote from all intermediate weight vectors	Freund and Schapire (1999)
Averaged perceptron	Average of all intermediate weight vectors	Freund and Schapire (1999)
Kernel perceptron	Kernel trick for non-linear classification	Aizerman, Braverman, Rozoner (1964)
Pocket algorithm	Keeps best weight vector found during training	Gallant (1990)
Multiclass perceptron	Separate weight vector per class	Varies
Maxover algorithm	Convergence regardless of linear separability	Wendemuth (1995)

Relationship to logistic regression and SVMs

The perceptron, logistic regression, and support vector machines (SVMs) are all linear classifiers that share a common mathematical framework. They each compute a linear combination of inputs (w . x + b) and use the result to make classification decisions. The key differences lie in their loss functions, training procedures, and outputs.

Property	Perceptron	Logistic regression	SVM
Loss function	0/1 misclassification (implicit)	Cross-entropy (log loss)	Hinge loss
Output	Binary class label (0 or 1)	Probability (0 to 1)	Class label with margin
Update rule	Only on misclassified examples	On all examples, every iteration	Only on support vectors
Decision boundary	Any separating hyperplane	Maximum likelihood hyperplane	Maximum margin hyperplane
Probabilistic	No	Yes	No (without calibration)
Convergence	Guaranteed if linearly separable	Always (convex objective)	Always (convex objective)

Logistic regression can be thought of as a probabilistic generalization of the perceptron. Where the perceptron applies a hard step function to the weighted sum, logistic regression applies a sigmoid function to produce a probability. The perceptron updates only on mistakes, while logistic regression adjusts weights on every example based on the gradient of the log loss. Because the log loss is convex, logistic regression is guaranteed to converge to a global optimum regardless of whether the data is linearly separable.

SVMs take a different approach by finding the hyperplane that maximizes the margin (the distance between the decision boundary and the nearest training examples on each side). The perceptron, by contrast, finds any separating hyperplane without regard to margin. The perceptron of optimal stability, developed using iterative algorithms such as Min-Over and AdaTron, seeks to maximize the margin and is conceptually a precursor to SVMs ^[15].

The kernel perceptron and kernel SVMs also share the kernel trick for non-linear classification, though SVMs typically achieve better generalization due to their margin-maximizing objective.

Relationship to modern neural networks

Modern neural networks, including convolutional neural networks (CNNs), recurrent neural networks (RNNs), and transformers, all build on the basic principles established by the perceptron. Every neuron in these architectures performs a weighted sum of inputs followed by a non-linear activation, which is the same core operation Rosenblatt described in 1958.

The differences between modern networks and the original perceptron are primarily in scale, architecture, and training techniques:

Feature	Original perceptron	Modern neural networks
Layers	1	Dozens to hundreds
Neurons	Tens	Millions to billions
Activation function	Step function	ReLU, GELU, SiLU, etc.
Training	Perceptron learning rule	Backpropagation with Adam, SGD, etc.
Hardware	Custom analog circuits	GPUs, TPUs
Applications	Simple pattern recognition	Language, vision, reasoning

The perceptron also inspired the development of support vector machines in the 1990s, which can be seen as a more principled version of the linear classifier that maximizes the margin between classes.

Since 2002, perceptron-based training methods (especially the averaged perceptron) have been widely used in natural language processing for tasks like part-of-speech tagging, named entity recognition, and syntactic parsing. These applications take advantage of the perceptron's simplicity and efficiency for high-dimensional, sparse feature spaces.

In the context of modern deep learning, the perceptron serves as both a historical starting point and a pedagogical tool. Introductory courses in machine learning and neural networks almost universally begin with the perceptron, using it to illustrate fundamental concepts such as linear classification, weight updates, loss functions, and the need for non-linear architectures. The progression from perceptron to multi-layer perceptron to deep neural network tells the story of the entire field.

The Mark I Perceptron: hardware details

The Mark I Perceptron was a custom-built hardware device, not a software simulation. Understanding its physical design helps appreciate both the ingenuity and the constraints of early neural network research.

Physical architecture

The Mark I Perceptron was first publicly demonstrated on June 23, 1960, at the Cornell Aeronautical Laboratory. It was funded by the Information Systems Branch of the United States Office of Naval Research and the Rome Air Development Center. Its physical components were:

Component	Specification	Function
Input retina (S-units)	400 cadmium sulfide photocells in a 20x20 grid	Sensory input: detected light patterns from images
Association units (A-units)	512 perceptron units	Hidden processing layer; each connected to up to 40 random S-units
Response units (R-units)	8 output perceptrons	Final classification decisions
Weight storage	Motor-driven potentiometers	Adjustable analog weights on connections
Weight update mechanism	Electric motors	Physically turned potentiometers to change weights during learning
Connection pattern	Random wiring	S-unit to A-unit connections were wired randomly during construction

Each sensory unit incorporated photoresistors paired with transistor amplifiers and relays, generating bipolar outputs of +/-24 volts upon activation. The random wiring between S-units and A-units was a deliberate design choice. Rosenblatt believed that the specific connection pattern did not matter as long as the learning rule could adjust the weights appropriately. This was a remarkably modern idea, anticipating the random initialization used in today's neural networks.

How it learned

The learning process was electromechanical. When the perceptron made an incorrect classification, electric motors physically rotated potentiometers to adjust the weights. The direction and magnitude of rotation were determined by the perceptron learning rule. This process was slow by modern standards, but it demonstrated that a physical machine could learn from examples without being explicitly programmed.

The machine was primarily used to recognize simple visual patterns: letters, geometric shapes, and basic figures. Its 20x20 input resolution (400 pixels) was extremely coarse, but sufficient for demonstrating the learning principle.

Legacy and preservation

The Mark I Perceptron is now housed in the Smithsonian Institution's National Museum of American History in Washington, D.C. It is recognized by Guinness World Records as the first artificial neural network ^[16]. The machine stands as a tangible artifact of the earliest attempt to build a learning machine, bridging the gap between theoretical neuroscience and practical engineering.

Frank Rosenblatt: biography and context

Frank Rosenblatt was born on July 11, 1928, in New Rochelle, New York. He attended the Bronx High School of Science before entering Cornell University, where he earned his B.A. in psychology in 1950 and his Ph.D. in 1956. His doctoral work focused on psychometrics and personality assessment, but his interests were shifting toward the computational modeling of the brain.

After completing his doctorate, Rosenblatt joined the Cornell Aeronautical Laboratory (CAL) in Buffalo, New York, where he rose through the ranks from research psychologist to senior psychologist to head of the cognitive systems section. It was at CAL that he conceived and built the perceptron, motivated by the question of how biological neural networks could learn to recognize patterns.

Rosenblatt was ambitious and charismatic, and he communicated his vision for machine intelligence with an enthusiasm that captured both scientific and public attention. His claims about the perceptron's potential were bold. At the 1958 press conference announcing the perceptron, he suggested it could eventually learn to "recognize people and call out their names, instantly translate speech in one language to speech and writing in another language, and be the first device to think as the human brain." These claims were scientifically premature, and they drew criticism from researchers who felt Rosenblatt was overpromising. This tension with the AI establishment, particularly with Marvin Minsky at MIT, would define much of the perceptron's legacy.

Rosenblatt died on July 11, 1971, his 43rd birthday, in a boating accident on Chesapeake Bay. He did not live to see the revival of neural networks in the 1980s or the deep learning revolution of the 2010s that vindicated many of his core ideas about learning machines.

Explain like I'm 5 (ELI5)

Imagine you have a group of animals, and you want to decide if an animal is a bird or a fish. A perceptron works like a simple decision maker that looks at different features of the animals, such as whether they have wings or fins. You give each feature a score for how important it is. If the total score is above a certain number, the perceptron says "bird." If not, it says "fish."

The clever part is that the perceptron learns from its mistakes. Every time it gets an answer wrong, it adjusts the scores a little bit. After seeing enough examples, it gets better and better at telling birds from fish.

However, a perceptron can only draw one straight line to separate two groups. If the groups are mixed together in a complicated way (like a checkerboard pattern), one line is not enough. That is when you need a multi-layer perceptron, which is like having several decision makers working together, each drawing their own line, so they can handle much trickier problems.

References

"Professor's perceptron paved the way for AI -- 60 years too soon." Cornell Chronicle, September 2019. https://news.cornell.edu/stories/2019/09/professors-perceptron-paved-way-ai-60-years-too-soon
Rosenblatt, F. "The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain." Psychological Review, 65(6), 386-408, 1958. https://psycnet.apa.org/doi/10.1037/h0042519
Novikoff, A. B. J. "On Convergence Proofs for Perceptrons." Symposium on the Mathematical Theory of Automata, 1962. https://cs.uwaterloo.ca/~y328yu/classics/novikoff.pdf
Cornell CS4780 Lecture Notes: "The Perceptron." https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote03.html
Minsky, M. and Papert, S. *Perceptrons: An Introduction to Computational Geometry.* MIT Press, 1969.
Trott, S. "Perceptrons, XOR, and the first AI winter." https://seantrott.substack.com/p/perceptrons-xor-and-the-first-ai
"Perceptrons (book)." Wikipedia. https://en.wikipedia.org/wiki/Perceptrons_(book)
Crevier, D. *AI: The Tumultuous History of the Search for Artificial Intelligence.* Basic Books, 1993.
"Frank Rosenblatt." Wikipedia. https://en.wikipedia.org/wiki/Frank_Rosenblatt
Rumelhart, D. E., Hinton, G. E., and Williams, R. J. "Learning representations by back-propagating errors." Nature, 323, 533-536, 1986. https://www.nature.com/articles/323533a0
Cybenko, G. "Approximation by superpositions of a sigmoidal function." Mathematics of Control, Signals and Systems, 2(4), 303-314, 1989.
Freund, Y. and Schapire, R. E. "Large Margin Classification Using the Perceptron Algorithm." Machine Learning, 37(3), 277-296, 1999.
Collins, M. "Discriminative Training Methods for Hidden Markov Models." Proceedings of EMNLP, 2002.
Aizerman, M. A., Braverman, E. M., and Rozoner, L. I. "Theoretical foundations of the potential function method in pattern recognition learning." Automation and Remote Control, 25, 821-837, 1964.
Krauth, W. and Mezard, M. "Learning algorithms with optimal stability in neural networks." Journal of Physics A, 20(11), L745, 1987.
"First artificial neural network." Guinness World Records. https://www.guinnessworldrecords.com/world-records/760225-first-artificial-neural-network

Introduction

Historical background

Structure

Single-layer perceptron

Geometric interpretation

Learning algorithm

The perceptron learning rule

Perceptron convergence theorem

Limitations and the XOR problem

Visualizing XOR

Why XOR matters

XOR with multi-layer networks

Minsky and Papert's "Perceptrons" (1969)

Impact on the first AI winter

How the winter unfolded

Multi-layer perceptron (MLP)

Revival through backpropagation

MLP architecture details

Variants of the perceptron

Voted perceptron

Averaged perceptron

Kernel perceptron

Pocket algorithm

Multiclass perceptron

Summary of variants

Relationship to logistic regression and SVMs

Relationship to modern neural networks

The Mark I Perceptron: hardware details

Physical architecture

How it learned

Legacy and preservation

Frank Rosenblatt: biography and context

Explain like I'm 5 (ELI5)

References

Improve this article

Related Articles

Multi-head Latent Attention

ARC-AGI 2

GELU (Gaussian Error Linear Unit)

AUC-ROC

Machine learning terms/Clustering

Machine learning terms/Decision Forests

Introduction

Historical background

Structure

Single-layer perceptron

Geometric interpretation

Learning algorithm

The perceptron learning rule

Perceptron convergence theorem

Limitations and the XOR problem

Visualizing XOR

Why XOR matters

XOR with multi-layer networks

Minsky and Papert's "Perceptrons" (1969)

Impact on the first AI winter

How the winter unfolded

Multi-layer perceptron (MLP)

Revival through backpropagation

MLP architecture details

Variants of the perceptron

Voted perceptron

Averaged perceptron

Kernel perceptron

Pocket algorithm

Multiclass perceptron

Summary of variants

Relationship to logistic regression and SVMs

Relationship to modern neural networks

The Mark I Perceptron: hardware details

Physical architecture

How it learned

Legacy and preservation

Frank Rosenblatt: biography and context

Explain like I'm 5 (ELI5)

References

Related Articles

Multi-head Latent Attention

ARC-AGI 2

GELU (Gaussian Error Linear Unit)