Neural Network
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v4 · 5,764 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v4 · 5,764 words
Add missing citations, update stale details, or suggest a clearer explanation.
A neural network (also called an artificial neural network or ANN) is a computational model loosely inspired by the networks of biological neurons in animal brains. It consists of layered groups of simple processing units, called artificial neurons, that transmit numerical signals along weighted connections. By adjusting these weights on data, a neural network can learn to approximate arbitrary functions, from classifying images to generating text, without being explicitly programmed for the task. Since the early 2010s, deep neural networks — networks with many stacked layers — have become the dominant model class in machine learning and the technological foundation of deep learning and modern artificial intelligence[^1][^2].
This article is a high-level survey. It traces the eight-decade history of neural networks, sketches their mathematical structure and training procedure, and links out to dedicated articles for each major architecture (such as the convolutional neural network, recurrent neural network, and transformer), training technique, and theoretical concept.
Imagine a huge team of tiny helpers, where each helper can only do one very simple thing: look at some numbers coming in, multiply them, add them up, and pass the result to the next helper. Alone, none of them are very smart. But when you line up thousands of these helpers in rows and connect them together, something amazing happens. You can show the whole team a picture of a cat, and after the numbers pass through all the helpers, the team says "cat!" at the end.
How does the team get so good? By practice. At first, the helpers give wrong answers. Each time they are wrong, a coach goes backward through the team and tells each helper to nudge its multiplication number a tiny bit. After seeing thousands of pictures of cats, dogs, and cars, the helpers settle on numbers that work, and the team can recognize new things it has never seen before.
That is essentially how a neural network works. The "helpers" are artificial neurons, the "multiplication numbers" are weights, and the "coach going backward" is an algorithm called backpropagation.
The history of neural networks spans more than eighty years and is conventionally divided into periods of progress separated by stretches of reduced funding and attention often called AI winters.
In 1943, neurophysiologist Warren McCulloch and logician Walter Pitts published "A Logical Calculus of the Ideas Immanent in Nervous Activity" in the Bulletin of Mathematical Biophysics. They proposed the first mathematical model of an artificial neuron: a binary-threshold unit that fires if and only if a weighted sum of its inputs exceeds a fixed threshold. They proved that networks of such units, with appropriately chosen connections, could in principle compute any function expressible in propositional logic[^3].
In 1949, psychologist Donald Hebb published The Organization of Behavior, in which he formulated what became known as the Hebbian learning principle. Hebb proposed that when one neuron repeatedly participates in firing another, the connection between them strengthens — often paraphrased as "cells that fire together wire together." Hebb's principle provided a theoretical basis for how synaptic strengths might be modified by experience and remains the conceptual root of many learning rules used in both neuroscience and artificial neural networks[^4].
In 1957 and 1958, Frank Rosenblatt of the Cornell Aeronautical Laboratory introduced the perceptron, the first trainable neural network model. Rosenblatt described the perceptron in a 1957 technical report and in the widely cited 1958 Psychological Review paper "The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain." The perceptron is a single-layer network of threshold neurons whose weights are learned from data via an iterative correction rule. Rosenblatt and colleagues also built the Mark I Perceptron, a custom hardware implementation with photocell inputs[^5].
In 1960, Bernard Widrow and his student Marcian "Ted" Hoff at Stanford introduced ADALINE (ADAptive LInear NEuron) and the closely related MADALINE network. ADALINE used a linear output unit and was trained by the Widrow–Hoff (or least-mean-square, LMS) learning rule, which adjusts weights in proportion to the difference between target and actual output. The LMS rule is a precursor to modern gradient descent and remains a foundational algorithm in adaptive signal processing[^6].
These early successes generated intense optimism. Rosenblatt was widely quoted in the popular press as predicting machines that would walk, talk, see, and reproduce themselves.
In 1969, Marvin Minsky and Seymour Papert published Perceptrons: An Introduction to Computational Geometry (MIT Press). The book gave a rigorous mathematical analysis of what single-layer perceptrons can and cannot compute. Most famously, Minsky and Papert showed that a single-layer perceptron cannot represent the exclusive-or (XOR) function because XOR is not linearly separable. They also raised pessimistic, though more nuanced, questions about whether multi-layer networks could be trained efficiently[^7].
The book's influence, combined with the unrealistic expectations set during the perceptron boom, contributed to a sharp decline in neural network research funding through the 1970s. Together with the broader 1973 Lighthill report and contemporaneous funding cuts, this period is widely considered part of the first AI winter. The XOR limitation applies strictly to single-layer perceptrons — multi-layer networks can represent XOR and arbitrary Boolean functions — but a practical learning algorithm for multi-layer networks would not become widely known until the mid-1980s.
Throughout the late 1970s and early 1980s, a small group of researchers continued to study learning in multi-layer networks. In 1974, Paul Werbos described the backpropagation algorithm in his Harvard PhD thesis, framing it as an application of the chain rule to layered systems[^8]. Independent rediscoveries and extensions appeared throughout the early 1980s.
The decisive event was the 1986 publication of David Rumelhart, Geoffrey Hinton, and Ronald Williams's "Learning Representations by Back-Propagating Errors" in Nature, which clearly described backpropagation and demonstrated that gradient-trained multi-layer networks could learn useful internal representations[^9]. In parallel, Rumelhart, James McClelland, and the broader Parallel Distributed Processing (PDP) Research Group published the two-volume Parallel Distributed Processing: Explorations in the Microstructure of Cognition, which laid out the connectionist research program: cognition as the emergent behavior of large networks of simple units. The PDP volumes helped seed a generation of researchers, including Hinton and Yoshua Bengio[^10].
Other landmark contributions of the 1980s include John Hopfield's 1982 associative memory networks (Hopfield networks), the introduction of Boltzmann machines by Hinton and Terrence Sejnowski, and Teuvo Kohonen's self-organizing maps.
In 1989, Yann LeCun and colleagues at AT&T Bell Labs applied backpropagation to a convolutional neural network for reading handwritten digits, demonstrating that depth, weight sharing, and local receptive fields could solve a practical pattern-recognition problem[^11]. Subsequent refinements led to the LeNet-5 system described in LeCun et al.'s 1998 Proceedings of the IEEE paper "Gradient-Based Learning Applied to Document Recognition," which was deployed at scale to read checks and ZIP codes in the United States[^12].
The same period produced fundamental theoretical results. In 1989, George Cybenko proved that a feedforward network with a single hidden layer of sigmoidal units can approximate any continuous function on a compact domain to arbitrary accuracy[^13]. In 1991, Kurt Hornik generalized this universal approximation result to a wide class of activation functions, establishing approximation as a property of the multi-layer architecture rather than any particular nonlinearity[^14]. The 1997 introduction of Long Short-Term Memory (LSTM) by Sepp Hochreiter and Jürgen Schmidhuber provided a sequence model that mitigated the vanishing-gradient problem in recurrent networks and would later dominate speech and language modeling for nearly two decades[^15].
Despite these advances, neural networks fell out of favor through much of the 1990s and early 2000s. Vladimir Vapnik and colleagues' support vector machine (SVM), with strong theoretical guarantees and effective kernel methods, became the default tool for classification on moderate-sized datasets. Ensemble methods such as random forests and gradient-boosted trees also outperformed neural networks on many benchmarks. Compute and data were scarce by modern standards, deep networks were difficult to train end-to-end, and many researchers abandoned neural approaches. This stretch is sometimes called the second AI winter for neural networks (though the broader AI field experienced its own cycles)[^16].
The modern era of neural networks began in 2006 with Geoffrey Hinton, Simon Osindero, and Yee-Whye Teh's "A Fast Learning Algorithm for Deep Belief Nets" in Neural Computation, which showed how to pre-train deep networks layer-by-layer using restricted Boltzmann machines. Hinton and Ruslan Salakhutdinov's companion Science paper, "Reducing the Dimensionality of Data with Neural Networks," further popularized the term "deep learning"[^17][^18].
The decisive watershed came in 2012, when Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton entered AlexNet in the ImageNet Large Scale Visual Recognition Challenge. AlexNet — a deep CNN trained on two NVIDIA GTX 580 GPUs using ReLU activations and dropout — achieved a top-5 error rate of 15.3%, compared to 26.2% for the second-best (non-neural) entry[^19]. The result was widely viewed as the unambiguous victory of deep learning over hand-engineered features, and triggered the rapid adoption of deep neural networks across vision, speech, and natural language processing.
Subsequent years produced rapid architectural advances:
In 2017, Ashish Vaswani and colleagues at Google Brain introduced the transformer in "Attention Is All You Need." The transformer replaced recurrence and convolution with multi-head self-attention, enabling models to learn long-range dependencies while being fully parallelizable on GPU and TPU hardware[^22]. Within a few years, transformers displaced LSTMs as the default sequence model and spread to vision (Vision Transformer), speech (Whisper), code, and scientific data.
In 2019, Hinton, LeCun, and Yoshua Bengio received the 2018 ACM A.M. Turing Award for "conceptual and engineering breakthroughs that have made deep neural networks a critical component of computing." The citation specifically credits backpropagation, convolutional networks, probabilistic models, and the broader connectionist research program[^23].
Beginning in 2018, large transformer language models pre-trained on web-scale corpora became the dominant paradigm. Google's BERT demonstrated that bidirectional masked-language-modeling pre-training produced state-of-the-art results across many NLP tasks, while OpenAI's GPT series — particularly GPT-3 (2020, 175 billion parameters) and GPT-4 (2023) — showed that autoregressive language models scaled to extreme sizes exhibited in-context learning and broad generalization[^24][^25]. The Stanford CRFM research group coined the term "foundation model" in 2021 to refer to any model trained on broad data at scale that can be adapted to a wide range of downstream tasks[^26].
By 2025–2026, frontier neural systems — including Anthropic's Claude, OpenAI's GPT-4 and GPT-5 lines, Google's Gemini, Meta's Llama family, and DeepMind's AlphaFold and Gemini Robotics — are routinely multimodal, mix dense and sparse mixture-of-experts layers, and rely on enormous distributed training systems running for weeks on tens of thousands of accelerators.
| Year | Milestone | Key contributors |
|---|---|---|
| 1943 | First mathematical neuron model | Warren McCulloch, Walter Pitts |
| 1949 | Hebbian learning principle | Donald Hebb |
| 1958 | Perceptron | Frank Rosenblatt |
| 1960 | ADALINE / Widrow–Hoff rule | Bernard Widrow, Marcian Hoff |
| 1969 | Perceptrons book identifies XOR limitation | Marvin Minsky, Seymour Papert |
| 1974 | Backpropagation in thesis form | Paul Werbos |
| 1982 | Hopfield networks | John Hopfield |
| 1986 | Backpropagation popularized; PDP volumes | Rumelhart, Hinton, Williams; PDP Group |
| 1989 | Backprop-trained CNN for digit recognition | Yann LeCun et al. |
| 1989 | Universal approximation theorem | George Cybenko (and Kurt Hornik, 1991) |
| 1997 | LSTM | Sepp Hochreiter, Jürgen Schmidhuber |
| 2006 | Deep belief networks; "deep learning" rises | Geoffrey Hinton et al. |
| 2012 | AlexNet wins ImageNet | Krizhevsky, Sutskever, Hinton |
| 2014 | GANs | Ian Goodfellow et al. |
| 2015 | ResNet (152 layers) | Kaiming He et al. |
| 2017 | Transformer architecture | Vaswani et al. |
| 2018 | BERT and GPT released | Google AI; OpenAI |
| 2018 Turing Award (announced 2019) | "Fathers of the deep-learning revolution" | Geoffrey Hinton, Yann LeCun, Yoshua Bengio |
| 2020 | GPT-3 (175B parameters) | OpenAI |
| 2022 | Chinchilla scaling laws; ChatGPT | DeepMind; OpenAI |
| 2023–2026 | Multimodal foundation models, sparse MoE, agentic systems | Multiple labs |
A standard feedforward neural network defines a parameterized function fθ: Rd → Rk, where θ collects all weights and biases.
A single artificial neuron computes:
z = w1·x1 + w2·x2 + … + wn·xn + b, a = φ(z)
Here xi are the inputs, wi the corresponding weights, b a bias term, φ a nonlinear activation function, and a the unit's output (also called its activation).
A neural network organizes neurons into layers. Writing the activations of layer ℓ as a vector a(ℓ), a fully-connected layer applies an affine transformation followed by an elementwise nonlinearity:
a(ℓ) = φ(ℓ)( W(ℓ) a(ℓ−1) + b(ℓ) )
with weight matrix W(ℓ) and bias vector b(ℓ). Stacking L such layers gives:
fθ(x) = a(L)(a(L−1)( … (a(1)(x)) … ))
Information is organized into three roles:
For non-feedforward networks (recurrent, graph, attention-based), the same building blocks reappear but the connectivity pattern differs.
Training a neural network is the process of choosing parameters θ to make fθ agree with a dataset. The standard recipe is empirical risk minimization by stochastic gradient descent on a differentiable loss, with gradients computed via backpropagation.
Given a training example x, the network is evaluated layer-by-layer to produce a prediction ŷ = fθ(x). This is the forward pass. Because each layer is a matrix–vector product followed by an elementwise nonlinearity, the forward pass is a sequence of dense linear algebra operations, which is why GPUs and TPUs — hardware specialized for parallel matrix arithmetic — are so effective.
The prediction ŷ is compared against the target y using a loss function L(ŷ, y). Common choices are mean squared error for regression, cross-entropy for classification, and contrastive or sequence-level losses for self-supervised and generative tasks. The training objective is the expected loss over the training distribution, approximated as the average loss over a mini-batch of examples.
To improve θ, the network computes ∇θL: the gradient of the loss with respect to every parameter. Backpropagation is the efficient algorithm for doing so. It applies the chain rule of calculus to the computation graph of the forward pass, starting from the output and propagating partial derivatives backward layer-by-layer. Backpropagation is the dominant training algorithm for essentially all modern neural networks[^9].
Gradient descent uses the gradient to update parameters:
θ ← θ − η · ∇θL
where η is the learning rate. In practice, the gradient is estimated from a mini-batch of typically 32–4,096 examples; this is mini-batch stochastic gradient descent. Modern training almost always uses momentum-based and adaptive variants. The most popular optimizers are SGD with momentum, Adam, and AdamW, the last being the de facto standard for transformer training.
Large networks overfit easily, so training combines several techniques to improve generalization and stabilize optimization:
These techniques, together with careful initialization (He, Xavier/Glorot) and the use of ReLU-family activations, are what make networks of hundreds or thousands of layers and billions of parameters trainable in practice.
Neural networks come in many architectural families; each tailors connectivity and parameterization to a class of data. This section is a high-level tour with links to dedicated articles.
The classical fully-connected feedforward network — historically called a multilayer perceptron or feedforward neural network — applies a stack of dense linear layers and nonlinear activations. MLPs remain ubiquitous as components inside larger models: the position-wise "MLP block" in a transformer is essentially a two-layer MLP applied independently to each token.
Convolutional neural networks introduce weight sharing and local receptive fields well-suited to spatial data such as images, video, and audio spectrograms. Key operations are convolution (feature detection), pooling (spatial downsampling), and fully connected classification heads. Landmark architectures include LeNet (1989/1998), AlexNet (2012), VGG (2014), GoogLeNet/Inception (2014), and ResNet (2015)[^11][^19][^21].
Recurrent neural networks maintain a hidden state that is updated as a sequence is consumed, providing a natural model for time series, text, and speech. Vanilla RNNs suffer from vanishing and exploding gradients, motivating gated variants: LSTM (Hochreiter and Schmidhuber, 1997) and the simpler GRU (Cho et al., 2014). RNNs dominated machine translation and speech recognition from roughly 2014 until they were largely displaced by transformers after 2017[^15].
The transformer is the dominant architecture for sequence modeling since 2017. Its core ingredient is multi-head self-attention: each position in the sequence attends to a learned, weighted combination of all other positions. Transformers are highly parallelizable, scale gracefully to enormous parameter counts, and now power large language models, image models (Vision Transformer), speech models (Whisper), and protein models (AlphaFold 2/3)[^22].
Graph neural networks generalize convolutions to arbitrary graphs by iteratively passing messages between neighboring nodes. They are central to molecular property prediction, drug discovery, recommendation systems, traffic forecasting (Google's road ETA models), and combinatorial optimization.
State-space models (SSMs) such as S4 and Mamba are a more recent family that models long sequences via linear recurrences with structured kernels, achieving subquadratic scaling in sequence length. SSMs are increasingly used as transformer alternatives or complements for long-context modeling.
Autoencoders (including variational autoencoders) learn compressed latent representations by reconstructing their inputs. Generative families include GANs, normalizing flows, autoregressive models, and (most recently) diffusion models, which dominate state-of-the-art image and video synthesis.
A mixture-of-experts (MoE) layer routes each input token to a small subset of "expert" sub-networks via a learned gating function. MoE allows total parameter count to grow much faster than per-token compute and underlies many recent frontier models (Mixtral, GPT-4-class systems, DeepSeek-V3).
| Architecture | Best suited for | Key mechanism | Examples |
|---|---|---|---|
| MLP | Tabular data, components in larger nets | Fully-connected layers | Standard MLP |
| CNN | Images, audio, video | Convolutional filters, pooling | AlexNet, ResNet, Inception |
| RNN / LSTM | Sequences, time series | Recurrent hidden state with gating | LSTM, GRU |
| Transformer | Text, sequences, multimodal | Multi-head self-attention | GPT-4, Claude, BERT, ViT |
| Graph NN | Graphs, molecules, social networks | Message passing | GCN, GAT, GraphSAGE |
| State-space model / Mamba | Very long sequences | Structured linear recurrence | S4, Mamba |
| Autoencoder | Compression, representation learning | Encoder–decoder bottleneck | VAE, denoising AE |
| GAN | Image synthesis | Generator vs. discriminator | StyleGAN, BigGAN |
| Mixture of experts | Scaling parameters cheaply | Sparse routing | Switch Transformer, Mixtral |
Activation functions inject the nonlinearity that lets stacked layers represent more than a single affine map. The most widely used today are[^27]:
| Activation | Definition | Notes |
|---|---|---|
| Sigmoid | σ(x) = 1 / (1 + e−x) | Bounded (0,1); historically dominant; vanishing gradients limit use in deep nets. Common at output of binary classifiers. |
| Tanh | (ex − e−x) / (ex + e−x) | Bounded (−1,1); standard in early RNNs and LSTMs. |
| ReLU | max(0, x) | Default hidden activation since AlexNet (2012); simple, sparse, and accelerates convergence[^19]. |
| Leaky ReLU / PReLU | max(αx, x) | Avoids "dead neuron" problem of vanilla ReLU. |
| GELU | x · Φ(x) | Smooth ReLU variant; default in BERT, GPT, and many modern transformers[^28]. |
| SwiGLU | (Swish(xW) ⊙ (xV)) | Gated activation used in PaLM, LLaMA, and many recent LLMs[^29]. |
| Softmax | exi / Σj exj | Converts logits to a probability distribution at the output of multiclass classifiers. |
The shift from sigmoid/tanh to ReLU around 2010–2012 — and then to GELU and gated variants such as SwiGLU around 2018–2020 — was one of several inconspicuous changes that made deep networks reliably trainable.
The choice of loss function encodes what "wrong" means for the task:
The universal approximation theorem is the foundational expressivity result for feedforward neural networks. George Cybenko proved in 1989 that finite linear combinations of sigmoidal activation functions are dense in the space of continuous functions on the unit cube — meaning that a single-hidden-layer feedforward network can approximate any continuous function on a compact domain to any desired accuracy, given enough hidden units[^13]. In 1991, Kurt Hornik extended the result to a wide class of non-polynomial activation functions and clarified that the universality is a property of the multi-layer architecture rather than of sigmoid in particular[^14].
The theorem is purely an existence result. It does not say how many neurons are needed, how to find the right weights, or whether gradient descent can in fact reach them. Subsequent work has shown that depth often allows networks to represent the same functions with exponentially fewer parameters than a shallow network, which is a key motivation for the modern emphasis on deep models.
Training contemporary neural networks is a large-scale systems problem.
Modern training runs on parallel matrix accelerators:
Distributed training spreads a single model across many accelerators. Standard parallelism strategies include:
Frameworks such as Megatron-LM, DeepSpeed, FSDP (Fully Sharded Data Parallel), and JAX's pjit/shard_map orchestrate these strategies.
Modern training uses mixed precision (when available) — storing weights in 32-bit or 16-bit and performing matrix multiplications in lower-precision formats (FP16, BF16, FP8) — to dramatically increase throughput. Loss scaling and stochastic rounding are used to preserve numerical stability. Mixed precision, together with sparse MoE and KV-cache tricks at inference time, accounts for much of the per-flop improvement in training large models over the past five years.
Scaling laws, first systematically studied by Kaplan et al. (OpenAI, 2020) and refined by the Chinchilla work of Hoffmann et al. (DeepMind, 2022), describe how validation loss falls smoothly as a power law in compute, dataset size, and parameters. Chinchilla in particular showed that for a fixed compute budget, model size and training tokens should be scaled in roughly equal proportions, implying that many earlier large models were undertrained[^30][^31].
Most modern neural networks are built using one of a handful of high-level frameworks:
All three major frameworks share core abstractions: tensors as multi-dimensional arrays, automatic differentiation, and compilation to optimized accelerator code (cuDNN, XLA, Triton).
CNNs and Vision Transformers power image classification, object detection, semantic and instance segmentation, optical character recognition, satellite and medical imaging analysis, and the perception stacks of autonomous vehicles. Architectures such as AlexNet, ResNet, Inception, YOLO, EfficientNet, ViT, and Segment Anything are all neural networks.
Transformer-based language models dominate machine translation, summarization, question answering, code generation, sentiment analysis, and conversational AI. Large language models such as Claude, GPT-4, Gemini, and Llama 3/4 perform extended chains of reasoning, function calling, and multi-step agentic behavior.
Neural networks underpin automatic speech recognition (e.g., Whisper), text-to-speech (WaveNet, VALL-E), music generation, and audio classification.
DeepMind's AlphaGo defeated world champion Lee Sedol at Go in 2016, combining deep CNNs, value/policy networks, and Monte Carlo tree search. AlphaGo Zero (2017) and MuZero (2019) generalized this approach to self-play and model-based reinforcement learning.
AlphaFold 2 (2020) predicted protein 3D structures at near-experimental accuracy, and contributions to protein structure prediction were recognized by the 2024 Nobel Prize in Chemistry (awarded jointly to David Baker, Demis Hassabis, and John Jumper)[^32]. Neural networks are now used in weather forecasting (GraphCast, GenCast), materials discovery, particle physics, fluid dynamics, and quantum chemistry. Physics-informed neural networks (PINNs) embed known physical laws as soft constraints.
Applications include radiology and pathology image analysis, retinal disease screening, electronic health record modeling, drug discovery, and single-cell genomics.
The empirical success of neural networks has outpaced theoretical understanding, but several lines of work have made progress:
Open theoretical questions include why overparameterized networks generalize, how representations form during training, whether emergent capabilities reflect genuine phase transitions, and what — if any — fundamental limits constrain scaling.
Despite their success, neural networks have well-known limitations: