# Neural Network

> Source: https://aiwiki.ai/wiki/neural_network
> Updated: 2026-06-20
> Categories: Deep Learning, Machine Learning, Neural Networks
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

A **neural network** (also called an **artificial neural network** or **ANN**) is a computational model, loosely inspired by the networks of biological neurons in animal brains, that learns to perform tasks by adjusting the strengths of connections between simple processing units rather than by following explicit programmed rules. It consists of layered groups of artificial [neurons](/wiki/neuron) that transmit numerical signals along weighted connections; by tuning these [weights](/wiki/weight) on data, the network can approximate almost any function, from classifying images to generating text. Since the early 2010s, deep neural networks (networks with many stacked layers) have become the dominant model class in [machine learning](/wiki/machine_learning) and the technological foundation of [deep learning](/wiki/deep_learning) and modern [artificial intelligence](/wiki/artificial_intelligence)[^1][^2]. The earliest mathematical model of an artificial neuron was published by Warren McCulloch and Walter Pitts in 1943, the first trainable network (the [perceptron](/wiki/perceptron)) by Frank Rosenblatt in 1958, and the [backpropagation](/wiki/backpropagation) learning algorithm that made deep networks practical was popularized in 1986[^3][^5][^9].

This article is a high-level survey. It traces the eight-decade history of neural networks, sketches their mathematical structure and training procedure, and links out to dedicated articles for each major architecture (such as the [convolutional neural network](/wiki/convolutional_neural_network), [recurrent neural network](/wiki/recurrent_neural_network), and [transformer](/wiki/transformer)), training technique, and theoretical concept.

## What is a neural network? (Explain like I'm 5)

Imagine a huge team of tiny helpers, where each helper can only do one very simple thing: look at some numbers coming in, multiply them, add them up, and pass the result to the next helper. Alone, none of them are very smart. But when you line up thousands of these helpers in rows and connect them together, something amazing happens. You can show the whole team a picture of a cat, and after the numbers pass through all the helpers, the team says "cat!" at the end.

How does the team get so good? By practice. At first, the helpers give wrong answers. Each time they are wrong, a coach goes backward through the team and tells each helper to nudge its multiplication number a tiny bit. After seeing thousands of pictures of cats, dogs, and cars, the helpers settle on numbers that work, and the team can recognize new things it has never seen before.

That is essentially how a neural network works. The "helpers" are artificial [neurons](/wiki/neuron), the "multiplication numbers" are weights, and the "coach going backward" is an algorithm called [backpropagation](/wiki/backpropagation).

## How does a neural network work?

At a high level, a neural network turns an input (such as the pixels of an image or the tokens of a sentence) into an output (such as a label or the next word) by passing numbers through a sequence of layers. Each artificial neuron multiplies its inputs by learned weights, adds a bias, and applies a nonlinear [activation function](/wiki/activation_function); stacking many such layers lets the network build up increasingly abstract representations of the data. The network is not programmed with the rules for a task. Instead, it is shown many examples and learns the weights that minimize its errors, using the forward pass, a [loss function](/wiki/loss_function), and the backward pass described in the Training section below. The mathematical formulation that follows makes this picture precise.

## History

The history of neural networks spans more than eighty years and is conventionally divided into periods of progress separated by stretches of reduced funding and attention often called [AI winters](/wiki/ai_winter).

### When were neural networks invented? Early origins (1940s)

In 1943, neurophysiologist Warren McCulloch and logician Walter Pitts published "A Logical Calculus of the Ideas Immanent in Nervous Activity" in the *Bulletin of Mathematical Biophysics*. They proposed the first mathematical model of an artificial neuron: a binary-threshold unit that fires if and only if a weighted sum of its inputs exceeds a fixed threshold. The paper opens with the premise that motivated the whole field: "Because of the 'all-or-none' character of nervous activity, neural events and the relations among them can be treated by means of propositional logic."[^3] They proved that networks of such units, with appropriately chosen connections, could in principle compute any function expressible in propositional logic[^3].

In 1949, psychologist Donald Hebb published *The Organization of Behavior*, in which he formulated what became known as the Hebbian learning principle. Hebb proposed that when one neuron repeatedly participates in firing another, the connection between them strengthens, often paraphrased as "cells that fire together wire together." Hebb's principle provided a theoretical basis for how synaptic strengths might be modified by experience and remains the conceptual root of many learning rules used in both neuroscience and artificial neural networks[^4].

### Perceptron era (1957-1968)

In 1957 and 1958, Frank Rosenblatt of the Cornell Aeronautical Laboratory introduced the [perceptron](/wiki/perceptron), the first trainable neural network model. Rosenblatt described the perceptron in a 1957 technical report and in the widely cited 1958 *Psychological Review* paper "The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain." The perceptron is a single-layer network of threshold neurons whose weights are learned from data via an iterative correction rule. Rosenblatt and colleagues also built the Mark I Perceptron, a custom hardware implementation with photocell inputs[^5].

In 1960, Bernard Widrow and his student Marcian "Ted" Hoff at Stanford introduced ADALINE (ADAptive LInear NEuron) and the closely related MADALINE network. ADALINE used a linear output unit and was trained by the Widrow-Hoff (or least-mean-square, LMS) learning rule, which adjusts weights in proportion to the difference between target and actual output. The LMS rule is a precursor to modern [gradient descent](/wiki/gradient_descent) and remains a foundational algorithm in adaptive signal processing[^6].

These early successes generated intense optimism. After a US Navy press conference in 1958, *The New York Times* reported that the perceptron was "the embryo of an electronic computer that [the Navy] expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence."[^35] Such coverage set expectations that the technology of the era could not meet.

### First AI winter (1969-1979)

In 1969, Marvin Minsky and Seymour Papert published *Perceptrons: An Introduction to Computational Geometry* (MIT Press). The book gave a rigorous mathematical analysis of what single-layer perceptrons can and cannot compute. Most famously, Minsky and Papert showed that a single-layer perceptron cannot represent the exclusive-or (XOR) function because XOR is not linearly separable. They also raised pessimistic, though more nuanced, questions about whether multi-layer networks could be trained efficiently[^7].

The book's influence, combined with the unrealistic expectations set during the perceptron boom, contributed to a sharp decline in neural network research funding through the 1970s. Together with the broader 1973 Lighthill report and contemporaneous funding cuts, this period is widely considered part of the first [AI winter](/wiki/ai_winter). The XOR limitation applies strictly to single-layer perceptrons; multi-layer networks can represent XOR and arbitrary Boolean functions, but a practical learning algorithm for multi-layer networks would not become widely known until the mid-1980s.

### Connectionism revival (1980s)

Throughout the late 1970s and early 1980s, a small group of researchers continued to study learning in multi-layer networks. In 1974, Paul Werbos described the backpropagation algorithm in his Harvard PhD thesis, framing it as an application of the chain rule to layered systems[^8]. Independent rediscoveries and extensions appeared throughout the early 1980s.

The decisive event was the 1986 publication of David Rumelhart, Geoffrey Hinton, and Ronald Williams's "Learning Representations by Back-Propagating Errors" in *Nature*, which clearly described [backpropagation](/wiki/backpropagation) and demonstrated that gradient-trained multi-layer networks could learn useful internal representations[^9]. In parallel, Rumelhart, James McClelland, and the broader Parallel Distributed Processing (PDP) Research Group published the two-volume *Parallel Distributed Processing: Explorations in the Microstructure of Cognition*, which laid out the connectionist research program: cognition as the emergent behavior of large networks of simple units. The PDP volumes helped seed a generation of researchers, including Hinton and Yoshua Bengio[^10].

Other landmark contributions of the 1980s include John Hopfield's 1982 associative memory networks (Hopfield networks), the introduction of Boltzmann machines by Hinton and Terrence Sejnowski, and Teuvo Kohonen's self-organizing maps.

### CNN era (1989-1998)

In 1989, Yann LeCun and colleagues at AT&T Bell Labs applied backpropagation to a [convolutional neural network](/wiki/convolutional_neural_network) for reading handwritten digits, demonstrating that depth, weight sharing, and local receptive fields could solve a practical pattern-recognition problem[^11]. Subsequent refinements led to the [LeNet](/wiki/lenet)-5 system described in LeCun et al.'s 1998 *Proceedings of the IEEE* paper "Gradient-Based Learning Applied to Document Recognition," which was deployed at scale to read checks and ZIP codes in the United States[^12].

The same period produced fundamental theoretical results. In 1989, George Cybenko proved that a feedforward network with a single hidden layer of sigmoidal units can approximate any continuous function on a compact domain to arbitrary accuracy[^13]. In 1991, Kurt Hornik generalized this universal approximation result to a wide class of activation functions, establishing approximation as a property of the multi-layer architecture rather than any particular nonlinearity[^14]. The 1997 introduction of [Long Short-Term Memory](/wiki/lstm) (LSTM) by Sepp Hochreiter and Jürgen Schmidhuber provided a sequence model that mitigated the vanishing-gradient problem in recurrent networks and would later dominate speech and language modeling for nearly two decades[^15].

### Second AI winter (mid 1990s-mid 2000s)

Despite these advances, neural networks fell out of favor through much of the 1990s and early 2000s. Vladimir Vapnik and colleagues' [support vector machine](/wiki/support_vector_machine) (SVM), with strong theoretical guarantees and effective kernel methods, became the default tool for classification on moderate-sized datasets. Ensemble methods such as random forests and gradient-boosted trees also outperformed neural networks on many benchmarks. Compute and data were scarce by modern standards, deep networks were difficult to train end-to-end, and many researchers abandoned neural approaches. This stretch is sometimes called the second AI winter for neural networks (though the broader AI field experienced its own cycles)[^16].

### Deep learning revolution (2006-2016)

The modern era of neural networks began in 2006 with Geoffrey Hinton, Simon Osindero, and Yee-Whye Teh's "A Fast Learning Algorithm for Deep Belief Nets" in *Neural Computation*, which showed how to pre-train deep networks layer-by-layer using restricted Boltzmann machines. Hinton and Ruslan Salakhutdinov's companion *Science* paper, "Reducing the Dimensionality of Data with Neural Networks," further popularized the term "deep learning"[^17][^18].

The decisive watershed came in 2012, when Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton entered [AlexNet](/wiki/alexnet) in the [ImageNet](/wiki/imagenet) Large Scale Visual Recognition Challenge. AlexNet, a deep CNN with 60 million parameters and 650,000 neurons trained on two NVIDIA GTX 580 GPUs using ReLU activations and dropout, achieved a top-5 error rate of 15.3%, compared to 26.2% for the second-best (non-neural) entry[^19]. The result was widely viewed as the unambiguous victory of deep learning over hand-engineered features, and triggered the rapid adoption of deep neural networks across vision, speech, and natural language processing.

Subsequent years produced rapid architectural advances:

- 2014: Ian Goodfellow and colleagues introduced the [generative adversarial network](/wiki/gan) (GAN), a framework in which a generator and discriminator network are trained in opposition[^20].
- 2014: Sutskever, Vinyals, and Le introduced sequence-to-sequence learning with LSTMs, soon extended with Bahdanau attention (2014-2015) to dramatically improve machine translation.
- 2015: Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun introduced [ResNet](/wiki/resnet), whose [residual connections](/wiki/residual_connection) enabled training of networks 152 layers deep and won the 2015 ImageNet challenge[^21].
- 2016: DeepMind's [AlphaGo](/wiki/alphago) defeated world champion Lee Sedol, combining deep neural networks with Monte Carlo tree search and reinforcement learning.

### Transformer era (2017-present)

In 2017, Ashish Vaswani and colleagues at Google Brain introduced the [transformer](/wiki/transformer) in "Attention Is All You Need." The transformer replaced recurrence and convolution with multi-head [self-attention](/wiki/self_attention); the authors proposed "a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely," and reported 28.4 BLEU on the WMT 2014 English-to-German task and 41.8 BLEU on English-to-French, setting new state-of-the-art results while being far more parallelizable[^22]. Within a few years, transformers displaced LSTMs as the default sequence model and spread to vision ([Vision Transformer](/wiki/vision_transformer_vit)), speech ([Whisper](/wiki/whisper)), code, and scientific data.

In 2019, Hinton, LeCun, and Yoshua Bengio received the 2018 ACM A.M. [Turing Award](/wiki/turing_award) for "conceptual and engineering breakthroughs that have made deep neural networks a critical component of computing." The citation specifically credits backpropagation, convolutional networks, probabilistic models, and the broader connectionist research program[^23].

### Foundation model era (2018-present)

Beginning in 2018, large transformer language models pre-trained on web-scale corpora became the dominant paradigm. Google's [BERT](/wiki/bert) demonstrated that bidirectional masked-language-modeling pre-training produced state-of-the-art results across many NLP tasks, while OpenAI's [GPT](/wiki/gpt) series, particularly GPT-3 (2020, 175 billion parameters, described by its authors as "10x more than any previous non-sparse language model") and GPT-4 (2023), showed that autoregressive language models scaled to extreme sizes exhibited [in-context learning](/wiki/in_context_learning) and broad generalization[^24][^25]. The Stanford CRFM research group coined the term "[foundation model](/wiki/foundation_model)" in 2021 to refer to any model trained on broad data at scale that can be adapted to a wide range of downstream tasks[^26].

By 2025-2026, frontier neural systems, including Anthropic's Claude, OpenAI's GPT-4 and GPT-5 lines, Google's Gemini, Meta's Llama family, and DeepMind's AlphaFold and Gemini Robotics, are routinely multimodal, mix dense and sparse [mixture-of-experts](/wiki/mixture_of_experts) layers, and rely on enormous distributed training systems running for weeks on tens of thousands of accelerators.

| Year | Milestone | Key contributors |
|------|-----------|------------------|
| 1943 | First mathematical neuron model | Warren McCulloch, Walter Pitts |
| 1949 | Hebbian learning principle | Donald Hebb |
| 1958 | [Perceptron](/wiki/perceptron) | Frank Rosenblatt |
| 1960 | ADALINE / Widrow-Hoff rule | Bernard Widrow, Marcian Hoff |
| 1969 | *Perceptrons* book identifies XOR limitation | Marvin Minsky, Seymour Papert |
| 1974 | [Backpropagation](/wiki/backpropagation) in thesis form | Paul Werbos |
| 1982 | Hopfield networks | John Hopfield |
| 1986 | Backpropagation popularized; PDP volumes | Rumelhart, Hinton, Williams; PDP Group |
| 1989 | Backprop-trained CNN for digit recognition | Yann LeCun et al. |
| 1989 | Universal approximation theorem | George Cybenko (and Kurt Hornik, 1991) |
| 1997 | [LSTM](/wiki/lstm) | Sepp Hochreiter, Jürgen Schmidhuber |
| 2006 | Deep belief networks; "deep learning" rises | Geoffrey Hinton et al. |
| 2012 | [AlexNet](/wiki/alexnet) wins ImageNet (60M parameters) | Krizhevsky, Sutskever, Hinton |
| 2014 | [GANs](/wiki/gan) | Ian Goodfellow et al. |
| 2015 | [ResNet](/wiki/resnet) (152 layers) | Kaiming He et al. |
| 2017 | [Transformer](/wiki/transformer) architecture | Vaswani et al. |
| 2018 | BERT and GPT released | Google AI; OpenAI |
| 2018 Turing Award (announced 2019) | "Fathers of the deep-learning revolution" | Geoffrey Hinton, Yann LeCun, Yoshua Bengio |
| 2020 | GPT-3 (175B parameters) | OpenAI |
| 2022 | Chinchilla scaling laws; ChatGPT | DeepMind; OpenAI |
| 2023-2026 | Multimodal foundation models, sparse MoE, agentic systems | Multiple labs |

## Mathematical formulation

A standard feedforward neural network defines a parameterized function f<sub>θ</sub>: R<sup>d</sup> → R<sup>k</sup>, where θ collects all weights and biases.

A single artificial [neuron](/wiki/neuron) computes:

**z = w<sub>1</sub>·x<sub>1</sub> + w<sub>2</sub>·x<sub>2</sub> + … + w<sub>n</sub>·x<sub>n</sub> + b**,    **a = φ(z)**

Here x<sub>i</sub> are the inputs, w<sub>i</sub> the corresponding weights, b a bias term, φ a nonlinear [activation function](/wiki/activation_function), and a the unit's output (also called its activation).

A neural network organizes neurons into [layers](/wiki/layer). Writing the activations of layer ℓ as a vector a<sup>(ℓ)</sup>, a fully-connected layer applies an affine transformation followed by an elementwise nonlinearity:

**a<sup>(ℓ)</sup> = φ<sup>(ℓ)</sup>( W<sup>(ℓ)</sup> a<sup>(ℓ−1)</sup> + b<sup>(ℓ)</sup> )**

with weight matrix W<sup>(ℓ)</sup> and bias vector b<sup>(ℓ)</sup>. Stacking L such layers gives:

**f<sub>θ</sub>(x) = a<sup>(L)</sup>(a<sup>(L−1)</sup>( … (a<sup>(1)</sup>(x)) … ))**

Information is organized into three roles:

- **Input layer:** receives the raw data (pixel values, token embeddings, sensor readings). Input nodes typically perform no computation themselves.
- **Hidden layers:** apply weights, biases, and activations to compute increasingly abstract internal representations. Networks with two or more hidden layers are usually called [deep neural networks](/wiki/deep_neural_network).
- **Output layer:** produces the final prediction (a class probability, a regression target, or a token logit), whose dimensionality depends on the task.

For non-feedforward networks (recurrent, graph, attention-based), the same building blocks reappear but the connectivity pattern differs.

## Training

Training a neural network is the process of choosing parameters θ to make f<sub>θ</sub> agree with a dataset. The standard recipe is **empirical risk minimization** by **stochastic gradient descent on a differentiable loss**, with gradients computed via [backpropagation](/wiki/backpropagation).

### Forward pass

Given a training example x, the network is evaluated layer-by-layer to produce a prediction ŷ = f<sub>θ</sub>(x). This is the *forward pass*. Because each layer is a matrix-vector product followed by an elementwise nonlinearity, the forward pass is a sequence of dense linear algebra operations, which is why GPUs and TPUs, hardware specialized for parallel matrix arithmetic, are so effective.

### Loss

The prediction ŷ is compared against the target y using a [loss function](/wiki/loss_function) L(ŷ, y). Common choices are mean squared error for regression, cross-entropy for classification, and contrastive or sequence-level losses for self-supervised and generative tasks. The training objective is the expected loss over the training distribution, approximated as the average loss over a mini-batch of examples.

### Backward pass

To improve θ, the network computes ∇<sub>θ</sub>L: the gradient of the loss with respect to every parameter. [Backpropagation](/wiki/backpropagation) is the efficient algorithm for doing so. It applies the chain rule of calculus to the computation graph of the forward pass, starting from the output and propagating partial derivatives backward layer-by-layer. Backpropagation is the dominant training algorithm for essentially all modern neural networks[^9].

### Optimization

[Gradient descent](/wiki/gradient_descent) uses the gradient to update parameters:

**θ ← θ − η · ∇<sub>θ</sub>L**

where η is the *learning rate*. In practice, the gradient is estimated from a mini-batch of typically 32-4,096 examples; this is *mini-batch stochastic gradient descent*. Modern training almost always uses momentum-based and adaptive variants. The most popular [optimizers](/wiki/optimizer) are SGD with momentum, [Adam](/wiki/adam_optimizer), and [AdamW](/wiki/adamw), the last being the de facto standard for transformer training.

### Regularization and stability

Large networks overfit easily, so training combines several techniques to improve generalization and stabilize optimization:

- [**Dropout**](/wiki/dropout) randomly zeroes a fraction of activations during training to discourage co-adaptation.
- **L2 weight decay** penalizes the squared norm of the weights.
- [**Batch normalization**](/wiki/batch_normalization) and layer normalization standardize intermediate activations, accelerating training.
- [**Residual connections**](/wiki/residual_connection) (ResNet, transformer) provide identity shortcuts that mitigate vanishing gradients in very deep stacks.
- **Early stopping**, **data augmentation**, and **learning-rate scheduling** (warmup, cosine decay) are standard.

These techniques, together with careful initialization (He, Xavier/Glorot) and the use of ReLU-family activations, are what make networks of hundreds or thousands of layers and billions of parameters trainable in practice.

## What are the types of neural networks? Architectures

Neural networks come in many architectural families; each tailors connectivity and parameterization to a class of data. This section is a high-level tour with links to dedicated articles.

### Multilayer perceptrons (MLPs)

The classical fully-connected feedforward network, historically called a [multilayer perceptron](/wiki/multilayer_perceptron) or [feedforward neural network](/wiki/feedforward_neural_network_ffn), applies a stack of dense linear layers and nonlinear activations. MLPs remain ubiquitous as components inside larger models: the position-wise "MLP block" in a transformer is essentially a two-layer MLP applied independently to each token.

### Convolutional neural networks (CNNs)

[Convolutional neural networks](/wiki/convolutional_neural_network) introduce weight sharing and local receptive fields well-suited to spatial data such as images, video, and audio spectrograms. Key operations are convolution (feature detection), pooling (spatial downsampling), and fully connected classification heads. Landmark architectures include [LeNet](/wiki/lenet) (1989/1998), [AlexNet](/wiki/alexnet) (2012), VGG (2014), GoogLeNet/Inception (2014), and [ResNet](/wiki/resnet) (2015)[^11][^19][^21].

### Recurrent neural networks (RNNs)

[Recurrent neural networks](/wiki/recurrent_neural_network) maintain a hidden state that is updated as a sequence is consumed, providing a natural model for time series, text, and speech. Vanilla RNNs suffer from vanishing and exploding gradients, motivating gated variants: [LSTM](/wiki/lstm) (Hochreiter and Schmidhuber, 1997) and the simpler GRU (Cho et al., 2014). RNNs dominated machine translation and speech recognition from roughly 2014 until they were largely displaced by transformers after 2017[^15].

### Transformers

The [transformer](/wiki/transformer) is the dominant architecture for sequence modeling since 2017. Its core ingredient is multi-head [self-attention](/wiki/self_attention): each position in the sequence attends to a learned, weighted combination of all other positions. Transformers are highly parallelizable, scale gracefully to enormous parameter counts, and now power large language models, image models ([Vision Transformer](/wiki/vision_transformer_vit)), speech models ([Whisper](/wiki/whisper)), and protein models (AlphaFold 2/3)[^22].

### Graph neural networks (GNNs)

[Graph neural networks](/wiki/graph_neural_network) generalize convolutions to arbitrary graphs by iteratively passing messages between neighboring nodes. They are central to molecular property prediction, drug discovery, recommendation systems, traffic forecasting (Google's road ETA models), and combinatorial optimization.

### State-space models

[State-space models](/wiki/state_space_model) (SSMs) such as S4 and [Mamba](/wiki/mamba) are a more recent family that models long sequences via linear recurrences with structured kernels, achieving subquadratic scaling in sequence length. SSMs are increasingly used as transformer alternatives or complements for long-context modeling.

### Autoencoders and generative models

[Autoencoders](/wiki/autoencoder) (including [variational autoencoders](/wiki/variational_autoencoder)) learn compressed latent representations by reconstructing their inputs. Generative families include [GANs](/wiki/gan), normalizing flows, autoregressive models, and (most recently) diffusion models, which dominate state-of-the-art image and video synthesis.

### Mixture of experts

A [mixture-of-experts](/wiki/mixture_of_experts) (MoE) layer routes each input token to a small subset of "expert" sub-networks via a learned gating function. MoE allows total parameter count to grow much faster than per-token compute and underlies many recent frontier models (Mixtral, GPT-4-class systems, DeepSeek-V3).

| Architecture | Best suited for | Key mechanism | Examples |
|---|---|---|---|
| [MLP](/wiki/multilayer_perceptron) | Tabular data, components in larger nets | Fully-connected layers | Standard MLP |
| [CNN](/wiki/convolutional_neural_network) | Images, audio, video | Convolutional filters, pooling | [AlexNet](/wiki/alexnet), [ResNet](/wiki/resnet), Inception |
| [RNN](/wiki/recurrent_neural_network) / [LSTM](/wiki/lstm) | Sequences, time series | Recurrent hidden state with gating | LSTM, GRU |
| [Transformer](/wiki/transformer) | Text, sequences, multimodal | Multi-head [self-attention](/wiki/self_attention) | GPT-4, Claude, [BERT](/wiki/bert), [ViT](/wiki/vision_transformer_vit) |
| [Graph NN](/wiki/graph_neural_network) | Graphs, molecules, social networks | Message passing | GCN, GAT, GraphSAGE |
| [State-space model](/wiki/state_space_model) / [Mamba](/wiki/mamba) | Very long sequences | Structured linear recurrence | S4, Mamba |
| [Autoencoder](/wiki/autoencoder) | Compression, representation learning | Encoder-decoder bottleneck | [VAE](/wiki/variational_autoencoder), denoising AE |
| [GAN](/wiki/gan) | Image synthesis | Generator vs. discriminator | StyleGAN, BigGAN |
| [Mixture of experts](/wiki/mixture_of_experts) | Scaling parameters cheaply | Sparse routing | Switch Transformer, Mixtral |

## Activation functions

Activation functions inject the nonlinearity that lets stacked layers represent more than a single affine map. The most widely used today are[^27]:

| Activation | Definition | Notes |
|---|---|---|
| [Sigmoid](/wiki/sigmoid) | σ(x) = 1 / (1 + e<sup>−x</sup>) | Bounded (0,1); historically dominant; vanishing gradients limit use in deep nets. Common at output of binary classifiers. |
| [Tanh](/wiki/tanh) | (e<sup>x</sup> − e<sup>−x</sup>) / (e<sup>x</sup> + e<sup>−x</sup>) | Bounded (−1,1); standard in early RNNs and LSTMs. |
| [ReLU](/wiki/relu) | max(0, x) | Default hidden activation since AlexNet (2012); simple, sparse, and accelerates convergence[^19]. |
| Leaky ReLU / PReLU | max(αx, x) | Avoids "dead neuron" problem of vanilla ReLU. |
| [GELU](/wiki/gelu) | x · Φ(x) | Smooth ReLU variant; default in BERT, GPT, and many modern transformers[^28]. |
| [SwiGLU](/wiki/swiglu) | (Swish(xW) ⊙ (xV)) | Gated activation used in PaLM, LLaMA, and many recent LLMs[^29]. |
| [Softmax](/wiki/softmax) | e<sup>x<sub>i</sub></sup> / Σ<sub>j</sub> e<sup>x<sub>j</sub></sup> | Converts logits to a probability distribution at the output of multiclass classifiers. |

The shift from sigmoid/tanh to ReLU around 2010-2012, and then to GELU and gated variants such as SwiGLU around 2018-2020, was one of several inconspicuous changes that made deep networks reliably trainable.

## Loss functions

The choice of [loss function](/wiki/loss_function) encodes what "wrong" means for the task:

- **Mean squared error (MSE):** L = (1/n) Σ (ŷ − y)<sup>2</sup>. Standard for regression.
- **Mean absolute error (MAE / L1):** more robust to outliers than MSE.
- **Binary cross-entropy:** for two-class problems with a sigmoid output.
- **Categorical cross-entropy:** the standard loss for multiclass classification and the per-token loss in language modeling.
- **Contrastive and triplet losses:** for representation learning and metric learning (e.g., CLIP, SimCLR).
- **Reinforcement-learning losses** such as policy-gradient and PPO losses, and **RLHF / DPO objectives** that combine a reward model with KL constraints, are used to align large language models to human preferences.

## Universal approximation theorem

The [universal approximation theorem](/wiki/universal_approximation_theorem) is the foundational expressivity result for feedforward neural networks. George Cybenko proved in 1989 that finite linear combinations of sigmoidal activation functions are dense in the space of continuous functions on the unit cube, meaning that a single-hidden-layer feedforward network can approximate any continuous function on a compact domain to any desired accuracy, given enough hidden units[^13]. In 1991, Kurt Hornik extended the result to a wide class of non-polynomial activation functions and clarified that the universality is a property of the multi-layer architecture rather than of sigmoid in particular[^14].

The theorem is purely an *existence* result. It does not say how many neurons are needed, how to find the right weights, or whether [gradient descent](/wiki/gradient_descent) can in fact reach them. Subsequent work has shown that *depth* often allows networks to represent the same functions with exponentially fewer parameters than a shallow network, which is a key motivation for the modern emphasis on deep models.

## Modern training

Training contemporary neural networks is a large-scale systems problem.

### Hardware

Modern training runs on parallel matrix accelerators:

- **[Graphics processing units (GPUs)](/wiki/gpu)**, originally designed for 3D rendering, became the workhorse of deep learning after NVIDIA's [CUDA](/wiki/cuda) platform (2007) opened them up to general-purpose computing. AlexNet was trained on two NVIDIA GTX 580 GPUs in 2012; frontier 2025 systems use tens of thousands of NVIDIA H100, H200, GB200, or AMD Instinct accelerators[^19].
- **[Tensor processing units (TPUs)](/wiki/tpu)** are Google's custom ASICs designed for the dense matrix multiplications and reduced-precision arithmetic central to deep learning.
- Other accelerators include AWS Trainium/Inferentia, Cerebras wafer-scale engines, and Graphcore IPUs.

### Distributed training

[Distributed training](/wiki/distributed_training) spreads a single model across many accelerators. Standard parallelism strategies include:

- **Data parallelism:** every accelerator holds a full copy of the model and processes a different mini-batch; gradients are synchronized (e.g., via all-reduce).
- **Tensor / model parallelism:** large weight matrices are sharded across devices.
- **Pipeline parallelism:** different layers are placed on different devices and operate as a pipeline.
- **Expert parallelism:** mixture-of-experts routes different tokens to experts placed on different devices.
- **Sequence and context parallelism:** very long sequences are split across devices for attention.

Frameworks such as Megatron-LM, DeepSpeed, FSDP (Fully Sharded Data Parallel), and JAX's `pjit`/`shard_map` orchestrate these strategies.

### Mixed precision

Modern training uses [mixed precision](/wiki/mixed_precision) (when available), storing weights in 32-bit or 16-bit and performing matrix multiplications in lower-precision formats (FP16, BF16, FP8), to dramatically increase throughput. Loss scaling and stochastic rounding are used to preserve numerical stability. Mixed precision, together with sparse MoE and KV-cache tricks at inference time, accounts for much of the per-flop improvement in training large models over the past five years.

### Scaling laws

[Scaling laws](/wiki/scaling_laws), first systematically studied by Kaplan et al. (OpenAI, 2020) and refined by the [Chinchilla](/wiki/chinchilla) work of Hoffmann et al. (DeepMind, 2022), describe how validation loss falls smoothly as a power law in compute, dataset size, and parameters. Chinchilla in particular showed that for a fixed compute budget, model size and training tokens should be scaled in roughly equal proportions, implying that many earlier large models were undertrained[^30][^31].

## Frameworks

Most modern neural networks are built using one of a handful of high-level frameworks:

- [**PyTorch**](/wiki/pytorch) (originally Meta AI, now Linux Foundation): the dominant research framework since roughly 2018, with eager-mode execution, dynamic graphs, and a large ecosystem (Hugging Face Transformers, PyTorch Lightning, torch.compile).
- [**TensorFlow**](/wiki/tensorflow) (Google): the dominant framework from 2015-2018; still widely used in production, especially through Keras and TFX.
- [**JAX**](/wiki/jax) (Google): a functional, transformation-based framework built on XLA, increasingly popular for large-scale research at DeepMind, Anthropic, and academic labs.
- Other tools include MLX (Apple), MindSpore (Huawei), and ONNX as an interchange format.

All three major frameworks share core abstractions: [tensors](/wiki/tensor) as multi-dimensional arrays, automatic differentiation, and compilation to optimized accelerator code (cuDNN, XLA, Triton).

## What is a neural network used for? Notable applications

### Computer vision

CNNs and Vision Transformers power image classification, object detection, semantic and instance segmentation, optical character recognition, satellite and medical imaging analysis, and the perception stacks of autonomous vehicles. Architectures such as [AlexNet](/wiki/alexnet), [ResNet](/wiki/resnet), Inception, YOLO, EfficientNet, ViT, and Segment Anything are all neural networks.

### Natural language processing

Transformer-based language models dominate machine translation, summarization, question answering, code generation, sentiment analysis, and conversational AI. Large language models such as Claude, [GPT](/wiki/gpt)-4, Gemini, and Llama 3/4 perform extended chains of reasoning, function calling, and multi-step agentic behavior.

### Speech and audio

Neural networks underpin automatic speech recognition (e.g., [Whisper](/wiki/whisper)), text-to-speech (WaveNet, VALL-E), music generation, and audio classification.

### Game playing and decision-making

DeepMind's [AlphaGo](/wiki/alphago) defeated world champion Lee Sedol at Go in 2016, combining deep CNNs, value/policy networks, and Monte Carlo tree search. AlphaGo Zero (2017) and MuZero (2019) generalized this approach to self-play and model-based [reinforcement learning](/wiki/reinforcement_learning).

### Scientific computing

[AlphaFold](/wiki/alphafold) 2 (2020) predicted protein 3D structures at near-experimental accuracy, and contributions to protein structure prediction were recognized by the 2024 Nobel Prize in Chemistry (awarded jointly to David Baker, Demis Hassabis, and John Jumper)[^32]. Neural networks are now used in weather forecasting (GraphCast, GenCast), materials discovery, particle physics, fluid dynamics, and quantum chemistry. Physics-informed neural networks (PINNs) embed known physical laws as soft constraints.

### Healthcare and biology

Applications include radiology and pathology image analysis, retinal disease screening, electronic health record modeling, drug discovery, and single-cell genomics.

## What is the difference between a neural network and deep learning?

A neural network is the model: a specific class of function made of layers of artificial neurons connected by weights. [Deep learning](/wiki/deep_learning) is the broader field and methodology of training neural networks that have many layers (hence "deep") on large datasets, including the architectures, optimization methods, regularization tricks, and hardware practices that make such training work. Put simply, every deep learning system is built on neural networks, but the term "deep learning" emphasizes depth and scale: a single-layer [perceptron](/wiki/perceptron) is a neural network but is not usually called deep learning, whereas a 100-layer [ResNet](/wiki/resnet) or a billion-parameter [transformer](/wiki/transformer) is. Neural networks are also one model class within the still-broader field of [machine learning](/wiki/machine_learning), which also includes non-neural methods such as decision trees, [support vector machines](/wiki/support_vector_machine), and linear regression. The 2015 *Nature* review by LeCun, Bengio, and Hinton defines deep learning as methods that "allow computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction."[^2]

## Theoretical understanding

The empirical success of neural networks has outpaced theoretical understanding, but several lines of work have made progress:

- **Neural tangent kernel (NTK)** theory (Jacot, Gabriel, and Hongler, 2018) shows that in an appropriate infinite-width limit, training a neural network by gradient descent is equivalent to kernel regression with a particular fixed kernel. NTK provides a tractable model of deep network training and generalization in the wide-network regime[^33].
- [**Double descent**](/wiki/double_descent) (Belkin, Hsu, Ma, and Mandal, 2019) describes the empirical phenomenon that test error, plotted as a function of model size, exhibits a second decrease in the heavily-overparameterized regime, contradicting the classical bias-variance picture[^34].
- **Lottery ticket hypothesis** (Frankle and Carbin, 2019) proposes that large networks contain sparse trainable sub-networks ("winning tickets") that achieve comparable performance.
- **Mechanistic interpretability**, championed by groups at Anthropic, OpenAI, Google DeepMind, and academic labs, aims to reverse-engineer the algorithms that trained networks have learned to implement.

Open theoretical questions include why overparameterized networks generalize, how representations form during training, whether emergent capabilities reflect genuine phase transitions, and what, if any, fundamental limits constrain scaling.

## Limitations

Despite their success, neural networks have well-known limitations:

- **Data hunger.** Training large models from scratch typically requires enormous labeled or web-scale datasets, although transfer learning, self-supervised learning, and instruction tuning have substantially reduced per-task data requirements.
- **Compute cost and energy.** Training frontier models can require months of compute on tens of thousands of accelerators and tens to hundreds of megawatt-hours of energy.
- **Interpretability.** Modern networks are largely opaque to human inspection. Saliency maps, attention visualizations, probing classifiers, and circuit-level mechanistic interpretability address this only partially.
- **Adversarial fragility.** Small, carefully crafted input perturbations can cause confident misclassifications, an issue with serious security implications.
- **Hallucination and reliability.** Large language models can produce fluent but factually wrong outputs.
- **Bias and fairness.** Networks readily absorb and amplify biases in their training data, raising concerns in high-stakes applications such as hiring, lending, and policing.
- **Catastrophic forgetting** when networks are trained sequentially on multiple tasks, motivating continual-learning research.
- **Biological implausibility.** Backpropagation has no known direct biological analog; the brain is unlikely to perform gradient descent on a global loss, and biologically plausible alternatives remain an open research area.

## See also

- [Register tokens (Vision Transformers Need Registers)](/wiki/vision_registers)
- [DeepNorm / DeepNet](/wiki/deepnorm)
- [H-Net (dynamic chunking)](/wiki/h_net)
- [PEER (Parameter Efficient Expert Retrieval / Mixture of a Million Experts)](/wiki/peer_experts)
- [Absolute Zero Reasoner](/wiki/absolute_zero)
- [Deep learning](/wiki/deep_learning)
- [Machine learning](/wiki/machine_learning)
- [Perceptron](/wiki/perceptron)
- [Multilayer perceptron](/wiki/multilayer_perceptron)
- [Convolutional neural network](/wiki/convolutional_neural_network)
- [Recurrent neural network](/wiki/recurrent_neural_network)
- [Transformer](/wiki/transformer)
- [Backpropagation](/wiki/backpropagation)
- [Gradient descent](/wiki/gradient_descent)
- [Universal approximation theorem](/wiki/universal_approximation_theorem)
- [Geoffrey Hinton](/wiki/geoffrey_hinton)
- [Yann LeCun](/wiki/yann_lecun)
- [Yoshua Bengio](/wiki/yoshua_bengio)
- [AlexNet](/wiki/alexnet)
- [ResNet](/wiki/resnet)
- [GAN](/wiki/gan)
- [LSTM](/wiki/lstm)
- [ReLU](/wiki/relu)
- [GELU](/wiki/gelu)
- [SwiGLU](/wiki/swiglu)
- [Mixture of experts](/wiki/mixture_of_experts)
- [Mamba](/wiki/mamba)
- [State-space model](/wiki/state_space_model)
- [Foundation model](/wiki/foundation_model)
- [Scaling laws](/wiki/scaling_laws)
- [Double descent](/wiki/double_descent)
- [AI winter](/wiki/ai_winter)

## References

[^1]: Goodfellow, I., Bengio, Y., & Courville, A. (2016). *Deep Learning*. MIT Press. https://www.deeplearningbook.org/
[^2]: LeCun, Y., Bengio, Y., & Hinton, G. (2015). "Deep learning." *Nature* 521, 436-444. https://www.nature.com/articles/nature14539
[^3]: McCulloch, W. S., & Pitts, W. (1943). "A logical calculus of the ideas immanent in nervous activity." *Bulletin of Mathematical Biophysics* 5, 115-133. https://link.springer.com/article/10.1007/BF02478259
[^4]: Hebb, D. O. (1949). *The Organization of Behavior: A Neuropsychological Theory*. Wiley. https://pure.mpg.de/rest/items/item_2346268_3/component/file_2346267/content
[^5]: Rosenblatt, F. (1958). "The perceptron: A probabilistic model for information storage and organization in the brain." *Psychological Review* 65(6), 386-408. https://psycnet.apa.org/doi/10.1037/h0042519
[^6]: Widrow, B., & Hoff, M. E. (1960). "Adaptive switching circuits." *1960 IRE WESCON Convention Record*, Part 4, 96-104. https://isl.stanford.edu/~widrow/papers/c1960adaptiveswitching.pdf
[^7]: Minsky, M., & Papert, S. (1969). *Perceptrons: An Introduction to Computational Geometry*. MIT Press. https://mitpress.mit.edu/9780262630221/perceptrons/
[^8]: Werbos, P. J. (1974). *Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences*. PhD thesis, Harvard University. https://www.researchgate.net/publication/35657389
[^9]: Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). "Learning representations by back-propagating errors." *Nature* 323, 533-536. https://www.nature.com/articles/323533a0
[^10]: Rumelhart, D. E., McClelland, J. L., & the PDP Research Group (1986). *Parallel Distributed Processing: Explorations in the Microstructure of Cognition*, Vols. 1-2. MIT Press. https://mitpress.mit.edu/9780262680530/parallel-distributed-processing-volume-1/
[^11]: LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., & Jackel, L. D. (1989). "Backpropagation applied to handwritten zip code recognition." *Neural Computation* 1(4), 541-551. https://direct.mit.edu/neco/article-abstract/1/4/541/5515
[^12]: LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). "Gradient-based learning applied to document recognition." *Proceedings of the IEEE* 86(11), 2278-2324. https://ieeexplore.ieee.org/document/726791
[^13]: Cybenko, G. (1989). "Approximation by superpositions of a sigmoidal function." *Mathematics of Control, Signals, and Systems* 2(4), 303-314. https://link.springer.com/article/10.1007/BF02551274
[^14]: Hornik, K. (1991). "Approximation capabilities of multilayer feedforward networks." *Neural Networks* 4(2), 251-257. https://www.sciencedirect.com/science/article/abs/pii/089360809190009T
[^15]: Hochreiter, S., & Schmidhuber, J. (1997). "Long short-term memory." *Neural Computation* 9(8), 1735-1780. https://direct.mit.edu/neco/article-abstract/9/8/1735/6109
[^16]: Cortes, C., & Vapnik, V. (1995). "Support-vector networks." *Machine Learning* 20, 273-297. https://link.springer.com/article/10.1007/BF00994018
[^17]: Hinton, G. E., Osindero, S., & Teh, Y.-W. (2006). "A fast learning algorithm for deep belief nets." *Neural Computation* 18(7), 1527-1554. https://direct.mit.edu/neco/article/18/7/1527/7065
[^18]: Hinton, G. E., & Salakhutdinov, R. R. (2006). "Reducing the dimensionality of data with neural networks." *Science* 313(5786), 504-507. https://www.science.org/doi/10.1126/science.1127647
[^19]: Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). "ImageNet classification with deep convolutional neural networks." *Advances in Neural Information Processing Systems* 25. https://papers.nips.cc/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html
[^20]: Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014). "Generative adversarial nets." *Advances in Neural Information Processing Systems* 27. https://papers.nips.cc/paper/2014/hash/5ca3e9b122f61f8f06494c97b1afccf3-Abstract.html
[^21]: He, K., Zhang, X., Ren, S., & Sun, J. (2016). "Deep residual learning for image recognition." *IEEE CVPR*. https://openaccess.thecvf.com/content_cvpr_2016/html/He_Deep_Residual_Learning_CVPR_2016_paper.html
[^22]: Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). "Attention is all you need." *Advances in Neural Information Processing Systems* 30. https://papers.nips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html
[^23]: ACM (2019). "Fathers of the Deep Learning Revolution Receive ACM A.M. Turing Award." Press release, March 27, 2019. https://www.acm.org/media-center/2019/march/turing-award-2018
[^24]: Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). "BERT: Pre-training of deep bidirectional transformers for language understanding." *NAACL-HLT*. https://aclanthology.org/N19-1423/
[^25]: Brown, T. B., Mann, B., Ryder, N., Subbiah, M., et al. (2020). "Language models are few-shot learners." *Advances in Neural Information Processing Systems* 33. https://papers.nips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html
[^26]: Bommasani, R., et al. (Stanford CRFM) (2021). "On the opportunities and risks of foundation models." arXiv:2108.07258. https://arxiv.org/abs/2108.07258
[^27]: Nwankpa, C., Ijomah, W., Gachagan, A., & Marshall, S. (2018). "Activation functions: Comparison of trends in practice and research for deep learning." arXiv:1811.03378. https://arxiv.org/abs/1811.03378
[^28]: Hendrycks, D., & Gimpel, K. (2016). "Gaussian Error Linear Units (GELUs)." arXiv:1606.08415. https://arxiv.org/abs/1606.08415
[^29]: Shazeer, N. (2020). "GLU variants improve Transformer." arXiv:2002.05202. https://arxiv.org/abs/2002.05202
[^30]: Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., & Amodei, D. (2020). "Scaling laws for neural language models." arXiv:2001.08361. https://arxiv.org/abs/2001.08361
[^31]: Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., et al. (2022). "Training compute-optimal large language models." arXiv:2203.15556. https://arxiv.org/abs/2203.15556
[^32]: Nobel Foundation (2024). "The Nobel Prize in Chemistry 2024." Press release, October 9, 2024. https://www.nobelprize.org/prizes/chemistry/2024/press-release/
[^33]: Jacot, A., Gabriel, F., & Hongler, C. (2018). "Neural tangent kernel: Convergence and generalization in neural networks." *Advances in Neural Information Processing Systems* 31. https://papers.nips.cc/paper/2018/hash/5a4be1fa34e62bb8a6ec6b91d2462f5a-Abstract.html
[^34]: Belkin, M., Hsu, D., Ma, S., & Mandal, S. (2019). "Reconciling modern machine-learning practice and the classical bias-variance trade-off." *PNAS* 116(32), 15849-15854. https://www.pnas.org/doi/10.1073/pnas.1903070116
[^35]: "New Navy Device Learns by Doing." *The New York Times*, July 8, 1958, p. 25. (Report on Frank Rosenblatt's perceptron demonstration.) https://www.nytimes.com/1958/07/08/archives/new-navy-device-learns-by-doing-psychologist-shows-embryo-of.html

