Backpropagation (short for "backward propagation of errors") is the primary algorithm used to train artificial neural networks. It computes the gradient of a loss function with respect to every weight in the network by applying the chain rule of calculus, working backward from the output layer to the input layer. These gradients then guide an optimization method such as gradient descent to adjust the weights and reduce the network's prediction error.
Backpropagation is foundational to virtually all modern deep learning. Every time a neural network learns from data, whether it is a convolutional neural network classifying images, a recurrent neural network processing text, or a transformer generating language, backpropagation is the mechanism computing how each parameter should change.
Imagine you are trying to throw a ball into a bucket. On your first throw, you miss to the left. So you adjust and throw a little more to the right. You miss again, but this time you only missed by a tiny bit. You keep adjusting each throw based on how far off you were, and eventually you get the ball in the bucket.
Backpropagation works the same way for computers. A neural network makes a guess, checks how wrong it was, and then figures out which parts of its "brain" caused the mistake. It traces the error backward through all its layers, telling each layer: "You contributed this much to the mistake, so adjust yourself by this much." After thousands of these cycles, the network gets very good at its task.
The mathematical foundations of backpropagation predate neural networks by several decades. Its development unfolded across multiple independent discoveries, with different researchers contributing pieces of the puzzle before the algorithm was finally popularized in the context of neural network training.
The earliest precursors of backpropagation emerged from the field of optimal control theory. In 1960, Henry J. Kelley published "Gradient Theory of Optimal Flight Paths" in the ARS Journal, which described gradient computation for multi-stage systems using adjoint equations and Green's theorem. Arthur E. Bryson published related work in 1961, using Lagrange multipliers for gradient methods in multi-stage optimization.
In 1962, Stuart Dreyfus provided a simpler derivation based only on the chain rule of differentiation. He used a recursive formulation that dealt explicitly with the optimal control problem in its discrete-stage form. As later researchers would note, the well-known backpropagation process for multilayer perceptrons can be viewed as a simplified version of the Kelley-Bryson gradient formula in classical discrete-time optimal control theory.
The core mathematical technique behind backpropagation, reverse-mode automatic differentiation, was first described by Seppo Linnainmaa in his 1970 master's thesis at the University of Helsinki. Linnainmaa's work presented an efficient method for computing derivatives of composite functions by propagating errors backward through a sequence of operations. He even included working FORTRAN code implementing the method. Although his formulation did not reference neural networks, the underlying algorithm is essentially identical to what is now called backpropagation. He published a more detailed version in 1976.
In 1971, Ostrovskii, Volin, and Borisov published what appears to be the first journal article on this form of automatic differentiation.
Paul Werbos was the first to propose applying reverse-mode differentiation specifically to neural network training. In his 1974 PhD dissertation at Harvard University, titled "Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences," Werbos described how error signals could be propagated backward through a network to compute gradients for all weights simultaneously. His work remained relatively unknown outside a small community for over a decade. In 1982, Werbos published what appears to be the first explicitly neural network-specific application of efficient backpropagation.
Separately, Shun'ichi Amari proposed training deep multilayer perceptrons via stochastic gradient descent as early as 1967 and demonstrated the formation of "internal representations" in a five-layer network.
Several researchers independently described backpropagation algorithms for neural networks in 1985, including Yann LeCun and David Parker. However, the paper that brought backpropagation into mainstream use was "Learning representations by back-propagating errors" by David Rumelhart, Geoffrey Hinton, and Ronald Williams, published in Nature (volume 323, pages 533-536) in October 1986. This paper demonstrated that backpropagation could train multi-layer networks to learn useful internal representations, something that had been considered impractical. As the authors wrote, internal "hidden" units that are not part of the input or output come to represent important features of the task domain. The clear writing and compelling experimental results sparked widespread interest in neural networks and helped launch the second wave of connectionist research in the late 1980s.
| Year | Contributor(s) | Contribution |
|---|---|---|
| 1960 | Henry J. Kelley | Gradient computation for multi-stage systems (optimal control) |
| 1961 | Arthur E. Bryson | Gradient methods for multi-stage optimization |
| 1962 | Stuart Dreyfus | Simplified chain rule derivation of gradient formulas |
| 1967 | Shun'ichi Amari | Stochastic gradient descent for multilayer perceptrons |
| 1970 | Seppo Linnainmaa | Reverse-mode automatic differentiation (master's thesis, with FORTRAN code) |
| 1974 | Paul Werbos | Applied reverse-mode differentiation to neural networks (PhD thesis) |
| 1982 | Paul Werbos | First explicitly NN-specific backpropagation application |
| 1985 | Yann LeCun | Independent derivation of backpropagation for neural networks |
| 1985 | David Parker | Independent derivation published as a technical report |
| 1986 | Rumelhart, Hinton, Williams | Published in Nature; popularized backpropagation widely |
Backpropagation operates in a cycle of four steps, repeated over many iterations (or epochs) until the network's loss converges to an acceptable level.
During the forward pass, input data flows through the network from the input layer to the output layer. At each neuron, the algorithm computes a weighted sum of the inputs, adds a bias term, and applies an activation function to produce the neuron's output.
For a single neuron in layer l, the computation is:
z^(l) = W^(l) * a^(l-1) + b^(l)
a^(l) = f(z^(l))
where:
W^(l) is the weight matrix for layer la^(l-1) is the activation vector from the previous layer (with a^(0) being the input x)b^(l) is the bias vectorf is the activation function (such as ReLU, sigmoid, or tanh)The final layer's output is the network's prediction, which is then compared against the true target values.
A loss function quantifies how far the network's predictions are from the true values. Common choices include:
| Loss function | Formula | Typical use case |
|---|---|---|
| Mean squared error (MSE) | L = (1/n) * sum((y_i - y_hat_i)^2) | Regression tasks |
| Cross-entropy | L = -sum(y_i * log(y_hat_i)) | Classification tasks |
| Hinge loss | L = max(0, 1 - y_i * y_hat_i) | Support vector machines, margin-based classifiers |
| Huber loss | Quadratic for small errors, linear for large | Robust regression |
The scalar loss value L is the starting point for the backward pass.
The backward pass computes the gradient of the loss with respect to every weight and bias in the network. It does this by applying the chain rule of calculus, starting from the output layer and working backward through each layer.
For the output layer, the algorithm first computes dL/da^(L), the derivative of the loss with respect to the output activations. It then computes dL/dz^(L) by multiplying with the derivative of the activation function. From there, the gradients with respect to the weights and biases follow directly:
dL/dW^(l) = dL/dz^(l) * (a^(l-1))^T
dL/db^(l) = dL/dz^(l)
The gradient is then propagated to the previous layer:
dL/da^(l-1) = (W^(l))^T * dL/dz^(l)
This process repeats layer by layer until the input layer is reached. The quantity dL/dz^(l) is often denoted as the "error signal" or "delta" for layer l and is defined as:
delta^(l) = dL/dz^(l) = (dL/da^(l)) * f'(z^(l))
where f'(z^(l)) is the derivative of the activation function evaluated at the pre-activation value, and * here denotes element-wise multiplication.
Once all gradients are computed, an optimizer updates each weight in the direction that reduces the loss. In basic gradient descent, the update rule is:
W^(l) = W^(l) - alpha * dL/dW^(l)
b^(l) = b^(l) - alpha * dL/db^(l)
where alpha is the learning rate. In practice, more sophisticated optimizers are used (see the section on optimizers below).
The chain rule is the mathematical backbone of backpropagation. For a network with L layers, the gradient of the loss with respect to a weight in layer l involves a product of partial derivatives through all subsequent layers:
dL/dW^(l) = dL/da^(L) * da^(L)/dz^(L) * dz^(L)/da^(L-1) * ... * da^(l)/dz^(l) * dz^(l)/dW^(l)
Each factor in this product corresponds to either the derivative of an activation function or the weight matrix connecting two layers. The key insight is that these intermediate quantities can be computed once during the backward pass and reused, making the algorithm efficient. Without this reuse, computing gradients for each weight independently would require a separate forward pass, making training prohibitively expensive.
For a concrete example, consider a simple three-layer network with layers 1, 2, and 3. The gradient of the loss with respect to a weight w in layer 1 is:
dL/dw = (dL/da3) * (da3/dz3) * (dz3/da2) * (da2/dz2) * (dz2/da1) * (da1/dz1) * (dz1/dw)
The backward pass computes the terms from left to right (output to input), caching intermediate values so that each is computed only once.
One of backpropagation's most important properties is its computational efficiency compared to naive gradient computation methods.
For a fully connected layer with n_in input neurons and n_out output neurons, the forward pass requires O(n_in * n_out) multiplications (a matrix-vector product). The backward pass has the same asymptotic cost, since it involves a matrix-transpose-vector product of the same dimensions. Therefore, one complete forward-backward cycle through a single layer costs O(n_in * n_out).
For a network with L layers, the total cost per training example is the sum of costs across all layers. If the largest layer has n neurons, a rough upper bound for the entire network is O(L * n^2). Across a training set of m examples over E epochs, the total cost becomes O(E * m * L * n^2).
Without backpropagation's reuse of intermediate values, one would need to compute each gradient independently, resulting in a cost that is orders of magnitude higher. This efficiency gain is analogous to how dynamic programming avoids redundant computation by caching subproblem solutions.
The backward pass requires storing all intermediate activations from the forward pass, because each layer's gradient computation depends on its own activations (for computing dL/dz^(l)) and the previous layer's activations (for computing dL/dW^(l)). This means memory usage grows linearly with the number of layers and the size of each layer.
For very deep networks, this memory requirement can become a bottleneck. Gradient checkpointing (discussed below) is a technique that trades additional computation time for reduced memory usage by recomputing some activations during the backward pass rather than storing them.
Backpropagation computes the gradient; a separate optimization algorithm uses that gradient to update weights. The gradient can be computed over different subsets of the training data, giving rise to three main variants.
| Variant | Gradient computed over | Pros | Cons |
|---|---|---|---|
| Batch gradient descent | Entire training set | Stable convergence, exact gradient | Slow for large datasets, high memory |
| Stochastic gradient descent (SGD) | Single training example | Fast updates, can escape local minima | Noisy gradients, high variance |
| Mini-batch gradient descent | Small random subset (e.g. 32-512 examples) | Good balance of speed and stability | Requires tuning batch size |
Mini-batch gradient descent is the most widely used variant in practice. It combines the computational benefits of batch processing on GPUs with the regularizing effect of gradient noise.
Backpropagation provides the gradient; the optimizer decides how to use it. Over the decades, researchers have developed increasingly sophisticated optimizers that go beyond vanilla gradient descent.
| Optimizer | Key idea | Update rule (simplified) | Reference |
|---|---|---|---|
| SGD with momentum | Accumulates past gradients to accelerate in consistent directions | v = betav + grad; W = W - alphav | Polyak (1964) |
| Nesterov momentum | Looks ahead before computing gradient | Gradient evaluated at W - alphabetav | Nesterov (1983) |
| Adagrad | Per-parameter learning rates based on historical gradient magnitudes | W = W - alpha*grad / sqrt(sum_of_squared_grads) | Duchi et al. (2011) |
| RMSprop | Adagrad with exponential moving average to prevent learning rate decay | Uses running average of squared gradients | Hinton (unpublished lecture, 2012) |
| Adam | Combines momentum and RMSprop; maintains running means of gradient and squared gradient | Uses bias-corrected first and second moment estimates | Kingma and Ba (2015) |
| AdamW | Adam with decoupled weight decay | Separates L2 regularization from adaptive learning rate | Loshchilov and Hutter (2019) |
Adam and AdamW are the most commonly used optimizers in modern deep learning. SGD with momentum remains popular for training convolutional neural networks on image tasks, where it has been observed to find flatter minima that generalize better.
A major practical challenge with backpropagation in deep networks is the vanishing gradient problem and its counterpart, the exploding gradient problem. Both arise from the repeated multiplication of gradient terms across many layers.
When the derivatives of the activation functions are consistently less than 1 (as with sigmoid or tanh in their saturated regions), the gradient shrinks exponentially as it propagates backward through many layers. Layers close to the input end up receiving near-zero gradients, which means their weights barely change during training. Sepp Hochreiter analyzed this problem rigorously in his 1991 diploma thesis at the Technical University of Munich, and it was a primary reason why deep networks were considered impractical to train for many years.
Mathematically, if each layer's Jacobian has spectral norm less than 1, then the product of L such Jacobians decays exponentially:
||dL/dz^(1)|| <= ||dL/dz^(L)|| * product(||J^(l)||, l=2..L)
When each ||J^(l)|| < 1, this product approaches zero as L increases.
Conversely, if the gradient terms are consistently greater than 1, the gradient grows exponentially during backpropagation. This causes extremely large weight updates, leading to numerical instability and divergent training. Exploding gradients are especially common in recurrent neural networks, where the same weight matrix is multiplied at every time step.
Researchers have developed multiple techniques to address these problems:
| Problem | Solution | How it helps |
|---|---|---|
| Vanishing gradients | ReLU activation | Derivative is 1 for positive inputs, preventing gradient shrinkage |
| Vanishing gradients | Batch normalization | Normalizes layer inputs, stabilizing gradient flow |
| Vanishing gradients | Residual connections (skip connections) | Allow gradients to bypass layers via additive shortcuts |
| Vanishing gradients | LSTM / GRU gates | Gating mechanisms regulate gradient flow in recurrent networks |
| Exploding gradients | Gradient clipping | Caps gradient magnitude to a maximum threshold |
| Both | Xavier/Glorot initialization | Sets initial weights to preserve variance across layers |
| Both | He (Kaiming) initialization | Weight initialization tailored for ReLU networks |
| Both | Careful learning rate selection | Prevents both stagnation and divergence |
Modern implementations of backpropagation use computational graphs to represent the forward computation. A computational graph is a directed acyclic graph (DAG) where each node represents an operation (addition, multiplication, activation function application) and each edge represents the flow of data (tensors) between operations.
During the forward pass, the framework builds this graph dynamically (in PyTorch) or statically (in older versions of TensorFlow). Each node stores the information needed to compute its local gradient. During the backward pass, the algorithm traverses the graph in reverse topological order, applying the chain rule at each node and accumulating gradients.
This graph-based view makes backpropagation both modular and general. Adding a new layer type or operation to a network only requires defining its forward computation and its local gradient; the framework handles the rest automatically. The local computation at each node can be described as:
This modularity is the reason modern deep learning frameworks can support arbitrary network architectures without requiring manual gradient derivation.
Backpropagation is a specific instance of reverse-mode automatic differentiation (autodiff). Understanding the broader context of autodiff helps clarify why backpropagation is so efficient.
Automatic differentiation comes in two modes:
Forward mode computes derivatives alongside the forward computation. It propagates a "tangent" vector forward through the graph, computing the derivative of every intermediate value with respect to a single input variable. For a function f: R^n -> R^m, forward mode computes one column of the Jacobian per pass, so it requires n passes to compute the full Jacobian.
Reverse mode (which is backpropagation) computes derivatives in a backward sweep after the forward pass. It propagates an "adjoint" vector backward through the graph, computing the derivative of a single output with respect to every intermediate value. For the same function, reverse mode computes one row of the Jacobian per pass, requiring m passes for the full Jacobian.
Since neural networks have millions or billions of parameters but a single scalar loss, reverse mode is the natural choice. A single backward pass computes the gradient with respect to all parameters simultaneously, at a computational cost roughly equal to two to three times that of a single forward pass.
| Property | Forward mode | Reverse mode (backpropagation) |
|---|---|---|
| Direction of computation | Input to output | Output to input |
| Efficient when | Few inputs, many outputs | Many inputs, few outputs (typical in deep learning) |
| Jacobian columns/rows per pass | One column | One row |
| Memory requirement | Low (no need to store intermediate values) | Higher (must store intermediate activations) |
| Use in deep learning | Rare; used for Jacobian-vector products | Standard for all training |
Automatic differentiation is distinct from both symbolic differentiation (which manipulates mathematical expressions and can produce unwieldy formulas for complex functions) and numerical differentiation (which uses finite differences and suffers from both truncation and rounding errors). Autodiff computes exact derivatives (up to floating-point precision) at a cost proportional to the original computation.
| Differentiation method | Approach | Accuracy | Computational cost | Suitable for neural networks |
|---|---|---|---|---|
| Symbolic | Manipulates expressions algebraically | Exact | Can grow exponentially with expression complexity | No |
| Numerical (finite differences) | Approximates using f(x+h) - f(x) / h | Approximate; prone to rounding errors | O(n) forward passes for n parameters | No (too slow and inaccurate) |
| Automatic (forward mode) | Propagates tangent vectors forward | Exact (up to floating point) | O(n) passes for n parameters | Impractical for large n |
| Automatic (reverse mode) | Propagates adjoints backward | Exact (up to floating point) | O(1) backward pass for scalar output | Yes; this is backpropagation |
All major deep learning frameworks implement backpropagation through reverse-mode automatic differentiation engines.
PyTorch uses a dynamic computational graph ("define-by-run") approach. Each operation on a tensor with requires_grad=True is recorded in a DAG. Calling .backward() on the loss tensor triggers reverse-mode differentiation through this graph. The autograd engine computes vector-Jacobian products (VJPs) at each node, which is mathematically equivalent to applying the chain rule. After the backward pass, the graph is discarded by default (unless retain_graph=True is specified), and gradients are accumulated in each parameter's .grad attribute.
A simple PyTorch example:
import torch
import torch.nn as nn
model = nn.Linear(10, 1)
loss_fn = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
# Forward pass
x = torch.randn(32, 10)
y = torch.randn(32, 1)
pred = model(x)
loss = loss_fn(pred, y)
# Backward pass (backpropagation)
loss.backward()
# Weight update
optimizer.step()
optimizer.zero_grad()
TensorFlow 2.x provides tf.GradientTape, a context manager that records operations for automatic differentiation. Operations executed inside the tape's context are traced, and calling tape.gradient(loss, variables) computes gradients via reverse-mode autodiff. TensorFlow also supports higher-order derivatives by nesting gradient tapes, and persistent tapes for computing multiple gradients from the same computation.
JAX, developed by Google, provides both forward-mode (jax.jvp) and reverse-mode (jax.grad) differentiation as composable function transformations. JAX can differentiate through native Python and NumPy code, and supports higher-order derivatives, per-example gradients (via jax.vmap), and compilation to accelerators via XLA. Its functional approach makes it particularly well-suited for research applications that require non-standard differentiation patterns.
For recurrent neural networks that process sequences, the standard backpropagation algorithm is extended to backpropagation through time (BPTT). The recurrent network is "unrolled" across time steps, creating a deep feedforward graph where each time step corresponds to a layer. Gradients are then computed by standard backpropagation on this unrolled graph.
For a recurrent network with hidden state h_t, input x_t, and weight matrices W_hh (recurrent) and W_xh (input), the unrolled computation at each time step is:
h_t = f(W_hh * h_{t-1} + W_xh * x_t + b)
The gradient with respect to W_hh involves a product of Jacobians across all time steps:
dL/dW_hh = sum over t of (dL/dh_t * product(dh_k/dh_{k-1}, k=t..1) * dh_1/dW_hh)
BPTT is particularly susceptible to vanishing and exploding gradients because the same weight matrices are multiplied at every time step, compounding the problem. This difficulty motivated the development of gated architectures such as LSTM (Hochreiter and Schmidhuber, 1997) and GRU (Cho et al., 2014).
Truncated BPTT is a common practical approximation that limits the number of time steps through which gradients are propagated, trading off long-range gradient accuracy for computational efficiency and numerical stability.
Several regularization methods interact directly with the backpropagation process to improve generalization.
| Technique | How it interacts with backpropagation |
|---|---|
| L2 regularization (weight decay) | Adds a penalty term proportional to the squared magnitude of weights to the loss, which modifies the gradients computed during backpropagation |
| L1 regularization | Adds a penalty proportional to the absolute value of weights, encouraging sparsity by driving some weights to exactly zero |
| Dropout | Randomly sets a fraction of neuron outputs to zero during each forward pass, which means only a random subset of weights receive gradient updates during backpropagation |
| Batch normalization | Normalizes activations within each mini-batch, stabilizing gradient flow and allowing higher learning rates |
| Early stopping | Monitors validation loss during training and stops when it begins increasing, preventing the network from overfitting to the training data |
| Gradient noise | Adds small random noise to gradients during training, which can help the optimizer escape sharp minima |
Backpropagation solves the credit assignment problem: given an error at the output, it determines how much each weight in every layer contributed to that error. While this works well in artificial systems, its biological plausibility has been debated since its popularization.
Several features of backpropagation do not match what is known about biological neural circuits:
Weight transport problem: Backpropagation requires the backward pass to use the exact transpose of the forward weight matrices. In the brain, synapses are unidirectional; feedforward and feedback connections are physically distinct, and requiring them to share precise weight values seems implausible.
Distinct forward and backward phases: Backpropagation requires a complete forward pass before the backward pass can begin. Biological neurons process information continuously.
Non-local error signals: In backpropagation, a neuron's weight update depends on error information from distant layers. In the brain, synaptic plasticity is thought to depend primarily on local information.
Derivative of activation function: Backpropagation requires computing the precise derivative of each neuron's activation function, which biological neurons do not appear to calculate.
| Algorithm | Key idea | Reference |
|---|---|---|
| Feedback alignment | Uses fixed random feedback weights instead of weight transposes | Lillicrap et al. (2016) |
| Direct feedback alignment | Sends error signals directly from the output to each hidden layer | Nokland (2016) |
| Target propagation | Propagates targets rather than gradients through the network | Lee et al. (2015) |
| Forward-forward algorithm | Replaces forward-backward cycle with two forward passes on positive and negative data | Hinton (2022) |
| Equilibrium propagation | Uses an energy-based model with contrastive learning | Scellier and Bengio (2017) |
| Predictive coding | Updates are based on prediction errors at each layer | Rao and Ballard (1999); Whittington and Bogacz (2017) |
None of these alternatives have matched backpropagation's effectiveness at scale, but they remain active research areas. The forward-forward algorithm, proposed by Hinton in 2022, replaces the forward and backward passes with two forward passes (one on real data and one on negative data), with each layer having its own local objective. It has shown competitive results on some benchmarks but is slower than backpropagation and does not yet generalize as well.
Standard backpropagation computes first-order gradient information. Second-order methods attempt to use curvature information (the Hessian matrix or approximations of it) for faster convergence, though at higher computational cost.
| Method | Approach | Practical consideration |
|---|---|---|
| Newton's method | Uses the full Hessian matrix to scale gradients | Impractical for neural networks due to O(n^2) storage and O(n^3) inversion cost |
| Hessian-free optimization | Uses conjugate gradient to approximately solve the Newton step without forming the Hessian explicitly | Hessian-vector products can be computed via a modified backpropagation pass |
| Natural gradient | Uses the Fisher information matrix as a preconditioner | Captures the geometry of the parameter space rather than Euclidean distance |
| K-FAC | Approximates the Fisher information using Kronecker products | Efficient per-layer approximation that scales to large networks |
James Martens demonstrated in 2010 that Hessian-free optimization could train deep networks that were difficult to train with first-order methods alone. A key insight is that Hessian-vector products can be computed at roughly the cost of one additional forward-backward pass, using a technique sometimes called "double backpropagation" or the R-operator.
Backpropagation remains the workhorse of neural network training. Even as model architectures have changed dramatically, from simple feedforward networks to transformers with hundreds of billions of parameters, the underlying training algorithm is still backpropagation combined with first-order optimization.
Recent developments that extend or build upon backpropagation include:
Gradient checkpointing: Reducing memory usage by recomputing intermediate activations during the backward pass rather than storing them all. This technique is essential for training very large models on limited GPU memory. It trades roughly 20-30% additional computation time for a large reduction in memory.
Mixed-precision training: Performing forward and backward passes in lower-precision floating-point formats (such as float16 or bfloat16) to speed up computation and reduce memory, while maintaining a float32 copy of the weights for stable updates. This approach was popularized by Micikevicius et al. (2018) and is now standard practice for training large models.
Pipeline and tensor parallelism: Splitting the backpropagation computation across multiple GPUs by partitioning either the model's layers (pipeline parallelism) or individual layers' weight matrices (tensor parallelism).
Sparse backpropagation: Computing gradients for only a subset of parameters, reducing both computation and communication costs in distributed training settings.
Higher-order gradients: Computing gradients of gradients, used in meta-learning algorithms like MAML (Model-Agnostic Meta-Learning) and in certain generative adversarial network training procedures.