Backpropagation
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v6 · 5,032 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v6 · 5,032 words
Add missing citations, update stale details, or suggest a clearer explanation.
Backpropagation (short for "backward propagation of errors") is the algorithm used to compute the gradient of a scalar loss function with respect to every parameter of a neural network. It applies the chain rule of calculus in a particular order, sweeping backward from the output of the network to its inputs and reusing intermediate quantities, so that the cost of computing the entire gradient is only a small constant factor more than the cost of evaluating the network itself.[^1] The resulting gradient is then fed to an optimizer such as gradient descent or one of its variants to update the parameters.
Backpropagation is the central computational mechanism in modern deep learning. Every major training framework, including PyTorch, TensorFlow, and JAX, exposes it through a general-purpose reverse-mode automatic differentiation engine, of which neural-network backpropagation is a special case.[^2] From a mathematical standpoint backpropagation is simply the chain rule organized so that the gradient of a scalar output with respect to many inputs is computed in a single backward sweep rather than one forward sweep per input.
Imagine throwing a ball at a bucket. You miss to the left, so on the next throw you aim a bit to the right. Each throw teaches you a little bit about how the world responds to your arm movements, and you slowly converge on the bucket. Backpropagation lets a neural network do exactly the same thing, except that instead of one arm it has millions or billions of tiny knobs (weights), and it needs to know how to turn each one individually. Backpropagation is the bookkeeping trick that takes the single "you missed by this much" error at the output and, in one sweep, tells every knob in the network how much it should turn.
Backpropagation was not invented by a single person or in a single year. The underlying mathematics, reverse-mode automatic differentiation, was worked out before researchers knew it would matter for neural networks, and the application to neural networks was rediscovered several times before it became famous. A careful accounting must distinguish three threads: the general technique (reverse-mode automatic differentiation), its application to neural-network training, and its popularization in the machine-learning community.[^3]
The first explicit description of what is now called reverse-mode automatic differentiation appears in the 1970 Master's thesis of Seppo Linnainmaa at the University of Helsinki, titled The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors.[^4] Linnainmaa described how the derivatives of an arbitrary computation expressed as a sequence of elementary operations could be obtained by traversing those operations in reverse order, accumulating partial derivatives via the chain rule. His thesis even included working FORTRAN code that implemented the algorithm. Linnainmaa published a more accessible version in 1976 in BIT Numerical Mathematics.[^5] His motivation was not learning but the analysis of floating-point rounding error; nevertheless, the algorithm he described is mathematically identical to the backward pass used today to train neural networks.
Paul Werbos, in his 1974 PhD dissertation at Harvard University, Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences, was the first to propose using a reverse-mode chain-rule procedure to compute the gradient of a network of differentiable units with respect to its parameters, for the explicit purpose of fitting such a model to data.[^6] Werbos's dissertation framed the problem in the language of "dynamic feedback," but it presented exactly the equations of modern backpropagation. He later expanded the work in a series of papers, including a 1982 article that gave a clearer neural-network-specific presentation.[^7] Despite this, his work received little attention from the connectionist community for more than a decade, in part because the field had been depressed by the 1969 Perceptrons critique of Minsky and Papert.
In the early 1980s several researchers independently rediscovered backpropagation. David Parker, then at Stanford, derived the algorithm in 1982 and circulated a technical report at the MIT Center for Computational Research in Economics and Management Science in 1985 titled Learning Logic.[^8] Yann LeCun, working independently in France, presented a closely related procedure ("HLM," for hierarchical learning machine) at the Cognitiva 85 conference, where it was published in proceedings.[^9] Neither rediscovery achieved broad notice, partly because they appeared in venues outside the mainstream of artificial intelligence and partly because the broader machine-learning community had not yet seen compelling demonstrations of what backpropagation could do on non-trivial problems.
The paper that turned backpropagation into a widely known and adopted algorithm is Learning representations by back-propagating errors by David Rumelhart, Geoffrey Hinton, and Ronald Williams, published in Nature in October 1986.[^10] All three co-authors were essential: Williams in particular contributed substantially to the mathematical formulation and is sometimes underemphasized in popular accounts. The paper presented backpropagation in clean modern notation, demonstrated that it could discover useful internal "hidden" representations in tasks such as symmetry detection and family-tree learning, and was published in a highly visible interdisciplinary venue. It is the 1986 Nature paper, more than the earlier mathematical priority work, that triggered the renewed interest in neural networks that became the connectionist movement of the late 1980s.
A longer presentation appeared the same year as a chapter in the Parallel Distributed Processing (PDP) book volumes edited by Rumelhart and McClelland, which became an extraordinarily influential introduction to connectionism.[^11]
Backpropagation is a particular evaluation order for the chain rule of multivariate calculus. Its goal is to compute the gradient of a scalar loss with respect to many parameters with cost proportional to a single forward pass.
If y = f(g(x)), the chain rule says that dy/dx = (dy/dg) * (dg/dx). For a composition of many functions y = f_L(f_{L-1}(... f_1(x) ...)), repeated application yields
dy/dx = (dy/df_L) * (df_L/df_{L-1}) * ... * (df_2/df_1) * (df_1/dx)
The product is associative, so the factors can be multiplied in any order. Forward-mode automatic differentiation multiplies from the right (input side first); reverse-mode automatic differentiation, which is backpropagation, multiplies from the left (output side first). The two strategies have very different costs depending on the shape of the function.[^12]
Consider a feedforward neural network with L layers. Let a^(0) = x be the input. For each layer l = 1, ..., L:
z^(l) = W^(l) * a^(l-1) + b^(l)
a^(l) = f^(l)(z^(l))
Here W^(l) is the weight matrix of layer l, b^(l) the bias, and f^(l) an activation function such as ReLU, sigmoid, or tanh. The output a^(L) is the network's prediction, and the loss is a scalar L = loss(a^(L), y) comparing the prediction to a target y.
The backward pass computes dL/dz^(l) (often abbreviated delta^(l)) for each layer in reverse order, from l = L down to l = 1. For the output layer,
delta^(L) = dL/dz^(L) = (dL/da^(L)) * f^(L)'(z^(L))
where * denotes element-wise multiplication and f^(L)' is the derivative of the output activation. For an intermediate layer,
delta^(l) = ((W^(l+1))^T * delta^(l+1)) * f^(l)'(z^(l))
The gradients with respect to the parameters are then read off directly:
dL/dW^(l) = delta^(l) * (a^(l-1))^T
dL/db^(l) = delta^(l)
The key fact is that delta^(l) is computed once and reused both for the parameter gradients of layer l and for the recursion to layer l-1. This shared computation is what makes the entire backward pass cost only O(1) times a single forward pass, rather than O(P) times for P parameters as naive symbolic differentiation would.[^13]
In practice all of the above is implemented in vector/matrix form on a GPU. For a mini-batch of B examples, a^(l-1) is a (d_{l-1}, B) matrix and delta^(l) is (d_l, B). The weight gradient becomes
dL/dW^(l) = delta^(l) * (a^(l-1))^T
which is a matrix-matrix product (a (d_l, B) x (B, d_{l-1}) GEMM), and the bias gradient is the sum of delta^(l) across the batch dimension. The mini-batch dimension is implicit in modern frameworks but crucial for hardware efficiency: the same dense linear algebra kernels are used for both forward and backward passes.
A cleaner way to understand backpropagation, and the way it is actually implemented in modern frameworks, is via the computational graph. A computational graph is a directed acyclic graph (DAG) whose nodes are elementary operations (multiplication, addition, exponentiation, an activation function, a matmul) and whose edges carry tensors.[^14] The forward pass evaluates the DAG in topological order, producing an output value at the root (the loss).
The backward pass traverses the DAG in reverse topological order. At each node, the framework already knows the local Jacobian of that node's output with respect to its inputs; combining this with the gradient that flows in from downstream gives the gradient that should flow out to upstream nodes. This is captured by the vector-Jacobian product (VJP): if y = f(x) is a node and g_y = dL/dy flows in, then g_x = (df/dx)^T * g_y flows out. The framework only ever needs efficient VJP rules for each elementary operation; the chain of VJPs across the whole DAG is the backward pass.
This view makes it obvious why backpropagation is modular. To add a new operation to a framework, an implementer specifies (1) how to compute its forward value, and (2) how to compute its VJP. The framework handles composition automatically. The same view also clarifies why memory is the dominant resource constraint: each forward node must, in general, retain enough state to evaluate its VJP later, and that state lives in GPU memory between the forward and the backward pass.
Backpropagation is, formally, a special case of reverse-mode automatic differentiation (reverse-mode AD).[^15] Reverse-mode AD applies to any sufficiently smooth computer program, not just neural networks. It rests on the same idea: build the computational graph, then walk it backward computing VJPs.
The mathematical statement is that, given a function f: R^n -> R^m and an output cotangent vector v in R^m, reverse-mode AD computes v^T * J_f in time roughly proportional to the time to evaluate f itself. When m = 1 (a scalar loss), choosing v = 1 gives the full gradient in R^n, which is what backpropagation does. The fact that one reverse-mode pass yields the entire gradient regardless of how many parameters there are is precisely what makes deep-learning training feasible: a modern model can have 10^11 parameters yet still compute its gradient in roughly the time of a forward pass.[^16]
Reverse-mode AD's main downside is memory. To evaluate the backward pass, the intermediate activations from the forward pass must in general be retained, leading to memory usage that scales with the depth and width of the network. This is the same memory pressure that motivates gradient checkpointing and activation offloading.
Both forward and reverse mode compute exact derivatives; they differ in evaluation order and therefore in cost as a function of input/output dimension.[^17]
| Property | Forward mode | Reverse mode (backpropagation) |
|---|---|---|
| Traversal | Input to output | Output to input |
| Primitive | Jacobian-vector product (JVP) | Vector-Jacobian product (VJP) |
Passes for full Jacobian of f: R^n -> R^m | n passes | m passes |
| Best when | Few inputs, many outputs (n << m) | Many inputs, few outputs (n >> m) |
| Memory | Low; no need to store activations | High; activations must be saved |
| Typical use in deep learning | Jacobian-vector products, sensitivity analysis | All gradient-based training |
Neural networks have n (parameters) in the millions to billions, but m = 1 (a scalar loss), so reverse mode wins by orders of magnitude. Forward mode is still useful for tasks where the output is high-dimensional and the input low-dimensional, for instance certain sensitivity analyses, or when one wants a single column of the Jacobian without paying the full memory cost of reverse mode. Modern frameworks expose both: in JAX, jax.jvp is forward-mode and jax.vjp and jax.grad are reverse-mode.[^18]
All major deep-learning frameworks ship a reverse-mode AD engine. They differ in how the computational graph is constructed and how flexibly the user can manipulate it.
PyTorch uses a dynamic ("define-by-run") graph. Each tensor with requires_grad=True carries a reference to the operation that produced it, building a DAG as the forward computation executes.[^19] Calling .backward() on a scalar loss tensor walks this graph in reverse, accumulating gradients into the .grad attribute of every leaf tensor. Because the graph is rebuilt on each iteration, control flow that depends on tensor values (Python if, for, recursion) "just works"; this flexibility was a major reason for PyTorch's adoption in research.
import torch
import torch.nn as nn
model = nn.Linear(10, 1)
loss_fn = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
x = torch.randn(32, 10)
y = torch.randn(32, 1)
pred = model(x)
loss = loss_fn(pred, y)
loss.backward() # reverse-mode AD over the dynamic graph
optimizer.step()
optimizer.zero_grad()
PyTorch also exposes torch.autograd.grad for ad-hoc gradient computation, torch.func.vjp and torch.func.jvp for the functional primitives, and torch.autograd.functional for Jacobians and Hessians.[^20]
TensorFlow 2.x provides tf.GradientTape, a context manager that records operations executed within its scope.[^21] Calling tape.gradient(loss, variables) runs reverse-mode AD on the recorded graph. Tapes can be made persistent (allowing multiple gradient calls from a single forward pass) and nested (for higher-order derivatives).
import tensorflow as tf
w = tf.Variable(tf.random.normal((10, 1)))
x = tf.random.normal((32, 10))
y = tf.random.normal((32, 1))
with tf.GradientTape() as tape:
pred = x @ w
loss = tf.reduce_mean((pred - y) ** 2)
grad_w = tape.gradient(loss, w)
Earlier static-graph TensorFlow (tf.Session, version 1.x) built the entire backward graph symbolically before execution, which made debugging harder but allowed certain whole-graph optimizations. TensorFlow 2 unified the two execution modes through tf.function, which traces eager code into a static graph for speed.
JAX takes a functional approach.[^22] jax.grad(f) returns a new function that computes the gradient of f. JAX traces f to obtain an intermediate representation (jaxpr) and applies reverse-mode AD as a function transformation. Because differentiation is just a transformation, it composes cleanly with other transformations: jax.jit(jax.grad(f)) produces a compiled gradient, jax.vmap(jax.grad(f)) produces a vectorized per-example gradient, and jax.grad(jax.grad(f)) gives second-order derivatives.
import jax
import jax.numpy as jnp
def loss(w, x, y):
pred = x @ w
return jnp.mean((pred - y) ** 2)
grad_fn = jax.grad(loss) # reverse-mode AD on `loss`
grad_w = grad_fn(w, x, y)
This functional style is well-suited to research workloads that want to differentiate through unusual control flow, train multiple models that share parameters, or compute higher-order derivatives.
For a recurrent neural network that consumes a sequence x_1, x_2, ..., x_T and maintains a hidden state h_t = f(W_hh * h_{t-1} + W_xh * x_t + b), backpropagation is applied to the unrolled computation graph that ties together all T time steps. This procedure is called backpropagation through time (BPTT) and was introduced by Werbos in 1990.[^23] The gradient with respect to the recurrent weight W_hh is a sum over time steps of contributions, each of which involves a product of Jacobians across the time steps it spans:
dL/dW_hh = sum over t of (dL/dh_t) * product_{k <= t} (dh_k/dh_{k-1}) * (dh_1/dW_hh)
Because the same weight matrix appears in every factor of the product, BPTT is especially prone to vanishing and exploding gradients, which motivated the development of LSTM gates by Hochreiter and Schmidhuber in 1997.[^24]
Truncated BPTT cuts the unrolled graph at a fixed horizon (say, the last k time steps) and backpropagates only through that window, treating earlier states as constants.[^25] This is essential in practice because computing the exact BPTT gradient for a sequence of length T requires O(T) memory and O(T) time per step; truncation lets training scale to long sequences at the cost of dropping long-range gradient signal. Many large language models trained on long contexts effectively rely on truncated BPTT-style approximations even when implemented as transformers.
When each layer's local Jacobian has spectral norm < 1, the gradient signal multiplied through L layers decays exponentially in L. The early layers receive near-zero gradients and effectively stop learning. Sepp Hochreiter analyzed this rigorously in his 1991 Diploma thesis at the Technical University of Munich, and it became known as the vanishing gradient problem.[^26] It is one of the main reasons deep feedforward networks and especially RNNs were considered untrainable in the 1990s. Modern remedies include the ReLU activation (whose derivative is exactly 1 on the positive half-line), careful weight initialization such as Xavier/Glorot and He, residual connections (which provide an unobstructed gradient path), batch normalization and its descendants, and gated recurrent architectures such as LSTM and GRU.[^27]
Conversely, when local Jacobians have spectral norm > 1, gradients can grow exponentially, producing huge weight updates and divergent training. This is the exploding gradient problem. It is most acute in RNNs and very deep networks. The standard fix is gradient clipping: rescale the gradient if its norm exceeds a threshold c (g <- g * (c / ||g||) when ||g|| > c), which guarantees the update magnitude is bounded regardless of how large the raw gradient is.[^28]
Gradient clipping comes in two common variants: clipping by global norm (computed across all parameters jointly), which preserves the gradient's direction, and clipping element-wise, which can distort it. The global-norm form proposed by Pascanu, Mikolov, and Bengio in 2013 is the standard choice for RNNs and remains common for very large transformer training runs.[^28]
The dominant memory cost of backpropagation is the activation memory: every intermediate tensor produced during the forward pass that will be needed to compute a VJP during the backward pass must be retained. For a network with L layers and per-layer activation size M, this gives O(L * M * B) memory for batch size B. In modern transformers training on long sequences, activation memory often dominates parameter memory by a large factor.
This is why memory-saving training techniques target activations specifically. Mixed-precision training stores activations in float16 or bfloat16 to halve their size; activation recomputation (next section) avoids storing some of them at all; sequence/tensor parallelism distributes them across devices; and CPU/NVMe offloading evicts them from GPU memory between the forward and backward passes.
Gradient checkpointing (also called activation checkpointing or activation recomputation) trades compute for memory by saving only a subset of forward activations and recomputing the rest on demand during the backward pass.[^29] The classic recipe, due to Chen, Xu, Zhang, and Guestrin in 2016, divides an L-layer network into sqrt(L) segments, saves activations only at segment boundaries, and recomputes within each segment during the backward pass. This reduces activation memory from O(L) to O(sqrt(L)) at the cost of one extra forward pass per training step. In modern training stacks for large transformers, gradient checkpointing is essentially mandatory and is built directly into PyTorch (torch.utils.checkpoint), JAX (via jax.checkpoint/jax.remat), and DeepSpeed/Megatron-style libraries.[^29]
In a modern large-scale training run, backpropagation is the central computation but it is wrapped in a stack of orthogonal techniques:
The same loop also underlies fine-tuning techniques (LoRA, full fine-tuning, RLHF), with the only difference being which subset of parameters has requires_grad=True. From this perspective backpropagation is the universal substrate, and everything else is an optimization on top of it.
Although backpropagation is enormously effective in artificial systems, it has long been argued that it cannot literally be what biological brains do. The standard objections are:[^31]
These objections have motivated a substantial research program in biologically plausible alternatives to backpropagation. The strongest practical answers so far do not match backpropagation's performance at scale, but they have made enough progress to remain interesting research directions.
Feedback alignment (FA) replaces the transpose of the forward weights in the backward pass with fixed random matrices. Lillicrap, Cownden, Tweed, and Akerman showed in 2016 that, surprisingly, this still works: the forward weights gradually align with the random feedback weights enough to learn.[^32] Direct feedback alignment (DFA), proposed by Nokland in 2016, takes this further by injecting the output error directly into each hidden layer through its own fixed random matrix, removing the sequential dependence on previous layers' error signals entirely.[^33] Both methods are appealingly local but generally do worse than backpropagation on hard tasks; DFA notably struggles on convolutional networks.
In 2022 Geoffrey Hinton proposed the forward-forward algorithm as an explicit replacement for the forward-backward cycle.[^34] Instead of one forward pass followed by a backward pass, FF performs two forward passes: one on real ("positive") data and one on negative data, with each layer trained to maximize a local "goodness" function (such as the sum of squared activations) on positive data and minimize it on negative data. There is no chain of gradients flowing backward through the network at all; each layer has its own local objective. FF has shown competitive results on small benchmarks and resolves several of the biological plausibility objections, but it has not yet displaced backpropagation on large-scale tasks. The wikilink for the technique is [[forward_forward]].
Other alternatives include target propagation (Lee, Zhang, Fischer, Bengio 2015), which propagates targets rather than gradients, and equilibrium propagation (Scellier and Bengio 2017), which uses an energy-based formulation with contrastive learning.[^35]
Who "invented" backpropagation has been the subject of recurring priority discussions. The most prominent advocate for a careful historical accounting is Jurgen Schmidhuber, who has argued in several venues, most extensively in his 2022 Annotated History of Modern AI and Deep Learning, that credit for the core algorithm belongs to Linnainmaa (1970, for reverse-mode AD) and Werbos (1974, for application to neural networks), with the 1986 Rumelhart-Hinton-Williams paper being a popularizer rather than the original source.[^36] In a separate dispute, Schmidhuber has also argued that LSTM, GAN, and several other techniques are attributed in popular accounts to researchers who came later than the original work.
The opposing perspective, often associated with the Rumelhart-Hinton-Williams group and its intellectual descendants, holds that the 1986 Nature paper is the natural attribution point because it is the work that successfully connected the algorithm to a working machine-learning practice, demonstrated on tasks that mattered, in a venue where the relevant community could see it. On this view "invention" of an algorithm and its emergence as a usable technology are not the same event.
Most contemporary histories now adopt a middle position: Linnainmaa 1970 originated reverse-mode AD; Werbos 1974 applied it to neural networks; Parker 1985 and LeCun 1985 independently rediscovered it; Rumelhart, Hinton, and Williams 1986 popularized it.[^3] All five contributions are routinely cited in serious historical accounts, and most recent textbooks present the timeline in roughly this form. This article takes the same neutral stance.