# Backpropagation

> Source: https://aiwiki.ai/wiki/backpropagation
> Updated: 2026-06-20
> Categories: Deep Learning, Machine Learning, Neural Networks
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**Backpropagation** (short for "backward propagation of errors") is the algorithm used to compute the [gradient](/wiki/gradient) of a scalar [loss function](/wiki/loss_function) with respect to every parameter of a [neural network](/wiki/neural_network) by applying the chain rule of calculus in reverse order, sweeping backward from the output to the inputs. Its defining property is efficiency: backpropagation computes the full gradient of a scalar output with respect to all of a network's parameters at a cost of only a small constant factor (provably bounded by about 5, and in practice typically 3 to 4) times the cost of evaluating the network once, no matter how many parameters there are.[^1][^16] This "cheap gradient principle" is what makes it feasible to train models with hundreds of billions of parameters, and backpropagation is therefore the central computational mechanism of modern [deep learning](/wiki/deep_neural_network).[^16][^37]

The algorithm was popularized for neural networks by the 1986 paper "Learning representations by back-propagating errors" by David Rumelhart, Geoffrey Hinton, and Ronald Williams, published in *Nature* (volume 323, pages 533 to 536), one of the most-cited papers in the history of machine learning.[^10][^38] Every major training framework, including [PyTorch](/wiki/pytorch), [TensorFlow](/wiki/tensorflow), and [JAX](/wiki/jax), exposes backpropagation through a general-purpose reverse-mode [automatic differentiation](/wiki/automatic_differentiation) engine, of which neural-network backpropagation is a special case.[^2] From a mathematical standpoint backpropagation is simply the [chain rule](/wiki/chain_rule) organized so that the gradient of a scalar output with respect to many inputs is computed in a single backward sweep rather than one forward sweep per input.

## Explain like I'm 5 (ELI5)

Imagine throwing a ball at a bucket. You miss to the left, so on the next throw you aim a bit to the right. Each throw teaches you a little bit about how the world responds to your arm movements, and you slowly converge on the bucket. Backpropagation lets a neural network do exactly the same thing, except that instead of one arm it has millions or billions of tiny knobs (weights), and it needs to know how to turn each one individually. Backpropagation is the bookkeeping trick that takes the single "you missed by this much" error at the output and, in one sweep, tells every knob in the network how much it should turn.

## Who invented backpropagation?

Backpropagation was not invented by a single person or in a single year. The underlying mathematics, reverse-mode automatic differentiation, was worked out before researchers knew it would matter for neural networks, and the application to neural networks was rediscovered several times before it became famous. A careful accounting must distinguish three threads: the general technique (reverse-mode automatic differentiation), its application to neural-network training, and its popularization in the machine-learning community.[^3]

### Linnainmaa 1970 (reverse-mode AD)

The first explicit description of what is now called reverse-mode automatic differentiation appears in the 1970 Master's thesis of Seppo Linnainmaa at the University of Helsinki, titled *The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors*.[^4] Linnainmaa described how the derivatives of an arbitrary computation expressed as a sequence of elementary operations could be obtained by traversing those operations in reverse order, accumulating partial derivatives via the chain rule. His thesis even included working FORTRAN code that implemented the algorithm. Linnainmaa published a more accessible version in 1976 in *BIT Numerical Mathematics*.[^5] His motivation was not learning but the analysis of floating-point rounding error; nevertheless, the algorithm he described is mathematically identical to the backward pass used today to train neural networks.

### Werbos 1974 (application to NNs)

Paul Werbos, in his 1974 PhD dissertation at Harvard University, *Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences*, was the first to propose using a reverse-mode chain-rule procedure to compute the gradient of a network of differentiable units with respect to its parameters, for the explicit purpose of fitting such a model to data.[^6] Werbos's dissertation framed the problem in the language of "dynamic feedback," but it presented exactly the equations of modern backpropagation. He later expanded the work in a series of papers, including a 1982 article that gave a clearer neural-network-specific presentation.[^7] Despite this, his work received little attention from the connectionist community for more than a decade, in part because the field had been depressed by the 1969 *Perceptrons* critique of Minsky and Papert.

### Parker 1985 (rediscovery)

In the early 1980s several researchers independently rediscovered backpropagation. David Parker, then at Stanford, derived the algorithm in 1982 and circulated a technical report at the MIT Center for Computational Research in Economics and Management Science in 1985 titled *Learning Logic*.[^8] Yann LeCun, working independently in France, presented a closely related procedure ("HLM," for *hierarchical learning machine*) at the Cognitiva 85 conference, where it was published in proceedings.[^9] Neither rediscovery achieved broad notice, partly because they appeared in venues outside the mainstream of artificial intelligence and partly because the broader machine-learning community had not yet seen compelling demonstrations of what backpropagation could do on non-trivial problems.

### Rumelhart, Hinton, Williams 1986 (Nature popularization)

The paper that turned backpropagation into a widely known and adopted algorithm is *Learning representations by back-propagating errors* by David Rumelhart, Geoffrey Hinton, and Ronald Williams, published in *Nature* in October 1986.[^10] Its abstract opens by stating, "We describe a new learning procedure, back-propagation, for networks of neurone-like units," and argues that the key advance is the ability to learn useful internal features: "The ability to create useful new features distinguishes back-propagation from earlier, simpler methods such as the perceptron-convergence procedure."[^10] All three co-authors were essential: Williams in particular contributed substantially to the mathematical formulation and is sometimes underemphasized in popular accounts. The paper presented backpropagation in clean modern notation, demonstrated that it could discover useful internal "hidden" representations in tasks such as symmetry detection and family-tree learning, and was published in a highly visible interdisciplinary venue. With tens of thousands of citations (Semantic Scholar records roughly 30,000), it ranks among the most-cited papers in machine learning.[^38] It is the 1986 *Nature* paper, more than the earlier mathematical priority work, that triggered the renewed interest in neural networks that became the connectionist movement of the late 1980s.

The lasting influence of this line of work was recognized at the highest levels of the field. Geoffrey Hinton shared the 2018 ACM A.M. Turing Award with Yoshua Bengio and Yann LeCun "for conceptual and engineering breakthroughs that have made deep neural networks a critical component of computing," and in 2024 he shared the Nobel Prize in Physics with John Hopfield "for foundational discoveries and inventions that enable machine learning with artificial neural networks."[^39][^40]

A longer presentation appeared the same year as a chapter in the *Parallel Distributed Processing* (PDP) book volumes edited by Rumelhart and McClelland, which became an extraordinarily influential introduction to connectionism.[^11]

## Mathematical formulation

Backpropagation is a particular evaluation order for the chain rule of multivariate calculus. Its goal is to compute the gradient of a scalar loss with respect to many parameters with cost proportional to a single forward pass.

### Chain rule basics

If `y = f(g(x))`, the [chain rule](/wiki/chain_rule) says that `dy/dx = (dy/dg) * (dg/dx)`. For a composition of many functions `y = f_L(f_{L-1}(... f_1(x) ...))`, repeated application yields

```
dy/dx = (dy/df_L) * (df_L/df_{L-1}) * ... * (df_2/df_1) * (df_1/dx)
```

The product is associative, so the factors can be multiplied in any order. **Forward-mode** automatic differentiation multiplies from the right (input side first); **reverse-mode** automatic differentiation, which is backpropagation, multiplies from the left (output side first). The two strategies have very different costs depending on the shape of the function.[^12]

### Forward pass

Consider a feedforward [neural network](/wiki/neural_network) with `L` layers. Let `a^(0) = x` be the input. For each layer `l = 1, ..., L`:

```
z^(l) = W^(l) * a^(l-1) + b^(l)
a^(l) = f^(l)(z^(l))
```

Here `W^(l)` is the weight matrix of layer `l`, `b^(l)` the bias, and `f^(l)` an [activation function](/wiki/activation_function) such as [ReLU](/wiki/rectified_linear_unit_relu), sigmoid, or tanh. The output `a^(L)` is the network's prediction, and the loss is a scalar `L = loss(a^(L), y)` comparing the prediction to a target `y`.

### Backward pass

The backward pass computes `dL/dz^(l)` (often abbreviated `delta^(l)`) for each layer in reverse order, from `l = L` down to `l = 1`. For the output layer,

```
delta^(L) = dL/dz^(L) = (dL/da^(L)) * f^(L)'(z^(L))
```

where `*` denotes element-wise multiplication and `f^(L)'` is the derivative of the output activation. For an intermediate layer,

```
delta^(l) = ((W^(l+1))^T * delta^(l+1)) * f^(l)'(z^(l))
```

The gradients with respect to the parameters are then read off directly:

```
dL/dW^(l) = delta^(l) * (a^(l-1))^T
dL/db^(l) = delta^(l)
```

The key fact is that `delta^(l)` is computed once and reused both for the parameter gradients of layer `l` and for the recursion to layer `l-1`. This shared computation is what makes the entire backward pass cost only `O(1)` times a single forward pass, rather than `O(P)` times for `P` parameters as naive symbolic differentiation would.[^13]

### Vector/matrix form

In practice all of the above is implemented in vector/matrix form on a GPU. For a mini-batch of `B` examples, `a^(l-1)` is a `(d_{l-1}, B)` matrix and `delta^(l)` is `(d_l, B)`. The weight gradient becomes

```
dL/dW^(l) = delta^(l) * (a^(l-1))^T
```

which is a matrix-matrix product (a `(d_l, B) x (B, d_{l-1})` GEMM), and the bias gradient is the sum of `delta^(l)` across the batch dimension. The mini-batch dimension is implicit in modern frameworks but crucial for hardware efficiency: the same dense linear algebra kernels are used for both forward and backward passes.

## Computational graph perspective

A cleaner way to understand backpropagation, and the way it is actually implemented in modern frameworks, is via the **computational graph**. A computational graph is a directed acyclic graph (DAG) whose nodes are elementary operations (multiplication, addition, exponentiation, an activation function, a matmul) and whose edges carry tensors.[^14] The forward pass evaluates the DAG in topological order, producing an output value at the root (the loss).

The backward pass traverses the DAG in **reverse topological order**. At each node, the framework already knows the local Jacobian of that node's output with respect to its inputs; combining this with the gradient that flows in from downstream gives the gradient that should flow out to upstream nodes. This is captured by the *vector-Jacobian product (VJP)*: if `y = f(x)` is a node and `g_y = dL/dy` flows in, then `g_x = (df/dx)^T * g_y` flows out. The framework only ever needs efficient VJP rules for each elementary operation; the chain of VJPs across the whole DAG is the backward pass.

This view makes it obvious why backpropagation is **modular**. To add a new operation to a framework, an implementer specifies (1) how to compute its forward value, and (2) how to compute its VJP. The framework handles composition automatically. The same view also clarifies why memory is the dominant resource constraint: each forward node must, in general, retain enough state to evaluate its VJP later, and that state lives in GPU memory between the forward and the backward pass.

## Why is backpropagation so efficient? (Reverse-mode automatic differentiation)

Backpropagation is, formally, a special case of **reverse-mode automatic differentiation** (reverse-mode AD).[^15] Reverse-mode AD applies to any sufficiently smooth computer program, not just neural networks. It rests on the same idea: build the computational graph, then walk it backward computing VJPs.

The mathematical statement is that, given a function `f: R^n -> R^m` and an output cotangent vector `v in R^m`, reverse-mode AD computes `v^T * J_f` in time roughly proportional to the time to evaluate `f` itself. When `m = 1` (a scalar loss), choosing `v = 1` gives the full gradient in `R^n`, which is what backpropagation does. The fact that one reverse-mode pass yields the entire gradient regardless of how many parameters there are is precisely what makes deep-learning training feasible: a modern model can have `10^11` parameters yet still compute its gradient in roughly the time of a forward pass.[^16] Griewank and Walther call this the **cheap gradient principle** and prove that the ratio of the cost of the gradient to the cost of the function is bounded by a small constant (an upper bound of about 5, with the constant often 3 or less in well-structured code), independent of the number of inputs.[^16][^37]

Reverse-mode AD's main downside is memory. To evaluate the backward pass, the intermediate activations from the forward pass must in general be retained, leading to memory usage that scales with the depth and width of the network. This is the same memory pressure that motivates [gradient checkpointing](#gradient-checkpointing) and activation offloading.

## How does backpropagation differ from forward-mode AD?

Both forward and reverse mode compute exact derivatives; they differ in evaluation order and therefore in cost as a function of input/output dimension.[^17]

| Property | Forward mode | Reverse mode (backpropagation) |
|---|---|---|
| Traversal | Input to output | Output to input |
| Primitive | Jacobian-vector product (JVP) | Vector-Jacobian product (VJP) |
| Passes for full Jacobian of `f: R^n -> R^m` | `n` passes | `m` passes |
| Best when | Few inputs, many outputs (`n << m`) | Many inputs, few outputs (`n >> m`) |
| Memory | Low; no need to store activations | High; activations must be saved |
| Typical use in deep learning | Jacobian-vector products, sensitivity analysis | All gradient-based training |

Neural networks have `n` (parameters) in the millions to billions, but `m = 1` (a scalar loss), so reverse mode wins by orders of magnitude. Forward mode is still useful for tasks where the output is high-dimensional and the input low-dimensional, for instance certain sensitivity analyses, or when one wants a single column of the Jacobian without paying the full memory cost of reverse mode. Modern frameworks expose both: in [JAX](/wiki/jax), `jax.jvp` is forward-mode and `jax.vjp` and `jax.grad` are reverse-mode.[^18]

## Implementation in modern frameworks

All major deep-learning frameworks ship a reverse-mode AD engine. They differ in how the computational graph is constructed and how flexibly the user can manipulate it.

### PyTorch autograd

[PyTorch](/wiki/pytorch) uses a **dynamic** ("define-by-run") graph. Each tensor with `requires_grad=True` carries a reference to the operation that produced it, building a DAG as the forward computation executes.[^19] Calling `.backward()` on a scalar loss tensor walks this graph in reverse, accumulating gradients into the `.grad` attribute of every leaf tensor. Because the graph is rebuilt on each iteration, control flow that depends on tensor values (Python `if`, `for`, recursion) "just works"; this flexibility was a major reason for PyTorch's adoption in research.

```python
import torch
import torch.nn as nn

model = nn.Linear(10, 1)
loss_fn = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

x = torch.randn(32, 10)
y = torch.randn(32, 1)

pred = model(x)
loss = loss_fn(pred, y)

loss.backward()        # reverse-mode AD over the dynamic graph
optimizer.step()
optimizer.zero_grad()
```

PyTorch also exposes `torch.autograd.grad` for ad-hoc gradient computation, `torch.func.vjp` and `torch.func.jvp` for the functional primitives, and `torch.autograd.functional` for Jacobians and Hessians.[^20]

### TensorFlow GradientTape

[TensorFlow](/wiki/tensorflow) 2.x provides `tf.GradientTape`, a context manager that records operations executed within its scope.[^21] Calling `tape.gradient(loss, variables)` runs reverse-mode AD on the recorded graph. Tapes can be made persistent (allowing multiple gradient calls from a single forward pass) and nested (for higher-order derivatives).

```python
import tensorflow as tf

w = tf.Variable(tf.random.normal((10, 1)))
x = tf.random.normal((32, 10))
y = tf.random.normal((32, 1))

with tf.GradientTape() as tape:
    pred = x @ w
    loss = tf.reduce_mean((pred - y) ** 2)

grad_w = tape.gradient(loss, w)
```

Earlier static-graph TensorFlow (`tf.Session`, version 1.x) built the entire backward graph symbolically before execution, which made debugging harder but allowed certain whole-graph optimizations. TensorFlow 2 unified the two execution modes through `tf.function`, which traces eager code into a static graph for speed.

### JAX grad

[JAX](/wiki/jax) takes a functional approach.[^22] `jax.grad(f)` returns a *new function* that computes the gradient of `f`. JAX traces `f` to obtain an intermediate representation (jaxpr) and applies reverse-mode AD as a function transformation. Because differentiation is just a transformation, it composes cleanly with other transformations: `jax.jit(jax.grad(f))` produces a compiled gradient, `jax.vmap(jax.grad(f))` produces a vectorized per-example gradient, and `jax.grad(jax.grad(f))` gives second-order derivatives.

```python
import jax
import jax.numpy as jnp

def loss(w, x, y):
    pred = x @ w
    return jnp.mean((pred - y) ** 2)

grad_fn = jax.grad(loss)        # reverse-mode AD on `loss`
grad_w = grad_fn(w, x, y)
```

This functional style is well-suited to research workloads that want to differentiate through unusual control flow, train multiple models that share parameters, or compute higher-order derivatives.

## Specific variants

### Backprop through time (BPTT) for RNNs

For a [recurrent neural network](/wiki/recurrent_neural_network) that consumes a sequence `x_1, x_2, ..., x_T` and maintains a hidden state `h_t = f(W_hh * h_{t-1} + W_xh * x_t + b)`, backpropagation is applied to the **unrolled** computation graph that ties together all `T` time steps. This procedure is called [backpropagation through time](/wiki/backpropagation_through_time) (BPTT) and was introduced by Werbos in 1990.[^23] The gradient with respect to the recurrent weight `W_hh` is a sum over time steps of contributions, each of which involves a product of Jacobians across the time steps it spans:

```
dL/dW_hh = sum over t of (dL/dh_t) * product_{k <= t} (dh_k/dh_{k-1}) * (dh_1/dW_hh)
```

Because the same weight matrix appears in every factor of the product, BPTT is especially prone to vanishing and exploding gradients, which motivated the development of [LSTM](/wiki/long_short-term_memory_lstm) gates by Hochreiter and Schmidhuber in 1997.[^24]

### Truncated BPTT

**Truncated BPTT** cuts the unrolled graph at a fixed horizon (say, the last `k` time steps) and backpropagates only through that window, treating earlier states as constants.[^25] This is essential in practice because computing the exact BPTT gradient for a sequence of length `T` requires `O(T)` memory and `O(T)` time per step; truncation lets training scale to long sequences at the cost of dropping long-range gradient signal. Many large language models trained on long contexts effectively rely on truncated BPTT-style approximations even when implemented as transformers.

## Numerical issues

### Vanishing gradients

When each layer's local Jacobian has spectral norm `< 1`, the gradient signal multiplied through `L` layers decays exponentially in `L`. The early layers receive near-zero gradients and effectively stop learning. Sepp Hochreiter analyzed this rigorously in his 1991 Diploma thesis at the Technical University of Munich, and it became known as the **[vanishing gradient](/wiki/vanishing_gradient) problem**.[^26] It is one of the main reasons deep feedforward networks and especially RNNs were considered untrainable in the 1990s. Modern remedies include the [ReLU](/wiki/rectified_linear_unit_relu) activation (whose derivative is exactly 1 on the positive half-line), careful weight initialization such as Xavier/Glorot and He, residual connections (which provide an unobstructed gradient path), batch normalization and its descendants, and gated recurrent architectures such as LSTM and GRU.[^27]

### Exploding gradients

Conversely, when local Jacobians have spectral norm `> 1`, gradients can grow exponentially, producing huge weight updates and divergent training. This is the **[exploding gradient](/wiki/exploding_gradient)** problem. It is most acute in RNNs and very deep networks. The standard fix is **gradient clipping**: rescale the gradient if its norm exceeds a threshold `c` (`g <- g * (c / ||g||)` when `||g|| > c`), which guarantees the update magnitude is bounded regardless of how large the raw gradient is.[^28]

### Gradient clipping

Gradient clipping comes in two common variants: clipping by global norm (computed across all parameters jointly), which preserves the gradient's direction, and clipping element-wise, which can distort it. The global-norm form proposed by Pascanu, Mikolov, and Bengio in 2013 is the standard choice for RNNs and remains common for very large transformer training runs.[^28]

## Memory cost

The dominant memory cost of backpropagation is the **activation memory**: every intermediate tensor produced during the forward pass that will be needed to compute a VJP during the backward pass must be retained. For a network with `L` layers and per-layer activation size `M`, this gives `O(L * M * B)` memory for batch size `B`. In modern transformers training on long sequences, activation memory often dominates parameter memory by a large factor.

This is why memory-saving training techniques target activations specifically. Mixed-precision training stores activations in float16 or bfloat16 to halve their size; activation recomputation (next section) avoids storing some of them at all; sequence/tensor parallelism distributes them across devices; and CPU/NVMe offloading evicts them from GPU memory between the forward and backward passes.

## Gradient checkpointing

**Gradient checkpointing** (also called activation checkpointing or activation recomputation) trades compute for memory by saving only a subset of forward activations and recomputing the rest on demand during the backward pass.[^29] The classic recipe, due to Chen, Xu, Zhang, and Guestrin in 2016, divides an `L`-layer network into `sqrt(L)` segments, saves activations only at segment boundaries, and recomputes within each segment during the backward pass. This reduces activation memory from `O(L)` to `O(sqrt(L))` at the cost of one extra forward pass per training step. In modern training stacks for large transformers, gradient checkpointing is essentially mandatory and is built directly into [PyTorch](/wiki/pytorch) (`torch.utils.checkpoint`), JAX (via `jax.checkpoint`/`jax.remat`), and DeepSpeed/Megatron-style libraries.[^29]

## Modern training stack with backprop

In a modern large-scale training run, backpropagation is the central computation but it is wrapped in a stack of orthogonal techniques:

1. **Forward pass** in mixed precision (bfloat16 or fp8), producing fp32 loss.
2. **Loss scaling** (for fp16) or fp32 accumulation (for bf16) to avoid underflow of small gradients.
3. **Reverse-mode backward pass** via the framework's autograd engine, often with **gradient checkpointing** in the deeper blocks.
4. **Gradient all-reduce** across data-parallel workers (or all-gather/scatter in ZeRO/FSDP).
5. **Gradient clipping** by global norm.
6. **Optimizer step** (typically Adam or AdamW) with weight decay.[^30]
7. **Optional EMA** (exponential moving average) of weights for evaluation.

The same loop also underlies fine-tuning techniques (LoRA, full fine-tuning, RLHF), with the only difference being which subset of parameters has `requires_grad=True`. From this perspective backpropagation is the universal substrate, and everything else is an optimization on top of it.

## Limitations and alternatives

### Backprop critiques (biological plausibility)

Although backpropagation is enormously effective in artificial systems, it has long been argued that it cannot literally be what biological brains do. The standard objections are:[^31]

1. **Weight transport.** The backward pass uses the *exact transpose* of the forward weight matrices. In a real brain, feedback connections are physically distinct synapses; they cannot be made to share weights with feedforward ones without an unexplained mechanism.
2. **Separate forward and backward phases.** Backpropagation requires the full forward pass to complete before any backward signal is computed, but biological neurons compute continuously.
3. **Non-local error signals.** Each synapse's update depends on error information that travels back through many layers, whereas synaptic plasticity is thought to be primarily local.
4. **Exact activation derivatives.** Backpropagation requires the derivative of each activation function at the exact pre-activation value, which biological neurons do not appear to compute.

These objections have motivated a substantial research program in **biologically plausible alternatives** to backpropagation. The strongest practical answers so far do not match backpropagation's performance at scale, but they have made enough progress to remain interesting research directions.

### Direct feedback alignment

**Feedback alignment** (FA) replaces the transpose of the forward weights in the backward pass with *fixed random matrices*. Lillicrap, Cownden, Tweed, and Akerman showed in 2016 that, surprisingly, this still works: the forward weights gradually align with the random feedback weights enough to learn.[^32] **Direct feedback alignment** (DFA), proposed by Nokland in 2016, takes this further by injecting the output error directly into each hidden layer through its own fixed random matrix, removing the sequential dependence on previous layers' error signals entirely.[^33] Both methods are appealingly local but generally do worse than backpropagation on hard tasks; DFA notably struggles on convolutional networks.

### Forward-forward (Hinton 2022)

In 2022 Geoffrey Hinton proposed the **forward-forward algorithm** as an explicit replacement for the forward-backward cycle.[^34] Instead of one forward pass followed by a backward pass, FF performs two forward passes: one on real ("positive") data and one on negative data, with each layer trained to maximize a local "goodness" function (such as the sum of squared activations) on positive data and minimize it on negative data. There is no chain of gradients flowing backward through the network at all; each layer has its own local objective. FF has shown competitive results on small benchmarks and resolves several of the biological plausibility objections, but it has not yet displaced backpropagation on large-scale tasks. The wikilink for the technique is [forward forward](/wiki/forward_forward).

Other alternatives include **target propagation** (Lee, Zhang, Fischer, Bengio 2015), which propagates targets rather than gradients, and **equilibrium propagation** (Scellier and Bengio 2017), which uses an energy-based formulation with contrastive learning.[^35]

## Priority disputes

Who "invented" backpropagation has been the subject of recurring priority discussions. The most prominent advocate for a careful historical accounting is Jurgen Schmidhuber, who has argued in several venues, most extensively in his 2022 *Annotated History of Modern AI and Deep Learning*, that credit for the core algorithm belongs to Linnainmaa (1970, for reverse-mode AD) and Werbos (1974, for application to neural networks), with the 1986 Rumelhart-Hinton-Williams paper being a popularizer rather than the original source.[^36] In a separate dispute, Schmidhuber has also argued that LSTM, GAN, and several other techniques are attributed in popular accounts to researchers who came later than the original work.

The opposing perspective, often associated with the Rumelhart-Hinton-Williams group and its intellectual descendants, holds that the 1986 *Nature* paper is the natural attribution point because it is the work that successfully connected the algorithm to a working machine-learning practice, demonstrated on tasks that mattered, in a venue where the relevant community could see it. On this view "invention" of an algorithm and its emergence as a usable technology are not the same event.

Most contemporary histories now adopt a middle position: Linnainmaa 1970 originated reverse-mode AD; Werbos 1974 applied it to neural networks; Parker 1985 and LeCun 1985 independently rediscovered it; Rumelhart, Hinton, and Williams 1986 popularized it.[^3] All five contributions are routinely cited in serious historical accounts, and most recent textbooks present the timeline in roughly this form. This article takes the same neutral stance.

## See also

- [gradient descent](/wiki/gradient_descent)
- [chain rule](/wiki/chain_rule)
- [automatic differentiation](/wiki/automatic_differentiation)
- [backpropagation through time](/wiki/backpropagation_through_time)
- [vanishing gradient](/wiki/vanishing_gradient)
- [exploding gradient](/wiki/exploding_gradient)
- [geoffrey hinton](/wiki/geoffrey_hinton)
- [paul werbos](/wiki/paul_werbos)
- [rumelhart](/wiki/rumelhart)
- [pytorch](/wiki/pytorch)
- [tensorflow](/wiki/tensorflow)
- [jax](/wiki/jax)
- [neural network](/wiki/neural_network)
- [forward forward](/wiki/forward_forward)

## References

[^1]: Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). "Learning representations by back-propagating errors." *Nature* 323, 533-536. https://www.nature.com/articles/323533a0

[^2]: Baydin, A. G., Pearlmutter, B. A., Radul, A. A., and Siskind, J. M. (2018). "Automatic Differentiation in Machine Learning: a Survey." *Journal of Machine Learning Research* 18(153):1-43. https://www.jmlr.org/papers/v18/17-468.html

[^3]: Schmidhuber, J. (2022). "Annotated History of Modern AI and Deep Learning." arXiv:2212.11279. https://arxiv.org/abs/2212.11279

[^4]: Linnainmaa, S. (1970). *The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors.* Master's thesis, University of Helsinki. Discussed in Griewank, A. (2012), "Who Invented the Reverse Mode of Differentiation?" *Documenta Mathematica*, Extra Volume ISMP, 389-400. https://www.math.uni-bielefeld.de/documenta/vol-ismp/52_griewank-andreas-b.pdf

[^5]: Linnainmaa, S. (1976). "Taylor expansion of the accumulated rounding error." *BIT Numerical Mathematics* 16(2):146-160. https://link.springer.com/article/10.1007/BF01931367

[^6]: Werbos, P. J. (1974). *Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences.* PhD thesis, Harvard University. Reprinted in Werbos, P. J. (1994), *The Roots of Backpropagation*, Wiley. https://www.wiley.com/en-us/The+Roots+of+Backpropagation%3A+From+Ordered+Derivatives+to+Neural+Networks+and+Political+Forecasting-p-9780471598978

[^7]: Werbos, P. J. (1982). "Applications of advances in nonlinear sensitivity analysis." In Drenick, R. F. and Kozin, F. (eds.), *System Modeling and Optimization* (Springer), pp. 762-770. https://link.springer.com/chapter/10.1007/BFb0006203

[^8]: Parker, D. B. (1985). *Learning Logic.* Technical Report TR-47, Center for Computational Research in Economics and Management Science, MIT. Cited in Rumelhart, Hinton, and Williams (1986); see also discussion in Schmidhuber (2022). https://arxiv.org/abs/2212.11279

[^9]: LeCun, Y. (1985). "Une procedure d'apprentissage pour reseau a seuil asymetrique." In *Proceedings of Cognitiva 85*, Paris, pp. 599-604. See also LeCun's account at http://yann.lecun.com/exdb/publis/index.html

[^10]: Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). "Learning representations by back-propagating errors." *Nature* 323(6088):533-536. https://www.nature.com/articles/323533a0

[^11]: Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). "Learning internal representations by error propagation." In Rumelhart, D. E. and McClelland, J. L. (eds.), *Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Volume 1: Foundations*, MIT Press, pp. 318-362. https://mitpress.mit.edu/9780262680530/parallel-distributed-processing/

[^12]: Griewank, A. and Walther, A. (2008). *Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation* (2nd ed.). SIAM. https://epubs.siam.org/doi/book/10.1137/1.9780898717761

[^13]: Goodfellow, I., Bengio, Y., and Courville, A. (2016). *Deep Learning.* MIT Press. Chapter 6, "Deep Feedforward Networks," section 6.5 on backpropagation. https://www.deeplearningbook.org/contents/mlp.html

[^14]: Nielsen, M. (2015). *Neural Networks and Deep Learning.* Chapter 2, "How the backpropagation algorithm works." http://neuralnetworksanddeeplearning.com/chap2.html

[^15]: Baydin, A. G., Pearlmutter, B. A., Radul, A. A., and Siskind, J. M. (2018). "Automatic Differentiation in Machine Learning: a Survey." https://www.jmlr.org/papers/v18/17-468.html

[^16]: Griewank, A. and Walther, A. (2008). *Evaluating Derivatives.* SIAM. The "cheap gradient principle" states that reverse-mode AD computes the gradient at a small constant multiple of the cost of the function. https://epubs.siam.org/doi/book/10.1137/1.9780898717761

[^17]: Margossian, C. C. (2019). "A Review of Automatic Differentiation and its Efficient Implementation." *WIREs Data Mining and Knowledge Discovery*, 9(4). https://wires.onlinelibrary.wiley.com/doi/10.1002/widm.1305 (preprint at https://arxiv.org/abs/1811.05031)

[^18]: JAX team (2018-present). "The Autodiff Cookbook." JAX documentation. https://jax.readthedocs.io/en/latest/notebooks/autodiff_cookbook.html

[^19]: PyTorch team (2017-present). "Autograd mechanics." PyTorch documentation. https://pytorch.org/docs/stable/notes/autograd.html

[^20]: Paszke, A. et al. (2019). "PyTorch: An Imperative Style, High-Performance Deep Learning Library." In *Advances in Neural Information Processing Systems 32*. https://papers.nips.cc/paper/2019/hash/bdbca288fee7f92f2bfa9f7012727740-Abstract.html

[^21]: TensorFlow team (2019-present). "Introduction to gradients and automatic differentiation." TensorFlow documentation. https://www.tensorflow.org/guide/autodiff

[^22]: Bradbury, J., Frostig, R., Hawkins, P., Johnson, M. J., Leary, C., Maclaurin, D., Necula, G., Paszke, A., VanderPlas, J., Wanderman-Milne, S., and Zhang, Q. (2018). "JAX: composable transformations of Python+NumPy programs." https://github.com/google/jax. Documentation: https://jax.readthedocs.io/

[^23]: Werbos, P. J. (1990). "Backpropagation through time: what it does and how to do it." *Proceedings of the IEEE* 78(10):1550-1560. https://ieeexplore.ieee.org/document/58337

[^24]: Hochreiter, S. and Schmidhuber, J. (1997). "Long Short-Term Memory." *Neural Computation* 9(8):1735-1780. https://www.bioinf.jku.at/publications/older/2604.pdf

[^25]: Williams, R. J. and Peng, J. (1990). "An efficient gradient-based algorithm for on-line training of recurrent network trajectories." *Neural Computation* 2(4):490-501. https://direct.mit.edu/neco/article/2/4/490/5567

[^26]: Hochreiter, S. (1991). *Untersuchungen zu dynamischen neuronalen Netzen.* Diploma thesis, Technical University of Munich. English summary in Hochreiter, S., Bengio, Y., Frasconi, P., and Schmidhuber, J. (2001), "Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies," in Kolen, J. F. and Kremer, S. C. (eds.), *A Field Guide to Dynamical Recurrent Neural Networks*, IEEE Press. https://www.bioinf.jku.at/publications/older/ch7.pdf

[^27]: Glorot, X. and Bengio, Y. (2010). "Understanding the difficulty of training deep feedforward neural networks." *Proceedings of AISTATS 2010*, JMLR W&CP 9:249-256. https://proceedings.mlr.press/v9/glorot10a.html. He, K., Zhang, X., Ren, S., and Sun, J. (2015). "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification." *Proceedings of ICCV 2015.* https://arxiv.org/abs/1502.01852

[^28]: Pascanu, R., Mikolov, T., and Bengio, Y. (2013). "On the difficulty of training recurrent neural networks." *Proceedings of ICML 2013.* https://proceedings.mlr.press/v28/pascanu13.html (preprint at https://arxiv.org/abs/1211.5063)

[^29]: Chen, T., Xu, B., Zhang, C., and Guestrin, C. (2016). "Training Deep Nets with Sublinear Memory Cost." arXiv:1604.06174. https://arxiv.org/abs/1604.06174. PyTorch documentation for `torch.utils.checkpoint`: https://pytorch.org/docs/stable/checkpoint.html

[^30]: Kingma, D. P. and Ba, J. (2015). "Adam: A Method for Stochastic Optimization." *Proceedings of ICLR 2015.* https://arxiv.org/abs/1412.6980. Loshchilov, I. and Hutter, F. (2019). "Decoupled Weight Decay Regularization." *Proceedings of ICLR 2019.* https://arxiv.org/abs/1711.05101

[^31]: Lillicrap, T. P., Santoro, A., Marris, L., Akerman, C. J., and Hinton, G. (2020). "Backpropagation and the brain." *Nature Reviews Neuroscience* 21:335-346. https://www.nature.com/articles/s41583-020-0277-3

[^32]: Lillicrap, T. P., Cownden, D., Tweed, D. B., and Akerman, C. J. (2016). "Random synaptic feedback weights support error backpropagation for deep learning." *Nature Communications* 7:13276. https://www.nature.com/articles/ncomms13276

[^33]: Nokland, A. (2016). "Direct Feedback Alignment Provides Learning in Deep Neural Networks." In *Advances in Neural Information Processing Systems 29.* https://papers.nips.cc/paper/2016/hash/d490d7b4576290fa60eb31b5fc917ad1-Abstract.html (preprint at https://arxiv.org/abs/1609.01596)

[^34]: Hinton, G. (2022). "The Forward-Forward Algorithm: Some Preliminary Investigations." arXiv:2212.13345. https://arxiv.org/abs/2212.13345

[^35]: Lee, D.-H., Zhang, S., Fischer, A., and Bengio, Y. (2015). "Difference Target Propagation." *Proceedings of ECML-PKDD 2015.* https://arxiv.org/abs/1412.7525. Scellier, B. and Bengio, Y. (2017). "Equilibrium Propagation: Bridging the Gap Between Energy-Based Models and Backpropagation." *Frontiers in Computational Neuroscience* 11:24. https://www.frontiersin.org/articles/10.3389/fncom.2017.00024/full

[^36]: Schmidhuber, J. (2015). "Deep Learning in Neural Networks: An Overview." *Neural Networks* 61:85-117. https://arxiv.org/abs/1404.7828. Updated and expanded as Schmidhuber, J. (2022), "Annotated History of Modern AI and Deep Learning," arXiv:2212.11279, https://arxiv.org/abs/2212.11279

[^37]: Baydin, A. G., Pearlmutter, B. A., Radul, A. A., and Siskind, J. M. (2018). "Automatic Differentiation in Machine Learning: a Survey." *JMLR* 18(153):1-43. On the cost of reverse-mode AD relative to function evaluation (the cheap gradient principle). https://www.jmlr.org/papers/v18/17-468.html

[^38]: Semantic Scholar. "Learning representations by back-propagating errors" (Rumelhart, Hinton, Williams 1986), citation record. https://www.semanticscholar.org/paper/052b1d8ce63b07fec3de9dbb583772d860b7c769

[^39]: Association for Computing Machinery (2019). "Fathers of the Deep Learning Revolution Receive ACM A.M. Turing Award" (2018 Turing Award to Bengio, Hinton, and LeCun). https://awards.acm.org/about/2018-turing

[^40]: The Royal Swedish Academy of Sciences (2024). "The Nobel Prize in Physics 2024" (awarded to John J. Hopfield and Geoffrey E. Hinton "for foundational discoveries and inventions that enable machine learning with artificial neural networks"). https://www.nobelprize.org/prizes/physics/2024/summary/

