Automatic differentiation (often abbreviated AD, also called algorithmic differentiation or autodiff) is a family of techniques for numerically evaluating the derivative of a function specified by a computer program. AD exploits the fact that every program, no matter how complicated, executes a sequence of elementary arithmetic operations (addition, subtraction, multiplication, division) and elementary functions (exp, log, sin, cos, etc.). By applying the chain rule of calculus repeatedly to these operations, derivatives of arbitrary order can be computed automatically, accurate to working precision, and with at most a small constant factor of arithmetic operations relative to the original program.
Automatic differentiation is the mathematical engine behind modern deep_learning, powering the gradient computation step that makes large neural_network training feasible. The reverse mode of AD is what most practitioners know as backpropagation, and every major machine learning framework, including pytorch, tensorflow, and jax, is built around an automatic differentiation engine. AD has also become foundational to scientific computing, robotics, computational finance, optimization, probabilistic programming, and a broader paradigm sometimes called differentiable programming, in which entire computational pipelines are made end-to-end differentiable so they can be optimized with gradient_descent.
Given a numerical function $f: \mathbb{R}^n \to \mathbb{R}^m$ implemented as a computer program, the goal of automatic differentiation is to compute the Jacobian matrix $J = \partial f / \partial x$ (or directional projections of it) without resorting to either of the two classical alternatives, both of which have severe practical drawbacks.
The first alternative is numerical differentiation, which approximates the derivative by finite differences such as $f'(x) \approx (f(x + h) - f(x)) / h$ for some small $h$. This approach is simple but suffers from two compounding errors: truncation error (because $h$ is not zero) and round-off error from subtracting nearly-equal floating point numbers. Choosing $h$ requires a delicate trade-off, and even with careful tuning, finite differences are typically accurate only to about half of working precision. Worse, computing a full gradient of a scalar-valued function with $n$ inputs requires $O(n)$ function evaluations, hopelessly expensive for the millions or billions of parameters in modern neural networks.
The second classical alternative is symbolic differentiation, which manipulates a closed-form algebraic expression for $f$ to produce a closed-form expression for its derivative, much like a computer algebra system would. Symbolic differentiation is exact, but it suffers from a problem known as expression swell: the symbolic derivative of even a moderately complex expression can be exponentially larger than the original, leading to huge memory consumption and slow evaluation. Symbolic differentiation also struggles with control flow (loops, branches, recursion), iterative algorithms, and any program where the computation is defined operationally rather than as a single mathematical expression.
Automatic differentiation occupies the middle ground. Like numerical differentiation, it produces numerical values rather than symbolic expressions. Like symbolic differentiation, it is exact to working precision rather than approximate. The key insight is that AD does not need a global symbolic expression for $f$; it needs only the local derivative of each elementary operation, which is known in closed form. By composing these local derivatives via the chain rule as the program runs, AD computes derivatives in a way that handles arbitrary control flow, scales gracefully to functions of millions of variables, and never suffers from cancellation error.
Every program that computes a numerical function can be represented as a directed acyclic graph called the computational graph. The nodes of the graph are intermediate variables, and the edges represent elementary operations that produce one variable from others. Inputs sit at the leaves, the output sits at the root, and the graph encodes the order in which sub-expressions must be evaluated.
For example, the function $f(x_1, x_2) = \ln(x_1) + x_1 \cdot x_2 - \sin(x_2)$ might be decomposed into the following sequence of elementary assignments, sometimes called a Wengert list or evaluation trace:
v1 = x1
v2 = x2
v3 = ln(v1)
v4 = v1 * v2
v5 = sin(v2)
v6 = v3 + v4
v7 = v6 - v5
y = v7
Each intermediate variable depends only on previously computed variables, so the trace can be visualized as a DAG with arrows pointing in the direction of data flow. The chain rule of multivariable calculus tells us how derivatives propagate along this graph: the derivative of any node with respect to any input is the sum, over all paths from input to node, of the product of local derivatives along the path. The two modes of automatic differentiation correspond to two different ways of organizing this sum.
Forward-mode automatic differentiation computes derivatives by propagating tangents from inputs to outputs in lockstep with the original computation. Alongside each intermediate value $v_i$, the program also carries a derivative value $\dot{v}_i = \partial v_i / \partial x_j$ for some chosen input $x_j$. Whenever a new $v_k$ is computed by an elementary operation $v_k = \phi(v_a, v_b, \ldots)$, its tangent is computed using the local partial derivatives:
$$\dot{v}_k = \frac{\partial \phi}{\partial v_a} \dot{v}_a + \frac{\partial \phi}{\partial v_b} \dot{v}_b + \ldots$$
At the end of the forward sweep, the tangent associated with the output node is exactly $\partial f / \partial x_j$. To compute the full Jacobian, the forward sweep must be repeated once for each input, seeding $\dot{x}_j = 1$ and all other input tangents to zero. The cost is therefore proportional to the number of inputs $n$.
The most elegant implementation of forward mode uses dual numbers, a two-component algebraic extension of the reals analogous to complex numbers. A dual number is written $a + b\epsilon$, where $\epsilon$ is an infinitesimal symbol satisfying $\epsilon^2 = 0$. Lifting any smooth function $f$ to dual numbers via Taylor expansion yields $f(a + b\epsilon) = f(a) + b \cdot f'(a) \epsilon$, so the coefficient of $\epsilon$ in the result is exactly the directional derivative. Forward-mode AD can therefore be implemented by replacing every floating-point variable in a program with a dual number and overloading the elementary operations to act correctly on dual numbers.
Forward mode is conceptually simple, requires no extra memory beyond the augmented values themselves, and parallelizes trivially across input directions. Its principal weakness is that it scales poorly when $n$ is large.
Reverse-mode automatic differentiation organizes the chain rule in the opposite direction. The original program is executed forward, producing all intermediate values $v_i$ and recording the operations that produced them. Then a second backward sweep starts at the output and propagates adjoints $\bar{v}_i = \partial y / \partial v_i$ from the output back toward the inputs. When the backward sweep encounters an operation $v_k = \phi(v_a, v_b, \ldots)$, it updates the adjoints of the inputs of that operation using the same local partial derivatives, but accumulated in reverse:
$$\bar{v}_a \mathrel{+}= \bar{v}_k \cdot \frac{\partial \phi}{\partial v_a}, \quad \bar{v}_b \mathrel{+}= \bar{v}_k \cdot \frac{\partial \phi}{\partial v_b}, \quad \ldots$$
The adjoints of the input nodes at the end of the backward sweep give the gradient $\nabla f$ in a single backward pass, no matter how many inputs there are. Provided the output is a scalar, the cost of computing the entire gradient is at most a small constant multiple (typically between three and five) of the cost of the original forward computation. This remarkable property is sometimes called the cheap gradient principle.
The price for this efficiency is memory: the backward sweep needs every intermediate value computed during the forward pass, since these values appear in the local derivatives. The data structure that stores them is called a tape or, in honor of its inventor, a Wengert list. For very long computations, the memory cost of the tape can become a bottleneck, motivating optimizations such as gradient checkpointing (also called recomputation), in which only some intermediates are stored and the rest are recomputed on demand during the backward pass, trading time for memory.
In machine learning, the loss function is almost always a scalar (the training objective) computed from a model with millions or billions of parameters. This is the regime where reverse-mode AD shines: a single backward pass produces the entire parameter gradient, which is then fed to a gradient_descent optimizer to update the weights. Backpropagation in neural networks is exactly this construction, applied to the specific class of computational graphs corresponding to layered neural networks.
The two modes of AD differ sharply in cost depending on the shape of the function being differentiated. The following table summarizes when to prefer each.
| Property | Forward mode | Reverse mode |
|---|---|---|
| Direction of sweep | Inputs to outputs | Outputs to inputs |
| Quantity computed per sweep | Jacobian-vector product (JVP), or pushforward | Vector-Jacobian product (VJP), or pullback |
| Cost to compute full Jacobian | $O(n)$ forward sweeps for $n$ inputs | $O(m)$ backward sweeps for $m$ outputs |
| Memory requirement | Constant overhead, no tape | Stores entire forward trace (tape) |
| Best when | Output dimension much greater than input dimension ($m \gg n$) | Input dimension much greater than output dimension ($n \gg m$) |
| Implementation strategy | Dual numbers, operator overloading | Wengert tape, source-to-source transformation |
| Typical ML use case | Sensitivity of small parameter sets, JVPs in physics | Backpropagation, training neural networks |
| Error from chain rule | Exact to working precision | Exact to working precision |
In machine learning, $n$ (the number of parameters) is enormous and $m$ (the size of the loss) is one, so reverse mode is the default. In some scientific computing applications such as forward sensitivity analysis of ODEs, $m$ may be larger than $n$, so forward mode wins. The Hessian, $H = \nabla^2 f$, can be computed efficiently by composing the two modes: applying forward mode on top of reverse mode produces the Hessian-vector product in a single pass, and full Hessians can be assembled by stacking such products.
The backpropagation algorithm popularized by David Rumelhart, Geoffrey Hinton, and Ronald Williams in their 1986 Nature paper Learning representations by back-propagating errors is precisely reverse-mode AD restricted to the computational graph of a feedforward neural network. The forward pass computes the activations of each layer, and the backward pass propagates partial derivatives of the loss with respect to each activation back through the same layers, accumulating gradients with respect to the weights along the way.
In fact, the underlying mathematics of reverse-mode AD predates the 1986 Nature paper by sixteen years. Seppo Linnainmaa described essentially modern reverse-mode AD in his 1970 University of Helsinki master's thesis on accumulating round-off errors in the evaluation of arithmetic expressions, treating the gradient computation in full generality for nested differentiable functions. The reframing of backpropagation as a special case of AD has been important for the field, because it explains why backpropagation works (it is just the chain rule, organized cleverly), generalizes it to arbitrary computation graphs (loops, conditionals, recursion), and provides the conceptual foundation for differentiable programming.
There are two main implementation paradigms for automatic differentiation, and modern frameworks combine elements of both.
In an operator overloading implementation, the host language (typically C++, Python, or Julia) provides a special tensor or number type whose arithmetic operators are overloaded to record the operations they perform. When a program runs with these special types, it implicitly constructs the computational graph as a side effect of execution, ready to be traversed in reverse for backpropagation. The graph is built dynamically and reflects whatever control flow the program actually took on this particular input.
This approach is the foundation of pytorch autograd, tensorflow eager mode, JAX tracing, and many older systems such as Adept (C++) and Autograd (Python, the spiritual ancestor of JAX). Operator overloading is flexible and easy to integrate into existing host languages, but it carries the runtime overhead of dispatching to overloaded methods and recording every operation onto a tape, which can hurt performance compared to compiled approaches.
In a source code transformation (or source-to-source) implementation, a compiler analyzes the source of the user's function and generates new source code that computes both the function's output and its derivative. The generated code can then be optimized and compiled like any other code, often producing extremely fast adjoint code. Source transformation can take advantage of static information about the program, such as variable types and array sizes, that is opaque to operator overloading.
Classical Fortran AD tools such as Tapenade, ADIFOR, and OpenAD, the C/C++ tool Tangent, and the Julia package Zygote are notable source-transformation systems. Zygote is particularly notable because it operates on the SSA-form intermediate representation used by the Julia compiler, allowing it to differentiate arbitrary Julia code, including control flow, mutation, recursion, closures, structs, and dictionaries, without requiring users to rewrite their programs.
A hybrid approach used by jax is to trace the user's Python function with abstract values, capturing the resulting computation graph as an intermediate representation, and then compile it just-in-time (JIT) into highly optimized machine code. JAX uses Google's XLA (Accelerated Linear Algebra) compiler, which lowers traced programs into HLO (High-Level Operations) and ultimately into device-specific kernels for CPUs, GPUs, and TPUs. The OpenXLA project and the broader MLIR (Multi-Level Intermediate Representation) ecosystem are increasingly providing a shared compilation infrastructure across PyTorch, JAX, and TensorFlow, with projects like Enzyme bringing automatic differentiation directly into MLIR as a first-class compiler transformation. This trend toward compiler-based AD blurs the line between operator overloading and source transformation: the user writes ordinary code, the framework traces it like operator overloading, and then compiles it like source transformation.
Automatic differentiation has a long and somewhat tangled history that predates modern machine learning by decades.
The earliest published descriptions of forward-mode AD appeared in the 1950s and early 1960s, including a 1964 paper by R. E. Wengert titled A simple automatic derivative evaluation program in Communications of the ACM. Wengert's paper introduced the idea of decomposing a function into a sequence of elementary operations and propagating derivative values alongside primal values through that sequence. The data structure now known as the Wengert list or Wengert tape is named in his honor.
Reverse-mode AD was first described in modern form by Seppo Linnainmaa in his 1970 master's thesis at the University of Helsinki. Linnainmaa's thesis treated the problem of accumulating rounding errors in arithmetic computations, but the mathematical structure he developed (a backward propagation of derivatives through a computational graph) is reverse-mode AD as we know it today. A related 1974 PhD thesis by Paul Werbos at Harvard developed similar ideas in the context of training neural networks, although it was not widely recognized at the time.
The first general-purpose, automatic implementation of reverse-mode AD was given by Bert Speelpenning in his 1980 University of Illinois PhD thesis Compiling Fast Partial Derivatives of Functions Given by Algorithms. Speelpenning's system accepted a Fortran-like specification of a computation and automatically produced reverse-mode adjoint code, demonstrating in practice the cheap-gradient principle that had previously been only theoretical.
The 1986 Nature paper by Rumelhart, Hinton, and Williams was a watershed for machine learning, popularizing reverse-mode AD under the name backpropagation and demonstrating that gradient-based optimization could train multi-layer neural networks to learn useful internal representations. Although later commentary (notably by Juergen Schmidhuber) emphasized that the underlying mathematics had been published earlier, the 1986 paper triggered the wave of neural network research that ultimately led to modern deep learning.
The AD research community continued to develop the theory and practice through the 1990s and 2000s, producing major textbooks (notably Andreas Griewank and Andrea Walther's Evaluating Derivatives, second edition by SIAM in 2008), tools (ADIFOR, ADOL-C, Tapenade, OpenAD), and a series of conferences. The 2018 JMLR survey by Baydin, Pearlmutter, Radul, and Siskind, Automatic differentiation in machine learning: a survey, brought the AD and ML communities back together and clarified that backpropagation, autograd, and AD are all the same thing.
Every major modern machine learning framework is built around an automatic differentiation engine. The three dominant systems differ in how they construct and execute the computational graph, in their language and ergonomic choices, and in the degree to which they expose AD as a first-class composable transformation.
| Feature | PyTorch (autograd) | TensorFlow (GradientTape) | JAX |
|---|---|---|---|
| Graph style | Dynamic, tape-based, eager by default | Eager by default since 2.0; static via tf.function | Functional, traced, JIT-compiled |
| AD trigger | requires_grad=True on tensors, implicit recording | Explicit with tf.GradientTape() context | Explicit transformation: jax.grad(f), jax.vjp(f, x) |
| Backward call | loss.backward() accumulates into .grad | tape.gradient(loss, vars) returns gradients | grad_fn = jax.grad(f); grads = grad_fn(x) |
| Forward mode | Limited (torch.func.jvp) | Limited (forward_accumulator) | First class: jax.jvp, jax.jacfwd |
| Higher-order derivatives | Supported via create_graph=True or torch.func | Supported with nested tapes | Native via grad(grad(f)), hessian(f) |
| Vectorization | torch.vmap | tf.vectorized_map | jax.vmap (composable with grad and JIT) |
| Parallelization | DDP, FSDP at training-loop level | tf.distribute strategies | jax.pmap for SPMD across devices |
| JIT compiler | torch.compile (Inductor / TorchDynamo) | tf.function (XLA optional) | jax.jit (XLA by default) |
| Functional purity | Imperative and stateful | Imperative and stateful | Pure functions, no implicit state |
| Best fit | Rapid research, dynamic models, NLP, RL | Production deployment, mobile, TPUs | Research at scale, scientific ML, transformer training on TPU |
| Year introduced | 2016 | 2015 | 2018 |
In pytorch, the autograd engine builds the computational graph dynamically as tensors with requires_grad=True flow through operations. Each operation node stores a function object describing how to compute the gradient of its inputs given the gradient of its output. Calling .backward() on a scalar loss triggers a topological traversal of the graph in reverse, populating the .grad fields of the leaf tensors. Because the graph is rebuilt on every forward pass, PyTorch handles arbitrary Python control flow naturally, which made it the favored framework for research and natural language processing.
In tensorflow, the original 1.x API used static, pre-compiled graphs constructed via Session and tf.placeholder. TensorFlow 2.0 (2019) made eager execution the default and introduced tf.GradientTape, an explicit context manager that records operations on watched tensors and computes gradients via tape.gradient(). Wrapping functions with @tf.function compiles them into graphs optimizable by the XLA backend, blending eager flexibility with static graph speed.
In jax, automatic differentiation is exposed as a set of pure, composable functional transformations that operate on Python functions. jax.grad transforms a function into one that returns its gradient. jax.jvp computes a forward-mode Jacobian-vector product (the pushforward), and jax.vjp computes a reverse-mode vector-Jacobian product (the pullback). Higher-level transformations such as jax.jacfwd, jax.jacrev, and jax.hessian are built compositionally from these primitives. Crucially, AD transformations compose with jax.jit (XLA compilation), jax.vmap (automatic vectorization), and jax.pmap (SPMD parallelism across devices), so a researcher can differentiate, vectorize, parallelize, and compile a function in any combination simply by composing transformations. This design makes JAX particularly attractive for large-scale transformer training on TPU pods and for methods that demand higher-order derivatives.
Outside the Python ecosystem, Julia has built a distinctive AD stack. Zygote, ForwardDiff, ReverseDiff, and Enzyme provide source-to-source AD that can differentiate ordinary Julia code, including code that calls into C, manipulates structs, and runs on GPUs. The Julia community has popularized the phrase scientific machine learning to describe the use of differentiable programming to train models that mix neural networks with differential equations, optimization, and physical simulators.
Because the output of an AD transformation is itself a program, AD can be applied to that program again. Composing reverse mode with itself yields second-order gradients (gradient of gradient), which can be used to inspect curvature, detect saddle points, and implement second-order optimizers like Newton's method. Composing forward mode on top of reverse mode is the standard way to compute Hessian-vector products without materializing the full Hessian, which is essential for trust-region methods, conjugate-gradient solvers, and Hessian-free optimization.
Frameworks differ in how cleanly they expose this composition. JAX makes it natural: jax.grad(jax.grad(f)) is the second derivative, and jax.hessian(f) is jax.jacfwd(jax.jacrev(f)) by definition. PyTorch supports it via the torch.func module (originally functorch) and the create_graph=True flag on backward(), while TensorFlow supports it through nested tape contexts.
Composition with vmap is equally important. The Jacobian of a function $f: \mathbb{R}^n \to \mathbb{R}^m$ can be assembled by mapping the VJP across the standard basis of the output space, which JAX expresses as jax.jacrev(f) = jax.vmap(jax.vjp(f, x)[1]). This compositional view of differentiation is the conceptual foundation of much of the modern AD literature.
The success of AD has motivated a broader paradigm called differentiable programming (DiffProg), in which entire software systems are written so their input-output behavior is end-to-end differentiable. Rather than confining AD to neural networks, differentiable programming applies it to physics simulators, ray tracers, ODE solvers, control loops, fluid dynamics codes, and even probabilistic programs. Any of these systems can then be embedded as a layer inside a larger model and trained jointly with neural network components by gradient_descent.
Yann LeCun, in a widely circulated 2018 Facebook post, argued that differentiable programming is more than a rebranding of deep learning: it is a paradigm in which programmers assemble networks of parameterized differentiable blocks trained automatically from data, much as conventional programmers assemble subroutines into programs. The Julia ecosystem (Flux.jl, DiffEqFlux.jl), the JAX ecosystem (Equinox, Flax), and PyTorch extensions like torchdiffeq have made differentiable programming accessible across many scientific domains.
Despite its remarkable power, automatic differentiation is not a panacea. Several practical issues recur often enough to merit attention.
Non-differentiable operations such as discrete sampling, integer arithmetic, and hard branches break the chain rule. Workarounds include the reparameterization trick, the straight-through estimator, score-function (REINFORCE) gradients, and continuous relaxations such as the Gumbel-softmax. Care must also be taken with operations that are differentiable almost everywhere but not at specific points, such as ReLU at zero, max-pooling at ties, and absolute value at zero; AD systems handle these by adopting a conventional sub-gradient.
Numerical stability is another concern. AD's local derivatives are exact to working precision, but the full gradient computation can still suffer from overflow, underflow, or catastrophic cancellation in deep networks where products of many terms can vanish or explode. Loss scaling, mixed-precision training, gradient clipping, batch normalization, and residual connections were all developed in part to keep AD-computed gradients well behaved.
Memory pressure from the reverse-mode tape grows linearly with computation length, which bottlenecks very deep networks, long recurrent sequences, and physics simulators. Gradient checkpointing trades compute for memory by storing only a subset of intermediates and recomputing the rest on demand; the classical Griewank-Walther binomial schedule and modern variants used in transformer training show that surprisingly little extra compute is needed to keep memory cost under control.
Correctness is subtle. Mutating arrays, in-place operations, and side effects such as printing can silently break AD if the framework's tracing or tape mechanism is not equipped to handle them. Functional-style frameworks like JAX impose purity restrictions to make differentiation correct by construction; imperative frameworks like PyTorch instead provide explicit hooks so users can supply custom gradient rules when needed. Finally, some functions involving fixed-point iterations or implicit equations do not admit a meaningful gradient simply by unrolling their computation; implicit differentiation, used in deep equilibrium models and meta-learning, computes derivatives of solutions of equations using the implicit function theorem rather than by tracing the solver.
Although machine learning has put AD into the spotlight, the technique has been used productively in many other fields for decades. In computational fluid dynamics and weather forecasting, adjoint codes enable variational data assimilation (4D-Var), in which observations are blended with a forward simulation by minimizing a misfit whose gradient is computed by reverse-mode AD through the simulator. In robotics, AD computes Jacobians of forward-kinematics maps for inverse kinematics and contact dynamics. In computational finance, AD provides exact Greeks for derivative pricing models via the adjoint algorithmic differentiation (AAD) technique. In computer graphics, differentiable renderers like Mitsuba 3 use AD to invert rendering pipelines for inverse graphics and 3D reconstruction. In probabilistic programming, frameworks like Pyro, NumPyro, and Stan use AD to compute gradients of log-densities for Hamiltonian Monte Carlo and variational inference. This breadth of application is part of why differentiable programming is sometimes called one of the deepest computational ideas of the early twenty-first century.