See also: Machine learning terms
A partial derivative measures how a multivariable function changes when one of its inputs is varied while every other input is held fixed. For a function f(x_1, x_2, ..., x_n), the partial derivative with respect to the i-th variable is written as ∂f/∂x_i. Partial derivatives are the building block of multivariable calculus and the mathematical engine behind almost every learning algorithm in modern artificial intelligence. Training a neural network with billions of parameters comes down to computing one partial derivative per parameter, billions of times per second, and using those numbers to nudge each weight in the direction that lowers the loss.
Let f: ℝ^n → ℝ be a real-valued function of n variables. The partial derivative of f with respect to x_i at a point x = (x_1, ..., x_n) is defined as the limit
∂f/∂x_i (x) = lim_{h → 0} [ f(x_1, ..., x_i + h, ..., x_n) − f(x_1, ..., x_i, ..., x_n) ] / h
when this limit exists. Geometrically, the partial derivative is the slope of the tangent line to the graph of f along the i-th coordinate axis, with all other coordinates frozen. If you slice the surface z = f(x, y) with a plane y = y_0, you get a one-dimensional curve in x; the slope of that curve at x_0 is exactly ∂f/∂x evaluated at (x_0, y_0).
This "freeze everything else" trick reduces a multivariable derivative to an ordinary single-variable derivative. To compute ∂f/∂x_i symbolically, treat every other variable as a constant and apply the usual rules of differentiation (power rule, product rule, quotient rule, chain rule).
Several notations for partial derivatives appear across textbooks, papers, and code. They all mean the same thing.
| Notation | Example | Origin |
|---|---|---|
| Leibniz | ∂f/∂x | Standard in physics and most ML papers |
| Subscript (Lagrange) | f_x, f_xy | Compact, common in pure math |
| Operator | D_i f, ∂_i f | Index notation, used in tensor calculus |
| Numerator-only | ∂_x f | Functional analysis |
| Programming | df/dx, grad_x | Symbolic math libraries like SymPy |
The symbol ∂ is a stylized cursive d, sometimes called "del" or "partial", and it signals that the derivative is partial rather than total. A total derivative df/dx accounts for indirect dependencies; a partial derivative does not.
Applying ∂/∂x_j to a partial derivative ∂f/∂x_i gives a second-order partial derivative
∂²f / ∂x_j ∂x_i
When i = j, this is the pure second partial ∂²f/∂x_i². When i ≠ j, it is a mixed partial. A natural question is whether the order of differentiation matters. In general it can, but for well-behaved functions it does not.
Schwarz's theorem (Clairaut's theorem): If f and its first and second partial derivatives are continuous on an open neighborhood of a point, then
∂²f / ∂x_i ∂x_j = ∂²f / ∂x_j ∂x_i
at that point. This result is sometimes attributed to Alexis Clairaut (1740) or Hermann Schwarz (1873). It fails for pathological functions whose mixed partials exist but are discontinuous; Peano constructed a classical counterexample. Almost every loss function used in practice satisfies the smoothness assumptions, so deep learning practitioners rely on the symmetry of mixed partials without comment.
Partial derivatives are usually packaged into matrices and vectors that summarize all first- or second-order information about a function.
| Object | Function type | Definition | Shape | Role in ML |
|---|---|---|---|---|
| Gradient ∇f | f: ℝ^n → ℝ | Vector of all first partials ∂f/∂x_i | n-dimensional vector | Direction of steepest ascent of the loss |
| Jacobian J_f | f: ℝ^n → ℝ^m | Matrix with entries ∂f_i/∂x_j | m × n matrix | Linear approximation of vector-valued layers |
| Hessian H_f | f: ℝ^n → ℝ | Matrix with entries ∂²f / ∂x_i ∂x_j | n × n symmetric matrix | Second-order curvature, used in Newton-type methods |
The gradient ∇f points in the direction in which f increases most rapidly, and its negative −∇f gives the steepest descent direction. The Jacobian generalizes this to vector-valued functions: each row is the gradient of one output component. The Hessian, by Schwarz's theorem, is symmetric for any twice continuously differentiable function. Its eigenvalues tell you about the local curvature of the loss landscape, which connects partial derivatives to convexity and to the choice of learning rate.
The chain rule extends to functions of several variables and is the workhorse of backpropagation. Suppose y = f(u_1, ..., u_m) where each u_k = g_k(x_1, ..., x_n). Then
∂y/∂x_i = Σ_k (∂f/∂u_k) (∂u_k/∂x_i)
This sum-of-products structure looks innocuous, but when you stack hundreds of layers it becomes a long matrix product. Each layer of a neural network contributes one Jacobian to the chain, and the gradient of the loss with respect to a weight deep inside the network is computed by multiplying these Jacobians together in a specific order. See the article on the chain rule for a deeper treatment.
A modern loss function L is a scalar-valued function of every learnable parameter in the model. A small language model has hundreds of millions of parameters; a frontier model has hundreds of billions. The fundamental learning step is
w_i ← w_i − η · ∂L/∂w_i
where η is the learning rate. This is just gradient descent written one parameter at a time. The partial derivative ∂L/∂w_i answers a very specific question: if I increase weight w_i by a tiny amount and leave every other weight alone, how much does the training loss change? The optimizer uses that answer to decide how to update each weight.
Making this practical requires three things: a way to compute ∂L/∂w_i for every parameter at once, a way to do it efficiently, and a way to do it without writing the formulas by hand. Backpropagation provides all three.
The backpropagation algorithm, popularized by David Rumelhart, Geoffrey Hinton, and Ronald Williams in their 1986 Nature paper Learning representations by back-propagating errors, is a particular schedule for applying the multivariable chain rule. Rather than recomputing the product of Jacobians from scratch for each parameter, backprop traverses the network once forward to compute activations, then once backward to accumulate partial derivatives layer by layer.
The key efficiency insight is reuse. The partial derivative of the loss with respect to a hidden activation appears in the chain-rule expansion for every parameter that influences that activation. Backpropagation computes it once and shares it. The total cost of evaluating the gradient is a small constant times the cost of evaluating the loss itself, regardless of how many parameters there are. This complexity result is what makes deep learning computationally tractable; without it, training a billion-parameter model would be hopeless.
Backpropagation is a special case of a more general technique called automatic differentiation, often shortened to autodiff or AD. Autodiff systematically applies the chain rule to any program built from differentiable primitives. A 2017 survey by Atilim Gunes Baydin, Barak Pearlmutter, Alexey Radul, and Jeffrey Siskind in the Journal of Machine Learning Research gives the standard reference. Autodiff comes in two main modes.
| Property | Forward mode | Reverse mode |
|---|---|---|
| Primitive operation | Jacobian-vector product (JVP) | Vector-Jacobian product (VJP) |
| Cost per evaluation | O(input dimension) | O(output dimension) |
| Memory | Low, no graph storage | Stores intermediate activations |
| Best when | Few inputs, many outputs | Many inputs, one output (typical ML) |
| ML name | Tangent propagation | Backpropagation |
For a function f: ℝ^n → ℝ^m, forward-mode AD computes one column of the Jacobian per pass, and reverse-mode AD computes one row per pass. A neural network loss has m = 1 (a scalar) and n in the millions or billions, which makes reverse mode dramatically cheaper. That is why every major deep learning framework defaults to reverse mode.
The big three frameworks each expose autodiff slightly differently:
requires_grad=True. Calling .backward() on a scalar walks the graph in reverse and populates the .grad attribute on every leaf tensor.tf.GradientTape context. Calling tape.gradient(loss, variables) replays the tape backward, applying the chain rule.jax.grad for the gradient of a scalar function, jax.jvp for forward-mode Jacobian-vector products, and jax.vjp for reverse-mode vector-Jacobian products. Higher-order derivatives compose by repeated application.All three implementations are exact up to floating-point rounding. They are not numerical approximations.
Automatic differentiation is one of three ways a computer can compute partial derivatives. The other two are worth mentioning because both have failure modes that make them poor choices for neural network training.
Numerical differentiation uses finite differences:
∂f/∂x_i ≈ [ f(x + h e_i) − f(x) ] / h
where e_i is the i-th standard basis vector. This is mathematically clean but computationally awful for ML. Each gradient component requires its own forward pass, so the cost is O(n) full evaluations of the loss; for n in the billions, this is hopeless. Worse, finite differences have a numerical accuracy problem: a small h gives small truncation error h²/6 · M but large round-off error roughly ε/h, where ε is machine epsilon. The two errors trade off, and the optimal h depends on the function. For double-precision floats, the best achievable accuracy is around 10^-8, far worse than autodiff. Finite differences remain useful as a sanity check, sometimes called a "gradient check", to catch bugs in custom autodiff implementations.
Symbolic differentiation manipulates expressions algebraically, the way Mathematica or SymPy do. It produces exact closed-form derivatives. The downside is expression swell: differentiating a nested expression naively can produce derivative expressions exponentially larger than the original, because the chain rule duplicates subexpressions. Symbolic systems can mitigate this with common subexpression elimination, but for the deeply nested compositions that define neural networks, automatic differentiation is more practical.
Linear regression with mean squared error has loss
L(w, b) = (1/2N) Σ_i (w · x_i + b − y_i)²
The partial derivative with respect to weight w_j is
∂L/∂w_j = (1/N) Σ_i (w · x_i + b − y_i) · x_{i,j}
and with respect to the bias is
∂L/∂b = (1/N) Σ_i (w · x_i + b − y_i)
In vector form, the gradient with respect to w is (1/N) X^T (Xw + b − y). Setting this to zero recovers the closed-form normal equations. Most modern ML uses gradient descent rather than the normal equations because matrix inversion does not scale to large data.
For a classifier with logits z = Wh + b, softmax probabilities p_k = exp(z_k) / Σ_j exp(z_j), and cross-entropy loss L = −Σ_k y_k log p_k against a one-hot label y, the partial derivative with respect to the logits has a famously clean form:
∂L/∂z_k = p_k − y_k
The gradient is just "predicted probability minus target". This identity is the reason softmax and cross-entropy are almost always implemented as a single fused operation in deep learning libraries: the combined form is numerically stable and the derivative collapses to a subtraction. By the chain rule, ∂L/∂W_{kj} = (p_k − y_k) · h_j, which is what backprop pushes back to the previous layer.
The rectified linear unit ReLU(x) = max(0, x) has derivative 1 for x > 0 and 0 for x < 0, but it is not differentiable at x = 0 because the left and right derivatives disagree. In practice, this is harmless. ReLU is convex, so it has a well-defined subdifferential at zero, which is the closed interval [0, 1]. Any value in that interval is a valid subgradient and produces a correct subgradient descent step. PyTorch and TensorFlow both implement the convention ∂ReLU/∂x|_{x=0} = 0, picking the lower endpoint of the subdifferential. Floating-point inputs hit exactly zero with vanishing probability anyway, so the choice almost never affects training in practice.
Computing partial derivatives for a deep network exposes several numerical and statistical pitfalls.
Vanishing gradients. When a network is many layers deep, the chain rule multiplies a long sequence of Jacobians. If most of those Jacobians have spectral norm less than one, the product shrinks exponentially with depth, and the gradient at early layers becomes too small to drive learning. Sepp Hochreiter identified this in his 1991 diploma thesis, and Yoshua Bengio and collaborators analyzed it again in 1994. The vanishing gradient problem motivated LSTM cells for recurrent networks, residual connections in ResNets, and careful weight initialization.
Exploding gradients. The opposite failure mode happens when Jacobian norms exceed one and the product grows exponentially. The standard fix, introduced by Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio in their 2013 paper on training recurrent networks, is gradient clipping: rescale the gradient vector to a fixed maximum norm before applying the update. Modern transformer training routinely clips gradients to a global norm of 1.0.
Mixed-precision training. Float16 has only about three significant decimal digits, so partial derivatives that underflow to zero in fp16 can derail training. The standard workaround is loss scaling: multiply the loss by a large constant before backprop, then divide gradients by the same constant. The bf16 format keeps the same dynamic range as fp32 and largely sidesteps the issue, which is why bf16 has become the default for large model training.
Stop-gradient operators. Sometimes you want to use a tensor as a constant during backprop even though it depends on parameters. PyTorch provides tensor.detach(), TensorFlow provides tf.stop_gradient, and JAX provides jax.lax.stop_gradient. These primitives appear in policy-gradient methods (where the value function is treated as fixed when computing the policy update), in Gumbel-softmax estimators, in target networks for deep Q-learning, and in self-supervised methods like BYOL where one branch's gradient must be blocked.
Not every quantity in modern ML is defined by an explicit forward computation. A deep equilibrium model, introduced by Shaojie Bai, J. Zico Kolter, and Vladlen Koltun in 2019, defines its hidden state as the fixed point of a nonlinear map z* = f(z*, x). To compute ∂z*/∂x without storing every iteration of the fixed-point solver, the model uses implicit differentiation derived from the implicit function theorem. The same trick appears in implicit MAML for meta-learning, where Aravind Rajeswaran and collaborators differentiate through the optimality condition of an inner-loop optimizer instead of differentiating through the optimizer steps themselves. Implicit differentiation gives constant-memory backward passes, which is otherwise impossible for problems with arbitrarily deep computation.
Imagine you're on a hill with lots of bumps and valleys, and you're trying to find the lowest point on the hill. To do this, you can only take one step at a time, and you want each step to take you lower. This is like what happens in machine learning when we try to make our model better by finding the lowest point of a special hill, called the "loss function."
Partial derivatives are like a magical compass that shows us which way to take a step to go lower on the hill. They tell us how the height of the hill changes if we only move in one direction. By following the directions given by the partial derivatives, we can find the lowest point on the hill, which helps make our machine learning model better.