# Gradient

> Source: https://aiwiki.ai/wiki/gradient
> Updated: 2026-07-13
> Categories: Machine Learning, Mathematics, Training & Optimization
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

In [machine learning](/wiki/machine_learning), the **gradient** is the vector of [partial derivatives](/wiki/partial_derivative) of a [loss function](/wiki/loss_function) with respect to every model parameter, and it points in the direction in which the loss increases most steeply. Training works by repeatedly stepping the parameters in the direction of the negative gradient, which reduces the loss fastest; this is [gradient descent](/wiki/gradient_descent). For a deep [neural network](/wiki/neural_network) with millions of parameters, the entire gradient of the scalar loss is computed in a single backward pass using reverse-mode automatic differentiation, the algorithm known in this setting as [backpropagation](/wiki/backpropagation). In vector calculus more broadly, the gradient of a scalar-valued differentiable function is a vector field whose direction is the direction of steepest ascent and whose magnitude equals the rate of fastest increase.

## Definition and mathematical notation

Formally, the gradient of a scalar-valued function $$f(x)$$, where $$x = (x_1, x_2, \ldots, x_n)$$ is a vector in $$\mathbb{R}^n$$, is a vector of all [partial derivatives](/wiki/partial_derivative) of $$f$$ with respect to each variable:

$$
\nabla f(x) = \left( \frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, \ldots, \frac{\partial f}{\partial x_n} \right)
$$

The symbol $$\nabla$$ (nabla) denotes the gradient operator. Each component $$\partial f / \partial x_i$$ is the partial derivative of $$f$$ with respect to $$x_i$$, measuring how $$f$$ changes when $$x_i$$ varies while all other variables are held constant.

For example, given a function $$f(x, y) = x^2 + 3xy$$, the gradient is:

$$
\nabla f = (2x + 3y, 3x)
$$

The gradient generalizes the single-variable derivative to functions of multiple variables. When a function has only one input, the gradient reduces to the ordinary derivative $$f'(x)$$.

## Geometric interpretation

The gradient carries two essential pieces of geometric information:

- **Direction.** The gradient vector at a point $$p$$ points in the direction along which $$f$$ increases most rapidly. Conversely, the negative gradient points in the direction of steepest descent. This follows because the directional derivative $$\nabla f \cdot v$$ is maximized when the unit direction $$v$$ is parallel to $$\nabla f$$, since the cosine of the angle between them is then 1 [11].
- **Magnitude.** The length of the gradient vector equals the maximum rate of change of $$f$$ at that point. A large gradient magnitude means the function is changing steeply, while a gradient near zero indicates a relatively flat region.

The gradient is also closely related to directional derivatives. The directional derivative of $$f$$ in the direction of a unit vector $$v$$ is given by the dot product $$\nabla f \cdot v$$. This means that the rate of change in any direction can be computed by projecting the gradient onto that direction.

At a point where $$\nabla f = 0$$, the function has a critical point, which may be a local minimum, a local maximum, or a saddle point.

## How is the gradient used in optimization?

Optimization is the process of finding the input values that minimize (or maximize) a given function. The gradient is the cornerstone of first-order optimization methods, which use only the gradient (first derivatives) to guide the search for optimal parameters.

### Gradient descent

[Gradient descent](/wiki/gradient_descent) is the most widely used optimization algorithm in machine learning. The basic update rule is:

$$
\theta_{\text{new}} = \theta_{\text{old}} - \eta \nabla L(\theta_{\text{old}})
$$

Here, $$\theta$$ represents the model parameters, $$L(\theta)$$ is the loss function, and $$\eta$$ is the [learning rate](/wiki/learning_rate) controlling the step size. Because the negative gradient points toward steepest descent, each step moves the parameters in the direction that reduces the loss most rapidly (locally).

Augustin-Louis Cauchy first proposed the method of steepest descent in 1847 to solve astronomical calculations, in a note titled "Méthode générale pour la résolution des systèmes d'équations simultanées" presented to the French Academy of Sciences on October 18, 1847 [1]. Jacques Hadamard independently proposed a similar approach in 1907. The method remained primarily a tool for numerical analysis until the rise of machine learning in the late 20th century, when it became the dominant paradigm for training models.

### Stochastic gradients and mini-batches

In practice, computing the exact gradient over an entire dataset is expensive. Stochastic gradient descent (SGD) approximates the true gradient by computing it on a single randomly chosen training example or a small subset called a mini-batch. The mini-batch gradient is a noisy but unbiased estimate of the full gradient:

$$
\nabla L(\theta) \approx \frac{1}{\lvert B \rvert} \sum_{i \in B} \nabla L_i(\theta)
$$

where $$B$$ is the mini-batch. This approximation introduces variance into the gradient estimates, but it provides substantial computational savings and can even help the optimization escape shallow local minima. Mini-batch SGD is the standard approach for training [neural networks](/wiki/neural_network) [7].

### Gradient-based optimizers

Several advanced [optimizers](/wiki/optimizer) build on the basic gradient descent idea by incorporating additional information:

| Optimizer | Key idea | Year introduced |
|---|---|---|
| SGD with Momentum | Accumulates a velocity vector from past gradients to accelerate convergence and dampen oscillations | 1964 (Polyak) |
| Adagrad | Adapts the learning rate per parameter based on the sum of historical squared gradients | 2011 (Duchi et al.) |
| RMSProp | Uses an exponentially decaying average of squared gradients to adapt learning rates | 2012 (Hinton) |
| Adam | Combines momentum (first moment) with RMSProp-style adaptive rates (second moment), plus bias correction | 2014 (Kingma & Ba) |
| AdamW | Decouples weight decay from the adaptive learning rate for better regularization | 2017 (Loshchilov & Hutter) |

The momentum method dates to Boris Polyak's 1964 "heavy ball" paper in *USSR Computational Mathematics and Mathematical Physics* [12]. Adagrad was introduced by Duchi, Hazan, and Singer in the *Journal of Machine Learning Research*, where its central idea is to give "frequently occurring features very low learning rates and infrequent features high learning rates" [13]. RMSProp was never formally published; Geoffrey Hinton presented it in Lecture 6 of his 2012 Coursera course as the rule "divide the gradient by a running average of its recent magnitude" [14]. Adam, the most common default today, ships with the standard hyperparameters $$\eta = 0.001$$, $$\beta_1 = 0.9$$, and $$\beta_2 = 0.999$$; its authors describe the method as one that "is straightforward to implement, is computationally efficient, has little memory requirements, is invariant to diagonal rescaling of the gradients, and is well suited for problems that are large in terms of data and\/or parameters" [4]. Adam is frequently the default choice because of its robustness across a wide variety of tasks, though SGD with momentum can achieve better generalization in some settings [10].

## How are gradients computed in practice?

There are three principal ways to compute gradients in practice.

### Analytical differentiation

The gradient is derived symbolically using the rules of calculus (chain rule, product rule, etc.). This approach yields exact results and is the most efficient when a closed-form expression is available, but it becomes impractical for complex computational graphs with millions of operations.

### Numerical differentiation (finite differences)

The partial derivative with respect to *x*ᵢ is approximated by evaluating the function at two nearby points:

$$
\frac{\partial f}{\partial x_i} \approx \frac{f(x + h e_i) - f(x - h e_i)}{2h}
$$

where $$h$$ is a small step size and $$e_i$$ is the unit vector in the $$i$$-th direction. This central difference formula is simple to implement and useful for gradient checking, but it is slow (requiring $$2n$$ function evaluations for $$n$$ parameters) and susceptible to numerical errors from choosing $$h$$ too large or too small.

### Automatic differentiation

Automatic differentiation (AD) computes exact derivatives by decomposing a program into elementary operations, each with a known derivative, and combining them via the chain rule. AD avoids both the symbolic complexity of analytical differentiation and the approximation errors of numerical methods [8].

AD has two primary modes:

| Mode | Direction | Cost | Best when |
|---|---|---|---|
| Forward mode | Propagates derivatives from inputs to outputs alongside the function evaluation | One pass per input variable | Few inputs, many outputs |
| Reverse mode | Evaluates the function forward, then propagates derivatives backward from outputs to inputs | One pass per output variable | Many inputs, few outputs |

Reverse-mode AD is the method of choice for training neural networks, because a network typically has millions of parameters (inputs to the loss function) but produces a single scalar loss (one output). Reverse-mode AD computes the gradient of this scalar loss with respect to all parameters in a single backward pass, making it extremely efficient. The reverse mode of automatic differentiation was first published by Seppo Linnainmaa in his 1970 master's thesis, and modern deep learning frameworks are built on this method [8].

## What is backpropagation?

[Backpropagation](/wiki/backpropagation) is the specific application of reverse-mode automatic differentiation to neural networks. The algorithm was popularized by Rumelhart, Hinton, and Williams in their landmark 1986 paper in *Nature*, though earlier formulations existed [2]. They described "a new learning procedure, back-propagation, for networks of neurone-like units" that "repeatedly adjusts the weights of the connections in the network so as to minimize a measure of the difference between the actual output vector of the net and the desired output vector" [2].

Backpropagation works in two phases:

1. **Forward pass.** Input data flows through the network layer by layer, producing activations at each layer and ultimately the loss value.
2. **Backward pass.** Starting from the loss, gradients are propagated backward through each layer using the chain rule. At each layer, the algorithm computes the gradient of the loss with respect to the layer's weights and the gradient with respect to the layer's inputs (which becomes the incoming gradient for the previous layer).

For a network with layers $$f_1, f_2, \ldots, f_L$$ and loss $$L$$, the chain rule gives:

$$
\frac{\partial L}{\partial W_k} = \frac{\partial L}{\partial a_L} \cdot \frac{\partial a_L}{\partial a_{L-1}} \cdot \ldots \cdot \frac{\partial a_{k+1}}{\partial a_k} \cdot \frac{\partial a_k}{\partial W_k}
$$

where $$a_k$$ denotes the activation at layer $$k$$ and $$W_k$$ denotes the weights of layer $$k$$.

Modern deep learning frameworks such as PyTorch and TensorFlow implement backpropagation automatically, constructing a computational graph during the forward pass and then traversing it in reverse to compute gradients.

## Gradient problems in deep networks

### Vanishing gradients

The [vanishing gradient problem](/wiki/vanishing_gradient_problem) occurs when gradients become exponentially small as they are propagated backward through many layers. This causes the weights of early layers to receive negligible updates, effectively preventing them from learning. The problem is especially severe when activation functions with saturating regions (such as sigmoid or tanh) are used, because their derivatives are less than 1 in most of their domain, and repeated multiplication of small numbers drives the gradient toward zero [7].

Sepp Hochreiter first analyzed this problem in his 1991 diploma thesis at the Technical University of Munich, supervised by Jürgen Schmidhuber, and it remained a major obstacle to training deep networks for years [3].

### Exploding gradients

The [exploding gradient problem](/wiki/exploding_gradient_problem) is the opposite: gradients grow exponentially large during backpropagation, causing weight updates that are so large they destabilize training. This is particularly common in recurrent neural networks (RNNs), where the same weight matrix is applied repeatedly across time steps [9].

### How do you fix vanishing and exploding gradients?

Researchers have developed a range of techniques to address gradient pathologies:

| Technique | How it helps |
|---|---|
| ReLU activation | Derivative is 1 for positive inputs, avoiding the saturation problem of sigmoid/tanh |
| Careful weight initialization (Xavier, He) | Sets initial weight scales so that activations and gradients maintain stable variance across layers |
| Batch normalization | Normalizes activations within each layer, stabilizing the distribution of gradients |
| Residual connections (skip connections) | Allow gradients to flow directly through shortcut paths, bypassing many multiplicative layers |
| LSTM and GRU gating | Gating mechanisms in recurrent networks explicitly control gradient flow across time steps |
| Gradient clipping | Caps the gradient norm or individual gradient values at a maximum threshold before applying updates |

## What is gradient clipping?

Gradient clipping is a practical technique for preventing exploding gradients. There are two common variants:

- **Norm clipping.** If the global norm of the gradient vector exceeds a threshold $$c$$, the gradient is rescaled so that its norm equals $$c$$:

  if $$\lVert \nabla L \rVert > c$$: $$\nabla L \leftarrow c \cdot \nabla L / \lVert \nabla L \rVert$$

- **Value clipping.** Each individual gradient component is clipped to lie within $$[-c, c]$$.

Norm clipping is generally preferred because it preserves the relative direction of the gradient vector, whereas value clipping can distort it. Gradient clipping is standard practice when training RNNs, transformers, and other architectures prone to gradient instability [9].

## What is gradient accumulation?

Gradient accumulation is a memory-saving technique that enables training with effectively larger batch sizes than the hardware can hold in memory at once. Instead of updating the model after every mini-batch, gradients from several consecutive mini-batches are summed (accumulated), and the parameter update is applied only after the desired number of accumulation steps.

For example, if the GPU can hold a batch of 8 samples but the target effective batch size is 32, the training loop accumulates gradients over 4 mini-batches before performing one optimizer step. The mathematical result is equivalent to training with a batch of 32, though the wall-clock time is longer because the mini-batches are processed sequentially.

Gradient accumulation is widely used when training large language models and other memory-intensive architectures.

## What is gradient checkpointing?

Gradient checkpointing (also called activation checkpointing) is a memory optimization technique that trades computation for memory during backpropagation. During the forward pass, only a subset of intermediate activations are saved to memory. During the backward pass, the activations that were not saved are recomputed on the fly from the nearest saved checkpoint.

The technique was formalized by Chen, Xu, Zhang, and Guestrin in 2016, who designed an algorithm "to cost $$O(\sqrt{n})$$ memory to train a n layer network, with only the computational cost of an extra forward pass per mini-batch" [6]. In their experiments this cut the memory needed to train a deep residual network on ImageNet from roughly 48 GB to 7 GB, at the cost of about 30 percent additional runtime [6]. The approach has since become standard in training large-scale transformer models such as BERT and GPT, where the memory required to store all intermediate activations would otherwise exceed available GPU memory.

Gradient accumulation and gradient checkpointing are complementary: accumulation reduces memory demands from batch size, while checkpointing reduces memory demands from model depth.

## Gradient flow in deep networks

Gradient flow refers to how gradient signals propagate from the loss function back through the layers of a network during backpropagation. Healthy gradient flow means that every layer receives a gradient signal of sufficient magnitude to learn effectively.

Residual networks (ResNets) dramatically improved gradient flow by introducing skip connections that create shortcut paths for gradients. In a residual block, the output is $$f(x) + x$$, so during backpropagation the gradient has an additive path that bypasses the learned transformation entirely. This architectural innovation, introduced by He, Zhang, Ren, and Sun in 2015, enabled the training of networks up to 152 layers deep, "8 times deeper than VGG nets but still having lower complexity" [5]. An ensemble of these residual networks reached 3.57 percent top-5 error on the ImageNet test set and won 1st place in the ILSVRC 2015 classification task [5].

Batch normalization further supports gradient flow by keeping activations in a well-conditioned range, which prevents the internal covariate shift that can slow or destabilize learning.

## Jacobian and Hessian matrices

The gradient is one member of a family of derivative objects used in optimization and machine learning.

**Jacobian matrix.** For a vector-valued function $$f: \mathbb{R}^n \to \mathbb{R}^m$$, the Jacobian is the $$m \times n$$ matrix of all first-order partial derivatives. Each row of the Jacobian is the gradient of one component of the output. When the function is scalar-valued ($$m = 1$$), the Jacobian reduces to the gradient (as a row vector).

**Hessian matrix.** For a scalar-valued function $$f: \mathbb{R}^n \to \mathbb{R}$$, the Hessian is the $$n \times n$$ matrix of second-order partial derivatives. The Hessian captures the curvature of $$f$$ and is the Jacobian of the gradient. Its eigenvalues at a critical point reveal whether the point is a local minimum (all positive eigenvalues), a local maximum (all negative eigenvalues), or a saddle point (mixed signs).

Second-order optimization methods, such as Newton's method, use the Hessian to take more informed steps than gradient descent. However, computing and storing the full Hessian requires $$O(n^2)$$ memory, which is intractable for models with millions of parameters. Practical approximations include quasi-Newton methods (such as L-BFGS) and Hessian-free optimization, which compute Hessian-vector products without forming the full matrix.

## When was the gradient method developed?

The mathematical foundations of the gradient trace back to the development of multivariable calculus in the 18th and 19th centuries. Key milestones include:

- **1740s.** Leonhard Euler and others developed the theory of partial derivatives and functions of several variables.
- **1847.** Augustin-Louis Cauchy proposed the method of steepest descent (gradient descent), the first systematic algorithm for minimizing a function by following the negative gradient [1].
- **1907.** Jacques Hadamard independently described a similar iterative method.
- **1964.** Robert E. Wengert published an early description of forward-mode automatic differentiation in *Communications of the ACM*, introducing the evaluation trace now called the Wengert list [8].
- **1970.** Seppo Linnainmaa introduced the reverse mode of automatic differentiation in his master's thesis, the method that underlies modern backpropagation [8].
- **1986.** Rumelhart, Hinton, and Williams published their influential paper on backpropagation, making gradient computation in neural networks practical and widely known [2].
- **2010s.** The development of frameworks like Theano (2010), TensorFlow (2015), and PyTorch (2016) made automatic gradient computation accessible, fueling the deep learning revolution.
- **2014-present.** Advanced optimizers such as Adam (2014) and training techniques including gradient clipping, accumulation, and checkpointing became standard practice for training increasingly large models [4].

## Explain like I'm 5 (ELI5)

Imagine you are standing on a hilly field with your eyes closed, and you want to walk to the lowest point. You can feel the ground under your feet, and you can tell which direction slopes downhill the most steeply. That "feel" for the steepest downhill direction is what the gradient tells a computer.

In machine learning, the computer has a "hill" (the loss function) that measures how wrong its answers are. The gradient is like an arrow that says: "If you change your settings this way, your mistakes will shrink the fastest." The computer takes a small step in that direction, checks the gradient again, and repeats. Over many steps, it walks down the hill to find settings that make good predictions. This process is called gradient descent.

## See also

- [Gradient descent](/wiki/gradient_descent)
- [Backpropagation](/wiki/backpropagation)
- [Loss function](/wiki/loss_function)
- [Vanishing gradient problem](/wiki/vanishing_gradient_problem)
- [Exploding gradient problem](/wiki/exploding_gradient_problem)
- [Optimizer](/wiki/optimizer)
- [Learning rate](/wiki/learning_rate)
- [Automatic differentiation](/wiki/automatic_differentiation)

## References

1. Cauchy, A. (1847). "Méthode générale pour la résolution des systèmes d'équations simultanées." *Comptes Rendus de l'Académie des Sciences*, 25, 536-538.
2. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). "Learning representations by back-propagating errors." *Nature*, 323(6088), 533-536.
3. Hochreiter, S. (1991). "Untersuchungen zu dynamischen neuronalen Netzen." Diploma thesis, Technische Universität München.
4. Kingma, D. P., & Ba, J. (2014). "Adam: A Method for Stochastic Optimization." *arXiv preprint arXiv:1412.6980*.
5. He, K., Zhang, X., Ren, S., & Sun, J. (2016). "Deep Residual Learning for Image Recognition." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*, 770-778.
6. Chen, T., Xu, B., Zhang, C., & Guestrin, C. (2016). "Training Deep Nets with Sublinear Memory Cost." *arXiv preprint arXiv:1604.06174*.
7. Goodfellow, I., Bengio, Y., & Courville, A. (2016). *Deep Learning*. MIT Press. Chapter 4: Numerical Computation; Chapter 6: Deep Feedforward Networks.
8. Baydin, A. G., Pearlmutter, B. A., Radul, A. A., & Siskind, J. M. (2018). "Automatic Differentiation in Machine Learning: a Survey." *Journal of Machine Learning Research*, 18(153), 1-43.
9. Pascanu, R., Mikolov, T., & Bengio, Y. (2013). "On the difficulty of training recurrent neural networks." *Proceedings of the 30th International Conference on Machine Learning (ICML)*, 1310-1318.
10. Ruder, S. (2016). "An overview of gradient descent optimization algorithms." *arXiv preprint arXiv:1609.04747*.
11. Sootla, S. (2017). "Proof that the gradient points in the direction of steepest ascent." Available at sootlasten.github.io.
12. Polyak, B. T. (1964). "Some methods of speeding up the convergence of iteration methods." *USSR Computational Mathematics and Mathematical Physics*, 4(5), 1-17.
13. Duchi, J., Hazan, E., & Singer, Y. (2011). "Adaptive Subgradient Methods for Online Learning and Stochastic Optimization." *Journal of Machine Learning Research*, 12, 2121-2159.
14. Tieleman, T., & Hinton, G. (2012). "Lecture 6.5 - RMSProp: Divide the gradient by a running average of its recent magnitude." COURSERA: Neural Networks for Machine Learning.