Gradient descent is a first-order iterative optimization algorithm used to find the minimum of a differentiable function. It is the most widely used optimization method in machine learning and deep learning, forming the backbone of how neural networks learn from data. The algorithm works by repeatedly adjusting parameters in the direction opposite to the gradient of the objective function, taking steps proportional to the negative of the gradient at the current point.
Gradient descent and its variants are responsible for training virtually every modern deep learning model, from convolutional neural networks for image recognition to transformers for natural language processing. Understanding gradient descent is essential for anyone working in artificial intelligence or data science.
Imagine you are standing on a hilly field and you want to get to the lowest point, but you are blindfolded. You can feel the ground under your feet to figure out which direction slopes downward. At each step, you move in the steepest downhill direction you can feel. You keep doing this over and over, and eventually you reach the bottom of the valley.
That is what gradient descent does with math. A computer has a big math problem (called a "loss function") and it wants to find the answer that makes the result as small as possible. It checks which direction would make the answer go down the fastest, takes a small step that way, and then checks again. It repeats this many times until it finds a good answer. The size of each step is called the "learning rate." If steps are too big, you might jump right over the lowest point. If steps are too small, it takes forever to get there.
The method of gradient descent was first described by the French mathematician Augustin-Louis Cauchy in 1847. In his memoir "Methode generale pour la resolution des systemes d'equations simultanees," Cauchy proposed an iterative process for solving systems of equations that is equivalent to what we now call gradient descent: updating variables by subtracting a positive multiple of partial derivatives.
Jacques Hadamard independently proposed a similar method in 1907. The convergence properties of gradient descent for nonlinear optimization problems were first rigorously studied by Haskell Curry in 1944, establishing the theoretical foundation for the method.
In 1951, Herbert Robbins and Sutton Monro introduced the stochastic approximation framework in their landmark paper "A Stochastic Approximation Method," published in the Annals of Mathematical Statistics. This work laid the groundwork for stochastic gradient descent, which would later become the dominant training algorithm for neural networks.
Boris Polyak introduced the momentum method (also called the "heavy ball" method) in 1964, drawing an analogy to a heavy ball rolling through a potential field with friction. Yurii Nesterov proposed his accelerated gradient method in 1983, achieving the optimal convergence rate for first-order methods on convex functions.
The connection between gradient descent and neural network training was solidified in 1986, when David Rumelhart, Geoffrey Hinton, and Ronald Williams demonstrated that backpropagation could efficiently compute gradients for multi-layer networks, enabling practical training of deep architectures with gradient-based methods.
The development of adaptive learning rate methods accelerated in the 2010s: AdaGrad (Duchi et al., 2011), RMSProp (Hinton, 2012), and Adam (Kingma and Ba, 2015) each introduced ways to automatically adjust per-parameter learning rates. AdamW, proposed by Loshchilov and Hutter in 2017 (published at ICLR 2019), fixed a subtle issue with weight decay in Adam and became the standard optimizer for training transformer models.
More recently, the Lion optimizer (Chen et al., 2023) was discovered through automated search by Google Brain researchers, demonstrating that optimization algorithms can themselves be found by machine learning techniques.
Gradient descent solves the unconstrained minimization problem:
minimize f(theta)
where f: R^n -> R is a differentiable function and theta is a vector of n parameters. In machine learning, f is typically a loss function that measures the discrepancy between a model's predictions and the true values, averaged over training data.
The core update rule for gradient descent at iteration t is:
theta(t+1) = theta(t) - alpha * nabla f(theta(t))
where:
The gradient nabla f points in the direction of steepest ascent. By moving in the opposite direction (negative gradient), the algorithm follows the direction of steepest descent, which locally reduces the function value most rapidly.
The rationale for gradient descent comes from the first-order Taylor approximation. Near a point theta(t), the function can be approximated as:
f(theta) approximately equals f(theta(t)) + nabla f(theta(t))^T * (theta - theta(t))
Choosing theta to minimize this linear approximation within a small neighborhood (controlled by the learning rate) leads directly to the gradient descent update. The step size alpha must be small enough that the linear approximation remains valid.
Convergence of gradient descent depends on properties of the objective function:
The convergence behavior of gradient descent varies significantly depending on the properties of the objective function. The table below summarizes the convergence rates for different function classes.
| Function class | Convergence rate | Iterations to reach epsilon accuracy | Step size requirement |
|---|---|---|---|
| Convex, Lipschitz gradient | O(1/k) | O(1/epsilon) | alpha <= 1/L |
| Strongly convex, Lipschitz gradient | O(c^k) where c = (1 - mu/L) | O(log(1/epsilon)) | alpha <= 1/L |
| Non-convex, Lipschitz gradient | O(1/sqrt(k)) to stationary point | O(1/epsilon^2) | alpha <= 1/L |
| Quadratic (condition number kappa) | O(((kappa-1)/(kappa+1))^k) | O(kappa * log(1/epsilon)) | alpha = 2/(L + mu) |
For non-convex problems (including most deep learning objectives), gradient descent can only guarantee convergence to a stationary point where the gradient is approximately zero. This could be a local minimum, a saddle point, or even a local maximum. However, research has shown that for over-parameterized neural networks, most local minima have loss values close to the global minimum.
In practice, the choice of how much training data to use when computing the gradient at each step defines three primary variants of gradient descent.
Batch gradient descent (also called vanilla gradient descent) computes the gradient of the loss function with respect to the parameters using the entire training dataset:
theta = theta - alpha * nabla J(theta)
where J(theta) is the average loss over all N training examples. Because it uses the full dataset for each update, batch gradient descent produces smooth, stable convergence trajectories and is guaranteed to converge to the global minimum for convex functions. However, it becomes computationally expensive and memory-intensive for large datasets, since every single update requires a full pass through the data. It also cannot incorporate new data on the fly (no online learning).
Stochastic gradient descent (SGD) updates parameters after computing the gradient on a single randomly chosen training example:
theta = theta - alpha * nabla J(theta; x_i, y_i)
SGD is much faster per iteration than batch gradient descent and can handle very large or even streaming datasets. The noise introduced by single-sample gradient estimates causes the objective function value to fluctuate heavily during training, which can actually help the optimizer escape shallow local minima. However, this same noise means convergence is less stable. To guarantee convergence, the learning rate typically needs to be decreased over time according to a schedule that satisfies the Robbins-Monro conditions: the sum of learning rates diverges (sum alpha_t = infinity), while the sum of squared learning rates converges (sum alpha_t^2 < infinity).
Mini-batch gradient descent is the practical compromise and the standard approach used in modern deep learning. It computes gradients over small randomly sampled subsets (mini-batches) of the training data, typically between 32 and 512 examples:
theta = theta - alpha * nabla J(theta; x_{i:i+b}, y_{i:i+b})
where b is the batch size. This approach reduces the variance of gradient estimates compared to SGD while remaining computationally tractable for large datasets. It also enables efficient use of hardware parallelism on GPUs and TPUs, since matrix operations on batches can be highly optimized. In common usage, the term "SGD" often refers to mini-batch gradient descent rather than true single-sample stochastic gradient descent.
| Variant | Data per update | Update noise | Memory usage | Convergence stability | Hardware efficiency |
|---|---|---|---|---|---|
| Batch | Entire dataset | None | Very high | Very stable | Poor (large datasets) |
| Stochastic (SGD) | 1 example | Very high | Very low | Unstable | Poor (no parallelism) |
| Mini-batch | b examples (32-512) | Moderate | Moderate | Balanced | Excellent (GPU/TPU) |
Plain gradient descent can be slow when the loss surface has regions where the curvature is much steeper in one direction than another (called "ravines"). In these settings, the gradient oscillates across the narrow dimension while making slow progress along the long dimension. Momentum methods address this by accumulating a velocity vector that smooths out these oscillations.
The heavy ball method adds a fraction of the previous update vector to the current update:
v(t) = gamma * v(t-1) + alpha * nabla J(theta(t))
theta(t+1) = theta(t) - v(t)
where gamma is the momentum coefficient, typically set to 0.9. The momentum term accelerates gradient descent in directions where the gradient consistently points the same way, while dampening oscillations in directions where the gradient frequently changes sign. Physically, this is analogous to a ball rolling downhill that accumulates speed and can roll through small bumps.
Nesterov momentum improves on classical momentum by computing the gradient at a "lookahead" position rather than the current position:
v(t) = gamma * v(t-1) + alpha * nabla J(theta(t) - gamma * v(t-1))
theta(t+1) = theta(t) - v(t)
By evaluating the gradient at the anticipated next position (theta - gamma * v), Nesterov momentum provides a correction factor that reduces overshooting. This lookahead mechanism gives it the optimal convergence rate of O(1/k^2) for convex functions, compared to O(1/k) for standard gradient descent. In practice, Nesterov momentum often converges faster and is less prone to overshooting than classical momentum.
A major limitation of standard gradient descent is that a single learning rate is applied to all parameters. Adaptive methods address this by maintaining per-parameter learning rates that are adjusted automatically based on the history of gradients observed for each parameter.
AdaGrad (Adaptive Gradient) accumulates the squared gradients for each parameter and scales the learning rate inversely by the square root of this sum:
G(t) = G(t-1) + g(t)^2 (element-wise)
theta(t+1) = theta(t) - (alpha / (sqrt(G(t)) + epsilon)) * g(t)
where g(t) = nabla J(theta(t)) and epsilon is a small constant (typically 10^-8) for numerical stability.
AdaGrad performs larger updates for infrequent parameters and smaller updates for frequent parameters, making it well suited for sparse data and natural language processing tasks where rare features are informative. However, its main weakness is that the accumulated squared gradients grow monotonically, causing the effective learning rate to shrink continuously and eventually become vanishingly small, stopping learning prematurely.
Adadelta addresses AdaGrad's diminishing learning rate problem by restricting the accumulation window using an exponentially decaying average of past squared gradients:
Eg^2 = rho * Eg^2 + (1-rho) * g(t)^2
Delta theta(t) = -(sqrt(EDelta theta^2 + epsilon) / sqrt(Eg^2 + epsilon)) * g(t)
A distinctive feature of Adadelta is that it eliminates the need to set a base learning rate entirely. Instead, it uses the ratio of the running average of parameter updates to the running average of gradients, making it fully adaptive.
RMSProp (Root Mean Square Propagation) was proposed by Geoffrey Hinton in an unpublished lecture and addresses the same diminishing learning rate problem as Adadelta, but in a simpler way:
Eg^2 = rho * Eg^2 + (1-rho) * g(t)^2
theta(t+1) = theta(t) - (alpha / sqrt(Eg^2 + epsilon)) * g(t)
Hinton recommended setting rho = 0.9 and alpha = 0.001. RMSProp divides the learning rate by an exponentially decaying average of squared gradients, which prevents the learning rate from decaying too aggressively while still adapting per parameter. It became widely used for training recurrent neural networks and remains a reliable choice.
Adam (Adaptive Moment Estimation) combines the benefits of momentum (first moment) with adaptive learning rates (second moment). It maintains exponential moving averages of both the gradient and the squared gradient:
m(t) = beta1 * m(t-1) + (1 - beta1) * g(t) (first moment estimate)
v(t) = beta2 * v(t-1) + (1 - beta2) * g(t)^2 (second moment estimate)
Because m and v are initialized to zero, they are biased toward zero in early iterations. Adam applies bias correction:
m_hat(t) = m(t) / (1 - beta1^t)
v_hat(t) = v(t) / (1 - beta2^t)
The parameter update is then:
theta(t+1) = theta(t) - (alpha / (sqrt(v_hat(t)) + epsilon)) * m_hat(t)
Default hyperparameters are beta1 = 0.9, beta2 = 0.999, and epsilon = 10^-8. Adam is effective across a wide range of problems and requires little hyperparameter tuning, making it one of the most popular optimizers in deep learning. However, research has shown that Adam can sometimes converge to suboptimal solutions compared to well-tuned SGD with momentum, particularly in computer vision tasks.
AdamW fixes a subtle but important issue with how weight decay (L2 regularization) interacts with adaptive learning rate methods. In standard Adam, L2 regularization is applied to the gradient before the adaptive scaling, which means the regularization strength effectively varies across parameters. AdamW decouples weight decay from the gradient-based update:
theta(t+1) = (1 - lambda) * theta(t) - (alpha / (sqrt(v_hat(t)) + epsilon)) * m_hat(t)
where lambda is the weight decay coefficient. This decoupling ensures that weight decay acts uniformly regardless of the adaptive learning rate for each parameter, resulting in better generalization. AdamW has become the default optimizer for training large language models and transformer architectures.
| Optimizer | Year | Key innovation | Typical use case |
|---|---|---|---|
| AdaGrad | 2011 | Per-parameter rates from accumulated squared gradients | Sparse data, NLP |
| Adadelta | 2012 | Windowed accumulation, no base learning rate needed | General purpose |
| RMSProp | 2012 | Exponential moving average of squared gradients | RNNs, general purpose |
| Adam | 2015 | Combines momentum + adaptive rates with bias correction | General default |
| AdaMax | 2015 | Adam with infinity-norm instead of L2 norm | Sparse gradients |
| Nadam | 2016 | Adam + Nesterov lookahead momentum | Sequence models |
| AMSGrad | 2018 | Non-increasing step sizes via max of second moments | Theoretical guarantees |
| AdamW | 2017/2019 | Decoupled weight decay from gradient adaptation | Transformers, LLMs |
| LAMB | 2019 | Layer-wise adaptive rates for large-batch training | Distributed training |
| Lion | 2023 | Sign-based momentum, discovered by automated search | Vision, language models |
The learning rate is arguably the single most important hyperparameter in gradient descent. Setting it correctly can mean the difference between fast convergence to a good solution and complete failure to train.
Modern deep learning training almost always adjusts the learning rate over the course of training rather than keeping it fixed. Common schedules include:
| Schedule | Description | Formula (simplified) | Common usage |
|---|---|---|---|
| Constant | Fixed learning rate throughout training | alpha(t) = alpha_0 | Simple baselines |
| Step decay | Multiply by a factor every N epochs | alpha(t) = alpha_0 * gamma^(floor(t/N)) | CNNs (ResNet, VGG) |
| Exponential decay | Continuously decay by a fixed ratio | alpha(t) = alpha_0 * e^(-lambda*t) | General purpose |
| Cosine annealing | Smooth decay following a cosine curve | alpha(t) = alpha_min + 0.5*(alpha_0 - alpha_min)(1 + cos(pit/T)) | Transformers, modern CNNs |
| Linear warmup + decay | Ramp up from zero, then decay | Ramp for W steps, then decay | LLMs, transformers |
| Cyclical learning rates | Oscillate between bounds | Triangular or cosine cycles | Fast training (Smith, 2017) |
| 1-cycle policy | Single ramp up then down over training | One large cycle | Super-convergence |
| Cosine with warm restarts (SGDR) | Cosine decay with periodic resets | Restart cosine schedule at intervals | SGDR (Loshchilov and Hutter, 2017) |
Warmup is a technique where the learning rate starts at a very small value (or zero) and gradually increases to the target learning rate over a set number of initial training steps. This is particularly important for transformer models and large-batch training, where starting with a high learning rate can cause training instability due to large, poorly calibrated gradients in the early iterations when the model weights are still random. The combination of linear warmup followed by cosine decay has become the standard schedule for training large language models.
Gradient descent faces several challenges, especially when applied to the high-dimensional, non-convex loss surfaces typical of deep neural networks.
In high-dimensional spaces, saddle points (where the gradient is zero but the point is neither a local minimum nor a local maximum) vastly outnumber local minima. At a saddle point, some dimensions curve upward while others curve downward. The gradient approaches zero near saddle points, causing standard gradient descent to slow down dramatically or stall. The number of saddle points grows exponentially with the dimensionality of the parameter space. Momentum-based methods and noise from stochastic gradients help escape saddle points by providing perturbations that push the optimizer off the saddle.
In non-convex optimization, gradient descent may converge to a local minimum rather than the global minimum. However, research on deep neural network loss surfaces has revealed that in over-parameterized networks (where the number of parameters greatly exceeds the number of training examples), most local minima have loss values very close to the global minimum. This means getting "stuck" in a local minimum is often not as problematic in practice as it might appear in theory.
Plateaus are large regions of the loss surface where the gradient is nearly zero. The optimizer makes very slow progress through these regions since the gradient signal is weak. Adaptive learning rate methods can partially address this by scaling up updates when gradients are small.
When the loss surface has very different curvatures along different directions (a large condition number), gradient descent zigzags across the narrow dimensions while making slow progress along the broader dimensions. The condition number kappa = L/mu (ratio of largest to smallest curvature) directly determines how many iterations are needed: O(kappa * log(1/epsilon)) for strongly convex functions. Batch normalization and proper weight initialization can help reduce ill-conditioning in neural network training.
In deep networks, gradients can become extremely small (vanishing gradients) or extremely large (exploding gradients) as they are propagated backward through many layers. Vanishing gradients cause early layers to learn very slowly or not at all, while exploding gradients cause numerical instability and divergence.
Common solutions include:
Gradient descent is a first-order method because it uses only the gradient (first derivative) of the objective function. Second-order methods additionally incorporate curvature information from the Hessian matrix (second derivative), which can yield faster convergence per iteration.
Newton's method updates parameters using:
theta(t+1) = theta(t) - H^(-1) * nabla f(theta(t))
where H is the Hessian matrix of second partial derivatives. By accounting for curvature, Newton's method achieves quadratic convergence near a minimum. However, computing and inverting the full Hessian is O(n^2) in storage and O(n^3) in computation for n parameters. For modern neural networks with millions or billions of parameters, this is completely impractical.
L-BFGS (Limited-memory Broyden-Fletcher-Goldfarb-Shanno) approximates the inverse Hessian using gradient information from the most recent iterations, requiring only O(n) storage. It is effective for smaller-scale optimization problems and is sometimes used for fine-tuning or in specific scientific computing applications, but is rarely used for large-scale neural network training due to its reliance on full-batch gradients.
K-FAC (Kronecker-Factored Approximate Curvature) approximates the Fisher information matrix using a Kronecker product structure, making it feasible for neural networks. Shampoo extends this approach with a block-diagonal approximation. These methods can achieve faster convergence than first-order methods in some settings but add significant implementation complexity and computational overhead.
| Method | Order | Per-iteration cost | Storage | Convergence rate (near minimum) | Practicality for large nets |
|---|---|---|---|---|---|
| Gradient descent | First | O(n) | O(n) | Linear | Standard |
| Newton's method | Second | O(n^3) | O(n^2) | Quadratic | Impractical |
| L-BFGS | Quasi-second | O(n) | O(n) | Superlinear | Rare, small models |
| K-FAC | Approximate second | O(n) amortized | O(n) | Faster than first-order | Experimental |
| Shampoo | Approximate second | O(n) amortized | O(n) | Faster than first-order | Experimental |
Backpropagation is the algorithm used to efficiently compute the gradient of the loss function with respect to all parameters in a neural network. It applies the chain rule of calculus layer by layer, starting from the output and working backward. Backpropagation computes the gradient; gradient descent (or one of its variants) uses that gradient to update the parameters. The two algorithms are complementary: backpropagation provides the directional information, and gradient descent determines how to step in that direction.
Different subfields of deep learning have converged on different default optimizers based on empirical results:
| Domain | Recommended optimizer | Typical schedule | Batch size |
|---|---|---|---|
| Computer vision (CNNs) | SGD with momentum (0.9) | Step decay or cosine annealing | 64-256 |
| NLP / Transformers | AdamW | Linear warmup + cosine decay | 256-2048 |
| Large language models | AdamW or LAMB | Linear warmup + cosine decay | 2048-8192+ |
| Generative models (GANs) | Adam (beta1=0.0, beta2=0.99) | Constant or linear decay | 32-128 |
| Reinforcement learning | Adam | Constant or linear decay | 32-256 |
| Sparse / NLP embeddings | AdaGrad | Constant | Variable |
In computer vision, well-tuned SGD with momentum often achieves slightly better final accuracy than Adam, though it requires more careful hyperparameter tuning. For transformers and language models, AdamW with warmup has become the standard because it converges reliably without extensive tuning.
Training on multiple GPUs or TPUs requires special consideration for gradient descent:
| Hyperparameter | Typical range | Notes |
|---|---|---|
| Learning rate (SGD) | 0.01 - 0.1 | Often requires grid search |
| Learning rate (Adam/AdamW) | 1e-4 - 3e-4 | More robust to exact setting |
| Learning rate (Lion) | 1e-5 - 3e-5 | 3-10x smaller than Adam |
| Momentum (SGD) | 0.9 - 0.99 | 0.9 is a safe default |
| beta1 (Adam) | 0.9 | Rarely changed |
| beta2 (Adam) | 0.999 | 0.95 for unstable training |
| Weight decay | 0.01 - 0.1 | Task-dependent |
| Warmup steps | 1-5% of total steps | Higher for larger models |
| Gradient clipping norm | 1.0 | Standard for transformers |
Gradient descent can be viewed as a discretization of the ordinary differential equation:
dx/dt = -nabla f(x(t))
Applying Euler's method with step size alpha to this ODE yields the standard gradient descent update. This continuous-time perspective has led to deeper understanding of optimizer dynamics and has inspired new optimization methods. Nesterov's accelerated method, for example, can be understood as discretizing a second-order ODE that includes a friction term.
Gradient descent is a special case of mirror descent, which generalizes the algorithm by replacing the Euclidean distance with an arbitrary Bregman divergence. Mirror descent is particularly useful for constrained optimization problems where the constraint set has non-Euclidean geometry.
Proposed by Shun-ichi Amari, natural gradient descent replaces the Euclidean metric in parameter space with the Fisher information metric, which accounts for the geometry of the probability distributions the model represents. This leads to updates that are invariant to reparameterization of the model. K-FAC can be understood as a practical approximation to natural gradient descent.
| Approach | Gradient required | Convergence speed | Scalability | Typical application |
|---|---|---|---|---|
| Gradient descent | Yes | Moderate to fast | Excellent | Neural networks, ML |
| Coordinate descent | Partial | Moderate | Good | Lasso, sparse models |
| Evolutionary algorithms | No | Slow | Poor for high-dim | Neural architecture search |
| Bayesian optimization | No | Fast (low-dim) | Poor for high-dim | Hyperparameter tuning |
| Simulated annealing | No | Slow | Moderate | Combinatorial optimization |
| Conjugate gradient | Yes | Fast (quadratic) | Good | Linear systems, small models |