Gradient Descent

Gradient descent is a first-order iterative optimization algorithm used to find the minimum of a differentiable function. It is the most widely used optimization method in machine learning and deep learning, forming the backbone of how neural networks learn from data. The algorithm works by repeatedly adjusting parameters in the direction opposite to the gradient of the objective function, taking steps proportional to the negative of the gradient at the current point.

Gradient descent and its variants are responsible for training virtually every modern deep learning model, from convolutional neural networks for image recognition to transformers for natural language processing. Understanding gradient descent is essential for anyone working in artificial intelligence or data science.

ELI5 (Explain like I'm 5)

Imagine you are standing on a hilly field and you want to get to the lowest point, but you are blindfolded. You can feel the ground under your feet to figure out which direction slopes downward. At each step, you move in the steepest downhill direction you can feel. You keep doing this over and over, and eventually you reach the bottom of the valley.

That is what gradient descent does with math. A computer has a big math problem (called a "loss function") and it wants to find the answer that makes the result as small as possible. It checks which direction would make the answer go down the fastest, takes a small step that way, and then checks again. It repeats this many times until it finds a good answer. The size of each step is called the "learning rate." If steps are too big, you might jump right over the lowest point. If steps are too small, it takes forever to get there.

History

The method of gradient descent was first described by the French mathematician Augustin-Louis Cauchy in 1847. In his memoir "Methode generale pour la resolution des systemes d'equations simultanees," Cauchy proposed an iterative process for solving systems of equations that is equivalent to what we now call gradient descent: updating variables by subtracting a positive multiple of partial derivatives.

Jacques Hadamard independently proposed a similar method in 1907. The convergence properties of gradient descent for nonlinear optimization problems were first rigorously studied by Haskell Curry in 1944, establishing the theoretical foundation for the method.

In 1951, Herbert Robbins and Sutton Monro introduced the stochastic approximation framework in their landmark paper "A Stochastic Approximation Method," published in the Annals of Mathematical Statistics. This work laid the groundwork for stochastic gradient descent, which would later become the dominant training algorithm for neural networks.

Boris Polyak introduced the momentum method (also called the "heavy ball" method) in 1964, drawing an analogy to a heavy ball rolling through a potential field with friction. Yurii Nesterov proposed his accelerated gradient method in 1983, achieving the optimal convergence rate for first-order methods on convex functions.

The connection between gradient descent and neural network training was solidified in 1986, when David Rumelhart, Geoffrey Hinton, and Ronald Williams demonstrated that backpropagation could efficiently compute gradients for multi-layer networks, enabling practical training of deep architectures with gradient-based methods.

The development of adaptive learning rate methods accelerated in the 2010s: AdaGrad (Duchi et al., 2011), RMSProp (Hinton, 2012), and Adam (Kingma and Ba, 2015) each introduced ways to automatically adjust per-parameter learning rates. AdamW, proposed by Loshchilov and Hutter in 2017 (published at ICLR 2019), fixed a subtle issue with weight decay in Adam and became the standard optimizer for training transformer models.

More recently, the Lion optimizer (Chen et al., 2023) was discovered through automated search by Google Brain researchers, demonstrating that optimization algorithms can themselves be found by machine learning techniques.

Mathematical foundation

Problem formulation

Gradient descent solves the unconstrained minimization problem:

minimize f(theta)

where f: R^n -> R is a differentiable function and theta is a vector of n parameters. In machine learning, f is typically a loss function that measures the discrepancy between a model's predictions and the true values, averaged over training data.

The update rule

The core update rule for gradient descent at iteration t is:

theta(t+1) = theta(t) - alpha * nabla f(theta(t))

where:

theta(t) is the parameter vector at iteration t
alpha (also written as eta) is the learning rate, a positive scalar controlling step size
nabla f(theta(t)) is the gradient of f evaluated at theta(t)

The gradient nabla f points in the direction of steepest ascent. By moving in the opposite direction (negative gradient), the algorithm follows the direction of steepest descent, which locally reduces the function value most rapidly.

Intuition from Taylor expansion

The rationale for gradient descent comes from the first-order Taylor approximation. Near a point theta(t), the function can be approximated as:

f(theta) approximately equals f(theta(t)) + nabla f(theta(t))^T * (theta - theta(t))

Choosing theta to minimize this linear approximation within a small neighborhood (controlled by the learning rate) leads directly to the gradient descent update. The step size alpha must be small enough that the linear approximation remains valid.

Conditions for convergence

Convergence of gradient descent depends on properties of the objective function:

Lipschitz continuous gradient: The gradient does not change too rapidly. Formally, ||nabla f(x) - nabla f(y)|| <= L * ||x - y|| for all x, y, where L is the Lipschitz constant. This ensures that a fixed step size of alpha = 1/L guarantees descent at each step.
Convexity: The function has no local minima other than the global minimum. For convex functions with Lipschitz gradients, gradient descent converges at a rate of O(1/k), meaning the error after k iterations shrinks proportionally to 1/k.
Strong convexity: A stronger condition where the function curves upward at least quadratically. With strong convexity parameter mu, gradient descent achieves linear convergence at rate O((1 - mu/L)^k), requiring only O(log(1/epsilon)) iterations to reach epsilon accuracy.

Convergence rates

The convergence behavior of gradient descent varies significantly depending on the properties of the objective function. The table below summarizes the convergence rates for different function classes.

Function class	Convergence rate	Iterations to reach epsilon accuracy	Step size requirement
Convex, Lipschitz gradient	O(1/k)	O(1/epsilon)	alpha <= 1/L
Strongly convex, Lipschitz gradient	O(c^k) where c = (1 - mu/L)	O(log(1/epsilon))	alpha <= 1/L
Non-convex, Lipschitz gradient	O(1/sqrt(k)) to stationary point	O(1/epsilon^2)	alpha <= 1/L
Quadratic (condition number kappa)	O(((kappa-1)/(kappa+1))^k)	O(kappa * log(1/epsilon))	alpha = 2/(L + mu)

For non-convex problems (including most deep learning objectives), gradient descent can only guarantee convergence to a stationary point where the gradient is approximately zero. This could be a local minimum, a saddle point, or even a local maximum. However, research has shown that for over-parameterized neural networks, most local minima have loss values close to the global minimum.

Three primary variants

In practice, the choice of how much training data to use when computing the gradient at each step defines three primary variants of gradient descent.

Batch gradient descent

Batch gradient descent (also called vanilla gradient descent) computes the gradient of the loss function with respect to the parameters using the entire training dataset:

theta = theta - alpha * nabla J(theta)

where J(theta) is the average loss over all N training examples. Because it uses the full dataset for each update, batch gradient descent produces smooth, stable convergence trajectories and is guaranteed to converge to the global minimum for convex functions. However, it becomes computationally expensive and memory-intensive for large datasets, since every single update requires a full pass through the data. It also cannot incorporate new data on the fly (no online learning).

Stochastic gradient descent

Stochastic gradient descent (SGD) updates parameters after computing the gradient on a single randomly chosen training example:

theta = theta - alpha * nabla J(theta; x_i, y_i)

SGD is much faster per iteration than batch gradient descent and can handle very large or even streaming datasets. The noise introduced by single-sample gradient estimates causes the objective function value to fluctuate heavily during training, which can actually help the optimizer escape shallow local minima. However, this same noise means convergence is less stable. To guarantee convergence, the learning rate typically needs to be decreased over time according to a schedule that satisfies the Robbins-Monro conditions: the sum of learning rates diverges (sum alpha_t = infinity), while the sum of squared learning rates converges (sum alpha_t^2 < infinity).

Mini-batch gradient descent

Mini-batch gradient descent is the practical compromise and the standard approach used in modern deep learning. It computes gradients over small randomly sampled subsets (mini-batches) of the training data, typically between 32 and 512 examples:

theta = theta - alpha * nabla J(theta; x_{i:i+b}, y_{i:i+b})

where b is the batch size. This approach reduces the variance of gradient estimates compared to SGD while remaining computationally tractable for large datasets. It also enables efficient use of hardware parallelism on GPUs and TPUs, since matrix operations on batches can be highly optimized. In common usage, the term "SGD" often refers to mini-batch gradient descent rather than true single-sample stochastic gradient descent.

Variant	Data per update	Update noise	Memory usage	Convergence stability	Hardware efficiency
Batch	Entire dataset	None	Very high	Very stable	Poor (large datasets)
Stochastic (SGD)	1 example	Very high	Very low	Unstable	Poor (no parallelism)
Mini-batch	b examples (32-512)	Moderate	Moderate	Balanced	Excellent (GPU/TPU)

Momentum-based methods

Plain gradient descent can be slow when the loss surface has regions where the curvature is much steeper in one direction than another (called "ravines"). In these settings, the gradient oscillates across the narrow dimension while making slow progress along the long dimension. Momentum methods address this by accumulating a velocity vector that smooths out these oscillations.

Classical momentum (Polyak, 1964)

The heavy ball method adds a fraction of the previous update vector to the current update:

v(t) = gamma * v(t-1) + alpha * nabla J(theta(t))

theta(t+1) = theta(t) - v(t)

where gamma is the momentum coefficient, typically set to 0.9. The momentum term accelerates gradient descent in directions where the gradient consistently points the same way, while dampening oscillations in directions where the gradient frequently changes sign. Physically, this is analogous to a ball rolling downhill that accumulates speed and can roll through small bumps.

Nesterov accelerated gradient (Nesterov, 1983)

Nesterov momentum improves on classical momentum by computing the gradient at a "lookahead" position rather than the current position:

v(t) = gamma * v(t-1) + alpha * nabla J(theta(t) - gamma * v(t-1))

theta(t+1) = theta(t) - v(t)

By evaluating the gradient at the anticipated next position (theta - gamma * v), Nesterov momentum provides a correction factor that reduces overshooting. This lookahead mechanism gives it the optimal convergence rate of O(1/k^2) for convex functions, compared to O(1/k) for standard gradient descent. In practice, Nesterov momentum often converges faster and is less prone to overshooting than classical momentum.

Adaptive learning rate methods

A major limitation of standard gradient descent is that a single learning rate is applied to all parameters. Adaptive methods address this by maintaining per-parameter learning rates that are adjusted automatically based on the history of gradients observed for each parameter.

AdaGrad (Duchi et al., 2011)

AdaGrad (Adaptive Gradient) accumulates the squared gradients for each parameter and scales the learning rate inversely by the square root of this sum:

G(t) = G(t-1) + g(t)^2 (element-wise)

theta(t+1) = theta(t) - (alpha / (sqrt(G(t)) + epsilon)) * g(t)

where g(t) = nabla J(theta(t)) and epsilon is a small constant (typically 10^-8) for numerical stability.

AdaGrad performs larger updates for infrequent parameters and smaller updates for frequent parameters, making it well suited for sparse data and natural language processing tasks where rare features are informative. However, its main weakness is that the accumulated squared gradients grow monotonically, causing the effective learning rate to shrink continuously and eventually become vanishingly small, stopping learning prematurely.

Adadelta (Zeiler, 2012)

Adadelta addresses AdaGrad's diminishing learning rate problem by restricting the accumulation window using an exponentially decaying average of past squared gradients:

Eg^2 = rho * Eg^2 + (1-rho) * g(t)^2

Delta theta(t) = -(sqrt(EDelta theta^2 + epsilon) / sqrt(Eg^2 + epsilon)) * g(t)

A distinctive feature of Adadelta is that it eliminates the need to set a base learning rate entirely. Instead, it uses the ratio of the running average of parameter updates to the running average of gradients, making it fully adaptive.

RMSProp (Hinton, 2012)

RMSProp (Root Mean Square Propagation) was proposed by Geoffrey Hinton in an unpublished lecture and addresses the same diminishing learning rate problem as Adadelta, but in a simpler way:

Eg^2 = rho * Eg^2 + (1-rho) * g(t)^2

theta(t+1) = theta(t) - (alpha / sqrt(Eg^2 + epsilon)) * g(t)

Hinton recommended setting rho = 0.9 and alpha = 0.001. RMSProp divides the learning rate by an exponentially decaying average of squared gradients, which prevents the learning rate from decaying too aggressively while still adapting per parameter. It became widely used for training recurrent neural networks and remains a reliable choice.

Adam (Kingma and Ba, 2015)

Adam (Adaptive Moment Estimation) combines the benefits of momentum (first moment) with adaptive learning rates (second moment). It maintains exponential moving averages of both the gradient and the squared gradient:

m(t) = beta1 * m(t-1) + (1 - beta1) * g(t) (first moment estimate)

v(t) = beta2 * v(t-1) + (1 - beta2) * g(t)^2 (second moment estimate)

Because m and v are initialized to zero, they are biased toward zero in early iterations. Adam applies bias correction:

m_hat(t) = m(t) / (1 - beta1^t)

v_hat(t) = v(t) / (1 - beta2^t)

The parameter update is then:

theta(t+1) = theta(t) - (alpha / (sqrt(v_hat(t)) + epsilon)) * m_hat(t)

Default hyperparameters are beta1 = 0.9, beta2 = 0.999, and epsilon = 10^-8. Adam is effective across a wide range of problems and requires little hyperparameter tuning, making it one of the most popular optimizers in deep learning. However, research has shown that Adam can sometimes converge to suboptimal solutions compared to well-tuned SGD with momentum, particularly in computer vision tasks.

AdamW (Loshchilov and Hutter, 2017)

AdamW fixes a subtle but important issue with how weight decay (L2 regularization) interacts with adaptive learning rate methods. In standard Adam, L2 regularization is applied to the gradient before the adaptive scaling, which means the regularization strength effectively varies across parameters. AdamW decouples weight decay from the gradient-based update:

theta(t+1) = (1 - lambda) * theta(t) - (alpha / (sqrt(v_hat(t)) + epsilon)) * m_hat(t)

where lambda is the weight decay coefficient. This decoupling ensures that weight decay acts uniformly regardless of the adaptive learning rate for each parameter, resulting in better generalization. AdamW has become the default optimizer for training large language models and transformer architectures.

Additional adaptive methods

Optimizer	Year	Key innovation	Typical use case
AdaGrad	2011	Per-parameter rates from accumulated squared gradients	Sparse data, NLP
Adadelta	2012	Windowed accumulation, no base learning rate needed	General purpose
RMSProp	2012	Exponential moving average of squared gradients	RNNs, general purpose
Adam	2015	Combines momentum + adaptive rates with bias correction	General default
AdaMax	2015	Adam with infinity-norm instead of L2 norm	Sparse gradients
Nadam	2016	Adam + Nesterov lookahead momentum	Sequence models
AMSGrad	2018	Non-increasing step sizes via max of second moments	Theoretical guarantees
AdamW	2017/2019	Decoupled weight decay from gradient adaptation	Transformers, LLMs
LAMB	2019	Layer-wise adaptive rates for large-batch training	Distributed training
Lion	2023	Sign-based momentum, discovered by automated search	Vision, language models

Learning rate selection and scheduling

The learning rate is arguably the single most important hyperparameter in gradient descent. Setting it correctly can mean the difference between fast convergence to a good solution and complete failure to train.

Effects of learning rate magnitude

Too large: The optimizer overshoots minima, causing the loss to oscillate wildly or diverge to infinity.
Too small: Training converges extremely slowly, potentially getting stuck in suboptimal regions and wasting computational resources.
Just right: The optimizer makes steady progress toward a good solution, balancing speed and stability.

Common scheduling strategies

Modern deep learning training almost always adjusts the learning rate over the course of training rather than keeping it fixed. Common schedules include:

Schedule	Description	Formula (simplified)	Common usage
Constant	Fixed learning rate throughout training	alpha(t) = alpha_0	Simple baselines
Step decay	Multiply by a factor every N epochs	alpha(t) = alpha_0 * gamma^(floor(t/N))	CNNs (ResNet, VGG)
Exponential decay	Continuously decay by a fixed ratio	alpha(t) = alpha_0 * e^(-lambda*t)	General purpose
Cosine annealing	Smooth decay following a cosine curve	alpha(t) = alpha_min + 0.5(alpha_0 - alpha_min)(1 + cos(pi*t/T))	Transformers, modern CNNs
Linear warmup + decay	Ramp up from zero, then decay	Ramp for W steps, then decay	LLMs, transformers
Cyclical learning rates	Oscillate between bounds	Triangular or cosine cycles	Fast training (Smith, 2017)
1-cycle policy	Single ramp up then down over training	One large cycle	Super-convergence
Cosine with warm restarts (SGDR)	Cosine decay with periodic resets	Restart cosine schedule at intervals	SGDR (Loshchilov and Hutter, 2017)

Learning rate warmup

Warmup is a technique where the learning rate starts at a very small value (or zero) and gradually increases to the target learning rate over a set number of initial training steps. This is particularly important for transformer models and large-batch training, where starting with a high learning rate can cause training instability due to large, poorly calibrated gradients in the early iterations when the model weights are still random. The combination of linear warmup followed by cosine decay has become the standard schedule for training large language models.

Challenges in optimization

Gradient descent faces several challenges, especially when applied to the high-dimensional, non-convex loss surfaces typical of deep neural networks.

Saddle points

In high-dimensional spaces, saddle points (where the gradient is zero but the point is neither a local minimum nor a local maximum) vastly outnumber local minima. At a saddle point, some dimensions curve upward while others curve downward. The gradient approaches zero near saddle points, causing standard gradient descent to slow down dramatically or stall. The number of saddle points grows exponentially with the dimensionality of the parameter space. Momentum-based methods and noise from stochastic gradients help escape saddle points by providing perturbations that push the optimizer off the saddle.

Local minima

In non-convex optimization, gradient descent may converge to a local minimum rather than the global minimum. However, research on deep neural network loss surfaces has revealed that in over-parameterized networks (where the number of parameters greatly exceeds the number of training examples), most local minima have loss values very close to the global minimum. This means getting "stuck" in a local minimum is often not as problematic in practice as it might appear in theory.

Plateaus and flat regions

Plateaus are large regions of the loss surface where the gradient is nearly zero. The optimizer makes very slow progress through these regions since the gradient signal is weak. Adaptive learning rate methods can partially address this by scaling up updates when gradients are small.

Ill-conditioning

When the loss surface has very different curvatures along different directions (a large condition number), gradient descent zigzags across the narrow dimensions while making slow progress along the broader dimensions. The condition number kappa = L/mu (ratio of largest to smallest curvature) directly determines how many iterations are needed: O(kappa * log(1/epsilon)) for strongly convex functions. Batch normalization and proper weight initialization can help reduce ill-conditioning in neural network training.

Vanishing and exploding gradients

In deep networks, gradients can become extremely small (vanishing gradients) or extremely large (exploding gradients) as they are propagated backward through many layers. Vanishing gradients cause early layers to learn very slowly or not at all, while exploding gradients cause numerical instability and divergence.

Common solutions include:

Gradient clipping: Capping gradient magnitudes by value or by norm to prevent explosions. This is standard practice in transformer training.
Residual connections: Skip connections (as in ResNet) that provide shorter gradient paths.
Careful initialization: Methods like Xavier/Glorot or He initialization that maintain gradient magnitudes across layers.
Normalization layers: Batch normalization, layer normalization, and other techniques that stabilize gradient magnitudes.
Activation function choice: Using ReLU or its variants instead of sigmoid or tanh to reduce vanishing gradients.

Second-order methods

Gradient descent is a first-order method because it uses only the gradient (first derivative) of the objective function. Second-order methods additionally incorporate curvature information from the Hessian matrix (second derivative), which can yield faster convergence per iteration.

Newton's method

Newton's method updates parameters using:

theta(t+1) = theta(t) - H^(-1) * nabla f(theta(t))

where H is the Hessian matrix of second partial derivatives. By accounting for curvature, Newton's method achieves quadratic convergence near a minimum. However, computing and inverting the full Hessian is O(n^2) in storage and O(n^3) in computation for n parameters. For modern neural networks with millions or billions of parameters, this is completely impractical.

Quasi-Newton methods

L-BFGS (Limited-memory Broyden-Fletcher-Goldfarb-Shanno) approximates the inverse Hessian using gradient information from the most recent iterations, requiring only O(n) storage. It is effective for smaller-scale optimization problems and is sometimes used for fine-tuning or in specific scientific computing applications, but is rarely used for large-scale neural network training due to its reliance on full-batch gradients.

Other approximate methods

K-FAC (Kronecker-Factored Approximate Curvature) approximates the Fisher information matrix using a Kronecker product structure, making it feasible for neural networks. Shampoo extends this approach with a block-diagonal approximation. These methods can achieve faster convergence than first-order methods in some settings but add significant implementation complexity and computational overhead.

Method	Order	Per-iteration cost	Storage	Convergence rate (near minimum)	Practicality for large nets
Gradient descent	First	O(n)	O(n)	Linear	Standard
Newton's method	Second	O(n^3)	O(n^2)	Quadratic	Impractical
L-BFGS	Quasi-second	O(n)	O(n)	Superlinear	Rare, small models
K-FAC	Approximate second	O(n) amortized	O(n)	Faster than first-order	Experimental
Shampoo	Approximate second	O(n) amortized	O(n)	Faster than first-order	Experimental

Gradient descent in practice

Connection to backpropagation

Backpropagation is the algorithm used to efficiently compute the gradient of the loss function with respect to all parameters in a neural network. It applies the chain rule of calculus layer by layer, starting from the output and working backward. Backpropagation computes the gradient; gradient descent (or one of its variants) uses that gradient to update the parameters. The two algorithms are complementary: backpropagation provides the directional information, and gradient descent determines how to step in that direction.

Optimizer selection by domain

Different subfields of deep learning have converged on different default optimizers based on empirical results:

Domain	Recommended optimizer	Typical schedule	Batch size
Computer vision (CNNs)	SGD with momentum (0.9)	Step decay or cosine annealing	64-256
NLP / Transformers	AdamW	Linear warmup + cosine decay	256-2048
Large language models	AdamW or LAMB	Linear warmup + cosine decay	2048-8192+
Generative models (GANs)	Adam (beta1=0.0, beta2=0.99)	Constant or linear decay	32-128
Reinforcement learning	Adam	Constant or linear decay	32-256
Sparse / NLP embeddings	AdaGrad	Constant	Variable

In computer vision, well-tuned SGD with momentum often achieves slightly better final accuracy than Adam, though it requires more careful hyperparameter tuning. For transformers and language models, AdamW with warmup has become the standard because it converges reliably without extensive tuning.

Distributed and large-batch training

Training on multiple GPUs or TPUs requires special consideration for gradient descent:

Data parallelism: Each device processes a different mini-batch, and gradients are averaged across devices before the parameter update. This is the most common distributed training strategy.
Gradient accumulation: When the desired batch size exceeds GPU memory, gradients from multiple forward-backward passes are summed before a single parameter update.
Large-batch scaling: Larger batch sizes require proportionally larger learning rates (linear scaling rule). The LAMB optimizer was specifically designed to enable large-batch training of BERT-scale models.
Hogwild! (Recht et al., 2011): A lock-free parallel SGD approach where workers update a shared parameter vector without coordination, effective for sparse problems.

Practical hyperparameter guidelines

Hyperparameter	Typical range	Notes
Learning rate (SGD)	0.01 - 0.1	Often requires grid search
Learning rate (Adam/AdamW)	1e-4 - 3e-4	More robust to exact setting
Learning rate (Lion)	1e-5 - 3e-5	3-10x smaller than Adam
Momentum (SGD)	0.9 - 0.99	0.9 is a safe default
beta1 (Adam)	0.9	Rarely changed
beta2 (Adam)	0.999	0.95 for unstable training
Weight decay	0.01 - 0.1	Task-dependent
Warmup steps	1-5% of total steps	Higher for larger models
Gradient clipping norm	1.0	Standard for transformers

Connections to other areas

Continuous-time interpretation

Gradient descent can be viewed as a discretization of the ordinary differential equation:

dx/dt = -nabla f(x(t))

Applying Euler's method with step size alpha to this ODE yields the standard gradient descent update. This continuous-time perspective has led to deeper understanding of optimizer dynamics and has inspired new optimization methods. Nesterov's accelerated method, for example, can be understood as discretizing a second-order ODE that includes a friction term.

Mirror descent

Gradient descent is a special case of mirror descent, which generalizes the algorithm by replacing the Euclidean distance with an arbitrary Bregman divergence. Mirror descent is particularly useful for constrained optimization problems where the constraint set has non-Euclidean geometry.

Natural gradient descent

Proposed by Shun-ichi Amari, natural gradient descent replaces the Euclidean metric in parameter space with the Fisher information metric, which accounts for the geometry of the probability distributions the model represents. This leads to updates that are invariant to reparameterization of the model. K-FAC can be understood as a practical approximation to natural gradient descent.

Comparison with other optimization approaches

Approach	Gradient required	Convergence speed	Scalability	Typical application
Gradient descent	Yes	Moderate to fast	Excellent	Neural networks, ML
Coordinate descent	Partial	Moderate	Good	Lasso, sparse models
Evolutionary algorithms	No	Slow	Poor for high-dim	Neural architecture search
Bayesian optimization	No	Fast (low-dim)	Poor for high-dim	Hyperparameter tuning
Simulated annealing	No	Slow	Moderate	Combinatorial optimization
Conjugate gradient	Yes	Fast (quadratic)	Good	Linear systems, small models

References

Cauchy, A. (1847). "Methode generale pour la resolution des systemes d'equations simultanees." Comptes Rendus de l'Academie des Sciences, 25, 536-538.
Robbins, H. and Monro, S. (1951). "A Stochastic Approximation Method." The Annals of Mathematical Statistics, 22(3), 400-407.
Polyak, B. T. (1964). "Some methods of speeding up the convergence of iteration methods." USSR Computational Mathematics and Mathematical Physics, 4(5), 1-17.
Nesterov, Y. (1983). "A method for solving the convex programming problem with convergence rate O(1/k^2)." Doklady Akademii Nauk SSSR, 269, 543-547.
Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). "Learning representations by back-propagating errors." Nature, 323(6088), 533-536.
Duchi, J., Hazan, E., and Singer, Y. (2011). "Adaptive Subgradient Methods for Online Learning and Stochastic Optimization." Journal of Machine Learning Research, 12, 2121-2159.
Hinton, G. (2012). "Neural Networks for Machine Learning: Lecture 6a Overview of mini-batch gradient descent." Coursera lecture slides.
Kingma, D. P. and Ba, J. (2015). "Adam: A Method for Stochastic Optimization." Proceedings of the 3rd International Conference on Learning Representations (ICLR 2015).
Ruder, S. (2016). "An overview of gradient descent optimization algorithms." arXiv preprint arXiv:1609.04747.
Smith, L. N. (2017). "Cyclical Learning Rates for Training Neural Networks." IEEE Winter Conference on Applications of Computer Vision (WACV).
Loshchilov, I. and Hutter, F. (2017). "Decoupled Weight Decay Regularization." arXiv preprint arXiv:1711.05101. Published at ICLR 2019.
Loshchilov, I. and Hutter, F. (2017). "SGDR: Stochastic Gradient Descent with Warm Restarts." Proceedings of the 5th International Conference on Learning Representations (ICLR 2017).
Chen, X. et al. (2023). "Symbolic Discovery of Optimization Algorithms." arXiv preprint arXiv:2302.06675.
Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press. Chapter 8: Optimization for Training Deep Models.

ELI5 (Explain like I'm 5)

History

Mathematical foundation

Problem formulation

The update rule

Intuition from Taylor expansion

Conditions for convergence

Convergence rates

Three primary variants

Batch gradient descent

Stochastic gradient descent

Mini-batch gradient descent

Momentum-based methods

Classical momentum (Polyak, 1964)

Nesterov accelerated gradient (Nesterov, 1983)

Adaptive learning rate methods

AdaGrad (Duchi et al., 2011)

Adadelta (Zeiler, 2012)

RMSProp (Hinton, 2012)

Adam (Kingma and Ba, 2015)

AdamW (Loshchilov and Hutter, 2017)

Additional adaptive methods

Learning rate selection and scheduling

Effects of learning rate magnitude

Common scheduling strategies

Learning rate warmup

Challenges in optimization

Saddle points

Local minima

Plateaus and flat regions

Ill-conditioning

Vanishing and exploding gradients

Second-order methods

Newton's method

Quasi-Newton methods

Other approximate methods

Gradient descent in practice

Connection to backpropagation

Optimizer selection by domain

Distributed and large-batch training

Practical hyperparameter guidelines

Connections to other areas

Continuous-time interpretation

Mirror descent

Natural gradient descent

Comparison with other optimization approaches

See also

References

Improve this article

Related Articles

Sparse autoencoder

ARC-AGI 2

GELU (Gaussian Error Linear Unit)

LeNet

Clipping

Hyperparameter

ELI5 (Explain like I'm 5)

History

Mathematical foundation

Problem formulation

The update rule

Intuition from Taylor expansion

Conditions for convergence

Convergence rates

Three primary variants

Batch gradient descent

Stochastic gradient descent

Mini-batch gradient descent

Momentum-based methods

Classical momentum (Polyak, 1964)

Nesterov accelerated gradient (Nesterov, 1983)

Adaptive learning rate methods

AdaGrad (Duchi et al., 2011)

Adadelta (Zeiler, 2012)

RMSProp (Hinton, 2012)

Adam (Kingma and Ba, 2015)

AdamW (Loshchilov and Hutter, 2017)

Additional adaptive methods

Learning rate selection and scheduling

Effects of learning rate magnitude