Stochastic Gradient Descent (SGD)

See also: Machine learning terms

Stochastic gradient descent (SGD) is an iterative optimization algorithm that updates model parameters using the gradient of the loss function computed on a single training example or a small subset of examples, rather than the entire dataset. Originally formalized in the context of stochastic approximation by Robbins and Monro in 1951, SGD has become the workhorse optimization method for training neural networks and other large-scale machine learning models. It is widely used in machine learning and deep learning to minimize objective functions, powering everything from convolutional neural networks for computer vision to transformers for natural language processing. Its combination of computational efficiency, implicit regularization through gradient noise, and favorable generalization properties makes it one of the most important algorithms in modern artificial intelligence.

Explain like I'm 5

Imagine you are trying to find the lowest spot in a huge, hilly park, but it is nighttime and you can only use a small flashlight. With regular gradient descent, you would somehow measure the slope of every single hill and valley in the entire park before taking one step. That would take forever. With SGD, you just shine your flashlight on the ground right in front of you, check which way the ground slopes downward, and take a step in that direction. Sometimes your flashlight shows you a slightly misleading patch of ground, so your path is a bit wobbly. But because you take lots of quick steps instead of one perfectly planned step, you actually reach the lowest point faster. The wobbliness even helps you avoid getting stuck in small dips that are not the true lowest point.

Historical background

The mathematical foundations of SGD predate modern machine learning by several decades. In 1951, Herbert Robbins and Sutton Monro published "A Stochastic Approximation Method" in The Annals of Mathematical Statistics, establishing the theoretical framework for iterative optimization with noisy gradient estimates. Their paper introduced a general framework for iteratively solving root-finding problems where the underlying function is observed only through noisy measurements. Their work proved that under certain conditions on the learning rate schedule (now called the Robbins-Monro conditions), stochastic iterative procedures converge to the true optimum despite using noisy measurements. These conditions specify that the step sizes must sum to infinity while their squares must sum to a finite value, and they remain central to the theory of SGD today.

Jack Kiefer and Jacob Wolfowitz extended this framework in 1952, demonstrating that optimization could proceed using only function evaluations rather than gradient measurements. Their method used finite differences to approximate the gradient when it was not directly available, removing the need for an analytic form of the gradient. The Kiefer-Wolfowitz algorithm showed that stochastic optimization could be performed even in settings where only function evaluations (not derivatives) were accessible. Frank Rosenblatt applied SGD-like updates to train the perceptron in the late 1950s, marking one of the first uses of stochastic optimization in a learning algorithm.

Despite these early foundations, SGD did not become widely adopted in machine learning until the 1980s and 1990s, when backpropagation made it practical to compute gradients for neural networks. Leon Bottou's influential 2010 paper "Large-Scale Machine Learning with Stochastic Gradient Descent" provided both theoretical analysis and practical guidance that helped establish SGD as the default optimizer for large-scale learning problems. With the explosion of deep learning in the 2010s, SGD and its variants became essential tools, as datasets grew to millions or billions of examples where full-batch methods were computationally infeasible. The development of GPU-accelerated training further cemented SGD's role as the default optimization method for deep learning.

Background

Gradient descent

Gradient descent is an iterative optimization algorithm used to minimize a differentiable objective function. The idea is to update the model parameters iteratively by moving them in the direction of the negative gradient of the objective function with respect to the parameters. This movement is governed by a learning rate, which determines the size of the steps taken towards the minimum.

Stochasticity in optimization

The term "stochastic" refers to the presence of randomness in the optimization process. In the context of SGD, this randomness comes from the random selection of data points used in each iteration of the algorithm. This stochastic nature helps the algorithm explore the optimization landscape more effectively, allowing it to find better solutions and escape local minima in complex, non-convex optimization problems.

Mathematical formulation

Consider a supervised learning problem where the goal is to minimize an empirical risk function over a dataset of n training examples:

J(θ) = (1/n) Σᵢ₌₁ⁿ L(θ; xᵢ, yᵢ)

where θ represents the model parameters, L is the per-example loss, and (xᵢ, yᵢ) are individual training samples.

Batch gradient descent computes the gradient over all n examples before making a single update:

θₜ₊₁ = θₜ - α (1/n) Σᵢ₌₁ⁿ ∇L(θₜ; xᵢ, yᵢ)

Pure SGD instead samples a single example i uniformly at random and updates using that example's gradient alone:

θₜ₊₁ = θₜ - α ∇L(θₜ; xᵢ, yᵢ)

The key insight is that the single-sample gradient ∇L(θₜ; xᵢ, yᵢ) is an unbiased estimator of the true gradient:

E[∇L(θₜ; xᵢ, yᵢ)] = (1/n) Σᵢ₌₁ⁿ ∇L(θₜ; xᵢ, yᵢ) = ∇J(θₜ)

This means that on average, the stochastic gradient points in the same direction as the full gradient, even though any individual estimate may be noisy. This unbiasedness property is what allows SGD to converge to the same solution as batch gradient descent given appropriate learning rate schedules, while performing each step at a fraction of the computational cost.

Algorithm

The main steps of the stochastic gradient descent algorithm are as follows:

Initialize the model parameters randomly or with a predefined starting point.
Randomly select a mini-batch of data points from the dataset.
Compute the gradient of the objective function with respect to the parameters, using the selected mini-batch.
Update the parameters by moving them in the direction of the negative gradient, scaled by the learning rate.
Repeat steps 2 through 4 until a stopping criterion is met, such as a maximum number of iterations, a desired level of convergence, or a predetermined time limit.

Batch gradient descent vs. mini-batch SGD vs. pure SGD

In practice, the term "SGD" is used loosely to refer to three distinct variants that differ in how many examples are used to estimate the gradient at each step. These variants sit on a spectrum from using the full dataset to using a single example.

Variant	Samples per update	Gradient quality	Compute per step	Memory usage	Convergence behavior
Batch gradient descent	Entire dataset (n)	Exact gradient	Very high	High	Smooth, stable descent; can get trapped in sharp minima
Mini-batch SGD	Small subset (b, typically 32 to 4096)	Low-variance estimate	Moderate	Moderate	Balanced noise and stability; parallelizes well on GPUs
Pure (online) SGD	Single example (1)	High-variance estimate	Very low	Very low	Noisy, rapid initial progress; high variance helps escape local minima

Batch gradient descent computes the true gradient at each step, yielding smooth and predictable updates. However, for large datasets it is prohibitively expensive, since every update requires a full pass through the data.

Mini-batch SGD is the most widely used variant in deep learning practice. At each step, a random subset (mini-batch) of b examples is drawn, and the gradient is averaged over this subset:

θₜ₊₁ = θₜ - α (1/b) Σⱼ∈Bₜ ∇L(θₜ; xⱼ, yⱼ)

where Bₜ is the mini-batch at step t. Mini-batch SGD balances the computational cost per step with the quality of the gradient estimate. It also benefits from hardware parallelism, since modern GPUs and TPUs can process a batch of examples simultaneously through vectorized operations.

Pure SGD uses a single example per step. While it provides the fastest updates in wall-clock time per iteration, the extreme variance of the gradient estimate can slow convergence and make training unstable.

When practitioners and papers refer to "SGD" without further qualification, they almost always mean mini-batch SGD with a batch size between 32 and 256. The mini-batch size is a hyperparameter that trades off gradient noise for computational efficiency.

Why noisy gradients help

A counterintuitive property of SGD is that the noise in its gradient estimates can actually be beneficial rather than purely harmful. This phenomenon has been studied extensively and is understood through several complementary perspectives.

Escaping local minima and saddle points. The loss landscape of deep neural networks is highly non-convex, containing many local minima and saddle points. The stochastic noise in SGD's gradient estimates acts as a form of random perturbation that helps the optimizer escape shallow local minima and saddle points that would trap a deterministic optimizer.

Implicit regularization. SGD exhibits an implicit regularization effect, biasing the optimization trajectory toward flatter regions of the loss landscape. Flatter minima correspond to solutions where small perturbations to the parameters do not drastically change the loss, which correlates with better generalization to unseen data. This bias arises because the noise in SGD updates is larger in sharper regions of the loss landscape, effectively pushing the optimizer away from sharp minima and toward broader, flatter basins.

Exploration of the loss landscape. The stochastic nature of SGD means that it does not follow a single deterministic path through parameter space. Instead, it effectively samples from a distribution of trajectories, providing a form of exploration that can help discover better optima.

Research by Wu et al. (2022) and Damian et al. (2021) has provided theoretical evidence that SGD noise aligns with the Hessian of the loss, causing the optimizer to preferentially escape sharp minima. This alignment property provides a rigorous explanation for why SGD with smaller batch sizes (and thus more noise) tends to find solutions that generalize better.

SGD with momentum

Plain SGD can oscillate back and forth across narrow valleys in the loss landscape, making slow progress toward the optimum. Momentum methods address this by accumulating a running average of past gradients, which dampens oscillations and accelerates movement along consistent gradient directions.

Classical momentum (Polyak, 1964)

Boris Polyak introduced the heavy ball method in 1964, which adds a momentum term to the update rule:

vₜ₊₁ = μ vₜ + ∇L(θₜ; xᵢ, yᵢ)
θₜ₊₁ = θₜ - α vₜ₊₁

Here, vₜ is the velocity vector, μ is the momentum coefficient (typically 0.9), and α is the learning rate. The velocity accumulates past gradients with exponential decay, so consistent gradient directions are reinforced while oscillating directions are damped. This is analogous to a ball rolling downhill: it gains speed on consistent slopes and resists changing direction on bumpy terrain. Momentum accelerates convergence on strongly convex problems and has been shown to improve the convergence rate from O(1/t) to O(1/t^2) for quadratic objectives under optimal parameter choices.

Nesterov accelerated gradient (1983)

Yurii Nesterov proposed a modification that evaluates the gradient at a "lookahead" position rather than the current position:

vₜ₊₁ = μ vₜ + ∇L(θₜ - α μ vₜ; xᵢ, yᵢ)
θₜ₊₁ = θₜ - α vₜ₊₁

By computing the gradient at the anticipated next position (θₜ - α μ vₜ), Nesterov momentum provides a corrective factor that prevents overshooting. For smooth convex functions, Nesterov's method achieves an optimal convergence rate of O(1/k²), which is optimal among all first-order methods that only use gradient information at consecutive iterates. In practice, Nesterov momentum often produces faster convergence than classical momentum, particularly in the final stages of optimization when the iterates are close to the solution. Nesterov momentum is the default in many PyTorch training pipelines when using SGD with momentum.

Learning rate schedules

The learning rate is the single most important hyperparameter for SGD. Setting it too high causes the optimizer to diverge or oscillate wildly. Setting it too low results in painfully slow convergence. In practice, the learning rate is almost always varied during training according to a predefined schedule.

Schedule	Formula / description	Typical use case
Constant	α remains fixed throughout training	Baseline; simple experiments; fine-tuning
Step decay	α is multiplied by a factor (e.g., 0.1) at specific epochs	Image classification (e.g., ResNet training); CNNs on ImageNet
Exponential decay	αₜ = α₀ · γᵗ for decay factor γ < 1	Smooth, gradual reduction; older-style training
Cosine annealing	αₜ = α_min + 0.5(α₀ - α_min)(1 + cos(πt/T))	Modern deep learning; transformer pretraining
Linear warmup + decay	Linearly increase α from 0 to α₀ over w steps, then decay	Large-batch training, transformers
One-cycle policy	Single cycle: increase α then decrease, with momentum mirror	Super-convergence; fast training
Polynomial decay	αₜ = α₀ · (1 - t/T)^p	NLP fine-tuning

Constant learning rate

The simplest approach uses a fixed learning rate throughout training. While this is easy to implement, it is rarely optimal. A large constant rate prevents the optimizer from settling into a precise minimum, while a small constant rate wastes computation in the early stages of training.

Step decay

Step decay reduces the learning rate by a multiplicative factor at predetermined milestones. For example, the learning rate might be divided by 10 at epochs 30, 60, and 90 of a 100-epoch training run. This schedule was widely used for training convolutional neural networks on ImageNet and remains a reliable baseline.

Cosine annealing

Proposed by Loshchilov and Hutter (2016) in their paper on SGDR (Stochastic Gradient Descent with Warm Restarts), cosine annealing smoothly decreases the learning rate following a cosine curve from an initial value to near zero over a training cycle. The schedule is given by:

eta(t) = eta_min + 0.5 * (eta_max - eta_min) * (1 + cos(pi * t / T))

where T is the total number of steps. The cosine shape starts with a gentle decline, allowing extended exploration at higher learning rates, and gradually flattens near zero for fine-tuning. Variants with warm restarts periodically reset the learning rate to its initial value, enabling the optimizer to escape suboptimal regions.

One-cycle policy

Proposed by Leslie Smith (2018), the one-cycle policy trains with a single learning rate cycle: the rate increases linearly from a small value to a large maximum over the first half of training, then decreases back down over the second half, often continuing to decay to a value several orders of magnitude below the starting point. This policy exploits a phenomenon called "super-convergence," where training with large learning rates in the middle of the schedule acts as a regularizer, preventing the model from settling into sharp minima. Experiments showed that super-convergence can reduce training time by an order of magnitude on datasets like CIFAR-10 and CIFAR-100 without sacrificing accuracy.

Robbins-Monro conditions

The Robbins-Monro conditions provide the classical theoretical requirement for learning rate schedules to guarantee convergence:

Σₜ αₜ = ∞   and   Σₜ αₜ² < ∞

The first condition ensures that the optimizer can reach any point in parameter space, while the second ensures that the step sizes decrease fast enough for the noise to average out. A schedule such as αₜ = α₀ / t satisfies both conditions. Common choices that satisfy these conditions include eta(t) = c/t and eta(t) = c/sqrt(t).

Learning rate warm-up

Learning rate warm-up is a technique where the learning rate starts at a very small value (or zero) and gradually increases to the target learning rate over the first few hundred or thousand training steps. Introduced by Goyal et al. (2017) for large-batch training and later adopted in the original Transformer paper by Vaswani et al. (2017), warm-up has become standard practice for training transformers and other large models.

The primary benefit of warm-up is stabilizing early training dynamics. At initialization, the model parameters are essentially random, and the gradient estimates can be unreliable. A large learning rate applied to these unreliable gradients can push parameters into poorly conditioned regions of the loss landscape, causing training to diverge. By starting with a small learning rate, warm-up allows the model to settle into a reasonable region of parameter space before applying the full learning rate.

Research has shown that warm-up effectively reduces the sharpness of the loss landscape (measured by the top eigenvalue of the Hessian), guiding the optimizer toward flatter regions that can tolerate larger learning rates. Goyal et al. (2017) demonstrated that warm-up is critical for training with very large batch sizes (up to 8,192 images) using SGD with the linear scaling rule.

Adaptive learning rate methods

While SGD with momentum uses a single learning rate for all parameters, adaptive methods maintain per-parameter learning rates that automatically adjust based on the history of gradients. This is especially useful when different parameters have gradients of vastly different magnitudes.

AdaGrad (Duchi et al., 2011)

AdaGrad (Adaptive Gradient Algorithm), proposed by John Duchi, Elad Hazan, and Yoram Singer in 2011, was the first widely adopted adaptive learning rate method. It accumulates the sum of squared gradients for each parameter and scales the learning rate inversely by the square root of this sum:

G(t) = G(t-1) + (nabla L(w(t)))^2
w(t+1) = w(t) - (eta / sqrt(G(t) + epsilon)) * nabla L(w(t))

AdaGrad performs well on problems with sparse gradients (such as NLP tasks with large vocabularies) because infrequent features receive larger effective learning rates. However, AdaGrad has a significant drawback: the accumulated squared gradients in G grow monotonically, causing the effective learning rate to shrink continuously. For long training runs, this can cause the learning rate to become vanishingly small, effectively halting learning before the model has converged.

RMSProp (Hinton, 2012)

RMSProp was proposed by Geoffrey Hinton in Lecture 6e of his Coursera course on neural networks. It was never published in a formal paper, yet it became one of the most widely used optimizers in deep learning. RMSProp addresses AdaGrad's diminishing learning rate problem by replacing the cumulative sum of squared gradients with an exponential moving average:

E[g^2](t) = rho * E[g^2](t-1) + (1 - rho) * (nabla L(w(t)))^2
w(t+1) = w(t) - (eta / sqrt(E[g^2](t) + epsilon)) * nabla L(w(t))

The decay factor rho (typically 0.9 or 0.99) controls how quickly the moving average forgets old gradients. By using a moving window rather than an ever-growing accumulator, RMSProp maintains a more stable effective learning rate throughout training.

Adam (Kingma and Ba, 2014)

Adam (Adaptive Moment Estimation), proposed by Diederik Kingma and Jimmy Ba in 2014, combines the ideas of momentum and adaptive learning rates. It maintains exponential moving averages of both the first moment (mean) and the second moment (uncentered variance) of the gradients:

m(t) = beta1 * m(t-1) + (1 - beta1) * nabla L(w(t))
v(t) = beta2 * v(t-1) + (1 - beta2) * (nabla L(w(t)))^2
m_hat(t) = m(t) / (1 - beta1^t)
v_hat(t) = v(t) / (1 - beta2^t)
w(t+1) = w(t) - eta * m_hat(t) / (sqrt(v_hat(t)) + epsilon)

The bias correction terms (m_hat and v_hat) compensate for the fact that the moving averages are initialized at zero, which would otherwise cause them to be biased toward zero during the early iterations. The default hyperparameters recommended by Kingma and Ba are beta1 = 0.9, beta2 = 0.999, and epsilon = 1e-8. Adam became the default optimizer for many deep learning tasks due to its fast convergence and relative insensitivity to hyperparameter choices.

AdamW (Loshchilov and Hutter, 2017)

In 2017, Ilya Loshchilov and Frank Hutter identified a subtle but important flaw in how weight decay is typically implemented with Adam. Standard L2 regularization adds a penalty term lambda * ||w||^2 to the loss function, which is equivalent to weight decay for vanilla SGD. However, for adaptive optimizers like Adam, L2 regularization and weight decay are not equivalent. In Adam, the gradient of the L2 penalty is scaled by the adaptive learning rate, which means that parameters with large accumulated gradients receive less regularization than intended.

AdamW decouples the weight decay from the gradient-based update by applying it directly to the parameters rather than through the loss function:

w(t+1) = (1 - lambda * eta) * w(t) - eta * m_hat(t) / (sqrt(v_hat(t)) + epsilon)

This simple modification substantially improves Adam's generalization performance. Loshchilov and Hutter reported a 15% relative improvement in test error when using decoupled weight decay versus L2 regularization with Adam. AdamW has become the standard optimizer for training transformer-based language models and is the default in libraries like Hugging Face Transformers.

Comparison of optimizers

Optimizer	Year	Key Innovation	Per-Parameter Adaptive?	Momentum?	Typical Use Cases
SGD	1951	Stochastic gradient updates	No	No	Baseline; simple models
SGD + Momentum	1964	Velocity accumulation (heavy ball)	No	Yes (classical)	CNNs on image classification
SGD + Nesterov	1983	Lookahead gradient evaluation	No	Yes (Nesterov)	CNNs; any task where SGD is preferred
AdaGrad	2011	Cumulative squared gradient scaling	Yes	No	Sparse data; NLP with large vocabularies
RMSProp	2012	Exponential moving average of squared gradients	Yes	No	RNNs; reinforcement learning
Adam	2014	First and second moment estimation with bias correction	Yes	Yes	General deep learning; default for many tasks
AdamW	2017	Decoupled weight decay	Yes	Yes	Transformer pretraining; language models
LARS	2017	Layer-wise adaptive learning rates	Yes (layer-wise)	Yes	Large-batch CNN training
LAMB	2019	Layer-wise adaptation combined with Adam	Yes (layer-wise + per-param)	Yes	Large-batch BERT pretraining

Convergence theory

The convergence properties of SGD depend heavily on the assumptions made about the objective function.

Function class	Convergence rate	Key requirement
Convex, Lipschitz continuous	O(1/sqrt(T))	Decaying learning rate
Strongly convex, Lipschitz continuous	O(1/T)	Decaying learning rate
Smooth, convex	O(1/T)	Constant or decaying learning rate
Smooth, strongly convex	O(exp(-cT)) linear rate	Decaying learning rate
Non-convex, smooth	O(1/sqrt(T)) to stationary point	Decaying learning rate

For convex functions, SGD with a decaying learning rate αₜ = O(1/sqrt(t)) achieves an expected suboptimality of O(1/sqrt(T)) after T iterations. This is slower than the O(1/T) rate of full-batch gradient descent, which is the price paid for using noisy gradient estimates. This rate is optimal for first-order stochastic methods in the general convex setting.

For strongly convex functions (functions with a positive curvature lower bound), SGD achieves the faster rate of O(1/T) with an appropriately decaying learning rate of αₜ = O(1/t), where the hidden constant depends on the strong convexity parameter. This matches the minimax optimal rate for stochastic first-order optimization. Polyak and Juditsky (1992) showed that averaging the iterates (Polyak-Ruppert averaging) can further improve the convergence, achieving the information-theoretic lower bound.

For non-convex functions, which is the setting relevant to deep learning, SGD converges to a stationary point (where the gradient norm is small) at a rate of O(1/sqrt(T)). Notably, this does not guarantee convergence to a global minimum or even a good local minimum. With a fixed learning rate, SGD achieves a rate of O(1/sqrt(T)) for driving the expected squared gradient norm to zero. The practical success of SGD in deep learning suggests that the loss landscapes of neural networks have benign properties (such as few poor local minima) that enable SGD to find good solutions despite the lack of convexity.

The Polyak-Lojasiewicz (PL) condition provides a useful middle ground: functions satisfying the PL condition are not necessarily convex, but SGD can still achieve linear convergence rates on them. Many overparameterized neural networks have been shown to satisfy the PL condition near their initialization, helping explain the fast convergence observed in practice.

SGD vs. Adam: the generalization debate

The choice between SGD (with momentum) and adaptive optimizers like Adam is one of the most debated practical questions in deep learning optimization. Both optimizers have passionate advocates, and the empirical evidence suggests that neither is universally better.

Wilson et al. (2017) published an influential study titled "The Marginal Value of Adaptive Gradient Methods in Machine Learning," demonstrating that adaptive methods (Adam, AdaGrad, RMSProp) consistently found solutions with worse generalization performance than SGD with momentum across a range of tasks, including image classification, character-level language modeling, and constituency parsing. They provided both empirical evidence and theoretical arguments that adaptive methods converge to different (and less desirable) minima than SGD.

Subsequent research has clarified this picture:

Aspect	SGD with momentum	Adam
Generalization on vision tasks	Often better; finds flatter minima	Can underperform on test accuracy
Training speed	Slower initial convergence	Faster initial convergence
Hyperparameter sensitivity	Requires careful tuning of learning rate and schedule	More robust to learning rate choice
Transformer training	Rarely used alone	Nearly universal default
Gradient distribution	Assumes relatively homogeneous gradients	Handles heterogeneous gradient scales well
Final performance on NLP	Competitive when well-tuned	Standard choice

The case for SGD

Multiple studies have observed that SGD with momentum, when properly tuned, finds solutions that generalize better than those found by Adam. The theoretical explanation for SGD's generalization advantage centers on the structure of gradient noise. SGD's noise is isotropic relative to the loss function curvature, which biases the optimizer toward flatter minima in the loss landscape. Flat minima tend to generalize better because small perturbations to the parameters do not significantly change the loss. Research by Zhou et al. (2020) showed that SGD is more locally unstable at sharp minima than Adam, meaning it is more likely to escape them and settle in flatter regions. Adaptive methods, by rescaling gradients per parameter, alter this noise structure in ways that can direct the optimizer toward sharper minima. This property makes SGD the preferred optimizer for image classification with CNNs, where it has historically produced state-of-the-art results on benchmarks like ImageNet.

The case for Adam

Adam tends to converge faster in the early stages of training and is much less sensitive to the choice of learning rate. This makes it particularly valuable for large-scale experiments where hyperparameter tuning budgets are limited. For transformer architectures, Adam (or AdamW) is strongly preferred because transformers exhibit highly heterogeneous gradient distributions across layers and parameters. Research has shown that transformers have a block-heterogeneous Hessian spectrum, meaning the curvature of the loss landscape varies dramatically across different parameter groups (e.g., attention weights versus layer norm parameters). Adam's per-parameter adaptivity naturally handles this heterogeneity, while SGD's single global learning rate struggles to handle these varying scales. Language models trained with transformers are widely reported to be difficult or impossible to train effectively with SGD.

Hybrid and switching strategies

Hybrid approaches such as SWATS (Keskar and Socher, 2017) begin training with Adam and automatically switch to SGD once a triggering criterion is met. This strategy attempts to capture Adam's fast early convergence while benefiting from SGD's generalization properties in the later stages of training.

The role of batch size

Recent research has highlighted that the batch size plays a critical role in the SGD versus Adam comparison. At smaller batch sizes with sufficient training steps, SGD can match or outperform Adam on many tasks. As batch size increases, Adam's advantage grows because its adaptive learning rates compensate for the reduced gradient noise in large batches. This finding suggests that the choice of optimizer should be considered jointly with the batch size and total training budget.

Large-batch and distributed SGD

Training modern deep learning models often requires distributing computation across multiple GPUs or machines. The primary strategies for parallel SGD are synchronous and asynchronous approaches. Training large models efficiently often requires distributing the workload across many GPUs or TPUs, which typically means using very large batch sizes. However, simply increasing the batch size with a fixed learning rate degrades model quality.

Synchronous SGD

In synchronous distributed SGD, each worker computes gradients on its local mini-batch, and all gradients are averaged (typically via an AllReduce operation) before any worker updates its parameters. This is mathematically equivalent to running SGD with a larger effective batch size equal to the per-worker batch size multiplied by the number of workers.

Linear scaling rule

The linear scaling rule (Goyal et al., 2017) states that when the batch size is multiplied by k, the learning rate should also be multiplied by k to maintain the same training dynamics. Goyal et al. at Facebook AI Research used this rule together with a gradual warmup phase to train ResNet-50 on ImageNet in one hour using a batch size of 8,192 across 256 GPUs with no loss in accuracy. This rule works well up to a certain batch size, beyond which training quality degrades.

LARS

LARS (Layer-wise Adaptive Rate Scaling), proposed by You et al. (2017), addresses the scaling limitation by adjusting the learning rate independently for each layer based on the ratio of weight norms to gradient norms. LARS enabled training ImageNet with batch sizes up to 32,768 without significant accuracy loss and was instrumental in pushing the limits of distributed SGD training.

LAMB

The Layer-wise Adaptive Moments optimizer for Batch training (LAMB), proposed by You et al. (2019), extends the LARS idea to Adam. LAMB combines per-dimension adaptivity (from Adam's second moment) with per-layer normalization (from LARS). It was used to train BERT in 76 minutes with a batch size of 65,536, significantly accelerating the pretraining of large language models. LAMB generally outperforms LARS across all batch sizes tested.

Asynchronous SGD

In asynchronous SGD, each worker computes gradients and updates a shared parameter server independently, without waiting for other workers. This eliminates the synchronization bottleneck but introduces "staleness," where gradients are computed using parameters that may have been updated multiple times since the gradient computation began.

Staleness can slow convergence and degrade final model quality. Mitigation strategies include bounded staleness (limiting how out-of-date gradients can be) and staleness-aware learning rate correction.

Gradient accumulation

Gradient accumulation is a technique that simulates large-batch training on hardware with limited memory. Instead of processing a large batch at once, multiple smaller mini-batches are processed sequentially, and their gradients are summed before performing a single parameter update. This is mathematically equivalent to using the larger batch size, but requires only the memory needed for the smaller mini-batch.

For example, processing 4 mini-batches of 64 examples with gradient accumulation before updating is equivalent to a single mini-batch of 256 examples. This technique is particularly important for training large language models where memory constraints limit the per-GPU batch size.

Maximal Update Parameterization (muP)

A persistent challenge in deep learning is that optimal hyperparameters (including the learning rate, initialization scale, and others) change as the model width scales up. This means practitioners cannot simply tune hyperparameters on a small model and transfer them to a larger one.

The Maximal Update Parameterization (muP), introduced by Greg Yang, Edward Hu, and collaborators in the "Tensor Programs" series of papers (presented at NeurIPS 2021), provides a principled solution. Under muP, the initialization variance, learning rate, and other hyperparameters are parameterized as functions of model width in such a way that optimal hyperparameters remain stable across different model sizes. This enables "muTransfer": tuning hyperparameters on a small proxy model and then directly transferring them to the full-scale model without further tuning.

The practical benefits are significant. By transferring pretraining hyperparameters from a 13M-parameter model, the authors matched the published performance of BERT-large (350M parameters) at a tuning cost equivalent to pretraining BERT-large only once. By transferring from a 40M-parameter model, they matched the published numbers for a 6.7B-parameter GPT-3 model at only 7% of the total pretraining cost. muP has been adopted by several large-scale training efforts and is available as an open-source PyTorch library from Microsoft.

Practical implementation

PyTorch

PyTorch provides SGD through torch.optim.SGD:

import torch
import torch.optim as optim

model = MyModel()

# SGD with momentum and weight decay
optimizer = optim.SGD(
    model.parameters(),
    lr=0.1,
    momentum=0.9,
    weight_decay=1e-4,
    nesterov=True
)

# Training loop
for epoch in range(num_epochs):
    for batch in dataloader:
        optimizer.zero_grad()
        loss = criterion(model(batch.inputs), batch.targets)
        loss.backward()
        optimizer.step()

Key parameters include lr (learning rate, default 0.001), momentum (default 0), weight_decay (L2 penalty, default 0), nesterov (enables Nesterov momentum, default False), and dampening (dampening for momentum, default 0). Note that PyTorch's momentum implementation differs slightly from the classical formulation: it uses v = mu * v + g and then p = p - lr * v, rather than the formulation found in some textbooks.

TensorFlow / Keras

In TensorFlow, SGD is available through tf.keras.optimizers.SGD:

import tensorflow as tf

model = build_model()

optimizer = tf.keras.optimizers.SGD(
    learning_rate=0.1,
    momentum=0.9,
    nesterov=True
)

model.compile(optimizer=optimizer, loss='categorical_crossentropy')
model.fit(train_data, train_labels, epochs=100, batch_size=128)

Both frameworks support learning rate schedulers that can be attached to the optimizer to implement the various schedules described above.

Practical tips

Gradient clipping prevents exploding gradients by capping the gradient norm before the update (e.g., to 1.0). This is especially important for recurrent neural networks and transformers.
Batch normalization smooths the loss landscape and reduces sensitivity to the learning rate, making SGD easier to tune. It is a standard component in CNN architectures.
Proper weight initialization (e.g., He initialization for ReLU networks, Xavier/Glorot initialization for sigmoid/tanh networks) ensures that gradients are appropriately scaled at the start of training, complementing the optimizer's work.
Data shuffling at the beginning of each epoch ensures that mini-batches are approximately independent, which is important for the unbiasedness assumption underlying SGD's convergence guarantees.
Mixed-precision training uses half-precision (FP16) or bfloat16 arithmetic for forward and backward passes while maintaining a master copy of weights in full precision (FP32). This reduces memory usage and can double throughput on modern GPUs with tensor cores. Optimizers like Adam typically maintain their running statistics in FP32 for numerical stability, even when the forward and backward passes use reduced precision.
Feature normalization. SGD is sensitive to the scale of input features. Normalizing inputs to zero mean and unit variance, or using batch normalization, significantly improves convergence.
Monitoring the learning rate. When using learning rate warm-up or decay schedules, logging the effective learning rate alongside the loss helps diagnose training issues.

Advantages

Computational efficiency / faster convergence. Each parameter update requires computing the gradient over only a small subset of the data, making SGD feasible for datasets with millions or billions of examples. Because SGD uses a smaller subset of the data at each iteration, it can make more frequent updates to the model parameters, resulting in faster convergence compared to batch gradient descent.
Better generalization. The implicit regularization from gradient noise helps SGD find flatter minima that generalize better to unseen data, helping prevent overfitting, as demonstrated in both theory and practice.
Memory efficiency / scalability. Mini-batch SGD requires storing only the current mini-batch in memory, not the entire dataset. SGD is particularly useful for large-scale machine learning problems, as it can operate on subsets of the data, reducing the memory requirements and computational cost.
Simplicity. SGD with momentum has only two primary hyperparameters (learning rate and momentum coefficient), making it relatively straightforward to implement and understand.
Parallelizability. Mini-batch SGD naturally parallelizes across GPUs and machines, enabling distributed training of large models.

Limitations

Hyperparameter sensitivity / tuning required. SGD's performance depends heavily on the choice of learning rate and schedule. Poor choices can lead to divergence or extremely slow convergence, which can be time-consuming to tune.
Noisy convergence path. The variance in gradient estimates causes the loss to fluctuate during training, which can make it difficult to determine when training has converged and may slow down convergence in some cases.
Uniform learning rate / no per-parameter adaptation. Unlike adaptive methods, standard SGD applies the same learning rate to all parameters, which is suboptimal when different parameters have gradients of very different magnitudes.
Requires careful tuning per task. Unlike Adam, which works reasonably well "out of the box" across many tasks, SGD often requires task-specific tuning of the learning rate, schedule, and momentum to achieve competitive performance.
Sensitivity to feature scaling. SGD performs poorly when features have very different scales unless the data is normalized or an adaptive optimizer is used.
Struggles with sparse gradients. For problems where most gradient components are zero (e.g., natural language processing with large vocabularies), SGD makes inefficient use of the rare non-zero gradients, while adaptive methods can adjust learning rates accordingly.

References

Robbins, H. and Monro, S. (1951). "A Stochastic Approximation Method." *The Annals of Mathematical Statistics*, 22(3), 400-407.
Kiefer, J. and Wolfowitz, J. (1952). "Stochastic Estimation of the Maximum of a Regression Function." *The Annals of Mathematical Statistics*, 23(3), 462-466.
Polyak, B.T. (1964). "Some methods of speeding up the convergence of iteration methods." *USSR Computational Mathematics and Mathematical Physics*, 4(5), 1-17.
Nesterov, Y. (1983). "A method of solving a convex programming problem with convergence rate O(1/k²)." *Soviet Mathematics Doklady*, 27, 372-376.
Polyak, B.T. and Juditsky, A.B. (1992). "Acceleration of Stochastic Approximation by Averaging." *SIAM Journal on Control and Optimization*, 30(4), 838-855.
Bottou, L. (2010). "Large-Scale Machine Learning with Stochastic Gradient Descent." *Proceedings of COMPSTAT 2010*, Springer, 177-186.
Duchi, J., Hazan, E., and Singer, Y. (2011). "Adaptive Subgradient Methods for Online Learning and Stochastic Optimization." *Journal of Machine Learning Research*, 12, 2121-2159.
Hinton, G. (2012). "Neural Networks for Machine Learning, Lecture 6e: RMSProp." Coursera.
Kingma, D.P. and Ba, J. (2014). "Adam: A Method for Stochastic Optimization." *arXiv preprint arXiv:1412.6980*. Published at ICLR 2015.
Loshchilov, I. and Hutter, F. (2016). "SGDR: Stochastic Gradient Descent with Warm Restarts." *arXiv preprint arXiv:1608.03983*. Published at ICLR 2017.
Goyal, P., Dollar, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and He, K. (2017). "Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour." *arXiv preprint arXiv:1706.02677*.
Wilson, A.C., Roelofs, R., Stern, M., Srebro, N., and Recht, B. (2017). "The Marginal Value of Adaptive Gradient Methods in Machine Learning." *Advances in Neural Information Processing Systems (NeurIPS)*, 30.
Loshchilov, I. and Hutter, F. (2017). "Decoupled Weight Decay Regularization." *arXiv preprint arXiv:1711.05101*. Published at ICLR 2019.
You, Y., Gitman, I., and Ginsburg, B. (2017). "Large Batch Training of Convolutional Networks." *arXiv preprint arXiv:1708.03888*.
Keskar, N.S. and Socher, R. (2017). "Improving Generalization Performance by Switching from Adam to SGD." *arXiv preprint arXiv:1712.07628*.
Bottou, L., Curtis, F.E., and Nocedal, J. (2018). "Optimization Methods for Large-Scale Machine Learning." *SIAM Review*, 60(2), 223-311.
Smith, L.N. (2018). "Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates." *arXiv preprint arXiv:1708.07120*.
Smith, S.L. and Le, Q.V. (2018). "A Bayesian Perspective on Generalization and Stochastic Gradient Descent." *International Conference on Learning Representations (ICLR)*.
You, Y., Li, J., Reddi, S., Hseu, J., Kumar, S., Bhojanapalli, S., Song, X., Demmel, J., Keutzer, K., and Hsieh, C.J. (2019). "Large Batch Optimization for Deep Learning: Training BERT in 76 Minutes." *arXiv preprint arXiv:1904.00962*. Published at ICLR 2020.
Zhou, P., Feng, J., Ma, C., Xiong, C., Hoi, S., and E, W. (2020). "Towards Theoretically Understanding Why SGD Generalizes Better Than Adam in Deep Learning." *Advances in Neural Information Processing Systems (NeurIPS)*.
Damian, A., Ma, T., and Lee, J.D. (2021). "Label Noise SGD Provably Prefers Flat Global Minimizers." *Advances in Neural Information Processing Systems (NeurIPS)*, 34.
Yang, G., Hu, E., Babuschkin, I., Sidor, S., Liu, X., Farhi, D., Ryder, N., Pachocki, J., Chen, W., and Gao, J. (2021). "Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer." *Advances in Neural Information Processing Systems (NeurIPS)*.

Explain like I'm 5

Historical background

Background

Gradient descent

Stochasticity in optimization

Mathematical formulation

Algorithm

Batch gradient descent vs. mini-batch SGD vs. pure SGD

Why noisy gradients help

SGD with momentum

Classical momentum (Polyak, 1964)

Nesterov accelerated gradient (1983)

Learning rate schedules

Constant learning rate

Step decay

Cosine annealing

One-cycle policy

Robbins-Monro conditions

Learning rate warm-up

Adaptive learning rate methods

AdaGrad (Duchi et al., 2011)

RMSProp (Hinton, 2012)

Adam (Kingma and Ba, 2014)

AdamW (Loshchilov and Hutter, 2017)

Comparison of optimizers

Convergence theory

SGD vs. Adam: the generalization debate

The case for SGD

The case for Adam

Hybrid and switching strategies

The role of batch size

Large-batch and distributed SGD

Synchronous SGD

Linear scaling rule

LARS

LAMB

Asynchronous SGD

Gradient accumulation

Maximal Update Parameterization (muP)

Practical implementation

PyTorch

TensorFlow / Keras

Practical tips

Advantages

Limitations

See also

References

Improve this article

Related Articles

Sparse autoencoder

ARC-AGI 2

GELU (Gaussian Error Linear Unit)

LeNet

Clipping

Hyperparameter

Explain like I'm 5

Historical background

Background

Gradient descent

Stochasticity in optimization

Mathematical formulation

Algorithm

Batch gradient descent vs. mini-batch SGD vs. pure SGD

Why noisy gradients help

SGD with momentum

Classical momentum (Polyak, 1964)

Nesterov accelerated gradient (1983)

Learning rate schedules

Constant learning rate

Step decay

Cosine annealing

One-cycle policy

Robbins-Monro conditions

Learning rate warm-up

Adaptive learning rate methods

AdaGrad (Duchi et al., 2011)

RMSProp (Hinton, 2012)

Adam (Kingma and Ba, 2014)

AdamW (Loshchilov and Hutter, 2017)

Comparison of optimizers