See also: Machine learning terms
Stochastic gradient descent (SGD) is an iterative optimization algorithm that updates model parameters using the gradient of the loss function computed on a single training example or a small subset of examples, rather than the entire dataset. Originally formalized in the context of stochastic approximation by Robbins and Monro in 1951, SGD has become the workhorse optimization method for training neural networks and other large-scale machine learning models. It is widely used in machine learning and deep learning to minimize objective functions, powering everything from convolutional neural networks for computer vision to transformers for natural language processing. Its combination of computational efficiency, implicit regularization through gradient noise, and favorable generalization properties makes it one of the most important algorithms in modern artificial intelligence.
Imagine you are trying to find the lowest spot in a huge, hilly park, but it is nighttime and you can only use a small flashlight. With regular gradient descent, you would somehow measure the slope of every single hill and valley in the entire park before taking one step. That would take forever. With SGD, you just shine your flashlight on the ground right in front of you, check which way the ground slopes downward, and take a step in that direction. Sometimes your flashlight shows you a slightly misleading patch of ground, so your path is a bit wobbly. But because you take lots of quick steps instead of one perfectly planned step, you actually reach the lowest point faster. The wobbliness even helps you avoid getting stuck in small dips that are not the true lowest point.
The mathematical foundations of SGD predate modern machine learning by several decades. In 1951, Herbert Robbins and Sutton Monro published "A Stochastic Approximation Method" in The Annals of Mathematical Statistics, establishing the theoretical framework for iterative optimization with noisy gradient estimates. Their paper introduced a general framework for iteratively solving root-finding problems where the underlying function is observed only through noisy measurements. Their work proved that under certain conditions on the learning rate schedule (now called the Robbins-Monro conditions), stochastic iterative procedures converge to the true optimum despite using noisy measurements. These conditions specify that the step sizes must sum to infinity while their squares must sum to a finite value, and they remain central to the theory of SGD today.
Jack Kiefer and Jacob Wolfowitz extended this framework in 1952, demonstrating that optimization could proceed using only function evaluations rather than gradient measurements. Their method used finite differences to approximate the gradient when it was not directly available, removing the need for an analytic form of the gradient. The Kiefer-Wolfowitz algorithm showed that stochastic optimization could be performed even in settings where only function evaluations (not derivatives) were accessible. Frank Rosenblatt applied SGD-like updates to train the perceptron in the late 1950s, marking one of the first uses of stochastic optimization in a learning algorithm.
Despite these early foundations, SGD did not become widely adopted in machine learning until the 1980s and 1990s, when backpropagation made it practical to compute gradients for neural networks. Leon Bottou's influential 2010 paper "Large-Scale Machine Learning with Stochastic Gradient Descent" provided both theoretical analysis and practical guidance that helped establish SGD as the default optimizer for large-scale learning problems. With the explosion of deep learning in the 2010s, SGD and its variants became essential tools, as datasets grew to millions or billions of examples where full-batch methods were computationally infeasible. The development of GPU-accelerated training further cemented SGD's role as the default optimization method for deep learning.
Gradient descent is an iterative optimization algorithm used to minimize a differentiable objective function. The idea is to update the model parameters iteratively by moving them in the direction of the negative gradient of the objective function with respect to the parameters. This movement is governed by a learning rate, which determines the size of the steps taken towards the minimum.
The term "stochastic" refers to the presence of randomness in the optimization process. In the context of SGD, this randomness comes from the random selection of data points used in each iteration of the algorithm. This stochastic nature helps the algorithm explore the optimization landscape more effectively, allowing it to find better solutions and escape local minima in complex, non-convex optimization problems.
Consider a supervised learning problem where the goal is to minimize an empirical risk function over a dataset of n training examples:
J(θ) = (1/n) Σᵢ₌₁ⁿ L(θ; xᵢ, yᵢ)
where θ represents the model parameters, L is the per-example loss, and (xᵢ, yᵢ) are individual training samples.
Batch gradient descent computes the gradient over all n examples before making a single update:
θₜ₊₁ = θₜ - α (1/n) Σᵢ₌₁ⁿ ∇L(θₜ; xᵢ, yᵢ)
Pure SGD instead samples a single example i uniformly at random and updates using that example's gradient alone:
θₜ₊₁ = θₜ - α ∇L(θₜ; xᵢ, yᵢ)
The key insight is that the single-sample gradient ∇L(θₜ; xᵢ, yᵢ) is an unbiased estimator of the true gradient:
E[∇L(θₜ; xᵢ, yᵢ)] = (1/n) Σᵢ₌₁ⁿ ∇L(θₜ; xᵢ, yᵢ) = ∇J(θₜ)
This means that on average, the stochastic gradient points in the same direction as the full gradient, even though any individual estimate may be noisy. This unbiasedness property is what allows SGD to converge to the same solution as batch gradient descent given appropriate learning rate schedules, while performing each step at a fraction of the computational cost.
The main steps of the stochastic gradient descent algorithm are as follows:
In practice, the term "SGD" is used loosely to refer to three distinct variants that differ in how many examples are used to estimate the gradient at each step. These variants sit on a spectrum from using the full dataset to using a single example.
| Variant | Samples per update | Gradient quality | Compute per step | Memory usage | Convergence behavior |
|---|---|---|---|---|---|
| Batch gradient descent | Entire dataset (n) | Exact gradient | Very high | High | Smooth, stable descent; can get trapped in sharp minima |
| Mini-batch SGD | Small subset (b, typically 32 to 4096) | Low-variance estimate | Moderate | Moderate | Balanced noise and stability; parallelizes well on GPUs |
| Pure (online) SGD | Single example (1) | High-variance estimate | Very low | Very low | Noisy, rapid initial progress; high variance helps escape local minima |
Batch gradient descent computes the true gradient at each step, yielding smooth and predictable updates. However, for large datasets it is prohibitively expensive, since every update requires a full pass through the data.
Mini-batch SGD is the most widely used variant in deep learning practice. At each step, a random subset (mini-batch) of b examples is drawn, and the gradient is averaged over this subset:
θₜ₊₁ = θₜ - α (1/b) Σⱼ∈Bₜ ∇L(θₜ; xⱼ, yⱼ)
where Bₜ is the mini-batch at step t. Mini-batch SGD balances the computational cost per step with the quality of the gradient estimate. It also benefits from hardware parallelism, since modern GPUs and TPUs can process a batch of examples simultaneously through vectorized operations.
Pure SGD uses a single example per step. While it provides the fastest updates in wall-clock time per iteration, the extreme variance of the gradient estimate can slow convergence and make training unstable.
When practitioners and papers refer to "SGD" without further qualification, they almost always mean mini-batch SGD with a batch size between 32 and 256. The mini-batch size is a hyperparameter that trades off gradient noise for computational efficiency.
A counterintuitive property of SGD is that the noise in its gradient estimates can actually be beneficial rather than purely harmful. This phenomenon has been studied extensively and is understood through several complementary perspectives.
Escaping local minima and saddle points. The loss landscape of deep neural networks is highly non-convex, containing many local minima and saddle points. The stochastic noise in SGD's gradient estimates acts as a form of random perturbation that helps the optimizer escape shallow local minima and saddle points that would trap a deterministic optimizer.
Implicit regularization. SGD exhibits an implicit regularization effect, biasing the optimization trajectory toward flatter regions of the loss landscape. Flatter minima correspond to solutions where small perturbations to the parameters do not drastically change the loss, which correlates with better generalization to unseen data. This bias arises because the noise in SGD updates is larger in sharper regions of the loss landscape, effectively pushing the optimizer away from sharp minima and toward broader, flatter basins.
Exploration of the loss landscape. The stochastic nature of SGD means that it does not follow a single deterministic path through parameter space. Instead, it effectively samples from a distribution of trajectories, providing a form of exploration that can help discover better optima.
Research by Wu et al. (2022) and Damian et al. (2021) has provided theoretical evidence that SGD noise aligns with the Hessian of the loss, causing the optimizer to preferentially escape sharp minima. This alignment property provides a rigorous explanation for why SGD with smaller batch sizes (and thus more noise) tends to find solutions that generalize better.
Plain SGD can oscillate back and forth across narrow valleys in the loss landscape, making slow progress toward the optimum. Momentum methods address this by accumulating a running average of past gradients, which dampens oscillations and accelerates movement along consistent gradient directions.
Boris Polyak introduced the heavy ball method in 1964, which adds a momentum term to the update rule:
vₜ₊₁ = μ vₜ + ∇L(θₜ; xᵢ, yᵢ)
θₜ₊₁ = θₜ - α vₜ₊₁
Here, vₜ is the velocity vector, μ is the momentum coefficient (typically 0.9), and α is the learning rate. The velocity accumulates past gradients with exponential decay, so consistent gradient directions are reinforced while oscillating directions are damped. This is analogous to a ball rolling downhill: it gains speed on consistent slopes and resists changing direction on bumpy terrain. Momentum accelerates convergence on strongly convex problems and has been shown to improve the convergence rate from O(1/t) to O(1/t^2) for quadratic objectives under optimal parameter choices.
Yurii Nesterov proposed a modification that evaluates the gradient at a "lookahead" position rather than the current position:
vₜ₊₁ = μ vₜ + ∇L(θₜ - α μ vₜ; xᵢ, yᵢ)
θₜ₊₁ = θₜ - α vₜ₊₁
By computing the gradient at the anticipated next position (θₜ - α μ vₜ), Nesterov momentum provides a corrective factor that prevents overshooting. For smooth convex functions, Nesterov's method achieves an optimal convergence rate of O(1/k²), which is optimal among all first-order methods that only use gradient information at consecutive iterates. In practice, Nesterov momentum often produces faster convergence than classical momentum, particularly in the final stages of optimization when the iterates are close to the solution. Nesterov momentum is the default in many PyTorch training pipelines when using SGD with momentum.
The learning rate is the single most important hyperparameter for SGD. Setting it too high causes the optimizer to diverge or oscillate wildly. Setting it too low results in painfully slow convergence. In practice, the learning rate is almost always varied during training according to a predefined schedule.
| Schedule | Formula / description | Typical use case |
|---|---|---|
| Constant | α remains fixed throughout training | Baseline; simple experiments; fine-tuning |
| Step decay | α is multiplied by a factor (e.g., 0.1) at specific epochs | Image classification (e.g., ResNet training); CNNs on ImageNet |
| Exponential decay | αₜ = α₀ · γᵗ for decay factor γ < 1 | Smooth, gradual reduction; older-style training |
| Cosine annealing | αₜ = α_min + 0.5(α₀ - α_min)(1 + cos(πt/T)) | Modern deep learning; transformer pretraining |
| Linear warmup + decay | Linearly increase α from 0 to α₀ over w steps, then decay | Large-batch training, transformers |
| One-cycle policy | Single cycle: increase α then decrease, with momentum mirror | Super-convergence; fast training |
| Polynomial decay | αₜ = α₀ · (1 - t/T)^p | NLP fine-tuning |
The simplest approach uses a fixed learning rate throughout training. While this is easy to implement, it is rarely optimal. A large constant rate prevents the optimizer from settling into a precise minimum, while a small constant rate wastes computation in the early stages of training.
Step decay reduces the learning rate by a multiplicative factor at predetermined milestones. For example, the learning rate might be divided by 10 at epochs 30, 60, and 90 of a 100-epoch training run. This schedule was widely used for training convolutional neural networks on ImageNet and remains a reliable baseline.
Proposed by Loshchilov and Hutter (2016) in their paper on SGDR (Stochastic Gradient Descent with Warm Restarts), cosine annealing smoothly decreases the learning rate following a cosine curve from an initial value to near zero over a training cycle. The schedule is given by:
eta(t) = eta_min + 0.5 * (eta_max - eta_min) * (1 + cos(pi * t / T))
where T is the total number of steps. The cosine shape starts with a gentle decline, allowing extended exploration at higher learning rates, and gradually flattens near zero for fine-tuning. Variants with warm restarts periodically reset the learning rate to its initial value, enabling the optimizer to escape suboptimal regions.
Proposed by Leslie Smith (2018), the one-cycle policy trains with a single learning rate cycle: the rate increases linearly from a small value to a large maximum over the first half of training, then decreases back down over the second half, often continuing to decay to a value several orders of magnitude below the starting point. This policy exploits a phenomenon called "super-convergence," where training with large learning rates in the middle of the schedule acts as a regularizer, preventing the model from settling into sharp minima. Experiments showed that super-convergence can reduce training time by an order of magnitude on datasets like CIFAR-10 and CIFAR-100 without sacrificing accuracy.
The Robbins-Monro conditions provide the classical theoretical requirement for learning rate schedules to guarantee convergence:
Σₜ αₜ = ∞ and Σₜ αₜ² < ∞
The first condition ensures that the optimizer can reach any point in parameter space, while the second ensures that the step sizes decrease fast enough for the noise to average out. A schedule such as αₜ = α₀ / t satisfies both conditions. Common choices that satisfy these conditions include eta(t) = c/t and eta(t) = c/sqrt(t).
Learning rate warm-up is a technique where the learning rate starts at a very small value (or zero) and gradually increases to the target learning rate over the first few hundred or thousand training steps. Introduced by Goyal et al. (2017) for large-batch training and later adopted in the original Transformer paper by Vaswani et al. (2017), warm-up has become standard practice for training transformers and other large models.
The primary benefit of warm-up is stabilizing early training dynamics. At initialization, the model parameters are essentially random, and the gradient estimates can be unreliable. A large learning rate applied to these unreliable gradients can push parameters into poorly conditioned regions of the loss landscape, causing training to diverge. By starting with a small learning rate, warm-up allows the model to settle into a reasonable region of parameter space before applying the full learning rate.
Research has shown that warm-up effectively reduces the sharpness of the loss landscape (measured by the top eigenvalue of the Hessian), guiding the optimizer toward flatter regions that can tolerate larger learning rates. Goyal et al. (2017) demonstrated that warm-up is critical for training with very large batch sizes (up to 8,192 images) using SGD with the linear scaling rule.
While SGD with momentum uses a single learning rate for all parameters, adaptive methods maintain per-parameter learning rates that automatically adjust based on the history of gradients. This is especially useful when different parameters have gradients of vastly different magnitudes.
AdaGrad (Adaptive Gradient Algorithm), proposed by John Duchi, Elad Hazan, and Yoram Singer in 2011, was the first widely adopted adaptive learning rate method. It accumulates the sum of squared gradients for each parameter and scales the learning rate inversely by the square root of this sum:
G(t) = G(t-1) + (nabla L(w(t)))^2
w(t+1) = w(t) - (eta / sqrt(G(t) + epsilon)) * nabla L(w(t))
AdaGrad performs well on problems with sparse gradients (such as NLP tasks with large vocabularies) because infrequent features receive larger effective learning rates. However, AdaGrad has a significant drawback: the accumulated squared gradients in G grow monotonically, causing the effective learning rate to shrink continuously. For long training runs, this can cause the learning rate to become vanishingly small, effectively halting learning before the model has converged.
RMSProp was proposed by Geoffrey Hinton in Lecture 6e of his Coursera course on neural networks. It was never published in a formal paper, yet it became one of the most widely used optimizers in deep learning. RMSProp addresses AdaGrad's diminishing learning rate problem by replacing the cumulative sum of squared gradients with an exponential moving average:
E[g^2](t) = rho * E[g^2](t-1) + (1 - rho) * (nabla L(w(t)))^2
w(t+1) = w(t) - (eta / sqrt(E[g^2](t) + epsilon)) * nabla L(w(t))
The decay factor rho (typically 0.9 or 0.99) controls how quickly the moving average forgets old gradients. By using a moving window rather than an ever-growing accumulator, RMSProp maintains a more stable effective learning rate throughout training.
Adam (Adaptive Moment Estimation), proposed by Diederik Kingma and Jimmy Ba in 2014, combines the ideas of momentum and adaptive learning rates. It maintains exponential moving averages of both the first moment (mean) and the second moment (uncentered variance) of the gradients:
m(t) = beta1 * m(t-1) + (1 - beta1) * nabla L(w(t))
v(t) = beta2 * v(t-1) + (1 - beta2) * (nabla L(w(t)))^2
m_hat(t) = m(t) / (1 - beta1^t)
v_hat(t) = v(t) / (1 - beta2^t)
w(t+1) = w(t) - eta * m_hat(t) / (sqrt(v_hat(t)) + epsilon)
The bias correction terms (m_hat and v_hat) compensate for the fact that the moving averages are initialized at zero, which would otherwise cause them to be biased toward zero during the early iterations. The default hyperparameters recommended by Kingma and Ba are beta1 = 0.9, beta2 = 0.999, and epsilon = 1e-8. Adam became the default optimizer for many deep learning tasks due to its fast convergence and relative insensitivity to hyperparameter choices.
In 2017, Ilya Loshchilov and Frank Hutter identified a subtle but important flaw in how weight decay is typically implemented with Adam. Standard L2 regularization adds a penalty term lambda * ||w||^2 to the loss function, which is equivalent to weight decay for vanilla SGD. However, for adaptive optimizers like Adam, L2 regularization and weight decay are not equivalent. In Adam, the gradient of the L2 penalty is scaled by the adaptive learning rate, which means that parameters with large accumulated gradients receive less regularization than intended.
AdamW decouples the weight decay from the gradient-based update by applying it directly to the parameters rather than through the loss function:
w(t+1) = (1 - lambda * eta) * w(t) - eta * m_hat(t) / (sqrt(v_hat(t)) + epsilon)
This simple modification substantially improves Adam's generalization performance. Loshchilov and Hutter reported a 15% relative improvement in test error when using decoupled weight decay versus L2 regularization with Adam. AdamW has become the standard optimizer for training transformer-based language models and is the default in libraries like Hugging Face Transformers.
| Optimizer | Year | Key Innovation | Per-Parameter Adaptive? | Momentum? | Typical Use Cases |
|---|---|---|---|---|---|
| SGD | 1951 | Stochastic gradient updates | No | No | Baseline; simple models |
| SGD + Momentum | 1964 | Velocity accumulation (heavy ball) | No | Yes (classical) | CNNs on image classification |
| SGD + Nesterov | 1983 | Lookahead gradient evaluation | No | Yes (Nesterov) | CNNs; any task where SGD is preferred |
| AdaGrad | 2011 | Cumulative squared gradient scaling | Yes | No | Sparse data; NLP with large vocabularies |
| RMSProp | 2012 | Exponential moving average of squared gradients | Yes | No | RNNs; reinforcement learning |
| Adam | 2014 | First and second moment estimation with bias correction | Yes | Yes | General deep learning; default for many tasks |
| AdamW | 2017 | Decoupled weight decay | Yes | Yes | Transformer pretraining; language models |
| LARS | 2017 | Layer-wise adaptive learning rates | Yes (layer-wise) | Yes | Large-batch CNN training |
| LAMB | 2019 | Layer-wise adaptation combined with Adam | Yes (layer-wise + per-param) | Yes | Large-batch BERT pretraining |
The convergence properties of SGD depend heavily on the assumptions made about the objective function.
| Function class | Convergence rate | Key requirement |
|---|---|---|
| Convex, Lipschitz continuous | O(1/sqrt(T)) | Decaying learning rate |
| Strongly convex, Lipschitz continuous | O(1/T) | Decaying learning rate |
| Smooth, convex | O(1/T) | Constant or decaying learning rate |
| Smooth, strongly convex | O(exp(-cT)) linear rate | Decaying learning rate |
| Non-convex, smooth | O(1/sqrt(T)) to stationary point | Decaying learning rate |
For convex functions, SGD with a decaying learning rate αₜ = O(1/sqrt(t)) achieves an expected suboptimality of O(1/sqrt(T)) after T iterations. This is slower than the O(1/T) rate of full-batch gradient descent, which is the price paid for using noisy gradient estimates. This rate is optimal for first-order stochastic methods in the general convex setting.
For strongly convex functions (functions with a positive curvature lower bound), SGD achieves the faster rate of O(1/T) with an appropriately decaying learning rate of αₜ = O(1/t), where the hidden constant depends on the strong convexity parameter. This matches the minimax optimal rate for stochastic first-order optimization. Polyak and Juditsky (1992) showed that averaging the iterates (Polyak-Ruppert averaging) can further improve the convergence, achieving the information-theoretic lower bound.
For non-convex functions, which is the setting relevant to deep learning, SGD converges to a stationary point (where the gradient norm is small) at a rate of O(1/sqrt(T)). Notably, this does not guarantee convergence to a global minimum or even a good local minimum. With a fixed learning rate, SGD achieves a rate of O(1/sqrt(T)) for driving the expected squared gradient norm to zero. The practical success of SGD in deep learning suggests that the loss landscapes of neural networks have benign properties (such as few poor local minima) that enable SGD to find good solutions despite the lack of convexity.
The Polyak-Lojasiewicz (PL) condition provides a useful middle ground: functions satisfying the PL condition are not necessarily convex, but SGD can still achieve linear convergence rates on them. Many overparameterized neural networks have been shown to satisfy the PL condition near their initialization, helping explain the fast convergence observed in practice.
The choice between SGD (with momentum) and adaptive optimizers like Adam is one of the most debated practical questions in deep learning optimization. Both optimizers have passionate advocates, and the empirical evidence suggests that neither is universally better.
Wilson et al. (2017) published an influential study titled "The Marginal Value of Adaptive Gradient Methods in Machine Learning," demonstrating that adaptive methods (Adam, AdaGrad, RMSProp) consistently found solutions with worse generalization performance than SGD with momentum across a range of tasks, including image classification, character-level language modeling, and constituency parsing. They provided both empirical evidence and theoretical arguments that adaptive methods converge to different (and less desirable) minima than SGD.
Subsequent research has clarified this picture:
| Aspect | SGD with momentum | Adam |
|---|---|---|
| Generalization on vision tasks | Often better; finds flatter minima | Can underperform on test accuracy |
| Training speed | Slower initial convergence | Faster initial convergence |
| Hyperparameter sensitivity | Requires careful tuning of learning rate and schedule | More robust to learning rate choice |
| Transformer training | Rarely used alone | Nearly universal default |
| Gradient distribution | Assumes relatively homogeneous gradients | Handles heterogeneous gradient scales well |
| Final performance on NLP | Competitive when well-tuned | Standard choice |
Multiple studies have observed that SGD with momentum, when properly tuned, finds solutions that generalize better than those found by Adam. The theoretical explanation for SGD's generalization advantage centers on the structure of gradient noise. SGD's noise is isotropic relative to the loss function curvature, which biases the optimizer toward flatter minima in the loss landscape. Flat minima tend to generalize better because small perturbations to the parameters do not significantly change the loss. Research by Zhou et al. (2020) showed that SGD is more locally unstable at sharp minima than Adam, meaning it is more likely to escape them and settle in flatter regions. Adaptive methods, by rescaling gradients per parameter, alter this noise structure in ways that can direct the optimizer toward sharper minima. This property makes SGD the preferred optimizer for image classification with CNNs, where it has historically produced state-of-the-art results on benchmarks like ImageNet.
Adam tends to converge faster in the early stages of training and is much less sensitive to the choice of learning rate. This makes it particularly valuable for large-scale experiments where hyperparameter tuning budgets are limited. For transformer architectures, Adam (or AdamW) is strongly preferred because transformers exhibit highly heterogeneous gradient distributions across layers and parameters. Research has shown that transformers have a block-heterogeneous Hessian spectrum, meaning the curvature of the loss landscape varies dramatically across different parameter groups (e.g., attention weights versus layer norm parameters). Adam's per-parameter adaptivity naturally handles this heterogeneity, while SGD's single global learning rate struggles to handle these varying scales. Language models trained with transformers are widely reported to be difficult or impossible to train effectively with SGD.
Hybrid approaches such as SWATS (Keskar and Socher, 2017) begin training with Adam and automatically switch to SGD once a triggering criterion is met. This strategy attempts to capture Adam's fast early convergence while benefiting from SGD's generalization properties in the later stages of training.
Recent research has highlighted that the batch size plays a critical role in the SGD versus Adam comparison. At smaller batch sizes with sufficient training steps, SGD can match or outperform Adam on many tasks. As batch size increases, Adam's advantage grows because its adaptive learning rates compensate for the reduced gradient noise in large batches. This finding suggests that the choice of optimizer should be considered jointly with the batch size and total training budget.
Training modern deep learning models often requires distributing computation across multiple GPUs or machines. The primary strategies for parallel SGD are synchronous and asynchronous approaches. Training large models efficiently often requires distributing the workload across many GPUs or TPUs, which typically means using very large batch sizes. However, simply increasing the batch size with a fixed learning rate degrades model quality.
In synchronous distributed SGD, each worker computes gradients on its local mini-batch, and all gradients are averaged (typically via an AllReduce operation) before any worker updates its parameters. This is mathematically equivalent to running SGD with a larger effective batch size equal to the per-worker batch size multiplied by the number of workers.
The linear scaling rule (Goyal et al., 2017) states that when the batch size is multiplied by k, the learning rate should also be multiplied by k to maintain the same training dynamics. Goyal et al. at Facebook AI Research used this rule together with a gradual warmup phase to train ResNet-50 on ImageNet in one hour using a batch size of 8,192 across 256 GPUs with no loss in accuracy. This rule works well up to a certain batch size, beyond which training quality degrades.
LARS (Layer-wise Adaptive Rate Scaling), proposed by You et al. (2017), addresses the scaling limitation by adjusting the learning rate independently for each layer based on the ratio of weight norms to gradient norms. LARS enabled training ImageNet with batch sizes up to 32,768 without significant accuracy loss and was instrumental in pushing the limits of distributed SGD training.
The Layer-wise Adaptive Moments optimizer for Batch training (LAMB), proposed by You et al. (2019), extends the LARS idea to Adam. LAMB combines per-dimension adaptivity (from Adam's second moment) with per-layer normalization (from LARS). It was used to train BERT in 76 minutes with a batch size of 65,536, significantly accelerating the pretraining of large language models. LAMB generally outperforms LARS across all batch sizes tested.
In asynchronous SGD, each worker computes gradients and updates a shared parameter server independently, without waiting for other workers. This eliminates the synchronization bottleneck but introduces "staleness," where gradients are computed using parameters that may have been updated multiple times since the gradient computation began.
Staleness can slow convergence and degrade final model quality. Mitigation strategies include bounded staleness (limiting how out-of-date gradients can be) and staleness-aware learning rate correction.
Gradient accumulation is a technique that simulates large-batch training on hardware with limited memory. Instead of processing a large batch at once, multiple smaller mini-batches are processed sequentially, and their gradients are summed before performing a single parameter update. This is mathematically equivalent to using the larger batch size, but requires only the memory needed for the smaller mini-batch.
For example, processing 4 mini-batches of 64 examples with gradient accumulation before updating is equivalent to a single mini-batch of 256 examples. This technique is particularly important for training large language models where memory constraints limit the per-GPU batch size.
A persistent challenge in deep learning is that optimal hyperparameters (including the learning rate, initialization scale, and others) change as the model width scales up. This means practitioners cannot simply tune hyperparameters on a small model and transfer them to a larger one.
The Maximal Update Parameterization (muP), introduced by Greg Yang, Edward Hu, and collaborators in the "Tensor Programs" series of papers (presented at NeurIPS 2021), provides a principled solution. Under muP, the initialization variance, learning rate, and other hyperparameters are parameterized as functions of model width in such a way that optimal hyperparameters remain stable across different model sizes. This enables "muTransfer": tuning hyperparameters on a small proxy model and then directly transferring them to the full-scale model without further tuning.
The practical benefits are significant. By transferring pretraining hyperparameters from a 13M-parameter model, the authors matched the published performance of BERT-large (350M parameters) at a tuning cost equivalent to pretraining BERT-large only once. By transferring from a 40M-parameter model, they matched the published numbers for a 6.7B-parameter GPT-3 model at only 7% of the total pretraining cost. muP has been adopted by several large-scale training efforts and is available as an open-source PyTorch library from Microsoft.
PyTorch provides SGD through torch.optim.SGD:
import torch
import torch.optim as optim
model = MyModel()
# SGD with momentum and weight decay
optimizer = optim.SGD(
model.parameters(),
lr=0.1,
momentum=0.9,
weight_decay=1e-4,
nesterov=True
)
# Training loop
for epoch in range(num_epochs):
for batch in dataloader:
optimizer.zero_grad()
loss = criterion(model(batch.inputs), batch.targets)
loss.backward()
optimizer.step()
Key parameters include lr (learning rate, default 0.001), momentum (default 0), weight_decay (L2 penalty, default 0), nesterov (enables Nesterov momentum, default False), and dampening (dampening for momentum, default 0). Note that PyTorch's momentum implementation differs slightly from the classical formulation: it uses v = mu * v + g and then p = p - lr * v, rather than the formulation found in some textbooks.
In TensorFlow, SGD is available through tf.keras.optimizers.SGD:
import tensorflow as tf
model = build_model()
optimizer = tf.keras.optimizers.SGD(
learning_rate=0.1,
momentum=0.9,
nesterov=True
)
model.compile(optimizer=optimizer, loss='categorical_crossentropy')
model.fit(train_data, train_labels, epochs=100, batch_size=128)
Both frameworks support learning rate schedulers that can be attached to the optimizer to implement the various schedules described above.