See also: Machine learning terms
Mini-batch stochastic gradient descent (often shortened to mini-batch SGD or MB-SGD) is the workhorse optimization algorithm of modern machine learning. It updates a model's parameters by computing the gradient of a loss function on a small random subset of the training data, called a mini-batch, and then taking a step opposite to that gradient. Almost every neural network trained today, from a simple convolutional classifier to a frontier LLM with hundreds of billions of parameters, is fit with some flavor of mini-batch SGD or one of its adaptive variants such as Adam or AdamW.
The method sits between two extremes. Full-batch gradient descent computes an exact gradient over the entire dataset before each step, which is expensive and requires the whole dataset to fit in memory. Pure SGD, in the strict sense of using a single example per step, gives very noisy updates that bounce around the loss surface. Mini-batch SGD picks a batch size B somewhere between 1 and the dataset size N, averaging gradients over B examples per step. This middle ground is what makes the method practical: it produces gradient estimates with manageable variance, makes good use of vectorized hardware like GPUs and TPUs, and converges much faster in wall-clock time than either alternative.
The statistical foundations of stochastic optimization predate machine learning by decades. In 1951 Herbert Robbins and Sutton Monro published "A Stochastic Approximation Method" in the Annals of Mathematical Statistics, introducing what is now called the Robbins-Monro algorithm for finding the root of a function known only through noisy measurements. Their convergence conditions, that the step sizes must satisfy the sum of step sizes diverging while the sum of squared step sizes remains finite, are still cited today as classical sufficient conditions for SGD to converge.
The ideas filtered into pattern recognition through the perceptron rule (Rosenblatt 1958) and the LMS algorithm (Widrow and Hoff 1960), both of which are early examples of stochastic gradient methods. The connection to neural network training was made explicit once backpropagation was popularized in the 1980s. The mini-batch variant became the standard recipe in deep learning during the 2000s and 2010s, when GPUs made it efficient to compute gradients on dozens or hundreds of examples in parallel using matrix-matrix multiplications instead of slower matrix-vector operations.
Given a model with parameters θ, a per-example loss function ℓ, and a training set of N examples, mini-batch SGD repeats the following loop:
In each epoch the algorithm processes every training example exactly once, distributed across N/B mini-batch updates. A typical training run lasts anywhere from a single epoch (common for very large language model pretraining) to hundreds of epochs (common for vision tasks).
The gradient computed on a mini-batch is an unbiased estimate of the true gradient over the data distribution, with variance that scales as 1/B. Doubling the batch size halves the variance of the gradient estimate but also doubles the compute per step, so there is a tradeoff between the quality of each step and the number of steps you can afford.
Algorithms in the gradient descent family are usually grouped by how much data they touch per update.
| regime | batch size | gradient quality | steps per epoch | typical use |
|---|---|---|---|---|
| Full-batch gradient descent | B = N | exact | 1 | small problems, convex optimization, theoretical analysis |
| Mini-batch SGD | 1 < B << N | unbiased estimate, moderate noise | N / B | the standard for deep learning |
| Stochastic gradient descent (strict sense) | B = 1 | unbiased but very noisy | N | online learning, streaming data |
In practice the term "SGD" is used loosely. When a deep learning paper says it trains a model "with SGD," it almost always means mini-batch SGD with some batch size B between 32 and several million.
Three reasons explain why the mini-batch regime dominates.
First, hardware. GPUs and TPUs are designed for dense linear algebra. A forward and backward pass over a batch of 256 images is not 256 times slower than a single image; it is often only 5 to 10 times slower, because the matrix multiplications inside the network keep the accelerator's compute units busy. Larger batches amortize the fixed overhead of kernel launches, memory transfers, and pipeline bubbles.
Second, variance reduction. The variance of the mini-batch gradient is the per-example gradient variance divided by B. Smaller batches give noisier updates, which can help the optimizer escape saddle points and shallow minima but make convergence less stable. Larger batches give cleaner updates but, beyond a certain point, the extra noise reduction stops helping.
Third, generalization. There is a long-running observation, formalized by Nitish Keskar and colleagues in 2017, that small-batch training tends to find flatter minima of the loss surface that generalize better to held-out data. Their paper "On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima" gave numerical evidence that large batches converge to sharp minima, while small batches converge to flat ones. The picture is not the whole story (later work showed the gap can often be closed with the right learning rate schedule), but the implicit regularization effect of mini-batch noise is real and is part of why neural networks generalize as well as they do.
The basic update rule θ ← θ − ηg has been extended in many ways. The most influential variants are summarized below.
| optimizer | year | author | core idea | typical use |
|---|---|---|---|---|
| Vanilla SGD | classical | Robbins & Monro (1951) | θ ← θ − ηg | baseline; image classification with momentum |
| Heavy ball momentum | 1964 | Polyak | v ← μv + g; θ ← θ − ηv | computer vision, ResNets |
| Nesterov accelerated gradient | 1983 | Nesterov | look-ahead momentum with provably better convex rate | convex problems, some CV models |
| AdaGrad | 2011 | Duchi, Hazan, Singer | per-parameter learning rate scaled by 1/√(Σ g²) | sparse features, NLP |
| RMSProp | 2012 | Hinton (Coursera lecture) | exponentially decaying average of g² | RNNs, early deep learning |
| Adam | 2015 | Kingma & Ba | combines momentum and RMSProp with bias correction | the de facto default for most tasks |
| AdamW | 2019 | Loshchilov & Hutter | Adam with decoupled weight decay | LLM and large-model training |
| Adafactor | 2018 | Shazeer & Stern | factorizes Adam's second moment to save memory | T5, PaLM, very large models |
| LARS | 2017 | You, Gitman, Ginsburg | layer-wise learning rate for large-batch CNN training | ResNet at large batch |
| LAMB | 2019 | You et al. | layer-wise variant of Adam for large batches | BERT pretraining in 76 minutes |
| Lion | 2023 | Chen et al. | sign-of-momentum updates discovered by symbolic search | competitive with AdamW, less memory |
Momentum, introduced by Boris Polyak in his 1964 paper "Some methods of speeding up the convergence of iteration methods," maintains a velocity vector v that accumulates past gradients with decay coefficient μ (typically 0.9). The update becomes vₜ = μ vₜ₋₁ + gₜ and θₜ = θₜ₋₁ − η vₜ. This damps oscillation across narrow valleys and accelerates progress along consistent gradient directions.
Adam, proposed by Diederik Kingma and Jimmy Ba at ICLR 2015, keeps an exponential moving average of both the gradient (first moment, like momentum) and the squared gradient (second moment, like RMSProp), then divides one by the square root of the other to get a per-parameter adaptive step size. It is the most widely used optimizer in deep learning practice. Its successor AdamW, from a 2019 ICLR paper by Ilya Loshchilov and Frank Hutter, fixes a subtle bug: in standard Adam, applying L2 regularization by adding λθ to the gradient does not behave like true weight decay because the adaptive denominator scales the regularization term too. AdamW decouples weight decay from the gradient update, applying θ ← (1 − ηλ) θ directly. The change is small in code but materially improves generalization, which is why AdamW has become the default for large-language-model pretraining.
Lion, introduced by Xiangning Chen and colleagues at Google Brain in their 2023 paper "Symbolic Discovery of Optimization Algorithms," was discovered by an evolutionary program search rather than designed by hand. Its update uses only the sign of a momentum-smoothed gradient, which keeps memory usage low (no second moment to store) and gives every parameter the same update magnitude. Reported gains include training compute reductions of up to 2.3x on diffusion models and competitive results on language models, though it requires roughly an order of magnitude smaller learning rate than AdamW.
The learning rate η is the single most important hyperparameter in mini-batch SGD. Most modern training runs vary it over time according to a schedule.
| schedule | shape | typical use |
|---|---|---|
| Constant | flat | small experiments, debugging |
| Step decay | drop by factor (e.g. 10x) at fixed epochs | classical CNN training |
| Exponential decay | ηₜ = η₀ · γᵗ | older recipes |
| Cosine annealing | half-cosine from η₀ to η_min | modern CV, LLM pretraining |
| Linear warmup + cosine | ramp up over first k steps, then cosine decay | the standard LLM recipe |
| One-cycle | warmup, plateau near peak, then anneal below η_min | super-convergence (Smith 2018) |
| Inverse square root | ηₜ = η₀ / √t | original Transformer paper |
Cosine annealing comes from "SGDR: Stochastic Gradient Descent with Warm Restarts" by Loshchilov and Hutter (ICLR 2017). The schedule decreases the learning rate from η_max to η_min following the curve ηₜ = η_min + 0.5 (η_max − η_min) (1 + cos(π T_cur / T_i)), with optional warm restarts that snap the rate back to its peak value. Combined with a short linear warmup, this is the schedule used by GPT-3 and most subsequent large-scale language models.
Linear warmup is important when training starts from a random initialization. A high learning rate applied to noisy early gradients can blow up the optimization. Warming up over a few hundred to a few thousand steps lets the gradient statistics stabilize before the optimizer takes large steps.
Learning rate and batch size are coupled. If you change one, you usually need to change the other.
The most-cited rule of thumb is the linear scaling rule from "Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour" by Priya Goyal and colleagues at Facebook AI Research (2017). When the batch size grows by a factor k, the learning rate should grow by the same factor k, holding everything else constant. The intuition is that a k-times-larger batch produces a gradient with roughly the same direction but lower variance, so taking a k-times-larger step is safe and keeps the total per-epoch progress comparable. Combined with a gradual warmup over the first few epochs, this rule allowed the team to train ResNet-50 on ImageNet to 76.3% top-1 accuracy in one hour using a batch of 8,192 images on 256 GPUs, with no loss of accuracy versus a small-batch baseline.
The linear scaling rule has practical limits. Sam McCandlish and colleagues at OpenAI made these limits precise in their 2018 paper "An Empirical Model of Large-Batch Training," which introduced the gradient noise scale. The noise scale is a measurable statistic that predicts the critical batch size, the point beyond which doubling the batch stops giving a corresponding speedup in wall-clock time. Below the critical batch size, larger batches mean fewer steps to convergence; above it, you get diminishing returns and eventually waste compute. The critical batch size grows during training as the loss decreases, and it varies enormously by task: tens of thousands for ImageNet, millions of tokens for language models, and even larger for some reinforcement learning tasks. This framework was used to plan the training of GPT-3 and remains a standard reference for deciding how much data parallelism is worth.
For very large effective batches that exceed available accelerator memory, gradient accumulation is the standard trick. Instead of computing the full batch in one forward and backward pass, you split it into k micro-batches, accumulate the gradients across them, and only call the optimizer once per k micro-batches. The result is mathematically equivalent (modulo numerical effects) to training with a batch k times larger. This is how teams routinely simulate batch sizes in the millions of tokens on hardware that can only fit thousands per device.
| scenario | typical batch size | notes |
|---|---|---|
| Memory-constrained fine-tuning | 1 to 8 | gradient accumulation often used |
| Vision fine-tuning, small CNNs | 32 to 256 | the classical sweet spot |
| Standard ImageNet training | 256 to 1,024 | works on a single 8-GPU node |
| Large-batch ImageNet (Goyal 2017) | 8,192 | with linear scaling and warmup |
| BERT pretraining (LAMB) | 32,768 | Yang You et al. 2019 |
| GPT-3 pretraining | ~3.2 million tokens | with linear warmup and cosine decay |
| RL agents (e.g. OpenAI Five Dota 2) | tens of millions | high noise scale environment |
Under a few standard assumptions (smooth loss, bounded gradient variance, suitable step sizes) SGD provably converges to a stationary point of the expected risk. For convex objectives the expected suboptimality after T steps decreases as O(1/√T) for fixed step size, or O(log T / T) for averaged iterates with a Robbins-Monro-style decreasing step size. For strongly convex objectives the rate improves to O(1/T).
Deep learning loss surfaces are non-convex, and the classical theory does not directly apply. In practice SGD on overparameterized neural networks reliably finds solutions with low training loss, often even when the network can fit random labels (Zhang et al. 2017, "Understanding deep learning requires rethinking generalization"). The implicit regularization of small-batch SGD, combined with explicit techniques such as weight decay, dropout, and data augmentation, makes these solutions generalize despite the network's capacity to memorize.
Every major deep learning framework ships with mini-batch SGD as a built-in optimizer.
| framework | API | notes |
|---|---|---|
| PyTorch | torch.optim.SGD, torch.optim.Adam, torch.optim.AdamW | momentum, weight decay, Nesterov supported as flags |
| TensorFlow / Keras | tf.keras.optimizers.SGD, tf.keras.optimizers.Adam | similar surface, also includes Adafactor and Lion |
| JAX / Optax | optax.sgd, optax.adam, optax.adamw, optax.lion | composable transformations for chaining schedules |
| Hugging Face Transformers | wraps the framework optimizer | exposes a Trainer with warmup and weight decay defaults |
A minimal PyTorch training loop looks like this:
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.1)
for epoch in range(num_epochs):
for x, y in dataloader: # dataloader yields mini-batches
optimizer.zero_grad()
loss = criterion(model(x), y)
loss.backward() # backprop fills .grad on every parameter
optimizer.step() # apply the update
The DataLoader handles shuffling, batching, and parallel data loading, while loss.backward() and optimizer.step() implement the gradient computation and parameter update.
Large-scale model training has changed what "mini-batch SGD" looks like in practice.
LLM pretraining today almost universally uses AdamW with a linear warmup followed by a cosine decay to roughly 10% of the peak learning rate. Effective batch sizes are measured in millions of tokens, achieved through a combination of distributed training across many accelerators and gradient accumulation. The Chinchilla and other scaling laws papers have shaped how teams allocate the compute budget between model size and the number of training tokens, but the underlying optimizer remains a mini-batch method.
Mixed-precision training is now standard, with weights stored in 32-bit but gradients computed in BF16 or FP16 on the accelerator. Optimizer states (the momentum and variance buffers in Adam) are typically kept in 32-bit to preserve numerical accuracy, although memory-saving variants like 8-bit Adam (Tim Dettmers, 2022) are common when memory is tight. Adafactor and Lion go further by reducing optimizer state to one tensor per parameter or by factorizing it.
For very large models, optimizer state itself becomes a bottleneck: standard Adam stores two extra full-precision tensors per parameter, which can exceed the model size for models in the hundreds of billions of parameters. Sharding the optimizer state across data-parallel ranks, as in DeepSpeed ZeRO and PyTorch FSDP, has become a routine part of the training stack.
Mini-batch SGD is not magic. It has well-known weak points.
It is sensitive to the learning rate. Set it too high and the loss diverges; set it too low and training stalls. Tuning the schedule, especially the peak learning rate and the warmup length, is one of the most important parts of getting a training run to work.
It requires gradients, which means it cannot be applied directly to non-differentiable objectives. Reinforcement learning, discrete optimization, and many combinatorial problems require gradient estimators, surrogate losses, or evolutionary methods to fit into the SGD framework.
It is path-dependent. Two runs with the same data and the same hyperparameters but different random seeds can land at noticeably different solutions, with different generalization properties. Reproducibility requires careful seeding of the data shuffler, parameter initialization, and any stochastic layers like dropout.
It does not give principled uncertainty estimates. The point estimate produced by SGD is just one mode of the posterior over parameters, and turning it into calibrated predictive uncertainty requires extra machinery such as Monte Carlo dropout, deep ensembles, or stochastic weight averaging.
You are trying to find the lowest point in a hilly field while blindfolded. You can feel the slope of the ground under your feet and step downhill. If you take a tiny step after feeling just one square inch of dirt, you will move a lot but probably not in the right direction, because that one square inch might be a bump going the wrong way. If you stop and survey the entire field before each step, you will always go the right way, but you will get tired and slow. The smart thing to do is to feel the slope across a small patch of ground (a mini-batch), average it, and step that way. You move efficiently, you avoid being misled by tiny bumps, and you eventually reach the lowest spot. That is what mini-batch SGD does for a neural network.