Mini-batch stochastic gradient descent

See also: Machine learning terms

Mini-batch stochastic gradient descent (often shortened to mini-batch SGD or MB-SGD) is the workhorse optimization algorithm of modern machine learning. It updates a model's parameters by computing the gradient of a loss function on a small random subset of the training data, called a mini-batch, and then taking a step opposite to that gradient. Almost every neural network trained today, from a simple convolutional classifier to a frontier LLM with hundreds of billions of parameters, is fit with some flavor of mini-batch SGD or one of its adaptive variants such as Adam or AdamW.

The method sits between two extremes. Full-batch gradient descent computes an exact gradient over the entire dataset before each step, which is expensive and requires the whole dataset to fit in memory. Pure SGD, in the strict sense of using a single example per step, gives very noisy updates that bounce around the loss surface. Mini-batch SGD picks a batch size B somewhere between 1 and the dataset size N, averaging gradients over B examples per step. This middle ground is what makes the method practical: it produces gradient estimates with manageable variance, makes good use of vectorized hardware like GPUs and TPUs, and converges much faster in wall-clock time than either alternative.

history and origin

The statistical foundations of stochastic optimization predate machine learning by decades. In 1951 Herbert Robbins and Sutton Monro published "A Stochastic Approximation Method" in the Annals of Mathematical Statistics, introducing what is now called the Robbins-Monro algorithm for finding the root of a function known only through noisy measurements. Their convergence conditions, that the step sizes must satisfy the sum of step sizes diverging while the sum of squared step sizes remains finite, are still cited today as classical sufficient conditions for SGD to converge.

The ideas filtered into pattern recognition through the perceptron rule (Rosenblatt 1958) and the LMS algorithm (Widrow and Hoff 1960), both of which are early examples of stochastic gradient methods. The connection to neural network training was made explicit once backpropagation was popularized in the 1980s. The mini-batch variant became the standard recipe in deep learning during the 2000s and 2010s, when GPUs made it efficient to compute gradients on dozens or hundreds of examples in parallel using matrix-matrix multiplications instead of slower matrix-vector operations.

the algorithm

Given a model with parameters θ, a per-example loss function ℓ, and a training set of N examples, mini-batch SGD repeats the following loop:

Shuffle the training data at the start of each epoch.
Partition it into mini-batches of size B.
For each mini-batch (x, y): a. Compute the average gradient g = (1/B) Σᵢ ∇θ ℓ(fθ(xᵢ), yᵢ) using backpropagation. b. Update the parameters: θ ← θ − η g, where η is the learning rate.
Stop after a fixed number of epochs, when the validation loss stops improving, or when some other criterion is met.

In each epoch the algorithm processes every training example exactly once, distributed across N/B mini-batch updates. A typical training run lasts anywhere from a single epoch (common for very large language model pretraining) to hundreds of epochs (common for vision tasks).

The gradient computed on a mini-batch is an unbiased estimate of the true gradient over the data distribution, with variance that scales as 1/B. Doubling the batch size halves the variance of the gradient estimate but also doubles the compute per step, so there is a tradeoff between the quality of each step and the number of steps you can afford.

the three regimes

Algorithms in the gradient descent family are usually grouped by how much data they touch per update.

regime	batch size	gradient quality	steps per epoch	typical use
Full-batch gradient descent	B = N	exact	1	small problems, convex optimization, theoretical analysis
Mini-batch SGD	1 < B << N	unbiased estimate, moderate noise	N / B	the standard for deep learning
Stochastic gradient descent (strict sense)	B = 1	unbiased but very noisy	N	online learning, streaming data

In practice the term "SGD" is used loosely. When a deep learning paper says it trains a model "with SGD," it almost always means mini-batch SGD with some batch size B between 32 and several million.

why mini-batches

Three reasons explain why the mini-batch regime dominates.

First, hardware. GPUs and TPUs are designed for dense linear algebra. A forward and backward pass over a batch of 256 images is not 256 times slower than a single image; it is often only 5 to 10 times slower, because the matrix multiplications inside the network keep the accelerator's compute units busy. Larger batches amortize the fixed overhead of kernel launches, memory transfers, and pipeline bubbles.

Second, variance reduction. The variance of the mini-batch gradient is the per-example gradient variance divided by B. Smaller batches give noisier updates, which can help the optimizer escape saddle points and shallow minima but make convergence less stable. Larger batches give cleaner updates but, beyond a certain point, the extra noise reduction stops helping.

Third, generalization. There is a long-running observation, formalized by Nitish Keskar and colleagues in 2017, that small-batch training tends to find flatter minima of the loss surface that generalize better to held-out data. Their paper "On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima" gave numerical evidence that large batches converge to sharp minima, while small batches converge to flat ones. The picture is not the whole story (later work showed the gap can often be closed with the right learning rate schedule), but the implicit regularization effect of mini-batch noise is real and is part of why neural networks generalize as well as they do.

variants and improvements

The basic update rule θ ← θ − ηg has been extended in many ways. The most influential variants are summarized below.

optimizer	year	author	core idea	typical use
Vanilla SGD	classical	Robbins & Monro (1951)	θ ← θ − ηg	baseline; image classification with momentum
Heavy ball momentum	1964	Polyak	v ← μv + g; θ ← θ − ηv	computer vision, ResNets
Nesterov accelerated gradient	1983	Nesterov	look-ahead momentum with provably better convex rate	convex problems, some CV models
AdaGrad	2011	Duchi, Hazan, Singer	per-parameter learning rate scaled by 1/√(Σ g²)	sparse features, NLP
RMSProp	2012	Hinton (Coursera lecture)	exponentially decaying average of g²	RNNs, early deep learning
Adam	2015	Kingma & Ba	combines momentum and RMSProp with bias correction	the de facto default for most tasks
AdamW	2019	Loshchilov & Hutter	Adam with decoupled weight decay	LLM and large-model training
Adafactor	2018	Shazeer & Stern	factorizes Adam's second moment to save memory	T5, PaLM, very large models
LARS	2017	You, Gitman, Ginsburg	layer-wise learning rate for large-batch CNN training	ResNet at large batch
LAMB	2019	You et al.	layer-wise variant of Adam for large batches	BERT pretraining in 76 minutes
Lion	2023	Chen et al.	sign-of-momentum updates discovered by symbolic search	competitive with AdamW, less memory

Momentum, introduced by Boris Polyak in his 1964 paper "Some methods of speeding up the convergence of iteration methods," maintains a velocity vector v that accumulates past gradients with decay coefficient μ (typically 0.9). The update becomes vₜ = μ vₜ₋₁ + gₜ and θₜ = θₜ₋₁ − η vₜ. This damps oscillation across narrow valleys and accelerates progress along consistent gradient directions.

Adam, proposed by Diederik Kingma and Jimmy Ba at ICLR 2015, keeps an exponential moving average of both the gradient (first moment, like momentum) and the squared gradient (second moment, like RMSProp), then divides one by the square root of the other to get a per-parameter adaptive step size. It is the most widely used optimizer in deep learning practice. Its successor AdamW, from a 2019 ICLR paper by Ilya Loshchilov and Frank Hutter, fixes a subtle bug: in standard Adam, applying L2 regularization by adding λθ to the gradient does not behave like true weight decay because the adaptive denominator scales the regularization term too. AdamW decouples weight decay from the gradient update, applying θ ← (1 − ηλ) θ directly. The change is small in code but materially improves generalization, which is why AdamW has become the default for large-language-model pretraining.

Lion, introduced by Xiangning Chen and colleagues at Google Brain in their 2023 paper "Symbolic Discovery of Optimization Algorithms," was discovered by an evolutionary program search rather than designed by hand. Its update uses only the sign of a momentum-smoothed gradient, which keeps memory usage low (no second moment to store) and gives every parameter the same update magnitude. Reported gains include training compute reductions of up to 2.3x on diffusion models and competitive results on language models, though it requires roughly an order of magnitude smaller learning rate than AdamW.

learning rate schedules

The learning rate η is the single most important hyperparameter in mini-batch SGD. Most modern training runs vary it over time according to a schedule.

schedule	shape	typical use
Constant	flat	small experiments, debugging
Step decay	drop by factor (e.g. 10x) at fixed epochs	classical CNN training
Exponential decay	ηₜ = η₀ · γᵗ	older recipes
Cosine annealing	half-cosine from η₀ to η_min	modern CV, LLM pretraining
Linear warmup + cosine	ramp up over first k steps, then cosine decay	the standard LLM recipe
One-cycle	warmup, plateau near peak, then anneal below η_min	super-convergence (Smith 2018)
Inverse square root	ηₜ = η₀ / √t	original Transformer paper

Cosine annealing comes from "SGDR: Stochastic Gradient Descent with Warm Restarts" by Loshchilov and Hutter (ICLR 2017). The schedule decreases the learning rate from η_max to η_min following the curve ηₜ = η_min + 0.5 (η_max − η_min) (1 + cos(π T_cur / T_i)), with optional warm restarts that snap the rate back to its peak value. Combined with a short linear warmup, this is the schedule used by GPT-3 and most subsequent large-scale language models.

Linear warmup is important when training starts from a random initialization. A high learning rate applied to noisy early gradients can blow up the optimization. Warming up over a few hundred to a few thousand steps lets the gradient statistics stabilize before the optimizer takes large steps.

batch size, learning rate, and the linear scaling rule

Learning rate and batch size are coupled. If you change one, you usually need to change the other.

The most-cited rule of thumb is the linear scaling rule from "Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour" by Priya Goyal and colleagues at Facebook AI Research (2017). When the batch size grows by a factor k, the learning rate should grow by the same factor k, holding everything else constant. The intuition is that a k-times-larger batch produces a gradient with roughly the same direction but lower variance, so taking a k-times-larger step is safe and keeps the total per-epoch progress comparable. Combined with a gradual warmup over the first few epochs, this rule allowed the team to train ResNet-50 on ImageNet to 76.3% top-1 accuracy in one hour using a batch of 8,192 images on 256 GPUs, with no loss of accuracy versus a small-batch baseline.

The linear scaling rule has practical limits. Sam McCandlish and colleagues at OpenAI made these limits precise in their 2018 paper "An Empirical Model of Large-Batch Training," which introduced the gradient noise scale. The noise scale is a measurable statistic that predicts the critical batch size, the point beyond which doubling the batch stops giving a corresponding speedup in wall-clock time. Below the critical batch size, larger batches mean fewer steps to convergence; above it, you get diminishing returns and eventually waste compute. The critical batch size grows during training as the loss decreases, and it varies enormously by task: tens of thousands for ImageNet, millions of tokens for language models, and even larger for some reinforcement learning tasks. This framework was used to plan the training of GPT-3 and remains a standard reference for deciding how much data parallelism is worth.

For very large effective batches that exceed available accelerator memory, gradient accumulation is the standard trick. Instead of computing the full batch in one forward and backward pass, you split it into k micro-batches, accumulate the gradients across them, and only call the optimizer once per k micro-batches. The result is mathematically equivalent (modulo numerical effects) to training with a batch k times larger. This is how teams routinely simulate batch sizes in the millions of tokens on hardware that can only fit thousands per device.

typical batch sizes in practice

scenario	typical batch size	notes
Memory-constrained fine-tuning	1 to 8	gradient accumulation often used
Vision fine-tuning, small CNNs	32 to 256	the classical sweet spot
Standard ImageNet training	256 to 1,024	works on a single 8-GPU node
Large-batch ImageNet (Goyal 2017)	8,192	with linear scaling and warmup
BERT pretraining (LAMB)	32,768	Yang You et al. 2019
GPT-3 pretraining	~3.2 million tokens	with linear warmup and cosine decay
RL agents (e.g. OpenAI Five Dota 2)	tens of millions	high noise scale environment

convergence theory

Under a few standard assumptions (smooth loss, bounded gradient variance, suitable step sizes) SGD provably converges to a stationary point of the expected risk. For convex objectives the expected suboptimality after T steps decreases as O(1/√T) for fixed step size, or O(log T / T) for averaged iterates with a Robbins-Monro-style decreasing step size. For strongly convex objectives the rate improves to O(1/T).

Deep learning loss surfaces are non-convex, and the classical theory does not directly apply. In practice SGD on overparameterized neural networks reliably finds solutions with low training loss, often even when the network can fit random labels (Zhang et al. 2017, "Understanding deep learning requires rethinking generalization"). The implicit regularization of small-batch SGD, combined with explicit techniques such as weight decay, dropout, and data augmentation, makes these solutions generalize despite the network's capacity to memorize.

implementation

Every major deep learning framework ships with mini-batch SGD as a built-in optimizer.

framework	API	notes
PyTorch	`torch.optim.SGD`, `torch.optim.Adam`, `torch.optim.AdamW`	momentum, weight decay, Nesterov supported as flags
TensorFlow / Keras	`tf.keras.optimizers.SGD`, `tf.keras.optimizers.Adam`	similar surface, also includes Adafactor and Lion
JAX / Optax	`optax.sgd`, `optax.adam`, `optax.adamw`, `optax.lion`	composable transformations for chaining schedules
Hugging Face Transformers	wraps the framework optimizer	exposes a `Trainer` with warmup and weight decay defaults

A minimal PyTorch training loop looks like this:

optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.1)
for epoch in range(num_epochs):
    for x, y in dataloader:                    # dataloader yields mini-batches
        optimizer.zero_grad()
        loss = criterion(model(x), y)
        loss.backward()                        # backprop fills .grad on every parameter
        optimizer.step()                       # apply the update

The DataLoader handles shuffling, batching, and parallel data loading, while loss.backward() and optimizer.step() implement the gradient computation and parameter update.

modern context

Large-scale model training has changed what "mini-batch SGD" looks like in practice.

LLM pretraining today almost universally uses AdamW with a linear warmup followed by a cosine decay to roughly 10% of the peak learning rate. Effective batch sizes are measured in millions of tokens, achieved through a combination of distributed training across many accelerators and gradient accumulation. The Chinchilla and other scaling laws papers have shaped how teams allocate the compute budget between model size and the number of training tokens, but the underlying optimizer remains a mini-batch method.

Mixed-precision training is now standard, with weights stored in 32-bit but gradients computed in BF16 or FP16 on the accelerator. Optimizer states (the momentum and variance buffers in Adam) are typically kept in 32-bit to preserve numerical accuracy, although memory-saving variants like 8-bit Adam (Tim Dettmers, 2022) are common when memory is tight. Adafactor and Lion go further by reducing optimizer state to one tensor per parameter or by factorizing it.

For very large models, optimizer state itself becomes a bottleneck: standard Adam stores two extra full-precision tensors per parameter, which can exceed the model size for models in the hundreds of billions of parameters. Sharding the optimizer state across data-parallel ranks, as in DeepSpeed ZeRO and PyTorch FSDP, has become a routine part of the training stack.

limitations

Mini-batch SGD is not magic. It has well-known weak points.

It is sensitive to the learning rate. Set it too high and the loss diverges; set it too low and training stalls. Tuning the schedule, especially the peak learning rate and the warmup length, is one of the most important parts of getting a training run to work.

It requires gradients, which means it cannot be applied directly to non-differentiable objectives. Reinforcement learning, discrete optimization, and many combinatorial problems require gradient estimators, surrogate losses, or evolutionary methods to fit into the SGD framework.

It is path-dependent. Two runs with the same data and the same hyperparameters but different random seeds can land at noticeably different solutions, with different generalization properties. Reproducibility requires careful seeding of the data shuffler, parameter initialization, and any stochastic layers like dropout.

It does not give principled uncertainty estimates. The point estimate produced by SGD is just one mode of the posterior over parameters, and turning it into calibrated predictive uncertainty requires extra machinery such as Monte Carlo dropout, deep ensembles, or stochastic weight averaging.

Gradient descent - the deterministic full-batch ancestor
SGD - the broader family of stochastic gradient methods
Backpropagation - the algorithm that computes the gradient
Loss function - the objective being minimized
Learning rate - the most important hyperparameter
Batch size - controls noise and compute per step
Epoch - one full pass through the training data
Distributed training - how very large effective batches are achieved

explain like i'm 5

You are trying to find the lowest point in a hilly field while blindfolded. You can feel the slope of the ground under your feet and step downhill. If you take a tiny step after feeling just one square inch of dirt, you will move a lot but probably not in the right direction, because that one square inch might be a bump going the wrong way. If you stop and survey the entire field before each step, you will always go the right way, but you will get tired and slow. The smart thing to do is to feel the slope across a small patch of ground (a mini-batch), average it, and step that way. You move efficiently, you avoid being misled by tiny bumps, and you eventually reach the lowest spot. That is what mini-batch SGD does for a neural network.

references

Robbins, H., & Monro, S. (1951). "A Stochastic Approximation Method." The Annals of Mathematical Statistics, 22(3), 400-407. https://projecteuclid.org/journals/annals-of-mathematical-statistics/volume-22/issue-3/A-Stochastic-Approximation-Method/10.1214/aoms/1177729586.full
Polyak, B. T. (1964). "Some methods of speeding up the convergence of iteration methods." USSR Computational Mathematics and Mathematical Physics, 4(5), 1-17.
Nesterov, Y. (1983). "A method for unconstrained convex minimization problem with the rate of convergence O(1/k²)." Doklady AN USSR, 269, 543-547.
Duchi, J., Hazan, E., & Singer, Y. (2011). "Adaptive Subgradient Methods for Online Learning and Stochastic Optimization." Journal of Machine Learning Research, 12, 2121-2159.
Hinton, G. (2012). "Lecture 6e: RMSProp." Coursera, Neural Networks for Machine Learning.
Kingma, D. P., & Ba, J. (2015). "Adam: A Method for Stochastic Optimization." ICLR 2015. https://arxiv.org/abs/1412.6980
Loshchilov, I., & Hutter, F. (2017). "SGDR: Stochastic Gradient Descent with Warm Restarts." ICLR 2017. https://arxiv.org/abs/1608.03983
Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., & Tang, P. T. P. (2017). "On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima." ICLR 2017. https://arxiv.org/abs/1609.04836
Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., & He, K. (2017). "Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour." arXiv:1706.02677. https://arxiv.org/abs/1706.02677
Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2017). "Understanding deep learning requires rethinking generalization." ICLR 2017. https://arxiv.org/abs/1611.03530
Smith, L. N. (2018). "A disciplined approach to neural network hyper-parameters: Part 1 - learning rate, batch size, momentum, and weight decay." arXiv:1803.09820.
Shazeer, N., & Stern, M. (2018). "Adafactor: Adaptive Learning Rates with Sublinear Memory Cost." ICML 2018. https://arxiv.org/abs/1804.04235
McCandlish, S., Kaplan, J., Amodei, D., & OpenAI Dota Team (2018). "An Empirical Model of Large-Batch Training." arXiv:1812.06162. https://arxiv.org/abs/1812.06162
You, Y., Gitman, I., & Ginsburg, B. (2017). "Large Batch Training of Convolutional Networks." arXiv:1708.03888.
You, Y., Li, J., Reddi, S., Hseu, J., Kumar, S., Bhojanapalli, S., Song, X., Demmel, J., Keutzer, K., & Hsieh, C. J. (2019). "Large Batch Optimization for Deep Learning: Training BERT in 76 Minutes." ICLR 2020. https://arxiv.org/abs/1904.00962
Loshchilov, I., & Hutter, F. (2019). "Decoupled Weight Decay Regularization." ICLR 2019. https://arxiv.org/abs/1711.05101
Brown, T. B., et al. (2020). "Language Models are Few-Shot Learners." NeurIPS 2020 (GPT-3 paper). https://arxiv.org/abs/2005.14165
Chen, X., Liang, C., Huang, D., Real, E., Wang, K., Liu, Y., Pham, H., Dong, X., Luong, T., Hsieh, C. J., Lu, Y., & Le, Q. V. (2023). "Symbolic Discovery of Optimization Algorithms." NeurIPS 2023. https://arxiv.org/abs/2302.06675

history and origin

the algorithm

the three regimes

why mini-batches

variants and improvements

learning rate schedules

batch size, learning rate, and the linear scaling rule

typical batch sizes in practice

convergence theory

implementation

modern context

limitations

related concepts

explain like i'm 5

references

Improve this article

Related Articles

L0 Regularization

L1 Loss

L1 Regularization

L2 Loss

L2 Regularization

AdaGrad

history and origin

the algorithm

the three regimes

why mini-batches

variants and improvements

learning rate schedules

batch size, learning rate, and the linear scaling rule

typical batch sizes in practice

convergence theory

implementation

modern context

limitations

related concepts

explain like i'm 5

references

Related Articles

L0 Regularization

L1 Loss

L1 Regularization

L2 Loss

L2 Regularization

AdaGrad