Mini-batch
Last reviewed
May 11, 2026
Sources
8 citations
Review status
Source-backed
Revision
v2 · 2,184 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 11, 2026
Sources
8 citations
Review status
Source-backed
Revision
v2 · 2,184 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Machine learning terms
In machine learning, a mini-batch is a small subset of the training dataset used to compute a single update to the model parameters during training. Instead of running the whole dataset through the model before each weight update (full batch gradient descent) or updating after every single example (pure stochastic gradient descent), mini-batch methods sit in between: they sample a chunk of examples, compute the gradient of the loss over that chunk, and step the parameters in the opposite direction. The chunk is the mini-batch, and the number of examples in it is the batch size.
Google's machine learning glossary describes the mini-batch as a batch-size strategy where the batch size is usually between 10 and 1,000 examples, and calls it the most efficient strategy in practice. Almost every modern deep learning workflow uses mini-batches by default. The iteration that uses one mini-batch is sometimes called mini-batch SGD. When people loosely say "SGD" today, they almost always mean mini-batch SGD, not the pure single-example version from the original Robbins and Monro paper in 1951.
A training run is organized into three nested concepts: epochs, iterations, and mini-batches. An epoch is one full pass through the entire training dataset. The dataset is sliced into mini-batches, and one iteration is a single forward pass, backward pass, and weight update on one mini-batch. If your dataset has 50,000 examples and your batch size is 100, one epoch contains 500 iterations.
Within each iteration the optimizer does roughly the following:
When all mini-batches are consumed, the epoch ends. The data is shuffled and a new epoch begins. Training stops once the loss on a held-out set stops improving, or a maximum iteration count is reached.
The three classical variants of gradient descent differ only in how many examples contribute to each update.
| Method | Examples per update | Update frequency per epoch | Typical use |
|---|---|---|---|
| Batch (full-batch) gradient descent | The entire training set | 1 | Small classical ML problems |
| Stochastic gradient descent (pure SGD) | 1 | One per example | Online learning, theoretical analysis |
| Mini-batch gradient descent | Usually 16 to 1,024 | Dataset size divided by batch size | Almost all modern deep learning |
Full-batch gradient descent gives the lowest-variance estimate of the true gradient, but each step is expensive on a large dataset and the parameters only move once per epoch. Pure SGD with a batch size of one is light per update, but every step is based on a single example, so the gradient is extremely noisy and modern hardware ends up underutilized.
Mini-batches compromise. The gradient on a mini-batch is an unbiased estimate of the full-data gradient, with variance that shrinks roughly with one over the batch size. GPUs and TPUs reach their peak throughput on batched matrix multiplications, so a mini-batch of 128 or 256 examples often runs only a little slower than a mini-batch of 32, while delivering a much less noisy gradient and a much higher utilization rate.
Common mini-batch sizes are powers of two: 32, 64, 128, 256, 512, and 1,024. The convention comes from older GPU guidance suggesting that aligned, power-of-two tensor dimensions map cleanly onto hardware warps and tensor cores. Sebastian Raschka's 2022 benchmarks showed that the speed difference between a power of two and a nearby non-power-of-two value is small on current hardware, so the convention is more habit than hard requirement.
A few rules of thumb from practitioner guides:
In their 2017 ICLR paper on large-batch training, Keskar and co-authors defined the small-batch regime as 32 to 512 examples, which matches the default range in many deep learning frameworks.
The practical ceiling on batch size is usually GPU memory, not optimization theory. Larger batches store more activations during the forward pass, because backpropagation needs them again on the way back. When the desired batch size exceeds what fits on one accelerator, two workarounds are standard:
Batch size also controls parallelism. Tiny batches leave matrix multiplications under-shaped and waste throughput. Huge batches expose all the parallelism at the cost of more memory and more difficult optimization.
Larger mini-batches do not automatically lead to better-trained models. In their 2017 ICLR paper "On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima," Keskar, Mudigere, Nocedal, Smelyanskiy, and Tang reported that switching from a small-batch regime (32 to 512 examples) to a large-batch regime causes a measurable drop in test accuracy. They argued that small-batch SGD tends to settle into flat regions of the loss surface, while large-batch SGD often converges to sharp minimizers, and that flat minima generalize better. The noise in small-batch gradients seems to act as implicit regularization. Subsequent work showed that the gap can be partially closed by changing the learning rate schedule, using warmup, or training for more epochs at the large batch size. The takeaway is that batch size is not free: doubling it and otherwise leaving everything alone often produces a slightly worse model.
The most influential recipe for keeping large mini-batches well-behaved is the linear scaling rule from Goyal and co-authors in their 2017 paper "Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour." The rule says: when you multiply the mini-batch size by a factor k, multiply the learning rate by the same factor k. The intuition is that a larger batch averages over more independent gradient estimates, which lets the optimizer take a proportionally larger step without overshooting.
The paper showed that this rule, combined with a gradual learning rate warmup over the first few thousand iterations, allowed the authors to train ResNet-50 on ImageNet with a global mini-batch of 8,192 across 256 GPUs in roughly one hour, while matching the accuracy of the standard 256-batch baseline. Warmup matters because the linear scaling rule breaks down very early in training, when the loss surface is steeply curved and a large step at full learning rate sends the parameters in a bad direction.
The rule is empirical and has limits. It tends to hold well for SGD with momentum on convolutional networks and breaks down at very large batch sizes (typically beyond a few tens of thousands of examples). For Adam and other adaptive optimizers, the scaling looks closer to a square root rule in some regimes, and several papers since 2018 have re-derived scaling laws from stochastic differential equation (SDE) approximations of SGD.
In a 2018 ICLR follow-up, "Don't Decay the Learning Rate, Increase the Batch Size," Smith, Kindermans, Ying, and Le pointed out that decaying the learning rate over training is mathematically equivalent (under the linear scaling rule) to growing the mini-batch size while holding the learning rate fixed. Their experiments trained ResNet-50 on ImageNet to 76.1% validation accuracy in under 30 minutes by ramping the batch size up over training.
In data-parallel distributed training, every worker holds a copy of the model and processes a shard of each global mini-batch. Two numbers matter:
The global batch is what the optimization recipe "sees," so it is the size that should be plugged into the linear scaling rule. When papers report training ImageNet with a batch size of 8,192 on 256 GPUs, the local batch is 32 and the global batch is 8,192. The gradient averaging happens through an AllReduce collective, often implemented as a ring AllReduce, which sums the per-device gradients and broadcasts the result so every worker performs the same parameter update.
The same pattern (large global mini-batches, linear scaling, warmup, gradient accumulation when memory is tight) underlies most large-scale training pipelines for large language models, vision transformers, and other large neural networks.
Mini-batch training is the default for several reasons. The optimizer takes many steps per epoch, so parameters move toward a good region of the loss surface faster than under full-batch gradient descent. Batched matrix operations saturate GPU and TPU throughput in a way that single-example SGD cannot. The averaged gradient is smoother than the per-example gradient, which gives steadier convergence. The remaining noise (relative to full-batch gradients) appears to help networks find flatter, better-generalizing minima. Memory stays manageable because only one mini-batch needs to sit on the device at a time.
The tradeoffs are real too. Batch size is one more hyperparameter to tune, and it couples with the learning rate, so a bad change can quietly cost accuracy. Mini-batch gradients oscillate around the true gradient, which sometimes causes training instability. Past a certain point, doubling the batch size doubles the compute but produces a smaller speedup in convergence; the exact crossover depends on the model, the dataset, and the optimizer.
Imagine a giant bag of math problems and a tutor teaching a student. If the tutor reads every problem in the bag before giving any feedback, the feedback is great but very slow. If the tutor reacts after every single problem, the feedback is fast but jumpy, and the student gets confused. So the tutor grabs a small handful of problems, looks at them together, gives one piece of feedback, then grabs another handful. The handful is the mini-batch: big enough that the feedback is steady, small enough that the student gets a lot of it through the bag.