Mini-batch

Introduction

In machine learning, a mini-batch is a small subset of the training dataset used to compute a single update to the model parameters during training. Instead of running the whole dataset through the model before each weight update (full batch gradient descent) or updating after every single example (pure stochastic gradient descent), mini-batch methods sit in between: they sample a chunk of examples, compute the gradient of the loss over that chunk, and step the parameters in the opposite direction. The chunk is the mini-batch, and the number of examples in it is the batch size.

Google's machine learning glossary describes the mini-batch as a batch-size strategy where the batch size is usually between 10 and 1,000 examples, and calls it the most efficient strategy in practice. Almost every modern deep learning workflow uses mini-batches by default. The iteration that uses one mini-batch is sometimes called mini-batch SGD. When people loosely say "SGD" today, they almost always mean mini-batch SGD, not the pure single-example version from the original Robbins and Monro paper in 1951.

How a mini-batch fits into training

A training run is organized into three nested concepts: epochs, iterations, and mini-batches. An epoch is one full pass through the entire training dataset. The dataset is sliced into mini-batches, and one iteration is a single forward pass, backward pass, and weight update on one mini-batch. If your dataset has 50,000 examples and your batch size is 100, one epoch contains 500 iterations.

Within each iteration the optimizer does roughly the following:

Sample a mini-batch from the training set, typically by shuffling the data once per epoch and walking through it in order.
Run a forward pass through the model on those examples and compute the loss.
Run a backward pass and compute the gradient of the loss function with respect to the parameters, averaged over the mini-batch.
Update the parameters using the gradient, a learning rate, and whatever optimizer state is in use (momentum, Adam moments, and so on).
Move on to the next mini-batch.

When all mini-batches are consumed, the epoch ends. The data is shuffled and a new epoch begins. Training stops once the loss on a held-out set stops improving, or a maximum iteration count is reached.

Mini-batch versus full-batch and stochastic gradient descent

The three classical variants of gradient descent differ only in how many examples contribute to each update.

Method	Examples per update	Update frequency per epoch	Typical use
Batch (full-batch) gradient descent	The entire training set	1	Small classical ML problems
Stochastic gradient descent (pure SGD)	1	One per example	Online learning, theoretical analysis
Mini-batch gradient descent	Usually 16 to 1,024	Dataset size divided by batch size	Almost all modern deep learning

Full-batch gradient descent gives the lowest-variance estimate of the true gradient, but each step is expensive on a large dataset and the parameters only move once per epoch. Pure SGD with a batch size of one is light per update, but every step is based on a single example, so the gradient is extremely noisy and modern hardware ends up underutilized.

Mini-batches compromise. The gradient on a mini-batch is an unbiased estimate of the full-data gradient, with variance that shrinks roughly with one over the batch size. GPUs and TPUs reach their peak throughput on batched matrix multiplications, so a mini-batch of 128 or 256 examples often runs only a little slower than a mini-batch of 32, while delivering a much less noisy gradient and a much higher utilization rate.

Typical batch sizes

Common mini-batch sizes are powers of two: 32, 64, 128, 256, 512, and 1,024. The convention comes from older GPU guidance suggesting that aligned, power-of-two tensor dimensions map cleanly onto hardware warps and tensor cores. Sebastian Raschka's 2022 benchmarks showed that the speed difference between a power of two and a nearby non-power-of-two value is small on current hardware, so the convention is more habit than hard requirement.

A few rules of thumb from practitioner guides:

On a single GPU, batch sizes between 32 and 256 are typical for image classification and similar workloads.
Very small batches (1 to 16) are used when memory is tight, when training very large models, or when researchers want the regularizing effect of noisy gradients.
Very large batches (4,096 and above) are reserved for distributed training across many devices.

In their 2017 ICLR paper on large-batch training, Keskar and co-authors defined the small-batch regime as 32 to 512 examples, which matches the default range in many deep learning frameworks.

Memory and parallelism tradeoffs

The practical ceiling on batch size is usually GPU memory, not optimization theory. Larger batches store more activations during the forward pass, because backpropagation needs them again on the way back. When the desired batch size exceeds what fits on one accelerator, two workarounds are standard:

Gradient accumulation runs several small mini-batches in sequence and adds up the gradients before a single weight update. The result is mathematically close to one big batch, with the memory footprint of one small one. The cost is wall-clock time, since the updates happen less often.
Data parallelism splits the mini-batch across multiple GPUs. Each GPU computes the gradient on its local shard, then an AllReduce step averages the gradients before the weight update. The global batch is the sum of the local batches.

Batch size also controls parallelism. Tiny batches leave matrix multiplications under-shaped and waste throughput. Huge batches expose all the parallelism at the cost of more memory and more difficult optimization.

Effect on generalization

Larger mini-batches do not automatically lead to better-trained models. In their 2017 ICLR paper "On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima," Keskar, Mudigere, Nocedal, Smelyanskiy, and Tang reported that switching from a small-batch regime (32 to 512 examples) to a large-batch regime causes a measurable drop in test accuracy. They argued that small-batch SGD tends to settle into flat regions of the loss surface, while large-batch SGD often converges to sharp minimizers, and that flat minima generalize better. The noise in small-batch gradients seems to act as implicit regularization. Subsequent work showed that the gap can be partially closed by changing the learning rate schedule, using warmup, or training for more epochs at the large batch size. The takeaway is that batch size is not free: doubling it and otherwise leaving everything alone often produces a slightly worse model.

Linear scaling rule and warmup

The most influential recipe for keeping large mini-batches well-behaved is the linear scaling rule from Goyal and co-authors in their 2017 paper "Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour." The rule says: when you multiply the mini-batch size by a factor k, multiply the learning rate by the same factor k. The intuition is that a larger batch averages over more independent gradient estimates, which lets the optimizer take a proportionally larger step without overshooting.

The paper showed that this rule, combined with a gradual learning rate warmup over the first few thousand iterations, allowed the authors to train ResNet-50 on ImageNet with a global mini-batch of 8,192 across 256 GPUs in roughly one hour, while matching the accuracy of the standard 256-batch baseline. Warmup matters because the linear scaling rule breaks down very early in training, when the loss surface is steeply curved and a large step at full learning rate sends the parameters in a bad direction.

The rule is empirical and has limits. It tends to hold well for SGD with momentum on convolutional networks and breaks down at very large batch sizes (typically beyond a few tens of thousands of examples). For Adam and other adaptive optimizers, the scaling looks closer to a square root rule in some regimes, and several papers since 2018 have re-derived scaling laws from stochastic differential equation (SDE) approximations of SGD.

In a 2018 ICLR follow-up, "Don't Decay the Learning Rate, Increase the Batch Size," Smith, Kindermans, Ying, and Le pointed out that decaying the learning rate over training is mathematically equivalent (under the linear scaling rule) to growing the mini-batch size while holding the learning rate fixed. Their experiments trained ResNet-50 on ImageNet to 76.1% validation accuracy in under 30 minutes by ramping the batch size up over training.

Mini-batches in distributed training

In data-parallel distributed training, every worker holds a copy of the model and processes a shard of each global mini-batch. Two numbers matter:

The local (or per-device, or micro) batch size: how many examples each GPU processes in one forward and backward pass.
The global (or effective) mini-batch size: the sum of the local batch sizes across all workers, which is the number of examples averaged over before the weight update.

The global batch is what the optimization recipe "sees," so it is the size that should be plugged into the linear scaling rule. When papers report training ImageNet with a batch size of 8,192 on 256 GPUs, the local batch is 32 and the global batch is 8,192. The gradient averaging happens through an AllReduce collective, often implemented as a ring AllReduce, which sums the per-device gradients and broadcasts the result so every worker performs the same parameter update.

The same pattern (large global mini-batches, linear scaling, warmup, gradient accumulation when memory is tight) underlies most large-scale training pipelines for large language models, vision transformers, and other large neural networks.

Practical considerations

Start with a power-of-two batch that fits comfortably in memory with room to spare for activations and optimizer state. 64 or 128 is a reasonable default on a single modern GPU for medium-sized models.
Shuffle the training data at the start of every epoch. Without shuffling, the same examples land in the same mini-batches each epoch, and the gradient estimate can become biased.
When you change the batch size, change the learning rate. The default starting point is the linear scaling rule.
Watch the loss curve. Training instability and divergent loss are common symptoms of a learning rate that is too high for the chosen batch size, often fixed by adding warmup or reducing the base rate.
Use batch normalization carefully with very small mini-batches. Batch norm statistics are computed across the mini-batch, and batches under about 8 examples per device tend to make the statistics noisy enough to hurt training. Layer normalization, group normalization, and similar variants were partly developed for that case.

Advantages and disadvantages

Mini-batch training is the default for several reasons. The optimizer takes many steps per epoch, so parameters move toward a good region of the loss surface faster than under full-batch gradient descent. Batched matrix operations saturate GPU and TPU throughput in a way that single-example SGD cannot. The averaged gradient is smoother than the per-example gradient, which gives steadier convergence. The remaining noise (relative to full-batch gradients) appears to help networks find flatter, better-generalizing minima. Memory stays manageable because only one mini-batch needs to sit on the device at a time.

The tradeoffs are real too. Batch size is one more hyperparameter to tune, and it couples with the learning rate, so a bad change can quietly cost accuracy. Mini-batch gradients oscillate around the true gradient, which sometimes causes training instability. Past a certain point, doubling the batch size doubles the compute but produces a smaller speedup in convergence; the exact crossover depends on the model, the dataset, and the optimizer.

Explain like I am 5 (ELI5)

Imagine a giant bag of math problems and a tutor teaching a student. If the tutor reads every problem in the bag before giving any feedback, the feedback is great but very slow. If the tutor reacts after every single problem, the feedback is fast but jumpy, and the student gets confused. So the tutor grabs a small handful of problems, looks at them together, gives one piece of feedback, then grabs another handful. The handful is the mini-batch: big enough that the feedback is steady, small enough that the student gets a lot of it through the bag.

References

Google Developers. "Machine Learning Glossary" (batch, batch size, mini-batch, SGD entries). developers.google.com/machine-learning/glossary.
Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., Tang, P. T. P. (2017). "On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima." ICLR 2017. arXiv:1609.04836.
Goyal, P. et al. (2017). "Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour." arXiv:1706.02677.
Smith, S. L., Kindermans, P.-J., Ying, C., Le, Q. V. (2018). "Don't Decay the Learning Rate, Increase the Batch Size." ICLR 2018. arXiv:1711.00489.
Robbins, H., and Monro, S. (1951). "A Stochastic Approximation Method." Annals of Mathematical Statistics 22(3): 400-407.
Goodfellow, I., Bengio, Y., Courville, A. (2016). Deep Learning. MIT Press. Chapter 8.
Wikipedia. "Stochastic gradient descent." en.wikipedia.org/wiki/Stochastic_gradient_descent.
Raschka, S. (2022). "No, We Don't Have to Choose Batch Sizes As Powers Of 2." sebastianraschka.com/blog/2022/batch-size-2.html.

Introduction

How a mini-batch fits into training

Mini-batch versus full-batch and stochastic gradient descent

Typical batch sizes

Memory and parallelism tradeoffs

Effect on generalization

Linear scaling rule and warmup

Mini-batches in distributed training

Practical considerations

Advantages and disadvantages

Explain like I am 5 (ELI5)

References

Improve this article

Related Articles

Machine learning terms/Natural Language Processing

Machine learning terms/Computer Vision

Machine learning terms/Sequence Models

Split

Static

Agglomerative clustering

Introduction

How a mini-batch fits into training

Mini-batch versus full-batch and stochastic gradient descent

Typical batch sizes

Memory and parallelism tradeoffs

Effect on generalization

Linear scaling rule and warmup

Mini-batches in distributed training

Practical considerations

Advantages and disadvantages

Explain like I am 5 (ELI5)

References

Related Articles

Machine learning terms/Natural Language Processing

Machine learning terms/Computer Vision

Machine learning terms/Sequence Models

Split

Static

Agglomerative clustering