# Mini-batch

> Source: https://aiwiki.ai/wiki/mini-batch
> Updated: 2026-06-23
> Categories: Machine Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

*See also: [Machine learning terms](/wiki/machine_learning_terms)*

## What is a mini-batch?

A mini-batch is a small, randomly sampled subset of the training [dataset](/wiki/dataset) used to compute a single update to a [model](/wiki/model)'s parameters during [training](/wiki/training). In [machine learning](/wiki/machine_learning), instead of running the whole dataset through the model before each weight update (full [batch](/wiki/batch) [gradient descent](/wiki/gradient_descent)) or updating after every single [example](/wiki/example) (pure [stochastic gradient descent](/wiki/stochastic_gradient_descent_sgd), batch size 1), mini-batch methods sit in between: they sample a chunk of examples, compute the gradient of the [loss](/wiki/loss) over that chunk, and step the parameters in the opposite direction. The chunk is the mini-batch, the number of examples in it is the [batch size](/wiki/batch_size), and mini-batch gradient descent is the dominant training method for modern deep neural networks.

Google's machine learning glossary describes the mini-batch as a batch-size strategy where the batch size is usually between 10 and 1,000 examples, and calls it the most efficient strategy in practice.[1] Typical batch sizes range from 32 to a few thousand, with powers of two (32, 64, 128, 256, 512, 1024) being conventional. Almost every modern deep learning workflow uses mini-batches by default. The iteration that uses one mini-batch is sometimes called mini-batch SGD. When people loosely say "SGD" today, they almost always mean mini-batch SGD, not the pure single-example version from the original Robbins and Monro paper in 1951.[5][7]

## How a mini-batch fits into training

A training run is organized into three nested concepts: [epochs](/wiki/epoch), [iterations](/wiki/iteration), and mini-batches. An epoch is one full pass through the entire training dataset. The dataset is sliced into mini-batches, and one iteration is a single forward pass, backward pass, and weight update on one mini-batch. If your dataset has 50,000 examples and your batch size is 100, one epoch contains 500 iterations.

Within each iteration the optimizer does roughly the following:

1. Sample a mini-batch from the training set, typically by shuffling the data once per epoch and walking through it in order.
2. Run a forward pass through the model on those examples and compute the loss.
3. Run a backward pass and compute the [gradient](/wiki/gradient) of the [loss function](/wiki/loss_function) with respect to the parameters, averaged over the mini-batch.
4. Update the parameters using the gradient, a learning rate, and whatever optimizer state is in use (momentum, Adam moments, and so on).
5. Move on to the next mini-batch.

When all mini-batches are consumed, the epoch ends. The data is shuffled and a new epoch begins. Training stops once the loss on a held-out set stops improving, or a maximum iteration count is reached.

## How does mini-batch differ from full-batch and stochastic gradient descent?

The three classical variants of gradient descent differ only in how many examples contribute to each update.[7]

| Method | Examples per update | Update frequency per epoch | Typical use |
| --- | --- | --- | --- |
| Batch (full-batch) gradient descent | The entire training set | 1 | Small classical ML problems |
| Stochastic gradient descent (pure SGD) | 1 | One per example | Online learning, theoretical analysis |
| Mini-batch gradient descent | Usually 16 to 1,024 | Dataset size divided by batch size | Almost all modern deep learning |

Full-batch gradient descent gives the lowest-variance estimate of the true gradient, but each step is expensive on a large dataset and the parameters only move once per epoch. Pure SGD with a batch size of one is light per update, but every step is based on a single example, so the gradient is extremely noisy and modern hardware ends up underutilized.

Mini-batches compromise. The gradient on a mini-batch is an unbiased estimate of the full-data gradient, with variance that shrinks roughly with one over the batch size.[6] GPUs and TPUs reach their peak throughput on batched matrix multiplications, so a mini-batch of 128 or 256 examples often runs only a little slower than a mini-batch of 32, while delivering a much less noisy gradient and a much higher utilization rate. This is also why batch sizes above about 10 are favored: matrix-matrix products are far better optimized in numerical libraries than the matrix-vector products of single-example SGD, so larger batches use the hardware more efficiently per example.[9]

## What are typical mini-batch sizes?

Common mini-batch sizes are powers of two: 32, 64, 128, 256, 512, and 1,024.[6] The convention comes from older GPU guidance suggesting that aligned, power-of-two tensor dimensions map cleanly onto hardware warps and tensor cores. Sebastian Raschka's 2022 benchmarks showed that the speed difference between a power of two and a nearby non-power-of-two value is small on current hardware, so the convention is more habit than hard requirement.[8]

There is no single "correct" batch size, and the research literature disagrees on the ideal range. Yoshua Bengio's 2012 practical-recommendations guide treats batch size as a hyperparameter that mostly affects training time rather than final test performance, and suggests 32 as a reasonable default, noting that values above about 10 begin to exploit the speed of matrix-matrix products.[10] Masters and Luschi argued for the small end of the range. Reviewing batch sizes across CIFAR-10, CIFAR-100, and ImageNet, they concluded: "The best performance has been consistently obtained for mini-batch sizes between m = 2 and m = 32, which contrasts with recent work advocating the use of mini-batch sizes in the thousands."[9] They also found that larger batches shrink the range of learning rates that train stably.[9]

A few rules of thumb from practitioner guides:

- On a single GPU, batch sizes between 32 and 256 are typical for image classification and similar workloads.
- Very small batches (1 to 16) are used when memory is tight, when training very large models, or when researchers want the regularizing effect of noisy gradients.
- Very large batches (4,096 and above) are reserved for distributed training across many devices.

In their 2017 ICLR paper on large-batch training, Keskar and co-authors defined the small-batch regime as 32 to 512 examples, which matches the default range in many deep learning frameworks.[2]

## Memory and parallelism tradeoffs

The practical ceiling on batch size is usually GPU memory, not optimization theory. Larger batches store more activations during the forward pass, because backpropagation needs them again on the way back. When the desired batch size exceeds what fits on one accelerator, two workarounds are standard:

- [Gradient accumulation](/wiki/gradient_accumulation) runs several small mini-batches in sequence and adds up the gradients before a single weight update. The result is mathematically close to one big batch, with the memory footprint of one small one. The cost is wall-clock time, since the updates happen less often.
- [Data parallelism](/wiki/data_parallelism) splits the mini-batch across multiple GPUs. Each GPU computes the gradient on its local shard, then an AllReduce step averages the gradients before the weight update. The global batch is the sum of the local batches.

Batch size also controls parallelism. Tiny batches leave matrix multiplications under-shaped and waste throughput. Huge batches expose all the parallelism at the cost of more memory and more difficult optimization.

## Does batch size affect generalization?

Larger mini-batches do not automatically lead to better-trained models. In their 2017 ICLR paper "On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima," Keskar, Mudigere, Nocedal, Smelyanskiy, and Tang reported that switching from a small-batch regime (32 to 512 examples) to a large-batch regime causes a measurable drop in test accuracy, a phenomenon they called the generalization gap.[2] They wrote: "Large-batch methods tend to converge to sharp minimizers of the training and testing functions, and as is well known, sharp minima lead to poorer generalization. In contrast, small-batch methods consistently converge to flat minimizers."[2]

The noise in small-batch gradients seems to act as implicit [regularization](/wiki/regularization), nudging the optimizer toward flatter regions of the loss surface. Subsequent work showed that the gap can be partially closed by changing the learning rate schedule, using warmup, or training for more epochs at the large batch size. The takeaway is that batch size is not free: doubling it and otherwise leaving everything alone often produces a slightly worse model.

## Linear scaling rule and warmup

The most influential recipe for keeping large mini-batches well-behaved is the linear scaling rule from Goyal and co-authors in their 2017 paper "Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour."[3] The rule, stated verbatim, is: "When the minibatch size is multiplied by k, multiply the learning rate by k."[3] The intuition is that a larger batch averages over more independent gradient estimates, which lets the optimizer take a proportionally larger step without overshooting.

The paper showed that this rule, combined with a gradual learning rate warmup over the first few thousand iterations, allowed the authors to train ResNet-50 on ImageNet with a global mini-batch of 8,192 across 256 GPUs in roughly one hour, while matching the 23.74% top-1 error of the standard 256-batch baseline and achieving about 90% scaling efficiency from 8 to 256 GPUs.[3] Warmup matters because the linear scaling rule breaks down very early in training, when the loss surface is steeply curved and a large step at full learning rate sends the parameters in a bad direction.

The rule is empirical and has limits. It tends to hold well for SGD with momentum on convolutional networks and breaks down at very large batch sizes (typically beyond a few tens of thousands of examples). For [Adam](/wiki/adam) and other adaptive optimizers, the scaling looks closer to a square root rule in some regimes, and several papers since 2018 have re-derived scaling laws from stochastic differential equation (SDE) approximations of SGD.

In a 2018 ICLR follow-up, "Don't Decay the Learning Rate, Increase the Batch Size," Smith, Kindermans, Ying, and Le pointed out that decaying the learning rate over training is mathematically equivalent (under the linear scaling rule) to growing the mini-batch size while holding the learning rate fixed.[4] They wrote that "one can usually obtain the same learning curve on both training and test sets by instead increasing the batch size during training," and their experiments trained ResNet-50 on ImageNet to 76.1% validation accuracy in under 30 minutes by ramping the batch size up over training.[4]

## How are mini-batches used in distributed training?

In data-parallel distributed training, every worker holds a copy of the model and processes a shard of each global mini-batch. Two numbers matter:

- The local (or per-device, or micro) batch size: how many examples each GPU processes in one forward and backward pass.
- The global (or effective) mini-batch size: the sum of the local batch sizes across all workers, which is the number of examples averaged over before the weight update.

The global batch is what the optimization recipe "sees," so it is the size that should be plugged into the linear scaling rule. When papers report training ImageNet with a batch size of 8,192 on 256 GPUs, the local batch is 32 and the global batch is 8,192.[3] The gradient averaging happens through an AllReduce collective, often implemented as a ring AllReduce, which sums the per-device gradients and broadcasts the result so every worker performs the same parameter update.

The same pattern (large global mini-batches, linear scaling, warmup, gradient accumulation when memory is tight) underlies most large-scale training pipelines for [large language models](/wiki/large_language_model), [vision transformers](/wiki/vision_transformer), and other large neural networks.

## Practical considerations

- Start with a power-of-two batch that fits comfortably in memory with room to spare for activations and optimizer state. 64 or 128 is a reasonable default on a single modern GPU for medium-sized models.
- Shuffle the training data at the start of every epoch. Without shuffling, the same examples land in the same mini-batches each epoch, and the gradient estimate can become biased.
- When you change the batch size, change the learning rate. The default starting point is the linear scaling rule.
- Watch the loss curve. Training instability and divergent loss are common symptoms of a learning rate that is too high for the chosen batch size, often fixed by adding warmup or reducing the base rate.
- Use [batch normalization](/wiki/batch_normalization) carefully with very small mini-batches. Batch norm statistics are computed across the mini-batch, and batches under about 8 examples per device tend to make the statistics noisy enough to hurt training. [Layer normalization](/wiki/layer_normalization), group normalization, and similar variants were partly developed for that case.

## Advantages and disadvantages

Mini-batch training is the default for several reasons. The optimizer takes many steps per epoch, so parameters move toward a good region of the loss surface faster than under full-batch gradient descent.[6] Batched matrix operations saturate GPU and TPU throughput in a way that single-example SGD cannot. The averaged gradient is smoother than the per-example gradient, which gives steadier [convergence](/wiki/convergence). The remaining noise (relative to full-batch gradients) appears to help networks find flatter, better-generalizing minima.[2] Memory stays manageable because only one mini-batch needs to sit on the device at a time.

The tradeoffs are real too. Batch size is one more hyperparameter to tune, and it couples with the learning rate, so a bad change can quietly cost accuracy. Mini-batch gradients oscillate around the true gradient, which sometimes causes training instability. Past a certain point, doubling the batch size doubles the compute but produces a smaller speedup in convergence; the exact crossover depends on the model, the dataset, and the optimizer.

## Explain like I am 5 (ELI5)

Imagine a giant bag of math problems and a tutor teaching a student. If the tutor reads every problem in the bag before giving any feedback, the feedback is great but very slow. If the tutor reacts after every single problem, the feedback is fast but jumpy, and the student gets confused. So the tutor grabs a small handful of problems, looks at them together, gives one piece of feedback, then grabs another handful. The handful is the mini-batch: big enough that the feedback is steady, small enough that the student gets a lot of it through the bag.

## References

1. Google Developers. "Machine Learning Glossary" (batch, batch size, mini-batch, SGD entries). developers.google.com/machine-learning/glossary.
2. Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., Tang, P. T. P. (2017). "On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima." ICLR 2017. arXiv:1609.04836.
3. Goyal, P., Dollar, P., Girshick, R., et al. (2017). "Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour." arXiv:1706.02677.
4. Smith, S. L., Kindermans, P.-J., Ying, C., Le, Q. V. (2018). "Don't Decay the Learning Rate, Increase the Batch Size." ICLR 2018. arXiv:1711.00489.
5. Robbins, H., and Monro, S. (1951). "A Stochastic Approximation Method." Annals of Mathematical Statistics 22(3): 400-407.
6. Goodfellow, I., Bengio, Y., Courville, A. (2016). Deep Learning. MIT Press. Chapter 8.
7. Wikipedia. "Stochastic gradient descent." en.wikipedia.org/wiki/Stochastic_gradient_descent.
8. Raschka, S. (2022). "No, We Don't Have to Choose Batch Sizes As Powers Of 2." sebastianraschka.com/blog/2022/batch-size-2.html.
9. Masters, D., and Luschi, C. (2018). "Revisiting Small Batch Training for Deep Neural Networks." arXiv:1804.07612.
10. Bengio, Y. (2012). "Practical Recommendations for Gradient-Based Training of Deep Architectures." In Neural Networks: Tricks of the Trade (2nd ed.), Springer, 437-478. arXiv:1206.5533.