In machine learning, batch size is a hyperparameter that specifies how many training examples are processed together before the model's parameters are updated through a single step of gradient descent. During each iteration, the model performs a forward pass on the batch of examples, computes the loss function, calculates gradients via backpropagation, and updates the weights. The batch size determines how many examples contribute to each gradient estimate.
Batch size interacts closely with the learning rate, training speed, GPU memory usage, and generalization. Selecting an appropriate batch size requires balancing computational efficiency against model quality, and it remains one of the most consequential decisions in deep learning experiment design.
The batch size defines the variant of gradient descent being used. The three canonical variants are:
| Variant | Batch Size | Description |
|---|---|---|
| Batch gradient descent | Entire dataset | Uses all training examples per update; exact gradient but slow and memory-intensive |
| Stochastic gradient descent (SGD) | 1 | Updates after a single example; fast but noisy |
| Mini-batch gradient descent | Typically 16 to 8,192 | Standard modern approach using subsets of the data |
In practice, the term "SGD" in deep learning almost always refers to mini-batch gradient descent with batch sizes between 32 and several thousand, not the literal single-example variant.
Research consistently demonstrates that smaller batches tend to produce models with better generalization to unseen data. Keskar et al. (2017) showed that large-batch training converges to "sharp minima" (narrow valleys in the loss landscape), while small-batch training finds "flat minima" (broad basins).[1] Flat minima generalize better because the loss remains stable under small perturbations to the weights.
The mechanism behind this effect is gradient noise. Small-batch gradient estimates are inherently noisy, and this noise acts as a form of implicit regularization that prevents the optimizer from settling into sharp valleys that would perform poorly on test data. Smith and Le (2018) formalized this by showing that the stochastic noise scale in SGD is approximately proportional to the ratio of learning rate to batch size.[2] This insight explains why the choice of batch size has such a significant impact on test performance.
However, the relationship is not absolute. With proper hyperparameter tuning and techniques like learning rate warmup, the generalization gap between small and large batches can often be closed.
Larger batches enable greater parallelism on modern hardware. GPUs and TPUs are designed to process large matrix operations efficiently, so a 4x increase in batch size may only raise iteration time by 1.5 to 2x, significantly improving data throughput.
However, larger batches mean fewer total parameter updates per epoch, reducing the number of times the optimizer can adjust the weights. This can slow convergence when measured in epochs, even though wall-clock time per epoch decreases.
| Aspect | Small Batch (32) | Large Batch (4,096) |
|---|---|---|
| GPU utilization | Often underutilizes hardware | Better hardware utilization |
| Time per iteration | Fast | Slower per step |
| Iterations per epoch | Many | Few |
| Gradient estimate | Noisy (high variance) | Accurate (low variance) |
| Generalization | Often better | Can degrade without tuning |
| Memory usage | Low | High |
Batch size directly affects GPU memory consumption. Each example in a batch requires memory for input data, intermediate activations (stored for backpropagation), and computed gradients. Activation memory scales linearly with batch size and, for transformer architectures, activations typically dominate memory requirements.
The maximum batch size that fits in GPU memory depends on the model architecture, input dimensions, numerical precision (FP32 vs. FP16/BF16), and whether memory-saving techniques like gradient checkpointing are employed.
Practitioners commonly select batch sizes that are powers of two (32, 64, 128, 256, 512, etc.). This convention originates from the way GPU memory and compute units are organized. GPU warp sizes, memory bus widths, and CUDA core groupings are all structured around powers of two, so tensor operations on arrays aligned to these sizes can be more efficient.
NVIDIA's cuDNN library, which underlies most deep learning frameworks, optimizes kernels for tensor dimensions that are multiples of 8 or 16.[3] Using powers of two naturally satisfies these alignment requirements.
That said, empirical benchmarks by Raschka (2022) and Weights & Biases have shown that the difference in throughput between, for example, a batch size of 128 and 127 is negligible on modern hardware.[3] Non-power-of-two sizes like 48, 96, or 384 work perfectly well in practice. The convention persists primarily out of habit and convenience rather than strict necessity.
Mini-batch gradients are stochastic estimates of the true gradient computed over the full dataset. The difference between the mini-batch gradient and the true gradient is referred to as "gradient noise." This noise is inversely related to the square root of the batch size: a batch size of 1 produces highly noisy estimates, while using the full dataset yields the exact gradient with zero noise.
Gradient noise has dual effects:
McCandlish et al. (2018) formalized the concept of the gradient noise scale as a measurable quantity that characterizes the signal-to-noise ratio of gradient estimates.[4] The noise scale is defined as:
B_noise = tr(Sigma) / |G|^2
where tr(Sigma) is the trace of the covariance matrix of per-example gradients and |G| is the norm of the true gradient. This provides a natural reference point: when the batch size is below B_noise, training is noise-dominated and each additional example provides significant information; when the batch size is above B_noise, training is compute-dominated and additional examples provide diminishing returns.
The gradient noise scale is not fixed during training. At initialization, it is typically small. As training progresses toward convergence, the true gradient shrinks while per-example variation persists, causing the noise scale to increase.
McCandlish et al. (2018) introduced the concept of the critical batch size, the batch size that represents an optimal tradeoff between data parallelism and compute efficiency.[4] Below the critical batch size, doubling the batch size nearly halves the number of optimization steps needed to reach a target loss. Above it, doubling the batch yields diminishing reductions in step count, effectively wasting compute.
The critical batch size relates directly to the gradient noise scale: B_crit is approximately equal to B_noise. When B is much less than B_crit, noise dominates and averaging more examples significantly improves the signal-to-noise ratio. When B is much greater than B_crit, the gradient estimate is already accurate, and additional examples add minimal new information.
Kaplan et al. (2020) discovered a scaling law relating critical batch size to the training loss in language modeling:[5]
B_crit(L) = 2.0 x 10^8 * L^(-4.76)
This means that as models improve (loss decreases), the critical batch size grows, justifying the extremely large batch sizes used in modern large language model training.
Recent empirical work (2025) from Allen AI on language model pre-training has refined the critical batch size framework with several notable findings:[6]
| Training Phase | Critical Batch Size Behavior | Practical Implication |
|---|---|---|
| Initialization | Very low | Small batches are efficient |
| Early (0 to 20% of steps) | Rapidly increasing | Gradually increase batch size |
| Mid to late training | Plateaus at a high value | Use the full target batch size |
The relationship between batch size and learning rate is one of the most extensively studied topics in deep learning optimization.
Goyal et al. (2017) proposed the linear scaling rule: when the batch size is multiplied by k, multiply the learning rate by k as well.[7] The intuition is that k-times larger batches average k times as many examples, so the learning rate must scale proportionally to maintain the same effective step size relative to the noise.
Using this rule, Goyal et al. trained ResNet-50 on ImageNet in one hour with batches of 8,192 images across 256 GPUs, achieving 76.3% validation accuracy that matched small-batch baselines. A key addition was a gradual warmup scheme: starting with a small learning rate and linearly increasing it to the target over the first few epochs, preventing the large learning rate from destabilizing the randomly initialized model.
However, the linear scaling rule tends to break down for very large batch sizes, where generalization performance can degrade despite matching the prescribed learning rate.
An alternative is the square root scaling rule: when the batch size is multiplied by k, multiply the learning rate by sqrt(k). This rule has theoretical justification for maintaining a constant gradient noise level across different batch sizes and tends to work better with adaptive optimizers like Adam in some settings.
Some practitioners find that Adam and AdamW are less sensitive to batch size changes and require no learning rate adjustment, though this is task-dependent.
| Scaling Rule | Formula | Best Suited For |
|---|---|---|
| Linear (Goyal et al.) | LR x k when batch x k | SGD on vision tasks |
| Square root | LR x sqrt(k) when batch x k | Adaptive optimizers, some LLM settings |
| No scaling | LR unchanged | Some Adam/AdamW setups |
Smith et al. (2018) proposed a novel alternative to the standard practice of decaying the learning rate during training: instead, keep the learning rate fixed and increase the batch size.[2] The two approaches are mathematically equivalent in their effect on the gradient signal-to-noise ratio. However, increasing the batch size enables greater data parallelism on multi-GPU systems, resulting in faster wall-clock training time. This technique was validated with SGD, SGD with momentum, Nesterov momentum, and Adam, reaching equivalent test accuracies after the same number of training epochs but with fewer parameter updates.
Large-batch training enables greater data parallelism across GPUs and TPUs, reducing total wall-clock training time. However, naively increasing the batch size often degrades model quality. Several techniques have been developed to overcome this challenge.
You et al. (2017) introduced LARS, which applies different learning rates to each layer of the network based on the ratio of the layer's weight norm to its gradient norm.[8] This layer-wise adaptation stabilizes training at very large batch sizes. Using LARS, ResNet-50 was trained on ImageNet with batch sizes of 32,768 in just 14 minutes.
You et al. (2020) developed LAMB to extend the layer-wise adaptive approach to Adam-based optimizers.[9] LARS performed poorly on attention-based models like BERT, motivating LAMB's development. LAMB adds layer-wise normalization to Adam's update rule and enabled BERT pre-training with batch sizes up to 65,536, reducing training time from 3 days to 76 minutes on a TPUv3 Pod.
| Research | Year | Batch Size | Key Technique | Result |
|---|---|---|---|---|
| Goyal et al. | 2017 | 8,192 | Linear LR scaling + warmup | ImageNet in 1 hour |
| You et al. (LARS) | 2017 | 32,768 | Layer-wise adaptive learning rates | ResNet-50 in 14 minutes |
| You et al. (LAMB) | 2020 | 65,536 | Layer-wise adaptive Adam | BERT in 76 minutes |
| Smith et al. | 2018 | Varies | Increase batch instead of decaying LR | Improved parallelism, same accuracy |
Despite the extensive research on large-batch scaling, a 2025 NeurIPS paper revisited small-batch language model training and found that batch sizes as small as 1 can train stably with vanilla SGD.[10] Small batches showed consistent robustness to hyperparameter choices. The authors argued that for many practical settings, gradient accumulation may waste compute that would be better spent on frequent, smaller updates. This challenges the conventional wisdom that large batch sizes are necessary for efficient training.
Modern large language model training uses token-based batch sizes rather than example counts, since sequence lengths can vary. The effective token batch size equals the number of sequences per batch multiplied by the sequence length.
| Model | Year | Batch Size (tokens) | Notes |
|---|---|---|---|
| GPT-2 | 2019 | ~512K | 1,024 sequences of 1,024 tokens |
| GPT-3 (175B) | 2020 | ~3.2M | Ramped during training |
| Chinchilla (70B) | 2022 | ~1.5M | Compute-optimal training |
| LLaMA (65B) | 2023 | ~4M | 2,048 sequences of 2,048 tokens |
| LLaMA 2 (70B) | 2023 | ~4M | Ramped from 512K to 4M tokens |
| LLaMA 3 (405B) | 2024 | ~16M | Largest documented token batch |
| Mistral (7B) | 2023 | ~2M | Sliding window attention |
| DeepSeek-V2 (236B) | 2024 | ~9.4M | Mixture of experts architecture |
Batch size warmup is standard practice in LLM pre-training: starting with smaller batches early in training (when the critical batch size is low) and increasing as training progresses. LLaMA 2 ramped from 512K to 4M tokens, and GPT-3 similarly increased its batch size during training. This approach aligns training efficiency with the evolving critical batch size.
When the desired batch size exceeds available GPU memory, gradient accumulation (also called micro-batching) provides a solution. The training loop proceeds as follows:
The result is mathematically equivalent to full-batch training (with minor floating-point differences). The effective batch size equals the micro-batch size multiplied by the number of accumulation steps.
Example: If the target batch size is 1,024 but only 256 examples fit in GPU memory, the training loop processes 4 micro-batches of 256 before performing one parameter update. The model sees 1,024 examples per update, just as it would with a true batch size of 1,024.
| Configuration | Micro-batch Size | Accumulation Steps | Effective Batch Size |
|---|---|---|---|
| No accumulation | 256 | 1 | 256 |
| 4x accumulation | 256 | 4 | 1,024 |
| 16x accumulation | 64 | 16 | 1,024 |
| 4-GPU + 2x accumulation | 256 per GPU | 2 | 2,048 |
Gradient accumulation combines naturally with other memory-saving techniques:
One important caveat is that gradient accumulation is fundamentally incompatible with standard batch normalization layers, which compute statistics on each micro-batch independently. This can lead to statistics computed on small, noisy micro-batches rather than the full effective batch, destabilizing training. Solutions include using synchronized batch normalization across accumulation steps or replacing batch normalization with layer normalization or group normalization.
Different machine learning tasks tend to use different batch size ranges depending on model size, data characteristics, and hardware constraints.
| Task | Typical Batch Size | Notes |
|---|---|---|
| Image classification | 32 to 256 | Standard for CNNs on single GPUs |
| Object detection | 2 to 16 | High-resolution images consume more memory |
| NLP fine-tuning (BERT) | 16 to 32 | Common for sequence classification tasks |
| LLM pre-training | 512K to 16M tokens | Measured in tokens, ramped during training |
| Reinforcement learning | 32 to 2,048 | Varies widely by algorithm and environment |
| GANs | 16 to 128 | Smaller batches help stabilize adversarial training |
| Diffusion models | 64 to 2,048 | Larger batches improve sample quality |
| Speech recognition | 16 to 64 | Variable-length audio limits batch size |
Imagine you are studying for a test using flashcards. You could look at one flashcard, check the answer, and then move on to the next (that would be a batch size of 1). Or you could look at 10 flashcards, think about all the answers at once, and then check how you did (that would be a batch size of 10). Looking at more cards at once gives you a better overall picture of what you know and what you do not know. But it also means you have to hold more cards in your hands at the same time, and you wait longer before checking your answers. Batch size is just how many flashcards the computer looks at before it checks its answers and tries to get better.