# Batch Size

> Source: https://aiwiki.ai/wiki/batch_size
> Updated: 2026-07-11
> Categories: Deep Learning, Machine Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

In [machine learning](/wiki/machine_learning), **batch size** is the [hyperparameter](/wiki/hyperparameter) that sets how many [training](/wiki/training) examples a model processes together before it updates its parameters with one step of [gradient descent](/wiki/gradient_descent). Common batch sizes range from 1 to several thousand examples for vision models and reach millions of tokens for [large language model](/wiki/large_language_model) pre-training. During each iteration, the model runs a forward pass on the [batch](/wiki/batch), computes the [loss function](/wiki/loss_function), calculates gradients via [backpropagation](/wiki/backpropagation), and updates the weights, so the batch size determines how many examples contribute to each gradient estimate.

Batch size interacts closely with the [learning rate](/wiki/learning_rate), training speed, GPU memory usage, and [generalization](/wiki/generalization). Selecting an appropriate batch size requires balancing computational efficiency against model quality, and it remains one of the most consequential decisions in [deep learning](/wiki/deep_learning) experiment design.

## Types of Gradient Descent by Batch Size

The batch size defines the variant of [gradient descent](/wiki/gradient_descent) being used. The three canonical variants are:

| Variant | Batch Size | Description |
|---------|-----------|-------------|
| Batch gradient descent | Entire dataset | Uses all training examples per update; exact gradient but slow and memory-intensive |
| [Stochastic gradient descent](/wiki/stochastic_gradient_descent_sgd) (SGD) | 1 | Updates after a single example; fast but noisy |
| [Mini-batch](/wiki/mini-batch) gradient descent | Typically 16 to 8,192 | Standard modern approach using subsets of the data |

In practice, the term "SGD" in deep learning almost always refers to [mini-batch](/wiki/mini-batch) gradient descent with batch sizes between 32 and several thousand, not the literal single-example variant.

## How Does Batch Size Affect Training?

### Generalization

Research consistently demonstrates that smaller batches tend to produce models with better [generalization](/wiki/generalization) to unseen data. Keskar et al. (2017) showed that large-batch training converges to "sharp minima" (narrow valleys in the loss landscape), while small-batch training finds "flat minima" (broad basins).[1] The paper states that "large-batch methods tend to converge to sharp minimizers of the training and testing functions, and as is well known, sharp minima lead to poorer generalization."[1] Flat minima generalize better because the [loss](/wiki/loss_function) remains stable under small perturbations to the weights.

The mechanism behind this effect is gradient noise. Small-batch gradient estimates are inherently noisy, and this noise acts as a form of implicit [regularization](/wiki/regularization) that prevents the [optimizer](/wiki/optimizer) from settling into sharp valleys that would perform poorly on test data. Smith and Le (2018) formalized this by showing that the stochastic noise scale in SGD is approximately proportional to the ratio of [learning rate](/wiki/learning_rate) to batch size.[2] This insight explains why the choice of batch size has such a significant impact on test performance.

However, the relationship is not absolute. With proper [hyperparameter](/wiki/hyperparameter) tuning and techniques like [learning rate](/wiki/learning_rate) warmup, the generalization gap between small and large batches can often be closed.

### Training Speed and Throughput

Larger batches enable greater parallelism on modern hardware. GPUs and [TPUs](/wiki/tpu) are designed to process large matrix operations efficiently, so a 4x increase in batch size may only raise iteration time by 1.5 to 2x, significantly improving data throughput.

However, larger batches mean fewer total parameter updates per [epoch](/wiki/epoch), reducing the number of times the [optimizer](/wiki/optimizer) can adjust the weights. This can slow convergence when measured in epochs, even though wall-clock time per epoch decreases.

| Aspect | Small Batch (32) | Large Batch (4,096) |
|--------|-----------------|-------------------|
| GPU utilization | Often underutilizes hardware | Better hardware utilization |
| Time per iteration | Fast | Slower per step |
| Iterations per epoch | Many | Few |
| Gradient estimate | Noisy (high variance) | Accurate (low variance) |
| Generalization | Often better | Can degrade without tuning |
| Memory usage | Low | High |

### Memory Usage

Batch size directly affects GPU memory consumption. Each example in a batch requires memory for input data, intermediate activations (stored for [backpropagation](/wiki/backpropagation)), and computed gradients. Activation memory scales linearly with batch size and, for [transformer](/wiki/transformer) architectures, activations typically dominate memory requirements.

The maximum batch size that fits in GPU memory depends on the model architecture, input dimensions, numerical precision ([FP32](/wiki/floating_point) vs. [FP16](/wiki/mixed_precision_training)/[BF16](/wiki/bfloat16)), and whether memory-saving techniques like [gradient checkpointing](/wiki/gradient_checkpointing) are employed.

## Why Are Batch Sizes Usually Powers of Two?

Practitioners commonly select batch sizes that are powers of two (32, 64, 128, 256, 512, etc.). This convention originates from the way GPU memory and compute units are organized. GPU warp sizes, memory bus widths, and CUDA core groupings are all structured around powers of two, so tensor operations on arrays aligned to these sizes can be more efficient.

NVIDIA's cuDNN library, which underlies most deep learning frameworks, optimizes kernels for tensor dimensions that are multiples of 8 or 16.[3] Using powers of two naturally satisfies these alignment requirements.

That said, empirical benchmarks by Raschka (2022) and Weights & Biases have shown that the difference in throughput between, for example, a batch size of 128 and 127 is negligible on modern hardware.[3] Non-power-of-two sizes like 48, 96, or 384 work perfectly well in practice. The convention persists primarily out of habit and convenience rather than strict necessity.

## Gradient Noise and the Gradient Noise Scale

[Mini-batch](/wiki/mini-batch) gradients are stochastic estimates of the true gradient computed over the full dataset. The difference between the mini-batch gradient and the true gradient is referred to as "gradient noise." This noise is inversely related to the square root of the batch size: a batch size of 1 produces highly noisy estimates, while using the full dataset yields the exact gradient with zero noise.

Gradient noise has dual effects:

- **Positive**: It helps the optimizer escape sharp minima and saddle points, providing implicit [regularization](/wiki/regularization) that improves [generalization](/wiki/generalization).
- **Negative**: Excessive noise prevents convergence and causes erratic training dynamics.

### The Gradient Noise Scale

McCandlish et al. (2018) formalized the concept of the **gradient noise scale** as a measurable quantity that characterizes the signal-to-noise ratio of gradient estimates.[4] The noise scale is defined as:

$$
B_{\text{noise}} = \frac{\mathrm{tr}(\Sigma)}{\lvert G \rvert^2}
$$

where $$\mathrm{tr}(\Sigma)$$ is the trace of the covariance matrix of per-example gradients and $$\lvert G \rvert$$ is the norm of the true gradient. This provides a natural reference point: when the batch size is below $$B_{\text{noise}}$$, training is noise-dominated and each additional example provides significant information; when the batch size is above $$B_{\text{noise}}$$, training is compute-dominated and additional examples provide diminishing returns.

The gradient noise scale is not fixed during training. At initialization, it is typically small. As training progresses toward convergence, the true gradient shrinks while per-example variation persists, causing the noise scale to increase.

## What Is the Critical Batch Size?

McCandlish et al. (2018) introduced the concept of the **critical batch size**, the batch size that represents an optimal tradeoff between data parallelism and compute efficiency.[4] Below the critical batch size, doubling the batch size nearly halves the number of optimization steps needed to reach a target loss. Above it, doubling the batch yields diminishing reductions in step count, effectively wasting compute.

The critical batch size relates directly to the gradient noise scale: $$B_{\text{crit}}$$ is approximately equal to $$B_{\text{noise}}$$. When $$B$$ is much less than $$B_{\text{crit}}$$, noise dominates and averaging more examples significantly improves the signal-to-noise ratio. When $$B$$ is much greater than $$B_{\text{crit}}$$, the gradient estimate is already accurate, and additional examples add minimal new information.

Kaplan et al. (2020) discovered a scaling law relating critical batch size to the training loss in language modeling:[5]

$$
B_{\text{crit}}(L) = 2.0 \times 10^8 \cdot L^{-4.76}
$$

This means that as models improve (loss decreases), the critical batch size grows, justifying the extremely large batch sizes used in modern [large language model](/wiki/large_language_model) training.

### Revisiting Critical Batch Size for LLMs

Recent empirical work (2025) from Allen AI on language model pre-training has refined the critical batch size framework with several notable findings:[6]

- The critical batch size starts near zero at initialization, increases rapidly during the first 10 to 20 percent of training, then plateaus.
- Batch size warmup strategies that double the batch size when justified can reduce total gradient steps by up to 43 percent. The authors used this approach to train OLMo 1B "to slightly better loss than the original training run with 43% fewer gradient steps."[6]
- The original gradient noise scale formula does not reliably predict critical batch size for modern LLMs trained with [Adam](/wiki/adam_optimizer); direct empirical measurement proves more reliable.
- The critical batch size is "largely independent of model size, scaling primarily with data size," which means data quantity, not parameter count, sets how large the batch can usefully grow.[6]

| Training Phase | Critical Batch Size Behavior | Practical Implication |
|---|---|---|
| Initialization | Very low | Small batches are efficient |
| Early (0 to 20% of steps) | Rapidly increasing | Gradually increase batch size |
| Mid to late training | Plateaus at a high value | Use the full target batch size |

## How Should Learning Rate Scale With Batch Size?

The relationship between batch size and [learning rate](/wiki/learning_rate) is one of the most extensively studied topics in deep learning optimization.

### Linear Scaling Rule

Goyal et al. (2017) proposed the **linear scaling rule**, stated in the paper as: "When the minibatch size is multiplied by k, multiply the learning rate by k."[7] The intuition is that k-times larger batches average k times as many examples, so the learning rate must scale proportionally to maintain the same effective step size relative to the noise.

Using this rule, Goyal et al. trained [ResNet](/wiki/resnet)-50 on [ImageNet](/wiki/imagenet) in one hour with batches of 8,192 images across 256 GPUs, reaching a 23.74 percent top-1 validation error that matched the 23.60 percent small-batch baseline (about 76.3 percent accuracy).[7] A key addition was a **gradual warmup** scheme: starting with a small learning rate and linearly increasing it to the target over the first few epochs, preventing the large learning rate from destabilizing the randomly initialized model.

However, the linear scaling rule tends to break down for very large batch sizes, where generalization performance can degrade despite matching the prescribed learning rate.

### Square Root Scaling Rule

An alternative is the **square root scaling rule**: when the batch size is multiplied by k, multiply the learning rate by $$\sqrt{k}$$. This rule has theoretical justification for maintaining a constant gradient noise level across different batch sizes and tends to work better with adaptive optimizers like [Adam](/wiki/adam_optimizer) in some settings.

### No Scaling

Some practitioners find that [Adam](/wiki/adam_optimizer) and [AdamW](/wiki/adamw) are less sensitive to batch size changes and require no learning rate adjustment, though this is task-dependent.

| Scaling Rule | Formula | Best Suited For |
|---|---|---|
| Linear (Goyal et al.) | $$\text{LR} \times k$$ when $$\text{batch} \times k$$ | SGD on vision tasks |
| Square root | $$\text{LR} \times \sqrt{k}$$ when $$\text{batch} \times k$$ | Adaptive optimizers, some LLM settings |
| No scaling | LR unchanged | Some Adam/AdamW setups |

### Increasing Batch Size Instead of Decaying Learning Rate

Smith et al. (2018) proposed a novel alternative to the standard practice of decaying the learning rate during training: instead, keep the learning rate fixed and increase the batch size.[2] The two approaches are mathematically equivalent in their effect on the gradient signal-to-noise ratio. However, increasing the batch size enables greater data parallelism on multi-GPU systems, resulting in faster wall-clock training time. This technique was validated with SGD, SGD with momentum, Nesterov momentum, and [Adam](/wiki/adam_optimizer), reaching equivalent test accuracies after the same number of training epochs but with fewer parameter updates.

## Large Batch Training Techniques

Large-batch training enables greater data parallelism across GPUs and [TPUs](/wiki/tpu), reducing total wall-clock training time. However, naively increasing the batch size often degrades model quality. Several techniques have been developed to overcome this challenge.

### LARS (Layer-wise Adaptive Rate Scaling)

You et al. (2017) introduced **LARS**, which applies different learning rates to each layer of the network based on the ratio of the layer's weight norm to its gradient norm.[8] This layer-wise adaptation stabilizes training at very large batch sizes. The paper reports that "using LARS, we scaled AlexNet up to a batch size of 8K, and ResNet-50 to a batch size of 32K without loss in accuracy."[8] Building on LARS, You et al. (2018) subsequently trained ResNet-50 on ImageNet with a batch size of 32,768 to 74.9 percent top-1 accuracy in 14 minutes, and completed the full 90-epoch schedule in 20 minutes.[9]

### LAMB (Layer-wise Adaptive Moments for Batch Training)

You et al. (2020) developed **LAMB** to extend the layer-wise adaptive approach to [Adam](/wiki/adam_optimizer)-based optimizers.[10] LARS performed poorly on attention-based models like [BERT](/wiki/bert), motivating LAMB's development. LAMB adds layer-wise normalization to Adam's update rule and enabled BERT pre-training with batch sizes up to 65,536, reducing training time from 3 days to 76 minutes on a TPUv3 Pod.

| Research | Year | Batch Size | Key Technique | Result |
|----------|------|-----------|----------------|--------|
| Goyal et al. | 2017 | 8,192 | Linear LR scaling + warmup | ImageNet in 1 hour, 23.74% top-1 error |
| You et al. (LARS) | 2017 | 32,768 | Layer-wise adaptive learning rates | ResNet-50 scaled without accuracy loss |
| You et al. (ImageNet in Minutes) | 2018 | 32,768 | LARS at scale | ResNet-50 in 14 minutes, 74.9% top-1 |
| You et al. (LAMB) | 2020 | 65,536 | Layer-wise adaptive Adam | BERT in 76 minutes |
| Smith et al. | 2018 | Varies | Increase batch instead of decaying LR | Improved parallelism, same accuracy |

### Small Batch Training Revisited

Despite the extensive research on large-batch scaling, a 2025 NeurIPS paper revisited small-batch language model training and found that batch sizes as small as 1 can train stably with vanilla SGD, even without momentum.[11] Small batches showed consistent robustness to [hyperparameter](/wiki/hyperparameter) choices. The authors recommend against [gradient accumulation](/wiki/gradient_accumulation) unless training on multiple devices, arguing that for many practical settings it wastes compute that would be better spent on frequent, smaller updates.[11] This challenges the conventional wisdom that large batch sizes are necessary for efficient training.

## Batch Sizes in Modern LLM Training

Modern [large language model](/wiki/large_language_model) training uses token-based batch sizes rather than example counts, since sequence lengths can vary. The effective token batch size equals the number of sequences per batch multiplied by the sequence length.

| Model | Year | Batch Size (tokens) | Notes |
|-------|------|-------------------|-------|
| [GPT-2](/wiki/gpt-2) | 2019 | ~512K | 1,024 sequences of 1,024 tokens |
| [GPT-3](/wiki/gpt-3) (175B) | 2020 | ~3.2M | Ramped during training |
| Chinchilla (70B) | 2022 | ~1.5M | Compute-optimal training |
| [LLaMA](/wiki/llama) (65B) | 2023 | ~4M | 2,048 sequences of 2,048 tokens |
| [LLaMA 2](/wiki/llama_2) (70B) | 2023 | ~4M | Ramped from 512K to 4M tokens |
| [LLaMA 3](/wiki/llama_3) (405B) | 2024 | ~16M | Largest documented token batch |
| [Mistral](/wiki/mistral) (7B) | 2023 | ~2M | Sliding window attention |
| [DeepSeek](/wiki/deepseek)-V2 (236B) | 2024 | ~9.4M | [Mixture of experts](/wiki/mixture_of_experts) architecture |

**Batch size warmup** is standard practice in LLM pre-training: starting with smaller batches early in training (when the critical batch size is low) and increasing as training progresses. [LLaMA 3](/wiki/llama_3) (405B) illustrates the pattern, starting at 4M tokens, doubling to 8M after roughly 252M tokens, and doubling again to 16M after about 2.87 trillion tokens.[12] [GPT-3](/wiki/gpt-3) similarly increased its batch size during training. This approach aligns training efficiency with the evolving critical batch size.

## Micro-batching and Gradient Accumulation

When the desired batch size exceeds available GPU memory, **[gradient accumulation](/wiki/gradient_accumulation)** (also called micro-batching) provides a solution. The training loop proceeds as follows:

1. The target batch is divided into smaller micro-batches that fit in memory.
2. A forward and backward pass is performed on each micro-batch, accumulating the gradients without updating the model.
3. After processing all micro-batches, the accumulated gradients are averaged and a single optimizer step is performed.

The result is mathematically equivalent to full-batch training (with minor floating-point differences). The effective batch size equals the micro-batch size multiplied by the number of accumulation steps.

**Example**: If the target batch size is 1,024 but only 256 examples fit in GPU memory, the training loop processes 4 micro-batches of 256 before performing one parameter update. The model sees 1,024 examples per update, just as it would with a true batch size of 1,024.

| Configuration | Micro-batch Size | Accumulation Steps | Effective Batch Size |
|---|---|---|---|
| No accumulation | 256 | 1 | 256 |
| 4x accumulation | 256 | 4 | 1,024 |
| 16x accumulation | 64 | 16 | 1,024 |
| 4-GPU + 2x accumulation | 256 per GPU | 2 | 2,048 |

Gradient accumulation combines naturally with other memory-saving techniques:

- **[Mixed precision](/wiki/mixed_precision_training) training**: Using FP16 or BF16 reduces memory by roughly half.
- **[Gradient checkpointing](/wiki/gradient_checkpointing)**: Trades compute for memory by recomputing some activations during the backward pass instead of storing them.
- **[Model parallelism](/wiki/model_parallelism)**: Splits models across multiple GPUs when a single model is too large for one device.

One important caveat is that gradient accumulation is fundamentally incompatible with standard [batch normalization](/wiki/batch_normalization) layers, which compute statistics on each micro-batch independently. This can lead to statistics computed on small, noisy micro-batches rather than the full effective batch, destabilizing training. Solutions include using synchronized batch normalization across accumulation steps or replacing batch normalization with [layer normalization](/wiki/layer_normalization) or [group normalization](/wiki/group_normalization).

## What Batch Size Should I Use for My Task?

Different machine learning tasks tend to use different batch size ranges depending on model size, data characteristics, and hardware constraints.

| Task | Typical Batch Size | Notes |
|------|-------------------|-------|
| Image classification | 32 to 256 | Standard for CNNs on single GPUs |
| [Object detection](/wiki/object_detection) | 2 to 16 | High-resolution images consume more memory |
| [NLP](/wiki/natural_language_processing) fine-tuning ([BERT](/wiki/bert)) | 16 to 32 | Common for sequence classification tasks |
| [LLM](/wiki/large_language_model) pre-training | 512K to 16M tokens | Measured in tokens, ramped during training |
| [Reinforcement learning](/wiki/reinforcement_learning) | 32 to 2,048 | Varies widely by algorithm and environment |
| [GANs](/wiki/generative_adversarial_network) | 16 to 128 | Smaller batches help stabilize adversarial training |
| [Diffusion models](/wiki/diffusion_model) | 64 to 2,048 | Larger batches improve sample quality |
| [Speech recognition](/wiki/speech_recognition) | 16 to 64 | Variable-length audio limits batch size |

## Practical Guidelines

- **Start with standard sizes.** Use 32 to 64 for smaller tasks and 256 to 512 for larger ones. Pick the largest batch size that fits comfortably in GPU memory.
- **Use gradient accumulation** when memory limits the batch size, simulating the desired effective batch size without additional hardware.
- **Adjust the learning rate** when changing the batch size. Use linear scaling with warmup for SGD; experiment with square root scaling or no scaling for Adam.
- **Monitor validation performance.** If increasing the batch size degrades generalization, consider adding explicit [regularization](/wiki/regularization), extending the warmup period, or reducing the batch size.
- **Consider the critical batch size.** If doubling the batch size does not meaningfully reduce the number of required training steps, you have exceeded the efficiency threshold and are wasting compute.
- **Try batch size warmup** for large-scale pre-training: start with smaller batches and increase during training for both efficiency and model quality.
- **Profile memory usage** before committing to a batch size. Tools like PyTorch's memory profiler or NVIDIA's nvidia-smi help identify the maximum feasible batch size.

## Explain Like I'm 5 (ELI5)

Imagine you are studying for a test using flashcards. You could look at one flashcard, check the answer, and then move on to the next (that would be a batch size of 1). Or you could look at 10 flashcards, think about all the answers at once, and then check how you did (that would be a batch size of 10). Looking at more cards at once gives you a better overall picture of what you know and what you do not know. But it also means you have to hold more cards in your hands at the same time, and you wait longer before checking your answers. Batch size is just how many flashcards the computer looks at before it checks its answers and tries to get better.

## References

1. Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., & Tang, P. T. P. (2017). "On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima." *ICLR 2017*. https://arxiv.org/abs/1609.04836
2. Smith, S. L., Kindermans, P.-J., Ying, C., & Le, Q. V. (2018). "Don't Decay the Learning Rate, Increase the Batch Size." *ICLR 2018*. https://arxiv.org/abs/1711.00489
3. Raschka, S. (2022). "No, We Don't Have to Choose Batch Sizes As Powers Of 2." https://sebastianraschka.com/blog/2022/batch-size-2.html
4. McCandlish, S., Kaplan, J., Amodei, D., & the OpenAI Dota Team. (2018). "An Empirical Model of Large-Batch Training." *arXiv:1812.06162*. https://arxiv.org/abs/1812.06162
5. Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., et al. (2020). "Scaling Laws for Neural Language Models." *arXiv:2001.08361*. https://arxiv.org/abs/2001.08361
6. Bergsma, S., et al. / Allen AI. (2025). "Power Lines: Scaling Laws for Weight Decay and Batch Size in LLM Pre-training" (Critical Batch Size Revisited). *arXiv:2505.23971*. https://arxiv.org/abs/2505.23971
7. Goyal, P., Dollar, P., Girshick, R., Noordhuis, P., et al. (2017). "Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour." *arXiv:1706.02677*. https://arxiv.org/abs/1706.02677
8. You, Y., Gitman, I., & Ginsburg, B. (2017). "Large Batch Training of Convolutional Networks" (Scaling SGD Batch Size to 32K for ImageNet Training). *arXiv:1708.03888*. https://arxiv.org/abs/1708.03888
9. You, Y., Zhang, Z., Hsieh, C.-J., Demmel, J., & Keutzer, K. (2018). "ImageNet Training in Minutes." *arXiv:1709.05011*. https://arxiv.org/abs/1709.05011
10. You, Y., Li, J., Reddi, S., Hseu, J., et al. (2020). "Large Batch Optimization for Deep Learning: Training BERT in 76 Minutes." *ICLR 2020*. https://arxiv.org/abs/1904.00962
11. Marek, M., et al. (2025). "Small Batch Size Training for Language Models: When Vanilla SGD Works, and Why Gradient Accumulation Is Wasteful." *NeurIPS 2025*. https://arxiv.org/abs/2507.07101
12. Grattafiori, A., et al. / Meta AI. (2024). "The Llama 3 Herd of Models." *arXiv:2407.21783*. https://arxiv.org/abs/2407.21783