Batch Size

In machine learning, batch size is a hyperparameter that specifies how many training examples are processed together before the model's parameters are updated through a single step of gradient descent. During each iteration, the model performs a forward pass on the batch of examples, computes the loss function, calculates gradients via backpropagation, and updates the weights. The batch size determines how many examples contribute to each gradient estimate.

Batch size interacts closely with the learning rate, training speed, GPU memory usage, and generalization. Selecting an appropriate batch size requires balancing computational efficiency against model quality, and it remains one of the most consequential decisions in deep learning experiment design.

Types of Gradient Descent by Batch Size

The batch size defines the variant of gradient descent being used. The three canonical variants are:

Variant	Batch Size	Description
Batch gradient descent	Entire dataset	Uses all training examples per update; exact gradient but slow and memory-intensive
Stochastic gradient descent (SGD)	1	Updates after a single example; fast but noisy
Mini-batch gradient descent	Typically 16 to 8,192	Standard modern approach using subsets of the data

In practice, the term "SGD" in deep learning almost always refers to mini-batch gradient descent with batch sizes between 32 and several thousand, not the literal single-example variant.

Effects on Training

Generalization

Research consistently demonstrates that smaller batches tend to produce models with better generalization to unseen data. Keskar et al. (2017) showed that large-batch training converges to "sharp minima" (narrow valleys in the loss landscape), while small-batch training finds "flat minima" (broad basins).^[1] Flat minima generalize better because the loss remains stable under small perturbations to the weights.

The mechanism behind this effect is gradient noise. Small-batch gradient estimates are inherently noisy, and this noise acts as a form of implicit regularization that prevents the optimizer from settling into sharp valleys that would perform poorly on test data. Smith and Le (2018) formalized this by showing that the stochastic noise scale in SGD is approximately proportional to the ratio of learning rate to batch size.^[2] This insight explains why the choice of batch size has such a significant impact on test performance.

However, the relationship is not absolute. With proper hyperparameter tuning and techniques like learning rate warmup, the generalization gap between small and large batches can often be closed.

Training Speed and Throughput

Larger batches enable greater parallelism on modern hardware. GPUs and TPUs are designed to process large matrix operations efficiently, so a 4x increase in batch size may only raise iteration time by 1.5 to 2x, significantly improving data throughput.

However, larger batches mean fewer total parameter updates per epoch, reducing the number of times the optimizer can adjust the weights. This can slow convergence when measured in epochs, even though wall-clock time per epoch decreases.

Aspect	Small Batch (32)	Large Batch (4,096)
GPU utilization	Often underutilizes hardware	Better hardware utilization
Time per iteration	Fast	Slower per step
Iterations per epoch	Many	Few
Gradient estimate	Noisy (high variance)	Accurate (low variance)
Generalization	Often better	Can degrade without tuning
Memory usage	Low	High

Memory Usage

Batch size directly affects GPU memory consumption. Each example in a batch requires memory for input data, intermediate activations (stored for backpropagation), and computed gradients. Activation memory scales linearly with batch size and, for transformer architectures, activations typically dominate memory requirements.

The maximum batch size that fits in GPU memory depends on the model architecture, input dimensions, numerical precision (FP32 vs. FP16/BF16), and whether memory-saving techniques like gradient checkpointing are employed.

The Powers-of-Two Convention

Practitioners commonly select batch sizes that are powers of two (32, 64, 128, 256, 512, etc.). This convention originates from the way GPU memory and compute units are organized. GPU warp sizes, memory bus widths, and CUDA core groupings are all structured around powers of two, so tensor operations on arrays aligned to these sizes can be more efficient.

NVIDIA's cuDNN library, which underlies most deep learning frameworks, optimizes kernels for tensor dimensions that are multiples of 8 or 16.^[3] Using powers of two naturally satisfies these alignment requirements.

That said, empirical benchmarks by Raschka (2022) and Weights & Biases have shown that the difference in throughput between, for example, a batch size of 128 and 127 is negligible on modern hardware.^[3] Non-power-of-two sizes like 48, 96, or 384 work perfectly well in practice. The convention persists primarily out of habit and convenience rather than strict necessity.

Gradient Noise and the Gradient Noise Scale

Mini-batch gradients are stochastic estimates of the true gradient computed over the full dataset. The difference between the mini-batch gradient and the true gradient is referred to as "gradient noise." This noise is inversely related to the square root of the batch size: a batch size of 1 produces highly noisy estimates, while using the full dataset yields the exact gradient with zero noise.

Gradient noise has dual effects:

Positive: It helps the optimizer escape sharp minima and saddle points, providing implicit regularization that improves generalization.
Negative: Excessive noise prevents convergence and causes erratic training dynamics.

The Gradient Noise Scale

McCandlish et al. (2018) formalized the concept of the gradient noise scale as a measurable quantity that characterizes the signal-to-noise ratio of gradient estimates.^[4] The noise scale is defined as:

B_noise = tr(Sigma) / |G|^2

where tr(Sigma) is the trace of the covariance matrix of per-example gradients and |G| is the norm of the true gradient. This provides a natural reference point: when the batch size is below B_noise, training is noise-dominated and each additional example provides significant information; when the batch size is above B_noise, training is compute-dominated and additional examples provide diminishing returns.

The gradient noise scale is not fixed during training. At initialization, it is typically small. As training progresses toward convergence, the true gradient shrinks while per-example variation persists, causing the noise scale to increase.

Critical Batch Size

McCandlish et al. (2018) introduced the concept of the critical batch size, the batch size that represents an optimal tradeoff between data parallelism and compute efficiency.^[4] Below the critical batch size, doubling the batch size nearly halves the number of optimization steps needed to reach a target loss. Above it, doubling the batch yields diminishing reductions in step count, effectively wasting compute.

The critical batch size relates directly to the gradient noise scale: B_crit is approximately equal to B_noise. When B is much less than B_crit, noise dominates and averaging more examples significantly improves the signal-to-noise ratio. When B is much greater than B_crit, the gradient estimate is already accurate, and additional examples add minimal new information.

Kaplan et al. (2020) discovered a scaling law relating critical batch size to the training loss in language modeling:^[5]

B_crit(L) = 2.0 x 10^8 * L^(-4.76)

This means that as models improve (loss decreases), the critical batch size grows, justifying the extremely large batch sizes used in modern large language model training.

Revisiting Critical Batch Size for LLMs

Recent empirical work (2025) from Allen AI on language model pre-training has refined the critical batch size framework with several notable findings:^[6]

The critical batch size starts near zero at initialization, increases rapidly during the first 10 to 20 percent of training, then plateaus.
Batch size warmup strategies that double the batch size when justified can reduce total gradient steps by up to 43 percent for 1B-parameter models without degrading final loss.
The original gradient noise scale formula does not reliably predict critical batch size for modern LLMs trained with Adam; direct empirical measurement proves more reliable.

Training Phase	Critical Batch Size Behavior	Practical Implication
Initialization	Very low	Small batches are efficient
Early (0 to 20% of steps)	Rapidly increasing	Gradually increase batch size
Mid to late training	Plateaus at a high value	Use the full target batch size

Batch Size and Learning Rate Scaling

The relationship between batch size and learning rate is one of the most extensively studied topics in deep learning optimization.

Linear Scaling Rule

Goyal et al. (2017) proposed the linear scaling rule: when the batch size is multiplied by k, multiply the learning rate by k as well.^[7] The intuition is that k-times larger batches average k times as many examples, so the learning rate must scale proportionally to maintain the same effective step size relative to the noise.

Using this rule, Goyal et al. trained ResNet-50 on ImageNet in one hour with batches of 8,192 images across 256 GPUs, achieving 76.3% validation accuracy that matched small-batch baselines. A key addition was a gradual warmup scheme: starting with a small learning rate and linearly increasing it to the target over the first few epochs, preventing the large learning rate from destabilizing the randomly initialized model.

However, the linear scaling rule tends to break down for very large batch sizes, where generalization performance can degrade despite matching the prescribed learning rate.

Square Root Scaling Rule

An alternative is the square root scaling rule: when the batch size is multiplied by k, multiply the learning rate by sqrt(k). This rule has theoretical justification for maintaining a constant gradient noise level across different batch sizes and tends to work better with adaptive optimizers like Adam in some settings.

No Scaling

Some practitioners find that Adam and AdamW are less sensitive to batch size changes and require no learning rate adjustment, though this is task-dependent.

Scaling Rule	Formula	Best Suited For
Linear (Goyal et al.)	LR x k when batch x k	SGD on vision tasks
Square root	LR x sqrt(k) when batch x k	Adaptive optimizers, some LLM settings
No scaling	LR unchanged	Some Adam/AdamW setups

Increasing Batch Size Instead of Decaying Learning Rate

Smith et al. (2018) proposed a novel alternative to the standard practice of decaying the learning rate during training: instead, keep the learning rate fixed and increase the batch size.^[2] The two approaches are mathematically equivalent in their effect on the gradient signal-to-noise ratio. However, increasing the batch size enables greater data parallelism on multi-GPU systems, resulting in faster wall-clock training time. This technique was validated with SGD, SGD with momentum, Nesterov momentum, and Adam, reaching equivalent test accuracies after the same number of training epochs but with fewer parameter updates.

Large Batch Training Techniques

Large-batch training enables greater data parallelism across GPUs and TPUs, reducing total wall-clock training time. However, naively increasing the batch size often degrades model quality. Several techniques have been developed to overcome this challenge.

LARS (Layer-wise Adaptive Rate Scaling)

You et al. (2017) introduced LARS, which applies different learning rates to each layer of the network based on the ratio of the layer's weight norm to its gradient norm.^[8] This layer-wise adaptation stabilizes training at very large batch sizes. Using LARS, ResNet-50 was trained on ImageNet with batch sizes of 32,768 in just 14 minutes.

LAMB (Layer-wise Adaptive Moments for Batch Training)

You et al. (2020) developed LAMB to extend the layer-wise adaptive approach to Adam-based optimizers.^[9] LARS performed poorly on attention-based models like BERT, motivating LAMB's development. LAMB adds layer-wise normalization to Adam's update rule and enabled BERT pre-training with batch sizes up to 65,536, reducing training time from 3 days to 76 minutes on a TPUv3 Pod.

Research	Year	Batch Size	Key Technique	Result
Goyal et al.	2017	8,192	Linear LR scaling + warmup	ImageNet in 1 hour
You et al. (LARS)	2017	32,768	Layer-wise adaptive learning rates	ResNet-50 in 14 minutes
You et al. (LAMB)	2020	65,536	Layer-wise adaptive Adam	BERT in 76 minutes
Smith et al.	2018	Varies	Increase batch instead of decaying LR	Improved parallelism, same accuracy

Small Batch Training Revisited

Despite the extensive research on large-batch scaling, a 2025 NeurIPS paper revisited small-batch language model training and found that batch sizes as small as 1 can train stably with vanilla SGD.^[10] Small batches showed consistent robustness to hyperparameter choices. The authors argued that for many practical settings, gradient accumulation may waste compute that would be better spent on frequent, smaller updates. This challenges the conventional wisdom that large batch sizes are necessary for efficient training.

Batch Sizes in Modern LLM Training

Modern large language model training uses token-based batch sizes rather than example counts, since sequence lengths can vary. The effective token batch size equals the number of sequences per batch multiplied by the sequence length.

Model	Year	Batch Size (tokens)	Notes
GPT-2	2019	~512K	1,024 sequences of 1,024 tokens
GPT-3 (175B)	2020	~3.2M	Ramped during training
Chinchilla (70B)	2022	~1.5M	Compute-optimal training
LLaMA (65B)	2023	~4M	2,048 sequences of 2,048 tokens
LLaMA 2 (70B)	2023	~4M	Ramped from 512K to 4M tokens
LLaMA 3 (405B)	2024	~16M	Largest documented token batch
Mistral (7B)	2023	~2M	Sliding window attention
DeepSeek-V2 (236B)	2024	~9.4M	Mixture of experts architecture

Batch size warmup is standard practice in LLM pre-training: starting with smaller batches early in training (when the critical batch size is low) and increasing as training progresses. LLaMA 2 ramped from 512K to 4M tokens, and GPT-3 similarly increased its batch size during training. This approach aligns training efficiency with the evolving critical batch size.

Micro-batching and Gradient Accumulation

When the desired batch size exceeds available GPU memory, gradient accumulation (also called micro-batching) provides a solution. The training loop proceeds as follows:

The target batch is divided into smaller micro-batches that fit in memory.
A forward and backward pass is performed on each micro-batch, accumulating the gradients without updating the model.
After processing all micro-batches, the accumulated gradients are averaged and a single optimizer step is performed.

The result is mathematically equivalent to full-batch training (with minor floating-point differences). The effective batch size equals the micro-batch size multiplied by the number of accumulation steps.

Example: If the target batch size is 1,024 but only 256 examples fit in GPU memory, the training loop processes 4 micro-batches of 256 before performing one parameter update. The model sees 1,024 examples per update, just as it would with a true batch size of 1,024.

Configuration	Micro-batch Size	Accumulation Steps	Effective Batch Size
No accumulation	256	1	256
4x accumulation	256	4	1,024
16x accumulation	64	16	1,024
4-GPU + 2x accumulation	256 per GPU	2	2,048

Gradient accumulation combines naturally with other memory-saving techniques:

Mixed precision training: Using FP16 or BF16 reduces memory by roughly half.
Gradient checkpointing: Trades compute for memory by recomputing some activations during the backward pass instead of storing them.
Model parallelism: Splits models across multiple GPUs when a single model is too large for one device.

One important caveat is that gradient accumulation is fundamentally incompatible with standard batch normalization layers, which compute statistics on each micro-batch independently. This can lead to statistics computed on small, noisy micro-batches rather than the full effective batch, destabilizing training. Solutions include using synchronized batch normalization across accumulation steps or replacing batch normalization with layer normalization or group normalization.

Typical Batch Sizes by Task

Different machine learning tasks tend to use different batch size ranges depending on model size, data characteristics, and hardware constraints.

Task	Typical Batch Size	Notes
Image classification	32 to 256	Standard for CNNs on single GPUs
Object detection	2 to 16	High-resolution images consume more memory
NLP fine-tuning (BERT)	16 to 32	Common for sequence classification tasks
LLM pre-training	512K to 16M tokens	Measured in tokens, ramped during training
Reinforcement learning	32 to 2,048	Varies widely by algorithm and environment
GANs	16 to 128	Smaller batches help stabilize adversarial training
Diffusion models	64 to 2,048	Larger batches improve sample quality
Speech recognition	16 to 64	Variable-length audio limits batch size

Practical Guidelines

Start with standard sizes. Use 32 to 64 for smaller tasks and 256 to 512 for larger ones. Pick the largest batch size that fits comfortably in GPU memory.
Use gradient accumulation when memory limits the batch size, simulating the desired effective batch size without additional hardware.
Adjust the learning rate when changing the batch size. Use linear scaling with warmup for SGD; experiment with square root scaling or no scaling for Adam.
Monitor validation performance. If increasing the batch size degrades generalization, consider adding explicit regularization, extending the warmup period, or reducing the batch size.
Consider the critical batch size. If doubling the batch size does not meaningfully reduce the number of required training steps, you have exceeded the efficiency threshold and are wasting compute.
Try batch size warmup for large-scale pre-training: start with smaller batches and increase during training for both efficiency and model quality.
Profile memory usage before committing to a batch size. Tools like PyTorch's memory profiler or NVIDIA's nvidia-smi help identify the maximum feasible batch size.

Explain Like I'm 5 (ELI5)

Imagine you are studying for a test using flashcards. You could look at one flashcard, check the answer, and then move on to the next (that would be a batch size of 1). Or you could look at 10 flashcards, think about all the answers at once, and then check how you did (that would be a batch size of 10). Looking at more cards at once gives you a better overall picture of what you know and what you do not know. But it also means you have to hold more cards in your hands at the same time, and you wait longer before checking your answers. Batch size is just how many flashcards the computer looks at before it checks its answers and tries to get better.

References

Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., & Tang, P. T. P. (2017). "On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima." *ICLR 2017*. https://openreview.net/pdf?id=H1oyRlYgg
Smith, S. L., Kindermans, P.-J., Ying, C., & Le, Q. V. (2018). "Don't Decay the Learning Rate, Increase the Batch Size." *ICLR 2018*. https://arxiv.org/abs/1711.00489
Raschka, S. (2022). "No, We Don't Have to Choose Batch Sizes As Powers Of 2." https://sebastianraschka.com/blog/2022/batch-size-2.html
McCandlish, S., Kaplan, J., Amodei, D., & the OpenAI Dota Team. (2018). "An Empirical Model of Large-Batch Training." *arXiv:1812.06162*. https://arxiv.org/abs/1812.06162
Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., et al. (2020). "Scaling Laws for Neural Language Models." *arXiv:2001.08361*. https://arxiv.org/abs/2001.08361
Allen AI. (2025). "Critical Batch Size Revisited: A Simple Empirical Approach to Large-Batch Language Model Training." https://arxiv.org/html/2505.23971v1
Goyal, P., Dollar, P., Girshick, R., Noordhuis, P., et al. (2017). "Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour." *arXiv:1706.02677*. https://arxiv.org/abs/1706.02677
You, Y., Gitman, I., & Ginsburg, B. (2017). "Scaling SGD Batch Size to 32K for ImageNet Training." *UC Berkeley Technical Report*. https://www2.eecs.berkeley.edu/Pubs/TechRpts/2017/EECS-2017-156.pdf
You, Y., Li, J., Reddi, S., Hseu, J., et al. (2020). "Large Batch Optimization for Deep Learning: Training BERT in 76 Minutes." *ICLR 2020*. https://arxiv.org/abs/1904.00962
"Small Batch Size Training for Language Models: When Vanilla SGD Works, and Why Gradient Accumulation Is Wasteful." *NeurIPS 2025*. https://arxiv.org/abs/2507.07101

Types of Gradient Descent by Batch Size

Effects on Training

Generalization

Training Speed and Throughput

Memory Usage

The Powers-of-Two Convention

Gradient Noise and the Gradient Noise Scale

The Gradient Noise Scale

Critical Batch Size

Revisiting Critical Batch Size for LLMs

Batch Size and Learning Rate Scaling

Linear Scaling Rule

Square Root Scaling Rule

No Scaling

Increasing Batch Size Instead of Decaying Learning Rate

Large Batch Training Techniques

LARS (Layer-wise Adaptive Rate Scaling)

LAMB (Layer-wise Adaptive Moments for Batch Training)

Small Batch Training Revisited

Batch Sizes in Modern LLM Training

Micro-batching and Gradient Accumulation

Typical Batch Sizes by Task

Practical Guidelines

Explain Like I'm 5 (ELI5)

References

Improve this article

Related Articles

Sparse autoencoder

ARC-AGI 2

GELU (Gaussian Error Linear Unit)

LeNet

Context window

Multi-head Latent Attention

Types of Gradient Descent by Batch Size

Effects on Training

Generalization

Training Speed and Throughput

Memory Usage

The Powers-of-Two Convention

Gradient Noise and the Gradient Noise Scale

The Gradient Noise Scale

Critical Batch Size

Revisiting Critical Batch Size for LLMs

Batch Size and Learning Rate Scaling

Linear Scaling Rule

Square Root Scaling Rule

No Scaling

Increasing Batch Size Instead of Decaying Learning Rate

Large Batch Training Techniques

LARS (Layer-wise Adaptive Rate Scaling)

LAMB (Layer-wise Adaptive Moments for Batch Training)

Small Batch Training Revisited

Batch Sizes in Modern LLM Training

Micro-batching and Gradient Accumulation

Typical Batch Sizes by Task

Practical Guidelines

Explain Like I'm 5 (ELI5)

References

Related Articles

Sparse autoencoder

ARC-AGI 2

GELU (Gaussian Error Linear Unit)

LeNet

Context window

Multi-head Latent Attention