See also: machine learning terms, batch size, gradient descent
In machine learning, a batch is the set of training examples used in one iteration of model parameter updates. The batch size determines how many examples are included in each batch. Rather than processing an entire dataset at once or one sample at a time, practitioners divide data into batches to balance computational efficiency, memory usage, and optimization quality.
Batches play a central role in virtually every stage of the machine learning pipeline, from gradient descent optimization during training to batch normalization layers within network architectures, and even to batched request handling during inference. Understanding how batches work, and how their size affects training dynamics, is one of the most practical concerns in applied deep learning.
Imagine you are a teacher grading a stack of 1,000 homework papers. You could try to read all 1,000 papers before deciding what to teach differently tomorrow, but that would take forever. Or, you could look at just one paper at a time, but each single paper might not tell you much about what the whole class needs. A good middle ground is to grab a small pile of papers (say 32), read through them, notice a pattern ("most students got question 5 wrong"), and adjust your lesson plan. Then you grab the next pile of 32 and keep refining. Each small pile is a batch. The number of papers in the pile is the batch size. Once you have gone through all 1,000 papers, you have finished one epoch.
In machine learning, the "papers" are data samples, "reading them" is computing how wrong the model's predictions are (the loss), and "adjusting the lesson plan" is updating the model's parameters. Working in batches lets the model learn efficiently without needing to see the entire dataset before making any improvement.
The concept of a batch is inseparable from the optimization algorithm used to train a model. In gradient descent, the model computes a loss function over some portion of the data, calculates gradients of that loss with respect to the model parameters, and then updates the parameters to reduce the loss. The portion of data used for each such computation is the batch.
There are three main variants of gradient descent, distinguished by how much data goes into each batch.
In full-batch (or simply "batch") gradient descent, the entire training dataset is used to compute the gradient at each step. This produces the most accurate estimate of the true gradient, since it averages over all examples. However, for large datasets, this approach is often impractical because it requires loading the entire dataset into memory at once and performing a full forward and backward pass before making even a single parameter update.
Full-batch gradient descent follows a smooth, deterministic path toward a minimum of the loss surface. While this sounds desirable, the lack of noise in the gradient estimate means that training can get stuck in sharp local minima or saddle points, which may hurt generalization performance on unseen data.
Stochastic gradient descent (SGD) sits at the opposite extreme. Each parameter update is computed from a single training example (a batch size of 1). The resulting gradient estimate is very noisy because it reflects only one data point rather than the full distribution. This noise, however, can be beneficial: it helps the optimizer escape shallow local minima and saddle points, often leading to solutions that generalize better.
The downside is that SGD makes very slow progress in terms of wall-clock time. Each update processes only one example, and modern hardware (particularly GPUs) is designed for parallel computation over large arrays. Processing one example at a time leaves most of the GPU's compute capacity idle.
Mini-batch gradient descent is the practical middle ground and the default approach in modern deep learning. Each update uses a small random subset (the mini-batch) of the training data, typically between 16 and 8,192 examples. This offers several advantages:
In practice, when researchers and practitioners say "SGD" they almost always mean mini-batch SGD, not the single-sample variant.
| Variant | Batch size | Gradient noise | GPU utilization | Memory cost | Typical use |
|---|---|---|---|---|---|
| Full-batch | Entire dataset | None | High (if dataset fits) | Very high | Small datasets, convex problems |
| Stochastic (SGD) | 1 | Very high | Very low | Minimal | Rare in practice today |
| Mini-batch | 16 to 8,192+ | Moderate | High | Moderate | Standard for neural network training |
Three closely related terms describe the structure of a training loop: epoch, iteration (also called a training step), and batch. Their relationship can be expressed with a simple formula.
An epoch is one complete pass through the entire training dataset, where every sample has been seen exactly once. An iteration is one parameter update, which consumes one batch of data. The number of iterations per epoch is therefore:
Iterations per epoch = Total training samples / Batch size
If a dataset contains 10,000 samples and the batch size is 256, one epoch consists of ceil(10,000 / 256) = 40 iterations (the last batch may be smaller if the dataset size is not evenly divisible by the batch size). Over 50 epochs, the model performs 40 x 50 = 2,000 total parameter updates.
| Term | Definition | Formula |
|---|---|---|
| Batch size | Number of samples in one batch | Chosen by the practitioner |
| Iteration | One forward pass, backward pass, and parameter update on one batch | Iterations per epoch = ceil(N / batch size) |
| Epoch | One full pass through all N training samples | Total iterations = iterations per epoch x epochs |
This relationship has practical consequences. Doubling the batch size halves the number of iterations per epoch and therefore halves the number of gradient updates. If no other hyperparameters are changed, the model sees the same total data but performs half as many updates, each using a more accurate gradient estimate. Whether this speeds up or slows down convergence depends on the learning rate and the characteristics of the loss landscape.
Batch size is one of the most important hyperparameters in deep learning. It affects convergence speed, generalization ability, memory consumption, and hardware utilization. The interactions between these factors are nuanced and have been the subject of extensive research.
Larger batches produce lower-variance gradient estimates, which means each parameter update points more reliably toward the direction that reduces the loss. This allows the optimizer to take larger steps without overshooting, and in principle, training progresses faster per update. However, the number of updates per epoch decreases as batch size increases (since each update consumes more data), so the relationship between batch size and wall-clock convergence is not straightforward.
Smaller batches, by contrast, produce noisier gradients. Each update is less reliable on its own, but more updates happen per epoch. The noise acts as an implicit form of regularization, preventing the model from fitting too precisely to the training data.
One of the most cited findings in the batch size literature comes from Keskar et al. (2017), who observed that large-batch training tends to converge to "sharp" minima of the loss surface, while small-batch training tends to find "flat" minima. A sharp minimizer sits in a narrow, steep valley, meaning that even slight perturbations to the parameters cause a large increase in loss. A flat minimizer sits in a broad, gently sloped region where small parameter changes have little effect on the loss.
The practical consequence is that models trained with large batches often achieve low training loss but higher test loss compared to models trained with small batches. This phenomenon is known as the generalization gap. The noise inherent in small-batch gradient estimates is believed to help the optimizer avoid sharp minima and settle in flatter regions that transfer better to held-out data.
That said, subsequent research has shown that the generalization gap can be narrowed or eliminated with proper learning rate tuning, warmup schedules, and other techniques. The relationship between batch size and generalization is not an immutable law; it depends on the optimizer, the learning rate schedule, and the training duration.
Batch size and learning rate are tightly coupled. When the batch size increases, the variance of the gradient estimate decreases, and each step becomes more deterministic. To maintain a similar training trajectory, the learning rate typically needs to increase as well.
The linear scaling rule, popularized by Goyal et al. (2017) in their work on training ResNet-50 on ImageNet, states: when the batch size is multiplied by a factor of k, the learning rate should also be multiplied by k. Using this rule, along with a gradual warmup period in the first few epochs, they trained ResNet-50 with a batch size of 8,192 across 256 GPUs and matched the accuracy of smaller-batch baselines, completing training in about one hour.
The linear scaling rule has practical limits. For very large batch sizes (beyond approximately 8,192), it begins to break down because gradient noise reduction does not scale linearly with batch size: doubling the batch size reduces the gradient standard deviation by a factor of sqrt(2), not by a factor of 2. This observation motivates the square root scaling rule, where the learning rate is multiplied by sqrt(k) instead of k, which is sometimes more stable for very large batches.
Smith et al. (2018) took this idea further in "Don't Decay the Learning Rate, Increase the Batch Size." They showed that the common practice of decaying the learning rate during training is mathematically equivalent to increasing the batch size while keeping the learning rate fixed. This insight provides an alternative path to efficient large-batch training: instead of reducing the learning rate at scheduled epochs, one can increase the batch size, which improves parallelism and reduces the total number of parameter updates needed.
| Scaling approach | Key idea | Reference |
|---|---|---|
| Linear scaling rule | Multiply learning rate by k when batch size is multiplied by k; use warmup | Goyal et al. (2017) |
| Batch size increase schedule | Increase batch size during training instead of decaying learning rate | Smith et al. (2018) |
| Square root scaling | Scale learning rate by sqrt(k) for batch size increase of k | Hoffer et al. (2017) |
The noise in mini-batch gradient estimates is not merely a nuisance; it has a well-defined statistical structure that plays a central role in optimization and generalization.
When computing the gradient over a mini-batch of size B drawn from a dataset of size N, the mini-batch gradient is an unbiased estimator of the full-batch gradient. Its variance is approximately:
Var(mini-batch gradient) = (1 - B/N) x (sigma^2 / B)
where sigma^2 is the per-sample gradient variance. The factor (1 - B/N) is a finite-population correction that becomes negligible when B is much smaller than N, which is the typical case. In practice, the variance scales roughly as sigma^2 / B, meaning that doubling the batch size halves the gradient variance.
McCandlish et al. (2018) from OpenAI formalized this relationship through the gradient noise scale, defined as the ratio of the gradient noise to the gradient signal. The gradient noise scale determines a critical batch size: below this threshold, doubling the batch size roughly halves training time (the noise is the bottleneck), while above it, further increases yield diminishing returns (the signal dominates). In their experiments, critical batch sizes ranged from around 20 for small autoencoders on SVHN to millions for Dota 2 reinforcement learning agents.
The gradient noise scale typically grows during training as the model approaches a minimum and the gradient signal shrinks relative to the noise. This observation supports the practice of increasing the batch size over the course of training, as used in GPT-3 pretraining, where the batch size was progressively raised based on gradient noise scale measurements.
Modern GPUs and accelerators are designed for massively parallel computation. A neural network's forward and backward passes consist primarily of matrix multiplications, and these operations are most efficient when the matrices are large. The batch dimension is one of the dimensions of these matrices, so increasing the batch size directly increases the GPU's arithmetic throughput, up to a point.
The total GPU memory consumed during training comes from several sources:
| Component | Description | Scales with batch size? |
|---|---|---|
| Model parameters | Weights and biases of the network | No |
| Optimizer state | Momentum buffers, second-moment estimates (for Adam) | No |
| Activations | Intermediate outputs stored for backpropagation | Yes (linearly) |
| Gradients | Computed during backward pass | Partially |
| Input data | The batch itself | Yes (linearly) |
Activations are typically the largest memory consumer for deep networks, because every layer's output must be saved for use during the backward pass. Since activation memory grows linearly with batch size, the maximum batch size is often constrained by the available GPU VRAM. Exceeding this limit triggers an out-of-memory error.
To maximize GPU utilization, practitioners generally want the largest batch size that fits in memory, though generalization considerations may argue for a smaller batch. A common workflow is:
Using powers of two for batch sizes is a widespread convention. The rationale is that GPU memory is organized in pages whose sizes are powers of two, and NVIDIA's documentation recommends that matrix dimensions be multiples of 8 for optimal Tensor Core utilization. In practice, experiments by Raschka (2022) and others have found that non-power-of-two batch sizes often perform just as well, but the convention persists because it simplifies benchmarking and comparison.
When the desired effective batch size is too large to fit in GPU memory, gradient accumulation provides a workaround. Instead of processing the full batch in one pass, the training loop processes several smaller micro-batches sequentially, accumulating (summing) the gradients from each, and then performs a single parameter update after all micro-batches have been processed.
For example, suppose the target effective batch size is 1,024 but only 256 examples fit in memory. The training loop processes four micro-batches of 256, accumulates the gradients, and then updates the model once. From the optimizer's perspective, this is equivalent to a single batch of 1,024.
Gradient accumulation trades time for memory. Each micro-batch is processed sequentially rather than in parallel, so the total training time is longer than it would be if the full batch fit in memory. However, it allows practitioners to simulate large-batch training on hardware that would otherwise be insufficient.
There is one notable complication: standard batch normalization computes statistics (mean and variance) over the micro-batch, not the full effective batch. Since each micro-batch is smaller, these statistics are noisier, which can degrade training stability. Solutions include synchronized batch normalization across micro-batches or replacing batch normalization with alternatives like Group Normalization or Layer Normalization.
In data parallelism setups where multiple GPUs each process a portion of the batch, gradient accumulation is often combined with all-reduce communication patterns. Each GPU computes gradients on its local micro-batch, the gradients are summed across GPUs, and the combined gradient is used for the parameter update. This approach allows the effective batch size to scale with the number of GPUs.
Batch normalization (BN), introduced by Ioffe and Szegedy (2015), is one of the most widely used techniques in deep learning, and it depends directly on the batch. During training, BN normalizes the activations of each layer by subtracting the mean and dividing by the standard deviation computed over the current mini-batch, for each feature independently. Two learnable parameters (scale and shift) are then applied to allow the network to undo the normalization if that is optimal.
The original motivation was to address "internal covariate shift," the idea that the distribution of each layer's inputs changes as the preceding layers' parameters are updated. Later research has debated whether reducing internal covariate shift is actually the reason BN works. Santurkar et al. (2018) argued that BN's benefits come more from smoothing the loss landscape, making it easier for the optimizer to navigate.
Regardless of the underlying mechanism, the practical benefits are well established:
Batch normalization's dependence on the batch creates certain limitations. During inference, there is no mini-batch to compute statistics from, so BN layers use running averages of mean and variance accumulated during training. When the training batch size is very small, the per-batch statistics become noisy and unreliable, degrading performance. This has motivated alternatives like Layer Normalization (which normalizes over features rather than over the batch) and Group Normalization (which normalizes over groups of channels), both of which are independent of batch size.
| Normalization method | Normalizes over | Depends on batch size? | Typical use case |
|---|---|---|---|
| Batch Normalization | Batch dimension | Yes | CNNs, large batch sizes |
| Layer Normalization | Feature dimension | No | Transformers, RNNs |
| Group Normalization | Groups of channels | No | Small batch sizes, detection |
| Instance Normalization | Single instance per channel | No | Style transfer |
Training with very large batches (tens of thousands of examples or more) can dramatically reduce training time by increasing parallelism across many GPUs. However, naively scaling up the batch size often degrades accuracy due to the generalization gap discussed earlier. Several specialized techniques have been developed to make large-batch training work.
LARS (Layer-wise Adaptive Rate Scaling), introduced by You et al. (2017), addresses the observation that different layers of a deep neural network may need very different learning rates. LARS computes a per-layer learning rate by looking at the ratio of the weight norm to the gradient norm for each layer, then scales the base learning rate accordingly. Using LARS, You et al. trained ResNet-50 on ImageNet with batch sizes up to 32,768 while maintaining accuracy.
However, LARS was designed primarily for networks trained with SGD with momentum. It performs poorly on attention-based architectures like BERT.
LAMB (Layer-wise Adaptive Moments optimizer for Batch training), introduced by You et al. (2020), extends the LARS idea to the Adam optimizer. LAMB computes the Adam update for each layer and then scales it by a per-layer trust ratio, which limits the relative change to any layer's weights. LAMB enabled training BERT with a batch size of 32,768 without accuracy loss, reducing training time from 3 days to 76 minutes on a TPUv3 Pod.
| Optimizer | Base method | Per-layer adaptation | Designed for | Key result |
|---|---|---|---|---|
| LARS | SGD + Momentum | Weight norm / gradient norm ratio | CNNs | ResNet-50 with batch size 32K on ImageNet |
| LAMB | Adam | Trust ratio on Adam update | Transformers, BERT | BERT training in 76 minutes |
Large-batch training often requires a learning rate warmup period at the beginning of training. During the first few epochs, the learning rate is gradually increased from a small value to the target value. This prevents the optimizer from making overly large updates early in training when the model parameters are far from any reasonable solution and the loss landscape is poorly conditioned.
Goyal et al. (2017) used a linear warmup over the first 5 epochs when training with a batch size of 8,192 on ImageNet. This simple technique proved essential for making the linear scaling rule work. Warmup duration typically scales with the batch size scaling factor: if the batch size is increased by 8x relative to the baseline, a warmup of roughly 5 epochs is often sufficient, while larger scaling factors may need longer warmup.
Modern deep learning frameworks provide built-in utilities to divide datasets into batches, shuffle them, and load them efficiently.
In PyTorch, the torch.utils.data.DataLoader class is the primary interface for batching. It wraps a Dataset object and yields batches of data during training. Key parameters include:
| Parameter | Description | Typical value |
|---|---|---|
batch_size | Number of samples per batch | 32, 64, 128 |
shuffle | Whether to randomize sample order each epoch | True for training, False for evaluation |
num_workers | Number of subprocesses for parallel data loading | 2 to 8 |
drop_last | Whether to drop the last incomplete batch | True when batch normalization is used |
pin_memory | Whether to use pinned (page-locked) memory for faster GPU transfer | True when using CUDA |
A typical usage pattern looks like:
from torch.utils.data import DataLoader, TensorDataset
dataset = TensorDataset(X_train, y_train)
loader = DataLoader(dataset, batch_size=64, shuffle=True, num_workers=4)
for epoch in range(num_epochs):
for batch_X, batch_y in loader:
# Forward pass, loss computation, backward pass, optimizer step
The DataLoader handles shuffling the dataset at the start of each epoch, dividing it into batches, and optionally loading batches in parallel using worker processes so that data preparation overlaps with GPU computation.
In TensorFlow, the tf.data.Dataset API provides similar functionality. The .batch() method groups consecutive elements into batches, and .prefetch() overlaps data preprocessing with model execution for higher throughput:
import tensorflow as tf
dataset = tf.data.Dataset.from_tensor_slices((X_train, y_train))
dataset = dataset.shuffle(buffer_size=10000)
dataset = dataset.batch(64)
dataset = dataset.prefetch(tf.data.AUTOTUNE)
The prefetch transformation is particularly important for performance. While the model is executing training step s, the input pipeline is already reading and preparing the data for step s+1. Setting the buffer size to tf.data.AUTOTUNE lets the runtime dynamically tune prefetch depth based on available resources.
Batching is not only relevant during training. At inference time, batching multiple requests together can significantly increase throughput by amortizing the cost of loading model weights from memory.
For most neural network architectures, the bottleneck during inference is not computation but memory bandwidth: the time it takes to move model weights from GPU memory (VRAM) to the GPU's compute cores. Once the weights for a given layer are loaded into the GPU's cache, processing one example or many examples through that layer costs roughly the same in terms of memory traffic. By batching N requests, the weight-loading cost is amortized over N predictions, dramatically increasing throughput.
The tradeoff is latency. A server that batches incoming requests must wait until a batch is full (or until a timeout expires) before running inference, which increases the time any individual request spends waiting. This creates a throughput-latency tension that is central to production ML serving.
Batch inference, also called offline inference, refers to generating predictions on a large set of data all at once rather than responding to individual requests in real time. This approach is common when immediate responses are not required. For example, a recommendation system might score all products for all users overnight, store the results in a database, and serve pre-computed recommendations during the day.
Batch inference typically runs on a recurring schedule (hourly, daily, or weekly) and can take advantage of distributed computing frameworks like Apache Spark or cloud services like AWS Batch and Google Cloud Vertex AI Batch Predictions. Because latency is not a concern, batch inference jobs can use larger batch sizes, optimize for throughput, and run on cheaper preemptible or spot instances.
| Inference mode | Latency requirement | Throughput | Cost efficiency | Use case |
|---|---|---|---|---|
| Real-time (online) | Milliseconds | Lower | Higher per prediction | Chatbots, search ranking |
| Batch (offline) | Hours to days | Very high | Lower per prediction | Recommendation scoring, risk analysis |
For large language models (LLMs) and other autoregressive models, batching strategies have evolved considerably:
vLLM (Kwon et al., 2023) combined continuous batching with PagedAttention, a memory management technique that reduces GPU memory fragmentation for the key-value cache. Together, these optimizations have achieved up to 23x throughput improvements over naive batched inference for LLMs. The continuous batching paradigm has since been adopted by most major inference frameworks, including TensorRT-LLM (which calls it "in-flight batching"), Hugging Face TGI, and SGLang.
| Batching strategy | How it works | Pros | Cons |
|---|---|---|---|
| Static | Fixed batch, return when all done | Simple to implement | Wasted GPU cycles on completed sequences |
| Dynamic | Batch until full or timeout | Better latency than static | Still waits for slowest request in batch |
| Continuous | Insert/remove requests at each iteration | Near-optimal GPU utilization | More complex scheduling logic |
Mixed precision training uses 16-bit floating-point (FP16 or BF16) for most computations while keeping a master copy of the weights in 32-bit (FP32) for numerical stability. Because 16-bit values take half the memory of 32-bit values, mixed precision effectively doubles the amount of activation memory available, allowing the batch size to be roughly doubled for the same GPU memory budget.
This technique was formalized by Micikevicius et al. (2018) and has become standard practice for training large models. NVIDIA's Tensor Cores are designed to accelerate FP16 matrix multiplications, so mixed precision training also improves computational throughput independent of the batch size benefit.
Choosing a batch size involves balancing multiple competing objectives. There is no single correct batch size for all situations, but several practical guidelines have emerged from research and industry experience.
| Task type | Typical batch sizes | Notes |
|---|---|---|
| Image classification (CNNs) | 32 to 256 | Higher with large-batch optimizers (LARS) |
| Object detection | 2 to 16 | High-resolution images consume more memory |
| Language modeling (Transformers) | 256 to 8,192+ tokens | Often measured in tokens rather than sequences |
| Fine-tuning LLMs | 1 to 32 | Small batches due to model size and long sequences |
| Reinforcement learning | Varies widely | Depends on environment and on-policy vs. off-policy |
| GANs | 16 to 128 | Smaller batches sometimes improve stability |
Smith et al. (2018) showed that the ratio of learning rate to batch size (or equivalently, the noise scale in SGD) is what controls the training dynamics. This means there are multiple paths to the same outcome:
Practitioners can use these equivalences to adapt their training schedule to their hardware constraints. If more GPUs become available mid-training, one can increase the batch size (and learning rate) without starting over.
The use of mini-batches for training neural networks predates the modern deep learning era. The idea of using subsets of training data for gradient estimation dates back to the early stochastic approximation methods of Robbins and Monro (1951). LeCun et al. (1998) discussed practical batch size choices for training convolutional neural networks, noting the tradeoff between gradient accuracy and computational cost.
The explosion of interest in batch size optimization came with the rise of large-scale distributed training in the 2010s. As models grew from millions to billions of parameters, and datasets from thousands to billions of examples, the ability to train efficiently across hundreds or thousands of GPUs became a competitive advantage. The batch size was the primary lever for distributing work across devices.
| Year | Milestone | Reference |
|---|---|---|
| 1951 | Stochastic approximation methods | Robbins and Monro |
| 1998 | Practical batch size guidelines for CNNs | LeCun et al. |
| 2015 | Batch Normalization | Ioffe and Szegedy |
| 2017 | Linear scaling rule; ImageNet in 1 hour | Goyal et al. |
| 2017 | LARS optimizer for large-batch CNN training | You et al. |
| 2017 | Generalization gap and sharp minima | Keskar et al. (published ICLR 2017) |
| 2018 | Critical batch size and gradient noise scale | McCandlish et al. |
| 2018 | Batch size increase as alternative to LR decay | Smith et al. |
| 2020 | LAMB optimizer; BERT in 76 minutes | You et al. |
| 2022 | Continuous batching (Orca) for LLM inference | Yu et al. |
| 2023 | vLLM with PagedAttention for batched inference | Kwon et al. |