# Batch

> Source: https://aiwiki.ai/wiki/batch
> Updated: 2026-06-23
> Categories: Deep Learning, Machine Learning
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

*See also: [machine learning terms](/wiki/machine_learning_terms), [batch size](/wiki/batch_size), [gradient descent](/wiki/gradient_descent)*

A **batch** in [machine learning](/wiki/machine_learning) is the set of [training](/wiki/training) [examples](/wiki/example) processed together in one forward and backward pass before the [model](/wiki/model)'s [parameters](/wiki/parameter) are updated once. The number of examples in a batch is the [batch size](/wiki/batch_size). Rather than processing an entire [dataset](/wiki/dataset) at once (full-batch) or one example at a time ([stochastic gradient descent](/wiki/stochastic_gradient_descent_sgd)), practitioners almost always use [mini-batches](/wiki/mini-batch) of tens to thousands of examples to balance computational efficiency, memory usage, and optimization quality. Each parameter update from one batch is an [iteration](/wiki/iteration), and one full pass over every batch in the dataset is an [epoch](/wiki/epoch).

Batches play a central role in virtually every stage of the machine learning pipeline, from [gradient descent](/wiki/gradient_descent) optimization during [training](/wiki/training) to [batch normalization](/wiki/batch_normalization) layers within network architectures, and even to batched request handling during [inference](/wiki/inference). Understanding how batches work, and how their size affects training dynamics, is one of the most practical concerns in applied [deep learning](/wiki/deep_learning).

## Explain like I'm five (ELI5)

Imagine you are a teacher grading a stack of 1,000 homework papers. You could try to read all 1,000 papers before deciding what to teach differently tomorrow, but that would take forever. Or, you could look at just one paper at a time, but each single paper might not tell you much about what the whole class needs. A good middle ground is to grab a small pile of papers (say 32), read through them, notice a pattern ("most students got question 5 wrong"), and adjust your lesson plan. Then you grab the next pile of 32 and keep refining. Each small pile is a **batch**. The number of papers in the pile is the **batch size**. Once you have gone through all 1,000 papers, you have finished one **[epoch](/wiki/epoch)**.

In machine learning, the "papers" are data samples, "reading them" is computing how wrong the model's predictions are (the [loss](/wiki/loss_function)), and "adjusting the lesson plan" is updating the model's [parameters](/wiki/parameter). Working in batches lets the model learn efficiently without needing to see the entire dataset before making any improvement.

## Gradient descent variants and the role of batches

The concept of a batch is inseparable from the optimization algorithm used to train a model. In [gradient descent](/wiki/gradient_descent), the model computes a [loss function](/wiki/loss_function) over some portion of the data, calculates [gradients](/wiki/gradient) of that loss with respect to the model [parameters](/wiki/parameter), and then updates the parameters to reduce the loss. The portion of data used for each such computation is the batch.

There are three main variants of gradient descent, distinguished by how much data goes into each batch.

### Full-batch gradient descent

In full-batch (or simply "batch") [gradient descent](/wiki/gradient_descent), the entire training dataset is used to compute the gradient at each step. This produces the most accurate estimate of the true gradient, since it averages over all examples. However, for large datasets, this approach is often impractical because it requires loading the entire dataset into memory at once and performing a full forward and backward pass before making even a single parameter update.

Full-batch gradient descent follows a smooth, deterministic path toward a minimum of the loss surface. While this sounds desirable, the lack of noise in the gradient estimate means that training can get stuck in sharp local minima or saddle points, which may hurt [generalization](/wiki/generalization) performance on unseen data.

### Stochastic gradient descent

[Stochastic gradient descent](/wiki/stochastic_gradient_descent_sgd) (SGD) sits at the opposite extreme. Each parameter update is computed from a single training example (a batch size of 1). The resulting gradient estimate is very noisy because it reflects only one data point rather than the full distribution. This noise, however, can be beneficial: it helps the [optimizer](/wiki/optimizer) escape shallow local minima and saddle points, often leading to solutions that generalize better.

The downside is that SGD makes very slow progress in terms of wall-clock time. Each update processes only one example, and modern hardware (particularly [GPUs](/wiki/gpu)) is designed for parallel computation over large arrays. Processing one example at a time leaves most of the GPU's compute capacity idle.

### Mini-batch gradient descent

[Mini-batch](/wiki/mini-batch) gradient descent is the practical middle ground and the default approach in modern deep learning. Each update uses a small random subset (the mini-batch) of the training data, typically between 16 and 8,192 examples. This offers several advantages:

- The gradient estimate is less noisy than single-sample SGD but retains enough stochasticity to help with [generalization](/wiki/generalization).
- The computation can exploit [GPU](/wiki/gpu) parallelism efficiently because matrix operations over a batch of examples can be executed in parallel.
- Memory requirements are manageable because only a fraction of the dataset needs to be in GPU memory at any time.

In practice, when researchers and practitioners say "SGD" they almost always mean mini-batch SGD, not the single-sample variant.

| Variant | Batch size | Gradient noise | GPU utilization | Memory cost | Typical use |
|---|---|---|---|---|---|
| Full-batch | Entire dataset | None | High (if dataset fits) | Very high | Small datasets, convex problems |
| Stochastic (SGD) | 1 | Very high | Very low | Minimal | Rare in practice today |
| Mini-batch | 16 to 8,192+ | Moderate | High | Moderate | Standard for neural network training |

## What is the difference between an epoch, an iteration, and a batch?

Three closely related terms describe the structure of a training loop: [epoch](/wiki/epoch), [iteration](/wiki/iteration) (also called a training step), and batch. Their relationship can be expressed with a simple formula.

An **epoch** is one complete pass through the entire training dataset, where every sample has been seen exactly once. An **iteration** is one parameter update, which consumes one batch of data. The number of iterations per epoch is therefore:

**Iterations per epoch = Total training samples / Batch size**

If a dataset contains 10,000 samples and the [batch size](/wiki/batch_size) is 256, one epoch consists of ceil(10,000 / 256) = 40 iterations (the last batch may be smaller if the dataset size is not evenly divisible by the batch size). Over 50 epochs, the model performs 40 x 50 = 2,000 total parameter updates.

| Term | Definition | Formula |
|---|---|---|
| [Batch size](/wiki/batch_size) | Number of samples in one batch | Chosen by the practitioner |
| [Iteration](/wiki/iteration) | One forward pass, backward pass, and parameter update on one batch | Iterations per epoch = ceil(N / batch size) |
| [Epoch](/wiki/epoch) | One full pass through all N training samples | Total iterations = iterations per epoch x epochs |

This relationship has practical consequences. Doubling the batch size halves the number of iterations per epoch and therefore halves the number of gradient updates. If no other hyperparameters are changed, the model sees the same total data but performs half as many updates, each using a more accurate gradient estimate. Whether this speeds up or slows down convergence depends on the [learning rate](/wiki/learning_rate) and the characteristics of the loss landscape.

## How does batch size affect training?

[Batch size](/wiki/batch_size) is one of the most important [hyperparameters](/wiki/hyperparameter) in deep learning. It affects convergence speed, generalization ability, memory consumption, and hardware utilization. The interactions between these factors are nuanced and have been the subject of extensive research.

### Convergence speed

Larger batches produce lower-variance gradient estimates, which means each parameter update points more reliably toward the direction that reduces the loss. This allows the optimizer to take larger steps without overshooting, and in principle, training progresses faster per update. However, the number of updates per [epoch](/wiki/epoch) decreases as batch size increases (since each update consumes more data), so the relationship between batch size and wall-clock convergence is not straightforward.

Smaller batches, by contrast, produce noisier gradients. Each update is less reliable on its own, but more updates happen per epoch. The noise acts as an implicit form of [regularization](/wiki/regularization), preventing the model from fitting too precisely to the training data.

### Do large batches generalize worse? The sharp vs. flat minima debate

One of the most cited findings in the batch size literature comes from Keskar et al. (2017), who observed that large-batch training tends to converge to "sharp" minima of the loss surface, while small-batch training tends to find "flat" minima.[4] The paper presented "numerical evidence that supports the view that large-batch methods tend to converge to sharp minimizers of the training and testing functions," and noted that "sharp minima lead to poorer generalization," whereas "small-batch methods consistently converge to flat minimizers."[4] A sharp minimizer sits in a narrow, steep valley, meaning that even slight perturbations to the parameters cause a large increase in loss. A flat minimizer sits in a broad, gently sloped region where small parameter changes have little effect on the loss.

The practical consequence is that models trained with large batches often achieve low training loss but higher test loss compared to models trained with small batches. This phenomenon is known as the generalization gap.[4] The noise inherent in small-batch gradient estimates is believed to help the optimizer avoid sharp minima and settle in flatter regions that transfer better to held-out data.

That said, subsequent research has shown that the generalization gap can be narrowed or eliminated with proper [learning rate](/wiki/learning_rate) tuning, warmup schedules, and other techniques. The relationship between batch size and generalization is not an immutable law; it depends on the optimizer, the learning rate schedule, and the training duration.

### Learning rate scaling

Batch size and [learning rate](/wiki/learning_rate) are tightly coupled. When the batch size increases, the variance of the gradient estimate decreases, and each step becomes more deterministic. To maintain a similar training trajectory, the learning rate typically needs to increase as well.

The **linear scaling rule**, popularized by Goyal et al. (2017) in their work on training [ResNet](/wiki/resnet)-50 on [ImageNet](/wiki/imagenet), states: when the batch size is multiplied by a factor of *k*, the learning rate should also be multiplied by *k*.[2] Using this rule, along with a gradual warmup period in the first few epochs, they trained ResNet-50 with a batch size of 8,192 across 256 GPUs and matched the accuracy of smaller-batch baselines, completing training in about one hour.[2] The authors describe adopting "a hyper-parameter-free linear scaling rule for adjusting learning rates as a function of minibatch size" together with "a new warmup scheme that overcomes optimization challenges early in training."[2]

The linear scaling rule has practical limits. For very large batch sizes (beyond approximately 8,192), it begins to break down because gradient noise reduction does not scale linearly with batch size: doubling the batch size reduces the gradient standard deviation by a factor of sqrt(2), not by a factor of 2. This observation motivates the **square root scaling rule**, where the learning rate is multiplied by sqrt(*k*) instead of *k*, which is sometimes more stable for very large batches.

Smith et al. (2018) took this idea further in "Don't Decay the Learning Rate, Increase the Batch Size." They showed that the common practice of decaying the learning rate during training is mathematically equivalent to increasing the batch size while keeping the learning rate fixed.[11] This insight provides an alternative path to efficient large-batch training: instead of reducing the learning rate at scheduled epochs, one can increase the batch size, which improves parallelism and reduces the total number of parameter updates needed.

| Scaling approach | Key idea | Reference |
|---|---|---|
| Linear scaling rule | Multiply learning rate by *k* when batch size is multiplied by *k*; use warmup | Goyal et al. (2017) |
| Batch size increase schedule | Increase batch size during training instead of decaying learning rate | Smith et al. (2018) |
| Square root scaling | Scale learning rate by sqrt(*k*) for batch size increase of *k* | Hoffer et al. (2017) |

## Gradient noise from mini-batches

The noise in mini-batch gradient estimates is not merely a nuisance; it has a well-defined statistical structure that plays a central role in optimization and generalization.

When computing the gradient over a mini-batch of size *B* drawn from a dataset of size *N*, the mini-batch gradient is an unbiased estimator of the full-batch gradient. Its variance is approximately:

**Var(mini-batch gradient) = (1 - B/N) x (sigma^2 / B)**

where sigma^2 is the per-sample gradient variance. The factor (1 - B/N) is a finite-population correction that becomes negligible when B is much smaller than N, which is the typical case. In practice, the variance scales roughly as sigma^2 / B, meaning that doubling the batch size halves the gradient variance.

McCandlish et al. (2018) from [OpenAI](/wiki/openai) formalized this relationship through the **gradient noise scale**, defined as the ratio of the gradient noise to the gradient signal.[7] The gradient noise scale determines a **critical batch size**: below this threshold, doubling the batch size roughly halves training time (the noise is the bottleneck), while above it, further increases yield diminishing returns (the signal dominates). In their experiments, critical batch sizes ranged from around 20 for small autoencoders on SVHN to millions for Dota 2 reinforcement learning agents, with [ImageNet](/wiki/imagenet) supervised training sitting in the tens of thousands, a span the single gradient noise scale statistic was shown to predict across all of these domains.[7]

The gradient noise scale typically grows during training as the model approaches a minimum and the gradient signal shrinks relative to the noise. This observation supports the practice of increasing the batch size over the course of training, as used in [GPT-3](/wiki/gpt-3) pretraining, where the batch size was progressively raised based on gradient noise scale measurements.

## Batch size and hardware utilization

Modern [GPUs](/wiki/gpu) and accelerators are designed for massively parallel computation. A [neural network](/wiki/neural_network)'s forward and backward passes consist primarily of matrix multiplications, and these operations are most efficient when the matrices are large. The batch dimension is one of the dimensions of these matrices, so increasing the batch size directly increases the GPU's arithmetic throughput, up to a point.

### GPU memory breakdown

The total GPU memory consumed during training comes from several sources:

| Component | Description | Scales with batch size? |
|---|---|---|
| Model parameters | Weights and biases of the network | No |
| Optimizer state | Momentum buffers, second-moment estimates (for [Adam](/wiki/adam_optimizer)) | No |
| Activations | Intermediate outputs stored for [backpropagation](/wiki/backpropagation) | Yes (linearly) |
| Gradients | Computed during backward pass | Partially |
| Input data | The batch itself | Yes (linearly) |

Activations are typically the largest memory consumer for deep networks, because every layer's output must be saved for use during the backward pass. Since activation memory grows linearly with batch size, the maximum batch size is often constrained by the available GPU VRAM. Exceeding this limit triggers an out-of-memory error.

### Maximizing throughput

To maximize GPU utilization, practitioners generally want the largest batch size that fits in memory, though generalization considerations may argue for a smaller batch. A common workflow is:

1. Start with a batch size of 32.
2. Gradually increase by factors of 2 (64, 128, 256, ...) and monitor GPU memory and training throughput.
3. Stop when GPU memory is nearly full or when the validation loss begins to degrade.

Using powers of two for batch sizes is a widespread convention. The rationale is that GPU memory is organized in pages whose sizes are powers of two, and NVIDIA's documentation recommends that matrix dimensions be multiples of 8 for optimal Tensor Core utilization. In practice, experiments by Raschka (2022) and others have found that non-power-of-two batch sizes often perform just as well, but the convention persists because it simplifies benchmarking and comparison.

## What is gradient accumulation and micro-batching?

When the desired effective batch size is too large to fit in GPU memory, **gradient accumulation** provides a workaround. Instead of processing the full batch in one pass, the training loop processes several smaller **micro-batches** sequentially, accumulating (summing) the gradients from each, and then performs a single parameter update after all micro-batches have been processed.

For example, suppose the target effective batch size is 1,024 but only 256 examples fit in memory. The training loop processes four micro-batches of 256, accumulates the gradients, and then updates the model once. From the optimizer's perspective, this is equivalent to a single batch of 1,024.

Gradient accumulation trades time for memory. Each micro-batch is processed sequentially rather than in parallel, so the total training time is longer than it would be if the full batch fit in memory. However, it allows practitioners to simulate large-batch training on hardware that would otherwise be insufficient.

There is one notable complication: standard [batch normalization](/wiki/batch_normalization) computes statistics (mean and variance) over the micro-batch, not the full effective batch. Since each micro-batch is smaller, these statistics are noisier, which can degrade training stability. Solutions include synchronized batch normalization across micro-batches or replacing batch normalization with alternatives like Group Normalization or Layer Normalization.

### Gradient accumulation in distributed training

In [data parallelism](/wiki/data_parallelism) setups where multiple GPUs each process a portion of the batch, gradient accumulation is often combined with all-reduce communication patterns. Each GPU computes gradients on its local micro-batch, the gradients are summed across GPUs, and the combined gradient is used for the parameter update. This approach allows the effective batch size to scale with the number of GPUs.

## Batch normalization

[Batch normalization](/wiki/batch_normalization) (BN), introduced by Ioffe and Szegedy (2015), is one of the most widely used techniques in deep learning, and it depends directly on the batch.[3] During training, BN normalizes the activations of each layer by subtracting the mean and dividing by the standard deviation computed over the current mini-batch, for each feature independently. Two learnable parameters (scale and shift) are then applied to allow the network to undo the normalization if that is optimal.

The original motivation was to address "internal covariate shift," the idea that the distribution of each layer's inputs changes as the preceding layers' parameters are updated.[3] Later research has debated whether reducing internal covariate shift is actually the reason BN works. Santurkar et al. (2018) argued that BN's benefits come more from smoothing the loss landscape, making it easier for the optimizer to navigate.[10]

Regardless of the underlying mechanism, the practical benefits are well established:

- BN allows the use of higher learning rates, which speeds up training.
- BN acts as a regularizer, sometimes reducing or eliminating the need for [dropout](/wiki/dropout).
- BN makes training less sensitive to weight initialization.

Batch normalization's dependence on the batch creates certain limitations. During inference, there is no mini-batch to compute statistics from, so BN layers use running averages of mean and variance accumulated during training. When the training batch size is very small, the per-batch statistics become noisy and unreliable, degrading performance. This has motivated alternatives like Layer Normalization (which normalizes over features rather than over the batch) and Group Normalization (which normalizes over groups of channels), both of which are independent of batch size.

| Normalization method | Normalizes over | Depends on batch size? | Typical use case |
|---|---|---|---|
| Batch Normalization | Batch dimension | Yes | [CNNs](/wiki/convolutional_neural_network), large batch sizes |
| Layer Normalization | Feature dimension | No | [Transformers](/wiki/transformer), [RNNs](/wiki/recurrent_neural_network) |
| Group Normalization | Groups of channels | No | Small batch sizes, detection |
| Instance Normalization | Single instance per channel | No | Style transfer |

## Large-batch training techniques

Training with very large batches (tens of thousands of examples or more) can dramatically reduce training time by increasing parallelism across many GPUs. However, naively scaling up the batch size often degrades accuracy due to the generalization gap discussed earlier. Several specialized techniques have been developed to make large-batch training work.

### LARS

LARS (Layer-wise Adaptive Rate Scaling), introduced by You et al. (2017), addresses the observation that different layers of a [deep neural network](/wiki/deep_learning) may need very different learning rates.[12] LARS computes a per-layer learning rate by looking at the ratio of the weight norm to the gradient norm for each layer, then scales the base learning rate accordingly. Using LARS, You et al. trained ResNet-50 on ImageNet with batch sizes up to 32,768 while maintaining accuracy.[12]

However, LARS was designed primarily for networks trained with SGD with [momentum](/wiki/momentum). It performs poorly on [attention](/wiki/attention)-based architectures like [BERT](/wiki/bert).

### LAMB

LAMB (Layer-wise Adaptive Moments optimizer for Batch training), introduced by You et al. (2020), extends the LARS idea to the [Adam optimizer](/wiki/adam_optimizer). LAMB computes the Adam update for each layer and then scales it by a per-layer trust ratio, which limits the relative change to any layer's weights. LAMB enabled training BERT with a batch size of 32,768 without accuracy loss, reducing training time from 3 days to 76 minutes on a TPUv3 Pod.[13]

| Optimizer | Base method | Per-layer adaptation | Designed for | Key result |
|---|---|---|---|---|
| LARS | SGD + Momentum | Weight norm / gradient norm ratio | CNNs | ResNet-50 with batch size 32K on ImageNet |
| LAMB | Adam | Trust ratio on Adam update | Transformers, BERT | BERT training in 76 minutes |

### Warmup schedules

Large-batch training often requires a [learning rate](/wiki/learning_rate) warmup period at the beginning of training. During the first few epochs, the learning rate is gradually increased from a small value to the target value. This prevents the optimizer from making overly large updates early in training when the model parameters are far from any reasonable solution and the loss landscape is poorly conditioned.

Goyal et al. (2017) used a linear warmup over the first 5 epochs when training with a batch size of 8,192 on ImageNet.[2] This simple technique proved essential for making the linear scaling rule work. Warmup duration typically scales with the batch size scaling factor: if the batch size is increased by 8x relative to the baseline, a warmup of roughly 5 epochs is often sufficient, while larger scaling factors may need longer warmup.

## Framework implementations of batching

Modern deep learning frameworks provide built-in utilities to divide datasets into batches, shuffle them, and load them efficiently.

### PyTorch DataLoader

In [PyTorch](/wiki/pytorch), the `torch.utils.data.DataLoader` class is the primary interface for batching. It wraps a `Dataset` object and yields batches of data during training. Key parameters include:

| Parameter | Description | Typical value |
|---|---|---|
| `batch_size` | Number of samples per batch | 32, 64, 128 |
| `shuffle` | Whether to randomize sample order each epoch | True for training, False for evaluation |
| `num_workers` | Number of subprocesses for parallel data loading | 2 to 8 |
| `drop_last` | Whether to drop the last incomplete batch | True when batch normalization is used |
| `pin_memory` | Whether to use pinned (page-locked) memory for faster GPU transfer | True when using CUDA |

A typical usage pattern looks like:

```python
from torch.utils.data import DataLoader, TensorDataset

dataset = TensorDataset(X_train, y_train)
loader = DataLoader(dataset, batch_size=64, shuffle=True, num_workers=4)

for epoch in range(num_epochs):
    for batch_X, batch_y in loader:
        # Forward pass, loss computation, backward pass, optimizer step
```

The DataLoader handles shuffling the dataset at the start of each [epoch](/wiki/epoch), dividing it into batches, and optionally loading batches in parallel using worker processes so that data preparation overlaps with GPU computation.

### TensorFlow tf.data

In [TensorFlow](/wiki/tensorflow), the `tf.data.Dataset` API provides similar functionality. The `.batch()` method groups consecutive elements into batches, and `.prefetch()` overlaps data preprocessing with model execution for higher throughput:

```python
import tensorflow as tf

dataset = tf.data.Dataset.from_tensor_slices((X_train, y_train))
dataset = dataset.shuffle(buffer_size=10000)
dataset = dataset.batch(64)
dataset = dataset.prefetch(tf.data.AUTOTUNE)
```

The `prefetch` transformation is particularly important for performance. While the model is executing training step *s*, the input pipeline is already reading and preparing the data for step *s+1*. Setting the buffer size to `tf.data.AUTOTUNE` lets the runtime dynamically tune prefetch depth based on available resources.

## Batches in inference

Batching is not only relevant during training. At inference time, batching multiple requests together can significantly increase throughput by amortizing the cost of loading model weights from memory.

### Why does batching help inference?

For most [neural network](/wiki/neural_network) architectures, the bottleneck during inference is not computation but memory bandwidth: the time it takes to move model weights from GPU memory (VRAM) to the GPU's compute cores. Once the weights for a given layer are loaded into the GPU's cache, processing one example or many examples through that layer costs roughly the same in terms of memory traffic. By batching *N* requests, the weight-loading cost is amortized over *N* predictions, dramatically increasing throughput.

The tradeoff is latency. A server that batches incoming requests must wait until a batch is full (or until a timeout expires) before running inference, which increases the time any individual request spends waiting. This creates a throughput-latency tension that is central to production ML serving.

### Batch (offline) inference

Batch inference, also called offline inference, refers to generating predictions on a large set of data all at once rather than responding to individual requests in real time. This approach is common when immediate responses are not required. For example, a recommendation system might score all products for all users overnight, store the results in a database, and serve pre-computed recommendations during the day.

Batch inference typically runs on a recurring schedule (hourly, daily, or weekly) and can take advantage of distributed computing frameworks like Apache Spark or cloud services like AWS Batch and Google Cloud Vertex AI Batch Predictions. Because latency is not a concern, batch inference jobs can use larger batch sizes, optimize for throughput, and run on cheaper preemptible or spot instances.

| Inference mode | Latency requirement | Throughput | Cost efficiency | Use case |
|---|---|---|---|---|
| Real-time (online) | Milliseconds | Lower | Higher per prediction | Chatbots, search ranking |
| Batch (offline) | Hours to days | Very high | Lower per prediction | Recommendation scoring, risk analysis |

### Static, dynamic, and continuous batching for LLM serving

For [large language models](/wiki/large_language_model) (LLMs) and other autoregressive models, batching strategies have evolved considerably:

- **Static batching** groups a fixed number of requests into a batch and waits until all sequences in the batch are complete before returning any results. This is simple but wasteful, because shorter sequences finish early and sit idle while longer ones continue generating.
- **Dynamic batching** assembles batches on the fly, starting inference once a batch reaches a target size or a maximum wait time elapses. This improves latency compared to static batching while maintaining high throughput during busy periods.
- **Continuous batching** (also called iteration-level scheduling), introduced by Yu et al. (2022) in the Orca system, is a more significant departure. Instead of waiting for all sequences in a batch to complete, the system can insert new requests into the batch at every decoding step as old ones finish.[14] This keeps the GPU continuously occupied and avoids the idle-slot problem of static batching.

vLLM (Kwon et al., 2023) combined continuous batching with PagedAttention, a memory management technique that reduces GPU memory fragmentation for the key-value cache. Together, these optimizations achieve up to 24x higher throughput than HuggingFace Transformers and up to 3.5x higher throughput than HuggingFace Text Generation Inference (TGI), without changing the model architecture.[5] The continuous batching paradigm has since been adopted by most major inference frameworks, including TensorRT-LLM (which calls it "in-flight batching"), Hugging Face TGI, and SGLang.

| Batching strategy | How it works | Pros | Cons |
|---|---|---|---|
| Static | Fixed batch, return when all done | Simple to implement | Wasted GPU cycles on completed sequences |
| Dynamic | Batch until full or timeout | Better latency than static | Still waits for slowest request in batch |
| Continuous | Insert/remove requests at each iteration | Near-optimal GPU utilization | More complex scheduling logic |

## Mixed precision training and batch size

[Mixed precision training](/wiki/mixed_precision_training) uses 16-bit floating-point (FP16 or BF16) for most computations while keeping a master copy of the weights in 32-bit (FP32) for numerical stability. Because 16-bit values take half the memory of 32-bit values, mixed precision effectively doubles the amount of activation memory available, allowing the batch size to be roughly doubled for the same GPU memory budget.

This technique was formalized by Micikevicius et al. (2018) and has become standard practice for training large models.[8] NVIDIA's Tensor Cores are designed to accelerate FP16 matrix multiplications, so mixed precision training also improves computational throughput independent of the batch size benefit.

## How do you choose a batch size in practice?

Choosing a batch size involves balancing multiple competing objectives. There is no single correct batch size for all situations, but several practical guidelines have emerged from research and industry experience.

### Rules of thumb

1. **Start with 32.** This is a widely recommended starting point (Bengio, 2012), and it works well for many problems.[1]
2. **Scale by powers of 2.** Try 32, 64, 128, 256, and so on. While non-power-of-two sizes work fine, the convention simplifies experimentation and comparison.
3. **Adjust the [learning rate](/wiki/learning_rate) when changing batch size.** If the batch size is doubled, increase the learning rate by a factor of 2 (linear scaling) or by a factor of sqrt(2) (square root scaling, sometimes more stable for very large batches).
4. **Use warmup with large batches.** A few epochs of learning rate warmup at the start of training are often necessary for batch sizes above a few hundred.
5. **Monitor validation metrics, not just training loss.** A batch size that produces fast training convergence may still generalize poorly. Always evaluate on a validation set.
6. **Let the GPU fill up.** Find the largest batch size that fits in memory, then check whether a smaller batch size (with appropriate learning rate) gives better validation performance.

### Task-specific considerations

| Task type | Typical batch sizes | Notes |
|---|---|---|
| Image classification ([CNNs](/wiki/convolutional_neural_network)) | 32 to 256 | Higher with large-batch optimizers (LARS) |
| Object detection | 2 to 16 | High-resolution images consume more memory |
| Language modeling ([Transformers](/wiki/transformer)) | 256 to 8,192+ tokens | Often measured in tokens rather than sequences |
| Fine-tuning [LLMs](/wiki/large_language_model) | 1 to 32 | Small batches due to model size and long sequences |
| Reinforcement learning | Varies widely | Depends on environment and on-policy vs. off-policy |
| GANs | 16 to 128 | Smaller batches sometimes improve stability |

### The batch size, learning rate, and training time triangle

Smith et al. (2018) showed that the ratio of learning rate to batch size (or equivalently, the noise scale in SGD) is what controls the training dynamics.[11] This means there are multiple paths to the same outcome:

- High learning rate with large batch produces low gradient noise and fast per-step progress.
- Low learning rate with small batch produces high gradient noise and slow per-step progress but potentially better generalization.
- Decaying the learning rate during training is equivalent to increasing the batch size during training.

Practitioners can use these equivalences to adapt their training schedule to their hardware constraints. If more GPUs become available mid-training, one can increase the batch size (and learning rate) without starting over.

## Historical context

The use of mini-batches for training neural networks predates the modern deep learning era. The idea of using subsets of training data for gradient estimation dates back to the early stochastic approximation methods of Robbins and Monro (1951).[9] LeCun et al. (1998) discussed practical batch size choices for training [convolutional neural networks](/wiki/convolutional_neural_network), noting the tradeoff between gradient accuracy and computational cost.[6]

The explosion of interest in batch size optimization came with the rise of large-scale distributed training in the 2010s. As models grew from millions to billions of parameters, and datasets from thousands to billions of examples, the ability to train efficiently across hundreds or thousands of GPUs became a competitive advantage. The batch size was the primary lever for distributing work across devices.

| Year | Milestone | Reference |
|---|---|---|
| 1951 | Stochastic approximation methods | Robbins and Monro |
| 1998 | Practical batch size guidelines for CNNs | LeCun et al. |
| 2015 | Batch Normalization | Ioffe and Szegedy |
| 2017 | Linear scaling rule; ImageNet in 1 hour | Goyal et al. |
| 2017 | LARS optimizer for large-batch CNN training | You et al. |
| 2017 | Generalization gap and sharp minima | Keskar et al. (published ICLR 2017) |
| 2018 | Critical batch size and gradient noise scale | McCandlish et al. |
| 2018 | Batch size increase as alternative to LR decay | Smith et al. |
| 2020 | LAMB optimizer; BERT in 76 minutes | You et al. |
| 2022 | Continuous batching (Orca) for LLM inference | Yu et al. |
| 2023 | vLLM with PagedAttention for batched inference | Kwon et al. |

## References

1. Bengio, Y. (2012). "Practical Recommendations for Gradient-Based Training of Deep Architectures." *Neural Networks: Tricks of the Trade*, Springer.
2. Goyal, P., Dollar, P., Girshick, R., et al. (2017). "Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour." *arXiv:1706.02677*.
3. Ioffe, S., & Szegedy, C. (2015). "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift." *Proceedings of the 32nd International Conference on Machine Learning (ICML)*.
4. Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., & Tang, P. T. P. (2017). "On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima." *International Conference on Learning Representations (ICLR)*.
5. Kwon, W., Li, Z., Zhuang, S., et al. (2023). "Efficient Memory Management for Large Language Model Serving with PagedAttention." *Proceedings of the 29th Symposium on Operating Systems Principles (SOSP)*.
6. LeCun, Y., Bottou, L., Orr, G. B., & Muller, K.-R. (1998). "Efficient BackProp." *Neural Networks: Tricks of the Trade*, Springer.
7. McCandlish, S., Kaplan, J., Amodei, D., & the OpenAI Dota Team. (2018). "An Empirical Model of Large-Batch Training." *arXiv:1812.06162*.
8. Micikevicius, P., Narang, S., Alben, J., et al. (2018). "Mixed Precision Training." *International Conference on Learning Representations (ICLR)*.
9. Robbins, H., & Monro, S. (1951). "A Stochastic Approximation Method." *The Annals of Mathematical Statistics*, 22(3), 400-407.
10. Santurkar, S., Tsipras, D., Ilyas, A., & Madry, A. (2018). "How Does Batch Normalization Help Optimization?" *Advances in Neural Information Processing Systems (NeurIPS)*.
11. Smith, S. L., Kindermans, P.-J., Ying, C., & Le, Q. V. (2018). "Don't Decay the Learning Rate, Increase the Batch Size." *International Conference on Learning Representations (ICLR)*.
12. You, Y., Gitman, I., & Ginsburg, B. (2017). "Large Batch Training of Convolutional Networks." *arXiv:1708.03888*.
13. You, Y., Li, J., Reddi, S., et al. (2020). "Large Batch Optimization for Deep Learning: Training BERT in 76 Minutes." *International Conference on Learning Representations (ICLR)*.
14. Yu, G. I., Jeong, J. S., Kim, G.-W., Kim, S., & Chun, B.-G. (2022). "Orca: A Distributed Serving System for Transformer-Based Generative Models." *16th USENIX Symposium on Operating Systems Design and Implementation (OSDI)*.
