Batch

In machine learning, a batch is the set of training examples used in one iteration of model parameter updates. The batch size determines how many examples are included in each batch. Rather than processing an entire dataset at once or one sample at a time, practitioners divide data into batches to balance computational efficiency, memory usage, and optimization quality.

Batches play a central role in virtually every stage of the machine learning pipeline, from gradient descent optimization during training to batch normalization layers within network architectures, and even to batched request handling during inference. Understanding how batches work, and how their size affects training dynamics, is one of the most practical concerns in applied deep learning.

Explain like I'm five (ELI5)

Imagine you are a teacher grading a stack of 1,000 homework papers. You could try to read all 1,000 papers before deciding what to teach differently tomorrow, but that would take forever. Or, you could look at just one paper at a time, but each single paper might not tell you much about what the whole class needs. A good middle ground is to grab a small pile of papers (say 32), read through them, notice a pattern ("most students got question 5 wrong"), and adjust your lesson plan. Then you grab the next pile of 32 and keep refining. Each small pile is a batch. The number of papers in the pile is the batch size. Once you have gone through all 1,000 papers, you have finished one epoch.

In machine learning, the "papers" are data samples, "reading them" is computing how wrong the model's predictions are (the loss), and "adjusting the lesson plan" is updating the model's parameters. Working in batches lets the model learn efficiently without needing to see the entire dataset before making any improvement.

Gradient descent variants and the role of batches

The concept of a batch is inseparable from the optimization algorithm used to train a model. In gradient descent, the model computes a loss function over some portion of the data, calculates gradients of that loss with respect to the model parameters, and then updates the parameters to reduce the loss. The portion of data used for each such computation is the batch.

There are three main variants of gradient descent, distinguished by how much data goes into each batch.

Full-batch gradient descent

In full-batch (or simply "batch") gradient descent, the entire training dataset is used to compute the gradient at each step. This produces the most accurate estimate of the true gradient, since it averages over all examples. However, for large datasets, this approach is often impractical because it requires loading the entire dataset into memory at once and performing a full forward and backward pass before making even a single parameter update.

Full-batch gradient descent follows a smooth, deterministic path toward a minimum of the loss surface. While this sounds desirable, the lack of noise in the gradient estimate means that training can get stuck in sharp local minima or saddle points, which may hurt generalization performance on unseen data.

Stochastic gradient descent

Stochastic gradient descent (SGD) sits at the opposite extreme. Each parameter update is computed from a single training example (a batch size of 1). The resulting gradient estimate is very noisy because it reflects only one data point rather than the full distribution. This noise, however, can be beneficial: it helps the optimizer escape shallow local minima and saddle points, often leading to solutions that generalize better.

The downside is that SGD makes very slow progress in terms of wall-clock time. Each update processes only one example, and modern hardware (particularly GPUs) is designed for parallel computation over large arrays. Processing one example at a time leaves most of the GPU's compute capacity idle.

Mini-batch gradient descent

Mini-batch gradient descent is the practical middle ground and the default approach in modern deep learning. Each update uses a small random subset (the mini-batch) of the training data, typically between 16 and 8,192 examples. This offers several advantages:

The gradient estimate is less noisy than single-sample SGD but retains enough stochasticity to help with generalization.
The computation can exploit GPU parallelism efficiently because matrix operations over a batch of examples can be executed in parallel.
Memory requirements are manageable because only a fraction of the dataset needs to be in GPU memory at any time.

In practice, when researchers and practitioners say "SGD" they almost always mean mini-batch SGD, not the single-sample variant.

Variant	Batch size	Gradient noise	GPU utilization	Memory cost	Typical use
Full-batch	Entire dataset	None	High (if dataset fits)	Very high	Small datasets, convex problems
Stochastic (SGD)	1	Very high	Very low	Minimal	Rare in practice today
Mini-batch	16 to 8,192+	Moderate	High	Moderate	Standard for neural network training

Relationship between epochs, iterations, and batches

Three closely related terms describe the structure of a training loop: epoch, iteration (also called a training step), and batch. Their relationship can be expressed with a simple formula.

An epoch is one complete pass through the entire training dataset, where every sample has been seen exactly once. An iteration is one parameter update, which consumes one batch of data. The number of iterations per epoch is therefore:

Iterations per epoch = Total training samples / Batch size

If a dataset contains 10,000 samples and the batch size is 256, one epoch consists of ceil(10,000 / 256) = 40 iterations (the last batch may be smaller if the dataset size is not evenly divisible by the batch size). Over 50 epochs, the model performs 40 x 50 = 2,000 total parameter updates.

Term	Definition	Formula
Batch size	Number of samples in one batch	Chosen by the practitioner
Iteration	One forward pass, backward pass, and parameter update on one batch	Iterations per epoch = ceil(N / batch size)
Epoch	One full pass through all N training samples	Total iterations = iterations per epoch x epochs

This relationship has practical consequences. Doubling the batch size halves the number of iterations per epoch and therefore halves the number of gradient updates. If no other hyperparameters are changed, the model sees the same total data but performs half as many updates, each using a more accurate gradient estimate. Whether this speeds up or slows down convergence depends on the learning rate and the characteristics of the loss landscape.

How batch size affects training

Batch size is one of the most important hyperparameters in deep learning. It affects convergence speed, generalization ability, memory consumption, and hardware utilization. The interactions between these factors are nuanced and have been the subject of extensive research.

Convergence speed

Larger batches produce lower-variance gradient estimates, which means each parameter update points more reliably toward the direction that reduces the loss. This allows the optimizer to take larger steps without overshooting, and in principle, training progresses faster per update. However, the number of updates per epoch decreases as batch size increases (since each update consumes more data), so the relationship between batch size and wall-clock convergence is not straightforward.

Smaller batches, by contrast, produce noisier gradients. Each update is less reliable on its own, but more updates happen per epoch. The noise acts as an implicit form of regularization, preventing the model from fitting too precisely to the training data.

Generalization and the sharp vs. flat minima debate

One of the most cited findings in the batch size literature comes from Keskar et al. (2017), who observed that large-batch training tends to converge to "sharp" minima of the loss surface, while small-batch training tends to find "flat" minima. A sharp minimizer sits in a narrow, steep valley, meaning that even slight perturbations to the parameters cause a large increase in loss. A flat minimizer sits in a broad, gently sloped region where small parameter changes have little effect on the loss.

The practical consequence is that models trained with large batches often achieve low training loss but higher test loss compared to models trained with small batches. This phenomenon is known as the generalization gap. The noise inherent in small-batch gradient estimates is believed to help the optimizer avoid sharp minima and settle in flatter regions that transfer better to held-out data.

That said, subsequent research has shown that the generalization gap can be narrowed or eliminated with proper learning rate tuning, warmup schedules, and other techniques. The relationship between batch size and generalization is not an immutable law; it depends on the optimizer, the learning rate schedule, and the training duration.

Learning rate scaling

Batch size and learning rate are tightly coupled. When the batch size increases, the variance of the gradient estimate decreases, and each step becomes more deterministic. To maintain a similar training trajectory, the learning rate typically needs to increase as well.

The linear scaling rule, popularized by Goyal et al. (2017) in their work on training ResNet-50 on ImageNet, states: when the batch size is multiplied by a factor of k, the learning rate should also be multiplied by k. Using this rule, along with a gradual warmup period in the first few epochs, they trained ResNet-50 with a batch size of 8,192 across 256 GPUs and matched the accuracy of smaller-batch baselines, completing training in about one hour.

The linear scaling rule has practical limits. For very large batch sizes (beyond approximately 8,192), it begins to break down because gradient noise reduction does not scale linearly with batch size: doubling the batch size reduces the gradient standard deviation by a factor of sqrt(2), not by a factor of 2. This observation motivates the square root scaling rule, where the learning rate is multiplied by sqrt(k) instead of k, which is sometimes more stable for very large batches.

Smith et al. (2018) took this idea further in "Don't Decay the Learning Rate, Increase the Batch Size." They showed that the common practice of decaying the learning rate during training is mathematically equivalent to increasing the batch size while keeping the learning rate fixed. This insight provides an alternative path to efficient large-batch training: instead of reducing the learning rate at scheduled epochs, one can increase the batch size, which improves parallelism and reduces the total number of parameter updates needed.

Scaling approach	Key idea	Reference
Linear scaling rule	Multiply learning rate by k when batch size is multiplied by k; use warmup	Goyal et al. (2017)
Batch size increase schedule	Increase batch size during training instead of decaying learning rate	Smith et al. (2018)
Square root scaling	Scale learning rate by sqrt(k) for batch size increase of k	Hoffer et al. (2017)

Gradient noise from mini-batches

The noise in mini-batch gradient estimates is not merely a nuisance; it has a well-defined statistical structure that plays a central role in optimization and generalization.

When computing the gradient over a mini-batch of size B drawn from a dataset of size N, the mini-batch gradient is an unbiased estimator of the full-batch gradient. Its variance is approximately:

Var(mini-batch gradient) = (1 - B/N) x (sigma^2 / B)

where sigma^2 is the per-sample gradient variance. The factor (1 - B/N) is a finite-population correction that becomes negligible when B is much smaller than N, which is the typical case. In practice, the variance scales roughly as sigma^2 / B, meaning that doubling the batch size halves the gradient variance.

McCandlish et al. (2018) from OpenAI formalized this relationship through the gradient noise scale, defined as the ratio of the gradient noise to the gradient signal. The gradient noise scale determines a critical batch size: below this threshold, doubling the batch size roughly halves training time (the noise is the bottleneck), while above it, further increases yield diminishing returns (the signal dominates). In their experiments, critical batch sizes ranged from around 20 for small autoencoders on SVHN to millions for Dota 2 reinforcement learning agents.

The gradient noise scale typically grows during training as the model approaches a minimum and the gradient signal shrinks relative to the noise. This observation supports the practice of increasing the batch size over the course of training, as used in GPT-3 pretraining, where the batch size was progressively raised based on gradient noise scale measurements.

Batch size and hardware utilization

Modern GPUs and accelerators are designed for massively parallel computation. A neural network's forward and backward passes consist primarily of matrix multiplications, and these operations are most efficient when the matrices are large. The batch dimension is one of the dimensions of these matrices, so increasing the batch size directly increases the GPU's arithmetic throughput, up to a point.

GPU memory breakdown

The total GPU memory consumed during training comes from several sources:

Component	Description	Scales with batch size?
Model parameters	Weights and biases of the network	No
Optimizer state	Momentum buffers, second-moment estimates (for Adam)	No
Activations	Intermediate outputs stored for backpropagation	Yes (linearly)
Gradients	Computed during backward pass	Partially
Input data	The batch itself	Yes (linearly)

Activations are typically the largest memory consumer for deep networks, because every layer's output must be saved for use during the backward pass. Since activation memory grows linearly with batch size, the maximum batch size is often constrained by the available GPU VRAM. Exceeding this limit triggers an out-of-memory error.

Maximizing throughput

To maximize GPU utilization, practitioners generally want the largest batch size that fits in memory, though generalization considerations may argue for a smaller batch. A common workflow is:

Start with a batch size of 32.
Gradually increase by factors of 2 (64, 128, 256, ...) and monitor GPU memory and training throughput.
Stop when GPU memory is nearly full or when the validation loss begins to degrade.

Using powers of two for batch sizes is a widespread convention. The rationale is that GPU memory is organized in pages whose sizes are powers of two, and NVIDIA's documentation recommends that matrix dimensions be multiples of 8 for optimal Tensor Core utilization. In practice, experiments by Raschka (2022) and others have found that non-power-of-two batch sizes often perform just as well, but the convention persists because it simplifies benchmarking and comparison.

Gradient accumulation and micro-batching

When the desired effective batch size is too large to fit in GPU memory, gradient accumulation provides a workaround. Instead of processing the full batch in one pass, the training loop processes several smaller micro-batches sequentially, accumulating (summing) the gradients from each, and then performs a single parameter update after all micro-batches have been processed.

For example, suppose the target effective batch size is 1,024 but only 256 examples fit in memory. The training loop processes four micro-batches of 256, accumulates the gradients, and then updates the model once. From the optimizer's perspective, this is equivalent to a single batch of 1,024.

Gradient accumulation trades time for memory. Each micro-batch is processed sequentially rather than in parallel, so the total training time is longer than it would be if the full batch fit in memory. However, it allows practitioners to simulate large-batch training on hardware that would otherwise be insufficient.

There is one notable complication: standard batch normalization computes statistics (mean and variance) over the micro-batch, not the full effective batch. Since each micro-batch is smaller, these statistics are noisier, which can degrade training stability. Solutions include synchronized batch normalization across micro-batches or replacing batch normalization with alternatives like Group Normalization or Layer Normalization.

Gradient accumulation in distributed training

In data parallelism setups where multiple GPUs each process a portion of the batch, gradient accumulation is often combined with all-reduce communication patterns. Each GPU computes gradients on its local micro-batch, the gradients are summed across GPUs, and the combined gradient is used for the parameter update. This approach allows the effective batch size to scale with the number of GPUs.

Batch normalization

Batch normalization (BN), introduced by Ioffe and Szegedy (2015), is one of the most widely used techniques in deep learning, and it depends directly on the batch. During training, BN normalizes the activations of each layer by subtracting the mean and dividing by the standard deviation computed over the current mini-batch, for each feature independently. Two learnable parameters (scale and shift) are then applied to allow the network to undo the normalization if that is optimal.

The original motivation was to address "internal covariate shift," the idea that the distribution of each layer's inputs changes as the preceding layers' parameters are updated. Later research has debated whether reducing internal covariate shift is actually the reason BN works. Santurkar et al. (2018) argued that BN's benefits come more from smoothing the loss landscape, making it easier for the optimizer to navigate.

Regardless of the underlying mechanism, the practical benefits are well established:

BN allows the use of higher learning rates, which speeds up training.
BN acts as a regularizer, sometimes reducing or eliminating the need for dropout.
BN makes training less sensitive to weight initialization.

Batch normalization's dependence on the batch creates certain limitations. During inference, there is no mini-batch to compute statistics from, so BN layers use running averages of mean and variance accumulated during training. When the training batch size is very small, the per-batch statistics become noisy and unreliable, degrading performance. This has motivated alternatives like Layer Normalization (which normalizes over features rather than over the batch) and Group Normalization (which normalizes over groups of channels), both of which are independent of batch size.

Normalization method	Normalizes over	Depends on batch size?	Typical use case
Batch Normalization	Batch dimension	Yes	CNNs, large batch sizes
Layer Normalization	Feature dimension	No	Transformers, RNNs
Group Normalization	Groups of channels	No	Small batch sizes, detection
Instance Normalization	Single instance per channel	No	Style transfer

Large-batch training techniques

Training with very large batches (tens of thousands of examples or more) can dramatically reduce training time by increasing parallelism across many GPUs. However, naively scaling up the batch size often degrades accuracy due to the generalization gap discussed earlier. Several specialized techniques have been developed to make large-batch training work.

LARS

LARS (Layer-wise Adaptive Rate Scaling), introduced by You et al. (2017), addresses the observation that different layers of a deep neural network may need very different learning rates. LARS computes a per-layer learning rate by looking at the ratio of the weight norm to the gradient norm for each layer, then scales the base learning rate accordingly. Using LARS, You et al. trained ResNet-50 on ImageNet with batch sizes up to 32,768 while maintaining accuracy.

However, LARS was designed primarily for networks trained with SGD with momentum. It performs poorly on attention-based architectures like BERT.

LAMB

LAMB (Layer-wise Adaptive Moments optimizer for Batch training), introduced by You et al. (2020), extends the LARS idea to the Adam optimizer. LAMB computes the Adam update for each layer and then scales it by a per-layer trust ratio, which limits the relative change to any layer's weights. LAMB enabled training BERT with a batch size of 32,768 without accuracy loss, reducing training time from 3 days to 76 minutes on a TPUv3 Pod.

Optimizer	Base method	Per-layer adaptation	Designed for	Key result
LARS	SGD + Momentum	Weight norm / gradient norm ratio	CNNs	ResNet-50 with batch size 32K on ImageNet
LAMB	Adam	Trust ratio on Adam update	Transformers, BERT	BERT training in 76 minutes

Warmup schedules

Large-batch training often requires a learning rate warmup period at the beginning of training. During the first few epochs, the learning rate is gradually increased from a small value to the target value. This prevents the optimizer from making overly large updates early in training when the model parameters are far from any reasonable solution and the loss landscape is poorly conditioned.

Goyal et al. (2017) used a linear warmup over the first 5 epochs when training with a batch size of 8,192 on ImageNet. This simple technique proved essential for making the linear scaling rule work. Warmup duration typically scales with the batch size scaling factor: if the batch size is increased by 8x relative to the baseline, a warmup of roughly 5 epochs is often sufficient, while larger scaling factors may need longer warmup.

Framework implementations of batching

Modern deep learning frameworks provide built-in utilities to divide datasets into batches, shuffle them, and load them efficiently.

PyTorch DataLoader

In PyTorch, the torch.utils.data.DataLoader class is the primary interface for batching. It wraps a Dataset object and yields batches of data during training. Key parameters include:

Parameter	Description	Typical value
`batch_size`	Number of samples per batch	32, 64, 128
`shuffle`	Whether to randomize sample order each epoch	True for training, False for evaluation
`num_workers`	Number of subprocesses for parallel data loading	2 to 8
`drop_last`	Whether to drop the last incomplete batch	True when batch normalization is used
`pin_memory`	Whether to use pinned (page-locked) memory for faster GPU transfer	True when using CUDA

A typical usage pattern looks like:

from torch.utils.data import DataLoader, TensorDataset

dataset = TensorDataset(X_train, y_train)
loader = DataLoader(dataset, batch_size=64, shuffle=True, num_workers=4)

for epoch in range(num_epochs):
    for batch_X, batch_y in loader:
        # Forward pass, loss computation, backward pass, optimizer step

The DataLoader handles shuffling the dataset at the start of each epoch, dividing it into batches, and optionally loading batches in parallel using worker processes so that data preparation overlaps with GPU computation.

TensorFlow tf.data

In TensorFlow, the tf.data.Dataset API provides similar functionality. The .batch() method groups consecutive elements into batches, and .prefetch() overlaps data preprocessing with model execution for higher throughput:

import tensorflow as tf

dataset = tf.data.Dataset.from_tensor_slices((X_train, y_train))
dataset = dataset.shuffle(buffer_size=10000)
dataset = dataset.batch(64)
dataset = dataset.prefetch(tf.data.AUTOTUNE)

The prefetch transformation is particularly important for performance. While the model is executing training step s, the input pipeline is already reading and preparing the data for step s+1. Setting the buffer size to tf.data.AUTOTUNE lets the runtime dynamically tune prefetch depth based on available resources.

Batches in inference

Batching is not only relevant during training. At inference time, batching multiple requests together can significantly increase throughput by amortizing the cost of loading model weights from memory.

Why batching helps inference

For most neural network architectures, the bottleneck during inference is not computation but memory bandwidth: the time it takes to move model weights from GPU memory (VRAM) to the GPU's compute cores. Once the weights for a given layer are loaded into the GPU's cache, processing one example or many examples through that layer costs roughly the same in terms of memory traffic. By batching N requests, the weight-loading cost is amortized over N predictions, dramatically increasing throughput.

The tradeoff is latency. A server that batches incoming requests must wait until a batch is full (or until a timeout expires) before running inference, which increases the time any individual request spends waiting. This creates a throughput-latency tension that is central to production ML serving.

Batch (offline) inference

Batch inference, also called offline inference, refers to generating predictions on a large set of data all at once rather than responding to individual requests in real time. This approach is common when immediate responses are not required. For example, a recommendation system might score all products for all users overnight, store the results in a database, and serve pre-computed recommendations during the day.

Batch inference typically runs on a recurring schedule (hourly, daily, or weekly) and can take advantage of distributed computing frameworks like Apache Spark or cloud services like AWS Batch and Google Cloud Vertex AI Batch Predictions. Because latency is not a concern, batch inference jobs can use larger batch sizes, optimize for throughput, and run on cheaper preemptible or spot instances.

Inference mode	Latency requirement	Throughput	Cost efficiency	Use case
Real-time (online)	Milliseconds	Lower	Higher per prediction	Chatbots, search ranking
Batch (offline)	Hours to days	Very high	Lower per prediction	Recommendation scoring, risk analysis

Static, dynamic, and continuous batching for LLM serving

For large language models (LLMs) and other autoregressive models, batching strategies have evolved considerably:

Static batching groups a fixed number of requests into a batch and waits until all sequences in the batch are complete before returning any results. This is simple but wasteful, because shorter sequences finish early and sit idle while longer ones continue generating.
Dynamic batching assembles batches on the fly, starting inference once a batch reaches a target size or a maximum wait time elapses. This improves latency compared to static batching while maintaining high throughput during busy periods.
Continuous batching (also called iteration-level scheduling), introduced by Yu et al. (2022) in the Orca system, is a more significant departure. Instead of waiting for all sequences in a batch to complete, the system can insert new requests into the batch at every decoding step as old ones finish. This keeps the GPU continuously occupied and avoids the idle-slot problem of static batching.

vLLM (Kwon et al., 2023) combined continuous batching with PagedAttention, a memory management technique that reduces GPU memory fragmentation for the key-value cache. Together, these optimizations have achieved up to 23x throughput improvements over naive batched inference for LLMs. The continuous batching paradigm has since been adopted by most major inference frameworks, including TensorRT-LLM (which calls it "in-flight batching"), Hugging Face TGI, and SGLang.

Batching strategy	How it works	Pros	Cons
Static	Fixed batch, return when all done	Simple to implement	Wasted GPU cycles on completed sequences
Dynamic	Batch until full or timeout	Better latency than static	Still waits for slowest request in batch
Continuous	Insert/remove requests at each iteration	Near-optimal GPU utilization	More complex scheduling logic

Mixed precision training and batch size

Mixed precision training uses 16-bit floating-point (FP16 or BF16) for most computations while keeping a master copy of the weights in 32-bit (FP32) for numerical stability. Because 16-bit values take half the memory of 32-bit values, mixed precision effectively doubles the amount of activation memory available, allowing the batch size to be roughly doubled for the same GPU memory budget.

This technique was formalized by Micikevicius et al. (2018) and has become standard practice for training large models. NVIDIA's Tensor Cores are designed to accelerate FP16 matrix multiplications, so mixed precision training also improves computational throughput independent of the batch size benefit.

Batch size selection in practice

Choosing a batch size involves balancing multiple competing objectives. There is no single correct batch size for all situations, but several practical guidelines have emerged from research and industry experience.

Rules of thumb

Start with 32. This is a widely recommended starting point (Bengio, 2012), and it works well for many problems.
Scale by powers of 2. Try 32, 64, 128, 256, and so on. While non-power-of-two sizes work fine, the convention simplifies experimentation and comparison.
Adjust the learning rate when changing batch size. If the batch size is doubled, increase the learning rate by a factor of 2 (linear scaling) or by a factor of sqrt(2) (square root scaling, sometimes more stable for very large batches).
Use warmup with large batches. A few epochs of learning rate warmup at the start of training are often necessary for batch sizes above a few hundred.
Monitor validation metrics, not just training loss. A batch size that produces fast training convergence may still generalize poorly. Always evaluate on a validation set.
Let the GPU fill up. Find the largest batch size that fits in memory, then check whether a smaller batch size (with appropriate learning rate) gives better validation performance.

Task-specific considerations

Task type	Typical batch sizes	Notes
Image classification (CNNs)	32 to 256	Higher with large-batch optimizers (LARS)
Object detection	2 to 16	High-resolution images consume more memory
Language modeling (Transformers)	256 to 8,192+ tokens	Often measured in tokens rather than sequences
Fine-tuning LLMs	1 to 32	Small batches due to model size and long sequences
Reinforcement learning	Varies widely	Depends on environment and on-policy vs. off-policy
GANs	16 to 128	Smaller batches sometimes improve stability

The batch size, learning rate, and training time triangle

Smith et al. (2018) showed that the ratio of learning rate to batch size (or equivalently, the noise scale in SGD) is what controls the training dynamics. This means there are multiple paths to the same outcome:

High learning rate with large batch produces low gradient noise and fast per-step progress.
Low learning rate with small batch produces high gradient noise and slow per-step progress but potentially better generalization.
Decaying the learning rate during training is equivalent to increasing the batch size during training.

Practitioners can use these equivalences to adapt their training schedule to their hardware constraints. If more GPUs become available mid-training, one can increase the batch size (and learning rate) without starting over.

Historical context

The use of mini-batches for training neural networks predates the modern deep learning era. The idea of using subsets of training data for gradient estimation dates back to the early stochastic approximation methods of Robbins and Monro (1951). LeCun et al. (1998) discussed practical batch size choices for training convolutional neural networks, noting the tradeoff between gradient accuracy and computational cost.

The explosion of interest in batch size optimization came with the rise of large-scale distributed training in the 2010s. As models grew from millions to billions of parameters, and datasets from thousands to billions of examples, the ability to train efficiently across hundreds or thousands of GPUs became a competitive advantage. The batch size was the primary lever for distributing work across devices.

Year	Milestone	Reference
1951	Stochastic approximation methods	Robbins and Monro
1998	Practical batch size guidelines for CNNs	LeCun et al.
2015	Batch Normalization	Ioffe and Szegedy
2017	Linear scaling rule; ImageNet in 1 hour	Goyal et al.
2017	LARS optimizer for large-batch CNN training	You et al.
2017	Generalization gap and sharp minima	Keskar et al. (published ICLR 2017)
2018	Critical batch size and gradient noise scale	McCandlish et al.
2018	Batch size increase as alternative to LR decay	Smith et al.
2020	LAMB optimizer; BERT in 76 minutes	You et al.
2022	Continuous batching (Orca) for LLM inference	Yu et al.
2023	vLLM with PagedAttention for batched inference	Kwon et al.

References

Bengio, Y. (2012). "Practical Recommendations for Gradient-Based Training of Deep Architectures." *Neural Networks: Tricks of the Trade*, Springer.
Goyal, P., Dollar, P., Girshick, R., et al. (2017). "Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour." *arXiv:1706.02677*.
Ioffe, S., & Szegedy, C. (2015). "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift." *Proceedings of the 32nd International Conference on Machine Learning (ICML)*.
Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., & Tang, P. T. P. (2017). "On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima." *International Conference on Learning Representations (ICLR)*.
Kwon, W., Li, Z., Zhuang, S., et al. (2023). "Efficient Memory Management for Large Language Model Serving with PagedAttention." *Proceedings of the 29th Symposium on Operating Systems Principles (SOSP)*.
LeCun, Y., Bottou, L., Orr, G. B., & Muller, K.-R. (1998). "Efficient BackProp." *Neural Networks: Tricks of the Trade*, Springer.
McCandlish, S., Kaplan, J., Amodei, D., & the OpenAI Dota Team. (2018). "An Empirical Model of Large-Batch Training." *arXiv:1812.06162*.
Micikevicius, P., Narang, S., Alben, J., et al. (2018). "Mixed Precision Training." *International Conference on Learning Representations (ICLR)*.
Robbins, H., & Monro, S. (1951). "A Stochastic Approximation Method." *The Annals of Mathematical Statistics*, 22(3), 400-407.
Santurkar, S., Tsipras, D., Ilyas, A., & Madry, A. (2018). "How Does Batch Normalization Help Optimization?" *Advances in Neural Information Processing Systems (NeurIPS)*.
Smith, S. L., Kindermans, P.-J., Ying, C., & Le, Q. V. (2018). "Don't Decay the Learning Rate, Increase the Batch Size." *International Conference on Learning Representations (ICLR)*.
You, Y., Gitman, I., & Ginsburg, B. (2017). "Large Batch Training of Convolutional Networks." *arXiv:1708.03888*.
You, Y., Li, J., Reddi, S., et al. (2020). "Large Batch Optimization for Deep Learning: Training BERT in 76 Minutes." *International Conference on Learning Representations (ICLR)*.
Yu, G. I., Jeong, J. S., Kim, G.-W., Kim, S., & Chun, B.-G. (2022). "Orca: A Distributed Serving System for Transformer-Based Generative Models." *16th USENIX Symposium on Operating Systems Design and Implementation (OSDI)*.

Explain like I'm five (ELI5)

Gradient descent variants and the role of batches

Full-batch gradient descent

Stochastic gradient descent

Mini-batch gradient descent

Relationship between epochs, iterations, and batches

How batch size affects training

Convergence speed

Generalization and the sharp vs. flat minima debate

Learning rate scaling

Gradient noise from mini-batches

Batch size and hardware utilization

GPU memory breakdown

Maximizing throughput

Gradient accumulation and micro-batching

Gradient accumulation in distributed training

Batch normalization

Large-batch training techniques

LARS

LAMB

Warmup schedules

Framework implementations of batching

PyTorch DataLoader

TensorFlow tf.data

Batches in inference

Why batching helps inference

Batch (offline) inference

Static, dynamic, and continuous batching for LLM serving

Mixed precision training and batch size

Batch size selection in practice

Rules of thumb

Task-specific considerations

The batch size, learning rate, and training time triangle

Historical context

References

Improve this article

Related Articles

Sparse autoencoder

ARC-AGI 2

GELU (Gaussian Error Linear Unit)

LeNet

Context window

Multi-head Latent Attention

Explain like I'm five (ELI5)

Gradient descent variants and the role of batches

Full-batch gradient descent

Stochastic gradient descent

Mini-batch gradient descent

Relationship between epochs, iterations, and batches

How batch size affects training

Convergence speed

Generalization and the sharp vs. flat minima debate

Learning rate scaling

Gradient noise from mini-batches

Batch size and hardware utilization

GPU memory breakdown

Maximizing throughput

Gradient accumulation and micro-batching

Gradient accumulation in distributed training

Batch normalization

Large-batch training techniques

LARS

LAMB

Warmup schedules

Framework implementations of batching

PyTorch DataLoader

TensorFlow tf.data

Batches in inference

Why batching helps inference

Batch (offline) inference

Static, dynamic, and continuous batching for LLM serving

Mixed precision training and batch size

Batch size selection in practice

Rules of thumb

Task-specific considerations

The batch size, learning rate, and training time triangle

Historical context

References

Related Articles

Sparse autoencoder