Gradient Accumulation
Last reviewed
Sources
12 citations
Review status
Source-backed
Revision
v7 ยท 4,249 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
12 citations
Review status
Source-backed
Revision
v7 ยท 4,249 words
Add missing citations, update stale details, or suggest a clearer explanation.
Gradient accumulation is a deep learning training technique that simulates a large batch size on limited GPU memory by summing the gradients from several small mini-batches (called micro-batches) and performing a single optimizer step only after the last one. Instead of computing one gradient on a large batch that may not fit in memory, the model runs N smaller forward and backward passes, accumulates their gradients in place, then updates the weights as if it had processed a batch N times larger. The relationship is effective_batch_size = micro_batch_size * accumulation_steps * num_devices. As the Unsloth engineering team put it, "The goal of gradient accumulation is to mimic full batch training with reduced VRAM usage." [1]
This technique is fundamental to modern large language model training, where effective batch sizes of millions of tokens are standard but no single GPU can hold even a fraction of that data in memory at once. Llama 2, for example, was trained with a global batch size of 4 million tokens across every model size in the family [6]. Gradient accumulation trades wall-clock time for memory: it does not make training faster, but it lets memory-bound hardware reproduce the large-batch recipes used to train frontier models [1].
In standard training with stochastic gradient descent (or its variants), each training step consists of:
With gradient accumulation, steps 1 and 2 are repeated N times (where N is the accumulation steps) before performing step 3. The process becomes:
The division of the loss by N in each backward pass ensures that the accumulated gradient is mathematically equivalent to the gradient computed on a single batch of size N times the micro-batch size. (As discussed below, this naive per-micro-batch averaging is correct for a simple mean loss but was the source of a subtle 2024 bug for token-normalized losses.)
Let B be the micro-batch size and N be the number of accumulation steps. The effective batch size is B_eff = N * B.
In standard training with batch size B_eff, the gradient update is:
g = (1 / B_eff) * sum_{i=1}^{B_eff} grad(L_i)
With gradient accumulation, processing N micro-batches of size B:
g = (1 / N) * sum_{k=1}^{N} [ (1 / B) * sum_{j=1}^{B} grad(L_{k,j}) ]
These two expressions are mathematically identical: both compute the average gradient over B_eff = N * B examples. This means that, for most purposes, training with gradient accumulation produces the same result as training with the larger batch size directly [2]. Hugging Face describes the property plainly: gradient accumulation is "supposed to be mathematically equivalent to full batch training." [10]
Batch normalization is the primary exception to this mathematical equivalence. Batch normalization computes running statistics (mean and variance) over each mini-batch during training. When using gradient accumulation, these statistics are computed over each micro-batch of size B rather than over the full effective batch of size B_eff. In other words, the running statistics only ever see one micro-batch at a time, never the full effective batch.
This means the normalization statistics are noisier, which can lead to slightly different training dynamics and final performance. In practice, this is often not a significant issue because:
| Normalization Type | Affected by Gradient Accumulation? | Explanation |
|---|---|---|
| Batch Normalization | Yes | Statistics computed per micro-batch, not effective batch |
| Layer Normalization | No | Statistics computed per example, batch-size independent |
| Group Normalization | No | Statistics computed per channel group within each example |
| RMSNorm | No | Computed per example; standard in modern LLMs |
Gradient accumulation solves a fundamental tension in deep learning: larger batch sizes generally improve training stability and throughput, but GPU memory is finite.
During training, GPU memory must hold:
For a 7-billion-parameter model in mixed precision (FP16 weights + FP32 optimizer states), the parameters and optimizer states alone consume roughly 56 GB. Activations scale linearly with batch size and sequence length. On a GPU with 80 GB of memory (e.g., an NVIDIA A100, which pairs 80 GB of HBM2e with about 2 TB/s of bandwidth), the remaining memory after parameters and optimizer states may only accommodate a micro-batch of 1 to 4 sequences [3][8].
Gradient accumulation allows this limited micro-batch size to be scaled to an effective batch size of hundreds or thousands of sequences, matching the batch sizes used in published training recipes.
Larger effective batch sizes produce gradient estimates with lower variance, leading to more stable training. This is particularly important for:
| Aspect | Small Effective Batch | Large Effective Batch (via accumulation) |
|---|---|---|
| Gradient noise | High | Low |
| Training stability | Lower | Higher |
| Convergence speed (in updates) | More updates needed | Fewer updates needed |
| Wall-clock time per update | Faster | Slower (sequential micro-batches) |
| Memory usage per GPU | Lower | Same per micro-batch |
| Learning rate sensitivity | More sensitive | More tolerant of higher learning rates |
PyTorch makes gradient accumulation straightforward because its autograd engine accumulates gradients by default. Calling loss.backward() adds the computed gradients to any existing gradients in the .grad attribute of each parameter, rather than replacing them. This is why optimizer.zero_grad() is explicitly called to reset gradients, and why omitting that call (intentionally) enables accumulation [4].
A standard PyTorch training loop with gradient accumulation looks like:
accumulation_steps = 4
optimizer.zero_grad()
for i, (inputs, targets) in enumerate(dataloader):
outputs = model(inputs)
loss = criterion(outputs, targets)
loss = loss / accumulation_steps # Normalize loss
loss.backward() # Accumulate gradients
if (i + 1) % accumulation_steps == 0:
optimizer.step() # Update weights
optimizer.zero_grad() # Reset gradients
Key implementation details:
accumulation_steps before calling backward(). This ensures the accumulated gradient is the average over all micro-batches, matching the behavior of training with the full effective batch size. (See the 2024 bug section for the important caveat that dividing by the step count is not the same as dividing by the total token count for token-normalized losses.)When combining gradient accumulation with mixed precision training (using torch.cuda.amp), the gradient scaler should only unscale and step after all micro-batches have been accumulated:
scaler = torch.cuda.amp.GradScaler()
for i, (inputs, targets) in enumerate(dataloader):
with torch.cuda.amp.autocast():
outputs = model(inputs)
loss = criterion(outputs, targets) / accumulation_steps
scaler.scale(loss).backward()
if (i + 1) % accumulation_steps == 0:
scaler.step(optimizer)
scaler.update()
optimizer.zero_grad()
The Hugging Face Transformers Trainer exposes gradient accumulation as a single argument, gradient_accumulation_steps, on TrainingArguments. It defaults to 1 (no accumulation), and the documentation describes the effective batch size as per_device_train_batch_size * num_devices * gradient_accumulation_steps [11]. When accumulation is enabled, the Trainer counts one logged "step" per optimizer update rather than per micro-batch, so evaluation, logging, and checkpoint saving fire on the accumulation cadence rather than on every forward pass [11].
from transformers import TrainingArguments
args = TrainingArguments(
per_device_train_batch_size=4,
gradient_accumulation_steps=8, # effective batch = 4 * num_gpus * 8
learning_rate=2e-5,
)
In October 2024, the broader LLM training community discovered that the gradient accumulation paths in many popular trainers were not, in fact, mathematically equivalent to full-batch training for the cross-entropy loss used in causal language modeling. The bug was rediscovered and reported by researcher Benjamin Marie, shared publicly by Unsloth on October 15, 2024, and patched in Hugging Face Transformers the following day [1][10].
The root cause was loss normalization. Cross-entropy loss for an LLM is normalized by the number of non-padded (non-ignored) tokens. When gradient accumulation computes that mean loss independently for each micro-batch and then sums the results, the denominators do not combine correctly. Unsloth showed that the naive sum produces a loss that is G times too large, where G is the number of accumulation steps. As the Unsloth writeup explains, "gradient accumulation should effectively be equivalent mathematically to full batch training," yet the loss curves with and without accumulation diverged until the denominator was fixed [1]. Hugging Face described the same symptom: gradient accumulation is "supposed to be mathematically equivalent to full batch training; however, losses did not match between training runs where the setting was toggled on and off." [10]
The correct computation, as Hugging Face documented, is to take "the total loss across all batches in a gradient accumulation step divided by the total number of all non padding tokens in those batches," which is not the same as averaging the per-micro-batch losses [10]. In code, the fix switches the loss reduction from a per-batch mean to a sum, then divides once by the global token count:
loss = nn.functional.cross_entropy(
shift_logits, shift_labels, ignore_index=-100, reduction="sum"
)
loss = loss / num_items # total non-padding tokens across the accumulation cycle
The impact was significant. The issue affected not only single-device accumulation but also Distributed Data Parallel and multi-GPU setups, meaning many published fine-tunes and pretraining runs had been trained with a subtly miscalibrated effective loss. After the fix, Unsloth reported that "all loss curves now match up, showing indeed gradient accumulation is equivalent to full batch training." [1] The corresponding Transformers patch landed in pull requests #34191 and #34198, and users were advised to install from the main branch to pick up the fix [10].
DeepSpeed, Microsoft's distributed training library, provides built-in gradient accumulation support through its configuration system. Rather than manually implementing the accumulation loop, users specify the desired behavior in a JSON configuration file [5]:
{
"train_micro_batch_size_per_gpu": 4,
"gradient_accumulation_steps": 8,
"optimizer": {
"type": "AdamW",
"params": {
"lr": 1e-4,
"weight_decay": 0.01
}
}
}
In this configuration, each GPU processes micro-batches of 4, accumulates gradients over 8 steps, resulting in an effective per-GPU batch size of 32. If running on 4 GPUs, the global effective batch size is 128.
DeepSpeed handles the loss normalization, gradient synchronization, and optimizer stepping automatically, reducing the risk of implementation errors.
Gradient accumulation and distributed training are complementary techniques that both increase the effective batch size, but through different mechanisms:
In practice, modern LLM training uses both simultaneously. The total effective batch size is:
B_eff = micro_batch_size * accumulation_steps * num_data_parallel_devices
| Technique | Mechanism | Speed Impact | Memory Impact | Hardware Requirement |
|---|---|---|---|---|
| Gradient accumulation | Sequential micro-batches on same GPU | Slower (sequential processing) | No additional memory | Single GPU sufficient |
| Data parallelism | Parallel micro-batches across GPUs | Faster (parallel processing) | Same per GPU | Multiple GPUs |
| Both combined | Sequential accumulation on each of multiple GPUs | Balanced trade-off | Same per GPU | Multiple GPUs |
When combining gradient accumulation with data parallelism, an important optimization is to synchronize gradients only after the final micro-batch in each accumulation cycle, not after every micro-batch. In standard PyTorch Distributed Data Parallel (DDP), this is achieved using the no_sync() context manager:
for i, (inputs, targets) in enumerate(dataloader):
# Skip gradient sync for intermediate micro-batches
context = model.no_sync() if (i + 1) % accumulation_steps != 0 else nullcontext()
with context:
outputs = model(inputs)
loss = criterion(outputs, targets) / accumulation_steps
loss.backward()
if (i + 1) % accumulation_steps == 0:
optimizer.step()
optimizer.zero_grad()
This optimization reduces the communication overhead by a factor of N (the accumulation steps), since the expensive all-reduce operation across GPUs happens once per effective batch rather than once per micro-batch. DeepSpeed handles this automatically based on the gradient_accumulation_steps configuration [5].
Gradient accumulation is a standard component of virtually every large language model training pipeline. The scale of modern LLM training makes it indispensable:
Models like GPT-4, Llama, and Gemini are pretrained with effective batch sizes measured in millions of tokens. For example, every Llama 2 model was trained with a global batch size of 4 million tokens [6]. With a sequence length of 4096 tokens, this corresponds to roughly 1000 sequences per batch. Distributing this across 2048 GPUs with a micro-batch size of 1 per GPU still leaves a factor of roughly 2 to be covered by gradient accumulation.
Many training recipes also use batch size warmup, starting with a smaller effective batch size and increasing it during training. Gradient accumulation provides a clean mechanism for this: simply increase the accumulation steps over time.
Fine-tuning is where gradient accumulation is perhaps most commonly encountered by practitioners, because fine-tuning often happens on single GPUs or small clusters with limited memory:
| Training Phase | Typical Effective Batch Size (tokens) | Common Micro-Batch Size | Accumulation Steps (varies with GPU count) |
|---|---|---|---|
| Pretraining (small models, under 7B) | 500K to 2M | 2 to 8 sequences | 4 to 32 |
| Pretraining (large models, 70B+) | 2M to 8M | 1 to 2 sequences | 8 to 64 |
| Supervised fine-tuning | 32K to 256K | 1 to 4 sequences | 4 to 16 |
| RLHF / DPO | 16K to 128K | 1 to 2 sequences | 8 to 32 |
Beyond the basic accumulation loop, several implementation patterns address common pitfalls and optimize performance in production training systems.
When the number of batches in an epoch is not evenly divisible by the accumulation steps, the last cycle will have fewer micro-batches than expected. This is a subtle but important issue: if the loss is divided by N (the expected accumulation steps) but only M < N micro-batches are processed, the accumulated gradient will be scaled incorrectly. (This is closely related to the 2024 token-normalization bug described above: assuming a fixed denominator instead of measuring the real one is exactly what went wrong at scale.)
There are two approaches to handle this:
Most production frameworks (Hugging Face Transformers, PyTorch Lightning) handle this automatically.
Gradient clipping is standard practice in LLM training, typically clipping the global gradient norm to a maximum value (e.g., 1.0). When combined with gradient accumulation, the timing of clipping matters:
| Clipping approach | Correct? | Behavior |
|---|---|---|
| Clip after each micro-batch backward | No | Clips incomplete gradients; final accumulated gradient may be wrong |
| Clip after all micro-batches, before optimizer step | Yes | Clips the full accumulated gradient; mathematically correct |
| Clip within DeepSpeed/FSDP framework | Automatic | Frameworks handle the timing correctly |
The correct approach is to accumulate all gradients first, then clip the accumulated gradient as a single operation before the optimizer step. Clipping intermediate micro-batch gradients changes the mathematical equivalence between accumulation and direct large-batch training.
When using gradient accumulation, metrics like loss and learning rate should be logged per optimizer step (i.e., once per accumulation cycle), not per micro-batch. Logging per micro-batch produces N times as many data points and can obscure the true training dynamics.
Similarly, training progress (current step, progress percentage) should count optimizer steps, not micro-batch iterations. A common bug is to report total micro-batch iterations as "steps," which misrepresents how far training has progressed.
The interaction between gradient accumulation and learning rate scheduling is a frequent source of subtle bugs. The key principle is that the learning rate scheduler should step based on optimizer steps (once per accumulation cycle), not based on micro-batch iterations.
Consider a training run with 1000 optimizer steps and a cosine decay schedule. If the scheduler advances after every micro-batch (instead of every optimizer step), and the accumulation factor is 4, the scheduler will complete its full cycle after only 250 optimizer steps, spending the remaining 750 steps at the minimum learning rate. This effectively truncates the schedule and can significantly hurt performance.
Different frameworks handle this differently:
| Framework | Default LR scheduler behavior | Accumulation-aware? |
|---|---|---|
| PyTorch (manual loop) | User controls when scheduler.step() is called | User must handle correctly |
| Hugging Face Transformers Trainer | Steps scheduler per optimizer step | Yes (automatic) |
| PyTorch Lightning | Steps scheduler per optimizer step | Yes (automatic) |
| DeepSpeed | Steps scheduler per optimizer step | Yes (automatic) |
| FSDP (manual) | User controls timing | User must handle correctly |
When writing a manual training loop with gradient accumulation, the scheduler should be stepped inside the same if (i + 1) % accumulation_steps == 0 block as the optimizer:
if (i + 1) % accumulation_steps == 0:
optimizer.step()
scheduler.step() # Step the scheduler here, not outside
optimizer.zero_grad()
Learning rate warmup interacts cleanly with gradient accumulation as long as the warmup is specified in optimizer steps. For example, "warmup for 500 steps" means 500 optimizer steps (each of which includes N accumulated micro-batches), not 500 micro-batch iterations.
DeepSpeed provides the most comprehensive built-in support for gradient accumulation among distributed training frameworks. Understanding its internals helps avoid common issues.
DeepSpeed's gradient accumulation is configured through three related parameters:
{
"train_batch_size": 128,
"train_micro_batch_size_per_gpu": 4,
"gradient_accumulation_steps": 8
}
Only two of these three parameters need to be specified; DeepSpeed computes the third:
train_batch_size = train_micro_batch_size_per_gpu * gradient_accumulation_steps * num_gpus
If all three are specified and inconsistent, DeepSpeed will raise an error.
DeepSpeed optimizes gradient synchronization during accumulation. Rather than running an all-reduce after every micro-batch (which would synchronize gradients across GPUs N times per optimizer step), DeepSpeed averages gradients locally during each micro-batch step and performs a single all-reduce at the end of the accumulation cycle. This reduces communication overhead by a factor of N.
| Approach | All-reduce operations per optimizer step | Communication overhead |
|---|---|---|
| Naive (sync every micro-batch) | N (one per micro-batch) | High |
| DeepSpeed optimized | 1 (after final micro-batch) | Low |
| PyTorch DDP with no_sync() | 1 (after final micro-batch) | Low |
| FSDP without communication | 0 (sync only at step) | Lowest (but uses more memory) |
DeepSpeed's ZeRO (Zero Redundancy Optimizer) memory optimization interacts with gradient accumulation at each stage:
PyTorch's Fully Sharded Data Parallel (FSDP), described by Zhao et al. (2023), offers two modes of gradient accumulation [12]:
no_sync()): FSDP skips the inter-GPU reduction for intermediate micro-batches, accumulating unsharded gradients locally. This uses more memory but eliminates communication for N-1 of the N micro-batches.A known difference between FSDP and DeepSpeed is that FSDP may require a higher learning rate to achieve the same convergence rate as DeepSpeed, particularly when using different communication strategies during accumulation. Practitioners should verify convergence when switching between frameworks [9].
While gradient accumulation is extremely useful, it has inherent limitations:
Increased wall-clock time. Processing micro-batches sequentially takes longer than processing the full batch in parallel. If the hardware can accommodate a larger batch size, simply increasing the batch size directly is always faster than gradient accumulation. Accumulation should be used when memory, not time, is the binding constraint.
No parallelism benefit. Unlike data parallelism, gradient accumulation does not utilize additional hardware. It trades time for memory, processing more data per update at the cost of longer time per update.
Interaction with learning rate. The optimal learning rate depends on the effective batch size. When using gradient accumulation to increase the effective batch size, the learning rate may need to be adjusted upward following the linear scaling rule or similar heuristics.
Stale statistics. Beyond batch normalization, any technique that relies on batch-level statistics (e.g., some contrastive learning methods that compute similarity within a batch) will see different behavior with gradient accumulation, since each micro-batch operates on a subset of the effective batch.
Correct loss normalization. As the 2024 bug demonstrated, naively averaging token-normalized losses across micro-batches is not equivalent to full-batch training. Use a framework version that normalizes by the total token count, or implement the sum-then-divide pattern yourself [1][10].
As of early 2026, gradient accumulation is a standard, universally supported feature across all major deep learning frameworks and training libraries:
gradient_accumulation_steps as a direct argument [11]pmap for data parallelismThe technique remains essential because GPU memory growth has not kept pace with model size growth. Even the NVIDIA H100, which offers 80 GB of HBM3 memory in its SXM form factor, cannot hold the activations for large batch sizes during training of models with tens of billions of parameters [7][8]. As models continue to scale and training recipes demand ever-larger effective batch sizes, gradient accumulation will remain a foundational technique in the practitioner's toolkit.