# Gradient Accumulation

> Source: https://aiwiki.ai/wiki/gradient_accumulation
> Updated: 2026-06-23
> Categories: Deep Learning, Machine Learning, Training & Optimization
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**Gradient accumulation** is a [deep learning](/wiki/deep_learning) training technique that simulates a large [batch size](/wiki/batch_size) on limited GPU memory by summing the gradients from several small [mini-batches](/wiki/mini-batch) (called micro-batches) and performing a single optimizer step only after the last one. Instead of computing one gradient on a large batch that may not fit in memory, the model runs N smaller forward and backward passes, accumulates their gradients in place, then updates the weights as if it had processed a batch N times larger. The relationship is `effective_batch_size = micro_batch_size * accumulation_steps * num_devices`. As the Unsloth engineering team put it, "The goal of gradient accumulation is to mimic full batch training with reduced VRAM usage." [1]

This technique is fundamental to modern [large language model](/wiki/large_language_model) training, where effective batch sizes of millions of tokens are standard but no single GPU can hold even a fraction of that data in memory at once. Llama 2, for example, was trained with a global batch size of 4 million tokens across every model size in the family [6]. Gradient accumulation trades wall-clock time for memory: it does not make training faster, but it lets memory-bound hardware reproduce the large-batch recipes used to train frontier models [1].

## How does gradient accumulation work?

In standard training with [stochastic gradient descent](/wiki/stochastic_gradient_descent_sgd) (or its variants), each training step consists of:

1. Forward pass: compute predictions on a batch of data
2. Backward pass: compute gradients of the loss with respect to model weights
3. Optimizer step: update weights using the computed gradients
4. Zero gradients: reset accumulated gradients to zero

With gradient accumulation, steps 1 and 2 are repeated N times (where N is the accumulation steps) before performing step 3. The process becomes:

1. **For each micro-batch i = 1 to N:**
   - Forward pass on micro-batch i
   - Compute loss for micro-batch i, divided by N (to average)
   - Backward pass to accumulate gradients
2. **After all N micro-batches:**
   - Optimizer step: update weights
   - Zero gradients: reset for next accumulation cycle

The division of the loss by N in each backward pass ensures that the accumulated gradient is mathematically equivalent to the gradient computed on a single batch of size N times the micro-batch size. (As discussed below, this naive per-micro-batch averaging is correct for a simple mean loss but was the source of a subtle 2024 bug for token-normalized losses.)

## Mathematical Equivalence

Let B be the micro-batch size and N be the number of accumulation steps. The effective batch size is B_eff = N * B.

In standard training with batch size B_eff, the gradient update is:

```
g = (1 / B_eff) * sum_{i=1}^{B_eff} grad(L_i)
```

With gradient accumulation, processing N micro-batches of size B:

```
g = (1 / N) * sum_{k=1}^{N} [ (1 / B) * sum_{j=1}^{B} grad(L_{k,j}) ]
```

These two expressions are mathematically identical: both compute the average gradient over B_eff = N * B examples. This means that, for most purposes, training with gradient accumulation produces the same result as training with the larger batch size directly [2]. Hugging Face describes the property plainly: gradient accumulation is "supposed to be mathematically equivalent to full batch training." [10]

### Does batch normalization break the equivalence?

[Batch normalization](/wiki/batch_normalization) is the primary exception to this mathematical equivalence. Batch normalization computes running statistics (mean and variance) over each mini-batch during training. When using gradient accumulation, these statistics are computed over each micro-batch of size B rather than over the full effective batch of size B_eff. In other words, the running statistics only ever see one micro-batch at a time, never the full effective batch.

This means the normalization statistics are noisier, which can lead to slightly different training dynamics and final performance. In practice, this is often not a significant issue because:

- Many modern architectures, including [transformers](/wiki/transformer), use [Layer Normalization](/wiki/layer_normalization) instead of Batch Normalization. Layer Normalization computes statistics per example rather than per batch, so it is unaffected by gradient accumulation.
- For architectures that do use Batch Normalization (such as some [convolutional neural networks](/wiki/convolutional_neural_network)), the difference is usually small unless the micro-batch size is very small (e.g., 1 or 2).

| Normalization Type | Affected by Gradient Accumulation? | Explanation |
|---|---|---|
| Batch Normalization | Yes | Statistics computed per micro-batch, not effective batch |
| Layer Normalization | No | Statistics computed per example, batch-size independent |
| Group Normalization | No | Statistics computed per channel group within each example |
| RMSNorm | No | Computed per example; standard in modern LLMs |

## Why does gradient accumulation matter?

Gradient accumulation solves a fundamental tension in deep learning: larger batch sizes generally improve training stability and throughput, but GPU memory is finite.

### Memory Constraints

During training, GPU memory must hold:
- Model parameters
- Optimizer states (e.g., [Adam](/wiki/adam_optimizer) stores two additional copies of the parameters)
- Activations saved for the backward pass
- The gradient tensors themselves

For a 7-billion-parameter model in mixed precision (FP16 weights + FP32 optimizer states), the parameters and optimizer states alone consume roughly 56 GB. Activations scale linearly with batch size and sequence length. On a GPU with 80 GB of memory (e.g., an [NVIDIA](/wiki/nvidia) A100, which pairs 80 GB of HBM2e with about 2 TB/s of bandwidth), the remaining memory after parameters and optimizer states may only accommodate a micro-batch of 1 to 4 sequences [3][8].

Gradient accumulation allows this limited micro-batch size to be scaled to an effective batch size of hundreds or thousands of sequences, matching the batch sizes used in published training recipes.

### Training Stability

Larger effective batch sizes produce gradient estimates with lower variance, leading to more stable training. This is particularly important for:

- **LLM pretraining**, where batch sizes of 2 to 4 million tokens are common
- **Fine-tuning**, where small datasets combined with small batch sizes can cause noisy gradients
- **Multi-task learning**, where each task needs sufficient representation in each update

### Comparison: Small vs. Large Effective Batch

| Aspect | Small Effective Batch | Large Effective Batch (via accumulation) |
|---|---|---|
| Gradient noise | High | Low |
| Training stability | Lower | Higher |
| Convergence speed (in updates) | More updates needed | Fewer updates needed |
| Wall-clock time per update | Faster | Slower (sequential micro-batches) |
| Memory usage per GPU | Lower | Same per micro-batch |
| Learning rate sensitivity | More sensitive | More tolerant of higher learning rates |

## Implementation in PyTorch

[PyTorch](/wiki/pytorch) makes gradient accumulation straightforward because its autograd engine accumulates gradients by default. Calling `loss.backward()` adds the computed gradients to any existing gradients in the `.grad` attribute of each parameter, rather than replacing them. This is why `optimizer.zero_grad()` is explicitly called to reset gradients, and why omitting that call (intentionally) enables accumulation [4].

A standard PyTorch training loop with gradient accumulation looks like:

```python
accumulation_steps = 4
optimizer.zero_grad()

for i, (inputs, targets) in enumerate(dataloader):
    outputs = model(inputs)
    loss = criterion(outputs, targets)
    loss = loss / accumulation_steps  # Normalize loss
    loss.backward()  # Accumulate gradients
    
    if (i + 1) % accumulation_steps == 0:
        optimizer.step()  # Update weights
        optimizer.zero_grad()  # Reset gradients
```

Key implementation details:

- **Loss normalization**: The loss is divided by `accumulation_steps` before calling `backward()`. This ensures the accumulated gradient is the average over all micro-batches, matching the behavior of training with the full effective batch size. (See the 2024 bug section for the important caveat that dividing by the step count is not the same as dividing by the total token count for token-normalized losses.)
- **Gradient clipping**: If using gradient clipping (common in LLM training), it should be applied after all micro-batches have been accumulated but before the optimizer step, so that clipping operates on the final accumulated gradient.
- **Learning rate scheduling**: The scheduler should step based on actual optimizer updates, not micro-batch iterations. If using a scheduler that steps per batch, it needs to be adjusted to account for the accumulation factor.

### Mixed Precision Considerations

When combining gradient accumulation with mixed precision training (using `torch.cuda.amp`), the gradient scaler should only unscale and step after all micro-batches have been accumulated:

```python
scaler = torch.cuda.amp.GradScaler()

for i, (inputs, targets) in enumerate(dataloader):
    with torch.cuda.amp.autocast():
        outputs = model(inputs)
        loss = criterion(outputs, targets) / accumulation_steps
    
    scaler.scale(loss).backward()
    
    if (i + 1) % accumulation_steps == 0:
        scaler.step(optimizer)
        scaler.update()
        optimizer.zero_grad()
```

## Implementation in Hugging Face Transformers

The [Hugging Face](/wiki/hugging_face) Transformers `Trainer` exposes gradient accumulation as a single argument, `gradient_accumulation_steps`, on `TrainingArguments`. It defaults to 1 (no accumulation), and the documentation describes the effective batch size as `per_device_train_batch_size * num_devices * gradient_accumulation_steps` [11]. When accumulation is enabled, the `Trainer` counts one logged "step" per optimizer update rather than per micro-batch, so evaluation, logging, and checkpoint saving fire on the accumulation cadence rather than on every forward pass [11].

```python
from transformers import TrainingArguments

args = TrainingArguments(
    per_device_train_batch_size=4,
    gradient_accumulation_steps=8,  # effective batch = 4 * num_gpus * 8
    learning_rate=2e-5,
)
```

## What was the 2024 gradient accumulation bug?

In October 2024, the broader LLM training community discovered that the gradient accumulation paths in many popular trainers were not, in fact, mathematically equivalent to full-batch training for the cross-entropy loss used in causal language modeling. The bug was rediscovered and reported by researcher Benjamin Marie, shared publicly by [Unsloth](/wiki/unsloth) on October 15, 2024, and patched in Hugging Face Transformers the following day [1][10].

The root cause was loss normalization. Cross-entropy loss for an LLM is normalized by the number of non-padded (non-ignored) tokens. When gradient accumulation computes that mean loss independently for each micro-batch and then sums the results, the denominators do not combine correctly. Unsloth showed that the naive sum produces a loss that is G times too large, where G is the number of accumulation steps. As the Unsloth writeup explains, "gradient accumulation should effectively be equivalent mathematically to full batch training," yet the loss curves with and without accumulation diverged until the denominator was fixed [1]. Hugging Face described the same symptom: gradient accumulation is "supposed to be mathematically equivalent to full batch training; however, losses did not match between training runs where the setting was toggled on and off." [10]

The correct computation, as Hugging Face documented, is to take "the total loss across all batches in a gradient accumulation step divided by the total number of all non padding tokens in those batches," which is not the same as averaging the per-micro-batch losses [10]. In code, the fix switches the loss reduction from a per-batch mean to a sum, then divides once by the global token count:

```python
loss = nn.functional.cross_entropy(
    shift_logits, shift_labels, ignore_index=-100, reduction="sum"
)
loss = loss / num_items  # total non-padding tokens across the accumulation cycle
```

The impact was significant. The issue affected not only single-device accumulation but also Distributed Data Parallel and multi-GPU setups, meaning many published fine-tunes and pretraining runs had been trained with a subtly miscalibrated effective loss. After the fix, Unsloth reported that "all loss curves now match up, showing indeed gradient accumulation is equivalent to full batch training." [1] The corresponding Transformers patch landed in pull requests #34191 and #34198, and users were advised to install from the main branch to pick up the fix [10].

## Implementation in DeepSpeed

[DeepSpeed](/wiki/deepspeed), Microsoft's distributed training library, provides built-in gradient accumulation support through its configuration system. Rather than manually implementing the accumulation loop, users specify the desired behavior in a JSON configuration file [5]:

```json
{
  "train_micro_batch_size_per_gpu": 4,
  "gradient_accumulation_steps": 8,
  "optimizer": {
    "type": "AdamW",
    "params": {
      "lr": 1e-4,
      "weight_decay": 0.01
    }
  }
}
```

In this configuration, each GPU processes micro-batches of 4, accumulates gradients over 8 steps, resulting in an effective per-GPU batch size of 32. If running on 4 GPUs, the global effective batch size is 128.

DeepSpeed handles the loss normalization, gradient synchronization, and optimizer stepping automatically, reducing the risk of implementation errors.

## Relationship to Distributed Training

Gradient accumulation and [distributed training](/wiki/distributed_training) are complementary techniques that both increase the effective batch size, but through different mechanisms:

- **Gradient accumulation** increases the effective batch size by processing micro-batches sequentially on the same device
- **[Data parallelism](/wiki/data_parallelism)** increases the effective batch size by processing micro-batches in parallel across multiple devices

In practice, modern LLM training uses both simultaneously. The total effective batch size is:

```
B_eff = micro_batch_size * accumulation_steps * num_data_parallel_devices
```

| Technique | Mechanism | Speed Impact | Memory Impact | Hardware Requirement |
|---|---|---|---|---|
| Gradient accumulation | Sequential micro-batches on same GPU | Slower (sequential processing) | No additional memory | Single GPU sufficient |
| Data parallelism | Parallel micro-batches across GPUs | Faster (parallel processing) | Same per GPU | Multiple GPUs |
| Both combined | Sequential accumulation on each of multiple GPUs | Balanced trade-off | Same per GPU | Multiple GPUs |

### Gradient Synchronization

When combining gradient accumulation with data parallelism, an important optimization is to synchronize gradients only after the final micro-batch in each accumulation cycle, not after every micro-batch. In standard [PyTorch Distributed Data Parallel](/wiki/distributed_data_parallel) (DDP), this is achieved using the `no_sync()` context manager:

```python
for i, (inputs, targets) in enumerate(dataloader):
    # Skip gradient sync for intermediate micro-batches
    context = model.no_sync() if (i + 1) % accumulation_steps != 0 else nullcontext()
    
    with context:
        outputs = model(inputs)
        loss = criterion(outputs, targets) / accumulation_steps
        loss.backward()
    
    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()
```

This optimization reduces the communication overhead by a factor of N (the accumulation steps), since the expensive all-reduce operation across GPUs happens once per effective batch rather than once per micro-batch. DeepSpeed handles this automatically based on the `gradient_accumulation_steps` configuration [5].

## Use in LLM Training

Gradient accumulation is a standard component of virtually every large language model training pipeline. The scale of modern LLM training makes it indispensable:

### Pretraining

Models like [GPT-4](/wiki/gpt-4), [Llama](/wiki/llama), and [Gemini](/wiki/gemini) are pretrained with effective batch sizes measured in millions of tokens. For example, every Llama 2 model was trained with a global batch size of 4 million tokens [6]. With a sequence length of 4096 tokens, this corresponds to roughly 1000 sequences per batch. Distributing this across 2048 GPUs with a micro-batch size of 1 per GPU still leaves a factor of roughly 2 to be covered by gradient accumulation.

Many training recipes also use batch size warmup, starting with a smaller effective batch size and increasing it during training. Gradient accumulation provides a clean mechanism for this: simply increase the accumulation steps over time.

### Fine-Tuning

[Fine-tuning](/wiki/fine_tuning) is where gradient accumulation is perhaps most commonly encountered by practitioners, because fine-tuning often happens on single GPUs or small clusters with limited memory:

- [LoRA](/wiki/lora) and [QLoRA](/wiki/qlora) fine-tuning of 7B to 70B models on consumer GPUs (24 GB VRAM) typically requires gradient accumulation with 4 to 16 steps
- Instruction tuning datasets are often small, making larger effective batch sizes critical for stable training
- [RLHF](/wiki/rlhf) and [DPO](/wiki/dpo) training pipelines use gradient accumulation to process paired preference data efficiently

### Practical Batch Size Guidelines for LLM Training

| Training Phase | Typical Effective Batch Size (tokens) | Common Micro-Batch Size | Accumulation Steps (varies with GPU count) |
|---|---|---|---|
| Pretraining (small models, under 7B) | 500K to 2M | 2 to 8 sequences | 4 to 32 |
| Pretraining (large models, 70B+) | 2M to 8M | 1 to 2 sequences | 8 to 64 |
| Supervised fine-tuning | 32K to 256K | 1 to 4 sequences | 4 to 16 |
| RLHF / DPO | 16K to 128K | 1 to 2 sequences | 8 to 32 |

## Advanced implementation patterns

Beyond the basic accumulation loop, several implementation patterns address common pitfalls and optimize performance in production training systems.

### Handling the last incomplete accumulation cycle

When the number of batches in an [epoch](/wiki/epoch) is not evenly divisible by the accumulation steps, the last cycle will have fewer micro-batches than expected. This is a subtle but important issue: if the loss is divided by N (the expected accumulation steps) but only M < N micro-batches are processed, the accumulated gradient will be scaled incorrectly. (This is closely related to the 2024 token-normalization bug described above: assuming a fixed denominator instead of measuring the real one is exactly what went wrong at scale.)

There are two approaches to handle this:

- **Drop the incomplete cycle**: Skip the final partial accumulation, similar to dropping the last incomplete batch. This is simpler but wastes some data.
- **Adjust the normalization**: Track the actual number of micro-batches in each cycle and divide by that number instead of the fixed N. This uses all data but requires slightly more bookkeeping.

Most production frameworks (Hugging Face Transformers, PyTorch Lightning) handle this automatically.

### Gradient clipping with accumulation

[Gradient clipping](/wiki/gradient_clipping) is standard practice in [LLM](/wiki/large_language_model) training, typically clipping the global gradient norm to a maximum value (e.g., 1.0). When combined with gradient accumulation, the timing of clipping matters:

| Clipping approach | Correct? | Behavior |
|---|---|---|
| Clip after each micro-batch backward | No | Clips incomplete gradients; final accumulated gradient may be wrong |
| Clip after all micro-batches, before optimizer step | Yes | Clips the full accumulated gradient; mathematically correct |
| Clip within DeepSpeed/FSDP framework | Automatic | Frameworks handle the timing correctly |

The correct approach is to accumulate all gradients first, then clip the accumulated gradient as a single operation before the optimizer step. Clipping intermediate micro-batch gradients changes the mathematical equivalence between accumulation and direct large-batch training.

### Logging and metrics tracking

When using gradient accumulation, metrics like loss and learning rate should be logged per optimizer step (i.e., once per accumulation cycle), not per micro-batch. Logging per micro-batch produces N times as many data points and can obscure the true training dynamics.

Similarly, training progress (current step, progress percentage) should count optimizer steps, not micro-batch iterations. A common bug is to report total micro-batch iterations as "steps," which misrepresents how far training has progressed.

## Interaction with learning rate scheduling

The interaction between gradient accumulation and [learning rate](/wiki/learning_rate) scheduling is a frequent source of subtle bugs. The key principle is that the learning rate scheduler should step based on **optimizer steps** (once per accumulation cycle), not based on micro-batch iterations.

### The problem

Consider a training run with 1000 optimizer steps and a cosine decay schedule. If the scheduler advances after every micro-batch (instead of every optimizer step), and the accumulation factor is 4, the scheduler will complete its full cycle after only 250 optimizer steps, spending the remaining 750 steps at the minimum learning rate. This effectively truncates the schedule and can significantly hurt performance.

### Framework behavior

Different frameworks handle this differently:

| Framework | Default LR scheduler behavior | Accumulation-aware? |
|---|---|---|
| PyTorch (manual loop) | User controls when scheduler.step() is called | User must handle correctly |
| Hugging Face Transformers Trainer | Steps scheduler per optimizer step | Yes (automatic) |
| PyTorch Lightning | Steps scheduler per optimizer step | Yes (automatic) |
| DeepSpeed | Steps scheduler per optimizer step | Yes (automatic) |
| FSDP (manual) | User controls timing | User must handle correctly |

When writing a manual training loop with gradient accumulation, the scheduler should be stepped inside the same `if (i + 1) % accumulation_steps == 0` block as the optimizer:

```python
if (i + 1) % accumulation_steps == 0:
    optimizer.step()
    scheduler.step()  # Step the scheduler here, not outside
    optimizer.zero_grad()
```

### Warmup considerations

Learning rate warmup interacts cleanly with gradient accumulation as long as the warmup is specified in optimizer steps. For example, "warmup for 500 steps" means 500 optimizer steps (each of which includes N accumulated micro-batches), not 500 micro-batch iterations.

## DeepSpeed integration details

[DeepSpeed](/wiki/deepspeed) provides the most comprehensive built-in support for gradient accumulation among distributed training frameworks. Understanding its internals helps avoid common issues.

### Configuration

DeepSpeed's gradient accumulation is configured through three related parameters:

```json
{
  "train_batch_size": 128,
  "train_micro_batch_size_per_gpu": 4,
  "gradient_accumulation_steps": 8
}
```

Only two of these three parameters need to be specified; DeepSpeed computes the third:

`train_batch_size = train_micro_batch_size_per_gpu * gradient_accumulation_steps * num_gpus`

If all three are specified and inconsistent, DeepSpeed will raise an error.

### DeepSpeed's communication optimization

DeepSpeed optimizes gradient synchronization during accumulation. Rather than running an all-reduce after every micro-batch (which would synchronize gradients across GPUs N times per optimizer step), DeepSpeed averages gradients locally during each micro-batch step and performs a single all-reduce at the end of the accumulation cycle. This reduces communication overhead by a factor of N.

| Approach | All-reduce operations per optimizer step | Communication overhead |
|---|---|---|
| Naive (sync every micro-batch) | N (one per micro-batch) | High |
| DeepSpeed optimized | 1 (after final micro-batch) | Low |
| PyTorch DDP with no_sync() | 1 (after final micro-batch) | Low |
| FSDP without communication | 0 (sync only at step) | Lowest (but uses more memory) |

### ZeRO stages and gradient accumulation

DeepSpeed's [ZeRO](/wiki/zero) (Zero Redundancy Optimizer) memory optimization interacts with gradient accumulation at each stage:

- **ZeRO Stage 1** (optimizer state partitioning): Gradient accumulation works normally. Accumulated gradients are reduced and then partitioned optimizer states are updated.
- **ZeRO Stage 2** (gradient partitioning): Each GPU only stores a partition of the gradients. Accumulation happens on the partitioned gradients, reducing memory further.
- **ZeRO Stage 3** (parameter partitioning): Parameters are gathered on-demand for each micro-batch. This means parameters are gathered N times per optimizer step, which adds overhead but enables training of much larger models.

### FSDP gradient accumulation

PyTorch's Fully Sharded Data Parallel ([FSDP](/wiki/fsdp)), described by Zhao et al. (2023), offers two modes of gradient accumulation [12]:

- **With communication**: FSDP reduces gradients across ranks after each micro-batch but holds the sharded gradients for accumulation. This uses less memory per GPU.
- **Without communication** (using `no_sync()`): FSDP skips the inter-GPU reduction for intermediate micro-batches, accumulating unsharded gradients locally. This uses more memory but eliminates communication for N-1 of the N micro-batches.

A known difference between FSDP and DeepSpeed is that FSDP may require a higher learning rate to achieve the same convergence rate as DeepSpeed, particularly when using different communication strategies during accumulation. Practitioners should verify convergence when switching between frameworks [9].

## Trade-offs and Limitations

While gradient accumulation is extremely useful, it has inherent limitations:

**Increased wall-clock time.** Processing micro-batches sequentially takes longer than processing the full batch in parallel. If the hardware can accommodate a larger batch size, simply increasing the batch size directly is always faster than gradient accumulation. Accumulation should be used when memory, not time, is the binding constraint.

**No parallelism benefit.** Unlike data parallelism, gradient accumulation does not utilize additional hardware. It trades time for memory, processing more data per update at the cost of longer time per update.

**Interaction with learning rate.** The optimal learning rate depends on the effective batch size. When using gradient accumulation to increase the effective batch size, the learning rate may need to be adjusted upward following the [linear scaling rule](/wiki/linear_scaling_rule) or similar heuristics.

**Stale statistics.** Beyond batch normalization, any technique that relies on batch-level statistics (e.g., some contrastive learning methods that compute similarity within a batch) will see different behavior with gradient accumulation, since each micro-batch operates on a subset of the effective batch.

**Correct loss normalization.** As the 2024 bug demonstrated, naively averaging token-normalized losses across micro-batches is not equivalent to full-batch training. Use a framework version that normalizes by the total token count, or implement the sum-then-divide pattern yourself [1][10].

## Current Usage

As of early 2026, gradient accumulation is a standard, universally supported feature across all major deep learning frameworks and training libraries:

- **PyTorch** supports it natively through its gradient accumulation behavior
- **[DeepSpeed](/wiki/deepspeed)** and **[FSDP](/wiki/fsdp)** (Fully Sharded Data Parallel) provide built-in configuration options
- **[Hugging Face](/wiki/hugging_face) Transformers** Trainer accepts `gradient_accumulation_steps` as a direct argument [11]
- **PyTorch Lightning** and **Accelerate** abstract it behind simple configuration flags
- **[JAX](/wiki/jax)** / **Flax** implementations use explicit gradient accumulation in training loops, often combined with `pmap` for data parallelism

The technique remains essential because GPU memory growth has not kept pace with model size growth. Even the NVIDIA H100, which offers 80 GB of HBM3 memory in its SXM form factor, cannot hold the activations for large batch sizes during training of models with tens of billions of parameters [7][8]. As models continue to scale and training recipes demand ever-larger effective batch sizes, gradient accumulation will remain a foundational technique in the practitioner's toolkit.

## References

1. Han, D., Han, M., and the Unsloth team (2024). "Bug Fixes in LLM Training - Gradient Accumulation." *Unsloth Blog*, October 15, 2024. https://unsloth.ai/blog/gradient
2. "Gradient Accumulation [+ code in PyTorch]." *OpenGenus IQ*. https://iq.opengenus.org/gradient-accumulation/
3. Rajbhandari, S. et al. (2020). "ZeRO: Memory Optimizations Toward Training Trillion Parameter Models." *SC '20*. https://arxiv.org/abs/1910.02054
4. "PyTorch Gradient Accumulation Training Loop." *Thomas Wolf, GitHub Gist*. https://gist.github.com/thomwolf/ac7a7da6b1888c2eeac8ac8b9b05d3d3
5. "Training Overview and Features." *DeepSpeed Documentation*. https://www.deepspeed.ai/training/
6. Touvron, H. et al. (2023). "Llama 2: Open Foundation and Fine-Tuned Chat Models." *arXiv preprint*. https://arxiv.org/abs/2307.09288
7. "Effective Training Techniques." *PyTorch Lightning Documentation*. https://lightning.ai/docs/pytorch/stable/advanced/training_tricks.html
8. "NVIDIA A100 Tensor Core GPU Datasheet." *NVIDIA*. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a100/pdf/nvidia-a100-datasheet-nvidia-us-2188504-web.pdf
9. "FSDP vs DeepSpeed." *Hugging Face Accelerate Documentation*. https://huggingface.co/docs/accelerate/concept_guides/fsdp_and_deepspeed
10. Zucker, A. and the Hugging Face team (2024). "Fixing Gradient Accumulation." *Hugging Face Blog*, October 16, 2024. https://huggingface.co/blog/gradient_accumulation
11. "Trainer." *Hugging Face Transformers Documentation*. https://huggingface.co/docs/transformers/en/main_classes/trainer
12. Zhao, Y. et al. (2023). "PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel." *arXiv preprint*. https://arxiv.org/abs/2304.11277

