Gradient Accumulation

Deep Learning Machine Learning Training & Optimization

21 min read

Updated Jun 23, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 23, 2026

Fact-checked

In review queue

Sources

12 citations

Revision

v7 · 4,249 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Gradient accumulation is a deep learning training technique that simulates a large batch size on limited GPU memory by summing the gradients from several small mini-batches (called micro-batches) and performing a single optimizer step only after the last one. Instead of computing one gradient on a large batch that may not fit in memory, the model runs N smaller forward and backward passes, accumulates their gradients in place, then updates the weights as if it had processed a batch N times larger. The relationship is effective_batch_size = micro_batch_size * accumulation_steps * num_devices. As the Unsloth engineering team put it, "The goal of gradient accumulation is to mimic full batch training with reduced VRAM usage." ^[1]

This technique is fundamental to modern large language model training, where effective batch sizes of millions of tokens are standard but no single GPU can hold even a fraction of that data in memory at once. Llama 2, for example, was trained with a global batch size of 4 million tokens across every model size in the family ^[6]. Gradient accumulation trades wall-clock time for memory: it does not make training faster, but it lets memory-bound hardware reproduce the large-batch recipes used to train frontier models ^[1].

How does gradient accumulation work?

In standard training with stochastic gradient descent (or its variants), each training step consists of:

Forward pass: compute predictions on a batch of data
Backward pass: compute gradients of the loss with respect to model weights
Optimizer step: update weights using the computed gradients
Zero gradients: reset accumulated gradients to zero

With gradient accumulation, steps 1 and 2 are repeated N times (where N is the accumulation steps) before performing step 3. The process becomes:

For each micro-batch i = 1 to N:
- Forward pass on micro-batch i
- Compute loss for micro-batch i, divided by N (to average)
- Backward pass to accumulate gradients
After all N micro-batches:
- Optimizer step: update weights
- Zero gradients: reset for next accumulation cycle

The division of the loss by N in each backward pass ensures that the accumulated gradient is mathematically equivalent to the gradient computed on a single batch of size N times the micro-batch size. (As discussed below, this naive per-micro-batch averaging is correct for a simple mean loss but was the source of a subtle 2024 bug for token-normalized losses.)

Mathematical Equivalence

Let B be the micro-batch size and N be the number of accumulation steps. The effective batch size is B_eff = N * B.

In standard training with batch size B_eff, the gradient update is:

g = (1 / B_eff) * sum_{i=1}^{B_eff} grad(L_i)

With gradient accumulation, processing N micro-batches of size B:

g = (1 / N) * sum_{k=1}^{N} [ (1 / B) * sum_{j=1}^{B} grad(L_{k,j}) ]

These two expressions are mathematically identical: both compute the average gradient over B_eff = N * B examples. This means that, for most purposes, training with gradient accumulation produces the same result as training with the larger batch size directly ^[2]. Hugging Face describes the property plainly: gradient accumulation is "supposed to be mathematically equivalent to full batch training." ^[10]

Does batch normalization break the equivalence?

Batch normalization is the primary exception to this mathematical equivalence. Batch normalization computes running statistics (mean and variance) over each mini-batch during training. When using gradient accumulation, these statistics are computed over each micro-batch of size B rather than over the full effective batch of size B_eff. In other words, the running statistics only ever see one micro-batch at a time, never the full effective batch.

This means the normalization statistics are noisier, which can lead to slightly different training dynamics and final performance. In practice, this is often not a significant issue because:

Many modern architectures, including transformers, use Layer Normalization instead of Batch Normalization. Layer Normalization computes statistics per example rather than per batch, so it is unaffected by gradient accumulation.
For architectures that do use Batch Normalization (such as some convolutional neural networks), the difference is usually small unless the micro-batch size is very small (e.g., 1 or 2).

Normalization Type	Affected by Gradient Accumulation?	Explanation
Batch Normalization	Yes	Statistics computed per micro-batch, not effective batch
Layer Normalization	No	Statistics computed per example, batch-size independent
Group Normalization	No	Statistics computed per channel group within each example
RMSNorm	No	Computed per example; standard in modern LLMs

Why does gradient accumulation matter?

Gradient accumulation solves a fundamental tension in deep learning: larger batch sizes generally improve training stability and throughput, but GPU memory is finite.

Memory Constraints

During training, GPU memory must hold:

Model parameters
Optimizer states (e.g., Adam stores two additional copies of the parameters)
Activations saved for the backward pass
The gradient tensors themselves

For a 7-billion-parameter model in mixed precision (FP16 weights + FP32 optimizer states), the parameters and optimizer states alone consume roughly 56 GB. Activations scale linearly with batch size and sequence length. On a GPU with 80 GB of memory (e.g., an NVIDIA A100, which pairs 80 GB of HBM2e with about 2 TB/s of bandwidth), the remaining memory after parameters and optimizer states may only accommodate a micro-batch of 1 to 4 sequences ^[3]^[8].

Gradient accumulation allows this limited micro-batch size to be scaled to an effective batch size of hundreds or thousands of sequences, matching the batch sizes used in published training recipes.

Training Stability

Larger effective batch sizes produce gradient estimates with lower variance, leading to more stable training. This is particularly important for:

LLM pretraining, where batch sizes of 2 to 4 million tokens are common
Fine-tuning, where small datasets combined with small batch sizes can cause noisy gradients
Multi-task learning, where each task needs sufficient representation in each update

Comparison: Small vs. Large Effective Batch

Aspect	Small Effective Batch	Large Effective Batch (via accumulation)
Gradient noise	High	Low
Training stability	Lower	Higher
Convergence speed (in updates)	More updates needed	Fewer updates needed
Wall-clock time per update	Faster	Slower (sequential micro-batches)
Memory usage per GPU	Lower	Same per micro-batch
Learning rate sensitivity	More sensitive	More tolerant of higher learning rates

Implementation in PyTorch

PyTorch makes gradient accumulation straightforward because its autograd engine accumulates gradients by default. Calling loss.backward() adds the computed gradients to any existing gradients in the .grad attribute of each parameter, rather than replacing them. This is why optimizer.zero_grad() is explicitly called to reset gradients, and why omitting that call (intentionally) enables accumulation ^[4].

A standard PyTorch training loop with gradient accumulation looks like:

accumulation_steps = 4
optimizer.zero_grad()

for i, (inputs, targets) in enumerate(dataloader):
    outputs = model(inputs)
    loss = criterion(outputs, targets)
    loss = loss / accumulation_steps  # Normalize loss
    loss.backward()  # Accumulate gradients
    
    if (i + 1) % accumulation_steps == 0:
        optimizer.step()  # Update weights
        optimizer.zero_grad()  # Reset gradients

Key implementation details:

Loss normalization: The loss is divided by accumulation_steps before calling backward(). This ensures the accumulated gradient is the average over all micro-batches, matching the behavior of training with the full effective batch size. (See the 2024 bug section for the important caveat that dividing by the step count is not the same as dividing by the total token count for token-normalized losses.)
Gradient clipping: If using gradient clipping (common in LLM training), it should be applied after all micro-batches have been accumulated but before the optimizer step, so that clipping operates on the final accumulated gradient.
Learning rate scheduling: The scheduler should step based on actual optimizer updates, not micro-batch iterations. If using a scheduler that steps per batch, it needs to be adjusted to account for the accumulation factor.

Mixed Precision Considerations

When combining gradient accumulation with mixed precision training (using torch.cuda.amp), the gradient scaler should only unscale and step after all micro-batches have been accumulated:

scaler = torch.cuda.amp.GradScaler()

for i, (inputs, targets) in enumerate(dataloader):
    with torch.cuda.amp.autocast():
        outputs = model(inputs)
        loss = criterion(outputs, targets) / accumulation_steps
    
    scaler.scale(loss).backward()
    
    if (i + 1) % accumulation_steps == 0:
        scaler.step(optimizer)
        scaler.update()
        optimizer.zero_grad()

Implementation in Hugging Face Transformers

The Hugging Face Transformers Trainer exposes gradient accumulation as a single argument, gradient_accumulation_steps, on TrainingArguments. It defaults to 1 (no accumulation), and the documentation describes the effective batch size as per_device_train_batch_size * num_devices * gradient_accumulation_steps ^[11]. When accumulation is enabled, the Trainer counts one logged "step" per optimizer update rather than per micro-batch, so evaluation, logging, and checkpoint saving fire on the accumulation cadence rather than on every forward pass ^[11].

from transformers import TrainingArguments

args = TrainingArguments(
    per_device_train_batch_size=4,
    gradient_accumulation_steps=8,  # effective batch = 4 * num_gpus * 8
    learning_rate=2e-5,
)

What was the 2024 gradient accumulation bug?

In October 2024, the broader LLM training community discovered that the gradient accumulation paths in many popular trainers were not, in fact, mathematically equivalent to full-batch training for the cross-entropy loss used in causal language modeling. The bug was rediscovered and reported by researcher Benjamin Marie, shared publicly by Unsloth on October 15, 2024, and patched in Hugging Face Transformers the following day ^[1]^[10].

The root cause was loss normalization. Cross-entropy loss for an LLM is normalized by the number of non-padded (non-ignored) tokens. When gradient accumulation computes that mean loss independently for each micro-batch and then sums the results, the denominators do not combine correctly. Unsloth showed that the naive sum produces a loss that is G times too large, where G is the number of accumulation steps. As the Unsloth writeup explains, "gradient accumulation should effectively be equivalent mathematically to full batch training," yet the loss curves with and without accumulation diverged until the denominator was fixed ^[1]. Hugging Face described the same symptom: gradient accumulation is "supposed to be mathematically equivalent to full batch training; however, losses did not match between training runs where the setting was toggled on and off." ^[10]

The correct computation, as Hugging Face documented, is to take "the total loss across all batches in a gradient accumulation step divided by the total number of all non padding tokens in those batches," which is not the same as averaging the per-micro-batch losses ^[10]. In code, the fix switches the loss reduction from a per-batch mean to a sum, then divides once by the global token count:

loss = nn.functional.cross_entropy(
    shift_logits, shift_labels, ignore_index=-100, reduction="sum"
)
loss = loss / num_items  # total non-padding tokens across the accumulation cycle

The impact was significant. The issue affected not only single-device accumulation but also Distributed Data Parallel and multi-GPU setups, meaning many published fine-tunes and pretraining runs had been trained with a subtly miscalibrated effective loss. After the fix, Unsloth reported that "all loss curves now match up, showing indeed gradient accumulation is equivalent to full batch training." ^[1] The corresponding Transformers patch landed in pull requests #34191 and #34198, and users were advised to install from the main branch to pick up the fix ^[10].

Implementation in DeepSpeed

DeepSpeed, Microsoft's distributed training library, provides built-in gradient accumulation support through its configuration system. Rather than manually implementing the accumulation loop, users specify the desired behavior in a JSON configuration file ^[5]:

{
  "train_micro_batch_size_per_gpu": 4,
  "gradient_accumulation_steps": 8,
  "optimizer": {
    "type": "AdamW",
    "params": {
      "lr": 1e-4,
      "weight_decay": 0.01
    }
  }
}

In this configuration, each GPU processes micro-batches of 4, accumulates gradients over 8 steps, resulting in an effective per-GPU batch size of 32. If running on 4 GPUs, the global effective batch size is 128.

DeepSpeed handles the loss normalization, gradient synchronization, and optimizer stepping automatically, reducing the risk of implementation errors.

Relationship to Distributed Training

Gradient accumulation and distributed training are complementary techniques that both increase the effective batch size, but through different mechanisms:

Gradient accumulation increases the effective batch size by processing micro-batches sequentially on the same device
Data parallelism increases the effective batch size by processing micro-batches in parallel across multiple devices

In practice, modern LLM training uses both simultaneously. The total effective batch size is:

B_eff = micro_batch_size * accumulation_steps * num_data_parallel_devices

Technique	Mechanism	Speed Impact	Memory Impact	Hardware Requirement
Gradient accumulation	Sequential micro-batches on same GPU	Slower (sequential processing)	No additional memory	Single GPU sufficient
Data parallelism	Parallel micro-batches across GPUs	Faster (parallel processing)	Same per GPU	Multiple GPUs
Both combined	Sequential accumulation on each of multiple GPUs	Balanced trade-off	Same per GPU	Multiple GPUs

Gradient Synchronization

When combining gradient accumulation with data parallelism, an important optimization is to synchronize gradients only after the final micro-batch in each accumulation cycle, not after every micro-batch. In standard PyTorch Distributed Data Parallel (DDP), this is achieved using the no_sync() context manager:

for i, (inputs, targets) in enumerate(dataloader):
    # Skip gradient sync for intermediate micro-batches
    context = model.no_sync() if (i + 1) % accumulation_steps != 0 else nullcontext()
    
    with context:
        outputs = model(inputs)
        loss = criterion(outputs, targets) / accumulation_steps
        loss.backward()
    
    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

This optimization reduces the communication overhead by a factor of N (the accumulation steps), since the expensive all-reduce operation across GPUs happens once per effective batch rather than once per micro-batch. DeepSpeed handles this automatically based on the gradient_accumulation_steps configuration ^[5].

Use in LLM Training

Gradient accumulation is a standard component of virtually every large language model training pipeline. The scale of modern LLM training makes it indispensable:

Pretraining

Models like GPT-4, Llama, and Gemini are pretrained with effective batch sizes measured in millions of tokens. For example, every Llama 2 model was trained with a global batch size of 4 million tokens ^[6]. With a sequence length of 4096 tokens, this corresponds to roughly 1000 sequences per batch. Distributing this across 2048 GPUs with a micro-batch size of 1 per GPU still leaves a factor of roughly 2 to be covered by gradient accumulation.

Many training recipes also use batch size warmup, starting with a smaller effective batch size and increasing it during training. Gradient accumulation provides a clean mechanism for this: simply increase the accumulation steps over time.

Fine-Tuning

Fine-tuning is where gradient accumulation is perhaps most commonly encountered by practitioners, because fine-tuning often happens on single GPUs or small clusters with limited memory:

LoRA and QLoRA fine-tuning of 7B to 70B models on consumer GPUs (24 GB VRAM) typically requires gradient accumulation with 4 to 16 steps
Instruction tuning datasets are often small, making larger effective batch sizes critical for stable training
RLHF and DPO training pipelines use gradient accumulation to process paired preference data efficiently

Practical Batch Size Guidelines for LLM Training

Training Phase	Typical Effective Batch Size (tokens)	Common Micro-Batch Size	Accumulation Steps (varies with GPU count)
Pretraining (small models, under 7B)	500K to 2M	2 to 8 sequences	4 to 32
Pretraining (large models, 70B+)	2M to 8M	1 to 2 sequences	8 to 64
Supervised fine-tuning	32K to 256K	1 to 4 sequences	4 to 16
RLHF / DPO	16K to 128K	1 to 2 sequences	8 to 32

Advanced implementation patterns

Beyond the basic accumulation loop, several implementation patterns address common pitfalls and optimize performance in production training systems.

Handling the last incomplete accumulation cycle

When the number of batches in an epoch is not evenly divisible by the accumulation steps, the last cycle will have fewer micro-batches than expected. This is a subtle but important issue: if the loss is divided by N (the expected accumulation steps) but only M < N micro-batches are processed, the accumulated gradient will be scaled incorrectly. (This is closely related to the 2024 token-normalization bug described above: assuming a fixed denominator instead of measuring the real one is exactly what went wrong at scale.)

There are two approaches to handle this:

Drop the incomplete cycle: Skip the final partial accumulation, similar to dropping the last incomplete batch. This is simpler but wastes some data.
Adjust the normalization: Track the actual number of micro-batches in each cycle and divide by that number instead of the fixed N. This uses all data but requires slightly more bookkeeping.

Most production frameworks (Hugging Face Transformers, PyTorch Lightning) handle this automatically.

Gradient clipping with accumulation

Gradient clipping is standard practice in LLM training, typically clipping the global gradient norm to a maximum value (e.g., 1.0). When combined with gradient accumulation, the timing of clipping matters:

Clipping approach	Correct?	Behavior
Clip after each micro-batch backward	No	Clips incomplete gradients; final accumulated gradient may be wrong
Clip after all micro-batches, before optimizer step	Yes	Clips the full accumulated gradient; mathematically correct
Clip within DeepSpeed/FSDP framework	Automatic	Frameworks handle the timing correctly

The correct approach is to accumulate all gradients first, then clip the accumulated gradient as a single operation before the optimizer step. Clipping intermediate micro-batch gradients changes the mathematical equivalence between accumulation and direct large-batch training.

Logging and metrics tracking

When using gradient accumulation, metrics like loss and learning rate should be logged per optimizer step (i.e., once per accumulation cycle), not per micro-batch. Logging per micro-batch produces N times as many data points and can obscure the true training dynamics.

Similarly, training progress (current step, progress percentage) should count optimizer steps, not micro-batch iterations. A common bug is to report total micro-batch iterations as "steps," which misrepresents how far training has progressed.

Interaction with learning rate scheduling

The interaction between gradient accumulation and learning rate scheduling is a frequent source of subtle bugs. The key principle is that the learning rate scheduler should step based on optimizer steps (once per accumulation cycle), not based on micro-batch iterations.

The problem

Consider a training run with 1000 optimizer steps and a cosine decay schedule. If the scheduler advances after every micro-batch (instead of every optimizer step), and the accumulation factor is 4, the scheduler will complete its full cycle after only 250 optimizer steps, spending the remaining 750 steps at the minimum learning rate. This effectively truncates the schedule and can significantly hurt performance.

Framework behavior

Different frameworks handle this differently:

Framework	Default LR scheduler behavior	Accumulation-aware?
PyTorch (manual loop)	User controls when scheduler.step() is called	User must handle correctly
Hugging Face Transformers Trainer	Steps scheduler per optimizer step	Yes (automatic)
PyTorch Lightning	Steps scheduler per optimizer step	Yes (automatic)
DeepSpeed	Steps scheduler per optimizer step	Yes (automatic)
FSDP (manual)	User controls timing	User must handle correctly

When writing a manual training loop with gradient accumulation, the scheduler should be stepped inside the same if (i + 1) % accumulation_steps == 0 block as the optimizer:

if (i + 1) % accumulation_steps == 0:
    optimizer.step()
    scheduler.step()  # Step the scheduler here, not outside
    optimizer.zero_grad()

Warmup considerations

Learning rate warmup interacts cleanly with gradient accumulation as long as the warmup is specified in optimizer steps. For example, "warmup for 500 steps" means 500 optimizer steps (each of which includes N accumulated micro-batches), not 500 micro-batch iterations.

DeepSpeed integration details

DeepSpeed provides the most comprehensive built-in support for gradient accumulation among distributed training frameworks. Understanding its internals helps avoid common issues.

Configuration

DeepSpeed's gradient accumulation is configured through three related parameters:

{
  "train_batch_size": 128,
  "train_micro_batch_size_per_gpu": 4,
  "gradient_accumulation_steps": 8
}

Only two of these three parameters need to be specified; DeepSpeed computes the third:

train_batch_size = train_micro_batch_size_per_gpu * gradient_accumulation_steps * num_gpus

If all three are specified and inconsistent, DeepSpeed will raise an error.

DeepSpeed's communication optimization

DeepSpeed optimizes gradient synchronization during accumulation. Rather than running an all-reduce after every micro-batch (which would synchronize gradients across GPUs N times per optimizer step), DeepSpeed averages gradients locally during each micro-batch step and performs a single all-reduce at the end of the accumulation cycle. This reduces communication overhead by a factor of N.

Approach	All-reduce operations per optimizer step	Communication overhead
Naive (sync every micro-batch)	N (one per micro-batch)	High
DeepSpeed optimized	1 (after final micro-batch)	Low
PyTorch DDP with no_sync()	1 (after final micro-batch)	Low
FSDP without communication	0 (sync only at step)	Lowest (but uses more memory)

ZeRO stages and gradient accumulation

DeepSpeed's ZeRO (Zero Redundancy Optimizer) memory optimization interacts with gradient accumulation at each stage:

ZeRO Stage 1 (optimizer state partitioning): Gradient accumulation works normally. Accumulated gradients are reduced and then partitioned optimizer states are updated.
ZeRO Stage 2 (gradient partitioning): Each GPU only stores a partition of the gradients. Accumulation happens on the partitioned gradients, reducing memory further.
ZeRO Stage 3 (parameter partitioning): Parameters are gathered on-demand for each micro-batch. This means parameters are gathered N times per optimizer step, which adds overhead but enables training of much larger models.

FSDP gradient accumulation

PyTorch's Fully Sharded Data Parallel (FSDP), described by Zhao et al. (2023), offers two modes of gradient accumulation ^[12]:

With communication: FSDP reduces gradients across ranks after each micro-batch but holds the sharded gradients for accumulation. This uses less memory per GPU.
Without communication (using no_sync()): FSDP skips the inter-GPU reduction for intermediate micro-batches, accumulating unsharded gradients locally. This uses more memory but eliminates communication for N-1 of the N micro-batches.

A known difference between FSDP and DeepSpeed is that FSDP may require a higher learning rate to achieve the same convergence rate as DeepSpeed, particularly when using different communication strategies during accumulation. Practitioners should verify convergence when switching between frameworks ^[9].

Trade-offs and Limitations

While gradient accumulation is extremely useful, it has inherent limitations:

Increased wall-clock time. Processing micro-batches sequentially takes longer than processing the full batch in parallel. If the hardware can accommodate a larger batch size, simply increasing the batch size directly is always faster than gradient accumulation. Accumulation should be used when memory, not time, is the binding constraint.

No parallelism benefit. Unlike data parallelism, gradient accumulation does not utilize additional hardware. It trades time for memory, processing more data per update at the cost of longer time per update.

Interaction with learning rate. The optimal learning rate depends on the effective batch size. When using gradient accumulation to increase the effective batch size, the learning rate may need to be adjusted upward following the linear scaling rule or similar heuristics.

Stale statistics. Beyond batch normalization, any technique that relies on batch-level statistics (e.g., some contrastive learning methods that compute similarity within a batch) will see different behavior with gradient accumulation, since each micro-batch operates on a subset of the effective batch.

Correct loss normalization. As the 2024 bug demonstrated, naively averaging token-normalized losses across micro-batches is not equivalent to full-batch training. Use a framework version that normalizes by the total token count, or implement the sum-then-divide pattern yourself ^[1]^[10].

Current Usage

As of early 2026, gradient accumulation is a standard, universally supported feature across all major deep learning frameworks and training libraries:

PyTorch supports it natively through its gradient accumulation behavior
DeepSpeed and FSDP (Fully Sharded Data Parallel) provide built-in configuration options
Hugging Face Transformers Trainer accepts gradient_accumulation_steps as a direct argument ^[11]
PyTorch Lightning and Accelerate abstract it behind simple configuration flags
JAX / Flax implementations use explicit gradient accumulation in training loops, often combined with pmap for data parallelism

The technique remains essential because GPU memory growth has not kept pace with model size growth. Even the NVIDIA H100, which offers 80 GB of HBM3 memory in its SXM form factor, cannot hold the activations for large batch sizes during training of models with tens of billions of parameters ^[7]^[8]. As models continue to scale and training recipes demand ever-larger effective batch sizes, gradient accumulation will remain a foundational technique in the practitioner's toolkit.

References

Han, D., Han, M., and the Unsloth team (2024). "Bug Fixes in LLM Training - Gradient Accumulation." *Unsloth Blog*, October 15, 2024. https://unsloth.ai/blog/gradient ↩
"Gradient Accumulation [+ code in PyTorch]." *OpenGenus IQ*. https://iq.opengenus.org/gradient-accumulation/ ↩
Rajbhandari, S. et al. (2020). "ZeRO: Memory Optimizations Toward Training Trillion Parameter Models." *SC '20*. https://arxiv.org/abs/1910.02054 ↩
"PyTorch Gradient Accumulation Training Loop." *Thomas Wolf, GitHub Gist*. https://gist.github.com/thomwolf/ac7a7da6b1888c2eeac8ac8b9b05d3d3 ↩
"Training Overview and Features." *DeepSpeed Documentation*. https://www.deepspeed.ai/training/ ↩
Touvron, H. et al. (2023). "Llama 2: Open Foundation and Fine-Tuned Chat Models." *arXiv preprint*. https://arxiv.org/abs/2307.09288 ↩
"Effective Training Techniques." *PyTorch Lightning Documentation*. https://lightning.ai/docs/pytorch/stable/advanced/training_tricks.html ↩
"NVIDIA A100 Tensor Core GPU Datasheet." *NVIDIA*. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a100/pdf/nvidia-a100-datasheet-nvidia-us-2188504-web.pdf ↩
"FSDP vs DeepSpeed." *Hugging Face Accelerate Documentation*. https://huggingface.co/docs/accelerate/concept_guides/fsdp_and_deepspeed ↩
Zucker, A. and the Hugging Face team (2024). "Fixing Gradient Accumulation." *Hugging Face Blog*, October 16, 2024. https://huggingface.co/blog/gradient_accumulation ↩
"Trainer." *Hugging Face Transformers Documentation*. https://huggingface.co/docs/transformers/en/main_classes/trainer ↩
Zhao, Y. et al. (2023). "PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel." *arXiv preprint*. https://arxiv.org/abs/2304.11277 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

6 revisions by 1 contributors · full history

Suggest edit

What links here

Batch Normalization Batch Size Data Parallelism Epoch Fully Sharded Data Parallel (FSDP)GaLore (Gradient Low-Rank Projection)Iteration Mini-batch Model Parallelism

How does gradient accumulation work?

Mathematical Equivalence

Does batch normalization break the equivalence?

Why does gradient accumulation matter?

Memory Constraints

Training Stability

Comparison: Small vs. Large Effective Batch

Implementation in PyTorch

Mixed Precision Considerations

Implementation in Hugging Face Transformers

What was the 2024 gradient accumulation bug?

Implementation in DeepSpeed

Relationship to Distributed Training

Gradient Synchronization

Use in LLM Training

Pretraining

Fine-Tuning

Practical Batch Size Guidelines for LLM Training

Advanced implementation patterns

Handling the last incomplete accumulation cycle

Gradient clipping with accumulation

Logging and metrics tracking

Interaction with learning rate scheduling

The problem

Framework behavior

Warmup considerations

DeepSpeed integration details

Configuration

DeepSpeed's communication optimization

ZeRO stages and gradient accumulation

FSDP gradient accumulation

Trade-offs and Limitations

Current Usage

References

Improve this article

Related Articles

Clipping

Dropout Regularization

Early Stopping

Fine Tuning

Gradient Descent

Hyperparameter

What links here

Related Articles

Clipping

Dropout Regularization

Early Stopping

Fine Tuning

Gradient Descent

Hyperparameter

What links here