# Gradient clipping

> Source: https://aiwiki.ai/wiki/gradient_clipping
> Updated: 2026-07-11
> Categories: Training & Optimization
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**Gradient clipping** is a training technique that caps the magnitude of [gradient](/wiki/gradient) values before they update model weights, preventing the excessively large [parameter updates](/wiki/parameter_update) that destabilize [neural network](/wiki/neural_network) training. It is the standard fix for the [exploding gradient problem](/wiki/exploding_gradient_problem): the most common form rescales the entire gradient vector to a fixed global L2 norm, and a threshold of 1.0 is the de facto default used by nearly every published large language model recipe, including GPT-3, Llama 2, and DeepSeek-V3.[1][17][16][15] Gradient clipping is cheap, adds one line to a training loop, and can prevent NaN divergence that would otherwise waste weeks of compute.

The two dominant variants are clip-by-value, which caps each individual gradient component to a fixed range, and clip-by-norm, which rescales the entire gradient vector when its overall magnitude exceeds a threshold. The norm version, formalized by Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio in their 2013 ICML paper "On the Difficulty of Training Recurrent Neural Networks," is by far the more common form in current practice.[1] Mikolov had already proposed the basic idea in his 2012 PhD thesis at Brno University of Technology, where he used it to train recurrent language models that would otherwise diverge.[2]

Gradient clipping is cheap to implement, costs almost nothing at runtime, and can be added to a training loop with a single line of code. Despite this, it sits at the center of how billion-parameter models stay numerically stable across months of training. The remainder of this article walks through the math, the major variants, the empirical thresholds used in production [LLMs](/wiki/llm), the interaction with [distributed training](/wiki/distributed_training) and [mixed precision training](/wiki/mixed_precision_training), the recent adaptive methods that try to do better than a fixed threshold, and the connection to differential privacy through DP-SGD.

## Why is gradient clipping needed?

During [backpropagation](/wiki/backpropagation), the chain rule produces gradients of the loss with respect to each parameter. In a deep feedforward network, those gradients are products of many Jacobians stacked layer by layer. In a [recurrent neural network](/wiki/recurrent_neural_network), they are products across time steps using the same recurrent weight matrix at each step, an operation known as backpropagation through time (BPTT). When the spectral radius of those repeated linear operators sits above one, the gradient magnitude grows exponentially with depth or sequence length. This is the [exploding gradient](/wiki/exploding_gradient_problem) problem.

Bengio, Simard, and Frasconi formalized the analytical version of this in 1994, showing that recurrent networks face a structural tension: the same eigenvalue conditions that allow gradients to flow over long horizons also expose training to gradient blowup.[3] The practical consequences are familiar to anyone who has trained a deep network without safeguards:

- Floating-point overflow turns gradients into NaN or infinity, after which every parameter update produces NaN, and the model is dead.
- Even before overflow, a single huge gradient can move parameters far outside the region where the local linear approximation underlying [gradient descent](/wiki/gradient_descent) is valid, sending the loss into a much worse part of the landscape.
- Loss curves develop sharp spikes that may or may not recover. When they do not recover, the training run is wasted.
- In autoregressive language modeling, a single divergence event can corrupt activation statistics for all subsequent steps and force a restart from a previous checkpoint.

Clipping does not fix the underlying conditioning of the optimization problem. It just bounds how badly any one step can hurt. That bound turns out to be enough for nearly all practical training, which is why the technique has stuck around for more than a decade with very little change to its basic form.

## Methods

### Clip by value

Clip-by-value, sometimes called elementwise clipping, treats each scalar entry of the gradient tensor independently. Given a threshold `c`, each gradient component `g_i` is replaced by

$$
g_i \leftarrow \max(-c, \min(c, g_i))
$$

Values inside $$[-c, c]$$ are unchanged; values outside are pulled to the nearest endpoint. The implementation is one line in any framework, and the cost is a single elementwise operation over the parameters.

The drawback is that clipping different components by different amounts changes the direction of the resulting gradient vector. If one or two coordinates are very large and the rest are moderate, value clipping can shrink the dominant coordinates while leaving the others alone, rotating the descent direction away from the true negative gradient. For convex objectives this slows convergence; for nonconvex objectives it can push the optimizer toward a different basin entirely. Practitioners use value clipping mostly when they want a hard worst-case bound on each weight update, for example in [GANs](/wiki/generative_adversarial_network_gan) or in custom optimizers where elementwise control is convenient.

### Clip by norm

Clip-by-norm treats the gradient as a single vector and rescales it whenever its overall length exceeds a threshold. With the L2 (Euclidean) norm, the rule is

$$
\text{if } \lVert g \rVert_2 > \tau: \quad g \leftarrow g \cdot \frac{\tau}{\lVert g \rVert_2}
$$

or equivalently $$g \leftarrow g \cdot \min(1, \tau / \lVert g \rVert_2)$$. When the norm is below `τ`, the gradient passes through unchanged. When it is above, the entire vector is shrunk by a single scalar factor, which preserves direction exactly. Only step length is affected.

Direction preservation is the main reason norm clipping has displaced value clipping in mainstream practice. Pascanu et al. demonstrated empirically that direction-preserving rescaling stabilized RNN training without introducing the systematic bias that elementwise clipping does.[1] Their suggested operating range was a threshold somewhere between half and ten times the typical observed gradient norm during a stable run; the modern norm of using values like 1.0 or 5.0 grew out of that recipe.

### Global norm versus per-parameter norm

Norm clipping admits two natural granularities. Global norm clipping concatenates every parameter's gradient into one logical vector, computes the L2 norm over the whole thing, and applies a single rescaling factor to all parameters when that combined norm exceeds the threshold. Per-parameter norm clipping computes a separate norm for each parameter tensor (each weight matrix, each bias vector) and rescales each one independently against its own threshold.

Global clipping is the standard. It treats the model as a single point in a single vector space, which matches the geometry of [SGD](/wiki/stochastic_gradient_descent_sgd) and [Adam](/wiki/adam_optimizer). Per-parameter clipping can over-shrink some tensors while under-shrinking others, distorting the overall update direction in much the same way that elementwise clipping does. Per-parameter is occasionally useful for debugging or for architectures with very heterogeneous parameter scales, but it has not displaced global clipping in mainstream training.

| Method | What it bounds | Direction preserved | Cost | Where it shows up |
|---|---|---|---|---|
| Clip by value | Each scalar component to $$[-c, c]$$ | No | One elementwise op | Older RNN code, GANs, custom losses |
| Clip by L2 norm (per-tensor) | Each parameter tensor's norm | Per-tensor only | One reduction per tensor | Specialized debugging, mixed scales |
| Clip by L2 global norm | Global vector norm across all params | Yes (globally) | One global reduction | RNNs, transformers, LLM pretraining |
| Clip by infinity norm | Largest absolute component | No (rotates) | One max reduction | Rare, theoretical interest |

### Adaptive Gradient Clipping (AGC)

Adaptive Gradient Clipping was introduced by Brock, De, Smith, and Simonyan in the 2021 NFNet paper, "High-Performance Large-Scale Image Recognition Without Normalization."[4] The idea is to set the clip threshold for each parameter unit as a multiple of the parameter's own norm, instead of using a single fixed value across the model. Concretely, for a parameter row `W_i` with gradient `G_i`,

$$
\text{if } \frac{\lVert G_i \rVert}{\lVert W_i \rVert} > \lambda: \quad G_i \leftarrow G_i \cdot \lambda \cdot \frac{\lVert W_i \rVert}{\lVert G_i \rVert}
$$

The coefficient `λ` is a small constant. Brock et al. used λ = 0.01 for every parameter except the final fully-connected classifier layer (which had AGC turned off entirely) when training NFNet-F0 through F6 at batch size 4096; smaller batch sizes such as 128 to 256 tolerated a looser λ around 0.16, while larger batches required tighter clipping for stability. The justification is that the magnitude of a sensible weight update should scale with the magnitude of the weight itself; clipping in absolute units conflates very different parameter scales. AGC was the key ingredient that allowed NFNets to match or beat the accuracy of [batch normalized](/wiki/batch_normalization) ResNets on ImageNet without using normalization layers, with NFNet-F5 reaching 86.0% top-1 accuracy on ImageNet.

### AdaGC and ZClip

For [LLM](/wiki/llm) pretraining, where a single fixed threshold across hundreds of billions of parameters is increasingly seen as too blunt, two newer adaptive methods have appeared.

**AdaGC** (Wang et al., 2025) tracks an exponential moving average (EMA) of each parameter tensor's gradient norm and clips against a relative threshold derived from that history.[5] The per-tensor EMA is updated as $$\gamma_{t,i} = \beta \cdot \gamma_{t-1,i} + (1 - \beta) \cdot \lVert g_{t,i} \rVert$$, and the clip factor is $$\min(\lambda_{\text{rel}} \cdot \gamma_{t-1,i} / \lVert g_{t,i} \rVert, 1.0)$$. On Llama-2 7B and 13B pretraining, AdaGC eliminated visible loss spikes while reducing WikiText perplexity by about 3.5% relative to global clipping.

**ZClip** (Kumar, Owen et al., April 2025) takes a similar EMA-of-gradient-norms approach but frames the clip decision as a z-score test.[6] At each step it tracks the running mean and standard deviation of the gradient norm and only clips when the current norm sits more than a configurable number of standard deviations above the running mean, treating the spike as a statistical anomaly rather than relying on an absolute cutoff. The authors report that on 1B-parameter Llama-style models, ZClip outperforms both fixed-threshold clipping and percentile-based methods, and broadens the range of learning rates that train stably.

Both methods aim at the same observation: the typical gradient norm during pretraining shifts substantially over the course of a run, and a fixed $$\tau = 1.0$$ threshold that is appropriately tight at step 100K may be too loose at step 1M.

## Where in the training loop is gradient clipping applied?

Gradient clipping is always applied between the backward pass and the optimizer step. The standard ordering in PyTorch is:

```python
optimizer.zero_grad()
output = model(input)
loss = loss_fn(output, target)
loss.backward()                                    # populates .grad on each param
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()                                    # consumes the (now clipped) .grad
```

Clipping after the optimizer step would clip the wrong tensor; clipping before the backward pass would clip nothing. The order also matters with respect to gradient accumulation, mixed precision, and distributed reduction, all discussed below.

## What clip threshold should you use?

The clip threshold `τ` is a [hyperparameter](/wiki/hyperparameter) that interacts with the [learning rate](/wiki/learning_rate), batch size, and model architecture. There is no single correct value, but the empirical literature has converged on a small set of conventions. Pascanu et al. recommended setting the threshold somewhere between half and ten times the average gradient norm observed during stable training, and most modern defaults sit at the low end of that range.[1]

| Setting | Typical global-norm threshold | Notes |
|---|---|---|
| Generic deep learning | 1.0 | Standard default for new training loops |
| LLM pretraining (decoder-only transformers) | 1.0 | Used by GPT-3, Llama 2, DeepSeek-V3 |
| Smaller transformers / fine-tuning | 1.0 to 5.0 | Less aggressive when batches are small |
| Vision transformers | 1.0 | Often paired with cosine schedule |
| Reinforcement learning (PPO, A2C) | 0.5 | Tighter clipping to handle non-stationarity |
| Compressive Transformer (memory-augmented) | 0.1 | Aggressive clipping on long-context recurrence |
| Value clipping (when used) | $$c \in [0.5, 1.0]$$ | Range applied per element |

A practical recipe is to log the gradient norm for the first few thousand steps without clipping, observe the typical magnitude, and set `τ` slightly above the median observed norm so that clipping triggers only on the genuine spikes. The PyTorch `clip_grad_norm_` function returns the pre-clipping total norm precisely so it can be logged for this purpose.

## LLM training recipes

Gradient clipping at a global L2 norm of 1.0 is the de facto standard for large language model pretraining. It appears, with minor variations, in nearly every published recipe for a frontier-scale decoder-only transformer. GPT-3 (175B) was trained with the Adam optimizer, gradient clipping at a global norm of 1.0, and weight decay of 0.1.[17]

| Model | Optimizer | Clip rule | Threshold |
|---|---|---|---|
| GPT-3 (175B) | Adam | Global L2 norm | 1.0 |
| Llama 2 (7B / 13B / 70B) | AdamW (β1=0.9, β2=0.95, wd=0.1) | Global L2 norm | 1.0 |
| Llama 3 family | AdamW | Global L2 norm | 1.0 |
| DeepSeek-V3 (671B MoE) | AdamW (β1=0.9, β2=0.95, wd=0.1) | Per-parameter clip | 1.0 |
| PaLM (540B) | Adafactor | Global L2 norm | 1.0 |
| GPT-NeoX, OPT, BLOOM | AdamW | Global L2 norm | 1.0 |
| Mistral 7B | AdamW | Global L2 norm | 1.0 |

The convergence on $$\tau = 1.0$$ is striking. It reflects the fact that decoder-only transformers trained with AdamW and a moderate peak learning rate (typically 1e-4 to 6e-4) produce gradient norms that hover around or just below 1 during the bulk of training. A threshold of 1.0 is therefore tight enough to catch genuine spikes without continuously throttling the normal gradient flow.

### Why does clipping not always stop loss spikes?

Google's PaLM paper documented a phenomenon that should temper any belief in clipping as a complete solution. As the authors put it, "For the largest model, we observed spikes in the loss roughly 20 times during training, despite the fact that gradient clipping was enabled."[7] The spikes occurred at irregular intervals, sometimes very late into the run, and could not be predicted from gradient statistics alone. The PaLM team's mitigation was operational rather than algorithmic: when a spike began, they would restart training from a checkpoint roughly 100 steps before the spike and skip the next 200 to 500 data batches. After the skipped window, the loss did not spike again at the same point. The authors concluded that "spikes only occur due to the combination of specific data batches with a particular model parameter state," rather than from any systematic data corruption.[7]

This is part of why adaptive methods like ZClip and AdaGC are an active area of research. A static threshold of 1.0 is good enough for most steps and most models, but not good enough to guarantee zero spikes across a billion-token-per-second pretraining run.

## How does gradient clipping work in distributed training?

In data-parallel training, gradient clipping is logically a global operation on the aggregated gradient, not a local operation on each worker's partial gradient. The standard order with [data parallelism](/wiki/data_parallelism) is:

1. Each worker computes its local gradient on its data shard.
2. An all-reduce sums (or averages) the gradients across workers.
3. Each worker holds the same combined gradient.
4. Clipping is applied to that combined gradient.
5. The optimizer step proceeds.

Clipping each worker's local gradient before the all-reduce would change the meaning of the threshold, because the global norm of the sum of clipped vectors is not the same as the clip of the global norm of the sum. Frameworks like PyTorch DDP handle this correctly by default: gradients are reduced first, then `clip_grad_norm_` is called on the synchronized gradients.

Fully Sharded Data Parallel (FSDP) and ZeRO add a wrinkle. With sharded parameters and sharded gradients, no single rank holds the full gradient vector at clip time. Computing the global norm requires an additional cross-rank reduction (each rank computes the local sum of squares for its shard, then an all-reduce sums those across ranks, and the square root is taken). PyTorch FSDP exposes `model.clip_grad_norm_(max_norm)` precisely for this reason, since calling the unsharded `torch.nn.utils.clip_grad_norm_` directly on `model.parameters()` would only see the local shard and produce an incorrect (smaller) norm. Megatron-LM and DeepSpeed ZeRO handle the same coordination internally.

Gradient accumulation introduces another subtlety. When gradients are accumulated across `k` micro-batches before a single optimizer step, clipping should apply to the accumulated gradient, not to each micro-batch's contribution. Clipping per micro-batch and then summing produces a different (and smaller) effective threshold. The standard pattern is to call `loss.backward()` `k` times, then call `clip_grad_norm_` once just before `optimizer.step()`.

## Mixed precision

[Mixed precision training](/wiki/mixed_precision_training) with FP16 introduces loss scaling: the loss is multiplied by a large scalar before the backward pass to push small gradients into the representable range of FP16, and the resulting gradients are correspondingly inflated. If clipping is applied to these scaled gradients, the threshold means something completely different (and very loose) compared to its meaning on unscaled gradients.

The correct sequence in PyTorch with `torch.amp.GradScaler` is:

```python
scaler.scale(loss).backward()
scaler.unscale_(optimizer)                          # restore true gradient magnitudes
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
scaler.step(optimizer)
scaler.update()
```

The explicit `scaler.unscale_` call divides the gradients by the current loss scale before clipping, so the threshold of 1.0 means what it normally means. BF16 training does not need this dance, since BF16 has the same dynamic range as FP32 and loss scaling is unnecessary, but the same clip-then-step ordering still applies.

## Implementation across frameworks

### PyTorch

PyTorch provides two utilities in `torch.nn.utils`:

```python
# Norm clipping (returns the unclipped norm for logging)
total_norm = torch.nn.utils.clip_grad_norm_(
    model.parameters(), max_norm=1.0, norm_type=2.0
)

# Value clipping
torch.nn.utils.clip_grad_value_(model.parameters(), clip_value=0.5)
```

The norm version computes the norm over all parameter gradients as if they were concatenated into a single vector (this is global clipping). It scales gradients in place and returns the pre-clipping total norm so it can be logged or used for monitoring. The `norm_type` argument accepts any p-norm, including `'inf'` for infinity-norm clipping; the default `2.0` is what almost everyone uses.

### TensorFlow / Keras

TensorFlow exposes three primitives:

```python
import tensorflow as tf

# Global L2 norm clipping (most common)
clipped, global_norm = tf.clip_by_global_norm(gradients, clip_norm=1.0)

# Per-tensor norm clipping
clipped = [tf.clip_by_norm(g, 1.0) for g in gradients]

# Value clipping
clipped = [tf.clip_by_value(g, -1.0, 1.0) for g in gradients]
```

Keras optimizers also accept `clipnorm` and `clipvalue` constructor arguments that perform per-variable clipping internally:

```python
optimizer = tf.keras.optimizers.AdamW(learning_rate=1e-3, global_clipnorm=1.0)
```

The `global_clipnorm` argument applies global L2 clipping; `clipnorm` applies per-variable clipping; `clipvalue` applies elementwise value clipping.

### JAX / Optax

In the JAX ecosystem, clipping is a `GradientTransformation` that composes with the optimizer through `optax.chain`:

```python
import optax

optimizer = optax.chain(
    optax.clip_by_global_norm(1.0),
    optax.adamw(learning_rate=3e-4, b1=0.9, b2=0.95, weight_decay=0.1),
)
```

`optax.clip_by_global_norm` implements the same global L2 rule used elsewhere; `optax.clip` does elementwise clipping; `optax.adaptive_grad_clip` implements Brock et al.'s AGC with a `clipping` parameter.

## Why does gradient clipping accelerate training?

For a long time, gradient clipping was justified entirely on empirical grounds. Pascanu et al. argued from a dynamical systems perspective that recurrent networks pass through narrow regions of the loss surface where the gradient becomes locally enormous, and clipping is a reasonable response: rescale the step but keep its direction so the optimizer can still descend.[1]

The sharper theoretical picture came from Zhang, He, Sra, and Jadbabaie in their 2020 ICLR paper, "Why Gradient Clipping Accelerates Training: A Theoretical Justification for Adaptivity."[8] They observed that the standard analysis of gradient descent assumes Lipschitz-smooth gradients with a fixed Lipschitz constant `L`. In real neural network training, that constant is anything but fixed: empirically, the local Lipschitz constant of the gradient grows roughly linearly with the gradient norm itself, a regime they called $$(L_0, L_1)$$-smoothness. Under that relaxed condition, standard fixed-step gradient descent is forced to take very small steps to stay safe in the high-curvature regions, slowing convergence. Gradient clipping (and the closely related normalized gradient method) effectively adapts the step size to the local curvature, achieving provably faster convergence rates without needing to know the Lipschitz constant in advance.

This result reframes gradient clipping as a form of implicit step-size adaptation rather than just a safety net against overflow. It also suggests why clipping helps even on training runs that never threaten to diverge: the threshold acts as a coarse but effective curvature regularizer, taming the optimizer's behavior in regions where the loss surface is locally rough.

## Differential privacy: gradient clipping as a sensitivity bound

Gradient clipping plays an entirely different role in privacy-preserving machine learning. In differentially private stochastic gradient descent (DP-SGD), introduced by Abadi et al. in their 2016 ACM CCS paper "Deep Learning with Differential Privacy," per-example gradients are clipped to a fixed L2 norm `C` before being summed and noised:[9]

1. For each example in the mini-batch, compute the per-example gradient `g_i`.
2. Clip each one: $$g_i \leftarrow g_i \cdot \min(1, C / \lVert g_i \rVert)$$.
3. Sum the clipped gradients and add Gaussian noise with standard deviation $$\sigma \cdot C$$.
4. Average over the batch and apply the optimizer step.

The role of the clip bound `C` here is fundamentally different. In standard training, clipping is a stability tool that triggers occasionally on outlier gradients. In DP-SGD, it is a privacy mechanism: the L2 sensitivity of the gradient sum to any single example is exactly `C`, which is what allows the Gaussian noise of scale $$\sigma \cdot C$$ to mask each example's contribution and yield formal $$(\epsilon, \delta)$$-differential privacy guarantees. Choosing `C` therefore involves a privacy-utility tradeoff that has nothing to do with exploding gradients per se: too small and the model cannot learn, too large and the required noise becomes overwhelming. Adaptive variants such as the median-clipping trick of Andrew et al. (2021) adjust `C` over training to track the empirical gradient norm distribution, but the core mechanism is unchanged.

This dual role makes gradient clipping a rare technique that is structurally important in two otherwise unrelated subfields: stable training of large models, and differentially private learning of any model.

## How do you monitor gradient norms during training?

Logging the gradient norm at every step is a habit that pays for itself many times over during long training runs. The norm provides a direct readout of training health and an early warning of problems.

What to look for:

- The norm typically starts higher and decays over the course of training as the model approaches a basin. A flat or rising norm late in training suggests the learning rate is too high.
- Sudden spikes that get clipped down to the threshold are normal and what clipping is for. Frequent spikes (more than a few percent of steps) suggest the threshold is too tight or the learning rate is too high.
- Norms that hit the threshold on every step indicate that clipping is throttling the entire run, not just catching outliers, and the threshold should probably be raised.
- A NaN in the gradient norm is fatal and should trigger automatic checkpoint reload. The PyTorch `clip_grad_norm_` function with `error_if_nonfinite=True` will raise instead of silently clipping NaN.
- A gradient norm that collapses toward zero for many consecutive steps points to the [vanishing gradient](/wiki/vanishing_gradient_problem) problem, not exploding, and is unaffected by clipping.

Logging the pre-clipping norm (which is what `clip_grad_norm_` returns) is more informative than logging the post-clipping norm, because the post-clip value is just the threshold whenever clipping triggers and provides no information about how close the run is to instability.

## When should you not use gradient clipping?

Gradient clipping is not free. The L2 norm requires touching every gradient tensor at every step, which costs a small amount of communication in distributed settings and a tiny amount of compute everywhere. More importantly, an aggressively low threshold can mask informative gradient signals and slow convergence. There are settings where clipping adds nothing:

- Small, well-conditioned models (logistic regression, small MLPs on tabular data) almost never produce exploding gradients and need no clipping.
- Tree-based methods do not have backpropagation gradients in the same sense and the technique does not apply.
- Very small fine-tuning runs where the optimizer is taking tens or hundreds of steps over a stable pretrained model rarely benefit from clipping; the model is already in a well-behaved region.
- Training runs that have been carefully tuned to use a learning rate well below the stability frontier may not need clipping, though the cost of leaving it in is so low that most practitioners do anyway as cheap insurance.

The failure mode of overly aggressive clipping is convergence that is slower than it needs to be but otherwise normal. The failure mode of no clipping on a model that needs it is total divergence, often hours or days into a run. The asymmetry strongly favors leaving clipping enabled.

## How does gradient clipping relate to other stability techniques?

Gradient clipping is one tool in a broader stability toolkit and interacts with several others.

- **[Batch normalization](/wiki/batch_normalization) and [layer normalization](/wiki/normalization)**: normalization layers smooth the loss surface and reduce gradient magnitudes indirectly. Modern networks use both normalization and clipping; the AGC work showed that careful clipping can substitute for normalization in some architectures.
- **Weight decay (L2 regularization)**: penalizes large weights, which indirectly reduces gradient magnitudes during the linearization step. Weight decay does not bound gradient norm directly and is not a substitute for clipping.
- **Learning rate warmup**: linearly ramping the learning rate from zero over the first 1K to 10K steps is a standard companion to clipping. Warmup avoids the most violent gradients of the first few steps; clipping handles the residual spikes throughout training.
- **Learning rate scheduling (cosine, linear decay)**: a decaying learning rate naturally reduces the effective parameter update magnitude over time. Clipping caps the per-step magnitude; scheduling shapes the trajectory.
- **Skip-bad-batches and checkpoint rewind**: when clipping fails to prevent a loss spike, the operational mitigation is to roll back to a recent checkpoint and skip the offending batches, as documented in the PaLM paper.[7]

## Explain like I'm 5

Imagine you are walking on a hilly path with your eyes closed, taking small steps in the direction someone tells you to go. Most of the time the ground is gentle and your steps work fine. But sometimes the person shouts a really long instruction, like "go ten feet that way!" and if you actually take a giant leap with your eyes closed you will probably fall off a cliff.

Gradient clipping is the rule "no matter how big the instruction is, only take one step at a time." The direction is still right, but you never lurch farther than you can recover from. Because of that one rule, you can keep walking the whole path without falling, even when the instructions get scary.

## References

1. Pascanu, R., Mikolov, T., & Bengio, Y. (2013). "On the Difficulty of Training Recurrent Neural Networks." Proceedings of the 30th International Conference on Machine Learning (ICML), PMLR 28(3):1310-1318. https://proceedings.mlr.press/v28/pascanu13.html
2. Mikolov, T. (2012). "Statistical Language Models Based on Neural Networks." PhD thesis, Brno University of Technology. https://www.fit.vut.cz/study/phd-thesis/283/.en
3. Bengio, Y., Simard, P., & Frasconi, P. (1994). "Learning Long-Term Dependencies with Gradient Descent is Difficult." IEEE Transactions on Neural Networks, 5(2), 157-166.
4. Brock, A., De, S., Smith, S. L., & Simonyan, K. (2021). "High-Performance Large-Scale Image Recognition Without Normalization." Proceedings of the 38th International Conference on Machine Learning (ICML), PMLR 139:1059-1071. https://proceedings.mlr.press/v139/brock21a/brock21a.pdf
5. Wang et al. (2025). "AdaGC: Improving Training Stability for Large Language Model Pretraining." arXiv:2502.11034. https://arxiv.org/abs/2502.11034
6. Kumar, A., Owen, M., et al. (2025). "ZClip: Adaptive Spike Mitigation for LLM Pre-Training." arXiv:2504.02507. https://arxiv.org/abs/2504.02507
7. Chowdhery, A., et al. (2022). "PaLM: Scaling Language Modeling with Pathways." arXiv:2204.02311. https://arxiv.org/abs/2204.02311
8. Zhang, J., He, T., Sra, S., & Jadbabaie, A. (2020). "Why Gradient Clipping Accelerates Training: A Theoretical Justification for Adaptivity." Proceedings of the 8th International Conference on Learning Representations (ICLR). https://openreview.net/forum?id=BJgnXpVYwS
9. Abadi, M., Chu, A., Goodfellow, I., McMahan, H. B., Mironov, I., Talwar, K., & Zhang, L. (2016). "Deep Learning with Differential Privacy." Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, 308-318. https://arxiv.org/abs/1607.00133
10. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). "Proximal Policy Optimization Algorithms." arXiv:1707.06347.
11. Goodfellow, I., Bengio, Y., & Courville, A. (2016). *Deep Learning*. MIT Press. Chapter 10.11.1: Clipping Gradients.
12. PyTorch Documentation. "torch.nn.utils.clip_grad_norm_." https://pytorch.org/docs/stable/generated/torch.nn.utils.clip_grad_norm_.html
13. TensorFlow Documentation. "tf.clip_by_global_norm." https://www.tensorflow.org/api_docs/python/tf/clip_by_global_norm
14. Optax Documentation. "optax.clip_by_global_norm." https://optax.readthedocs.io/en/latest/api/transformations.html
15. DeepSeek-AI (2024). "DeepSeek-V3 Technical Report." arXiv:2412.19437. https://arxiv.org/abs/2412.19437
16. Touvron, H., et al. (2023). "Llama 2: Open Foundation and Fine-Tuned Chat Models." arXiv:2307.09288. https://arxiv.org/abs/2307.09288
17. Brown, T. B., et al. (2020). "Language Models are Few-Shot Learners." Advances in Neural Information Processing Systems 33 (NeurIPS). https://arxiv.org/abs/2005.14165
18. Andrew, G., Thakkar, O., McMahan, H. B., & Ramaswamy, S. (2021). "Differentially Private Learning with Adaptive Clipping." Advances in Neural Information Processing Systems 34 (NeurIPS).