Gradient clipping is a family of techniques used during the training of neural networks that constrain the magnitude of gradient values before they are applied to model weights. The goal is to prevent excessively large parameter updates, which can destabilize training, cause numerical overflow (NaN), or send the loss function into divergence. Although the operation is mathematically simple, gradient clipping has become one of the most reliable workhorse stabilizers in modern deep learning, used in nearly every recurrent network, every large transformer, and every published large language model recipe.
The two dominant variants are clip-by-value, which caps each individual gradient component to a fixed range, and clip-by-norm, which rescales the entire gradient vector when its overall magnitude exceeds a threshold. The norm version, formalized by Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio in their 2013 ICML paper "On the Difficulty of Training Recurrent Neural Networks," is by far the more common form in current practice.[1] Mikolov had already proposed the basic idea in his 2012 PhD thesis at Brno University of Technology, where he used it to train recurrent language models that would otherwise diverge.[2]
Gradient clipping is cheap to implement, costs almost nothing at runtime, and can be added to a training loop with a single line of code. Despite this, it sits at the center of how billion-parameter models stay numerically stable across months of training. The remainder of this article walks through the math, the major variants, the empirical thresholds used in production LLMs, the interaction with distributed training and mixed precision training, the recent adaptive methods that try to do better than a fixed threshold, and the connection to differential privacy through DP-SGD.
During backpropagation, the chain rule produces gradients of the loss with respect to each parameter. In a deep feedforward network, those gradients are products of many Jacobians stacked layer by layer. In a recurrent neural network, they are products across time steps using the same recurrent weight matrix at each step, an operation known as backpropagation through time (BPTT). When the spectral radius of those repeated linear operators sits above one, the gradient magnitude grows exponentially with depth or sequence length. This is the exploding gradient problem.
Bengio, Simard, and Frasconi formalized the analytical version of this in 1994, showing that recurrent networks face a structural tension: the same eigenvalue conditions that allow gradients to flow over long horizons also expose training to gradient blowup.[3] The practical consequences are familiar to anyone who has trained a deep network without safeguards:
Clipping does not fix the underlying conditioning of the optimization problem. It just bounds how badly any one step can hurt. That bound turns out to be enough for nearly all practical training, which is why the technique has stuck around for more than a decade with very little change to its basic form.
Clip-by-value, sometimes called elementwise clipping, treats each scalar entry of the gradient tensor independently. Given a threshold c, each gradient component g_i is replaced by
g_i ← max(-c, min(c, g_i))
Values inside [-c, c] are unchanged; values outside are pulled to the nearest endpoint. The implementation is one line in any framework, and the cost is a single elementwise operation over the parameters.
The drawback is that clipping different components by different amounts changes the direction of the resulting gradient vector. If one or two coordinates are very large and the rest are moderate, value clipping can shrink the dominant coordinates while leaving the others alone, rotating the descent direction away from the true negative gradient. For convex objectives this slows convergence; for nonconvex objectives it can push the optimizer toward a different basin entirely. Practitioners use value clipping mostly when they want a hard worst-case bound on each weight update, for example in GANs or in custom optimizers where elementwise control is convenient.
Clip-by-norm treats the gradient as a single vector and rescales it whenever its overall length exceeds a threshold. With the L2 (Euclidean) norm, the rule is
if ||g||_2 > τ: g ← g · τ / ||g||_2
or equivalently g ← g · min(1, τ / ||g||_2). When the norm is below τ, the gradient passes through unchanged. When it is above, the entire vector is shrunk by a single scalar factor, which preserves direction exactly. Only step length is affected.
Direction preservation is the main reason norm clipping has displaced value clipping in mainstream practice. Pascanu et al. demonstrated empirically that direction-preserving rescaling stabilized RNN training without introducing the systematic bias that elementwise clipping does.[1] Their suggested operating range was a threshold somewhere between half and ten times the typical observed gradient norm during a stable run; the modern norm of using values like 1.0 or 5.0 grew out of that recipe.
Norm clipping admits two natural granularities. Global norm clipping concatenates every parameter's gradient into one logical vector, computes the L2 norm over the whole thing, and applies a single rescaling factor to all parameters when that combined norm exceeds the threshold. Per-parameter norm clipping computes a separate norm for each parameter tensor (each weight matrix, each bias vector) and rescales each one independently against its own threshold.
Global clipping is the standard. It treats the model as a single point in a single vector space, which matches the geometry of SGD and Adam. Per-parameter clipping can over-shrink some tensors while under-shrinking others, distorting the overall update direction in much the same way that elementwise clipping does. Per-parameter is occasionally useful for debugging or for architectures with very heterogeneous parameter scales, but it has not displaced global clipping in mainstream training.
| Method | What it bounds | Direction preserved | Cost | Where it shows up |
|---|---|---|---|---|
| Clip by value | Each scalar component to [-c, c] | No | One elementwise op | Older RNN code, GANs, custom losses |
| Clip by L2 norm (per-tensor) | Each parameter tensor's norm | Per-tensor only | One reduction per tensor | Specialized debugging, mixed scales |
| Clip by L2 global norm | Global vector norm across all params | Yes (globally) | One global reduction | RNNs, transformers, LLM pretraining |
| Clip by infinity norm | Largest absolute component | No (rotates) | One max reduction | Rare, theoretical interest |
Adaptive Gradient Clipping was introduced by Brock, De, Smith, and Simonyan in the 2021 NFNet paper, "High-Performance Large-Scale Image Recognition Without Normalization."[4] The idea is to set the clip threshold for each parameter unit as a multiple of the parameter's own norm, instead of using a single fixed value across the model. Concretely, for a parameter row W_i with gradient G_i,
if ||G_i|| / ||W_i|| > λ: G_i ← G_i · λ · ||W_i|| / ||G_i||
The coefficient λ is a small constant. Brock et al. used λ = 0.01 for every parameter except the final fully-connected classifier layer (which had AGC turned off entirely) when training NFNet-F0 through F6 at batch size 4096; smaller batch sizes such as 128 to 256 tolerated a looser λ around 0.16, while larger batches required tighter clipping for stability. The justification is that the magnitude of a sensible weight update should scale with the magnitude of the weight itself; clipping in absolute units conflates very different parameter scales. AGC was the key ingredient that allowed NFNets to match or beat the accuracy of batch normalized ResNets on ImageNet without using normalization layers, with NFNet-F5 reaching 86.0% top-1 accuracy on ImageNet.
For LLM pretraining, where a single fixed threshold across hundreds of billions of parameters is increasingly seen as too blunt, two newer adaptive methods have appeared.
AdaGC (Wang et al., 2025) tracks an exponential moving average (EMA) of each parameter tensor's gradient norm and clips against a relative threshold derived from that history.[5] The per-tensor EMA is updated as γ_t,i = β · γ_{t-1,i} + (1 - β) · ||g_t,i||, and the clip factor is min(λ_rel · γ_{t-1,i} / ||g_t,i||, 1.0). On Llama-2 7B and 13B pretraining, AdaGC eliminated visible loss spikes while reducing WikiText perplexity by about 3.5% relative to global clipping.
ZClip (Kumar, Owen et al., April 2025) takes a similar EMA-of-gradient-norms approach but frames the clip decision as a z-score test.[6] At each step it tracks the running mean and standard deviation of the gradient norm and only clips when the current norm sits more than a configurable number of standard deviations above the running mean, treating the spike as a statistical anomaly rather than relying on an absolute cutoff. The authors report that on 1B-parameter Llama-style models, ZClip outperforms both fixed-threshold clipping and percentile-based methods, and broadens the range of learning rates that train stably.
Both methods aim at the same observation: the typical gradient norm during pretraining shifts substantially over the course of a run, and a fixed τ = 1.0 threshold that is appropriately tight at step 100K may be too loose at step 1M.
Gradient clipping is always applied between the backward pass and the optimizer step. The standard ordering in PyTorch is:
optimizer.zero_grad()
output = model(input)
loss = loss_fn(output, target)
loss.backward() # populates .grad on each param
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step() # consumes the (now clipped) .grad
Clipping after the optimizer step would clip the wrong tensor; clipping before the backward pass would clip nothing. The order also matters with respect to gradient accumulation, mixed precision, and distributed reduction, all discussed below.
The clip threshold τ is a hyperparameter that interacts with the learning rate, batch size, and model architecture. There is no single correct value, but the empirical literature has converged on a small set of conventions.
| Setting | Typical global-norm threshold | Notes |
|---|---|---|
| Generic deep learning | 1.0 | Standard default for new training loops |
| LLM pretraining (decoder-only transformers) | 1.0 | Used by GPT-3, Llama 2, DeepSeek-V3 |
| Smaller transformers / fine-tuning | 1.0 to 5.0 | Less aggressive when batches are small |
| Vision transformers | 1.0 | Often paired with cosine schedule |
| Reinforcement learning (PPO, A2C) | 0.5 | Tighter clipping to handle non-stationarity |
| Compressive Transformer (memory-augmented) | 0.1 | Aggressive clipping on long-context recurrence |
| Value clipping (when used) | c ∈ [0.5, 1.0] | Range applied per element |
A practical recipe is to log the gradient norm for the first few thousand steps without clipping, observe the typical magnitude, and set τ slightly above the median observed norm so that clipping triggers only on the genuine spikes. The PyTorch clip_grad_norm_ function returns the pre-clipping total norm precisely so it can be logged for this purpose.
Gradient clipping at a global L2 norm of 1.0 is the de facto standard for large language model pretraining. It appears, with minor variations, in nearly every published recipe for a frontier-scale decoder-only transformer.
| Model | Optimizer | Clip rule | Threshold |
|---|---|---|---|
| GPT-3 (175B) | Adam | Global L2 norm | 1.0 |
| Llama 2 (7B / 13B / 70B) | AdamW (β1=0.9, β2=0.95, wd=0.1) | Global L2 norm | 1.0 |
| Llama 3 family | AdamW | Global L2 norm | 1.0 |
| DeepSeek-V3 (671B MoE) | AdamW (β1=0.9, β2=0.95, wd=0.1) | Per-parameter clip | 1.0 |
| PaLM (540B) | Adafactor | Global L2 norm | 1.0 |
| GPT-NeoX, OPT, BLOOM | AdamW | Global L2 norm | 1.0 |
| Mistral 7B | AdamW | Global L2 norm | 1.0 |
The convergence on τ = 1.0 is striking. It reflects the fact that decoder-only transformers trained with AdamW and a moderate peak learning rate (typically 1e-4 to 6e-4) produce gradient norms that hover around or just below 1 during the bulk of training. A threshold of 1.0 is therefore tight enough to catch genuine spikes without continuously throttling the normal gradient flow.
Google's PaLM paper documented a phenomenon that should temper any belief in clipping as a complete solution: during training of the 540B PaLM model, the loss spiked roughly 20 times despite clipping being enabled at norm 1.0.[7] The spikes occurred at irregular intervals, sometimes very late into the run, and could not be predicted from gradient statistics alone. The PaLM team's mitigation was operational rather than algorithmic: when a spike began, they would restart training from a checkpoint roughly 100 steps before the spike and skip the next 200 to 500 data batches. After the skipped window, the loss did not spike again at the same point, suggesting that the spikes arose from specific interactions between particular data batches and particular model parameter states rather than from any systematic data corruption.
This is part of why adaptive methods like ZClip and AdaGC are an active area of research. A static threshold of 1.0 is good enough for most steps and most models, but not good enough to guarantee zero spikes across a billion-token-per-second pretraining run.
In data-parallel training, gradient clipping is logically a global operation on the aggregated gradient, not a local operation on each worker's partial gradient. The standard order with data parallelism is:
Clipping each worker's local gradient before the all-reduce would change the meaning of the threshold, because the global norm of the sum of clipped vectors is not the same as the clip of the global norm of the sum. Frameworks like PyTorch DDP handle this correctly by default: gradients are reduced first, then clip_grad_norm_ is called on the synchronized gradients.
Fully Sharded Data Parallel (FSDP) and ZeRO add a wrinkle. With sharded parameters and sharded gradients, no single rank holds the full gradient vector at clip time. Computing the global norm requires an additional cross-rank reduction (each rank computes the local sum of squares for its shard, then an all-reduce sums those across ranks, and the square root is taken). PyTorch FSDP exposes model.clip_grad_norm_(max_norm) precisely for this reason, since calling the unsharded torch.nn.utils.clip_grad_norm_ directly on model.parameters() would only see the local shard and produce an incorrect (smaller) norm. Megatron-LM and DeepSpeed ZeRO handle the same coordination internally.
Gradient accumulation introduces another subtlety. When gradients are accumulated across k micro-batches before a single optimizer step, clipping should apply to the accumulated gradient, not to each micro-batch's contribution. Clipping per micro-batch and then summing produces a different (and smaller) effective threshold. The standard pattern is to call loss.backward() k times, then call clip_grad_norm_ once just before optimizer.step().
Mixed precision training with FP16 introduces loss scaling: the loss is multiplied by a large scalar before the backward pass to push small gradients into the representable range of FP16, and the resulting gradients are correspondingly inflated. If clipping is applied to these scaled gradients, the threshold means something completely different (and very loose) compared to its meaning on unscaled gradients.
The correct sequence in PyTorch with torch.amp.GradScaler is:
scaler.scale(loss).backward()
scaler.unscale_(optimizer) # restore true gradient magnitudes
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
scaler.step(optimizer)
scaler.update()
The explicit scaler.unscale_ call divides the gradients by the current loss scale before clipping, so the threshold of 1.0 means what it normally means. BF16 training does not need this dance, since BF16 has the same dynamic range as FP32 and loss scaling is unnecessary, but the same clip-then-step ordering still applies.
PyTorch provides two utilities in torch.nn.utils:
# Norm clipping (returns the unclipped norm for logging)
total_norm = torch.nn.utils.clip_grad_norm_(
model.parameters(), max_norm=1.0, norm_type=2.0
)
# Value clipping
torch.nn.utils.clip_grad_value_(model.parameters(), clip_value=0.5)
The norm version computes the norm over all parameter gradients as if they were concatenated into a single vector (this is global clipping). It scales gradients in place and returns the pre-clipping total norm so it can be logged or used for monitoring. The norm_type argument accepts any p-norm, including 'inf' for infinity-norm clipping; the default 2.0 is what almost everyone uses.
TensorFlow exposes three primitives:
import tensorflow as tf
# Global L2 norm clipping (most common)
clipped, global_norm = tf.clip_by_global_norm(gradients, clip_norm=1.0)
# Per-tensor norm clipping
clipped = [tf.clip_by_norm(g, 1.0) for g in gradients]
# Value clipping
clipped = [tf.clip_by_value(g, -1.0, 1.0) for g in gradients]
Keras optimizers also accept clipnorm and clipvalue constructor arguments that perform per-variable clipping internally:
optimizer = tf.keras.optimizers.AdamW(learning_rate=1e-3, global_clipnorm=1.0)
The global_clipnorm argument applies global L2 clipping; clipnorm applies per-variable clipping; clipvalue applies elementwise value clipping.
In the JAX ecosystem, clipping is a GradientTransformation that composes with the optimizer through optax.chain:
import optax
optimizer = optax.chain(
optax.clip_by_global_norm(1.0),
optax.adamw(learning_rate=3e-4, b1=0.9, b2=0.95, weight_decay=0.1),
)
optax.clip_by_global_norm implements the same global L2 rule used elsewhere; optax.clip does elementwise clipping; optax.adaptive_grad_clip implements Brock et al.'s AGC with a clipping parameter.
For a long time, gradient clipping was justified entirely on empirical grounds. Pascanu et al. argued from a dynamical systems perspective that recurrent networks pass through narrow regions of the loss surface where the gradient becomes locally enormous, and clipping is a reasonable response: rescale the step but keep its direction so the optimizer can still descend.[1]
The sharper theoretical picture came from Zhang, He, Sra, and Jadbabaie in their 2020 ICLR paper, "Why Gradient Clipping Accelerates Training: A Theoretical Justification for Adaptivity."[8] They observed that the standard analysis of gradient descent assumes Lipschitz-smooth gradients with a fixed Lipschitz constant L. In real neural network training, that constant is anything but fixed: empirically, the local Lipschitz constant of the gradient grows roughly linearly with the gradient norm itself, a regime they called (L_0, L_1)-smoothness. Under that relaxed condition, standard fixed-step gradient descent is forced to take very small steps to stay safe in the high-curvature regions, slowing convergence. Gradient clipping (and the closely related normalized gradient method) effectively adapts the step size to the local curvature, achieving provably faster convergence rates without needing to know the Lipschitz constant in advance.
This result reframes gradient clipping as a form of implicit step-size adaptation rather than just a safety net against overflow. It also suggests why clipping helps even on training runs that never threaten to diverge: the threshold acts as a coarse but effective curvature regularizer, taming the optimizer's behavior in regions where the loss surface is locally rough.
Gradient clipping plays an entirely different role in privacy-preserving machine learning. In differentially private stochastic gradient descent (DP-SGD), introduced by Abadi et al. in their 2016 ACM CCS paper "Deep Learning with Differential Privacy," per-example gradients are clipped to a fixed L2 norm C before being summed and noised:[9]
g_i.g_i ← g_i · min(1, C / ||g_i||).σ · C.The role of the clip bound C here is fundamentally different. In standard training, clipping is a stability tool that triggers occasionally on outlier gradients. In DP-SGD, it is a privacy mechanism: the L2 sensitivity of the gradient sum to any single example is exactly C, which is what allows the Gaussian noise of scale σ · C to mask each example's contribution and yield formal (ε, δ)-differential privacy guarantees. Choosing C therefore involves a privacy-utility tradeoff that has nothing to do with exploding gradients per se: too small and the model cannot learn, too large and the required noise becomes overwhelming. Adaptive variants such as the median-clipping trick of Andrew et al. (2021) adjust C over training to track the empirical gradient norm distribution, but the core mechanism is unchanged.
This dual role makes gradient clipping a rare technique that is structurally important in two otherwise unrelated subfields: stable training of large models, and differentially private learning of any model.
Logging the gradient norm at every step is a habit that pays for itself many times over during long training runs. The norm provides a direct readout of training health and an early warning of problems.
What to look for:
clip_grad_norm_ function with error_if_nonfinite=True will raise instead of silently clipping NaN.Logging the pre-clipping norm (which is what clip_grad_norm_ returns) is more informative than logging the post-clipping norm, because the post-clip value is just the threshold whenever clipping triggers and provides no information about how close the run is to instability.
Gradient clipping is not free. The L2 norm requires touching every gradient tensor at every step, which costs a small amount of communication in distributed settings and a tiny amount of compute everywhere. More importantly, an aggressively low threshold can mask informative gradient signals and slow convergence. There are settings where clipping adds nothing:
The failure mode of overly aggressive clipping is convergence that is slower than it needs to be but otherwise normal. The failure mode of no clipping on a model that needs it is total divergence, often hours or days into a run. The asymmetry strongly favors leaving clipping enabled.
Gradient clipping is one tool in a broader stability toolkit and interacts with several others.
Imagine you are walking on a hilly path with your eyes closed, taking small steps in the direction someone tells you to go. Most of the time the ground is gentle and your steps work fine. But sometimes the person shouts a really long instruction, like "go ten feet that way!" and if you actually take a giant leap with your eyes closed you will probably fall off a cliff.
Gradient clipping is the rule "no matter how big the instruction is, only take one step at a time." The direction is still right, but you never lurch farther than you can recover from. Because of that one rule, you can keep walking the whole path without falling, even when the instructions get scary.