Gradient clipping

Gradient clipping is a family of techniques used during the training of neural networks that constrain the magnitude of gradient values before they are applied to model weights. The goal is to prevent excessively large parameter updates, which can destabilize training, cause numerical overflow (NaN), or send the loss function into divergence. Although the operation is mathematically simple, gradient clipping has become one of the most reliable workhorse stabilizers in modern deep learning, used in nearly every recurrent network, every large transformer, and every published large language model recipe.

The two dominant variants are clip-by-value, which caps each individual gradient component to a fixed range, and clip-by-norm, which rescales the entire gradient vector when its overall magnitude exceeds a threshold. The norm version, formalized by Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio in their 2013 ICML paper "On the Difficulty of Training Recurrent Neural Networks," is by far the more common form in current practice.^[1] Mikolov had already proposed the basic idea in his 2012 PhD thesis at Brno University of Technology, where he used it to train recurrent language models that would otherwise diverge.^[2]

Gradient clipping is cheap to implement, costs almost nothing at runtime, and can be added to a training loop with a single line of code. Despite this, it sits at the center of how billion-parameter models stay numerically stable across months of training. The remainder of this article walks through the math, the major variants, the empirical thresholds used in production LLMs, the interaction with distributed training and mixed precision training, the recent adaptive methods that try to do better than a fixed threshold, and the connection to differential privacy through DP-SGD.

Background and motivation

During backpropagation, the chain rule produces gradients of the loss with respect to each parameter. In a deep feedforward network, those gradients are products of many Jacobians stacked layer by layer. In a recurrent neural network, they are products across time steps using the same recurrent weight matrix at each step, an operation known as backpropagation through time (BPTT). When the spectral radius of those repeated linear operators sits above one, the gradient magnitude grows exponentially with depth or sequence length. This is the exploding gradient problem.

Bengio, Simard, and Frasconi formalized the analytical version of this in 1994, showing that recurrent networks face a structural tension: the same eigenvalue conditions that allow gradients to flow over long horizons also expose training to gradient blowup.^[3] The practical consequences are familiar to anyone who has trained a deep network without safeguards:

Floating-point overflow turns gradients into NaN or infinity, after which every parameter update produces NaN, and the model is dead.
Even before overflow, a single huge gradient can move parameters far outside the region where the local linear approximation underlying gradient descent is valid, sending the loss into a much worse part of the landscape.
Loss curves develop sharp spikes that may or may not recover. When they do not recover, the training run is wasted.
In autoregressive language modeling, a single divergence event can corrupt activation statistics for all subsequent steps and force a restart from a previous checkpoint.

Clipping does not fix the underlying conditioning of the optimization problem. It just bounds how badly any one step can hurt. That bound turns out to be enough for nearly all practical training, which is why the technique has stuck around for more than a decade with very little change to its basic form.

Methods

Clip by value

Clip-by-value, sometimes called elementwise clipping, treats each scalar entry of the gradient tensor independently. Given a threshold c, each gradient component g_i is replaced by

g_i ← max(-c, min(c, g_i))

Values inside [-c, c] are unchanged; values outside are pulled to the nearest endpoint. The implementation is one line in any framework, and the cost is a single elementwise operation over the parameters.

The drawback is that clipping different components by different amounts changes the direction of the resulting gradient vector. If one or two coordinates are very large and the rest are moderate, value clipping can shrink the dominant coordinates while leaving the others alone, rotating the descent direction away from the true negative gradient. For convex objectives this slows convergence; for nonconvex objectives it can push the optimizer toward a different basin entirely. Practitioners use value clipping mostly when they want a hard worst-case bound on each weight update, for example in GANs or in custom optimizers where elementwise control is convenient.

Clip by norm

Clip-by-norm treats the gradient as a single vector and rescales it whenever its overall length exceeds a threshold. With the L2 (Euclidean) norm, the rule is

if ||g||_2 > τ:  g ← g · τ / ||g||_2

or equivalently g ← g · min(1, τ / ||g||_2). When the norm is below τ, the gradient passes through unchanged. When it is above, the entire vector is shrunk by a single scalar factor, which preserves direction exactly. Only step length is affected.

Direction preservation is the main reason norm clipping has displaced value clipping in mainstream practice. Pascanu et al. demonstrated empirically that direction-preserving rescaling stabilized RNN training without introducing the systematic bias that elementwise clipping does.^[1] Their suggested operating range was a threshold somewhere between half and ten times the typical observed gradient norm during a stable run; the modern norm of using values like 1.0 or 5.0 grew out of that recipe.

Global norm versus per-parameter norm

Norm clipping admits two natural granularities. Global norm clipping concatenates every parameter's gradient into one logical vector, computes the L2 norm over the whole thing, and applies a single rescaling factor to all parameters when that combined norm exceeds the threshold. Per-parameter norm clipping computes a separate norm for each parameter tensor (each weight matrix, each bias vector) and rescales each one independently against its own threshold.

Global clipping is the standard. It treats the model as a single point in a single vector space, which matches the geometry of SGD and Adam. Per-parameter clipping can over-shrink some tensors while under-shrinking others, distorting the overall update direction in much the same way that elementwise clipping does. Per-parameter is occasionally useful for debugging or for architectures with very heterogeneous parameter scales, but it has not displaced global clipping in mainstream training.

Method	What it bounds	Direction preserved	Cost	Where it shows up
Clip by value	Each scalar component to [-c, c]	No	One elementwise op	Older RNN code, GANs, custom losses
Clip by L2 norm (per-tensor)	Each parameter tensor's norm	Per-tensor only	One reduction per tensor	Specialized debugging, mixed scales
Clip by L2 global norm	Global vector norm across all params	Yes (globally)	One global reduction	RNNs, transformers, LLM pretraining
Clip by infinity norm	Largest absolute component	No (rotates)	One max reduction	Rare, theoretical interest

Adaptive Gradient Clipping (AGC)

Adaptive Gradient Clipping was introduced by Brock, De, Smith, and Simonyan in the 2021 NFNet paper, "High-Performance Large-Scale Image Recognition Without Normalization."^[4] The idea is to set the clip threshold for each parameter unit as a multiple of the parameter's own norm, instead of using a single fixed value across the model. Concretely, for a parameter row W_i with gradient G_i,

if ||G_i|| / ||W_i|| > λ:  G_i ← G_i · λ · ||W_i|| / ||G_i||

The coefficient λ is a small constant. Brock et al. used λ = 0.01 for every parameter except the final fully-connected classifier layer (which had AGC turned off entirely) when training NFNet-F0 through F6 at batch size 4096; smaller batch sizes such as 128 to 256 tolerated a looser λ around 0.16, while larger batches required tighter clipping for stability. The justification is that the magnitude of a sensible weight update should scale with the magnitude of the weight itself; clipping in absolute units conflates very different parameter scales. AGC was the key ingredient that allowed NFNets to match or beat the accuracy of batch normalized ResNets on ImageNet without using normalization layers, with NFNet-F5 reaching 86.0% top-1 accuracy on ImageNet.

AdaGC and ZClip

For LLM pretraining, where a single fixed threshold across hundreds of billions of parameters is increasingly seen as too blunt, two newer adaptive methods have appeared.

AdaGC (Wang et al., 2025) tracks an exponential moving average (EMA) of each parameter tensor's gradient norm and clips against a relative threshold derived from that history.^[5] The per-tensor EMA is updated as γ_t,i = β · γ_{t-1,i} + (1 - β) · ||g_t,i||, and the clip factor is min(λ_rel · γ_{t-1,i} / ||g_t,i||, 1.0). On Llama-2 7B and 13B pretraining, AdaGC eliminated visible loss spikes while reducing WikiText perplexity by about 3.5% relative to global clipping.

ZClip (Kumar, Owen et al., April 2025) takes a similar EMA-of-gradient-norms approach but frames the clip decision as a z-score test.^[6] At each step it tracks the running mean and standard deviation of the gradient norm and only clips when the current norm sits more than a configurable number of standard deviations above the running mean, treating the spike as a statistical anomaly rather than relying on an absolute cutoff. The authors report that on 1B-parameter Llama-style models, ZClip outperforms both fixed-threshold clipping and percentile-based methods, and broadens the range of learning rates that train stably.

Both methods aim at the same observation: the typical gradient norm during pretraining shifts substantially over the course of a run, and a fixed τ = 1.0 threshold that is appropriately tight at step 100K may be too loose at step 1M.

Where in the training loop

Gradient clipping is always applied between the backward pass and the optimizer step. The standard ordering in PyTorch is:

optimizer.zero_grad()
output = model(input)
loss = loss_fn(output, target)
loss.backward()                                    # populates .grad on each param
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()                                    # consumes the (now clipped) .grad

Clipping after the optimizer step would clip the wrong tensor; clipping before the backward pass would clip nothing. The order also matters with respect to gradient accumulation, mixed precision, and distributed reduction, all discussed below.

Threshold values in practice

The clip threshold τ is a hyperparameter that interacts with the learning rate, batch size, and model architecture. There is no single correct value, but the empirical literature has converged on a small set of conventions.

Setting	Typical global-norm threshold	Notes
Generic deep learning	1.0	Standard default for new training loops
LLM pretraining (decoder-only transformers)	1.0	Used by GPT-3, Llama 2, DeepSeek-V3
Smaller transformers / fine-tuning	1.0 to 5.0	Less aggressive when batches are small
Vision transformers	1.0	Often paired with cosine schedule
Reinforcement learning (PPO, A2C)	0.5	Tighter clipping to handle non-stationarity
Compressive Transformer (memory-augmented)	0.1	Aggressive clipping on long-context recurrence
Value clipping (when used)	c ∈ [0.5, 1.0]	Range applied per element

A practical recipe is to log the gradient norm for the first few thousand steps without clipping, observe the typical magnitude, and set τ slightly above the median observed norm so that clipping triggers only on the genuine spikes. The PyTorch clip_grad_norm_ function returns the pre-clipping total norm precisely so it can be logged for this purpose.

LLM training recipes

Gradient clipping at a global L2 norm of 1.0 is the de facto standard for large language model pretraining. It appears, with minor variations, in nearly every published recipe for a frontier-scale decoder-only transformer.

Model	Optimizer	Clip rule	Threshold
GPT-3 (175B)	Adam	Global L2 norm	1.0
Llama 2 (7B / 13B / 70B)	AdamW (β1=0.9, β2=0.95, wd=0.1)	Global L2 norm	1.0
Llama 3 family	AdamW	Global L2 norm	1.0
DeepSeek-V3 (671B MoE)	AdamW (β1=0.9, β2=0.95, wd=0.1)	Per-parameter clip	1.0
PaLM (540B)	Adafactor	Global L2 norm	1.0
GPT-NeoX, OPT, BLOOM	AdamW	Global L2 norm	1.0
Mistral 7B	AdamW	Global L2 norm	1.0

The convergence on τ = 1.0 is striking. It reflects the fact that decoder-only transformers trained with AdamW and a moderate peak learning rate (typically 1e-4 to 6e-4) produce gradient norms that hover around or just below 1 during the bulk of training. A threshold of 1.0 is therefore tight enough to catch genuine spikes without continuously throttling the normal gradient flow.

Loss spikes are not always solved by clipping

Google's PaLM paper documented a phenomenon that should temper any belief in clipping as a complete solution: during training of the 540B PaLM model, the loss spiked roughly 20 times despite clipping being enabled at norm 1.0.^[7] The spikes occurred at irregular intervals, sometimes very late into the run, and could not be predicted from gradient statistics alone. The PaLM team's mitigation was operational rather than algorithmic: when a spike began, they would restart training from a checkpoint roughly 100 steps before the spike and skip the next 200 to 500 data batches. After the skipped window, the loss did not spike again at the same point, suggesting that the spikes arose from specific interactions between particular data batches and particular model parameter states rather than from any systematic data corruption.

This is part of why adaptive methods like ZClip and AdaGC are an active area of research. A static threshold of 1.0 is good enough for most steps and most models, but not good enough to guarantee zero spikes across a billion-token-per-second pretraining run.

Distributed training

In data-parallel training, gradient clipping is logically a global operation on the aggregated gradient, not a local operation on each worker's partial gradient. The standard order with data parallelism is:

Each worker computes its local gradient on its data shard.
An all-reduce sums (or averages) the gradients across workers.
Each worker holds the same combined gradient.
Clipping is applied to that combined gradient.
The optimizer step proceeds.

Clipping each worker's local gradient before the all-reduce would change the meaning of the threshold, because the global norm of the sum of clipped vectors is not the same as the clip of the global norm of the sum. Frameworks like PyTorch DDP handle this correctly by default: gradients are reduced first, then clip_grad_norm_ is called on the synchronized gradients.

Fully Sharded Data Parallel (FSDP) and ZeRO add a wrinkle. With sharded parameters and sharded gradients, no single rank holds the full gradient vector at clip time. Computing the global norm requires an additional cross-rank reduction (each rank computes the local sum of squares for its shard, then an all-reduce sums those across ranks, and the square root is taken). PyTorch FSDP exposes model.clip_grad_norm_(max_norm) precisely for this reason, since calling the unsharded torch.nn.utils.clip_grad_norm_ directly on model.parameters() would only see the local shard and produce an incorrect (smaller) norm. Megatron-LM and DeepSpeed ZeRO handle the same coordination internally.

Gradient accumulation introduces another subtlety. When gradients are accumulated across k micro-batches before a single optimizer step, clipping should apply to the accumulated gradient, not to each micro-batch's contribution. Clipping per micro-batch and then summing produces a different (and smaller) effective threshold. The standard pattern is to call loss.backward() k times, then call clip_grad_norm_ once just before optimizer.step().

Mixed precision

Mixed precision training with FP16 introduces loss scaling: the loss is multiplied by a large scalar before the backward pass to push small gradients into the representable range of FP16, and the resulting gradients are correspondingly inflated. If clipping is applied to these scaled gradients, the threshold means something completely different (and very loose) compared to its meaning on unscaled gradients.

The correct sequence in PyTorch with torch.amp.GradScaler is:

scaler.scale(loss).backward()
scaler.unscale_(optimizer)                          # restore true gradient magnitudes
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
scaler.step(optimizer)
scaler.update()

The explicit scaler.unscale_ call divides the gradients by the current loss scale before clipping, so the threshold of 1.0 means what it normally means. BF16 training does not need this dance, since BF16 has the same dynamic range as FP32 and loss scaling is unnecessary, but the same clip-then-step ordering still applies.

Implementation across frameworks

PyTorch

PyTorch provides two utilities in torch.nn.utils:

# Norm clipping (returns the unclipped norm for logging)
total_norm = torch.nn.utils.clip_grad_norm_(
    model.parameters(), max_norm=1.0, norm_type=2.0
)

# Value clipping
torch.nn.utils.clip_grad_value_(model.parameters(), clip_value=0.5)

The norm version computes the norm over all parameter gradients as if they were concatenated into a single vector (this is global clipping). It scales gradients in place and returns the pre-clipping total norm so it can be logged or used for monitoring. The norm_type argument accepts any p-norm, including 'inf' for infinity-norm clipping; the default 2.0 is what almost everyone uses.

TensorFlow / Keras

TensorFlow exposes three primitives:

import tensorflow as tf

# Global L2 norm clipping (most common)
clipped, global_norm = tf.clip_by_global_norm(gradients, clip_norm=1.0)

# Per-tensor norm clipping
clipped = [tf.clip_by_norm(g, 1.0) for g in gradients]

# Value clipping
clipped = [tf.clip_by_value(g, -1.0, 1.0) for g in gradients]

Keras optimizers also accept clipnorm and clipvalue constructor arguments that perform per-variable clipping internally:

optimizer = tf.keras.optimizers.AdamW(learning_rate=1e-3, global_clipnorm=1.0)

The global_clipnorm argument applies global L2 clipping; clipnorm applies per-variable clipping; clipvalue applies elementwise value clipping.

JAX / Optax

In the JAX ecosystem, clipping is a GradientTransformation that composes with the optimizer through optax.chain:

import optax

optimizer = optax.chain(
    optax.clip_by_global_norm(1.0),
    optax.adamw(learning_rate=3e-4, b1=0.9, b2=0.95, weight_decay=0.1),
)

optax.clip_by_global_norm implements the same global L2 rule used elsewhere; optax.clip does elementwise clipping; optax.adaptive_grad_clip implements Brock et al.'s AGC with a clipping parameter.

Theoretical understanding

For a long time, gradient clipping was justified entirely on empirical grounds. Pascanu et al. argued from a dynamical systems perspective that recurrent networks pass through narrow regions of the loss surface where the gradient becomes locally enormous, and clipping is a reasonable response: rescale the step but keep its direction so the optimizer can still descend.^[1]

The sharper theoretical picture came from Zhang, He, Sra, and Jadbabaie in their 2020 ICLR paper, "Why Gradient Clipping Accelerates Training: A Theoretical Justification for Adaptivity."^[8] They observed that the standard analysis of gradient descent assumes Lipschitz-smooth gradients with a fixed Lipschitz constant L. In real neural network training, that constant is anything but fixed: empirically, the local Lipschitz constant of the gradient grows roughly linearly with the gradient norm itself, a regime they called (L_0, L_1)-smoothness. Under that relaxed condition, standard fixed-step gradient descent is forced to take very small steps to stay safe in the high-curvature regions, slowing convergence. Gradient clipping (and the closely related normalized gradient method) effectively adapts the step size to the local curvature, achieving provably faster convergence rates without needing to know the Lipschitz constant in advance.

This result reframes gradient clipping as a form of implicit step-size adaptation rather than just a safety net against overflow. It also suggests why clipping helps even on training runs that never threaten to diverge: the threshold acts as a coarse but effective curvature regularizer, taming the optimizer's behavior in regions where the loss surface is locally rough.

Differential privacy: gradient clipping as a sensitivity bound

Gradient clipping plays an entirely different role in privacy-preserving machine learning. In differentially private stochastic gradient descent (DP-SGD), introduced by Abadi et al. in their 2016 ACM CCS paper "Deep Learning with Differential Privacy," per-example gradients are clipped to a fixed L2 norm C before being summed and noised:^[9]

For each example in the mini-batch, compute the per-example gradient g_i.
Clip each one: g_i ← g_i · min(1, C / ||g_i||).
Sum the clipped gradients and add Gaussian noise with standard deviation σ · C.
Average over the batch and apply the optimizer step.

The role of the clip bound C here is fundamentally different. In standard training, clipping is a stability tool that triggers occasionally on outlier gradients. In DP-SGD, it is a privacy mechanism: the L2 sensitivity of the gradient sum to any single example is exactly C, which is what allows the Gaussian noise of scale σ · C to mask each example's contribution and yield formal (ε, δ)-differential privacy guarantees. Choosing C therefore involves a privacy-utility tradeoff that has nothing to do with exploding gradients per se: too small and the model cannot learn, too large and the required noise becomes overwhelming. Adaptive variants such as the median-clipping trick of Andrew et al. (2021) adjust C over training to track the empirical gradient norm distribution, but the core mechanism is unchanged.

This dual role makes gradient clipping a rare technique that is structurally important in two otherwise unrelated subfields: stable training of large models, and differentially private learning of any model.

Monitoring gradient norms

Logging the gradient norm at every step is a habit that pays for itself many times over during long training runs. The norm provides a direct readout of training health and an early warning of problems.

What to look for:

The norm typically starts higher and decays over the course of training as the model approaches a basin. A flat or rising norm late in training suggests the learning rate is too high.
Sudden spikes that get clipped down to the threshold are normal and what clipping is for. Frequent spikes (more than a few percent of steps) suggest the threshold is too tight or the learning rate is too high.
Norms that hit the threshold on every step indicate that clipping is throttling the entire run, not just catching outliers, and the threshold should probably be raised.
A NaN in the gradient norm is fatal and should trigger automatic checkpoint reload. The PyTorch clip_grad_norm_ function with error_if_nonfinite=True will raise instead of silently clipping NaN.
A gradient norm that collapses toward zero for many consecutive steps points to the vanishing gradient problem, not exploding, and is unaffected by clipping.

Logging the pre-clipping norm (which is what clip_grad_norm_ returns) is more informative than logging the post-clipping norm, because the post-clip value is just the threshold whenever clipping triggers and provides no information about how close the run is to instability.

When not to use it

Gradient clipping is not free. The L2 norm requires touching every gradient tensor at every step, which costs a small amount of communication in distributed settings and a tiny amount of compute everywhere. More importantly, an aggressively low threshold can mask informative gradient signals and slow convergence. There are settings where clipping adds nothing:

Small, well-conditioned models (logistic regression, small MLPs on tabular data) almost never produce exploding gradients and need no clipping.
Tree-based methods do not have backpropagation gradients in the same sense and the technique does not apply.
Very small fine-tuning runs where the optimizer is taking tens or hundreds of steps over a stable pretrained model rarely benefit from clipping; the model is already in a well-behaved region.
Training runs that have been carefully tuned to use a learning rate well below the stability frontier may not need clipping, though the cost of leaving it in is so low that most practitioners do anyway as cheap insurance.

The failure mode of overly aggressive clipping is convergence that is slower than it needs to be but otherwise normal. The failure mode of no clipping on a model that needs it is total divergence, often hours or days into a run. The asymmetry strongly favors leaving clipping enabled.

Relationship to other techniques

Gradient clipping is one tool in a broader stability toolkit and interacts with several others.

Batch normalization and layer normalization: normalization layers smooth the loss surface and reduce gradient magnitudes indirectly. Modern networks use both normalization and clipping; the AGC work showed that careful clipping can substitute for normalization in some architectures.
Weight decay (L2 regularization): penalizes large weights, which indirectly reduces gradient magnitudes during the linearization step. Weight decay does not bound gradient norm directly and is not a substitute for clipping.
Learning rate warmup: linearly ramping the learning rate from zero over the first 1K to 10K steps is a standard companion to clipping. Warmup avoids the most violent gradients of the first few steps; clipping handles the residual spikes throughout training.
Learning rate scheduling (cosine, linear decay): a decaying learning rate naturally reduces the effective parameter update magnitude over time. Clipping caps the per-step magnitude; scheduling shapes the trajectory.
Skip-bad-batches and checkpoint rewind: when clipping fails to prevent a loss spike, the operational mitigation is to roll back to a recent checkpoint and skip the offending batches, as documented in the PaLM paper.^[7]

Explain like I'm 5

Imagine you are walking on a hilly path with your eyes closed, taking small steps in the direction someone tells you to go. Most of the time the ground is gentle and your steps work fine. But sometimes the person shouts a really long instruction, like "go ten feet that way!" and if you actually take a giant leap with your eyes closed you will probably fall off a cliff.

Gradient clipping is the rule "no matter how big the instruction is, only take one step at a time." The direction is still right, but you never lurch farther than you can recover from. Because of that one rule, you can keep walking the whole path without falling, even when the instructions get scary.

References

Pascanu, R., Mikolov, T., & Bengio, Y. (2013). "On the Difficulty of Training Recurrent Neural Networks." Proceedings of the 30th International Conference on Machine Learning (ICML), PMLR 28(3):1310-1318. https://proceedings.mlr.press/v28/pascanu13.html
Mikolov, T. (2012). "Statistical Language Models Based on Neural Networks." PhD thesis, Brno University of Technology. https://www.fit.vut.cz/study/phd-thesis/283/.en
Bengio, Y., Simard, P., & Frasconi, P. (1994). "Learning Long-Term Dependencies with Gradient Descent is Difficult." IEEE Transactions on Neural Networks, 5(2), 157-166.
Brock, A., De, S., Smith, S. L., & Simonyan, K. (2021). "High-Performance Large-Scale Image Recognition Without Normalization." Proceedings of the 38th International Conference on Machine Learning (ICML), PMLR 139:1059-1071. https://proceedings.mlr.press/v139/brock21a/brock21a.pdf
Wang et al. (2025). "AdaGC: Improving Training Stability for Large Language Model Pretraining." arXiv:2502.11034. https://arxiv.org/abs/2502.11034
Kumar, A., Owen, M., et al. (2025). "ZClip: Adaptive Spike Mitigation for LLM Pre-Training." arXiv:2504.02507. https://arxiv.org/abs/2504.02507
Chowdhery, A., et al. (2022). "PaLM: Scaling Language Modeling with Pathways." arXiv:2204.02311.
Zhang, J., He, T., Sra, S., & Jadbabaie, A. (2020). "Why Gradient Clipping Accelerates Training: A Theoretical Justification for Adaptivity." Proceedings of the 8th International Conference on Learning Representations (ICLR). https://openreview.net/forum?id=BJgnXpVYwS
Abadi, M., Chu, A., Goodfellow, I., McMahan, H. B., Mironov, I., Talwar, K., & Zhang, L. (2016). "Deep Learning with Differential Privacy." Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, 308-318. https://arxiv.org/abs/1607.00133
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). "Proximal Policy Optimization Algorithms." arXiv:1707.06347.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). *Deep Learning*. MIT Press. Chapter 10.11.1: Clipping Gradients.
PyTorch Documentation. "torch.nn.utils.clip_grad_norm_." https://pytorch.org/docs/stable/generated/torch.nn.utils.clip_grad_norm_.html
TensorFlow Documentation. "tf.clip_by_global_norm." https://www.tensorflow.org/api_docs/python/tf/clip_by_global_norm
Optax Documentation. "optax.clip_by_global_norm." https://optax.readthedocs.io/en/latest/api/transformations.html
DeepSeek-AI (2024). "DeepSeek-V3 Technical Report." arXiv:2412.19437.
Touvron, H., et al. (2023). "Llama 2: Open Foundation and Fine-Tuned Chat Models." arXiv:2307.09288.
Brown, T. B., et al. (2020). "Language Models are Few-Shot Learners." Advances in Neural Information Processing Systems 33 (NeurIPS).
Andrew, G., Thakkar, O., McMahan, H. B., & Ramaswamy, S. (2021). "Differentially Private Learning with Adaptive Clipping." Advances in Neural Information Processing Systems 34 (NeurIPS).

Background and motivation

Methods

Clip by value

Clip by norm

Global norm versus per-parameter norm

Adaptive Gradient Clipping (AGC)

AdaGC and ZClip

Where in the training loop

Threshold values in practice

LLM training recipes

Loss spikes are not always solved by clipping

Distributed training

Mixed precision

Implementation across frameworks

PyTorch

TensorFlow / Keras

JAX / Optax

Theoretical understanding

Differential privacy: gradient clipping as a sensitivity bound

Monitoring gradient norms

When not to use it

Relationship to other techniques

Explain like I'm 5

References

Improve this article

Related Articles

AdaGrad

Momentum

Parameter update

Step

Adam optimizer

Staged training

Background and motivation

Methods

Clip by value

Clip by norm

Global norm versus per-parameter norm

Adaptive Gradient Clipping (AGC)

AdaGC and ZClip

Where in the training loop

Threshold values in practice

LLM training recipes

Loss spikes are not always solved by clipping

Distributed training

Mixed precision

Implementation across frameworks

PyTorch

TensorFlow / Keras

JAX / Optax

Theoretical understanding

Differential privacy: gradient clipping as a sensitivity bound

Monitoring gradient norms

When not to use it

Relationship to other techniques

Explain like I'm 5

References

Related Articles

AdaGrad

Momentum

Parameter update

Step

Adam optimizer

Staged training