Clipping

Clipping is a family of techniques in machine learning that constrain numerical values to lie within a specified range or below a specified magnitude. The most common application is gradient clipping, which limits the size of gradients during backpropagation to prevent the exploding gradient problem. Clipping is also applied to weights, activations, and policy ratios in various contexts across deep learning and reinforcement learning.

Explain like I'm 5 (ELI5)

Imagine you are steering a toy car with a remote control. If you push the joystick all the way to one side, the car spins out of control and crashes. Clipping is like putting bumpers on the joystick so it can only move a little bit at a time. The car still turns, but it never turns so fast that it crashes. In machine learning, the "joystick" is the gradient (the signal that tells the model how to change), and clipping keeps it from getting so large that training goes haywire.

Background and motivation

Neural networks learn by computing gradients of a loss function with respect to each parameter and then updating those parameters using an optimizer such as stochastic gradient descent (SGD) or Adam. During backpropagation, gradients are propagated backward through layers via the chain rule. In deep networks, and especially in recurrent neural networks (RNNs), repeated multiplication of Jacobian matrices can cause gradient magnitudes to grow exponentially with the number of layers or time steps. This is known as the exploding gradient problem.

When gradients explode, parameter updates become extremely large, causing the loss to diverge, weights to overflow to NaN, or training to oscillate wildly. Gradient clipping provides a straightforward remedy: if the gradient exceeds a threshold, it is scaled down (or capped) before the optimizer step.

The technique was formalized by Pascanu, Mikolov, and Bengio in their 2013 paper "On the difficulty of training recurrent neural networks," which analyzed the exploding and vanishing gradient problems from analytical, geometric, and dynamical systems perspectives and proposed gradient norm clipping as a practical solution.

Gradient clipping

Gradient clipping is performed after computing gradients (via loss.backward() or equivalent) but before the optimizer updates parameters (via optimizer.step()). There are two main strategies: clipping by value and clipping by norm.

Clipping by value

Clipping by value constrains each individual gradient component independently to lie within a fixed interval [-λ, λ], where λ is a chosen threshold.

For each gradient element g_i:

g_i = max(-λ, min(λ, g_i))

This approach is simple but has a significant drawback: it can change the direction of the gradient vector. If some components are clipped and others are not, the resulting vector may point in a different direction from the original gradient. This can slow convergence or cause erratic training dynamics.

Clipping by norm

Clipping by norm treats the entire gradient as a single vector and rescales it so that its norm does not exceed a threshold, while preserving the gradient's direction. This is the more commonly used strategy in practice.

Given a gradient vector g and a threshold τ, the L2 (Euclidean) norm is computed as:

||g||_2 = sqrt(sum(g_i^2))

If ||g||_2 > τ, the gradient is rescaled:

g = (τ / ||g||_2) * g

If ||g||_2 ≤ τ, the gradient is left unchanged. Because the rescaling is uniform across all components, the direction of the gradient is preserved. Only the magnitude is reduced.

Global norm clipping vs. per-parameter clipping

There are two ways to apply norm-based clipping in a multi-parameter model:

Strategy	Description	Direction preserved?	Speed
Global norm clipping	Concatenate all parameter gradients into one vector, compute its norm, and scale all gradients by the same factor if the global norm exceeds the threshold.	Yes (globally)	Slower (requires all gradients before clipping)
Per-parameter clipping	Compute the norm of each parameter's gradient independently and clip each one separately.	Yes (per tensor)	Faster (can clip as each gradient is ready)

Global norm clipping is generally preferred because it preserves the relative scale of gradients across different parameters, maintaining the overall descent direction. This is the approach recommended by Pascanu et al. (2013) and used in most large-scale training pipelines, including transformer models.

Clipping in reinforcement learning

PPO clipped objective

In reinforcement learning, clipping appears prominently in Proximal Policy Optimization (PPO), introduced by Schulman et al. (2017). PPO uses a clipped surrogate objective to prevent the policy from changing too much in a single update, which improves training stability.

The PPO-Clip objective is:

L(θ) = E[ min( r(θ) * A, clip(r(θ), 1 - ε, 1 + ε) * A ) ]

where:

r(θ) = πθ(a|s) / πθ_old(a|s) is the probability ratio between the new and old policies
A is the advantage estimate
ε is a small hyperparameter (commonly 0.2)

The clip function constrains r(θ) to the interval [1 - ε, 1 + ε]. When the advantage is positive (the action was better than expected), the objective is capped at (1 + ε) * A, preventing the policy from increasing the probability of that action beyond the clipping boundary. When the advantage is negative, the objective is capped at (1 - ε) * A, preventing excessive reduction. This mechanism replaces the hard KL divergence constraint used in Trust Region Policy Optimization (TRPO) with a simpler first-order optimization approach.

Weight clipping

Weight clipping constrains the values of model parameters (not gradients) to a fixed range after each optimization step. The most prominent use case is in the original Wasserstein GAN (WGAN), proposed by Arjovsky, Chintala, and Bottou (2017).

WGAN requires the discriminator (called the "critic") to be a 1-Lipschitz function. The original paper enforced this by clipping all critic weights to a compact interval [-c, c] (typically c = 0.01) after each parameter update:

for p in critic.parameters():
    p.data.clamp_(-c, c)

While simple, weight clipping has notable drawbacks. It tends to push weights toward the boundary values of the clipping range, underutilizing the network's capacity. It can also cause the critic to learn overly simple functions. These limitations led to the development of WGAN-GP (Gulrajani et al., 2017), which replaces weight clipping with a gradient penalty term that penalizes the critic when the gradient norm deviates from 1.

Activation clipping

Activation clipping bounds the output of activation functions to a fixed range. The most well-known example is ReLU6, defined as:

ReLU6(x) = min(max(0, x), 6)

ReLU6 was introduced by Krizhevsky (2010) and became widely used in mobile and edge architectures such as MobileNet. The upper bound of 6 prevents activations from growing unboundedly, which is particularly useful for:

Quantization: Bounded activations map cleanly to fixed-point representations (e.g., 8-bit integers), reducing computational cost on mobile hardware.
Numerical stability: Clamping prevents overflow in low-precision arithmetic.

More generally, any use of torch.clamp() or tf.clip_by_value() on intermediate activations constitutes activation clipping.

Implementation

PyTorch

PyTorch provides two built-in functions for gradient clipping in torch.nn.utils:

Function	Type	Description
`clip_grad_norm_(parameters, max_norm, norm_type=2.0)`	Norm clipping	Computes the total norm of all parameter gradients (concatenated as a single vector) and scales them if the norm exceeds `max_norm`. Returns the total norm.
`clip_grad_value_(parameters, clip_value)`	Value clipping	Clamps each gradient element to the range [-clip_value, clip_value].

Typical usage:

optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()

TensorFlow

TensorFlow offers several clipping functions:

Function	Type	Description
`tf.clip_by_value(t, clip_value_min, clip_value_max)`	Value clipping	Clips tensor values element-wise to a range.
`tf.clip_by_norm(t, clip_norm)`	Per-tensor norm clipping	Clips a single tensor so its L2 norm does not exceed `clip_norm`.
`tf.clip_by_global_norm(t_list, clip_norm)`	Global norm clipping	Clips a list of tensors by the ratio of the sum of their norms.

TensorFlow/Keras optimizers also accept clipnorm (per-parameter) and global_clipnorm (global) arguments directly:

optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3, global_clipnorm=1.0)

Gradient clipping and mixed-precision training

Mixed-precision training uses lower-precision floating-point formats (such as float16) to speed up computation and reduce memory usage. Because float16 has a limited dynamic range, small gradients can underflow to zero. To prevent this, frameworks like PyTorch use a GradScaler that multiplies the loss by a scale factor before the backward pass, inflating gradient magnitudes into the representable float16 range.

When combining gradient clipping with mixed-precision training, the correct order of operations is:

Compute the scaled loss: scaler.scale(loss).backward()
Unscale the gradients: scaler.unscale_(optimizer) (restores original magnitudes)
Clip the unscaled gradients: torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)
Step the optimizer: scaler.step(optimizer)
Update the scaler: scaler.update()

Unscaling before clipping is necessary because the clipping threshold is specified for the true (unscaled) gradient magnitudes. If gradients were still scaled, the threshold would not correspond to the intended magnitude, rendering the clipping ineffective or overly aggressive.

Choosing clipping thresholds

Selecting the right clipping threshold requires balancing stability against learning speed. If the threshold is too low, gradients are clipped too aggressively, slowing convergence and potentially preventing the model from escaping local minima. If it is too high, clipping never activates and provides no protection against gradient explosions.

Common practices include:

Standard defaults: Many practitioners start with a max norm of 1.0 for global norm clipping. Values of 0.5, 1.0, and 5.0 are all common in the literature.
Monitoring gradient norms: Logging the gradient norm at each training step (which clip_grad_norm_ conveniently returns) reveals the typical range of gradient magnitudes. The threshold can then be set at a percentile (e.g., 90th or 95th) of observed norms, so that clipping only activates on unusually large gradients.
Architecture-specific tuning: Transformer models for language modeling commonly use max norm values between 0.25 and 1.0. RNN-based models often require more aggressive clipping (max norm of 1.0 or lower). Reinforcement learning algorithms may use different thresholds depending on reward scale.
Adaptive methods: Some training pipelines adjust the clipping threshold over the course of training, starting with a lower threshold during early (unstable) phases and relaxing it later.

Application	Typical threshold range	Clipping type
Transformer language models	0.25 to 1.0	Global norm
RNNs / LSTMs	1.0 to 5.0	Global norm
GANs (gradient penalty)	1.0 to 10.0	Global norm
PPO (policy ratio)	ε = 0.1 to 0.3	Ratio clipping
WGAN (weight clipping)	c = 0.01	Weight value

Comparison of clipping methods

Method	What is clipped	Direction preserved?	Primary use case
Gradient clipping by value	Individual gradient elements	No	Preventing outlier gradient components
Gradient clipping by norm	Gradient vector magnitude	Yes	General training stabilization (most common)
Weight clipping	Model parameters	N/A	Enforcing Lipschitz constraints (WGAN)
Activation clipping	Layer outputs	N/A	Quantization-friendly architectures (MobileNet)
PPO ratio clipping	Policy probability ratios	N/A	Stable policy updates in RL

Historical context

The problem of exploding and vanishing gradients in recurrent networks was identified by Bengio, Simard, and Frasconi (1994), and independently by Hochreiter (1991). The Long Short-Term Memory (LSTM) architecture, introduced by Hochreiter and Schmidhuber (1997), addressed vanishing gradients through gating mechanisms but did not fully solve the exploding gradient issue.

Gradient clipping as a regularization heuristic appeared in various forms throughout the 2000s, but the technique was formalized and analyzed rigorously by Pascanu, Mikolov, and Bengio (2013). Their paper provided both theoretical analysis and practical algorithms, establishing gradient norm clipping as a standard tool in deep learning. The approach has since been adopted in nearly every major training framework and is a default component of training pipelines for large language models, vision transformers, and other deep architectures.

References

Pascanu, R., Mikolov, T., & Bengio, Y. (2013). "On the difficulty of training recurrent neural networks." *Proceedings of the 30th International Conference on Machine Learning (ICML)*, PMLR 28(3):1310-1318. arXiv:1211.5063
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). "Proximal Policy Optimization Algorithms." arXiv:1707.06347
Arjovsky, M., Chintala, S., & Bottou, L. (2017). "Wasserstein GAN." arXiv:1701.07875
Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., & Courville, A. (2017). "Improved Training of Wasserstein GANs." arXiv:1704.00028
Bengio, Y., Simard, P., & Frasconi, P. (1994). "Learning long-term dependencies with gradient descent is difficult." *IEEE Transactions on Neural Networks*, 5(2):157-166.
Hochreiter, S. & Schmidhuber, J. (1997). "Long Short-Term Memory." *Neural Computation*, 9(8):1735-1780.
Krizhevsky, A. (2010). "Convolutional Deep Belief Networks on CIFAR-10." *Unpublished manuscript*.
Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., & Adam, H. (2017). "MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications." arXiv:1704.04861
Micikevicius, P., Narang, S., Alben, J., Diamos, G., Elsen, E., Garcia, D., Ginsburg, B., Houston, M., Kuchaiev, O., Venkatesh, G., & Wu, H. (2018). "Mixed Precision Training." *International Conference on Learning Representations (ICLR)*. arXiv:1710.03740
Zhang, J., He, T., Sra, S., & Jadbabaie, A. (2020). "Why Gradient Clipping Accelerates Training: A Theoretical Justification for Adaptivity." *International Conference on Learning Representations (ICLR)*. arXiv:1905.11881

Explain like I'm 5 (ELI5)

Background and motivation

Gradient clipping

Clipping by value

Clipping by norm

Global norm clipping vs. per-parameter clipping

Clipping in reinforcement learning

PPO clipped objective

Weight clipping

Activation clipping

Implementation

PyTorch

TensorFlow

Gradient clipping and mixed-precision training

Choosing clipping thresholds

Comparison of clipping methods

Historical context

References

Improve this article

Related Articles

Sparse autoencoder

ARC-AGI 2

GELU (Gaussian Error Linear Unit)

LeNet

Gradient Descent

Hyperparameter

Explain like I'm 5 (ELI5)

Background and motivation

Gradient clipping

Clipping by value

Clipping by norm

Global norm clipping vs. per-parameter clipping

Clipping in reinforcement learning

PPO clipped objective

Weight clipping

Activation clipping

Implementation

PyTorch

TensorFlow

Gradient clipping and mixed-precision training

Choosing clipping thresholds

Comparison of clipping methods

Historical context

References

Related Articles

Sparse autoencoder

ARC-AGI 2

GELU (Gaussian Error Linear Unit)

LeNet

Gradient Descent

Hyperparameter