Clipping is a family of techniques in machine learning that constrain numerical values to lie within a specified range or below a specified magnitude. The most common application is gradient clipping, which limits the size of gradients during backpropagation to prevent the exploding gradient problem. Clipping is also applied to weights, activations, and policy ratios in various contexts across deep learning and reinforcement learning.
Imagine you are steering a toy car with a remote control. If you push the joystick all the way to one side, the car spins out of control and crashes. Clipping is like putting bumpers on the joystick so it can only move a little bit at a time. The car still turns, but it never turns so fast that it crashes. In machine learning, the "joystick" is the gradient (the signal that tells the model how to change), and clipping keeps it from getting so large that training goes haywire.
Neural networks learn by computing gradients of a loss function with respect to each parameter and then updating those parameters using an optimizer such as stochastic gradient descent (SGD) or Adam. During backpropagation, gradients are propagated backward through layers via the chain rule. In deep networks, and especially in recurrent neural networks (RNNs), repeated multiplication of Jacobian matrices can cause gradient magnitudes to grow exponentially with the number of layers or time steps. This is known as the exploding gradient problem.
When gradients explode, parameter updates become extremely large, causing the loss to diverge, weights to overflow to NaN, or training to oscillate wildly. Gradient clipping provides a straightforward remedy: if the gradient exceeds a threshold, it is scaled down (or capped) before the optimizer step.
The technique was formalized by Pascanu, Mikolov, and Bengio in their 2013 paper "On the difficulty of training recurrent neural networks," which analyzed the exploding and vanishing gradient problems from analytical, geometric, and dynamical systems perspectives and proposed gradient norm clipping as a practical solution.
Gradient clipping is performed after computing gradients (via loss.backward() or equivalent) but before the optimizer updates parameters (via optimizer.step()). There are two main strategies: clipping by value and clipping by norm.
Clipping by value constrains each individual gradient component independently to lie within a fixed interval [-λ, λ], where λ is a chosen threshold.
For each gradient element g_i:
g_i = max(-λ, min(λ, g_i))
This approach is simple but has a significant drawback: it can change the direction of the gradient vector. If some components are clipped and others are not, the resulting vector may point in a different direction from the original gradient. This can slow convergence or cause erratic training dynamics.
Clipping by norm treats the entire gradient as a single vector and rescales it so that its norm does not exceed a threshold, while preserving the gradient's direction. This is the more commonly used strategy in practice.
Given a gradient vector g and a threshold τ, the L2 (Euclidean) norm is computed as:
||g||_2 = sqrt(sum(g_i^2))
If ||g||_2 > τ, the gradient is rescaled:
g = (τ / ||g||_2) * g
If ||g||_2 ≤ τ, the gradient is left unchanged. Because the rescaling is uniform across all components, the direction of the gradient is preserved. Only the magnitude is reduced.
There are two ways to apply norm-based clipping in a multi-parameter model:
| Strategy | Description | Direction preserved? | Speed |
|---|---|---|---|
| Global norm clipping | Concatenate all parameter gradients into one vector, compute its norm, and scale all gradients by the same factor if the global norm exceeds the threshold. | Yes (globally) | Slower (requires all gradients before clipping) |
| Per-parameter clipping | Compute the norm of each parameter's gradient independently and clip each one separately. | Yes (per tensor) | Faster (can clip as each gradient is ready) |
Global norm clipping is generally preferred because it preserves the relative scale of gradients across different parameters, maintaining the overall descent direction. This is the approach recommended by Pascanu et al. (2013) and used in most large-scale training pipelines, including transformer models.
In reinforcement learning, clipping appears prominently in Proximal Policy Optimization (PPO), introduced by Schulman et al. (2017). PPO uses a clipped surrogate objective to prevent the policy from changing too much in a single update, which improves training stability.
The PPO-Clip objective is:
L(θ) = E[ min( r(θ) * A, clip(r(θ), 1 - ε, 1 + ε) * A ) ]
where:
The clip function constrains r(θ) to the interval [1 - ε, 1 + ε]. When the advantage is positive (the action was better than expected), the objective is capped at (1 + ε) * A, preventing the policy from increasing the probability of that action beyond the clipping boundary. When the advantage is negative, the objective is capped at (1 - ε) * A, preventing excessive reduction. This mechanism replaces the hard KL divergence constraint used in Trust Region Policy Optimization (TRPO) with a simpler first-order optimization approach.
Weight clipping constrains the values of model parameters (not gradients) to a fixed range after each optimization step. The most prominent use case is in the original Wasserstein GAN (WGAN), proposed by Arjovsky, Chintala, and Bottou (2017).
WGAN requires the discriminator (called the "critic") to be a 1-Lipschitz function. The original paper enforced this by clipping all critic weights to a compact interval [-c, c] (typically c = 0.01) after each parameter update:
for p in critic.parameters():
p.data.clamp_(-c, c)
While simple, weight clipping has notable drawbacks. It tends to push weights toward the boundary values of the clipping range, underutilizing the network's capacity. It can also cause the critic to learn overly simple functions. These limitations led to the development of WGAN-GP (Gulrajani et al., 2017), which replaces weight clipping with a gradient penalty term that penalizes the critic when the gradient norm deviates from 1.
Activation clipping bounds the output of activation functions to a fixed range. The most well-known example is ReLU6, defined as:
ReLU6(x) = min(max(0, x), 6)
ReLU6 was introduced by Krizhevsky (2010) and became widely used in mobile and edge architectures such as MobileNet. The upper bound of 6 prevents activations from growing unboundedly, which is particularly useful for:
More generally, any use of torch.clamp() or tf.clip_by_value() on intermediate activations constitutes activation clipping.
PyTorch provides two built-in functions for gradient clipping in torch.nn.utils:
| Function | Type | Description |
|---|---|---|
clip_grad_norm_(parameters, max_norm, norm_type=2.0) | Norm clipping | Computes the total norm of all parameter gradients (concatenated as a single vector) and scales them if the norm exceeds max_norm. Returns the total norm. |
clip_grad_value_(parameters, clip_value) | Value clipping | Clamps each gradient element to the range [-clip_value, clip_value]. |
Typical usage:
optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
TensorFlow offers several clipping functions:
| Function | Type | Description |
|---|---|---|
tf.clip_by_value(t, clip_value_min, clip_value_max) | Value clipping | Clips tensor values element-wise to a range. |
tf.clip_by_norm(t, clip_norm) | Per-tensor norm clipping | Clips a single tensor so its L2 norm does not exceed clip_norm. |
tf.clip_by_global_norm(t_list, clip_norm) | Global norm clipping | Clips a list of tensors by the ratio of the sum of their norms. |
TensorFlow/Keras optimizers also accept clipnorm (per-parameter) and global_clipnorm (global) arguments directly:
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3, global_clipnorm=1.0)
Mixed-precision training uses lower-precision floating-point formats (such as float16) to speed up computation and reduce memory usage. Because float16 has a limited dynamic range, small gradients can underflow to zero. To prevent this, frameworks like PyTorch use a GradScaler that multiplies the loss by a scale factor before the backward pass, inflating gradient magnitudes into the representable float16 range.
When combining gradient clipping with mixed-precision training, the correct order of operations is:
scaler.scale(loss).backward()scaler.unscale_(optimizer) (restores original magnitudes)torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)scaler.step(optimizer)scaler.update()Unscaling before clipping is necessary because the clipping threshold is specified for the true (unscaled) gradient magnitudes. If gradients were still scaled, the threshold would not correspond to the intended magnitude, rendering the clipping ineffective or overly aggressive.
Selecting the right clipping threshold requires balancing stability against learning speed. If the threshold is too low, gradients are clipped too aggressively, slowing convergence and potentially preventing the model from escaping local minima. If it is too high, clipping never activates and provides no protection against gradient explosions.
Common practices include:
clip_grad_norm_ conveniently returns) reveals the typical range of gradient magnitudes. The threshold can then be set at a percentile (e.g., 90th or 95th) of observed norms, so that clipping only activates on unusually large gradients.| Application | Typical threshold range | Clipping type |
|---|---|---|
| Transformer language models | 0.25 to 1.0 | Global norm |
| RNNs / LSTMs | 1.0 to 5.0 | Global norm |
| GANs (gradient penalty) | 1.0 to 10.0 | Global norm |
| PPO (policy ratio) | ε = 0.1 to 0.3 | Ratio clipping |
| WGAN (weight clipping) | c = 0.01 | Weight value |
| Method | What is clipped | Direction preserved? | Primary use case |
|---|---|---|---|
| Gradient clipping by value | Individual gradient elements | No | Preventing outlier gradient components |
| Gradient clipping by norm | Gradient vector magnitude | Yes | General training stabilization (most common) |
| Weight clipping | Model parameters | N/A | Enforcing Lipschitz constraints (WGAN) |
| Activation clipping | Layer outputs | N/A | Quantization-friendly architectures (MobileNet) |
| PPO ratio clipping | Policy probability ratios | N/A | Stable policy updates in RL |
The problem of exploding and vanishing gradients in recurrent networks was identified by Bengio, Simard, and Frasconi (1994), and independently by Hochreiter (1991). The Long Short-Term Memory (LSTM) architecture, introduced by Hochreiter and Schmidhuber (1997), addressed vanishing gradients through gating mechanisms but did not fully solve the exploding gradient issue.
Gradient clipping as a regularization heuristic appeared in various forms throughout the 2000s, but the technique was formalized and analyzed rigorously by Pascanu, Mikolov, and Bengio (2013). Their paper provided both theoretical analysis and practical algorithms, establishing gradient norm clipping as a standard tool in deep learning. The approach has since been adopted in nearly every major training framework and is a default component of training pipelines for large language models, vision transformers, and other deep architectures.