# Clipping

> Source: https://aiwiki.ai/wiki/clipping
> Updated: 2026-04-26
> Categories: Deep Learning, Machine Learning, Training & Optimization
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**Clipping** is a family of techniques in [machine learning](/wiki/machine_learning) that constrain numerical values to lie within a specified range or below a specified magnitude. The most common application is **gradient clipping**, which limits the size of gradients during [backpropagation](/wiki/backpropagation) to prevent the [exploding gradient problem](/wiki/exploding_gradient_problem). Clipping is also applied to weights, activations, and policy ratios in various contexts across [deep learning](/wiki/deep_learning) and [reinforcement learning](/wiki/reinforcement_learning_rl).

## Explain like I'm 5 (ELI5)

Imagine you are steering a toy car with a remote control. If you push the joystick all the way to one side, the car spins out of control and crashes. Clipping is like putting bumpers on the joystick so it can only move a little bit at a time. The car still turns, but it never turns so fast that it crashes. In machine learning, the "joystick" is the gradient (the signal that tells the model how to change), and clipping keeps it from getting so large that training goes haywire.

## Background and motivation

Neural networks learn by computing gradients of a [loss function](/wiki/loss_function) with respect to each parameter and then updating those parameters using an optimizer such as [stochastic gradient descent](/wiki/stochastic_gradient_descent_sgd) (SGD) or [Adam](/wiki/adam_optimizer). During [backpropagation](/wiki/backpropagation), gradients are propagated backward through layers via the chain rule. In deep networks, and especially in [recurrent neural networks](/wiki/recurrent_neural_network) (RNNs), repeated multiplication of Jacobian matrices can cause gradient magnitudes to grow exponentially with the number of layers or time steps. This is known as the exploding gradient problem.

When gradients explode, parameter updates become extremely large, causing the loss to diverge, weights to overflow to NaN, or training to oscillate wildly. Gradient clipping provides a straightforward remedy: if the gradient exceeds a threshold, it is scaled down (or capped) before the optimizer step.

The technique was formalized by Pascanu, Mikolov, and Bengio in their 2013 paper "On the difficulty of training recurrent neural networks," which analyzed the exploding and vanishing gradient problems from analytical, geometric, and dynamical systems perspectives and proposed gradient norm clipping as a practical solution.

## Gradient clipping

Gradient clipping is performed after computing gradients (via `loss.backward()` or equivalent) but before the optimizer updates parameters (via `optimizer.step()`). There are two main strategies: clipping by value and clipping by norm.

### Clipping by value

Clipping by value constrains each individual gradient component independently to lie within a fixed interval [-&lambda;, &lambda;], where &lambda; is a chosen threshold.

For each gradient element g_i:

```
g_i = max(-λ, min(λ, g_i))
```

This approach is simple but has a significant drawback: it can change the direction of the gradient vector. If some components are clipped and others are not, the resulting vector may point in a different direction from the original gradient. This can slow convergence or cause erratic training dynamics.

### Clipping by norm

Clipping by norm treats the entire gradient as a single vector and rescales it so that its norm does not exceed a threshold, while preserving the gradient's direction. This is the more commonly used strategy in practice.

Given a gradient vector **g** and a threshold &tau;, the L2 (Euclidean) norm is computed as:

```
||g||_2 = sqrt(sum(g_i^2))
```

If ||**g**||_2 > &tau;, the gradient is rescaled:

```
g = (τ / ||g||_2) * g
```

If ||**g**||_2 &le; &tau;, the gradient is left unchanged. Because the rescaling is uniform across all components, the direction of the gradient is preserved. Only the magnitude is reduced.

### Global norm clipping vs. per-parameter clipping

There are two ways to apply norm-based clipping in a multi-parameter model:

| Strategy | Description | Direction preserved? | Speed |
|---|---|---|---|
| Global norm clipping | Concatenate all parameter gradients into one vector, compute its norm, and scale all gradients by the same factor if the global norm exceeds the threshold. | Yes (globally) | Slower (requires all gradients before clipping) |
| Per-parameter clipping | Compute the norm of each parameter's gradient independently and clip each one separately. | Yes (per tensor) | Faster (can clip as each gradient is ready) |

Global norm clipping is generally preferred because it preserves the relative scale of gradients across different parameters, maintaining the overall descent direction. This is the approach recommended by Pascanu et al. (2013) and used in most large-scale training pipelines, including [transformer](/wiki/transformer) models.

## Clipping in reinforcement learning

### PPO clipped objective

In [reinforcement learning](/wiki/reinforcement_learning_rl), clipping appears prominently in Proximal Policy Optimization (PPO), introduced by Schulman et al. (2017). PPO uses a clipped surrogate objective to prevent the policy from changing too much in a single update, which improves training stability.

The PPO-Clip objective is:

```
L(θ) = E[ min( r(θ) * A, clip(r(θ), 1 - ε, 1 + ε) * A ) ]
```

where:
- r(&theta;) = &pi;_&theta;(a|s) / &pi;_&theta;_old(a|s) is the probability ratio between the new and old policies
- A is the advantage estimate
- &epsilon; is a small hyperparameter (commonly 0.2)

The clip function constrains r(&theta;) to the interval [1 - &epsilon;, 1 + &epsilon;]. When the advantage is positive (the action was better than expected), the objective is capped at (1 + &epsilon;) * A, preventing the policy from increasing the probability of that action beyond the clipping boundary. When the advantage is negative, the objective is capped at (1 - &epsilon;) * A, preventing excessive reduction. This mechanism replaces the hard KL divergence constraint used in Trust Region Policy Optimization (TRPO) with a simpler first-order optimization approach.

## Weight clipping

Weight clipping constrains the values of model parameters (not gradients) to a fixed range after each optimization step. The most prominent use case is in the original Wasserstein GAN (WGAN), proposed by Arjovsky, Chintala, and Bottou (2017).

WGAN requires the discriminator (called the "critic") to be a 1-Lipschitz function. The original paper enforced this by clipping all critic weights to a compact interval [-c, c] (typically c = 0.01) after each parameter update:

```
for p in critic.parameters():
    p.data.clamp_(-c, c)
```

While simple, weight clipping has notable drawbacks. It tends to push weights toward the boundary values of the clipping range, underutilizing the network's capacity. It can also cause the critic to learn overly simple functions. These limitations led to the development of WGAN-GP (Gulrajani et al., 2017), which replaces weight clipping with a gradient penalty term that penalizes the critic when the gradient norm deviates from 1.

## Activation clipping

Activation clipping bounds the output of activation functions to a fixed range. The most well-known example is **ReLU6**, defined as:

```
ReLU6(x) = min(max(0, x), 6)
```

ReLU6 was introduced by Krizhevsky (2010) and became widely used in mobile and edge architectures such as [MobileNet](/wiki/mobilenet). The upper bound of 6 prevents activations from growing unboundedly, which is particularly useful for:

- **Quantization:** Bounded activations map cleanly to fixed-point representations (e.g., 8-bit integers), reducing computational cost on mobile hardware.
- **Numerical stability:** Clamping prevents overflow in low-precision arithmetic.

More generally, any use of `torch.clamp()` or `tf.clip_by_value()` on intermediate activations constitutes activation clipping.

## Implementation

### PyTorch

PyTorch provides two built-in functions for gradient clipping in `torch.nn.utils`:

| Function | Type | Description |
|---|---|---|
| `clip_grad_norm_(parameters, max_norm, norm_type=2.0)` | Norm clipping | Computes the total norm of all parameter gradients (concatenated as a single vector) and scales them if the norm exceeds `max_norm`. Returns the total norm. |
| `clip_grad_value_(parameters, clip_value)` | Value clipping | Clamps each gradient element to the range [-clip_value, clip_value]. |

Typical usage:

```python
optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
```

### TensorFlow

TensorFlow offers several clipping functions:

| Function | Type | Description |
|---|---|---|
| `tf.clip_by_value(t, clip_value_min, clip_value_max)` | Value clipping | Clips tensor values element-wise to a range. |
| `tf.clip_by_norm(t, clip_norm)` | Per-tensor norm clipping | Clips a single tensor so its L2 norm does not exceed `clip_norm`. |
| `tf.clip_by_global_norm(t_list, clip_norm)` | Global norm clipping | Clips a list of tensors by the ratio of the sum of their norms. |

TensorFlow/Keras optimizers also accept `clipnorm` (per-parameter) and `global_clipnorm` (global) arguments directly:

```python
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3, global_clipnorm=1.0)
```

## Gradient clipping and mixed-precision training

Mixed-precision training uses lower-precision floating-point formats (such as float16) to speed up computation and reduce memory usage. Because float16 has a limited dynamic range, small gradients can underflow to zero. To prevent this, frameworks like PyTorch use a **GradScaler** that multiplies the loss by a scale factor before the backward pass, inflating gradient magnitudes into the representable float16 range.

When combining gradient clipping with mixed-precision training, the correct order of operations is:

1. Compute the scaled loss: `scaler.scale(loss).backward()`
2. **Unscale the gradients:** `scaler.unscale_(optimizer)` (restores original magnitudes)
3. Clip the unscaled gradients: `torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)`
4. Step the optimizer: `scaler.step(optimizer)`
5. Update the scaler: `scaler.update()`

Unscaling before clipping is necessary because the clipping threshold is specified for the true (unscaled) gradient magnitudes. If gradients were still scaled, the threshold would not correspond to the intended magnitude, rendering the clipping ineffective or overly aggressive.

## Choosing clipping thresholds

Selecting the right clipping threshold requires balancing stability against learning speed. If the threshold is too low, gradients are clipped too aggressively, slowing convergence and potentially preventing the model from escaping local minima. If it is too high, clipping never activates and provides no protection against gradient explosions.

Common practices include:

- **Standard defaults:** Many practitioners start with a max norm of 1.0 for global norm clipping. Values of 0.5, 1.0, and 5.0 are all common in the literature.
- **Monitoring gradient norms:** Logging the gradient norm at each training step (which `clip_grad_norm_` conveniently returns) reveals the typical range of gradient magnitudes. The threshold can then be set at a percentile (e.g., 90th or 95th) of observed norms, so that clipping only activates on unusually large gradients.
- **Architecture-specific tuning:** Transformer models for language modeling commonly use max norm values between 0.25 and 1.0. RNN-based models often require more aggressive clipping (max norm of 1.0 or lower). Reinforcement learning algorithms may use different thresholds depending on reward scale.
- **Adaptive methods:** Some training pipelines adjust the clipping threshold over the course of training, starting with a lower threshold during early (unstable) phases and relaxing it later.

| Application | Typical threshold range | Clipping type |
|---|---|---|
| Transformer language models | 0.25 to 1.0 | Global norm |
| RNNs / LSTMs | 1.0 to 5.0 | Global norm |
| GANs (gradient penalty) | 1.0 to 10.0 | Global norm |
| PPO (policy ratio) | &epsilon; = 0.1 to 0.3 | Ratio clipping |
| WGAN (weight clipping) | c = 0.01 | Weight value |

## Comparison of clipping methods

| Method | What is clipped | Direction preserved? | Primary use case |
|---|---|---|---|
| Gradient clipping by value | Individual gradient elements | No | Preventing outlier gradient components |
| Gradient clipping by norm | Gradient vector magnitude | Yes | General training stabilization (most common) |
| Weight clipping | Model parameters | N/A | Enforcing Lipschitz constraints (WGAN) |
| Activation clipping | Layer outputs | N/A | Quantization-friendly architectures (MobileNet) |
| PPO ratio clipping | Policy probability ratios | N/A | Stable policy updates in RL |

## Historical context

The problem of exploding and vanishing gradients in recurrent networks was identified by Bengio, Simard, and Frasconi (1994), and independently by Hochreiter (1991). The [Long Short-Term Memory](/wiki/long_short-term_memory_lstm) (LSTM) architecture, introduced by Hochreiter and Schmidhuber (1997), addressed vanishing gradients through gating mechanisms but did not fully solve the exploding gradient issue.

Gradient clipping as a regularization heuristic appeared in various forms throughout the 2000s, but the technique was formalized and analyzed rigorously by Pascanu, Mikolov, and Bengio (2013). Their paper provided both theoretical analysis and practical algorithms, establishing gradient norm clipping as a standard tool in deep learning. The approach has since been adopted in nearly every major training framework and is a default component of training pipelines for large language models, vision transformers, and other deep architectures.

## References

1. Pascanu, R., Mikolov, T., & Bengio, Y. (2013). "On the difficulty of training recurrent neural networks." *Proceedings of the 30th International Conference on Machine Learning (ICML)*, PMLR 28(3):1310-1318. [arXiv:1211.5063](https://arxiv.org/abs/1211.5063)
2. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). "Proximal Policy Optimization Algorithms." [arXiv:1707.06347](https://arxiv.org/abs/1707.06347)
3. Arjovsky, M., Chintala, S., & Bottou, L. (2017). "Wasserstein GAN." [arXiv:1701.07875](https://arxiv.org/abs/1701.07875)
4. Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., & Courville, A. (2017). "Improved Training of Wasserstein GANs." [arXiv:1704.00028](https://arxiv.org/abs/1704.00028)
5. Bengio, Y., Simard, P., & Frasconi, P. (1994). "Learning long-term dependencies with gradient descent is difficult." *IEEE Transactions on Neural Networks*, 5(2):157-166.
6. Hochreiter, S. & Schmidhuber, J. (1997). "Long Short-Term Memory." *Neural Computation*, 9(8):1735-1780.
7. Krizhevsky, A. (2010). "Convolutional Deep Belief Networks on CIFAR-10." *Unpublished manuscript*.
8. Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., & Adam, H. (2017). "MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications." [arXiv:1704.04861](https://arxiv.org/abs/1704.04861)
9. Micikevicius, P., Narang, S., Alben, J., Diamos, G., Elsen, E., Garcia, D., Ginsburg, B., Houston, M., Kuchaiev, O., Venkatesh, G., & Wu, H. (2018). "Mixed Precision Training." *International Conference on Learning Representations (ICLR)*. [arXiv:1710.03740](https://arxiv.org/abs/1710.03740)
10. Zhang, J., He, T., Sra, S., & Jadbabaie, A. (2020). "Why Gradient Clipping Accelerates Training: A Theoretical Justification for Adaptivity." *International Conference on Learning Representations (ICLR)*. [arXiv:1905.11881](https://arxiv.org/abs/1905.11881)
