# Clipping

> Source: https://aiwiki.ai/wiki/clipping
> Updated: 2026-06-29
> Categories: Deep Learning, Machine Learning, Training & Optimization
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**Clipping** is a family of techniques in [machine learning](/wiki/machine_learning) that constrain numerical values to lie within a specified range or below a specified magnitude. The most common application is **gradient clipping**, which limits the size of gradients during [backpropagation](/wiki/backpropagation) to prevent the [exploding gradient problem](/wiki/exploding_gradient_problem) by rescaling or capping gradients before the optimizer step. [1][2] Clipping is also applied to raw feature values (to cap outliers), and to weights, activations, and policy ratios in various contexts across [deep learning](/wiki/deep_learning) and [reinforcement learning](/wiki/reinforcement_learning_rl). Google's Machine Learning Glossary defines clipping in these two main senses: forcing gradient values within a designated range during training, and reducing or increasing feature values that fall outside a maximum or minimum threshold. [3]

## What is clipping in machine learning?

Clipping is any operation that bounds a numerical quantity to a fixed interval or maximum magnitude. The term is overloaded: depending on what is being clipped, it refers to distinct techniques. The two senses cataloged by the Google Machine Learning Glossary are gradient clipping (capping gradient magnitudes during training) and feature clipping, which the glossary describes as "A technique for handling outliers by ... [r]educing feature values that are greater than a maximum threshold down to that maximum threshold ... [or i]ncreasing feature values that are less than a minimum threshold up to that minimum threshold." [3] Beyond these, the same clamping operation appears as weight clipping in [generative adversarial networks](/wiki/generative_adversarial_network), activation clipping in efficient vision models, and ratio clipping in policy-gradient reinforcement learning.

| Sense of clipping | What is bounded | Where it is used |
|---|---|---|
| Gradient clipping | Gradient magnitudes during training | RNNs, [transformers](/wiki/transformer), most deep nets |
| Feature clipping | Raw input feature values | Data preprocessing, outlier handling |
| Weight clipping | Model parameters | WGAN critic (Lipschitz constraint) |
| Activation clipping | Layer outputs | MobileNet (ReLU6), quantization |
| Ratio clipping | Policy probability ratio | PPO in reinforcement learning |

## Explain like I'm 5 (ELI5)

Imagine you are steering a toy car with a remote control. If you push the joystick all the way to one side, the car spins out of control and crashes. Clipping is like putting bumpers on the joystick so it can only move a little bit at a time. The car still turns, but it never turns so fast that it crashes. In machine learning, the "joystick" is the gradient (the signal that tells the model how to change), and clipping keeps it from getting so large that training goes haywire.

## Why is gradient clipping needed?

Neural networks learn by computing gradients of a [loss function](/wiki/loss_function) with respect to each parameter and then updating those parameters using an optimizer such as [stochastic gradient descent](/wiki/stochastic_gradient_descent_sgd) (SGD) or [Adam](/wiki/adam_optimizer). During [backpropagation](/wiki/backpropagation), gradients are propagated backward through layers via the chain rule. In deep networks, and especially in [recurrent neural networks](/wiki/recurrent_neural_network) (RNNs), repeated multiplication of Jacobian matrices can cause gradient magnitudes to grow exponentially with the number of layers or time steps. This is known as the exploding gradient problem. [1]

When gradients explode, parameter updates become extremely large, causing the loss to diverge, weights to overflow to NaN, or training to oscillate wildly. Gradient clipping provides a straightforward remedy: if the gradient exceeds a threshold, it is scaled down (or capped) before the optimizer step.

The technique was formalized by Pascanu, Mikolov, and Bengio in their 2013 paper "On the difficulty of training recurrent neural networks," which analyzed the exploding and vanishing gradient problems from analytical, geometric, and dynamical systems perspectives and proposed gradient norm clipping as a practical solution. [1] As the authors state in the abstract, "We propose a gradient norm clipping strategy to deal with exploding gradients and a soft constraint for the vanishing gradients problem." [1] As of 2026 the paper has accumulated more than 3,400 citations, reflecting how central the technique has become to deep learning practice. A key insight of the paper is that the exploding-gradient problem is fundamentally geometric: gradients point in the right direction but become unreasonably long, which is exactly why rescaling the magnitude while preserving direction works so well.

## What is gradient clipping?

Gradient clipping is performed after computing gradients (via `loss.backward()` or equivalent) but before the optimizer updates parameters (via `optimizer.step()`). There are two main strategies: clipping by value and clipping by norm.

### Clipping by value

Clipping by value constrains each individual gradient component independently to lie within a fixed interval [-&lambda;, &lambda;], where &lambda; is a chosen threshold.

For each gradient element g_i:

```
g_i = max(-λ, min(λ, g_i))
```

This approach is simple but has a significant drawback: it can change the direction of the gradient vector. If some components are clipped and others are not, the resulting vector may point in a different direction from the original gradient. This can slow convergence or cause erratic training dynamics.

### Clipping by norm

Clipping by norm treats the entire gradient as a single vector and rescales it so that its norm does not exceed a threshold, while preserving the gradient's direction. This is the more commonly used strategy in practice. [1]

Given a gradient vector **g** and a threshold &tau;, the L2 (Euclidean) norm is computed as:

```
||g||_2 = sqrt(sum(g_i^2))
```

If ||**g**||_2 > &tau;, the gradient is rescaled:

```
g = (τ / ||g||_2) * g
```

If ||**g**||_2 &le; &tau;, the gradient is left unchanged. Because the rescaling is uniform across all components, the direction of the gradient is preserved. Only the magnitude is reduced.

### How does clipping by norm differ from clipping by value?

The core difference is whether the gradient's direction is preserved. Clipping by value clamps each component independently, so it can rotate the gradient vector away from the true descent direction. Clipping by norm rescales the whole vector uniformly, so the direction is unchanged and only the overall step size is reduced. Pascanu et al. (2013) recommend norm-based clipping precisely because preserving direction matches their geometric finding that exploding gradients are correctly oriented but too long. [1]

### Global norm clipping vs. per-parameter clipping

There are two ways to apply norm-based clipping in a multi-parameter model:

| Strategy | Description | Direction preserved? | Speed |
|---|---|---|---|
| Global norm clipping | Concatenate all parameter gradients into one vector, compute its norm, and scale all gradients by the same factor if the global norm exceeds the threshold. | Yes (globally) | Slower (requires all gradients before clipping) |
| Per-parameter clipping | Compute the norm of each parameter's gradient independently and clip each one separately. | Yes (per tensor) | Faster (can clip as each gradient is ready) |

Global norm clipping is generally preferred because it preserves the relative scale of gradients across different parameters, maintaining the overall descent direction. This is the approach recommended by Pascanu et al. (2013) and used in most large-scale training pipelines, including [transformer](/wiki/transformer) models. [1]

## What is feature clipping?

Feature clipping (also called value capping or winsorizing-style capping) is a data-preprocessing technique that limits raw feature values to a chosen range so that outliers do not dominate training. The Google Machine Learning Glossary frames it with a concrete example: "suppose that <0.5% of values for a particular feature fall outside the range 40-60. In this case, you could ... [c]lip all values over 60 (the maximum threshold) to be exactly 60 ... [and c]lip all values under 40 (the minimum threshold) to be exactly 40." [3]

Unlike gradient clipping, which acts on the optimization signal, feature clipping acts on the input data before it ever enters the model. It is commonly applied during feature engineering by setting the minimum and maximum thresholds at fixed values or at percentiles of the observed distribution (for example, the 1st and 99th percentiles). Capping outliers in this way keeps a single extreme record from distorting normalization statistics (mean, variance, min, max) and from producing exploding activations downstream. The same operation underlies `numpy.clip`, `pandas.Series.clip`, and `torch.clamp` when applied to feature tensors rather than gradients.

## How is clipping used in reinforcement learning?

### PPO clipped objective

In [reinforcement learning](/wiki/reinforcement_learning_rl), clipping appears prominently in Proximal Policy Optimization (PPO), introduced by Schulman et al. (2017). [2] PPO uses a clipped surrogate objective to prevent the policy from changing too much in a single update, which improves training stability.

The PPO-Clip objective is:

```
L(θ) = E[ min( r(θ) * A, clip(r(θ), 1 - ε, 1 + ε) * A ) ]
```

where:
- r(&theta;) = &pi;_&theta;(a|s) / &pi;_&theta;_old(a|s) is the probability ratio between the new and old policies
- A is the advantage estimate
- &epsilon; is a small hyperparameter (commonly 0.2)

The clip function constrains r(&theta;) to the interval [1 - &epsilon;, 1 + &epsilon;]. When the advantage is positive (the action was better than expected), the objective is capped at (1 + &epsilon;) * A, preventing the policy from increasing the probability of that action beyond the clipping boundary. When the advantage is negative, the objective is capped at (1 - &epsilon;) * A, preventing excessive reduction. This mechanism replaces the hard KL divergence constraint used in Trust Region Policy Optimization (TRPO) with a simpler first-order optimization approach. [2]

## What is weight clipping?

Weight clipping constrains the values of model parameters (not gradients) to a fixed range after each optimization step. The most prominent use case is in the original Wasserstein GAN (WGAN), proposed by Arjovsky, Chintala, and Bottou (2017). [3]

WGAN requires the discriminator (called the "critic") to be a 1-Lipschitz function. The original paper enforced this by clipping all critic weights to a compact interval [-c, c] (typically c = 0.01) after each parameter update: [3]

```
for p in critic.parameters():
    p.data.clamp_(-c, c)
```

While simple, weight clipping has notable drawbacks. It tends to push weights toward the boundary values of the clipping range, underutilizing the network's capacity. It can also cause the critic to learn overly simple functions. These limitations led to the development of WGAN-GP (Gulrajani et al., 2017), which replaces weight clipping with a gradient penalty term that penalizes the critic when the gradient norm deviates from 1. [4]

## What is activation clipping?

Activation clipping bounds the output of activation functions to a fixed range. The most well-known example is **ReLU6**, defined as:

```
ReLU6(x) = min(max(0, x), 6)
```

ReLU6 was introduced by Krizhevsky (2010) and became widely used in mobile and edge architectures such as [MobileNet](/wiki/mobilenet). [7][8] The upper bound of 6 prevents activations from growing unboundedly, which is particularly useful for:

- **Quantization:** Bounded activations map cleanly to fixed-point representations (e.g., 8-bit integers), reducing computational cost on mobile hardware.
- **Numerical stability:** Clamping prevents overflow in low-precision arithmetic.

More generally, any use of `torch.clamp()` or `tf.clip_by_value()` on intermediate activations constitutes activation clipping.

## How is gradient clipping implemented?

### PyTorch

PyTorch provides two built-in functions for gradient clipping in `torch.nn.utils`:

| Function | Type | Description |
|---|---|---|
| `clip_grad_norm_(parameters, max_norm, norm_type=2.0)` | Norm clipping | Computes the total norm of all parameter gradients (concatenated as a single vector) and scales them if the norm exceeds `max_norm`. Returns the total norm. |
| `clip_grad_value_(parameters, clip_value)` | Value clipping | Clamps each gradient element to the range [-clip_value, clip_value]. |

Internally, `clip_grad_norm_` scales gradients by `min(max_norm / (total_norm + 1e-6), 1)`, so the scale factor is clamped at 1.0 and gradients are only ever reduced, never amplified. [5] The `1e-6` term in the denominator guards against division by zero. Typical usage:

```python
optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
```

### TensorFlow

TensorFlow offers several clipping functions:

| Function | Type | Description |
|---|---|---|
| `tf.clip_by_value(t, clip_value_min, clip_value_max)` | Value clipping | Clips tensor values element-wise to a range. |
| `tf.clip_by_norm(t, clip_norm)` | Per-tensor norm clipping | Clips a single tensor so its L2 norm does not exceed `clip_norm`. |
| `tf.clip_by_global_norm(t_list, clip_norm)` | Global norm clipping | Clips a list of tensors by the ratio of the sum of their norms. |

TensorFlow/Keras optimizers also accept `clipnorm` (per-parameter) and `global_clipnorm` (global) arguments directly:

```python
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3, global_clipnorm=1.0)
```

## How does clipping interact with mixed-precision training?

Mixed-precision training uses lower-precision floating-point formats (such as float16) to speed up computation and reduce memory usage. [9] Because float16 has a limited dynamic range, small gradients can underflow to zero. To prevent this, frameworks like PyTorch use a **GradScaler** that multiplies the loss by a scale factor before the backward pass, inflating gradient magnitudes into the representable float16 range. [9]

When combining gradient clipping with mixed-precision training, the correct order of operations is:

1. Compute the scaled loss: `scaler.scale(loss).backward()`
2. **Unscale the gradients:** `scaler.unscale_(optimizer)` (restores original magnitudes)
3. Clip the unscaled gradients: `torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)`
4. Step the optimizer: `scaler.step(optimizer)`
5. Update the scaler: `scaler.update()`

Unscaling before clipping is necessary because the clipping threshold is specified for the true (unscaled) gradient magnitudes. If gradients were still scaled, the threshold would not correspond to the intended magnitude, rendering the clipping ineffective or overly aggressive.

## How do you choose a clipping threshold?

Selecting the right clipping threshold requires balancing stability against learning speed. If the threshold is too low, gradients are clipped too aggressively, slowing convergence and potentially preventing the model from escaping local minima. If it is too high, clipping never activates and provides no protection against gradient explosions.

Common practices include:

- **Standard defaults:** Many practitioners start with a max norm of 1.0 for global norm clipping. Values of 0.5, 1.0, and 5.0 are all common in the literature.
- **Monitoring gradient norms:** Logging the gradient norm at each training step (which `clip_grad_norm_` conveniently returns) reveals the typical range of gradient magnitudes. The threshold can then be set at a percentile (e.g., 90th or 95th) of observed norms, so that clipping only activates on unusually large gradients.
- **Architecture-specific tuning:** Transformer models for language modeling commonly use max norm values between 0.25 and 1.0. RNN-based models often require more aggressive clipping (max norm of 1.0 or lower). Reinforcement learning algorithms may use different thresholds depending on reward scale.
- **Adaptive methods:** Some training pipelines adjust the clipping threshold over the course of training, starting with a lower threshold during early (unstable) phases and relaxing it later. Zhang et al. (2020) gave a theoretical justification for why clipping accelerates training, showing it behaves like an adaptive step size that can converge faster than fixed-step gradient descent. [10]

| Application | Typical threshold range | Clipping type |
|---|---|---|
| Transformer language models | 0.25 to 1.0 | Global norm |
| RNNs / LSTMs | 1.0 to 5.0 | Global norm |
| GANs (gradient penalty) | 1.0 to 10.0 | Global norm |
| PPO (policy ratio) | &epsilon; = 0.1 to 0.3 | Ratio clipping |
| WGAN (weight clipping) | c = 0.01 | Weight value |

## Comparison of clipping methods

| Method | What is clipped | Direction preserved? | Primary use case |
|---|---|---|---|
| Gradient clipping by value | Individual gradient elements | No | Preventing outlier gradient components |
| Gradient clipping by norm | Gradient vector magnitude | Yes | General training stabilization (most common) |
| Feature clipping | Raw input feature values | N/A | Outlier capping in preprocessing |
| Weight clipping | Model parameters | N/A | Enforcing Lipschitz constraints (WGAN) |
| Activation clipping | Layer outputs | N/A | Quantization-friendly architectures (MobileNet) |
| PPO ratio clipping | Policy probability ratios | N/A | Stable policy updates in RL |

## Historical context

The problem of exploding and vanishing gradients in recurrent networks was identified by Bengio, Simard, and Frasconi (1994), and independently by Hochreiter (1991). [5] The [Long Short-Term Memory](/wiki/long_short-term_memory_lstm) (LSTM) architecture, introduced by Hochreiter and Schmidhuber (1997), addressed vanishing gradients through gating mechanisms but did not fully solve the exploding gradient issue. [6]

Gradient clipping as a regularization heuristic appeared in various forms throughout the 2000s, but the technique was formalized and analyzed rigorously by Pascanu, Mikolov, and Bengio (2013). [1] Their paper provided both theoretical analysis and practical algorithms, establishing gradient norm clipping as a standard tool in deep learning. The approach has since been adopted in nearly every major training framework and is a default component of training pipelines for large language models, vision transformers, and other deep architectures.

## See also

- [Exploding gradient problem](/wiki/exploding_gradient_problem)
- [Vanishing gradient problem](/wiki/vanishing_gradient_problem)
- [Backpropagation](/wiki/backpropagation)
- [Recurrent neural network](/wiki/recurrent_neural_network)
- [Long Short-Term Memory](/wiki/long_short-term_memory_lstm)
- [Transformer](/wiki/transformer)

## References

1. Pascanu, R., Mikolov, T., & Bengio, Y. (2013). "On the difficulty of training recurrent neural networks." *Proceedings of the 30th International Conference on Machine Learning (ICML)*, PMLR 28(3):1310-1318. [arXiv:1211.5063](https://arxiv.org/abs/1211.5063)
2. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). "Proximal Policy Optimization Algorithms." [arXiv:1707.06347](https://arxiv.org/abs/1707.06347)
3. Arjovsky, M., Chintala, S., & Bottou, L. (2017). "Wasserstein GAN." [arXiv:1701.07875](https://arxiv.org/abs/1701.07875). Definitions of "clipping" and "gradient clipping" from Google for Developers, Machine Learning Glossary. [developers.google.com](https://developers.google.com/machine-learning/glossary)
4. Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., & Courville, A. (2017). "Improved Training of Wasserstein GANs." [arXiv:1704.00028](https://arxiv.org/abs/1704.00028)
5. Bengio, Y., Simard, P., & Frasconi, P. (1994). "Learning long-term dependencies with gradient descent is difficult." *IEEE Transactions on Neural Networks*, 5(2):157-166. PyTorch documentation, `torch.nn.utils.clip_grad_norm_`. [docs.pytorch.org](https://docs.pytorch.org/docs/stable/generated/torch.nn.utils.clip_grad_norm_.html)
6. Hochreiter, S. & Schmidhuber, J. (1997). "Long Short-Term Memory." *Neural Computation*, 9(8):1735-1780.
7. Krizhevsky, A. (2010). "Convolutional Deep Belief Networks on CIFAR-10." *Unpublished manuscript*.
8. Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., & Adam, H. (2017). "MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications." [arXiv:1704.04861](https://arxiv.org/abs/1704.04861)
9. Micikevicius, P., Narang, S., Alben, J., Diamos, G., Elsen, E., Garcia, D., Ginsburg, B., Houston, M., Kuchaiev, O., Venkatesh, G., & Wu, H. (2018). "Mixed Precision Training." *International Conference on Learning Representations (ICLR)*. [arXiv:1710.03740](https://arxiv.org/abs/1710.03740)
10. Zhang, J., He, T., Sra, S., & Jadbabaie, A. (2020). "Why Gradient Clipping Accelerates Training: A Theoretical Justification for Adaptivity." *International Conference on Learning Representations (ICLR)*. [arXiv:1905.11881](https://arxiv.org/abs/1905.11881)