# Vanishing Gradient Problem

> Source: https://aiwiki.ai/wiki/vanishing_gradient_problem
> Updated: 2026-07-11
> Categories: Deep Learning, Machine Learning, Neural Networks
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

*See also: [Machine learning terms](/wiki/machine_learning_terms)*

## Vanishing Gradient Problem

The **vanishing gradient problem** is a difficulty in training [deep neural networks](/wiki/deep_neural_network) where the gradients used to update the network shrink exponentially as they are propagated backward through the layers, leaving the earliest layers with updates so small that they learn extremely slowly or not at all. It was first identified by Sepp Hochreiter in 1991, and it kept deep networks largely untrainable for nearly two decades until solutions such as the [ReLU](/wiki/rectified_linear_unit_relu) activation, residual connections, [LSTM](/wiki/long_short-term_memory_lstm), and [batch normalization](/wiki/batch_normalization) made depth practical.[1][3][7][12]

The problem arises because [backpropagation](/wiki/backpropagation) computes each layer's gradient as a product of many partial derivatives (the chain rule). When those derivatives are smaller than 1, the product decays toward zero with depth. With the [sigmoid function](/wiki/sigmoid_function), whose derivative peaks at just 0.25, a 10-layer network can attenuate the gradient by a factor near $$0.25^{10}$$, roughly $$9.5 \times 10^{-7}$$, so the first layer receives almost no learning signal.[1] This single mechanism shaped the trajectory of deep learning research: it is the reason early deep and [recurrent neural networks](/wiki/recurrent_neural_network) failed to learn long-range dependencies, and overcoming it is what enabled the modern era of networks with hundreds or thousands of layers.

## What is the vanishing gradient problem?

In one sentence: the vanishing gradient problem is the exponential decay of error gradients as they travel backward through a deep network during training, which starves early layers of the information they need to learn. It occurs when training [deep neural networks](/wiki/deep_neural_network) using [backpropagation](/wiki/backpropagation) and gradient-based optimization algorithms. It results in very slow learning or, in severe cases, no learning at all in those layers. The problem was one of the primary reasons that training deep networks remained impractical for nearly two decades and played a central role in shaping the trajectory of deep learning research.

## When was the vanishing gradient problem discovered?

The vanishing gradient problem was first formally identified by Sepp Hochreiter in his 1991 diploma thesis at the Technische Universitat Munchen, titled *Untersuchungen zu dynamischen neuronalen Netzen* ("Investigations into Dynamic Neural Networks"), supervised by Jurgen Schmidhuber. Hochreiter's thesis provided a rigorous mathematical analysis showing that error signals in [recurrent neural networks](/wiki/recurrent_neural_network) (RNNs) and deep feedforward networks either shrink or grow exponentially during backpropagation.[1] Because the thesis was written entirely in German, it was not widely read by the broader international research community at the time.

In 1994, Yoshua Bengio, Patrice Simard, and Paolo Frasconi published the influential paper "Learning Long-Term Dependencies with [Gradient Descent](/wiki/gradient_descent) is Difficult" in *IEEE Transactions on Neural Networks*, which independently arrived at conclusions similar to Hochreiter's 1991 analysis. The paper showed "why gradient based learning algorithms face an increasingly difficult problem as the duration of the dependencies to be captured increases."[2] Bengio and colleagues approached the problem from a dynamical systems perspective, demonstrating that for recurrent networks, either the dynamics of the system are robust to noise (in which case gradients vanish), or the gradients explode. Their work formalized the theoretical barriers to learning long-range temporal dependencies with standard gradient descent and brought widespread international attention to the problem.[2]

Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio further expanded on these findings in their 2013 paper "On the Difficulty of Training Recurrent Neural Networks," providing geometric and dynamical systems analyses of both the vanishing and [exploding gradient](/wiki/exploding_gradient_problem) problems. They showed that a sufficient condition for vanishing gradients in RNNs is that the spectral radius of the recurrent weight matrix is less than 1.[9]

## Mathematical explanation

### The chain rule and gradient computation

The vanishing gradient problem arises directly from the chain rule of calculus, which is the mathematical foundation of the backpropagation algorithm. In a deep feedforward neural network with *L* layers, the gradient of the loss function *C* with respect to the weights in layer *l* is computed by multiplying a chain of partial derivatives from the output layer back to layer *l*:

$$
\frac{\partial C}{\partial w_l} = \frac{\partial C}{\partial a_L} \cdot \frac{\partial a_L}{\partial a_{L-1}} \cdot \frac{\partial a_{L-1}}{\partial a_{L-2}} \cdots \frac{\partial a_{l+1}}{\partial a_l} \cdot \frac{\partial a_l}{\partial w_l}
$$

Each term $$\partial a_{k} / \partial a_{k-1}$$ involves the derivative of the [activation function](/wiki/activation_function) at layer *k* multiplied by the weights of that layer. When many of these intermediate derivatives are small (less than 1), their product shrinks exponentially as the number of layers increases. For a network with *n* layers where each intermediate derivative has magnitude *d* (where $$0 < d < 1$$), the gradient reaching the first layer is proportional to $$d^n$$, which approaches zero rapidly as *n* grows.

### Why do sigmoid and tanh cause vanishing gradients?

The problem is particularly severe when using the [sigmoid function](/wiki/sigmoid_function) or the hyperbolic tangent (tanh) as activation functions. The sigmoid function is defined as:

$$
\sigma(x) = \frac{1}{1 + e^{-x}}
$$

Its derivative is:

$$
\sigma'(x) = \sigma(x) \cdot (1 - \sigma(x))
$$

The maximum value of this derivative occurs at $$x = 0$$, where $$\sigma(0) = 0.5$$, giving a maximum derivative of 0.25. For all other input values, the derivative is strictly less than 0.25. As the input moves away from zero in either direction, the sigmoid function saturates (its output approaches 0 or 1), and the derivative approaches zero.

This means that during backpropagation, each layer multiplies the gradient by a factor no greater than 0.25 (and typically much smaller). In a network with 10 layers using sigmoid activations, the gradient reaching the first layer could be reduced by a factor of approximately $$0.25^{10} = 9.5 \times 10^{-7}$$, effectively making it vanish.[1]

The tanh activation function has a maximum derivative of 1.0 at $$x = 0$$, but its derivative falls off quickly for inputs away from zero. In practice, the average derivative during training is still well below 1, meaning that tanh networks can also suffer from vanishing gradients, though typically less severely than sigmoid networks.

The following table compares the gradient behavior of common activation functions:

| Activation Function | Maximum Derivative | Derivative Range | Saturation Behavior | Vanishing Gradient Risk |
|---|---|---|---|---|
| [Sigmoid](/wiki/sigmoid_function) | 0.25 (at $$x = 0$$) | $$(0, 0.25]$$ | Saturates at both extremes | Severe |
| Tanh | 1.0 (at $$x = 0$$) | $$(0, 1]$$ | Saturates at both extremes | Moderate |
| [ReLU](/wiki/rectified_linear_unit_relu) | 1.0 (for $$x > 0$$) | $$\{0, 1\}$$ | Saturates only for $$x < 0$$ | Low |
| Leaky ReLU | 1.0 (for $$x > 0$$) | $$\{\alpha, 1\}$$ where $$\alpha \approx 0.01$$ | Never fully saturates | Very low |
| ELU | 1.0 (for $$x > 0$$) | $$(0, 1]$$ | Approaches $$-\alpha$$ for large negative x | Low |
| GELU | $$\sim 1.0$$ (for large x) | $$[\sim -0.17, \sim 1.08]$$ | Smooth, non-saturating | Low |

### Numerical example

Consider a simple feedforward network with 5 hidden layers, each using sigmoid activations. Suppose that during a particular forward pass, the sigmoid derivatives at each layer are approximately 0.2 (a typical value for inputs not near zero). The gradient flowing back from the output to the first hidden layer would be multiplied by:

$$
0.2 \cdot 0.2 \cdot 0.2 \cdot 0.2 \cdot 0.2 = 0.2^5 = 0.00032
$$

This means the first layer receives a gradient that is roughly 3,000 times smaller than the gradient at the output layer. In a 10-layer network with the same conditions, the factor becomes $$0.2^{10} = 1.024 \times 10^{-7}$$, and in a 20-layer network, it becomes $$0.2^{20} \approx 1.05 \times 10^{-14}$$. At such magnitudes, floating-point precision limits make meaningful weight updates impossible.

The table below shows how gradient magnitudes decay with increasing network depth when the per-layer derivative factor is 0.2:

| Number of Layers | Gradient Factor | Approximate Magnitude |
|---|---|---|
| 1 | $$0.2^1$$ | $$2.0 \times 10^{-1}$$ |
| 5 | $$0.2^5$$ | $$3.2 \times 10^{-4}$$ |
| 10 | $$0.2^{10}$$ | $$1.0 \times 10^{-7}$$ |
| 15 | $$0.2^{15}$$ | $$3.3 \times 10^{-11}$$ |
| 20 | $$0.2^{20}$$ | $$1.05 \times 10^{-14}$$ |
| 50 | $$0.2^{50}$$ | $$\sim 1.1 \times 10^{-35}$$ |

## Vanishing gradients across architectures

The vanishing gradient problem manifests differently depending on the network architecture. While the underlying mathematical cause is the same (repeated multiplication of small derivatives), the severity and practical consequences vary significantly.

### Feedforward networks

In standard feedforward (fully connected) networks, the gradient for layer *l* passes through every intermediate layer between *l* and the output. The depth of the network directly determines how many multiplicative factors are involved. During the 1990s and early 2000s, feedforward networks with more than a few hidden layers were generally considered impractical to train with gradient descent. Networks deeper than about 5 to 7 layers often failed to converge when using sigmoid or tanh activations. The introduction of ReLU, proper initialization, and batch normalization made deep feedforward networks practical starting around 2010 to 2012.

### Convolutional neural networks

Convolutional neural networks (CNNs) share the same vulnerability to vanishing gradients, but several features of their architecture partially mitigate it. Weight sharing across spatial locations means each convolutional filter receives gradient contributions from many spatial positions, effectively aggregating gradient information. Pooling layers reduce spatial dimensions, shortening the effective depth of the gradient path. Additionally, CNNs historically tended to be shallower than the feedforward networks studied in early vanishing gradient research. Nonetheless, as CNNs grew deeper (VGGNet at 19 layers, GoogLeNet at 22 layers), the vanishing gradient problem became significant. The introduction of residual connections in ResNet (2015) was motivated directly by the observation that very deep plain CNNs suffered from degradation in training accuracy, a symptom closely linked to vanishing gradients.[12]

### Recurrent neural networks

The vanishing gradient problem is especially acute in recurrent neural networks, where the same set of weights is applied at each time step. When an RNN processes a sequence of length *T*, the gradient of the loss with respect to the hidden state at time step *t* involves a product of $$T - t$$ Jacobian matrices:

$$
\frac{\partial h_T}{\partial h_t} = \prod_{k=t}^{T-1} \frac{\partial h_{k+1}}{\partial h_k}
$$

Each Jacobian matrix involves the recurrent weight matrix and the derivative of the activation function. If the largest singular value (spectral radius) of this Jacobian is less than 1, the product of these matrices contracts exponentially, and the network cannot learn dependencies between events separated by many time steps.[9] This is why standard RNNs struggle to learn long-range temporal dependencies in tasks such as language modeling and speech recognition.

A modest RNN may process sequences of 200 to 400 time steps, which is conceptually equivalent to a feedforward network with that many layers. Experimental studies have shown that gradient magnitudes in vanilla RNNs can drop by a factor exceeding $$10^4$$ between the last and first time steps, making it nearly impossible to learn dependencies at the beginning of a sequence. This severe limitation directly motivated the development of [LSTM](/wiki/long_short-term_memory_lstm)[3] and GRU[11] architectures.

### Transformer networks

Transformer architectures largely avoid the vanishing gradient problem through their use of residual connections around every sublayer (self-attention and feed-forward blocks), combined with layer normalization. Because every sublayer has a direct additive shortcut to the input, gradients can flow from the output back through any number of layers without being repeatedly multiplied by small factors. The self-attention mechanism also provides direct gradient paths between any two positions in the input sequence, avoiding the sequential gradient propagation bottleneck of RNNs.

However, vanishing gradients have not been entirely eliminated in transformers. Research on very deep transformers (Xiong et al., 2020; Liu et al., 2020) has shown that the placement of layer normalization matters significantly.[24] Post-LN transformers (where layer normalization is applied after the residual addition) can still suffer from gradient instability in very deep configurations. Pre-LN transformers (where layer normalization is applied before the attention or feed-forward computation, inside the residual branch) produce more stable gradient flow and have become the default choice for training large language models with many layers.[24]

## How does the vanishing gradient problem differ from the exploding gradient problem?

The **[exploding gradient problem](/wiki/exploding_gradient_problem)** is the opposite phenomenon, where gradients grow exponentially rather than shrinking during backpropagation. This occurs when the intermediate derivatives in the chain rule product are consistently greater than 1 in magnitude. In recurrent neural networks, exploding gradients occur when the spectral radius of the recurrent weight matrix exceeds 1.

The consequences of exploding gradients are different from those of vanishing gradients but equally problematic:

- **Unstable training:** Extremely large gradients cause enormous weight updates that can cause the loss function to oscillate wildly or diverge to infinity.
- **Numerical overflow:** Gradient values can exceed the representable range of floating-point numbers, producing NaN (Not a Number) values that halt training.
- **Parameter explosion:** Weights can grow to very large values, pushing activations into saturation regions and further destabilizing training.

Exploding gradients are often easier to detect than vanishing gradients because they produce obvious symptoms such as NaN loss values or wildly fluctuating training curves. They are also somewhat easier to mitigate, primarily through gradient clipping.[9]

It is worth noting that vanishing and exploding gradients are not mutually exclusive within a single network. Different paths through the network can experience different gradient magnitudes, and some layers may have vanishing gradients while others have exploding gradients simultaneously.

## Effects on training

The vanishing gradient problem manifests in several observable ways during the training of deep neural networks:

- **Slow convergence:** Earlier layers in the network receive vanishingly small gradient updates, so their weights change extremely slowly compared to later layers. This creates an asymmetry where later layers learn quickly while earlier layers remain nearly at their initial values.
- **Poor feature learning:** Because the earliest layers are responsible for extracting low-level features from the input data, their inability to learn effectively means the network fails to develop useful feature representations. This leads to poor generalization.
- **Effective depth limitation:** Even though a network may have many layers, the vanishing gradient problem means that only the last few layers are effectively learning. The earlier layers act almost as fixed random projections, wasting the representational capacity of the deep architecture.
- **Training instability:** The interaction between vanishing gradients in some layers and potentially exploding gradients in others can make training unpredictable and sensitive to hyperparameter choices.
- **Inability to learn long-range dependencies:** In recurrent networks, vanishing gradients prevent the model from connecting events that are separated by many time steps. The network effectively has a limited "memory horizon" beyond which it cannot learn temporal relationships.

## How do you diagnose vanishing gradients?

Identifying vanishing gradients in practice requires monitoring gradient statistics during training. Several diagnostic approaches are commonly used:

- **Gradient norm monitoring:** Plotting the L2 norm of gradients for each layer over training steps reveals whether early layers receive significantly smaller gradients than later layers. A gradient norm that decreases by orders of magnitude from the last layer to the first is a strong indicator of vanishing gradients.
- **Weight update ratios:** Computing the ratio of the update magnitude to the weight magnitude for each layer provides a normalized view of how much each layer is learning. A ratio below approximately $$10^{-3}$$ suggests that the layer is barely updating.
- **Activation statistics:** Monitoring the mean and variance of activations at each layer can reveal saturation. If activations cluster near the saturation regions of sigmoid or tanh (near 0 or 1 for sigmoid, near -1 or 1 for tanh), the derivatives at those points will be near zero.
- **Training loss plateaus:** A training loss that decreases rapidly at first (as later layers learn) and then plateaus for extended periods (as earlier layers stagnate) can indicate vanishing gradients.

Modern deep learning frameworks such as PyTorch and TensorFlow provide built-in tools and hooks for extracting gradient statistics. Third-party libraries and visualization platforms like TensorBoard and Weights & Biases offer gradient histogram plots and per-layer gradient tracking dashboards.

## How do you fix the vanishing gradient problem?

Over the past three decades, researchers have developed numerous techniques to address or circumvent the vanishing gradient problem. These solutions have collectively enabled the training of networks with hundreds or even thousands of layers.

| Technique | Year Introduced | Key Authors | Mechanism | Applicable To |
|---|---|---|---|---|
| [LSTM](/wiki/long_short-term_memory_lstm) | 1997 | Hochreiter, Schmidhuber | Gated cell state with additive updates preserves gradient flow | RNNs |
| [GRU](/wiki/recurrent_neural_network) | 2014 | Cho et al. | Simplified gating mechanism with update and reset gates | RNNs |
| [ReLU](/wiki/rectified_linear_unit_relu) Activation | 2010 | Nair, Hinton | Constant gradient of 1 for positive inputs; no saturation | Feedforward, CNNs |
| Leaky ReLU | 2013 | Maas et al. | Small positive slope for negative inputs prevents dead neurons | Feedforward, CNNs |
| Parametric ReLU (PReLU) | 2015 | He et al. | Learnable slope parameter for negative inputs | Feedforward, CNNs |
| ELU | 2015 | Clevert et al. | Smooth exponential curve for negative inputs; self-normalizing property | Feedforward, CNNs |
| Xavier/Glorot Initialization | 2010 | Glorot, Bengio | Scales initial weights based on fan-in and fan-out to preserve variance | All network types |
| He Initialization | 2015 | He et al. | Variance scaling adapted for ReLU activations ($$2/\text{fan-in}$$) | Networks using ReLU |
| LSUV Initialization | 2016 | Mishkin, Matas | Layer-sequential unit-variance; orthonormal init + variance normalization | All network types |
| [Batch Normalization](/wiki/batch_normalization) | 2015 | Ioffe, Szegedy | Normalizes layer inputs to reduce internal covariate shift | All network types |
| Layer Normalization | 2016 | Ba, Kiros, Hinton | Normalizes across features rather than batch dimension | RNNs, Transformers |
| Residual Connections (Skip Connections) | 2015 | He et al. | Identity shortcut paths allow gradients to bypass layers | Very deep CNNs, Transformers |
| Highway Networks | 2015 | Srivastava, Greff, Schmidhuber | Learned gating mechanisms for information and gradient flow | Deep feedforward networks |
| Gradient Clipping | 2012/2013 | Pascanu et al. | Caps gradient norm at a threshold to prevent explosion | All network types |

### Activation function solutions

The Rectified Linear Unit (ReLU), popularized by Vinod Nair and Geoffrey Hinton in 2010, was one of the most important breakthroughs in addressing the vanishing gradient problem. ReLU is defined as $$f(x) = \max(0, x)$$, which gives a derivative of exactly 1 for all positive inputs and 0 for negative inputs. Unlike the sigmoid and tanh functions, ReLU does not saturate for positive values, meaning gradients can flow through the network without being attenuated.[7] The practical payoff was demonstrated in 2012, when AlexNet used ReLU activations to win the ImageNet challenge with a top-5 error rate of 15.3%, compared to 26.2% for the next best entry.[8]

However, ReLU has its own limitation known as the "dying ReLU" problem, where neurons that receive negative inputs consistently output zero and stop learning entirely. Several variants have been developed to address this:

- **Leaky ReLU** (Maas et al., 2013): Uses a small positive slope (typically 0.01) for negative inputs instead of zero, ensuring that gradients are never completely blocked.[10]
- **Parametric ReLU (PReLU)** (He et al., 2015): Similar to Leaky ReLU, but the slope for negative inputs is a learnable parameter optimized during training.[13]
- **Exponential Linear Unit (ELU)** (Clevert et al., 2015): Uses an exponential function for negative inputs, which provides smoother gradients near zero and has a self-normalizing property that helps stabilize training.[15]
- **Scaled Exponential Linear Unit (SELU)** (Klambauer et al., 2017): A scaled version of ELU that, under certain conditions, automatically normalizes activations toward zero mean and unit variance.[22]
- **Gaussian Error Linear Unit (GELU)** (Hendrycks and Gimpel, 2016): Uses a smooth approximation that weights inputs by their percentile under a Gaussian distribution. GELU has become the default activation in many transformer models, including BERT and GPT.
- **Swish/SiLU** (Ramachandran et al., 2017): Defined as $$f(x) = x \cdot \mathrm{sigmoid}(x)$$, Swish is a smooth, non-monotonic function discovered through automated search that has shown strong performance in deep networks.

### Gated architectures: LSTM and GRU

Long Short-Term Memory (LSTM) networks, introduced by Hochreiter and Schmidhuber in 1997, were specifically designed to solve the vanishing gradient problem in recurrent neural networks.[3] The key innovation is the cell state, a separate memory pathway that runs through the entire sequence with only linear interactions (additions and element-wise multiplications by gate values). This creates what Hochreiter called a "Constant Error Carousel" (CEC) that allows gradients to flow backward through time without exponential decay. In their original paper, Hochreiter and Schmidhuber reported that LSTM "can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units," a range far beyond what standard RNNs could reach.[3]

LSTM achieves this through three gating mechanisms:

- **Forget gate:** Decides what information to discard from the cell state by outputting values between 0 and 1 for each element. The forget gate was added by Gers, Schmidhuber, and Cummins in 1999.[4]
- **Input gate:** Controls which new information is written to the cell state.
- **Output gate:** Determines which parts of the cell state are exposed as the hidden state output.

Because the cell state update is additive rather than multiplicative ($$c_t = f_t \cdot c_{t-1} + i_t \cdot \text{candidate}$$), the gradient with respect to the cell state at an earlier time step does not involve repeated multiplication by the same weight matrix. This allows LSTM networks to learn dependencies spanning hundreds of time steps.

The Gated Recurrent Unit (GRU), proposed by Kyunghyun Cho and colleagues in 2014, is a simplified variant of LSTM that combines the forget and input gates into a single "update gate" and merges the cell state with the hidden state.[11] GRU uses only two gates (update and reset) instead of three, making it computationally more efficient while achieving comparable performance to LSTM on many tasks.

### Residual connections

Residual connections (also called skip connections), introduced by Kaiming He and colleagues in their 2015 paper "Deep Residual Learning for Image Recognition," represent one of the most important architectural innovations for training very deep networks.[12] The core idea is to reformulate layers as learning residual functions with reference to the layer inputs.

In a standard network, a block of layers learns a mapping $$H(x)$$. In a residual network, the block instead learns the residual $$F(x) = H(x) - x$$, and the output is computed as $$F(x) + x$$. The addition of the identity shortcut connection means that during backpropagation, the gradient always has a direct path through the identity connection (with a gradient of 1), regardless of how small the gradient through the learned layers $$F(x)$$ might be.

This simple modification had a profound impact. ResNet demonstrated that networks with 152 layers could be trained effectively, winning the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2015. An ensemble of residual networks achieved a top-5 error rate of 3.57% on the ImageNet test set, while a single 152-layer model reached a 4.49% top-5 validation error that already outperformed all previous ensemble results.[12] Follow-up work showed that networks with over 1,000 layers could be trained using residual connections.

Residual connections have since become a standard component in nearly all deep architectures, including transformer models, where they are used around both the self-attention and feed-forward sublayers.

### Highway networks

Highway networks, introduced by Rupesh Kumar Srivastava, Klaus Greff, and Jurgen Schmidhuber in May 2015, predated ResNet by several months and were directly inspired by the gating mechanisms of LSTM.[16] In a highway layer, the output is computed as:

$$
y = H(x) \cdot T(x) + x \cdot (1 - T(x))
$$

Here, $$H(x)$$ is a nonlinear transformation, $$T(x)$$ is a learned "transform gate" with values between 0 and 1, and $$(1 - T(x))$$ serves as the "carry gate." When $$T(x)$$ is close to 0, the layer simply passes the input through (the "carry" path), allowing unimpeded gradient flow. When $$T(x)$$ is close to 1, the layer applies the full nonlinear transformation.

The Jacobian of a highway layer includes an identity term from the carry path, which helps preserve gradient magnitude across many layers. Srivastava and colleagues demonstrated that highway networks with over 900 layers could be trained with simple stochastic gradient descent with momentum.[16] While residual connections eventually became more widely adopted due to their simplicity (they are essentially highway networks with the gates fixed at 0.5), highway networks provided important theoretical insight into how gating mechanisms can facilitate gradient flow in very deep networks.

### Weight initialization

Proper weight initialization is critical for preventing both vanishing and exploding gradients at the start of training. Three initialization strategies have become standard:

**Xavier/Glorot Initialization** (Glorot and Bengio, 2010): Proposed in the paper "Understanding the Difficulty of Training Deep Feedforward Neural Networks,"[6] this method initializes weights from a distribution with variance scaled according to the number of input and output connections (fan-in and fan-out) of each layer:

$$
\mathrm{Var}(w) = \frac{2}{\text{fan\_in} + \text{fan\_out}}
$$

This keeps the variance of activations and gradients approximately constant across layers when using sigmoid or tanh activation functions. The derivation assumes that activations are approximately linear near their operating point, which is valid for tanh near zero but not for ReLU.

**He Initialization** (He et al., 2015): Proposed in "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification,"[13] this method adjusts the variance for ReLU activations, which zero out roughly half of their inputs:

$$
\mathrm{Var}(w) = \frac{2}{\text{fan\_in}}
$$

He initialization doubles the variance compared to Xavier initialization to compensate for the halving effect of ReLU, and it has become the default initialization for networks using ReLU and its variants.

**LSUV Initialization** (Mishkin and Matas, 2016): Layer-Sequential Unit-Variance (LSUV) initialization, proposed in the paper "All You Need is a Good Init,"[18] takes a data-driven approach. The method consists of two steps: first, weights in each layer are initialized with orthonormal matrices; second, a mini-batch of data is passed through the network, and the weights in each layer are iteratively rescaled so that the output variance of each layer equals one. LSUV has been shown to match or outperform more complex training schemes on datasets including CIFAR-10/100 and ImageNet, and it works well across different activation functions (ReLU, maxout, tanh).

| Initialization Method | Target Activation | Variance Formula | Key Assumption |
|---|---|---|---|
| Xavier/Glorot (2010) | Sigmoid, Tanh | $$2 / (\text{fan\_in} + \text{fan\_out})$$ | Linear activations |
| He/Kaiming (2015) | [ReLU](/wiki/rectified_linear_unit_relu) and variants | $$2 / \text{fan\_in}$$ | Half of inputs zeroed by ReLU |
| LSUV (2016) | Any | Data-driven (unit variance per layer) | None (empirical normalization) |

### Batch normalization

[Batch normalization](/wiki/batch_normalization), introduced by Sergey Ioffe and Christian Szegedy in 2015, normalizes the inputs to each layer by subtracting the batch mean and dividing by the batch standard deviation. The technique then applies learned scale and shift parameters to restore representational capacity. Ioffe and Szegedy originally described the motivation as reducing "internal covariate shift" (the change in the distribution of layer inputs during training), though later research has questioned whether this is the true mechanism behind its effectiveness.[14]

Regardless of the theoretical explanation, batch normalization has a clear practical effect on the vanishing gradient problem: by keeping layer inputs in a normalized range, it prevents activations from drifting into the saturated regions of sigmoid or tanh functions, where derivatives are near zero. Batch normalization also allows the use of higher learning rates and reduces sensitivity to weight initialization, both of which contribute to more stable and faster training.

Variants of normalization have been developed for settings where batch normalization is less effective. Layer normalization (Ba, Kiros, and Hinton, 2016) normalizes across features rather than across the batch dimension, making it suitable for recurrent networks and transformers.[17] Group normalization (Wu and He, 2018) divides channels into groups and normalizes within each group, working well with small batch sizes.

### Gradient clipping

Gradient clipping is a technique that directly addresses the exploding gradient problem by capping the magnitude of gradients during backpropagation. Two common approaches exist:

- **Gradient norm clipping:** If the L2 norm of the gradient vector exceeds a threshold, the entire vector is rescaled so its norm equals the threshold. This preserves the direction of the gradient while limiting its magnitude.
- **Gradient value clipping:** Individual gradient values are clamped to a specified range (for example, $$[-1, 1]$$).

Gradient norm clipping, as proposed by Pascanu, Mikolov, and Bengio (2013), has become the standard approach and is widely used when training RNNs and large transformer models.[9] While gradient clipping primarily targets the exploding gradient problem, by preventing extreme gradient values it also contributes to more stable training overall.

## Impact on deep learning history

The vanishing gradient problem had a profound influence on the history and development of deep learning and, by extension, the broader field of artificial intelligence.

### The shallow network era (1991 to 2006)

After Hochreiter's 1991 analysis[1] and Bengio's 1994 paper,[2] many researchers concluded that training deep networks with gradient descent was fundamentally impractical. This contributed to a shift toward shallow architectures, typically with only one or two hidden layers. During this period, support vector machines and other kernel methods gained prominence as alternatives that did not require deep architectures.

The difficulty of training deep networks also contributed to reduced interest and funding for neural network research in the late 1990s, a period sometimes characterized as part of the broader "AI winter." Researchers who continued working on neural networks, such as Yann LeCun, Geoffrey Hinton, and Yoshua Bengio (later recognized as the "Godfathers of Deep Learning"), often faced skepticism from the wider machine learning community.

### The pretraining era (2006 to 2012)

A partial breakthrough came in 2006 when Geoffrey Hinton and Ruslan Salakhutdinov demonstrated that deep belief networks could be trained effectively using a layer-by-layer unsupervised pretraining strategy.[5] By pretraining each layer as a restricted Boltzmann machine (RBM) and then fine-tuning the entire network with backpropagation, they circumvented the vanishing gradient problem during the initial phase of training. This approach showed that deep architectures could learn useful hierarchical representations, reigniting interest in deep learning.

Schmidhuber had explored a related layer-wise pretraining approach for recurrent neural networks as early as 1992.[23] Other pretraining methods followed, including the use of denoising autoencoders (Vincent et al., 2008).

### The modern deep learning era (2012 onward)

The true revolution came around 2012, driven by the convergence of multiple solutions to the vanishing gradient problem together with increases in computational power from GPUs. Key milestones include:

- **2010:** Nair and Hinton popularized ReLU,[7] and Glorot and Bengio introduced Xavier initialization.[6]
- **2011-2012:** Dan Ciresan and colleagues at IDSIA demonstrated that plain deep convolutional neural networks could be trained successfully on GPUs without pretraining.
- **2012:** AlexNet (Krizhevsky, Sutskever, and Hinton) won the ImageNet challenge by a large margin, cutting the top-5 error rate to 15.3% from the 26.2% of the next best entry, using ReLU activations and dropout.[8] This is widely regarded as the event that launched the modern deep learning era.
- **2015:** Batch normalization, ResNet, Highway Networks, and He initialization were introduced, enabling networks with over 100 layers.
- **2017:** The transformer architecture (Vaswani et al.) combined residual connections, layer normalization, and attention mechanisms to build deep sequence models without recurrence.[21]

Each of these advances directly or indirectly addressed the vanishing gradient problem, collectively transforming deep learning from a niche research topic into the dominant paradigm in artificial intelligence.

## Modern relevance

Although the vanishing gradient problem has been largely managed in standard architectures through the techniques described above, it remains an active concern in several areas of modern deep learning research.

### Very deep transformer models

Transformer architectures used in large language models (LLMs) such as GPT-4, Claude, and Gemini can have dozens or even hundreds of layers. While residual connections and layer normalization mitigate the vanishing gradient problem, training these models still requires careful attention to architectural choices. Pre-norm architectures (which apply layer normalization before attention and feed-forward blocks rather than after) have been found to produce more stable gradient flow in very deep transformers. The choice of normalization strategy (Pre-LN vs. Post-LN) can determine whether a deep transformer trains successfully or suffers from gradient degradation.[24]

### Extremely deep networks

Research into networks with thousands of layers has revealed that even residual connections are not a complete solution at extreme depths. Techniques such as stochastic depth (Huang et al., 2016), which randomly drops entire layers during training,[19] and dense connections (as in DenseNet, Huang et al., 2017) have been explored to further improve gradient flow.[20] ReZero initialization (Bachlechner et al., 2020), which initializes residual connections with a learnable scalar set to zero, has also shown promise for training very deep networks.

### Signal propagation theory

Recent theoretical work on signal propagation in deep networks (sometimes called "mean field theory" for neural networks) has provided a deeper understanding of the vanishing gradient problem. Researchers have identified that networks exist on an "edge of chaos" between ordered phases (where signals and gradients vanish) and chaotic phases (where they explode). Initialization and normalization strategies that place the network at this critical boundary tend to produce the most trainable networks.

### Training stability in large-scale models

As models scale to billions or trillions of parameters, training stability becomes increasingly important. Gradient-related instabilities, including both vanishing and exploding gradients, are among the leading causes of training runs failing or producing suboptimal results. Techniques such as gradient accumulation, mixed-precision training, and careful learning rate scheduling all interact with gradient flow dynamics and must be tuned to ensure stable training.

## Explain like I'm 5 (ELI5)

Imagine a long line of children playing the "telephone" game, where each child whispers a message to the next. By the time the message reaches the end of the line, it has become so faint and garbled that the last child can barely hear it. The vanishing gradient problem works the same way. When a deep neural network is learning, it sends a correction signal backward through its layers, but that signal gets weaker and weaker at each layer. By the time it reaches the first layers, the signal is so tiny that those layers have no idea how to improve.

To fix this, researchers came up with several clever tricks. One trick (ReLU) is like replacing the children with ones who do not accidentally muffle the message. Another trick (residual connections) is like running a straight telephone wire from the beginning of the line to the end so the message can skip all the children in between. A third trick (LSTM) gives each child a notebook to write the message down, so it does not get lost. Together, these tricks let neural networks have hundreds or even thousands of layers while still getting the message through loud and clear.

## References

1. Hochreiter, S. (1991). "Untersuchungen zu dynamischen neuronalen Netzen." Diploma thesis, Technische Universitat Munchen.
2. Bengio, Y., Simard, P., & Frasconi, P. (1994). "Learning Long-Term Dependencies with Gradient Descent is Difficult." *IEEE Transactions on Neural Networks*, 5(2), 157-166.
3. Hochreiter, S. & Schmidhuber, J. (1997). "Long Short-Term Memory." *Neural Computation*, 9(8), 1735-1780.
4. Gers, F.A., Schmidhuber, J., & Cummins, F. (1999). "Learning to Forget: Continual Prediction with LSTM." *Proceedings of the 9th International Conference on Artificial Neural Networks (ICANN)*.
5. Hinton, G.E. & Salakhutdinov, R.R. (2006). "Reducing the Dimensionality of Data with Neural Networks." *Science*, 313(5786), 504-507.
6. Glorot, X. & Bengio, Y. (2010). "Understanding the Difficulty of Training Deep Feedforward Neural Networks." *Proceedings of AISTATS*.
7. Nair, V. & Hinton, G.E. (2010). "Rectified Linear Units Improve Restricted Boltzmann Machines." *Proceedings of ICML*.
8. Krizhevsky, A., Sutskever, I., & Hinton, G.E. (2012). "ImageNet Classification with Deep Convolutional Neural Networks." *Advances in Neural Information Processing Systems (NeurIPS)*.
9. Pascanu, R., Mikolov, T., & Bengio, Y. (2013). "On the Difficulty of Training Recurrent Neural Networks." *Proceedings of ICML*, PMLR 28(3):1310-1318.
10. Maas, A.L., Hannun, A.Y., & Ng, A.Y. (2013). "Rectifier Nonlinearities Improve Neural Network Acoustic Models." *Proceedings of ICML Workshop on Deep Learning for Audio, Speech, and Language Processing*.
11. Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). "Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation." *Proceedings of EMNLP*.
12. He, K., Zhang, X., Ren, S., & Sun, J. (2015). "Deep Residual Learning for Image Recognition." *Proceedings of CVPR*.
13. He, K., Zhang, X., Ren, S., & Sun, J. (2015). "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification." *Proceedings of ICCV*.
14. Ioffe, S. & Szegedy, C. (2015). "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift." *Proceedings of ICML*.
15. Clevert, D.A., Unterthiner, T., & Hochreiter, S. (2015). "Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)." *arXiv:1511.07289*.
16. Srivastava, R.K., Greff, K., & Schmidhuber, J. (2015). "Highway Networks." *arXiv:1505.00387*.
17. Ba, J.L., Kiros, J.R., & Hinton, G.E. (2016). "Layer Normalization." *arXiv:1607.06450*.
18. Mishkin, D. & Matas, J. (2016). "All You Need is a Good Init." *Proceedings of ICLR*.
19. Huang, G., Sun, Y., Liu, Z., Sedra, D., & Weinberger, K.Q. (2016). "Deep Networks with Stochastic Depth." *Proceedings of ECCV*.
20. Huang, G., Liu, Z., van der Maaten, L., & Weinberger, K.Q. (2017). "Densely Connected Convolutional Networks." *Proceedings of CVPR*.
21. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., & Polosukhin, I. (2017). "Attention Is All You Need." *Advances in Neural Information Processing Systems (NeurIPS)*.
22. Klambauer, G., Unterthiner, T., Maas, A., & Hochreiter, S. (2017). "Self-Normalizing Neural Networks." *Advances in Neural Information Processing Systems (NeurIPS)*.
23. Schmidhuber, J. (2015). "Deep Learning in Neural Networks: An Overview." *Neural Networks*, 61, 85-117.
24. Xiong, R., Yang, Y., He, D., Zheng, K., Zheng, S., Xing, C., Zhang, H., Lan, Y., Wang, L., & Liu, T. (2020). "On Layer Normalization in the Transformer Architecture." *Proceedings of ICML*.