See also: Machine learning terms
The vanishing gradient problem is a fundamental difficulty encountered when training deep neural networks using backpropagation and gradient-based optimization algorithms. It occurs when gradients of the loss function shrink exponentially as they are propagated backward through the layers of a network, causing the weights in earlier layers to receive negligibly small updates. This results in very slow learning or, in severe cases, no learning at all in those layers. The problem was one of the primary reasons that training deep networks remained impractical for nearly two decades and played a central role in shaping the trajectory of deep learning research.
The vanishing gradient problem was first formally identified by Sepp Hochreiter in his 1991 diploma thesis at the Technische Universitat Munchen, titled Untersuchungen zu dynamischen neuronalen Netzen ("Investigations into Dynamic Neural Networks"), supervised by Jurgen Schmidhuber. Hochreiter's thesis provided a rigorous mathematical analysis showing that error signals in recurrent neural networks (RNNs) and deep feedforward networks either shrink or grow exponentially during backpropagation. Because the thesis was written entirely in German, it was not widely read by the broader international research community at the time.
In 1994, Yoshua Bengio, Patrice Simard, and Paolo Frasconi published the influential paper "Learning Long-Term Dependencies with Gradient Descent is Difficult," which independently arrived at conclusions similar to Hochreiter's 1991 analysis. Bengio and colleagues approached the problem from a dynamical systems perspective, demonstrating that for recurrent networks, either the dynamics of the system are robust to noise (in which case gradients vanish), or the gradients explode. Their work formalized the theoretical barriers to learning long-range temporal dependencies with standard gradient descent and brought widespread international attention to the problem.
Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio further expanded on these findings in their 2013 paper "On the Difficulty of Training Recurrent Neural Networks," providing geometric and dynamical systems analyses of both the vanishing and exploding gradient problems. They showed that a sufficient condition for vanishing gradients in RNNs is that the spectral radius of the recurrent weight matrix is less than 1.
The vanishing gradient problem arises directly from the chain rule of calculus, which is the mathematical foundation of the backpropagation algorithm. In a deep feedforward neural network with L layers, the gradient of the loss function C with respect to the weights in layer l is computed by multiplying a chain of partial derivatives from the output layer back to layer l:
∂C/∂w_l = ∂C/∂a_L * ∂a_L/∂a_{L-1} * ∂a_{L-1}/∂a_{L-2} * ... * ∂a_{l+1}/∂a_l * ∂a_l/∂w_l
Each term ∂a_{k}/∂a_{k-1} involves the derivative of the activation function at layer k multiplied by the weights of that layer. When many of these intermediate derivatives are small (less than 1), their product shrinks exponentially as the number of layers increases. For a network with n layers where each intermediate derivative has magnitude d (where 0 < d < 1), the gradient reaching the first layer is proportional to d^n, which approaches zero rapidly as n grows.
The problem is particularly severe when using the sigmoid function or the hyperbolic tangent (tanh) as activation functions. The sigmoid function is defined as:
σ(x) = 1 / (1 + e^{-x})
Its derivative is:
σ'(x) = σ(x) * (1 - σ(x))
The maximum value of this derivative occurs at x = 0, where σ(0) = 0.5, giving a maximum derivative of 0.25. For all other input values, the derivative is strictly less than 0.25. As the input moves away from zero in either direction, the sigmoid function saturates (its output approaches 0 or 1), and the derivative approaches zero.
This means that during backpropagation, each layer multiplies the gradient by a factor no greater than 0.25 (and typically much smaller). In a network with 10 layers using sigmoid activations, the gradient reaching the first layer could be reduced by a factor of approximately 0.25^10 = 9.5 x 10^{-7}, effectively making it vanish.
The tanh activation function has a maximum derivative of 1.0 at x = 0, but its derivative falls off quickly for inputs away from zero. In practice, the average derivative during training is still well below 1, meaning that tanh networks can also suffer from vanishing gradients, though typically less severely than sigmoid networks.
The following table compares the gradient behavior of common activation functions:
| Activation Function | Maximum Derivative | Derivative Range | Saturation Behavior | Vanishing Gradient Risk |
|---|---|---|---|---|
| Sigmoid | 0.25 (at x = 0) | (0, 0.25] | Saturates at both extremes | Severe |
| Tanh | 1.0 (at x = 0) | (0, 1] | Saturates at both extremes | Moderate |
| ReLU | 1.0 (for x > 0) | {0, 1} | Saturates only for x < 0 | Low |
| Leaky ReLU | 1.0 (for x > 0) | {α, 1} where α ≈ 0.01 | Never fully saturates | Very low |
| ELU | 1.0 (for x > 0) | (0, 1] | Approaches -α for large negative x | Low |
| GELU | ~1.0 (for large x) | [~-0.17, ~1.08] | Smooth, non-saturating | Low |
Consider a simple feedforward network with 5 hidden layers, each using sigmoid activations. Suppose that during a particular forward pass, the sigmoid derivatives at each layer are approximately 0.2 (a typical value for inputs not near zero). The gradient flowing back from the output to the first hidden layer would be multiplied by:
0.2 * 0.2 * 0.2 * 0.2 * 0.2 = 0.2^5 = 0.00032
This means the first layer receives a gradient that is roughly 3,000 times smaller than the gradient at the output layer. In a 10-layer network with the same conditions, the factor becomes 0.2^10 = 1.024 x 10^{-7}, and in a 20-layer network, it becomes 0.2^20 ≈ 1.05 x 10^{-14}. At such magnitudes, floating-point precision limits make meaningful weight updates impossible.
The table below shows how gradient magnitudes decay with increasing network depth when the per-layer derivative factor is 0.2:
| Number of Layers | Gradient Factor | Approximate Magnitude |
|---|---|---|
| 1 | 0.2^1 | 2.0 x 10^{-1} |
| 5 | 0.2^5 | 3.2 x 10^{-4} |
| 10 | 0.2^10 | 1.0 x 10^{-7} |
| 15 | 0.2^15 | 3.3 x 10^{-11} |
| 20 | 0.2^20 | 1.05 x 10^{-14} |
| 50 | 0.2^50 | ~1.1 x 10^{-35} |
The vanishing gradient problem manifests differently depending on the network architecture. While the underlying mathematical cause is the same (repeated multiplication of small derivatives), the severity and practical consequences vary significantly.
In standard feedforward (fully connected) networks, the gradient for layer l passes through every intermediate layer between l and the output. The depth of the network directly determines how many multiplicative factors are involved. During the 1990s and early 2000s, feedforward networks with more than a few hidden layers were generally considered impractical to train with gradient descent. Networks deeper than about 5 to 7 layers often failed to converge when using sigmoid or tanh activations. The introduction of ReLU, proper initialization, and batch normalization made deep feedforward networks practical starting around 2010 to 2012.
Convolutional neural networks (CNNs) share the same vulnerability to vanishing gradients, but several features of their architecture partially mitigate it. Weight sharing across spatial locations means each convolutional filter receives gradient contributions from many spatial positions, effectively aggregating gradient information. Pooling layers reduce spatial dimensions, shortening the effective depth of the gradient path. Additionally, CNNs historically tended to be shallower than the feedforward networks studied in early vanishing gradient research. Nonetheless, as CNNs grew deeper (VGGNet at 19 layers, GoogLeNet at 22 layers), the vanishing gradient problem became significant. The introduction of residual connections in ResNet (2015) was motivated directly by the observation that very deep plain CNNs suffered from degradation in training accuracy, a symptom closely linked to vanishing gradients.
The vanishing gradient problem is especially acute in recurrent neural networks, where the same set of weights is applied at each time step. When an RNN processes a sequence of length T, the gradient of the loss with respect to the hidden state at time step t involves a product of T - t Jacobian matrices:
∂h_T/∂h_t = ∏_{k=t}^{T-1} ∂h_{k+1}/∂h_k
Each Jacobian matrix involves the recurrent weight matrix and the derivative of the activation function. If the largest singular value (spectral radius) of this Jacobian is less than 1, the product of these matrices contracts exponentially, and the network cannot learn dependencies between events separated by many time steps. This is why standard RNNs struggle to learn long-range temporal dependencies in tasks such as language modeling and speech recognition.
A modest RNN may process sequences of 200 to 400 time steps, which is conceptually equivalent to a feedforward network with that many layers. Experimental studies have shown that gradient magnitudes in vanilla RNNs can drop by a factor exceeding 10^4 between the last and first time steps, making it nearly impossible to learn dependencies at the beginning of a sequence. This severe limitation directly motivated the development of LSTM and GRU architectures.
Transformer architectures largely avoid the vanishing gradient problem through their use of residual connections around every sublayer (self-attention and feed-forward blocks), combined with layer normalization. Because every sublayer has a direct additive shortcut to the input, gradients can flow from the output back through any number of layers without being repeatedly multiplied by small factors. The self-attention mechanism also provides direct gradient paths between any two positions in the input sequence, avoiding the sequential gradient propagation bottleneck of RNNs.
However, vanishing gradients have not been entirely eliminated in transformers. Research on very deep transformers (Xiong et al., 2020; Liu et al., 2020) has shown that the placement of layer normalization matters significantly. Post-LN transformers (where layer normalization is applied after the residual addition) can still suffer from gradient instability in very deep configurations. Pre-LN transformers (where layer normalization is applied before the attention or feed-forward computation, inside the residual branch) produce more stable gradient flow and have become the default choice for training large language models with many layers.
The exploding gradient problem is the opposite phenomenon, where gradients grow exponentially rather than shrinking during backpropagation. This occurs when the intermediate derivatives in the chain rule product are consistently greater than 1 in magnitude. In recurrent neural networks, exploding gradients occur when the spectral radius of the recurrent weight matrix exceeds 1.
The consequences of exploding gradients are different from those of vanishing gradients but equally problematic:
Exploding gradients are often easier to detect than vanishing gradients because they produce obvious symptoms such as NaN loss values or wildly fluctuating training curves. They are also somewhat easier to mitigate, primarily through gradient clipping.
It is worth noting that vanishing and exploding gradients are not mutually exclusive within a single network. Different paths through the network can experience different gradient magnitudes, and some layers may have vanishing gradients while others have exploding gradients simultaneously.
The vanishing gradient problem manifests in several observable ways during the training of deep neural networks:
Identifying vanishing gradients in practice requires monitoring gradient statistics during training. Several diagnostic approaches are commonly used:
Modern deep learning frameworks such as PyTorch and TensorFlow provide built-in tools and hooks for extracting gradient statistics. Third-party libraries and visualization platforms like TensorBoard and Weights & Biases offer gradient histogram plots and per-layer gradient tracking dashboards.
Over the past three decades, researchers have developed numerous techniques to address or circumvent the vanishing gradient problem. These solutions have collectively enabled the training of networks with hundreds or even thousands of layers.
| Technique | Year Introduced | Key Authors | Mechanism | Applicable To |
|---|---|---|---|---|
| LSTM | 1997 | Hochreiter, Schmidhuber | Gated cell state with additive updates preserves gradient flow | RNNs |
| GRU | 2014 | Cho et al. | Simplified gating mechanism with update and reset gates | RNNs |
| ReLU Activation | 2010 | Nair, Hinton | Constant gradient of 1 for positive inputs; no saturation | Feedforward, CNNs |
| Leaky ReLU | 2013 | Maas et al. | Small positive slope for negative inputs prevents dead neurons | Feedforward, CNNs |
| Parametric ReLU (PReLU) | 2015 | He et al. | Learnable slope parameter for negative inputs | Feedforward, CNNs |
| ELU | 2015 | Clevert et al. | Smooth exponential curve for negative inputs; self-normalizing property | Feedforward, CNNs |
| Xavier/Glorot Initialization | 2010 | Glorot, Bengio | Scales initial weights based on fan-in and fan-out to preserve variance | All network types |
| He Initialization | 2015 | He et al. | Variance scaling adapted for ReLU activations (2/fan-in) | Networks using ReLU |
| LSUV Initialization | 2016 | Mishkin, Matas | Layer-sequential unit-variance; orthonormal init + variance normalization | All network types |
| Batch Normalization | 2015 | Ioffe, Szegedy | Normalizes layer inputs to reduce internal covariate shift | All network types |
| Layer Normalization | 2016 | Ba, Kiros, Hinton | Normalizes across features rather than batch dimension | RNNs, Transformers |
| Residual Connections (Skip Connections) | 2015 | He et al. | Identity shortcut paths allow gradients to bypass layers | Very deep CNNs, Transformers |
| Highway Networks | 2015 | Srivastava, Greff, Schmidhuber | Learned gating mechanisms for information and gradient flow | Deep feedforward networks |
| Gradient Clipping | 2012/2013 | Pascanu et al. | Caps gradient norm at a threshold to prevent explosion | All network types |
The Rectified Linear Unit (ReLU), popularized by Vinod Nair and Geoffrey Hinton in 2010, was one of the most important breakthroughs in addressing the vanishing gradient problem. ReLU is defined as f(x) = max(0, x), which gives a derivative of exactly 1 for all positive inputs and 0 for negative inputs. Unlike the sigmoid and tanh functions, ReLU does not saturate for positive values, meaning gradients can flow through the network without being attenuated.
However, ReLU has its own limitation known as the "dying ReLU" problem, where neurons that receive negative inputs consistently output zero and stop learning entirely. Several variants have been developed to address this:
Long Short-Term Memory (LSTM) networks, introduced by Hochreiter and Schmidhuber in 1997, were specifically designed to solve the vanishing gradient problem in recurrent neural networks. The key innovation is the cell state, a separate memory pathway that runs through the entire sequence with only linear interactions (additions and element-wise multiplications by gate values). This creates what Hochreiter called a "Constant Error Carousel" (CEC) that allows gradients to flow backward through time without exponential decay.
LSTM achieves this through three gating mechanisms:
Because the cell state update is additive rather than multiplicative (c_t = f_t * c_{t-1} + i_t * candidate), the gradient with respect to the cell state at an earlier time step does not involve repeated multiplication by the same weight matrix. This allows LSTM networks to learn dependencies spanning hundreds of time steps.
The Gated Recurrent Unit (GRU), proposed by Kyunghyun Cho and colleagues in 2014, is a simplified variant of LSTM that combines the forget and input gates into a single "update gate" and merges the cell state with the hidden state. GRU uses only two gates (update and reset) instead of three, making it computationally more efficient while achieving comparable performance to LSTM on many tasks.
Residual connections (also called skip connections), introduced by Kaiming He and colleagues in their 2015 paper "Deep Residual Learning for Image Recognition," represent one of the most important architectural innovations for training very deep networks. The core idea is to reformulate layers as learning residual functions with reference to the layer inputs.
In a standard network, a block of layers learns a mapping H(x). In a residual network, the block instead learns the residual F(x) = H(x) - x, and the output is computed as F(x) + x. The addition of the identity shortcut connection means that during backpropagation, the gradient always has a direct path through the identity connection (with a gradient of 1), regardless of how small the gradient through the learned layers F(x) might be.
This simple modification had a profound impact. ResNet demonstrated that networks with 152 layers could be trained effectively, winning the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2015 with a top-5 error rate of 3.57%, surpassing human-level performance. Follow-up work showed that networks with over 1,000 layers could be trained using residual connections.
Residual connections have since become a standard component in nearly all deep architectures, including transformer models, where they are used around both the self-attention and feed-forward sublayers.
Highway networks, introduced by Rupesh Kumar Srivastava, Klaus Greff, and Jurgen Schmidhuber in May 2015, predated ResNet by several months and were directly inspired by the gating mechanisms of LSTM. In a highway layer, the output is computed as:
y = H(x) * T(x) + x * (1 - T(x))
Here, H(x) is a nonlinear transformation, T(x) is a learned "transform gate" with values between 0 and 1, and (1 - T(x)) serves as the "carry gate." When T(x) is close to 0, the layer simply passes the input through (the "carry" path), allowing unimpeded gradient flow. When T(x) is close to 1, the layer applies the full nonlinear transformation.
The Jacobian of a highway layer includes an identity term from the carry path, which helps preserve gradient magnitude across many layers. Srivastava and colleagues demonstrated that highway networks with over 900 layers could be trained with simple stochastic gradient descent with momentum. While residual connections eventually became more widely adopted due to their simplicity (they are essentially highway networks with the gates fixed at 0.5), highway networks provided important theoretical insight into how gating mechanisms can facilitate gradient flow in very deep networks.
Proper weight initialization is critical for preventing both vanishing and exploding gradients at the start of training. Three initialization strategies have become standard:
Xavier/Glorot Initialization (Glorot and Bengio, 2010): Proposed in the paper "Understanding the Difficulty of Training Deep Feedforward Neural Networks," this method initializes weights from a distribution with variance scaled according to the number of input and output connections (fan-in and fan-out) of each layer:
Var(w) = 2 / (fan_in + fan_out)
This keeps the variance of activations and gradients approximately constant across layers when using sigmoid or tanh activation functions. The derivation assumes that activations are approximately linear near their operating point, which is valid for tanh near zero but not for ReLU.
He Initialization (He et al., 2015): Proposed in "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification," this method adjusts the variance for ReLU activations, which zero out roughly half of their inputs:
Var(w) = 2 / fan_in
He initialization doubles the variance compared to Xavier initialization to compensate for the halving effect of ReLU, and it has become the default initialization for networks using ReLU and its variants.
LSUV Initialization (Mishkin and Matas, 2016): Layer-Sequential Unit-Variance (LSUV) initialization, proposed in the paper "All You Need is a Good Init," takes a data-driven approach. The method consists of two steps: first, weights in each layer are initialized with orthonormal matrices; second, a mini-batch of data is passed through the network, and the weights in each layer are iteratively rescaled so that the output variance of each layer equals one. LSUV has been shown to match or outperform more complex training schemes on datasets including CIFAR-10/100 and ImageNet, and it works well across different activation functions (ReLU, maxout, tanh).
| Initialization Method | Target Activation | Variance Formula | Key Assumption |
|---|---|---|---|
| Xavier/Glorot (2010) | Sigmoid, Tanh | 2 / (fan_in + fan_out) | Linear activations |
| He/Kaiming (2015) | ReLU and variants | 2 / fan_in | Half of inputs zeroed by ReLU |
| LSUV (2016) | Any | Data-driven (unit variance per layer) | None (empirical normalization) |
Batch normalization, introduced by Sergey Ioffe and Christian Szegedy in 2015, normalizes the inputs to each layer by subtracting the batch mean and dividing by the batch standard deviation. The technique then applies learned scale and shift parameters to restore representational capacity. Ioffe and Szegedy originally described the motivation as reducing "internal covariate shift" (the change in the distribution of layer inputs during training), though later research has questioned whether this is the true mechanism behind its effectiveness.
Regardless of the theoretical explanation, batch normalization has a clear practical effect on the vanishing gradient problem: by keeping layer inputs in a normalized range, it prevents activations from drifting into the saturated regions of sigmoid or tanh functions, where derivatives are near zero. Batch normalization also allows the use of higher learning rates and reduces sensitivity to weight initialization, both of which contribute to more stable and faster training.
Variants of normalization have been developed for settings where batch normalization is less effective. Layer normalization (Ba, Kiros, and Hinton, 2016) normalizes across features rather than across the batch dimension, making it suitable for recurrent networks and transformers. Group normalization (Wu and He, 2018) divides channels into groups and normalizes within each group, working well with small batch sizes.
Gradient clipping is a technique that directly addresses the exploding gradient problem by capping the magnitude of gradients during backpropagation. Two common approaches exist:
Gradient norm clipping, as proposed by Pascanu, Mikolov, and Bengio (2013), has become the standard approach and is widely used when training RNNs and large transformer models. While gradient clipping primarily targets the exploding gradient problem, by preventing extreme gradient values it also contributes to more stable training overall.
The vanishing gradient problem had a profound influence on the history and development of deep learning and, by extension, the broader field of artificial intelligence.
After Hochreiter's 1991 analysis and Bengio's 1994 paper, many researchers concluded that training deep networks with gradient descent was fundamentally impractical. This contributed to a shift toward shallow architectures, typically with only one or two hidden layers. During this period, support vector machines and other kernel methods gained prominence as alternatives that did not require deep architectures.
The difficulty of training deep networks also contributed to reduced interest and funding for neural network research in the late 1990s, a period sometimes characterized as part of the broader "AI winter." Researchers who continued working on neural networks, such as Yann LeCun, Geoffrey Hinton, and Yoshua Bengio (later recognized as the "Godfathers of Deep Learning"), often faced skepticism from the wider machine learning community.
A partial breakthrough came in 2006 when Geoffrey Hinton and Ruslan Salakhutdinov demonstrated that deep belief networks could be trained effectively using a layer-by-layer unsupervised pretraining strategy. By pretraining each layer as a restricted Boltzmann machine (RBM) and then fine-tuning the entire network with backpropagation, they circumvented the vanishing gradient problem during the initial phase of training. This approach showed that deep architectures could learn useful hierarchical representations, reigniting interest in deep learning.
Schmidhuber had explored a related layer-wise pretraining approach for recurrent neural networks as early as 1992. Other pretraining methods followed, including the use of denoising autoencoders (Vincent et al., 2008).
The true revolution came around 2012, driven by the convergence of multiple solutions to the vanishing gradient problem together with increases in computational power from GPUs. Key milestones include:
Each of these advances directly or indirectly addressed the vanishing gradient problem, collectively transforming deep learning from a niche research topic into the dominant paradigm in artificial intelligence.
Although the vanishing gradient problem has been largely managed in standard architectures through the techniques described above, it remains an active concern in several areas of modern deep learning research.
Transformer architectures used in large language models (LLMs) such as GPT-4, Claude, and Gemini can have dozens or even hundreds of layers. While residual connections and layer normalization mitigate the vanishing gradient problem, training these models still requires careful attention to architectural choices. Pre-norm architectures (which apply layer normalization before attention and feed-forward blocks rather than after) have been found to produce more stable gradient flow in very deep transformers. The choice of normalization strategy (Pre-LN vs. Post-LN) can determine whether a deep transformer trains successfully or suffers from gradient degradation.
Research into networks with thousands of layers has revealed that even residual connections are not a complete solution at extreme depths. Techniques such as stochastic depth (Huang et al., 2016), which randomly drops entire layers during training, and dense connections (as in DenseNet, Huang et al., 2017) have been explored to further improve gradient flow. ReZero initialization (Bachlechner et al., 2020), which initializes residual connections with a learnable scalar set to zero, has also shown promise for training very deep networks.
Recent theoretical work on signal propagation in deep networks (sometimes called "mean field theory" for neural networks) has provided a deeper understanding of the vanishing gradient problem. Researchers have identified that networks exist on an "edge of chaos" between ordered phases (where signals and gradients vanish) and chaotic phases (where they explode). Initialization and normalization strategies that place the network at this critical boundary tend to produce the most trainable networks.
As models scale to billions or trillions of parameters, training stability becomes increasingly important. Gradient-related instabilities, including both vanishing and exploding gradients, are among the leading causes of training runs failing or producing suboptimal results. Techniques such as gradient accumulation, mixed-precision training, and careful learning rate scheduling all interact with gradient flow dynamics and must be tuned to ensure stable training.
Imagine a long line of children playing the "telephone" game, where each child whispers a message to the next. By the time the message reaches the end of the line, it has become so faint and garbled that the last child can barely hear it. The vanishing gradient problem works the same way. When a deep neural network is learning, it sends a correction signal backward through its layers, but that signal gets weaker and weaker at each layer. By the time it reaches the first layers, the signal is so tiny that those layers have no idea how to improve.
To fix this, researchers came up with several clever tricks. One trick (ReLU) is like replacing the children with ones who do not accidentally muffle the message. Another trick (residual connections) is like running a straight telephone wire from the beginning of the line to the end so the message can skip all the children in between. A third trick (LSTM) gives each child a notebook to write the message down, so it does not get lost. Together, these tricks let neural networks have hundreds or even thousands of layers while still getting the message through loud and clear.