Residual connection

Deep Learning Machine Learning

16 min read

Updated Jul 11, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 11, 2026

Fact-checked

In review queue

Sources

10 citations

Revision

v5 · 3,110 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

A residual connection (also called a skip connection or shortcut connection) is a structural element in neural networks that adds the input of a layer or block directly to its output, producing $y = F(x) + x$ instead of $y = F(x)$ . It was introduced by Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun of Microsoft Research in their 2015 paper "Deep Residual Learning for Image Recognition," which solved the degradation problem that had prevented effective training of very deep networks ^[1]. The shortcut adds no extra parameters and almost no compute, yet it lets gradients flow directly across hundreds of layers, and it has become a universal component of modern deep learning: it appears in every transformer, every large language model, and most other deep network designs. The ResNet paper that introduced it ranks as the single most cited scientific paper of the 21st century, with well over 200,000 citations ^[2].

The idea was first demonstrated in convolutional neural networks for image recognition, where it enabled training of networks with over 100 layers and won the ILSVRC 2015 competition. As He et al. put it in the paper's abstract, residual networks are "easier to optimize, and can gain accuracy from considerably increased depth" ^[1].

What problem do residual connections solve? The Degradation Problem

Before residual connections, the deep learning community faced a frustrating paradox. In theory, a deeper network should never perform worse than a shallower one: the extra layers could simply learn to be identity mappings, passing the input through unchanged, and the deeper network would at minimum match the shallower network's performance.

In practice, this did not happen. When researchers trained plain (non-residual) networks with increasing depth, both training and test error increased beyond a certain depth. A 56-layer plain network performed worse than a 20-layer plain network on CIFAR-10, not because of overfitting (the training error was also higher) but because the optimization process could not find a good solution ^[1].

This phenomenon, which He et al. called the "degradation problem," is distinct from the vanishing gradient problem. Even with techniques like batch normalization that ensure gradients do not literally vanish, deep plain networks still degraded. The problem is that standard nonlinear layers make it surprisingly difficult for the optimizer to learn identity-like mappings when that is the appropriate solution.

How does the residual learning framework work?

Core Idea

He et al. proposed a simple but powerful reformulation. Instead of asking a stack of layers to directly learn a desired mapping H(x), they asked the layers to learn the residual function:

F(x) = H(x) - x

The original mapping is then recovered as:

H(x) = F(x) + x

The key insight is that if the optimal transformation is close to the identity (i.e., H(x) is close to x), then F(x) is close to zero. Learning a function close to zero is much easier for the optimizer than learning the identity through a stack of nonlinear layers. The optimizer can push the weights of F toward zero to approximate the identity, whereas achieving the same result through a chain of convolutional layers, activation functions, and normalizations is much harder. He et al. hypothesized that "it is easier to optimize the residual mapping than to optimize the original, unreferenced mapping" ^[1].

Identity Shortcut

The residual connection itself is simply the addition $y = F(x) + x$ , where x is the input to the block and F(x) is the output of the stacked layers. This addition does not introduce any extra parameters or computational complexity. The shortcut is sometimes called an "identity shortcut" because it implements the identity function: the input passes through unchanged and is added to the output.

When the dimensions of F(x) and x do not match (for example, when the number of channels changes between blocks), a linear projection $W_s$ is applied to x:

y = F(x) + W_s x

In practice, this projection is implemented as a 1x1 convolution (in CNNs) or a linear layer (in transformers).

Mathematical Perspective

Why do residual connections improve gradient flow?

The most important mathematical property of residual connections is their effect on gradient flow during backpropagation. Consider a network with L residual blocks. Let $x_l$ denote the input to block l, and let $F_l$ denote the function computed by block l. Then:

x_{l+1} = x_l + F_l(x_l)

Unrolling this recursion from layer l to a later layer L:

x_L = x_l + \sum_{i=l}^{L-1} F_i(x_i)

Taking the gradient of a loss function E with respect to x_l:

\frac{\partial E}{\partial x_l} = \frac{\partial E}{\partial x_L}\left(1 + \frac{\partial}{\partial x_l}\sum_{i=l}^{L-1} F_i(x_i)\right)

The crucial term is the "1" inside the parentheses. This means the gradient always has a direct path from the loss back to any earlier layer, even if the gradient through the residual blocks $\frac{\partial}{\partial x_l}\sum_{i=l}^{L-1} F_i(x_i)$ is small. The residual connection acts as a "gradient highway" that prevents gradients from vanishing entirely, regardless of network depth ^[1].

Without residual connections, the gradient must flow through every layer's transformation sequentially. If each layer's Jacobian has eigenvalues less than 1, the gradient shrinks exponentially with depth (vanishing gradients). If the eigenvalues are greater than 1, the gradient grows exponentially (exploding gradients). Residual connections add the identity to this chain, ensuring the gradient always has a component that neither shrinks nor grows.

Ensemble Interpretation

Veit et al. (2016) proposed an alternative perspective: residual networks can be understood as an implicit ensemble of many shallow networks. Because each residual block can be either "used" or "bypassed" (via the shortcut), a network with n residual blocks implicitly represents $2^n$ possible paths of different lengths. Experiments showed that removing individual blocks from a trained ResNet causes only modest performance degradation, consistent with the ensemble interpretation ^[3].

This view helps explain why residual networks are robust to layer dropout and why they exhibit graceful degradation rather than catastrophic failure when individual components are perturbed.

ResNet Architecture

Design

The original ResNet paper proposed several architectures of increasing depth. The deepest, ResNet-152, was 8 times deeper than the VGG networks it outperformed while having lower computational complexity ^[1]:

Architecture	Layers	Parameters	Top-5 Error (ImageNet)
ResNet-18	18	11.7M	10.92%
ResNet-34	34	21.8M	9.46%
ResNet-50	50	25.6M	7.48%
ResNet-101	101	44.5M	6.58%
ResNet-152	152	60.2M	6.16%

Each residual block in ResNet-18 and ResNet-34 contains two 3x3 convolutional layers. ResNet-50 and deeper variants use a "bottleneck" design with three layers: a 1x1 convolution to reduce dimensionality, a 3x3 convolution, and a 1x1 convolution to restore dimensionality. This bottleneck design reduces computational cost while maintaining representational capacity ^[1].

What did ResNet win in 2015?

A single ResNet-152 model reached a top-5 validation error of 4.49%, and an ensemble of six models of varying depth achieved 3.57% top-5 error on the ImageNet test set, winning first place in the ILSVRC 2015 classification task ^[1]. This roughly halved the 6.67% top-5 error of the 2014 winner, GoogLeNet, and was the first time a deep learning system surpassed the estimated human-level top-5 error of around 5% on this benchmark. The same residual learning framework also won the ILSVRC 2015 detection and localization tasks, as well as the COCO 2015 detection and segmentation tasks ^[1].

How are residual connections used in Transformers?

The transformer architecture, introduced by Vaswani et al. (2017), uses residual connections around every sublayer ^[4]. Each transformer layer contains two sublayers:

A self-attention sublayer (multi-head attention)
A position-wise feed-forward network (FFN) sublayer

Both sublayers are wrapped in residual connections. In the original transformer (Post-Norm configuration), the computation for each sublayer is:

y = \mathrm{LayerNorm}(x + \mathrm{Sublayer}(x))

Here, x is the input, Sublayer(x) is the output of the attention or FFN computation, x + Sublayer(x) is the residual addition, and LayerNorm normalizes the result.

In the Pre-Norm configuration used by most modern models (including GPT, LLaMA, and Mistral), the order is rearranged:

y = x + \mathrm{Sublayer}(\mathrm{LayerNorm}(x))

The layer normalization is applied to the input before the sublayer, and the raw (unnormalized) input is used for the residual addition. This means the residual stream carries unnormalized activations, which has been shown to improve gradient flow and training stability ^[5].

The Residual Stream

In transformer architectures, the sequence of residual additions creates what researchers call the "residual stream." This is a conceptual pathway through the network where information accumulates additively. Each attention layer and each FFN layer contributes an additive update to the residual stream:

x_0 \to x_0 + \mathrm{attn}_1(x_0) \to x_0 + \mathrm{attn}_1(x_0) + \mathrm{ffn}_1(\ldots) \to \cdots \to x_0 + \text{sum of all layer outputs}

This view has been influential in mechanistic interpretability research, where the residual stream is treated as a shared communication channel that all layers read from and write to. The residual connections ensure that information from early layers remains accessible to later layers without being corrupted by intervening transformations ^[6].

How do Pre-Norm and Post-Norm residual patterns differ?

The placement of layer normalization relative to the residual connection significantly affects training dynamics.

Post-Norm (Original Transformer)

In the Post-Norm pattern, normalization comes after the residual addition:

y = \mathrm{LayerNorm}(x + \mathrm{Sublayer}(x))

This was the design used in the original transformer and in BERT. It provides stronger normalization of the residual stream but can lead to gradient instability at initialization, especially in deep networks. Post-Norm transformers typically require a careful learning rate warmup schedule to train successfully ^[5].

Pre-Norm (Modern Standard)

In the Pre-Norm pattern, normalization comes before the sublayer:

y = x + \mathrm{Sublayer}(\mathrm{LayerNorm}(x))

Xiong et al. (2020) proved using mean field theory that, at initialization, Pre-Norm (Pre-LN) transformers have well-behaved gradients while Post-Norm (Post-LN) transformers have very large gradients near the output layer. This finding let them remove the warmup stage entirely for Pre-LN training, reaching comparable results with significantly less training time and hyperparameter tuning ^[5]. The Pre-Norm configuration has become the standard for large language models, with GPT-2 being one of the earliest prominent models to adopt it.

The tradeoff is that some studies have found Post-Norm achieves slightly better final performance when training is successful, because the normalization after the residual addition provides a stronger regularization effect. Researchers continue to explore alternative placements and hybrid approaches.

Configuration	Formula	Stability	Final Quality
Post-Norm	$\mathrm{LayerNorm}(x + \mathrm{Sublayer}(x))$	Requires warmup	Slightly higher (some evidence)
Pre-Norm	$x + \mathrm{Sublayer}(\mathrm{LayerNorm}(x))$	Stable at init	Standard quality
Sandwich Norm	$x + \mathrm{LayerNorm}(\mathrm{Sublayer}(\mathrm{LayerNorm}(x)))$	Most stable	Mixed results

How does DenseNet differ from a residual connection?

DenseNet, proposed by Huang et al. (2017), extends the skip connection concept in a different direction. Instead of adding the input to the output (as in ResNet), DenseNet concatenates the outputs of all preceding layers as the input to each layer ^[7].

In a DenseNet block with layers 1 through L, layer l receives as input the concatenation of the feature maps from all preceding layers: $x_0, x_1, \ldots, x_{l-1}$ . This creates much denser connectivity than ResNet's block-level shortcuts.

Property	ResNet	DenseNet
Connection type	Addition ( $x + F(x)$ )	Concatenation ( $[x, F(x)]$ )
Connections per block	One shortcut	All-to-all within block
Feature reuse	Implicit	Explicit
Parameter efficiency	Standard	Higher (fewer parameters for similar accuracy)
Memory usage	Standard	Higher (stores all intermediate features)
Used in transformers	Universal	Rare

DenseNet achieves comparable accuracy to ResNet with significantly fewer parameters. For example, a DenseNet-201 with 20M parameters matches a ResNet-101 with 44.5M parameters on ImageNet. However, the concatenation of all intermediate feature maps increases memory consumption, and the dense connectivity pattern has not been widely adopted in transformers, where additive residual connections remain the standard ^[7].

Where are residual connections used today? Universality in Modern Architectures

Residual connections have become so universal that it is easier to list architectures that do not use them. Nearly every architecture achieving state-of-the-art results across domains employs residual connections in some form:

Transformer-Based Models

Every transformer model uses residual connections around attention and feed-forward sublayers. This includes encoder-only models (BERT, RoBERTa), decoder-only models (GPT series, LLaMA, Mistral, Claude), and encoder-decoder models (T5, BART).

Convolutional Networks

ResNet's influence reshaped convolutional network design. Architectures following ResNet, including ResNeXt, SE-ResNet, EfficientNet, and ConvNeXt, all incorporate residual connections. ConvNeXt (2022) demonstrated that a pure convolutional architecture with modern design choices (including residual connections) can match the performance of vision transformers ^[8].

State Space Models

State space models such as Mamba also use residual connections around each layer, despite replacing the attention mechanism with a recurrent formulation. The residual connection pattern of $y = x + \mathrm{Block}(x)$ is retained because its gradient flow benefits are independent of the specific computation performed by the block.

Diffusion Models

U-Net architectures used in diffusion models for image generation incorporate both residual connections within blocks and skip connections between encoder and decoder stages at matching resolutions.

Summary of Adoption

Architecture Family	Uses Residual Connections	Connection Style
Transformers (all variants)	Yes	Additive around each sublayer
ResNet / ConvNeXt	Yes	Additive around each block
DenseNet	Variant	Concatenation (all-to-all)
State Space Models (Mamba)	Yes	Additive around each block
U-Net / Diffusion Models	Yes	Additive within blocks, skip across scales
Highway Networks	Yes	Gated additive
Plain MLPs	Sometimes	Additive when used

Variants and Extensions

Several variations on the basic residual connection have been explored:

Gated Residual Connections

Highway Networks (Srivastava et al., 2015) introduced gated shortcuts where a learned gating function controls how much of the input versus the transformed output flows through:

y = T(x) F(x) + (1 - T(x)) x

where $T(x)$ is a gating function producing values between 0 and 1. Highway Networks predated ResNet but were more complex due to the additional gating parameters. The simplicity of the identity shortcut in ResNet proved more practical ^[9].

Scaled Residual Connections

Some architectures scale either the residual or the shortcut path. GPT-2 initializes the output projection of each sublayer with a scaling factor of $1/\sqrt{2N}$ , where N is the number of layers. This prevents the residual stream's magnitude from growing as $\sqrt{N}$ , which would happen with unscaled additions, and helps stabilize training of deep models.

Stochastic Depth

Huang et al. (2016) proposed stochastic depth training, where entire residual blocks are randomly dropped during training (replaced by identity shortcuts). This acts as a form of regularization similar to dropout but at the block level, and it also reduces training time since dropped blocks do not need to be computed. At test time, all blocks are used with appropriate rescaling ^[10].

Practical Considerations

Residual connections introduce minimal overhead. The addition $y = F(x) + x$ is an element-wise operation with negligible computational cost compared to the attention or feed-forward computations inside F. The main cost is memory: the input x must be kept in memory until F(x) has been computed so that the addition can be performed. For large transformers with long sequences, this contributes to the overall memory footprint, but the training stability benefits far outweigh this cost.

When implementing residual connections, care must be taken to ensure dimensional compatibility. The input x and the output F(x) must have the same shape for the addition to work. In transformers, this is naturally satisfied because both attention and FFN sublayers are designed to preserve the hidden dimension $d_{\text{model}}$ . In convolutional networks, dimension mismatches at downsampling stages require projection shortcuts.

Historical Impact

The ResNet paper by He et al. (2015) is one of the most cited papers in the history of science. A 2025 Nature analysis of citation databases ranked "Deep Residual Learning for Image Recognition" as the single most cited paper published in the 21st century ^[2]. Its impact extends far beyond the original image classification application: Nature noted that the residual learning idea was a foundational factor behind systems that play board games (AlphaGo), predict protein structure (AlphaFold), and model language (ChatGPT) ^[2]. By demonstrating that depth could be scaled effectively with a simple architectural modification, residual connections opened the door to the very deep networks that define modern deep learning.

The combination of residual connections and layer normalization, established in the original transformer, has proven to be one of the most robust and scalable architectural patterns in deep learning. This pairing allows models to be scaled from millions to trillions of parameters while maintaining trainability, and it underpins every major large language model in use today.

References

He, K., Zhang, X., Ren, S., & Sun, J. (2016). "Deep Residual Learning for Image Recognition." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). https://arxiv.org/abs/1512.03385 ↩
Pearson, H. (2025). "Exclusive: the most-cited papers of the twenty-first century." Nature. https://www.nature.com/articles/d41586-025-01125-9 ↩
Veit, A., Wilber, M.J., & Belongie, S. (2016). "Residual Networks Behave Like Ensembles of Relatively Shallow Networks." Advances in Neural Information Processing Systems 29. https://arxiv.org/abs/1605.06431 ↩
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., & Polosukhin, I. (2017). "Attention Is All You Need." Advances in Neural Information Processing Systems 30. https://arxiv.org/abs/1706.03762 ↩
Xiong, R., Yang, Y., He, D., Zheng, K., Zheng, S., Xing, C., Zhang, H., Lan, Y., Wang, L., & Liu, T. (2020). "On Layer Normalization in the Transformer Architecture." Proceedings of the 37th International Conference on Machine Learning. https://arxiv.org/abs/2002.04745 ↩
Elhage, N., Nanda, N., Olsson, C., et al. (2021). "A Mathematical Framework for Transformer Circuits." Anthropic. https://transformer-circuits.pub/2021/framework/index.html ↩
Huang, G., Liu, Z., Van Der Maaten, L., & Weinberger, K.Q. (2017). "Densely Connected Convolutional Networks." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). https://arxiv.org/abs/1608.06993 ↩
Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., & Xie, S. (2022). "A ConvNet for the 2020s." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. https://arxiv.org/abs/2201.03545 ↩
Srivastava, R.K., Greff, K., & Schmidhuber, J. (2015). "Highway Networks." arXiv preprint. https://arxiv.org/abs/1505.00387 ↩
Huang, G., Sun, Y., Liu, Z., Sedra, D., & Weinberger, K.Q. (2016). "Deep Networks with Stochastic Depth." Proceedings of the European Conference on Computer Vision (ECCV). https://arxiv.org/abs/1603.09382 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

4 revisions by 1 contributors · full history

Suggest edit

What links here

Attention Is All You Need Deep Neural Network DeepNorm / DeepNet Exploding Gradient Problem ImageNet Jian Sun Jürgen Schmidhuber Kaiming He Layer normalization Neural Network Self-attention State space model (deep learning)Swin Transformer

What problem do residual connections solve? The Degradation Problem

How does the residual learning framework work?

Core Idea

Identity Shortcut

Mathematical Perspective

Why do residual connections improve gradient flow?

Ensemble Interpretation

ResNet Architecture

Design

What did ResNet win in 2015?

How are residual connections used in Transformers?

The Residual Stream

How do Pre-Norm and Post-Norm residual patterns differ?

Post-Norm (Original Transformer)

Pre-Norm (Modern Standard)

How does DenseNet differ from a residual connection?

Where are residual connections used today? Universality in Modern Architectures

Transformer-Based Models

Convolutional Networks

State Space Models

Diffusion Models

Summary of Adoption

Variants and Extensions

Gated Residual Connections

Scaled Residual Connections

Stochastic Depth

Practical Considerations

Historical Impact

References

Improve this article

Related Articles

Diffusion model

Generalization

Mixture of Experts (MoE)

Modality

Sparsity

Activation Function

What links here

Related Articles

Diffusion model

Generalization

Mixture of Experts (MoE)

Modality

Sparsity

Activation Function

What links here