A residual connection (also called a skip connection or shortcut connection) is a structural element in neural networks that adds the input of a layer or block directly to its output, producing y = F(x) + x instead of y = F(x). Introduced by Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun in their 2015 paper on deep residual learning, residual connections solved the degradation problem that had prevented effective training of very deep networks [1]. The idea was first demonstrated in convolutional neural networks for image recognition, where it enabled training of networks with over 100 layers and won the ILSVRC 2015 competition. Residual connections have since become a universal component of modern deep learning architectures, appearing in every transformer, every large language model, and most other deep network designs.
Before residual connections, the deep learning community faced a frustrating paradox. In theory, a deeper network should never perform worse than a shallower one: the extra layers could simply learn to be identity mappings, passing the input through unchanged, and the deeper network would at minimum match the shallower network's performance.
In practice, this did not happen. When researchers trained plain (non-residual) networks with increasing depth, both training and test error increased beyond a certain depth. A 56-layer plain network performed worse than a 20-layer plain network on CIFAR-10, not because of overfitting (the training error was also higher) but because the optimization process could not find a good solution [1].
This phenomenon, which He et al. called the "degradation problem," is distinct from the vanishing gradient problem. Even with techniques like batch normalization that ensure gradients do not literally vanish, deep plain networks still degraded. The problem is that standard nonlinear layers make it surprisingly difficult for the optimizer to learn identity-like mappings when that is the appropriate solution.
He et al. proposed a simple but powerful reformulation. Instead of asking a stack of layers to directly learn a desired mapping H(x), they asked the layers to learn the residual function:
F(x) = H(x) - x
The original mapping is then recovered as:
H(x) = F(x) + x
The key insight is that if the optimal transformation is close to the identity (i.e., H(x) is close to x), then F(x) is close to zero. Learning a function close to zero is much easier for the optimizer than learning the identity through a stack of nonlinear layers. The optimizer can push the weights of F toward zero to approximate the identity, whereas achieving the same result through a chain of convolutional layers, activation functions, and normalizations is much harder [1].
The residual connection itself is simply the addition y = F(x) + x, where x is the input to the block and F(x) is the output of the stacked layers. This addition does not introduce any extra parameters or computational complexity. The shortcut is sometimes called an "identity shortcut" because it implements the identity function: the input passes through unchanged and is added to the output.
When the dimensions of F(x) and x do not match (for example, when the number of channels changes between blocks), a linear projection W_s is applied to x:
y = F(x) + W_s * x
In practice, this projection is implemented as a 1x1 convolution (in CNNs) or a linear layer (in transformers).
The most important mathematical property of residual connections is their effect on gradient flow during backpropagation. Consider a network with L residual blocks. Let x_l denote the input to block l, and let F_l denote the function computed by block l. Then:
x_{l+1} = x_l + F_l(x_l)
Unrolling this recursion from layer l to a later layer L:
x_L = x_l + sum(F_i(x_i)) for i = l to L-1
Taking the gradient of a loss function E with respect to x_l:
dE/dx_l = dE/dx_L * (1 + d/dx_l * sum(F_i(x_i)))
The crucial term is the "1" inside the parentheses. This means the gradient always has a direct path from the loss back to any earlier layer, even if the gradient through the residual blocks d/dx_l * sum(F_i(x_i)) is small. The residual connection acts as a "gradient highway" that prevents gradients from vanishing entirely, regardless of network depth [1].
Without residual connections, the gradient must flow through every layer's transformation sequentially. If each layer's Jacobian has eigenvalues less than 1, the gradient shrinks exponentially with depth (vanishing gradients). If the eigenvalues are greater than 1, the gradient grows exponentially (exploding gradients). Residual connections add the identity to this chain, ensuring the gradient always has a component that neither shrinks nor grows.
Veit et al. (2016) proposed an alternative perspective: residual networks can be understood as an implicit ensemble of many shallow networks. Because each residual block can be either "used" or "bypassed" (via the shortcut), a network with n residual blocks implicitly represents 2^n possible paths of different lengths. Experiments showed that removing individual blocks from a trained ResNet causes only modest performance degradation, consistent with the ensemble interpretation [2].
This view helps explain why residual networks are robust to layer dropout and why they exhibit graceful degradation rather than catastrophic failure when individual components are perturbed.
The original ResNet paper proposed several architectures of increasing depth:
| Architecture | Layers | Parameters | Top-5 Error (ImageNet) |
|---|---|---|---|
| ResNet-18 | 18 | 11.7M | 10.92% |
| ResNet-34 | 34 | 21.8M | 9.46% |
| ResNet-50 | 50 | 25.6M | 7.48% |
| ResNet-101 | 101 | 44.5M | 6.58% |
| ResNet-152 | 152 | 60.2M | 6.16% |
Each residual block in ResNet-18 and ResNet-34 contains two 3x3 convolutional layers. ResNet-50 and deeper variants use a "bottleneck" design with three layers: a 1x1 convolution to reduce dimensionality, a 3x3 convolution, and a 1x1 convolution to restore dimensionality. This bottleneck design reduces computational cost while maintaining representational capacity [1].
ResNet-152 won first place in the ILSVRC 2015 classification task with a top-5 error rate of 3.57% (using an ensemble). This was the first time a deep learning system surpassed human-level performance on this benchmark, which was estimated at around 5% error. The same residual learning framework also won the ILSVRC 2015 detection and localization tasks, as well as the COCO 2015 detection and segmentation tasks [1].
The transformer architecture, introduced by Vaswani et al. (2017), uses residual connections around every sublayer [3]. Each transformer layer contains two sublayers:
Both sublayers are wrapped in residual connections. In the original transformer (Post-Norm configuration), the computation for each sublayer is:
y = LayerNorm(x + Sublayer(x))
Here, x is the input, Sublayer(x) is the output of the attention or FFN computation, x + Sublayer(x) is the residual addition, and LayerNorm normalizes the result.
In the Pre-Norm configuration used by most modern models (including GPT, LLaMA, and Mistral), the order is rearranged:
y = x + Sublayer(LayerNorm(x))
The layer normalization is applied to the input before the sublayer, and the raw (unnormalized) input is used for the residual addition. This means the residual stream carries unnormalized activations, which has been shown to improve gradient flow and training stability [4].
In transformer architectures, the sequence of residual additions creates what researchers call the "residual stream." This is a conceptual pathway through the network where information accumulates additively. Each attention layer and each FFN layer contributes an additive update to the residual stream:
x_0 -> x_0 + attn_1(x_0) -> x_0 + attn_1(x_0) + ffn_1(...) -> ... -> x_0 + sum of all layer outputs
This view has been influential in mechanistic interpretability research, where the residual stream is treated as a shared communication channel that all layers read from and write to. The residual connections ensure that information from early layers remains accessible to later layers without being corrupted by intervening transformations [5].
The placement of layer normalization relative to the residual connection significantly affects training dynamics.
In the Post-Norm pattern, normalization comes after the residual addition:
y = LayerNorm(x + Sublayer(x))
This was the design used in the original transformer and in BERT. It provides stronger normalization of the residual stream but can lead to gradient instability at initialization, especially in deep networks. Post-Norm transformers typically require a careful learning rate warmup schedule to train successfully [4].
In the Pre-Norm pattern, normalization comes before the sublayer:
y = x + Sublayer(LayerNorm(x))
Xiong et al. (2020) proved theoretically that Pre-Norm transformers have well-behaved gradients at initialization, explaining why they are easier to train. The Pre-Norm configuration has become the standard for large language models, with GPT-2 being one of the earliest prominent models to adopt it [4].
The tradeoff is that some studies have found Post-Norm achieves slightly better final performance when training is successful, because the normalization after the residual addition provides a stronger regularization effect. Researchers continue to explore alternative placements and hybrid approaches.
| Configuration | Formula | Stability | Final Quality |
|---|---|---|---|
| Post-Norm | LayerNorm(x + Sublayer(x)) | Requires warmup | Slightly higher (some evidence) |
| Pre-Norm | x + Sublayer(LayerNorm(x)) | Stable at init | Standard quality |
| Sandwich Norm | x + LayerNorm(Sublayer(LayerNorm(x))) | Most stable | Mixed results |
DenseNet, proposed by Huang et al. (2017), extends the skip connection concept in a different direction. Instead of adding the input to the output (as in ResNet), DenseNet concatenates the outputs of all preceding layers as the input to each layer [6].
In a DenseNet block with layers 1 through L, layer l receives as input the concatenation of the feature maps from all preceding layers: x_0, x_1, ..., x_{l-1}. This creates much denser connectivity than ResNet's block-level shortcuts.
| Property | ResNet | DenseNet |
|---|---|---|
| Connection type | Addition (x + F(x)) | Concatenation ([x, F(x)]) |
| Connections per block | One shortcut | All-to-all within block |
| Feature reuse | Implicit | Explicit |
| Parameter efficiency | Standard | Higher (fewer parameters for similar accuracy) |
| Memory usage | Standard | Higher (stores all intermediate features) |
| Used in transformers | Universal | Rare |
DenseNet achieves comparable accuracy to ResNet with significantly fewer parameters. For example, a DenseNet-201 with 20M parameters matches a ResNet-101 with 44.5M parameters on ImageNet. However, the concatenation of all intermediate feature maps increases memory consumption, and the dense connectivity pattern has not been widely adopted in transformers, where additive residual connections remain the standard [6].
Residual connections have become so universal that it is easier to list architectures that do not use them. Nearly every architecture achieving state-of-the-art results across domains employs residual connections in some form:
Every transformer model uses residual connections around attention and feed-forward sublayers. This includes encoder-only models (BERT, RoBERTa), decoder-only models (GPT series, LLaMA, Mistral, Claude), and encoder-decoder models (T5, BART).
ResNet's influence reshaped convolutional network design. Architectures following ResNet, including ResNeXt, SE-ResNet, EfficientNet, and ConvNeXt, all incorporate residual connections. ConvNeXt (2022) demonstrated that a pure convolutional architecture with modern design choices (including residual connections) can match the performance of vision transformers [7].
State space models such as Mamba also use residual connections around each layer, despite replacing the attention mechanism with a recurrent formulation. The residual connection pattern of y = x + Block(x) is retained because its gradient flow benefits are independent of the specific computation performed by the block.
U-Net architectures used in diffusion models for image generation incorporate both residual connections within blocks and skip connections between encoder and decoder stages at matching resolutions.
| Architecture Family | Uses Residual Connections | Connection Style |
|---|---|---|
| Transformers (all variants) | Yes | Additive around each sublayer |
| ResNet / ConvNeXt | Yes | Additive around each block |
| DenseNet | Variant | Concatenation (all-to-all) |
| State Space Models (Mamba) | Yes | Additive around each block |
| U-Net / Diffusion Models | Yes | Additive within blocks, skip across scales |
| Highway Networks | Yes | Gated additive |
| Plain MLPs | Sometimes | Additive when used |
Several variations on the basic residual connection have been explored:
Highway Networks (Srivastava et al., 2015) introduced gated shortcuts where a learned gating function controls how much of the input versus the transformed output flows through:
y = T(x) * F(x) + (1 - T(x)) * x
where T(x) is a gating function producing values between 0 and 1. Highway Networks predated ResNet but were more complex due to the additional gating parameters. The simplicity of the identity shortcut in ResNet proved more practical [8].
Some architectures scale either the residual or the shortcut path. GPT-2 initializes the output projection of each sublayer with a scaling factor of 1/sqrt(2N), where N is the number of layers. This prevents the residual stream's magnitude from growing as sqrt(N), which would happen with unscaled additions, and helps stabilize training of deep models.
Huang et al. (2016) proposed stochastic depth training, where entire residual blocks are randomly dropped during training (replaced by identity shortcuts). This acts as a form of regularization similar to dropout but at the block level, and it also reduces training time since dropped blocks do not need to be computed. At test time, all blocks are used with appropriate rescaling [9].
Residual connections introduce minimal overhead. The addition y = F(x) + x is an element-wise operation with negligible computational cost compared to the attention or feed-forward computations inside F. The main cost is memory: the input x must be kept in memory until F(x) has been computed so that the addition can be performed. For large transformers with long sequences, this contributes to the overall memory footprint, but the training stability benefits far outweigh this cost.
When implementing residual connections, care must be taken to ensure dimensional compatibility. The input x and the output F(x) must have the same shape for the addition to work. In transformers, this is naturally satisfied because both attention and FFN sublayers are designed to preserve the hidden dimension d_model. In convolutional networks, dimension mismatches at downsampling stages require projection shortcuts.
The ResNet paper by He et al. (2015) is one of the most cited papers in the history of computer science, with over 200,000 citations. Its impact extends far beyond the original image classification application. By demonstrating that depth could be scaled effectively with a simple architectural modification, residual connections opened the door to the very deep networks that define modern deep learning.
The combination of residual connections and layer normalization, established in the original transformer, has proven to be one of the most robust and scalable architectural patterns in deep learning. This pairing allows models to be scaled from millions to trillions of parameters while maintaining trainability, and it underpins every major large language model in use today.