# ResNet

> Source: https://aiwiki.ai/wiki/resnet
> Updated: 2026-06-20
> Categories: Computer Vision, Deep Learning, Neural Networks
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**ResNet** (Residual Network) is a deep [convolutional neural network](/wiki/convolutional_neural_network) architecture introduced by Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, then at [Microsoft Research Asia](/wiki/microsoft_research_asia) (MSRA), in December 2015. The paper, "Deep Residual Learning for Image Recognition," introduced *identity shortcut* (or *residual*) connections that bypass two or three stacked layers and add the input directly to the output. As the authors put it, "We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions," and they reported that "these residual networks are easier to optimize, and can gain accuracy from considerably increased depth."[^1] This deceptively simple design dissolved the *degradation problem* that had blocked the training of networks much deeper than 20-30 layers and made networks of 100+ layers practical for the first time.[^1] ResNet won every major track of the 2015 [ImageNet](/wiki/imagenet) Large Scale Visual Recognition Challenge (ILSVRC) and the COCO 2015 challenges, reaching a 3.57% top-5 error on ImageNet classification with a 152-layer model, and it received the Best Paper Award at the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).[^2][^3] The winning model used "a depth of up to 152 layers, 8x deeper than VGG nets but still having lower complexity."[^1] As of 2026, the ResNet paper has been cited more than 250,000 times on Google Scholar, and a 2025 Nature analysis ranked it the single most-cited scientific paper of the 21st century across all fields.[^4][^28]

The architecture's defining unit is the residual block, which computes `y = F(x, {W_i}) + x`, where `F` is a small stack of weight layers and `+ x` is an identity skip connection. The skip path adds no parameters and almost no computation, but it gives gradients a direct route from later layers back to earlier ones, which dramatically eases optimization. ResNet-50, the bottleneck variant most used in practice, became the de facto vision backbone of the late 2010s and remains a reference benchmark in MLPerf and in nearly every paper that introduces a new image model.[^5] Skip connections themselves have outlasted any specific ResNet configuration: every modern [Transformer](/wiki/transformer), every [large language model](/wiki/large_language_model), and every diffusion image generator places a residual addition around each block, a direct inheritance from the 2015 paper.

## Background

By 2015, [convolutional neural networks](/wiki/convolutional_neural_network) had become the dominant tool for image recognition. The progression of [ImageNet](/wiki/imagenet) winners showed a clear trend toward depth: AlexNet (2012) had 8 layers, [VGG](/wiki/vgg) (Simonyan and Zisserman, 2014) used 16 or 19 layers stacked from small 3x3 filters, and GoogLeNet (Szegedy et al., 2014) reached 22 layers using *inception modules*.[^6][^7] VGG in particular established the principle that small filters stacked deeply produced strong features for transfer learning.

### What is the degradation problem?

In principle, a network of 50 layers should be at least as expressive as one of 20 layers: the extra 30 layers could simply learn the identity function and replicate the shallower network's behavior. Empirically, though, the opposite happened. As researchers added more layers to standard *plain* networks, accuracy first improved, then saturated, and then degraded rapidly.[^1] The 56-layer plain network reported by He et al. on CIFAR-10 produced higher *training* error than a 20-layer plain network using the same recipe. Because the error grew on the training set itself, this was not [overfitting](/wiki/overfitting); the deeper networks were failing to optimize, even with [batch normalization](/wiki/batch_normalization) in place to control activations.

He et al. called this *the degradation problem* and made it the central motivation of the paper. They observed that "with the network depth increasing, accuracy gets saturated (which might be unsurprising) and then degrades rapidly," and stressed that this degradation "is not caused by overfitting."[^1] They argued that if a stack of layers could easily learn identity mappings, no degradation should appear. But standard nonlinear layers, initialized to small random weights and threaded through ReLUs, evidently struggled to express identity precisely. The residual reformulation removed that burden by *making identity the default*.

### Highway Networks

A few months earlier, Srivastava, Greff, and Schmidhuber had introduced [Highway Networks](/wiki/highway_networks), which used *gated* shortcut connections inspired by LSTM cells.[^8] A Highway block computed `y = T(x) * H(x) + (1 - T(x)) * x`, where `T(x)` is a learned transform gate. ResNet's key simplification was to fix the gate to identity: the shortcut is *always open*, with no learned modulation. This made the architecture easier to train, easier to analyze, and slightly cheaper, and it turned out to scale better with depth than the gated variant.

## Who created ResNet?

All four authors of the ResNet paper were at [Microsoft Research Asia](/wiki/microsoft_research_asia) in Beijing when the work was submitted to arXiv on December 10, 2015 (arXiv:1512.03385).[^1] The paper was accepted at CVPR 2016 and presented in Las Vegas in June 2016, where it received the Best Paper Award.[^3]

| Author | Role at MSRA (2015) | Later affiliations |
|---|---|---|
| [Kaiming He](/wiki/kaiming_he) | Lead author | Facebook AI Research (2016), then MIT EECS faculty (2024) |
| Xiangyu Zhang | Co-author | MEGVII (Face++) Research |
| Shaoqing Ren | Co-author | Co-author of Faster R-CNN; autonomous driving industry |
| Jian Sun | Senior researcher | MEGVII Chief Scientist until his death in 2022 |

The same four-author team had previously produced the *He initialization* paper, which derived the variance-preserving initialization for [ReLU](/wiki/relu) networks and was itself a prerequisite for training the very deep models in ResNet.[^9] Kaiming He also co-authored Faster R-CNN (with Ren and Sun), Mask R-CNN, MoCo, MAE, and a long sequence of other influential vision papers, making him one of the most-cited computer scientists alive. In the 2025 Nature analysis of the most-cited papers of the century, He was an author of the top-ranked paper, ResNet.[^28]

## The residual block

The mathematical core of ResNet fits on a single line. Let `x` be the input to a block and `H(x)` be the desired mapping the block should compute. Instead of asking the stacked layers to learn `H` directly, He et al. asked them to learn the residual function

```
F(x, {W_i}) = H(x) - x
```

and to recover `H(x)` by adding back the input:

```
y = F(x, {W_i}) + x
```

If, for some block, the optimal `H` is close to the identity, the optimizer only needs to drive `F` toward zero, which is easy when `F` is a small stack of weight layers initialized near zero. Learning small deviations from identity is much easier than reconstructing identity from scratch through nonlinear ReLU layers. The paper hypothesized that "it is easier to optimize the residual mapping than to optimize the original, unreferenced mapping."[^1]

### Identity vs projection shortcuts

When the input `x` and the output of `F` have the same dimensions, the addition is element-wise and the shortcut adds zero parameters. When the channel count changes (at the start of each new stage) or the spatial resolution changes (when stride 2 is used), a linear projection `W_s` is applied to the skip path:

```
y = F(x, {W_i}) + W_s * x
```

In practice `W_s` is implemented as a 1x1 convolution with stride 2. The original paper tested three shortcut options: (A) zero-padding for extra channels, (B) projections only when dimensions change, and (C) projections everywhere. Option B is the standard; option C added cost with little gain.[^1]

### Basic block and bottleneck block

ResNet has two block templates:

- **Basic block** (used in ResNet-18 and ResNet-34): two stacked 3x3 convolutions, each followed by [batch normalization](/wiki/batch_normalization) and [ReLU](/wiki/relu), with the skip connection added after the second BN and before the final ReLU.
- **Bottleneck block** (used in ResNet-50, -101, -152): a 1x1 convolution that reduces the channel count by a factor of four, a 3x3 convolution operating on this narrow representation, and a 1x1 convolution that restores the original width. The skip path again adds the original input.

The bottleneck design is the trick that makes ResNet-50 only marginally more expensive than ResNet-34 despite the extra layers. Most parameters and FLOPs of a deep CNN sit in 3x3 convolutions on wide feature maps; the 1x1 reductions move computation into a much narrower 256/4 = 64-channel core where the 3x3 is cheap, then expand back.[^1]

## Architecture variants

The original paper introduced five ImageNet configurations, all sharing a common stem (a 7x7 conv with stride 2, 64 channels, followed by 3x3 max pooling with stride 2) and four stages operating at progressively halved spatial resolution (56x56 -> 28x28 -> 14x14 -> 7x7) with doubled channel counts.

| Model | Block | Layers | Blocks per stage | Params | FLOPs | ImageNet top-1 | ImageNet top-5 |
|---|---|---|---|---|---|---|---|
| ResNet-18 | Basic | 18 | 2, 2, 2, 2 | 11.7M | 1.8G | 30.2% | 10.9% |
| ResNet-34 | Basic | 34 | 3, 4, 6, 3 | 21.8M | 3.6G | 26.7% | 8.6% |
| ResNet-50 | Bottleneck | 50 | 3, 4, 6, 3 | 25.6M | 3.8G | 24.0% | 7.1% |
| ResNet-101 | Bottleneck | 101 | 3, 4, 23, 3 | 44.5M | 7.6G | 22.4% | 6.2% |
| ResNet-152 | Bottleneck | 152 | 3, 8, 36, 3 | 60.2M | 11.3G | 21.7% | 5.7% |

Numbers are from the original paper (single-crop validation error).[^1] The layer counts include the initial 7x7 conv, all conv layers inside residual blocks, and the final fully connected classifier; batch normalization layers, the max pool, and global average pool are not counted. Notably, ResNet-152 is still cheaper than [VGG](/wiki/vgg)-16 (15.3 GFLOPs) or VGG-19 (19.6 GFLOPs) despite being roughly eight times deeper; the paper describes its 152-layer net as "8x deeper than VGG nets but still having lower complexity."[^1]

A standard ResNet-50 ends with global average pooling and a single fully-connected layer that maps the 2048-dimensional pooled feature to 1000 ImageNet logits, followed by softmax.

### Training recipe

The original models were trained on ImageNet with [stochastic gradient descent](/wiki/stochastic_gradient_descent) (momentum 0.9), weight decay 1e-4, mini-batch size 256, and an initial learning rate of 0.1 divided by 10 when validation error plateaued, for roughly 90 epochs in total. Standard data augmentation included random crops, horizontal flips, per-pixel mean subtraction, and AlexNet-style color jittering.[^1] Weights were initialized with the He scheme from the earlier 2015 paper.[^9]

## ILSVRC 2015 and superhuman ImageNet

The headline result of the paper was the 2015 [ImageNet](/wiki/imagenet) Large Scale Visual Recognition Challenge. An ensemble of six ResNet models of varying depth produced a top-5 error of **3.57%** on the ImageNet test set, winning the classification track.[^2] In the authors' words, "an ensemble of these residual nets achieves 3.57% error on the ImageNet test set," a result that "won the 1st place on the ILSVRC 2015 classification task."[^1] A single ResNet-152 achieved 4.49% top-5 on the validation set.

The estimated human top-5 error on ImageNet, measured by [Andrej Karpathy](/wiki/andrej_karpathy) when he trained himself as a one-person baseline against AlexNet-era models, was about 5.1%.[^10] ResNet's 3.57% was the first published result to clearly pass that human reference number on this benchmark. The progression of ILSVRC winners over four years tells the story:

| Year | Winner | Top-5 error | Depth | Key idea |
|---|---|---|---|---|
| 2012 | AlexNet | 15.3% | 8 | GPU training, ReLU, dropout |
| 2013 | ZFNet | 11.7% | 8 | Visualization-guided tuning |
| 2014 | GoogLeNet | 6.7% | 22 | Inception modules |
| 2014 (2nd) | VGG | 7.3% | 19 | Stacked 3x3 filters |
| 2015 | **ResNet** | **3.57%** | **152** | **Residual connections** |

In three years, top-5 error had been cut by more than a factor of four, and the network depth had grown by an order of magnitude.

### COCO 2015 and detection sweep

ResNet did not just win classification. The MSRA team entered ResNet-based systems into every major detection, localization, and segmentation track of ILSVRC 2015 and the 2015 COCO challenges, and won them all.[^1] The abstract attributes the detection gains directly to depth: "solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset."[^1] On COCO detection, ResNet-101 plugged into Faster R-CNN raised mean average precision (mAP@[.5, .95]) to 37.3%, that 28% relative improvement over the previous state of the art. The same backbone took first place in ImageNet detection (62.1% mAP), ImageNet localization, COCO segmentation, and COCO localization, confirming that residual learning was a backbone-level improvement rather than a classification-only trick.[^1]

## Why does ResNet work?

In the years after publication, a number of follow-up papers tried to explain *why* the simple identity shortcut had such an outsized effect.

### Gradient flow

For a single residual block `y = F(x) + x`, the gradient of the loss `L` with respect to the input is

```
dL/dx = dL/dy * (1 + dF/dx)
```

The `1` term is the contribution of the identity skip. During backpropagation through a deep stack of residual blocks, this gives a direct, unattenuated path for gradients from the top of the network to the bottom; even if `dF/dx` becomes small for many blocks, the additive `1` ensures the gradient does not vanish.[^11] This was the original explanation, and it generalizes the [vanishing gradient](/wiki/vanishing_gradient_problem) argument that motivated LSTM-style gating in earlier work.

### Identity mappings (pre-activation analysis)

In a 2016 follow-up paper, *Identity Mappings in Deep Residual Networks*, He, Zhang, Ren, and Sun analyzed what happens when both the shortcut path *and* the after-addition activation are exact identity mappings.[^11] They showed that the forward signal can then be written as a clean recursive sum:

```
x_L = x_l + Σ_{i=l..L-1} F(x_i, W_i)
```

so any deeper feature `x_L` is the shallow feature `x_l` plus an explicit sum of residuals. Differentiating gives a gradient expression in which the identity term again propagates back unattenuated. The same paper introduced the *pre-activation* block ordering (BN -> ReLU -> conv, conv, etc.) so that the addition output is a pure identity rather than passing through a final ReLU. The modified architecture, sometimes called *ResNet-v2*, trained more stably at extreme depths: a 1001-layer ResNet-v2 reached 4.62% test error on CIFAR-10.[^11]

### Ensembles of paths

Veit, Wilber, and Belongie (NeurIPS 2016) argued that residual networks behave like *unrolled ensembles* of many shorter networks.[^12] An n-block ResNet has `2^n` possible paths through the network (each block can be skipped via its identity branch or traversed via `F`). They showed empirically that deleting individual residual blocks from a trained ResNet has little effect on accuracy, while deleting a layer from a plain VGG-like network is catastrophic. By this view, ResNet works partly because it implicitly ensembles a large number of shallow models.

### Smoother loss landscape

Li, Xu, Taylor, Studer, and Goldstein (NeurIPS 2018) visualized loss surfaces along random and curvature-aligned directions and showed that residual connections dramatically smooth the loss landscape relative to deep plain networks, which exhibit chaotic, non-convex surfaces.[^13] The smoother landscape is easier for SGD to traverse, which is a third complementary explanation alongside gradient flow and ensemble behavior.

## Pre-activation ResNet (ResNet-v2)

The 2016 follow-up paper *Identity Mappings in Deep Residual Networks* (He et al., arXiv:1603.05027, ECCV 2016) is usually the second paper everyone reads on this topic.[^11] Its main practical contribution is the rearranged block:

```
Original (v1):    Conv -> BN -> ReLU -> Conv -> BN -> add -> ReLU
Pre-activation (v2): BN -> ReLU -> Conv -> BN -> ReLU -> Conv -> add
```

In v2, the addition output is pure identity (no ReLU on top), so deeper-to-shallower gradient flow is completely unattenuated. Pre-activation ResNet became the default for very deep networks (1000+ layers on CIFAR) and is the variant used by many later libraries when they specify "ResNet". The two variants differ by only a permutation of BN/ReLU/conv ordering, but the analytical cleanness of v2 made it the canonical form for theoretical work.

## What is ResNet used for?

ResNet's most important practical role from 2016 onward was as a *backbone*: a pretrained feature extractor that other tasks could plug into. The same set of ResNet-50 weights, trained once on ImageNet, became the starting point for thousands of downstream computer vision systems.

### Object detection

Within a year of publication, the dominant two-stage detector, [Faster R-CNN](/wiki/faster_rcnn), had switched its VGG-16 backbone to ResNet-101, with a large jump in mAP on PASCAL VOC and COCO.[^14] In 2017, Lin, Dollar, Girshick, He, Hariharan, and Belongie introduced the Feature Pyramid Network (FPN), which built a multi-scale feature pyramid on top of ResNet's hierarchical feature maps. ResNet-FPN became the standard backbone for two-stage detectors throughout the late 2010s.[^15]

### Instance and semantic segmentation

Mask R-CNN (He, Gkioxari, Dollar, Girshick; ICCV 2017) extended Faster R-CNN with a per-instance mask branch and used ResNet-50-FPN or ResNet-101-FPN as the standard backbone. It won the COCO 2016 segmentation challenge and became the reference instance-segmentation framework for years.[^16] The DeepLab series (Chen et al., 2017-2018) used ResNet with *atrous* (dilated) convolutions and atrous spatial pyramid pooling for semantic segmentation, and shipped pretrained ResNet-based DeepLab models that anchored the segmentation literature.[^17] In Facebook's Detectron and Detectron2 frameworks, "ResNet-50-FPN" and "ResNet-101-FPN" remain default backbone choices.

### Transfer learning

For practitioners, the most important use of ResNet has always been [transfer learning](/wiki/transfer_learning): download ImageNet-pretrained ResNet-50, replace the final classifier, fine-tune on a domain dataset, ship. Medical imaging (chest X-ray classification, retinal disease detection), remote sensing, satellite imagery, manufacturing defect detection, plant disease classification, and many other specialized domains adopted this pattern. Even in domains visually unlike ImageNet, the hierarchical features learned by ResNet usually beat training from scratch when labeled data is limited.

### MLPerf and hardware benchmarking

ResNet-50 on ImageNet remains a standard MLPerf training benchmark. Hardware vendors from [NVIDIA](/wiki/nvidia) to Google (TPU) to Cerebras and Graphcore have reported ResNet-50 numbers since 2018, which makes performance directly comparable across generations. As a result, ResNet-50 functions as the unit cell of "how fast does this accelerator train a vision model"; many AI chips have been tuned in part around the operations ResNet-50 stresses (3x3 and 1x1 convolutions on batches of 256+).[^5]

## Followups and descendants

The success of ResNet kicked off a long line of "ResNet plus X" architectures that explored what aspects of the design could be improved.

### ResNeXt (2017)

[ResNeXt](/wiki/resnext) (Xie, Girshick, Dollar, Tu, He; CVPR 2017) added a new design axis called *cardinality*: instead of one 3x3 convolution in the middle of a bottleneck block, ResNeXt splits the operation into 32 parallel grouped convolutions with narrower channels, then sums them.[^18] Mathematically, this is equivalent to a *grouped convolution*. Xie et al. showed that increasing cardinality was a more efficient way to spend parameters than increasing depth or width. ResNeXt-101 (32x4d) became a popular detection backbone, and the grouped/depthwise convolution idea propagated into MobileNet and EfficientNet.

### Wide ResNet (2016)

Zagoruyko and Komodakis (BMVC 2016) argued that for CIFAR-scale problems, *width* was a better resource than depth.[^19] A 16-layer wide network with widening factor 8 matched the accuracy of a 1000-layer thin network, while training roughly 8x faster on the same hardware. Wide ResNets became the standard CIFAR baseline for years.

### DenseNet (2017)

[DenseNet](/wiki/densenet) (Huang, Liu, van der Maaten, Weinberger; CVPR 2017, Best Paper Award) replaced the additive residual `y = F(x) + x` with a *concatenative* skip pattern: each layer receives as input the concatenation of all preceding layers' feature maps within a block.[^20] Concatenation preserves more information than addition and encourages feature reuse, which lets DenseNet match ResNet's accuracy with fewer parameters. DenseNet was widely used in medical imaging.

### SE-ResNet (2018)

Squeeze-and-Excitation (SE) blocks (Hu, Shen, Sun; CVPR 2018) added a light-weight channel-attention module after each residual block.[^21] An SE block applies global average pooling to compress spatial information, then learns per-channel gating weights via a tiny two-layer MLP. Plugging SE blocks into ResNet-50 produced SE-ResNet-50, which improved ImageNet top-5 error noticeably with very little extra compute. An SE-ResNeXt-152 variant from the same authors won the ILSVRC 2017 classification challenge, the final year ILSVRC ran in its classic form.

### ResNet-RS and modernized training

Bello, Fedus, Du, Cubuk, Srinivas, Lin, Shlens, and Zoph (2021) revisited ResNet under modern training recipes.[^22] Using long training schedules, label smoothing, stochastic depth, RandAugment, and weight EMA, they pushed a standard ResNet-50 from the original 76.1% top-1 to 79.0% accuracy, then to 83.4% with two small architectural tweaks. *ResNet-RS* models were faster on TPUs than EfficientNet at matched accuracy, suggesting that much of the apparent "gap" between ResNet and later architectures was actually a training-recipe gap.

### ConvNeXt (2022)

[ConvNeXt](/wiki/convnext) (Liu, Mao, Wu, Feichtenhofer, Darrell, Xie; CVPR 2022) modernized ResNet step by step using design choices borrowed from Vision Transformers: a patchified stem, inverted bottleneck blocks, depthwise convolutions with 7x7 kernels, [layer normalization](/wiki/layer_normalization) in place of batch normalization, GELU instead of ReLU, and Transformer-style training schedules.[^23] The result was a pure-convolutional network that matched [Swin Transformer](/wiki/swin_transformer) on ImageNet (87.8% top-1), COCO, and ADE20K. ConvNeXt is in many ways "what ResNet would look like if it were designed in 2022."

## Skip connections in other architectures

ResNet's most influential legacy is not a specific configuration but the *residual addition* primitive. Once He et al. showed that `y = F(x) + x` made gradient flow trivial in 100-layer networks, the same trick was added to nearly every subsequent deep architecture.

### U-Net

[U-Net](/wiki/u_net) (Ronneberger, Fischer, Brox; MICCAI 2015) had been published a few months *before* ResNet but used a different kind of skip: long-range concatenative connections from encoder layers to symmetric decoder layers, designed to preserve high-resolution spatial detail through a U-shaped network.[^24] U-Net's skip pattern is structural (across an encoder-decoder) rather than per-block, but it shares ResNet's intuition that giving signals a direct path from earlier to later parts of the network helps.

### Transformer

The [Transformer](/wiki/transformer) (Vaswani et al., NeurIPS 2017) is in many ways a residual network with attention.[^25] Each Transformer block has the structure

```
x' = LayerNorm(x + MultiHeadAttention(x))
x'' = LayerNorm(x' + FFN(x'))
```

The `x + ...` additions are exactly the ResNet residual pattern, applied around the attention sublayer and the feed-forward sublayer. Without them, the deep stacks of self-attention layers in modern LLMs would not optimize. Every block of [GPT](/wiki/gpt), [Claude](/wiki/claude), [Gemini](/wiki/gemini), and [LLaMA](/wiki/llama) inherits this design directly from the ResNet paper.

### Vision Transformer and ConvNeXt

[Vision Transformer](/wiki/vision_transformer) (ViT; Dosovitskiy et al., 2021) is a Transformer that operates on image patches; like the original Transformer, it places residual additions around every attention and MLP sublayer.[^26] ConvNeXt does the same in a convolutional setting.[^23] Even though both architectures otherwise look very different from ResNet, the residual connection survives unchanged.

### Other modern uses

Beyond classification, residual connections appear in essentially every modern deep architecture:

- **Diffusion models** (Stable Diffusion, DALL-E, Imagen) use ResNet-style blocks throughout their U-Net denoisers.
- **Speech recognition** systems such as Wav2Vec 2.0 and Whisper use residual connections in both their convolutional encoders and Transformer stacks.
- **Reinforcement learning** policy and value networks (AlphaGo Zero, MuZero) used deep ResNets with millions of self-play games.
- **Graph neural networks** apply residual connections to enable deeper message-passing.

## Modern legacy

Between 2016 and roughly 2020, ResNet (and its bottleneck variant ResNet-50 in particular) was the default backbone for almost every computer vision system. Starting around 2020-2021, Vision Transformers began to overtake ResNet on the very largest scale, and ConvNeXt-style modernized convnets caught up with Transformers at the same compute budgets. By 2026, top-of-the-art ImageNet accuracy is achieved by Vision Transformer or ConvNeXt variants rather than plain ResNet.

### Is ResNet still used in 2026?

Yes. ResNet has not disappeared, and ResNet-50 remains:

- a reference accuracy/efficiency baseline in nearly every new vision paper;
- the canonical MLPerf training benchmark, against which AI accelerators are evaluated;
- a default backbone in production deployments where Vision Transformers are still considered too memory-hungry or where well-understood inference latency is important;
- a teaching example in nearly every deep-learning course.

More importantly, the residual connection itself is now so basic that papers rarely cite it. Skip connections sit inside every Transformer block, every diffusion denoiser, every modern speech model. It is, alongside [batch normalization](/wiki/batch_normalization) and the [attention mechanism](/wiki/attention), one of the small handful of architectural primitives invented in the 2010s that show no signs of being replaced.

## Citation impact

The ResNet paper is one of the most-cited works in the history of computer science. As of 2026, it has accumulated more than 250,000 citations on Google Scholar.[^4] In April 2025, a Nature analysis of the most-cited scientific papers of the 21st century, which measured academic citations across five major databases covering tens of millions of papers published since 2000, ranked *Deep Residual Learning for Image Recognition* as the single most-cited paper of the century across all fields, ahead of every other AI paper and of landmark works in physics, chemistry, and biology.[^28] It was the only AI paper to take the number-one spot.[^28]

For comparison within AI: the [*Attention Is All You Need*](/wiki/attention_is_all_you_need) paper that introduced the Transformer (Vaswani et al., 2017) ranked seventh on the same Nature list, and the original AlexNet paper, *ImageNet Classification with Deep Convolutional Neural Networks* (Krizhevsky, Sutskever, Hinton, 2012), ranked eighth.[^25][^27][^28] ResNet exceeds these AI references and also surpasses landmark papers in adjacent fields, such as the CRISPR-Cas9 papers.

## See also

- [Convolutional neural network](/wiki/convolutional_neural_network)
- [VGG](/wiki/vgg)
- [ImageNet](/wiki/imagenet)
- [Batch normalization](/wiki/batch_normalization)
- [DenseNet](/wiki/densenet)
- [ResNeXt](/wiki/resnext)
- [U-Net](/wiki/u_net)
- [Transformer](/wiki/transformer)
- [Vision Transformer](/wiki/vision_transformer)
- [ConvNeXt](/wiki/convnext)
- [Highway Networks](/wiki/highway_networks)
- [Kaiming He](/wiki/kaiming_he)
- [Microsoft Research Asia](/wiki/microsoft_research_asia)
- [Transfer learning](/wiki/transfer_learning)

## References

[^1]: He, K., Zhang, X., Ren, S., Sun, J. *Deep Residual Learning for Image Recognition*. arXiv:1512.03385 (Dec 10, 2015); CVPR 2016. https://arxiv.org/abs/1512.03385
[^2]: Russakovsky, O., Deng, J., Su, H., et al. *ImageNet Large Scale Visual Recognition Challenge*. IJCV, 2015. https://arxiv.org/abs/1409.0575
[^3]: CVPR 2016 Best Paper Award announcement. IEEE Computer Society. https://www.thecvf.com/?page_id=413
[^4]: Google Scholar entry for *Deep Residual Learning for Image Recognition*. https://scholar.google.com/scholar?q=Deep+Residual+Learning+for+Image+Recognition
[^5]: MLCommons. *MLPerf Training Benchmark Suite*. https://mlcommons.org/benchmarks/training/
[^6]: Simonyan, K., Zisserman, A. *Very Deep Convolutional Networks for Large-Scale Image Recognition*. ICLR 2015. https://arxiv.org/abs/1409.1556
[^7]: Szegedy, C., Liu, W., Jia, Y., et al. *Going Deeper with Convolutions*. CVPR 2015. https://arxiv.org/abs/1409.4842
[^8]: Srivastava, R. K., Greff, K., Schmidhuber, J. *Training Very Deep Networks*. NeurIPS 2015. https://arxiv.org/abs/1507.06228
[^9]: He, K., Zhang, X., Ren, S., Sun, J. *Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification*. ICCV 2015. https://arxiv.org/abs/1502.01852
[^10]: Karpathy, A. *What I learned from competing against a ConvNet on ImageNet*. 2014. https://karpathy.github.io/2014/09/02/what-i-learned-from-competing-against-a-convnet-on-imagenet/
[^11]: He, K., Zhang, X., Ren, S., Sun, J. *Identity Mappings in Deep Residual Networks*. arXiv:1603.05027; ECCV 2016. https://arxiv.org/abs/1603.05027
[^12]: Veit, A., Wilber, M., Belongie, S. *Residual Networks Behave Like Ensembles of Relatively Shallow Networks*. NeurIPS 2016. https://arxiv.org/abs/1605.06431
[^13]: Li, H., Xu, Z., Taylor, G., Studer, C., Goldstein, T. *Visualizing the Loss Landscape of Neural Nets*. NeurIPS 2018. https://arxiv.org/abs/1712.09913
[^14]: Ren, S., He, K., Girshick, R., Sun, J. *Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks*. NeurIPS 2015. https://arxiv.org/abs/1506.01497
[^15]: Lin, T., Dollar, P., Girshick, R., He, K., Hariharan, B., Belongie, S. *Feature Pyramid Networks for Object Detection*. CVPR 2017. https://arxiv.org/abs/1612.03144
[^16]: He, K., Gkioxari, G., Dollar, P., Girshick, R. *Mask R-CNN*. ICCV 2017. https://arxiv.org/abs/1703.06870
[^17]: Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A. *DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs*. IEEE TPAMI, 2018. https://arxiv.org/abs/1606.00915
[^18]: Xie, S., Girshick, R., Dollar, P., Tu, Z., He, K. *Aggregated Residual Transformations for Deep Neural Networks*. CVPR 2017. https://arxiv.org/abs/1611.05431
[^19]: Zagoruyko, S., Komodakis, N. *Wide Residual Networks*. BMVC 2016. https://arxiv.org/abs/1605.07146
[^20]: Huang, G., Liu, Z., van der Maaten, L., Weinberger, K. Q. *Densely Connected Convolutional Networks*. CVPR 2017. https://arxiv.org/abs/1608.06993
[^21]: Hu, J., Shen, L., Sun, G. *Squeeze-and-Excitation Networks*. CVPR 2018. https://arxiv.org/abs/1709.01507
[^22]: Bello, I., Fedus, W., Du, X., Cubuk, E. D., Srinivas, A., Lin, T., Shlens, J., Zoph, B. *Revisiting ResNets: Improved Training and Scaling Strategies*. NeurIPS 2021. https://arxiv.org/abs/2103.07579
[^23]: Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S. *A ConvNet for the 2020s*. CVPR 2022. https://arxiv.org/abs/2201.03545
[^24]: Ronneberger, O., Fischer, P., Brox, T. *U-Net: Convolutional Networks for Biomedical Image Segmentation*. MICCAI 2015. https://arxiv.org/abs/1505.04597
[^25]: Vaswani, A., Shazeer, N., Parmar, N., et al. *Attention Is All You Need*. NeurIPS 2017. https://arxiv.org/abs/1706.03762
[^26]: Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al. *An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale*. ICLR 2021. https://arxiv.org/abs/2010.11929
[^27]: Krizhevsky, A., Sutskever, I., Hinton, G. E. *ImageNet Classification with Deep Convolutional Neural Networks*. NeurIPS 2012. https://papers.nips.cc/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html
[^28]: *Exclusive: the most-cited papers of the twenty-first century*. Nature, April 2025. https://www.nature.com/articles/d41586-025-01125-9

