ResNet
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v5 · 4,495 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v5 · 4,495 words
Add missing citations, update stale details, or suggest a clearer explanation.
ResNet (Residual Network) is a deep convolutional neural network architecture introduced by Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, then at Microsoft Research Asia (MSRA), in December 2015. The paper, "Deep Residual Learning for Image Recognition," introduced identity shortcut (or residual) connections that bypass two or three stacked layers and add the input directly to the output. This deceptively simple design dissolved the degradation problem that had blocked the training of networks much deeper than 20-30 layers and made networks of 100+ layers practical for the first time.[1] ResNet won every major track of the 2015 ImageNet Large Scale Visual Recognition Challenge (ILSVRC) and the COCO 2015 challenges, reaching a 3.57% top-5 error on ImageNet classification with a 152-layer model, and it received the Best Paper Award at the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).[2][3] As of 2026, the ResNet paper has been cited more than 250,000 times on Google Scholar, making it among the most cited scientific papers of the 21st century.[4]
The architecture's defining unit is the residual block, which computes y = F(x, {W_i}) + x, where F is a small stack of weight layers and + x is an identity skip connection. The skip path adds no parameters and almost no computation, but it gives gradients a direct route from later layers back to earlier ones, which dramatically eases optimization. ResNet-50, the bottleneck variant most used in practice, became the de facto vision backbone of the late 2010s and remains a reference benchmark in MLPerf and in nearly every paper that introduces a new image model.[5] Skip connections themselves have outlasted any specific ResNet configuration: every modern Transformer, every large language model, and every diffusion image generator places a residual addition around each block, a direct inheritance from the 2015 paper.
By 2015, convolutional neural networks had become the dominant tool for image recognition. The progression of ImageNet winners showed a clear trend toward depth: AlexNet (2012) had 8 layers, VGG (Simonyan and Zisserman, 2014) used 16 or 19 layers stacked from small 3x3 filters, and GoogLeNet (Szegedy et al., 2014) reached 22 layers using inception modules.[6][7] VGG in particular established the principle that small filters stacked deeply produced strong features for transfer learning.
In principle, a network of 50 layers should be at least as expressive as one of 20 layers: the extra 30 layers could simply learn the identity function and replicate the shallower network's behavior. Empirically, though, the opposite happened. As researchers added more layers to standard plain networks, accuracy first improved, then saturated, and then degraded rapidly.[1] The 56-layer plain network reported by He et al. on CIFAR-10 produced higher training error than a 20-layer plain network using the same recipe. Because the error grew on the training set itself, this was not overfitting; the deeper networks were failing to optimize, even with batch normalization in place to control activations.
He et al. called this the degradation problem and made it the central motivation of the paper. They argued that if a stack of layers could easily learn identity mappings, no degradation should appear. But standard nonlinear layers, initialized to small random weights and threaded through ReLUs, evidently struggled to express identity precisely. The residual reformulation removed that burden by making identity the default.
A few months earlier, Srivastava, Greff, and Schmidhuber had introduced Highway Networks, which used gated shortcut connections inspired by LSTM cells.[8] A Highway block computed y = T(x) * H(x) + (1 - T(x)) * x, where T(x) is a learned transform gate. ResNet's key simplification was to fix the gate to identity: the shortcut is always open, with no learned modulation. This made the architecture easier to train, easier to analyze, and slightly cheaper, and it turned out to scale better with depth than the gated variant.
All four authors of the ResNet paper were at Microsoft Research Asia in Beijing when the work was submitted to arXiv on December 10, 2015 (arXiv:1512.03385).[1] The paper was accepted at CVPR 2016 and presented in Las Vegas in June 2016, where it received the Best Paper Award.[3]
| Author | Role at MSRA (2015) | Later affiliations |
|---|---|---|
| Kaiming He | Lead author | Facebook AI Research (2016), then MIT EECS faculty (2024) |
| Xiangyu Zhang | Co-author | MEGVII (Face++) Research |
| Shaoqing Ren | Co-author | Co-author of Faster R-CNN; autonomous driving industry |
| Jian Sun | Senior researcher | MEGVII Chief Scientist until his death in 2022 |
The same four-author team had previously produced the He initialization paper, which derived the variance-preserving initialization for ReLU networks and was itself a prerequisite for training the very deep models in ResNet.[9] Kaiming He also co-authored Faster R-CNN (with Ren and Sun), Mask R-CNN, MoCo, MAE, and a long sequence of other influential vision papers, making him one of the most-cited computer scientists alive.
The mathematical core of ResNet fits on a single line. Let x be the input to a block and H(x) be the desired mapping the block should compute. Instead of asking the stacked layers to learn H directly, He et al. asked them to learn the residual function
F(x, {W_i}) = H(x) - x
and to recover H(x) by adding back the input:
y = F(x, {W_i}) + x
If, for some block, the optimal H is close to the identity, the optimizer only needs to drive F toward zero, which is easy when F is a small stack of weight layers initialized near zero. Learning small deviations from identity is much easier than reconstructing identity from scratch through nonlinear ReLU layers.[1]
When the input x and the output of F have the same dimensions, the addition is element-wise and the shortcut adds zero parameters. When the channel count changes (at the start of each new stage) or the spatial resolution changes (when stride 2 is used), a linear projection W_s is applied to the skip path:
y = F(x, {W_i}) + W_s * x
In practice W_s is implemented as a 1x1 convolution with stride 2. The original paper tested three shortcut options: (A) zero-padding for extra channels, (B) projections only when dimensions change, and (C) projections everywhere. Option B is the standard; option C added cost with little gain.[1]
ResNet has two block templates:
The bottleneck design is the trick that makes ResNet-50 only marginally more expensive than ResNet-34 despite the extra layers. Most parameters and FLOPs of a deep CNN sit in 3x3 convolutions on wide feature maps; the 1x1 reductions move computation into a much narrower 256/4 = 64-channel core where the 3x3 is cheap, then expand back.[1]
The original paper introduced five ImageNet configurations, all sharing a common stem (a 7x7 conv with stride 2, 64 channels, followed by 3x3 max pooling with stride 2) and four stages operating at progressively halved spatial resolution (56x56 → 28x28 → 14x14 → 7x7) with doubled channel counts.
| Model | Block | Layers | Blocks per stage | Params | FLOPs | ImageNet top-1 | ImageNet top-5 |
|---|---|---|---|---|---|---|---|
| ResNet-18 | Basic | 18 | 2, 2, 2, 2 | 11.7M | 1.8G | 30.2% | 10.9% |
| ResNet-34 | Basic | 34 | 3, 4, 6, 3 | 21.8M | 3.6G | 26.7% | 8.6% |
| ResNet-50 | Bottleneck | 50 | 3, 4, 6, 3 | 25.6M | 3.8G | 24.0% | 7.1% |
| ResNet-101 | Bottleneck | 101 | 3, 4, 23, 3 | 44.5M | 7.6G | 22.4% | 6.2% |
| ResNet-152 | Bottleneck | 152 | 3, 8, 36, 3 | 60.2M | 11.3G | 21.7% | 5.7% |
Numbers are from the original paper (single-crop validation error).[1] The layer counts include the initial 7x7 conv, all conv layers inside residual blocks, and the final fully connected classifier; batch normalization layers, the max pool, and global average pool are not counted. Notably, ResNet-152 is still cheaper than VGG-16 (15.3 GFLOPs) or VGG-19 (19.6 GFLOPs) despite being roughly eight times deeper.[1]
A standard ResNet-50 ends with global average pooling and a single fully-connected layer that maps the 2048-dimensional pooled feature to 1000 ImageNet logits, followed by softmax.
The original models were trained on ImageNet with stochastic gradient descent (momentum 0.9), weight decay 1e-4, mini-batch size 256, and an initial learning rate of 0.1 divided by 10 when validation error plateaued, for roughly 90 epochs in total. Standard data augmentation included random crops, horizontal flips, per-pixel mean subtraction, and AlexNet-style color jittering.[1] Weights were initialized with the He scheme from the earlier 2015 paper.[9]
The headline result of the paper was the 2015 ImageNet Large Scale Visual Recognition Challenge. An ensemble of six ResNet models of varying depth produced a top-5 error of 3.57% on the ImageNet test set, winning the classification track.[2] A single ResNet-152 achieved 4.49% top-5 on the validation set.
The estimated human top-5 error on ImageNet, measured by Andrej Karpathy when he trained himself as a one-person baseline against AlexNet-era models, was about 5.1%.[10] ResNet's 3.57% was the first published result to clearly pass that human reference number on this benchmark. The progression of ILSVRC winners over four years tells the story:
| Year | Winner | Top-5 error | Depth | Key idea |
|---|---|---|---|---|
| 2012 | AlexNet | 15.3% | 8 | GPU training, ReLU, dropout |
| 2013 | ZFNet | 11.7% | 8 | Visualization-guided tuning |
| 2014 | GoogLeNet | 6.7% | 22 | Inception modules |
| 2014 (2nd) | VGG | 7.3% | 19 | Stacked 3x3 filters |
| 2015 | ResNet | 3.57% | 152 | Residual connections |
In three years, top-5 error had been cut by more than a factor of four, and the network depth had grown by an order of magnitude.
ResNet did not just win classification. The MSRA team entered ResNet-based systems into every major detection, localization, and segmentation track of ILSVRC 2015 and the 2015 COCO challenges, and won them all.[1] On COCO detection, ResNet-101 plugged into Faster R-CNN raised mean average precision (mAP@[.5, .95]) to 37.3%, a 28% relative improvement over the previous state of the art. The same backbone took first place in ImageNet detection (62.1% mAP), ImageNet localization, COCO segmentation, and COCO localization, confirming that residual learning was a backbone-level improvement rather than a classification-only trick.[1]
In the years after publication, a number of follow-up papers tried to explain why the simple identity shortcut had such an outsized effect.
For a single residual block y = F(x) + x, the gradient of the loss L with respect to the input is
dL/dx = dL/dy * (1 + dF/dx)
The 1 term is the contribution of the identity skip. During backpropagation through a deep stack of residual blocks, this gives a direct, unattenuated path for gradients from the top of the network to the bottom; even if dF/dx becomes small for many blocks, the additive 1 ensures the gradient does not vanish.[11] This was the original explanation, and it generalizes the vanishing gradient argument that motivated LSTM-style gating in earlier work.
In a 2016 follow-up paper, Identity Mappings in Deep Residual Networks, He, Zhang, Ren, and Sun analyzed what happens when both the shortcut path and the after-addition activation are exact identity mappings.[11] They showed that the forward signal can then be written as a clean recursive sum:
x_L = x_l + Σ_{i=l..L-1} F(x_i, W_i)
so any deeper feature x_L is the shallow feature x_l plus an explicit sum of residuals. Differentiating gives a gradient expression in which the identity term again propagates back unattenuated. The same paper introduced the pre-activation block ordering (BN → ReLU → conv, conv, etc.) so that the addition output is a pure identity rather than passing through a final ReLU. The modified architecture, sometimes called ResNet-v2, trained more stably at extreme depths: a 1001-layer ResNet-v2 reached 4.62% test error on CIFAR-10.[11]
Veit, Wilber, and Belongie (NeurIPS 2016) argued that residual networks behave like unrolled ensembles of many shorter networks.[12] An n-block ResNet has 2^n possible paths through the network (each block can be skipped via its identity branch or traversed via F). They showed empirically that deleting individual residual blocks from a trained ResNet has little effect on accuracy, while deleting a layer from a plain VGG-like network is catastrophic. By this view, ResNet works partly because it implicitly ensembles a large number of shallow models.
Li, Xu, Taylor, Studer, and Goldstein (NeurIPS 2018) visualized loss surfaces along random and curvature-aligned directions and showed that residual connections dramatically smooth the loss landscape relative to deep plain networks, which exhibit chaotic, non-convex surfaces.[13] The smoother landscape is easier for SGD to traverse, which is a third complementary explanation alongside gradient flow and ensemble behavior.
The 2016 follow-up paper Identity Mappings in Deep Residual Networks (He et al., arXiv:1603.05027, ECCV 2016) is usually the second paper everyone reads on this topic.[11] Its main practical contribution is the rearranged block:
Original (v1): Conv -> BN -> ReLU -> Conv -> BN -> add -> ReLU
Pre-activation (v2): BN -> ReLU -> Conv -> BN -> ReLU -> Conv -> add
In v2, the addition output is pure identity (no ReLU on top), so deeper-to-shallower gradient flow is completely unattenuated. Pre-activation ResNet became the default for very deep networks (1000+ layers on CIFAR) and is the variant used by many later libraries when they specify "ResNet". The two variants differ by only a permutation of BN/ReLU/conv ordering, but the analytical cleanness of v2 made it the canonical form for theoretical work.
ResNet's most important practical role from 2016 onward was as a backbone: a pretrained feature extractor that other tasks could plug into. The same set of ResNet-50 weights, trained once on ImageNet, became the starting point for thousands of downstream computer vision systems.
Within a year of publication, the dominant two-stage detector, Faster R-CNN, had switched its VGG-16 backbone to ResNet-101, with a large jump in mAP on PASCAL VOC and COCO.[14] In 2017, Lin, Dollar, Girshick, He, Hariharan, and Belongie introduced the Feature Pyramid Network (FPN), which built a multi-scale feature pyramid on top of ResNet's hierarchical feature maps. ResNet-FPN became the standard backbone for two-stage detectors throughout the late 2010s.[15]
Mask R-CNN (He, Gkioxari, Dollar, Girshick; ICCV 2017) extended Faster R-CNN with a per-instance mask branch and used ResNet-50-FPN or ResNet-101-FPN as the standard backbone. It won the COCO 2016 segmentation challenge and became the reference instance-segmentation framework for years.[16] The DeepLab series (Chen et al., 2017-2018) used ResNet with atrous (dilated) convolutions and atrous spatial pyramid pooling for semantic segmentation, and shipped pretrained ResNet-based DeepLab models that anchored the segmentation literature.[17] In Facebook's Detectron and Detectron2 frameworks, "ResNet-50-FPN" and "ResNet-101-FPN" remain default backbone choices.
For practitioners, the most important use of ResNet has always been transfer learning: download ImageNet-pretrained ResNet-50, replace the final classifier, fine-tune on a domain dataset, ship. Medical imaging (chest X-ray classification, retinal disease detection), remote sensing, satellite imagery, manufacturing defect detection, plant disease classification, and many other specialized domains adopted this pattern. Even in domains visually unlike ImageNet, the hierarchical features learned by ResNet usually beat training from scratch when labeled data is limited.
ResNet-50 on ImageNet remains a standard MLPerf training benchmark. Hardware vendors from NVIDIA to Google (TPU) to Cerebras and Graphcore have reported ResNet-50 numbers since 2018, which makes performance directly comparable across generations. As a result, ResNet-50 functions as the unit cell of "how fast does this accelerator train a vision model"; many AI chips have been tuned in part around the operations ResNet-50 stresses (3x3 and 1x1 convolutions on batches of 256+).[5]
The success of ResNet kicked off a long line of "ResNet plus X" architectures that explored what aspects of the design could be improved.
ResNeXt (Xie, Girshick, Dollar, Tu, He; CVPR 2017) added a new design axis called cardinality: instead of one 3x3 convolution in the middle of a bottleneck block, ResNeXt splits the operation into 32 parallel grouped convolutions with narrower channels, then sums them.[18] Mathematically, this is equivalent to a grouped convolution. Xie et al. showed that increasing cardinality was a more efficient way to spend parameters than increasing depth or width. ResNeXt-101 (32x4d) became a popular detection backbone, and the grouped/depthwise convolution idea propagated into MobileNet and EfficientNet.
Zagoruyko and Komodakis (BMVC 2016) argued that for CIFAR-scale problems, width was a better resource than depth.[19] A 16-layer wide network with widening factor 8 matched the accuracy of a 1000-layer thin network, while training roughly 8x faster on the same hardware. Wide ResNets became the standard CIFAR baseline for years.
DenseNet (Huang, Liu, van der Maaten, Weinberger; CVPR 2017, Best Paper Award) replaced the additive residual y = F(x) + x with a concatenative skip pattern: each layer receives as input the concatenation of all preceding layers' feature maps within a block.[20] Concatenation preserves more information than addition and encourages feature reuse, which lets DenseNet match ResNet's accuracy with fewer parameters. DenseNet was widely used in medical imaging.
Squeeze-and-Excitation (SE) blocks (Hu, Shen, Sun; CVPR 2018) added a light-weight channel-attention module after each residual block.[21] An SE block applies global average pooling to compress spatial information, then learns per-channel gating weights via a tiny two-layer MLP. Plugging SE blocks into ResNet-50 produced SE-ResNet-50, which improved ImageNet top-5 error noticeably with very little extra compute. An SE-ResNeXt-152 variant from the same authors won the ILSVRC 2017 classification challenge, the final year ILSVRC ran in its classic form.
Bello, Fedus, Du, Cubuk, Srinivas, Lin, Shlens, and Zoph (2021) revisited ResNet under modern training recipes.[22] Using long training schedules, label smoothing, stochastic depth, RandAugment, and weight EMA, they pushed a standard ResNet-50 from the original 76.1% top-1 to 79.0% accuracy, then to 83.4% with two small architectural tweaks. ResNet-RS models were faster on TPUs than EfficientNet at matched accuracy, suggesting that much of the apparent "gap" between ResNet and later architectures was actually a training-recipe gap.
ConvNeXt (Liu, Mao, Wu, Feichtenhofer, Darrell, Xie; CVPR 2022) modernized ResNet step by step using design choices borrowed from Vision Transformers: a patchified stem, inverted bottleneck blocks, depthwise convolutions with 7x7 kernels, layer normalization in place of batch normalization, GELU instead of ReLU, and Transformer-style training schedules.[23] The result was a pure-convolutional network that matched Swin Transformer on ImageNet (87.8% top-1), COCO, and ADE20K. ConvNeXt is in many ways "what ResNet would look like if it were designed in 2022."
ResNet's most influential legacy is not a specific configuration but the residual addition primitive. Once He et al. showed that y = F(x) + x made gradient flow trivial in 100-layer networks, the same trick was added to nearly every subsequent deep architecture.
U-Net (Ronneberger, Fischer, Brox; MICCAI 2015) had been published a few months before ResNet but used a different kind of skip: long-range concatenative connections from encoder layers to symmetric decoder layers, designed to preserve high-resolution spatial detail through a U-shaped network.[24] U-Net's skip pattern is structural (across an encoder-decoder) rather than per-block, but it shares ResNet's intuition that giving signals a direct path from earlier to later parts of the network helps.
The Transformer (Vaswani et al., NeurIPS 2017) is in many ways a residual network with attention.[25] Each Transformer block has the structure
x' = LayerNorm(x + MultiHeadAttention(x))
x'' = LayerNorm(x' + FFN(x'))
The x + ... additions are exactly the ResNet residual pattern, applied around the attention sublayer and the feed-forward sublayer. Without them, the deep stacks of self-attention layers in modern LLMs would not optimize. Every block of GPT, Claude, Gemini, and LLaMA inherits this design directly from the ResNet paper.
Vision Transformer (ViT; Dosovitskiy et al., 2021) is a Transformer that operates on image patches; like the original Transformer, it places residual additions around every attention and MLP sublayer.[26] ConvNeXt does the same in a convolutional setting.[23] Even though both architectures otherwise look very different from ResNet, the residual connection survives unchanged.
Beyond classification, residual connections appear in essentially every modern deep architecture:
Between 2016 and roughly 2020, ResNet (and its bottleneck variant ResNet-50 in particular) was the default backbone for almost every computer vision system. Starting around 2020-2021, Vision Transformers began to overtake ResNet on the very largest scale, and ConvNeXt-style modernized convnets caught up with Transformers at the same compute budgets. By 2026, top-of-the-art ImageNet accuracy is achieved by Vision Transformer or ConvNeXt variants rather than plain ResNet.
But ResNet has not disappeared. ResNet-50 remains:
More importantly, the residual connection itself is now so basic that papers rarely cite it. Skip connections sit inside every Transformer block, every diffusion denoiser, every modern speech model. It is, alongside batch normalization and the attention mechanism, one of the small handful of architectural primitives invented in the 2010s that show no signs of being replaced.
The ResNet paper is one of the most-cited works in the history of computer science. As of 2026, it has accumulated more than 250,000 citations on Google Scholar.[4] A 2025 Nature analysis of the most-cited scientific papers of the 21st century, which aggregated rankings from Google Scholar, Semantic Scholar, Web of Science, Scopus, and Crossref, placed Deep Residual Learning for Image Recognition at or near the top across all fields.
For comparison within AI: the Attention Is All You Need paper that introduced the Transformer (Vaswani et al., 2017) is itself among the most cited papers in CS, with citation counts somewhat below ResNet's; Adam: A Method for Stochastic Optimization (Kingma and Ba, 2015) is similarly close behind; the original ImageNet classification paper (Krizhevsky, Sutskever, Hinton, 2012) holds a comparable position.[25][27] ResNet exceeds these AI references and also surpasses landmark papers in adjacent fields, such as the CRISPR-Cas9 papers.