ResNet (Residual Network) is a deep learning architecture introduced by Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun of Microsoft Research in December 2015. The paper, titled "Deep Residual Learning for Image Recognition," won the Best Paper Award at CVPR 2016 and presented a solution to a fundamental problem in training very deep neural networks: the degradation problem [1]. By introducing skip connections (also called residual connections or shortcut connections), ResNet enabled the successful training of networks with over 100 layers, a feat that had previously been impractical. The architecture won the ILSVRC 2015 image classification challenge with a top-5 error rate of 3.57%, surpassing estimated human-level performance of 5.1% for the first time [2]. As of 2025, the ResNet paper has accumulated over 250,000 citations on Google Scholar, making it the single most cited academic paper of the 21st century and the most influential publication in computer vision [3].
The four authors of the ResNet paper were all affiliated with Microsoft Research Asia (MSRA) in Beijing at the time of publication.
| Author | Role | Later Affiliations |
|---|---|---|
| Kaiming He | Lead researcher | Facebook AI Research (FAIR), MIT, Google DeepMind |
| Xiangyu Zhang | Co-author | Megvii (Face++) |
| Shaoqing Ren | Co-author | Megvii, autonomous driving |
| Jian Sun | Senior researcher | Megvii co-founder |
Kaiming He received his Bachelor of Science degree from Tsinghua University in 2007 and his PhD from the Chinese University of Hong Kong in 2011. He is also known for co-developing Faster R-CNN and for the He (Kaiming) weight initialization method, which became the standard initialization for networks using ReLU activations [15]. He later joined Facebook AI Research (FAIR), then became an associate professor at MIT, and as of 2025 also works as a Distinguished Scientist at Google DeepMind.
The ResNet paper was submitted to arXiv on December 10, 2015, and officially published at CVPR in June 2016. The CVPR 2016 Best Paper Award committee selected it from among hundreds of accepted papers, recognizing the work's fundamental contribution to the field.
Before ResNet, researchers observed an unexpected phenomenon when building deeper neural networks. Conventional wisdom suggested that adding more layers to a network should improve its representational capacity and, consequently, its accuracy. In practice, the opposite often occurred. Networks with more layers frequently produced higher training error than their shallower counterparts, even when given sufficient data and computation.
This was not caused by overfitting, since the error increased on the training set itself, not just on the validation set. The phenomenon became known as the degradation problem. Intuitively, a deeper network should be able to perform at least as well as a shallower one, because the extra layers could simply learn identity mappings (passing input through unchanged). However, standard stacked nonlinear layers struggled to approximate identity mappings efficiently.
He et al. demonstrated the degradation problem with a direct experiment on CIFAR-10. They trained two plain networks (without skip connections): one with 20 layers and one with 56 layers. Despite having far greater capacity, the 56-layer network produced higher training error and higher test error than the 20-layer network. This was a striking result because it showed the problem was not about generalization; the deeper network was failing to optimize effectively on the training data itself.
On ImageNet, the authors observed similar behavior with plain networks of 18 and 34 layers. The 34-layer plain network exhibited higher validation error than the 18-layer network throughout training. When the same architectures were equipped with residual connections, the situation reversed: the 34-layer ResNet outperformed the 18-layer ResNet, and adding depth consistently improved accuracy rather than degrading it.
Earlier architectures like VGGNet (2014) had pushed depth to 16 and 19 layers with strong results [4], and GoogLeNet (2014) reached 22 layers using inception modules [5]. But experiments consistently showed that simply stacking more convolutional layers beyond a certain point led to degraded performance. He et al. set out to solve this problem directly.
The core insight behind ResNet is a reformulation of what each layer learns. Instead of asking a stack of layers to learn a desired mapping H(x) directly, the authors proposed letting the layers learn a residual mapping F(x) = H(x) - x. The original mapping then becomes F(x) + x, which is realized through a skip connection that adds the input x to the output of the stacked layers.
This reformulation is motivated by the degradation problem. If the optimal function for a set of additional layers is close to an identity mapping, it is easier for the network to push the residual F(x) toward zero than it is to learn the identity through a stack of nonlinear transformations. In other words, it is easier to learn small deviations from the identity than to learn the identity itself through multiple weight layers.
Mathematically, a residual block computes:
y = F(x, {W_i}) + x
where x is the input, F represents the residual function learned by two or three stacked convolutional layers, and the addition is performed element-wise. Crucially, this shortcut connection introduces neither extra parameters nor extra computational cost (beyond the negligible addition operation).
When the dimensions of x and F(x) differ (for example, when the number of channels changes), a linear projection W_s is applied to the shortcut:
y = F(x, {W_i}) + W_s * x
This projection is typically implemented as a 1x1 convolution. The authors experimented with three types of shortcut connections: (A) zero-padding for dimension matching, (B) projection shortcuts only when dimensions change, and (C) projection shortcuts for all blocks. Option B became the standard practice, as option C added parameters without significant accuracy gains, and option A performed slightly worse due to the lack of residual learning on the padded dimensions.
The identity shortcut is the defining feature that separates residual networks from earlier approaches like Highway Networks (Srivastava et al., 2015), which also used gated shortcut connections [16]. In Highway Networks, the shortcut path is modulated by a learned gating function, meaning it can be partially or fully closed during training. In ResNet, the shortcut is always open (an unmodified identity), which provides a guaranteed, unimpeded path for both information and gradients. This architectural simplicity proved more effective and easier to optimize than the gated approach.
Residual connections address the degradation problem through several complementary mechanisms.
Gradient flow. During backpropagation, the skip connection provides a direct path for gradients to flow from later layers back to earlier layers. Because the gradient of the addition operation distributes equally to both branches, the gradient through the shortcut path is unattenuated, regardless of how many layers it passes through. This mitigates the vanishing gradient problem that plagued very deep networks trained with traditional architectures.
Ease of optimization. The residual formulation makes it easier for the optimizer to find good solutions. If a layer's optimal contribution is near zero, the weights can simply converge to small values, and the skip connection ensures the signal passes through. This lowers the optimization difficulty compared to learning a complex mapping from scratch.
Ensemble-like behavior. Later analysis by Veit et al. (2016) suggested that ResNets behave like ensembles of many shallower networks. Because information can take different paths through the residual blocks (through the shortcut, through the layers, or through combinations), the network effectively learns a collection of paths of varying lengths, providing robustness similar to an ensemble [6]. Veit et al. showed that removing individual layers from a trained ResNet caused only minor performance degradation, unlike plain networks where removing a single layer could be catastrophic.
Smoother loss landscape. Li et al. (2018) showed that residual connections produce significantly smoother loss surfaces compared to plain networks [17]. The smooth, nearly convex loss landscape of ResNets makes them easier to optimize with standard gradient-based methods. Plain networks, by contrast, tend to have chaotic loss surfaces with many sharp local minima, explaining their difficulty in training at greater depths.
The original ResNet paper presented five main model variants, all designed for the ImageNet classification task (1000 classes, 224x224 pixel input images). All variants share a common structure: an initial 7x7 convolution with stride 2 (64 filters), followed by a 3x3 max pooling layer with stride 2, then a series of residual blocks organized into four stages (conv2 through conv5), and finally a global average pooling layer followed by a fully connected layer with softmax.
The four stages operate at progressively reduced spatial resolutions (56x56, 28x28, 14x14, 7x7) and progressively increased channel counts (64, 128, 256, 512 for the basic block; 256, 512, 1024, 2048 for the bottleneck). Spatial downsampling occurs at the first block of each stage (except conv2) using convolutions with stride 2.
The basic block, used in ResNet-18 and ResNet-34, consists of two 3x3 convolutional layers. Each convolution is followed by batch normalization and a ReLU activation. The skip connection adds the input to the output after the second batch normalization but before the final ReLU. The structure is:
Input -> Conv 3x3 -> BN -> ReLU -> Conv 3x3 -> BN -> (+input) -> ReLU
Each basic block has a parameter count proportional to 2 * (3 * 3 * C * C) = 18C^2, where C is the number of channels.
The bottleneck block, used in ResNet-50, ResNet-101, and ResNet-152, consists of three convolutional layers arranged in a 1x1, 3x3, 1x1 pattern. The first 1x1 convolution reduces the number of channels (typically by a factor of 4), the 3x3 convolution operates on this reduced representation, and the final 1x1 convolution restores the original channel count. This bottleneck design significantly reduces computation while maintaining representational capacity. The structure is:
Input -> Conv 1x1 -> BN -> ReLU -> Conv 3x3 -> BN -> ReLU -> Conv 1x1 -> BN -> (+input) -> ReLU
For example, a bottleneck block with 256 output channels first reduces to 64 channels, applies the 3x3 convolution on 64 channels, and then expands back to 256 channels. The parameter count is proportional to (1 * 1 * C * C/4) + (3 * 3 * C/4 * C/4) + (1 * 1 * C/4 * C) = C^2 * (1/4 + 9/16 + 1/4) = approximately 17C^2/16, which is significantly less than the 18C^2 of two 3x3 layers in the basic block, despite having three layers instead of two.
The expansion factor in bottleneck blocks is 4: the output of a bottleneck block always has 4 times the number of channels as the intermediate 3x3 convolution layer.
The following table summarizes the five standard ResNet configurations from the original paper.
| Model | Block Type | Layers | Blocks per Stage (conv2-conv5) | Parameters | FLOPs (approx.) | Top-1 Error (%) | Top-5 Error (%) |
|---|---|---|---|---|---|---|---|
| ResNet-18 | Basic | 18 | 2, 2, 2, 2 | 11.7M | 1.8G | 30.2 | 10.9 |
| ResNet-34 | Basic | 34 | 3, 4, 6, 3 | 21.8M | 3.6G | 26.7 | 8.6 |
| ResNet-50 | Bottleneck | 50 | 3, 4, 6, 3 | 25.6M | 3.8G | 24.0 | 7.1 |
| ResNet-101 | Bottleneck | 101 | 3, 4, 23, 3 | 44.5M | 7.6G | 22.4 | 6.2 |
| ResNet-152 | Bottleneck | 152 | 3, 8, 36, 3 | 60.2M | 11.3G | 21.7 | 5.7 |
Note that ResNet-50 has only slightly more parameters than ResNet-34 despite being significantly deeper. This is because the bottleneck design is highly parameter-efficient. Also notable is that ResNet-152, with 11.3 billion FLOPs, is still less computationally expensive than VGG-16 (15.3 billion FLOPs) or VGG-19 (19.6 billion FLOPs), despite being roughly eight times deeper [1].
The layer count convention deserves clarification. ResNet-50, for instance, has 50 weight layers: one initial 7x7 conv, (3 + 4 + 6 + 3) * 3 = 48 conv layers in residual blocks, and one final fully connected layer. Batch normalization layers and the max pooling layer are not counted.
All ResNet variants make extensive use of batch normalization, applied after every convolutional layer and before the ReLU activation. Batch normalization normalizes the activations within each mini-batch, stabilizing training and allowing the use of higher learning rates. He et al. adopted batch normalization throughout the architecture following the technique introduced by Ioffe and Szegedy in 2015 [7]. This design choice was essential for training networks as deep as 152 layers. The combination of residual connections and batch normalization was what ultimately allowed networks of unprecedented depth to be trained successfully.
The original ResNet models were trained on ImageNet using stochastic gradient descent (SGD) with momentum of 0.9, weight decay of 0.0001, and a mini-batch size of 256. The learning rate started at 0.1 and was divided by 10 when the error plateaued (typically at 30 and 60 epochs, over a total of roughly 90 epochs). Training used standard data augmentation: random crops from resized images (scale augmentation), horizontal flipping, and per-pixel mean subtraction. Color augmentation as described by Krizhevsky et al. (2012) was also applied [18]. Weights were initialized using the He initialization method [15], which draws values from a Gaussian distribution with standard deviation sqrt(2/n_in), designed specifically for ReLU-based networks.
ResNet's victory in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2015 was a landmark moment. The team submitted an ensemble of six ResNet models of different depths that achieved a top-5 error rate of 3.57% on the test set, winning first place in the classification track [2]. A single ResNet-152 model achieved 4.49% top-5 error on validation, already below human-level performance.
To appreciate the significance, consider the progression of ILSVRC winners:
| Year | Winner | Top-5 Error (%) | Depth | Key Innovation |
|---|---|---|---|---|
| 2012 | AlexNet | 15.3 | 8 | GPU training, ReLU, dropout |
| 2013 | ZFNet | 11.7 | 8 | Visualization-guided tuning |
| 2014 | GoogLeNet | 6.7 | 22 | Inception modules |
| 2014 | VGGNet | 7.3 | 19 | Small 3x3 filters throughout |
| 2015 | ResNet | 3.57 | 152 | Residual connections |
The estimated human-level top-5 error rate on ImageNet, as benchmarked by Andrej Karpathy, was approximately 5.1% [8]. ResNet's 3.57% error rate represented the first time an automated system surpassed this human benchmark on the ImageNet classification task. This milestone attracted widespread media attention and helped establish deep learning as a dominant paradigm in artificial intelligence.
Beyond classification, ResNet-based models also won first place in every major track of the ILSVRC 2015 and COCO 2015 competitions:
| Competition | Task | Key Metric |
|---|---|---|
| ILSVRC 2015 | Image Classification | 3.57% top-5 error |
| ILSVRC 2015 | Object Detection | 62.1% mAP |
| ILSVRC 2015 | Object Localization | First place |
| COCO 2015 | Object Detection | 37.3% mAP@[.5,.95] |
| COCO 2015 | Semantic Segmentation | First place |
On the COCO dataset, the ResNet-based system achieved a 28% relative improvement over the previous state of the art in the standard mAP metric. This sweep across five competition tracks demonstrated that the residual learning framework was not just a classification trick but a genuinely better feature representation that generalized across vision tasks [1].
The success of ResNet inspired numerous variants that explored different aspects of the residual learning framework.
In a follow-up paper published at ECCV 2016, He et al. proposed a modified residual block where batch normalization and ReLU are applied before the convolutional layers, rather than after [9]. This "pre-activation" design rearranges the block from the original "conv -> BN -> ReLU" order to "BN -> ReLU -> conv". The change creates a cleaner information path through the shortcut connections, allowing the identity mapping to propagate signals without any nonlinear transformation.
The authors showed formally that when both the skip connection and the after-addition activation are identity mappings, the forward signal can be propagated directly from any block to any other block, and the same holds for the gradient during backpropagation. This "clean" information path was the theoretical justification for the pre-activation design.
The pre-activation variant showed improved results on CIFAR-10 and CIFAR-100, with the benefits becoming more pronounced in very deep networks. A pre-activation ResNet-1001 achieved 4.62% error on CIFAR-10, training smoothly and converging faster than the original ResNet architecture at the same depth. The authors even trained a 1202-layer network (19.4M parameters), though this model showed mild overfitting on CIFAR-10 due to the dataset's small size. The architecture is sometimes referred to as ResNet-v2.
ResNeXt, introduced by Xie et al. at Facebook AI Research in 2017, extended ResNet by adding a new dimension called "cardinality" [10]. Instead of a single path through each residual block, ResNeXt uses multiple parallel pathways (typically 32) with identical topology. Each pathway performs a narrower transformation, and the results are aggregated. This is mathematically equivalent to performing grouped convolutions, a technique originally introduced in AlexNet for distributing computation across multiple GPUs but repurposed in ResNeXt as an architectural design choice.
ResNeXt demonstrated that increasing cardinality (the number of parallel paths) was a more effective way to improve accuracy than increasing depth or width alone. A ResNeXt-101 with 32x4d configuration (32 groups, each 4 channels wide) matched or exceeded the performance of a significantly deeper or wider ResNet with comparable parameter count. ResNeXt models formed the backbone of the team's entry to ILSVRC 2016, where they placed second. The design principle of grouped convolutions later influenced architectures like MobileNet and EfficientNet through depthwise separable convolutions.
Zagoruyko and Komodakis (2016) proposed Wide Residual Networks (WRN), which challenged the assumption that deeper is always better [11]. By widening the residual blocks (using more feature channels per layer) rather than adding more layers, they showed that a 16-layer wide network could match or outperform a 1000-layer thin network on CIFAR datasets. A WRN-40-4 (40 layers, widening factor 4, 8.9M parameters) achieved better accuracy than ResNet-1001 (10.2M parameters) while training roughly 8 times faster. The faster training was because wider architectures are more amenable to GPU parallelism than extremely deep, thin architectures. Wide ResNets demonstrated that the depth vs. width tradeoff in residual networks was more nuanced than the original ResNet paper suggested.
DenseNet (Densely Connected Convolutional Networks), introduced by Huang et al. in 2017, took the concept of skip connections further [12]. Instead of adding the input to the output (as in ResNet), DenseNet concatenated feature maps from all preceding layers in a block. Every layer in a DenseNet block receives the feature maps from all previous layers as input. This design maximized feature reuse, reduced the number of parameters needed, and strengthened gradient flow. While architecturally distinct from ResNet, DenseNet was directly inspired by the success of residual connections.
Hu et al. (2018) introduced Squeeze-and-Excitation (SE) blocks that could be inserted into ResNet architectures to adaptively recalibrate channel-wise feature responses [13]. An SE block uses global average pooling to compress spatial information, then applies two fully connected layers to learn channel-wise attention weights. When added to ResNet-50, the resulting SE-ResNet-50 achieved notable accuracy improvements with minimal additional computation. SE-Net won the ILSVRC 2017 classification challenge.
Bello et al. (2021) at Google Research published "Revisiting ResNets," demonstrating that training methodology and scaling strategies can matter as much as architectural changes [19]. By applying modern training techniques (longer training, increased regularization, label smoothing, stochastic depth, and RandAugment), they improved a standard ResNet-50 from 76.1% to 79.0% top-1 accuracy on ImageNet, then to 82.2% through further training improvements, and to 83.4% with two small architectural modifications. The resulting ResNet-RS models were 1.7x to 2.7x faster on TPUs than EfficientNet at similar accuracy levels. This work showed that much of the perceived architectural advantage of newer models was actually due to improved training recipes rather than fundamental architectural superiority.
ConvNeXt (Liu et al., 2022) took the modernization of ResNet to its logical conclusion by systematically incorporating design choices from Vision Transformers into a pure convolutional neural network [20]. Starting from a standard ResNet-50, the authors applied changes including patchified stems, inverted bottleneck blocks, depthwise separable convolutions, larger kernel sizes (7x7), and layer normalization. The resulting ConvNeXt models achieved 87.8% ImageNet top-1 accuracy and outperformed Swin Transformer on COCO detection and ADE20K segmentation while maintaining the simplicity and efficiency of pure convolutions. ConvNeXt demonstrated that the core ResNet framework, when properly modernized, could remain competitive with attention-based architectures.
| Variant | Year | Key Modification | Main Benefit |
|---|---|---|---|
| Pre-activation ResNet (v2) | 2016 | BN and ReLU before convolutions | Cleaner gradient path; better for very deep nets |
| ResNeXt | 2017 | Parallel grouped convolutions (cardinality) | More efficient capacity scaling |
| Wide ResNet | 2016 | Wider blocks instead of deeper stacking | Faster training; strong on CIFAR |
| DenseNet | 2017 | Concatenation of all preceding feature maps | Parameter efficiency; feature reuse |
| SE-ResNet | 2018 | Channel attention via squeeze-and-excitation | Better feature calibration |
| ResNet-RS | 2021 | Modern training recipe and scaling | Matched state-of-the-art without new architecture |
| ConvNeXt | 2022 | Transformer-inspired modernization | Competitive with Vision Transformers |
One of ResNet's most significant roles has been as a feature extraction backbone in downstream vision tasks. Rather than designing separate feature extractors for each task, researchers discovered that using a ResNet pre-trained on ImageNet as a shared backbone yielded strong features for a wide variety of problems.
Faster R-CNN (Ren et al., 2015) originally used VGG-16 as its backbone, but switching to ResNet-101 produced large accuracy improvements [21]. The Feature Pyramid Network (FPN), introduced by Lin et al. in 2017, built a multi-scale feature pyramid on top of ResNet's hierarchical feature maps, enabling effective detection at multiple scales [22]. The combination of ResNet + FPN became the standard backbone for two-stage object detectors, including Faster R-CNN, Cascade R-CNN, and many others.
Mask R-CNN (He et al., 2017) extended Faster R-CNN with a pixel-level mask prediction branch and used ResNet-50-FPN or ResNet-101-FPN as its backbone [23]. This architecture became the dominant framework for instance segmentation. For semantic segmentation, the DeepLab family (Chen et al., 2017, 2018) used ResNet with atrous (dilated) convolutions and Atrous Spatial Pyramid Pooling (ASPP) to capture multi-scale context [24]. Panoptic segmentation methods like Panoptic FPN and Panoptic-DeepLab also relied on ResNet backbones. ResNet-50 and ResNet-101 remain the most common backbones in the Detectron2 framework.
ResNet backbones have been used in human pose estimation (e.g., Simple Baselines by Xiao et al., 2018), action recognition (e.g., SlowFast Networks by Feichtenhofer et al., 2019), depth estimation, image generation, and numerous other tasks. The architecture's versatility as a general-purpose feature extractor is one reason for its longevity.
Pre-trained ResNet models became one of the most popular starting points for transfer learning in computer vision. Researchers and practitioners would download ResNet weights pre-trained on ImageNet and fine-tune them for domain-specific tasks, from medical image analysis to satellite imagery classification. This practice significantly reduced the data and computation required to achieve strong performance on specialized tasks.
Several properties make ResNet particularly effective for transfer learning. Its hierarchical feature representations (from low-level edges in early layers to high-level semantic features in later layers) transfer well across domains. The architecture is available in multiple sizes (ResNet-18 through ResNet-152), allowing practitioners to choose the right tradeoff between accuracy and computational cost for their specific application. Additionally, the extensive batch normalization in ResNet means the features are well-conditioned, making fine-tuning with small learning rates stable and effective.
ResNet-based transfer learning has been applied across a wide range of domains:
| Domain | Example Application | Typical Model |
|---|---|---|
| Medical Imaging | Chest X-ray classification, retinal disease detection, cancer diagnosis | ResNet-50, ResNet-18 |
| Remote Sensing | Land use classification, crop monitoring, disaster assessment | ResNet-50, ResNet-101 |
| Manufacturing | Surface defect detection, quality control | ResNet-34, ResNet-50 |
| Agriculture | Plant disease identification, crop species classification | ResNet-50 |
| Autonomous Driving | Road scene understanding, traffic sign recognition | ResNet-101 |
Studies have shown that even for medical imaging, where the visual domain differs significantly from ImageNet's natural images, ImageNet pre-trained ResNet features provide a strong initialization that outperforms training from scratch, especially when labeled data is limited.
ResNet's influence extends far beyond image classification. The residual connection has become one of the most fundamental building blocks in modern deep learning, appearing in architectures across domains.
The transformer architecture, introduced by Vaswani et al. in 2017, uses residual connections around every self-attention and feed-forward sublayer [14]. This design choice was directly inspired by ResNet and is critical for training deep transformer stacks. Without residual connections, transformers with dozens or hundreds of layers would face the same gradient degradation issues that plagued pre-ResNet convolutional networks. Every major large language model, from GPT to BERT to Claude, relies on residual connections inherited from the ResNet lineage. The standard transformer block formula "output = LayerNorm(x + Sublayer(x))" is a direct application of the residual principle.
Residual connections appear in virtually every modern deep learning architecture:
The ResNet paper's citation count places it in a category of its own among computer science publications. As of late 2025, "Deep Residual Learning for Image Recognition" has accumulated over 250,000 citations on Google Scholar, making it one of the most cited academic papers ever published in any field [3].
A Nature analysis identified it as the most cited paper of the 21st century across all scientific disciplines when measured by median ranking across five major citation databases. The paper consistently ranks in the top two or three positions across Google Scholar, Semantic Scholar, Web of Science, Scopus, and Crossref.
To put this in context, the paper's citation count exceeds those of landmark works in other fields, such as the CRISPR gene-editing paper by Doudna and Charpentier. Within AI, it surpasses the citation counts of the attention mechanism paper ("Attention Is All You Need"), the batch normalization paper, and the Adam optimizer paper, each of which is itself among the most cited computer science publications.
Despite its enormous impact, ResNet has several recognized limitations.
The residual connections require that the input and output of each block have compatible dimensions for the addition operation. When dimensions change (at stage boundaries), projection shortcuts or zero-padding must be used, which adds design complexity.
ResNet's convolutional backbone inherently operates with local receptive fields. While deeper networks gradually expand the effective receptive field, ResNet cannot efficiently capture long-range spatial dependencies the way attention-based architectures can.
The fixed, uniform architecture of standard ResNet models (same block structure at each depth) may not be optimal. Neural architecture search methods have since discovered architectures that outperform hand-designed ResNets while using fewer parameters.
Finally, while ResNet significantly improved depth scalability, there are diminishing returns beyond a certain depth. The difference between ResNet-101 and ResNet-152 is much smaller than between ResNet-34 and ResNet-50, suggesting that simply adding layers is not the most efficient way to improve performance.
As of 2026, ResNet continues to hold a significant position in the deep learning landscape. Although Vision Transformers and architectures like ConvNeXt have achieved higher accuracy on major benchmarks, ResNet remains widely used for several reasons.
ResNet-50 is among the most commonly used models in industry for production computer vision systems due to its favorable balance of accuracy, speed, and well-understood behavior. It serves as a standard benchmark model: nearly every new architecture paper compares against ResNet variants. Pre-trained ResNet models are available in every major deep learning framework, including PyTorch and TensorFlow, making them accessible for rapid prototyping.
The residual connection itself, rather than any specific ResNet configuration, is the paper's most enduring contribution. It is now so ubiquitous that it is rarely even cited explicitly; researchers simply treat it as a fundamental design principle, as basic as convolution or backpropagation. From diffusion models to speech recognition to reinforcement learning, residual connections appear in virtually every modern deep learning system.
MLPerf, the industry-standard benchmark suite for measuring hardware and software performance in machine learning, uses ResNet-50 as one of its core benchmark models. This ensures that ResNet-50 performance is a key metric in evaluating new hardware accelerators, from NVIDIA GPUs to Google TPUs to custom AI chips.