ConvNeXt is a family of pure convolutional neural network (CNN) models introduced by Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie in the paper "A ConvNet for the 2020s," published at the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) in 2022. The paper challenged the prevailing assumption that Vision Transformers (ViTs) had rendered CNNs obsolete for visual recognition tasks. By systematically modernizing a standard ResNet architecture using design principles borrowed from Transformers, the authors demonstrated that a pure ConvNet can match or surpass the performance of hierarchical Vision Transformers such as the Swin Transformer on ImageNet classification, COCO object detection, and ADE20K semantic segmentation.
The ConvNeXt project originated at Facebook AI Research (FAIR), now part of Meta AI, in collaboration with UC Berkeley. The work has since accumulated thousands of citations and renewed interest in CNN-based architectures for deep learning research and practical applications.
The early 2020s saw a rapid shift in computer vision research away from CNNs and toward Transformer-based architectures. The original Vision Transformer (ViT) by Dosovitskiy et al. (2020) showed that a pure Transformer applied directly to sequences of image patches could achieve excellent results on image classification, particularly when pre-trained on large datasets. Subsequent models like DeiT and the Swin Transformer refined this approach, incorporating hierarchical feature maps and efficient attention mechanisms that made Transformers competitive even when trained solely on ImageNet-1K.
During this same period, CNN-based models received comparatively little attention in terms of architectural innovation. Most improvements in CNN performance came from better training procedures rather than fundamental design changes. The ConvNeXt paper set out to answer a specific question: how much of the performance gap between modern CNNs and Vision Transformers is attributable to the Transformer architecture itself, and how much comes from associated training strategies and design decisions that could, in principle, be applied to any architecture?
The authors hypothesized that many of the design choices commonly associated with Transformers (such as larger receptive fields, inverted bottleneck structures, and specific normalization schemes) are not inherently tied to the self-attention mechanism and can be incorporated into a pure convolutional framework.
The core contribution of the ConvNeXt paper is a systematic, step-by-step modernization of a standard ResNet-50 model. Rather than proposing a single novel architecture, the authors incrementally applied a series of modifications, measuring the impact of each change on ImageNet-1K top-1 accuracy. This methodical approach provides a roadmap showing exactly which design decisions matter most.
The first step involved no architectural changes at all. The authors simply updated the training procedure for a standard ResNet-50 from the original 90-epoch schedule to a modern recipe inspired by DeiT and Swin Transformer training. The updated recipe included:
This change alone raised the ResNet-50 top-1 accuracy from 76.1% to 78.8%, a gain of 2.7 percentage points without touching the architecture. This result highlighted how much modern training strategies contributed to the reported performance gap between CNNs and Transformers.
Stage compute ratio. ResNet-50 distributes its blocks across four stages in a (3, 4, 6, 3) pattern. The Swin Transformer uses a roughly 1:1:3:1 ratio across its stages. The authors adjusted the ResNet block distribution to (3, 3, 9, 3), concentrating more computation in the third stage. This change improved accuracy from 78.8% to 79.4%.
Patchify stem. The standard ResNet stem uses a 7x7 convolution with stride 2 followed by a 3x3 max pooling layer to downsample the input by a factor of 4. The authors replaced this with a single non-overlapping 4x4 convolution with stride 4, analogous to the patch embedding layer used in Vision Transformers. Accuracy improved slightly from 79.4% to 79.5%.
The authors adopted depthwise separable convolutions, where the number of groups equals the number of channels. This separates spatial mixing (done by the depthwise convolution) from channel mixing (done by 1x1 pointwise convolutions), mirroring how self-attention in Transformers operates on a per-head basis. The depthwise convolution itself reduced both FLOPs and accuracy, but increasing the base channel width from 64 to 96 (matching Swin-T) compensated for the loss. The net result was an improvement from 79.5% to 80.5%.
Transformer blocks use feed-forward networks (FFNs) where the hidden dimension is four times the input dimension. This inverted bottleneck design, also seen in MobileNetV2, expands channels before compressing them. The authors applied the same principle: starting with 96 channels, the hidden layer expanded to 384 channels. Despite increasing FLOPs in the depthwise convolution layer, the overall network FLOPs decreased because the expensive 1x1 convolutions operated on narrower inputs. Accuracy improved marginally from 80.5% to 80.6%.
Vision Transformers benefit from large receptive fields through self-attention. Swin Transformers use a window size of at least 7x7. To achieve a similar effect in a ConvNet, the authors increased the depthwise convolution kernel from 3x3 to 7x7. Before making this change, they moved the depthwise convolution layer upward in the block (before the 1x1 layers) so that the more computationally expensive pointwise convolutions could process features already mixed spatially. With the 7x7 kernel, accuracy reached 80.6% (measured after the repositioning, which temporarily dropped accuracy to 79.9% before the kernel size increase brought it back).
Several smaller modifications further closed the gap with Transformers:
Replacing ReLU with GELU. The Gaussian Error Linear Unit (GELU) activation, used in Transformers such as GPT and BERT, replaced ReLU throughout the network.
Fewer activation functions. In a Transformer block, only one activation function exists within the FFN (between the two linear layers). The authors mimicked this by keeping only a single GELU between the two 1x1 convolutions and removing all other activations from the block.
Fewer normalization layers. Transformers typically use one normalization layer per sub-block. The authors removed two of the three BatchNorm layers, keeping only one before the first 1x1 convolution.
Replacing BatchNorm with LayerNorm. Layer Normalization, the standard in Transformers, replaced BatchNorm. While LayerNorm is less common in CNNs and can sometimes hurt performance, it worked well in the ConvNeXt architecture.
Separate downsampling layers. Instead of using strided convolutions within residual blocks for spatial downsampling (as in ResNet), the authors added explicit 2x2 convolution layers with stride 2 between stages, with a LayerNorm layer for stability.
These micro design changes collectively raised accuracy from 80.6% to 82.0%.
The following table summarizes the incremental accuracy improvements measured on ImageNet-1K:
| Step | Modification | Top-1 Accuracy | GFLOPs |
|---|---|---|---|
| Baseline | ResNet-50 (modern training) | 78.8% | 4.09 |
| 1 | Stage compute ratio (3,3,9,3) | 79.4% | 4.53 |
| 2 | Patchify stem (4x4, stride 4) | 79.5% | 4.42 |
| 3 | Depthwise conv + width 96 | 80.5% | 5.27 |
| 4 | Inverted bottleneck | 80.6% | 4.64 |
| 5 | Move depthwise conv up | 79.9% | 4.07 |
| 6 | Large kernel (7x7) | 80.6% | 4.15 |
| 7 | GELU, fewer activations | 81.3% | 4.15 |
| 8 | Fewer norms, BN to LN | 81.5% | 4.46 |
| 9 | Separate downsampling | 82.0% | 4.49 |
| Reference | Swin-T | 81.3% | 4.50 |
The final result, 82.0% for ConvNeXt-T, surpassed the Swin-T baseline of 81.3% while using a comparable number of FLOPs.
The ConvNeXt block is the fundamental building unit of the architecture. Each block consists of the following layers in sequence:
A residual connection adds the input of the block to its output, following the standard residual learning paradigm introduced by ResNet. The block also incorporates Layer Scale, a learnable per-channel scaling factor initialized to a small value (1e-6), applied to the residual branch before addition.
This block design bears a structural resemblance to a Transformer block, where the depthwise convolution plays the role of the self-attention layer (mixing spatial information) and the two 1x1 convolutions with GELU activation function as the feed-forward network (mixing channel information).
ConvNeXt follows a four-stage hierarchical design. The input image passes through a patchify stem (4x4 convolution with stride 4) that reduces spatial resolution by a factor of 4 and projects to the initial channel dimension. Each subsequent stage begins with a downsampling layer (2x2 convolution, stride 2, preceded by LayerNorm) that halves the spatial resolution and doubles the channel count. Within each stage, a series of ConvNeXt blocks process the feature maps at a fixed spatial resolution.
The number of blocks per stage and the channel dimensions at each stage define the model variant.
ConvNeXt follows the scaling convention established by Swin Transformer, offering five model sizes. Each variant shares the same block design and overall network structure but differs in depth (number of blocks per stage) and width (channel dimensions).
| Model | Depths | Channel Dimensions | Parameters | FLOPs (224x224) | IN-1K Acc. | IN-22K Acc. |
|---|---|---|---|---|---|---|
| ConvNeXt-T (Tiny) | [3, 3, 9, 3] | [96, 192, 384, 768] | 29M | 4.5G | 82.1% | 82.9% |
| ConvNeXt-S (Small) | [3, 3, 27, 3] | [96, 192, 384, 768] | 50M | 8.7G | 83.1% | 84.6% |
| ConvNeXt-B (Base) | [3, 3, 27, 3] | [128, 256, 512, 1024] | 89M | 15.4G | 83.8% | 85.8% |
| ConvNeXt-L (Large) | [3, 3, 27, 3] | [192, 384, 768, 1536] | 198M | 34.4G | 84.3% | 86.6% |
| ConvNeXt-XL (XLarge) | [3, 3, 27, 3] | [256, 512, 1024, 2048] | 350M | 60.9G | N/A | 87.0% |
The "IN-1K Acc." column reports top-1 accuracy when trained solely on ImageNet-1K at 224x224 resolution. The "IN-22K Acc." column reports accuracy when pre-trained on ImageNet-22K and fine-tuned to ImageNet-1K at 224x224 resolution. The largest model, ConvNeXt-XL, achieved 87.8% top-1 accuracy when fine-tuned at 384x384 resolution with ImageNet-22K pre-training.
The Tiny variant uses a (3, 3, 9, 3) block distribution, while all larger variants use (3, 3, 27, 3), tripling the depth of the third stage for greater capacity. Channel doubling across stages follows the standard convention: each stage's channel count is twice the previous stage's.
ConvNeXt consistently matched or exceeded Swin Transformer accuracy at every comparable model size when trained on ImageNet-1K:
| Model | Parameters | FLOPs | Top-1 Accuracy |
|---|---|---|---|
| Swin-T | 28M | 4.5G | 81.3% |
| ConvNeXt-T | 29M | 4.5G | 82.1% |
| Swin-S | 50M | 8.7G | 83.0% |
| ConvNeXt-S | 50M | 8.7G | 83.1% |
| Swin-B | 88M | 15.4G | 83.5% |
| ConvNeXt-B | 89M | 15.4G | 83.8% |
With ImageNet-22K pre-training and fine-tuning at 384x384 resolution, ConvNeXt-B achieved 86.8% compared to Swin-B's 86.4%.
Using Cascade Mask R-CNN with a 3x training schedule on the COCO dataset, ConvNeXt backbones delivered competitive or superior results compared to Swin Transformer backbones:
| Backbone | Box AP | Mask AP |
|---|---|---|
| Swin-T | 50.4 | 43.7 |
| ConvNeXt-T | 50.4 | 43.7 |
| Swin-S | 51.9 | 45.0 |
| ConvNeXt-S | 51.9 | 45.0 |
| Swin-B | 51.9 | 45.0 |
| ConvNeXt-B | 52.7 | 45.6 |
| ConvNeXt-L (22K) | 54.8 | 47.6 |
At the Base model size and above, ConvNeXt showed clear advantages. ConvNeXt-B outperformed Swin-B by 0.8 box AP and 0.6 mask AP. With ImageNet-22K pre-training, ConvNeXt-L reached 54.8 box AP.
Using the UPerNet framework for semantic segmentation on ADE20K, ConvNeXt again outperformed its Swin Transformer counterparts:
| Backbone | Crop Size | mIoU |
|---|---|---|
| Swin-T (1K) | 512x512 | 45.8 |
| ConvNeXt-T (1K) | 512x512 | 46.7 |
| Swin-B (22K) | 640x640 | 51.7 |
| ConvNeXt-B (22K) | 640x640 | 53.1 |
| ConvNeXt-L (22K) | 640x640 | 53.7 |
ConvNeXt-T improved over Swin-T by 0.9 mIoU, and ConvNeXt-B surpassed Swin-B by 1.4 mIoU.
Beyond accuracy, ConvNeXt models showed favorable inference throughput. Because ConvNeXt uses only standard convolutional operations without specialized modules like shifted windows or relative position biases, it benefits from highly optimized GPU implementations of convolutions. On NVIDIA A100 GPUs, ConvNeXt models were reported to be up to 49% faster than Swin Transformers of comparable size.
In January 2023, a follow-up paper titled "ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders" was published by Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, and Saining Xie. This work appeared at CVPR 2023 and extended the ConvNeXt architecture with two key innovations: a fully convolutional masked autoencoder (FCMAE) framework for self-supervised pre-training and a new normalization layer called Global Response Normalization (GRN).
Masked autoencoders (MAEs), originally designed for Vision Transformers, learn visual representations by masking a large portion of input patches and training the model to reconstruct them. Directly applying this approach to ConvNets posed challenges because standard convolutions cannot naturally skip over masked regions the way Transformer attention can ignore masked tokens.
FCMAE addressed this by converting standard convolution layers to sparse convolutions during pre-training. Sparse convolutions operate only on visible (unmasked) pixels, allowing the ConvNet to process masked inputs efficiently. A masking ratio of 60% was applied to 32x32 patches. During fine-tuning, the sparse convolutions convert back to standard dense convolutions without any special handling.
The decoder in FCMAE is intentionally lightweight, consisting of a single ConvNeXt block with a decoder dimension of 512. Training uses mean squared error (MSE) loss computed only on masked patches with patch-wise normalized targets.
When the authors first tried training the original ConvNeXt architecture with FCMAE, they observed feature collapse: many output channels became inactive, producing near-zero activations. This problem did not occur with supervised training. To address this, they introduced GRN, a normalization layer that promotes inter-channel feature competition.
GRN operates in three steps:
GRN encourages each channel to develop distinct, diverse features rather than collapsing to redundant representations.
ConvNeXt V2 expanded the model family with four smaller variants (Atto, Femto, Pico, Nano) designed for resource-constrained settings, while retaining the original Tiny, Base, Large sizes and adding a Huge variant:
| Model | Depths | Channel Dimensions | Parameters | FLOPs (224x224) | IN-1K Acc. (FCMAE) |
|---|---|---|---|---|---|
| ConvNeXt V2-Atto | [2, 2, 6, 2] | [40, 80, 160, 320] | 3.7M | 0.55G | 76.7% |
| ConvNeXt V2-Femto | [2, 2, 6, 2] | [48, 96, 192, 384] | 5.2M | 0.78G | 78.5% |
| ConvNeXt V2-Pico | [2, 2, 6, 2] | [64, 128, 256, 512] | 9.1M | 1.37G | 80.3% |
| ConvNeXt V2-Nano | [2, 2, 8, 2] | [80, 160, 320, 640] | 15.6M | 2.45G | 81.9% |
| ConvNeXt V2-Tiny | [3, 3, 9, 3] | [96, 192, 384, 768] | 28.6M | 4.47G | 83.0% |
| ConvNeXt V2-Base | [3, 3, 27, 3] | [128, 256, 512, 1024] | 89M | 15.4G | 84.9% |
| ConvNeXt V2-Large | [3, 3, 27, 3] | [192, 384, 768, 1536] | 198M | 34.4G | 85.8% |
| ConvNeXt V2-Huge | [3, 3, 27, 3] | [352, 704, 1408, 2816] | 660M | 115G | 86.3% |
With intermediate fine-tuning on ImageNet-22K and evaluation at 512x512 resolution, the ConvNeXt V2-Huge model achieved 88.9% top-1 accuracy on ImageNet-1K, setting a new record among models trained exclusively with publicly available data at the time of publication.
ConvNeXt V2 demonstrated strong transfer learning performance:
COCO Detection (Mask R-CNN):
| Backbone | Box AP | Mask AP |
|---|---|---|
| ConvNeXt V2-Base | 52.9 | 46.6 |
| ConvNeXt V2-Large | 54.4 | 47.7 |
| ConvNeXt V2-Huge | 55.7 | 48.9 |
ADE20K Segmentation (UPerNet):
| Backbone | mIoU |
|---|---|
| ConvNeXt V2-Base | 52.1 |
| ConvNeXt V2-Large | 53.7 |
| ConvNeXt V2-Huge (22K) | 57.0 |
The co-design of architecture (adding GRN) and training framework (FCMAE) proved essential. ConvNeXt V2-Large with FCMAE pre-training achieved 85.8% top-1 accuracy on ImageNet-1K, compared to 84.3% for the original ConvNeXt V1-Large trained with supervised learning, demonstrating that the architectural change and the self-supervised framework work synergistically.
ConvNeXt models are trained for 300 epochs on ImageNet-1K using the AdamW optimizer with a base learning rate of 4e-3 and weight decay of 0.05. A batch size of 4096 is used across multiple GPUs. The learning rate follows a cosine decay schedule after a 20-epoch linear warmup.
Data augmentation includes Mixup (alpha=0.8), CutMix (alpha=1.0), RandAugment (magnitude 9, standard deviation 0.5), and Random Erasing (probability 0.25). Regularization techniques include Stochastic Depth (with drop rates varying by model size, from 0.1 for Tiny to 0.5 for Large/XL) and Label Smoothing (0.1). Layer Scale with an initial value of 1e-6 is used in all ConvNeXt blocks.
For larger models, pre-training on the full ImageNet-22K dataset (approximately 14 million images across 21,841 classes) is performed for 90 epochs with a 5-epoch warmup. The pre-trained models are then fine-tuned on ImageNet-1K at higher resolutions (224x224 or 384x384) for 30 epochs.
ConvNeXt V2 models are pre-trained with FCMAE for 300 to 800 epochs (depending on model size) on ImageNet-1K without labels, followed by supervised fine-tuning for 100 epochs. The Huge model uses 800 epochs of FCMAE pre-training. For the best results, intermediate fine-tuning on ImageNet-22K is applied before final ImageNet-1K evaluation.
The official ConvNeXt implementation is available in PyTorch through the Facebook Research GitHub repository. ConvNeXt models are also integrated into major deep learning libraries:
ConvNextModel and ConvNextV2Model classes, with support for image classification, feature extraction, and backbone use in detection/segmentation frameworks.The official repository was archived in October 2023, with the codebase considered stable and complete.
ConvNeXt's primary contribution is not a single novel component but rather the demonstration that a carefully designed pure ConvNet can compete with state-of-the-art Vision Transformers. This finding had several important implications for the field:
CNNs remain competitive. Before ConvNeXt, a growing consensus held that Transformers would replace CNNs for most vision tasks. ConvNeXt proved that much of the Transformers' advantage came from improved training recipes and specific design choices (larger kernels, fewer activations, Layer Normalization) rather than the self-attention mechanism itself.
Simplicity and efficiency. ConvNeXt achieves its results using only standard convolutional operations, which are highly optimized on modern hardware. Unlike Swin Transformer, which requires custom implementations for shifted window attention, ConvNeXt works out of the box with any deep learning framework.
Design principles transfer across paradigms. The paper showed that design innovations developed for one architectural paradigm (Transformers) can often be adapted to another (CNNs). This cross-pollination of ideas has inspired subsequent research exploring the boundary between convolutional and attention-based models.
Influence on subsequent architectures. ConvNeXt inspired several follow-up works, including hybrid architectures that combine convolutional and attention layers. Models such as ConvFormer and CAFormer (which achieved 85.5% on ImageNet-1K without extra data) built on ConvNeXt's design principles. ConvNeXt has also been widely adopted as a backbone in domain-specific applications, including medical imaging, remote sensing, and autonomous driving.
Practical deployment. The pure convolutional design of ConvNeXt makes it particularly well-suited for deployment on edge devices and specialized hardware (such as FPGAs and mobile processors) where convolutional operations are better optimized than attention computations. The smaller ConvNeXt V2 variants (Atto through Nano) specifically target these resource-constrained settings.
While ConvNeXt demonstrated that CNNs can match Transformers on standard benchmarks, several limitations are worth noting: