# ConvNeXt

> Source: https://aiwiki.ai/wiki/convnext
> Updated: 2026-06-22
> Categories: Computer Vision, Deep Learning, Neural Networks
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**ConvNeXt** is a family of pure [convolutional neural network](/wiki/convolutional_neural_network) (CNN) models that match or beat Vision Transformers on standard vision benchmarks, reaching **87.8% ImageNet top-1 accuracy** while using only conventional convolutional building blocks. It was introduced by Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie in the paper "A ConvNet for the 2020s," published at the IEEE/CVF Conference on [Computer Vision](/wiki/computer_vision) and Pattern Recognition ([CVPR](/wiki/cvpr)) in 2022.[1] The authors state that ConvNeXts "compete favorably with Transformers in terms of accuracy and scalability, achieving 87.8% ImageNet top-1 accuracy and outperforming Swin Transformers on COCO detection and ADE20K segmentation, while maintaining the simplicity and efficiency of standard ConvNets."[1]

The paper challenged the prevailing assumption that [Vision Transformers](/wiki/vision_transformer) (ViTs) had rendered CNNs obsolete for visual recognition tasks. By systematically modernizing a standard [ResNet](/wiki/resnet) architecture using design principles borrowed from Transformers, the authors demonstrated that a pure ConvNet can match or surpass the performance of hierarchical Vision Transformers such as the [Swin Transformer](/wiki/swin_transformer) on [ImageNet](/wiki/imagenet) classification, [COCO](/wiki/coco_dataset) object detection, and ADE20K semantic segmentation.[1][3]

The ConvNeXt project originated at Facebook AI Research (FAIR), now part of [Meta](/wiki/meta) AI, in collaboration with UC Berkeley.[1] The work has since accumulated thousands of citations and renewed interest in CNN-based architectures for [deep learning](/wiki/deep_learning) research and practical applications.

## What problem did ConvNeXt set out to solve?

The early 2020s saw a rapid shift in [computer vision](/wiki/computer_vision) research away from CNNs and toward [Transformer](/wiki/transformer)-based architectures. The original [Vision Transformer](/wiki/vision_transformer) (ViT) by Dosovitskiy et al. (2020) showed that a pure Transformer applied directly to sequences of image patches could achieve excellent results on image classification, particularly when pre-trained on large datasets.[4] Subsequent models like [DeiT](/wiki/deit) and the [Swin Transformer](/wiki/swin_transformer) refined this approach, incorporating hierarchical feature maps and efficient attention mechanisms that made Transformers competitive even when trained solely on ImageNet-1K.[9][3]

During this same period, CNN-based models received comparatively little attention in terms of architectural innovation. Most improvements in CNN performance came from better training procedures rather than fundamental design changes. The ConvNeXt paper set out to answer a specific question: how much of the performance gap between modern CNNs and Vision Transformers is attributable to the Transformer architecture itself, and how much comes from associated training strategies and design decisions that could, in principle, be applied to any architecture?[1]

The authors hypothesized that many of the design choices commonly associated with Transformers (such as larger receptive fields, inverted bottleneck structures, and specific normalization schemes) are not inherently tied to the [self-attention](/wiki/attention) mechanism and can be incorporated into a pure convolutional framework.[1]

## How was ConvNeXt built by modernizing ResNet-50?

The core contribution of the ConvNeXt paper is a systematic, step-by-step modernization of a standard ResNet-50 model. Rather than proposing a single novel architecture, the authors incrementally applied a series of modifications, measuring the impact of each change on ImageNet-1K top-1 accuracy.[1] This methodical approach provides a roadmap showing exactly which design decisions matter most.

### Step 1: Modernized Training Recipe

The first step involved no architectural changes at all. The authors simply updated the training procedure for a standard ResNet-50 from the original 90-epoch schedule to a modern recipe inspired by DeiT and Swin Transformer training.[1][9] The updated recipe included:

- Extending training from 90 to 300 epochs
- Using the [AdamW](/wiki/adam_optimizer) optimizer instead of SGD
- Applying a cosine learning rate schedule with a learning rate of 4e-3 and 20-epoch linear warmup
- [Data augmentation](/wiki/data_augmentation) techniques: [Mixup](/wiki/data_augmentation) (alpha=0.8), CutMix (alpha=1.0), [RandAugment](/wiki/data_augmentation), and Random Erasing (probability=0.25)
- [Regularization](/wiki/regularization) with [Stochastic Depth](/wiki/stochastic_depth) and Label Smoothing (0.1)

This change alone raised the ResNet-50 top-1 accuracy from **76.1% to 78.8%**, a gain of 2.7 percentage points without touching the architecture.[1] This result highlighted how much modern training strategies contributed to the reported performance gap between CNNs and Transformers.

### Step 2: Macro Design Changes

**Stage compute ratio.** ResNet-50 distributes its blocks across four stages in a (3, 4, 6, 3) pattern.[5] The Swin Transformer uses a roughly 1:1:3:1 ratio across its stages.[3] The authors adjusted the ResNet block distribution to (3, 3, 9, 3), concentrating more computation in the third stage. This change improved accuracy from **78.8% to 79.4%**.[1]

**Patchify stem.** The standard ResNet stem uses a 7x7 convolution with stride 2 followed by a 3x3 max pooling layer to downsample the input by a factor of 4.[5] The authors replaced this with a single non-overlapping 4x4 convolution with stride 4, analogous to the patch embedding layer used in Vision Transformers. Accuracy improved slightly from **79.4% to 79.5%**.[1]

### Step 3: ResNeXt-ify with Depthwise Convolution

The authors adopted [depthwise separable convolutions](/wiki/convolutional_neural_network), where the number of groups equals the number of channels. This separates spatial mixing (done by the depthwise convolution) from channel mixing (done by 1x1 pointwise convolutions), mirroring how self-attention in Transformers operates on a per-head basis. The depthwise convolution itself reduced both FLOPs and accuracy, but increasing the base channel width from 64 to 96 (matching Swin-T) compensated for the loss. The net result was an improvement from **79.5% to 80.5%**.[1]

### Step 4: Inverted Bottleneck

Transformer blocks use feed-forward networks (FFNs) where the hidden dimension is four times the input dimension. This inverted bottleneck design, also seen in [MobileNetV2](/wiki/mobilenet), expands channels before compressing them.[6] The authors applied the same principle: starting with 96 channels, the hidden layer expanded to 384 channels. Despite increasing FLOPs in the depthwise convolution layer, the overall network FLOPs decreased because the expensive 1x1 convolutions operated on narrower inputs. Accuracy improved marginally from **80.5% to 80.6%**.[1]

### Step 5: Large Kernel Size

Vision Transformers benefit from large receptive fields through self-attention. Swin Transformers use a window size of at least 7x7.[3] To achieve a similar effect in a ConvNet, the authors increased the depthwise convolution kernel from 3x3 to 7x7. Before making this change, they moved the depthwise convolution layer upward in the block (before the 1x1 layers) so that the more computationally expensive pointwise convolutions could process features already mixed spatially. With the 7x7 kernel, accuracy reached **80.6%** (measured after the repositioning, which temporarily dropped accuracy to 79.9% before the kernel size increase brought it back).[1]

### Step 6: Micro Design Changes

Several smaller modifications further closed the gap with Transformers:

**Replacing ReLU with GELU.** The [Gaussian Error Linear Unit](/wiki/activation_function) (GELU) activation, used in Transformers such as [GPT](/wiki/gpt) and [BERT](/wiki/bert), replaced [ReLU](/wiki/relu) throughout the network.[1]

**Fewer activation functions.** In a Transformer block, only one activation function exists within the FFN (between the two linear layers). The authors mimicked this by keeping only a single GELU between the two 1x1 convolutions and removing all other activations from the block.[1]

**Fewer normalization layers.** Transformers typically use one normalization layer per sub-block. The authors removed two of the three [BatchNorm](/wiki/batch_normalization) layers, keeping only one before the first 1x1 convolution.[1]

**Replacing BatchNorm with LayerNorm.** [Layer Normalization](/wiki/layer_normalization), the standard in Transformers, replaced BatchNorm. While LayerNorm is less common in CNNs and can sometimes hurt performance, it worked well in the ConvNeXt architecture.[1]

**Separate downsampling layers.** Instead of using strided convolutions within residual blocks for spatial downsampling (as in ResNet), the authors added explicit 2x2 convolution layers with stride 2 between stages, with a LayerNorm layer for stability.[1]

These micro design changes collectively raised accuracy from **80.6% to 82.0%**.[1]

### Summary of Modernization Steps

The following table summarizes the incremental accuracy improvements measured on ImageNet-1K:

| Step | Modification | Top-1 Accuracy | GFLOPs |
|------|-------------|---------------|--------|
| Baseline | ResNet-50 (modern training) | 78.8% | 4.09 |
| 1 | Stage compute ratio (3,3,9,3) | 79.4% | 4.53 |
| 2 | Patchify stem (4x4, stride 4) | 79.5% | 4.42 |
| 3 | Depthwise conv + width 96 | 80.5% | 5.27 |
| 4 | Inverted bottleneck | 80.6% | 4.64 |
| 5 | Move depthwise conv up | 79.9% | 4.07 |
| 6 | Large kernel (7x7) | 80.6% | 4.15 |
| 7 | GELU, fewer activations | 81.3% | 4.15 |
| 8 | Fewer norms, BN to LN | 81.5% | 4.46 |
| 9 | Separate downsampling | 82.0% | 4.49 |
| Reference | Swin-T | 81.3% | 4.50 |

The final result, 82.0% for ConvNeXt-T, surpassed the Swin-T baseline of 81.3% while using a comparable number of FLOPs.[1]

## Architecture

### ConvNeXt Block

The ConvNeXt block is the fundamental building unit of the architecture. Each block consists of the following layers in sequence:

1. **7x7 depthwise convolution** with padding to preserve spatial dimensions
2. **Layer Normalization**
3. **1x1 pointwise convolution** expanding the channel dimension by a factor of 4
4. **GELU activation**
5. **1x1 pointwise convolution** projecting back to the original channel dimension

A residual connection adds the input of the block to its output, following the standard residual learning paradigm introduced by [ResNet](/wiki/resnet).[5] The block also incorporates Layer Scale, a learnable per-channel scaling factor initialized to a small value (1e-6), applied to the residual branch before addition.[1]

This block design bears a structural resemblance to a Transformer block, where the depthwise convolution plays the role of the self-attention layer (mixing spatial information) and the two 1x1 convolutions with GELU activation function as the feed-forward network (mixing channel information).[1]

### Overall Network Structure

ConvNeXt follows a four-stage hierarchical design. The input image passes through a patchify stem (4x4 convolution with stride 4) that reduces spatial resolution by a factor of 4 and projects to the initial channel dimension. Each subsequent stage begins with a downsampling layer (2x2 convolution, stride 2, preceded by LayerNorm) that halves the spatial resolution and doubles the channel count. Within each stage, a series of ConvNeXt blocks process the feature maps at a fixed spatial resolution.[1]

The number of blocks per stage and the channel dimensions at each stage define the model variant.

## What are the ConvNeXt model variants?

ConvNeXt follows the scaling convention established by Swin Transformer, offering five model sizes.[3] Each variant shares the same block design and overall network structure but differs in depth (number of blocks per stage) and width (channel dimensions).

### ConvNeXt V1 Model Sizes

| Model | Depths | Channel Dimensions | Parameters | FLOPs (224x224) | IN-1K Acc. | IN-22K Acc. |
|-------|--------|-------------------|------------|-----------------|------------|-------------|
| ConvNeXt-T (Tiny) | [3, 3, 9, 3] | [96, 192, 384, 768] | 29M | 4.5G | 82.1% | 82.9% |
| ConvNeXt-S (Small) | [3, 3, 27, 3] | [96, 192, 384, 768] | 50M | 8.7G | 83.1% | 84.6% |
| ConvNeXt-B (Base) | [3, 3, 27, 3] | [128, 256, 512, 1024] | 89M | 15.4G | 83.8% | 85.8% |
| ConvNeXt-L (Large) | [3, 3, 27, 3] | [192, 384, 768, 1536] | 198M | 34.4G | 84.3% | 86.6% |
| ConvNeXt-XL (XLarge) | [3, 3, 27, 3] | [256, 512, 1024, 2048] | 350M | 60.9G | N/A | 87.0% |

The "IN-1K Acc." column reports top-1 accuracy when trained solely on ImageNet-1K at 224x224 resolution. The "IN-22K Acc." column reports accuracy when pre-trained on ImageNet-22K and fine-tuned to ImageNet-1K at 224x224 resolution. The largest model, ConvNeXt-XL, achieved 87.8% top-1 accuracy when fine-tuned at 384x384 resolution with ImageNet-22K pre-training.[1]

The Tiny variant uses a (3, 3, 9, 3) block distribution, while all larger variants use (3, 3, 27, 3), tripling the depth of the third stage for greater capacity.[1] Channel doubling across stages follows the standard convention: each stage's channel count is twice the previous stage's.

## How does ConvNeXt compare with Vision Transformers?

### ImageNet Classification

ConvNeXt consistently matched or exceeded Swin Transformer accuracy at every comparable model size when trained on ImageNet-1K:

| Model | Parameters | FLOPs | Top-1 Accuracy |
|-------|-----------|-------|----------------|
| Swin-T | 28M | 4.5G | 81.3% |
| ConvNeXt-T | 29M | 4.5G | 82.1% |
| Swin-S | 50M | 8.7G | 83.0% |
| ConvNeXt-S | 50M | 8.7G | 83.1% |
| Swin-B | 88M | 15.4G | 83.5% |
| ConvNeXt-B | 89M | 15.4G | 83.8% |

With ImageNet-22K pre-training and fine-tuning at 384x384 resolution, ConvNeXt-B achieved 86.8% compared to Swin-B's 86.4%.[1]

### COCO Object Detection and Instance Segmentation

Using Cascade Mask R-CNN with a 3x training schedule on the [COCO](/wiki/coco_dataset) dataset, ConvNeXt backbones delivered competitive or superior results compared to Swin Transformer backbones:

| Backbone | Box AP | Mask AP |
|----------|--------|---------|
| Swin-T | 50.4 | 43.7 |
| ConvNeXt-T | 50.4 | 43.7 |
| Swin-S | 51.9 | 45.0 |
| ConvNeXt-S | 51.9 | 45.0 |
| Swin-B | 51.9 | 45.0 |
| ConvNeXt-B | 52.7 | 45.6 |
| ConvNeXt-L (22K) | 54.8 | 47.6 |

At the Base model size and above, ConvNeXt showed clear advantages. ConvNeXt-B outperformed Swin-B by 0.8 box AP and 0.6 mask AP. With ImageNet-22K pre-training, ConvNeXt-L reached 54.8 box AP.[1]

### ADE20K Semantic Segmentation

Using the UPerNet framework for semantic segmentation on ADE20K, ConvNeXt again outperformed its Swin Transformer counterparts:

| Backbone | Crop Size | mIoU |
|----------|----------|------|
| Swin-T (1K) | 512x512 | 45.8 |
| ConvNeXt-T (1K) | 512x512 | 46.7 |
| Swin-B (22K) | 640x640 | 51.7 |
| ConvNeXt-B (22K) | 640x640 | 53.1 |
| ConvNeXt-L (22K) | 640x640 | 53.7 |

ConvNeXt-T improved over Swin-T by 0.9 mIoU, and ConvNeXt-B surpassed Swin-B by 1.4 mIoU.[1]

### Throughput Advantages

Beyond accuracy, ConvNeXt models showed favorable inference throughput. Because ConvNeXt uses only standard convolutional operations without specialized modules like shifted windows or relative position biases, it benefits from highly optimized GPU implementations of convolutions. On NVIDIA A100 GPUs, ConvNeXt models were reported to be up to 49% faster than Swin Transformers of comparable size.[1]

## What is ConvNeXt V2?

In January 2023, a follow-up paper titled "ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders" was published by Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, and Saining Xie.[2] This work appeared at CVPR 2023 and extended the ConvNeXt architecture with two key innovations: a fully convolutional masked autoencoder (FCMAE) framework for self-supervised pre-training and a new normalization layer called Global Response Normalization (GRN).[2] The authors describe the two contributions as "a fully convolutional masked autoencoder framework and a new Global Response Normalization (GRN) layer that can be added to the ConvNeXt architecture to enhance inter-channel feature competition."[2]

### Fully Convolutional Masked Autoencoder (FCMAE)

Masked autoencoders (MAEs), originally designed for Vision Transformers, learn visual representations by masking a large portion of input patches and training the model to reconstruct them.[10] Directly applying this approach to ConvNets posed challenges because standard convolutions cannot naturally skip over masked regions the way Transformer attention can ignore masked tokens.[2]

FCMAE addressed this by converting standard convolution layers to sparse convolutions during pre-training. Sparse convolutions operate only on visible (unmasked) pixels, allowing the ConvNet to process masked inputs efficiently. A masking ratio of 60% was applied to 32x32 patches. During fine-tuning, the sparse convolutions convert back to standard dense convolutions without any special handling.[2]

The decoder in FCMAE is intentionally lightweight, consisting of a single ConvNeXt block with a decoder dimension of 512. Training uses mean squared error (MSE) loss computed only on masked patches with patch-wise normalized targets.[2]

### Global Response Normalization (GRN)

When the authors first tried training the original ConvNeXt architecture with FCMAE, they observed feature collapse: many output channels became inactive, producing near-zero activations. This problem did not occur with supervised training. To address this, they introduced GRN, a normalization layer that promotes inter-channel feature competition.[2]

GRN operates in three steps:

1. **Global feature aggregation:** For each channel, compute a global statistic (L2-norm) across the spatial dimensions, reducing the feature map from H x W x C to a C-dimensional vector.
2. **Feature normalization:** Apply divisive normalization across channels, computing the ratio of each channel's global response to the sum of all channels' responses.
3. **Feature calibration:** Scale the original feature map by the normalized response values, with learnable parameters (gamma and beta) and a residual connection.

GRN encourages each channel to develop distinct, diverse features rather than collapsing to redundant representations.[2]

### ConvNeXt V2 Model Sizes

ConvNeXt V2 expanded the model family with four smaller variants (Atto, Femto, Pico, Nano) designed for resource-constrained settings, while retaining the original Tiny, Base, Large sizes and adding a Huge variant:[2]

| Model | Depths | Channel Dimensions | Parameters | FLOPs (224x224) | IN-1K Acc. (FCMAE) |
|-------|--------|-------------------|------------|-----------------|---------------------|
| ConvNeXt V2-Atto | [2, 2, 6, 2] | [40, 80, 160, 320] | 3.7M | 0.55G | 76.7% |
| ConvNeXt V2-Femto | [2, 2, 6, 2] | [48, 96, 192, 384] | 5.2M | 0.78G | 78.5% |
| ConvNeXt V2-Pico | [2, 2, 6, 2] | [64, 128, 256, 512] | 9.1M | 1.37G | 80.3% |
| ConvNeXt V2-Nano | [2, 2, 8, 2] | [80, 160, 320, 640] | 15.6M | 2.45G | 81.9% |
| ConvNeXt V2-Tiny | [3, 3, 9, 3] | [96, 192, 384, 768] | 28.6M | 4.47G | 83.0% |
| ConvNeXt V2-Base | [3, 3, 27, 3] | [128, 256, 512, 1024] | 89M | 15.4G | 84.9% |
| ConvNeXt V2-Large | [3, 3, 27, 3] | [192, 384, 768, 1536] | 198M | 34.4G | 85.8% |
| ConvNeXt V2-Huge | [3, 3, 27, 3] | [352, 704, 1408, 2816] | 660M | 115G | 86.3% |

With intermediate fine-tuning on ImageNet-22K and evaluation at 512x512 resolution, the ConvNeXt V2-Huge model achieved **88.9% top-1 accuracy** on ImageNet-1K. The V2 paper reports a "650M Huge model that achieves a state-of-the-art 88.9% accuracy using only public training data," setting a new record among models trained exclusively with publicly available data at the time of publication.[2]

### ConvNeXt V2 Downstream Results

ConvNeXt V2 demonstrated strong transfer learning performance:

**COCO Detection (Mask R-CNN):**

| Backbone | Box AP | Mask AP |
|----------|--------|---------|
| ConvNeXt V2-Base | 52.9 | 46.6 |
| ConvNeXt V2-Large | 54.4 | 47.7 |
| ConvNeXt V2-Huge | 55.7 | 48.9 |

**ADE20K Segmentation (UPerNet):**

| Backbone | mIoU |
|----------|------|
| ConvNeXt V2-Base | 52.1 |
| ConvNeXt V2-Large | 53.7 |
| ConvNeXt V2-Huge (22K) | 57.0 |

The co-design of architecture (adding GRN) and training framework (FCMAE) proved essential. ConvNeXt V2-Large with FCMAE pre-training achieved 85.8% top-1 accuracy on ImageNet-1K, compared to 84.3% for the original ConvNeXt V1-Large trained with supervised learning, demonstrating that the architectural change and the self-supervised framework work synergistically.[2]

## Training Details

### ImageNet-1K Training

ConvNeXt models are trained for 300 epochs on ImageNet-1K using the AdamW optimizer with a base learning rate of 4e-3 and weight decay of 0.05. A batch size of 4096 is used across multiple GPUs. The learning rate follows a cosine decay schedule after a 20-epoch linear warmup.[1]

Data augmentation includes Mixup (alpha=0.8), CutMix (alpha=1.0), RandAugment (magnitude 9, standard deviation 0.5), and Random Erasing (probability 0.25). Regularization techniques include Stochastic Depth (with drop rates varying by model size, from 0.1 for Tiny to 0.5 for Large/XL) and Label Smoothing (0.1). Layer Scale with an initial value of 1e-6 is used in all ConvNeXt blocks.[1]

### ImageNet-22K Pre-training

For larger models, pre-training on the full ImageNet-22K dataset (approximately 14 million images across 21,841 classes) is performed for 90 epochs with a 5-epoch warmup. The pre-trained models are then fine-tuned on ImageNet-1K at higher resolutions (224x224 or 384x384) for 30 epochs.[1]

### ConvNeXt V2 Self-Supervised Pre-training

ConvNeXt V2 models are pre-trained with FCMAE for 300 to 800 epochs (depending on model size) on ImageNet-1K without labels, followed by supervised fine-tuning for 100 epochs. The Huge model uses 800 epochs of FCMAE pre-training. For the best results, intermediate fine-tuning on ImageNet-22K is applied before final ImageNet-1K evaluation.[2]

## Is ConvNeXt open source?

The official ConvNeXt implementation is available in [PyTorch](/wiki/pytorch) through the Facebook Research GitHub repository.[7] ConvNeXt models are also integrated into major deep learning libraries:

- **[Hugging Face](/wiki/hugging_face) Transformers:** Pre-trained ConvNeXt and ConvNeXt V2 models are available through the `ConvNextModel` and `ConvNextV2Model` classes, with support for image classification, feature extraction, and backbone use in detection/segmentation frameworks.
- **timm (PyTorch Image Models):** Ross Wightman's timm library includes ConvNeXt variants with extensive pre-trained weight options.
- **TorchVision:** PyTorch's official vision library provides ConvNeXt-Tiny, Small, Base, and Large as built-in models.
- **Keras/TensorFlow:** The Keras Applications module includes ConvNeXt Tiny through XLarge with ImageNet pre-trained weights.

The official repository was archived in October 2023, with the codebase considered stable and complete.[7][8]

## Why does ConvNeXt matter?

ConvNeXt's primary contribution is not a single novel component but rather the demonstration that a carefully designed pure ConvNet can compete with state-of-the-art Vision Transformers.[1] This finding had several important implications for the field:

**CNNs remain competitive.** Before ConvNeXt, a growing consensus held that Transformers would replace CNNs for most vision tasks. ConvNeXt proved that much of the Transformers' advantage came from improved training recipes and specific design choices (larger kernels, fewer activations, Layer Normalization) rather than the self-attention mechanism itself.[1]

**Simplicity and efficiency.** ConvNeXt achieves its results using only standard convolutional operations, which are highly optimized on modern hardware. Unlike Swin Transformer, which requires custom implementations for shifted window attention, ConvNeXt works out of the box with any deep learning framework.[1]

**Design principles transfer across paradigms.** The paper showed that design innovations developed for one architectural paradigm (Transformers) can often be adapted to another (CNNs). This cross-pollination of ideas has inspired subsequent research exploring the boundary between convolutional and attention-based models.

**Influence on subsequent architectures.** ConvNeXt inspired several follow-up works, including hybrid architectures that combine convolutional and attention layers. Models such as ConvFormer and CAFormer (which achieved 85.5% on ImageNet-1K without extra data) built on ConvNeXt's design principles. ConvNeXt has also been widely adopted as a backbone in domain-specific applications, including medical imaging, remote sensing, and autonomous driving.

**Practical deployment.** The pure convolutional design of ConvNeXt makes it particularly well-suited for deployment on edge devices and specialized hardware (such as FPGAs and mobile processors) where convolutional operations are better optimized than attention computations. The smaller ConvNeXt V2 variants (Atto through Nano) specifically target these resource-constrained settings.[2]

## What are the limitations of ConvNeXt?

While ConvNeXt demonstrated that CNNs can match Transformers on standard benchmarks, several limitations are worth noting:

- **Receptive field constraints.** Even with 7x7 kernels, the effective receptive field of a ConvNeXt model grows more slowly with depth compared to the global attention available in standard ViT models.[4] For tasks requiring very long-range dependencies across an image, Transformers may still hold an advantage.
- **Scaling behavior.** At very large scales (billions of parameters), the relative performance of pure ConvNets versus Transformers remains an open research question. Most comparisons have been conducted in the sub-billion parameter range.
- **[Self-supervised learning](/wiki/self-supervised_learning).** The original ConvNeXt V1 was designed primarily for supervised learning. ConvNeXt V2 addressed this gap, but the need for architectural modifications (GRN) to make masked autoencoder pre-training work suggests that Transformers may have a structural advantage for certain self-supervised paradigms.[2]

## See Also

- [Convolutional Neural Networks](/wiki/convolutional_neural_network)
- [Vision Transformer](/wiki/vision_transformer)
- [Swin Transformer](/wiki/swin_transformer)
- [ResNet](/wiki/resnet)
- [DenseNet](/wiki/densenet)
- [EfficientNet](/wiki/efficientnet)
- [MobileNet](/wiki/mobilenet)
- [ImageNet](/wiki/imagenet)

## References

1. Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., & Xie, S. (2022). "A ConvNet for the 2020s." *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*, 11976-11986. [arXiv:2201.03545](https://arxiv.org/abs/2201.03545)

2. Woo, S., Debnath, S., Hu, R., Chen, X., Liu, Z., Kweon, I. S., & Xie, S. (2023). "ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders." *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*, 16133-16142. [arXiv:2301.00808](https://arxiv.org/abs/2301.00808)

3. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021). "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows." *Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)*, 10012-10022. [arXiv:2103.14030](https://arxiv.org/abs/2103.14030)

4. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021). "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale." *International Conference on Learning Representations (ICLR)*. [arXiv:2010.11929](https://arxiv.org/abs/2010.11929)

5. He, K., Zhang, X., Ren, S., & Sun, J. (2016). "Deep Residual Learning for Image Recognition." *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*, 770-778. [arXiv:1512.03385](https://arxiv.org/abs/1512.03385)

6. Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L.-C. (2018). "MobileNetV2: Inverted Residuals and Linear Bottlenecks." *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*, 4510-4520. [arXiv:1801.04381](https://arxiv.org/abs/1801.04381)

7. Facebook Research. "ConvNeXt: Code release for ConvNeXt model." GitHub. [https://github.com/facebookresearch/ConvNeXt](https://github.com/facebookresearch/ConvNeXt)

8. Facebook Research. "ConvNeXt-V2: Code release for ConvNeXt V2 model." GitHub. [https://github.com/facebookresearch/ConvNeXt-V2](https://github.com/facebookresearch/ConvNeXt-V2)

9. Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., & Jegou, H. (2021). "Training data-efficient image transformers & distillation through attention." *Proceedings of the 38th International Conference on Machine Learning (ICML)*. [arXiv:2012.12877](https://arxiv.org/abs/2012.12877)

10. He, K., Chen, X., Xie, S., Li, Y., Dollar, P., & Girshick, R. (2022). "Masked Autoencoders Are Scalable Vision Learners." *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*, 16000-16009. [arXiv:2111.06377](https://arxiv.org/abs/2111.06377)