The Vision Transformer (ViT) is a deep learning architecture that applies the transformer model, originally designed for natural language processing, to image recognition tasks. Instead of processing images through convolutional neural networks (CNNs), ViT splits an image into fixed-size patches, treats each patch as a token (analogous to a word in a sentence), and feeds the resulting sequence into a standard transformer encoder. Introduced by Alexey Dosovitskiy and colleagues at Google Brain in October 2020, the approach demonstrated that a pure transformer, applied directly to sequences of image patches, can achieve state-of-the-art results on image classification benchmarks when pre-trained on large datasets [1]. The original paper, titled "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale," became one of the most influential publications in computer vision and sparked a wave of transformer-based architectures across the field.
Before ViT, convolutional neural networks dominated computer vision. Architectures such as AlexNet, VGGNet, ResNet, and EfficientNet relied on learned convolutional filters to extract local features from images in a hierarchical manner. While CNNs proved highly effective, they carried strong inductive biases: locality (each filter operates on a small spatial region) and translation equivariance (the same filter is applied across the entire image). These biases helped CNNs learn efficiently from limited data, but they also constrained the model's ability to capture long-range dependencies across an image without stacking many layers.
Meanwhile, the transformer architecture had revolutionized NLP. Models like BERT and GPT demonstrated that self-attention mechanisms could model complex relationships between tokens in a sequence with remarkable effectiveness. Several researchers explored ways to bring attention mechanisms into vision, including hybrid approaches that combined convolutions with self-attention. However, Dosovitskiy et al. took a more radical approach: they asked whether a pure transformer, with minimal image-specific modifications, could match or exceed CNN performance on image classification.
The answer, it turned out, was yes, provided the model had access to sufficient training data.
The ViT architecture follows a straightforward pipeline that maps an image to a class label through a sequence of well-defined steps.
Given an input image of resolution H x W with C color channels, ViT divides it into a grid of non-overlapping patches, each of size P x P pixels. For a standard 224 x 224 image with a patch size of 16 x 16, this yields 196 patches (14 rows by 14 columns). Each patch is flattened into a one-dimensional vector of length P x P x C. For 16 x 16 RGB patches, each vector has 768 values. These flattened patches are then projected through a trainable linear layer (a single matrix multiplication) into a fixed-dimensional embedding space. The result is a sequence of patch embeddings, each representing one spatial region of the image [1].
Because the transformer architecture is permutation-invariant (it has no built-in notion of spatial order), positional information must be explicitly provided. ViT adds a learnable position embedding to each patch embedding. These position embeddings allow the model to learn where each patch is located relative to others in the original image. The original ViT paper used standard 1D learnable position embeddings and found that they performed comparably to more complex 2D-aware positional encoding schemes [1].
Following the convention established by BERT, ViT prepends a special learnable [CLS] token to the sequence of patch embeddings. This token does not correspond to any image patch. Instead, its representation at the output of the transformer encoder serves as the aggregate image representation used for classification. The final state of the [CLS] token is passed through a classification head (typically a small multilayer perceptron) to produce the predicted class probabilities.
The sequence of patch embeddings (plus the [CLS] token and position embeddings) is fed into a standard transformer encoder, identical in design to the one proposed by Vaswani et al. in 2017 [2]. The encoder consists of L identical layers, each containing two sub-layers:
Both sub-layers use Layer Normalization (applied before the sub-layer, known as Pre-Norm) and residual connections. The complete forward pass can be summarized as:
where z_L^0 is the output state of the [CLS] token after all L layers.
During pre-training, the classification head is a small MLP with one hidden layer and a tanh activation. During fine-tuning on a downstream task, it is replaced by a single linear layer.
The original paper defined three model sizes, borrowing naming conventions from BERT. Each variant can be combined with different patch sizes, denoted as ViT-{size}/{patch}, for example ViT-B/16 (Base model with 16x16 patches) or ViT-L/32 (Large model with 32x32 patches). Smaller patch sizes produce longer sequences and higher computational cost, but generally yield better accuracy.
| Model | Layers | Hidden Dim | MLP Dim | Attention Heads | Parameters |
|---|---|---|---|---|---|
| ViT-Base (ViT-B) | 12 | 768 | 3,072 | 12 | ~86M |
| ViT-Large (ViT-L) | 24 | 1,024 | 4,096 | 16 | ~307M |
| ViT-Huge (ViT-H) | 32 | 1,280 | 5,120 | 16 | ~632M |
Later work scaled ViT even further. Google Research published "Scaling Vision Transformers to 22 Billion Parameters" in 2023, demonstrating that ViT continues to benefit from increased scale with a ViT-22B model [3].
One of the key findings from the original ViT paper is that transformers lack the strong inductive biases of CNNs. Without convolutions enforcing locality and translation equivariance, ViTs need substantially more training data to learn these patterns from scratch. When trained only on ImageNet-1K (approximately 1.3 million images), ViT underperformed comparable ResNet models. However, when pre-trained on larger datasets such as ImageNet-21K (14 million images) or the proprietary JFT-300M dataset (300 million images), ViT surpassed all CNN baselines [1].
This data hunger initially limited ViT's practical appeal. Subsequent research addressed this limitation through improved training strategies, data augmentation, knowledge distillation, and self-supervised pre-training.
The relationship between ViTs and CNNs reveals fundamental trade-offs in model design for visual understanding.
| Aspect | Vision Transformer (ViT) | Convolutional Neural Networks |
|---|---|---|
| Inductive bias | Minimal; learns spatial relationships from data | Strong; built-in locality and translation equivariance |
| Receptive field | Global from the first layer (self-attention) | Local, grows gradually with depth |
| Data efficiency | Requires large-scale pre-training data | Trains effectively on smaller datasets |
| Scalability | Performance scales strongly with more data and compute | Improvements plateau at very large scales |
| Computational cost | Quadratic in sequence length (number of patches) | Linear in image resolution |
| Interpretability | Attention maps provide some spatial interpretability | Feature maps and gradient-based methods |
| Edge deployment | More expensive; requires optimization | Efficient variants widely deployed on mobile |
In practice, the choice between ViTs and CNNs depends heavily on the available data, computational budget, and deployment constraints. For large-scale applications with abundant data, ViTs tend to deliver superior accuracy. For resource-constrained settings or smaller datasets, CNNs and hybrid architectures remain competitive.
The success of ViT inspired a proliferation of transformer-based vision models, each addressing specific limitations or targeting new applications.
| Model | Year | Organization | Key Innovation | ImageNet Top-1 |
|---|---|---|---|---|
| ViT (original) [1] | 2020 | Google Brain | Pure transformer for image classification | 88.55% (ViT-H/14, JFT pre-trained) |
| DeiT [4] | 2021 | Meta AI (Facebook) | Knowledge distillation; data-efficient training on ImageNet only | 85.2% (with distillation) |
| Swin Transformer [5] | 2021 | Microsoft Research | Hierarchical features; shifted window attention | 87.3% (Swin-L, ImageNet-22K pre-trained) |
| BEiT [6] | 2021 | Microsoft Research | BERT-style masked image modeling pre-training | 86.3% (BEiT-L) |
| MAE [7] | 2021 | Meta AI | Masked autoencoder; reconstructs 75% masked patches | 87.8% (ViT-H) |
| DINO [8] | 2021 | Meta AI | Self-distillation with no labels | 80.1% (linear eval, ViT-B) |
| DINOv2 [9] | 2023 | Meta AI | Scaled self-supervised training on 142M images | Strong on diverse benchmarks |
| EVA [10] | 2022 | BAAI | Masked image modeling with CLIP features | 89.6% (EVA, 336px) |
| EVA-02 [10] | 2023 | BAAI | Improved architecture; language-aligned features | 90.0% (304M params) |
| DINOv3 | 2025 | Meta AI | 7B parameters; image-text alignment; Gram anchoring | +6 mIoU over DINOv2 on ADE20K |
DeiT (Data-efficient Image Transformers), introduced by Hugo Touvron and colleagues at Meta AI in January 2021, demonstrated that ViTs could be trained competitively using only ImageNet-1K, without requiring massive external datasets [4]. The key contribution was a knowledge distillation approach where a strong CNN teacher (typically a RegNet) guided the transformer student's learning. DeiT introduced a special distillation token alongside the [CLS] token, which learned to mimic the teacher's output.
A DeiT-Base model achieved 83.1% top-1 accuracy on ImageNet without external data, and with distillation reached 85.2%. Critically, training could be completed on a single 8-GPU machine in under three days, making ViT research accessible to a much broader community.
The Swin Transformer, proposed by Ze Liu and colleagues at Microsoft Research in March 2021, addressed two major limitations of the original ViT: its single-resolution feature map and the quadratic computational complexity of global self-attention [5].
Instead of computing self-attention across all patches globally, Swin Transformer partitions the image into non-overlapping local windows and computes self-attention within each window. This reduces computational complexity from quadratic to linear with respect to image size. To enable cross-window information flow, the window partitions are shifted by half the window size in alternating layers. This simple yet effective "shifted window" strategy allows each patch to attend to patches from neighboring windows across successive layers.
Swin Transformer produces multi-scale feature maps by merging patches at each stage, similar to how CNNs downsample spatial resolution through pooling layers. Starting from small patches (typically 4x4 pixels), the model progressively merges neighboring patches at each hierarchical stage, producing feature maps at resolutions of 1/4, 1/8, 1/16, and 1/32 of the input. This hierarchical design makes Swin Transformer suitable as a general-purpose backbone for dense prediction tasks such as object detection and semantic segmentation, where multi-scale features are essential.
Swin Transformer achieved 87.3% top-1 accuracy on ImageNet with ImageNet-22K pre-training and set new records on COCO object detection and ADE20K segmentation at the time of publication. Its successor, Swin Transformer V2, further improved training stability and scaled to larger image resolutions using techniques like log-spaced continuous relative position bias and residual post-normalization.
Self-supervised learning has become one of the most important paradigms for training vision transformers, reducing or eliminating the need for labeled data during pre-training.
Masked Autoencoders (MAE), proposed by Kaiming He and colleagues at Meta AI in November 2021, adapted the masked language modeling concept from BERT to the visual domain [7]. The approach is elegant in its simplicity: randomly mask a large proportion (75%) of image patches and train the model to reconstruct the missing pixels.
MAE uses an asymmetric encoder-decoder design. The encoder operates only on the visible (unmasked) patches, which dramatically reduces computation during training. A lightweight decoder then takes the encoded visible patches along with mask tokens and reconstructs the full image. The high masking ratio is crucial; it creates a challenging task that forces the model to learn rich semantic representations rather than relying on simple interpolation from nearby patches.
Pre-training with MAE followed by fine-tuning yielded 87.8% top-1 accuracy on ImageNet using a ViT-Huge model. The approach also accelerated training by 3x or more compared to methods that process all patches, since 75% of the patches are excluded from the encoder.
DINO (self-DIstillation with NO labels), introduced by Mathilde Caron and colleagues at Meta AI in April 2021, demonstrated remarkable emergent properties in self-supervised vision transformers [8]. DINO trains a student network and a teacher network with identical architectures. The student learns by matching the output distribution of the teacher, while the teacher's weights are updated as an exponential moving average of the student's weights.
A striking discovery was that self-supervised ViT features trained with DINO contain explicit information about semantic segmentation, even though the model was never trained with segmentation labels or objectives. Attention maps from DINO-trained ViTs clearly delineate object boundaries and distinguish foreground from background.
DINOv2, released by Meta AI in 2023, scaled this approach to 142 million curated images and produced general-purpose visual features that transferred strongly across a wide range of tasks without fine-tuning [9]. DINOv2 models outperformed OpenCLIP and other foundation models on linear evaluation across classification, segmentation, and depth estimation benchmarks.
In August 2025, Meta released DINOv3 with 7 billion parameters, trained on 1.7 billion images. DINOv3 introduced image-text alignment (similar to CLIP), Gram anchoring for teacher-student self-distillation, and axial RoPE (Rotary Positional Embeddings). It outperformed DINOv2 by +6 mIoU on ADE20K semantic segmentation, demonstrating particularly strong performance on dense prediction tasks.
Vision transformers have expanded far beyond image classification, becoming foundational components across virtually all areas of computer vision.
DEtection TRansformer (DETR), introduced by Nicolas Carion and colleagues at Meta AI in 2020, reimagined object detection as a direct set prediction problem [11]. DETR uses a CNN backbone to extract features, then passes them through a transformer encoder-decoder architecture. The decoder outputs a fixed set of predictions in parallel, eliminating the need for hand-designed components like anchor boxes, non-maximum suppression, and region proposal networks that were central to earlier detectors like Faster R-CNN.
While the original DETR was slower to converge than traditional detectors, subsequent variants (Deformable DETR, DINO-DETR, Co-DETR, RT-DETR) addressed convergence speed and achieved state-of-the-art detection results. The DETR paradigm fundamentally simplified the object detection pipeline.
The Segment Anything Model (SAM), released by Meta AI in 2023, demonstrated the power of vision transformers for interactive and promptable segmentation. SAM uses a ViT-based image encoder to produce image embeddings, which are then decoded into segmentation masks based on user prompts (points, bounding boxes, or text). Trained on over 1 billion masks from 11 million images, SAM could segment virtually any object in any image in a zero-shot manner.
SAM 2 extended the approach to video, enabling consistent object tracking and segmentation across frames. SAM 3, announced in 2025, introduced concept-based segmentation, where a single text prompt or image exemplar can find and segment every instance of a visual concept across images and videos.
CLIP (Contrastive Language-Image Pre-training), introduced by OpenAI in January 2021, trained a vision transformer (or CNN) jointly with a text encoder using contrastive learning on 400 million image-text pairs from the internet [12]. By learning to align visual and textual representations in a shared embedding space, CLIP enabled zero-shot image classification: the model could classify images into categories it had never explicitly been trained on, simply by comparing image embeddings with text embeddings of category descriptions.
CLIP's vision encoder (often a ViT-L/14) has become one of the most widely used visual backbones in the field. It serves as the visual component in multimodal models such as LLaVA, GPT-4V, and numerous other vision-language models.
Vision transformers have also found roles in generative models. The Diffusion Transformer (DiT), introduced in 2023, replaced the U-Net backbone commonly used in diffusion models with a transformer architecture, demonstrating improved scaling behavior. DiT forms the backbone of several leading image and video generation systems, including Sora by OpenAI and Stable Diffusion 3 by Stability AI.
The following table summarizes representative results on the ImageNet-1K benchmark across CNN and ViT-based architectures. Pre-training datasets and model sizes vary, so direct comparisons should be interpreted with context.
| Model | Type | Parameters | Pre-training Data | ImageNet Top-1 (%) |
|---|---|---|---|---|
| ResNet-50 | CNN | 25M | ImageNet-1K | 76.1 |
| ResNet-152 | CNN | 60M | ImageNet-1K | 78.3 |
| EfficientNet-B7 | CNN | 66M | ImageNet-1K | 84.3 |
| EfficientNetV2-L | CNN | 120M | ImageNet-21K | 85.7 |
| ViT-B/16 | ViT | 86M | ImageNet-1K | 77.9 |
| ViT-B/16 | ViT | 86M | ImageNet-21K | 84.0 |
| ViT-L/16 | ViT | 307M | ImageNet-21K | 85.3 |
| ViT-H/14 | ViT | 632M | JFT-300M | 88.55 |
| DeiT-B (distilled) | ViT | 86M | ImageNet-1K | 85.2 |
| Swin-L | ViT (hierarchical) | 197M | ImageNet-22K | 87.3 |
| BEiT-L | ViT | 307M | ImageNet-21K (self-supervised) | 86.3 |
| MAE (ViT-H) | ViT | 632M | ImageNet-1K (self-supervised) | 87.8 |
| EVA | ViT | 1.0B | Merged (MIM + CLIP) | 89.6 |
| EVA-02 | ViT | 304M | ImageNet-22K (MIM) | 90.0 |
| CoCa | ViT + Text | 2.1B | Multimodal | 91.0 |
A clear trend emerges from these results. ViTs trained only on ImageNet-1K lag behind well-optimized CNNs of similar size. But with larger pre-training datasets or self-supervised objectives, ViTs consistently outperform CNNs. The best-performing models in 2025 are either pure ViTs or multimodal systems with ViT visual encoders.
The introduction of ViT triggered a paradigm shift in computer vision research. Several developments can be traced directly to its influence.
First, ViT demonstrated that domain-specific architectural inductive biases (like convolutions) are not strictly necessary for strong visual understanding. Given enough data and compute, a general-purpose architecture can learn the relevant patterns. This insight aligned computer vision with the broader trend in AI toward scaling general architectures rather than engineering task-specific ones.
Second, ViT unified the architectural foundations of vision and language. Because both modalities now use transformer encoders, building multimodal systems became significantly more straightforward. Models like CLIP, Flamingo, and GPT-4V leverage shared transformer components for both visual and textual processing, enabling capabilities that would have been difficult to achieve with separate CNN and RNN pipelines.
Third, ViT accelerated the adoption of self-supervised pre-training in vision. Techniques like MAE, DINO, and BEiT drew direct inspiration from masked language modeling in NLP, and these methods proved highly effective precisely because the transformer architecture is shared across domains. Self-supervised ViT features now serve as general-purpose visual representations across dozens of downstream tasks.
Fourth, the success of ViT contributed to the rise of foundation models in vision. Rather than training specialized models for each task, the field moved toward training large, general-purpose visual encoders once and then adapting them to specific tasks through fine-tuning, linear probing, or prompting.
As of early 2026, vision transformers have firmly established themselves as the dominant architecture in computer vision research and are increasingly deployed in production systems.
The strict dichotomy between CNNs and transformers has given way to a more nuanced landscape. Hybrid architectures like ConvNeXt V2 incorporate design principles from both paradigms. Many modern "vision transformers" include convolutional elements in their patch embedding layers or use depthwise convolutions in their feed-forward networks. Conversely, recent CNN designs borrow attention mechanisms and training recipes from the ViT literature.
A major area of active research involves making ViTs practical for deployment on edge devices and in latency-sensitive applications. Token pruning and routing techniques allow models to dynamically allocate computation only to informative image regions, reducing inference time by up to 50% while maintaining accuracy. Quantization, distillation, and architectural simplifications have produced compact ViT variants suitable for mobile deployment.
The largest vision transformers now serve as universal visual backbones. DINOv2 and DINOv3 produce features that transfer effectively to classification, segmentation, depth estimation, and other tasks without any fine-tuning. EVA-02 achieves 90.0% ImageNet accuracy with only 304 million parameters. These models, along with multimodal systems like CLIP and SigLIP, have become standard building blocks in the AI stack.
Vision transformers have expanded beyond 2D images into video understanding, 3D point cloud processing, medical imaging, satellite imagery analysis, and autonomous driving perception. The flexibility of the patch-based tokenization scheme allows ViTs to process diverse spatial data formats with minimal architectural changes. Video models treat temporal frames as additional tokens, while 3D models tokenize voxels or point cloud patches.
Despite their success, several challenges remain. The quadratic complexity of self-attention with respect to sequence length limits the resolution at which ViTs can efficiently process images. Training large ViTs from scratch still requires substantial computational resources. And while ViTs excel at capturing global patterns, they can struggle with fine-grained local details compared to CNNs, particularly at lower data scales.
Research continues to address these limitations through linear attention approximations, mixture-of-experts architectures, and improved training methodologies. The ViT-5 initiative proposed in 2026 explores next-generation activation functions, relative positional encodings, and improved attention normalization for the next era of vision transformers.