The Vision Transformer (ViT) is a deep learning architecture that applies the transformer model, originally designed for natural language processing, to image recognition tasks. Instead of processing images through convolutional neural networks (CNNs), ViT splits an image into fixed-size patches, treats each patch as a token (analogous to a word in a sentence), and feeds the resulting sequence into a standard transformer encoder. Introduced by Alexey Dosovitskiy and colleagues at Google Brain in October 2020, the approach demonstrated that a pure transformer, applied directly to sequences of image patches, can achieve state-of-the-art results on image classification benchmarks when pre-trained on large datasets [1]. The original paper, titled "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale," became one of the most influential publications in computer vision and sparked a wave of transformer-based architectures across the field.
The paper appeared at ICLR 2021 and is now the dominant backbone behind nearly every modern vision foundation model, multimodal large language model, image generator, and segmentation system. By 2025-2026, ViT and its derivatives (DeiT, Swin, MAE, DINOv2, DINOv3, EVA, CLIP, SigLIP, SAM) had effectively replaced CNNs as the default choice for new computer vision research at scale.
Before ViT, convolutional neural networks dominated computer vision. Architectures such as AlexNet, VGGNet, ResNet, and EfficientNet relied on learned convolutional filters to extract local features from images in a hierarchical manner. While CNNs proved highly effective, they carried strong inductive biases: locality (each filter operates on a small spatial region) and translation equivariance (the same filter is applied across the entire image). These biases helped CNNs learn efficiently from limited data, but they also constrained the model's ability to capture long-range dependencies across an image without stacking many layers.
Meanwhile, the transformer architecture had revolutionized NLP. Models like BERT and GPT demonstrated that self-attention mechanisms could model complex relationships between tokens in a sequence with remarkable effectiveness. Several researchers explored ways to bring attention mechanisms into vision, including hybrid approaches that combined convolutions with self-attention. However, Dosovitskiy et al. took a more radical approach: they asked whether a pure transformer, with minimal image-specific modifications, could match or exceed CNN performance on image classification.
The answer, it turned out, was yes, provided the model had access to sufficient training data. The clean experimental finding was that ViT underperformed comparable ResNets when pre-trained on ImageNet-1K alone, roughly matched them on ImageNet-21K, and clearly surpassed them when pre-trained on the proprietary JFT-300M dataset. The takeaway was that scale, not architecture-specific bias, was the dominant factor at the high end.
The ViT architecture follows a straightforward pipeline that maps an image to a class label through a sequence of well-defined steps.
Given an input image of resolution H x W with C color channels, ViT divides it into a grid of non-overlapping patches, each of size P x P pixels. For a standard 224 x 224 image with a patch size of 16 x 16, this yields 196 patches (14 rows by 14 columns). Each patch is flattened into a one-dimensional vector of length P x P x C. For 16 x 16 RGB patches, each vector has 768 values. These flattened patches are then projected through a trainable linear layer (a single matrix multiplication) into a fixed-dimensional embedding space. The result is a sequence of patch embeddings, each representing one spatial region of the image [1].
In practice, this linear projection is implemented as a 2D convolution with kernel size P and stride P, which is mathematically equivalent to slicing patches and applying a dense layer but is more efficient on GPU. This convolutional view also explains why some authors describe ViT as "a transformer applied to a single convolutional layer."
Because the transformer architecture is permutation-invariant (it has no built-in notion of spatial order), positional information must be explicitly provided. ViT adds a learnable position embedding to each patch embedding. These position embeddings allow the model to learn where each patch is located relative to others in the original image. The original ViT paper used standard 1D learnable position embeddings and found that they performed comparably to more complex 2D-aware positional encoding schemes [1].
A practical consequence of using learnable position embeddings is that the embedding table is tied to a fixed sequence length, and therefore to a fixed image resolution at the chosen patch size. To fine-tune at a higher resolution, the standard recipe is to perform 2D interpolation of the pre-trained position embedding grid (typically bicubic interpolation) so that it matches the new sequence length. This trick is essential to the standard ViT recipe and is now built into most vision transformer libraries. Later work explored alternatives such as relative position bias (Swin, BEiT), conditional positional encoding (CPVT), and rotary position embedding for vision (RoPE-ViT), each aimed at making position handling more resolution-flexible.
Following the convention established by BERT, ViT prepends a special learnable [CLS] token to the sequence of patch embeddings. This token does not correspond to any image patch. Instead, its representation at the output of the transformer encoder serves as the aggregate image representation used for classification. The final state of the [CLS] token is passed through a classification head (typically a small multilayer perceptron) to produce the predicted class probabilities. Subsequent work showed that simple global average pooling over patch tokens can replace the [CLS] token with negligible performance change, and many modern variants (DINOv2, EVA-02, MAE for fine-tuning) use mean pooling either alongside or instead of the [CLS] token.
The sequence of patch embeddings (plus the [CLS] token and position embeddings) is fed into a standard transformer encoder, identical in design to the one proposed by Vaswani et al. in 2017 [2]. The encoder consists of L identical layers, each containing two sub-layers:
Both sub-layers use Layer Normalization (applied before the sub-layer, known as Pre-Norm) and residual connections. The complete forward pass can be summarized as:
where z_L^0 is the output state of the [CLS] token after all L layers.
During pre-training, the classification head is a small MLP with one hidden layer and a tanh activation. During fine-tuning on a downstream task, it is replaced by a single linear layer applied to the [CLS] token (or to mean-pooled patch features). This minimal head means almost all of the model's capacity sits in the transformer trunk, which is what makes the same backbone usable for classification, retrieval, segmentation, and generation by simply swapping the head.
For an image of N patches, the dominant cost in each transformer layer is the self-attention operation, which is O(N^2 * D) in time and O(N^2) in memory for the attention matrix. The MLP block is O(N * D^2) and is typically the larger term until the sequence becomes very long. This quadratic dependence on N is the central scaling problem of vanilla ViT: doubling the input resolution at fixed patch size quadruples the number of patches and so multiplies the attention cost by sixteen. Most follow-up work, including Swin Transformer, MViT, and the various linear-attention vision transformers, exists primarily to break this quadratic wall.
The original paper defined three model sizes, borrowing naming conventions from BERT. Each variant can be combined with different patch sizes, denoted as ViT-{size}/{patch}, for example ViT-B/16 (Base model with 16x16 patches) or ViT-L/32 (Large model with 32x32 patches). Smaller patch sizes produce longer sequences and higher computational cost, but generally yield better accuracy.
| Model | Layers | Hidden Dim | MLP Dim | Attention Heads | Parameters |
|---|---|---|---|---|---|
| ViT-Base (ViT-B) | 12 | 768 | 3,072 | 12 | ~86M |
| ViT-Large (ViT-L) | 24 | 1,024 | 4,096 | 16 | ~307M |
| ViT-Huge (ViT-H) | 32 | 1,280 | 5,120 | 16 | ~632M |
The community later extended this naming with smaller and larger sizes:
| Variant | Layers | Hidden Dim | Heads | Parameters | Notes |
|---|---|---|---|---|---|
| ViT-Tiny (ViT-Ti) | 12 | 192 | 3 | ~5.7M | Introduced in DeiT for mobile and ablation |
| ViT-Small (ViT-S) | 12 | 384 | 6 | ~22M | DeiT and DINO standard small variant |
| ViT-Base (ViT-B) | 12 | 768 | 12 | ~86M | Original Dosovitskiy et al. base size |
| ViT-Large (ViT-L) | 24 | 1,024 | 16 | ~307M | Standard size for CLIP, MAE, BEiT |
| ViT-Huge (ViT-H) | 32 | 1,280 | 16 | ~632M | Standard size for SAM, MAE, EVA |
| ViT-g (small g) | 40 | 1,408 | 16 | ~1.1B | Used by DINOv2 and EVA |
| ViT-G (capital G) | 48 | 1,664 | 16 | ~1.8B | Used by Google scaling work |
| ViT-e | 56 | 1,792 | 16 | ~4B | Pre-ViT-22B Google scaling step |
| ViT-22B | 48 | 6,144 | 48 | 22B | Dehghani et al. 2023 [3] |
| SwinV2-G | (hierarchical) | 512 (stage 1) | (varied) | 3B | Liu et al. 2022 [13] |
| DINOv3 7B | 40 | 4,096 | 32 | 7B | Meta AI 2025 self-supervised |
Later work scaled ViT even further. Google Research published "Scaling Vision Transformers to 22 Billion Parameters" in 2023, demonstrating that ViT continues to benefit from increased scale with a ViT-22B model [3]. ViT-22B applies three changes that proved crucial for stable training at this scale: parallel attention and MLP blocks (both run in parallel from the same input rather than sequentially), QK-LayerNorm (applying LayerNorm to the queries and keys before the dot product to prevent attention logit explosion seen near the 8B-parameter regime), and removing biases in the QKV projections and LayerNorms (which improved hardware utilization by about 3%). It was trained on a JFT extension of roughly four billion semi-automatically labeled images using 256 visual tokens per 224x224 image and a 14x14 patch grid, with a model FLOPs utilization of 54.9% on TPU v4.
One of the key findings from the original ViT paper is that transformers lack the strong inductive biases of CNNs. Without convolutions enforcing locality and translation equivariance, ViTs need substantially more training data to learn these patterns from scratch. When trained only on ImageNet-1K (approximately 1.3 million images), ViT underperformed comparable ResNet models. However, when pre-trained on larger datasets such as ImageNet-21K (14 million images) or the proprietary JFT-300M dataset (300 million images), ViT surpassed all CNN baselines [1].
This data hunger initially limited ViT's practical appeal. Subsequent research addressed this limitation through three families of techniques:
The combination of these techniques means that a modern ViT can be trained to competitive accuracy on ImageNet-1K alone, and self-supervised pre-training on large unlabeled corpora regularly produces backbones that outperform JFT-pretrained ViTs on downstream tasks.
The relationship between ViTs and CNNs reveals fundamental trade-offs in model design for visual understanding.
| Aspect | Vision Transformer (ViT) | Convolutional Neural Networks |
|---|---|---|
| Inductive bias | Minimal; learns spatial relationships from data | Strong; built-in locality and translation equivariance |
| Receptive field | Global from the first layer (self-attention) | Local, grows gradually with depth |
| Data efficiency | Requires large-scale pre-training data | Trains effectively on smaller datasets |
| Scalability | Performance scales strongly with more data and compute | Improvements plateau at very large scales |
| Computational cost | Quadratic in sequence length (number of patches) | Linear in image resolution |
| Memory | Attention matrix is O(N^2); high resolution is expensive | Convolutions are local; memory grows linearly |
| Robustness | Stronger adversarial robustness in many studies; better OOD on shape vs texture | Often biased toward texture cues |
| Interpretability | Attention maps provide some spatial interpretability | Feature maps and gradient-based methods (Grad-CAM, etc.) |
| Edge deployment | More expensive; requires optimization | Efficient variants widely deployed on mobile |
In practice, the choice between ViTs and CNNs depends heavily on the available data, computational budget, and deployment constraints. For large-scale applications with abundant data, ViTs tend to deliver superior accuracy. For resource-constrained settings or smaller datasets, CNNs and hybrid architectures remain competitive.
A related line of work argued that the modern ViT recipe (heavy augmentation, AdamW with weight decay, long schedules, patch-based input) is at least as important as the architecture itself. ConvNeXt (Liu et al. 2022) modernized a ResNet with techniques borrowed from ViT and Swin and matched their accuracy with a pure-convolutional model, suggesting that the ViT vs CNN gap is partly a recipe gap rather than a pure architectural one.
The success of ViT inspired a proliferation of transformer-based vision models, each addressing specific limitations or targeting new applications.
| Model | Year | Organization | Key Innovation | ImageNet Top-1 |
|---|---|---|---|---|
| ViT (original) [1] | 2020 | Google Brain | Pure transformer for image classification | 88.55% (ViT-H/14, JFT pre-trained) |
| DeiT [4] | 2021 | Meta AI (Facebook) | Knowledge distillation; data-efficient training on ImageNet only | 85.2% (with distillation) |
| Swin Transformer [5] | 2021 | Microsoft Research | Hierarchical features; shifted window attention | 87.3% (Swin-L, ImageNet-22K pre-trained) |
| Swin Transformer V2 [13] | 2022 | Microsoft Research | Residual-post-norm, log-spaced position bias, SwinV2-G at 3B params | 84.0% (SwinV2-G, ImageNet-V2) |
| PVT (Pyramid ViT) | 2021 | Nanjing Univ. + others | Pyramid feature maps; spatial-reduction attention | 81.7% (PVT-Large) |
| MViT (Multi-scale ViT) | 2021 | Meta AI | Pooling attention for multi-scale features | 84.1% (MViT-L) |
| BEiT [6] | 2021 | Microsoft Research | BERT-style masked image modeling pre-training | 86.3% (BEiT-L) |
| MAE [7] | 2021 | Meta AI | Masked autoencoder; reconstructs 75% masked patches | 87.8% (ViT-H) |
| DINO [8] | 2021 | Meta AI | Self-distillation with no labels | 80.1% (linear eval, ViT-B) |
| DINOv2 [9] | 2023 | Meta AI | Scaled self-supervised training on 142M curated images; ViT-g/14 | Strong on diverse benchmarks |
| EVA [10] | 2022 | BAAI | Masked image modeling with CLIP features as targets | 89.6% (EVA, 336px) |
| EVA-02 [10] | 2023 | BAAI | Updated ViT (SwiGLU, RoPE, sub-LN); language-aligned MIM | 90.0% (304M params) |
| ConvNeXt | 2022 | Meta AI | Modernized CNN matching ViT accuracy | 87.8% (ConvNeXt-XL, IN-22K) |
| DINOv3 | 2025 | Meta AI | 7B parameters; image-text alignment; Gram anchoring | +6 mIoU over DINOv2 on ADE20K |
DeiT (Data-efficient Image Transformers), introduced by Hugo Touvron and colleagues at Meta AI in January 2021, demonstrated that ViTs could be trained competitively using only ImageNet-1K, without requiring massive external datasets [4]. The key contribution was a knowledge distillation approach where a strong CNN teacher (typically a RegNet) guided the transformer student's learning. DeiT introduced a special distillation token alongside the [CLS] token, which learned to mimic the teacher's output.
A DeiT-Base model achieved 83.1% top-1 accuracy on ImageNet without external data, and with distillation reached 85.2%. Critically, training could be completed on a single 8-GPU machine in under three days, making ViT research accessible to a much broader community.
Two properties of DeiT's distillation are worth noting. First, the optimal teacher is a CNN, not another transformer; the authors argue that the student inherits useful inductive bias from the teacher rather than just smoother labels. Second, the distillation token sits next to the [CLS] token in the input sequence and is supervised by the teacher's predictions through a separate loss term, which differs from classical soft-label distillation by giving the model an extra read-out path it can specialize. DeiT III (Touvron et al. 2022) revisited the recipe with stronger augmentation, longer schedules, and Lamb optimizer settings, and achieved 87.7% on a ViT-H/14 trained from scratch on ImageNet-1K, further weakening the case that ViTs must be pre-trained on JFT-scale data.
The Swin Transformer, proposed by Ze Liu and colleagues at Microsoft Research in March 2021, addressed two major limitations of the original ViT: its single-resolution feature map and the quadratic computational complexity of global self-attention [5]. Swin won the Marr Prize for best paper at ICCV 2021.
Instead of computing self-attention across all patches globally, Swin Transformer partitions the image into non-overlapping local windows and computes self-attention within each window. This reduces computational complexity from quadratic to linear with respect to image size. To enable cross-window information flow, the window partitions are shifted by half the window size in alternating layers. This simple yet effective shifted window strategy allows each patch to attend to patches from neighboring windows across successive layers.
Swin Transformer produces multi-scale feature maps by merging patches at each stage, similar to how CNNs downsample spatial resolution through pooling layers. Starting from small patches (typically 4x4 pixels), the model progressively merges neighboring patches at each hierarchical stage, producing feature maps at resolutions of 1/4, 1/8, 1/16, and 1/32 of the input. This hierarchical design makes Swin Transformer suitable as a general-purpose backbone for dense prediction tasks such as object detection and semantic segmentation, where multi-scale features are essential.
Swin Transformer achieved 87.3% top-1 accuracy on ImageNet with ImageNet-22K pre-training and set new records on COCO object detection and ADE20K segmentation at the time of publication. Its successor, Swin Transformer V2 (Liu et al., CVPR 2022), scaled to 3 billion parameters (SwinV2-G), trained on images up to 1,536 x 1,536 in resolution, and introduced three stability tricks: residual post-norm with cosine attention to keep deep transformer activations bounded, log-spaced continuous relative position bias to transfer position encoding from low-resolution pre-training to high-resolution fine-tuning, and SimMIM, a self-supervised masked image modeling method that reduces the need for large labeled datasets [13]. SwinV2-G set state-of-the-art records on ImageNet-V2, COCO detection, ADE20K segmentation, and Kinetics-400 action recognition while using roughly forty times less labeled data and forty times less training time than concurrent billion-parameter Google models.
Self-supervised learning has become one of the most important paradigms for training vision transformers, reducing or eliminating the need for labeled data during pre-training.
Masked Autoencoders (MAE), proposed by Kaiming He and colleagues at Meta AI in November 2021, adapted the masked language modeling concept from BERT to the visual domain [7]. The approach is elegant in its simplicity: randomly mask a large proportion (75%) of image patches and train the model to reconstruct the missing pixels.
MAE uses an asymmetric encoder-decoder design. The encoder operates only on the visible (unmasked) patches, which dramatically reduces computation during training. A lightweight decoder (typically 8 transformer blocks, around 9% of the encoder's per-token compute) then takes the encoded visible patches along with mask tokens and reconstructs the full image. The high masking ratio is crucial; it creates a challenging task that forces the model to learn rich semantic representations rather than relying on simple interpolation from nearby patches. Reconstruction targets are normalized pixel values within each patch, which the authors found worked better than predicting raw RGB.
Pre-training with MAE followed by fine-tuning yielded 87.8% top-1 accuracy on ImageNet using a ViT-Huge model. The approach also accelerated training by 3x or more compared to methods that process all patches, since 75% of the patches are excluded from the encoder. MAE-style pre-training has since been adopted as a default pre-training step for many production vision pipelines, in part because it requires no labels and scales smoothly to ViT-Large and ViT-Huge backbones.
BEiT (BERT pre-training of Image Transformers), introduced by Hangbo Bao, Li Dong, and Furu Wei at Microsoft Research in June 2021, was the first paper to show that self-supervised pre-training could outperform supervised pre-training for ViTs [6]. BEiT borrows the BERT recipe directly: it masks roughly 40% of image patches and trains the model to predict, for each masked patch, a discrete visual token from a fixed vocabulary.
The visual tokens are produced by a separately trained discrete variational autoencoder (dVAE), originally borrowed from OpenAI's DALL-E codebook, which maps each image to a 14 x 14 grid of integer tokens drawn from an 8,192-entry codebook. Because the targets are discrete, BEiT trains with a standard cross-entropy loss that is identical in form to BERT's masked language modeling loss. BEiT v2 later replaced the dVAE with a vector-quantized teacher trained jointly with a perceptual loss, and BEiT v3 unified vision, language, and multimodal pre-training with a multiway transformer.
DINO (self-DIstillation with NO labels), introduced by Mathilde Caron and colleagues at Meta AI in April 2021, demonstrated remarkable emergent properties in self-supervised vision transformers [8]. DINO trains a student network and a teacher network with identical architectures. The student learns by matching the output distribution of the teacher, while the teacher's weights are updated as an exponential moving average of the student's weights.
A striking discovery was that self-supervised ViT features trained with DINO contain explicit information about semantic segmentation, even though the model was never trained with segmentation labels or objectives. Attention maps from DINO-trained ViTs clearly delineate object boundaries and distinguish foreground from background. This emergent property has been used directly for unsupervised object discovery, copy detection, and dense feature extraction without any task-specific labels.
DINOv2, released by Meta AI in 2023, scaled this approach to 142 million curated images and produced general-purpose visual features that transferred strongly across a wide range of tasks without fine-tuning [9]. The curated LVD-142M dataset was assembled by retrieving images from a large pool of uncurated web data that were close in feature space to images in several smaller curated datasets, an automated curation procedure that effectively reproduced the quality benefits of human-curated data at much larger scale. The largest released model is a ViT-g/14 with about 1.1 billion parameters. DINOv2 features outperformed OpenCLIP and other foundation models on linear evaluation across classification, segmentation, depth estimation, and instance retrieval benchmarks.
In August 2025, Meta released DINOv3 with 7 billion parameters, trained on 1.7 billion images. DINOv3 introduced image-text alignment (similar to CLIP) following the LiT recipe, in which a text encoder is trained from scratch to match a frozen visual encoder's features through a contrastive loss. It also introduced Gram anchoring for teacher-student self-distillation, a regularization technique that preserves patch-level Gram-matrix correlations during long training schedules to prevent dense feature degradation. Combined with axial RoPE (Rotary Positional Embeddings) and a high-resolution post-training phase that fine-tunes on 512 px and 768 px crops, DINOv3 outperformed DINOv2 by +6 mIoU on ADE20K semantic segmentation and showed particularly strong gains on dense prediction tasks where patch-level feature quality matters most.
EVA, proposed by Yuxin Fang and colleagues at BAAI in late 2022, took the masked image modeling idea in a different direction: rather than reconstructing pixels (MAE) or discrete tokens (BEiT), EVA reconstructs the visible-image-conditioned features of a frozen CLIP vision encoder [10]. This means the pre-training target is itself a learned, language-aligned representation, which transferred unusually well to downstream tasks. EVA scaled to one billion parameters and reached 89.6% top-1 on ImageNet at 336 px.
EVA-02 (Fang et al. 2023) updated the architecture with SwiGLU feed-forward layers, rotary position embeddings, and a sub-LN normalization scheme, and reached 90.0% ImageNet top-1 with only 304 million parameters by pre-training on ImageNet-22K with masked image modeling using EVA-CLIP as the teacher. EVA-CLIP itself, a CLIP variant trained with EVA features, reached 80.4% zero-shot top-1 on ImageNet using only about one-sixth of the parameters of the previous best open CLIP, and has since become a popular open-source vision encoder for multimodal LLMs.
A major reason ViT now dominates vision research is that the same architecture used for text in modern large language models can read images directly, allowing them to be combined trivially. Almost every multimodal model released since 2022 uses a ViT-derived vision encoder.
CLIP (Contrastive Language-Image Pre-training), introduced by OpenAI in January 2021, trained a vision transformer (or CNN) jointly with a text encoder using contrastive learning on 400 million image-text pairs from the internet [12]. By learning to align visual and textual representations in a shared embedding space, CLIP enabled zero-shot image classification: the model could classify images into categories it had never explicitly been trained on, simply by comparing image embeddings with text embeddings of category descriptions.
The largest CLIP model trained on this set was a ViT-L/14, with a higher-resolution variant (ViT-L/14@336px) fine-tuned for one extra epoch at 336 x 336 pixels. The largest ViT took roughly 12 days to train on 256 V100 GPUs. CLIP's vision encoder has become one of the most widely used visual backbones in the field. It serves as the visual component in multimodal models such as LLaVA, GPT-4V, and numerous other vision-language models.
LiT (Locked-image Tuning), introduced by Zhai et al. at Google Research in 2022, observed that you can keep a strong pre-trained image encoder frozen and only train the text tower to align to it, which often outperforms training both encoders jointly from scratch. LiT became the basis for several follow-up image-text models including DINOv3's text alignment phase.
SigLIP (Sigmoid Loss for Language Image Pre-Training), introduced by Xiaohua Zhai and colleagues at Google in March 2023, replaced the softmax-based contrastive loss of CLIP with a sigmoid loss applied independently to each image-text pair [14]. Because the sigmoid loss does not require global normalization across the batch, it eliminates the need to materialize the full N x N similarity matrix and reduces inter-GPU communication. Practical consequences include better performance at small batch sizes, more memory headroom, and the ability to scale to very large effective batches when desired. SigLIP performed best at a batch size of 32k, while CLIP's softmax loss needed 98k for its optimum and still did not match the sigmoid variant. SigLIP's vision tower includes ViT-B/16, ViT-L/16, and SoViT-400m/14 (a shape-optimized variant from a separate Google paper).
Flamingo (DeepMind, 2022) was an early demonstration that you could glue a frozen vision encoder to a frozen large language model with a small bridging module and achieve strong few-shot performance on visual question answering, captioning, and OCR. Flamingo used a Normalizer-Free ResNet (NFNet-F6) as its vision encoder rather than a pure ViT, but the core pattern, freeze a vision encoder, freeze a language model, and learn a thin connector with cross-attention, became the template for almost all later vision-language models. Its Perceiver Resampler module compressed variable-length visual feature maps into a fixed-size set of visual tokens consumed by the LLM.
LLaVA (Liu et al., NeurIPS 2023) replaced Flamingo's NFNet with a CLIP ViT-L vision encoder and used a simple projection (initially linear, then a two-layer MLP in LLaVA-1.5) to feed visual features into a Vicuna LLM. LLaVA's two-stage training (first projection-only feature alignment, then visual instruction tuning on GPT-generated multimodal instruction data) delivered 85.1% relative score against GPT-4 on multimodal tasks at a tiny fraction of the training cost. The same recipe powers most academic open-source vision-language models, and proprietary systems including GPT-4V, Gemini, Claude with vision, and Qwen-VL all use ViT-style vision encoders feeding their LLM decoders.
Vision transformers have expanded far beyond image classification, becoming foundational components across virtually all areas of computer vision.
DEtection TRansformer (DETR), introduced by Nicolas Carion and colleagues at Meta AI in 2020, reimagined object detection as a direct set prediction problem [11]. DETR uses a CNN backbone to extract features, then passes them through a transformer encoder-decoder architecture. The decoder outputs a fixed set of predictions in parallel, eliminating the need for hand-designed components like anchor boxes, non-maximum suppression, and region proposal networks that were central to earlier detectors like Faster R-CNN.
While the original DETR was slower to converge than traditional detectors, subsequent variants (Deformable DETR, DINO-DETR, Co-DETR, RT-DETR) addressed convergence speed and achieved state-of-the-art detection results. The DETR paradigm fundamentally simplified the object detection pipeline. Many modern detectors now use a Swin or ViT backbone (for example ViTDet, Mask DINO, and Co-DETR with Swin-L) and a DETR-style transformer head.
The Segment Anything Model (SAM), released by Meta AI in April 2023, demonstrated the power of vision transformers for interactive and promptable segmentation. SAM uses a ViT-based image encoder (ViT-B, ViT-L, or ViT-H, with the largest ViT-H having about 636 million parameters and 32 transformer layers) to produce image embeddings, which are then decoded into segmentation masks based on user prompts (points, bounding boxes, or text). Trained on the SA-1B dataset of more than one billion masks across 11 million licensed images, SAM could segment virtually any object in any image in a zero-shot manner.
SAM 2 (Meta AI, August 2024) extended the approach to video by adding a streaming memory architecture: a memory attention module conditions the current frame's features on past frames and earlier prompts, a memory encoder produces compact representations of past predictions, and a memory bank stores spatial features and object pointers. The result is a single model that handles both image and video segmentation, runs at real-time speed on a single GPU, segments video with roughly three times fewer interactions than prior approaches, and is six times faster than the original SAM on still images. SAM 3, announced in 2025, introduced concept-based segmentation, where a single text prompt or image exemplar can find and segment every instance of a visual concept across images and videos.
Beyond CLIP, ViT vision encoders feed essentially every production vision-language model: GPT-4V uses a vision encoder pre-trained at OpenAI, Gemini uses a Google vision encoder closely related to SigLIP, Claude with vision uses Anthropic's internal encoder, and most open-source VLMs (LLaVA, Qwen-VL, InternVL, MiniCPM-V) use CLIP ViT-L, SigLIP, or EVA-CLIP. The standard pattern is: ViT vision encoder, then a connector (linear, MLP, Q-Former, or Perceiver Resampler), then an LLM. The choice of vision encoder is one of the most consequential design decisions in modern multimodal AI.
Vision transformers have also found roles in generative models. U-ViT (Bao et al. 2022) was an early ViT backbone for diffusion that treated time, condition, and noisy image patches as a single token sequence and added long skip connections in the spirit of U-Net, achieving record FID scores of 2.29 on class-conditional ImageNet 256 x 256 and 5.48 on text-to-image MS-COCO. The Diffusion Transformer (DiT), introduced by William Peebles and Saining Xie in 2023, took an even more vanilla path: it used standard ViT blocks with adaLN-Zero conditioning to inject the diffusion timestep and class label, and demonstrated that the FID-50K on ImageNet 256 x 256 dropped from 3.60 (LDM, U-Net) to 2.27 with DiT-XL/2. DiT and its descendants (DiT-XL/2, MM-DiT) form the backbone of several leading image and video generation systems, including Sora by OpenAI and Stable Diffusion 3 by Stability AI.
ViTs extend naturally to video by treating the temporal dimension as additional tokens. TimeSformer (Bertasius et al. 2021) introduced divided space-time attention, factorizing 3D self-attention into separate temporal and spatial attention blocks, and reached state-of-the-art on Kinetics-400 and Kinetics-600 without any convolutions. ViViT (Arnab et al., ICCV 2021) explored several factorization strategies, including factorized encoder, factorized self-attention, and factorized dot-product, and showed how to leverage pre-trained image ViTs for video by inflating spatial weights along the time axis. Video Swin Transformer extended Swin's shifted windows to space-time. VideoMAE (Tong et al. 2022) extended MAE pre-training to video with very high masking ratios (90 to 95%) and showed strong scaling on Something-Something v2 and Kinetics-400.
The ViT recipe of "slice into patches, add positional encoding, run a transformer" generalized to many other spatial inputs. Medical imaging uses ViTs for radiology and pathology, often pre-trained with MAE on large unlabeled image archives. Satellite imagery analysis uses ViTs (Prithvi, Clay) trained on Sentinel-2 and Landsat data. Point cloud processing tokenizes voxels or local point patches; Point-MAE and PointBERT applied masked-token pre-training to 3D points. Even tabular and time-series models have borrowed the ViT recipe for sequence-of-patches input.
The following table summarizes representative results on the ImageNet-1K benchmark across CNN and ViT-based architectures. Pre-training datasets and model sizes vary, so direct comparisons should be interpreted with context.
| Model | Type | Parameters | Pre-training Data | ImageNet Top-1 (%) |
|---|---|---|---|---|
| ResNet-50 | CNN | 25M | ImageNet-1K | 76.1 |
| ResNet-152 | CNN | 60M | ImageNet-1K | 78.3 |
| EfficientNet-B7 | CNN | 66M | ImageNet-1K | 84.3 |
| EfficientNetV2-L | CNN | 120M | ImageNet-21K | 85.7 |
| ConvNeXt-XL | CNN | 350M | ImageNet-22K | 87.8 |
| ViT-B/16 | ViT | 86M | ImageNet-1K | 77.9 |
| ViT-B/16 | ViT | 86M | ImageNet-21K | 84.0 |
| ViT-L/16 | ViT | 307M | ImageNet-21K | 85.3 |
| ViT-H/14 | ViT | 632M | JFT-300M | 88.55 |
| DeiT-B (distilled) | ViT | 86M | ImageNet-1K | 85.2 |
| Swin-L | ViT (hierarchical) | 197M | ImageNet-22K | 87.3 |
| SwinV2-G | ViT (hierarchical) | 3B | ImageNet-22K + ext | 84.0 (ImageNet-V2) |
| BEiT-L | ViT | 307M | ImageNet-21K (self-supervised) | 86.3 |
| MAE (ViT-H) | ViT | 632M | ImageNet-1K (self-supervised) | 87.8 |
| EVA | ViT | 1.0B | Merged (MIM + CLIP) | 89.6 |
| EVA-02 | ViT | 304M | ImageNet-22K (MIM) | 90.0 |
| CoCa | ViT + Text | 2.1B | Multimodal | 91.0 |
A clear trend emerges from these results. ViTs trained only on ImageNet-1K lag behind well-optimized CNNs of similar size. But with larger pre-training datasets or self-supervised objectives, ViTs consistently outperform CNNs. The best-performing models in 2025 are either pure ViTs or multimodal systems with ViT visual encoders.
ViT is supported by every major deep learning framework and a number of dedicated libraries:
| Library | Maintainer | Notes |
|---|---|---|
| timm (PyTorch Image Models) | Ross Wightman / Hugging Face | Reference high-quality implementations of ViT, DeiT, Swin, BEiT, EVA, ConvNeXt, plus thousands of pretrained weights |
| torchvision.models.vit_* | PyTorch / Meta | First-party ViT-B/L/H implementations with ImageNet weights |
| Hugging Face transformers | Hugging Face | ViTModel, ViTForImageClassification, plus DeiT, Swin, BEiT, MAE, DINOv2 wrappers |
| Hugging Face diffusers | Hugging Face | DiT and U-ViT implementations for image and video generation |
| Big Vision (Google) | Google Research | JAX/Flax implementation of ViT, MAE, SigLIP, used by the original ViT and ViT-22B papers |
| Lucidrains vit-pytorch | Phil Wang | Compact PyTorch reference for ViT and many follow-up variants |
| TensorFlow Model Garden | TensorFlow ViT and Swin reference implementations |
For production inference, ViTs are commonly deployed through ONNX Runtime, TensorRT, or vLLM-style serving stacks; for edge deployment they are typically distilled, quantized to INT8 or 4-bit, and pruned through token pruning libraries such as ToMe (Token Merging) and DynamicViT.
The choice of patch size P is the dominant lever for trading accuracy against compute in ViT. Smaller patches produce longer sequences (N grows quadratically with 1/P), which boosts spatial detail but multiplies the cost of self-attention and the size of intermediate activations. The most common choices and their typical use cases are summarized below.
| Patch size | Sequence length at 224 px | Typical use |
|---|---|---|
| 32 x 32 | 49 | ViT-B/32 fast baseline; classical CLIP variant |
| 16 x 16 | 196 | Default in original ViT, MAE, BEiT, SigLIP |
| 14 x 14 | 256 | CLIP ViT-L/14, DINOv2, EVA-CLIP |
| 8 x 8 | 784 | Fine-grained tasks (fine-grained classification, dense matching) |
| 4 x 4 | 3,136 | Swin Transformer initial stage |
Resolution further compounds this scaling. Doubling the resolution from 224 to 448 quadruples the sequence length and roughly multiplies attention cost by sixteen, which is why high-resolution ViTs almost always use hierarchical attention (Swin), windowed attention with global tokens (Hiera, MViT), or efficient-attention substitutes (Linformer, Performer, Mamba-style state-space backbones).
The introduction of ViT triggered a paradigm shift in computer vision research. Several developments can be traced directly to its influence.
First, ViT demonstrated that domain-specific architectural inductive biases (like convolutions) are not strictly necessary for strong visual understanding. Given enough data and compute, a general-purpose architecture can learn the relevant patterns. This insight aligned computer vision with the broader trend in AI toward scaling general architectures rather than engineering task-specific ones.
Second, ViT unified the architectural foundations of vision and language. Because both modalities now use transformer encoders, building multimodal systems became significantly more straightforward. Models like CLIP, Flamingo, and GPT-4V leverage shared transformer components for both visual and textual processing, enabling capabilities that would have been difficult to achieve with separate CNN and RNN pipelines.
Third, ViT accelerated the adoption of self-supervised pre-training in vision. Techniques like MAE, DINO, and BEiT drew direct inspiration from masked language modeling in NLP, and these methods proved highly effective precisely because the transformer architecture is shared across domains. Self-supervised ViT features now serve as general-purpose visual representations across dozens of downstream tasks.
Fourth, the success of ViT contributed to the rise of foundation models in vision. Rather than training specialized models for each task, the field moved toward training large, general-purpose visual encoders once and then adapting them to specific tasks through fine-tuning, linear probing, or prompting. DINOv2 and DINOv3 are explicit examples: a single set of frozen weights produces features that drive classification, segmentation, depth estimation, and retrieval on dozens of benchmarks.
Despite its dominance, ViT has well-known weaknesses that motivate active research. They include:
As of early 2026, vision transformers have firmly established themselves as the dominant architecture in computer vision research and are increasingly deployed in production systems.
The strict dichotomy between CNNs and transformers has given way to a more nuanced landscape. Hybrid architectures like ConvNeXt V2 incorporate design principles from both paradigms. Many modern vision transformers include convolutional elements in their patch embedding layers or use depthwise convolutions in their feed-forward networks. Conversely, recent CNN designs borrow attention mechanisms and training recipes from the ViT literature.
A major area of active research involves making ViTs practical for deployment on edge devices and in latency-sensitive applications. Token pruning and routing techniques (DynamicViT, A-ViT, ToMe) allow models to dynamically allocate computation only to informative image regions, reducing inference time by up to 50% while maintaining accuracy. Quantization, distillation, and architectural simplifications have produced compact ViT variants suitable for mobile deployment. FlashAttention and FlashAttention-2 have made global attention practical at much higher resolutions on modern GPUs.
The largest vision transformers now serve as universal visual backbones. DINOv2 and DINOv3 produce features that transfer effectively to classification, segmentation, depth estimation, and other tasks without any fine-tuning. EVA-02 achieves 90.0% ImageNet accuracy with only 304 million parameters. These models, along with multimodal systems like CLIP and SigLIP, have become standard building blocks in the AI stack. Vision encoders are now treated as a commodity layer in the multimodal LLM stack: pick a strong frozen ViT, project its features into the LLM's token space, and train only the connector and the LLM.
Vision transformers have expanded beyond 2D images into video understanding, 3D point cloud processing, medical imaging, satellite imagery analysis, and autonomous driving perception. The flexibility of the patch-based tokenization scheme allows ViTs to process diverse spatial data formats with minimal architectural changes. Video models treat temporal frames as additional tokens, while 3D models tokenize voxels or point cloud patches. Tesla's autopilot, Waymo's perception stack, and most modern surgical robotics systems include ViT or Swin backbones somewhere in their pipeline.
Despite their success, several challenges remain. The quadratic complexity of self-attention with respect to sequence length limits the resolution at which ViTs can efficiently process images. Training large ViTs from scratch still requires substantial computational resources. And while ViTs excel at capturing global patterns, they can struggle with fine-grained local details compared to CNNs, particularly at lower data scales. Recent work on linear attention approximations, mixture-of-experts ViTs, state-space models such as Vision Mamba, and hybrid attention-convolution backbones such as Hiera and FastViT continues to attack these limits.