Vision Transformer

The Vision Transformer (ViT) is a deep learning architecture that applies the transformer model, originally designed for natural language processing, to image recognition tasks. Instead of processing images through convolutional neural networks (CNNs), ViT splits an image into fixed-size patches, treats each patch as a token (analogous to a word in a sentence), and feeds the resulting sequence into a standard transformer encoder. Introduced by Alexey Dosovitskiy and colleagues at Google Brain in October 2020, the approach demonstrated that a pure transformer, applied directly to sequences of image patches, can achieve state-of-the-art results on image classification benchmarks when pre-trained on large datasets ^[1]. The original paper, titled "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale," became one of the most influential publications in computer vision and sparked a wave of transformer-based architectures across the field.

The paper appeared at ICLR 2021 and is now the dominant backbone behind nearly every modern vision foundation model, multimodal large language model, image generator, and segmentation system. By 2025-2026, ViT and its derivatives (DeiT, Swin, MAE, DINOv2, DINOv3, EVA, CLIP, SigLIP, SAM) had effectively replaced CNNs as the default choice for new computer vision research at scale.

Background and motivation

Before ViT, convolutional neural networks dominated computer vision. Architectures such as AlexNet, VGGNet, ResNet, and EfficientNet relied on learned convolutional filters to extract local features from images in a hierarchical manner. While CNNs proved highly effective, they carried strong inductive biases: locality (each filter operates on a small spatial region) and translation equivariance (the same filter is applied across the entire image). These biases helped CNNs learn efficiently from limited data, but they also constrained the model's ability to capture long-range dependencies across an image without stacking many layers.

Meanwhile, the transformer architecture had revolutionized NLP. Models like BERT and GPT demonstrated that self-attention mechanisms could model complex relationships between tokens in a sequence with remarkable effectiveness. Several researchers explored ways to bring attention mechanisms into vision, including hybrid approaches that combined convolutions with self-attention. However, Dosovitskiy et al. took a more radical approach: they asked whether a pure transformer, with minimal image-specific modifications, could match or exceed CNN performance on image classification.

The answer, it turned out, was yes, provided the model had access to sufficient training data. The clean experimental finding was that ViT underperformed comparable ResNets when pre-trained on ImageNet-1K alone, roughly matched them on ImageNet-21K, and clearly surpassed them when pre-trained on the proprietary JFT-300M dataset. The takeaway was that scale, not architecture-specific bias, was the dominant factor at the high end.

How ViT works

The ViT architecture follows a straightforward pipeline that maps an image to a class label through a sequence of well-defined steps.

Patch extraction and linear embedding

Given an input image of resolution H x W with C color channels, ViT divides it into a grid of non-overlapping patches, each of size P x P pixels. For a standard 224 x 224 image with a patch size of 16 x 16, this yields 196 patches (14 rows by 14 columns). Each patch is flattened into a one-dimensional vector of length P x P x C. For 16 x 16 RGB patches, each vector has 768 values. These flattened patches are then projected through a trainable linear layer (a single matrix multiplication) into a fixed-dimensional embedding space. The result is a sequence of patch embeddings, each representing one spatial region of the image ^[1].

In practice, this linear projection is implemented as a 2D convolution with kernel size P and stride P, which is mathematically equivalent to slicing patches and applying a dense layer but is more efficient on GPU. This convolutional view also explains why some authors describe ViT as "a transformer applied to a single convolutional layer."

Position embeddings

Because the transformer architecture is permutation-invariant (it has no built-in notion of spatial order), positional information must be explicitly provided. ViT adds a learnable position embedding to each patch embedding. These position embeddings allow the model to learn where each patch is located relative to others in the original image. The original ViT paper used standard 1D learnable position embeddings and found that they performed comparably to more complex 2D-aware positional encoding schemes ^[1].

A practical consequence of using learnable position embeddings is that the embedding table is tied to a fixed sequence length, and therefore to a fixed image resolution at the chosen patch size. To fine-tune at a higher resolution, the standard recipe is to perform 2D interpolation of the pre-trained position embedding grid (typically bicubic interpolation) so that it matches the new sequence length. This trick is essential to the standard ViT recipe and is now built into most vision transformer libraries. Later work explored alternatives such as relative position bias (Swin, BEiT), conditional positional encoding (CPVT), and rotary position embedding for vision (RoPE-ViT), each aimed at making position handling more resolution-flexible.

CLS token

Following the convention established by BERT, ViT prepends a special learnable [CLS] token to the sequence of patch embeddings. This token does not correspond to any image patch. Instead, its representation at the output of the transformer encoder serves as the aggregate image representation used for classification. The final state of the [CLS] token is passed through a classification head (typically a small multilayer perceptron) to produce the predicted class probabilities. Subsequent work showed that simple global average pooling over patch tokens can replace the [CLS] token with negligible performance change, and many modern variants (DINOv2, EVA-02, MAE for fine-tuning) use mean pooling either alongside or instead of the [CLS] token.

Transformer encoder

The sequence of patch embeddings (plus the [CLS] token and position embeddings) is fed into a standard transformer encoder, identical in design to the one proposed by Vaswani et al. in 2017 ^[2]. The encoder consists of L identical layers, each containing two sub-layers:

Multi-Head Self-Attention (MSA): Each patch embedding attends to every other patch embedding, computing attention weights that determine how much information to aggregate from other positions. This allows the model to capture global relationships from the very first layer.
Feed-Forward Network (FFN): A two-layer MLP with a GELU activation function processes each position independently, expanding the dimensionality and then projecting it back down.

Both sub-layers use Layer Normalization (applied before the sub-layer, known as Pre-Norm) and residual connections. The complete forward pass can be summarized as:

z_0 = [x_class; x_1E; x_2E; ... ; x_NE] + E_pos
z'l = MSA(LN(z{l-1})) + z_{l-1}
z_l = FFN(LN(z'_l)) + z'_l
y = LN(z_L^0)

where z_L^0 is the output state of the [CLS] token after all L layers.

Classification head

During pre-training, the classification head is a small MLP with one hidden layer and a tanh activation. During fine-tuning on a downstream task, it is replaced by a single linear layer applied to the [CLS] token (or to mean-pooled patch features). This minimal head means almost all of the model's capacity sits in the transformer trunk, which is what makes the same backbone usable for classification, retrieval, segmentation, and generation by simply swapping the head.

Computational profile

For an image of N patches, the dominant cost in each transformer layer is the self-attention operation, which is O(N^2 * D) in time and O(N^2) in memory for the attention matrix. The MLP block is O(N * D^2) and is typically the larger term until the sequence becomes very long. This quadratic dependence on N is the central scaling problem of vanilla ViT: doubling the input resolution at fixed patch size quadruples the number of patches and so multiplies the attention cost by sixteen. Most follow-up work, including Swin Transformer, MViT, and the various linear-attention vision transformers, exists primarily to break this quadratic wall.

Model variants

The original paper defined three model sizes, borrowing naming conventions from BERT. Each variant can be combined with different patch sizes, denoted as ViT-{size}/{patch}, for example ViT-B/16 (Base model with 16x16 patches) or ViT-L/32 (Large model with 32x32 patches). Smaller patch sizes produce longer sequences and higher computational cost, but generally yield better accuracy.

Model	Layers	Hidden Dim	MLP Dim	Attention Heads	Parameters
ViT-Base (ViT-B)	12	768	3,072	12	~86M
ViT-Large (ViT-L)	24	1,024	4,096	16	~307M
ViT-Huge (ViT-H)	32	1,280	5,120	16	~632M

The community later extended this naming with smaller and larger sizes:

Variant	Layers	Hidden Dim	Heads	Parameters	Notes
ViT-Tiny (ViT-Ti)	12	192	3	~5.7M	Introduced in DeiT for mobile and ablation
ViT-Small (ViT-S)	12	384	6	~22M	DeiT and DINO standard small variant
ViT-Base (ViT-B)	12	768	12	~86M	Original Dosovitskiy et al. base size
ViT-Large (ViT-L)	24	1,024	16	~307M	Standard size for CLIP, MAE, BEiT
ViT-Huge (ViT-H)	32	1,280	16	~632M	Standard size for SAM, MAE, EVA
ViT-g (small g)	40	1,408	16	~1.1B	Used by DINOv2 and EVA
ViT-G (capital G)	48	1,664	16	~1.8B	Used by Google scaling work
ViT-e	56	1,792	16	~4B	Pre-ViT-22B Google scaling step
ViT-22B	48	6,144	48	22B	Dehghani et al. 2023 ^[3]
SwinV2-G	(hierarchical)	512 (stage 1)	(varied)	3B	Liu et al. 2022 ^[13]
DINOv3 7B	40	4,096	32	7B	Meta AI 2025 self-supervised

Later work scaled ViT even further. Google Research published "Scaling Vision Transformers to 22 Billion Parameters" in 2023, demonstrating that ViT continues to benefit from increased scale with a ViT-22B model ^[3]. ViT-22B applies three changes that proved crucial for stable training at this scale: parallel attention and MLP blocks (both run in parallel from the same input rather than sequentially), QK-LayerNorm (applying LayerNorm to the queries and keys before the dot product to prevent attention logit explosion seen near the 8B-parameter regime), and removing biases in the QKV projections and LayerNorms (which improved hardware utilization by about 3%). It was trained on a JFT extension of roughly four billion semi-automatically labeled images using 256 visual tokens per 224x224 image and a 14x14 patch grid, with a model FLOPs utilization of 54.9% on TPU v4.

Pre-training and data requirements

One of the key findings from the original ViT paper is that transformers lack the strong inductive biases of CNNs. Without convolutions enforcing locality and translation equivariance, ViTs need substantially more training data to learn these patterns from scratch. When trained only on ImageNet-1K (approximately 1.3 million images), ViT underperformed comparable ResNet models. However, when pre-trained on larger datasets such as ImageNet-21K (14 million images) or the proprietary JFT-300M dataset (300 million images), ViT surpassed all CNN baselines ^[1].

This data hunger initially limited ViT's practical appeal. Subsequent research addressed this limitation through three families of techniques:

Stronger augmentation and regularization: heavy mixup, cutmix, RandAugment, stochastic depth, and label smoothing made it possible to train ViTs on ImageNet-1K from scratch competitively. The DeiT III recipe (Touvron et al. 2022) showed that a carefully tuned recipe alone can close most of the gap with JFT-pretrained models.
Knowledge distillation from a CNN teacher: DeiT introduced a distillation token specifically designed to absorb the predictions of a strong convolutional teacher (typically a RegNet), letting a ViT student learn the inductive biases the teacher had baked in.
Self-supervised pre-training: methods such as MAE, BEiT, MoCo v3, DINO, and DINOv2 turn unlabeled images into a much larger effective training signal, removing the dependency on JFT-scale labeled corpora.

The combination of these techniques means that a modern ViT can be trained to competitive accuracy on ImageNet-1K alone, and self-supervised pre-training on large unlabeled corpora regularly produces backbones that outperform JFT-pretrained ViTs on downstream tasks.

Comparison with CNNs

The relationship between ViTs and CNNs reveals fundamental trade-offs in model design for visual understanding.

Aspect	Vision Transformer (ViT)	Convolutional Neural Networks
Inductive bias	Minimal; learns spatial relationships from data	Strong; built-in locality and translation equivariance
Receptive field	Global from the first layer (self-attention)	Local, grows gradually with depth
Data efficiency	Requires large-scale pre-training data	Trains effectively on smaller datasets
Scalability	Performance scales strongly with more data and compute	Improvements plateau at very large scales
Computational cost	Quadratic in sequence length (number of patches)	Linear in image resolution
Memory	Attention matrix is O(N^2); high resolution is expensive	Convolutions are local; memory grows linearly
Robustness	Stronger adversarial robustness in many studies; better OOD on shape vs texture	Often biased toward texture cues
Interpretability	Attention maps provide some spatial interpretability	Feature maps and gradient-based methods (Grad-CAM, etc.)
Edge deployment	More expensive; requires optimization	Efficient variants widely deployed on mobile

In practice, the choice between ViTs and CNNs depends heavily on the available data, computational budget, and deployment constraints. For large-scale applications with abundant data, ViTs tend to deliver superior accuracy. For resource-constrained settings or smaller datasets, CNNs and hybrid architectures remain competitive.

A related line of work argued that the modern ViT recipe (heavy augmentation, AdamW with weight decay, long schedules, patch-based input) is at least as important as the architecture itself. ConvNeXt (Liu et al. 2022) modernized a ResNet with techniques borrowed from ViT and Swin and matched their accuracy with a pure-convolutional model, suggesting that the ViT vs CNN gap is partly a recipe gap rather than a pure architectural one.

Key variants and extensions

The success of ViT inspired a proliferation of transformer-based vision models, each addressing specific limitations or targeting new applications.

Model	Year	Organization	Key Innovation	ImageNet Top-1
ViT (original) ^[1]	2020	Google Brain	Pure transformer for image classification	88.55% (ViT-H/14, JFT pre-trained)
DeiT ^[4]	2021	Meta AI (Facebook)	Knowledge distillation; data-efficient training on ImageNet only	85.2% (with distillation)
Swin Transformer ^[5]	2021	Microsoft Research	Hierarchical features; shifted window attention	87.3% (Swin-L, ImageNet-22K pre-trained)
Swin Transformer V2 ^[13]	2022	Microsoft Research	Residual-post-norm, log-spaced position bias, SwinV2-G at 3B params	84.0% (SwinV2-G, ImageNet-V2)
PVT (Pyramid ViT)	2021	Nanjing Univ. + others	Pyramid feature maps; spatial-reduction attention	81.7% (PVT-Large)
MViT (Multi-scale ViT)	2021	Meta AI	Pooling attention for multi-scale features	84.1% (MViT-L)
BEiT ^[6]	2021	Microsoft Research	BERT-style masked image modeling pre-training	86.3% (BEiT-L)
MAE ^[7]	2021	Meta AI	Masked autoencoder; reconstructs 75% masked patches	87.8% (ViT-H)
DINO ^[8]	2021	Meta AI	Self-distillation with no labels	80.1% (linear eval, ViT-B)
DINOv2 ^[9]	2023	Meta AI	Scaled self-supervised training on 142M curated images; ViT-g/14	Strong on diverse benchmarks
EVA ^[10]	2022	BAAI	Masked image modeling with CLIP features as targets	89.6% (EVA, 336px)
EVA-02 ^[10]	2023	BAAI	Updated ViT (SwiGLU, RoPE, sub-LN); language-aligned MIM	90.0% (304M params)
ConvNeXt	2022	Meta AI	Modernized CNN matching ViT accuracy	87.8% (ConvNeXt-XL, IN-22K)
DINOv3	2025	Meta AI	7B parameters; image-text alignment; Gram anchoring	+6 mIoU over DINOv2 on ADE20K

DeiT: data-efficient training

DeiT (Data-efficient Image Transformers), introduced by Hugo Touvron and colleagues at Meta AI in January 2021, demonstrated that ViTs could be trained competitively using only ImageNet-1K, without requiring massive external datasets ^[4]. The key contribution was a knowledge distillation approach where a strong CNN teacher (typically a RegNet) guided the transformer student's learning. DeiT introduced a special distillation token alongside the [CLS] token, which learned to mimic the teacher's output.

A DeiT-Base model achieved 83.1% top-1 accuracy on ImageNet without external data, and with distillation reached 85.2%. Critically, training could be completed on a single 8-GPU machine in under three days, making ViT research accessible to a much broader community.

Two properties of DeiT's distillation are worth noting. First, the optimal teacher is a CNN, not another transformer; the authors argue that the student inherits useful inductive bias from the teacher rather than just smoother labels. Second, the distillation token sits next to the [CLS] token in the input sequence and is supervised by the teacher's predictions through a separate loss term, which differs from classical soft-label distillation by giving the model an extra read-out path it can specialize. DeiT III (Touvron et al. 2022) revisited the recipe with stronger augmentation, longer schedules, and Lamb optimizer settings, and achieved 87.7% on a ViT-H/14 trained from scratch on ImageNet-1K, further weakening the case that ViTs must be pre-trained on JFT-scale data.

Swin Transformer: hierarchical vision

The Swin Transformer, proposed by Ze Liu and colleagues at Microsoft Research in March 2021, addressed two major limitations of the original ViT: its single-resolution feature map and the quadratic computational complexity of global self-attention ^[5]. Swin won the Marr Prize for best paper at ICCV 2021.

Shifted window mechanism

Instead of computing self-attention across all patches globally, Swin Transformer partitions the image into non-overlapping local windows and computes self-attention within each window. This reduces computational complexity from quadratic to linear with respect to image size. To enable cross-window information flow, the window partitions are shifted by half the window size in alternating layers. This simple yet effective shifted window strategy allows each patch to attend to patches from neighboring windows across successive layers.

Hierarchical feature maps

Swin Transformer produces multi-scale feature maps by merging patches at each stage, similar to how CNNs downsample spatial resolution through pooling layers. Starting from small patches (typically 4x4 pixels), the model progressively merges neighboring patches at each hierarchical stage, producing feature maps at resolutions of 1/4, 1/8, 1/16, and 1/32 of the input. This hierarchical design makes Swin Transformer suitable as a general-purpose backbone for dense prediction tasks such as object detection and semantic segmentation, where multi-scale features are essential.

Swin Transformer achieved 87.3% top-1 accuracy on ImageNet with ImageNet-22K pre-training and set new records on COCO object detection and ADE20K segmentation at the time of publication. Its successor, Swin Transformer V2 (Liu et al., CVPR 2022), scaled to 3 billion parameters (SwinV2-G), trained on images up to 1,536 x 1,536 in resolution, and introduced three stability tricks: residual post-norm with cosine attention to keep deep transformer activations bounded, log-spaced continuous relative position bias to transfer position encoding from low-resolution pre-training to high-resolution fine-tuning, and SimMIM, a self-supervised masked image modeling method that reduces the need for large labeled datasets ^[13]. SwinV2-G set state-of-the-art records on ImageNet-V2, COCO detection, ADE20K segmentation, and Kinetics-400 action recognition while using roughly forty times less labeled data and forty times less training time than concurrent billion-parameter Google models.

Self-supervised pre-training

Self-supervised learning has become one of the most important paradigms for training vision transformers, reducing or eliminating the need for labeled data during pre-training.

MAE: masked autoencoders

Masked Autoencoders (MAE), proposed by Kaiming He and colleagues at Meta AI in November 2021, adapted the masked language modeling concept from BERT to the visual domain ^[7]. The approach is elegant in its simplicity: randomly mask a large proportion (75%) of image patches and train the model to reconstruct the missing pixels.

MAE uses an asymmetric encoder-decoder design. The encoder operates only on the visible (unmasked) patches, which dramatically reduces computation during training. A lightweight decoder (typically 8 transformer blocks, around 9% of the encoder's per-token compute) then takes the encoded visible patches along with mask tokens and reconstructs the full image. The high masking ratio is crucial; it creates a challenging task that forces the model to learn rich semantic representations rather than relying on simple interpolation from nearby patches. Reconstruction targets are normalized pixel values within each patch, which the authors found worked better than predicting raw RGB.

Pre-training with MAE followed by fine-tuning yielded 87.8% top-1 accuracy on ImageNet using a ViT-Huge model. The approach also accelerated training by 3x or more compared to methods that process all patches, since 75% of the patches are excluded from the encoder. MAE-style pre-training has since been adopted as a default pre-training step for many production vision pipelines, in part because it requires no labels and scales smoothly to ViT-Large and ViT-Huge backbones.

BEiT: BERT for images

BEiT (BERT pre-training of Image Transformers), introduced by Hangbo Bao, Li Dong, and Furu Wei at Microsoft Research in June 2021, was the first paper to show that self-supervised pre-training could outperform supervised pre-training for ViTs ^[6]. BEiT borrows the BERT recipe directly: it masks roughly 40% of image patches and trains the model to predict, for each masked patch, a discrete visual token from a fixed vocabulary.

The visual tokens are produced by a separately trained discrete variational autoencoder (dVAE), originally borrowed from OpenAI's DALL-E codebook, which maps each image to a 14 x 14 grid of integer tokens drawn from an 8,192-entry codebook. Because the targets are discrete, BEiT trains with a standard cross-entropy loss that is identical in form to BERT's masked language modeling loss. BEiT v2 later replaced the dVAE with a vector-quantized teacher trained jointly with a perceptual loss, and BEiT v3 unified vision, language, and multimodal pre-training with a multiway transformer.

DINO and DINOv2

DINO (self-DIstillation with NO labels), introduced by Mathilde Caron and colleagues at Meta AI in April 2021, demonstrated remarkable emergent properties in self-supervised vision transformers ^[8]. DINO trains a student network and a teacher network with identical architectures. The student learns by matching the output distribution of the teacher, while the teacher's weights are updated as an exponential moving average of the student's weights.

A striking discovery was that self-supervised ViT features trained with DINO contain explicit information about semantic segmentation, even though the model was never trained with segmentation labels or objectives. Attention maps from DINO-trained ViTs clearly delineate object boundaries and distinguish foreground from background. This emergent property has been used directly for unsupervised object discovery, copy detection, and dense feature extraction without any task-specific labels.

DINOv2, released by Meta AI in 2023, scaled this approach to 142 million curated images and produced general-purpose visual features that transferred strongly across a wide range of tasks without fine-tuning ^[9]. The curated LVD-142M dataset was assembled by retrieving images from a large pool of uncurated web data that were close in feature space to images in several smaller curated datasets, an automated curation procedure that effectively reproduced the quality benefits of human-curated data at much larger scale. The largest released model is a ViT-g/14 with about 1.1 billion parameters. DINOv2 features outperformed OpenCLIP and other foundation models on linear evaluation across classification, segmentation, depth estimation, and instance retrieval benchmarks.

In August 2025, Meta released DINOv3 with 7 billion parameters, trained on 1.7 billion images. DINOv3 introduced image-text alignment (similar to CLIP) following the LiT recipe, in which a text encoder is trained from scratch to match a frozen visual encoder's features through a contrastive loss. It also introduced Gram anchoring for teacher-student self-distillation, a regularization technique that preserves patch-level Gram-matrix correlations during long training schedules to prevent dense feature degradation. Combined with axial RoPE (Rotary Positional Embeddings) and a high-resolution post-training phase that fine-tunes on 512 px and 768 px crops, DINOv3 outperformed DINOv2 by +6 mIoU on ADE20K semantic segmentation and showed particularly strong gains on dense prediction tasks where patch-level feature quality matters most.

EVA and EVA-02

EVA, proposed by Yuxin Fang and colleagues at BAAI in late 2022, took the masked image modeling idea in a different direction: rather than reconstructing pixels (MAE) or discrete tokens (BEiT), EVA reconstructs the visible-image-conditioned features of a frozen CLIP vision encoder ^[10]. This means the pre-training target is itself a learned, language-aligned representation, which transferred unusually well to downstream tasks. EVA scaled to one billion parameters and reached 89.6% top-1 on ImageNet at 336 px.

EVA-02 (Fang et al. 2023) updated the architecture with SwiGLU feed-forward layers, rotary position embeddings, and a sub-LN normalization scheme, and reached 90.0% ImageNet top-1 with only 304 million parameters by pre-training on ImageNet-22K with masked image modeling using EVA-CLIP as the teacher. EVA-CLIP itself, a CLIP variant trained with EVA features, reached 80.4% zero-shot top-1 on ImageNet using only about one-sixth of the parameters of the previous best open CLIP, and has since become a popular open-source vision encoder for multimodal LLMs.

Multimodal vision transformers

A major reason ViT now dominates vision research is that the same architecture used for text in modern large language models can read images directly, allowing them to be combined trivially. Almost every multimodal model released since 2022 uses a ViT-derived vision encoder.

CLIP

CLIP (Contrastive Language-Image Pre-training), introduced by OpenAI in January 2021, trained a vision transformer (or CNN) jointly with a text encoder using contrastive learning on 400 million image-text pairs from the internet ^[12]. By learning to align visual and textual representations in a shared embedding space, CLIP enabled zero-shot image classification: the model could classify images into categories it had never explicitly been trained on, simply by comparing image embeddings with text embeddings of category descriptions.

The largest CLIP model trained on this set was a ViT-L/14, with a higher-resolution variant (ViT-L/14@336px) fine-tuned for one extra epoch at 336 x 336 pixels. The largest ViT took roughly 12 days to train on 256 V100 GPUs. CLIP's vision encoder has become one of the most widely used visual backbones in the field. It serves as the visual component in multimodal models such as LLaVA, GPT-4V, and numerous other vision-language models.

LiT and SigLIP

LiT (Locked-image Tuning), introduced by Zhai et al. at Google Research in 2022, observed that you can keep a strong pre-trained image encoder frozen and only train the text tower to align to it, which often outperforms training both encoders jointly from scratch. LiT became the basis for several follow-up image-text models including DINOv3's text alignment phase.

SigLIP (Sigmoid Loss for Language Image Pre-Training), introduced by Xiaohua Zhai and colleagues at Google in March 2023, replaced the softmax-based contrastive loss of CLIP with a sigmoid loss applied independently to each image-text pair ^[14]. Because the sigmoid loss does not require global normalization across the batch, it eliminates the need to materialize the full N x N similarity matrix and reduces inter-GPU communication. Practical consequences include better performance at small batch sizes, more memory headroom, and the ability to scale to very large effective batches when desired. SigLIP performed best at a batch size of 32k, while CLIP's softmax loss needed 98k for its optimum and still did not match the sigmoid variant. SigLIP's vision tower includes ViT-B/16, ViT-L/16, and SoViT-400m/14 (a shape-optimized variant from a separate Google paper).

Flamingo, LLaVA, and modern multimodal LLMs

Flamingo (DeepMind, 2022) was an early demonstration that you could glue a frozen vision encoder to a frozen large language model with a small bridging module and achieve strong few-shot performance on visual question answering, captioning, and OCR. Flamingo used a Normalizer-Free ResNet (NFNet-F6) as its vision encoder rather than a pure ViT, but the core pattern, freeze a vision encoder, freeze a language model, and learn a thin connector with cross-attention, became the template for almost all later vision-language models. Its Perceiver Resampler module compressed variable-length visual feature maps into a fixed-size set of visual tokens consumed by the LLM.

LLaVA (Liu et al., NeurIPS 2023) replaced Flamingo's NFNet with a CLIP ViT-L vision encoder and used a simple projection (initially linear, then a two-layer MLP in LLaVA-1.5) to feed visual features into a Vicuna LLM. LLaVA's two-stage training (first projection-only feature alignment, then visual instruction tuning on GPT-generated multimodal instruction data) delivered 85.1% relative score against GPT-4 on multimodal tasks at a tiny fraction of the training cost. The same recipe powers most academic open-source vision-language models, and proprietary systems including GPT-4V, Gemini, Claude with vision, and Qwen-VL all use ViT-style vision encoders feeding their LLM decoders.

Applications beyond classification

Vision transformers have expanded far beyond image classification, becoming foundational components across virtually all areas of computer vision.

Object detection: DETR and successors

DEtection TRansformer (DETR), introduced by Nicolas Carion and colleagues at Meta AI in 2020, reimagined object detection as a direct set prediction problem ^[11]. DETR uses a CNN backbone to extract features, then passes them through a transformer encoder-decoder architecture. The decoder outputs a fixed set of predictions in parallel, eliminating the need for hand-designed components like anchor boxes, non-maximum suppression, and region proposal networks that were central to earlier detectors like Faster R-CNN.

While the original DETR was slower to converge than traditional detectors, subsequent variants (Deformable DETR, DINO-DETR, Co-DETR, RT-DETR) addressed convergence speed and achieved state-of-the-art detection results. The DETR paradigm fundamentally simplified the object detection pipeline. Many modern detectors now use a Swin or ViT backbone (for example ViTDet, Mask DINO, and Co-DETR with Swin-L) and a DETR-style transformer head.

Image segmentation: SAM

The Segment Anything Model (SAM), released by Meta AI in April 2023, demonstrated the power of vision transformers for interactive and promptable segmentation. SAM uses a ViT-based image encoder (ViT-B, ViT-L, or ViT-H, with the largest ViT-H having about 636 million parameters and 32 transformer layers) to produce image embeddings, which are then decoded into segmentation masks based on user prompts (points, bounding boxes, or text). Trained on the SA-1B dataset of more than one billion masks across 11 million licensed images, SAM could segment virtually any object in any image in a zero-shot manner.

SAM 2 (Meta AI, August 2024) extended the approach to video by adding a streaming memory architecture: a memory attention module conditions the current frame's features on past frames and earlier prompts, a memory encoder produces compact representations of past predictions, and a memory bank stores spatial features and object pointers. The result is a single model that handles both image and video segmentation, runs at real-time speed on a single GPU, segments video with roughly three times fewer interactions than prior approaches, and is six times faster than the original SAM on still images. SAM 3, announced in 2025, introduced concept-based segmentation, where a single text prompt or image exemplar can find and segment every instance of a visual concept across images and videos.

Multimodal learning

Beyond CLIP, ViT vision encoders feed essentially every production vision-language model: GPT-4V uses a vision encoder pre-trained at OpenAI, Gemini uses a Google vision encoder closely related to SigLIP, Claude with vision uses Anthropic's internal encoder, and most open-source VLMs (LLaVA, Qwen-VL, InternVL, MiniCPM-V) use CLIP ViT-L, SigLIP, or EVA-CLIP. The standard pattern is: ViT vision encoder, then a connector (linear, MLP, Q-Former, or Perceiver Resampler), then an LLM. The choice of vision encoder is one of the most consequential design decisions in modern multimodal AI.

Image generation

Vision transformers have also found roles in generative models. U-ViT (Bao et al. 2022) was an early ViT backbone for diffusion that treated time, condition, and noisy image patches as a single token sequence and added long skip connections in the spirit of U-Net, achieving record FID scores of 2.29 on class-conditional ImageNet 256 x 256 and 5.48 on text-to-image MS-COCO. The Diffusion Transformer (DiT), introduced by William Peebles and Saining Xie in 2023, took an even more vanilla path: it used standard ViT blocks with adaLN-Zero conditioning to inject the diffusion timestep and class label, and demonstrated that the FID-50K on ImageNet 256 x 256 dropped from 3.60 (LDM, U-Net) to 2.27 with DiT-XL/2. DiT and its descendants (DiT-XL/2, MM-DiT) form the backbone of several leading image and video generation systems, including Sora by OpenAI and Stable Diffusion 3 by Stability AI.

Video understanding

ViTs extend naturally to video by treating the temporal dimension as additional tokens. TimeSformer (Bertasius et al. 2021) introduced divided space-time attention, factorizing 3D self-attention into separate temporal and spatial attention blocks, and reached state-of-the-art on Kinetics-400 and Kinetics-600 without any convolutions. ViViT (Arnab et al., ICCV 2021) explored several factorization strategies, including factorized encoder, factorized self-attention, and factorized dot-product, and showed how to leverage pre-trained image ViTs for video by inflating spatial weights along the time axis. Video Swin Transformer extended Swin's shifted windows to space-time. VideoMAE (Tong et al. 2022) extended MAE pre-training to video with very high masking ratios (90 to 95%) and showed strong scaling on Something-Something v2 and Kinetics-400.

Other modalities

The ViT recipe of "slice into patches, add positional encoding, run a transformer" generalized to many other spatial inputs. Medical imaging uses ViTs for radiology and pathology, often pre-trained with MAE on large unlabeled image archives. Satellite imagery analysis uses ViTs (Prithvi, Clay) trained on Sentinel-2 and Landsat data. Point cloud processing tokenizes voxels or local point patches; Point-MAE and PointBERT applied masked-token pre-training to 3D points. Even tabular and time-series models have borrowed the ViT recipe for sequence-of-patches input.

Performance comparison on ImageNet

The following table summarizes representative results on the ImageNet-1K benchmark across CNN and ViT-based architectures. Pre-training datasets and model sizes vary, so direct comparisons should be interpreted with context.

Model	Type	Parameters	Pre-training Data	ImageNet Top-1 (%)
ResNet-50	CNN	25M	ImageNet-1K	76.1
ResNet-152	CNN	60M	ImageNet-1K	78.3
EfficientNet-B7	CNN	66M	ImageNet-1K	84.3
EfficientNetV2-L	CNN	120M	ImageNet-21K	85.7
ConvNeXt-XL	CNN	350M	ImageNet-22K	87.8
ViT-B/16	ViT	86M	ImageNet-1K	77.9
ViT-B/16	ViT	86M	ImageNet-21K	84.0
ViT-L/16	ViT	307M	ImageNet-21K	85.3
ViT-H/14	ViT	632M	JFT-300M	88.55
DeiT-B (distilled)	ViT	86M	ImageNet-1K	85.2
Swin-L	ViT (hierarchical)	197M	ImageNet-22K	87.3
SwinV2-G	ViT (hierarchical)	3B	ImageNet-22K + ext	84.0 (ImageNet-V2)
BEiT-L	ViT	307M	ImageNet-21K (self-supervised)	86.3
MAE (ViT-H)	ViT	632M	ImageNet-1K (self-supervised)	87.8
EVA	ViT	1.0B	Merged (MIM + CLIP)	89.6
EVA-02	ViT	304M	ImageNet-22K (MIM)	90.0
CoCa	ViT + Text	2.1B	Multimodal	91.0

A clear trend emerges from these results. ViTs trained only on ImageNet-1K lag behind well-optimized CNNs of similar size. But with larger pre-training datasets or self-supervised objectives, ViTs consistently outperform CNNs. The best-performing models in 2025 are either pure ViTs or multimodal systems with ViT visual encoders.

Implementations

ViT is supported by every major deep learning framework and a number of dedicated libraries:

Library	Maintainer	Notes
timm (PyTorch Image Models)	Ross Wightman / Hugging Face	Reference high-quality implementations of ViT, DeiT, Swin, BEiT, EVA, ConvNeXt, plus thousands of pretrained weights
torchvision.models.vit_*	PyTorch / Meta	First-party ViT-B/L/H implementations with ImageNet weights
Hugging Face transformers	Hugging Face	ViTModel, ViTForImageClassification, plus DeiT, Swin, BEiT, MAE, DINOv2 wrappers
Hugging Face diffusers	Hugging Face	DiT and U-ViT implementations for image and video generation
Big Vision (Google)	Google Research	JAX/Flax implementation of ViT, MAE, SigLIP, used by the original ViT and ViT-22B papers
Lucidrains vit-pytorch	Phil Wang	Compact PyTorch reference for ViT and many follow-up variants
TensorFlow Model Garden	Google	TensorFlow ViT and Swin reference implementations

For production inference, ViTs are commonly deployed through ONNX Runtime, TensorRT, or vLLM-style serving stacks; for edge deployment they are typically distilled, quantized to INT8 or 4-bit, and pruned through token pruning libraries such as ToMe (Token Merging) and DynamicViT.

Patch size and resolution trade-offs

The choice of patch size P is the dominant lever for trading accuracy against compute in ViT. Smaller patches produce longer sequences (N grows quadratically with 1/P), which boosts spatial detail but multiplies the cost of self-attention and the size of intermediate activations. The most common choices and their typical use cases are summarized below.

Patch size	Sequence length at 224 px	Typical use
32 x 32	49	ViT-B/32 fast baseline; classical CLIP variant
16 x 16	196	Default in original ViT, MAE, BEiT, SigLIP
14 x 14	256	CLIP ViT-L/14, DINOv2, EVA-CLIP
8 x 8	784	Fine-grained tasks (fine-grained classification, dense matching)
4 x 4	3,136	Swin Transformer initial stage

Resolution further compounds this scaling. Doubling the resolution from 224 to 448 quadruples the sequence length and roughly multiplies attention cost by sixteen, which is why high-resolution ViTs almost always use hierarchical attention (Swin), windowed attention with global tokens (Hiera, MViT), or efficient-attention substitutes (Linformer, Performer, Mamba-style state-space backbones).

Impact on computer vision

The introduction of ViT triggered a paradigm shift in computer vision research. Several developments can be traced directly to its influence.

First, ViT demonstrated that domain-specific architectural inductive biases (like convolutions) are not strictly necessary for strong visual understanding. Given enough data and compute, a general-purpose architecture can learn the relevant patterns. This insight aligned computer vision with the broader trend in AI toward scaling general architectures rather than engineering task-specific ones.

Second, ViT unified the architectural foundations of vision and language. Because both modalities now use transformer encoders, building multimodal systems became significantly more straightforward. Models like CLIP, Flamingo, and GPT-4V leverage shared transformer components for both visual and textual processing, enabling capabilities that would have been difficult to achieve with separate CNN and RNN pipelines.

Third, ViT accelerated the adoption of self-supervised pre-training in vision. Techniques like MAE, DINO, and BEiT drew direct inspiration from masked language modeling in NLP, and these methods proved highly effective precisely because the transformer architecture is shared across domains. Self-supervised ViT features now serve as general-purpose visual representations across dozens of downstream tasks.

Fourth, the success of ViT contributed to the rise of foundation models in vision. Rather than training specialized models for each task, the field moved toward training large, general-purpose visual encoders once and then adapting them to specific tasks through fine-tuning, linear probing, or prompting. DINOv2 and DINOv3 are explicit examples: a single set of frozen weights produces features that drive classification, segmentation, depth estimation, and retrieval on dozens of benchmarks.

Limitations

Despite its dominance, ViT has well-known weaknesses that motivate active research. They include:

Quadratic attention cost in the number of patches, which makes high-resolution inputs (medical imaging, remote sensing, dense prediction) expensive without hierarchical or windowed attention.
Memory hunger during training, especially at long sequence lengths; activation recomputation, FlashAttention, and tensor parallelism are typical mitigations.
Strong data requirements when trained from scratch without pre-training; small datasets still favor CNN inductive biases or distillation from a CNN teacher.
Sensitivity to the patch grid: changing resolution requires position-embedding interpolation and often a brief fine-tuning epoch.
Less efficient than CNNs at low resolution and small parameter counts, which keeps mobile and edge deployment dominated by MobileNet, EfficientNet, and ConvNeXt-Tiny variants.
Attention can over-attend to high-norm "sink" tokens in long sequences, an issue that motivated DINOv2's register tokens and similar fixes.

Current state (2025-2026)

As of early 2026, vision transformers have firmly established themselves as the dominant architecture in computer vision research and are increasingly deployed in production systems.

Architectural convergence

The strict dichotomy between CNNs and transformers has given way to a more nuanced landscape. Hybrid architectures like ConvNeXt V2 incorporate design principles from both paradigms. Many modern vision transformers include convolutional elements in their patch embedding layers or use depthwise convolutions in their feed-forward networks. Conversely, recent CNN designs borrow attention mechanisms and training recipes from the ViT literature.

Efficiency improvements

A major area of active research involves making ViTs practical for deployment on edge devices and in latency-sensitive applications. Token pruning and routing techniques (DynamicViT, A-ViT, ToMe) allow models to dynamically allocate computation only to informative image regions, reducing inference time by up to 50% while maintaining accuracy. Quantization, distillation, and architectural simplifications have produced compact ViT variants suitable for mobile deployment. FlashAttention and FlashAttention-2 have made global attention practical at much higher resolutions on modern GPUs.

Foundation model era

The largest vision transformers now serve as universal visual backbones. DINOv2 and DINOv3 produce features that transfer effectively to classification, segmentation, depth estimation, and other tasks without any fine-tuning. EVA-02 achieves 90.0% ImageNet accuracy with only 304 million parameters. These models, along with multimodal systems like CLIP and SigLIP, have become standard building blocks in the AI stack. Vision encoders are now treated as a commodity layer in the multimodal LLM stack: pick a strong frozen ViT, project its features into the LLM's token space, and train only the connector and the LLM.

Expanding modalities

Vision transformers have expanded beyond 2D images into video understanding, 3D point cloud processing, medical imaging, satellite imagery analysis, and autonomous driving perception. The flexibility of the patch-based tokenization scheme allows ViTs to process diverse spatial data formats with minimal architectural changes. Video models treat temporal frames as additional tokens, while 3D models tokenize voxels or point cloud patches. Tesla's autopilot, Waymo's perception stack, and most modern surgical robotics systems include ViT or Swin backbones somewhere in their pipeline.

Open challenges

Despite their success, several challenges remain. The quadratic complexity of self-attention with respect to sequence length limits the resolution at which ViTs can efficiently process images. Training large ViTs from scratch still requires substantial computational resources. And while ViTs excel at capturing global patterns, they can struggle with fine-grained local details compared to CNNs, particularly at lower data scales. Recent work on linear attention approximations, mixture-of-experts ViTs, state-space models such as Vision Mamba, and hybrid attention-convolution backbones such as Hiera and FastViT continues to attack these limits.

References

Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al. "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale." ICLR 2021. https://arxiv.org/abs/2010.11929
Vaswani, A., Shazeer, N., Parmar, N., et al. "Attention Is All You Need." NeurIPS 2017. https://arxiv.org/abs/1706.03762
Dehghani, M., Djolonga, J., Mustafa, B., et al. "Scaling Vision Transformers to 22 Billion Parameters." ICML 2023. https://arxiv.org/abs/2302.05442
Touvron, H., Cord, M., Douze, M., et al. "Training data-efficient image transformers & distillation through attention." ICML 2021. https://arxiv.org/abs/2012.12877
Liu, Z., Lin, Y., Cao, Y., et al. "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows." ICCV 2021. https://arxiv.org/abs/2103.14030
Bao, H., Dong, L., Piao, S., Wei, F. "BEiT: BERT Pre-Training of Image Transformers." ICLR 2022. https://arxiv.org/abs/2106.08254
He, K., Chen, X., Xie, S., et al. "Masked Autoencoders Are Scalable Vision Learners." CVPR 2022. https://arxiv.org/abs/2111.06377
Caron, M., Touvron, H., Misra, I., et al. "Emerging Properties in Self-Supervised Vision Transformers." ICCV 2021. https://arxiv.org/abs/2104.14294
Oquab, M., Darcet, T., Moutakanni, T., et al. "DINOv2: Learning Robust Visual Features without Supervision." Transactions on Machine Learning Research, 2024. https://arxiv.org/abs/2304.07193
Fang, Y., Wang, W., Xie, B., et al. "EVA: Exploring the Limits of Masked Visual Representation Learning at Scale." CVPR 2023. https://arxiv.org/abs/2211.07636 ; Fang, Y., Sun, Q., Wang, X., et al. "EVA-02: A Visual Representation for Neon Genesis." 2023. https://arxiv.org/abs/2303.11331
Carion, N., Massa, F., Synnaeve, G., et al. "End-to-End Object Detection with Transformers." ECCV 2020. https://arxiv.org/abs/2005.12872
Radford, A., Kim, J.W., Hallacy, C., et al. "Learning Transferable Visual Models From Natural Language Supervision." ICML 2021. https://arxiv.org/abs/2103.00020
Liu, Z., Hu, H., Lin, Y., et al. "Swin Transformer V2: Scaling Up Capacity and Resolution." CVPR 2022. https://arxiv.org/abs/2111.09883
Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L. "Sigmoid Loss for Language Image Pre-Training." ICCV 2023. https://arxiv.org/abs/2303.15343
Kirillov, A., Mintun, E., Ravi, N., et al. "Segment Anything." ICCV 2023. https://arxiv.org/abs/2304.02643
Ravi, N., Gabeur, V., Hu, Y.-T., et al. "SAM 2: Segment Anything in Images and Videos." 2024. https://arxiv.org/abs/2408.00714
Peebles, W., Xie, S. "Scalable Diffusion Models with Transformers." ICCV 2023. https://arxiv.org/abs/2212.09748
Bao, F., Nie, S., Xue, K., et al. "All are Worth Words: A ViT Backbone for Diffusion Models." CVPR 2023. https://arxiv.org/abs/2209.12152
Bertasius, G., Wang, H., Torresani, L. "Is Space-Time Attention All You Need for Video Understanding?" ICML 2021. https://arxiv.org/abs/2102.05095
Arnab, A., Dehghani, M., Heigold, G., et al. "ViViT: A Video Vision Transformer." ICCV 2021. https://arxiv.org/abs/2103.15691
Tong, Z., Song, Y., Wang, J., Wang, L. "VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training." NeurIPS 2022. https://arxiv.org/abs/2203.12602
Liu, H., Li, C., Wu, Q., Lee, Y.J. "Visual Instruction Tuning (LLaVA)." NeurIPS 2023. https://arxiv.org/abs/2304.08485
Alayrac, J.-B., Donahue, J., Luc, P., et al. "Flamingo: a Visual Language Model for Few-Shot Learning." NeurIPS 2022. https://arxiv.org/abs/2204.14198
Touvron, H., Cord, M., Jegou, H. "DeiT III: Revenge of the ViT." ECCV 2022. https://arxiv.org/abs/2204.07118
Liu, Z., Mao, H., Wu, C.-Y., et al. "A ConvNet for the 2020s (ConvNeXt)." CVPR 2022. https://arxiv.org/abs/2201.03545

Background and motivation

How ViT works

Patch extraction and linear embedding

Position embeddings

CLS token

Transformer encoder

Classification head

Computational profile

Model variants

Pre-training and data requirements

Comparison with CNNs

Key variants and extensions

DeiT: data-efficient training

Swin Transformer: hierarchical vision

Shifted window mechanism

Hierarchical feature maps

Self-supervised pre-training

MAE: masked autoencoders

BEiT: BERT for images

DINO and DINOv2

EVA and EVA-02

Multimodal vision transformers

CLIP

LiT and SigLIP

Flamingo, LLaVA, and modern multimodal LLMs

Applications beyond classification

Object detection: DETR and successors

Image segmentation: SAM

Multimodal learning

Image generation

Video understanding

Other modalities

Performance comparison on ImageNet

Implementations

Patch size and resolution trade-offs

Impact on computer vision

Limitations

Current state (2025-2026)

Architectural convergence

Efficiency improvements

Foundation model era

Expanding modalities

Open challenges

See also

References

Improve this article

Related Articles

Machine learning terms/Computer Vision

Photography

LeNet

Computer-use agent

Computer-use model

OCR Models

Background and motivation

How ViT works

Patch extraction and linear embedding

Position embeddings

CLS token

Transformer encoder

Classification head

Computational profile

Model variants

Pre-training and data requirements

Comparison with CNNs

Key variants and extensions

DeiT: data-efficient training

Swin Transformer: hierarchical vision

Shifted window mechanism

Hierarchical feature maps

Self-supervised pre-training

MAE: masked autoencoders

BEiT: BERT for images

DINO and DINOv2

EVA and EVA-02

Multimodal vision transformers

CLIP

LiT and SigLIP

Flamingo, LLaVA, and modern multimodal LLMs

Applications beyond classification

Object detection: DETR and successors