See also: transformer, attention, convolutional neural network, computer vision, self-supervised learning, image recognition
The Vision Transformer (ViT) is a deep learning architecture that applies the transformer model, originally designed for natural language processing (NLP), directly to image recognition tasks. Introduced by Alexey Dosovitskiy and colleagues at Google Research in the 2020 paper "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale," ViT demonstrated that a pure transformer applied to sequences of image patches can perform image classification at a level that matches or exceeds state-of-the-art convolutional neural networks (CNNs), provided the model is pre-trained on sufficiently large datasets.
Before ViT, CNNs had dominated computer vision for nearly a decade following the success of AlexNet in 2012. While several researchers had explored incorporating self-attention mechanisms into convolutional architectures or using attention alongside convolutions, ViT was among the first models to show that convolutions could be removed entirely. The model treats an image as a sequence of flattened patches, much like a sentence is a sequence of words, and processes these patches using a standard transformer encoder. This conceptually simple approach opened an entirely new research direction in computer vision.
ViT was published at ICLR 2021 and has since become one of the most cited papers in machine learning, spawning a large family of vision transformer variants including DeiT, Swin Transformer, BEiT, MAE, DINO, and DINOv2.
Imagine you have a big picture of a cat. Instead of looking at the whole picture at once, you cut it into a grid of small squares (like puzzle pieces). Then you line up all those little squares in a row and give them to a really smart reader that normally reads sentences. The reader looks at all the puzzle pieces together and figures out what the picture is showing. That is basically what a Vision Transformer does: it chops up a picture into patches and reads them like words in a sentence to understand what the image contains.
The ViT architecture closely follows the encoder portion of the original transformer proposed by Vaswani et al. (2017). The key innovation lies in how images are converted into a sequence of tokens that the transformer can process. The architecture consists of several distinct stages: patch embedding, position embedding, a class token, multiple transformer encoder layers, and a classification head.
An input image of resolution H x W with C color channels is divided into a grid of non-overlapping patches, each of size P x P pixels. For a standard 224 x 224 image with a patch size of 16 x 16, this produces a sequence of N = (224 / 16) x (224 / 16) = 196 patches. Each patch is then flattened into a one-dimensional vector of length P x P x C. For a 16 x 16 RGB patch, this gives a vector of length 16 x 16 x 3 = 768 values.
Each flattened patch vector is projected through a trainable linear layer (a matrix multiplication) to produce a D-dimensional embedding vector. In ViT-Base, D = 768. This linear projection is mathematically equivalent to applying a single convolutional layer with kernel size and stride both equal to the patch size. The result is a sequence of N patch embeddings, each of dimension D.
Following the convention established by BERT in NLP, ViT prepends a special learnable [CLS] (class) token to the sequence of patch embeddings. This token does not correspond to any image patch. Instead, it serves as a summary representation of the entire image. After passing through all transformer encoder layers, the output vector corresponding to the [CLS] token position is used as the input to the classification head. This approach allows the model to aggregate information from all patches into a single representation through the self-attention mechanism.
Because the transformer architecture is permutation-invariant (it treats its input as a set rather than a sequence), spatial information about where each patch was located in the original image must be added explicitly. ViT uses learnable 1D positional embeddings, one for each position in the sequence (including the [CLS] token position). These position embeddings are added element-wise to the corresponding patch embeddings before the sequence enters the transformer encoder.
The original paper experimented with both 1D and 2D positional embeddings and found no significant difference in performance, suggesting that the model can learn to infer 2D spatial structure from 1D positional information. During fine-tuning at higher resolutions (which increases the number of patches), the pre-trained position embeddings are interpolated using 2D interpolation to accommodate the longer sequence.
The core of ViT is a stack of L identical transformer encoder layers. Each layer consists of two sub-layers:
Multi-head self-attention (MSA): The input sequence is projected into queries (Q), keys (K), and values (V) using learned linear projections. Attention scores are computed as the scaled dot product of queries and keys, then used to compute a weighted sum of values. Multiple attention heads operate in parallel, each attending to different aspects of the input. The outputs of all heads are concatenated and projected through another linear layer.
MLP (feed-forward network): A two-layer feed-forward neural network with a GELU activation function between the layers. The hidden dimension of the MLP is typically 4 times the model dimension D (for example, 3072 for ViT-Base with D = 768).
Each sub-layer is preceded by layer normalization (pre-norm design) and wrapped with a residual connection. This "pre-norm" arrangement differs from the original transformer, which applies layer normalization after the sub-layer (post-norm). The pre-norm design has been found to improve training stability.
During pre-training, the classification head is a small MLP with one hidden layer and a GELU activation. During fine-tuning, it is replaced by a single linear layer that maps the [CLS] token output to the number of target classes.
The original ViT paper defined three model sizes, with naming conventions borrowed from BERT:
| Model | Layers | Hidden size (D) | MLP size | Attention heads | Parameters |
|---|---|---|---|---|---|
| ViT-Base (ViT-B) | 12 | 768 | 3072 | 12 | 86M |
| ViT-Large (ViT-L) | 24 | 1024 | 4096 | 16 | 307M |
| ViT-Huge (ViT-H) | 32 | 1280 | 5120 | 16 | 632M |
The notation ViT-B/16 or ViT-L/16 indicates the model size followed by the patch size. Smaller patch sizes (such as 14 x 14) increase the sequence length and computational cost but generally improve performance because the model processes the image at higher effective resolution.
A defining characteristic of ViT is its dependence on large-scale pre-training. Unlike CNNs, which incorporate inductive biases such as locality and translation equivariance through convolutional filters, ViT has very few built-in assumptions about image structure. This means the model must learn spatial relationships entirely from data, requiring substantially more training examples to reach competitive performance.
The original paper evaluated ViT using three pre-training datasets of increasing size:
| Dataset | Images | Description |
|---|---|---|
| ImageNet-1K (ILSVRC-2012) | ~1.3M | Standard classification benchmark with 1,000 classes |
| ImageNet-21K | ~14M | Superset of ImageNet with 21,843 classes |
| JFT-300M | ~300M | Google's internal dataset with 18,291 classes |
The key finding was that ViT models underperformed comparable CNN models (such as BiT, a ResNet-based model) when pre-trained only on ImageNet-1K or ImageNet-21K. However, when pre-trained on JFT-300M, ViT models matched or exceeded the best CNN results while using fewer computational resources.
The following table summarizes top-1 accuracy on ImageNet (after pre-training on the indicated dataset and fine-tuning on ImageNet-1K):
| Model | Pre-training dataset | ImageNet top-1 accuracy |
|---|---|---|
| ViT-B/16 | ImageNet-21K | ~84.0% |
| ViT-L/16 | ImageNet-21K | ~85.3% |
| ViT-L/16 | JFT-300M | 87.76% |
| ViT-H/14 | JFT-300M | 88.55% |
| BiT-L (ResNet152x4) | JFT-300M | 87.54% |
| Noisy Student (EfficientNet-L2) | JFT-300M + unlabeled | 88.4% |
ViT-H/14 pre-trained on JFT-300M achieved 88.55% top-1 accuracy on ImageNet, setting a new state of the art at the time of publication. It also required substantially fewer TPUv3-core-days to train (approximately 2,500) compared to prior state-of-the-art models (9,900 to 12,300 TPUv3-core-days).
The relationship between ViTs and CNNs is one of the most studied topics in modern computer vision. The two architectures differ in several fundamental ways:
| Property | CNN | Vision Transformer |
|---|---|---|
| Core operation | Convolution (local filters) | Self-attention (global pairwise) |
| Inductive bias | Strong: locality, translation equivariance, weight sharing | Weak: only sequence ordering via position embeddings |
| Receptive field | Grows gradually across layers | Global from the first layer |
| Data efficiency | Higher; learns well from smaller datasets | Lower; needs large-scale pre-training |
| Scalability | Accuracy saturates earlier with more data | Continues to improve with more data |
| Computational cost | Linear in image resolution | Quadratic in number of patches (sequence length) |
| Feature hierarchy | Built-in through pooling and stride | Flat (unless using hierarchical variants like Swin) |
CNNs embed strong assumptions about visual data into their architecture. Convolutional filters enforce locality (each filter looks at a small region), weight sharing (the same filter slides across the entire image), and translation equivariance (a shifted input produces a correspondingly shifted output). These biases make CNNs data-efficient because the model does not need to learn these properties from scratch.
ViT, by contrast, has almost no image-specific inductive bias. The only spatial information comes from the position embeddings. The self-attention mechanism allows every patch to attend to every other patch from the very first layer, giving the model a global receptive field but requiring it to learn local patterns (edges, textures) from data rather than from architectural constraints.
When trained on mid-sized datasets like ImageNet-1K alone, ViT models underperform equivalent CNNs. This is because ViT must use its large capacity to learn the basic visual priors that CNNs get for free from convolutions. However, as the training data grows beyond approximately 10 to 30 million images, ViT models begin to outperform CNNs and continue to improve with more data, while CNN performance tends to plateau.
This scaling behavior has been interpreted as evidence that the transformer architecture has higher capacity than CNNs; it simply needs enough data to realize that capacity.
Research has shown that ViTs tend to be more robust to certain types of input perturbations, including adversarial patches and image corruptions, compared to CNNs. ViTs also appear to rely more on shape-based features and global structure rather than local texture, which may contribute to their robustness.
Since the original ViT paper, numerous variants have been proposed to address its limitations. These include improved training strategies, hierarchical architectures, and self-supervised pre-training methods.
Published by Touvron et al. at Facebook AI Research in 2021, DeiT demonstrated that vision transformers can be trained competitively on ImageNet-1K alone, without requiring hundreds of millions of images. The key contributions were:
DeiT-Base achieved 83.1% top-1 accuracy on ImageNet without any external data, and 85.2% with distillation. The model was trained on a single machine in less than three days, making vision transformers accessible to researchers without access to massive compute clusters.
Proposed by Ze Liu et al. at Microsoft Research in 2021, the Swin Transformer introduced a hierarchical vision transformer architecture that computes self-attention within local windows rather than globally. Its main innovations are:
Swin Transformer won the ICCV 2021 Marr Prize (best paper award). Swin-L achieved 87.3% top-1 on ImageNet-1K and set new state-of-the-art results on COCO object detection (58.7 box AP) and ADE20K semantic segmentation (53.5 mIoU).
Swin Transformer V2, published in 2022, scaled the architecture to 3 billion parameters and introduced techniques such as residual post-normalization, cosine attention, and log-spaced continuous position bias to stabilize training at large scale. SwinV2-G achieved 90.2% top-1 accuracy on ImageNet.
Proposed by Bao et al. at Microsoft Research in 2021, BEiT adapted the masked language modeling approach from BERT to vision. The model pre-trains by:
BEiT was the first method to demonstrate that self-supervised pre-training of vision transformers could outperform supervised pre-training. BEiT-Base achieved 83.2% top-1 accuracy on ImageNet-1K (compared to 81.8% for DeiT-Base trained from scratch), and BEiT-Large reached 86.3% using only ImageNet-1K data.
Introduced by Kaiming He et al. at Facebook AI Research in 2022, MAE is a self-supervised pre-training method that masks a very high proportion (75%) of image patches and trains the model to reconstruct the raw pixel values of the masked patches. Key design choices include:
MAE pre-training with ViT-Huge on ImageNet-1K achieved 87.8% top-1 accuracy after fine-tuning, outperforming supervised pre-training and other self-supervised methods like DINO, MoCo v3, and BEiT on the same backbone.
Developed by Mathilde Caron et al. at Facebook AI Research in 2021, DINO is a self-supervised learning method based on self-distillation between a student and a teacher network:
DINO revealed several interesting properties of self-supervised ViT features. The attention maps of the [CLS] token in a self-supervised ViT spontaneously learn to segment objects, producing attention patterns that closely follow object boundaries. This property does not emerge as clearly in supervised ViTs or in CNNs. DINO with ViT-Base achieved 80.1% top-1 accuracy on ImageNet using linear evaluation (training only a linear classifier on frozen features).
Published by Maxime Oquab et al. at Meta AI in 2023, DINOv2 scaled the DINO approach to produce general-purpose visual features. Key advances include:
DINOv2 models have been widely adopted as feature extractors in both research and production systems.
| Variant | Year | Key idea |
|---|---|---|
| CaiT (Class-Attention in Image Transformers) | 2021 | Separates self-attention among patches from class-attention between patches and [CLS] token |
| PVT (Pyramid Vision Transformer) | 2021 | Hierarchical ViT with spatial reduction attention for dense prediction |
| CrossViT | 2021 | Dual-branch ViT that processes patches at two different scales |
| CSWin Transformer | 2022 | Cross-shaped window self-attention for efficient global modeling |
| EVA | 2022 | Billion-parameter ViT pre-trained with masked image modeling, achieving 89.6% on ImageNet |
| FlexiViT | 2023 | ViT that supports variable patch sizes at inference time |
| ViT-22B | 2023 | Scaled ViT to 22 billion parameters for vision-language tasks |
| SigLIP | 2023 | Improved vision-language pre-training using sigmoid loss instead of softmax |
A major research theme in the ViT literature is self-supervised pre-training, which allows models to learn visual representations from unlabeled images. Three main paradigms have emerged:
Methods like MoCo v3 and DINO train the model to produce similar representations for different augmented views of the same image while pushing apart representations of different images. These methods work with both ViTs and CNNs and typically produce features well-suited for linear evaluation.
Inspired by masked language modeling in NLP, methods like BEiT, MAE, and SimMIM mask portions of the input image and train the model to predict the masked content. The prediction target can be discrete visual tokens (BEiT), raw pixels (MAE), or other representations. Masked image modeling has proven highly effective for pre-training large ViT models and often outperforms contrastive methods when the pre-trained model is fine-tuned end-to-end.
Methods like DINO and DINOv2 use a teacher-student framework where both networks share the same architecture. The teacher is updated via an exponential moving average of the student weights, creating a bootstrapping signal. Self-distillation methods produce features with strong emergent properties, including object-level segmentation in attention maps.
The self-attention mechanism in ViT has quadratic computational complexity with respect to the number of input tokens (patches). For an image of resolution H x W with patch size P, the number of tokens is N = (H/P) x (W/P). The self-attention operation requires O(N^2 x D) computation, where D is the embedding dimension. This quadratic scaling poses challenges for processing high-resolution images.
Several strategies have been proposed to address this:
Despite the higher computational cost of self-attention, ViT models can be more efficient than CNNs in practice at large scale because transformer operations (matrix multiplications) are highly optimized on modern hardware (GPUs and TPUs).
Vision transformers have been adopted across a wide range of computer vision tasks and domains:
This is the original application of ViT. Modern ViT variants and their pre-trained checkpoints consistently achieve top results on benchmarks like ImageNet, CIFAR-100, and Oxford Flowers.
The DETR (Detection Transformer) family uses a transformer decoder to directly predict object bounding boxes and class labels. Hierarchical ViT backbones like Swin Transformer have become standard feature extractors for object detection frameworks such as Mask R-CNN and Cascade R-CNN.
ViT-based models serve as backbones for semantic segmentation architectures. Models like SegFormer and Mask2Former combine ViT encoders with lightweight decoders to produce per-pixel class predictions. DINOv2 features, used with simple linear decoders, achieve competitive segmentation results without task-specific training.
Vision transformers have been widely adopted in medical image analysis for tasks including tumor classification from MRI scans, skin lesion segmentation from dermatoscopic images, and pathology slide analysis. Fine-tuned ViT models have achieved accuracy above 98% on certain brain tumor classification benchmarks.
Extensions like TimeSformer and Video Swin Transformer adapt the ViT architecture for video by applying attention across both spatial and temporal dimensions. TimeSformer uses factorized space-time attention to keep computational costs manageable.
ViT backbones serve as components in modern generative models. Diffusion models such as DiT (Diffusion Transformer) replace the U-Net backbone with a transformer, and vision-language models like CLIP use a ViT as the image encoder.
Models like DPT (Dense Prediction Transformer) use ViT features for monocular depth estimation. DINOv2 features have also shown strong performance on depth estimation tasks using only linear probes.
Despite their success, vision transformers have several known limitations:
Several approaches combine the strengths of CNNs and transformers:
These hybrid designs often achieve better performance than either pure CNNs or pure ViTs, especially in data-limited or latency-sensitive settings.
| Year | Development |
|---|---|
| 2017 | Vaswani et al. publish "Attention Is All You Need," introducing the transformer for NLP |
| 2019 | iGPT (Image GPT) applies a GPT-style autoregressive transformer to pixel sequences |
| 2020 | Dosovitskiy et al. introduce ViT, showing pure transformers can match CNNs on image classification |
| 2020 | DETR applies transformers to object detection |
| 2021 | DeiT enables training ViTs on ImageNet-1K without external data |
| 2021 | Swin Transformer introduces hierarchical window-based attention (ICCV best paper) |
| 2021 | BEiT pioneers masked image modeling for ViT pre-training |
| 2021 | DINO shows self-supervised ViTs learn object segmentation |
| 2022 | MAE achieves 87.8% on ImageNet with self-supervised ViT-Huge |
| 2022 | Swin Transformer V2 scales to 3 billion parameters |
| 2022 | EVA reaches 89.6% on ImageNet with a billion-parameter ViT |
| 2023 | DINOv2 produces universal visual features from 142M curated images |
| 2023 | ViT-22B scales vision transformers to 22 billion parameters |
| 2024 | ViT-based architectures used in 113-billion-parameter models for weather and climate prediction |