DeiT (Data-efficient Image Transformers) is a family of vision transformer models developed by Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herve Jegou at Facebook AI Research (FAIR) and Sorbonne University. Published in December 2020 and presented at ICML 2021, DeiT demonstrated that Vision Transformers (ViTs) could be trained competitively on ImageNet alone, without the massive proprietary datasets (such as JFT-300M) that the original ViT required. The paper introduced a transformer-specific knowledge distillation strategy built around a novel distillation token, achieving 85.2% top-1 accuracy on ImageNet using only publicly available data and a single 8-GPU server trained in under three days.
The original Vision Transformer (ViT), introduced by Dosovitskiy et al. at Google Research in 2020, showed that a pure transformer architecture could match or exceed convolutional neural networks (CNNs) on image classification when pre-trained on very large datasets. However, ViT-B/16 achieved only 77.9% top-1 accuracy on ImageNet when trained on ImageNet-1k alone, compared to 84.15% when pre-trained on the proprietary JFT-300M dataset containing 300 million labeled images. This reliance on massive datasets limited the accessibility and reproducibility of ViT research.
Convolutional neural networks benefit from built-in inductive biases such as translation equivariance and locality that help them learn effectively from smaller datasets. Transformers, which process images as sequences of patches using global self-attention, lack these spatial priors and therefore need more training data to learn equivalent representations from scratch.
DeiT addressed this gap by combining a carefully engineered training recipe with a novel distillation approach, showing that proper data augmentation, regularization, and optimization strategies could compensate for the absence of large-scale pre-training data.
DeiT shares the same core architecture as ViT. An input image of resolution 224x224 is divided into non-overlapping patches of size 16x16, producing a sequence of 196 patch tokens. Each patch is linearly projected into an embedding vector, and a learnable class token ([CLS]) is prepended to the sequence. Learnable positional embeddings are added to encode spatial information. The resulting sequence passes through a stack of transformer encoder layers, each consisting of multi-head self-attention (MHSA) and a feed-forward network (FFN) with GELU activation.
The key architectural addition in the distilled variant (denoted DeiT-B‡) is the distillation token. This is a second learnable token appended to the input sequence alongside the class token. Both tokens interact with all patch tokens and with each other through the self-attention mechanism. At the output of the transformer, two separate linear classifiers are attached: one to the class token embedding and one to the distillation token embedding. During training, the class token head is supervised by the ground-truth labels, while the distillation token head learns from the teacher model's predictions. At inference time, the logits from both heads are averaged to produce the final prediction.
DeiT comes in three sizes that differ in embedding dimension, number of attention heads, and total parameter count. All three variants use 12 transformer layers and a patch size of 16x16:
| Model | Embedding Dim | Heads | Layers | Parameters | Throughput (img/s) |
|---|---|---|---|---|---|
| DeiT-Ti (Tiny) | 192 | 3 | 12 | 5M | 2536 |
| DeiT-S (Small) | 384 | 6 | 12 | 22M | 940 |
| DeiT-B (Base) | 768 | 12 | 12 | 86M | 292 |
The dimension per attention head remains constant at 64 across all variants. DeiT-B has the same configuration as ViT-B/16, making direct comparison straightforward. DeiT-Ti and DeiT-S provide lighter alternatives with faster inference throughput.
A central contribution of DeiT was demonstrating that the right combination of data augmentation, regularization, and optimization could bridge the performance gap between ImageNet-only training and large-scale pre-training. The training recipe draws from best practices developed for CNN training and adapts them for transformers.
DeiT uses the AdamW optimizer with a base learning rate of 5 x 10^-4, scaled linearly with batch size as lr = 5 x 10^-4 x (batch_size / 512). The default batch size is 1024. A cosine learning rate schedule is applied over 300 training epochs, preceded by a 5-epoch linear warmup period. Weight decay is set to 0.05.
The training procedure employs several augmentation techniques:
| Augmentation | Setting |
|---|---|
| RandAugment | 9 operations, magnitude 0.5 |
| Mixup | alpha = 0.8 |
| CutMix | probability = 1.0 |
| Random Erasing | probability = 0.25 |
| Color Jitter | 0.3 |
| Repeated Augmentation | 3 repetitions |
Repeated Augmentation, based on work by Hoffer et al. (2020) and Berman et al. (2019), presents each image in three independently augmented versions within each epoch, effectively multiplying the diversity of training data without collecting new images. This technique proved particularly beneficial for transformer training, where data efficiency is critical.
DeiT applies several regularization techniques to prevent overfitting:
| Technique | Setting |
|---|---|
| Stochastic Depth | drop rate = 0.1 |
| Label Smoothing | epsilon = 0.1 |
| Dropout | 0.0 (disabled) |
Notably, dropout is set to zero. The authors found that the combination of strong augmentation with stochastic depth and label smoothing provided sufficient regularization, and adding dropout degraded performance.
The most distinctive contribution of DeiT is its transformer-specific distillation strategy. Traditional knowledge distillation methods, introduced by Hinton et al. (2015), train a student model to mimic the output distribution of a larger teacher model. DeiT adapts this concept for vision transformers through its distillation token mechanism.
In standard ViT, only the class token embedding is used for classification. DeiT introduces a second special token, the distillation token, that is appended to the patch token sequence alongside the class token. Both tokens participate fully in the self-attention layers, allowing the distillation token to attend to all patch tokens and to the class token.
At the output of the transformer encoder, each token has its own linear classification head:
This design allows the student model to simultaneously learn from the labeled data and from the teacher's knowledge, with the two objectives handled by separate, dedicated pathways within the same architecture. The authors observed that the class token and distillation token converge to different representations. Their cosine similarity is high but not equal to 1, indicating that the two tokens capture complementary information.
DeiT explores two forms of distillation:
Soft distillation minimizes the Kullback-Leibler divergence between the softmax outputs of the student (distillation head) and the teacher. The total loss combines the standard cross-entropy with the ground-truth labels and the KL divergence with the teacher's soft predictions, weighted by a temperature parameter and a mixing coefficient.
Hard distillation replaces the teacher's soft probability distribution with its hard predicted label (argmax of the teacher's output). The distillation loss becomes a standard cross-entropy between the student's distillation head output and the teacher's hard prediction. Label smoothing is applied to this hard label, converting it into a mixture of the hard decision and a uniform distribution.
The authors found that hard distillation consistently outperformed soft distillation across their experiments:
| Method | Top-1 Accuracy (DeiT-B, 224x224) |
|---|---|
| No distillation | 81.8% |
| Soft distillation | 81.8% |
| Hard distillation | 83.0% |
| Hard distillation + distillation token | 83.4% |
The hard distillation approach yielded a 1.2 percentage point improvement over soft distillation, and adding the dedicated distillation token pushed performance further to 83.4%.
The choice of teacher model significantly impacts distillation quality. The authors experimented with both CNN and transformer teachers:
This finding was an important insight: the student transformer benefits from the convolutional inductive biases of the CNN teacher, gaining a form of spatial awareness that would otherwise require large-scale pre-training to develop.
The following table summarizes DeiT's ImageNet results at different scales and resolutions:
| Model | Resolution | Parameters | Top-1 Accuracy |
|---|---|---|---|
| DeiT-Ti | 224x224 | 5M | 72.2% |
| DeiT-Ti‡ | 224x224 | 6M | 74.5% |
| DeiT-S | 224x224 | 22M | 79.8% |
| DeiT-S‡ | 224x224 | 22M | 81.2% |
| DeiT-B | 224x224 | 86M | 81.8% |
| DeiT-B‡ | 224x224 | 87M | 83.4% |
| DeiT-B | 384x384 | 86M | 83.1% |
| DeiT-B‡ | 384x384 | 87M | 84.5% |
| DeiT-B‡ (1000 epochs) | 384x384 | 87M | 85.2% |
The ‡ symbol denotes models trained with the distillation token and hard distillation using a RegNetY-16GF teacher. Distillation consistently improves performance across all model sizes, with gains ranging from 1.4 to 2.3 percentage points.
DeiT-B achieved a dramatic improvement over the original ViT-B trained on ImageNet-1k alone (81.8% vs. 77.9%, a gain of +3.9 points), and when combined with distillation and extended training at 384x384 resolution, DeiT-B‡ reached 85.2% top-1, surpassing even ViT-B/16 pre-trained on the proprietary JFT-300M dataset (84.15%).
| Model | Pre-training Data | Parameters | Top-1 Accuracy |
|---|---|---|---|
| ResNet-50 | ImageNet-1k | 25M | 76.2% |
| ResNet-152 | ImageNet-1k | 60M | 78.3% |
| EfficientNet-B5 | ImageNet-1k | 30M | 83.6% |
| EfficientNet-B7 | ImageNet-1k | 66M | 84.3% |
| ViT-B/16 | ImageNet-1k | 86M | 77.9% |
| ViT-B/16 | JFT-300M | 86M | 84.15% |
| ViT-L/16 | JFT-300M | 307M | 87.1% |
| DeiT-B | ImageNet-1k | 86M | 81.8% |
| DeiT-B‡ (384) | ImageNet-1k | 87M | 85.2% |
DeiT models also transfer well to downstream classification tasks. The following results were obtained by fine-tuning ImageNet-pretrained models:
| Model | CIFAR-10 | CIFAR-100 | Oxford Flowers | Stanford Cars |
|---|---|---|---|---|
| DeiT-B | 99.1% | 90.8% | 98.4% | 92.1% |
| DeiT-B (384) | 99.1% | 90.8% | 98.5% | 93.3% |
| DeiT-B‡ | 99.1% | 91.3% | 98.8% | 92.9% |
| DeiT-B‡ (384) | 99.2% | 91.4% | 98.9% | 93.9% |
These results confirm that the distillation-based training approach produces representations that generalize effectively to diverse visual recognition tasks.
In April 2022, Touvron, Cord, and Jegou published DeiT III: Revenge of the ViT, presented at ECCV 2022. This follow-up work focused not on architectural changes but on further refining the training recipe to push vanilla ViT models to state-of-the-art performance. The central insight was that much of the performance gap between plain ViTs and more complex architectures like Swin Transformer could be attributed to suboptimal training procedures rather than architectural limitations.
DeiT III replaced the heavy augmentation pipeline of the original DeiT with a simpler scheme called 3-Augment. For each training image, one of three augmentations is selected uniformly at random:
These three augmentations are complemented by standard color jitter and horizontal flip. This approach was inspired by self-supervised learning methods like DINO and BYOL, which similarly use minimal augmentation pipelines. The simplification removed techniques like RandAugment, Mixup, and CutMix that were central to the original DeiT recipe.
DeiT III introduced several other changes to the training procedure:
| Model | Stochastic Depth Rate (ImageNet-1k) | Stochastic Depth Rate (ImageNet-21k) |
|---|---|---|
| ViT-Ti | 0.0 | 0.0 |
| ViT-S | 0.0 | 0.0 |
| ViT-B | 0.1 | 0.1 |
| ViT-L | 0.4 | 0.3 |
| ViT-H | 0.5 | 0.5 |
With the improved training recipe, vanilla ViT models achieved results competitive with or superior to architectures that incorporate hierarchical designs and local attention mechanisms:
ImageNet-1k only training:
| Model | Resolution | Top-1 Accuracy | ImageNet-V2 |
|---|---|---|---|
| ViT-S (DeiT III) | 224x224 | 81.4% | 70.5% |
| ViT-S (DeiT III) | 384x384 | 83.4% | 73.1% |
| ViT-B (DeiT III) | 224x224 | 83.8% | 73.6% |
| ViT-B (DeiT III) | 384x384 | 85.0% | 74.8% |
| ViT-L (DeiT III) | 224x224 | 84.9% | 75.1% |
| ViT-L (DeiT III) | 384x384 | 85.8% | 76.7% |
| ViT-H (DeiT III) | 224x224 | 85.2% | 75.9% |
ViT-B trained for different numbers of epochs showed consistent gains with longer training:
| Epochs | ViT-B Top-1 | ViT-L Top-1 |
|---|---|---|
| 300 | 82.8% | 84.1% |
| 400 | 83.1% | 84.2% |
| 600 | 83.2% | 84.4% |
| 800 | 83.7% | 84.5% |
ImageNet-21k pre-training followed by ImageNet-1k fine-tuning:
| Model | Resolution | Top-1 Accuracy | ImageNet-V2 |
|---|---|---|---|
| ViT-S (DeiT III) | 224x224 | 83.1% | 73.8% |
| ViT-B (DeiT III) | 224x224 | 85.7% | 76.5% |
| ViT-L (DeiT III) | 224x224 | 87.0% | 78.6% |
| ViT-H (DeiT III) | 224x224 | 87.2% | 79.2% |
The ViT-H result of 85.2% on ImageNet-1k only training represented a +5.1 percentage point improvement over the best previously reported supervised ViT-H training result at 224x224 resolution.
The original ViT and DeiT share the same architecture, making their comparison a direct study of training methodology. ViT-B/16 trained on ImageNet-1k alone reached only 77.9% top-1 accuracy, while DeiT-B reached 81.8% with no architectural changes, a gain of 3.9 points purely from improved training. The distilled DeiT-B‡ pushed this further to 83.4% at 224x224 and 85.2% at 384x384, exceeding ViT-B/16 pre-trained on JFT-300M (84.15%) despite using 250 times less pre-training data.
The Swin Transformer, introduced by Liu et al. (Microsoft Research, 2021), takes a different approach to making vision transformers practical. Swin uses a hierarchical architecture with shifted window attention, introducing locality and multi-scale feature maps that parallel the design of CNN feature pyramids. This architectural design gives Swin built-in inductive biases that help with smaller datasets and dense prediction tasks.
At comparable model sizes, Swin and DeiT achieve similar performance on ImageNet classification:
| Model | Parameters | ImageNet Top-1 |
|---|---|---|
| DeiT-S (DeiT III) | 22M | 81.4% |
| Swin-T | 29M | 81.3% |
| DeiT-B (DeiT III) | 86M | 83.8% |
| Swin-S | 50M | 83.0% |
| Swin-B | 88M | 83.5% |
| ConvNeXt-B | 89M | 83.8% |
With the DeiT III training recipe, vanilla ViT models match or exceed Swin at similar scales, suggesting that much of Swin's advantage came from better training rather than from the shifted window attention mechanism itself. However, Swin's hierarchical design remains advantageous for downstream tasks requiring multi-scale feature maps, such as object detection and semantic segmentation.
DeiT III showed that for ViT-B and ViT-L, its fully supervised training approach performed on par with BERT-like self-supervised pre-training methods such as BEiT and MAE. With ImageNet-21k pre-training, DeiT III ViT-L reached 87.0% top-1, competitive with self-supervised approaches that require additional pre-training stages.
DeiT had a significant impact on the field of computer vision by demonstrating that vision transformers could be trained effectively without massive datasets or enormous compute budgets. Several areas were directly influenced:
Before DeiT, training a competitive ViT required access to proprietary datasets like JFT-300M (Google) or Instagram-scale data. DeiT showed that ImageNet-1k, a freely available dataset of 1.2 million images, was sufficient to train transformers that matched or exceeded CNN performance. Training could be completed on a single 8-GPU node in under three days, making ViT research accessible to academic labs and smaller organizations.
DeiT established training methodology as a primary research axis for vision transformers. The ConvNeXt paper (Liu et al., 2022) later adopted DeiT's training recipe to modernize ResNet architectures, achieving 82.0% top-1 on a ResNet-50 (up from the baseline of 76.1%), directly demonstrating that training procedures contribute as much to performance as architectural innovations. This insight catalyzed a broader rethinking of how models are trained across computer vision.
The distillation token approach opened a new line of research in transformer-specific knowledge distillation. Follow-up works such as DeiT-LT (Rangwani et al., CVPR 2024) extended DeiT's distillation framework to handle long-tailed data distributions. The concept of using CNN teachers to inject convolutional inductive biases into transformer students has been adopted in numerous subsequent papers.
DeiT's data-efficient training approach proved especially valuable in domains where labeled data is scarce, such as medical imaging. Researchers applied DeiT-style training to chest X-ray classification, histopathology, and other clinical tasks where collecting hundreds of millions of labeled images is not feasible.
The augmentation and regularization combinations documented in DeiT became standard reference points for subsequent vision transformer papers. Techniques like repeated augmentation, the specific combination of Mixup with CutMix, and the effectiveness of stochastic depth over dropout for transformers are now widely used defaults in the vision transformer community.
The official DeiT implementation is available on GitHub at facebookresearch/deit, built on PyTorch and the timm (PyTorch Image Models) library. Pre-trained model weights for all DeiT variants (including DeiT III) are available through both the official repository and Hugging Face Transformers. The Hugging Face integration supports both the DeiT and DeiT-with-distillation variants through dedicated model classes (DeiTModel and DeiTForImageClassificationWithTeacher).
While DeiT significantly improved the data efficiency of vision transformers, several limitations remain: