DeiT

Computer Vision Deep Learning Transformer Models

19 min read

Updated Jun 23, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 23, 2026

Fact-checked

In review queue

Sources

9 citations

Revision

v5 · 3,845 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

DeiT (Data-efficient Image Transformers) is a family of vision transformer models that proved Vision Transformers can be trained to state-of-the-art image classification accuracy on ImageNet alone, without the massive proprietary datasets the original ViT required. Developed by Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herve Jegou at Facebook AI Research (FAIR) and Sorbonne University, DeiT was published on arXiv on December 23, 2020 and presented at ICML 2021. Its central result was a "competitive convolution-free transformer by training on Imagenet only" that reached up to 85.2% top-1 accuracy using only publicly available data and "a single computer in less than 3 days." ^[1] The paper also introduced a transformer-specific knowledge distillation strategy built around a novel distillation token, in which the student "learns from the teacher through attention." ^[1]

What problem did DeiT solve?

The original Vision Transformer (ViT), introduced by Dosovitskiy et al. at Google Research in 2020, showed that a pure transformer architecture could match or exceed convolutional neural networks (CNNs) on image classification when pre-trained on very large datasets. ^[3] However, ViT-B/16 achieved only 77.9% top-1 accuracy on ImageNet when trained on ImageNet-1k alone, compared to 84.15% when pre-trained on the proprietary JFT-300M dataset containing 300 million labeled images. ^[3] This reliance on massive datasets limited the accessibility and reproducibility of ViT research.

Convolutional neural networks benefit from built-in inductive biases such as translation equivariance and locality that help them learn effectively from smaller datasets. Transformers, which process images as sequences of patches using global self-attention, lack these spatial priors and therefore need more training data to learn equivalent representations from scratch.

DeiT addressed this gap by combining a carefully engineered training recipe with a novel distillation approach, showing that proper data augmentation, regularization, and optimization strategies could compensate for the absence of large-scale pre-training data. ^[1] As the authors framed it, the goal was to show "neural networks that contain no convolutional layer can achieve competitive results against the state of the art on ImageNet with no external data." ^[1]

Architecture

DeiT shares the same core architecture as ViT. An input image of resolution 224x224 is divided into non-overlapping patches of size 16x16, producing a sequence of 196 patch tokens. Each patch is linearly projected into an embedding vector, and a learnable class token ([CLS]) is prepended to the sequence. Learnable positional embeddings are added to encode spatial information. The resulting sequence passes through a stack of transformer encoder layers, each consisting of multi-head self-attention (MHSA) and a feed-forward network (FFN) with GELU activation.

The key architectural addition in the distilled variant (denoted DeiT-B‡) is the distillation token. This is a second learnable token appended to the input sequence alongside the class token. Both tokens interact with all patch tokens and with each other through the self-attention mechanism. At the output of the transformer, two separate linear classifiers are attached: one to the class token embedding and one to the distillation token embedding. During training, the class token head is supervised by the ground-truth labels, while the distillation token head learns from the teacher model's predictions. At inference time, the logits from both heads are averaged to produce the final prediction. ^[1]

Model Variants

DeiT comes in three sizes that differ in embedding dimension, number of attention heads, and total parameter count. All three variants use 12 transformer layers and a patch size of 16x16: ^[1]

Model	Embedding Dim	Heads	Layers	Parameters	Throughput (img/s)
DeiT-Ti (Tiny)	192	3	12	5M	2536
DeiT-S (Small)	384	6	12	22M	940
DeiT-B (Base)	768	12	12	86M	292

The dimension per attention head remains constant at 64 across all variants. DeiT-B has the same configuration as ViT-B/16, making direct comparison straightforward. DeiT-Ti and DeiT-S provide lighter alternatives with faster inference throughput.

How was DeiT trained on ImageNet alone?

A central contribution of DeiT was demonstrating that the right combination of data augmentation, regularization, and optimization could bridge the performance gap between ImageNet-only training and large-scale pre-training. The training recipe draws from best practices developed for CNN training and adapts them for transformers. ^[1]

Optimizer and Schedule

DeiT uses the AdamW optimizer with a base learning rate of 5 x 10^-4, scaled linearly with batch size as lr = 5 x 10^-4 x (batch_size / 512). The default batch size is 1024. A cosine learning rate schedule is applied over 300 training epochs, preceded by a 5-epoch linear warmup period. Weight decay is set to 0.05. ^[1]

Data Augmentation

The training procedure employs several augmentation techniques: ^[1]

Augmentation	Setting
RandAugment	9 operations, magnitude 0.5
Mixup	alpha = 0.8
CutMix	probability = 1.0
Random Erasing	probability = 0.25
Color Jitter	0.3
Repeated Augmentation	3 repetitions

Repeated Augmentation, based on work by Hoffer et al. (2020) and Berman et al. (2019), presents each image in three independently augmented versions within each epoch, effectively multiplying the diversity of training data without collecting new images. ^[7] This technique proved particularly beneficial for transformer training, where data efficiency is critical.

Regularization

DeiT applies several regularization techniques to prevent overfitting: ^[1]

Technique	Setting
Stochastic Depth	drop rate = 0.1
Label Smoothing	epsilon = 0.1
Dropout	0.0 (disabled)

Notably, dropout is set to zero. The authors found that the combination of strong augmentation with stochastic depth and label smoothing provided sufficient regularization, and adding dropout degraded performance. ^[1]

How does the distillation token work?

The most distinctive contribution of DeiT is its transformer-specific distillation strategy. Traditional knowledge distillation methods, introduced by Hinton et al. (2015), train a student model to mimic the output distribution of a larger teacher model. ^[5] DeiT adapts this concept for vision transformers through its distillation token mechanism, which the authors describe as "a teacher-student strategy specific to transformers" that "relies on a distillation token ensuring that the student learns from the teacher through attention." ^[1]

Distillation Token

In standard ViT, only the class token embedding is used for classification. DeiT introduces a second special token, the distillation token, that is appended to the patch token sequence alongside the class token. Both tokens participate fully in the self-attention layers, allowing the distillation token to attend to all patch tokens and to the class token.

At the output of the transformer encoder, each token has its own linear classification head:

The class token head is trained with the standard cross-entropy loss against the ground-truth labels.
The distillation token head is trained to match the teacher model's predictions.

This design allows the student model to simultaneously learn from the labeled data and from the teacher's knowledge, with the two objectives handled by separate, dedicated pathways within the same architecture. The authors observed that the class token and distillation token converge to different representations. Their cosine similarity is high but not equal to 1, indicating that the two tokens capture complementary information useful for classification. ^[1]

Hard distillation vs. soft distillation

DeiT explores two forms of distillation: ^[1]

Soft distillation minimizes the Kullback-Leibler divergence between the softmax outputs of the student (distillation head) and the teacher. The total loss combines the standard cross-entropy with the ground-truth labels and the KL divergence with the teacher's soft predictions, weighted by a temperature parameter and a mixing coefficient.

Hard distillation replaces the teacher's soft probability distribution with its hard predicted label (argmax of the teacher's output). The distillation loss becomes a standard cross-entropy between the student's distillation head output and the teacher's hard prediction. Label smoothing is applied to this hard label, converting it into a mixture of the hard decision and a uniform distribution.

The authors found that hard distillation consistently outperformed soft distillation across their experiments: ^[1]

Method	Top-1 Accuracy (DeiT-B, 224x224)
No distillation	81.8%
Soft distillation	81.8%
Hard distillation	83.0%
Hard distillation + distillation token	83.4%

Hard distillation reached 83.0% at 224x224, a 1.2 percentage point improvement over the soft-distillation and no-distillation baselines of 81.8%, and adding the dedicated distillation token pushed performance further to 83.4%. ^[1]

Why a CNN teacher?

The choice of teacher model significantly impacts distillation quality. The authors experimented with both CNN and transformer teachers: ^[1]

RegNetY-16GF (CNN, 84M parameters, 82.9% top-1): The default teacher, trained on the same data and data augmentation as DeiT. Using a CNN teacher proved more effective than using a transformer teacher, because the CNN's inductive biases (locality, translation equivariance) transfer to the student through distillation, compensating for the transformer's lack of built-in spatial priors. ^[1]
DeiT-B (transformer teacher): Less effective than the CNN teacher, suggesting that cross-architecture distillation provides complementary knowledge that same-architecture distillation cannot.

This finding was an important insight: the student transformer benefits from the convolutional inductive biases of the CNN teacher, gaining a form of spatial awareness that would otherwise require large-scale pre-training to develop. ^[1]

What accuracy did DeiT achieve on ImageNet?

DeiT Performance

The following table summarizes DeiT's ImageNet results at different scales and resolutions: ^[1]

Model	Resolution	Parameters	Top-1 Accuracy
DeiT-Ti	224x224	5M	72.2%
DeiT-Ti‡	224x224	6M	74.5%
DeiT-S	224x224	22M	79.8%
DeiT-S‡	224x224	22M	81.2%
DeiT-B	224x224	86M	81.8%
DeiT-B‡	224x224	87M	83.4%
DeiT-B	384x384	86M	83.1%
DeiT-B‡	384x384	87M	84.5%
DeiT-B‡ (1000 epochs)	384x384	87M	85.2%

The ‡ symbol denotes models trained with the distillation token and hard distillation using a RegNetY-16GF teacher. The undistilled DeiT-B reached 83.1% top-1 at 384x384 resolution, while the best distilled and extended-training configuration, DeiT-B‡ at 384x384 trained for 1000 epochs, reached 85.2%. ^[1] Distillation consistently improves performance across all model sizes, with gains ranging from 1.4 to 2.3 percentage points.

How does DeiT compare to ViT and CNNs?

DeiT-B achieved a dramatic improvement over the original ViT-B trained on ImageNet-1k alone (81.8% vs. 77.9%, a gain of +3.9 points), and when combined with distillation and extended training at 384x384 resolution, DeiT-B‡ reached 85.2% top-1, surpassing even ViT-B/16 pre-trained on the proprietary JFT-300M dataset (84.15%). ^[1] ^[3]

Model	Pre-training Data	Parameters	Top-1 Accuracy
ResNet-50	ImageNet-1k	25M	76.2%
ResNet-152	ImageNet-1k	60M	78.3%
EfficientNet-B5	ImageNet-1k	30M	83.6%
EfficientNet-B7	ImageNet-1k	66M	84.3%
ViT-B/16	ImageNet-1k	86M	77.9%
ViT-B/16	JFT-300M	86M	84.15%
ViT-L/16	JFT-300M	307M	87.1%
DeiT-B	ImageNet-1k	86M	81.8%
DeiT-B‡ (384)	ImageNet-1k	87M	85.2%

Transfer Learning

DeiT models also transfer well to downstream classification tasks. The following results were obtained by fine-tuning ImageNet-pretrained models: ^[1]

Model	CIFAR-10	CIFAR-100	Oxford Flowers	Stanford Cars
DeiT-B	99.1%	90.8%	98.4%	92.1%
DeiT-B (384)	99.1%	90.8%	98.5%	93.3%
DeiT-B‡	99.1%	91.3%	98.8%	92.9%
DeiT-B‡ (384)	99.2%	91.4%	98.9%	93.9%

These results confirm that the distillation-based training approach produces representations that generalize effectively to diverse visual recognition tasks.

DeiT III: Revenge of the ViT

In April 2022, Touvron, Cord, and Jegou published DeiT III: Revenge of the ViT, presented at ECCV 2022. ^[2] This follow-up work focused not on architectural changes but on further refining the training recipe to push vanilla ViT models to state-of-the-art performance. The central insight was that much of the performance gap between plain ViTs and more complex architectures like Swin Transformer could be attributed to suboptimal training procedures rather than architectural limitations. The paper revisits the supervised training of ViTs with "a procedure that builds upon and simplifies a recipe introduced for training ResNet-50." ^[2]

3-Augment

DeiT III replaced the heavy augmentation pipeline of the original DeiT with a simpler scheme called 3-Augment, described by the authors as "a new simple data-augmentation procedure with only 3 augmentations, closer to the practice in self-supervised learning." ^[2] For each training image, one of three augmentations is selected uniformly at random:

Grayscale conversion - encourages the model to focus on shapes and textures rather than relying on color.
Solarization - adds strong noise to color channels, further promoting shape-based recognition.
Gaussian blur - slightly alters fine details in the image.

These three augmentations are complemented by standard color jitter and horizontal flip. This approach was inspired by self-supervised learning methods like DINO and BYOL, which similarly use minimal augmentation pipelines. The simplification removed techniques like RandAugment, Mixup, and CutMix that were central to the original DeiT recipe. ^[2]

Additional Training Improvements

DeiT III introduced several other changes to the training procedure: ^[2]

Binary Cross-Entropy (BCE) loss replaced the standard cross-entropy loss for ImageNet-1k training. BCE provided a significant performance improvement, particularly for larger ViT models. Standard cross-entropy was retained for ImageNet-21k pre-training and fine-tuning.
LayerScale normalizes each channel of the output from the multi-head self-attention and feed-forward network blocks using a learned per-channel scaling factor. All models used an initialization value of 10^-4.
Stochastic Depth with model-size-dependent drop rates replaced the fixed 0.1 rate used in the original DeiT.
Simple Random Crop replaced Random Resized Crop for ImageNet-21k pre-training, though Random Resized Crop was retained for ImageNet-1k training.
LAMB optimizer was used in some configurations.

Model	Stochastic Depth Rate (ImageNet-1k)	Stochastic Depth Rate (ImageNet-21k)
ViT-Ti	0.0	0.0
ViT-S	0.0	0.0
ViT-B	0.1	0.1
ViT-L	0.4	0.3
ViT-H	0.5	0.5

DeiT III Results

With the improved training recipe, vanilla ViT models achieved results competitive with or superior to architectures that incorporate hierarchical designs and local attention mechanisms: ^[2]

ImageNet-1k only training:

Model	Resolution	Top-1 Accuracy	ImageNet-V2
ViT-S (DeiT III)	224x224	81.4%	70.5%
ViT-S (DeiT III)	384x384	83.4%	73.1%
ViT-B (DeiT III)	224x224	83.8%	73.6%
ViT-B (DeiT III)	384x384	85.0%	74.8%
ViT-L (DeiT III)	224x224	84.9%	75.1%
ViT-L (DeiT III)	384x384	85.8%	76.7%
ViT-H (DeiT III)	224x224	85.2%	75.9%

ViT-B trained for different numbers of epochs showed consistent gains with longer training: ^[2]

Epochs	ViT-B Top-1	ViT-L Top-1
300	82.8%	84.1%
400	83.1%	84.2%
600	83.2%	84.4%
800	83.7%	84.5%

ImageNet-21k pre-training followed by ImageNet-1k fine-tuning:

Model	Resolution	Top-1 Accuracy	ImageNet-V2
ViT-S (DeiT III)	224x224	83.1%	73.8%
ViT-B (DeiT III)	224x224	85.7%	76.5%
ViT-L (DeiT III)	224x224	87.0%	78.6%
ViT-H (DeiT III)	224x224	87.2%	79.2%

The ViT-H result of 85.2% on ImageNet-1k only training represented a +5.1 percentage point improvement over the best previously reported supervised ViT-H training result at 224x224 resolution. ^[2]

How does DeiT compare to other vision transformers?

DeiT vs. ViT

The original ViT and DeiT share the same architecture, making their comparison a direct study of training methodology. ViT-B/16 trained on ImageNet-1k alone reached only 77.9% top-1 accuracy, while DeiT-B reached 81.8% with no architectural changes, a gain of 3.9 points purely from improved training. ^[1] ^[3] The distilled DeiT-B‡ pushed this further to 83.4% at 224x224 and 85.2% at 384x384, exceeding ViT-B/16 pre-trained on JFT-300M (84.15%) despite using 250 times less pre-training data. ^[1]

DeiT vs. Swin Transformer

The Swin Transformer, introduced by Liu et al. (Microsoft Research, 2021), takes a different approach to making vision transformers practical. ^[4] Swin uses a hierarchical architecture with shifted window attention, introducing locality and multi-scale feature maps that parallel the design of CNN feature pyramids. This architectural design gives Swin built-in inductive biases that help with smaller datasets and dense prediction tasks.

At comparable model sizes, Swin and DeiT achieve similar performance on ImageNet classification: ^[2] ^[4]

Model	Parameters	ImageNet Top-1
DeiT-S (DeiT III)	22M	81.4%
Swin-T	29M	81.3%
DeiT-B (DeiT III)	86M	83.8%
Swin-S	50M	83.0%
Swin-B	88M	83.5%
ConvNeXt-B	89M	83.8%

With the DeiT III training recipe, vanilla ViT models match or exceed Swin at similar scales, suggesting that much of Swin's advantage came from better training rather than from the shifted window attention mechanism itself. ^[2] However, Swin's hierarchical design remains advantageous for downstream tasks requiring multi-scale feature maps, such as object detection and semantic segmentation.

DeiT vs. Self-Supervised Methods

DeiT III showed that for ViT-B and ViT-L, its fully supervised training approach performed on par with BERT-like self-supervised pre-training methods such as BEiT and MAE. ^[2] With ImageNet-21k pre-training, DeiT III ViT-L reached 87.0% top-1, competitive with self-supervised approaches that require additional pre-training stages.

Why was DeiT influential?

DeiT had a significant impact on the field of computer vision by demonstrating that vision transformers could be trained effectively without massive datasets or enormous compute budgets. Several areas were directly influenced:

Democratization of Vision Transformer Research

Before DeiT, training a competitive ViT required access to proprietary datasets like JFT-300M (Google) or Instagram-scale data. DeiT showed that ImageNet-1k, a freely available dataset of 1.2 million images, was sufficient to train transformers that matched or exceeded CNN performance. According to Meta AI, "Training a DeiT model with just a single 8-GPU server over 3 days" and "without using any external data for training" the team "achieved 84.2 top-1 accuracy on the widely used ImageNet benchmark." ^[8] This made ViT research accessible to academic labs and smaller organizations.

Training Recipe as a Research Direction

DeiT established training methodology as a primary research axis for vision transformers. The ConvNeXt paper (Liu et al., 2022) later adopted DeiT's training recipe to modernize ResNet architectures, achieving 82.0% top-1 on a ResNet-50 (up from the baseline of 76.1%), directly demonstrating that training procedures contribute as much to performance as architectural innovations. ^[6] This insight catalyzed a broader rethinking of how models are trained across computer vision.

Knowledge Distillation for Vision Transformers

The distillation token approach opened a new line of research in transformer-specific knowledge distillation. Follow-up works such as DeiT-LT (Rangwani et al., CVPR 2024) extended DeiT's distillation framework to handle long-tailed data distributions. ^[9] The concept of using CNN teachers to inject convolutional inductive biases into transformer students has been adopted in numerous subsequent papers.

Medical Imaging and Low-Data Regimes

DeiT's data-efficient training approach proved especially valuable in domains where labeled data is scarce, such as medical imaging. Researchers applied DeiT-style training to chest X-ray classification, histopathology, and other clinical tasks where collecting hundreds of millions of labeled images is not feasible.

Practical Training Guidelines

The augmentation and regularization combinations documented in DeiT became standard reference points for subsequent vision transformer papers. Techniques like repeated augmentation, the specific combination of Mixup with CutMix, and the effectiveness of stochastic depth over dropout for transformers are now widely used defaults in the vision transformer community. ^[1]

Implementation and Availability

The official DeiT implementation is available on GitHub at facebookresearch/deit, built on PyTorch and the timm (PyTorch Image Models) library. Pre-trained model weights for all DeiT variants (including DeiT III) are available through both the official repository and Hugging Face Transformers. The Hugging Face integration supports both the DeiT and DeiT-with-distillation variants through dedicated model classes (DeiTModel and DeiTForImageClassificationWithTeacher).

Limitations

While DeiT significantly improved the data efficiency of vision transformers, several limitations remain:

Quadratic attention complexity: Like all standard ViTs, DeiT has O(n^2) computational cost with respect to the number of tokens, making it less efficient than architectures with linear or window-based attention for high-resolution inputs.
Fixed resolution: DeiT uses fixed-size patch embeddings, requiring interpolation of positional embeddings when transferring to different input resolutions.
Classification focus: The original DeiT papers primarily addressed image classification. Adapting the distillation approach to dense prediction tasks like detection and segmentation requires additional architectural modifications.
Teacher dependency: The quality of the distilled model depends on having a strong, well-trained teacher. The best results require a separately trained RegNetY-16GF, adding to the total training pipeline.

References

Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., & Jegou, H. (2021). "Training data-efficient image transformers & distillation through attention." *Proceedings of the 38th International Conference on Machine Learning (ICML)*, PMLR 139:10347-10357. arXiv:2012.12877 ↩
Touvron, H., Cord, M., & Jegou, H. (2022). "DeiT III: Revenge of the ViT." *European Conference on Computer Vision (ECCV) 2022*, pp. 516-533. arXiv:2204.07118 ↩
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021). "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale." *International Conference on Learning Representations (ICLR)*. arXiv:2010.11929 ↩
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021). "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows." *IEEE/CVF International Conference on Computer Vision (ICCV)*. arXiv:2103.14030 ↩
Hinton, G., Vinyals, O., & Dean, J. (2015). "Distilling the Knowledge in a Neural Network." *NIPS 2014 Deep Learning Workshop*. arXiv:1503.02531 ↩
Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., & Xie, S. (2022). "A ConvNet for the 2020s." *IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*. arXiv:2201.03545 ↩
Hoffer, E., Ben-Nun, T., Hubara, I., Giladi, N., Hoefler, T., & Soudry, D. (2020). "Augment Your Batch: Improving Generalization Through Instance Repetition." *IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*. ↩
Meta AI. "Data-efficient image Transformers: A promising new technique for image classification." Meta AI Blog ↩
Rangwani, H., Aithal, S. K., Mishra, M., Jain, A., & Babu, R. V. (2024). "DeiT-LT: Distillation Strikes Back for Vision Transformer Training on Long-Tailed Datasets." *IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2024*. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

4 revisions by 1 contributors · full history

Suggest edit

What links here

AI in agriculture AI weather forecasting Computer vision ConvNeXt MobileNet Pre-Trained Model Pruning Register tokens (Vision Transformers Need Registers)Regularization Swin Transformer Vision Transformer

What problem did DeiT solve?

Architecture

Model Variants

How was DeiT trained on ImageNet alone?

Optimizer and Schedule

Data Augmentation

Regularization

How does the distillation token work?

Distillation Token

Hard distillation vs. soft distillation

Why a CNN teacher?

What accuracy did DeiT achieve on ImageNet?

DeiT Performance

How does DeiT compare to ViT and CNNs?

Transfer Learning

DeiT III: Revenge of the ViT

3-Augment

Additional Training Improvements

DeiT III Results

How does DeiT compare to other vision transformers?

DeiT vs. ViT

DeiT vs. Swin Transformer

DeiT vs. Self-Supervised Methods

Why was DeiT influential?

Democratization of Vision Transformer Research

Training Recipe as a Research Direction

Knowledge Distillation for Vision Transformers

Medical Imaging and Low-Data Regimes

Practical Training Guidelines

Implementation and Availability

Limitations

See Also

References

Improve this article

Related Articles

Swin Transformer

DETR

Masked autoencoder (MAE)

Hiera

Multi-head Latent Attention

Multi-Head Self-Attention

What links here

Related Articles

Swin Transformer

DETR

Masked autoencoder (MAE)

Hiera

Multi-head Latent Attention

Multi-Head Self-Attention

What links here