DINO (self-DIstillation with NO labels) is a family of self-supervised learning methods for computer vision developed by Meta AI (formerly Facebook AI Research). The original DINO was introduced by Caron et al. in 2021, and the family has since expanded to include DINOv2 (2023) and DINOv3 (2025). These models learn powerful visual representations from unlabeled images using Vision Transformers (ViTs), producing features that rival or surpass those trained with explicit human supervision. The DINO family has become one of the most influential lines of research in self-supervised visual representation learning, with applications spanning image classification, semantic segmentation, depth estimation, object detection, and video understanding.
The original DINO paper, titled "Emerging Properties in Self-Supervised Vision Transformers," was published at the IEEE/CVF International Conference on Computer Vision (ICCV) in October 2021. The authors were Mathilde Caron, Hugo Touvron, Ishan Misra, Herve Jegou, Julien Mairal, Piotr Bojanowski, and Armand Joulin, all affiliated with Facebook AI Research (FAIR) and Inria.
The central question of the paper was whether self-supervised learning could unlock properties in Vision Transformers that are not present in supervised ViTs or in convolutional neural networks (ConvNets). The answer turned out to be a resounding yes: self-supervised ViT features contain explicit information about the semantic segmentation of an image, a property that does not emerge as clearly with supervised training or with ConvNet architectures.
DINO uses a student-teacher framework inspired by knowledge distillation, but without any labeled data. Both the student and teacher networks share the same architecture (typically a Vision Transformer), but their parameters evolve through different pathways.
The student network is trained via standard backpropagation. The teacher network, by contrast, is not trained with gradients. Instead, its weights are updated as an exponential moving average (EMA) of the student's weights. This is known as a momentum teacher. At each training step, the teacher parameters are updated according to:
theta_teacher = m * theta_teacher + (1 - m) * theta_student
The momentum parameter m follows a cosine schedule from 0.996 to 1.0 during training, meaning the teacher becomes increasingly stable as training progresses.
DINO applies a multi-crop augmentation strategy to each training image. Two global crops are generated at 224x224 resolution, each covering more than 50% of the original image. Several local crops are generated at 96x96 resolution, each covering less than 50% of the image. The teacher network processes only the global crops, while the student network processes all crops (both global and local). This asymmetry encourages the student to learn "local-to-global" correspondences, aligning its representations of small local patches with the teacher's broader contextual understanding.
Two critical mechanisms prevent training collapse in DINO:
Centering is applied to the teacher's output before the softmax. DINO maintains a running average (the "center") of the teacher's output vectors across the batch and subtracts it from each output. This prevents any single dimension from dominating, encouraging the model to use the full output space. The center is updated with a momentum of approximately 0.9.
Sharpening is achieved by using a low temperature in the softmax function for the teacher's outputs. This makes the output distribution more peaked and confident. The teacher temperature is warmed up from 0.04 to 0.07 during the early stages of training, while the student uses a fixed, higher temperature (typically 0.1).
Centering and sharpening have complementary effects: centering prevents collapse toward a uniform distribution by ensuring all output dimensions are used, while sharpening prevents collapse toward a single dominant mode by encouraging confident predictions.
The training objective is a cross-entropy loss that aligns the student's output probability distribution with the teacher's output distribution across all cross-view pairs (global crops fed to the teacher, all crops fed to the student), excluding self-pairs where the same crop is fed to both networks.
The most striking finding of the DINO paper was that self-supervised ViTs develop emergent properties not seen in supervised ViTs or ConvNets.
Attention maps as segmentation masks. The self-attention maps of the final layer of DINO-trained ViTs contain explicit information about object boundaries and semantic regions. Different attention heads attend to different semantic parts of an image, effectively producing unsupervised segmentation masks. This property does not appear as clearly in ViTs trained with supervised classification objectives.
Strong k-NN classification. DINO features are excellent k-nearest-neighbor (k-NN) classifiers without any fine-tuning, linear classifier, or data augmentation. A small ViT trained with DINO achieved 78.3% top-1 accuracy on ImageNet using only a k-NN classifier applied directly to frozen features.
The following table summarizes DINO's performance on ImageNet-1k with different architectures:
| Model | Parameters | Patch Size | k-NN Accuracy | Linear Eval Accuracy |
|---|---|---|---|---|
| ResNet-50 | 23M | N/A | 67.5% | 75.3% |
| ViT-S/16 | 21M | 16x16 | 74.5% | 77.0% |
| ViT-S/8 | 21M | 8x8 | 78.3% | 79.7% |
| ViT-B/16 | 85M | 16x16 | 76.1% | 78.2% |
| ViT-B/8 | 85M | 8x8 | 77.4% | 80.1% |
These results demonstrated that Vision Transformers trained with DINO's self-supervised approach achieved competitive performance with methods that relied on labeled data. The ViT-B/8 configuration achieved 80.1% top-1 accuracy in linear evaluation on ImageNet, which was state-of-the-art for self-supervised methods at the time.
DINOv2, introduced in the paper "DINOv2: Learning Robust Visual Features without Supervision" by Maxime Oquab et al., was released by Meta AI in April 2023. The paper had 26 authors from Meta AI Research and Inria, including Oquab, Timothee Darcet, Theo Moutakanni, Huy V. Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, and Piotr Bojanowski.
DINOv2 demonstrated that existing self-supervised pretraining methods, when scaled properly with curated data and modern training techniques, could produce all-purpose visual features that work across a wide range of tasks without fine-tuning. The goal was to produce visual features analogous to what foundation models like GPT had achieved for natural language processing: general-purpose representations that transfer to many downstream tasks out of the box.
DINOv2 combines several self-supervised objectives into a unified training framework:
Additional training innovations included a custom FlashAttention implementation for fast and memory-efficient attention, stochastic depth with a 40% drop rate, fully-sharded data parallel (FSDP) training across GPUs with mixed precision, and a high-resolution adaptation phase at 518x518 pixels during the final training iterations.
A major contribution of DINOv2 was the careful curation of training data. Rather than simply using a large uncurated web crawl, the authors built the LVD-142M (Large Vision Dataset, 142 million images) dataset through a systematic pipeline:
The result was a diverse, high-quality dataset of 142 million images assembled entirely without manual annotation or text metadata.
DINOv2 was trained across a family of Vision Transformer architectures, all using a patch size of 14x14 pixels:
| Model | Parameters | Embedding Dim | Attention Heads |
|---|---|---|---|
| ViT-S/14 | 21M | 384 | 6 |
| ViT-B/14 | 86M | 768 | 12 |
| ViT-L/14 | 300M | 1,024 | 16 |
| ViT-g/14 | 1.1B | 1,536 | 24 |
The largest model, ViT-g/14 with 1.1 billion parameters, served as the teacher. The three smaller variants (ViT-S, ViT-B, ViT-L) were produced through knowledge distillation from the frozen ViT-g teacher, which proved more effective than training each smaller model from scratch.
| Model | k-NN Accuracy | Linear Probing Accuracy |
|---|---|---|
| ViT-S/14 | 79.0% | 81.1% |
| ViT-B/14 | 82.1% | 84.5% |
| ViT-L/14 | 83.5% | 86.3% |
| ViT-g/14 | 83.5% | 86.5% |
The ViT-g/14 model achieved 86.5% top-1 accuracy on ImageNet-1k with a simple linear probe on frozen features, matching the performance of OpenCLIP ViT-G/14 (86.2%) and EVA-CLIP ViT-g/14 (86.4%) while using no text supervision.
DINOv2 also released models with registers, an architectural modification introduced in a follow-up paper by Darcet et al. (2024) that adds extra learnable tokens to reduce attention map artifacts. The register variants achieved slightly improved performance, with ViT-g/14 reaching 87.1% linear probing accuracy.
DINOv2 showed strong robustness to distribution shifts:
| Benchmark | DINOv2 ViT-g/14 | OpenCLIP ViT-G/14 |
|---|---|---|
| ImageNet-ReaL | 89.6% | N/A |
| ImageNet-V2 | 78.4% | N/A |
| ImageNet-A | 75.9% | 63.8% |
The 12.1 percentage point advantage on ImageNet-A (adversarial examples) was particularly noteworthy, suggesting that DINOv2 features are more robust than language-supervised features.
DINOv2 excelled on tasks requiring spatial understanding, using only simple linear probes on frozen features:
Semantic Segmentation (mIoU, linear probe):
| Dataset | DINOv2 ViT-g/14 |
|---|---|
| ADE20k | 53.0% |
| Cityscapes | 81.0% |
| Pascal VOC | 86.2% |
Monocular Depth Estimation (RMSE, lower is better):
| Dataset | DINOv2 ViT-g/14 |
|---|---|
| NYUd | 0.298 |
| KITTI | 2.35 |
DINOv2 features also transferred well to video tasks without any video-specific training:
| Dataset | DINOv2 ViT-g/14 | OpenCLIP ViT-G/14 |
|---|---|---|
| Kinetics-400 | 78.4% | 78.3% |
| UCF-101 | 91.2% | 90.7% |
| Something-Something v2 | 38.3% | 35.8% |
Training DINOv2 required 22,016 GPU-hours on NVIDIA A100-40GB GPUs. Compared to the iBOT baseline using the same hardware, DINOv2 was 2x faster and used one-third the memory, thanks to engineering optimizations like FlashAttention and sequence packing. The estimated carbon footprint for a full reproducible training run was 3.7 metric tons of CO2 equivalent.
Like its predecessor, DINOv2 exhibited striking emergent properties when analyzed through PCA (Principal Component Analysis) of patch features:
DINOv3 was introduced in August 2025 by a team of 26 researchers at Meta AI, led by Oriane Simeoni, Huy V. Vo, Maximilian Seitzer, and Federico Baldassarre, among others. The paper, published as arXiv:2508.10104, represented a major leap in scale and capability over DINOv2.
DINOv3 trained a self-supervised vision model with 7 billion parameters on 1.7 billion unlabeled images, making it roughly 7x larger and trained on 12x more data than DINOv2. For the first time, a single frozen vision backbone outperformed specialized solutions on multiple dense prediction tasks, including object detection and semantic segmentation.
One of the central technical contributions of DINOv3 was Gram anchoring, a regularization technique designed to solve a known but previously unsolved problem: the degradation of dense (patch-level) feature quality during long training schedules. As self-supervised vision models train for more iterations, their global (image-level) features tend to improve, but their local (patch-level) features can deteriorate. This creates a difficult tradeoff between dense and global representation quality.
Gram anchoring works by enforcing patch-level consistency through the Gram matrix of the model's features. The Gram matrix captures all pairwise dot products between patch features. The method aligns the current model's Gram matrix with that of an earlier, more stable version of the teacher (from around 100k to 200k training iterations). The loss is formulated as:
L_gram = ||X_S * X_S^T - X_G * X_G^T||_F^2
where X_S represents the current student features, X_G represents the reference Gram teacher features, and ||.||_F denotes the Frobenius norm. This regularization is applied after 1 million training iterations using high-resolution (2x input resolution) features.
DINOv3 expanded the data curation pipeline from DINOv2 to build a dataset of 1.689 billion images, called LVD-1689M. The approach combined:
The teacher model was a ViT-7B with 6.7 billion parameters, 40 transformer blocks, a 4096-dimensional embedding space, 32 attention heads, patch size 16, and RoPE (Rotary Position Embedding) positional encodings with jittering. Training used a constant learning rate over 1 million iterations, departing from the cosine scheduling used in earlier work.
From the ViT-7B teacher, a comprehensive family of student models was distilled:
| Model | Parameters |
|---|---|
| ViT-S | 21M |
| ViT-S+ | 29M |
| ViT-B | 86M |
| ViT-L | 300M |
| ViT-H+ | 800M |
| ConvNeXt-T/S/B/L | Various |
The inclusion of ConvNeXt-based variants was a notable addition, providing deployment-friendly alternatives for resource-constrained environments.
DINOv3 was evaluated across 15 diverse visual tasks and more than 60 benchmarks:
| Task | Benchmark | DINOv3 Result |
|---|---|---|
| Image Classification | ImageNet-1k (linear) | 88.2% |
| Semantic Segmentation | ADE20k (mIoU, frozen) | 63.0% |
| Object Detection | COCO (mAP, frozen) | 66.1 |
| Depth Estimation | NYUv2 (RMSE) | 0.281 |
| OOD Classification | ObjectNet | 72.8% |
Compared to DINOv2, DINOv3 showed significant improvements on dense prediction tasks (ADE20k improved from 53.0% to 63.0% mIoU), while also improving image classification (from 86.5% to 88.2% on ImageNet). The Gram anchoring technique was directly responsible for large gains in segmentation quality; ablation studies showed that it improved Pascal VOC mIoU from 50.3% to 55.7%.
DINOv3 introduced several post-hoc strategies:
Meta highlighted several real-world DINOv3 use cases:
The following table compares the DINO family with other prominent vision representation learning approaches:
| Feature | DINO (2021) | DINOv2 (2023) | DINOv3 (2025) | CLIP (2021) | MAE (2022) |
|---|---|---|---|---|---|
| Organization | Facebook AI | Meta AI | Meta AI | OpenAI | Meta AI |
| Supervision Type | Self-supervised | Self-supervised | Self-supervised | Language-supervised | Self-supervised |
| Method | Self-distillation | DINO + iBOT + SwAV | DINO + Gram anchoring | Contrastive (image-text) | Masked image modeling |
| Training Data | ImageNet-1k | LVD-142M (142M images) | LVD-1689M (1.7B images) | 400M image-text pairs | ImageNet-1k |
| Largest Model | ViT-B/8 (85M) | ViT-g/14 (1.1B) | ViT-7B (6.7B) | ViT-L/14 (428M) | ViT-H/16 (632M) |
| ImageNet Linear Probe (best) | 80.1% | 86.5% | 88.2% | 85.4% (OpenCLIP G/14) | 73.5% (ViT-L) |
| ImageNet Fine-tune (best) | N/A | N/A | N/A | N/A | 87.8% (ViT-H, 448px) |
| Zero-Shot Classification | No | No | No | Yes (75.4% ViT-L/14) | No |
| Dense Prediction (frozen) | Emergent attention maps | Strong (segmentation, depth) | State-of-the-art | Weak | Weak |
| Text Understanding | No | No | No | Yes | No |
| Requires Fine-Tuning | No (k-NN works well) | No (linear probe works well) | No (linear probe works well) | No (zero-shot) | Yes (for best results) |
DINO/DINOv2/DINOv3 vs. CLIP: CLIP uses language supervision through contrastive learning on image-text pairs, giving it natural zero-shot classification and text-image retrieval capabilities. However, CLIP features are weaker for dense prediction tasks like segmentation and depth estimation. Studies have shown that CLIP captures high-level semantic information (object categories, text-relevant features), while DINO features are more responsive to low-level visual properties like colors, textures, and spatial structure. When both are used as visual encoders in multimodal language models, CLIP excels at text-intensive tasks, while DINO slightly outperforms on vision-centric ones.
DINO/DINOv2/DINOv3 vs. MAE: MAE (Masked Autoencoders) learns representations by reconstructing masked image patches, an approach inspired by masked language modeling in NLP. While MAE achieves strong results when fine-tuned (87.8% on ImageNet with ViT-H), its frozen features are significantly weaker than DINO's for downstream tasks. MAE achieves only 73.5% linear probing accuracy with ViT-L, compared to DINOv2's 86.3% with the same architecture class. MAE features also lack sufficient semantic information for global understanding, making them less suitable as frozen visual backbones.
The DINO family of models has found widespread use across many computer vision applications:
Semantic and Instance Segmentation. DINO's emergent attention maps and DINOv2's strong dense features enable high-quality segmentation without task-specific training. Researchers have used frozen DINO features for zero-shot and few-shot segmentation, part discovery, and unsupervised object localization.
Monocular Depth Estimation. DINOv2 achieved state-of-the-art results for monocular depth estimation on benchmarks like NYU Depth v2 and KITTI, using only a simple linear probe on frozen features. This capability has practical applications in robotics, warehouse safety systems, and autonomous navigation.
Medical Imaging. Researchers have adapted DINOv2 for radiology and surgical applications. The "Surgical-DINO" system adapts DINOv2 for depth estimation in endoscopic surgery, aiding in 3D reconstruction and surgical navigation.
Remote Sensing. DINOv3 includes a dedicated satellite backbone trained on MAXAR imagery. Applications include deforestation detection, canopy height estimation, and land cover classification.
Visual Backbone for Multimodal Models. DINOv2 features have been used as the visual branch in multimodal large language models. Studies show that DINOv2 provides fine-grained localization information that complements the high-level semantic features from CLIP, and combining both encoders often yields better results than using either alone.
Image Retrieval and Matching. DINOv2 achieved strong results on instance recognition benchmarks (Oxford Hard: 52.3% mAP, Paris Hard: 82.6% mAP), making it useful for visual search, copy detection, and image matching applications.
Video Understanding. Despite being trained only on static images, DINO and DINOv2 features transfer effectively to video tasks including action recognition (91.2% on UCF-101), video object segmentation (strong performance on DAVIS), and temporal understanding.
All three versions of DINO are open source:
All models can be loaded with a few lines of PyTorch code via torch.hub or through the Hugging Face transformers library, making them accessible for both research and production use.