# DINO (computer vision)

> Source: https://aiwiki.ai/wiki/dino_model
> Updated: 2026-06-25
> Categories: Computer Vision, Deep Learning, Machine Learning, Meta AI
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**DINO** (self-**DI**stillation with **NO** labels) is a family of [self-supervised learning](/wiki/self_supervised_learning) methods for [computer vision](/wiki/computer_vision) from [Meta AI](/wiki/meta_ai) that trains [Vision Transformers](/wiki/vision_transformer) (ViTs) on unlabeled images and produces general-purpose visual features rivaling or surpassing those learned with human-labeled supervision [1][2]. The original DINO, introduced by Caron et al. in 2021, showed that self-supervised ViTs develop emergent properties absent from supervised models: their attention maps act as unsupervised object-segmentation masks, and their frozen features reach 78.3% top-1 accuracy on [ImageNet](/wiki/imagenet) using a simple k-nearest-neighbor classifier with no fine-tuning [1]. The family has since scaled to DINOv2 (2023) and DINOv3 (2025), the latter a 7-billion-parameter backbone trained on 1.7 billion images that a single frozen model uses to match or beat specialized state-of-the-art systems across roughly 60 benchmarks [2][3].

The DINO family has become one of the most influential lines of research in self-supervised visual representation learning, with applications spanning image classification, [semantic segmentation](/wiki/image_segmentation), [depth estimation](/wiki/depth_estimation), [object detection](/wiki/object_detection), and video understanding. (Note: the name "DINO" is also used for an unrelated object detector, DETR with Improved deNoising anchOr boxes; this article covers the self-supervised representation-learning method.)

## ELI5: what is DINO?

Imagine showing a computer millions of photos but never telling it what is in any of them, no labels at all. DINO teaches itself to see by playing a copycat game between two copies of the same network: a "student" and a "teacher." The student looks at small zoomed-in patches of a picture, the teacher looks at the whole picture, and the student tries to guess what the teacher is thinking. To win, the student has to figure out that a paw and an ear belong to the same dog even when it only sees a tiny piece. After this game, the network has quietly learned where objects are and how to tell them apart, so much so that its internal "attention" lights up around each object like an automatic outline, even though nobody ever drew one for it.

## What is DINO (2021)?

The original DINO paper, titled "Emerging Properties in Self-Supervised Vision Transformers," was published at the IEEE/CVF International Conference on Computer Vision (ICCV) in October 2021 [1]. The authors were Mathilde Caron, Hugo Touvron, Ishan Misra, Herve Jegou, Julien Mairal, Piotr Bojanowski, and Armand Joulin, all affiliated with Facebook AI Research ([FAIR](/wiki/fair)) and Inria [1].

The central question of the paper was whether self-supervised learning could unlock properties in Vision Transformers that are not present in supervised ViTs or in [convolutional neural networks](/wiki/convolutional_neural_network) (ConvNets). The answer turned out to be a resounding yes. As the authors put it, "self-supervised ViT features contain explicit information about the semantic segmentation of an image, which does not emerge as clearly with supervised ViTs, nor with convnets" [1].

## How does DINO work?

DINO uses a student-teacher framework inspired by [knowledge distillation](/wiki/knowledge_distillation), but without any labeled data. The authors describe the method as "a form of self-distillation with no labels" [1]. Both the student and teacher networks share the same architecture (typically a [Vision Transformer](/wiki/vision_transformer)), but their parameters evolve through different pathways.

### Student and teacher networks

The student network is trained via standard backpropagation. The teacher network, by contrast, is not trained with gradients. Instead, its weights are updated as an exponential moving average (EMA) of the student's weights. This is known as a **momentum teacher**. At each training step, the teacher parameters are updated according to:

`theta_teacher = m * theta_teacher + (1 - m) * theta_student`

The momentum parameter `m` follows a cosine schedule from 0.996 to 1.0 during training, meaning the teacher becomes increasingly stable as training progresses [1].

### Multi-crop augmentation strategy

DINO applies a multi-crop augmentation strategy to each training image. Two **global crops** are generated at 224x224 resolution, each covering more than 50% of the original image. Several **local crops** are generated at 96x96 resolution, each covering less than 50% of the image. The teacher network processes only the global crops, while the student network processes all crops (both global and local). This asymmetry encourages the student to learn "local-to-global" correspondences, aligning its representations of small local patches with the teacher's broader contextual understanding [1].

### Centering and sharpening

Two critical mechanisms prevent training collapse in DINO [1]:

- **Centering** is applied to the teacher's output before the softmax. DINO maintains a running average (the "center") of the teacher's output vectors across the batch and subtracts it from each output. This prevents any single dimension from dominating, encouraging the model to use the full output space. The center is updated with a momentum of approximately 0.9.

- **Sharpening** is achieved by using a low temperature in the softmax function for the teacher's outputs. This makes the output distribution more peaked and confident. The teacher temperature is warmed up from 0.04 to 0.07 during the early stages of training, while the student uses a fixed, higher temperature (typically 0.1).

Centering and sharpening have complementary effects: centering prevents collapse toward a uniform distribution by ensuring all output dimensions are used, while sharpening prevents collapse toward a single dominant mode by encouraging confident predictions [1].

### Loss function

The training objective is a cross-entropy loss that aligns the student's output probability distribution with the teacher's output distribution across all cross-view pairs (global crops fed to the teacher, all crops fed to the student), excluding self-pairs where the same crop is fed to both networks [1].

## What emergent properties does DINO have?

The most striking finding of the DINO paper was that self-supervised ViTs develop emergent properties not seen in supervised ViTs or ConvNets [1].

**[Attention](/wiki/attention) maps as segmentation masks.** The self-attention maps of the final layer of DINO-trained ViTs contain explicit information about object boundaries and semantic regions. Different attention heads attend to different semantic parts of an image, effectively producing unsupervised segmentation masks. This property does not appear as clearly in ViTs trained with supervised classification objectives [1].

**Strong k-NN classification.** DINO features are excellent k-nearest-neighbor (k-NN) classifiers without any fine-tuning, linear classifier, or data augmentation. A small ViT trained with DINO achieved 78.3% top-1 accuracy on [ImageNet](/wiki/imagenet) using only a k-NN classifier applied directly to frozen features [1].

### How accurate is DINO on ImageNet?

The following table summarizes DINO's performance on ImageNet-1k with different architectures [1]:

| Model | Parameters | Patch Size | k-NN Accuracy | Linear Eval Accuracy |
|---|---|---|---|---|
| ResNet-50 | 23M | N/A | 67.5% | 75.3% |
| ViT-S/16 | 21M | 16x16 | 74.5% | 77.0% |
| ViT-S/8 | 21M | 8x8 | 78.3% | 79.7% |
| ViT-B/16 | 85M | 16x16 | 76.1% | 78.2% |
| ViT-B/8 | 85M | 8x8 | 77.4% | 80.1% |

These results demonstrated that Vision Transformers trained with DINO's self-supervised approach achieved competitive performance with methods that relied on labeled data. The ViT-B/8 configuration achieved 80.1% top-1 accuracy in linear evaluation on ImageNet, which was state-of-the-art for self-supervised methods at the time [1].

## What is DINOv2 (2023)?

DINOv2, introduced in the paper "DINOv2: Learning Robust Visual Features without Supervision" by Maxime Oquab et al., was released by Meta AI in April 2023 [2][9]. The paper had 26 authors from Meta AI Research and Inria, including Oquab, Timothee Darcet, Theo Moutakanni, Huy V. Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, and Piotr Bojanowski [2].

DINOv2 demonstrated that existing self-supervised pretraining methods, when scaled properly with curated data and modern training techniques, could produce "all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning" [2]. The authors reported that DINOv2 features "surpass the best available all-purpose features, OpenCLIP, on most of the benchmarks" [2]. The goal was to produce visual features analogous to what foundation models like [GPT](/wiki/gpt4) had achieved for natural language processing: general-purpose representations that transfer to many downstream tasks out of the box without [fine-tuning](/wiki/fine_tuning).

### How was DINOv2 trained?

DINOv2 combines several self-supervised objectives into a unified training framework [2]:

- **DINO loss**: Applied at the image level using global crop tokens (the [CLS] token), following the same self-distillation approach as the original DINO.
- **iBOT loss**: A masked image modeling objective applied at the patch level. Random patches of the input are masked, and the model must predict the teacher's representation of those patches. This is similar in spirit to [BERT](/wiki/bert)-style masked language modeling, but applied to image patches [7].
- **SwAV-style centering**: The Sinkhorn-Knopp algorithm (run for 3 iterations) replaces the simple centering mechanism from the original DINO, providing more balanced prototype assignment.
- **KoLeo regularizer**: Encourages a uniform distribution of features within each batch, preventing feature collapse.

Additional training innovations included a custom [FlashAttention](/wiki/flash_attention) implementation for fast and memory-efficient attention, stochastic depth with a 40% drop rate, fully-sharded data parallel (FSDP) training across GPUs with mixed precision, and a high-resolution adaptation phase at 518x518 pixels during the final training iterations [2].

### What is the LVD-142M dataset?

A major contribution of DINOv2 was the careful curation of training data. Rather than simply using a large uncurated web crawl, the authors built the **LVD-142M** (Large Vision Dataset, 142 million images) dataset through a systematic pipeline [2]:

1. **Collection**: Approximately 1.2 billion unique images were gathered from the internet.
2. **Deduplication**: A copy detection pipeline removed near-duplicate images.
3. **Self-supervised retrieval**: Uncurated images were clustered using k-means on features from a pretrained ViT-H/16. For each cluster, images were retrieved that were visually similar to those in curated reference datasets.
4. **Reference datasets**: The curated seeds included ImageNet-22k, the training split of ImageNet-1k, Google Landmarks, and several fine-grained classification datasets.

The result was a diverse, high-quality dataset of 142 million images assembled entirely without manual annotation or text metadata [2].

### Model architectures

DINOv2 was trained across a family of [Vision Transformer](/wiki/vision_transformer) architectures, all using a patch size of 14x14 pixels [2]:

| Model | Parameters | Embedding Dim | Attention Heads |
|---|---|---|---|
| ViT-S/14 | 21M | 384 | 6 |
| ViT-B/14 | 86M | 768 | 12 |
| ViT-L/14 | 300M | 1,024 | 16 |
| ViT-g/14 | 1.1B | 1,536 | 24 |

The largest model, ViT-g/14 with 1.1 billion parameters, served as the teacher. The three smaller variants (ViT-S, ViT-B, ViT-L) were produced through [knowledge distillation](/wiki/knowledge_distillation) from the frozen ViT-g teacher, which proved more effective than training each smaller model from scratch [2].

### How accurate is DINOv2?

#### ImageNet-1k classification (frozen features)

| Model | k-NN Accuracy | Linear Probing Accuracy |
|---|---|---|
| ViT-S/14 | 79.0% | 81.1% |
| ViT-B/14 | 82.1% | 84.5% |
| ViT-L/14 | 83.5% | 86.3% |
| ViT-g/14 | 83.5% | 86.5% |

The ViT-g/14 model achieved 86.5% top-1 accuracy on ImageNet-1k with a simple linear probe on frozen features, matching the performance of OpenCLIP ViT-G/14 (86.2%) and EVA-CLIP ViT-g/14 (86.4%) while using no text supervision [2].

DINOv2 also released models **with registers**, an architectural modification introduced in a follow-up paper by Darcet et al. (2024) that adds extra learnable tokens to reduce attention map artifacts. The register variants achieved slightly improved performance, with ViT-g/14 reaching 87.1% linear probing accuracy [4].

#### Domain generalization

DINOv2 showed strong robustness to distribution shifts [2]:

| Benchmark | DINOv2 ViT-g/14 | OpenCLIP ViT-G/14 |
|---|---|---|
| ImageNet-ReaL | 89.6% | N/A |
| ImageNet-V2 | 78.4% | N/A |
| ImageNet-A | 75.9% | 63.8% |

The 12.1 percentage point advantage on ImageNet-A (adversarial examples) was particularly noteworthy, suggesting that DINOv2 features are more robust than language-supervised features [2].

#### Dense prediction tasks

DINOv2 excelled on tasks requiring spatial understanding, using only simple linear probes on frozen features [2].

**Semantic Segmentation (mIoU, linear probe):**

| Dataset | DINOv2 ViT-g/14 |
|---|---|
| ADE20k | 53.0% |
| Cityscapes | 81.0% |
| Pascal VOC | 86.2% |

**Monocular Depth Estimation (RMSE, lower is better):**

| Dataset | DINOv2 ViT-g/14 |
|---|---|
| NYUd | 0.298 |
| KITTI | 2.35 |

#### Video understanding

DINOv2 features also transferred well to video tasks without any video-specific training [2]:

| Dataset | DINOv2 ViT-g/14 | OpenCLIP ViT-G/14 |
|---|---|---|
| Kinetics-400 | 78.4% | 78.3% |
| UCF-101 | 91.2% | 90.7% |
| Something-Something v2 | 38.3% | 35.8% |

### How expensive was DINOv2 to train?

Training DINOv2 required 22,016 GPU-hours on NVIDIA A100-40GB GPUs. Compared to the iBOT baseline using the same hardware, DINOv2 was 2x faster and used one-third the memory, thanks to engineering optimizations like FlashAttention and sequence packing. The estimated carbon footprint for a full reproducible training run was 3.7 metric tons of CO2 equivalent [2].

### Emergent properties of DINOv2

Like its predecessor, DINOv2 exhibited striking emergent properties when analyzed through PCA (Principal Component Analysis) of patch features [2]:

- **Object part parsing**: The first few principal components of DINOv2 patch features cleanly separate different semantic parts of objects.
- **Foreground/background separation**: The first principal component consistently separates foreground objects from the background.
- **Cross-domain semantic matching**: DINOv2 features can match semantically corresponding parts across very different visual domains. For example, a bird's wing can be matched to an airplane's wing, despite the enormous difference in visual appearance.

## What is DINOv3 (2025)?

DINOv3 was introduced in August 2025 (arXiv:2508.10104, posted 13 August 2025) by a team of 26 researchers at Meta AI, led by Oriane Simeoni, Huy V. Vo, Maximilian Seitzer, and Federico Baldassarre, among others [3][10]. The paper represented a major leap in scale and capability over DINOv2.

DINOv3 trained a self-supervised vision model with **7 billion parameters** on **1.7 billion unlabeled images**, making it roughly 7x larger and trained on 12x more data than DINOv2 [3]. For the first time, a single frozen vision backbone outperformed specialized solutions on multiple dense prediction tasks, including object detection and semantic segmentation, matching or surpassing the state of the art across roughly 60 benchmarks and 15 vision tasks [3][10].

### What is Gram anchoring?

One of the central technical contributions of DINOv3 was **Gram anchoring**, a regularization technique designed to solve a known but previously unsolved problem: the degradation of dense (patch-level) feature quality during long training schedules [3]. As self-supervised vision models train for more iterations, their global (image-level) features tend to improve, but their local (patch-level) features can deteriorate. This creates a difficult tradeoff between dense and global representation quality.

Gram anchoring works by enforcing patch-level consistency through the Gram matrix of the model's features. The Gram matrix captures all pairwise dot products between patch features. The method aligns the current model's Gram matrix with that of an earlier, more stable version of the teacher (from around 100k to 200k training iterations). The loss is formulated as:

`L_gram = ||X_S * X_S^T - X_G * X_G^T||_F^2`

where `X_S` represents the current student features, `X_G` represents the reference Gram teacher features, and `||.||_F` denotes the Frobenius norm. This regularization is applied after 1 million training iterations using high-resolution (2x input resolution) features [3].

### Training data: LVD-1689M

DINOv3 expanded the data curation pipeline from DINOv2 to build a dataset of 1.689 billion images, called **LVD-1689M**. The approach combined [3]:

- Hierarchical k-means clustering for data organization
- Retrieval-based curation using seed datasets like ImageNet and Mapillary
- Raw datasets including ImageNet-1k and ImageNet-22k
- A sampling strategy mixing 10% homogeneous ImageNet-1k batches with 90% heterogeneous batches

### Model architecture and variants

The teacher model was a ViT-7B with 6.7 billion parameters, 40 transformer blocks, a 4096-dimensional embedding space, 32 attention heads, patch size 16, and RoPE ([Rotary Position Embedding](/wiki/rotary_position_embedding)) positional encodings with jittering. Training used a constant learning rate over 1 million iterations, departing from the cosine scheduling used in earlier work [3].

From the ViT-7B teacher, a comprehensive family of student models was distilled [3]:

| Model | Parameters |
|---|---|
| ViT-S | 21M |
| ViT-S+ | 29M |
| ViT-B | 86M |
| ViT-L | 300M |
| ViT-H+ | 800M |
| ConvNeXt-T/S/B/L | Various |

The inclusion of [ConvNeXt](/wiki/convnext)-based variants was a notable addition, providing deployment-friendly alternatives for resource-constrained environments [3].

### How accurate is DINOv3?

DINOv3 was evaluated across 15 diverse visual tasks and more than 60 benchmarks [3]:

| Task | Benchmark | DINOv3 Result |
|---|---|---|
| Image Classification | ImageNet-1k (linear) | 88.2% |
| Semantic Segmentation | ADE20k (mIoU, frozen) | 63.0% |
| Object Detection | COCO (mAP, frozen) | 66.1 |
| Depth Estimation | NYUv2 (RMSE) | 0.281 |
| OOD Classification | ObjectNet | 72.8% |

Compared to DINOv2, DINOv3 showed significant improvements on dense prediction tasks (ADE20k improved from 53.0% to 63.0% mIoU), while also improving image classification (from 86.5% to 88.2% on ImageNet). The Gram anchoring technique was directly responsible for large gains in segmentation quality; ablation studies showed that it improved Pascal VOC mIoU from 50.3% to 55.7% [3].

### Post-training enhancements

DINOv3 introduced several post-hoc strategies [3]:

- **Resolution scaling**: 10,000 iterations of training with mixed resolutions (512 and 768 pixels for global crops), enabling stable inference at resolutions exceeding 4096x4096 pixels.
- **Multi-student distillation**: Shared teacher inference across multiple student training groups, allowing efficient simultaneous distillation of all student variants.
- **Satellite backbone**: A specialized variant trained on MAXAR satellite imagery for remote sensing applications.

### What is DINOv3 used for?

Meta highlighted several real-world DINOv3 use cases [10]:

- **Environmental monitoring**: Partnership with the World Resources Institute for deforestation detection from satellite imagery.
- **Canopy height estimation**: Reduced estimation error from 4.1 meters to 1.2 meters in Kenya using satellite image analysis.
- **Healthcare, autonomous vehicles, manufacturing, and retail**: Broad applicability as an all-purpose visual backbone.

## How does DINO compare with CLIP and MAE?

The following table compares the DINO family with other prominent vision representation learning approaches:

| Feature | DINO (2021) | DINOv2 (2023) | DINOv3 (2025) | [CLIP](/wiki/clip) (2021) | [MAE](/wiki/masked_autoencoder) (2022) |
|---|---|---|---|---|---|
| Organization | Facebook AI | [Meta AI](/wiki/meta_ai) | Meta AI | [OpenAI](/wiki/openai) | Meta AI |
| Supervision Type | Self-supervised | Self-supervised | Self-supervised | Language-supervised | Self-supervised |
| Method | Self-distillation | DINO + iBOT + SwAV | DINO + Gram anchoring | Contrastive (image-text) | Masked image modeling |
| Training Data | ImageNet-1k | LVD-142M (142M images) | LVD-1689M (1.7B images) | 400M image-text pairs | ImageNet-1k |
| Largest Model | ViT-B/8 (85M) | ViT-g/14 (1.1B) | ViT-7B (6.7B) | ViT-L/14 (428M) | ViT-H/16 (632M) |
| ImageNet Linear Probe (best) | 80.1% | 86.5% | 88.2% | 85.4% (OpenCLIP G/14) | 73.5% (ViT-L) |
| ImageNet Fine-tune (best) | N/A | N/A | N/A | N/A | 87.8% (ViT-H, 448px) |
| Zero-Shot Classification | No | No | No | Yes (75.4% ViT-L/14) | No |
| Dense Prediction (frozen) | Emergent attention maps | Strong (segmentation, depth) | State-of-the-art | Weak | Weak |
| Text Understanding | No | No | No | Yes | No |
| Requires Fine-Tuning | No (k-NN works well) | No (linear probe works well) | No (linear probe works well) | No (zero-shot) | Yes (for best results) |

### Key differences

**DINO/DINOv2/DINOv3 vs. CLIP**: [CLIP](/wiki/clip) uses language supervision through contrastive learning on image-text pairs, giving it natural zero-shot classification and text-image retrieval capabilities [6]. However, CLIP features are weaker for dense prediction tasks like segmentation and depth estimation. Studies have shown that CLIP captures high-level semantic information (object categories, text-relevant features), while DINO features are more responsive to low-level visual properties like colors, textures, and spatial structure. When both are used as visual encoders in [multimodal language models](/wiki/multimodal_ai), CLIP excels at text-intensive tasks, while DINO slightly outperforms on vision-centric ones.

**DINO/DINOv2/DINOv3 vs. MAE**: [MAE](/wiki/masked_autoencoder) (Masked Autoencoders) learns representations by reconstructing masked image patches, an approach inspired by masked language modeling in [NLP](/wiki/natural_language_processing) [5]. While MAE achieves strong results when fine-tuned (87.8% on ImageNet with ViT-H), its frozen features are significantly weaker than DINO's for downstream tasks. MAE achieves only 73.5% linear probing accuracy with ViT-L, compared to DINOv2's 86.3% with the same architecture class [2][5]. MAE features also lack sufficient semantic information for global understanding, making them less suitable as frozen visual backbones.

## What is DINO used for?

The DINO family of models has found widespread use across many computer vision applications:

**Semantic and Instance Segmentation.** DINO's emergent attention maps and DINOv2's strong dense features enable high-quality [segmentation](/wiki/image_segmentation) without task-specific training. Researchers have used frozen DINO features for zero-shot and few-shot segmentation, part discovery, and unsupervised object localization [1][2].

**Monocular Depth Estimation.** DINOv2 achieved state-of-the-art results for [monocular depth estimation](/wiki/depth_estimation) on benchmarks like NYU Depth v2 and KITTI, using only a simple linear probe on frozen features. This capability has practical applications in robotics, warehouse safety systems, and autonomous navigation [2].

**Medical Imaging.** Researchers have adapted DINOv2 for radiology and surgical applications. The "Surgical-DINO" system adapts DINOv2 for depth estimation in endoscopic surgery, aiding in 3D reconstruction and surgical navigation.

**Remote Sensing.** DINOv3 includes a dedicated satellite backbone trained on MAXAR imagery. Applications include deforestation detection, canopy height estimation, and land cover classification [10].

**Visual Backbone for Multimodal Models.** DINOv2 features have been used as the visual branch in multimodal large language models. Studies show that DINOv2 provides fine-grained localization information that complements the high-level semantic features from CLIP, and combining both encoders often yields better results than using either alone.

**Image Retrieval and Matching.** DINOv2 achieved strong results on instance recognition benchmarks (Oxford Hard: 52.3% mAP, Paris Hard: 82.6% mAP), making it useful for visual search, copy detection, and image matching applications [2].

**Video Understanding.** Despite being trained only on static images, DINO and DINOv2 features transfer effectively to video tasks including action recognition (91.2% on UCF-101), video object segmentation (strong performance on DAVIS), and temporal understanding [2].

## Is DINO open source?

All three versions of DINO are open source, with weights publicly released:

- **DINO**: Released under the Apache 2.0 license on GitHub (facebookresearch/dino). Implemented in [PyTorch](/wiki/pytorch) [11].
- **DINOv2**: Released on GitHub (facebookresearch/dinov2) with pre-trained weights available on [Hugging Face](/wiki/hugging_face). Integrated into the Hugging Face Transformers library [12].
- **DINOv3**: Released under a DINOv3 license on GitHub (facebookresearch/dinov3) with weights on Hugging Face Hub. Supported by Hugging Face Transformers [3][10].

All models can be loaded with a few lines of PyTorch code via `torch.hub` or through the Hugging Face `transformers` library, making them accessible for both research and production use.

## See also

- [DINOv2](/wiki/dinov2)
- [DINOv3](/wiki/dinov3)
- [Self-supervised learning](/wiki/self_supervised_learning)
- [Vision Transformer](/wiki/vision_transformer)
- [CLIP](/wiki/clip)
- [Masked Autoencoder (MAE)](/wiki/masked_autoencoder)
- [Knowledge distillation](/wiki/knowledge_distillation)
- [Meta AI](/wiki/meta_ai)

## References

1. Caron, M., Touvron, H., Misra, I., Jegou, H., Mairal, J., Bojanowski, P., & Joulin, A. (2021). "Emerging Properties in Self-Supervised Vision Transformers." *Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)*. arXiv:2104.14294.
2. Oquab, M., Darcet, T., Moutakanni, T., Vo, H.V., Szafraniec, M., Khalidov, V., ... & Bojanowski, P. (2023). "DINOv2: Learning Robust Visual Features without Supervision." *Transactions on Machine Learning Research (TMLR)*. arXiv:2304.07193.
3. Simeoni, O., Vo, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., ... & Bojanowski, P. (2025). "DINOv3." arXiv:2508.10104.
4. Darcet, T., Oquab, M., Mairal, J., & Bojanowski, P. (2024). "Vision Transformers Need Registers." *International Conference on Learning Representations (ICLR)*. arXiv:2309.16588.
5. He, K., Chen, X., Xie, S., Li, Y., Dollar, P., & Girshick, R. (2022). "Masked Autoencoders Are Scalable Vision Learners." *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*. arXiv:2111.06377.
6. Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., ... & Sutskever, I. (2021). "Learning Transferable Visual Models From Natural Language Supervision." *Proceedings of the 38th International Conference on Machine Learning (ICML)*. arXiv:2103.00020.
7. Zhou, J., Wei, C., Wang, H., Shen, W., Xie, C., Yuille, A., & Kong, T. (2022). "iBOT: Image BERT Pre-Training with Online Tokenizer." *International Conference on Learning Representations (ICLR)*. arXiv:2111.07832.
8. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., ... & Houlsby, N. (2021). "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale." *International Conference on Learning Representations (ICLR)*. arXiv:2010.11929.
9. Meta AI. "DINOv2: State-of-the-art computer vision models with self-supervised learning." Meta AI Blog, April 2023. https://ai.meta.com/blog/dino-v2-computer-vision-self-supervised-learning/
10. Meta AI. "DINOv3: Self-supervised learning for vision at unprecedented scale." Meta AI Blog, 2025. https://ai.meta.com/blog/dinov3-self-supervised-vision-model/
11. Facebook Research. "facebook/dino-vitb16." Hugging Face Model Hub, 2021. https://huggingface.co/facebook/dino-vitb16
12. Facebook Research. "facebook/dinov2-large." Hugging Face Model Hub, 2023. https://huggingface.co/facebook/dinov2-large