DINOv2
Last reviewed
May 24, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 6,730 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 24, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 6,730 words
Add missing citations, update stale details, or suggest a clearer explanation.
DINOv2 is a family of self-supervised Vision Transformer models released by Meta AI Research in April 2023, intended to produce general-purpose visual features that transfer to many downstream tasks without fine-tuning of the backbone.[1][2] The system was introduced in the paper "DINOv2: Learning Robust Visual Features without Supervision" by Maxime Oquab, Timothée Darcet, Théo Moutakanni and 23 co-authors, and combines an automated data curation pipeline that assembled an unlabeled corpus of 142 million images (the LVD-142M dataset) with a knowledge distillation training objective derived from the earlier DINO and iBOT methods.[1][3] A one-billion-parameter ViT-g/14 teacher was trained from scratch, and smaller ViT-S/14, ViT-B/14 and ViT-L/14 students were then produced by distillation; the family is distributed as open-source code and weights under the Apache 2.0 license following an August 2023 relicensing.[2][4][5] DINOv2 has since been adopted as a frozen visual backbone in research on dense prediction, image retrieval, medical imaging, robotics, and as one of the vision encoders evaluated for multimodal language models.[6][7][8] Meta released a successor, DINOv3, in August 2025, which scales the recipe to a 7-billion-parameter teacher trained on roughly 1.7 billion images and addresses identified shortcomings in DINOv2's late-training dense-feature stability.[9][10]
| Attribute | Value |
|---|---|
| Developer | Meta AI Research (FAIR) |
| Initial paper | arXiv:2304.07193 |
| Submission date (v1) | April 14, 2023 |
| Blog announcement | April 17, 2023 |
| Backbone | Vision Transformer (ViT) |
| Model sizes | ViT-S/14 (21M), ViT-B/14 (86M), ViT-L/14 (300M), ViT-g/14 (1.1B) |
| Training dataset | LVD-142M (142 million curated images) |
| Training objective | DINO + iBOT + KoLeo (self-distillation) |
| Patch size | 14 |
| Initial license | CC BY-NC 4.0 |
| Relicensed | Apache 2.0 (August 31, 2023) |
| Code repository | github.com/facebookresearch/dinov2 |
| Companion paper | "Vision Transformers Need Registers" (arXiv:2309.16588) |
| Successor | DINOv3 (August 2025) |
The DINOv2 project sits at the intersection of two threads of computer-vision research: scaling laws derived from language pre-training, and the line of self-distillation methods that emerged from the original DINO paper in 2021. In natural language processing, large transformer models pre-trained on web text had been shown to learn representations that transfer to a wide range of downstream tasks; the DINOv2 authors framed their work as an attempt to produce an analogous foundation backbone for images by relying on self-supervised learning rather than text supervision.[1][2] The paper opens with the observation that, in natural language processing, foundation model pre-training on large amounts of text has yielded representations that work across tasks without fine-tuning, and asks whether the same can be achieved for images by training large Vision Transformers on appropriately curated visual data.[1]
The immediate predecessors were two self-supervised methods developed in part at Meta's research labs. DINO, introduced in Mathilde Caron et al., "Emerging Properties in Self-Supervised Vision Transformers," demonstrated that a student-teacher distillation loss applied to multi-crop augmentations of unlabeled images causes ViTs to develop attention maps that segment foreground objects, and produces features that score 80.1% top-1 accuracy on ImageNet under linear evaluation with a ViT-Base backbone.[11] iBOT, by Jinghao Zhou et al., extended this approach with a patch-level masked image modeling objective and an online tokenizer, reaching 82.3% linear probing accuracy on ImageNet-1k and 87.8% under full fine-tuning.[12] DINOv2 explicitly combines elements of both: the image-level DINO loss is retained for the class token, while the patch-level iBOT loss is retained for masked patch tokens of the student.[1]
A second motivation was the perceived limitation of weakly supervised image-text pre-training as exemplified by CLIP. The DINOv2 authors argue that text alignment, while effective for zero-shot transfer, "discards" information that is not easily described in captions, and that purely visual self-supervision can in principle capture finer pixel-level structure such as depth and part segmentation.[1] The paper benchmarks against the open-source OpenCLIP ViT-G/14 reproduction of CLIP and reports that the DINOv2 distilled ViT-L/14 matches or surpasses OpenCLIP on a broad suite of image-level and pixel-level tasks despite having roughly one-third of the parameter count.[1]
A practical concern that shaped the project was data quality. Earlier self-supervised work had shown that simply scraping more images from the web tends to degrade representation quality, because uncurated crawls contain redundant, low-quality, or off-distribution images. The DINOv2 paper therefore devotes a substantial portion of its method section to an automatic pipeline that turns 1.2 billion unique web images into the 142-million-image curated set called LVD-142M.[1]
LVD-142M (Large Visual Dataset, 142 million images) was assembled without using any human labels, text captions, or external metadata. The pipeline has three stages: gathering uncurated source images, deduplication, and self-supervised retrieval against a set of curated seed images.[1]
The uncurated source is a public web crawl. After URL filtering to remove unsafe content, the team downloaded approximately 1.2 billion unique images. Post-processing applied PCA-hash deduplication, NSFW filtering, and face blurring before any further use.[1] Deduplication then used the copy detection pipeline of Pizzi et al. (2022), known as SSCD (Self-Supervised Copy Detection), to remove near-duplicate images from this uncurated pool, both internally and against the test sets of downstream evaluation benchmarks so that subsequent benchmarking is not biased by training-test overlap.[1][13]
The curated side combines roughly 25 third-party datasets including ImageNet-22k, the training split of ImageNet-1k, Google Landmarks, and several fine-grained classification corpora. These act as seed images: for each curated image, a query is issued against the deduplicated web pool to retrieve visually similar images. Embeddings for retrieval are produced by a self-supervised ViT-H/16 network previously pre-trained on ImageNet-22k. Retrieval uses cosine similarity in this embedding space; the authors typically retain N=4 nearest neighbors per query, a value chosen as a trade-off between coverage and collision rates between queries.[1] When a query has too few near neighbors (rare classes), the pipeline falls back on k-means clustering of the web pool, sampling images from the cluster nearest to the query.[1]
The entire curation job was distributed across 20 nodes equipped with eight V100-32GB GPUs and ran in under two days, illustrating that automated curation at this scale is tractable on modest infrastructure relative to the cost of subsequent model training.[1][14] Ablations in the paper report that models trained on LVD-142M outperform those trained on raw web crawls of similar size on image classification, retrieval, and dense prediction, supporting the claim that careful self-supervised curation, rather than raw scale alone, drives the gains.[1] LVD-142M is described in detail in the paper but is not distributed as a downloadable dataset, partly for legal reasons relating to web-image redistribution and partly to limit the spread of potentially sensitive content; this asymmetry between published code, published weights, and unpublished training data has shaped much of the subsequent reproduction work.[1][14]
DINOv2 adopts the standard ViT architecture introduced by Dosovitskiy et al. with three modifications: a patch size of 14 (rather than 16) for all variants, the choice of an embedding dimension of 1536 with 24 heads (64 dimensions per head) for the giant variant (in place of 1408 with 16 heads in Zhai et al.'s ViT-G design), and the addition of a SwiGLU feed-forward block in larger variants.[1][15] The patch-size-14 choice produces a 256-token grid for 224 by 224 inputs (16 by 16 patches) and a denser 1369-token grid for 518 by 518 inputs used in the high-resolution finishing stage.
Four backbone sizes are released:[4][14]
| Model | Layers | Embedding dim | Heads | Parameters | ImageNet-1k k-NN | ImageNet-1k linear |
|---|---|---|---|---|---|---|
| ViT-S/14 (distilled) | 12 | 384 | 6 | 21M | 79.0% | 80.9-81.1% |
| ViT-B/14 (distilled) | 12 | 768 | 12 | 86M | 82.0-82.1% | 84.5% |
| ViT-L/14 (distilled) | 24 | 1024 | 16 | 300M | 83.5% | 86.3% |
| ViT-g/14 | 40 | 1536 | 24 | 1100M | 83.5% | 86.5% |
The ViT-g/14 model is the only one trained from scratch on LVD-142M. The ViT-S/14, ViT-B/14, and ViT-L/14 checkpoints are produced by self-distillation from a frozen ViT-g teacher; the authors report that this procedure outperforms training the smaller backbones from scratch on all 12 downstream benchmarks they tested.[1] In September 2023, register-augmented checkpoints were released for each size following the companion paper "Vision Transformers Need Registers."[16][14] Each variant therefore exists in two forms in the official repository: the original 2023 weights and the register weights with four additional learned tokens.[4][14]
The class-token projection head is a multilayer perceptron with bottlenecked output that maps to a high-dimensional prototype space of 128k prototypes (128,000 learnable codes), shared across global crops, used to compute the DINO loss.[1][17] A parallel head with the same architecture but separate weights is used for the patch tokens to compute the iBOT loss; this separation of the image-level and patch-level heads was found in ablations to improve representation quality compared to a shared head.[1] LayerScale and stochastic depth are used throughout the deeper variants to stabilize training; aggressive stochastic depth with drop rates up to about 40% is applied to the giant model.[1][17]
The DINOv2 loss is a weighted combination of three terms, all consistent with a student-teacher self-distillation framework in which the teacher's parameters are an exponential moving average of the student's.[1] The teacher momentum follows a cosine schedule from 0.994 to 1.0 over training, so that the teacher becomes more stable as the student matures.[1][17]
The image-level loss is taken from DINO. For each input image, two global crops at resolution 224 and several smaller local crops are passed through the student; only the global crops are passed through the teacher. The class-token embeddings are projected into a high-dimensional prototype space and converted into probability distributions via a softmax over learned prototypes. The loss is the cross-entropy between the teacher and student distributions over class tokens:[1][11]
L_DINO = - sum over views: p_teacher * log p_student
The patch-level loss is taken from iBOT. The student receives a copy of each global crop in which a random subset of patches has been masked; the teacher receives the unmasked image. Each masked patch's token is projected through a separate prototype head and matched, again via cross-entropy, to the teacher's projection of the same patch position from the unmasked image.[1][12]
A KoLeo regularizer (Kozachenko-Leonenko entropy estimator) is added to encourage a uniform spread of the global class-token embeddings inside each mini-batch.[1][17] For a batch of n features, the term is
L_KoLeo = -(1/n) sum_i log d_{n,i}
where d_{n,i} is the minimum L2 distance from feature i to any other feature in the batch.[1] This term acts as an entropy-style spread loss that discourages representation collapse to a small region of the embedding manifold. Ablations report that adding KoLeo to an iBOT-style baseline gives the largest single contribution to k-NN accuracy among the new training-recipe components, with about a 2.3 percentage-point improvement on ImageNet k-NN.[1][17]
The teacher prototype centering uses Sinkhorn-Knopp normalization rather than a running mean as in DINO. Three Sinkhorn iterations are applied per step, producing a doubly stochastic assignment of teacher tokens to prototypes that the authors find more stable than softmax centering with momentum.[1][17] At inference, the projection heads are discarded; only the ViT backbone is used to produce features.
Two further tricks contribute to training stability. First, the student crop set uses multi-crop augmentation with two global crops at resolution 224 and several smaller local crops, which forces the student to predict the same representation from views of different scales.[1] Second, a final fine-tuning stage trains the backbone at higher resolution (518 by 518) for an additional 10,000 iterations, which the authors report improves dense prediction tasks at a small fraction of full-resolution training cost.[1]
Training the one-billion-parameter ViT-g/14 from scratch is computationally demanding, and the paper describes several engineering optimizations that distinguish DINOv2's implementation from earlier self-supervised codebases. All training is implemented in PyTorch 2.0 on NVIDIA A100 GPUs with mixed-precision fp16.[1][14] The AdamW optimizer is used with a cosine learning-rate schedule, a 100,000-iteration warm-up, and a cosine weight-decay schedule that ramps from 0.04 to 0.2 over training; the ViT-g/14 run uses approximately 625,000 iterations at a batch size of around 3,072 images.[17][18]
A custom version of FlashAttention is used for the self-attention computation, taking advantage of the fact that the per-head embedding dimension is a multiple of 64. The authors report that their attention implementation is roughly twice as fast as iBOT's original codebase and uses about one third of the memory, on identical hardware.[1]
Sequence packing concatenates token sequences of differing lengths (arising from multi-crop training) into a single long sequence and applies a block-diagonal attention mask so that crops do not attend across boundaries. This avoids padding waste from the smallest local crops.[1]
Stochastic depth is implemented by skipping the computation of dropped residual branches rather than masking their outputs; at high drop rates (around 40%) this gives substantial wall-clock speed-ups.[1]
Fully Sharded Data Parallel (FSDP) distributes parameters, gradients, and optimizer states across GPUs. The authors observe an approximate 50% reduction in inter-GPU communication compared to standard data parallel training, because gradient all-reduce operations can be performed in reduced precision while master weights are kept in float32 on each shard.[1] On a 16-GPU node, FSDP allowed training a model with roughly four times the parameter count that fits in plain distributed data parallel for the same memory budget.[1]
The reported training cost for the released ViT-g/14 is approximately 22,016 A100-40GB GPU-hours.[1][17] Total project compute, including ablations, smaller variants, and intermediate checkpoints, runs into hundreds of thousands of GPU-hours. Distilled smaller models reuse the same data pipeline but require substantially less compute, since the teacher is frozen and only the student is updated.[1]
The DINOv2 paper evaluates frozen-backbone features (no fine-tuning of the ViT) across eight task families covering image classification, fine-grained recognition, instance retrieval, video classification by frame, semantic segmentation, depth estimation, video segmentation, and robustness. Results below are quoted from the paper or its v2 revision.[1][17]
| Task | Benchmark | Backbone | DINOv2 result | Comparison |
|---|---|---|---|---|
| Image classification (linear) | ImageNet-1k | ViT-g/14 | 86.5% top-1 | OpenCLIP ViT-G/14: 86.2% |
| Image classification (k-NN) | ImageNet-1k | ViT-g/14 | 83.5% top-1 | |
| Robustness | ImageNet-A | ViT-g/14 | 75.9% | |
| Robustness | ImageNet-R | ViT-g/14 | 78.8% | |
| Robustness | ImageNet-Sketch | ViT-g/14 | 62.5% | |
| Video action recognition | Kinetics-400 (frames + linear) | ViT-g/14 | 78.4% top-1 | matches OpenCLIP |
| Video action recognition | UCF-101 | ViT-g/14 | 91.2% top-1 | matches OpenCLIP |
| Video temporal reasoning | Something-Something v2 | ViT-g/14 | 38.3% top-1 | +2.5 over OpenCLIP |
| Fine-grained classification (12 tasks avg.) | various | ViT-g/14 | 92.1% mean acc. | |
| Semantic segmentation (linear) | ADE20K | ViT-g/14 | 49.0 mIoU | |
| Semantic segmentation (+ multiscale) | ADE20K | ViT-g/14 | 53.0 mIoU | |
| Monocular depth estimation (linear) | NYUd | ViT-g/14 | 0.344 RMSE | |
| Depth estimation (DPT head) | NYUd | ViT-g/14 | 0.279 RMSE | |
| Landmark retrieval | Oxford-Medium | ViT-g/14 | 73.6 mAP | |
| Landmark retrieval | Oxford-Hard | ViT-g/14 | 52.3 mAP | |
| Landmark retrieval | Paris-Medium | ViT-g/14 | 92.1 mAP | |
| Landmark retrieval | Paris-Hard | ViT-g/14 | 82.6 mAP |
The paper emphasises that these results are obtained without any fine-tuning of the backbone: a linear classifier, a k-nearest-neighbors classifier, or a small dense-prediction head is trained on top of frozen patch tokens. This is the primary headline of the DINOv2 release: a single set of frozen features competes with or exceeds task-specialist models that were trained or fine-tuned for each benchmark individually.[1][2]
Linear probing of ImageNet-1k is the most directly comparable head-to-head benchmark with CLIP-family models. DINOv2 ViT-g/14 reaches 86.5%, slightly above the publicly available OpenCLIP ViT-G/14 reproduction at 86.2%; the distilled DINOv2 ViT-L/14 reaches 86.3% with roughly one-third the parameters of OpenCLIP ViT-G/14, which the authors highlight as evidence that purely visual self-supervision can match weakly supervised image-text pre-training at the image-classification level.[1] The advantage is larger on dense pixel-level tasks such as depth estimation, where the OpenCLIP baselines lag behind DINOv2 by larger margins.[1] On fine-grained classification, the DINOv2 ViT-g/14 reaches an average of 92.1% accuracy across twelve datasets including ImageNet-22k, Places205, iNaturalist 2018, iNaturalist 2021, Stanford Cars, FGVC-Aircraft, Caltech-101, and Food-101, with substantial gains over OpenCLIP on natural-world long-tailed benchmarks (+8.6 points on iNaturalist 2018 and +9.7 points on iNaturalist 2021).[1][17]
For video, the paper performs frozen-feature linear classification on uniformly sampled frames and reports 78.4% top-1 on Kinetics-400, 91.2% on UCF-101, and 38.3% on Something-Something v2; the first two are within roughly half a point of OpenCLIP while the SSv2 number is 2.5 points higher, despite DINOv2 not training on any video data.[17] For dense pixel-level evaluation, the paper places a DPT-style decoder head on the frozen ViT and reaches 0.279 RMSE on NYUd, a result described as competitive with task-specialist supervised models trained directly for the task.[1]
In September 2023, Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski (three of whom are DINOv2 authors) published "Vision Transformers Need Registers" (arXiv:2309.16588), which identified an artifact in DINOv2 (and other) ViT feature maps: a small number of patch tokens, typically in low-information background regions, have anomalously high feature norms and appear to be repurposed by the network as scratch memory for global computations.[16]
The proposed fix is to extend the input sequence of the ViT with a small set of additional learnable "register" tokens that have no spatial meaning. These tokens absorb the global-computation role that previously corrupted background patches. The paper uses four register tokens in its main experiments and reports that adding registers eliminates the high-norm artifacts, produces smoother attention maps and dense feature maps, sets new state-of-the-art numbers for self-supervised models on dense prediction tasks, and improves the quality of unsupervised object discovery built on top of the features.[16][19] Ablations in the same paper indicate that even a single register token captures most of the benefit, but four is recommended as a robust default.[16]
Following the paper, Meta released DINOv2-with-registers checkpoints for each model size (ViT-S/14, ViT-B/14, ViT-L/14, ViT-g/14) alongside the original DINOv2 weights, exposing both variants in the official GitHub repository and PyTorch Hub interface.[4][14] The registers variant is the recommended default for tasks that rely on per-patch features, such as dense segmentation or robot manipulation policies that read spatial saliency from the backbone.[7]
A 2025 follow-up by Jiang, Dravid, Efros, and Gandelsman, "Vision Transformers Don't Need Trained Registers," argued that the artifacts are caused by a small set of identifiable neurons that concentrate high-norm activations on outlier tokens, and that the same benefit as trained registers can be obtained at inference time by shifting these activations onto a single appended untrained token.[20] The training-free intervention reaches comparable downstream performance to trained registers across a range of models including CLIP and DINOv2 and was accepted as a spotlight paper at NeurIPS 2025, demonstrating that the underlying mechanism, rather than the training procedure, is the relevant lever.[20]
DINOv2 was originally released under a CC BY-NC 4.0 license, which prohibited commercial use of the code and weights. On August 31, 2023, Meta announced a relicensing of the entire DINOv2 release (including training code, all backbone checkpoints, and downstream heads for segmentation and depth) under the Apache 2.0 license, removing the non-commercial restriction.[5][21] The same announcement introduced the FACET evaluation dataset, a 32,000-image benchmark of 50,000 people annotated with demographic and physical attributes intended for fairness audits of vision foundation models (FACET itself is restricted to evaluation and is not licensed for training).[21]
The relicensing made DINOv2 one of the few large vision foundation backbones available for commercial use under a permissive license at the time of release, alongside OpenCLIP reproductions of CLIP and the Segment Anything Model also released by Meta in 2023.[4][21] Some specialized extensions of the DINOv2 codebase released later (Cell-DINO for microscopy, XRay-DINO for medical imaging) carry different licenses appropriate to their domain data, but the core DINOv2 weights and training code remain under Apache 2.0.[14]
The most common use of DINOv2 in the literature treats the released ViT as a frozen feature extractor. The official repository ships pretrained linear classifiers for ImageNet-1k, ImageNet-22k, and Places-205, along with DPT-style heads for monocular depth estimation on NYU and KITTI, and semantic segmentation heads for ADE20K and VOC2012 (including linear, multi-scale, and Mask2Former-style decoders).[4][14] These are intended both as ready-to-use components and as reference baselines for new downstream heads built on the same backbone.
DINOv2 has been evaluated as an alternative vision encoder for multimodal large language models in the LLaVA family. A 2025 comparative study, "LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning," reports mixed results: DINOv2 backbones outperform CLIP variants on certain visual benchmarks when paired with a 2-billion-parameter language model, but image-text pre-trained encoders such as CLIP and SigLIP remain competitive or superior on text-grounded multimodal evaluations, particularly at smaller language-model scales.[6] The general pattern reported in that and related work is that DINOv2 features carry strong spatial structure but lack the language-aligned semantics of CLIP, making concatenation or hybrid encoder schemes a common workaround for LLaVA-style architectures.[6][22]
The Cambrian-1 study by Tong, Brown, Wu and colleagues (NeurIPS 2024 oral) systematically compared more than 20 vision encoders across multimodal benchmarks and proposed a Spatial Vision Aggregator (SVA) connector that combines features from CLIP, SigLIP, OpenCLIP-ConvNeXt, and DINOv2; the authors report that DINOv2 contributes disproportionately to vision-centric and spatial benchmarks despite being weaker on text-grounded ones, and that the optimal multimodal vision tower is typically a hybrid of language-supervised and self-supervised encoders rather than either alone.[22] The complementarity is consistent with the original DINOv2 paper's claim that the self-supervised features encode finer pixel-level structure than CLIP-style features but lack their language alignment.[1][22]
A 2023 study, "Evaluating General Purpose Vision Foundation Models for Medical Image Analysis: An Experimental Study of DINOv2 on Radiology Benchmarks" (arXiv:2312.02366), evaluated frozen DINOv2 features on chest X-ray, computed tomography, and magnetic resonance datasets across more than 200 evaluations covering disease classification and organ segmentation in 2D and 3D, and found that DINOv2 transfers competitively to medical classification tasks despite not having been trained on any medical imagery.[23] Subsequent work has explored domain-specific continued pre-training using the DINOv2 objective on histopathology, fundus images, and other clinical datasets.
The most prominent example is UNI, a general-purpose pathology foundation model from the Mahmood Lab at Brigham and Women's Hospital / Harvard, published in Nature Medicine in 2024. UNI uses the DINOv2 training recipe to pre-train a ViT-L/16 on more than 100 million pathology tile crops sampled from over 100,000 hematoxylin-and-eosin (H&E) whole-slide images, and reports strong performance across a wide range of pathology classification and survival-prediction tasks.[24] A January 2025 successor, UNI 2, scales pre-training to more than 200 million tiles sampled from more than 350,000 whole-slide images including H&E and immunohistochemistry stains.[25] Hibou, by Nechaev, Pchelnikov, and Ivanova (June 2024), independently used the DINOv2 framework to pre-train ViT-B and ViT-L pathology models on more than one million whole-slide images and released them publicly.[26] Other DINOv2-derived medical models include RudolfV (a pathology backbone trained on tiles from approximately 103,849 whole-slide images), HistoDARE (an attention-modified DINO variant for histopathology), and MM-DINOv2 (a multi-modal adaptation presented at MICCAI 2025).[27]
DINOv2's combination of strong spatial features, frozen-backbone usability, and permissive license has made it a frequent choice in robotics pipelines. DINOBot ("DINOBot: Robot Manipulation via Retrieval and Alignment with Vision Foundation Models," arXiv:2402.13181), by Norman Di Palo and Edward Johns at Imperial College London, uses DINO/DINOv2 features as the basis for an imitation-learning framework that retrieves the most visually similar demonstration of an object and then aligns its end-effector with the new object using pixel-level features from a single frozen ViT.[7] A 2025 system for bimanual manipulation uses DINOv2 attention maps as pixel-level saliency scores lifted into a 3D voxel grid to provide semantic cues to a behavior-cloning policy.[28] NASA's Jet Propulsion Laboratory has reported using DINOv2 as the common backbone in a "Visual Perception Engine" for planetary rover prototypes, where a single forward pass through the encoder serves multiple downstream vision tasks without per-task fine-tuning.[8]
Because DINOv2's attention maps localize objects without supervision, the backbone has been used as the front end of unsupervised object-discovery pipelines (descendants of the LOST and CutLER methods), saliency detection, and weakly supervised segmentation.[16] The release of registers further improved the smoothness of feature maps used in these downstream pipelines.[16]
A 2024 collaboration between Meta and the World Resources Institute used DINOv2 as a feature extractor over 18 million 0.5-meter natural-color satellite tiles from Maxar Technologies to produce a global canopy height map at 1-meter resolution, training a convolutional decoder on aerial lidar height measurements to map DINOv2 features to per-pixel canopy heights.[29][38] The released model achieved a mean absolute error of about 2.8 meters and was made publicly available.[29][38] A March 2026 update, Canopy Height Maps v2 (CHMv2), replaced the DINOv2 backbone with DINOv3 pre-trained on a domain-specific 493-million-image satellite corpus (SAT-493M), reporting an R-squared improvement from 0.53 to 0.86 and sharper structure at the canopy level.[30] Beyond canopy mapping, DINOv2 features have been used in remote-sensing few-shot object detection and visual place recognition; Panopticon (2025) extends DINOv2 to multi-sensor satellite data by encoding optical and synthetic-aperture-radar channels using cross-sensor augmentations.[31]
DINOv2 sits in a crowded landscape of vision pre-training paradigms; the table summarizes how it relates to the main alternatives at the time of its release.
| Method | Supervision signal | Architecture family | Key property |
|---|---|---|---|
| CLIP (Radford et al., 2021) | Image-text contrastive | ViT or CNN | Zero-shot text-to-image alignment |
| SimCLR (Chen et al., 2020) | Image-image contrastive | ResNet | Augmentation-based contrastive learning |
| MAE (He et al., 2021) | Masked pixel reconstruction | ViT | Reconstruction in pixel space |
| DINO (Caron et al., 2021) | Self-distillation, class token | ViT | Emergent attention-based segmentation |
| iBOT (Zhou et al., 2021) | Self-distillation + masked patches | ViT | Patch-level + image-level objective |
| DINOv2 (Oquab et al., 2023) | DINO + iBOT + KoLeo on curated 142M | ViT | Frozen backbone competitive with fine-tuned task models |
| SigLIP (Zhai et al., 2023) | Image-text sigmoid contrastive | ViT | Replaces CLIP softmax with pairwise sigmoid |
| I-JEPA (Assran et al., 2023) | Predicted-target self-distillation in feature space | ViT | Predict representations of masked regions |
| V-JEPA (Bardes et al., 2024) | I-JEPA extended to video | ViT | Video-feature self-supervision |
| V-JEPA 2 (Assran, Bardes, Fan et al., 2025) | Video-based world model | ViT | Robot planning from frozen video features |
| DINOv3 (Siméoni et al., 2025) | DINO + iBOT + KoLeo + Gram anchoring on 1.7B | ViT, ConvNeXt | 7B-parameter teacher, stable dense features |
DINOv2's main empirical claim relative to CLIP-family models is that it matches them on image-level tasks while substantially exceeding them on dense pixel-level tasks like depth estimation and segmentation, which require fine spatial structure that text supervision tends to discard.[1] Relative to masked-image methods such as MAE, DINOv2 features are more directly useful as frozen representations (MAE features typically require fine-tuning to compete), at the cost of a more complex training recipe.[1] Relative to its own predecessor DINO, the main improvements come from larger and more carefully curated training data, the addition of the iBOT patch loss, the KoLeo regularizer, and engineering changes that make training a billion-parameter model practical.[1][11]
The Joint Embedding Predictive Architecture line (I-JEPA in early 2023, V-JEPA in 2024, V-JEPA 2 in June 2025) shares the broad goal of learning visual features without text alignment but uses a different mechanism: predicting target representations of masked regions in feature space rather than performing self-distillation on entire image crops.[32][33] V-JEPA 2 in particular is positioned as a video-based world model for robotics, trained on more than one million hours of video and roughly 62 hours of robot data.[33] The two lines coexist within Meta's broader research program on self-supervised vision rather than supplanting each other, with DINO-family models continuing to dominate dense-feature use cases and JEPA-family models targeting video understanding and planning.
In August 2025, Meta released DINOv3 (Siméoni, Vo, Seitzer, Baldassarre, Oquab et al., arXiv:2508.10104), positioned as a direct successor to DINOv2 that addresses two identified limitations: difficulty scaling beyond the one-billion-parameter regime with the original recipe, and gradual degradation of dense-feature quality late in long training schedules.[9][10] The DINOv3 paper introduces several changes:
DINOv3 reports state-of-the-art frozen-feature performance across more than 60 benchmarks covering 15 task families, including 88.2% ImageNet-1k linear probing, 66.1 mAP on COCO object detection, 63.0 mIoU on ADE20K semantic segmentation, and 0.281 RMSE on NYUv2 depth, matching or exceeding both DINOv2 and weakly supervised baselines such as SigLIP 2 and Meta's Perception Encoder.[10][34] The model was released under a commercial license alongside the paper, and the same release introduced DINOv3 variants pretrained on satellite imagery (SAT-493M), which the canopy-height map collaboration with the World Resources Institute later adopted for its v2 product in March 2026.[29][30]
Notwithstanding DINOv3, DINOv2 remains widely used in 2025 and 2026 because it is mature, has well-validated downstream heads, and runs at lower inference cost. The Apache-2.0 license remains identical for all original DINOv2 weights, and the Hugging Face Transformers integration adds DINOv2 as a first-class supported model alongside the DINOv2-with-registers variants.[14][19][35]
The DINOv2 paper and subsequent literature have noted a number of limitations.
The first is dataset opacity. LVD-142M is described in the paper but is not released as a downloadable dataset, both for legal reasons (the underlying images come from a web crawl) and to mitigate the redistribution of potentially sensitive material. Reproducing DINOv2 from scratch therefore requires either re-running the curation pipeline against a comparable web source or substituting another large image set, and exact reproductions of the published numbers have been difficult outside Meta.[1][14]
Second, even with registers, the released checkpoints still exhibit some texture- and patch-level artifacts under aggressive prompting. The "Vision Transformers Need Registers" paper documents the high-norm artifact in plain DINOv2 and provides a fix; subsequent work ("Vision Transformers Don't Need Trained Registers," 2025) has argued that training-free interventions on attention heads can replicate much of the benefit of registers without retraining, indicating that the underlying phenomenon is a structural property of large ViTs rather than a quirk of the DINOv2 recipe.[16][20]
Third, when used as a multimodal vision encoder, DINOv2 has consistently been reported as weaker than image-text pre-trained encoders for tasks that rely on aligning visual content to natural-language prompts, including VQA-style benchmarks used to evaluate LLaVA variants.[6][22] This is a structural limitation of any pure self-supervised vision objective: the resulting features encode visual similarity rather than semantic categories that map onto words. Hybrid encoder configurations (combining DINOv2 with CLIP features) such as those in Cambrian-1 are commonly used to mitigate this.[22]
Fourth, the training recipe is intricate. The combined DINO+iBOT+KoLeo loss with Sinkhorn-Knopp centering, multi-crop augmentation, EMA teacher, and high-resolution finishing stage has many hyperparameters, and the authors note that several of the engineering tricks (sequence packing, stochastic depth implementation, custom FlashAttention) are not exposed at the API level. The DINOv3 follow-up paper explicitly motivates its simpler constant-learning-rate training schedule and the Gram anchoring fix as responses to the brittleness and late-training degradation observed in DINOv2 training runs.[9][10]
Fifth, the original CC BY-NC 4.0 release effectively excluded commercial deployments for the first four months, which limited industrial adoption until the August 2023 Apache 2.0 relicensing.[5][21]
Sixth, fairness audits using FACET have surfaced disparities in feature behaviour across demographic groups in DINOv2-family models, motivating continued evaluation of how curation, distillation, and self-supervision influence downstream group fairness; FACET itself is restricted to evaluation rather than training to discourage misuse.[21]
DINOv2's significance in computer vision is twofold. As an artifact, it produced a set of foundation model weights that, combined with the permissive 2023 relicensing, became the default open self-supervised vision backbone for research, robotics, and downstream products that need frozen visual features.[4][5] As a methodology, it demonstrated that self-supervised pre-training on curated image data can equal or exceed CLIP-style image-text pre-training on classification tasks and substantially exceed it on dense prediction, supporting the broader argument from researchers such as Yann LeCun that purely visual self-supervision is a viable foundation for vision and that the Transformer-based joint-embedding predictive paradigm has room to scale further.[10][32]
The release also influenced engineering practice in self-supervised vision: the use of large-scale automated curation rather than raw web scraping, FSDP-based training of billion-parameter ViTs in pure PyTorch, and the now-common pattern of distilling smaller students from a single trained-from-scratch giant teacher have all been widely adopted in successor work, both inside Meta (DINOv3, V-JEPA 2) and in third-party medical and remote-sensing foundation models (UNI, UNI 2, Hibou, RudolfV, Panopticon).[1][10][24][25][26][31][33] Citation counts on the DINOv2 paper exceeded several thousand within two years of release, placing it among the most heavily cited vision papers of 2023, and the GitHub repository hosting the official implementation accumulated tens of thousands of stars over the same window.[4][14]