DINOv2

AI Models Computer Vision Meta AI

34 min read

Updated Jul 13, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 13, 2026

Fact-checked

In review queue

Sources

38 citations

Revision

v6 · 6,897 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

DINOv2 is a family of self-supervised Vision Transformer models released by Meta AI Research in April 2023 that produces general-purpose visual features transferring to many downstream tasks without fine-tuning of the backbone.^[1]^[2] The system was introduced in the paper "DINOv2: Learning Robust Visual Features without Supervision" by Maxime Oquab, Timothée Darcet, Théo Moutakanni and 23 co-authors, and combines an automated data curation pipeline that assembled an unlabeled corpus of 142 million images (the LVD-142M dataset) with a knowledge distillation training objective derived from the earlier DINO and iBOT methods.^[1]^[3] A one-billion-parameter ViT-g/14 teacher was trained from scratch, and smaller ViT-S/14, ViT-B/14 and ViT-L/14 students were then produced by distillation; the family is distributed as open-source code and weights under the Apache 2.0 license following an August 2023 relicensing.^[2]^[4]^[5] The paper reports that the released features "perform well across domains without any requirement for fine-tuning" and were "pretrained on a dataset of 142 M images without using any labels or annotations," a claim that became the model's defining headline.^[1] DINOv2 has since been adopted as a frozen visual backbone in research on dense prediction, image retrieval, medical imaging, robotics, and as one of the vision encoders evaluated for multimodal language models.^[6]^[7]^[8] Meta released a successor, DINOv3, in August 2025, which scales the recipe to a 7-billion-parameter teacher trained on roughly 1.7 billion images and addresses identified shortcomings in DINOv2's late-training dense-feature stability.^[9]^[10]

Infobox

Attribute	Value
Developer	Meta AI Research (FAIR)
Initial paper	arXiv:2304.07193
Submission date (v1)	April 14, 2023
Blog announcement	April 17, 2023
Journal publication	Transactions on Machine Learning Research (TMLR), January 2024
Backbone	Vision Transformer (ViT)
Model sizes	ViT-S/14 (21M), ViT-B/14 (86M), ViT-L/14 (300M), ViT-g/14 (1.1B)
Training dataset	LVD-142M (142 million curated images)
Training objective	DINO + iBOT + KoLeo (self-distillation)
Patch size	14
Initial license	CC BY-NC 4.0
Relicensed	Apache 2.0 (August 31, 2023)
Code repository	github.com/facebookresearch/dinov2
Companion paper	"Vision Transformers Need Registers" (arXiv:2309.16588)
Successor	DINOv3 (August 2025)

What problem does DINOv2 solve?

The DINOv2 project sits at the intersection of two threads of computer-vision research: scaling laws derived from language pre-training, and the line of self-distillation methods that emerged from the original DINO paper in 2021. In natural language processing, large transformer models pre-trained on web text had been shown to learn representations that transfer to a wide range of downstream tasks; the DINOv2 authors framed their work as an attempt to produce an analogous foundation backbone for images by relying on self-supervised learning rather than text supervision.^[1]^[2] The paper opens with the observation that, in natural language processing, foundation model pre-training on large amounts of text has yielded representations that work across tasks without fine-tuning, and asks whether the same can be achieved for images by training large Vision Transformers on appropriately curated visual data.^[1] The paper was first posted to arXiv on April 14, 2023 and was later published in Transactions on Machine Learning Research in January 2024.^[1]^[3]

The immediate predecessors were two self-supervised methods developed in part at Meta's research labs. DINO, introduced in Mathilde Caron et al., "Emerging Properties in Self-Supervised Vision Transformers," demonstrated that a student-teacher distillation loss applied to multi-crop augmentations of unlabeled images causes ViTs to develop attention maps that segment foreground objects, and produces features that score 80.1% top-1 accuracy on ImageNet under linear evaluation with a ViT-Base backbone.^[11] iBOT, by Jinghao Zhou et al., extended this approach with a patch-level masked image modeling objective and an online tokenizer, reaching 82.3% linear probing accuracy on ImageNet-1k and 87.8% under full fine-tuning.^[12] DINOv2 explicitly combines elements of both: the image-level DINO loss is retained for the class token, while the patch-level iBOT loss is retained for masked patch tokens of the student.^[1]

How does DINOv2 differ from CLIP and image-text pre-training?

A second motivation was the perceived limitation of weakly supervised image-text pre-training as exemplified by CLIP. The DINOv2 authors argue that text alignment, while effective for zero-shot transfer, throws away information that is not easily described in captions: in the paper's words, "captions only approximate the rich information in images, and complex pixel-level information may not surface with this supervision."^[1] Purely visual self-supervision can in principle capture finer pixel-level structure such as depth and part segmentation.^[1] The paper benchmarks against the open-source OpenCLIP ViT-G/14 reproduction of CLIP and reports that the DINOv2 distilled ViT-L/14 matches or surpasses OpenCLIP on a broad suite of image-level and pixel-level tasks despite having roughly one-third of the parameter count.^[1]

A practical concern that shaped the project was data quality. Earlier self-supervised work had shown that simply scraping more images from the web tends to degrade representation quality, because uncurated crawls contain redundant, low-quality, or off-distribution images. The DINOv2 paper therefore devotes a substantial portion of its method section to an automatic pipeline that turns 1.2 billion unique web images into the 142-million-image curated set called LVD-142M.^[1]

What is the LVD-142M dataset and how was it curated?

LVD-142M (Large Visual Dataset, 142 million images) was assembled without using any human labels, text captions, or external metadata. The pipeline has three stages: gathering uncurated source images, deduplication, and self-supervised retrieval against a set of curated seed images.^[1]

The uncurated source is a public web crawl. After URL filtering to remove unsafe content, the team downloaded approximately 1.2 billion unique images. Post-processing applied PCA-hash deduplication, NSFW filtering, and face blurring before any further use.^[1] Deduplication then used the copy detection pipeline of Pizzi et al. (2022), known as SSCD (Self-Supervised Copy Detection), to remove near-duplicate images from this uncurated pool, both internally and against the test sets of downstream evaluation benchmarks so that subsequent benchmarking is not biased by training-test overlap.^[1]^[13]

The curated side combines roughly 25 third-party datasets including ImageNet-22k, the training split of ImageNet-1k, Google Landmarks, and several fine-grained classification corpora. These act as seed images: for each curated image, a query is issued against the deduplicated web pool to retrieve visually similar images. Embeddings for retrieval are produced by a self-supervised ViT-H/16 network previously pre-trained on ImageNet-22k. Retrieval uses cosine similarity in this embedding space; the authors typically retain N=4 nearest neighbors per query, a value chosen as a trade-off between coverage and collision rates between queries.^[1] When a query has too few near neighbors (rare classes), the pipeline falls back on k-means clustering of the web pool, sampling images from the cluster nearest to the query.^[1]

The entire curation job was distributed across 20 nodes equipped with eight V100-32GB GPUs and ran in under two days, illustrating that automated curation at this scale is tractable on modest infrastructure relative to the cost of subsequent model training.^[1]^[14] Ablations in the paper report that models trained on LVD-142M outperform those trained on raw web crawls of similar size on image classification, retrieval, and dense prediction, supporting the claim that careful self-supervised curation, rather than raw scale alone, drives the gains.^[1] LVD-142M is described in detail in the paper but is not distributed as a downloadable dataset, partly for legal reasons relating to web-image redistribution and partly to limit the spread of potentially sensitive content; this asymmetry between published code, published weights, and unpublished training data has shaped much of the subsequent reproduction work.^[1]^[14]

Model architecture

DINOv2 adopts the standard ViT architecture introduced by Dosovitskiy et al. with three modifications: a patch size of 14 (rather than 16) for all variants, the choice of an embedding dimension of 1536 with 24 heads (64 dimensions per head) for the giant variant (in place of 1408 with 16 heads in Zhai et al.'s ViT-G design), and the addition of a SwiGLU feed-forward block in larger variants.^[1]^[15] The patch-size-14 choice produces a 256-token grid for 224 by 224 inputs (16 by 16 patches) and a denser 1369-token grid for 518 by 518 inputs used in the high-resolution finishing stage.

Four backbone sizes are released:^[4]^[14]

Model	Layers	Embedding dim	Heads	Parameters	ImageNet-1k k-NN	ImageNet-1k linear
ViT-S/14 (distilled)	12	384	6	21M	79.0%	80.9-81.1%
ViT-B/14 (distilled)	12	768	12	86M	82.0-82.1%	84.5%
ViT-L/14 (distilled)	24	1024	16	300M	83.5%	86.3%
ViT-g/14	40	1536	24	1100M	83.5%	86.5%

The ViT-g/14 model is the only one trained from scratch on LVD-142M. The ViT-S/14, ViT-B/14, and ViT-L/14 checkpoints are produced by self-distillation from a frozen ViT-g teacher; the authors report that this procedure outperforms training the smaller backbones from scratch on all 12 downstream benchmarks they tested.^[1] In September 2023, register-augmented checkpoints were released for each size following the companion paper "Vision Transformers Need Registers."^[16]^[14] Each variant therefore exists in two forms in the official repository: the original 2023 weights and the register weights with four additional learned tokens.^[4]^[14]

The class-token projection head is a multilayer perceptron with bottlenecked output that maps to a high-dimensional prototype space of 128k prototypes (128,000 learnable codes), shared across global crops, used to compute the DINO loss.^[1]^[17] A parallel head with the same architecture but separate weights is used for the patch tokens to compute the iBOT loss; this separation of the image-level and patch-level heads was found in ablations to improve representation quality compared to a shared head.^[1] LayerScale and stochastic depth are used throughout the deeper variants to stabilize training; aggressive stochastic depth with drop rates up to about 40% is applied to the giant model.^[1]^[17]

How is DINOv2 trained?

The DINOv2 loss is a weighted combination of three terms, all consistent with a student-teacher self-distillation framework in which the teacher's parameters are an exponential moving average of the student's.^[1] The teacher momentum follows a cosine schedule from 0.994 to 1.0 over training, so that the teacher becomes more stable as the student matures.^[1]^[17]

The image-level loss is taken from DINO. For each input image, two global crops at resolution 224 and several smaller local crops are passed through the student; only the global crops are passed through the teacher. The class-token embeddings are projected into a high-dimensional prototype space and converted into probability distributions via a softmax over learned prototypes. The loss is the cross-entropy between the teacher and student distributions over class tokens:^[1]^[11]

\mathcal{L}_{\text{DINO}} = -\sum_{\text{views}} p_{\text{teacher}} \log p_{\text{student}}

The patch-level loss is taken from iBOT. The student receives a copy of each global crop in which a random subset of patches has been masked; the teacher receives the unmasked image. Each masked patch's token is projected through a separate prototype head and matched, again via cross-entropy, to the teacher's projection of the same patch position from the unmasked image.^[1]^[12]

A KoLeo regularizer (Kozachenko-Leonenko entropy estimator) is added to encourage a uniform spread of the global class-token embeddings inside each mini-batch.^[1]^[17] For a batch of n features, the term is

\mathcal{L}_{\text{KoLeo}} = -\frac{1}{n} \sum_i \log d_{n,i}

where $d_{n,i}$ is the minimum L2 distance from feature $i$ to any other feature in the batch.^[1] This term acts as an entropy-style spread loss that discourages representation collapse to a small region of the embedding manifold. Ablations report that adding KoLeo to an iBOT-style baseline gives the largest single contribution to k-NN accuracy among the new training-recipe components, with about a 2.3 percentage-point improvement on ImageNet k-NN.^[1]^[17]

The teacher prototype centering uses Sinkhorn-Knopp normalization rather than a running mean as in DINO. Three Sinkhorn iterations are applied per step, producing a doubly stochastic assignment of teacher tokens to prototypes that the authors find more stable than softmax centering with momentum.^[1]^[17] At inference, the projection heads are discarded; only the ViT backbone is used to produce features.

Two further tricks contribute to training stability. First, the student crop set uses multi-crop augmentation with two global crops at resolution 224 and several smaller local crops, which forces the student to predict the same representation from views of different scales.^[1] Second, a final fine-tuning stage trains the backbone at higher resolution (518 by 518) for an additional 10,000 iterations, which the authors report improves dense prediction tasks at a small fraction of full-resolution training cost.^[1]

Infrastructure and engineering

Training the one-billion-parameter ViT-g/14 from scratch is computationally demanding, and the paper describes several engineering optimizations that distinguish DINOv2's implementation from earlier self-supervised codebases. All training is implemented in PyTorch 2.0 on NVIDIA A100 GPUs with mixed-precision fp16.^[1]^[14] The AdamW optimizer is used with a cosine learning-rate schedule, a 100,000-iteration warm-up, and a cosine weight-decay schedule that ramps from 0.04 to 0.2 over training; the ViT-g/14 run uses approximately 625,000 iterations at a batch size of around 3,072 images.^[17]^[18]

A custom version of FlashAttention is used for the self-attention computation, taking advantage of the fact that the per-head embedding dimension is a multiple of 64. The authors report that their attention implementation is roughly twice as fast as iBOT's original codebase and uses about one third of the memory, on identical hardware.^[1]

Sequence packing concatenates token sequences of differing lengths (arising from multi-crop training) into a single long sequence and applies a block-diagonal attention mask so that crops do not attend across boundaries. This avoids padding waste from the smallest local crops.^[1]

Stochastic depth is implemented by skipping the computation of dropped residual branches rather than masking their outputs; at high drop rates (around 40%) this gives substantial wall-clock speed-ups.^[1]

Fully Sharded Data Parallel (FSDP) distributes parameters, gradients, and optimizer states across GPUs. The authors observe an approximate 50% reduction in inter-GPU communication compared to standard data parallel training, because gradient all-reduce operations can be performed in reduced precision while master weights are kept in float32 on each shard.^[1] On a 16-GPU node, FSDP allowed training a model with roughly four times the parameter count that fits in plain distributed data parallel for the same memory budget.^[1]

The reported training cost for the released ViT-g/14 is approximately 22,016 A100-40GB GPU-hours.^[1]^[17] Total project compute, including ablations, smaller variants, and intermediate checkpoints, runs into hundreds of thousands of GPU-hours. Distilled smaller models reuse the same data pipeline but require substantially less compute, since the teacher is frozen and only the student is updated.^[1]

How well does DINOv2 perform on downstream tasks?

The DINOv2 paper evaluates frozen-backbone features (no fine-tuning of the ViT) across eight task families covering image classification, fine-grained recognition, instance retrieval, video classification by frame, semantic segmentation, depth estimation, video segmentation, and robustness. Results below are quoted from the paper or its v2 revision.^[1]^[17]

Task	Benchmark	Backbone	DINOv2 result	Comparison
Image classification (linear)	ImageNet-1k	ViT-g/14	86.5% top-1	OpenCLIP ViT-G/14: 86.2%
Image classification (k-NN)	ImageNet-1k	ViT-g/14	83.5% top-1
Robustness	ImageNet-A	ViT-g/14	75.9%
Robustness	ImageNet-R	ViT-g/14	78.8%
Robustness	ImageNet-Sketch	ViT-g/14	62.5%
Video action recognition	Kinetics-400 (frames + linear)	ViT-g/14	78.4% top-1	matches OpenCLIP
Video action recognition	UCF-101	ViT-g/14	91.2% top-1	matches OpenCLIP
Video temporal reasoning	Something-Something v2	ViT-g/14	38.3% top-1	+2.5 over OpenCLIP
Fine-grained classification (12 tasks avg.)	various	ViT-g/14	92.1% mean acc.
Semantic segmentation (linear)	ADE20K	ViT-g/14	49.0 mIoU
Semantic segmentation (+ multiscale)	ADE20K	ViT-g/14	53.0 mIoU
Monocular depth estimation (linear)	NYUd	ViT-g/14	0.344 RMSE
Depth estimation (DPT head)	NYUd	ViT-g/14	0.279 RMSE
Landmark retrieval	Oxford-Medium	ViT-g/14	73.6 mAP
Landmark retrieval	Oxford-Hard	ViT-g/14	52.3 mAP
Landmark retrieval	Paris-Medium	ViT-g/14	92.1 mAP
Landmark retrieval	Paris-Hard	ViT-g/14	82.6 mAP

The paper emphasises that these results are obtained without any fine-tuning of the backbone: a linear classifier, a k-nearest-neighbors classifier, or a small dense-prediction head is trained on top of frozen patch tokens. This is the primary headline of the DINOv2 release: a single set of frozen features competes with or exceeds task-specialist models that were trained or fine-tuned for each benchmark individually.^[1]^[2]

Linear probing of ImageNet-1k is the most directly comparable head-to-head benchmark with CLIP-family models. DINOv2 ViT-g/14 reaches 86.5%, slightly above the publicly available OpenCLIP ViT-G/14 reproduction at 86.2%; the distilled DINOv2 ViT-L/14 reaches 86.3% with roughly one-third the parameters of OpenCLIP ViT-G/14, which the authors highlight as evidence that purely visual self-supervision can match weakly supervised image-text pre-training at the image-classification level.^[1] The advantage is larger on dense pixel-level tasks such as depth estimation, where the OpenCLIP baselines lag behind DINOv2 by larger margins.^[1] On fine-grained classification, the DINOv2 ViT-g/14 reaches an average of 92.1% accuracy across twelve datasets including ImageNet-22k, Places205, iNaturalist 2018, iNaturalist 2021, Stanford Cars, FGVC-Aircraft, Caltech-101, and Food-101, with substantial gains over OpenCLIP on natural-world long-tailed benchmarks (+8.6 points on iNaturalist 2018 and +9.7 points on iNaturalist 2021).^[1]^[17]

For video, the paper performs frozen-feature linear classification on uniformly sampled frames and reports 78.4% top-1 on Kinetics-400, 91.2% on UCF-101, and 38.3% on Something-Something v2; the first two are within roughly half a point of OpenCLIP while the SSv2 number is 2.5 points higher, despite DINOv2 not training on any video data.^[17] For dense pixel-level evaluation, the paper places a DPT-style decoder head on the frozen ViT and reaches 0.279 RMSE on NYUd, a result described as competitive with task-specialist supervised models trained directly for the task.^[1]

Why do Vision Transformers need registers?

In September 2023, Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski (three of whom are DINOv2 authors) published "Vision Transformers Need Registers" (arXiv:2309.16588), which identified an artifact in DINOv2 (and other) ViT feature maps: a small number of patch tokens, typically in low-information background regions, have anomalously high feature norms and appear to be repurposed by the network as scratch memory for global computations.^[16]

The proposed fix is to extend the input sequence of the ViT with a small set of additional learnable "register" tokens that have no spatial meaning. These tokens absorb the global-computation role that previously corrupted background patches. The paper uses four register tokens in its main experiments and reports that adding registers eliminates the high-norm artifacts, produces smoother attention maps and dense feature maps, sets new state-of-the-art numbers for self-supervised models on dense prediction tasks, and improves the quality of unsupervised object discovery built on top of the features.^[16]^[19] Ablations in the same paper indicate that even a single register token captures most of the benefit, but four is recommended as a robust default.^[16]

Following the paper, Meta released DINOv2-with-registers checkpoints for each model size (ViT-S/14, ViT-B/14, ViT-L/14, ViT-g/14) alongside the original DINOv2 weights, exposing both variants in the official GitHub repository and PyTorch Hub interface.^[4]^[14] The registers variant is the recommended default for tasks that rely on per-patch features, such as dense segmentation or robot manipulation policies that read spatial saliency from the backbone.^[7]

A 2025 follow-up by Jiang, Dravid, Efros, and Gandelsman, "Vision Transformers Don't Need Trained Registers," argued that the artifacts are caused by a small set of identifiable neurons that concentrate high-norm activations on outlier tokens, and that the same benefit as trained registers can be obtained at inference time by shifting these activations onto a single appended untrained token.^[20] The training-free intervention reaches comparable downstream performance to trained registers across a range of models including CLIP and DINOv2 and was accepted as a spotlight paper at NeurIPS 2025, demonstrating that the underlying mechanism, rather than the training procedure, is the relevant lever.^[20]

Is DINOv2 open source?

DINOv2 was originally released under a CC BY-NC 4.0 license, which prohibited commercial use of the code and weights. On August 31, 2023, Meta announced a relicensing of the entire DINOv2 release (including training code, all backbone checkpoints, and downstream heads for segmentation and depth) under the Apache 2.0 license, removing the non-commercial restriction.^[5]^[21] The same announcement introduced the FACET evaluation dataset, a 32,000-image benchmark of 50,000 people annotated with demographic and physical attributes intended for fairness audits of vision foundation models (FACET itself is restricted to evaluation and is not licensed for training).^[21]

The relicensing made DINOv2 one of the few large vision foundation backbones available for commercial use under a permissive license at the time of release, alongside OpenCLIP reproductions of CLIP and the Segment Anything Model also released by Meta in 2023.^[4]^[21] Some specialized extensions of the DINOv2 codebase released later (Cell-DINO for microscopy, XRay-DINO for medical imaging) carry different licenses appropriate to their domain data, but the core DINOv2 weights and training code remain under Apache 2.0.^[14]

What is DINOv2 used for?

Frozen-backbone transfer

The most common use of DINOv2 in the literature treats the released ViT as a frozen feature extractor. The official repository ships pretrained linear classifiers for ImageNet-1k, ImageNet-22k, and Places-205, along with DPT-style heads for monocular depth estimation on NYU and KITTI, and semantic segmentation heads for ADE20K and VOC2012 (including linear, multi-scale, and Mask2Former-style decoders).^[4]^[14] These are intended both as ready-to-use components and as reference baselines for new downstream heads built on the same backbone.

Multimodal language models

DINOv2 has been evaluated as an alternative vision encoder for multimodal large language models in the LLaVA family. A 2025 comparative study, "LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning," reports mixed results: DINOv2 backbones outperform CLIP variants on certain visual benchmarks when paired with a 2-billion-parameter language model, but image-text pre-trained encoders such as CLIP and SigLIP remain competitive or superior on text-grounded multimodal evaluations, particularly at smaller language-model scales.^[6] The general pattern reported in that and related work is that DINOv2 features carry strong spatial structure but lack the language-aligned semantics of CLIP, making concatenation or hybrid encoder schemes a common workaround for LLaVA-style architectures.^[6]^[22]

The Cambrian-1 study by Tong, Brown, Wu and colleagues (NeurIPS 2024 oral) systematically compared more than 20 vision encoders across multimodal benchmarks and proposed a Spatial Vision Aggregator (SVA) connector that combines features from CLIP, SigLIP, OpenCLIP-ConvNeXt, and DINOv2; the authors report that DINOv2 contributes disproportionately to vision-centric and spatial benchmarks despite being weaker on text-grounded ones, and that the optimal multimodal vision tower is typically a hybrid of language-supervised and self-supervised encoders rather than either alone.^[22] The complementarity is consistent with the original DINOv2 paper's claim that the self-supervised features encode finer pixel-level structure than CLIP-style features but lack their language alignment.^[1]^[22]

Medical imaging

A 2023 study, "Evaluating General Purpose Vision Foundation Models for Medical Image Analysis: An Experimental Study of DINOv2 on Radiology Benchmarks" (arXiv:2312.02366), evaluated frozen DINOv2 features on chest X-ray, computed tomography, and magnetic resonance datasets across more than 200 evaluations covering disease classification and organ segmentation in 2D and 3D, and found that DINOv2 transfers competitively to medical classification tasks despite not having been trained on any medical imagery.^[23] Subsequent work has explored domain-specific continued pre-training using the DINOv2 objective on histopathology, fundus images, and other clinical datasets.

The most prominent example is UNI, a general-purpose pathology foundation model from the Mahmood Lab at Brigham and Women's Hospital / Harvard, published in Nature Medicine in 2024. UNI uses the DINOv2 training recipe to pre-train a ViT-L/16 on more than 100 million pathology tile crops sampled from over 100,000 hematoxylin-and-eosin (H&E) whole-slide images, and reports strong performance across a wide range of pathology classification and survival-prediction tasks.^[24] A January 2025 successor, UNI 2, scales pre-training to more than 200 million tiles sampled from more than 350,000 whole-slide images including H&E and immunohistochemistry stains.^[25] Hibou, by Nechaev, Pchelnikov, and Ivanova (June 2024), independently used the DINOv2 framework to pre-train ViT-B and ViT-L pathology models on more than one million whole-slide images and released them publicly.^[26] Other DINOv2-derived medical models include RudolfV (a pathology backbone trained on tiles from approximately 103,849 whole-slide images), HistoDARE (an attention-modified DINO variant for histopathology), and MM-DINOv2 (a multi-modal adaptation presented at MICCAI 2025).^[27]

Robotics and physical agents

DINOv2's combination of strong spatial features, frozen-backbone usability, and permissive license has made it a frequent choice in robotics pipelines. DINOBot ("DINOBot: Robot Manipulation via Retrieval and Alignment with Vision Foundation Models," arXiv:2402.13181), by Norman Di Palo and Edward Johns at Imperial College London, uses DINO/DINOv2 features as the basis for an imitation-learning framework that retrieves the most visually similar demonstration of an object and then aligns its end-effector with the new object using pixel-level features from a single frozen ViT.^[7] A 2025 system for bimanual manipulation uses DINOv2 attention maps as pixel-level saliency scores lifted into a 3D voxel grid to provide semantic cues to a behavior-cloning policy.^[28] NASA's Jet Propulsion Laboratory has reported using DINOv2 as the common backbone in a "Visual Perception Engine" for planetary rover prototypes, where a single forward pass through the encoder serves multiple downstream vision tasks without per-task fine-tuning.^[8]

Object discovery and dense prediction

Because DINOv2's attention maps localize objects without supervision, the backbone has been used as the front end of unsupervised object-discovery pipelines (descendants of the LOST and CutLER methods), saliency detection, and weakly supervised segmentation.^[16] The release of registers further improved the smoothness of feature maps used in these downstream pipelines.^[16]

Earth observation and forest canopy mapping

A 2024 collaboration between Meta and the World Resources Institute used DINOv2 as a feature extractor over 18 million 0.5-meter natural-color satellite tiles from Maxar Technologies to produce a global canopy height map at 1-meter resolution, training a convolutional decoder on aerial lidar height measurements to map DINOv2 features to per-pixel canopy heights.^[29]^[38] The released model achieved a mean absolute error of about 2.8 meters and was made publicly available.^[29]^[38] A March 2026 update, Canopy Height Maps v2 (CHMv2), replaced the DINOv2 backbone with DINOv3 pre-trained on a domain-specific 493-million-image satellite corpus (SAT-493M), reporting an R-squared improvement from 0.53 to 0.86 and sharper structure at the canopy level.^[30] Beyond canopy mapping, DINOv2 features have been used in remote-sensing few-shot object detection and visual place recognition; Panopticon (2025) extends DINOv2 to multi-sensor satellite data by encoding optical and synthetic-aperture-radar channels using cross-sensor augmentations.^[31]

How does DINOv2 compare with other visual representation methods?

DINOv2 sits in a crowded landscape of vision pre-training paradigms; the table summarizes how it relates to the main alternatives at the time of its release.

Method	Supervision signal	Architecture family	Key property
CLIP (Radford et al., 2021)	Image-text contrastive	ViT or CNN	Zero-shot text-to-image alignment
SimCLR (Chen et al., 2020)	Image-image contrastive	ResNet	Augmentation-based contrastive learning
MAE (He et al., 2021)	Masked pixel reconstruction	ViT	Reconstruction in pixel space
DINO (Caron et al., 2021)	Self-distillation, class token	ViT	Emergent attention-based segmentation
iBOT (Zhou et al., 2021)	Self-distillation + masked patches	ViT	Patch-level + image-level objective
DINOv2 (Oquab et al., 2023)	DINO + iBOT + KoLeo on curated 142M	ViT	Frozen backbone competitive with fine-tuned task models
SigLIP (Zhai et al., 2023)	Image-text sigmoid contrastive	ViT	Replaces CLIP softmax with pairwise sigmoid
I-JEPA (Assran et al., 2023)	Predicted-target self-distillation in feature space	ViT	Predict representations of masked regions
V-JEPA (Bardes et al., 2024)	I-JEPA extended to video	ViT	Video-feature self-supervision
V-JEPA 2 (Assran, Bardes, Fan et al., 2025)	Video-based world model	ViT	Robot planning from frozen video features
DINOv3 (Siméoni et al., 2025)	DINO + iBOT + KoLeo + Gram anchoring on 1.7B	ViT, ConvNeXt	7B-parameter teacher, stable dense features

DINOv2's main empirical claim relative to CLIP-family models is that it matches them on image-level tasks while substantially exceeding them on dense pixel-level tasks like depth estimation and segmentation, which require fine spatial structure that text supervision tends to discard.^[1] Relative to masked-image methods such as MAE, DINOv2 features are more directly useful as frozen representations (MAE features typically require fine-tuning to compete), at the cost of a more complex training recipe.^[1] Relative to its own predecessor DINO, the main improvements come from larger and more carefully curated training data, the addition of the iBOT patch loss, the KoLeo regularizer, and engineering changes that make training a billion-parameter model practical.^[1]^[11]

The Joint Embedding Predictive Architecture line (I-JEPA in early 2023, V-JEPA in 2024, V-JEPA 2 in June 2025) shares the broad goal of learning visual features without text alignment but uses a different mechanism: predicting target representations of masked regions in feature space rather than performing self-distillation on entire image crops.^[32]^[33] V-JEPA 2 in particular is positioned as a video-based world model for robotics, trained on more than one million hours of video and roughly 62 hours of robot data.^[33] The two lines coexist within Meta's broader research program on self-supervised vision rather than supplanting each other, with DINO-family models continuing to dominate dense-feature use cases and JEPA-family models targeting video understanding and planning.

What changed in DINOv3, the August 2025 successor?

In August 2025, Meta released DINOv3 (Siméoni, Vo, Seitzer, Baldassarre, Oquab et al., arXiv:2508.10104), positioned as a direct successor to DINOv2 that addresses two identified limitations: difficulty scaling beyond the one-billion-parameter regime with the original recipe, and gradual degradation of dense-feature quality late in long training schedules.^[9]^[10] Meta described DINOv3 as the first time a self-supervised model could "outperform their weakly supervised counterparts across a wide range of tasks."^[9] The DINOv3 paper introduces several changes:

A 6.7-billion-parameter ViT-7B teacher with 40 transformer blocks and embedding dimension 4096, trained on LVD-1689M, a curated mixture of approximately 1.7 billion images (about 12 times the size of LVD-142M) that combines hierarchical k-means clustering and retrieval against seed datasets (ImageNet-1k, ImageNet-22k, and Mapillary Street-level).^[10]
An axial rotary position embedding (RoPE) variant ("RoPE-box") replacing the absolute position embeddings of DINOv2, with jittering of coordinate scales during training to improve robustness to inference at varying resolutions and aspect ratios.^[10]
A new "Gram anchoring" loss that aligns the Gram matrix (all pairwise patch-token dot products) of the student with that of an earlier teacher checkpoint, stabilizing dense features during long training.^[9]^[10] This addresses the empirical observation that patch features in DINOv2-style training tend to degrade in fine-grained quality as classification metrics continue to improve.
A more aggressive distillation pipeline that produces, from the ViT-7B teacher, a family of distilled students: ViT-S (21M), ViT-S+ (29M), ViT-B (86M), ViT-L (0.3B), and ViT-H+ (0.8B), plus ConvNeXt variants (T, S, B, L) for compute-constrained settings.^[10]^[34]

DINOv3 reports state-of-the-art frozen-feature performance across more than 60 benchmarks covering 15 task families, including 88.2% ImageNet-1k linear probing, 66.1 mAP on COCO object detection, 63.0 mIoU on ADE20K semantic segmentation, and 0.281 RMSE on NYUv2 depth, matching or exceeding both DINOv2 and weakly supervised baselines such as SigLIP 2 and Meta's Perception Encoder.^[10]^[34] The model was released under a commercial license alongside the paper, and the same release introduced DINOv3 variants pretrained on satellite imagery (SAT-493M), which the canopy-height map collaboration with the World Resources Institute later adopted for its v2 product in March 2026.^[29]^[30]

Notwithstanding DINOv3, DINOv2 remains widely used in 2025 and 2026 because it is mature, has well-validated downstream heads, and runs at lower inference cost. The Apache-2.0 license remains identical for all original DINOv2 weights, and the Hugging Face Transformers integration adds DINOv2 as a first-class supported model alongside the DINOv2-with-registers variants.^[14]^[19]^[35]

Limitations and criticisms

The DINOv2 paper and subsequent literature have noted a number of limitations.

The first is dataset opacity. LVD-142M is described in the paper but is not released as a downloadable dataset, both for legal reasons (the underlying images come from a web crawl) and to mitigate the redistribution of potentially sensitive material. Reproducing DINOv2 from scratch therefore requires either re-running the curation pipeline against a comparable web source or substituting another large image set, and exact reproductions of the published numbers have been difficult outside Meta.^[1]^[14]

Second, even with registers, the released checkpoints still exhibit some texture- and patch-level artifacts under aggressive prompting. The "Vision Transformers Need Registers" paper documents the high-norm artifact in plain DINOv2 and provides a fix; subsequent work ("Vision Transformers Don't Need Trained Registers," 2025) has argued that training-free interventions on attention heads can replicate much of the benefit of registers without retraining, indicating that the underlying phenomenon is a structural property of large ViTs rather than a quirk of the DINOv2 recipe.^[16]^[20]

Third, when used as a multimodal vision encoder, DINOv2 has consistently been reported as weaker than image-text pre-trained encoders for tasks that rely on aligning visual content to natural-language prompts, including VQA-style benchmarks used to evaluate LLaVA variants.^[6]^[22] This is a structural limitation of any pure self-supervised vision objective: the resulting features encode visual similarity rather than semantic categories that map onto words. Hybrid encoder configurations (combining DINOv2 with CLIP features) such as those in Cambrian-1 are commonly used to mitigate this.^[22]

Fourth, the training recipe is intricate. The combined DINO+iBOT+KoLeo loss with Sinkhorn-Knopp centering, multi-crop augmentation, EMA teacher, and high-resolution finishing stage has many hyperparameters, and the authors note that several of the engineering tricks (sequence packing, stochastic depth implementation, custom FlashAttention) are not exposed at the API level. The DINOv3 follow-up paper explicitly motivates its simpler constant-learning-rate training schedule and the Gram anchoring fix as responses to the brittleness and late-training degradation observed in DINOv2 training runs.^[9]^[10]

Fifth, the original CC BY-NC 4.0 release effectively excluded commercial deployments for the first four months, which limited industrial adoption until the August 2023 Apache 2.0 relicensing.^[5]^[21]

Sixth, fairness audits using FACET have surfaced disparities in feature behaviour across demographic groups in DINOv2-family models, motivating continued evaluation of how curation, distillation, and self-supervision influence downstream group fairness; FACET itself is restricted to evaluation rather than training to discourage misuse.^[21]

Significance

DINOv2's significance in computer vision is twofold. As an artifact, it produced a set of foundation model weights that, combined with the permissive 2023 relicensing, became the default open self-supervised vision backbone for research, robotics, and downstream products that need frozen visual features.^[4]^[5] As a methodology, it demonstrated that self-supervised pre-training on curated image data can equal or exceed CLIP-style image-text pre-training on classification tasks and substantially exceed it on dense prediction, supporting the broader argument from researchers such as Yann LeCun that purely visual self-supervision is a viable foundation for vision and that the Transformer-based joint-embedding predictive paradigm has room to scale further.^[10]^[32]

The release also influenced engineering practice in self-supervised vision: the use of large-scale automated curation rather than raw web scraping, FSDP-based training of billion-parameter ViTs in pure PyTorch, and the now-common pattern of distilling smaller students from a single trained-from-scratch giant teacher have all been widely adopted in successor work, both inside Meta (DINOv3, V-JEPA 2) and in third-party medical and remote-sensing foundation models (UNI, UNI 2, Hibou, RudolfV, Panopticon).^[1]^[10]^[24]^[25]^[26]^[31]^[33] Citation counts on the DINOv2 paper exceeded several thousand within two years of release, placing it among the most heavily cited vision papers of 2023, and the GitHub repository hosting the official implementation accumulated tens of thousands of stars over the same window.^[4]^[14]

References

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V. Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, Piotr Bojanowski, "DINOv2: Learning Robust Visual Features without Supervision", arXiv, 2023-04-14 (published in Transactions on Machine Learning Research, January 2024). https://arxiv.org/abs/2304.07193. Accessed 2026-06-23. ↩
Meta AI, "DINOv2: State-of-the-art computer vision models with self-supervised learning", Meta AI Blog, 2023-04-17. https://ai.meta.com/blog/dino-v2-computer-vision-self-supervised-learning/. Accessed 2026-06-23. ↩
Maxime Oquab et al., "DINOv2: Learning Robust Visual Features without Supervision (HTML v2)", arXiv, 2024-02-02. https://arxiv.org/html/2304.07193v2. Accessed 2026-06-23. ↩
Meta AI Research, "facebookresearch/dinov2 (GitHub repository)", GitHub, 2023-04-17. https://github.com/facebookresearch/dinov2. Accessed 2026-06-23. ↩
Meta AI, "DINOv2 releases training code and model weights under Apache-2 license", LinkedIn / Meta AI announcement, 2023-08-31. https://www.linkedin.com/posts/aiatmeta_announcing-the-commercial-relicensing-and-activity-7107141311752785920-Fmcy. Accessed 2026-06-23. ↩
Federico Cocchi et al., "LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning", arXiv, 2025-03-19. https://arxiv.org/abs/2503.15621. Accessed 2026-06-23. ↩
Norman Di Palo, Edward Johns, "DINOBot: Robot Manipulation via Retrieval and Alignment with Vision Foundation Models", arXiv, 2024-02-20. https://arxiv.org/abs/2402.13181. Accessed 2026-06-23. ↩
Meta AI, "Small robots, mighty vision: NASA Jet Propulsion Laboratory's DINOv2-enabled robot rovers and the future of planetary exploration", Meta AI Blog. https://ai.meta.com/blog/nasa-jpl-dino-robot-explorers/. Accessed 2026-06-23. ↩
Meta AI, "DINOv3: Self-supervised learning for vision at unprecedented scale", Meta AI Blog, 2025-08-14. https://ai.meta.com/blog/dinov3-self-supervised-vision-model/. Accessed 2026-06-23. ↩
Oriane Siméoni, Huy V. Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab et al., "DINOv3", arXiv, 2025-08-13. https://arxiv.org/abs/2508.10104. Accessed 2026-06-23. ↩
Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, Armand Joulin, "Emerging Properties in Self-Supervised Vision Transformers", arXiv, 2021-04-29. https://arxiv.org/abs/2104.14294. Accessed 2026-06-23. ↩
Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, Tao Kong, "iBOT: Image BERT Pre-Training with Online Tokenizer", arXiv, 2021-11-15. https://arxiv.org/abs/2111.07832. Accessed 2026-06-23. ↩
Ed Pizzi, Sreya Dutta Roy, Sugosh Nagavara Ravindra, Priya Goyal, Matthijs Douze, "A Self-Supervised Descriptor for Image Copy Detection (SSCD)", CVPR 2022. https://arxiv.org/abs/2202.10261. Accessed 2026-06-23. ↩
Meta AI Research, "DINOv2 Model Card (MODEL_CARD.md)", GitHub repository. https://github.com/facebookresearch/dinov2/blob/main/MODEL_CARD.md. Accessed 2026-06-23. ↩
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby, "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale", arXiv, 2020-10-22. https://arxiv.org/abs/2010.11929. Accessed 2026-06-23. ↩
Timothée Darcet, Maxime Oquab, Julien Mairal, Piotr Bojanowski, "Vision Transformers Need Registers", arXiv, 2023-09-28. https://arxiv.org/abs/2309.16588. Accessed 2026-06-23. ↩
Maxime Oquab et al., "DINOv2: Learning Robust Visual Features without Supervision (PDF)", HAL/arXiv, 2023. https://hal.science/hal-04376640v2/file/CVPR_2023_dinov2%20(4).pdf. Accessed 2026-06-23. ↩
Lightly AI, "DINOv2 (LightlyTrain documentation)", Lightly. https://docs.lightly.ai/train/stable/methods/dinov2.html. Accessed 2026-06-23. ↩
Hugging Face, "DINOv2 with Registers (Transformers documentation)". https://huggingface.co/docs/transformers/model_doc/dinov2_with_registers. Accessed 2026-06-23. ↩
Nick Jiang, Amil Dravid, Alexei A. Efros, Yossi Gandelsman, "Vision Transformers Don't Need Trained Registers", arXiv, 2025-06-09. https://arxiv.org/abs/2506.08010. Accessed 2026-06-23. ↩
Meta AI, "Announcing the commercial relicensing and expansion of DINOv2, plus the introduction of FACET", Meta AI Blog, 2023-08-31. https://ai.meta.com/blog/dinov2-facet-computer-vision-fairness-evaluation/. Accessed 2026-06-23. ↩
Shengbang Tong, Ellis Brown, Penghao Wu et al., "Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs", arXiv, 2024-06-24. https://arxiv.org/abs/2406.16860. Accessed 2026-06-23. ↩
Mohammed Baharoon, Waseem Qureshi, Jiahong Ouyang, Yanwu Xu, Abdulrhman Aljouie, Wei Peng, "Evaluating General Purpose Vision Foundation Models for Medical Image Analysis: An Experimental Study of DINOv2 on Radiology Benchmarks", arXiv, 2023-12-05. https://arxiv.org/abs/2312.02366. Accessed 2026-06-23. ↩
Richard J. Chen, Tong Ding, Ming Y. Lu, Drew F. K. Williamson et al., "Towards a general-purpose foundation model for computational pathology (UNI)", Nature Medicine, 2024. https://www.nature.com/articles/s41591-024-02857-3. Accessed 2026-06-23. ↩
Mahmood Lab, "UNI: Pathology Foundation Model (GitHub)", GitHub. https://github.com/mahmoodlab/UNI. Accessed 2026-06-23. ↩
Dmitry Nechaev, Alexey Pchelnikov, Ekaterina Ivanova, "Hibou: A Family of Foundational Vision Transformers for Pathology", arXiv, 2024-06-07. https://arxiv.org/abs/2406.05074. Accessed 2026-06-23. ↩
Authors as listed, "A Survey on Computational Pathology Foundation Models: Datasets, Adaptation Strategies, and Evaluation Tasks", arXiv, 2025. https://arxiv.org/html/2501.15724v1. Accessed 2026-06-23. ↩
Authors as listed, "Large Pre-Trained Models for Bimanual Manipulation in 3D", arXiv, 2025-09-25. https://arxiv.org/abs/2509.20579. Accessed 2026-06-23. ↩
Meta AI / World Resources Institute, "Canopy Height Maps (AFG dataset)", AI at Meta. https://ai.meta.com/ai-for-good/datasets/canopy-height-maps/. Accessed 2026-06-23. ↩
Meta AI, "Mapping the World's Forests with Greater Precision: Introducing Canopy Height Maps v2", Meta AI Blog, 2026-03-10. https://ai.meta.com/blog/world-resources-institute-dino-canopy-height-maps-v2/. Accessed 2026-06-23. ↩
Authors as listed, "Panopticon: Advancing Any-Sensor Foundation Models for Earth Observation", arXiv, 2025-03-13. https://arxiv.org/abs/2503.10845. Accessed 2026-06-23. ↩
Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, Nicolas Ballas, "Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture (I-JEPA)", arXiv, 2023-01-19. https://arxiv.org/abs/2301.08243. Accessed 2026-06-23. ↩
Mido Assran, Adrien Bardes, David Fan et al., "V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning", arXiv, 2025-06-11. https://arxiv.org/abs/2506.09985. Accessed 2026-06-23. ↩
Lightly AI, "DINOv3 Explained: Technical Deep Dive", Lightly Blog, 2025. https://www.lightly.ai/blog/dinov3. Accessed 2026-06-23. ↩
Hugging Face, "DINOv2 (Transformers documentation)". https://huggingface.co/docs/transformers/en/model_doc/dinov2. Accessed 2026-06-23. ↩
Encord, "DINOv2 Explained: Revolutionizing Computer Vision with Self-Supervised Learning", Encord Blog. https://encord.com/blog/dinov2-self-supervised-learning-explained/. Accessed 2026-06-23.
Andrey Lukyanenko, "Paper Review: DINOv2: Learning Robust Visual Features without Supervision", andlukyane.com, 2023. https://andlukyane.com/blog/paper-review-dinov2. Accessed 2026-06-23.
Meta Sustainability, "Using Artificial Intelligence to Map the Earth's Forests", Meta Sustainability Blog, 2024-04-22. https://sustainability.atmeta.com/blog/2024/04/22/using-artificial-intelligence-to-map-the-earths-forests/. Accessed 2026-06-23. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

5 revisions by 1 contributor · full history

Suggest edit

What links here

CLIP (Contrastive Language-Image Pre-training)Candle (HuggingFace Rust ML)DINO (computer vision)Depth estimation Embedding Space I-JEPA Image Classification Models Machine learning terms/Computer Vision OpenVLA Perception Encoder Register tokens (Vision Transformers Need Registers)Representation SigLIP Size invariance Vector embeddings Vision Transformer

Infobox

What problem does DINOv2 solve?

How does DINOv2 differ from CLIP and image-text pre-training?

What is the LVD-142M dataset and how was it curated?

Model architecture

How is DINOv2 trained?

Infrastructure and engineering

How well does DINOv2 perform on downstream tasks?

Why do Vision Transformers need registers?

Is DINOv2 open source?

What is DINOv2 used for?

Frozen-backbone transfer

Multimodal language models

Medical imaging

Robotics and physical agents

Object discovery and dense prediction

Earth observation and forest canopy mapping

How does DINOv2 compare with other visual representation methods?

What changed in DINOv3, the August 2025 successor?

Limitations and criticisms

Significance

See also

References

Improve this article

Related Articles

Segment Anything Model and Dataset (SAM and SA-1B)

DINOv3

SAM 2

Nougat (model)

Sapiens (computer vision)

DINO (computer vision)

What links here

Related Articles

Segment Anything Model and Dataset (SAM and SA-1B)

DINOv3

SAM 2

Nougat (model)

Sapiens (computer vision)

DINO (computer vision)

What links here