DINOv3

AI Models Computer Vision Meta AI

20 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

14 citations

Revision

v4 · 4,075 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

DINOv3 is a family of self-supervised computer vision foundation models released by Meta AI in August 2025. It is the third major iteration in the DINO (self-distillation with no labels) line, following the original DINO (2021) and DINOv2 (2023), and is trained on roughly 1.7 billion images without any human annotations.^[1]^[2] DINOv3 introduces Gram anchoring, a regularization technique that prevents the degradation of dense patch-level features during long, large-scale training, and it scales the DINO recipe to a 6.7-billion-parameter Vision Transformer backbone.^[1]^[3] The released model suite spans Vision Transformers from 21 million to 6.7 billion parameters and a set of ConvNeXt backbones, all distilled from the same 7B teacher and made available under a commercial license through GitHub and Hugging Face.^[4]^[5]

DINOv3 is positioned as a frozen backbone: Meta reports that across roughly sixty benchmarks and fifteen vision tasks, a single DINOv3 backbone without fine-tuning matches or exceeds specialized state-of-the-art systems and previously published self- and weakly-supervised models such as SigLIP 2 and Perception Encoder on most dense-prediction tasks.^[2]^[6] The paper, titled simply DINOv3, was posted to arXiv on 13 August 2025 as preprint 2508.10104, authored by Oriane Siméoni, Huy V. Vo, Maximilian Seitzer and colleagues at Meta AI.^[1]

Infobox

Field	Value
Developer	Meta AI (FAIR)
Initial release	14 August 2025
arXiv preprint	2508.10104
Largest model	ViT-7B/16 (6.7B parameters)
Pretraining dataset	LVD-1689M (~1.7B images)
Satellite dataset	SAT-493M (~493M images)
Architectures	Vision Transformer, ConvNeXt
License	DINOv3 License (commercial use permitted)
Framework	PyTorch >= 2.7.1

Background

From DINO to DINOv2

The DINO lineage began with the 2021 paper Emerging Properties in Self-Supervised Vision Transformers by Mathilde Caron and collaborators at FAIR, which showed that a Vision Transformer trained by self-distillation with no labels developed attention maps that segmented objects without supervision.^[7] The original DINO established the basic recipe: a student network is trained to match the output distribution of an exponentially averaged teacher network across different augmented crops of the same image, producing image-level features useful for downstream tasks. The acronym DINO stands for self-DIstillation with NO labels.^[7]

DINOv2, released in April 2023, scaled this idea to ViT-g/14 (around one billion parameters) and trained on a curated 142-million-image dataset called LVD-142M. It demonstrated that frozen self-supervised features could match or exceed CLIP-style weakly supervised features on many classification and dense prediction benchmarks, but its patch-level features still lagged behind specialized systems on segmentation and depth estimation, and the curated dataset capped further data scaling.^[8] DINOv3 was framed by the authors as the answer to two questions left by DINOv2: whether the self-supervised recipe could scale by another order of magnitude in both model size and data, and whether dense features could be stabilized when training is pushed to very long schedules.^[1]^[3]

Context within Meta's research program

The release sits alongside other Meta foundation work in vision and embodied AI. V-JEPA and V-JEPA 2, also from FAIR, target video understanding through a different Joint Embedding Predictive Architecture objective championed by Yann LeCun, while DINOv3 remains an image-centric model trained with view-matching self-distillation. The two lines are research siblings rather than direct successors: DINO models target still-image representations through teacher-student matching across views, whereas JEPA models target prediction in a learned representation space, typically over time in video.^[9]

DINOv3 is also distinct from Meta's Segment Anything (SAM) family, which trains promptable segmentation models on large mask datasets, and from CLIP-style transfer learning approaches that align images with text. Where SAM produces masks given prompts and CLIP produces image-text similarity scores, DINOv3 produces a frozen visual backbone whose patch and CLS embeddings are reused unchanged across downstream tasks.^[2]^[5]

Publication timeline

The DINOv3 arXiv preprint (2508.10104) was posted on 13 August 2025, with the accompanying Meta AI blog post and product page going live the next day, on 14 August 2025.^[1]^[2]^[9] Code, model weights, and the DINOv3 License appeared on the facebookresearch/dinov3 GitHub repository the same week, and Hugging Face integration through dedicated DINOv3ViTModel and DINOv3ConvNextModel classes shipped in Transformers 4.56.0 shortly afterward.^[4]^[11]

How it works

Self-supervised objective

DINOv3 keeps the core teacher-student structure of DINO and DINOv2. The student network sees several augmented crops (global and local) of an input image, the teacher sees only the global crops, and the student is trained so that its output token distributions match the teacher's on shared content. The teacher is an exponential moving average of the student's weights.^[1]^[7]

During the main pretraining phase, the per-batch loss combines three terms:

$\mathcal{L}_{\text{Pre}} = \mathcal{L}_{\text{DINO}} + \mathcal{L}_{\text{iBOT}} + 0.1 \cdot \mathcal{L}_{\text{DKoleo}}$

where the DINO loss compares CLS-token distributions between student and teacher on different crops, the iBOT loss adds masked patch prediction by reconstructing teacher latents at masked positions in the student input, and the Koleo regularizer encourages diversity in the batch of CLS embeddings.^[1]^[3] The combination is inherited largely intact from DINOv2; the substantive changes in DINOv3 are in scale, schedules, and the new Gram term added later in training.^[3]

Gram anchoring

A practical problem the DINOv3 authors highlight is that when DINOv2-style training is extended to billions of images and millions of iterations, the global classification features keep improving while the dense patch-level features get noisier and visibly degrade. Standard tricks like adjusting the loss weights, changing the iBOT mask ratio, or shrinking the model did not fully solve the problem.^[1]^[3] The symptom is most visible in patch-feature similarity visualizations: in DINOv2-style runs taken to very long schedules, the patches that should be close in feature space (for example, all patches covering a single object) drift apart, and patch attention maps lose the crisp object alignment that made DINO interesting in the first place.^[3]^[10]

The fix introduced in DINOv3 is Gram anchoring. After about one million iterations, a frozen "Gram teacher" snapshot is taken from an earlier point in training where the dense features are still clean. During the refinement phase, the student is regularized so that the Gram matrix of its patch features (the matrix of pairwise dot products between patch embeddings) stays close to the Gram matrix produced by the Gram teacher:^[1]^[3]

$\mathcal{L}_{\text{Gram}} = \left\| X_S X_S^\top - X_G X_G^\top \right\|_F^2$

This constrains the relational structure of the patch features (which patches are similar to which) without pinning the absolute values, so individual patch embeddings remain free to evolve. The refinement loss is then:^[1]

$\mathcal{L}_{\text{Ref}} = w_D \mathcal{L}_{\text{DINO}} + \mathcal{L}_{\text{iBOT}} + w_{DK} \mathcal{L}_{\text{DKoleo}} + w_{\text{Gram}} \mathcal{L}_{\text{Gram}}$

A subtle aspect of Gram anchoring is that the Gram teacher is a different network from the EMA teacher used for DINO and iBOT. The EMA teacher tracks the current student to provide moving targets for view-matching; the Gram teacher is fixed at an earlier checkpoint chosen for the cleanliness of its dense features.^[1]^[3] In effect, Gram anchoring is closer to a soft anchor against a known-good representation than to a standard contrastive or distillation loss against a current target.

Independent analyses describe the effect of Gram anchoring as repairing degraded local features so that visualized patch features become sharper and more object-aligned, with measurable gains on dense-task benchmarks such as ADE20K semantic segmentation and depth estimation.^[3]^[10] Meta's blog post characterizes the technique as the central methodological innovation of DINOv3 relative to DINOv2.^[2]

Architecture choices

The flagship DINOv3 ViT-7B/16 has 6,716 million parameters, an embedding dimension of 4096, and a patch size of 16x16.^[4]^[11] Compared with the DINOv2 ViT-g backbone, two architectural changes are notable. First, the model replaces additive positional embeddings with rotary position embedding (RoPE) in two dimensions, augmented with random rescaling and jittering of the position coordinates during training to make the model robust across resolutions. This allows the same backbone to handle inputs from low resolution up to 4096x4096 pixels at inference time.^[1]^[3]^[11] Second, the released open-source variants include four register tokens, learnable embeddings that absorb high-norm artifacts and yield cleaner attention maps and stronger dense predictions; this practice was introduced in earlier FAIR work on "vision transformers need registers" and is now standard in DINOv3 checkpoints.^[11]

Training data

DINOv3's main dataset is called LVD-1689M and contains roughly 1.689 billion images.^[1]^[4] The authors construct it from a pool of about 17 billion public Instagram images using three complementary strategies:

Clustering-based curation. Images are embedded with a DINOv2 backbone and clustered hierarchically with k-means (200M -> 88M -> 800k -> 100k -> 25k clusters) to balance visual concepts and avoid head-of-distribution bias.^[1]^[3] The clustering is hierarchical: a level-0 partition into 200 million clusters is built first, then those clusters are themselves clustered into 88 million groups, and so on, producing a tree that approximately uniformizes the sampling density across visual content.^[3]
Retrieval-based augmentation. Seed images from public datasets such as ImageNet-1k, ImageNet-22k, and Mapillary Street-level Sequences are used to retrieve nearest neighbors in the Instagram pool, ensuring that downstream-relevant concepts are represented even when they are rare on Instagram.^[1]
Specialized homogeneous batches. Roughly 10% of training batches are pure ImageNet-1k batches, providing a stable benchmark signal during training and a reliable comparison point against prior self-supervised work.^[1]

The relative scale-up over DINOv2 is roughly 12x in data (142M to 1.689B images) and 6x in model parameters (1.1B to 6.7B), with the same general teacher-student structure but more sophisticated data balancing.^[2]^[6]

For geospatial applications, the team additionally trains models on SAT-493M, a satellite imagery dataset of about 493 million tiles, partially drawn from Maxar imagery.^[4]^[2] Two backbones (ViT-L/16 and ViT-7B/16) are pretrained on SAT-493M alongside the LVD-1689M variants, allowing geospatial users to start from a backbone whose pretraining distribution already matches overhead imagery.^[4]

Optimization and compute

A distinctive feature of DINOv3 is the use of a constant learning-rate, weight-decay, and EMA-momentum schedule for the main pretraining phase, after a short linear warmup. The authors report that removing cosine schedules allowed them to extend training arbitrarily without retuning, which is important when neither the optimal number of iterations nor the eventual model use cases are known in advance.^[1]^[3] In effect, the schedule choice trades the sharper near-end-of-training boost typical of cosine schedules for the ability to "stop anywhere", which matters when training runs span weeks and downstream needs are still emerging.^[3]

The main run uses a batch size of 4096 images split across 256 GPUs for roughly one million iterations.^[3] Total compute is not officially reported in GPU-days; secondary technical summaries note the absence of these figures, which makes external estimates of cost and environmental impact difficult.^[10] The authors note that the multi-student distillation pipeline was designed to amortize teacher inference across multiple student trainings, reducing the marginal cost of producing each additional distilled checkpoint in the released family.^[1]^[3]

Post-hoc adaptation

After the main run, DINOv3 applies three post-hoc strategies:^[1]^[3]

High-resolution adaptation. An additional 10k iterations are run with mixed crops drawn from {512, 768} pixels for global views and {112, 168, 224, 336} for local views, and Gram anchoring is re-applied. This yields stable behavior at inference resolutions up to about 4096x4096.^[3]
Distillation into a model family. The 7B teacher is distilled into smaller students using a multi-student procedure that shares teacher inference across student groups on the same hardware, producing ViT-S (21M), ViT-S+ (29M), ViT-B (86M), ViT-L (300M), and ViT-H+ (840M) backbones, plus four ConvNeXt sizes (Tiny 29M, Small 50M, Base 89M, Large 198M).^[4]
Text alignment ("dino.txt"). A vision-language head is contrastively trained on image-text pairs so that DINOv3 features can be used for zero-shot tasks; the resulting variant is sometimes called DINOv3-text.^[3]^[4]

Model suite

Backbone	Parameters	Pretraining dataset	Notes
ViT-S/16	21M	LVD-1689M	Distilled from ViT-7B
ViT-S+/16	29M	LVD-1689M	Distilled, custom width
ViT-B/16	86M	LVD-1689M	Distilled
ViT-L/16	300M	LVD-1689M	Distilled
ViT-H+/16	840M	LVD-1689M	Distilled, custom width
ViT-7B/16	6,716M	LVD-1689M	Teacher, trained from scratch
ConvNeXt-T	29M	LVD-1689M	Distilled
ConvNeXt-S	50M	LVD-1689M	Distilled
ConvNeXt-B	89M	LVD-1689M	Distilled
ConvNeXt-L	198M	LVD-1689M	Distilled
ViT-L/16 (SAT)	300M	SAT-493M	Satellite imagery
ViT-7B/16 (SAT)	6,716M	SAT-493M	Satellite imagery

In addition to the bare backbones, the GitHub release ships specialized heads for ImageNet classification, depth estimation on a SYNTHMIX dataset, object detection on COCO 2017, semantic segmentation on ADE20K, dino.txt zero-shot classification, and a Canopy Height Maps v2 head (CHMv2) for geospatial use.^[4]

Benchmarks

DINOv3 is evaluated extensively without fine-tuning the backbone. Selected reported numbers, all using frozen features, include:

Task	Dataset	Metric	DINOv3	DINOv2 (reference)
Image classification	ImageNet-1k linear	Top-1 accuracy	88.4%	87.3%
Semantic segmentation	PASCAL VOC	mean IoU	86.6	83.1
Semantic segmentation	ADE20K	mIoU	63.0	~57
Object detection	COCO 2017	mAP	66.1	lower
Depth estimation	NYU-Depth v2	RMSE	0.281	higher

Sources: paper-reported numbers and independent technical summaries.^[3]^[6]^[10]

Meta reports that on linear ImageNet classification DINOv3 matches strong weakly-supervised baselines but does not surpass them outright; SigLIP 2 and Perception Encoder (PECore) are reported at 89.1% and 89.3% respectively against DINOv3's 88.4%.^[6] On dense prediction tasks (segmentation, depth, 3D keypoint matching), DINOv3 is reported to outperform both the weakly-supervised baselines and previous self-supervised models.^[2]^[6] An external application reported by Meta, with the World Resources Institute on canopy height estimation in Kenya, decreased the canopy height error from approximately 4.1 meters using DINOv2 to approximately 1.2 meters using DINOv3.^[2]

A growing set of independent benchmarks evaluates DINOv3 on medical imaging. Huo et al. (2025) report a comprehensive 2D/3D benchmark of DINOv3 on classification, segmentation, and registration tasks, asking whether it sets a new medical-imaging standard; results are mixed across modalities but generally favorable for 2D radiology tasks when DINOv3 is used as a frozen backbone.^[12] DINOv3 also won the MIDOG 2025 Task 2 competition on atypical mitotic figure classification through efficient fine-tuning.^[13]

Applications

Because DINOv3 backbones are designed to be used frozen, they have been adopted as drop-in feature extractors in several domains.^[2]^[3]^[4]

Dense vision tasks

DINOv3 is most distinctive on tasks where every patch in the input image must produce a useful prediction: semantic segmentation, depth estimation, object detection, and 3D keypoint matching. The released specialized heads (for ImageNet classification, depth on SYNTHMIX, COCO 2017 detection, and ADE20K segmentation) are intentionally lightweight wrappers around the frozen backbone, illustrating the intended usage pattern: train a small task head on top of DINOv3 features rather than fine-tuning the full backbone.^[4]^[3]

Remote sensing and geospatial analysis

The SAT-493M-trained variants enable canopy height mapping, land cover classification, and related earth-observation tasks at scale. Meta highlights a collaboration with the World Resources Institute on canopy height estimation in Kenya, in which the canopy-height error decreased from approximately 4.1 meters using DINOv2 to approximately 1.2 meters using DINOv3. A Canopy Height Maps v2 (CHMv2) head ships in the official repository.^[2]^[4]

Medical imaging

Several independent groups have evaluated DINOv3 as a frozen backbone for medical imaging. Huo et al. published a 2D/3D benchmark on classification, segmentation, and registration, asking whether DINOv3 sets a new medical-imaging standard; results are mixed across modalities but generally favorable for 2D radiology tasks.^[12] DINOv3 was also used in the winning submission of the MIDOG 2025 Task 2 competition on atypical mitotic figure classification, where the authors emphasize efficient fine-tuning of DINOv3 pretrained on natural images for a histopathology task.^[13]

Robotics and visuomotor learning

DINOv3 features have been combined with diffusion-policy controllers for manipulation. The DINOv3-Diffusion Policy paper proposes using DINOv3 visual features as input to a diffusion-based policy for visuomotor control, taking advantage of the strong dense features for object-aware grasping and manipulation.^[14]

Image retrieval, clustering, and zero-shot tasks

The CLS token and patch-level embeddings are widely usable for retrieval and clustering via linear-probe or k-NN protocols, and the dino.txt variant adds a contrastively trained vision-language head that supports zero-shot classification and retrieval through text queries.^[4]^[3]

Developer ecosystem

For developers, DINOv3 is integrated into the Hugging Face Hugging Face Transformers library (v4.56.0 and later) as DINOv3ViTModel and DINOv3ConvNextModel, with matching configuration classes (DINOv3ViTConfig, DINOv3ConvNextConfig) and a dedicated image processor (DINOv3ViTImageProcessor). A standard usage pattern is to load the model with AutoModel.from_pretrained("facebook/dinov3-vits16-pretrain-lvd1689m") and pool the last hidden state by splitting it into a CLS token, register tokens, and patch tokens.^[11]

The model card documents explicit handling of register tokens: the default ViT-S/16 configuration ships with 4 register tokens, which the integration layer separates from CLS and patch tokens at inference time so that downstream code can use whichever subset is appropriate (CLS for retrieval and classification, patch tokens for dense tasks, register tokens for diagnostic visualization).^[11] The Transformers documentation also includes a torchao quantization example reducing the 7B variant to int4 weights for inference on smaller hardware.^[11]

A corresponding integration ships in the timm library (PyTorch Image Models, v1.0.20 and later) under DINOv3 model entries, allowing the same checkpoints to be loaded by users who prefer the timm ecosystem.^[4] The official facebookresearch/dinov3 repository on GitHub provides reference training code, evaluation scripts, multi-distillation training support for ConvNeXt backbones, and pre-built task heads for classification, depth, detection, and segmentation, requiring PyTorch 2.7.1 or later.^[4]

Limitations and criticisms

Several limitations are acknowledged in the paper or pointed out by external reviewers:

Not a uniform win over weakly-supervised models. On linear ImageNet classification, DINOv3 underperforms SigLIP 2 (89.1%) and Perception Encoder (89.3%), although it surpasses both on most dense-prediction tasks.^[6]^[10] DINOv3's advantages are clearest on segmentation, depth, and matching, less so on classification benchmarks dominated by language-aligned models.
Training complexity. The recipe requires a two-phase schedule (constant-learning-rate pretraining followed by Gram-anchored refinement), a multi-student distillation pipeline, and a high-resolution adaptation stage, all of which add engineering overhead compared with the simpler DINOv2 recipe.^[6]^[3]
Data provenance. The bulk of LVD-1689M is sourced from public Instagram images; this is consistent with Meta's prior practice but has been a recurring concern with large web-scale self-supervised learning models in general, both in terms of biases inherited from social-media distributions and in terms of consent and copyright. The paper provides limited detail on filtering for sensitive content.^[1]^[3]
License. The release uses a custom "DINOv3 License" rather than a standard open-source license; while it permits commercial use, the precise restrictions differ from Apache or MIT and need to be reviewed by adopters.^[4]
Compute opacity. The paper does not disclose total training compute in GPU-days, which makes external estimates of cost and carbon impact difficult.^[10]
Dense-feature behavior at very large scales remains a research question: Gram anchoring repairs the symptom (degrading patch features) but the underlying cause is not fully characterized in the paper, and the technique is itself a fix introduced part-way through training rather than a property of the base objective.^[1]^[3]

Model	Year	Parameters	Training data	Supervision	Primary strength
DINO	2021	up to 85M (ViT-B)	ImageNet	Self-supervised	Emergent segmentation maps
DINOv2	2023	up to 1.1B (ViT-g)	LVD-142M (~142M)	Self-supervised	Frozen features for many tasks
DINOv3	2025	up to 6.7B (ViT-7B)	LVD-1689M (~1.7B)	Self-supervised	Dense features at scale
CLIP	2021	up to ~0.4B	~400M image-text pairs	Weakly supervised (text)	Zero-shot classification
SigLIP 2	2025	varied	image-text pairs	Weakly supervised (text)	High-accuracy classification
Segment Anything (SAM)	2023	up to 636M	SA-1B	Promptable segmentation	Universal segmentation masks
Masked autoencoder (MAE)	2021	ViT-L/H	ImageNet	Self-supervised (reconstructive)	Pretraining for fine-tuning
SimCLR	2020	ResNet-50 to ResNet-200	ImageNet	Self-supervised (contrastive)	Early SSL baseline
V-JEPA 2	2025	varied	video	Self-supervised (JEPA)	Video understanding

DINOv3 differs from CLIP and SigLIP 2 by being fully self-supervised; it does not require image-text pairs. It differs from Segment Anything by being a representation model rather than a promptable segmenter. It differs from masked autoencoder (MAE) by using cross-view distillation rather than pixel-level reconstruction, and from earlier contrastive baselines like SimCLR by replacing batch-level positive/negative contrast with teacher-student distribution matching. And it differs from V-JEPA and V-JEPA 2 by operating on still images with a view-matching objective rather than on video with a joint-embedding predictive objective.^[9]^[1]^[7]

Compared with supervised pretraining and weakly-supervised pretraining, DINOv3 is most distinctive in not requiring any external annotations or paired text. This makes it attractive for domains where labels or text descriptions are expensive or unavailable, such as overhead imagery and certain medical modalities.^[2]^[12]

Significance

DINOv3's importance to the field rests on three claims that researchers and practitioners are now testing:^[1]^[3]^[6]

That self-supervised learning for vision can be scaled to a regime where it outperforms strong weakly-supervised foundation models on the dense tasks that matter for most downstream vision applications.
That a single frozen backbone, used without fine-tuning, can serve as a universal feature extractor across classification, segmentation, depth, detection, retrieval, and geospatial analysis.
That techniques like Gram anchoring can stabilize long-schedule training of vision transformers in a way analogous to how various regularizers stabilized long training of large language models (such as the LLaMA family).

These claims echo, in the vision domain, some of the trajectory of language model scaling: bigger model, bigger curated data, simpler schedule, and frozen reuse downstream. Whether DINOv3 marks a true inflection or a one-step refinement of the DINO line will become clearer as independent benchmarks accumulate.

The release also has a more concrete operational significance for practitioners. Where the DINOv2 generation made self-supervised backbones plausibly competitive with CLIP for many tasks, DINOv3 explicitly targets the dense-prediction regime that has historically been the weak point of self-supervised vision. The shipped specialized heads, the satellite-specific backbones, and the standardized integration into the Hugging Face Transformers and timm libraries lower the barrier to using a frozen DINOv3 backbone in production, in a way that earlier research releases did not.^[4]^[11]

References

Siméoni, Oriane et al., "DINOv3", arXiv preprint 2508.10104, 2025-08-13. https://arxiv.org/abs/2508.10104. Accessed 2026-05-20. ↩
Meta AI, "DINOv3: Self-supervised learning for vision at unprecedented scale", Meta AI Blog, 2025-08-14. https://ai.meta.com/blog/dinov3-self-supervised-vision-model/. Accessed 2026-05-20. ↩
Lukyanenko, Andrey, "Paper Review: DINOv3", andlukyane.com, 2025. https://andlukyane.com/blog/paper-review-dinov3. Accessed 2026-05-20. ↩
Meta AI / FAIR, "facebookresearch/dinov3 repository", GitHub, 2025. https://github.com/facebookresearch/dinov3. Accessed 2026-05-20. ↩
Meta AI, "DINOv3", AI at Meta product page, 2025. https://ai.meta.com/dinov3/. Accessed 2026-05-20. ↩
The Batch / DeepLearning.AI, "Meta's DINOv3 Gets an Updated Loss Term and Improved Vision Performance", deeplearning.ai, 2025-08-27. https://www.deeplearning.ai/the-batch/metas-dinov3-gets-an-updated-loss-term-and-improved-vision-performance/. Accessed 2026-05-20. ↩
Caron, Mathilde et al., "Emerging Properties in Self-Supervised Vision Transformers", arXiv preprint 2104.14294, 2021-04-29. https://arxiv.org/abs/2104.14294. Accessed 2026-05-20. ↩
Oquab, Maxime et al., "DINOv2: Learning Robust Visual Features without Supervision", arXiv preprint 2304.07193, 2023-04-14. https://arxiv.org/abs/2304.07193. Accessed 2026-05-20. ↩
Meta AI Research, "DINOv3", Meta AI Research Publications, 2025-08-14. https://ai.meta.com/research/publications/dinov3/. Accessed 2026-05-20. ↩
Lightly AI, "DINOv3 Explained: Technical Deep Dive", lightly.ai blog, 2025. https://www.lightly.ai/blog/dinov3. Accessed 2026-05-20. ↩
Hugging Face, "DINOv3 model documentation", Transformers docs, 2025. https://huggingface.co/docs/transformers/main/en/model_doc/dinov3. Accessed 2026-05-20. ↩
Huo, Yuankai et al., "Does DINOv3 Set a New Medical Vision Standard? Benchmarking 2D and 3D Classification, Segmentation, and Registration", arXiv preprint 2509.06467, 2025-09. https://arxiv.org/abs/2509.06467. Accessed 2026-05-20. ↩
Authors of the MIDOG 2025 Task 2 winning submission, "Efficient Fine-Tuning of DINOv3 Pretrained on Natural Images for Atypical Mitotic Figure Classification", arXiv preprint 2508.21041, 2025-08. https://arxiv.org/abs/2508.21041. Accessed 2026-05-20. ↩
Authors of DINOv3-Diffusion Policy, "DINOv3-Diffusion Policy: Self-Supervised Large Visual Model for Visuomotor Diffusion Policy Learning", arXiv preprint 2509.17684, 2025-09. https://arxiv.org/abs/2509.17684. Accessed 2026-05-20. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributor · full history

Suggest edit

What links here

DINO (computer vision)DINOv2 Register tokens (Vision Transformers Need Registers)

Infobox

Background

From DINO to DINOv2

Context within Meta's research program

Publication timeline

How it works

Self-supervised objective

Gram anchoring

Architecture choices

Training data

Optimization and compute

Post-hoc adaptation

Model suite

Benchmarks

Applications

Dense vision tasks

Remote sensing and geospatial analysis

Medical imaging

Robotics and visuomotor learning

Image retrieval, clustering, and zero-shot tasks

Developer ecosystem

Limitations and criticisms

Comparison with related models

Significance

See also

References

Improve this article

Related Articles

Segment Anything Model and Dataset (SAM and SA-1B)

DINOv2

SAM 2

Nougat (model)

Sapiens (computer vision)

DINO (computer vision)

What links here

Related Articles

Segment Anything Model and Dataset (SAM and SA-1B)

DINOv2

SAM 2

Nougat (model)

Sapiens (computer vision)

DINO (computer vision)

What links here