DINOv3
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,077 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,077 words
Add missing citations, update stale details, or suggest a clearer explanation.
DINOv3 is a family of self-supervised computer vision foundation models released by Meta AI in August 2025. It is the third major iteration in the DINO (self-distillation with no labels) line, following the original DINO (2021) and DINOv2 (2023), and is trained on roughly 1.7 billion images without any human annotations.[^1][^2] DINOv3 introduces Gram anchoring, a regularization technique that prevents the degradation of dense patch-level features during long, large-scale training, and it scales the DINO recipe to a 6.7-billion-parameter Vision Transformer backbone.[^1][^3] The released model suite spans Vision Transformers from 21 million to 6.7 billion parameters and a set of ConvNeXt backbones, all distilled from the same 7B teacher and made available under a commercial license through GitHub and Hugging Face.[^4][^5]
DINOv3 is positioned as a frozen backbone: Meta reports that across roughly sixty benchmarks and fifteen vision tasks, a single DINOv3 backbone without fine-tuning matches or exceeds specialized state-of-the-art systems and previously published self- and weakly-supervised models such as SigLIP 2 and Perception Encoder on most dense-prediction tasks.[^2][^6] The paper, titled simply DINOv3, was posted to arXiv on 13 August 2025 as preprint 2508.10104, authored by Oriane Siméoni, Huy V. Vo, Maximilian Seitzer and colleagues at Meta AI.[^1]
| Field | Value |
|---|---|
| Developer | Meta AI (FAIR) |
| Initial release | 14 August 2025 |
| arXiv preprint | 2508.10104 |
| Largest model | ViT-7B/16 (6.7B parameters) |
| Pretraining dataset | LVD-1689M (~1.7B images) |
| Satellite dataset | SAT-493M (~493M images) |
| Architectures | Vision Transformer, ConvNeXt |
| License | DINOv3 License (commercial use permitted) |
| Framework | PyTorch >= 2.7.1 |
The DINO lineage began with the 2021 paper Emerging Properties in Self-Supervised Vision Transformers by Mathilde Caron and collaborators at FAIR, which showed that a Vision Transformer trained by self-distillation with no labels developed attention maps that segmented objects without supervision.[^7] The original DINO established the basic recipe: a student network is trained to match the output distribution of an exponentially averaged teacher network across different augmented crops of the same image, producing image-level features useful for downstream tasks. The acronym DINO stands for self-DIstillation with NO labels.[^7]
DINOv2, released in April 2023, scaled this idea to ViT-g/14 (around one billion parameters) and trained on a curated 142-million-image dataset called LVD-142M. It demonstrated that frozen self-supervised features could match or exceed CLIP-style weakly supervised features on many classification and dense prediction benchmarks, but its patch-level features still lagged behind specialized systems on segmentation and depth estimation, and the curated dataset capped further data scaling.[^8] DINOv3 was framed by the authors as the answer to two questions left by DINOv2: whether the self-supervised recipe could scale by another order of magnitude in both model size and data, and whether dense features could be stabilized when training is pushed to very long schedules.[^1][^3]
The release sits alongside other Meta foundation work in vision and embodied AI. V-JEPA and V-JEPA 2, also from FAIR, target video understanding through a different Joint Embedding Predictive Architecture objective championed by Yann LeCun, while DINOv3 remains an image-centric model trained with view-matching self-distillation. The two lines are research siblings rather than direct successors: DINO models target still-image representations through teacher-student matching across views, whereas JEPA models target prediction in a learned representation space, typically over time in video.[^9]
DINOv3 is also distinct from Meta's Segment Anything (SAM) family, which trains promptable segmentation models on large mask datasets, and from CLIP-style transfer learning approaches that align images with text. Where SAM produces masks given prompts and CLIP produces image-text similarity scores, DINOv3 produces a frozen visual backbone whose patch and CLS embeddings are reused unchanged across downstream tasks.[^2][^5]
The DINOv3 arXiv preprint (2508.10104) was posted on 13 August 2025, with the accompanying Meta AI blog post and product page going live the next day, on 14 August 2025.[^1][^2][^9] Code, model weights, and the DINOv3 License appeared on the facebookresearch/dinov3 GitHub repository the same week, and Hugging Face integration through dedicated DINOv3ViTModel and DINOv3ConvNextModel classes shipped in Transformers 4.56.0 shortly afterward.[^4][^11]
DINOv3 keeps the core teacher-student structure of DINO and DINOv2. The student network sees several augmented crops (global and local) of an input image, the teacher sees only the global crops, and the student is trained so that its output token distributions match the teacher's on shared content. The teacher is an exponential moving average of the student's weights.[^1][^7]
During the main pretraining phase, the per-batch loss combines three terms:
$$\mathcal{L}{\text{Pre}} = \mathcal{L}{\text{DINO}} + \mathcal{L}{\text{iBOT}} + 0.1 \cdot \mathcal{L}{\text{DKoleo}}$$
where the DINO loss compares CLS-token distributions between student and teacher on different crops, the iBOT loss adds masked patch prediction by reconstructing teacher latents at masked positions in the student input, and the Koleo regularizer encourages diversity in the batch of CLS embeddings.[^1][^3] The combination is inherited largely intact from DINOv2; the substantive changes in DINOv3 are in scale, schedules, and the new Gram term added later in training.[^3]
A practical problem the DINOv3 authors highlight is that when DINOv2-style training is extended to billions of images and millions of iterations, the global classification features keep improving while the dense patch-level features get noisier and visibly degrade. Standard tricks like adjusting the loss weights, changing the iBOT mask ratio, or shrinking the model did not fully solve the problem.[^1][^3] The symptom is most visible in patch-feature similarity visualizations: in DINOv2-style runs taken to very long schedules, the patches that should be close in feature space (for example, all patches covering a single object) drift apart, and patch attention maps lose the crisp object alignment that made DINO interesting in the first place.[^3][^10]
The fix introduced in DINOv3 is Gram anchoring. After about one million iterations, a frozen "Gram teacher" snapshot is taken from an earlier point in training where the dense features are still clean. During the refinement phase, the student is regularized so that the Gram matrix of its patch features (the matrix of pairwise dot products between patch embeddings) stays close to the Gram matrix produced by the Gram teacher:[^1][^3]
$$\mathcal{L}_{\text{Gram}} = \left| X_S X_S^\top - X_G X_G^\top \right|_F^2$$
This constrains the relational structure of the patch features (which patches are similar to which) without pinning the absolute values, so individual patch embeddings remain free to evolve. The refinement loss is then:[^1]
$$\mathcal{L}{\text{Ref}} = w_D \mathcal{L}{\text{DINO}} + \mathcal{L}{\text{iBOT}} + w{DK} \mathcal{L}{\text{DKoleo}} + w{\text{Gram}} \mathcal{L}_{\text{Gram}}$$
A subtle aspect of Gram anchoring is that the Gram teacher is a different network from the EMA teacher used for DINO and iBOT. The EMA teacher tracks the current student to provide moving targets for view-matching; the Gram teacher is fixed at an earlier checkpoint chosen for the cleanliness of its dense features.[^1][^3] In effect, Gram anchoring is closer to a soft anchor against a known-good representation than to a standard contrastive or distillation loss against a current target.
Independent analyses describe the effect of Gram anchoring as repairing degraded local features so that visualized patch features become sharper and more object-aligned, with measurable gains on dense-task benchmarks such as ADE20K semantic segmentation and depth estimation.[^3][^10] Meta's blog post characterizes the technique as the central methodological innovation of DINOv3 relative to DINOv2.[^2]
The flagship DINOv3 ViT-7B/16 has 6,716 million parameters, an embedding dimension of 4096, and a patch size of 16x16.[^4][^11] Compared with the DINOv2 ViT-g backbone, two architectural changes are notable. First, the model replaces additive positional embeddings with rotary position embedding (RoPE) in two dimensions, augmented with random rescaling and jittering of the position coordinates during training to make the model robust across resolutions. This allows the same backbone to handle inputs from low resolution up to 4096x4096 pixels at inference time.[^1][^3][^11] Second, the released open-source variants include four register tokens, learnable embeddings that absorb high-norm artifacts and yield cleaner attention maps and stronger dense predictions; this practice was introduced in earlier FAIR work on "vision transformers need registers" and is now standard in DINOv3 checkpoints.[^11]
DINOv3's main dataset is called LVD-1689M and contains roughly 1.689 billion images.[^1][^4] The authors construct it from a pool of about 17 billion public Instagram images using three complementary strategies:
The relative scale-up over DINOv2 is roughly 12x in data (142M to 1.689B images) and 6x in model parameters (1.1B to 6.7B), with the same general teacher-student structure but more sophisticated data balancing.[^2][^6]
For geospatial applications, the team additionally trains models on SAT-493M, a satellite imagery dataset of about 493 million tiles, partially drawn from Maxar imagery.[^4][^2] Two backbones (ViT-L/16 and ViT-7B/16) are pretrained on SAT-493M alongside the LVD-1689M variants, allowing geospatial users to start from a backbone whose pretraining distribution already matches overhead imagery.[^4]
A distinctive feature of DINOv3 is the use of a constant learning-rate, weight-decay, and EMA-momentum schedule for the main pretraining phase, after a short linear warmup. The authors report that removing cosine schedules allowed them to extend training arbitrarily without retuning, which is important when neither the optimal number of iterations nor the eventual model use cases are known in advance.[^1][^3] In effect, the schedule choice trades the sharper near-end-of-training boost typical of cosine schedules for the ability to "stop anywhere", which matters when training runs span weeks and downstream needs are still emerging.[^3]
The main run uses a batch size of 4096 images split across 256 GPUs for roughly one million iterations.[^3] Total compute is not officially reported in GPU-days; secondary technical summaries note the absence of these figures, which makes external estimates of cost and environmental impact difficult.[^10] The authors note that the multi-student distillation pipeline was designed to amortize teacher inference across multiple student trainings, reducing the marginal cost of producing each additional distilled checkpoint in the released family.[^1][^3]
After the main run, DINOv3 applies three post-hoc strategies:[^1][^3]
| Backbone | Parameters | Pretraining dataset | Notes |
|---|---|---|---|
| ViT-S/16 | 21M | LVD-1689M | Distilled from ViT-7B |
| ViT-S+/16 | 29M | LVD-1689M | Distilled, custom width |
| ViT-B/16 | 86M | LVD-1689M | Distilled |
| ViT-L/16 | 300M | LVD-1689M | Distilled |
| ViT-H+/16 | 840M | LVD-1689M | Distilled, custom width |
| ViT-7B/16 | 6,716M | LVD-1689M | Teacher, trained from scratch |
| ConvNeXt-T | 29M | LVD-1689M | Distilled |
| ConvNeXt-S | 50M | LVD-1689M | Distilled |
| ConvNeXt-B | 89M | LVD-1689M | Distilled |
| ConvNeXt-L | 198M | LVD-1689M | Distilled |
| ViT-L/16 (SAT) | 300M | SAT-493M | Satellite imagery |
| ViT-7B/16 (SAT) | 6,716M | SAT-493M | Satellite imagery |
In addition to the bare backbones, the GitHub release ships specialized heads for ImageNet classification, depth estimation on a SYNTHMIX dataset, object detection on COCO 2017, semantic segmentation on ADE20K, dino.txt zero-shot classification, and a Canopy Height Maps v2 head (CHMv2) for geospatial use.[^4]
DINOv3 is evaluated extensively without fine-tuning the backbone. Selected reported numbers, all using frozen features, include:
| Task | Dataset | Metric | DINOv3 | DINOv2 (reference) |
|---|---|---|---|---|
| Image classification | ImageNet-1k linear | Top-1 accuracy | 88.4% | 87.3% |
| Semantic segmentation | PASCAL VOC | mean IoU | 86.6 | 83.1 |
| Semantic segmentation | ADE20K | mIoU | 63.0 | ~57 |
| Object detection | COCO 2017 | mAP | 66.1 | lower |
| Depth estimation | NYU-Depth v2 | RMSE | 0.281 | higher |
Sources: paper-reported numbers and independent technical summaries.[^3][^6][^10]
Meta reports that on linear ImageNet classification DINOv3 matches strong weakly-supervised baselines but does not surpass them outright; SigLIP 2 and Perception Encoder (PECore) are reported at 89.1% and 89.3% respectively against DINOv3's 88.4%.[^6] On dense prediction tasks (segmentation, depth, 3D keypoint matching), DINOv3 is reported to outperform both the weakly-supervised baselines and previous self-supervised models.[^2][^6] An external application reported by Meta, with the World Resources Institute on canopy height estimation in Kenya, decreased the canopy height error from approximately 4.1 meters using DINOv2 to approximately 1.2 meters using DINOv3.[^2]
A growing set of independent benchmarks evaluates DINOv3 on medical imaging. Huo et al. (2025) report a comprehensive 2D/3D benchmark of DINOv3 on classification, segmentation, and registration tasks, asking whether it sets a new medical-imaging standard; results are mixed across modalities but generally favorable for 2D radiology tasks when DINOv3 is used as a frozen backbone.[^12] DINOv3 also won the MIDOG 2025 Task 2 competition on atypical mitotic figure classification through efficient fine-tuning.[^13]
Because DINOv3 backbones are designed to be used frozen, they have been adopted as drop-in feature extractors in several domains.[^2][^3][^4]
DINOv3 is most distinctive on tasks where every patch in the input image must produce a useful prediction: semantic segmentation, depth estimation, object detection, and 3D keypoint matching. The released specialized heads (for ImageNet classification, depth on SYNTHMIX, COCO 2017 detection, and ADE20K segmentation) are intentionally lightweight wrappers around the frozen backbone, illustrating the intended usage pattern: train a small task head on top of DINOv3 features rather than fine-tuning the full backbone.[^4][^3]
The SAT-493M-trained variants enable canopy height mapping, land cover classification, and related earth-observation tasks at scale. Meta highlights a collaboration with the World Resources Institute on canopy height estimation in Kenya, in which the canopy-height error decreased from approximately 4.1 meters using DINOv2 to approximately 1.2 meters using DINOv3. A Canopy Height Maps v2 (CHMv2) head ships in the official repository.[^2][^4]
Several independent groups have evaluated DINOv3 as a frozen backbone for medical imaging. Huo et al. published a 2D/3D benchmark on classification, segmentation, and registration, asking whether DINOv3 sets a new medical-imaging standard; results are mixed across modalities but generally favorable for 2D radiology tasks.[^12] DINOv3 was also used in the winning submission of the MIDOG 2025 Task 2 competition on atypical mitotic figure classification, where the authors emphasize efficient fine-tuning of DINOv3 pretrained on natural images for a histopathology task.[^13]
DINOv3 features have been combined with diffusion-policy controllers for manipulation. The DINOv3-Diffusion Policy paper proposes using DINOv3 visual features as input to a diffusion-based policy for visuomotor control, taking advantage of the strong dense features for object-aware grasping and manipulation.[^14]
The CLS token and patch-level embeddings are widely usable for retrieval and clustering via linear-probe or k-NN protocols, and the dino.txt variant adds a contrastively trained vision-language head that supports zero-shot classification and retrieval through text queries.[^4][^3]
For developers, DINOv3 is integrated into the Hugging Face Hugging Face Transformers library (v4.56.0 and later) as DINOv3ViTModel and DINOv3ConvNextModel, with matching configuration classes (DINOv3ViTConfig, DINOv3ConvNextConfig) and a dedicated image processor (DINOv3ViTImageProcessor). A standard usage pattern is to load the model with AutoModel.from_pretrained("facebook/dinov3-vits16-pretrain-lvd1689m") and pool the last hidden state by splitting it into a CLS token, register tokens, and patch tokens.[^11]
The model card documents explicit handling of register tokens: the default ViT-S/16 configuration ships with 4 register tokens, which the integration layer separates from CLS and patch tokens at inference time so that downstream code can use whichever subset is appropriate (CLS for retrieval and classification, patch tokens for dense tasks, register tokens for diagnostic visualization).[^11] The Transformers documentation also includes a torchao quantization example reducing the 7B variant to int4 weights for inference on smaller hardware.[^11]
A corresponding integration ships in the timm library (PyTorch Image Models, v1.0.20 and later) under DINOv3 model entries, allowing the same checkpoints to be loaded by users who prefer the timm ecosystem.[^4] The official facebookresearch/dinov3 repository on GitHub provides reference training code, evaluation scripts, multi-distillation training support for ConvNeXt backbones, and pre-built task heads for classification, depth, detection, and segmentation, requiring PyTorch 2.7.1 or later.[^4]
Several limitations are acknowledged in the paper or pointed out by external reviewers:
| Model | Year | Parameters | Training data | Supervision | Primary strength |
|---|---|---|---|---|---|
| DINO | 2021 | up to 85M (ViT-B) | ImageNet | Self-supervised | Emergent segmentation maps |
| DINOv2 | 2023 | up to 1.1B (ViT-g) | LVD-142M (~142M) | Self-supervised | Frozen features for many tasks |
| DINOv3 | 2025 | up to 6.7B (ViT-7B) | LVD-1689M (~1.7B) | Self-supervised | Dense features at scale |
| CLIP | 2021 | up to ~0.4B | ~400M image-text pairs | Weakly supervised (text) | Zero-shot classification |
| SigLIP 2 | 2025 | varied | image-text pairs | Weakly supervised (text) | High-accuracy classification |
| Segment Anything (SAM) | 2023 | up to 636M | SA-1B | Promptable segmentation | Universal segmentation masks |
| Masked autoencoder (MAE) | 2021 | ViT-L/H | ImageNet | Self-supervised (reconstructive) | Pretraining for fine-tuning |
| SimCLR | 2020 | ResNet-50 to ResNet-200 | ImageNet | Self-supervised (contrastive) | Early SSL baseline |
| V-JEPA 2 | 2025 | varied | video | Self-supervised (JEPA) | Video understanding |
DINOv3 differs from CLIP and SigLIP 2 by being fully self-supervised; it does not require image-text pairs. It differs from Segment Anything by being a representation model rather than a promptable segmenter. It differs from masked autoencoder (MAE) by using cross-view distillation rather than pixel-level reconstruction, and from earlier contrastive baselines like SimCLR by replacing batch-level positive/negative contrast with teacher-student distribution matching. And it differs from V-JEPA and V-JEPA 2 by operating on still images with a view-matching objective rather than on video with a joint-embedding predictive objective.[^9][^1][^7]
Compared with supervised pretraining and weakly-supervised pretraining, DINOv3 is most distinctive in not requiring any external annotations or paired text. This makes it attractive for domains where labels or text descriptions are expensive or unavailable, such as overhead imagery and certain medical modalities.[^2][^12]
DINOv3's importance to the field rests on three claims that researchers and practitioners are now testing:[^1][^3][^6]
These claims echo, in the vision domain, some of the trajectory of language model scaling: bigger model, bigger curated data, simpler schedule, and frozen reuse downstream. Whether DINOv3 marks a true inflection or a one-step refinement of the DINO line will become clearer as independent benchmarks accumulate.
The release also has a more concrete operational significance for practitioners. Where the DINOv2 generation made self-supervised backbones plausibly competitive with CLIP for many tasks, DINOv3 explicitly targets the dense-prediction regime that has historically been the weak point of self-supervised vision. The shipped specialized heads, the satellite-specific backbones, and the standardized integration into the Hugging Face Transformers and timm libraries lower the barrier to using a frozen DINOv3 backbone in production, in a way that earlier research releases did not.[^4][^11]