Computer Vision Models
Last reviewed
May 13, 2026
Sources
70 citations
Review status
Source-backed
Revision
v2 · 7,997 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 13, 2026
Sources
70 citations
Review status
Source-backed
Revision
v2 · 7,997 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Computer Vision and Models
Computer vision models are machine learning systems that take pixels as input and produce structured outputs such as class labels, bounding boxes, segmentation masks, depth maps, keypoints, or text descriptions. The field has gone through three rough eras since 2012. The first was the convolutional neural network era, kicked off when AlexNet won the ImageNet challenge by a wide margin and convinced the field that deep networks trained on GPUs could beat hand-engineered features. The second was the transformer era, which began when ViT showed that a plain transformer applied to image patches could match or beat ConvNets given enough data. The third, ongoing, is the foundation-model era, in which a small number of large self-supervised or vision-language backbones (DINOv2, DINOv3, CLIP, SigLIP, SAM, SAM 2) are reused across detection, segmentation, depth, pose, and video tasks instead of training a separate architecture for each one.
This article surveys backbone architectures, task-specific models for detection, segmentation, depth, pose, and video, the self-supervised methods that pretrain modern backbones, and the foundation models that increasingly serve as universal vision encoders. It is meant as a road map, not as a substitute for the original papers; each model has a dedicated reference at the end.
A computer vision model is a function that maps an image (or video, or 3D scene representation) to some task-specific output. The same backbone, a ResNet-50 or a ViT-Base or a Hiera, can be repurposed for many tasks by swapping out the head and fine-tuning. The standard tasks are:
| Task | Output | Canonical benchmark |
|---|---|---|
| Image classification | One label per image | ImageNet (1.28 M images, 1,000 classes) |
| Object detection | Bounding boxes plus class labels | COCO detection (118 K images, 80 classes) |
| Semantic segmentation | A class label for every pixel | ADE20K, Cityscapes |
| Instance segmentation | A mask per object instance | COCO segmentation |
| Panoptic segmentation | Pixel labels for both stuff and things, with instance IDs | COCO panoptic |
| Open-vocabulary detection | Boxes for arbitrary text queries | LVIS, ODinW |
| Monocular depth estimation | A per-pixel depth value | NYU Depth v2, KITTI |
| Pose estimation | A set of keypoints per person | COCO keypoints, MPII |
| Action recognition | A class label per video clip | Kinetics-400, Something-Something V2 |
| Novel view synthesis | A rendered image from a new viewpoint | NeRF synthetic, Mip-NeRF 360 |
| Image-text retrieval / zero-shot classification | Similarity scores | ImageNet zero-shot, MSCOCO retrieval |
The organizing question for this article is how the models in each row of that table evolved from hand-designed pipelines to learned representations and, more recently, to foundation models that handle several rows at once.
A backbone is the part of the network that turns the image into a stack of feature maps. Detection heads, segmentation heads, and depth heads all sit on top of a backbone. The history of computer vision is largely the history of better backbones.
The modern story starts with LeNet-5 (1998), a seven-layer convolutional network by Yann LeCun and collaborators that read handwritten zip codes for the US Postal Service. LeNet-5 introduced the canonical convolution, pooling, fully-connected pattern, but on hardware of the time it could not scale to large natural-image datasets.
AlexNet (2012) by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton cut the ImageNet top-5 error from 26.2 percent to 15.3 percent, beating the next entry by more than ten points and ending the era of hand-crafted features. AlexNet had eight learned layers, ReLU activations, dropout, local response normalization, and was trained on two GTX 580 GPUs for about a week. It is now widely treated as the start of the deep learning era in vision.
VGG (Karen Simonyan and Andrew Zisserman, 2014) showed that deeper but simpler networks built from 3x3 convolutions could match or beat AlexNet. VGG-16 and VGG-19 became standard feature extractors for years.
GoogLeNet (Christian Szegedy and colleagues at Google, 2014) introduced the Inception module, which applied parallel 1x1, 3x3, and 5x5 convolutions inside the same block and used 1x1 convolutions for dimensionality reduction. The original Inception network had 22 layers and won ILSVRC 2014. The Inception family went through several iterations (Inception-v2, v3, v4) and merged with the residual idea in Inception-ResNet.
ResNet (Kaiming He and colleagues at Microsoft Research Asia, 2015) was the breakthrough that made very deep networks trainable. By adding identity shortcut connections that let gradients flow around blocks, ResNet trained 152-layer networks that achieved 3.57 percent ImageNet top-5 error and won ILSVRC 2015 classification, detection, and localization, plus COCO detection and segmentation. ResNet remains the most widely used CNN backbone in research, and its residual block is a building component in almost every later network, including transformers.
DenseNet (Gao Huang and colleagues, 2016) replaced ResNet's additive shortcuts with concatenation: each layer received feature maps from all preceding layers in a block. This improved gradient flow and parameter efficiency at the cost of higher memory use.
MobileNet (Andrew Howard and colleagues at Google, 2017) targeted mobile and embedded devices. It replaced standard convolutions with depthwise separable convolutions, which decompose a kernel into a per-channel spatial filter followed by a pointwise mixing step, reducing FLOPs by roughly a factor of eight at comparable accuracy. MobileNetV2 (2018) added inverted residuals and linear bottlenecks; MobileNetV3 (2019) combined neural architecture search with hand-designed components and squeeze-and-excitation blocks.
EfficientNet (Mingxing Tan and Quoc Le at Google, 2019) introduced compound scaling: depth, width, and resolution were scaled together using a single compound coefficient. EfficientNet-B7 reached 84.4 percent top-1 ImageNet accuracy while being 8.4 times smaller and 6.1 times faster on inference than the best earlier ConvNet.
RegNet (Ilija Radosavovic and colleagues at Meta, 2020) treated network design as a search over populations of architectures and produced simple, regular designs with strong scaling laws. RegNet variants are competitive with EfficientNet at a wider range of compute budgets.
ConvNeXt (Zhuang Liu and colleagues at Meta and UC Berkeley, 2022) was a deliberate attempt to modernize a pure ConvNet using ideas borrowed from vision transformers (larger kernels, depthwise convolutions, layer normalization, GELU activations, fewer activation functions per block). A ConvNeXt-XL reached 87.8 percent ImageNet top-1 accuracy at ImageNet-22K pretraining scale, matching or beating Swin Transformers and proving that the CNN versus transformer gap was largely about training recipes rather than architecture.
ConvNeXt V2 (Sanghyun Woo and colleagues, 2023) co-designed ConvNeXt with masked autoencoder pretraining. The authors introduced Global Response Normalization to fix a feature-collapse issue that arose when combining standard ConvNeXt with MAE, and the resulting models were the first pure ConvNets to outperform similarly sized ViTs across detection and segmentation under self-supervised pretraining.
Vision Transformer (ViT) by Alexey Dosovitskiy and colleagues at Google (paper at ICLR 2021, arXiv October 2020) showed that a standard transformer encoder, fed sequences of 16x16 image patches treated as tokens, could match the best ConvNets on ImageNet given enough pretraining data. ViT-Huge pretrained on JFT-300M reached 88.55 percent ImageNet top-1, lifting the ceiling that ConvNets had defined.
DeiT (Hugo Touvron and colleagues at Meta, December 2020) made ViT practical on ImageNet-1K alone by using a distillation token and strong augmentation. DeiT trained a ViT-Base to 81.8 percent top-1 in 53 hours on eight GPUs without external data, removing the dependency on JFT-scale pretraining that had limited ViT adoption.
Swin Transformer (Ze Liu and colleagues at Microsoft Research Asia, 2021) introduced shifted window attention, which limited self-attention to non-overlapping local windows shifted between layers, recovering ConvNet-style hierarchy and linear complexity in image size. Swin won the ICCV 2021 Marr Prize for best paper and became a standard transformer backbone for detection and segmentation. Swin V2 (2022) scaled the design up to three billion parameters at native 1,536-pixel resolution.
MViT (Multiscale Vision Transformer, Haoqi Fan and colleagues at Meta, 2021) and MaxViT (Zhengzhong Tu and colleagues at Google, 2022) interleaved local and global attention to get hierarchical features without sliding windows. Hiera (Chaitanya Ryali and colleagues at Meta, 2023) went the opposite direction and stripped a hierarchical transformer down to a plain ViT plus pooling, relying on masked autoencoder pretraining to provide the inductive bias that bells-and-whistles had previously supplied. Hiera was 30 to 40 percent faster than the best hierarchical models at each scale and slightly more accurate, and it became the backbone for SAM 2.
EVA (Yuxin Fang and colleagues at BAAI, 2022) and EVA-02 (2023) scaled masked image modeling to billion-parameter ViTs by reconstructing CLIP features rather than raw pixels, hitting strong numbers on ImageNet, COCO, and LVIS at a fraction of the data scale of comparable supervised models.
The table below summarizes the major backbone families. "Top-1" refers to ImageNet-1K validation accuracy at a representative model size.
| Model | Year | Author / Lab | Type | Notable result |
|---|---|---|---|---|
| LeNet-5 | 1998 | LeCun et al., AT&T | CNN | First widely deployed ConvNet for digit recognition |
| AlexNet | 2012 | Krizhevsky, Sutskever, Hinton, Toronto | CNN | Won ILSVRC 2012, 15.3 percent top-5 error |
| VGG-16 | 2014 | Simonyan, Zisserman, Oxford VGG | CNN | 92.7 percent top-5 on ImageNet, simple stacked 3x3 design |
| GoogLeNet (Inception v1) | 2014 | Szegedy et al., Google | CNN | Won ILSVRC 2014, Inception module |
| ResNet-152 | 2015 | He et al., Microsoft Research Asia | CNN | 3.57 percent top-5, won ILSVRC 2015 |
| DenseNet-201 | 2016 | Huang et al., Cornell, Tsinghua | CNN | Dense connectivity, parameter efficient |
| MobileNet | 2017 | Howard et al., Google | CNN | Depthwise separable convolutions for mobile |
| EfficientNet-B7 | 2019 | Tan, Le, Google | CNN | 84.4 percent top-1, compound scaling |
| RegNet | 2020 | Radosavovic et al., Meta | CNN | Population-based architecture design |
| Vision Transformer (ViT) | 2020 | Dosovitskiy et al., Google | Transformer | 88.55 percent top-1 with JFT-300M pretraining |
| DeiT | 2020 | Touvron et al., Meta | Transformer | ImageNet-1K-only ViT, distillation token |
| Swin Transformer | 2021 | Liu et al., MSRA | Hierarchical transformer | ICCV 2021 best paper, shifted windows |
| MViT | 2021 | Fan et al., Meta | Multiscale transformer | Strong on Kinetics, ImageNet |
| ConvNeXt | 2022 | Liu et al., Meta, UC Berkeley | Modernized CNN | 87.8 percent top-1 at 22K, matches Swin |
| MaxViT | 2022 | Tu et al., Google | Block-local + grid attention | Linear complexity, strong on detection |
| Hiera | 2023 | Ryali et al., Meta | Hierarchical transformer | 30 to 40 percent faster than Swin, MAE pretraining |
| EVA-02 | 2023 | Fang et al., BAAI | ViT, MIM with CLIP target | Billion-parameter scale |
| ConvNeXt V2 | 2023 | Woo et al., Meta, KAIST | CNN + MAE | First pure ConvNet co-designed with MAE |
Object detection means localizing and classifying every instance of a set of categories in an image. The field went through three architectural phases: two-stage detectors built around the R-CNN idea, one-stage anchor-based detectors led by SSD, RetinaNet, and the YOLO family, and end-to-end transformer detectors starting with DETR.
R-CNN (Ross Girshick and colleagues at UC Berkeley, 2014) was the first detector to plug deep CNN features into a region-proposal pipeline. It used selective search to produce around 2,000 candidate regions per image and then ran a CNN (AlexNet) on each crop. R-CNN was slow (47 seconds per image) but raised PASCAL VOC mean average precision from 33.7 to 53.7. Fast R-CNN (Girshick, 2015) pushed the CNN feature extractor over the whole image once and pooled per-region features, cutting per-image inference to about 0.3 seconds. Faster R-CNN (Shaoqing Ren and colleagues, 2015) replaced selective search with a learned Region Proposal Network that shared features with the detection head, getting end-to-end detection at near real-time speeds. Mask R-CNN (He, Gkioxari, Dollár, Girshick at Meta, 2017) added a parallel mask head and a more precise RoIAlign operation, becoming the dominant instance-segmentation framework for several years.
Single Shot MultiBox Detector (SSD) by Wei Liu and colleagues (ECCV 2016) eliminated the proposal stage entirely. It predicted class scores and box offsets at every position of a multi-scale feature pyramid, achieving 74.3 mAP on VOC 2007 at 59 FPS. RetinaNet (Tsung-Yi Lin and colleagues at Meta, 2017) addressed the class-imbalance problem that hurt early one-stage detectors by introducing focal loss, which down-weighted easy negatives so the loss came mostly from hard examples. RetinaNet matched two-stage accuracy at one-stage speed.
The YOLO family (Joseph Redmon and Ali Farhadi, 2016 through 2018, then continued by Alexey Bochkovskiy and the Ultralytics team) is the most widely deployed line of detectors. YOLO v1 (2016) regressed boxes and classes directly from a grid over the image; YOLOv2 (2016) added anchor boxes and batch normalization; YOLOv3 (2018) introduced a multi-scale prediction head and the Darknet-53 backbone. YOLOv4 (2020) and YOLOv5 (2020, Ultralytics) integrated many small training tricks (mosaic augmentation, CIoU loss, anchor-free detection in later versions). YOLOv6 (Meituan, 2022), YOLOv7 (Wang et al., 2022), YOLOv8 (Ultralytics, 2023), and YOLO11 (Ultralytics, September 2024) added attention modules and architectural refinements. YOLO11, released September 30, 2024, retained the SPPF block for multi-scale features, replaced C2f with the more compact C3k2 block, and added a C2PSA spatial attention module; it shipped at five sizes from Nano to X and supports detection, segmentation, pose, classification, and tracking.
DETR (Nicolas Carion and colleagues at Meta, ECCV 2020) reframed object detection as a direct set prediction problem. A CNN backbone produced features, a transformer encoder-decoder used a small set of learned object queries to attend to the features, and bipartite matching trained the model to produce one query per ground-truth box. DETR removed non-maximum suppression and anchor generation, becoming the conceptual basis for almost all subsequent transformer detectors. The original DETR was slow to converge (500 epochs to reach Faster R-CNN parity); Deformable DETR (Xizhou Zhu and colleagues, 2020) used deformable attention to cut training time by an order of magnitude.
DINO (Hao Zhang and colleagues, 2022, a detection framework, not the self-supervised method) added contrastive denoising training and mixed query selection on top of Deformable DETR, reaching state of the art on COCO. Co-DETR (Zhuofan Zong and colleagues, 2023) trained DETR jointly with auxiliary one-to-many supervision and pushed COCO box AP above 64.
Grounding DINO (Shilong Liu and colleagues, March 2023) married the DINO detection framework with grounded language pretraining, producing a model that could detect arbitrary objects from free-form text prompts. It reached 52.5 AP on zero-shot COCO and set a new record on the ODinW open-world benchmark.
OWL-ViT (Matthias Minderer and colleagues at Google, 2022) and OWLv2 (NeurIPS 2023) used CLIP-style vision-language pretraining together with detection fine-tuning to do open-vocabulary detection. OWLv2 scaled to one billion training examples using OWL-ViT v1 itself as a pseudo-labeler, lifting LVIS rare-class AP from 31.2 to 44.6.
| Model | Year | Lab | Class | Notable detail |
|---|---|---|---|---|
| R-CNN | 2014 | Girshick et al., UC Berkeley | Two-stage | Selective search plus CNN features |
| Fast R-CNN | 2015 | Girshick, Microsoft Research | Two-stage | RoI pooling on shared feature map |
| Faster R-CNN | 2015 | Ren et al., MSRA | Two-stage | Region Proposal Network |
| Mask R-CNN | 2017 | He et al., Meta | Two-stage | RoIAlign, mask head |
| SSD | 2016 | Liu et al. | One-stage | Multi-scale feature pyramid |
| RetinaNet | 2017 | Lin et al., Meta | One-stage | Focal loss |
| YOLO v1 | 2016 | Redmon et al., Washington | One-stage | Grid-cell direct regression |
| YOLOv5 | 2020 | Ultralytics | One-stage | First widely deployed Ultralytics release |
| YOLOv8 | 2023 | Ultralytics | Anchor-free | Multi-task, classification, segmentation, pose |
| YOLO11 | Sept 2024 | Ultralytics | Anchor-free | C3k2 blocks, C2PSA attention, five sizes |
| DETR | 2020 | Carion et al., Meta | Transformer | Direct set prediction, bipartite matching |
| Deformable DETR | 2020 | Zhu et al., SenseTime | Transformer | Deformable attention, faster convergence |
| DINO (detection) | 2022 | Zhang et al. | Transformer | Contrastive denoising, mixed queries |
| Co-DETR | 2023 | Zong et al. | Transformer | Collaborative hybrid assignment, 64+ AP on COCO |
| Grounding DINO | March 2023 | Liu et al., IDEA Research | Open-vocab | Free-form text queries, 52.5 zero-shot COCO AP |
| OWLv2 | 2023 | Minderer et al., Google DeepMind | Open-vocab | Scaled to a billion pseudo-labeled examples |
Segmentation assigns labels at pixel granularity. The three flavors are semantic (same label for every pixel of a class), instance (each object instance gets its own ID), and panoptic (combines stuff and things). For decades, segmentation was a custom problem with its own architectures; since 2023, SAM and SAM 2 have made promptable, class-agnostic segmentation a foundation-model service.
Fully Convolutional Networks (Jonathan Long, Evan Shelhamer, Trevor Darrell, 2014) extended classification CNNs to dense prediction by replacing the final fully-connected layers with convolutions and upsampling. FCN-8s set the first deep-learning baseline on PASCAL VOC.
U-Net (Olaf Ronneberger, Philipp Fischer, Thomas Brox at the University of Freiburg, MICCAI 2015) was designed for biomedical microscopy where labeled training data was scarce. The architecture had a contracting encoder, a symmetric expanding decoder, and skip connections at every scale, and it could be trained end-to-end from very few images. U-Net won the 2015 ISBI cell-tracking challenge and remains the default architecture in medical imaging. Almost every recent diffusion-based image generator uses a U-Net (or a U-Net plus transformer) as its noise-prediction network, an unintended second life for the design.
DeepLab (Liang-Chieh Chen and colleagues at Google, 2015 through 2018) used atrous (dilated) convolutions to enlarge the receptive field without losing resolution. DeepLab v3+ (2018) added an encoder-decoder structure with atrous spatial pyramid pooling and remained a standard semantic-segmentation baseline. PSPNet (Hengshuang Zhao and colleagues, 2017) introduced the pyramid pooling module for multi-scale context aggregation.
SegFormer (Enze Xie and colleagues at NVIDIA, NeurIPS 2021) paired a lightweight hierarchical transformer encoder with an MLP-only decoder, producing strong results on ADE20K and Cityscapes at a fraction of the parameters of earlier transformer segmenters.
MaskFormer (Bowen Cheng and colleagues at Meta, NeurIPS 2021) reframed segmentation as mask classification: instead of predicting a class per pixel, the model predicted a set of binary masks each tagged with a class label. Mask2Former (Cheng et al., CVPR 2022) added masked attention that restricted cross-attention to predicted mask regions, becoming a universal architecture for semantic, instance, and panoptic segmentation. Mask2Former set state of the art at 57.8 panoptic quality on COCO and 57.7 mIoU on ADE20K. OneFormer (Jitesh Jain and colleagues, 2022) extended the idea further with a single model trained jointly on all three segmentation tasks.
Segment Anything Model (SAM) by Alexander Kirillov and colleagues at Meta AI (April 5, 2023) introduced a promptable segmentation model and a one-billion-mask dataset called SA-1B, collected with a model-in-the-loop pipeline on eleven million licensed images. SAM took clicks, boxes, masks, or text as prompts and emitted segmentation masks for arbitrary objects. The model used a ViT-H image encoder (with MAE pretraining), a lightweight prompt encoder, and a transformer mask decoder. SAM was the first vision foundation model that could generalize to unseen object categories purely from prompts, and it became the de facto labeling tool for downstream segmentation work.
SAM 2 (Nikhila Ravi and colleagues at Meta, July 29, 2024) extended the design to video. It added a memory module that propagated mask predictions across frames and was trained on a new dataset called SA-V with 51,000 videos and over 600,000 masklets. SAM 2 was roughly six times more accurate than the original SAM on image segmentation and could segment objects in video at real-time speeds. Its image encoder is a Hiera backbone, which provides multi-scale features without the overhead of windowed attention.
A growing ecosystem extends SAM. MobileSAM (Chaoning Zhang and colleagues, 2023) replaced the heavy image encoder with a TinyViT, cutting inference cost by an order of magnitude. EfficientSAM (Yunyang Xiong and colleagues at Meta, 2023) distilled SAM into a smaller ViT and recovered most of the accuracy. HQ-SAM (Lei Ke and colleagues, NeurIPS 2023) added a high-quality output token for finer mask boundaries.
| Model | Year | Lab | Task | Notable detail |
|---|---|---|---|---|
| FCN | 2014 | Long, Shelhamer, Darrell, Berkeley | Semantic | First end-to-end fully convolutional segmenter |
| U-Net | 2015 | Ronneberger et al., Freiburg | Semantic, biomedical | Encoder-decoder with skip connections |
| DeepLab v3+ | 2018 | Chen et al., Google | Semantic | Atrous convolutions, ASPP |
| PSPNet | 2017 | Zhao et al., CUHK | Semantic | Pyramid pooling module |
| SegFormer | 2021 | Xie et al., NVIDIA | Semantic | Hierarchical transformer, MLP decoder |
| MaskFormer | 2021 | Cheng et al., Meta | Universal | Mask classification framing |
| Mask2Former | 2022 | Cheng et al., Meta | Universal | Masked attention, 57.8 PQ on COCO |
| OneFormer | 2022 | Jain et al., SHI Labs | Universal | Single model for semantic, instance, panoptic |
| SAM | April 2023 | Kirillov et al., Meta | Promptable | SA-1B (1 B masks), ViT-H encoder |
| HQ-SAM | 2023 | Ke et al., ETH Zurich | Promptable | High-quality output token |
| MobileSAM | 2023 | Zhang et al. | Promptable | TinyViT encoder, ~60x faster |
| EfficientSAM | 2023 | Xiong et al., Meta | Promptable | SAMI distillation |
| SAM 2 | July 2024 | Ravi et al., Meta | Promptable, video | Hiera encoder, memory module, SA-V dataset |
Human pose estimation predicts 2D or 3D keypoints (joints, body landmarks) for one or more people in an image or video. The two main paradigms are top-down (run a person detector, then a single-person pose model on each crop) and bottom-up (predict all keypoints in the image and group them into people).
OpenPose (Zhe Cao and colleagues at Carnegie Mellon, 2017) was the first real-time multi-person 2D pose system. It predicted part confidence maps and part affinity fields and grouped keypoints into individuals using a learned association. OpenPose handled body, hand, and face keypoints in a single forward pass.
AlphaPose (Hao-Shu Fang and colleagues, 2017, then continuously updated) used a top-down pipeline with a symmetric spatial transformer and Pose-NMS. MMPose (OpenMMLab, 2020) is not a model but a popular open-source library that bundles dozens of pose models in a uniform training and evaluation framework.
HRNet (Ke Sun and colleagues at Microsoft Research Asia and the University of Science and Technology of China, CVPR 2019) maintained high-resolution feature maps throughout the network by running parallel multi-resolution streams with repeated multi-scale fusion. HRNet became the standard backbone for keypoint detection and continues to be used in modern pose pipelines.
ViTPose (Yufei Xu and colleagues, 2022) showed that a plain ViT with a simple deconvolution decoder could outperform every previous COCO keypoint result, with the ViT-Huge variant reaching 81.1 AP. Sapiens (Rawal Khirodkar and colleagues at Meta, ECCV 2024) trained a family of human-centric ViTs on 300 million in-the-wild human images using masked autoencoder pretraining at native 1,024-pixel resolution. The Sapiens models scaled from 0.3 to 2 billion parameters and unified pose estimation, body-part segmentation, depth estimation, and surface-normal prediction in one backbone. Sapiens improved over the prior state of the art on Humans-5K pose by 7.6 mAP, on Humans-2K part segmentation by 17.1 mIoU, on Hi4D depth by 22.4 percent relative RMSE, and on THuman2 surface normals by 53.5 percent relative angular error.
For 3D pose, SMPL-X and its descendants (Matt Loper and colleagues, then Pavlakos and colleagues) parameterize the human body as a low-dimensional mesh, and models such as PIXIE, PyMAF-X, and SMPLer-X regress SMPL-X parameters from images.
Monocular depth estimation predicts a depth value for every pixel from a single RGB image. Stereo and multi-view methods have access to more geometric information, but the monocular case is the one that benefits most from learned priors.
MiDaS (Rene Ranftl and colleagues at Intel, 2020) trained a single model on a mixture of 10 depth datasets using a scale-and-shift-invariant loss, producing the first relative-depth model that generalized robustly to in-the-wild images. MiDaS variants (DPT-Hybrid, DPT-Large, MiDaS v3.1) added transformer backbones.
Depth Anything (Lihe Yang, Bingyi Kang, and colleagues at HKU and TikTok, CVPR 2024) scaled monocular depth training to 62 million unlabeled images by combining a small labeled set with model-distilled pseudo-labels and a strong perturbation regime. Depth Anything V2 (Yang, Kang, and colleagues, NeurIPS 2024) replaced labeled real images with synthetic ones and used a much larger teacher, producing finer and more robust predictions, especially on transparent and reflective surfaces.
Marigold (Bingxin Ke and colleagues at ETH Zurich, December 2023, CVPR 2024 best paper candidate) repurposed a pretrained Stable Diffusion model for monocular depth estimation by fine-tuning the latent U-Net on depth latents. The result was a depth model that inherited Stable Diffusion's visual prior and could be trained on synthetic data alone, yet generalized to real images at state-of-the-art quality.
Depth Pro (Apple Machine Learning Research, October 2024) was the first monocular model to produce metric, not just relative, depth at high resolution in real time. It used a multi-scale ViT backbone and produced a 2.25-megapixel depth map in about 0.3 seconds on a single GPU, with sharp boundaries on fine structures such as hair and fur.
| Model | Year | Lab | Output type | Notable detail |
|---|---|---|---|---|
| MiDaS | 2020 | Ranftl et al., Intel | Relative | Multi-dataset training, scale-and-shift invariant loss |
| MiDaS v3.1 / DPT | 2022 | Ranftl et al., Intel | Relative | Dense Prediction Transformer backbone |
| Marigold | Dec 2023 | Ke et al., ETH Zurich | Affine-invariant | Diffusion-based, synthetic training data |
| Depth Anything | Jan 2024 | Yang, Kang et al. | Relative | 62 M unlabeled images, semi-supervised |
| Depth Anything V2 | June 2024 | Yang, Kang et al. | Relative | Synthetic-only labels, larger teacher |
| Depth Pro | Oct 2024 | Apple ML Research | Metric | 2.25 MP in 0.3 s, sharp boundaries |
Self-supervised learning lets a model learn representations from unlabeled images, removing the bottleneck of human annotation. Two broad families dominate vision: contrastive and predictive methods (SimCLR, MoCo, BYOL, DINO) and masked image modeling (BEiT, MAE, MaskFeat).
SimCLR (Ting Chen and colleagues at Google Brain, 2020) showed that contrastive learning with strong augmentations and a large batch size could match supervised ImageNet pretraining. The model pulled different augmented views of the same image together in feature space and pushed unrelated images apart.
MoCo (Kaiming He and colleagues at Meta, November 2019) introduced a momentum-updated encoder and a queue of negative samples, decoupling batch size from the number of negatives. MoCo v2 (2020) added the SimCLR augmentation recipe and an MLP projection head; MoCo v3 (2021) adapted MoCo to ViTs.
BYOL (Jean-Bastien Grill and colleagues at DeepMind, 2020) showed that contrastive learning did not actually need negative samples: pulling two augmented views of the same image into agreement through a target network with an exponential moving average and a predictor head was sufficient. BYOL freed self-supervised methods from the negative-sample bottleneck.
DINO (Mathilde Caron and colleagues at Meta, April 2021) applied a teacher-student self-distillation objective with a momentum teacher and an entropy-balanced softmax, producing ViT features that contained explicit object-segmentation information without any label supervision. DINO with ViT-Base reached 80.1 percent top-1 in linear evaluation. DINOv2 (Maxime Oquab and colleagues at Meta, April 2023) scaled DINO to 142 million curated images and combined ideas from iBOT, SwAV, and Sinkhorn-Knopp normalization, producing all-purpose visual features that beat task-specific methods on classification, segmentation, depth, and instance retrieval without fine-tuning. DINOv3 (Meta AI, arXiv August 13, 2025) scaled to 1.7 billion images and a 7-billion-parameter ViT teacher, added a Gram-anchoring loss to prevent dense-feature collapse during long training, and was distilled into ViT-S, B, L, H+ and ConvNeXt variants. DINOv3 matched or beat SigLIP 2 and Perception Encoder on classification while widening the gap on dense prediction.
BEiT (Hangbo Bao and colleagues at Microsoft Research, 2021) adapted BERT-style masked token prediction to images by tokenizing images with a discrete VAE.
MAE (Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick at Meta, November 2021) showed that an asymmetric encoder-decoder, where the encoder saw only the visible 25 percent of patches and a lightweight decoder reconstructed the missing 75 percent in pixel space, could pretrain ViT-Huge to 87.8 percent top-1 on ImageNet-1K alone. MAE accelerated training by 3x or more and became the standard pretraining method for ViTs.
MaskFeat (Chen Wei and colleagues at Meta, 2021) replaced pixel reconstruction with HOG feature regression; CAE, MaskGIT, and a series of follow-ups explored other reconstruction targets.
I-JEPA (Mahmoud Assran and colleagues at Meta, including Yann LeCun, January 2023) is a non-generative self-supervised method. Instead of reconstructing pixels, it predicts the latent representations of target blocks from a context block in the same image. I-JEPA with ViT-Huge/14 trained on ImageNet in under 72 hours on 16 A100 GPUs and achieved strong downstream performance from linear classification to depth prediction. V-JEPA (2024) extended the design to video, and V-JEPA 2 (Meta, June 2025) scaled to a 1.2-billion-parameter ViT trained on over one million hours of internet video and one million images, then fine-tuned for zero-shot robot control on the Droid dataset.
Some vision encoders are now used as off-the-shelf feature backbones across many tasks instead of being trained per problem. CLIP, SigLIP, DINOv2, DINOv3, and the Florence series are the main names.
CLIP (Alec Radford and colleagues at OpenAI, February 2021) trained an image encoder and a text encoder to align 400 million (image, caption) pairs from the web using a symmetric contrastive loss. The result was a zero-shot image classifier that could be prompted with text strings ("a photo of a dog", "a photo of a cat") and a strong image feature extractor for downstream tasks. CLIP started the era of vision-language models and is the de facto image encoder for most diffusion-based generators and most VLMs.
Florence-2 (Bin Xiao and colleagues at Microsoft, November 2023) was a unified vision foundation model trained on FLD-5B, a dataset of 5.4 billion annotations across 126 million images. Florence-2 took text prompts as task instructions and produced text outputs for captioning, detection, grounding, and segmentation in one sequence-to-sequence framework. The 232 M and 771 M parameter variants were released under MIT license in mid-2024.
SigLIP (Xiaohua Zhai and colleagues at Google, 2023) replaced CLIP's softmax cross-entropy with a sigmoid loss applied per pair, which scaled better with batch size and let SigLIP outperform CLIP at comparable compute. SigLIP 2 (Google DeepMind, February 2025) added captioning-based pretraining, self-distillation and masked prediction losses, and online data curation; it covered 109 languages and shipped at four scales from ViT-B (86 M) to g (1 B parameters).
DINOv3 (Meta, August 2025) is currently the strongest pure-vision foundation backbone for dense prediction. EVA-CLIP and OpenCLIP are open-weight reproductions and scale-ups of the original CLIP. Florence-VL adds a Florence encoder to large language models for VLM training.
| Model | Year | Lab | Training signal | Use |
|---|---|---|---|---|
| CLIP | Feb 2021 | Radford et al., OpenAI | Image-text contrastive | Zero-shot classification, VLM backbone |
| SigLIP | 2023 | Zhai et al., Google | Sigmoid image-text | Stronger contrastive vision-language |
| EVA-CLIP | 2022 | Sun et al., BAAI | CLIP with MIM init | Open-weight, larger scale |
| Florence-2 | Nov 2023 | Xiao et al., Microsoft | Multi-task seq2seq | Unified caption / detect / segment |
| DINOv2 | April 2023 | Oquab et al., Meta | Self-supervised distillation | All-purpose dense features |
| SigLIP 2 | Feb 2025 | Google DeepMind | Sigmoid + self-distill + masked | 109-language vision-language |
| DINOv3 | Aug 2025 | Meta | Self-supervised, Gram anchoring | Strongest dense-prediction backbone |
Video adds a time axis that conventional 2D ConvNets cannot exploit directly. I3D (Joao Carreira and Andrew Zisserman at DeepMind, 2017) inflated 2D ImageNet-pretrained kernels to 3D and trained on Kinetics-400, setting the first strong action-recognition baseline. SlowFast (Christoph Feichtenhofer and colleagues at Meta, 2019) used two parallel pathways: a slow pathway with low temporal resolution for spatial semantics and a fast pathway with high temporal resolution for motion.
TimeSformer (Gedas Bertasius, Heng Wang, Lorenzo Torresani at Meta, 2021) was the first pure transformer for video, applying divided space-time attention to patches of frames. ViViT (Anurag Arnab and colleagues at Google, 2021) explored several factorizations of space-time attention. Video Swin Transformer extended Swin to the time dimension.
VideoMAE (Zhan Tong and colleagues, 2022) applied MAE-style masked autoencoding to video by masking a high fraction of space-time tubelets. VideoMAE V2 (Limin Wang and colleagues, CVPR 2023) introduced a dual-masking strategy that let the model scale to a billion-parameter video ViT and reach 90.0 percent top-1 on Kinetics-400 and 77.0 percent on Something-Something V2.
V-JEPA and V-JEPA 2 (Meta, June 2025) are the latest large-scale video foundation models, applying joint-embedding predictive pretraining to video at internet scale and demonstrating zero-shot robot control. SAM 2 (Meta, July 2024) is the equivalent for video segmentation, propagating mask predictions across frames via a memory module.
3D vision moved from explicit reconstruction (multi-view stereo, structure from motion) toward learned scene representations and feed-forward reconstruction in the late 2010s.
Neural Radiance Fields (NeRF) by Ben Mildenhall, Pratul Srinivasan, Matthew Tancik, Jonathan Barron, Ravi Ramamoorthi, and Ren Ng (ECCV 2020, oral, best paper honorable mention) represented a 3D scene as a small MLP that mapped a (x, y, z) position and a (θ, φ) viewing direction to a color and density value. Differentiable volume rendering let NeRF be trained directly from posed images, producing photoreal novel views. NeRF launched a flood of follow-ups: Mip-NeRF for anti-aliasing, Instant-NGP for fast training via hash grids, Plenoxels and TensoRF for explicit grid representations, NeRF-W for in-the-wild images.
3D Gaussian Splatting (Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, George Drettakis at Inria, SIGGRAPH 2023) replaced NeRF's implicit MLP with a set of explicit anisotropic 3D Gaussians, optimized via differentiable rasterization. The method achieved real-time rendering (30+ FPS at 1080p) at NeRF-level quality and trained in roughly the same time as Instant-NGP. 3D Gaussian Splatting has largely replaced NeRF as the default scene representation for novel-view synthesis in research code.
DUSt3R (Shuzhe Wang and colleagues at Naver Labs Europe, CVPR 2024) and MASt3R (Naver Labs Europe, 2024) eliminated the need for separate camera calibration and depth estimation. DUSt3R was a transformer that took two unposed images and directly produced dense point maps in a common coordinate system, recovering camera intrinsics, extrinsics, and 3D geometry in a single forward pass. The successor MASt3R added a matching head for stereo correspondences. Together, these models reframed multi-view 3D as a feed-forward learning problem rather than a multi-stage optimization pipeline.
The trajectory of computer vision is largely the trajectory of its benchmarks. Better datasets force better models, and ImageNet in particular set off the deep-learning era.
ImageNet (Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, Li Fei-Fei, CVPR 2009) contains 1.28 million training images across 1,000 classes for ILSVRC, plus a larger ImageNet-22K version with 14 million images across 22,000 WordNet synsets. ImageNet remains the de facto pretraining and evaluation set for classification backbones.
COCO (Tsung-Yi Lin and colleagues at Microsoft Research, ECCV 2014) introduced 330,000 images with 80 detection categories, 91 stuff categories, and dense object segmentations and keypoints. COCO is the standard detection and segmentation benchmark.
PASCAL VOC (Mark Everingham and colleagues, 2005 through 2012) was the pre-COCO standard for detection and segmentation, with 20 categories and around 11,500 images in VOC 2012. It is still used for ablations and quick studies.
ADE20K (Bolei Zhou and colleagues at MIT, 2017) covers 20,000 scene-parsing images across 150 semantic categories. It is the main benchmark for semantic segmentation.
LVIS (Agrim Gupta, Piotr Dollár, Ross Girshick, 2019) provides 1,200 long-tail object categories with federated annotations, used to evaluate detectors on rare classes and open-vocabulary detection.
For video, Kinetics-400, Kinetics-600, and Kinetics-700 (Will Kay and colleagues at DeepMind, 2017 through 2019) provide 400, 600, and 700 action categories from YouTube clips. Something-Something V2 (Twenty Billion Neurons / Qualcomm, 2018) focuses on temporal reasoning. HMDB51 (Brown University, 2011) and UCF101 (University of Central Florida, 2012) are smaller legacy benchmarks still cited as reference points.
For depth and 3D, NYU Depth v2, KITTI, ScanNet, and Matterport3D provide indoor and outdoor benchmarks at varying scales. For pose, COCO keypoints, MPII Human Pose, and CrowdPose are standard, with AGORA and Hi4D for 3D mesh recovery.
The practical issues that keep showing up in computer vision deployments are well known and not yet solved.
Dataset bias is the first. Models trained on ImageNet inherit ImageNet's photographer choices, geographic distribution, and category quirks. Models trained on web-scraped image-text pairs (CLIP, SigLIP) inherit web bias including racial, gender, and cultural skews. Audit studies routinely find that face recognition and pedestrian-detection models perform worse on darker-skinned subjects, and that captioning models reproduce stereotype associations from their training data.
Distribution shift is the second. A model trained on COCO often fails on satellite imagery, medical scans, or industrial inspection footage. Self-supervised foundation models help but do not solve it; a DINOv3 backbone is more robust to distribution shift than an ImageNet ResNet-50, but it still needs domain adaptation for specialist applications.
Adversarial vulnerability is the third. Tiny pixel-level perturbations that are imperceptible to humans can flip the classification of nearly any deployed model. Mitigation methods (adversarial training, certified robustness) reduce the effect but introduce accuracy or compute costs that practitioners often refuse to pay.
Long-tail and rare-class performance is the fourth. LVIS exists precisely because COCO's 80-category training set is too easy on common classes and provides no signal on rare ones. Open-vocabulary detectors close some of this gap but inherit text-encoder limitations and still struggle on fine-grained discrimination.
Compute and memory are the fifth. SAM-H, DINOv3, Sapiens 2B, and other foundation backbones run at hundreds of milliseconds per image on a high-end GPU and gigabytes of weight; distilled and mobile variants exist but always trade accuracy for speed. Real-time on-device computer vision remains the domain of smaller specialist models such as MobileNet, YOLO, and MobileSAM.
Finally, evaluation is increasingly the bottleneck. Saturated benchmarks (COCO box AP, ImageNet top-1) no longer cleanly separate the best models, and new benchmarks for open-world reasoning, fine-grained categorization, and physical understanding are still consolidating.