Object Detection Models
Last reviewed
May 11, 2026
Sources
18 citations
Review status
Source-backed
Revision
v2 ยท 2,500 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 11, 2026
Sources
18 citations
Review status
Source-backed
Revision
v2 ยท 2,500 words
Add missing citations, update stale details, or suggest a clearer explanation.
Object detection models are computer vision systems that locate and classify multiple objects in an image or video frame, producing both bounding boxes and class labels for each instance. Unlike pure image classification, which assigns one label to a whole image, and unlike semantic segmentation, which labels every pixel without separating instances, object detection must answer both what objects are present and where each one sits. Accuracy is measured against ground-truth boxes using Intersection over Union (IoU), with Average Precision (AP) and mean Average Precision (mAP) as the dominant metrics on benchmarks such as the COCO dataset.
Detectors are grouped along two axes. The first separates two-stage methods, which propose candidate regions and then classify them, from one-stage methods, which predict boxes and classes in a single forward pass. The second distinguishes anchor-based designs, which regress offsets from a dense grid of predefined reference boxes, from anchor-free and set-prediction designs that predict keypoints, centers, or a fixed-size set of object queries directly. Modern leaderboards include strong examples from all three.
Object detection predates convolutional neural networks by more than a decade. The Viola-Jones face detector (2001) used Haar-like features, an integral image, and a cascade of AdaBoost classifiers for real-time face detection, becoming the default in consumer cameras and OpenCV. Dalal and Triggs introduced Histograms of Oriented Gradients (HOG) in 2005, pairing dense gradient descriptors with a linear SVM to detect pedestrians. Felzenszwalb, Girshick, and McAllester extended this with Deformable Part Models (DPM) between 2008 and 2010, modeling each object as a root filter plus deformable part filters; DPM dominated PASCAL VOC for several years. Selective Search (Uijlings et al. 2013) bridged the transition to deep learning by producing class-agnostic region proposals that the first deep detectors fed into a CNN classifier.
The modern era opened in late 2013 when Ross Girshick and colleagues at UC Berkeley released R-CNN (arXiv:1311.2524). R-CNN ran Selective Search to obtain about 2,000 proposals per image, warped each to a fixed size, passed it through an ImageNet-pretrained CNN, and classified the features with class-specific SVMs, establishing CNN feature extraction as the new baseline. Girshick's Fast R-CNN (2015, arXiv:1504.08083) ran the CNN once over the whole image and pooled region-specific features using RoI pooling. Faster R-CNN (Ren, He, Girshick, and Sun, 2015, arXiv:1506.01497) replaced Selective Search with a learnable Region Proposal Network (RPN) sharing convolutional features with the detector, hitting 5 FPS with VGG-16 and 73.2% mAP on VOC 2007. Faster R-CNN became the canonical two-stage architecture.
In 2017, Kaiming He and colleagues at FAIR (Meta AI) released Mask R-CNN (arXiv:1703.06870), adding a parallel branch predicting per-instance segmentation masks and introducing RoIAlign to fix RoI pooling's quantization errors. Mask R-CNN swept the COCO 2017 instance segmentation, detection, and keypoint tracks and remains a standard baseline for instance segmentation. Cascade R-CNN (Cai and Vasconcelos 2017) added multi-stage refinement that progressively tightened the IoU threshold for positives, and Libra R-CNN (Pang et al. 2019) addressed sample, feature, and objective imbalance.
One-stage detectors skip the region-proposal step and predict box coordinates and class scores directly from a dense grid. The original YOLO paper by Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi at the University of Washington (2015, arXiv:1506.02640) framed detection as a single regression problem and ran at 45 FPS, with Fast YOLO at 155 FPS. YOLOv2 (2016) added batch normalization, anchor boxes, and a higher-resolution classifier; YOLOv3 (Redmon and Farhadi 2018, arXiv:1804.02767) added multi-scale prediction over three feature levels and an independent logistic classifier per class.
Redmon stopped after v3, but the line continued under different authors. YOLOv4 (Bochkovskiy, Wang, and Liao 2020, arXiv:2004.10934) bundled CSPDarknet53, mosaic augmentation, and CIoU loss. YOLOv5 (Ultralytics, 2020) was a PyTorch re-implementation that became the most widely deployed YOLO in production. YOLOX (Megvii 2021) decoupled the classification and regression heads and went anchor-free with SimOTA assignment. YOLOv6 (Meituan 2022) and YOLOv7 (Wang, Bochkovskiy, and Liao 2022, arXiv:2207.02696) introduced reparameterized backbones and trainable bag-of-freebies. YOLOv8 (Ultralytics, January 2023) unified detection, segmentation, classification, and pose under one anchor-free framework. YOLOv9 (Wang, Yeh, and Liao 2024, arXiv:2402.13616) introduced Programmable Gradient Information (PGI) and GELAN. YOLOv10 (Wang et al. at Tsinghua University, May 2024, arXiv:2405.14458) eliminated NMS at inference through consistent dual label assignments. YOLO11 (Ultralytics, September 2024) refined the YOLOv8 architecture, reaching higher COCO mAP with 22% fewer parameters than YOLOv8m. YOLOv12 (Tian et al. 2025, arXiv:2502.12524) shifted toward attention while preserving real-time latency, using Area Attention and Residual ELAN modules.
Outside the YOLO line, SSD (Liu et al. December 2015, arXiv:1512.02325) made dense multi-scale anchor prediction practical with default boxes over several feature pyramid levels, hitting 74.3% VOC2007 mAP at 59 FPS. RetinaNet (Lin et al. at FAIR, 2017, arXiv:1708.02002) introduced Focal Loss to handle the extreme foreground-to-background imbalance in dense detection, letting a one-stage detector surpass contemporary two-stage models. FCOS (Tian et al. 2019) and CenterNet (Zhou et al. 2019) eliminated anchor boxes entirely, predicting per-pixel objectness, box regression, or keypoint heatmaps. EfficientDet (Tan, Pang, and Le at Google, 2019, arXiv:1911.09070) scaled the backbone, Feature Pyramid Network, and head jointly with compound scaling, producing models from EfficientDet-D0 through D7.
DETR (Detection Transformer), introduced by Carion, Massa, and colleagues at FAIR in May 2020 (arXiv:2005.12872), reformulated detection as direct set prediction. A CNN backbone produces image features, a transformer encoder-decoder turns learned object queries into box and class predictions, and a bipartite Hungarian matching loss enforces one-to-one assignment. DETR removed the need for anchor design, NMS, and hand-tuned proposals, but slow convergence (500 epochs on COCO) prompted follow-ups. Deformable DETR (Zhu et al. 2020) replaced dense attention with sparse deformable attention, cutting training to about 50 epochs. DAB-DETR introduced anchor box coordinates as queries, DN-DETR added query denoising, and DINO (Zhang et al. 2022, arXiv:2203.03605) combined contrastive denoising, mixed query selection, and look-forward-twice refinement to reach 51.3 AP on COCO with a ResNet-50 backbone, the first end-to-end transformer detector to beat the strongest CNN baselines.
Grounding DINO (Liu et al. at IDEA Research, March 2023, arXiv:2303.05499) extended DINO to open-vocabulary detection by fusing image and text features in encoder and decoder, achieving 52.5 AP zero-shot on COCO. Co-DETR (Zong et al. 2022) trained a DETR head with auxiliary one-to-many heads. OWL-ViT (Minderer et al. at Google, May 2022, arXiv:2205.06230) used a Vision Transformer pretrained with CLIP-style image-text contrastive learning plus a lightweight detection head for text-queried detection, with OWLv2 (2023) scaling via web self-training. YOLO-World (Cheng et al. 2024, arXiv:2401.17270) brought open-vocabulary detection to YOLO via a Re-parameterizable Vision-Language Path Aggregation Network and region-text contrastive pretraining, reaching 35.4 AP on LVIS at 52 FPS on a V100.
| Model | Year | Origin | Type | Notes |
|---|---|---|---|---|
| R-CNN | Nov 2013 | UC Berkeley | Two-stage | First CNN-based detector; Selective Search proposals |
| Fast R-CNN | Apr 2015 | Microsoft Research | Two-stage | Shared CNN over image, RoI pooling |
| Faster R-CNN | Jun 2015 | MSRA | Two-stage | Learned Region Proposal Network |
| SSD | Dec 2015 | UNC Chapel Hill | One-stage | Multi-scale anchor boxes, 59 FPS |
| YOLO | Jun 2015 | Univ. of Washington | One-stage | Single-pass regression, 45 FPS |
| YOLOv3 | Apr 2018 | Univ. of Washington | One-stage anchor | Multi-scale, Darknet-53 backbone |
| Mask R-CNN | Mar 2017 | FAIR | Two-stage | Adds mask branch and RoIAlign |
| RetinaNet | Aug 2017 | FAIR | One-stage anchor | Focal loss for class imbalance |
| EfficientDet | Nov 2019 | Google Research | One-stage | Compound scaling, BiFPN |
| DETR | May 2020 | FAIR | Set prediction | Transformer + bipartite matching |
| YOLOv5 | Jun 2020 | Ultralytics | One-stage | PyTorch, widely deployed |
| YOLOX | Jul 2021 | Megvii | One-stage anchor-free | Decoupled head, SimOTA |
| YOLOv7 | Jul 2022 | Academia Sinica / IIS | One-stage | E-ELAN, trainable bag-of-freebies |
| DINO (detection) | Mar 2022 | IDEA Research | Set prediction | First DETR to beat strong CNN baselines |
| OWL-ViT | May 2022 | Google Research | Open-vocab | CLIP-pretrained ViT detector |
| YOLOv8 | Jan 2023 | Ultralytics | One-stage anchor-free | Unified detection, segmentation, pose |
| Grounding DINO | Mar 2023 | IDEA Research | Open-vocab | Text-conditioned, 52.5 AP COCO zero-shot |
| YOLO-World | Jan 2024 | Tencent AI Lab | Open-vocab YOLO | 52 FPS open-vocabulary detection |
| YOLOv9 | Feb 2024 | Academia Sinica / IIS | One-stage | PGI, GELAN |
| YOLOv10 | May 2024 | Tsinghua University | One-stage | NMS-free dual assignment |
| YOLO11 | Sep 2024 | Ultralytics | One-stage | 22% fewer params than YOLOv8m |
| YOLOv12 | Feb 2025 | Researchers (NeurIPS 2025) | Attention-centric | Area attention, R-ELAN |
| Benchmark | Year | Scope | Notes |
|---|---|---|---|
| PASCAL VOC | 2005-2012 | 20 classes, ~22k images | Standard until 2014; mAP@0.5 metric |
| MS COCO (Lin et al. 2014, arXiv:1405.0312) | 2014 | 80 classes, ~330k images, 2.5M instances | The dominant benchmark; mAP averaged over IoU 0.5-0.95 |
| Open Images | 2017 | 600 classes, ~1.9M images | Google's large-scale detection set |
| LVIS | 2019 | 1,203 categories, long-tail | Federated annotation, rare categories |
| Objects365 | 2019 | 365 classes, 600k images | Larger high-quality object set |
| BDD100K | 2018 | Driving scenes | 100k videos, 10 task types |
| Waymo Open Dataset | 2019 | LiDAR and camera | Self-driving research benchmark |
| ODinW | 2022 | 35 datasets | Open-vocabulary zero-shot evaluation |
The standard metric on COCO is mAP averaged over IoU thresholds from 0.5 to 0.95 in steps of 0.05, often broken down by object scale and by maximum detections per image. AP@0.5 (the older VOC-style metric) and AP@0.75 are reported alongside the COCO primary score. Average Recall (AR) measures recall at fixed detection budgets. Speed is reported as inference latency or frames per second on a specified GPU (V100, T4, A100, or A6000), and parameter count appears in efficiency comparisons.
Several building blocks recur across families. The Feature Pyramid Network (Lin et al. 2016) builds a top-down pathway with lateral connections, producing multi-scale feature maps that help small-object detection. Anchor-based heads place reference boxes at each location and regress offsets; anchor-free heads predict objectness, center-ness, or keypoints directly. Label assignment drives much of recent accuracy gain: ATSS selects positives adaptively, OTA frames assignment as optimal transport, and SimOTA in YOLOX approximates it cheaply. Non-maximum suppression (NMS) prunes overlapping detections; Soft-NMS decays confidence rather than dropping boxes, and Cluster-NMS parallelizes the operation. End-to-end detectors using bipartite matching (DETR, YOLOv10) avoid NMS at inference. Common training tricks include mosaic data augmentation introduced in YOLOv4, MixUp, exponential moving averages, knowledge distillation, and self-supervised pretraining followed by detection fine-tuning.
Classical detectors are restricted to a fixed label set chosen at training time. Open-vocabulary detectors accept arbitrary natural-language queries at test time, conditioning category prediction on a text embedding produced by an image-text encoder. Representative systems include GLIP (Li et al. 2021), Grounding DINO, OWL-ViT, OWLv2, OmDet, and YOLO-World. They are evaluated on benchmarks like LVIS and ODinW that include rare categories the model never saw labeled boxes for. Trade-offs exist between zero-shot accuracy, inference speed, and the cost of encoding text prompts.
Object detection underlies many production systems. Autonomous driving and ADAS pipelines use detectors fed by camera and LiDAR streams to track vehicles, pedestrians, signs, and lane markings. Surveillance and physical security run detectors on fixed cameras to count people, flag intrusions, and recognize license plates. Industrial robotics relies on detection for bin picking, defect inspection, and grasp planning. Retail deploys detection for cashierless stores (Amazon Go is the best-known example), shelf monitoring, and planogram compliance. Agriculture uses crop, weed, fruit, and livestock detection for precision spraying and yield estimation. Medical imaging uses detection for lesion, polyp, and nodule localization in radiology and endoscopy pipelines, with stricter false-positive constraints than consumer applications. Sports analytics tracks players, ball, and equipment for broadcast graphics. Augmented reality uses detection plus pose estimation to anchor virtual content. Content moderation runs detectors to flag weapons, logos, or restricted categories at upload time, and manufacturing uses detection for defect spotting on production lines.
Small objects remain the hardest case, especially with low contrast or compression artifacts. Dense crowds with heavy occlusion stress the assignment step and frustrate NMS. Domain shift between training and deployment, between synthetic and real data or between daytime and adverse-weather conditions, can crash accuracy with no obvious in-distribution signal. Long-tailed class distributions (LVIS-style) penalize standard cross-entropy and motivate decoupled classifier learning. Real-time deployment forces a constant accuracy-versus-latency trade-off; YOLO and EfficientDet families differ from research-leaderboard detectors in where they sit on that curve. Open-vocabulary detectors add prompt-encoding cost and tend to underperform closed-set models on their trained categories. Adversarial patches can fool deployed detectors, and explainability for box predictions is far less developed than for classification.