Object detection

Artificial Intelligence Computer Vision Deep Learning Machine Learning

30 min read

Updated Jun 20, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 20, 2026

Fact-checked

In review queue

Sources

26 citations

Revision

v7 · 5,988 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Object detection is a computer vision task that locates and classifies every object instance in an image or video frame, returning for each one a bounding box, a class label, and a confidence score. Unlike image classification, which assigns a single label to an entire image, object detection answers both "what" and "where" at once, so a single pass over one photo can report multiple objects of different types and pinpoint each position. The dominant academic benchmark, MS COCO, labels 80 object categories across roughly 330,000 images with about 1.5 million annotated instances ^[20]. Modern detectors are typically scored by Average Precision (AP) on COCO, where leading models in 2025 exceed 60 AP and the fastest real-time detectors process video at 100 or more frames per second ^[25].

The field has progressed from handcrafted feature methods in the early 2000s to deep learning-based approaches that now dominate benchmarks and real-world applications. The first deep detector to reach real-time speed, You Only Look Once (YOLO), ran at 45 frames per second in 2016 ^[8], and the broader image recognition market that object detection underpins was valued at about USD 53.3 billion in 2023 and is projected to reach USD 128.3 billion by 2030, a compound annual growth rate of 12.8% ^[26]. Object detection serves as a foundational capability for tasks including image segmentation, object tracking, scene understanding, and activity recognition.

What does an object detector output?

Given an input image, an object detector produces a list of detections. Each detection consists of:

A bounding box defined by four coordinates (typically the top-left and bottom-right corners, or center coordinates plus width and height).
A class label drawn from a predefined vocabulary of object categories (for example, "car," "person," "dog").
A confidence score between 0 and 1, indicating the model's certainty that the bounding box contains an object of the predicted class.

During training, the model learns to minimize a combined loss that includes a localization component (penalizing inaccurate bounding boxes) and a classification component (penalizing wrong labels). Most modern detectors also incorporate a post-processing step called non-maximum suppression (NMS), which removes duplicate detections by suppressing overlapping boxes with lower confidence scores.

History and traditional methods

Before the deep learning revolution, object detection relied on manually designed features and classical machine learning classifiers. Three methods from this era were especially influential.

Viola-Jones detector (2001)

Paul Viola and Michael Jones introduced a real-time face detection framework in 2001 that became one of the most widely deployed object detectors of its time. The detector uses three key ideas: Haar-like features computed rapidly using an integral image representation, the AdaBoost algorithm to select a small set of discriminative features from a large pool, and a cascade classifier structure that quickly rejects non-face image regions in early stages while spending more computation on ambiguous regions. ^[1]

Running on a 700 MHz Intel Pentium III processor, the Viola-Jones detector achieved real-time face detection without hardware acceleration, processing images tens to hundreds of times faster than competing methods at comparable accuracy levels. ^[1] The framework operates by sliding a detection window across the image at multiple scales and positions.

HOG + SVM (2005)

Navneet Dalal and Bill Triggs proposed the Histogram of Oriented Gradients (HOG) descriptor in 2005, motivated primarily by the problem of pedestrian detection. HOG features capture the distribution of gradient orientations in localized portions of an image. The descriptor is computed on a dense grid of uniformly spaced cells, with overlapping local contrast normalization applied over larger blocks to achieve robustness to illumination and shadowing changes. ^[2]

HOG features paired with a linear support vector machine (SVM) classifier established a strong baseline for pedestrian detection on the INRIA Person dataset. ^[2] The approach balanced feature invariance (to translation, scale, and illumination) with the ability to capture fine-grained shape information.

Deformable Part Models (2008-2010)

Pedro Felzenszwalb and colleagues extended the HOG detector into the Deformable Part-based Model (DPM). DPM represents objects using a coarse root filter combined with several higher-resolution part filters. Each part can shift spatially relative to the root, allowing the model to handle variations in object pose and viewpoint through a latent variable formulation. ^[3]

DPM won the PASCAL VOC detection challenge in 2007, 2008, and 2009, representing the peak of traditional object detection methods. The model uses a star-structured graphical model and is trained with a latent SVM algorithm. ^[3] While effective, DPM-based systems were slow (requiring several seconds per image) and struggled to scale beyond a moderate number of object categories.

Deep learning era: two-stage detectors

The introduction of convolutional neural networks (CNNs) to object detection in 2014 brought dramatic accuracy improvements, launching the modern era of the field.

R-CNN (2014)

Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik introduced Regions with CNN features (R-CNN) in 2014. The system operates in three stages: it first generates approximately 2,000 region proposals using selective search, then extracts a 4,096-dimensional feature vector from each proposal by passing it through an AlexNet-based CNN pretrained on ImageNet, and finally classifies each region with class-specific linear SVMs. ^[4]

R-CNN achieved 53.7% mAP on PASCAL VOC 2010, a large improvement over the prior state of the art. On VOC 2012 the R-CNN paper reported that the method "improves mean average precision (mAP) by more than 30% relative to the previous best result on VOC 2012, achieving a mAP of 53.3%." ^[4] The paper demonstrated two key insights: high-capacity CNNs can be applied to bottom-up region proposals for localization, and supervised pretraining on a large auxiliary dataset followed by domain-specific fine-tuning produces significant performance gains even when labeled detection data is limited. ^[4]

However, R-CNN was computationally expensive because CNN features had to be extracted independently for each of the roughly 2,000 proposals per image, leading to substantial redundant computation. ^[4]

Fast R-CNN (2015)

Ross Girshick addressed the efficiency bottleneck with Fast R-CNN, which processes the entire image through the CNN backbone once and then extracts features for each region proposal using a Region of Interest (RoI) pooling layer. This shared computation made training and inference significantly faster. Fast R-CNN also introduced multi-task learning, jointly optimizing classification and bounding box regression in a single network. ^[5]

Fast R-CNN achieved 65.7% mAP on PASCAL VOC 2012 and was roughly 9 times faster than R-CNN at training time and 213 times faster at test time. ^[5] Despite these gains, the system still depended on selective search for region proposals, which remained a computational bottleneck.

Faster R-CNN (2015)

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun eliminated the external proposal mechanism entirely by introducing the Region Proposal Network (RPN). The RPN is a small, fully convolutional network that shares convolutional features with the detection network. It simultaneously predicts objectness scores and bounding box coordinates at each spatial position using a set of reference boxes called anchors. ^[6]

Faster R-CNN unified proposal generation and detection into a single, end-to-end trainable architecture. Using a VGG-16 backbone, it achieved 73.2% mAP on PASCAL VOC 2007 and ran at approximately 5 frames per second on a GPU. With a ResNet-101 backbone, the model reached 27.2% AP on the more challenging MS COCO dataset. ^[6] Faster R-CNN established the two-stage detection paradigm (propose, then classify) that influenced many subsequent architectures.

Feature Pyramid Network (2017)

Tsung-Yi Lin, Piotr Dollar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie introduced the Feature Pyramid Network (FPN) to address the challenge of detecting objects at different scales. FPN builds a top-down pathway with lateral connections on top of a standard CNN backbone, producing a multi-scale feature pyramid where each level contains semantically strong features at a different spatial resolution. ^[7]

Integrating FPN with Faster R-CNN improved AP on COCO by several points, particularly for small objects, without significant additional computation. ^[7] FPN became a standard component in most later detection architectures.

Single-shot detectors

Two-stage detectors achieve high accuracy but are relatively slow because they process region proposals in a second stage. Single-shot (or one-stage) detectors predict bounding boxes and class probabilities directly from feature maps in a single forward pass, enabling faster inference at some cost to accuracy.

YOLO (2016)

Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi introduced You Only Look Once (YOLO) in 2016. YOLO reframes detection as a single regression problem: the input image is divided into an S x S grid, and each grid cell directly predicts bounding box coordinates, objectness scores, and class probabilities in one evaluation of the network. As the paper states, "A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation." ^[8]

The original YOLOv1 achieved 63.4% mAP on PASCAL VOC 2007 and processed images at 45 frames per second, making it dramatically faster than two-stage methods. The authors report that "Our base YOLO model processes images in real-time at 45 frames per second. A smaller version of the network, Fast YOLO, processes an astounding 155 frames per second while still achieving double the mAP of other real-time detectors." ^[8] However, YOLOv1 struggled with small objects and objects in close proximity due to its coarse grid design. ^[8]

YOLO family evolution

The YOLO architecture has been refined extensively over the years by multiple research groups:

Version	Year	Key innovations	COCO AP	Notes
YOLOv1	2016	Grid-based regression, real-time speed	63.4% mAP (VOC)	First real-time deep detector
YOLOv2	2017	Batch normalization, anchor boxes, multi-scale training	21.6% AP	Also called YOLO9000
YOLOv3	2018	Multi-scale predictions, Darknet-53 backbone	33.0% AP	Strong small-object detection
YOLOv4	2020	CSPDarknet53, Mish activation, mosaic augmentation	43.5% AP	Bag of freebies and specials
YOLOv5	2020	PyTorch reimplementation, auto-anchor, Focus layer	50.7% AP	Widely adopted in industry
YOLOv7	2022	Extended efficient layer aggregation (E-ELAN), re-parameterization	51.4% AP	Trainable bag-of-freebies
YOLOv8	2023	Anchor-free head, decoupled head, C2f module	53.9% AP	Ultralytics unified framework
YOLOv9	2024	Programmable Gradient Information (PGI), GELAN architecture	55.6% AP	49% fewer parameters than YOLOv8
YOLOv10	2024	NMS-free via consistent dual assignments, end-to-end	54.4% AP	First end-to-end YOLO
YOLO11	2024	Refined feature extraction, optimized training pipeline	54.7% AP	22% fewer parameters than YOLOv8m

SSD (2016)

Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg introduced the Single Shot MultiBox Detector (SSD) in 2016. SSD predicts bounding boxes and class scores from multiple feature maps at different resolutions within the network, allowing it to handle objects of various sizes without requiring a separate proposal stage. ^[9]

SSD300 (using 300x300 input) achieved 74.3% mAP on PASCAL VOC 2007 at 59 FPS on an NVIDIA Titan X GPU. SSD512 (512x512 input) pushed accuracy to 76.9% mAP, surpassing the then state-of-the-art Faster R-CNN while maintaining real-time speed. On the COCO dataset, SSD512 outperformed Faster R-CNN in both mAP@0.5 and the stricter mAP@[0.5:0.95] metric. ^[9]

RetinaNet and focal loss (2017)

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollar investigated why single-shot detectors historically lagged behind two-stage methods in accuracy. They identified the extreme foreground-background class imbalance during dense prediction training as the primary cause. To address this, they proposed focal loss, a modified cross-entropy loss that down-weights the contribution of well-classified (easy) examples so the model focuses on hard, misclassified ones. ^[10]

RetinaNet with a ResNeXt-101-FPN backbone and focal loss surpassed 40 AP on COCO, matching two-stage detectors for the first time while maintaining single-stage inference speed. This result demonstrated that the accuracy gap between one-stage and two-stage detectors was not inherent to the architecture but was caused by the class imbalance problem during training. ^[10]

Anchor-free detectors

Traditional detectors rely on predefined anchor boxes (reference boxes of various sizes and aspect ratios) to generate detection candidates. Anchor-free methods remove this dependency, simplifying the detection pipeline and reducing the number of hyperparameters.

CornerNet (2018)

Hei Law and Jia Deng proposed CornerNet, which detects each object as a pair of keypoints: the top-left corner and the bottom-right corner of its bounding box. The network produces heatmaps for corner locations and embedding vectors for each detected corner. Corners belonging to the same object are grouped by minimizing the distance between their embeddings. CornerNet achieved 40.5% AP on COCO without using anchor boxes. ^[11]

CenterNet (2019)

Kaiwen Duan, Song Bai, Lingxi Xie, Honggang Qi, Qingming Huang, and Qi Tian extended the keypoint-based approach with CenterNet, which represents each object by its center point and regresses the object's width and height directly from the center location. The model uses an hourglass network backbone and eliminates the need for both anchor boxes and NMS. A variant by Xingyi Zhou, Dequan Wang, and Philipp Krahenbuhl proposed a triplet representation: top-left corner, bottom-right corner, and center keypoint, with the center keypoint serving as a verification mechanism.

FCOS (2019)

Zhi Tian, Chunhua Shen, Hao Chen, and Tong He introduced FCOS (Fully Convolutional One-Stage Object Detection), which predicts a bounding box at every foreground pixel location by regressing the distances from that pixel to the four sides (left, top, right, bottom) of the bounding box. A "centerness" branch suppresses low-quality detections far from the object center. FCOS demonstrated that simple per-pixel prediction, combined with multi-level FPN features and centerness scoring, can match or exceed the performance of anchor-based detectors on COCO. ^[12]

Transformer-based detectors

Transformers, originally developed for natural language processing, were adapted for object detection starting in 2020. These models treat detection as a set prediction problem, often eliminating the need for hand-designed components like anchor boxes and NMS.

DETR (2020)

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko at Meta AI introduced DEtection TRansformer (DETR), which formulates object detection as a direct set prediction problem. DETR uses a CNN backbone to extract image features, then feeds them to a transformer encoder-decoder architecture. A fixed set of learned object queries attend to the image features through cross-attention. The model is trained using a bipartite matching loss (the Hungarian algorithm) that uniquely assigns each prediction to a ground truth object, eliminating the need for NMS. The authors describe the two main ingredients as "a set-based global loss that forces unique predictions via bipartite matching, and a transformer encoder-decoder architecture." ^[13]

DETR achieved 42 AP on COCO, matching the performance of a well-tuned Faster R-CNN with FPN. However, DETR suffered from slow convergence (requiring 500 training epochs compared to the typical 36 for Faster R-CNN) and poor performance on small objects due to the global nature of the transformer's attention mechanism. ^[13]

Deformable DETR (2021)

Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai addressed DETR's limitations by introducing deformable attention. Instead of attending to all spatial positions in the feature map (which is computationally expensive and spreads attention too thinly), deformable attention attends only to a small set of key sampling points around a reference position. This design enables efficient multi-scale feature processing. ^[14]

Deformable DETR achieved 46.9 AP on COCO while converging 10 times faster than the original DETR (50 epochs versus 500), with particular improvements in small object detection. ^[14]

DINO (2022)

Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M. Ni, and Heung-Yeung Shum proposed DINO (DETR with Improved DeNoising Anchor Boxes). DINO combines contrastive denoising training, mixed query selection, and a "look forward twice" scheme for iterative box refinement. With a Swin Transformer-Large backbone pretrained on Objects365, DINO achieved 63.3 AP on the COCO test-dev benchmark, setting a new state of the art among models that do not use pseudo-labeling or test-time augmentation. ^[15]

RT-DETR (2024)

Yian Zhao, Wenyu Lv, and colleagues introduced RT-DETR (Real-Time DEtection TRansformer), demonstrating that transformer-based detectors can match or surpass YOLO models in real-time settings. RT-DETR-R50 achieved 53.1 AP at 108 FPS, compared to the 5 FPS of DINO with a similar backbone. The paper, titled "DETRs Beat YOLOs on Real-time Object Detection," was presented at CVPR 2024 and showed that efficient hybrid encoder designs and IoU-aware query selection can make transformer detectors viable for latency-sensitive applications. ^[16]

RF-DETR (2025)

Developed by Roboflow, RF-DETR builds on a DINOv2 vision transformer backbone and uses weight-sharing neural architecture search (NAS) to discover optimal accuracy-latency trade-offs. RF-DETR-Large (128M parameters) achieved 60.5 AP on COCO at 25 FPS on an NVIDIA T4 GPU, becoming the first documented real-time model to break the 60 AP barrier on the COCO benchmark. RF-DETR-Base (29M parameters) provides a smaller option for edge deployment scenarios. The model was released under the Apache 2.0 license. ^[25]

Foundation models for open-vocabulary detection

Conventional object detectors can only recognize categories seen during training. A new class of models, often called open-vocabulary or open-set detectors, leverages vision-language pretraining to detect objects described by arbitrary text prompts at inference time, without retraining.

GLIP (2022)

GLIP (Grounded Language-Image Pre-training), developed by Liunian Harold Li and colleagues at Microsoft and UCLA, reformulates object detection as a phrase grounding task. The model learns to align image regions with text phrases during training on a large corpus of 27 million images with region-phrase pairs. GLIP achieved 49.8 AP on COCO in a zero-shot transfer setting (without seeing any COCO training images) and 60.8 AP after fine-tuning on COCO. ^[17]

OWL-ViT (2022)

Google Research introduced OWL-ViT (Open-World Localization with Vision Transformers), which adapts a CLIP-pretrained Vision Transformer for open-vocabulary detection. The model attaches lightweight detection heads to the ViT backbone and uses text embeddings from CLIP's text encoder as classification weights. This allows detection of any object category described in natural language. ^[18]

Grounding DINO (2023-2024)

Grounding DINO, developed by Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang at IDEA Research, marries the DINO detector with grounded language pretraining. The architecture uses a dual-encoder design with separate image and text backbones, followed by a feature enhancer that fuses image and text features through cross-modal attention. A language-guided query selection module initializes detection queries based on the text input. ^[19]

The follow-up model, Grounding DINO 1.5 Pro, scaled the architecture and training data to over 20 million grounding-annotated images, achieving 54.3 AP on COCO and 55.7 AP on LVIS-minival in zero-shot evaluation. Grounding DINO 1.5 Edge, optimized for deployment, reached 45.0 AP on COCO in zero-shot mode.

Evaluation metrics

Standardized metrics allow fair comparison between object detection models. The most widely used metrics are defined by the COCO evaluation protocol.

Intersection over Union (IoU)

IoU measures the overlap between a predicted bounding box and a ground truth bounding box. It is calculated as the area of their intersection divided by the area of their union. An IoU of 1.0 indicates a perfect match, while 0.0 means no overlap. IoU serves as the basis for determining whether a detection is a true positive or a false positive.

Precision and recall

Precision is the fraction of detections that are correct (true positives divided by all detections). Recall is the fraction of ground truth objects that are successfully detected (true positives divided by all ground truth objects). A precision-recall curve plots precision against recall at varying confidence thresholds.

Average Precision (AP)

Average Precision is the area under the precision-recall curve for a single object category at a given IoU threshold.

Metric	Definition
AP@50 (or AP50)	Average Precision at IoU threshold of 0.50
AP@75 (or AP75)	Average Precision at IoU threshold of 0.75 (stricter localization)
AP@[.5:.05:.95]	AP averaged over IoU thresholds from 0.50 to 0.95 in steps of 0.05 (the primary COCO metric)
AP-small	AP for objects with area less than 32x32 pixels
AP-medium	AP for objects with area between 32x32 and 96x96 pixels
AP-large	AP for objects with area greater than 96x96 pixels

Mean Average Precision (mAP)

mAP is the AP averaged across all object categories. In the COCO evaluation protocol, the terms AP and mAP are used interchangeably because the primary metric is already averaged over all 80 categories. The COCO AP@[.5:.05:.95] metric is more demanding than the PASCAL VOC AP@50 metric because it requires accurate localization across multiple IoU thresholds. This explains why AP numbers on COCO appear lower than on VOC for the same model. ^[20]

Average Recall (AR)

Average Recall measures the maximum recall achievable given a fixed number of detections per image (for example, AR@1, AR@10, AR@100). It provides insight into how many objects a detector can find regardless of confidence calibration.

Benchmark datasets

Progress in object detection has been driven by large, carefully annotated benchmark datasets that enable standardized evaluation.

Dataset	Year	Images	Categories	Annotations	Primary metric
PASCAL VOC	2005-2012	~11,500 (VOC2012)	20	Bounding boxes, segmentation masks	mAP@50
MS COCO	2014	330,000 (200K+ labeled)	80	1.5M object instances, bounding boxes, segmentation masks, captions	AP@[.5:.05:.95]
Open Images V7	2016-2022	~9 million	600 (bounding boxes)	16M+ bounding boxes, 2.8M segmentation masks	mAP@50
LVIS	2019	164,000	1,203	2M+ instance segmentation masks	AP (long-tail)
Objects365	2019	2 million	365	30M+ bounding boxes	AP@[.5:.05:.95]

PASCAL VOC

The PASCAL Visual Object Classes challenge, running from 2005 to 2012, was the first major benchmark for object detection. It defined 20 object categories including person, car, bicycle, dog, and cat. VOC2007 contains 9,963 images with 24,640 annotated objects. The VOC2007 and VOC2012 datasets remain widely used for ablation studies and method comparison, with mAP@50 (AP at 50% IoU) as the standard metric. ^[21]

MS COCO

The Microsoft Common Objects in Context (MS COCO) dataset, introduced by Tsung-Yi Lin and colleagues in 2014, expanded the scope to 80 object categories across 330,000 images, with over 200,000 images labeled with 1.5 million object instances. COCO images show objects in natural, cluttered scenes with significant occlusion and varying scales. The COCO evaluation protocol uses the more stringent AP@[.5:.05:.95] metric and breaks down performance by object size (small, medium, large). COCO has been the primary benchmark for object detection since approximately 2016. ^[20]

Open Images

Google's Open Images dataset (latest version V7) is one of the largest annotated image collections, containing approximately 9 million images with over 16 million bounding boxes across 600 categories. It also provides 2.8 million segmentation masks for 350 categories and 66.4 million point-level labels. The scale and diversity of Open Images make it valuable for pretraining and evaluating detection in the wild. ^[24]

LVIS

The Large Vocabulary Instance Segmentation (LVIS) dataset, created by Agrim Gupta and colleagues at Meta AI, contains over 2 million instance segmentation masks for 1,203 object categories in 164,000 images. The categories follow a natural long-tail distribution, with a few common categories having thousands of examples and many rare categories having fewer than 10 training instances. LVIS is designed to evaluate how well detectors handle rare and infrequent object types. ^[22]

Objects365

Objects365, introduced by Shuai Shao and colleagues in 2019, contains 2 million images with over 30 million bounding boxes across 365 categories. The dataset is denser than COCO (averaging about 5 categories per image) and uses a rigorous three-step annotation pipeline. Pretrained models on Objects365 significantly outperform ImageNet-pretrained models when transferred to other detection tasks. ^[23]

Comparison of major object detectors

The following table summarizes representative object detection models, their architectural approach, and reported performance. COCO AP refers to AP@[.5:.05:.95] unless otherwise noted. Speed measurements vary by hardware and configuration.

Model	Year	Type	Backbone	COCO AP	Speed (FPS)	Key contribution
R-CNN	2014	Two-stage	AlexNet	53.7% mAP (VOC)	~0.02	CNN features for detection
Fast R-CNN	2015	Two-stage	VGG-16	65.7% mAP (VOC)	~0.5	RoI pooling, shared features
Faster R-CNN	2015	Two-stage	VGG-16 / ResNet-101	27.2 (COCO)	~5	Region Proposal Network
SSD512	2016	Single-shot	VGG-16	26.8	~22	Multi-scale feature maps
YOLOv1	2016	Single-shot	Custom (Darknet)	63.4% mAP (VOC)	45	Grid-based regression
YOLOv3	2018	Single-shot	Darknet-53	33.0	~20	Multi-scale predictions
RetinaNet	2017	Single-shot	ResNeXt-101-FPN	40.8	~5	Focal loss
CornerNet	2018	Anchor-free	Hourglass-104	40.5	~4	Keypoint-based detection
FCOS	2019	Anchor-free	ResNeXt-101-FPN	44.7	~18	Per-pixel regression
YOLOv4	2020	Single-shot	CSPDarknet53	43.5	~62	Bag of freebies/specials
DETR	2020	Transformer	ResNet-50	42.0	~28	Set prediction with Hungarian matching
Deformable DETR	2021	Transformer	ResNet-50	46.9	~19	Deformable attention
YOLOv5x	2020	Single-shot	CSP-Darknet	50.7	~12	PyTorch ecosystem
DINO	2022	Transformer	Swin-L (O365 pretrain)	63.3	~5	Denoising anchor boxes
YOLOv8x	2023	Anchor-free	CSP-Darknet	53.9	~15	Unified Ultralytics framework
RT-DETR-R50	2024	Transformer	ResNet-50	53.1	108	Real-time transformer detector
YOLOv9e	2024	Single-shot	GELAN	55.6	~15	Programmable Gradient Info
YOLOv10-X	2024	Anchor-free	CSPDarknet	54.4	~12	NMS-free training
RF-DETR-L	2025	Transformer	DINOv2 ViT	60.5	25	First real-time 60+ AP

What is object detection used for?

Object detection powers a wide range of practical systems across many industries.

Autonomous driving

Self-driving vehicles depend on object detection to identify pedestrians, cyclists, other vehicles, traffic signs, lane markings, and obstacles in real time. Systems like those from Waymo, Tesla, and Cruise use multi-camera setups combined with lidar and radar, processing detections at high frame rates to support safe navigation. Detection accuracy, especially for vulnerable road users like pedestrians and cyclists, directly impacts safety.

Video surveillance and security

Surveillance systems use object detection to identify people, vehicles, and suspicious objects in camera feeds. Applications include intrusion detection, crowd counting, abandoned object detection, and license plate recognition. Real-time processing requirements and the need to operate in varying lighting and weather conditions make this a demanding application domain.

Medical imaging

In healthcare, object detection helps radiologists identify abnormalities in X-rays, CT scans, and MRI images. Examples include detecting lung nodules in chest X-rays, identifying tumors in mammograms, and locating polyps in colonoscopy images. Detection models can serve as a "second reader" to reduce missed findings and improve diagnostic efficiency.

Retail and inventory management

Retailers use object detection for automated checkout systems (recognizing products without barcode scanning), shelf monitoring (detecting out-of-stock items or misplaced products), and customer behavior analysis. Amazon Go stores, for instance, pioneered cashier-less retail using computer vision including object detection to track items customers pick up.

Robotics and manufacturing

Industrial robots use object detection to identify and localize parts on assembly lines, guide pick-and-place operations, and perform quality inspection. In warehouse automation, robots detect and grasp items of varying shapes and sizes. Detection accuracy and speed are both important for maintaining production throughput.

Agriculture

Object detection assists in crop monitoring, weed detection, fruit counting, and pest identification. Drones equipped with cameras fly over fields and use detection models to assess crop health and identify areas needing attention, enabling precision agriculture practices that reduce waste and improve yields.

How big is the object detection market?

Object detection is a core component of the broader image recognition and computer vision market, which has grown rapidly with the spread of autonomous systems, smart surveillance, and industrial automation. The global image recognition market was valued at approximately USD 53.3 billion in 2023 and is projected to reach about USD 128.3 billion by 2030, growing at a compound annual growth rate of 12.8% from 2024 to 2030, with object detection cited as one of its principal application segments alongside facial recognition and optical character recognition. ^[26] Key growth drivers include advances in AI algorithms, the expansion of autonomous vehicles, and rising demand for smart surveillance and robotics. ^[26]

Real-time versus accuracy trade-offs

Object detection models exist on a spectrum from lightweight, fast models suitable for edge devices to large, accurate models designed for offline processing.

Speed-optimized models (such as YOLO variants, SSD, and RT-DETR) prioritize inference latency, targeting applications that require processing 30 or more frames per second. These models typically use smaller backbones, reduced input resolutions, and efficient architectural components. They may sacrifice some accuracy on small or occluded objects.

Accuracy-optimized models (such as DINO with a Swin Transformer backbone, Cascade R-CNN, and large-scale foundation models) use deeper backbones, higher input resolutions, and techniques like test-time augmentation and model ensembling. These models achieve the highest AP scores on benchmarks but may process only a few frames per second.

Several techniques help bridge this gap:

Knowledge distillation: Training a smaller "student" model to mimic a larger "teacher" model's predictions.
Model pruning: Removing redundant weights and channels to reduce computation.
Quantization: Reducing weight precision from 32-bit floating point to 8-bit integers, significantly speeding up inference on compatible hardware.
Neural Architecture Search (NAS): Automatically searching for efficient network architectures, as demonstrated by EfficientDet and RF-DETR.
Input resolution scaling: Using lower resolution images for faster inference at the cost of missing small objects.

The choice between speed and accuracy depends on the application. Autonomous driving demands both high accuracy and low latency, pushing the development of efficient yet precise models. Offline video analysis or medical image screening can tolerate slower inference in exchange for higher detection rates.

Current trends and future directions

As of 2025, several trends are shaping the field of object detection:

Transformer dominance: Transformer-based detectors, particularly DETR variants, have closed the speed gap with YOLO models while achieving higher accuracy. RT-DETR and RF-DETR demonstrate that transformers can operate in real-time settings.
Open-vocabulary detection: Models like Grounding DINO, GLIP, and OWL-ViT are enabling detection of arbitrary categories described in natural language, moving beyond fixed category sets.
Vision foundation models: Large pretrained vision models (such as DINOv2 and SAM) are being integrated into detection pipelines as powerful feature extractors, improving generalization across domains.
NMS-free detection: Architectures like DETR and YOLOv10 eliminate the need for non-maximum suppression, simplifying deployment and enabling true end-to-end training.
Edge deployment: Growing demand for object detection on mobile phones, drones, and IoT devices is driving research into extremely efficient models that can run on limited hardware.
3D object detection: Extensions from 2D bounding boxes to 3D detection using point clouds from lidar sensors or monocular depth estimation are increasingly important for autonomous driving and robotics.
Video object detection: Temporal information across video frames is being exploited to improve detection accuracy and consistency, using techniques like feature aggregation and tracking-by-detection.

References

Viola, P. and Jones, M. "Rapid Object Detection using a Boosted Cascade of Simple Features." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*, 2001. ↩
Dalal, N. and Triggs, B. "Histograms of Oriented Gradients for Human Detection." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*, 2005. ↩
Felzenszwalb, P., Girshick, R., McAllester, D., and Ramanan, D. "Object Detection with Discriminatively Trained Part-Based Models." *IEEE Transactions on Pattern Analysis and Machine Intelligence*, 32(9):1627-1645, 2010. ↩
Girshick, R., Donahue, J., Darrell, T., and Malik, J. "Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*, 2014. arXiv:1311.2524. ↩
Girshick, R. "Fast R-CNN." *Proceedings of the IEEE International Conference on Computer Vision (ICCV)*, 2015. ↩
Ren, S., He, K., Girshick, R., and Sun, J. "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks." *Advances in Neural Information Processing Systems (NeurIPS)*, 2015. ↩
Lin, T.-Y., Dollar, P., Girshick, R., He, K., Hariharan, B., and Belongie, S. "Feature Pyramid Networks for Object Detection." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*, 2017. ↩
Redmon, J., Divvala, S., Girshick, R., and Farhadi, A. "You Only Look Once: Unified, Real-Time Object Detection." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*, 2016. arXiv:1506.02640. ↩
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., and Berg, A.C. "SSD: Single Shot MultiBox Detector." *Proceedings of the European Conference on Computer Vision (ECCV)*, 2016. ↩
Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Dollar, P. "Focal Loss for Dense Object Detection." *Proceedings of the IEEE International Conference on Computer Vision (ICCV)*, 2017. ↩
Law, H. and Deng, J. "CornerNet: Detecting Objects as Paired Keypoints." *Proceedings of the European Conference on Computer Vision (ECCV)*, 2018. ↩
Tian, Z., Shen, C., Chen, H., and He, T. "FCOS: Fully Convolutional One-Stage Object Detection." *Proceedings of the IEEE International Conference on Computer Vision (ICCV)*, 2019. ↩
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., and Zagoruyko, S. "End-to-End Object Detection with Transformers." *Proceedings of the European Conference on Computer Vision (ECCV)*, 2020. arXiv:2005.12872. ↩
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., and Dai, J. "Deformable DETR: Deformable Transformers for End-to-End Object Detection." *International Conference on Learning Representations (ICLR)*, 2021. ↩
Zhang, H., Li, F., Liu, S., Zhang, L., Su, H., Zhu, J., Ni, L.M., and Shum, H.-Y. "DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection." *International Conference on Learning Representations (ICLR)*, 2023. ↩
Zhao, Y., Lv, W., et al. "DETRs Beat YOLOs on Real-time Object Detection." *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*, 2024. ↩
Li, L.H., et al. "Grounded Language-Image Pre-training." *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*, 2022. ↩
Minderer, M., Gritsenko, A., Stone, A., et al. "Simple Open-Vocabulary Object Detection with Vision Transformers." *Proceedings of the European Conference on Computer Vision (ECCV)*, 2022. ↩
Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., et al. "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection." *Proceedings of the European Conference on Computer Vision (ECCV)*, 2024. ↩
Lin, T.-Y., Maire, M., Belongie, S., et al. "Microsoft COCO: Common Objects in Context." *Proceedings of the European Conference on Computer Vision (ECCV)*, 2014. ↩
Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., and Zisserman, A. "The Pascal Visual Object Classes (VOC) Challenge." *International Journal of Computer Vision*, 88(2):303-338, 2010. ↩
Gupta, A., Dollar, P., and Girshick, R. "LVIS: A Dataset for Large Vocabulary Instance Segmentation." *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*, 2019. ↩
Shao, S., et al. "Objects365: A Large-Scale, High-Quality Dataset for Object Detection." *Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)*, 2019. ↩
Kuznetsova, A., et al. "The Open Images Dataset V4: Unified Image Classification, Object Detection, and Visual Relationship Detection at Scale." *International Journal of Computer Vision*, 128:1956-1981, 2020. ↩
Roboflow. "RF-DETR: Real-Time Detection Transformer." GitHub repository, 2025. https://github.com/roboflow/rf-detr ↩
Grand View Research. "Image Recognition Market Size, Share & Trends Analysis Report, 2024-2030." 2024. https://www.grandviewresearch.com/industry-analysis/image-recognition-market ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

6 revisions by 1 contributors · full history

Suggest edit