Object detection is a computer vision task that involves identifying and localizing instances of objects within images or video frames. Unlike image classification, which assigns a single label to an entire image, object detection outputs a set of bounding boxes, each paired with a class label and a confidence score. This makes it possible to recognize multiple objects of different types and pinpoint their positions in a scene simultaneously.
The field has progressed from handcrafted feature methods in the early 2000s to deep learning-based approaches that now dominate benchmarks and real-world applications. Object detection serves as a foundational capability for tasks including image segmentation, object tracking, scene understanding, and activity recognition.
Given an input image, an object detector produces a list of detections. Each detection consists of:
During training, the model learns to minimize a combined loss that includes a localization component (penalizing inaccurate bounding boxes) and a classification component (penalizing wrong labels). Most modern detectors also incorporate a post-processing step called non-maximum suppression (NMS), which removes duplicate detections by suppressing overlapping boxes with lower confidence scores.
Before the deep learning revolution, object detection relied on manually designed features and classical machine learning classifiers. Three methods from this era were especially influential.
Paul Viola and Michael Jones introduced a real-time face detection framework in 2001 that became one of the most widely deployed object detectors of its time. The detector uses three key ideas: Haar-like features computed rapidly using an integral image representation, the AdaBoost algorithm to select a small set of discriminative features from a large pool, and a cascade classifier structure that quickly rejects non-face image regions in early stages while spending more computation on ambiguous regions.
Running on a 700 MHz Intel Pentium III processor, the Viola-Jones detector achieved real-time face detection without hardware acceleration, processing images tens to hundreds of times faster than competing methods at comparable accuracy levels. The framework operates by sliding a detection window across the image at multiple scales and positions.
Navneet Dalal and Bill Triggs proposed the Histogram of Oriented Gradients (HOG) descriptor in 2005, motivated primarily by the problem of pedestrian detection. HOG features capture the distribution of gradient orientations in localized portions of an image. The descriptor is computed on a dense grid of uniformly spaced cells, with overlapping local contrast normalization applied over larger blocks to achieve robustness to illumination and shadowing changes.
HOG features paired with a linear support vector machine (SVM) classifier established a strong baseline for pedestrian detection on the INRIA Person dataset. The approach balanced feature invariance (to translation, scale, and illumination) with the ability to capture fine-grained shape information.
Pedro Felzenszwalb and colleagues extended the HOG detector into the Deformable Part-based Model (DPM). DPM represents objects using a coarse root filter combined with several higher-resolution part filters. Each part can shift spatially relative to the root, allowing the model to handle variations in object pose and viewpoint through a latent variable formulation.
DPM won the PASCAL VOC detection challenge in 2007, 2008, and 2009, representing the peak of traditional object detection methods. The model uses a star-structured graphical model and is trained with a latent SVM algorithm. While effective, DPM-based systems were slow (requiring several seconds per image) and struggled to scale beyond a moderate number of object categories.
The introduction of convolutional neural networks (CNNs) to object detection in 2014 brought dramatic accuracy improvements, launching the modern era of the field.
Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik introduced Regions with CNN features (R-CNN) in 2014. The system operates in three stages: it first generates approximately 2,000 region proposals using selective search, then extracts a 4,096-dimensional feature vector from each proposal by passing it through an AlexNet-based CNN pretrained on ImageNet, and finally classifies each region with class-specific linear SVMs.
R-CNN achieved 53.7% mAP on PASCAL VOC 2010, a large improvement over the prior state of the art. The paper demonstrated two key insights: high-capacity CNNs can be applied to bottom-up region proposals for localization, and supervised pretraining on a large auxiliary dataset followed by domain-specific fine-tuning produces significant performance gains even when labeled detection data is limited.
However, R-CNN was computationally expensive because CNN features had to be extracted independently for each of the roughly 2,000 proposals per image, leading to substantial redundant computation.
Ross Girshick addressed the efficiency bottleneck with Fast R-CNN, which processes the entire image through the CNN backbone once and then extracts features for each region proposal using a Region of Interest (RoI) pooling layer. This shared computation made training and inference significantly faster. Fast R-CNN also introduced multi-task learning, jointly optimizing classification and bounding box regression in a single network.
Fast R-CNN achieved 65.7% mAP on PASCAL VOC 2012 and was roughly 9 times faster than R-CNN at training time and 213 times faster at test time. Despite these gains, the system still depended on selective search for region proposals, which remained a computational bottleneck.
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun eliminated the external proposal mechanism entirely by introducing the Region Proposal Network (RPN). The RPN is a small, fully convolutional network that shares convolutional features with the detection network. It simultaneously predicts objectness scores and bounding box coordinates at each spatial position using a set of reference boxes called anchors.
Faster R-CNN unified proposal generation and detection into a single, end-to-end trainable architecture. Using a VGG-16 backbone, it achieved 73.2% mAP on PASCAL VOC 2007 and ran at approximately 5 frames per second on a GPU. With a ResNet-101 backbone, the model reached 27.2% AP on the more challenging MS COCO dataset. Faster R-CNN established the two-stage detection paradigm (propose, then classify) that influenced many subsequent architectures.
Tsung-Yi Lin, Piotr Dollar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie introduced the Feature Pyramid Network (FPN) to address the challenge of detecting objects at different scales. FPN builds a top-down pathway with lateral connections on top of a standard CNN backbone, producing a multi-scale feature pyramid where each level contains semantically strong features at a different spatial resolution.
Integrating FPN with Faster R-CNN improved AP on COCO by several points, particularly for small objects, without significant additional computation. FPN became a standard component in most later detection architectures.
Two-stage detectors achieve high accuracy but are relatively slow because they process region proposals in a second stage. Single-shot (or one-stage) detectors predict bounding boxes and class probabilities directly from feature maps in a single forward pass, enabling faster inference at some cost to accuracy.
Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi introduced You Only Look Once (YOLO) in 2016. YOLO reframes detection as a single regression problem: the input image is divided into an S x S grid, and each grid cell directly predicts bounding box coordinates, objectness scores, and class probabilities in one evaluation of the network.
The original YOLOv1 achieved 63.4% mAP on PASCAL VOC 2007 and processed images at 45 frames per second (155 FPS with a smaller variant called Fast YOLO), making it dramatically faster than two-stage methods. However, YOLOv1 struggled with small objects and objects in close proximity due to its coarse grid design.
The YOLO architecture has been refined extensively over the years by multiple research groups:
| Version | Year | Key innovations | COCO AP | Notes |
|---|---|---|---|---|
| YOLOv1 | 2016 | Grid-based regression, real-time speed | 63.4% mAP (VOC) | First real-time deep detector |
| YOLOv2 | 2017 | Batch normalization, anchor boxes, multi-scale training | 21.6% AP | Also called YOLO9000 |
| YOLOv3 | 2018 | Multi-scale predictions, Darknet-53 backbone | 33.0% AP | Strong small-object detection |
| YOLOv4 | 2020 | CSPDarknet53, Mish activation, mosaic augmentation | 43.5% AP | Bag of freebies and specials |
| YOLOv5 | 2020 | PyTorch reimplementation, auto-anchor, Focus layer | 50.7% AP | Widely adopted in industry |
| YOLOv7 | 2022 | Extended efficient layer aggregation (E-ELAN), re-parameterization | 51.4% AP | Trainable bag-of-freebies |
| YOLOv8 | 2023 | Anchor-free head, decoupled head, C2f module | 53.9% AP | Ultralytics unified framework |
| YOLOv9 | 2024 | Programmable Gradient Information (PGI), GELAN architecture | 55.6% AP | 49% fewer parameters than YOLOv8 |
| YOLOv10 | 2024 | NMS-free via consistent dual assignments, end-to-end | 54.4% AP | First end-to-end YOLO |
| YOLO11 | 2024 | Refined feature extraction, optimized training pipeline | 54.7% AP | 22% fewer parameters than YOLOv8m |
Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg introduced the Single Shot MultiBox Detector (SSD) in 2016. SSD predicts bounding boxes and class scores from multiple feature maps at different resolutions within the network, allowing it to handle objects of various sizes without requiring a separate proposal stage.
SSD300 (using 300x300 input) achieved 74.3% mAP on PASCAL VOC 2007 at 59 FPS on an NVIDIA Titan X GPU. SSD512 (512x512 input) pushed accuracy to 76.9% mAP, surpassing the then state-of-the-art Faster R-CNN while maintaining real-time speed. On the COCO dataset, SSD512 outperformed Faster R-CNN in both mAP@0.5 and the stricter mAP@[0.5:0.95] metric.
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollar investigated why single-shot detectors historically lagged behind two-stage methods in accuracy. They identified the extreme foreground-background class imbalance during dense prediction training as the primary cause. To address this, they proposed focal loss, a modified cross-entropy loss that down-weights the contribution of well-classified (easy) examples so the model focuses on hard, misclassified ones.
RetinaNet with a ResNeXt-101-FPN backbone and focal loss surpassed 40 AP on COCO, matching two-stage detectors for the first time while maintaining single-stage inference speed. This result demonstrated that the accuracy gap between one-stage and two-stage detectors was not inherent to the architecture but was caused by the class imbalance problem during training.
Traditional detectors rely on predefined anchor boxes (reference boxes of various sizes and aspect ratios) to generate detection candidates. Anchor-free methods remove this dependency, simplifying the detection pipeline and reducing the number of hyperparameters.
Hei Law and Jia Deng proposed CornerNet, which detects each object as a pair of keypoints: the top-left corner and the bottom-right corner of its bounding box. The network produces heatmaps for corner locations and embedding vectors for each detected corner. Corners belonging to the same object are grouped by minimizing the distance between their embeddings. CornerNet achieved 40.5% AP on COCO without using anchor boxes.
Kaiwen Duan, Song Bai, Lingxi Xie, Honggang Qi, Qingming Huang, and Qi Tian extended the keypoint-based approach with CenterNet, which represents each object by its center point and regresses the object's width and height directly from the center location. The model uses an hourglass network backbone and eliminates the need for both anchor boxes and NMS. A variant by Xingyi Zhou, Dequan Wang, and Philipp Krahenbuhl proposed a triplet representation: top-left corner, bottom-right corner, and center keypoint, with the center keypoint serving as a verification mechanism.
Zhi Tian, Chunhua Shen, Hao Chen, and Tong He introduced FCOS (Fully Convolutional One-Stage Object Detection), which predicts a bounding box at every foreground pixel location by regressing the distances from that pixel to the four sides (left, top, right, bottom) of the bounding box. A "centerness" branch suppresses low-quality detections far from the object center. FCOS demonstrated that simple per-pixel prediction, combined with multi-level FPN features and centerness scoring, can match or exceed the performance of anchor-based detectors on COCO.
Transformers, originally developed for natural language processing, were adapted for object detection starting in 2020. These models treat detection as a set prediction problem, often eliminating the need for hand-designed components like anchor boxes and NMS.
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko at Meta AI introduced DEtection TRansformer (DETR), which formulates object detection as a direct set prediction problem. DETR uses a CNN backbone to extract image features, then feeds them to a transformer encoder-decoder architecture. A fixed set of learned object queries attend to the image features through cross-attention. The model is trained using a bipartite matching loss (the Hungarian algorithm) that uniquely assigns each prediction to a ground truth object, eliminating the need for NMS.
DETR achieved 42 AP on COCO, matching the performance of a well-tuned Faster R-CNN with FPN. However, DETR suffered from slow convergence (requiring 500 training epochs compared to the typical 36 for Faster R-CNN) and poor performance on small objects due to the global nature of the transformer's attention mechanism.
Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai addressed DETR's limitations by introducing deformable attention. Instead of attending to all spatial positions in the feature map (which is computationally expensive and spreads attention too thinly), deformable attention attends only to a small set of key sampling points around a reference position. This design enables efficient multi-scale feature processing.
Deformable DETR achieved 46.9 AP on COCO while converging 10 times faster than the original DETR (50 epochs versus 500), with particular improvements in small object detection.
Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M. Ni, and Heung-Yeung Shum proposed DINO (DETR with Improved DeNoising Anchor Boxes). DINO combines contrastive denoising training, mixed query selection, and a "look forward twice" scheme for iterative box refinement. With a Swin Transformer-Large backbone pretrained on Objects365, DINO achieved 63.3 AP on the COCO test-dev benchmark, setting a new state of the art among models that do not use pseudo-labeling or test-time augmentation.
Yian Zhao, Wenyu Lv, and colleagues introduced RT-DETR (Real-Time DEtection TRansformer), demonstrating that transformer-based detectors can match or surpass YOLO models in real-time settings. RT-DETR-R50 achieved 53.1 AP at 108 FPS, compared to the 5 FPS of DINO with a similar backbone. The paper, titled "DETRs Beat YOLOs on Real-time Object Detection," was presented at CVPR 2024 and showed that efficient hybrid encoder designs and IoU-aware query selection can make transformer detectors viable for latency-sensitive applications.
Developed by Roboflow, RF-DETR builds on a DINOv2 vision transformer backbone and uses weight-sharing neural architecture search (NAS) to discover optimal accuracy-latency trade-offs. RF-DETR-Large (128M parameters) achieved 60.5 AP on COCO at 25 FPS on an NVIDIA T4 GPU, becoming the first documented real-time model to break the 60 AP barrier on the COCO benchmark. RF-DETR-Base (29M parameters) provides a smaller option for edge deployment scenarios. The model was released under the Apache 2.0 license.
Conventional object detectors can only recognize categories seen during training. A new class of models, often called open-vocabulary or open-set detectors, leverages vision-language pretraining to detect objects described by arbitrary text prompts at inference time, without retraining.
GLIP (Grounded Language-Image Pre-training), developed by Liunian Harold Li and colleagues at Microsoft and UCLA, reformulates object detection as a phrase grounding task. The model learns to align image regions with text phrases during training on a large corpus of 27 million images with region-phrase pairs. GLIP achieved 49.8 AP on COCO in a zero-shot transfer setting (without seeing any COCO training images) and 60.8 AP after fine-tuning on COCO.
Google Research introduced OWL-ViT (Open-World Localization with Vision Transformers), which adapts a CLIP-pretrained Vision Transformer for open-vocabulary detection. The model attaches lightweight detection heads to the ViT backbone and uses text embeddings from CLIP's text encoder as classification weights. This allows detection of any object category described in natural language.
Grounding DINO, developed by Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang at IDEA Research, marries the DINO detector with grounded language pretraining. The architecture uses a dual-encoder design with separate image and text backbones, followed by a feature enhancer that fuses image and text features through cross-modal attention. A language-guided query selection module initializes detection queries based on the text input.
The follow-up model, Grounding DINO 1.5 Pro, scaled the architecture and training data to over 20 million grounding-annotated images, achieving 54.3 AP on COCO and 55.7 AP on LVIS-minival in zero-shot evaluation. Grounding DINO 1.5 Edge, optimized for deployment, reached 45.0 AP on COCO in zero-shot mode.
Standardized metrics allow fair comparison between object detection models. The most widely used metrics are defined by the COCO evaluation protocol.
IoU measures the overlap between a predicted bounding box and a ground truth bounding box. It is calculated as the area of their intersection divided by the area of their union. An IoU of 1.0 indicates a perfect match, while 0.0 means no overlap. IoU serves as the basis for determining whether a detection is a true positive or a false positive.
Precision is the fraction of detections that are correct (true positives divided by all detections). Recall is the fraction of ground truth objects that are successfully detected (true positives divided by all ground truth objects). A precision-recall curve plots precision against recall at varying confidence thresholds.
Average Precision is the area under the precision-recall curve for a single object category at a given IoU threshold.
| Metric | Definition |
|---|---|
| AP@50 (or AP50) | Average Precision at IoU threshold of 0.50 |
| AP@75 (or AP75) | Average Precision at IoU threshold of 0.75 (stricter localization) |
| AP@[.5:.05:.95] | AP averaged over IoU thresholds from 0.50 to 0.95 in steps of 0.05 (the primary COCO metric) |
| AP-small | AP for objects with area less than 32x32 pixels |
| AP-medium | AP for objects with area between 32x32 and 96x96 pixels |
| AP-large | AP for objects with area greater than 96x96 pixels |
mAP is the AP averaged across all object categories. In the COCO evaluation protocol, the terms AP and mAP are used interchangeably because the primary metric is already averaged over all 80 categories. The COCO AP@[.5:.05:.95] metric is more demanding than the PASCAL VOC AP@50 metric because it requires accurate localization across multiple IoU thresholds. This explains why AP numbers on COCO appear lower than on VOC for the same model.
Average Recall measures the maximum recall achievable given a fixed number of detections per image (for example, AR@1, AR@10, AR@100). It provides insight into how many objects a detector can find regardless of confidence calibration.
Progress in object detection has been driven by large, carefully annotated benchmark datasets that enable standardized evaluation.
| Dataset | Year | Images | Categories | Annotations | Primary metric |
|---|---|---|---|---|---|
| PASCAL VOC | 2005-2012 | ~11,500 (VOC2012) | 20 | Bounding boxes, segmentation masks | mAP@50 |
| MS COCO | 2014 | 330,000 (200K+ labeled) | 80 | 1.5M object instances, bounding boxes, segmentation masks, captions | AP@[.5:.05:.95] |
| Open Images V7 | 2016-2022 | ~9 million | 600 (bounding boxes) | 16M+ bounding boxes, 2.8M segmentation masks | mAP@50 |
| LVIS | 2019 | 164,000 | 1,203 | 2M+ instance segmentation masks | AP (long-tail) |
| Objects365 | 2019 | 2 million | 365 | 30M+ bounding boxes | AP@[.5:.05:.95] |
The PASCAL Visual Object Classes challenge, running from 2005 to 2012, was the first major benchmark for object detection. It defined 20 object categories including person, car, bicycle, dog, and cat. VOC2007 contains 9,963 images with 24,640 annotated objects. The VOC2007 and VOC2012 datasets remain widely used for ablation studies and method comparison, with mAP@50 (AP at 50% IoU) as the standard metric.
The Microsoft Common Objects in Context (MS COCO) dataset, introduced by Tsung-Yi Lin and colleagues in 2014, expanded the scope to 80 object categories across 330,000 images, with over 200,000 images labeled with 1.5 million object instances. COCO images show objects in natural, cluttered scenes with significant occlusion and varying scales. The COCO evaluation protocol uses the more stringent AP@[.5:.05:.95] metric and breaks down performance by object size (small, medium, large). COCO has been the primary benchmark for object detection since approximately 2016.
Google's Open Images dataset (latest version V7) is one of the largest annotated image collections, containing approximately 9 million images with over 16 million bounding boxes across 600 categories. It also provides 2.8 million segmentation masks for 350 categories and 66.4 million point-level labels. The scale and diversity of Open Images make it valuable for pretraining and evaluating detection in the wild.
The Large Vocabulary Instance Segmentation (LVIS) dataset, created by Agrim Gupta and colleagues at Meta AI, contains over 2 million instance segmentation masks for 1,203 object categories in 164,000 images. The categories follow a natural long-tail distribution, with a few common categories having thousands of examples and many rare categories having fewer than 10 training instances. LVIS is designed to evaluate how well detectors handle rare and infrequent object types.
Objects365, introduced by Shuai Shao and colleagues in 2019, contains 2 million images with over 30 million bounding boxes across 365 categories. The dataset is denser than COCO (averaging about 5 categories per image) and uses a rigorous three-step annotation pipeline. Pretrained models on Objects365 significantly outperform ImageNet-pretrained models when transferred to other detection tasks.
The following table summarizes representative object detection models, their architectural approach, and reported performance. COCO AP refers to AP@[.5:.05:.95] unless otherwise noted. Speed measurements vary by hardware and configuration.
| Model | Year | Type | Backbone | COCO AP | Speed (FPS) | Key contribution |
|---|---|---|---|---|---|---|
| R-CNN | 2014 | Two-stage | AlexNet | 53.7% mAP (VOC) | ~0.02 | CNN features for detection |
| Fast R-CNN | 2015 | Two-stage | VGG-16 | 65.7% mAP (VOC) | ~0.5 | RoI pooling, shared features |
| Faster R-CNN | 2015 | Two-stage | VGG-16 / ResNet-101 | 27.2 (COCO) | ~5 | Region Proposal Network |
| SSD512 | 2016 | Single-shot | VGG-16 | 26.8 | ~22 | Multi-scale feature maps |
| YOLOv1 | 2016 | Single-shot | Custom (Darknet) | 63.4% mAP (VOC) | 45 | Grid-based regression |
| YOLOv3 | 2018 | Single-shot | Darknet-53 | 33.0 | ~20 | Multi-scale predictions |
| RetinaNet | 2017 | Single-shot | ResNeXt-101-FPN | 40.8 | ~5 | Focal loss |
| CornerNet | 2018 | Anchor-free | Hourglass-104 | 40.5 | ~4 | Keypoint-based detection |
| FCOS | 2019 | Anchor-free | ResNeXt-101-FPN | 44.7 | ~18 | Per-pixel regression |
| YOLOv4 | 2020 | Single-shot | CSPDarknet53 | 43.5 | ~62 | Bag of freebies/specials |
| DETR | 2020 | Transformer | ResNet-50 | 42.0 | ~28 | Set prediction with Hungarian matching |
| Deformable DETR | 2021 | Transformer | ResNet-50 | 46.9 | ~19 | Deformable attention |
| YOLOv5x | 2020 | Single-shot | CSP-Darknet | 50.7 | ~12 | PyTorch ecosystem |
| DINO | 2022 | Transformer | Swin-L (O365 pretrain) | 63.3 | ~5 | Denoising anchor boxes |
| YOLOv8x | 2023 | Anchor-free | CSP-Darknet | 53.9 | ~15 | Unified Ultralytics framework |
| RT-DETR-R50 | 2024 | Transformer | ResNet-50 | 53.1 | 108 | Real-time transformer detector |
| YOLOv9e | 2024 | Single-shot | GELAN | 55.6 | ~15 | Programmable Gradient Info |
| YOLOv10-X | 2024 | Anchor-free | CSPDarknet | 54.4 | ~12 | NMS-free training |
| RF-DETR-L | 2025 | Transformer | DINOv2 ViT | 60.5 | 25 | First real-time 60+ AP |
Object detection powers a wide range of practical systems across many industries.
Self-driving vehicles depend on object detection to identify pedestrians, cyclists, other vehicles, traffic signs, lane markings, and obstacles in real time. Systems like those from Waymo, Tesla, and Cruise use multi-camera setups combined with lidar and radar, processing detections at high frame rates to support safe navigation. Detection accuracy, especially for vulnerable road users like pedestrians and cyclists, directly impacts safety.
Surveillance systems use object detection to identify people, vehicles, and suspicious objects in camera feeds. Applications include intrusion detection, crowd counting, abandoned object detection, and license plate recognition. Real-time processing requirements and the need to operate in varying lighting and weather conditions make this a demanding application domain.
In healthcare, object detection helps radiologists identify abnormalities in X-rays, CT scans, and MRI images. Examples include detecting lung nodules in chest X-rays, identifying tumors in mammograms, and locating polyps in colonoscopy images. Detection models can serve as a "second reader" to reduce missed findings and improve diagnostic efficiency.
Retailers use object detection for automated checkout systems (recognizing products without barcode scanning), shelf monitoring (detecting out-of-stock items or misplaced products), and customer behavior analysis. Amazon Go stores, for instance, pioneered cashier-less retail using computer vision including object detection to track items customers pick up.
Industrial robots use object detection to identify and localize parts on assembly lines, guide pick-and-place operations, and perform quality inspection. In warehouse automation, robots detect and grasp items of varying shapes and sizes. Detection accuracy and speed are both important for maintaining production throughput.
Object detection assists in crop monitoring, weed detection, fruit counting, and pest identification. Drones equipped with cameras fly over fields and use detection models to assess crop health and identify areas needing attention, enabling precision agriculture practices that reduce waste and improve yields.
Object detection models exist on a spectrum from lightweight, fast models suitable for edge devices to large, accurate models designed for offline processing.
Speed-optimized models (such as YOLO variants, SSD, and RT-DETR) prioritize inference latency, targeting applications that require processing 30 or more frames per second. These models typically use smaller backbones, reduced input resolutions, and efficient architectural components. They may sacrifice some accuracy on small or occluded objects.
Accuracy-optimized models (such as DINO with a Swin Transformer backbone, Cascade R-CNN, and large-scale foundation models) use deeper backbones, higher input resolutions, and techniques like test-time augmentation and model ensembling. These models achieve the highest AP scores on benchmarks but may process only a few frames per second.
Several techniques help bridge this gap:
The choice between speed and accuracy depends on the application. Autonomous driving demands both high accuracy and low latency, pushing the development of efficient yet precise models. Offline video analysis or medical image screening can tolerate slower inference in exchange for higher detection rates.
As of 2025, several trends are shaping the field of object detection: