YOLO (object detection)

Computer Vision Deep Learning Neural Networks

28 min read

Updated Jun 21, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 21, 2026

Fact-checked

In review queue

Sources

20 citations

Revision

v6 · 5,682 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

YOLO (You Only Look Once) is a family of object detection models that treat detection as a single regression problem, predicting bounding boxes and class probabilities directly from full images in one forward pass through a neural network. First introduced in 2015 by Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi, YOLO broke away from the two-stage detection paradigm popularized by R-CNN and its successors.^[1] Instead of generating region proposals and then classifying them separately, YOLO looks at the entire image once and outputs all detections simultaneously. This single-shot approach made YOLO dramatically faster than competing methods, enabling real-time object detection on standard hardware. The original model ran at 45 frames per second, and a smaller variant, Fast YOLO, reached 155 frames per second, which the authors described as "extremely fast" and capable of processing "streaming video in real-time with less than 25 milliseconds of latency."^[1]

Since the original paper, the YOLO family has grown to include more than a dozen major versions and several specialized variants. Different research groups and companies have contributed to its evolution, making YOLO one of the most widely used object detection frameworks in both academic research and industry applications.

Background and motivation

Before YOLO, the dominant approach to object detection relied on a two-stage pipeline. Models like R-CNN (2014), Fast R-CNN (2015), and Faster R-CNN (2015) first generated thousands of candidate regions (region proposals) that might contain objects, then classified each region individually using a convolutional neural network. While this approach achieved strong accuracy on benchmarks like PASCAL VOC and MS COCO, it was slow. Faster R-CNN, the fastest of the group, ran at roughly 7 frames per second (FPS), far below what was needed for applications like autonomous driving or video surveillance.^[17]

The sliding window approach used by earlier methods like Deformable Parts Models (DPM) was even slower. There was a clear need for a detection system that could operate at real-time speeds (30+ FPS) without sacrificing too much accuracy.^[1]

Joseph Redmon and his collaborators at the University of Washington and the Allen Institute for AI proposed a fundamentally different solution: frame object detection as a single regression problem. Rather than scanning an image with region proposals, a single network would look at the whole image and predict all bounding boxes and their class labels at once.^[1]

What was the first YOLO model (YOLOv1, 2015)?

The original YOLO paper, titled "You Only Look Once: Unified, Real-Time Object Detection," was submitted to arXiv in June 2015 and published at CVPR 2016. The authors were Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi.^[1]

Architecture

YOLOv1 divides the input image into an S x S grid (7 x 7 in the default configuration). Each grid cell is responsible for detecting objects whose center falls within that cell. Each cell predicts B bounding boxes (B=2 in the paper), along with confidence scores and C class probabilities. Each bounding box prediction consists of five values: the x and y coordinates of the box center relative to the grid cell, the width and height relative to the full image, and a confidence score reflecting the probability that the box contains an object and how well the box fits.^[1]

The network architecture has 24 convolutional layers followed by 2 fully connected layers. The convolutional layers extract features from the image, while the fully connected layers predict the output probabilities and bounding box coordinates. The design was inspired by GoogLeNet, using 1x1 convolutions followed by 3x3 convolutions instead of the inception modules. The first 20 convolutional layers were pre-trained on ImageNet for classification, and then 4 additional convolutional layers plus 2 fully connected layers were added for detection.^[1]

A smaller variant called Fast YOLO used only 9 convolutional layers with fewer filters.^[1]

Performance

On PASCAL VOC 2007, YOLOv1 achieved 63.4% mAP at 45 FPS. Fast YOLO reached 155 FPS with 52.7% mAP. By comparison, Faster R-CNN with VGG-16 achieved 73.2% mAP but ran at only 7 FPS. YOLOv1 was therefore roughly 6 times faster than Faster R-CNN, though it lagged in accuracy.^[1]

YOLOv1 had notable limitations. Because each grid cell could only predict two boxes and one set of class probabilities, it struggled with small objects that appeared in groups (such as flocks of birds). It also had difficulty generalizing to objects with unusual aspect ratios. Localization error was the main source of its accuracy gap with two-stage detectors.^[1]

YOLOv2 / YOLO9000 (2016)

The second version was presented in the paper "YOLO9000: Better, Faster, Stronger," published in December 2016 by Joseph Redmon and Ali Farhadi. YOLOv2 addressed many of the shortcomings of the original model, while YOLO9000 extended the system to detect over 9,000 object categories by jointly training on detection and classification datasets.^[2]

Key improvements

Batch normalization. Adding batch normalization to all convolutional layers improved mAP by more than 2 percentage points and eliminated the need for dropout as a regularizer.^[2]

High-resolution classifier. YOLOv1 trained the classifier at 224x224 and then switched to 448x448 for detection. YOLOv2 first fine-tuned the classification network at 448x448 for 10 epochs on ImageNet, giving the network time to adjust its filters for higher resolution input.^[2]

Anchor boxes. Instead of predicting bounding box coordinates directly, YOLOv2 adopted anchor boxes (predefined template boxes), predicting offsets relative to these anchors. The anchor box dimensions were determined through k-means clustering on the training data rather than being hand-picked, which gave a better starting point for prediction.^[2]

New backbone: Darknet-19. YOLOv2 replaced the original backbone with Darknet-19, a 19-layer network using 3x3 convolutions, 1x1 convolutions, and global average pooling. Darknet-19 achieved 72.9% top-1 accuracy on ImageNet with only 5.58 billion floating-point operations, making it significantly more efficient than VGG-16.^[2]

Multi-scale training. The network was trained on images of varying sizes (from 320x320 to 608x608) every few batches, allowing it to handle different input resolutions at test time and providing a speed-accuracy tradeoff.^[2]

Performance

YOLOv2 achieved 76.8% mAP at 67 FPS on PASCAL VOC 2007 (at 416x416 input), and 78.6% mAP at 40 FPS (at 544x544 input). On COCO, it achieved 44.0% mAP at IoU 0.5.^[2]

YOLO9000 used a WordTree hierarchy to combine labels from ImageNet and COCO, enabling detection of over 9,000 categories. It trained on both detection and classification data simultaneously, using detection images for the full loss and classification images for only the classification portion.^[2]

YOLOv3 (2018)

YOLOv3 was introduced in the paper "YOLOv3: An Incremental Improvement" by Joseph Redmon and Ali Farhadi in April 2018. Redmon described it as a collection of incremental improvements rather than a groundbreaking change, but the cumulative effect was substantial.^[3]

Architecture changes

Darknet-53 backbone. YOLOv3 replaced Darknet-19 with Darknet-53, a 53-layer network that incorporated residual connections (skip connections) borrowed from ResNet. Darknet-53 was more powerful than Darknet-19 and more efficient than ResNet-101 or ResNet-152 in terms of floating-point operations per unit of accuracy.^[3]

Multi-scale predictions. One of the biggest improvements was predicting objects at three different scales. YOLOv3 extracted features at three points in the network (at layers 82, 94, and 106), producing detection maps at 13x13, 26x26, and 52x52 resolutions for a 416x416 input. This approach used a Feature Pyramid Network (FPN)-like structure, where feature maps from deeper layers were upsampled and concatenated with feature maps from earlier layers. The multi-scale design dramatically improved detection of small objects, which was a weakness in earlier versions.^[3]

Class prediction. YOLOv3 switched from softmax classification to independent logistic classifiers for each class, using binary cross-entropy loss. This allowed multi-label classification, where a single object could belong to multiple categories (useful for datasets with overlapping labels like "woman" and "person").^[3]

Performance

YOLOv3 achieved 57.9% mAP at IoU 0.5 (AP50) on COCO, running at approximately 30 to 45 FPS depending on input resolution and hardware. At the stricter AP metric (averaging over IoU thresholds from 0.5 to 0.95), its performance was lower, but at AP50 it was competitive with detectors like RetinaNet while being significantly faster.^[3]

Redmon's departure from computer vision

In February 2020, Joseph Redmon announced on Twitter (now X) that he had stopped his computer vision research due to ethical concerns. "I stopped doing CV research because I saw the impact my work was having," he wrote. "I loved the work but the military applications and privacy concerns eventually became impossible to ignore."^[20] He cited the military applications of his work and the privacy implications of detection and surveillance technology, and noted that he felt facial recognition technologies had more downside than upside. This decision meant that subsequent YOLO versions would come from other researchers.^[19]

YOLOv4 (2020)

With Redmon stepping away, development of YOLO continued under new leadership. YOLOv4 was released in April 2020 by Alexey Bochkovskiy, Chien-Yao Wang, and Hong-Yuan Mark Liao. The paper, "YOLOv4: Optimal Speed and Accuracy of Object Detection," systematically evaluated dozens of techniques for improving detection accuracy and speed.^[4]

Architecture

YOLOv4 introduced a modular architecture with three clear components:

Backbone: CSPDarknet53 (Cross-Stage Partial connections applied to Darknet-53), which reduced computation by splitting the feature map and passing one copy through dense blocks while routing the other directly to the next stage
Neck: SPP (Spatial Pyramid Pooling) and PANet (Path Aggregation Network), which improved information flow between backbone features and detection heads
Head: The YOLOv3-style detection head with anchor-based predictions at three scales

Bag of Freebies and Bag of Specials

The paper organized training and architectural techniques into two categories:

Category	Description	Examples
Bag of Freebies (BoF)	Techniques that improve accuracy during training at no extra inference cost	CutMix and Mosaic data augmentation, DropBlock regularization, class label smoothing, CIoU loss, cosine annealing scheduler
Bag of Specials (BoS)	Techniques that slightly increase inference cost but significantly boost accuracy	Mish activation, CSP connections, SPP block, SAM attention, PAN path aggregation

Mosaic data augmentation, introduced in this paper, combines four training images into one, allowing the model to see objects at smaller scales and reducing the need for large batch sizes. This became one of the most widely adopted augmentation techniques in later detection models.^[4]

Performance

YOLOv4 achieved 43.5% AP (COCO test-dev, averaging over IoU 0.5 to 0.95) and 65.7% AP50 at approximately 62 FPS on an NVIDIA V100 GPU. This was a substantial accuracy improvement over YOLOv3 while maintaining real-time speed.^[4]

YOLOv5 (2020)

YOLOv5 was released in June 2020 by Glenn Jocher of Ultralytics, just weeks after YOLOv4. Unlike all prior YOLO models, which were implemented in the C-based Darknet framework, YOLOv5 was written entirely in PyTorch.^[5]

Controversy

YOLOv5 generated significant controversy in the computer vision community for several reasons. First, Jocher was not an original YOLO author and did not have a direct lineage to Redmon's work. Second, no accompanying research paper was published. Third, some researchers argued the improvements over YOLOv4 were not sufficiently validated to justify the "v5" designation. Critics on platforms like Hacker News called the naming "bullshit" and questioned whether the model was novel enough to claim the next version number.

Jocher responded that "YOLOv5" was an internal project name and that the community should judge the model by its results rather than its label. Regardless of the naming debate, YOLOv5 became enormously popular due to its user-friendly PyTorch codebase, strong documentation, and active maintenance.^[5]

Architecture and features

YOLOv5 used a CSPNet (Cross-Stage Partial Network) backbone with a PANet neck and an anchor-based detection head. It came in five model sizes: YOLOv5n (nano), YOLOv5s (small), YOLOv5m (medium), YOLOv5l (large), and YOLOv5x (extra-large), giving users a range of speed-accuracy options. The model included automatic anchor box calculation, mosaic augmentation, and mixed-precision training out of the box.^[5]

The PyTorch implementation made it easy to export models to ONNX, CoreML, and TFLite formats, which accelerated deployment on edge devices and mobile platforms.^[5]

YOLOv6 (2022)

YOLOv6 was released in September 2022 by the AI team at Meituan, a Chinese technology company. The paper, "YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications," focused on deployment efficiency for production environments.^[6]

Key features

YOLOv6 introduced several innovations:

RepVGG-style backbone with reparameterizable convolutions (RepBlock for small models, CSPStackRep for larger ones) that merge multi-branch training architectures into single-path inference models
Bi-directional Concatenation (BiC) module in the neck for improved feature aggregation
Anchor-aided training (AAT) strategy that uses anchors during training but switches to anchor-free inference
Self-distillation to boost the accuracy of smaller model variants

Performance

YOLOv6 provided models at multiple scales. On COCO val2017 with an NVIDIA T4 GPU: YOLOv6-N reached 37.5% AP at 1,187 FPS; YOLOv6-S reached 45.0% AP at 484 FPS; YOLOv6-M reached 50.0% AP at 226 FPS; YOLOv6-L reached 52.8% AP at 116 FPS.^[6]

YOLOv7 (2022)

YOLOv7 was published in July 2022 by Chien-Yao Wang, Alexey Bochkovskiy, and Hong-Yuan Mark Liao, the same team behind YOLOv4. The paper, "YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors," emphasized architectural efficiency and training improvements.^[7]

Architecture

YOLOv7 introduced Extended Efficient Layer Aggregation Networks (E-ELAN), which allow the network to learn more diverse feature representations by shuffling and merging feature groups with different cardinality. It also used compound model scaling, adjusting the depth and width of the network in a coordinated way, and reparameterized convolutions similar to RepVGG.^[7]

The model used an anchor-based detection paradigm and introduced auxiliary detection heads during training (which are removed during inference) to improve gradient flow and learning without increasing inference cost.^[7]

Performance

YOLOv7 achieved state-of-the-art results at the time, surpassing all known object detectors in both speed and accuracy in the 5 to 160 FPS range.^[7]

YOLOv8 (2023)

YOLOv8 was released in January 2023 by Ultralytics, the same organization behind YOLOv5. It represented a major architectural overhaul and a shift toward a unified computer vision framework.^[8]

Architecture changes

Anchor-free detection. YOLOv8 dropped the anchor box mechanism used in previous versions, switching to an anchor-free approach that predicts object centers directly. This simplified training by removing the need for anchor box configuration.^[8]

C2f modules. The backbone replaced the C3 (CSP Bottleneck with 3 convolutions) modules from YOLOv5 with C2f (Cross-Stage Partial Bottleneck with 2 convolutions) modules, which improved gradient flow and feature extraction.^[8]

Decoupled head. YOLOv8 separated the classification and regression tasks into independent branches in the detection head, a technique borrowed from FCOS and other anchor-free detectors.^[8]

Multi-task support. Beyond object detection, YOLOv8 natively supports instance segmentation, image classification, pose estimation, and oriented bounding boxes (OBB) from a single unified codebase.^[8]

Performance

YOLOv8x achieved approximately 53.9% AP on COCO val2017. The smaller variants provided a range of speed-accuracy tradeoffs, with YOLOv8n running at over 1,000 FPS on a T4 GPU.^[8]

YOLOv9 (2024)

YOLOv9 was released in February 2024 by Chien-Yao Wang, I-Hau Yeh, and Hong-Yuan Mark Liao. The paper, "YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information," was published at ECCV 2024 and focused on a fundamental problem: information loss in deep networks.^[9]

Key innovations

Programmable Gradient Information (PGI). PGI provides the network with complete input information for computing the objective function, ensuring that reliable gradient signals reach all layers during training. This addresses the "information bottleneck" problem where deep layers lose access to the original input information.^[9]

Generalized Efficient Layer Aggregation Network (GELAN). GELAN is a lightweight architecture based on gradient path planning that uses only conventional convolution operators. Despite not relying on depth-wise convolutions or other specialized operations, GELAN achieved better parameter utilization than competing architectures.^[9]

Performance

YOLOv9 demonstrated strong results on COCO, with the YOLOv9-E model achieving 55.6% AP. The authors showed that PGI could be applied to models ranging from lightweight to large configurations, and that train-from-scratch models with PGI could outperform models pre-trained on massive datasets.^[9]

YOLOv10 (2024)

YOLOv10 was released in May 2024 by Ao Wang, Hui Chen, and colleagues at Tsinghua University. The paper, "YOLOv10: Real-Time End-to-End Object Detection," was published at NeurIPS 2024.^[10]

NMS-free detection

The headline innovation in YOLOv10 was the elimination of non-maximum suppression (NMS) at inference time. NMS is a post-processing step traditionally required to remove duplicate detections, and it adds latency and complexity to deployment. YOLOv10 introduced consistent dual assignments: during training, the model uses both one-to-many label assignment (for rich supervisory signals) and one-to-one label assignment (for clean predictions). At inference, only the one-to-one head is used, producing unique detections without NMS.^[10]

Performance

YOLOv10-S achieved 44.3% AP with significantly lower end-to-end latency than comparable models. YOLOv10-B matched the accuracy of YOLOv9-C while having 46% less latency and 25% fewer parameters.^[10]

YOLO11 (2024)

YOLO11 was officially launched by Ultralytics on September 30, 2024, at the YOLO Vision 2024 (YV24) event. Ultralytics dropped the "v" prefix starting with this release.^[11]

Architecture

YOLO11 introduced the C3k2 block (a variant of CSP bottleneck with two small kernel convolutions) and the C2PSA block (Cross-Stage Partial with Spatial Attention), which improved feature extraction and spatial awareness. The model maintained support for all tasks from YOLOv8: object detection, instance segmentation, image classification, pose estimation, and oriented bounding boxes.^[11]

Performance

YOLO11n achieved 39.5% AP on COCO with a latency of 1.5 ms on a T4 GPU. Across all model scales, YOLO11 showed improvements over YOLOv8 in both accuracy and efficiency.^[11]

YOLOv12 (2025)

YOLOv12 was released in February 2025 by Yunjie Tian, Qixiang Ye, and David Junhao Zhang. The paper, "YOLOv12: Attention-Centric Real-Time Object Detectors," was accepted at NeurIPS 2025.^[12]

Attention-centric architecture

YOLOv12 broke from the CNN-dominated tradition of the YOLO family by adopting an attention-centric design. Previous YOLO models relied almost entirely on convolutional layers for feature extraction; YOLOv12 integrated self-attention mechanisms while maintaining real-time inference speeds.^[12]

The three main innovations were:

Area Attention (A2): A self-attention variant that divides feature maps into equal-sized regions (defaulting to 4 segments, either horizontal or vertical) to process large receptive fields efficiently without the quadratic cost of full self-attention
Residual Efficient Layer Aggregation Networks (R-ELAN): An improved version of ELAN with block-level residual connections and a redesigned aggregation method that creates a bottleneck-like structure, addressing optimization challenges in attention-based models
FlashAttention integration: Used to accelerate attention computation, combined with removal of positional encoding for further efficiency

Performance

YOLOv12-N achieved 40.6% AP with 1.64 ms inference latency on a T4 GPU, outperforming YOLOv10-N and YOLO11-N by 2.1% and 1.2% mAP respectively at comparable speeds.^[12]

YOLO26 (2025)

YOLO26 was released in September 2025 by Ultralytics as the latest member of the YOLO family. It was designed with simplicity and edge deployment as core priorities.^[13]

Key changes

YOLO26 implemented several simplifications:

NMS-free inference as the default mode (building on YOLOv10's approach), producing end-to-end predictions without post-processing
Removal of Distribution Focal Loss (DFL), simplifying the training pipeline
Progressive Loss Balancing (ProgLoss) for more stable convergence
Small-Target-Aware Label Assignment (STAL) to improve detection of small objects
MuSGD optimizer, a hybrid of SGD and the Muon optimizer for faster, more stable training

YOLO26 supports all five Ultralytics tasks (detection, segmentation, classification, pose estimation, and OBB), including open-vocabulary versions.^[13]

Performance

YOLO26n achieved approximately 39.8% AP (up to 40.3% in end-to-end mode) at 38.9 ms on CPU. YOLO26x reached 57.5% AP with 55.7M parameters. Across all scales, YOLO26 was faster and smaller than YOLO11 at comparable or better accuracy.^[13]

Specialized YOLO variants

YOLO-NAS

YOLO-NAS was released in May 2023 by Deci AI. Unlike other YOLO models designed by human researchers, YOLO-NAS was generated using Deci's AutoNAC (Automated Neural Architecture Construction) technology, which uses neural architecture search to find optimal network structures for a given hardware target.^[14]

YOLO-NAS incorporates quantization-aware blocks, which means the architecture was designed from the start to work well with INT8 quantization. When quantized, YOLO-NAS models lose only 0.45 to 0.65 mAP points, compared to 1 to 2 mAP points for other models. The model comes in three sizes (S, M, L) and was pre-trained on COCO, Objects365, and Roboflow 100 datasets.^[14]

YOLO-NAS-M delivered approximately 50% higher throughput and 1 mAP point better accuracy compared to equivalent YOLOv8 variants on the NVIDIA T4 GPU.^[14]

YOLO-World

YOLO-World, published at CVPR 2024 by Tianheng Cheng and colleagues, extends YOLO with open-vocabulary detection, meaning it can detect object categories not seen during training by using text descriptions.^[15]

The model introduces a Re-parameterizable Vision-Language Path Aggregation Network (RepVL-PAN) that fuses visual features with language embeddings. It uses a "prompt-then-detect" strategy where text prompts (category names or captions) are encoded into offline vocabulary embeddings, which are then used to guide detection.^[15]

On the LVIS benchmark, YOLO-World achieved 35.4 AP at 52.0 FPS on an NVIDIA V100. YOLO-World-S (13M parameters) achieved 26.2 AP at 74.1 FPS, showing that open-vocabulary detection does not require massive models.^[15]

Version comparison

Version	Year	Authors / Organization	Backbone	Key innovations	mAP (benchmark)	Speed
YOLOv1	2015	Redmon, Divvala, Girshick, Farhadi	Custom (24 conv layers)	Single-shot detection, grid-based prediction	63.4% mAP (VOC 07)	45 FPS (Titan X)
YOLOv2	2016	Redmon, Farhadi	Darknet-19	Anchor boxes, batch normalization, multi-scale training	78.6% mAP (VOC 07)	40-67 FPS (Titan X)
YOLOv3	2018	Redmon, Farhadi	Darknet-53	Multi-scale predictions (FPN-like), residual connections	57.9% AP50 (COCO)	30-45 FPS
YOLOv4	2020	Bochkovskiy, Wang, Liao	CSPDarknet53	Bag of Freebies/Specials, Mosaic augmentation, SPP, PANet	43.5% AP (COCO)	~62 FPS (V100)
YOLOv5	2020	Glenn Jocher (Ultralytics)	CSPNet	PyTorch implementation, 5 model scales, auto-anchor	~50.7% AP (COCO, v5x)	Varies by size
YOLOv6	2022	Meituan	RepVGG / CSPStackRep	Reparameterizable backbone, self-distillation, BiC	52.8% AP (COCO, v6-L)	116-1187 FPS (T4)
YOLOv7	2022	Wang, Bochkovskiy, Liao	E-ELAN	E-ELAN, compound scaling, auxiliary heads	State-of-the-art at release	5-160 FPS range
YOLOv8	2023	Ultralytics	C2f backbone	Anchor-free, decoupled head, multi-task framework	~53.9% AP (COCO, v8x)	1000+ FPS (T4, v8n)
YOLOv9	2024	Wang, Yeh, Liao	GELAN	Programmable Gradient Information, GELAN	55.6% AP (COCO, v9-E)	Competitive
YOLOv10	2024	Tsinghua University	YOLOv8-based	NMS-free inference, consistent dual assignments	44.3% AP (COCO, v10-S)	46% less latency than v9-C
YOLO11	2024	Ultralytics	C3k2 backbone	C3k2 block, C2PSA spatial attention	39.5% AP (COCO, 11n)	1.5 ms (T4, 11n)
YOLOv12	2025	Tian, Ye, Zhang	Attention-based	Area Attention, R-ELAN, FlashAttention	40.6% AP (COCO, 12-N)	1.64 ms (T4, 12-N)
YOLO26	2025	Ultralytics	Simplified CNN	NMS-free default, no DFL, ProgLoss, MuSGD	~57.5% AP (COCO, 26x)	38.9 ms CPU (26n)
YOLO-NAS	2023	Deci AI	NAS-generated	Neural architecture search, quantization-aware	~0.5-1 AP above YOLOv8	50% more throughput
YOLO-World	2024	Cheng et al. (Tencent)	YOLOv8-based	Open-vocabulary detection, RepVL-PAN	35.4 AP (LVIS)	52 FPS (V100)

Architecture evolution

The YOLO architecture can be understood as three main components: the backbone (feature extractor), the neck (feature aggregation), and the head (detection output). Each component has evolved significantly over the years.

Backbone

The backbone extracts features from the input image at multiple spatial resolutions.

YOLO version	Backbone	Key characteristics
YOLOv1	Custom CNN (24 layers)	Inspired by GoogLeNet, 1x1 and 3x3 convolutions
YOLOv2	Darknet-19	19 layers, 5.58 billion FLOPs, global average pooling
YOLOv3	Darknet-53	53 layers with residual connections, more powerful than ResNet-101
YOLOv4	CSPDarknet53	Cross-Stage Partial connections reduce redundant gradients
YOLOv5	CSPNet variants	5 model scales (nano to extra-large)
YOLOv8	C2f-based CSPNet	Cross-Stage Partial Bottleneck with 2 convolutions
YOLOv12	Attention-based	Area Attention replaces some convolutional blocks

Neck

The neck aggregates features from different backbone layers to capture both fine-grained and semantic information.

YOLOv1-v2: No dedicated neck; the backbone output fed directly into the detection layers
YOLOv3: FPN-like top-down pathway, concatenating upsampled deep features with earlier shallow features
YOLOv4: SPP (Spatial Pyramid Pooling) plus PANet (Path Aggregation Network); SPP increases the receptive field without reducing resolution, and PANet adds a bottom-up path alongside the top-down path for bidirectional feature flow
YOLOv5-v8: Variations of FPN + PAN (sometimes called PANet), with increasingly efficient designs
YOLO-World: RepVL-PAN, integrating vision and language features

The evolution from no neck to FPN to PAN to bidirectional feature pyramids allowed later models to detect objects across a wider range of scales.

Head

The detection head produces the final bounding box predictions and class scores.

YOLOv1: Fully connected layers predicting box coordinates and class probabilities jointly for each grid cell
YOLOv2-v7: Anchor-based convolutional heads, predicting offsets and scales relative to predefined anchor boxes
YOLOv8 onward: Anchor-free heads with decoupled classification and regression branches
YOLOv10, YOLO26: End-to-end heads with NMS-free inference via one-to-one label assignment

How does YOLO compare with other detection frameworks?

YOLO belongs to the family of single-shot (one-stage) detectors. Other major approaches to object detection include two-stage detectors and transformer-based detectors.

Method	Type	Typical accuracy (COCO)	Typical speed	Strengths	Weaknesses
Faster R-CNN	Two-stage	~42% AP (ResNet-101)	~7 FPS	High accuracy, strong on small objects	Slow inference, complex pipeline
SSD (Liu et al., 2016)	Single-shot	74.3% mAP (VOC), ~28.8% AP (COCO, SSD512)	22-59 FPS	Good speed-accuracy balance, multi-scale features	Lower accuracy than two-stage at the time
RetinaNet (Lin et al., 2017)	Single-shot	~40% AP (COCO)	~5-12 FPS	Focal loss addresses class imbalance	Slower than YOLO
DETR (Carion et al., 2020)	Transformer	~43% AP (COCO)	~28 FPS (on V100)	No anchors, no NMS, elegant design	Slow training convergence, struggles with small objects
RT-DETR (Lv et al., 2023)	Transformer	~54% AP (COCO)	Real-time on GPU	Transformer with real-time speed	Requires GPU for real-time speed
YOLO (latest, e.g., v12)	Single-shot	40.6-57.5% AP (COCO)	Real-time on GPU and CPU	Fast inference, easy deployment, active ecosystem	Many competing versions can cause confusion

Two-stage detectors like Faster R-CNN generate region proposals first and then classify them, which gives them an advantage in accuracy on difficult cases but makes them slower.^[17] Single-shot detectors like SSD and YOLO process the entire image in one pass, trading some accuracy for much higher speed.^[16] Transformer-based methods like DETR eliminate the need for hand-designed components (anchors, NMS) but historically required longer training schedules and more compute.^[18]

Recent YOLO versions have narrowed or closed the accuracy gap with two-stage and transformer-based detectors while maintaining their speed advantage, especially for edge deployment.

What is YOLO used for?

YOLO models are used across a wide range of domains. The combination of real-time speed, reasonable accuracy, and easy deployment has made YOLO a default choice for many practical object detection systems.

Autonomous driving

Self-driving vehicles and advanced driver assistance systems (ADAS) use YOLO for detecting pedestrians, vehicles, cyclists, traffic signs, and road markings. The sub-millisecond latency of modern YOLO models allows vehicles to process camera feeds in real time and make rapid driving decisions. Companies and research groups have customized YOLOv5, YOLOv7, and YOLOv8 specifically for autonomous vehicle perception tasks.

Video surveillance and security

Surveillance systems use YOLO for real-time person detection, intrusion detection, and anomaly recognition. The ability to run on edge devices (cameras with embedded processors) rather than requiring cloud servers makes YOLO attractive for security applications. YOLOv5 and YOLOv8 form the backbone of many commercial security camera products.

Medical imaging

YOLO has been adapted for detecting tumors in X-ray and MRI images, identifying cells in microscopy, and localizing organs for surgical planning. Applications include COVID-19 detection in chest X-rays, breast cancer identification in mammograms, and fracture detection in bone radiographs. The speed of YOLO is particularly useful when processing large batches of medical images for screening.

Robotics and manufacturing

Industrial robots use YOLO for quality inspection on production lines, detecting defective products, and guiding pick-and-place operations. The model's ability to handle multiple object classes simultaneously makes it suitable for sorting tasks in warehouses and logistics facilities.

Agriculture

YOLO models have been deployed for crop disease detection, fruit counting, weed identification, and livestock monitoring. Drones equipped with YOLO-based systems can survey fields and identify areas that need attention.

Retail and commerce

Retail applications include cashierless checkout systems (detecting products as customers pick them up), shelf inventory monitoring, and customer behavior analysis.

Training and implementation

Frameworks

The original YOLO models (v1 through v4) were implemented in Darknet, a C-based deep learning framework written by Joseph Redmon. Starting with YOLOv5, Ultralytics moved to PyTorch, which has become the standard framework for most modern YOLO variants. YOLO-NAS uses Deci's SuperGradients library (also PyTorch-based).^[5]

Datasets

YOLO models are typically trained and evaluated on the MS COCO dataset, which contains 118,000 training images and 5,000 validation images across 80 object categories. PASCAL VOC (20 categories) was the standard benchmark for earlier versions. Some models like YOLO-NAS are also pre-trained on Objects365 (365 categories, 2 million images) for stronger feature learning.^[14]

Loss functions

The loss function has evolved alongside the architecture. YOLOv1 used a simple sum-of-squared-errors loss.^[1] YOLOv2 and v3 used a combination of binary cross-entropy for classification and mean squared error for coordinates. YOLOv4 introduced CIoU (Complete Intersection over Union) loss for better bounding box regression.^[4] Later versions have adopted variants like DFL (Distribution Focal Loss) in YOLOv8 and task-aligned loss functions.^[8]

Transfer learning and fine-tuning

In practice, most users do not train YOLO from scratch. Instead, they start with a model pre-trained on COCO and fine-tune it on their specific dataset. This approach requires far less data and training time. The Ultralytics library provides a straightforward API for fine-tuning, making it accessible to users who are not deep learning specialists.

What are the limitations of YOLO?

Despite its popularity, YOLO has several known weaknesses:

Small objects: Although multi-scale prediction has improved small object detection since YOLOv3, YOLO still tends to underperform compared to two-stage detectors on very small or densely packed objects.
Occluded objects: When objects overlap significantly, YOLO can struggle to separate them, particularly in its earlier versions.
Fragmented ecosystem: The lack of a single authoritative developer has led to version numbering confusion. Multiple groups release models under the YOLO name with different architectures, codebases, and evaluation methodologies, making fair comparisons difficult.
Ethical concerns: As Joseph Redmon highlighted, object detection technology has applications in military targeting, mass surveillance, and privacy invasion. These concerns apply to all detection models, but YOLO's popularity and accessibility make it especially relevant.^[19]

Influence and legacy

The original YOLO paper has been cited over 40,000 times, making it one of the most referenced papers in computer vision. The idea of framing detection as a single-pass regression task influenced the design of subsequent single-shot detectors like SSD, RetinaNet, and CenterNet.^[1] The YOLO brand itself has become synonymous with real-time object detection.

The Ultralytics repositories on GitHub are among the most popular open-source computer vision projects: the original YOLOv5 repository has gathered more than 56,000 stars, and the unified ultralytics repository (which hosts YOLOv8 and later) had passed 46,000 stars by October 2025.^[5]^[8] The active ecosystem of tutorials, pre-trained models, and community contributions has lowered the barrier to entry for object detection.

References

Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). "You Only Look Once: Unified, Real-Time Object Detection." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). https://arxiv.org/abs/1506.02640 ↩
Redmon, J., & Farhadi, A. (2017). "YOLO9000: Better, Faster, Stronger." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). https://arxiv.org/abs/1612.08242 ↩
Redmon, J., & Farhadi, A. (2018). "YOLOv3: An Incremental Improvement." arXiv preprint. https://arxiv.org/abs/1804.02767 ↩
Bochkovskiy, A., Wang, C.-Y., & Liao, H.-Y. M. (2020). "YOLOv4: Optimal Speed and Accuracy of Object Detection." arXiv preprint. https://arxiv.org/abs/2004.10934 ↩
Jocher, G. et al. (2020). "YOLOv5." Ultralytics. https://github.com/ultralytics/yolov5 ↩
Li, C. et al. (2022). "YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications." arXiv preprint. https://arxiv.org/abs/2209.02976 ↩
Wang, C.-Y., Bochkovskiy, A., & Liao, H.-Y. M. (2023). "YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). https://arxiv.org/abs/2207.02696 ↩
Jocher, G., Chaurasia, A., & Qiu, J. (2023). "Ultralytics YOLOv8." Ultralytics. https://github.com/ultralytics/ultralytics ↩
Wang, C.-Y., Yeh, I.-H., & Liao, H.-Y. M. (2024). "YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information." Proceedings of the European Conference on Computer Vision (ECCV). https://arxiv.org/abs/2402.13616 ↩
Wang, A., Chen, H., Liu, L. et al. (2024). "YOLOv10: Real-Time End-to-End Object Detection." Proceedings of NeurIPS 2024. https://arxiv.org/abs/2405.14458 ↩
Khanam, R. & Hussain, M. (2024). "YOLO11." Ultralytics. https://docs.ultralytics.com/models/yolo11/ ↩
Tian, Y., Ye, Q., & Zhang, D. J. (2025). "YOLOv12: Attention-Centric Real-Time Object Detectors." NeurIPS 2025. https://arxiv.org/abs/2502.12524 ↩
Jocher, G. et al. (2025). "Ultralytics YOLO26." Ultralytics. https://docs.ultralytics.com/models/yolo26/ ↩
Deci AI. (2023). "YOLO-NAS." https://github.com/Deci-AI/super-gradients/blob/master/YOLONAS.md ↩
Cheng, T. et al. (2024). "YOLO-World: Real-Time Open-Vocabulary Object Detection." Proceedings of CVPR 2024. https://arxiv.org/abs/2401.17270 ↩
Liu, W. et al. (2016). "SSD: Single Shot MultiBox Detector." Proceedings of ECCV 2016. https://arxiv.org/abs/1512.02325 ↩
Ren, S., He, K., Girshick, R., & Sun, J. (2015). "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks." Proceedings of NeurIPS 2015. https://arxiv.org/abs/1506.01497 ↩
Carion, N. et al. (2020). "End-to-End Object Detection with Transformers." Proceedings of ECCV 2020. https://arxiv.org/abs/2005.12872 ↩
Synced Review. (2020). "YOLO Creator Joseph Redmon Stopped CV Research Due to Ethical Concerns." https://syncedreview.com/2020/02/24/yolo-creator-says-he-stopped-cv-research-due-to-ethical-concerns/ ↩
Redmon, J. [@pjreddie]. (2020, February 20). "I stopped doing CV research because I saw the impact my work was having..." X (formerly Twitter). https://x.com/pjreddie/status/1230524770350817280 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

5 revisions by 1 contributors · full history

Suggest edit

Background and motivation

What was the first YOLO model (YOLOv1, 2015)?

Architecture

Performance

YOLOv2 / YOLO9000 (2016)

Key improvements

Performance

YOLOv3 (2018)

Architecture changes

Performance

Redmon's departure from computer vision

YOLOv4 (2020)

Architecture

Bag of Freebies and Bag of Specials

Performance

YOLOv5 (2020)

Controversy

Architecture and features

YOLOv6 (2022)

Key features

Performance

YOLOv7 (2022)

Architecture

Performance

YOLOv8 (2023)

Architecture changes

Performance

YOLOv9 (2024)

Key innovations

Performance

YOLOv10 (2024)

NMS-free detection

Performance

YOLO11 (2024)

Architecture

Performance

YOLOv12 (2025)

Attention-centric architecture

Performance

YOLO26 (2025)

Key changes

Performance

Specialized YOLO variants

YOLO-NAS

YOLO-World

Version comparison

Architecture evolution

Backbone

Neck

Head

How does YOLO compare with other detection frameworks?

What is YOLO used for?

Autonomous driving

Video surveillance and security

Medical imaging

Robotics and manufacturing

Agriculture

Retail and commerce

Training and implementation

Frameworks

Datasets

Loss functions

Transfer learning and fine-tuning

What are the limitations of YOLO?

Influence and legacy

References

Improve this article

Related Articles

Translational invariance

Convolutional Neural Network

ResNet

EfficientNet

VGG

Inception (deep learning)

What links here (24 of 36)

Related Articles

Translational invariance

Convolutional Neural Network

ResNet

EfficientNet