YOLO (You Only Look Once) is a family of object detection models that treat detection as a single regression problem, predicting bounding boxes and class probabilities directly from full images in one forward pass through a neural network. First introduced in 2015 by Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi, YOLO broke away from the two-stage detection paradigm popularized by R-CNN and its successors. Instead of generating region proposals and then classifying them separately, YOLO looks at the entire image once and outputs all detections simultaneously. This single-shot approach made YOLO dramatically faster than competing methods, enabling real-time object detection on standard hardware.
Since the original paper, the YOLO family has grown to include more than a dozen major versions and several specialized variants. Different research groups and companies have contributed to its evolution, making YOLO one of the most widely used object detection frameworks in both academic research and industry applications.
Before YOLO, the dominant approach to object detection relied on a two-stage pipeline. Models like R-CNN (2014), Fast R-CNN (2015), and Faster R-CNN (2015) first generated thousands of candidate regions (region proposals) that might contain objects, then classified each region individually using a convolutional neural network. While this approach achieved strong accuracy on benchmarks like PASCAL VOC and MS COCO, it was slow. Faster R-CNN, the fastest of the group, ran at roughly 7 frames per second (FPS), far below what was needed for applications like autonomous driving or video surveillance.
The sliding window approach used by earlier methods like Deformable Parts Models (DPM) was even slower. There was a clear need for a detection system that could operate at real-time speeds (30+ FPS) without sacrificing too much accuracy.
Joseph Redmon and his collaborators at the University of Washington and the Allen Institute for AI proposed a fundamentally different solution: frame object detection as a single regression problem. Rather than scanning an image with region proposals, a single network would look at the whole image and predict all bounding boxes and their class labels at once.
The original YOLO paper, titled "You Only Look Once: Unified, Real-Time Object Detection," was submitted to arXiv in June 2015 and published at CVPR 2016. The authors were Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi.
YOLOv1 divides the input image into an S x S grid (7 x 7 in the default configuration). Each grid cell is responsible for detecting objects whose center falls within that cell. Each cell predicts B bounding boxes (B=2 in the paper), along with confidence scores and C class probabilities. Each bounding box prediction consists of five values: the x and y coordinates of the box center relative to the grid cell, the width and height relative to the full image, and a confidence score reflecting the probability that the box contains an object and how well the box fits.
The network architecture has 24 convolutional layers followed by 2 fully connected layers. The convolutional layers extract features from the image, while the fully connected layers predict the output probabilities and bounding box coordinates. The design was inspired by GoogLeNet, using 1x1 convolutions followed by 3x3 convolutions instead of the inception modules. The first 20 convolutional layers were pre-trained on ImageNet for classification, and then 4 additional convolutional layers plus 2 fully connected layers were added for detection.
A smaller variant called Fast YOLO used only 9 convolutional layers with fewer filters.
On PASCAL VOC 2007, YOLOv1 achieved 63.4% mAP at 45 FPS. Fast YOLO reached 155 FPS with 52.7% mAP. By comparison, Faster R-CNN with VGG-16 achieved 73.2% mAP but ran at only 7 FPS. YOLOv1 was therefore roughly 6 times faster than Faster R-CNN, though it lagged in accuracy.
YOLOv1 had notable limitations. Because each grid cell could only predict two boxes and one set of class probabilities, it struggled with small objects that appeared in groups (such as flocks of birds). It also had difficulty generalizing to objects with unusual aspect ratios. Localization error was the main source of its accuracy gap with two-stage detectors.
The second version was presented in the paper "YOLO9000: Better, Faster, Stronger," published in December 2016 by Joseph Redmon and Ali Farhadi. YOLOv2 addressed many of the shortcomings of the original model, while YOLO9000 extended the system to detect over 9,000 object categories by jointly training on detection and classification datasets.
Batch normalization. Adding batch normalization to all convolutional layers improved mAP by more than 2 percentage points and eliminated the need for dropout as a regularizer.
High-resolution classifier. YOLOv1 trained the classifier at 224x224 and then switched to 448x448 for detection. YOLOv2 first fine-tuned the classification network at 448x448 for 10 epochs on ImageNet, giving the network time to adjust its filters for higher resolution input.
Anchor boxes. Instead of predicting bounding box coordinates directly, YOLOv2 adopted anchor boxes (predefined template boxes), predicting offsets relative to these anchors. The anchor box dimensions were determined through k-means clustering on the training data rather than being hand-picked, which gave a better starting point for prediction.
New backbone: Darknet-19. YOLOv2 replaced the original backbone with Darknet-19, a 19-layer network using 3x3 convolutions, 1x1 convolutions, and global average pooling. Darknet-19 achieved 72.9% top-1 accuracy on ImageNet with only 5.58 billion floating-point operations, making it significantly more efficient than VGG-16.
Multi-scale training. The network was trained on images of varying sizes (from 320x320 to 608x608) every few batches, allowing it to handle different input resolutions at test time and providing a speed-accuracy tradeoff.
YOLOv2 achieved 76.8% mAP at 67 FPS on PASCAL VOC 2007 (at 416x416 input), and 78.6% mAP at 40 FPS (at 544x544 input). On COCO, it achieved 44.0% mAP at IoU 0.5.
YOLO9000 used a WordTree hierarchy to combine labels from ImageNet and COCO, enabling detection of over 9,000 categories. It trained on both detection and classification data simultaneously, using detection images for the full loss and classification images for only the classification portion.
YOLOv3 was introduced in the paper "YOLOv3: An Incremental Improvement" by Joseph Redmon and Ali Farhadi in April 2018. Redmon described it as a collection of incremental improvements rather than a groundbreaking change, but the cumulative effect was substantial.
Darknet-53 backbone. YOLOv3 replaced Darknet-19 with Darknet-53, a 53-layer network that incorporated residual connections (skip connections) borrowed from ResNet. Darknet-53 was more powerful than Darknet-19 and more efficient than ResNet-101 or ResNet-152 in terms of floating-point operations per unit of accuracy.
Multi-scale predictions. One of the biggest improvements was predicting objects at three different scales. YOLOv3 extracted features at three points in the network (at layers 82, 94, and 106), producing detection maps at 13x13, 26x26, and 52x52 resolutions for a 416x416 input. This approach used a Feature Pyramid Network (FPN)-like structure, where feature maps from deeper layers were upsampled and concatenated with feature maps from earlier layers. The multi-scale design dramatically improved detection of small objects, which was a weakness in earlier versions.
Class prediction. YOLOv3 switched from softmax classification to independent logistic classifiers for each class, using binary cross-entropy loss. This allowed multi-label classification, where a single object could belong to multiple categories (useful for datasets with overlapping labels like "woman" and "person").
YOLOv3 achieved 57.9% mAP at IoU 0.5 (AP50) on COCO, running at approximately 30 to 45 FPS depending on input resolution and hardware. At the stricter AP metric (averaging over IoU thresholds from 0.5 to 0.95), its performance was lower, but at AP50 it was competitive with detectors like RetinaNet while being significantly faster.
In February 2020, Joseph Redmon announced on Twitter that he had stopped his computer vision research due to ethical concerns. He cited the military applications of his work and privacy implications of detection and surveillance technology. He expressed regret for "ever believing science was apolitical" and noted that he felt facial recognition technologies had "more downside than upside." This decision meant that subsequent YOLO versions would come from other researchers.
With Redmon stepping away, development of YOLO continued under new leadership. YOLOv4 was released in April 2020 by Alexey Bochkovskiy, Chien-Yao Wang, and Hong-Yuan Mark Liao. The paper, "YOLOv4: Optimal Speed and Accuracy of Object Detection," systematically evaluated dozens of techniques for improving detection accuracy and speed.
YOLOv4 introduced a modular architecture with three clear components:
The paper organized training and architectural techniques into two categories:
| Category | Description | Examples |
|---|---|---|
| Bag of Freebies (BoF) | Techniques that improve accuracy during training at no extra inference cost | CutMix and Mosaic data augmentation, DropBlock regularization, class label smoothing, CIoU loss, cosine annealing scheduler |
| Bag of Specials (BoS) | Techniques that slightly increase inference cost but significantly boost accuracy | Mish activation, CSP connections, SPP block, SAM attention, PAN path aggregation |
Mosaic data augmentation, introduced in this paper, combines four training images into one, allowing the model to see objects at smaller scales and reducing the need for large batch sizes. This became one of the most widely adopted augmentation techniques in later detection models.
YOLOv4 achieved 43.5% AP (COCO test-dev, averaging over IoU 0.5 to 0.95) and 65.7% AP50 at approximately 62 FPS on an NVIDIA V100 GPU. This was a substantial accuracy improvement over YOLOv3 while maintaining real-time speed.
YOLOv5 was released in June 2020 by Glenn Jocher of Ultralytics, just weeks after YOLOv4. Unlike all prior YOLO models, which were implemented in the C-based Darknet framework, YOLOv5 was written entirely in PyTorch.
YOLOv5 generated significant controversy in the computer vision community for several reasons. First, Jocher was not an original YOLO author and did not have a direct lineage to Redmon's work. Second, no accompanying research paper was published. Third, some researchers argued the improvements over YOLOv4 were not sufficiently validated to justify the "v5" designation. Critics on platforms like Hacker News called the naming "bullshit" and questioned whether the model was novel enough to claim the next version number.
Jocher responded that "YOLOv5" was an internal project name and that the community should judge the model by its results rather than its label. Regardless of the naming debate, YOLOv5 became enormously popular due to its user-friendly PyTorch codebase, strong documentation, and active maintenance.
YOLOv5 used a CSPNet (Cross-Stage Partial Network) backbone with a PANet neck and an anchor-based detection head. It came in five model sizes: YOLOv5n (nano), YOLOv5s (small), YOLOv5m (medium), YOLOv5l (large), and YOLOv5x (extra-large), giving users a range of speed-accuracy options. The model included automatic anchor box calculation, mosaic augmentation, and mixed-precision training out of the box.
The PyTorch implementation made it easy to export models to ONNX, CoreML, and TFLite formats, which accelerated deployment on edge devices and mobile platforms.
YOLOv6 was released in September 2022 by the AI team at Meituan, a Chinese technology company. The paper, "YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications," focused on deployment efficiency for production environments.
YOLOv6 introduced several innovations:
YOLOv6 provided models at multiple scales. On COCO val2017 with an NVIDIA T4 GPU: YOLOv6-N reached 37.5% AP at 1,187 FPS; YOLOv6-S reached 45.0% AP at 484 FPS; YOLOv6-M reached 50.0% AP at 226 FPS; YOLOv6-L reached 52.8% AP at 116 FPS.
YOLOv7 was published in July 2022 by Chien-Yao Wang, Alexey Bochkovskiy, and Hong-Yuan Mark Liao, the same team behind YOLOv4. The paper, "YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors," emphasized architectural efficiency and training improvements.
YOLOv7 introduced Extended Efficient Layer Aggregation Networks (E-ELAN), which allow the network to learn more diverse feature representations by shuffling and merging feature groups with different cardinality. It also used compound model scaling, adjusting the depth and width of the network in a coordinated way, and reparameterized convolutions similar to RepVGG.
The model used an anchor-based detection paradigm and introduced auxiliary detection heads during training (which are removed during inference) to improve gradient flow and learning without increasing inference cost.
YOLOv7 achieved state-of-the-art results at the time, surpassing all known object detectors in both speed and accuracy in the 5 to 160 FPS range.
YOLOv8 was released in January 2023 by Ultralytics, the same organization behind YOLOv5. It represented a major architectural overhaul and a shift toward a unified computer vision framework.
Anchor-free detection. YOLOv8 dropped the anchor box mechanism used in previous versions, switching to an anchor-free approach that predicts object centers directly. This simplified training by removing the need for anchor box configuration.
C2f modules. The backbone replaced the C3 (CSP Bottleneck with 3 convolutions) modules from YOLOv5 with C2f (Cross-Stage Partial Bottleneck with 2 convolutions) modules, which improved gradient flow and feature extraction.
Decoupled head. YOLOv8 separated the classification and regression tasks into independent branches in the detection head, a technique borrowed from FCOS and other anchor-free detectors.
Multi-task support. Beyond object detection, YOLOv8 natively supports instance segmentation, image classification, pose estimation, and oriented bounding boxes (OBB) from a single unified codebase.
YOLOv8x achieved approximately 53.9% AP on COCO val2017. The smaller variants provided a range of speed-accuracy tradeoffs, with YOLOv8n running at over 1,000 FPS on a T4 GPU.
YOLOv9 was released in February 2024 by Chien-Yao Wang, I-Hau Yeh, and Hong-Yuan Mark Liao. The paper, "YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information," was published at ECCV 2024 and focused on a fundamental problem: information loss in deep networks.
Programmable Gradient Information (PGI). PGI provides the network with complete input information for computing the objective function, ensuring that reliable gradient signals reach all layers during training. This addresses the "information bottleneck" problem where deep layers lose access to the original input information.
Generalized Efficient Layer Aggregation Network (GELAN). GELAN is a lightweight architecture based on gradient path planning that uses only conventional convolution operators. Despite not relying on depth-wise convolutions or other specialized operations, GELAN achieved better parameter utilization than competing architectures.
YOLOv9 demonstrated strong results on COCO, with the YOLOv9-E model achieving 55.6% AP. The authors showed that PGI could be applied to models ranging from lightweight to large configurations, and that train-from-scratch models with PGI could outperform models pre-trained on massive datasets.
YOLOv10 was released in May 2024 by Ao Wang, Hui Chen, and colleagues at Tsinghua University. The paper, "YOLOv10: Real-Time End-to-End Object Detection," was published at NeurIPS 2024.
The headline innovation in YOLOv10 was the elimination of non-maximum suppression (NMS) at inference time. NMS is a post-processing step traditionally required to remove duplicate detections, and it adds latency and complexity to deployment. YOLOv10 introduced consistent dual assignments: during training, the model uses both one-to-many label assignment (for rich supervisory signals) and one-to-one label assignment (for clean predictions). At inference, only the one-to-one head is used, producing unique detections without NMS.
YOLOv10-S achieved 44.3% AP with significantly lower end-to-end latency than comparable models. YOLOv10-B matched the accuracy of YOLOv9-C while having 46% less latency and 25% fewer parameters.
YOLO11 was officially launched by Ultralytics on September 30, 2024, at the YOLO Vision 2024 (YV24) event. Ultralytics dropped the "v" prefix starting with this release.
YOLO11 introduced the C3k2 block (a variant of CSP bottleneck with two small kernel convolutions) and the C2PSA block (Cross-Stage Partial with Spatial Attention), which improved feature extraction and spatial awareness. The model maintained support for all tasks from YOLOv8: object detection, instance segmentation, image classification, pose estimation, and oriented bounding boxes.
YOLO11n achieved 39.5% AP on COCO with a latency of 1.5 ms on a T4 GPU. Across all model scales, YOLO11 showed improvements over YOLOv8 in both accuracy and efficiency.
YOLOv12 was released in February 2025 by Yunjie Tian, Qixiang Ye, and David Junhao Zhang. The paper, "YOLOv12: Attention-Centric Real-Time Object Detectors," was accepted at NeurIPS 2025.
YOLOv12 broke from the CNN-dominated tradition of the YOLO family by adopting an attention-centric design. Previous YOLO models relied almost entirely on convolutional layers for feature extraction; YOLOv12 integrated self-attention mechanisms while maintaining real-time inference speeds.
The three main innovations were:
YOLOv12-N achieved 40.6% AP with 1.64 ms inference latency on a T4 GPU, outperforming YOLOv10-N and YOLO11-N by 2.1% and 1.2% mAP respectively at comparable speeds.
YOLO26 was released in September 2025 by Ultralytics as the latest member of the YOLO family. It was designed with simplicity and edge deployment as core priorities.
YOLO26 implemented several simplifications:
YOLO26 supports all five Ultralytics tasks (detection, segmentation, classification, pose estimation, and OBB), including open-vocabulary versions.
YOLO26n achieved approximately 39.8% AP (up to 40.3% in end-to-end mode) at 38.9 ms on CPU. YOLO26x reached 57.5% AP with 55.7M parameters. Across all scales, YOLO26 was faster and smaller than YOLO11 at comparable or better accuracy.
YOLO-NAS was released in May 2023 by Deci AI. Unlike other YOLO models designed by human researchers, YOLO-NAS was generated using Deci's AutoNAC (Automated Neural Architecture Construction) technology, which uses neural architecture search to find optimal network structures for a given hardware target.
YOLO-NAS incorporates quantization-aware blocks, which means the architecture was designed from the start to work well with INT8 quantization. When quantized, YOLO-NAS models lose only 0.45 to 0.65 mAP points, compared to 1 to 2 mAP points for other models. The model comes in three sizes (S, M, L) and was pre-trained on COCO, Objects365, and Roboflow 100 datasets.
YOLO-NAS-M delivered approximately 50% higher throughput and 1 mAP point better accuracy compared to equivalent YOLOv8 variants on the NVIDIA T4 GPU.
YOLO-World, published at CVPR 2024 by Tianheng Cheng and colleagues, extends YOLO with open-vocabulary detection, meaning it can detect object categories not seen during training by using text descriptions.
The model introduces a Re-parameterizable Vision-Language Path Aggregation Network (RepVL-PAN) that fuses visual features with language embeddings. It uses a "prompt-then-detect" strategy where text prompts (category names or captions) are encoded into offline vocabulary embeddings, which are then used to guide detection.
On the LVIS benchmark, YOLO-World achieved 35.4 AP at 52.0 FPS on an NVIDIA V100. YOLO-World-S (13M parameters) achieved 26.2 AP at 74.1 FPS, showing that open-vocabulary detection does not require massive models.
| Version | Year | Authors / Organization | Backbone | Key innovations | mAP (benchmark) | Speed |
|---|---|---|---|---|---|---|
| YOLOv1 | 2015 | Redmon, Divvala, Girshick, Farhadi | Custom (24 conv layers) | Single-shot detection, grid-based prediction | 63.4% mAP (VOC 07) | 45 FPS (Titan X) |
| YOLOv2 | 2016 | Redmon, Farhadi | Darknet-19 | Anchor boxes, batch normalization, multi-scale training | 78.6% mAP (VOC 07) | 40-67 FPS (Titan X) |
| YOLOv3 | 2018 | Redmon, Farhadi | Darknet-53 | Multi-scale predictions (FPN-like), residual connections | 57.9% AP50 (COCO) | 30-45 FPS |
| YOLOv4 | 2020 | Bochkovskiy, Wang, Liao | CSPDarknet53 | Bag of Freebies/Specials, Mosaic augmentation, SPP, PANet | 43.5% AP (COCO) | ~62 FPS (V100) |
| YOLOv5 | 2020 | Glenn Jocher (Ultralytics) | CSPNet | PyTorch implementation, 5 model scales, auto-anchor | ~50.7% AP (COCO, v5x) | Varies by size |
| YOLOv6 | 2022 | Meituan | RepVGG / CSPStackRep | Reparameterizable backbone, self-distillation, BiC | 52.8% AP (COCO, v6-L) | 116-1187 FPS (T4) |
| YOLOv7 | 2022 | Wang, Bochkovskiy, Liao | E-ELAN | E-ELAN, compound scaling, auxiliary heads | State-of-the-art at release | 5-160 FPS range |
| YOLOv8 | 2023 | Ultralytics | C2f backbone | Anchor-free, decoupled head, multi-task framework | ~53.9% AP (COCO, v8x) | 1000+ FPS (T4, v8n) |
| YOLOv9 | 2024 | Wang, Yeh, Liao | GELAN | Programmable Gradient Information, GELAN | 55.6% AP (COCO, v9-E) | Competitive |
| YOLOv10 | 2024 | Tsinghua University | YOLOv8-based | NMS-free inference, consistent dual assignments | 44.3% AP (COCO, v10-S) | 46% less latency than v9-C |
| YOLO11 | 2024 | Ultralytics | C3k2 backbone | C3k2 block, C2PSA spatial attention | 39.5% AP (COCO, 11n) | 1.5 ms (T4, 11n) |
| YOLOv12 | 2025 | Tian, Ye, Zhang | Attention-based | Area Attention, R-ELAN, FlashAttention | 40.6% AP (COCO, 12-N) | 1.64 ms (T4, 12-N) |
| YOLO26 | 2025 | Ultralytics | Simplified CNN | NMS-free default, no DFL, ProgLoss, MuSGD | ~57.5% AP (COCO, 26x) | 38.9 ms CPU (26n) |
| YOLO-NAS | 2023 | Deci AI | NAS-generated | Neural architecture search, quantization-aware | ~0.5-1 AP above YOLOv8 | 50% more throughput |
| YOLO-World | 2024 | Cheng et al. (Tencent) | YOLOv8-based | Open-vocabulary detection, RepVL-PAN | 35.4 AP (LVIS) | 52 FPS (V100) |
The YOLO architecture can be understood as three main components: the backbone (feature extractor), the neck (feature aggregation), and the head (detection output). Each component has evolved significantly over the years.
The backbone extracts features from the input image at multiple spatial resolutions.
| YOLO version | Backbone | Key characteristics |
|---|---|---|
| YOLOv1 | Custom CNN (24 layers) | Inspired by GoogLeNet, 1x1 and 3x3 convolutions |
| YOLOv2 | Darknet-19 | 19 layers, 5.58 billion FLOPs, global average pooling |
| YOLOv3 | Darknet-53 | 53 layers with residual connections, more powerful than ResNet-101 |
| YOLOv4 | CSPDarknet53 | Cross-Stage Partial connections reduce redundant gradients |
| YOLOv5 | CSPNet variants | 5 model scales (nano to extra-large) |
| YOLOv8 | C2f-based CSPNet | Cross-Stage Partial Bottleneck with 2 convolutions |
| YOLOv12 | Attention-based | Area Attention replaces some convolutional blocks |
The neck aggregates features from different backbone layers to capture both fine-grained and semantic information.
The evolution from no neck to FPN to PAN to bidirectional feature pyramids allowed later models to detect objects across a wider range of scales.
The detection head produces the final bounding box predictions and class scores.
YOLO belongs to the family of single-shot (one-stage) detectors. Other major approaches to object detection include two-stage detectors and transformer-based detectors.
| Method | Type | Typical accuracy (COCO) | Typical speed | Strengths | Weaknesses |
|---|---|---|---|---|---|
| Faster R-CNN | Two-stage | ~42% AP (ResNet-101) | ~7 FPS | High accuracy, strong on small objects | Slow inference, complex pipeline |
| SSD (Liu et al., 2016) | Single-shot | 74.3% mAP (VOC), ~28.8% AP (COCO, SSD512) | 22-59 FPS | Good speed-accuracy balance, multi-scale features | Lower accuracy than two-stage at the time |
| RetinaNet (Lin et al., 2017) | Single-shot | ~40% AP (COCO) | ~5-12 FPS | Focal loss addresses class imbalance | Slower than YOLO |
| DETR (Carion et al., 2020) | Transformer | ~43% AP (COCO) | ~28 FPS (on V100) | No anchors, no NMS, elegant design | Slow training convergence, struggles with small objects |
| RT-DETR (Lv et al., 2023) | Transformer | ~54% AP (COCO) | Real-time on GPU | Transformer with real-time speed | Requires GPU for real-time speed |
| YOLO (latest, e.g., v12) | Single-shot | 40.6-57.5% AP (COCO) | Real-time on GPU and CPU | Fast inference, easy deployment, active ecosystem | Many competing versions can cause confusion |
Two-stage detectors like Faster R-CNN generate region proposals first and then classify them, which gives them an advantage in accuracy on difficult cases but makes them slower. Single-shot detectors like SSD and YOLO process the entire image in one pass, trading some accuracy for much higher speed. Transformer-based methods like DETR eliminate the need for hand-designed components (anchors, NMS) but historically required longer training schedules and more compute.
Recent YOLO versions have narrowed or closed the accuracy gap with two-stage and transformer-based detectors while maintaining their speed advantage, especially for edge deployment.
YOLO models are used across a wide range of domains. The combination of real-time speed, reasonable accuracy, and easy deployment has made YOLO a default choice for many practical object detection systems.
Self-driving vehicles and advanced driver assistance systems (ADAS) use YOLO for detecting pedestrians, vehicles, cyclists, traffic signs, and road markings. The sub-millisecond latency of modern YOLO models allows vehicles to process camera feeds in real time and make rapid driving decisions. Companies and research groups have customized YOLOv5, YOLOv7, and YOLOv8 specifically for autonomous vehicle perception tasks.
Surveillance systems use YOLO for real-time person detection, intrusion detection, and anomaly recognition. The ability to run on edge devices (cameras with embedded processors) rather than requiring cloud servers makes YOLO attractive for security applications. YOLOv5 and YOLOv8 form the backbone of many commercial security camera products.
YOLO has been adapted for detecting tumors in X-ray and MRI images, identifying cells in microscopy, and localizing organs for surgical planning. Applications include COVID-19 detection in chest X-rays, breast cancer identification in mammograms, and fracture detection in bone radiographs. The speed of YOLO is particularly useful when processing large batches of medical images for screening.
Industrial robots use YOLO for quality inspection on production lines, detecting defective products, and guiding pick-and-place operations. The model's ability to handle multiple object classes simultaneously makes it suitable for sorting tasks in warehouses and logistics facilities.
YOLO models have been deployed for crop disease detection, fruit counting, weed identification, and livestock monitoring. Drones equipped with YOLO-based systems can survey fields and identify areas that need attention.
Retail applications include cashierless checkout systems (detecting products as customers pick them up), shelf inventory monitoring, and customer behavior analysis.
The original YOLO models (v1 through v4) were implemented in Darknet, a C-based deep learning framework written by Joseph Redmon. Starting with YOLOv5, Ultralytics moved to PyTorch, which has become the standard framework for most modern YOLO variants. YOLO-NAS uses Deci's SuperGradients library (also PyTorch-based).
YOLO models are typically trained and evaluated on the MS COCO dataset, which contains 118,000 training images and 5,000 validation images across 80 object categories. PASCAL VOC (20 categories) was the standard benchmark for earlier versions. Some models like YOLO-NAS are also pre-trained on Objects365 (365 categories, 2 million images) for stronger feature learning.
The loss function has evolved alongside the architecture. YOLOv1 used a simple sum-of-squared-errors loss. YOLOv2 and v3 used a combination of binary cross-entropy for classification and mean squared error for coordinates. YOLOv4 introduced CIoU (Complete Intersection over Union) loss for better bounding box regression. Later versions have adopted variants like DFL (Distribution Focal Loss) in YOLOv8 and task-aligned loss functions.
In practice, most users do not train YOLO from scratch. Instead, they start with a model pre-trained on COCO and fine-tune it on their specific dataset. This approach requires far less data and training time. The Ultralytics library provides a straightforward API for fine-tuning, making it accessible to users who are not deep learning specialists.
Despite its popularity, YOLO has several known weaknesses:
The original YOLO paper has been cited over 40,000 times, making it one of the most referenced papers in computer vision. The idea of framing detection as a single-pass regression task influenced the design of subsequent single-shot detectors like SSD, RetinaNet, and CenterNet. The YOLO brand itself has become synonymous with real-time object detection.
The Ultralytics YOLO repository on GitHub has accumulated over 55,000 stars and is one of the most popular open-source computer vision projects. The active ecosystem of tutorials, pre-trained models, and community contributions has lowered the barrier to entry for object detection.