Bounding Box
Last reviewed
Jun 2, 2026
Sources
34 citations
Review status
Source-backed
Revision
v3 · 7,212 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 2, 2026
Sources
34 citations
Review status
Source-backed
Revision
v3 · 7,212 words
Add missing citations, update stale details, or suggest a clearer explanation.
A bounding box is a rectangular region defined by a set of coordinates that encloses an object of interest within an image, video frame, or three-dimensional space. In computer vision and machine learning, bounding boxes serve as the standard method for representing the location and spatial extent of detected objects. They are central to tasks such as object detection, object tracking, and image segmentation, providing a simple geometric approximation that balances computational efficiency with spatial precision.
Bounding boxes can be two-dimensional (enclosing objects in flat images) or three-dimensional (enclosing objects in volumetric data or point clouds). Despite their simplicity, they remain one of the most widely used spatial representations in modern visual recognition systems, from autonomous vehicles to medical imaging pipelines.
Imagine you are looking at a photo with a dog in it. If someone asked you to show where the dog is, you might draw a rectangle around it with a marker. That rectangle is a bounding box. It does not trace the exact outline of the dog; it just draws the smallest box that fits around the whole dog.
Computers do the same thing. When a computer looks at a photo and tries to find objects like dogs, cars, or people, it draws rectangles around them. Each rectangle tells the computer two things: what the object is (like "dog") and where the object is in the picture (the rectangle's position and size).
To check whether the computer drew a good rectangle, people compare it with a rectangle that a human drew around the same object. If the two rectangles overlap a lot, the computer did a good job. If they barely overlap, the computer needs more practice. The way we measure that overlap is called Intersection over Union, or IoU.
A bounding box in two dimensions is defined by a minimal set of values that specify a rectangle's position and size within an image coordinate system. Several coordinate conventions exist, each suited to different frameworks and annotation standards.
| Format | Notation | Description | Used by |
|---|---|---|---|
| Corner format (xyxy) | (x_min, y_min, x_max, y_max) | Top-left and bottom-right corner coordinates in absolute pixels | Pascal VOC, PyTorch (torchvision) |
| Corner + dimensions (xywh) | (x_min, y_min, width, height) | Top-left corner plus width and height in absolute pixels | COCO dataset |
| Center format (cxcywh) | (cx, cy, width, height) | Center point plus width and height, often normalized to [0, 1] | YOLO |
| Normalized format | Values in [0, 1] | Any of the above with coordinates divided by image width/height | YOLO, some TensorFlow pipelines |
In the corner format, (x_min, y_min) is the top-left corner and (x_max, y_max) is the bottom-right corner. The width equals x_max minus x_min, and the height equals y_max minus y_min. The center format represents the same rectangle using its center coordinates (cx, cy) and its width and height. Normalized coordinates express all values as fractions of the image dimensions, making annotations resolution-independent.
Conversion between these formats is straightforward. For example, to convert from (x_min, y_min, x_max, y_max) to center format:
Libraries like torchvision provide utility functions such as box_convert for switching between these representations programmatically.
In 3D applications, bounding boxes extend to cuboids (also called 3D bounding boxes or 3D bounding cuboids). A 3D bounding box is typically defined by its center coordinates (cx, cy, cz), dimensions (length, width, height), and a rotation angle (yaw). This representation is standard in autonomous driving systems, where LiDAR point clouds and camera images are used together to localize vehicles, pedestrians, and other objects in three-dimensional space.
An axis-aligned bounding box has edges that are parallel to the coordinate axes of the image or scene. Because the rectangle's sides are always horizontal and vertical, an AABB is simple to compute and store. It requires only four values in 2D (or six in 3D). AABBs are the default type used in most object detection systems.
The main limitation of AABBs is that they can include a large amount of background area when the enclosed object is elongated or rotated at an angle. For example, a diagonally oriented ship in a satellite image would have an AABB that contains substantial empty space around the actual ship.
An oriented bounding box (also called a rotated bounding box) is a rectangle that can be rotated to align with the orientation of the object it encloses. An OBB is typically parameterized by its center (cx, cy), width, height, and a rotation angle (theta). This additional degree of freedom allows the bounding box to fit tightly around elongated or angled objects, reducing the amount of background included.
OBBs are particularly useful in remote sensing (for detecting ships, aircraft, and vehicles in aerial imagery), scene text detection (where text lines may appear at arbitrary angles), and industrial inspection. Models like YOLOv8-OBB and oriented versions of Faster R-CNN have been developed specifically to predict oriented bounding boxes.
| Property | Axis-aligned (AABB) | Oriented (OBB) |
|---|---|---|
| Parameters | 4 (x_min, y_min, x_max, y_max) | 5 (cx, cy, w, h, theta) |
| Rotation support | No | Yes |
| Tightness of fit | Loose for rotated objects | Tight for rotated objects |
| Computational cost | Lower | Higher |
| Common use cases | General object detection | Aerial/satellite imagery, text detection |
| IoU calculation | Simple | Requires polygon intersection |
Object detection is the task of identifying and localizing objects in images or videos. Bounding box prediction is a core component of virtually every modern object detection system. These systems output a list of detections, where each detection consists of a bounding box, a class label (such as "car" or "person"), and a confidence score indicating the model's certainty.
Two-stage detectors split the detection process into two phases. In the first stage, the model generates a set of region proposals, which are candidate bounding boxes that might contain objects. In the second stage, the model classifies each proposal and refines its bounding box coordinates.
R-CNN (2014). The Regions with Convolutional Neural Network features (R-CNN) method, introduced by Ross Girshick and colleagues, was one of the first approaches to combine CNNs with region proposals for object detection. It used selective search to generate approximately 2,000 candidate regions per image, warped each region to a fixed size, extracted CNN features from each region, and classified them with a class-specific linear support vector machine. R-CNN achieved a mean average precision (mAP) of 53.3% on PASCAL VOC 2012, a relative improvement of more than 30% over the prior best result. [1] A separate bounding box regressor refined the proposal coordinates to reduce localization error.
Fast R-CNN (2015). Fast R-CNN improved on R-CNN by introducing Region of Interest (RoI) pooling, which allowed the CNN features to be computed once for the entire image rather than separately for each region proposal. It also combined the classification and bounding box regression tasks into a single multi-task loss function, trained end to end. This made training 9 times faster and inference 213 times faster than R-CNN, while also improving accuracy. [2]
Faster R-CNN (2015). Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun introduced the Region Proposal Network (RPN), a fully convolutional network that generates region proposals directly from feature maps. The RPN shares convolutional features with the detection network, eliminating the computational bottleneck of external proposal methods such as selective search. Faster R-CNN achieved near real-time performance at 5 frames per second on a GPU while maintaining high accuracy, and reached 73.2% mAP on PASCAL VOC 2007. It formed the foundation of many winning entries at the ILSVRC and COCO 2015 competitions. [3]
Single-stage detectors predict bounding boxes and class probabilities directly from the full image in a single pass through the network, without a separate region proposal step. This design typically results in faster inference at the cost of some accuracy compared to two-stage methods.
YOLO (2016). YOLO (You Only Look Once), introduced by Joseph Redmon and colleagues at CVPR 2016, reframed object detection as a single regression problem. The model divides the input image into a grid of cells. Each cell predicts a fixed number of bounding boxes along with confidence scores and class probabilities. The base YOLO model processed images at 45 frames per second, while the smaller Fast YOLO variant reached 155 fps. [4] Subsequent versions (YOLOv2 through YOLOv12 and beyond) introduced anchor boxes, multi-scale feature maps, and improved loss functions, and are described in detail in the YOLO family evolution section below.
SSD (2016). The Single Shot MultiBox Detector, proposed by Wei Liu and colleagues, discretizes the output space of bounding boxes into a set of default boxes with different aspect ratios and scales at each feature map location. SSD combines predictions from multiple feature maps with different resolutions, allowing it to detect objects of various sizes. On VOC 2007, SSD achieved 72.1% mAP at 58 fps with 300x300 input and 75.1% mAP with 512x512 input. [5]
RetinaNet (2017). Tsung-Yi Lin and colleagues identified that the extreme class imbalance between foreground and background examples in single-stage detectors was the primary obstacle to matching two-stage detector accuracy. They introduced focal loss, which down-weights the loss contribution of well-classified (easy) examples so training focuses on hard cases. RetinaNet matched the speed of single-stage detectors while surpassing the accuracy of the two-stage detectors of that era. [6]
DETR (2020). The DEtection Transformer, introduced by Nicolas Carion and colleagues at Facebook AI Research, eliminated the need for anchor boxes and non-maximum suppression entirely. DETR treats object detection as a direct set prediction problem, using a transformer encoder-decoder architecture on top of a CNN backbone. A fixed set of learned object queries is fed to the decoder, and each query produces either a box-and-class prediction or a "no object" label. DETR employs a set-based global loss using the Hungarian algorithm to find a bipartite matching between predicted and ground truth objects, ensuring each prediction maps to at most one ground truth object. The box regression term in this loss combines an L1 distance with the GIoU loss so that it is invariant to box scale. DETR matched a well-tuned Faster R-CNN baseline in accuracy on the COCO dataset with a conceptually simpler architecture, though it converged slowly (500 training epochs) and struggled on small objects. [10]
A family of follow-up models addressed DETR's slow convergence and improved its bounding box predictions. The table below summarizes the most influential variants.
| Model | Year | Key contribution to bounding box prediction |
|---|---|---|
| Deformable DETR | 2021 | Deformable attention sampling a few key points per query; multi-scale features; roughly 10x faster convergence than DETR [16] |
| DAB-DETR | 2022 | Formulates each query explicitly as a 4D dynamic anchor box (x, y, w, h) that is refined layer by layer [17] |
| DN-DETR | 2022 | Adds a query denoising task that feeds noised ground-truth boxes to the decoder, stabilizing bipartite matching and speeding convergence [18] |
| DINO | 2022 | Combines dynamic anchor boxes, contrastive denoising, and a look-forward-twice box update; first DETR variant to top the COCO leaderboard [19] |
| RT-DETR | 2023 | First real-time end-to-end detector; an efficient hybrid encoder and IoU-aware query selection let it beat comparable YOLO models in both speed and accuracy [20] |
Deformable DETR replaced the global attention of the original model with deformable attention, which attends to only a small set of sampled locations around each reference point, cutting training time from 500 epochs to about 50 while improving small-object accuracy. [16] DAB-DETR reinterpreted the abstract object queries as explicit 4D anchor boxes that are progressively refined through the decoder layers, giving the box predictions a clear geometric meaning. [17] DN-DETR introduced query denoising, feeding noised versions of ground-truth boxes into the decoder so the network learns to reconstruct them, which stabilizes the otherwise unstable Hungarian matching during early training. [18] DINO unified dynamic anchor boxes, a contrastive variant of denoising, and a refined box-update scheme; it was the first DETR-style model to reach the top of the COCO detection leaderboard. [19]
RT-DETR (2023). RT-DETR (Real-Time DEtection TRansformer), from a team at Baidu, was the first transformer detector to run in real time while remaining end to end (no NMS). It uses an efficient hybrid encoder that decouples intra-scale interaction from cross-scale fusion, plus IoU-aware query selection to initialize the decoder with high-quality object queries. RT-DETR-R50 reached 53.1% AP on COCO at 108 frames per second on a T4 GPU, reportedly outperforming similarly sized YOLO detectors and the DINO-Deformable-DETR baseline in both speed and accuracy. [20]
| Model | Year | Type | Key bounding box innovation | Speed (fps) | mAP (COCO/VOC) |
|---|---|---|---|---|---|
| R-CNN | 2014 | Two-stage | CNN features + selective search proposals | ~0.03 | 53.3% (VOC 2012) |
| Fast R-CNN | 2015 | Two-stage | RoI pooling, multi-task loss | ~0.5 | 66.9% (VOC 2007) |
| Faster R-CNN | 2015 | Two-stage | Region Proposal Network (RPN) | ~5 | 73.2% (VOC 2007) |
| YOLO | 2016 | Single-stage | Grid-based regression | 45 | 63.4% (VOC 2007) |
| SSD | 2016 | Single-stage | Multi-scale default boxes | 58 | 72.1% (VOC 2007) |
| RetinaNet | 2017 | Single-stage | Focal loss | ~5 | 40.8% (COCO) |
| YOLOv4 | 2020 | Single-stage | CIoU loss, mosaic augmentation | ~65 | 43.5% (COCO) |
| DETR | 2020 | Transformer | Anchor-free, set prediction, Hungarian matching | ~28 | 42.0% (COCO) |
| YOLOv7 | 2022 | Single-stage | Trainable bag-of-freebies, E-ELAN | 36-160 | up to 56.8% (COCO) |
| DINO | 2022 | Transformer | Denoising + dynamic anchor box queries | ~10 | 63.3% (COCO) |
| RT-DETR | 2023 | Transformer | Real-time end-to-end, IoU-aware query selection | ~108 | 53.1% (COCO) |
| YOLOv10 | 2024 | Single-stage | NMS-free training via dual label assignment | ~100+ | 54.4% (COCO) |
(Speed and mAP figures are drawn from each model's original paper and vary with backbone, input resolution, and hardware; they are best read as indicative rather than directly comparable.)
The YOLO (You Only Look Once) line is the most widely used family of single-stage detectors, and the way each version predicts bounding boxes has changed substantially over time. The original three versions were authored by Joseph Redmon (with Ali Farhadi from YOLOv2 onward); later versions were produced by different research groups and companies, so the name now refers to a lineage of related architectures rather than a single team's work. [4][9]
| Version | Year | Authors / maintainer | Bounding box approach and notable changes |
|---|---|---|---|
| YOLOv1 | 2016 | Redmon, Divvala, Girshick, Farhadi | Direct grid-based box regression; 2 boxes per cell, no anchors [4] |
| YOLOv2 / YOLO9000 | 2016 | Redmon, Farhadi | Anchor boxes from k-means clustering, batch normalization, higher-resolution training [21] |
| YOLOv3 | 2018 | Redmon, Farhadi | Predictions at three scales, 9 anchor boxes total, logistic class predictions [22] |
| YOLOv4 | 2020 | Bochkovskiy, Wang, Liao | CIoU loss, mosaic augmentation, CSPDarknet53 backbone; 43.5% AP on COCO [9] |
| YOLOv5 | 2020 | Ultralytics (Glenn Jocher) | PyTorch reimplementation, auto-anchor learning; no formal paper [23] |
| YOLOv6 | 2022 | Meituan | Anchor-free design, decoupled head, hardware-aware backbone [24] |
| YOLOv7 | 2022 | Wang, Bochkovskiy, Liao | Trainable "bag-of-freebies", extended ELAN (E-ELAN); up to 56.8% AP on COCO [25] |
| YOLOv8 | 2023 | Ultralytics | Anchor-free, decoupled head, unified detection/segmentation/pose framework [26] |
| YOLOv9 | 2024 | Wang, Yeh, Liao | Programmable Gradient Information (PGI) and Generalized ELAN (GELAN) to reduce information loss [27] |
| YOLOv10 | 2024 | Wang et al. (Tsinghua) | NMS-free training via consistent dual label assignment, native end-to-end inference [28] |
| YOLO11 | 2024 | Ultralytics | Efficiency-focused redesign; higher mAP with fewer parameters than YOLOv8 [29] |
| YOLOv12 | 2025 | Tian, Ye, Doermann | Attention-centric architecture (area attention) matching CNN-level speed [30] |
Several trends are visible across the lineage. Early versions (YOLOv2 through YOLOv5) relied on anchor boxes whose dimensions were chosen by k-means clustering on the training set. [21][22] Starting around YOLOv6 and YOLOv8, the family shifted to anchor-free prediction with decoupled heads that separate the classification and box-regression branches. [24][26] YOLOv4 popularized the CIoU loss and mosaic data augmentation for bounding box training. [9] YOLOv9 attacked the information bottleneck in deep networks with Programmable Gradient Information and the GELAN architecture, improving parameter efficiency. [27] YOLOv10 removed non-maximum suppression from inference entirely by training with both a one-to-many and a one-to-one label assignment (consistent dual assignment), making the detector natively end to end and lowering latency. [28] YOLOv12 reintroduced attention mechanisms through an "area attention" module while keeping inference speed competitive with the CNN-based versions. [30]
Anchor boxes (also called prior boxes or default boxes) are a set of predefined bounding boxes with fixed aspect ratios and scales that serve as reference templates for detection. Instead of predicting bounding box coordinates from scratch, the model predicts offsets (adjustments) relative to these anchor boxes.
In Faster R-CNN, the Region Proposal Network uses anchors at each spatial position in the feature map. The original implementation used 9 anchors per position (3 scales and 3 aspect ratios). For each anchor, the network predicts an objectness score (whether the anchor contains an object or background) and four bounding box refinement values (delta_x, delta_y, delta_w, delta_h). [3]
In YOLO versions 2 through 5, anchor box dimensions were determined using k-means clustering on the training set bounding boxes. The cluster centroids served as the anchor dimensions, ensuring that the anchors matched the distribution of object sizes in the data. YOLOv3, for instance, used 9 clusters split evenly across its three detection scales. [22]
SSD uses a similar concept called default boxes. At each feature map cell, SSD places default boxes of multiple aspect ratios (typically 1:1, 2:1, 1:2, 3:1, and 1:3) and two scales. Because SSD uses feature maps at multiple resolutions, small default boxes on high-resolution feature maps detect small objects, while large default boxes on low-resolution feature maps detect large objects. [5]
Anchor-free methods predict bounding boxes without relying on predefined anchors, and they fall into two broad styles. Keypoint-based detectors such as CornerNet (2018) represent each box by a pair of corner keypoints, and CenterNet (2019) detects objects as a single center point and regresses the box size from it. Dense per-pixel detectors such as FCOS (Fully Convolutional One-Stage detection, 2019) treat every feature-map location as a training sample and regress the four distances from that point to the top, bottom, left, and right edges of the enclosing box; FCOS reached 38.7% AP with a ResNet-50 backbone, surpassing a comparable Faster R-CNN while removing all anchor-related hyperparameters. [31] DETR and its successors take a different route, predicting a set of boxes directly from learned queries. Anchor-free designs reduce the number of hand-tuned hyperparameters (anchor scales, aspect ratios, and matching thresholds) and have become the default in recent detectors such as YOLOv8 and YOLO11.
Bounding box regression is the process by which a neural network learns to predict the coordinates of a bounding box around a detected object. Rather than predicting absolute coordinates directly, most models predict offsets or transformations relative to anchor boxes or predefined reference points.
In the Faster R-CNN family of detectors, bounding box regression targets are parameterized as:
Here, (x, y, w, h) are the predicted box's center coordinates and dimensions, and (x_a, y_a, w_a, h_a) are the corresponding anchor box values. The logarithmic transform for width and height ensures that scale changes are symmetric (doubling and halving produce equal magnitude offsets). [3]
The choice of loss function for bounding box regression has a significant effect on detection performance. Several loss functions have been developed, each addressing specific limitations of its predecessors.
| Loss function | Year | Key idea | Advantage |
|---|---|---|---|
| Smooth L1 loss | 2015 | Piecewise combination of L1 and L2 loss | Robust to outliers; used in Fast/Faster R-CNN |
| IoU loss | 2016 | Directly optimizes the IoU between predicted and ground truth boxes | Scale-invariant; optimizes the evaluation metric itself |
| GIoU loss | 2019 | Adds penalty based on the area of the smallest enclosing box | Provides gradient signal even when boxes do not overlap |
| DIoU loss | 2020 | Adds penalty based on the normalized distance between box centers | Faster convergence than GIoU |
| CIoU loss | 2020 | Combines overlap area, center distance, and aspect ratio penalty | Considers all three geometric factors; used in YOLOv4 and later |
Smooth L1 loss was introduced in Fast R-CNN. It behaves like L2 loss when the error is small and like L1 loss when the error is large, making it less sensitive to outliers than pure L2 loss. [2]
IoU loss treats the IoU metric itself as a loss function (loss = 1 - IoU). Because IoU is scale-invariant and directly corresponds to the evaluation metric, optimizing it aligns training with the evaluation objective. However, standard IoU loss has a critical flaw: when two boxes do not overlap at all, the IoU is zero regardless of how far apart they are, providing no gradient signal for optimization.
Generalized IoU (GIoU) loss, introduced by Rezatofighi and colleagues in 2019, addresses this by incorporating the smallest enclosing box that contains both the predicted and ground truth boxes. The GIoU penalty ensures that even non-overlapping boxes receive a meaningful gradient, guiding the predicted box toward the target. [7]
Distance IoU (DIoU) loss adds a penalty based on the normalized Euclidean distance between the centers of the predicted and ground truth boxes. This explicit center-distance term accelerates convergence compared to GIoU. [8]
Complete IoU (CIoU) loss, proposed by Zheng and colleagues in 2020 in the same line of work as DIoU, combines three geometric factors: overlap area, center distance, and aspect ratio consistency. CIoU was adopted in YOLOv4 and has since become a standard choice in many detection frameworks. [8][9] More recent variants such as EIoU (which decomposes the aspect-ratio term into separate width and height penalties) and SIoU (which adds an angle-aware penalty) have been proposed to further refine box regression.
IoU (also known as the Jaccard index in set theory) is the standard metric for evaluating bounding box predictions. It measures the degree of overlap between two bounding boxes by dividing the area of their intersection by the area of their union:
IoU = Area of Intersection / Area of Union
The IoU value ranges from 0 (no overlap) to 1 (perfect overlap). In practice, a predicted bounding box is typically considered a true positive if its IoU with a ground truth box exceeds a predefined threshold. Common thresholds include 0.5 (used in the PASCAL VOC challenge) [12] and 0.5:0.05:0.95 (a range of thresholds used in the COCO evaluation protocol). [11]
For two axis-aligned bounding boxes A = (x1_a, y1_a, x2_a, y2_a) and B = (x1_b, y1_b, x2_b, y2_b):
This calculation runs in constant time, making it efficient even when evaluating thousands of detection pairs.
| Variant | Formula modification | Purpose |
|---|---|---|
| Standard IoU | Intersection / Union | Basic overlap measurement |
| GIoU | IoU - (C - Union) / C, where C is the smallest enclosing box | Handles non-overlapping boxes [7] |
| DIoU | IoU - (d^2 / c^2), where d is center distance and c is enclosing box diagonal | Penalizes center distance [8] |
| CIoU | DIoU - alpha * v, where v captures aspect ratio consistency | Penalizes center distance and aspect ratio [8] |
Object detection models typically produce many overlapping bounding box predictions for a single object. Non-maximum suppression (NMS) is a post-processing step that removes redundant detections by keeping only the highest-confidence box among a group of overlapping predictions.
The result is a reduced set of bounding boxes where each detected object is represented by a single high-confidence prediction.
Standard NMS uses a hard threshold: any box with IoU above the threshold is completely discarded. This can be problematic when objects are close together or overlapping, as legitimate detections may be suppressed. Soft-NMS, introduced by Bodla and colleagues in 2017, addresses this by reducing the confidence score of overlapping boxes instead of removing them entirely. The paper proposes two rescoring functions: a linear penalty that decreases a box's score in proportion to its overlap with the higher-scoring box, and a Gaussian penalty that applies a smooth exponential decay. Because the change requires only a single line of code and no retraining, Soft-NMS reported consistent gains of roughly 1 percent COCO-style mAP across detectors such as Faster R-CNN and R-FCN. [13]
DIoU-NMS, introduced alongside the DIoU loss, replaces the standard IoU criterion in NMS with the DIoU metric, which considers both overlap and center distance. Two boxes with the same IoU but different center distances receive different treatment: boxes with more distant centers are less likely to be suppressed, preserving detections of adjacent objects. [8]
NMS is a heuristic post-processing step rather than a learned component, and several modern detectors avoid it. DETR and other set-prediction transformers produce a fixed set of non-duplicate boxes through bipartite matching, so they require no NMS at all. [10] YOLOv10 achieves the same effect for a CNN detector by training with a consistent dual label assignment, which teaches the network to emit a single box per object and removes NMS from the inference path, reducing end-to-end latency. [28]
Bounding box annotations are stored in various file formats depending on the detection framework and dataset. The annotation file encodes the class label and bounding box coordinates for each object in each image.
| Format | File type | Coordinate convention | Normalization | Notable users |
|---|---|---|---|---|
| Pascal VOC | XML | (x_min, y_min, x_max, y_max) | Absolute pixels | Pascal VOC dataset |
| COCO | JSON | (x_min, y_min, width, height) | Absolute pixels | COCO dataset |
| YOLO Darknet | TXT | (cx, cy, width, height) | Normalized [0, 1] | Darknet, YOLOv3, YOLOv4 |
| YOLO PyTorch | TXT + YAML | (cx, cy, width, height) | Normalized [0, 1] | YOLOv5, YOLOv8 (Ultralytics) |
| TensorFlow TFRecord | Binary protobuf | (y_min, x_min, y_max, x_max) | Normalized [0, 1] | TensorFlow Object Detection API |
| CVAT | XML/JSON | (x_min, y_min, x_max, y_max) | Absolute pixels | CVAT annotation tool |
Format conversion tools such as Roboflow and FiftyOne allow datasets to be imported in one format and exported in another, reducing the manual effort of reformatting annotations when switching between detection frameworks.
Pascal VOC. The PASCAL Visual Object Classes challenge, active from 2005 to 2012, was one of the earliest large-scale benchmarks for object detection. It contains 20 object categories and uses a single IoU threshold of 0.5 for evaluation. Detection performance is reported as mAP at IoU 0.5. [12]
COCO. The COCO dataset (Common Objects in Context), introduced by Lin and colleagues in 2014, contains roughly 330,000 images (over 200,000 of them labeled) with about 1.5 million object instances spanning 80 object categories. [11] It uses a more stringent evaluation protocol than Pascal VOC, averaging mAP across IoU thresholds from 0.5 to 0.95 in steps of 0.05. This multi-threshold evaluation provides a more nuanced assessment of bounding box accuracy. The commonly used 2017 split divides the labeled data into about 118,000 training, 5,000 validation, and 41,000 test images.
Open Images. Google's Open Images dataset contains over 9 million images, with around 16 million bounding box annotations spanning 600 object categories in its V7 release. It is one of the largest publicly available detection datasets. [33]
Creating bounding box annotations for training data is a labor-intensive process. Several software tools have been developed to streamline this workflow.
| Tool | Type | Key features |
|---|---|---|
| LabelImg | Desktop | Lightweight, supports Pascal VOC and YOLO formats |
| CVAT | Web-based | Supports 19+ export formats, AI-assisted annotation, video tracking |
| Labelbox | Cloud platform | Team collaboration, model-assisted labeling, quality assurance workflows |
| Roboflow | Cloud platform | Annotation, data augmentation, format conversion, model training |
| VGG Image Annotator (VIA) | Browser-based | No installation required, supports bounding boxes and polygons |
| Label Studio | Web-based (open source) | Multi-modal annotation (images, text, audio), customizable labeling interfaces |
Semi-automated and AI-assisted annotation tools use pre-trained models to suggest initial bounding boxes, which human annotators then review and correct. This approach can reduce annotation time significantly, though human oversight remains necessary to ensure accuracy.
Beyond IoU, several aggregate metrics are used to evaluate how well a detection model predicts bounding boxes across an entire dataset.
Precision measures the fraction of predicted bounding boxes that are correct (true positives out of all positive predictions). Recall measures the fraction of ground truth objects that are successfully detected (true positives out of all ground truth objects). A predicted box is counted as a true positive if its IoU with a ground truth box exceeds the threshold and it has the correct class label.
Average precision (AP) summarizes the precision-recall curve for a single object class. It is computed as the area under the precision-recall curve, where predictions are ranked by confidence score. Mean average precision (mAP) is the mean of AP values across all object classes. In the COCO evaluation protocol, the primary metric is mAP averaged over ten IoU thresholds (0.5 to 0.95), often written as AP@[.5:.95].
| Metric | Description |
|---|---|
| AP@0.5 | Average precision at IoU threshold 0.5 (also called AP50) |
| AP@0.75 | Average precision at IoU threshold 0.75 (stricter localization) |
| AP_small | AP for small objects (area less than 32x32 pixels) |
| AP_medium | AP for medium objects (area between 32x32 and 96x96 pixels) |
| AP_large | AP for large objects (area greater than 96x96 pixels) |
| AR@[1,10,100] | Average recall with 1, 10, or 100 maximum detections per image |
Bounding boxes are used across a wide range of domains wherever objects need to be localized in visual data.
Autonomous driving systems use both 2D and 3D bounding boxes to detect and track vehicles, pedestrians, cyclists, and other road users. Camera-based detection produces 2D bounding boxes in image space, while LiDAR-based detection produces 3D bounding cuboids in world coordinates. Sensor fusion methods combine both modalities to improve detection reliability. Datasets such as KITTI, nuScenes, and Waymo Open Dataset provide 3D bounding box annotations for training and evaluation.
In medical imaging, bounding boxes are used to localize abnormalities such as tumors, lesions, and fractures in X-rays, CT scans, and MRI images. Detection models trained with bounding box annotations can assist radiologists by highlighting regions of interest for further examination. This approach is particularly valuable for screening applications where large numbers of images must be reviewed efficiently.
Surveillance systems use bounding boxes to detect and track people, vehicles, and other objects of interest across video feeds. Real-time object detection models like YOLO are commonly deployed in these systems due to their speed. Multi-object tracking algorithms maintain bounding box predictions across frames to follow individual objects over time.
Retail applications use bounding box detection to count products on shelves, detect out-of-stock conditions, and analyze customer behavior in stores. Automated checkout systems use object detection to identify products placed on a scanner or conveyor belt.
Robotic systems use bounding boxes as part of their perception pipeline to identify and localize objects for grasping, manipulation, and navigation. In warehouse robotics, bounding box detection helps robots identify packages, bins, and obstacles.
Bounding boxes are used in optical character recognition (OCR) and document analysis to localize text regions, tables, figures, and other structural elements within scanned documents or images. Text detection models output bounding boxes (often oriented) around individual words, lines, or text blocks.
Several popular software libraries provide bounding box utilities for detection, evaluation, and visualization.
| Library | Language | Bounding box features |
|---|---|---|
| torchvision (PyTorch) | Python | box_convert, box_iou, nms, batched_nms, generalized_box_iou, complete_box_iou, distance_box_iou |
| TensorFlow Object Detection API | Python | Bounding box encoding/decoding, evaluation metrics, pre-trained detection models |
| OpenCV | C++/Python | boundingRect, minAreaRect, contour-based bounding boxes |
| Detectron2 (Meta) | Python | Modular detection framework with built-in box operations, NMS, and evaluation |
| MMDetection (OpenMMLab) | Python | Support for 50+ detection models, comprehensive box utilities |
| Ultralytics | Python | YOLOv5/v8/v11 implementation with built-in box format handling, plus RT-DETR support |
| Albumentations | Python | Bounding-box-aware image augmentation (keeps boxes consistent under transforms) |
While bounding boxes are the most common spatial representation in object detection, they have several limitations.
Poor fit for irregular shapes. Bounding boxes are rectangular, so they include background pixels when the object has an irregular or non-convex shape. For example, a bounding box around a banana includes a large amount of background.
Ambiguity with occlusion. When objects overlap, their bounding boxes may also overlap substantially, making it difficult to determine which pixels belong to which object.
Lack of shape information. Bounding boxes provide no information about the actual shape or contour of the object. They indicate only position and spatial extent.
| Representation | Description | Trade-off |
|---|---|---|
| Instance segmentation masks | Per-pixel labels for each object instance | More precise but much more expensive to annotate |
| Polygons | Ordered set of vertices outlining the object | More precise than boxes; faster to annotate than pixel masks |
| Keypoints / landmarks | A set of named points on the object (e.g., joints of a human body) | Captures structure; does not define a region |
| Ellipses | Elliptical bounding regions | Better fit for rounded objects; less common |
| 3D meshes | Full 3D shape representation | Highest precision; most expensive to create |
Despite these limitations, bounding boxes remain dominant because they offer the best balance between annotation cost, computational efficiency, and usefulness for downstream tasks. In many applications, knowing that an object is located within a particular rectangular region is sufficient for the task at hand.
The use of bounding boxes in computer vision predates deep learning. Early object detection methods in the 2000s, such as the Viola-Jones face detector (2001), used sliding window approaches where a fixed-size bounding box was scanned across the image at multiple scales and positions. [14] The histogram of oriented gradients (HOG) detector by Dalal and Triggs (2005) used a similar sliding window approach for pedestrian detection. [15]
The selective search algorithm (2013) introduced a more efficient approach by generating a smaller set of candidate bounding box proposals based on image segmentation. This method was adopted by R-CNN and became the standard proposal generation technique until it was replaced by learned proposal networks in Faster R-CNN. [1][3]
The introduction of anchor-based single-stage detectors (SSD and YOLOv2 in 2016) and the subsequent development of anchor-free methods (CornerNet in 2018, FCOS in 2019, DETR in 2020) represent the two major shifts in how bounding boxes are predicted. [5][31][10] More recently the field has converged on three ideas: anchor-free per-pixel or keypoint regression, transformer-based set prediction, and the removal of non-maximum suppression from the inference path (as in DETR and YOLOv10). [10][28] The field continues to evolve, with current research exploring open-vocabulary and text-conditioned detectors such as Grounding DINO [32] and YOLO-World, which predict bounding boxes for object categories specified at inference time by a text prompt rather than fixed at training time. [34]