Bounding Box

A bounding box is a rectangular region defined by a set of coordinates that encloses an object of interest within an image, video frame, or three-dimensional space. In computer vision and machine learning, bounding boxes serve as the standard method for representing the location and spatial extent of detected objects. They are central to tasks such as object detection, object tracking, and image segmentation, providing a simple geometric approximation that balances computational efficiency with spatial precision.

Bounding boxes can be two-dimensional (enclosing objects in flat images) or three-dimensional (enclosing objects in volumetric data or point clouds). Despite their simplicity, they remain one of the most widely used spatial representations in modern visual recognition systems, from autonomous vehicles to medical imaging pipelines.

ELI5

Imagine you are looking at a photo with a dog in it. If someone asked you to show where the dog is, you might draw a rectangle around it with a marker. That rectangle is a bounding box. It does not trace the exact outline of the dog; it just draws the smallest box that fits around the whole dog.

Computers do the same thing. When a computer looks at a photo and tries to find objects like dogs, cars, or people, it draws rectangles around them. Each rectangle tells the computer two things: what the object is (like "dog") and where the object is in the picture (the rectangle's position and size).

To check whether the computer drew a good rectangle, people compare it with a rectangle that a human drew around the same object. If the two rectangles overlap a lot, the computer did a good job. If they barely overlap, the computer needs more practice. The way we measure that overlap is called Intersection over Union, or IoU.

Definition and coordinate representation

A bounding box in two dimensions is defined by a minimal set of values that specify a rectangle's position and size within an image coordinate system. Several coordinate conventions exist, each suited to different frameworks and annotation standards.

Common coordinate formats

Format	Notation	Description	Used by
Corner format (xyxy)	(x_min, y_min, x_max, y_max)	Top-left and bottom-right corner coordinates in absolute pixels	Pascal VOC, PyTorch (torchvision)
Corner + dimensions (xywh)	(x_min, y_min, width, height)	Top-left corner plus width and height in absolute pixels	COCO dataset
Center format (cxcywh)	(cx, cy, width, height)	Center point plus width and height, often normalized to [0, 1]	YOLO
Normalized format	Values in [0, 1]	Any of the above with coordinates divided by image width/height	YOLO, some TensorFlow pipelines

In the corner format, (x_min, y_min) is the top-left corner and (x_max, y_max) is the bottom-right corner. The width equals x_max minus x_min, and the height equals y_max minus y_min. The center format represents the same rectangle using its center coordinates (cx, cy) and its width and height. Normalized coordinates express all values as fractions of the image dimensions, making annotations resolution-independent.

Conversion between these formats is straightforward. For example, to convert from (x_min, y_min, x_max, y_max) to center format:

cx = (x_min + x_max) / 2
cy = (y_min + y_max) / 2
width = x_max - x_min
height = y_max - y_min

Libraries like torchvision provide utility functions such as box_convert for switching between these representations programmatically.

Three-dimensional bounding boxes

In 3D applications, bounding boxes extend to cuboids (also called 3D bounding boxes or 3D bounding cuboids). A 3D bounding box is typically defined by its center coordinates (cx, cy, cz), dimensions (length, width, height), and a rotation angle (yaw). This representation is standard in autonomous driving systems, where LiDAR point clouds and camera images are used together to localize vehicles, pedestrians, and other objects in three-dimensional space.

Types of bounding boxes

Axis-aligned bounding box (AABB)

An axis-aligned bounding box has edges that are parallel to the coordinate axes of the image or scene. Because the rectangle's sides are always horizontal and vertical, an AABB is simple to compute and store. It requires only four values in 2D (or six in 3D). AABBs are the default type used in most object detection systems.

The main limitation of AABBs is that they can include a large amount of background area when the enclosed object is elongated or rotated at an angle. For example, a diagonally oriented ship in a satellite image would have an AABB that contains substantial empty space around the actual ship.

Oriented bounding box (OBB)

An oriented bounding box (also called a rotated bounding box) is a rectangle that can be rotated to align with the orientation of the object it encloses. An OBB is typically parameterized by its center (cx, cy), width, height, and a rotation angle (theta). This additional degree of freedom allows the bounding box to fit tightly around elongated or angled objects, reducing the amount of background included.

OBBs are particularly useful in remote sensing (for detecting ships, aircraft, and vehicles in aerial imagery), scene text detection (where text lines may appear at arbitrary angles), and industrial inspection. Models like YOLOv8-OBB and oriented versions of Faster R-CNN have been developed specifically to predict oriented bounding boxes.

Comparison of AABB and OBB

Property	Axis-aligned (AABB)	Oriented (OBB)
Parameters	4 (x_min, y_min, x_max, y_max)	5 (cx, cy, w, h, theta)
Rotation support	No	Yes
Tightness of fit	Loose for rotated objects	Tight for rotated objects
Computational cost	Lower	Higher
Common use cases	General object detection	Aerial/satellite imagery, text detection
IoU calculation	Simple	Requires polygon intersection

Bounding boxes in object detection

Object detection is the task of identifying and localizing objects in images or videos. Bounding box prediction is a core component of virtually every modern object detection system. These systems output a list of detections, where each detection consists of a bounding box, a class label (such as "car" or "person"), and a confidence score indicating the model's certainty.

Two-stage detectors

Two-stage detectors split the detection process into two phases. In the first stage, the model generates a set of region proposals, which are candidate bounding boxes that might contain objects. In the second stage, the model classifies each proposal and refines its bounding box coordinates.

R-CNN (2014). The Regions with Convolutional Neural Network features (R-CNN) method, introduced by Ross Girshick and colleagues, was one of the first approaches to combine CNNs with region proposals for object detection. It used selective search to generate approximately 2,000 candidate regions per image, extracted CNN features from each region, and classified them with a support vector machine. R-CNN achieved a mean average precision (mAP) of 53.3% on PASCAL VOC 2012, a relative improvement of more than 30% over prior methods.

Fast R-CNN (2015). Fast R-CNN improved on R-CNN by introducing Region of Interest (RoI) pooling, which allowed the CNN features to be computed once for the entire image rather than separately for each region proposal. It also combined the classification and bounding box regression tasks into a single multi-task loss function. This made training 9 times faster and inference 213 times faster than R-CNN.

Faster R-CNN (2015). Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun introduced the Region Proposal Network (RPN), a fully convolutional network that generates region proposals directly from feature maps. The RPN shares convolutional features with the detection network, eliminating the computational bottleneck of external proposal methods. Faster R-CNN achieved near real-time performance at 5 frames per second on a GPU while maintaining high accuracy. It formed the foundation of many winning entries at the ILSVRC and COCO 2015 competitions.

Single-stage detectors

Single-stage detectors predict bounding boxes and class probabilities directly from the full image in a single pass through the network, without a separate region proposal step. This design typically results in faster inference at the cost of some accuracy compared to two-stage methods.

YOLO (2016). YOLO (You Only Look Once), introduced by Joseph Redmon and colleagues, reframed object detection as a single regression problem. The model divides the input image into a grid of cells. Each cell predicts a fixed number of bounding boxes along with confidence scores and class probabilities. The base YOLO model processed images at 45 frames per second, while the smaller Fast YOLO variant reached 155 fps. Subsequent versions (YOLOv2 through YOLOv11 and beyond) introduced anchor boxes, multi-scale feature maps, and improved loss functions.

SSD (2016). The Single Shot MultiBox Detector, proposed by Wei Liu and colleagues, discretizes the output space of bounding boxes into a set of default boxes with different aspect ratios and scales at each feature map location. SSD combines predictions from multiple feature maps with different resolutions, allowing it to detect objects of various sizes. On VOC 2007, SSD achieved 72.1% mAP at 58 fps with 300x300 input and 75.1% mAP with 512x512 input.

RetinaNet (2017). Tsung-Yi Lin and colleagues identified that the extreme class imbalance between foreground and background examples in single-stage detectors was the primary obstacle to matching two-stage detector accuracy. They introduced focal loss, which down-weights the loss contribution of well-classified (easy) examples so training focuses on hard cases. RetinaNet matched the speed of single-stage detectors while surpassing the accuracy of two-stage detectors of that era.

Anchor-free and transformer-based detectors

DETR (2020). The DEtection Transformer eliminated the need for anchor boxes and non-maximum suppression entirely. DETR treats object detection as a direct set prediction problem, using a transformer encoder-decoder architecture. It employs a set-based global loss using the Hungarian algorithm to find a bipartite matching between predicted and ground truth objects, ensuring each prediction maps to at most one ground truth object. DETR matched Faster R-CNN in accuracy on the COCO dataset with a simpler architecture.

Summary of major object detection architectures

Model	Year	Type	Key bounding box innovation	Speed (fps)	mAP (COCO/VOC)
R-CNN	2014	Two-stage	CNN features + selective search proposals	~0.03	53.3% (VOC 2012)
Fast R-CNN	2015	Two-stage	RoI pooling, multi-task loss	~0.5	66.9% (VOC 2007)
Faster R-CNN	2015	Two-stage	Region Proposal Network (RPN)	~5	73.2% (VOC 2007)
YOLO	2016	Single-stage	Grid-based regression	45	63.4% (VOC 2007)
SSD	2016	Single-stage	Multi-scale default boxes	58	72.1% (VOC 2007)
RetinaNet	2017	Single-stage	Focal loss	~5	40.8% (COCO)
YOLOv4	2020	Single-stage	CIoU loss, mosaic augmentation	~65	43.5% (COCO)
DETR	2020	Transformer	Anchor-free, set prediction, Hungarian matching	~28	42.0% (COCO)

Anchor boxes

Anchor boxes (also called prior boxes or default boxes) are a set of predefined bounding boxes with fixed aspect ratios and scales that serve as reference templates for detection. Instead of predicting bounding box coordinates from scratch, the model predicts offsets (adjustments) relative to these anchor boxes.

In Faster R-CNN, the Region Proposal Network uses anchors at each spatial position in the feature map. The original implementation used 9 anchors per position (3 scales and 3 aspect ratios). For each anchor, the network predicts an objectness score (whether the anchor contains an object or background) and four bounding box refinement values (delta_x, delta_y, delta_w, delta_h).

In YOLO versions 2 through 5, anchor box dimensions were determined using k-means clustering on the training set bounding boxes. The cluster centroids served as the anchor dimensions, ensuring that the anchors matched the distribution of object sizes in the data.

SSD uses a similar concept called default boxes. At each feature map cell, SSD places default boxes of multiple aspect ratios (typically 1:1, 2:1, 1:2, 3:1, and 1:3) and two scales. Because SSD uses feature maps at multiple resolutions, small default boxes on high-resolution feature maps detect small objects, while large default boxes on low-resolution feature maps detect large objects.

Anchor-free methods like DETR, FCOS (Fully Convolutional One-Stage detection), and CenterNet have emerged as alternatives that predict bounding boxes without relying on predefined anchors. These methods typically predict the center point of an object and regress the distances to the four edges of the bounding box directly.

Bounding box regression

Bounding box regression is the process by which a neural network learns to predict the coordinates of a bounding box around a detected object. Rather than predicting absolute coordinates directly, most models predict offsets or transformations relative to anchor boxes or predefined reference points.

Parameterization

In the Faster R-CNN family of detectors, bounding box regression targets are parameterized as:

t_x = (x - x_a) / w_a
t_y = (y - y_a) / h_a
t_w = log(w / w_a)
t_h = log(h / h_a)

Here, (x, y, w, h) are the predicted box's center coordinates and dimensions, and (x_a, y_a, w_a, h_a) are the corresponding anchor box values. The logarithmic transform for width and height ensures that scale changes are symmetric (doubling and halving produce equal magnitude offsets).

Loss functions for bounding box regression

The choice of loss function for bounding box regression has a significant effect on detection performance. Several loss functions have been developed, each addressing specific limitations of its predecessors.

Loss function	Year	Key idea	Advantage
Smooth L1 loss	2015	Piecewise combination of L1 and L2 loss	Robust to outliers; used in Fast/Faster R-CNN
IoU loss	2016	Directly optimizes the IoU between predicted and ground truth boxes	Scale-invariant; optimizes the evaluation metric itself
GIoU loss	2019	Adds penalty based on the area of the smallest enclosing box	Provides gradient signal even when boxes do not overlap
DIoU loss	2020	Adds penalty based on the normalized distance between box centers	Faster convergence than GIoU
CIoU loss	2020	Combines overlap area, center distance, and aspect ratio penalty	Considers all three geometric factors; used in YOLOv4 and later

Smooth L1 loss was introduced in Fast R-CNN. It behaves like L2 loss when the error is small and like L1 loss when the error is large, making it less sensitive to outliers than pure L2 loss.

IoU loss treats the IoU metric itself as a loss function (loss = 1 - IoU). Because IoU is scale-invariant and directly corresponds to the evaluation metric, optimizing it aligns training with the evaluation objective. However, standard IoU loss has a critical flaw: when two boxes do not overlap at all, the IoU is zero regardless of how far apart they are, providing no gradient signal for optimization.

Generalized IoU (GIoU) loss, introduced by Rezatofighi and colleagues in 2019, addresses this by incorporating the smallest enclosing box that contains both the predicted and ground truth boxes. The GIoU penalty ensures that even non-overlapping boxes receive a meaningful gradient, guiding the predicted box toward the target.

Distance IoU (DIoU) loss adds a penalty based on the normalized Euclidean distance between the centers of the predicted and ground truth boxes. This explicit center-distance term accelerates convergence compared to GIoU.

Complete IoU (CIoU) loss, proposed by Zheng and colleagues in 2020, combines three geometric factors: overlap area, center distance, and aspect ratio consistency. CIoU was adopted in YOLOv4 and has since become a standard choice in many detection frameworks.

Intersection over Union (IoU)

IoU (also known as the Jaccard index in set theory) is the standard metric for evaluating bounding box predictions. It measures the degree of overlap between two bounding boxes by dividing the area of their intersection by the area of their union:

IoU = Area of Intersection / Area of Union

The IoU value ranges from 0 (no overlap) to 1 (perfect overlap). In practice, a predicted bounding box is typically considered a true positive if its IoU with a ground truth box exceeds a predefined threshold. Common thresholds include 0.5 (used in the PASCAL VOC challenge) and 0.5:0.05:0.95 (a range of thresholds used in the COCO evaluation protocol).

IoU calculation for axis-aligned boxes

For two axis-aligned bounding boxes A = (x1_a, y1_a, x2_a, y2_a) and B = (x1_b, y1_b, x2_b, y2_b):

Compute the intersection rectangle:
- x1_i = max(x1_a, x1_b)
- y1_i = max(y1_a, y1_b)
- x2_i = min(x2_a, x2_b)
- y2_i = min(y2_a, y2_b)
Compute the intersection area: max(0, x2_i - x1_i) * max(0, y2_i - y1_i)
Compute the union area: Area(A) + Area(B) - Intersection area
IoU = Intersection area / Union area

This calculation runs in constant time, making it efficient even when evaluating thousands of detection pairs.

IoU variants

Variant	Formula modification	Purpose
Standard IoU	Intersection / Union	Basic overlap measurement
GIoU	IoU - (C - Union) / C, where C is the smallest enclosing box	Handles non-overlapping boxes
DIoU	IoU - (d^2 / c^2), where d is center distance and c is enclosing box diagonal	Penalizes center distance
CIoU	DIoU - alpha * v, where v captures aspect ratio consistency	Penalizes center distance and aspect ratio

Non-maximum suppression (NMS)

Object detection models typically produce many overlapping bounding box predictions for a single object. Non-maximum suppression (NMS) is a post-processing step that removes redundant detections by keeping only the highest-confidence box among a group of overlapping predictions.

Standard NMS algorithm

Sort all predicted bounding boxes by their confidence scores in descending order.
Select the box with the highest confidence score and add it to the final output list.
Compute the IoU between this selected box and all remaining boxes.
Remove any remaining box whose IoU with the selected box exceeds a threshold (commonly 0.5).
Repeat steps 2 through 4 until no boxes remain.

The result is a reduced set of bounding boxes where each detected object is represented by a single high-confidence prediction.

Soft-NMS

Standard NMS uses a hard threshold: any box with IoU above the threshold is completely discarded. This can be problematic when objects are close together or overlapping, as legitimate detections may be suppressed. Soft-NMS, introduced by Bodla and colleagues in 2017, addresses this by reducing the confidence score of overlapping boxes instead of removing them entirely. The score reduction is proportional to the degree of overlap, allowing closely spaced objects to be detected.

DIoU-NMS

DIoU-NMS replaces the standard IoU criterion in NMS with the DIoU metric, which considers both overlap and center distance. Two boxes with the same IoU but different center distances receive different treatment: boxes with more distant centers are less likely to be suppressed, preserving detections of adjacent objects.

Annotation formats and datasets

Bounding box annotations are stored in various file formats depending on the detection framework and dataset. The annotation file encodes the class label and bounding box coordinates for each object in each image.

Major annotation formats

Format	File type	Coordinate convention	Normalization	Notable users
Pascal VOC	XML	(x_min, y_min, x_max, y_max)	Absolute pixels	Pascal VOC dataset
COCO	JSON	(x_min, y_min, width, height)	Absolute pixels	COCO dataset
YOLO Darknet	TXT	(cx, cy, width, height)	Normalized [0, 1]	Darknet, YOLOv3, YOLOv4
YOLO PyTorch	TXT + YAML	(cx, cy, width, height)	Normalized [0, 1]	YOLOv5, YOLOv8 (Ultralytics)
TensorFlow TFRecord	Binary protobuf	(y_min, x_min, y_max, x_max)	Normalized [0, 1]	TensorFlow Object Detection API
CVAT	XML/JSON	(x_min, y_min, x_max, y_max)	Absolute pixels	CVAT annotation tool

Format conversion tools such as Roboflow and FiftyOne allow datasets to be imported in one format and exported in another, reducing the manual effort of reformatting annotations when switching between detection frameworks.

Benchmark datasets

Pascal VOC. The PASCAL Visual Object Classes challenge, active from 2005 to 2012, was one of the earliest large-scale benchmarks for object detection. It contains 20 object categories and uses a single IoU threshold of 0.5 for evaluation. Detection performance is reported as mAP at IoU 0.5.

COCO. The COCO dataset (Common Objects in Context) contains over 200,000 labeled images spanning 80 object categories. It uses a more stringent evaluation protocol than Pascal VOC, averaging mAP across IoU thresholds from 0.5 to 0.95 in steps of 0.05. This multi-threshold evaluation provides a more nuanced assessment of bounding box accuracy.

Open Images. Google's Open Images dataset contains over 9 million images with bounding box annotations for 600 object categories. It is one of the largest publicly available detection datasets.

Annotation tools

Creating bounding box annotations for training data is a labor-intensive process. Several software tools have been developed to streamline this workflow.

Tool	Type	Key features
LabelImg	Desktop	Lightweight, supports Pascal VOC and YOLO formats
CVAT	Web-based	Supports 19+ export formats, AI-assisted annotation, video tracking
Labelbox	Cloud platform	Team collaboration, model-assisted labeling, quality assurance workflows
Roboflow	Cloud platform	Annotation, data augmentation, format conversion, model training
VGG Image Annotator (VIA)	Browser-based	No installation required, supports bounding boxes and polygons
Label Studio	Web-based (open source)	Multi-modal annotation (images, text, audio), customizable labeling interfaces

Semi-automated and AI-assisted annotation tools use pre-trained models to suggest initial bounding boxes, which human annotators then review and correct. This approach can reduce annotation time significantly, though human oversight remains necessary to ensure accuracy.

Evaluation metrics

Beyond IoU, several aggregate metrics are used to evaluate how well a detection model predicts bounding boxes across an entire dataset.

Precision and recall

Precision measures the fraction of predicted bounding boxes that are correct (true positives out of all positive predictions). Recall measures the fraction of ground truth objects that are successfully detected (true positives out of all ground truth objects). A predicted box is counted as a true positive if its IoU with a ground truth box exceeds the threshold and it has the correct class label.

Average precision (AP) and mean average precision (mAP)

Average precision (AP) summarizes the precision-recall curve for a single object class. It is computed as the area under the precision-recall curve, where predictions are ranked by confidence score. Mean average precision (mAP) is the mean of AP values across all object classes. In the COCO evaluation protocol, the primary metric is mAP averaged over ten IoU thresholds (0.5 to 0.95), often written as AP@[.5:.95].

Additional COCO metrics

Metric	Description
AP@0.5	Average precision at IoU threshold 0.5 (also called AP50)
AP@0.75	Average precision at IoU threshold 0.75 (stricter localization)
AP_small	AP for small objects (area less than 32x32 pixels)
AP_medium	AP for medium objects (area between 32x32 and 96x96 pixels)
AP_large	AP for large objects (area greater than 96x96 pixels)
AR@[1,10,100]	Average recall with 1, 10, or 100 maximum detections per image

Applications

Bounding boxes are used across a wide range of domains wherever objects need to be localized in visual data.

Autonomous driving

Autonomous driving systems use both 2D and 3D bounding boxes to detect and track vehicles, pedestrians, cyclists, and other road users. Camera-based detection produces 2D bounding boxes in image space, while LiDAR-based detection produces 3D bounding cuboids in world coordinates. Sensor fusion methods combine both modalities to improve detection reliability. Datasets such as KITTI, nuScenes, and Waymo Open Dataset provide 3D bounding box annotations for training and evaluation.

Medical imaging

In medical imaging, bounding boxes are used to localize abnormalities such as tumors, lesions, and fractures in X-rays, CT scans, and MRI images. Detection models trained with bounding box annotations can assist radiologists by highlighting regions of interest for further examination. This approach is particularly valuable for screening applications where large numbers of images must be reviewed efficiently.

Video surveillance and security

Surveillance systems use bounding boxes to detect and track people, vehicles, and other objects of interest across video feeds. Real-time object detection models like YOLO are commonly deployed in these systems due to their speed. Multi-object tracking algorithms maintain bounding box predictions across frames to follow individual objects over time.

Retail and inventory

Retail applications use bounding box detection to count products on shelves, detect out-of-stock conditions, and analyze customer behavior in stores. Automated checkout systems use object detection to identify products placed on a scanner or conveyor belt.

Robotics

Robotic systems use bounding boxes as part of their perception pipeline to identify and localize objects for grasping, manipulation, and navigation. In warehouse robotics, bounding box detection helps robots identify packages, bins, and obstacles.

Document analysis

Bounding boxes are used in optical character recognition (OCR) and document analysis to localize text regions, tables, figures, and other structural elements within scanned documents or images. Text detection models output bounding boxes (often oriented) around individual words, lines, or text blocks.

Software libraries and implementations

Several popular software libraries provide bounding box utilities for detection, evaluation, and visualization.

Library	Language	Bounding box features
torchvision (PyTorch)	Python	`box_convert`, `box_iou`, `nms`, `batched_nms`, `generalized_box_iou`, `complete_box_iou`, `distance_box_iou`
TensorFlow Object Detection API	Python	Bounding box encoding/decoding, evaluation metrics, pre-trained detection models
OpenCV	C++/Python	`boundingRect`, `minAreaRect`, contour-based bounding boxes
Detectron2 (Meta)	Python	Modular detection framework with built-in box operations, NMS, and evaluation
MMDetection (OpenMMLab)	Python	Support for 50+ detection models, comprehensive box utilities
Ultralytics	Python	YOLOv5/v8/v11 implementation with built-in box format handling

Limitations and alternatives

While bounding boxes are the most common spatial representation in object detection, they have several limitations.

Poor fit for irregular shapes. Bounding boxes are rectangular, so they include background pixels when the object has an irregular or non-convex shape. For example, a bounding box around a banana includes a large amount of background.

Ambiguity with occlusion. When objects overlap, their bounding boxes may also overlap substantially, making it difficult to determine which pixels belong to which object.

Lack of shape information. Bounding boxes provide no information about the actual shape or contour of the object. They indicate only position and spatial extent.

Alternatives to bounding boxes

Representation	Description	Trade-off
Instance segmentation masks	Per-pixel labels for each object instance	More precise but much more expensive to annotate
Polygons	Ordered set of vertices outlining the object	More precise than boxes; faster to annotate than pixel masks
Keypoints / landmarks	A set of named points on the object (e.g., joints of a human body)	Captures structure; does not define a region
Ellipses	Elliptical bounding regions	Better fit for rounded objects; less common
3D meshes	Full 3D shape representation	Highest precision; most expensive to create

Despite these limitations, bounding boxes remain dominant because they offer the best balance between annotation cost, computational efficiency, and usefulness for downstream tasks. In many applications, knowing that an object is located within a particular rectangular region is sufficient for the task at hand.

History and development

The use of bounding boxes in computer vision predates deep learning. Early object detection methods in the 2000s, such as the Viola-Jones face detector (2001), used sliding window approaches where a fixed-size bounding box was scanned across the image at multiple scales and positions. The histogram of oriented gradients (HOG) detector by Dalal and Triggs (2005) used a similar sliding window approach for pedestrian detection.

The selective search algorithm (2013) introduced a more efficient approach by generating a smaller set of candidate bounding box proposals based on image segmentation. This method was adopted by R-CNN and became the standard proposal generation technique until it was replaced by learned proposal networks in Faster R-CNN.

The introduction of anchor-based single-stage detectors (SSD and YOLOv2 in 2016) and the subsequent development of anchor-free methods (CornerNet in 2018, FCOS in 2019, DETR in 2020) represent the two major shifts in how bounding boxes are predicted. The field continues to evolve, with recent work exploring end-to-end detection using vision transformers that output bounding boxes as part of a set prediction framework.

References

Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). "Rich feature hierarchies for accurate object detection and semantic segmentation." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*.
Girshick, R. (2015). "Fast R-CNN." *Proceedings of the IEEE International Conference on Computer Vision (ICCV)*.
Ren, S., He, K., Girshick, R., & Sun, J. (2015). "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks." *Advances in Neural Information Processing Systems (NeurIPS)*. arXiv:1506.01497.
Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). "You Only Look Once: Unified, Real-Time Object Detection." *Proceedings of CVPR*. arXiv:1506.02640.
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., & Berg, A. C. (2016). "SSD: Single Shot MultiBox Detector." *Proceedings of the European Conference on Computer Vision (ECCV)*. arXiv:1512.02325.
Lin, T.-Y., Goyal, P., Girshick, R., He, K., & Dollar, P. (2017). "Focal Loss for Dense Object Detection." *Proceedings of ICCV*. arXiv:1708.02002.
Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., & Savarese, S. (2019). "Generalized Intersection over Union: A Metric and a Loss for Bounding Box Regression." *Proceedings of CVPR*. arXiv:1902.09630.
Zheng, Z., Wang, P., Liu, W., Li, J., Ye, R., & Ren, D. (2020). "Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression." *Proceedings of AAAI*. arXiv:1911.08287.
Bochkovskiy, A., Wang, C.-Y., & Liao, H.-Y. M. (2020). "YOLOv4: Optimal Speed and Accuracy of Object Detection." arXiv:2004.10934.
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020). "End-to-End Object Detection with Transformers." *Proceedings of ECCV*. arXiv:2005.12872.
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollar, P., & Zitnick, C. L. (2014). "Microsoft COCO: Common Objects in Context." *Proceedings of ECCV*. arXiv:1405.0312.
Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2010). "The Pascal Visual Object Classes (VOC) Challenge." *International Journal of Computer Vision*, 88(2), 303-338.
Bodla, N., Singh, B., Chellappa, R., & Davis, L. S. (2017). "Soft-NMS: Improving Object Detection with One Line of Code." *Proceedings of ICCV*. arXiv:1704.04503.
Viola, P. & Jones, M. (2001). "Rapid Object Detection Using a Boosted Cascade of Simple Features." *Proceedings of CVPR*.
Dalal, N. & Triggs, B. (2005). "Histograms of Oriented Gradients for Human Detection." *Proceedings of CVPR*.

ELI5

Definition and coordinate representation

Common coordinate formats

Three-dimensional bounding boxes

Types of bounding boxes

Axis-aligned bounding box (AABB)

Oriented bounding box (OBB)

Comparison of AABB and OBB

Bounding boxes in object detection

Two-stage detectors

Single-stage detectors

Anchor-free and transformer-based detectors

Summary of major object detection architectures

Anchor boxes

Bounding box regression

Parameterization

Loss functions for bounding box regression

Intersection over Union (IoU)

IoU calculation for axis-aligned boxes

IoU variants

Non-maximum suppression (NMS)

Standard NMS algorithm

Soft-NMS

DIoU-NMS

Annotation formats and datasets

Major annotation formats

Benchmark datasets

Annotation tools

Evaluation metrics

Precision and recall

Average precision (AP) and mean average precision (mAP)

Additional COCO metrics

Applications

Autonomous driving

Medical imaging

Video surveillance and security

Retail and inventory

Robotics

Document analysis

Software libraries and implementations

Limitations and alternatives

Alternatives to bounding boxes

History and development

See also

References

Improve this article

Related Articles

ARC-AGI 2

COCO dataset

Computer-use agent

Computer-use model

OCR Models

Pre-training

ELI5

Definition and coordinate representation

Common coordinate formats

Three-dimensional bounding boxes

Types of bounding boxes

Axis-aligned bounding box (AABB)

Oriented bounding box (OBB)

Comparison of AABB and OBB

Bounding boxes in object detection

Two-stage detectors

Single-stage detectors

Anchor-free and transformer-based detectors

Summary of major object detection architectures

Anchor boxes

Bounding box regression

Parameterization

Loss functions for bounding box regression

Intersection over Union (IoU)

IoU calculation for axis-aligned boxes

IoU variants

Non-maximum suppression (NMS)

Standard NMS algorithm

Soft-NMS

DIoU-NMS

Annotation formats and datasets

Major annotation formats

Benchmark datasets