Intersection over union (IoU)

See also: machine learning terms

Intersection over Union (IoU) is a similarity metric that measures the overlap between two regions, typically a model's predicted region and a ground-truth region. It is the workhorse evaluation metric for object detection, semantic segmentation, instance segmentation, and multi-object tracking. The same quantity is known in set theory as the Jaccard index, introduced by Swiss botanist Paul Jaccard in 1912 to compare the floras of different alpine zones.

IoU is defined as the area of intersection divided by the area of union of two sets. It returns a value in the range [0, 1], where 0 means no overlap and 1 means perfect overlap. Because it is bounded, scale-relative, and easy to compute, IoU has become the default way to ask the question "how well does the predicted box, mask, or region match the truth?"

definition

For any two sets A and B, intersection over union is:

IoU(A, B) = |A ∩ B| / |A ∪ B|

Using the identity |A ∪ B| = |A| + |B| - |A ∩ B|, the same metric can be written as:

IoU(A, B) = |A ∩ B| / (|A| + |B| - |A ∩ B|)

In classification terms, if A is the prediction and B is the ground truth, the intersection counts true positives (TP) while the union counts TP plus false positives (FP) plus false negatives (FN). This gives the equivalent form:

IoU = TP / (TP + FP + FN)

This last form makes it clear why IoU is symmetric: it punishes both over-prediction (extra area outside the truth, FP) and under-prediction (missed area inside the truth, FN) equally.

computation for bounding boxes

Most object detectors emit axis-aligned bounding boxes. For two boxes A and B parameterised by corner coordinates (x1, y1, x2, y2), the intersection rectangle has width and height:

inter_w = max(0, min(A.x2, B.x2) - max(A.x1, B.x1))
inter_h = max(0, min(A.y2, B.y2) - max(A.y1, B.y1))
inter   = inter_w * inter_h

The max(0, ...) clamps prevent negative intersection when boxes do not overlap. The union is:

area_A = (A.x2 - A.x1) * (A.y2 - A.y1)
area_B = (B.x2 - B.x1) * (B.y2 - B.y1)
union  = area_A + area_B - inter
IoU    = inter / union

A minimal PyTorch implementation for batches of boxes:

import torch

def box_iou(boxes_a, boxes_b):
    # boxes shape: (N, 4) and (M, 4) in (x1, y1, x2, y2) format
    area_a = (boxes_a[:, 2] - boxes_a[:, 0]) * (boxes_a[:, 3] - boxes_a[:, 1])
    area_b = (boxes_b[:, 2] - boxes_b[:, 0]) * (boxes_b[:, 3] - boxes_b[:, 1])

    lt = torch.max(boxes_a[:, None, :2], boxes_b[None, :, :2])
    rb = torch.min(boxes_a[:, None, 2:], boxes_b[None, :, 2:])
    wh = (rb - lt).clamp(min=0)
    inter = wh[..., 0] * wh[..., 1]

    union = area_a[:, None] + area_b[None, :] - inter
    return inter / union.clamp(min=1e-7)

For segmentation masks, IoU is computed pixel-wise on binary masks: count the pixels where both masks are 1 (intersection) and divide by the count of pixels where either mask is 1 (union). The same logic extends to volumetric masks in 3D segmentation by replacing pixel counts with voxel counts.

thresholds and benchmarks

A detection is usually scored as a true positive only when its IoU against the matched ground-truth box exceeds a threshold. The threshold turns IoU into a binary decision, which feeds into precision, recall, and average precision (AP). Different benchmarks use different thresholds.

benchmark	iou threshold	metric reported	notes
PASCAL VOC 2007 / 2012	0.5	mAP@0.5	Single threshold; introduced in the 2007 challenge
COCO	0.50, 0.55, ..., 0.95	mAP@[.5:.95]	Average of AP at 10 thresholds in 0.05 steps
Open Images	0.5 (with hierarchy)	mAP@0.5	Hierarchical class matching
LVIS	0.50, 0.55, ..., 0.95	mAP plus AP for rare classes	COCO-style averaging with long-tail evaluation
Cityscapes (segmentation)	per-pixel	mean IoU per class	mIoUclass and mIoUcategory
ADE20K (segmentation)	per-pixel	mean IoU over 150 classes	Pixel accuracy reported alongside

The COCO scheme of averaging AP across IoU thresholds from 0.5 to 0.95 was introduced by Tsung-Yi Lin and colleagues in the 2014 "Microsoft COCO: Common Objects in Context" paper. It rewards detectors that produce tight boxes, not just boxes that are roughly in the right place. This is why a model can score 60% mAP@0.5 on PASCAL VOC but only 35% mAP@[.5:.95] on COCO: the stricter thresholds penalise sloppy localisation.

use in mean average precision

IoU is the matching criterion that turns raw detector outputs into the inputs for mAP. For each predicted box, the evaluator finds the ground-truth box of the same class with the highest IoU. If that IoU is above the threshold and the ground-truth box has not already been matched, the prediction counts as a true positive. Otherwise it is a false positive. Unmatched ground-truth boxes become false negatives.

Sorting predictions by their confidence score and walking down the list produces a precision-recall curve. The area under that curve is the average precision (AP) for one class at one IoU threshold. Averaging AP across classes gives mAP. Averaging mAP across IoU thresholds (the COCO style) gives mAP@[.5:.95]. Without IoU as the matching rule, none of this machinery works.

use in non-maximum suppression

Object detectors typically produce many overlapping predictions for the same object. Non-maximum suppression (NMS) trims this down. The classic algorithm is:

Sort predicted boxes by confidence score.
Take the top-scoring box and add it to the output list.
Compute IoU between that box and every remaining box of the same class.
Remove any remaining box whose IoU exceeds a suppression threshold (commonly 0.5 to 0.7).
Repeat until no boxes remain.

A lower threshold suppresses more aggressively (fewer duplicate boxes, more risk of merging two truly distinct objects). A higher threshold keeps more candidates (more duplicates, more crowded output). The choice is dataset-dependent: crowd-heavy benchmarks like CrowdHuman tend to use higher thresholds.

Variants such as Soft-NMS and DIoU-NMS modify how IoU is used. Soft-NMS decays the score of overlapping boxes instead of removing them outright. DIoU-NMS, proposed alongside the DIoU loss by Zheng et al. in 2020, replaces vanilla IoU with the distance-aware DIoU score, which improves results in dense scenes where two correct boxes are physically close.

iou as a loss function

Using IoU directly as a training loss is appealing because it is the same quantity used at evaluation time. The most common form is L_IoU = 1 - IoU, sometimes written as -log(IoU). The first published use of IoU as a regression loss for object detection is the UnitBox network of Yu et al., presented at ACM Multimedia 2016. UnitBox regressed all four box bounds jointly through an IoU loss, in contrast to the prior practice of treating each coordinate as an independent target with smooth L1 loss.

Vanilla IoU loss has two well-known failure modes. First, when the predicted and ground-truth boxes do not overlap at all, IoU is zero and the gradient is zero everywhere, so the network gets no signal about which direction to move. Second, two pairs of non-overlapping boxes with very different distances between them have the same IoU loss of 1, even though one pair is much closer to a useful prediction than the other. A series of follow-up losses fixes these problems by adding penalty terms.

loss	year / venue	extra term added	what problem it fixes
IoU loss (UnitBox)	Yu et al., ACM MM 2016	none, just `1 - IoU` or `-log(IoU)`	Joint regression of all four bounds
GIoU	Rezatofighi et al., CVPR 2019	Penalty for the empty area of the smallest enclosing box	Non-zero gradient when boxes do not overlap
DIoU	Zheng et al., AAAI 2020	Normalised squared distance between box centres	Faster convergence, better behaviour for distant boxes
CIoU	Zheng et al., AAAI 2020	DIoU plus an aspect-ratio consistency term	Three geometric factors at once: overlap, distance, shape
EIoU	Zhang et al., Neurocomputing 2021 (arXiv 2021)	CIoU's aspect-ratio term replaced by explicit width and height penalties	More accurate aspect alignment; pairs with Focal-EIoU
Alpha-IoU	He et al., NeurIPS 2021	Power transform with parameter α applied to IoU and the regularisation term	Stronger gradient on high-IoU samples; tunable accuracy
SIoU	Gevorgyan, arXiv 2022	Angle, distance, shape, and overlap costs combined	Considers the angle between predicted and target box vectors

GIoU works by computing C, the smallest axis-aligned box that encloses both A and B, then subtracting (|C| - |A ∪ B|) / |C| from the IoU. When the two boxes do not overlap, this penalty term still has a gradient that pulls them together. DIoU adds the squared distance between the two box centres, normalised by the diagonal of C, which gives even faster convergence. CIoU keeps the DIoU distance term and adds a term measuring the difference in aspect ratio, which Zheng et al. reported as the missing third geometric factor. Most modern one-stage detectors (YOLOv5, YOLOv6, YOLOv7, YOLOv8 from Ultralytics) use CIoU as the default regression loss; YOLOv8 also adds a Distribution Focal Loss (DFL) component on top.

Alpha-IoU generalises this whole family by raising the IoU and the regularisation term to a power α. Setting α greater than 1 puts more weight on already-good predictions, which the authors showed improves robustness to noisy labels and small datasets. SIoU, proposed by Zhora Gevorgyan in 2022, introduces an angular cost: the loss decreases when the line connecting the two box centres aligns with the x or y axis, which the author argued helps the predicted box converge to the target along a more direct path.

difference from the dice coefficient

The Dice coefficient (also called Sørensen-Dice index, or F1 score in the binary classification setting) is the other dominant overlap metric in segmentation. The two are closely related but not identical:

metric	formula	range	sensitivity to small errors
IoU (Jaccard)	`\|A ∩ B\| / \|A ∪ B\|`	[0, 1]	More sensitive: drops faster as overlap shrinks
Dice / F1	`2 \|A ∩ B\| / (\|A\| + \|B\|)`	[0, 1]	Less sensitive: gives more credit to partial overlap

The two are monotonically related by the closed-form expression Dice = 2·IoU / (1 + IoU) and equivalently IoU = Dice / (2 - Dice). For the same prediction, Dice will always be at least as large as IoU, with the gap widest at low overlap. As an example, an IoU of 0.5 corresponds to a Dice of 0.667, and an IoU of 0.8 corresponds to a Dice of 0.889. Medical imaging communities tend to prefer Dice because small structures dominate the metric and Dice gives a smoother optimisation surface, while general computer vision tends to prefer IoU because it is a stricter measure of localisation quality.

variants for evaluation

Beyond box IoU, several application-specific variants are in common use.

Mean IoU (mIoU) is the standard semantic segmentation metric. It computes IoU for each class across the entire dataset and averages the per-class scores, weighting all classes equally regardless of frequency. This is what makes ADE20K, with its 150 categories and heavy class imbalance, so demanding: a model that does well on common classes like sky and road but poorly on rare classes like microwave will have a much lower mIoU than its pixel accuracy would suggest.

Mask IoU is the COCO-style metric for instance segmentation. It works exactly like box IoU but on the predicted and ground-truth pixel masks of each instance. COCO reports mask AP using the same 0.5 to 0.95 averaging as box AP.

Boundary IoU and trimap IoU focus the metric on the contour of the segmented region rather than the interior. They were proposed to fight the tendency of mIoU to saturate on large, easy-to-segment objects. Panoptic Quality (PQ), the metric for panoptic segmentation, uses an IoU threshold of 0.5 for matching predicted segments to ground truth, then multiplies the average IoU of matched segments by the F1 score of segment matching.

use in tracking and other tasks

Multi-object trackers like SORT and ByteTrack use IoU between predicted and detected boxes as the assignment cost in the Hungarian algorithm: each detection in the new frame is matched to the existing track whose predicted location overlaps it the most. IoU-based assignment is cheap and works surprisingly well when objects do not move dramatically between frames.

Beyond computer vision, IoU shows up wherever two sets need to be compared: text similarity (Jaccard over n-gram or token sets), recommendation systems (overlap between user item lists), genomics (overlap between gene sets), and clustering evaluation. Because the metric is symmetric, scale-relative, and well-defined for any two finite sets, it travels well across domains.

limitations

IoU is well understood, but it has real shortcomings.

It is symmetric, so it cannot distinguish over-prediction from under-prediction. A predicted mask that is too large by 100 pixels and one that is too small by 100 pixels (against a ground-truth mask of the same size) score identically. In safety-critical applications like medical segmentation, where missing tumour tissue is much worse than predicting some healthy tissue as tumour, this symmetry can be misleading.

Vanilla IoU is non-differentiable in the trivial sense that it is constant (and equal to zero) when boxes do not overlap. This was the explicit motivation for GIoU.

IoU is not a true distance metric. It satisfies symmetry and identity but the dual 1 - IoU does not satisfy the triangle inequality without modification. The Jaccard distance 1 - IoU is a true metric, however, in the sense of the standard set-theoretic distance.

At very small object sizes, IoU is extremely unforgiving: a single pixel of misalignment on a 4x4 box already costs a large IoU drop. This is part of why small-object detection scores so much lower than large-object detection on COCO, and why benchmarks report APsmall, APmedium, and APlarge separately.

Finally, IoU only measures localisation. Two detectors with identical IoU on every box can have wildly different classification accuracies. mAP combines IoU-based matching with classification scoring to address this.

explain like i'm 5

Imagine you and a friend each draw a box around the same toy in a photo. Now lay your two boxes on top of each other. The part where both boxes overlap is the intersection. The total area covered by either box (or both) is the union. IoU is the intersection divided by the union. If your boxes match exactly, IoU is 1. If they do not touch at all, IoU is 0. Computers use the same rule to grade themselves on how well they find objects in pictures: they draw a box, compare it with the right answer, and an IoU above 0.5 usually counts as "close enough."

references

Jaccard, P. (1912). "The Distribution of the Flora in the Alpine Zone." New Phytologist, 11(2), 37 to 50. DOI: 10.1111/j.1469-8137.1912.tb05611.x.
Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., and Zisserman, A. (2010). "The Pascal Visual Object Classes (VOC) Challenge." International Journal of Computer Vision, 88(2), 303 to 338.
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., and Zitnick, C. L. (2014). "Microsoft COCO: Common Objects in Context." European Conference on Computer Vision (ECCV) 2014.
Yu, J., Jiang, Y., Wang, Z., Cao, Z., and Huang, T. (2016). "UnitBox: An Advanced Object Detection Network." Proceedings of the 24th ACM International Conference on Multimedia, 516 to 520.
Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., and Savarese, S. (2019). "Generalized Intersection Over Union: A Metric and a Loss for Bounding Box Regression." CVPR 2019. arXiv:1902.09630.
Zheng, Z., Wang, P., Liu, W., Li, J., Ye, R., and Ren, D. (2020). "Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression." AAAI 2020, 12993 to 13000. arXiv:1911.08287.
Zhang, Y.-F., Ren, W., Zhang, Z., Jia, Z., Wang, L., and Tan, T. (2021). "Focal and Efficient IOU Loss for Accurate Bounding Box Regression." Neurocomputing. arXiv:2101.08158.
He, J., Erfani, S., Ma, X., Bailey, J., Chi, Y., and Hua, X.-S. (2021). "Alpha-IoU: A Family of Power Intersection over Union Losses for Bounding Box Regression." NeurIPS 2021. arXiv:2110.13675.
Gevorgyan, Z. (2022). "SIoU Loss: More Powerful Learning for Bounding Box Regression." arXiv:2205.12740.
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., and Schiele, B. (2016). "The Cityscapes Dataset for Semantic Urban Scene Understanding." CVPR 2016.
Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., and Torralba, A. (2017). "Scene Parsing through ADE20K Dataset." CVPR 2017.
Ultralytics. "YOLOv5 / YOLOv8 Architecture Documentation." https://docs.ultralytics.com/yolov5/tutorials/architecture_description/

definition

computation for bounding boxes

thresholds and benchmarks

use in mean average precision

use in non-maximum suppression

iou as a loss function

difference from the dice coefficient

variants for evaluation

use in tracking and other tasks

limitations

explain like i'm 5

references

Improve this article

Related Articles

Bounding Box

IoU

YOLO (object detection)

COCO dataset

DETR

Faster R-CNN

definition

computation for bounding boxes

thresholds and benchmarks

use in mean average precision

use in non-maximum suppression

iou as a loss function

difference from the dice coefficient

variants for evaluation

use in tracking and other tasks

limitations

explain like i'm 5

references

Related Articles

Bounding Box

IoU

YOLO (object detection)

COCO dataset

DETR

Faster R-CNN