See also: machine learning terms
Intersection over Union (IoU) is a similarity metric that measures the overlap between two regions, typically a model's predicted region and a ground-truth region. It is the workhorse evaluation metric for object detection, semantic segmentation, instance segmentation, and multi-object tracking. The same quantity is known in set theory as the Jaccard index, introduced by Swiss botanist Paul Jaccard in 1912 to compare the floras of different alpine zones.
IoU is defined as the area of intersection divided by the area of union of two sets. It returns a value in the range [0, 1], where 0 means no overlap and 1 means perfect overlap. Because it is bounded, scale-relative, and easy to compute, IoU has become the default way to ask the question "how well does the predicted box, mask, or region match the truth?"
For any two sets A and B, intersection over union is:
IoU(A, B) = |A ∩ B| / |A ∪ B|
Using the identity |A ∪ B| = |A| + |B| - |A ∩ B|, the same metric can be written as:
IoU(A, B) = |A ∩ B| / (|A| + |B| - |A ∩ B|)
In classification terms, if A is the prediction and B is the ground truth, the intersection counts true positives (TP) while the union counts TP plus false positives (FP) plus false negatives (FN). This gives the equivalent form:
IoU = TP / (TP + FP + FN)
This last form makes it clear why IoU is symmetric: it punishes both over-prediction (extra area outside the truth, FP) and under-prediction (missed area inside the truth, FN) equally.
Most object detectors emit axis-aligned bounding boxes. For two boxes A and B parameterised by corner coordinates (x1, y1, x2, y2), the intersection rectangle has width and height:
inter_w = max(0, min(A.x2, B.x2) - max(A.x1, B.x1))
inter_h = max(0, min(A.y2, B.y2) - max(A.y1, B.y1))
inter = inter_w * inter_h
The max(0, ...) clamps prevent negative intersection when boxes do not overlap. The union is:
area_A = (A.x2 - A.x1) * (A.y2 - A.y1)
area_B = (B.x2 - B.x1) * (B.y2 - B.y1)
union = area_A + area_B - inter
IoU = inter / union
A minimal PyTorch implementation for batches of boxes:
import torch
def box_iou(boxes_a, boxes_b):
# boxes shape: (N, 4) and (M, 4) in (x1, y1, x2, y2) format
area_a = (boxes_a[:, 2] - boxes_a[:, 0]) * (boxes_a[:, 3] - boxes_a[:, 1])
area_b = (boxes_b[:, 2] - boxes_b[:, 0]) * (boxes_b[:, 3] - boxes_b[:, 1])
lt = torch.max(boxes_a[:, None, :2], boxes_b[None, :, :2])
rb = torch.min(boxes_a[:, None, 2:], boxes_b[None, :, 2:])
wh = (rb - lt).clamp(min=0)
inter = wh[..., 0] * wh[..., 1]
union = area_a[:, None] + area_b[None, :] - inter
return inter / union.clamp(min=1e-7)
For segmentation masks, IoU is computed pixel-wise on binary masks: count the pixels where both masks are 1 (intersection) and divide by the count of pixels where either mask is 1 (union). The same logic extends to volumetric masks in 3D segmentation by replacing pixel counts with voxel counts.
A detection is usually scored as a true positive only when its IoU against the matched ground-truth box exceeds a threshold. The threshold turns IoU into a binary decision, which feeds into precision, recall, and average precision (AP). Different benchmarks use different thresholds.
| benchmark | iou threshold | metric reported | notes |
|---|---|---|---|
| PASCAL VOC 2007 / 2012 | 0.5 | mAP@0.5 | Single threshold; introduced in the 2007 challenge |
| COCO | 0.50, 0.55, ..., 0.95 | mAP@[.5:.95] | Average of AP at 10 thresholds in 0.05 steps |
| Open Images | 0.5 (with hierarchy) | mAP@0.5 | Hierarchical class matching |
| LVIS | 0.50, 0.55, ..., 0.95 | mAP plus AP for rare classes | COCO-style averaging with long-tail evaluation |
| Cityscapes (segmentation) | per-pixel | mean IoU per class | mIoUclass and mIoUcategory |
| ADE20K (segmentation) | per-pixel | mean IoU over 150 classes | Pixel accuracy reported alongside |
The COCO scheme of averaging AP across IoU thresholds from 0.5 to 0.95 was introduced by Tsung-Yi Lin and colleagues in the 2014 "Microsoft COCO: Common Objects in Context" paper. It rewards detectors that produce tight boxes, not just boxes that are roughly in the right place. This is why a model can score 60% mAP@0.5 on PASCAL VOC but only 35% mAP@[.5:.95] on COCO: the stricter thresholds penalise sloppy localisation.
IoU is the matching criterion that turns raw detector outputs into the inputs for mAP. For each predicted box, the evaluator finds the ground-truth box of the same class with the highest IoU. If that IoU is above the threshold and the ground-truth box has not already been matched, the prediction counts as a true positive. Otherwise it is a false positive. Unmatched ground-truth boxes become false negatives.
Sorting predictions by their confidence score and walking down the list produces a precision-recall curve. The area under that curve is the average precision (AP) for one class at one IoU threshold. Averaging AP across classes gives mAP. Averaging mAP across IoU thresholds (the COCO style) gives mAP@[.5:.95]. Without IoU as the matching rule, none of this machinery works.
Object detectors typically produce many overlapping predictions for the same object. Non-maximum suppression (NMS) trims this down. The classic algorithm is:
A lower threshold suppresses more aggressively (fewer duplicate boxes, more risk of merging two truly distinct objects). A higher threshold keeps more candidates (more duplicates, more crowded output). The choice is dataset-dependent: crowd-heavy benchmarks like CrowdHuman tend to use higher thresholds.
Variants such as Soft-NMS and DIoU-NMS modify how IoU is used. Soft-NMS decays the score of overlapping boxes instead of removing them outright. DIoU-NMS, proposed alongside the DIoU loss by Zheng et al. in 2020, replaces vanilla IoU with the distance-aware DIoU score, which improves results in dense scenes where two correct boxes are physically close.
Using IoU directly as a training loss is appealing because it is the same quantity used at evaluation time. The most common form is L_IoU = 1 - IoU, sometimes written as -log(IoU). The first published use of IoU as a regression loss for object detection is the UnitBox network of Yu et al., presented at ACM Multimedia 2016. UnitBox regressed all four box bounds jointly through an IoU loss, in contrast to the prior practice of treating each coordinate as an independent target with smooth L1 loss.
Vanilla IoU loss has two well-known failure modes. First, when the predicted and ground-truth boxes do not overlap at all, IoU is zero and the gradient is zero everywhere, so the network gets no signal about which direction to move. Second, two pairs of non-overlapping boxes with very different distances between them have the same IoU loss of 1, even though one pair is much closer to a useful prediction than the other. A series of follow-up losses fixes these problems by adding penalty terms.
| loss | year / venue | extra term added | what problem it fixes |
|---|---|---|---|
| IoU loss (UnitBox) | Yu et al., ACM MM 2016 | none, just 1 - IoU or -log(IoU) | Joint regression of all four bounds |
| GIoU | Rezatofighi et al., CVPR 2019 | Penalty for the empty area of the smallest enclosing box | Non-zero gradient when boxes do not overlap |
| DIoU | Zheng et al., AAAI 2020 | Normalised squared distance between box centres | Faster convergence, better behaviour for distant boxes |
| CIoU | Zheng et al., AAAI 2020 | DIoU plus an aspect-ratio consistency term | Three geometric factors at once: overlap, distance, shape |
| EIoU | Zhang et al., Neurocomputing 2021 (arXiv 2021) | CIoU's aspect-ratio term replaced by explicit width and height penalties | More accurate aspect alignment; pairs with Focal-EIoU |
| Alpha-IoU | He et al., NeurIPS 2021 | Power transform with parameter α applied to IoU and the regularisation term | Stronger gradient on high-IoU samples; tunable accuracy |
| SIoU | Gevorgyan, arXiv 2022 | Angle, distance, shape, and overlap costs combined | Considers the angle between predicted and target box vectors |
GIoU works by computing C, the smallest axis-aligned box that encloses both A and B, then subtracting (|C| - |A ∪ B|) / |C| from the IoU. When the two boxes do not overlap, this penalty term still has a gradient that pulls them together. DIoU adds the squared distance between the two box centres, normalised by the diagonal of C, which gives even faster convergence. CIoU keeps the DIoU distance term and adds a term measuring the difference in aspect ratio, which Zheng et al. reported as the missing third geometric factor. Most modern one-stage detectors (YOLOv5, YOLOv6, YOLOv7, YOLOv8 from Ultralytics) use CIoU as the default regression loss; YOLOv8 also adds a Distribution Focal Loss (DFL) component on top.
Alpha-IoU generalises this whole family by raising the IoU and the regularisation term to a power α. Setting α greater than 1 puts more weight on already-good predictions, which the authors showed improves robustness to noisy labels and small datasets. SIoU, proposed by Zhora Gevorgyan in 2022, introduces an angular cost: the loss decreases when the line connecting the two box centres aligns with the x or y axis, which the author argued helps the predicted box converge to the target along a more direct path.
The Dice coefficient (also called Sørensen-Dice index, or F1 score in the binary classification setting) is the other dominant overlap metric in segmentation. The two are closely related but not identical:
| metric | formula | range | sensitivity to small errors |
|---|---|---|---|
| IoU (Jaccard) | |A ∩ B| / |A ∪ B| | [0, 1] | More sensitive: drops faster as overlap shrinks |
| Dice / F1 | 2 |A ∩ B| / (|A| + |B|) | [0, 1] | Less sensitive: gives more credit to partial overlap |
The two are monotonically related by the closed-form expression Dice = 2·IoU / (1 + IoU) and equivalently IoU = Dice / (2 - Dice). For the same prediction, Dice will always be at least as large as IoU, with the gap widest at low overlap. As an example, an IoU of 0.5 corresponds to a Dice of 0.667, and an IoU of 0.8 corresponds to a Dice of 0.889. Medical imaging communities tend to prefer Dice because small structures dominate the metric and Dice gives a smoother optimisation surface, while general computer vision tends to prefer IoU because it is a stricter measure of localisation quality.
Beyond box IoU, several application-specific variants are in common use.
Mean IoU (mIoU) is the standard semantic segmentation metric. It computes IoU for each class across the entire dataset and averages the per-class scores, weighting all classes equally regardless of frequency. This is what makes ADE20K, with its 150 categories and heavy class imbalance, so demanding: a model that does well on common classes like sky and road but poorly on rare classes like microwave will have a much lower mIoU than its pixel accuracy would suggest.
Mask IoU is the COCO-style metric for instance segmentation. It works exactly like box IoU but on the predicted and ground-truth pixel masks of each instance. COCO reports mask AP using the same 0.5 to 0.95 averaging as box AP.
Boundary IoU and trimap IoU focus the metric on the contour of the segmented region rather than the interior. They were proposed to fight the tendency of mIoU to saturate on large, easy-to-segment objects. Panoptic Quality (PQ), the metric for panoptic segmentation, uses an IoU threshold of 0.5 for matching predicted segments to ground truth, then multiplies the average IoU of matched segments by the F1 score of segment matching.
Multi-object trackers like SORT and ByteTrack use IoU between predicted and detected boxes as the assignment cost in the Hungarian algorithm: each detection in the new frame is matched to the existing track whose predicted location overlaps it the most. IoU-based assignment is cheap and works surprisingly well when objects do not move dramatically between frames.
Beyond computer vision, IoU shows up wherever two sets need to be compared: text similarity (Jaccard over n-gram or token sets), recommendation systems (overlap between user item lists), genomics (overlap between gene sets), and clustering evaluation. Because the metric is symmetric, scale-relative, and well-defined for any two finite sets, it travels well across domains.
IoU is well understood, but it has real shortcomings.
It is symmetric, so it cannot distinguish over-prediction from under-prediction. A predicted mask that is too large by 100 pixels and one that is too small by 100 pixels (against a ground-truth mask of the same size) score identically. In safety-critical applications like medical segmentation, where missing tumour tissue is much worse than predicting some healthy tissue as tumour, this symmetry can be misleading.
Vanilla IoU is non-differentiable in the trivial sense that it is constant (and equal to zero) when boxes do not overlap. This was the explicit motivation for GIoU.
IoU is not a true distance metric. It satisfies symmetry and identity but the dual 1 - IoU does not satisfy the triangle inequality without modification. The Jaccard distance 1 - IoU is a true metric, however, in the sense of the standard set-theoretic distance.
At very small object sizes, IoU is extremely unforgiving: a single pixel of misalignment on a 4x4 box already costs a large IoU drop. This is part of why small-object detection scores so much lower than large-object detection on COCO, and why benchmarks report APsmall, APmedium, and APlarge separately.
Finally, IoU only measures localisation. Two detectors with identical IoU on every box can have wildly different classification accuracies. mAP combines IoU-based matching with classification scoring to address this.
Imagine you and a friend each draw a box around the same toy in a photo. Now lay your two boxes on top of each other. The part where both boxes overlap is the intersection. The total area covered by either box (or both) is the union. IoU is the intersection divided by the union. If your boxes match exactly, IoU is 1. If they do not touch at all, IoU is 0. Computers use the same rule to grade themselves on how well they find objects in pictures: they draw a box, compare it with the right answer, and an IoU above 0.5 usually counts as "close enough."