See also: Machine learning terms
Intersection over Union (IoU), also known as the Jaccard index or Jaccard similarity coefficient, is a statistic used to compare the similarity of two sets. In computer vision and machine learning, it is the dominant metric for measuring how well a predicted region (a bounding box or a segmentation mask) overlaps with a ground truth region. IoU is computed as the area of overlap between two regions divided by the area of their union, producing a value between 0 (no overlap) and 1 (perfect overlap).
The metric appears across nearly every modern visual recognition pipeline. It defines the matching criterion in benchmarks such as PASCAL VOC, Microsoft COCO, Cityscapes, and KITTI. It serves as the suppression criterion inside non-maximum suppression. It is the basis of a large family of regression losses used to train object detectors, including IoU loss, GIoU, DIoU, CIoU, SIoU, EIoU, and Wise-IoU. The same quantity is also used outside vision, for example in document deduplication, set similarity in databases, and ecological community comparison, which was the original setting in which Paul Jaccard introduced the index in 1901.[^jaccard1901]
For two finite sets $A$ and $B$, the Jaccard index is defined as
$$J(A, B) = \frac{|A \cap B|}{|A \cup B|} = \frac{|A \cap B|}{|A| + |B| - |A \cap B|}.$$
The value satisfies $0 \leq J(A, B) \leq 1$, with $J(A, B) = 1$ if and only if $A = B$ (and both are non-empty), and $J(A, B) = 0$ when $A \cap B = \emptyset$. By convention $J(\emptyset, \emptyset) = 1$.[^jaccard_wiki]
In object detection, $A$ and $B$ are typically axis-aligned rectangles in image coordinates. Letting $A = (x_1^A, y_1^A, x_2^A, y_2^A)$ and $B = (x_1^B, y_1^B, x_2^B, y_2^B)$ denote the top-left and bottom-right corners, the intersection is itself a rectangle with corners
$$x_1^I = \max(x_1^A, x_1^B), \quad y_1^I = \max(y_1^A, y_1^B),$$
$$x_2^I = \min(x_2^A, x_2^B), \quad y_2^I = \min(y_2^A, y_2^B).$$
If $x_2^I > x_1^I$ and $y_2^I > y_1^I$, the intersection area is $(x_2^I - x_1^I)(y_2^I - y_1^I)$; otherwise it is zero. The IoU is then
$$\text{IoU}(A, B) = \frac{\text{Area}(A \cap B)}{\text{Area}(A) + \text{Area}(B) - \text{Area}(A \cap B)}.$$
For binary segmentation, $A$ and $B$ are sets of foreground pixels. With true positives $TP$, false positives $FP$, and false negatives $FN$ counted at the pixel level,
$$\text{IoU} = \frac{TP}{TP + FP + FN}.$$
This is the form used in PASCAL VOC's segmentation evaluation and in Cityscapes' per-class IoU.[^pascal_voc_2010][^cityscapes]
IoU answers a single question: of all the area covered by the prediction and the ground truth combined, how much do they share? Two boxes that overlap completely score 1. Two boxes that touch at a single edge score 0. A prediction that is twice as large as the ground truth and fully contains it scores 0.5, because the union is twice the intersection. The same logic applies to pixel masks: doubling the prediction's footprint while keeping the same true positive area halves the IoU.
A useful property is that IoU is symmetric: $\text{IoU}(A, B) = \text{IoU}(B, A)$. It is also scale-aware in a relative sense. Shifting both boxes by a few pixels has a much larger effect when the boxes are small than when they are large, so a 5-pixel error on a 20-pixel-wide pedestrian penalises an IoU score far more than the same error on a 200-pixel-wide bus.
The IoU has several properties that explain why it is preferred over older similarity measures such as the simple overlap coefficient or pixel accuracy:
| Property | Description |
|---|---|
| Range | $[0, 1]$ for any pair of non-empty sets |
| Symmetry | $J(A, B) = J(B, A)$ |
| Identity | $J(A, A) = 1$ for any non-empty $A$ |
| Triangle-related | $1 - J(A, B)$ is the Jaccard distance, a true metric on the space of finite sets |
| Scale-invariance | $J(\lambda A, \lambda B) = J(A, B)$ for any positive scaling $\lambda$ |
| Class imbalance robustness | Unlike pixel accuracy, IoU does not become trivially high when one class dominates the image |
The Jaccard distance $d_J(A, B) = 1 - J(A, B)$ satisfies the triangle inequality and is a proper metric, which is why it is used in clustering and retrieval contexts.[^jaccard_wiki]
The relationship between IoU and the Dice coefficient (also called F1 score in the binary case) is monotonic:
$$\text{Dice} = \frac{2 \cdot \text{IoU}}{1 + \text{IoU}}, \quad \text{IoU} = \frac{\text{Dice}}{2 - \text{Dice}}.$$
Because both metrics are monotonically related, ranking models by IoU and ranking them by Dice yields the same ordering. The two are not equal, however, and Dice tends to give numerically larger values for the same prediction.
The quantity now called IoU was first published in its modern form by the Swiss botanist Paul Jaccard in 1901, in a study comparing alpine flora. He called it the coefficient de communauté and used it to compare which plant species occurred together in different mountain plots.[^jaccard1901] An equivalent ratio had been described earlier by the geologist Grove Karl Gilbert in 1884 as a "ratio of verification" for weather forecasts. The same statistic was independently rediscovered by Taffee Tadashi Tanimoto at IBM in the 1950s, and is therefore sometimes called the Tanimoto coefficient, especially in cheminformatics.[^jaccard_wiki]
The Jaccard index entered modern computer vision through the PASCAL VOC challenge. The first VOC paper that codified the evaluation procedure was published in the International Journal of Computer Vision in 2010 by Mark Everingham and colleagues. They specified that, for an object detection result to count as a true positive, the predicted bounding box must overlap a ground truth box with IoU greater than 0.5.[^pascal_voc_2010] This 0.5 threshold became the de facto standard for detection evaluation throughout the early deep learning era and is still the default for many practical reports.
When the Microsoft COCO dataset was released in 2014, its authors argued that a single threshold rewards loose localisation. They introduced the now-standard mAP@[0.5:0.95], averaged over ten IoU thresholds from 0.5 to 0.95 in steps of 0.05.[^coco_paper] This stricter protocol forced subsequent detectors to produce tighter, better-fitting boxes.
In the object detection setting, a predicted box is matched to a ground truth box if their IoU exceeds a threshold $\tau$. The match defines true positives, false positives (predictions that do not match) and false negatives (ground truth boxes with no matching prediction). Average precision (AP) is then computed from the resulting precision-recall curve.
Different benchmarks pick different IoU thresholds:
| Benchmark | Threshold | Notes |
|---|---|---|
| PASCAL VOC | 0.5 | Single fixed threshold; mAP averaged over 20 classes |
| COCO | 0.5 to 0.95 in steps of 0.05 | Average of 10 IoU values, called mAP@[.5:.95] or simply AP |
| LVIS | 0.5 to 0.95 in steps of 0.05 | Same as COCO but emphasises long-tail classes |
| KITTI (cars, 2D and 3D) | 0.7 | Stricter threshold reflects driving-safety requirements |
| KITTI (pedestrians, cyclists) | 0.5 | Smaller objects allow looser matching |
| Open Images | 0.5 | Hierarchical class structure with relaxed matching |
A single value such as AP@0.5 reports how often a model finds objects loosely; AP@0.75, often abbreviated AP75, measures stricter localisation; AP@[.5:.95] integrates the two.[^coco_metrics]
For semantic segmentation, the standard metric is mean Intersection over Union (mIoU), sometimes also written Jaccard index. The IoU is computed for each class across the entire test set, treating the predicted and ground truth pixel masks as sets, and then averaged over classes:
$$\text{mIoU} = \frac{1}{C} \sum_{c=1}^{C} \frac{TP_c}{TP_c + FP_c + FN_c}.$$
The mean is taken across $C$ classes so that minority classes such as poles or traffic signs are not drowned out by dominant classes such as road or sky. The Cityscapes benchmark reports both IoU per class and IoU per category (a coarser grouping), and Cityscapes' leaderboard ranks models by mean class IoU.[^cityscapes] PASCAL VOC's segmentation track uses the same per-class IoU formulation.[^pascal_voc_2010]
In instance segmentation (for example, the Mask R-CNN style of output), each predicted mask is matched to a ground truth mask using mask IoU rather than box IoU, and the COCO-style AP@[.5:.95] is reported on the resulting matches. COCO maintains separate AP scores for bounding boxes and masks.[^coco_paper]
Object detectors usually emit hundreds of overlapping candidate boxes per object. Non-maximum suppression (NMS) trims these to a single best box per object using IoU as the suppression criterion. A typical NMS routine sorts candidates by class score, picks the highest-scoring box, and removes any other box whose IoU with the chosen box exceeds a threshold (commonly 0.45 or 0.5). The procedure is repeated until no candidates remain. Soft-NMS, introduced by Bodla et al., reduces (rather than removes) the score of overlapping boxes proportionally to their IoU, which preserves recall when objects of the same class genuinely overlap.[^soft_nms]
IoU is also the matching cost in single-stage trackers such as IoU-Tracker and the cascade matching stage of DeepSORT and ByteTrack. A track and a detection are linked across consecutive frames when their boxes overlap by more than a chosen IoU threshold.
For most of the 2010s, object detectors were trained with smooth L1 or L2 loss applied to the four box coordinates. This is suboptimal because two boxes can have a low coordinate-wise distance and still poor IoU, and vice versa. UnitBox, introduced by Yu et al. at ACM Multimedia in 2016, was the first detector to back-propagate directly through the IoU.[^unitbox] The IoU loss is
$$\mathcal{L}_{\text{IoU}} = 1 - \text{IoU}(A, B).$$
This loss has two well-known shortcomings:
A family of variants, summarised below, has been proposed to address these issues.
| Variant | Year and venue | Authors | Key idea |
|---|---|---|---|
| IoU loss (UnitBox) | 2016, ACM MM | Yu et al. | Back-propagate through IoU directly instead of L2 on coordinates |
| GIoU | 2019, CVPR | Rezatofighi, Tsoi, Gwak, Sadeghian, Reid, Savarese | Adds a penalty for the empty area of the smallest enclosing box, providing gradient when boxes do not overlap |
| DIoU | 2020, AAAI | Zheng, Wang, Liu, Li, Ye, Ren | Adds a normalised distance term between box centers; converges faster than GIoU |
| CIoU | 2020, AAAI | Zheng et al. | Adds a third term for aspect-ratio consistency on top of DIoU |
| EIoU and Focal-EIoU | 2022, Neurocomputing | Zhang, Ren, Zhang, Jia, Wang, Tan | Replaces CIoU's aspect-ratio term with explicit width and height differences and adds focal weighting |
| SIoU | 2022, arXiv | Gevorgyan | Adds an angle cost between the line connecting box centers and the image axes, plus a redefined distance and shape cost |
| Wise-IoU (WIoU) | 2023, arXiv | Tong, Chen and others | Dynamic, non-monotonic focusing weight based on outlier degree of each anchor |
| Probabilistic IoU (ProbIoU) | 2021, arXiv | Llerena, Zeni, Kristen, Jung | Models boxes as 2D Gaussians, uses Hellinger or Bhattacharyya distance for differentiable IoU on rotated boxes |
| PIoU (Pixels-IoU) | 2020, ECCV | Chen, Yang, Zhang and others | Pixel-wise approximation suitable for oriented bounding boxes |
GIoU was introduced by Hamid Rezatofighi and colleagues at CVPR 2019. It addresses the vanishing-gradient problem of plain IoU when boxes do not overlap. Letting $C$ be the smallest axis-aligned box that contains both $A$ and $B$, GIoU is defined as
$$\text{GIoU} = \text{IoU} - \frac{|C \setminus (A \cup B)|}{|C|}.$$
The additional term penalises the empty area inside the enclosing box, so that even when boxes do not overlap, smaller enclosing boxes give higher GIoU values. GIoU lies in the range $[-1, 1]$. The authors reported consistent gains over smooth-L1 and plain IoU on PASCAL VOC and COCO when GIoU was plugged into Faster R-CNN, Mask R-CNN, and YOLOv3.[^giou]
DIoU and CIoU were proposed in the same paper by Zhaohui Zheng and colleagues at AAAI 2020. DIoU adds the normalised squared distance between the centers of the two boxes:
$$\text{DIoU} = \text{IoU} - \frac{\rho^2(\mathbf{b}, \mathbf{b}^{gt})}{c^2},$$
where $\rho$ denotes Euclidean distance, $\mathbf{b}$ and $\mathbf{b}^{gt}$ are the centers of the predicted and ground truth boxes, and $c$ is the diagonal of the smallest enclosing box. CIoU adds a further consistency term $\alpha v$ that penalises differences in aspect ratio:
$$\text{CIoU} = \text{IoU} - \frac{\rho^2(\mathbf{b}, \mathbf{b}^{gt})}{c^2} - \alpha v,$$
where $v$ measures aspect-ratio dissimilarity and $\alpha$ is a positive trade-off coefficient. The authors showed that DIoU also improves NMS by replacing IoU as the suppression criterion (DIoU-NMS), preserving boxes that overlap but have distinct centers (for example, two adjacent pedestrians).[^diou]
The SCYLLA-IoU (SIoU) loss, introduced by Zhora Gevorgyan in 2022, adds an angle cost. The intuition is that earlier IoU losses do not consider the direction of the offset between the predicted and ground truth box centers, so the predicted box can wander before converging. SIoU's combined loss includes an angle cost, a distance cost re-weighted by that angle, a shape cost on width and height differences, and the IoU term itself.[^siou]
Wise-IoU (WIoU) was proposed by Zanjia Tong, Yuhang Chen and colleagues in 2023. Earlier focal-style losses such as Focal-EIoU upweight harder examples monotonically. WIoU uses a dynamic, non-monotonic focusing function based on each anchor's outlier degree, so that very low-quality anchors (likely mislabelled or extremely poor crops) are also down-weighted. The authors reported AP gains on YOLOv7 trained on MS-COCO.[^wise_iou]
For oriented bounding boxes used in aerial imagery and text detection, the standard rectangular IoU is hard to differentiate. The Pixels-IoU loss (Chen et al., ECCV 2020) approximates IoU pixel by pixel.[^piou] Llerena and colleagues took a different approach, modelling each box as a 2D Gaussian distribution and using Hellinger or Bhattacharyya distance as a probabilistic analogue of IoU, giving the ProbIoU family of losses.[^probiou]
A practical distinction: IoU as an evaluation metric is computed on hard predictions and requires no gradient. IoU as a training loss must be differentiable with respect to the predicted box parameters. This is why training and evaluation often use slightly different IoU variants. Models commonly train with CIoU or GIoU and are then evaluated using vanilla IoU at one or several thresholds.
| Benchmark | Domain | IoU usage |
|---|---|---|
| PASCAL VOC (2007 to 2012) | 2D detection, segmentation | mAP@0.5 for detection; per-class IoU for segmentation |
| Microsoft COCO (2014 onwards) | Detection, instance segmentation, keypoints | mAP@[.5:.95], AP@0.5, AP@0.75; mask IoU for segmentation |
| Cityscapes (2016) | Urban driving scenes | Mean IoU per class and per category |
| ADE20K (2017) | Scene parsing | Mean IoU over 150 classes |
| KITTI (2012, 2017) | Autonomous driving 2D and 3D | IoU thresholds 0.7 (cars) and 0.5 (pedestrians, cyclists) |
| LVIS (2019) | Long-tail detection | COCO-style AP@[.5:.95] |
| Open Images (2018) | Large-scale detection | mAP@0.5 with hierarchy-aware matching |
| nuScenes (2019) | 3D detection in autonomous driving | Uses center distance instead of IoU for matching, but reports IoU as a secondary metric |
| Waymo Open Dataset (2020) | 3D detection and tracking | 3D IoU thresholds 0.7 and 0.5 |
The torchvision.ops module provides differentiable and non-differentiable IoU utilities. The basic call computes a pairwise IoU matrix between two sets of boxes in xyxy format:[^torchvision_box_iou]
import torch
from torchvision.ops import box_iou, generalized_box_iou_loss, complete_box_iou_loss
boxes1 = torch.tensor([[0, 0, 100, 100], [50, 50, 150, 150]], dtype=torch.float32)
boxes2 = torch.tensor([[10, 10, 110, 110]], dtype=torch.float32)
iou_matrix = box_iou(boxes1, boxes2)
# iou_matrix has shape (2, 1)
# Loss variants for training
pred = torch.tensor([[10.0, 10.0, 90.0, 90.0]], requires_grad=True)
target = torch.tensor([[0.0, 0.0, 100.0, 100.0]])
loss_giou = generalized_box_iou_loss(pred, target, reduction="mean")
loss_ciou = complete_box_iou_loss(pred, target, reduction="mean")
loss_giou.backward()
Torchvision also exposes box_iou, generalized_box_iou, distance_box_iou, complete_box_iou, and the corresponding _loss versions.
The tf.keras.metrics.MeanIoU metric maintains a confusion matrix and computes the mean per-class IoU. It is most often used for semantic segmentation:[^tf_meaniou]
import tensorflow as tf
m = tf.keras.metrics.MeanIoU(num_classes=2)
m.update_state([0, 0, 1, 1], [0, 1, 0, 1])
print(m.result().numpy()) # 0.33333334
model.compile(
optimizer="sgd",
loss="categorical_crossentropy",
metrics=[tf.keras.metrics.MeanIoU(num_classes=21)],
)
For bounding boxes, the TensorFlow Models library and TF Addons historically provided tfa.losses.GIoULoss, and Keras CV provides keras_cv.losses.IoULoss and keras_cv.losses.CIoULoss for modern detector training.
A bare implementation of box IoU in NumPy is short enough to be illustrative:
import numpy as np
def box_iou(box_a, box_b):
"""Compute IoU between two boxes in xyxy format."""
x1 = max(box_a<sup><a href="#cite_note-0" class="cite-ref">[0]</a></sup>, box_b<sup><a href="#cite_note-0" class="cite-ref">[0]</a></sup>)
y1 = max(box_a<sup><a href="#cite_note-1" class="cite-ref">[1]</a></sup>, box_b<sup><a href="#cite_note-1" class="cite-ref">[1]</a></sup>)
x2 = min(box_a<sup><a href="#cite_note-2" class="cite-ref">[2]</a></sup>, box_b<sup><a href="#cite_note-2" class="cite-ref">[2]</a></sup>)
y2 = min(box_a<sup><a href="#cite_note-3" class="cite-ref">[3]</a></sup>, box_b<sup><a href="#cite_note-3" class="cite-ref">[3]</a></sup>)
inter = max(0.0, x2 - x1) * max(0.0, y2 - y1)
area_a = (box_a<sup><a href="#cite_note-2" class="cite-ref">[2]</a></sup> - box_a<sup><a href="#cite_note-0" class="cite-ref">[0]</a></sup>) * (box_a<sup><a href="#cite_note-3" class="cite-ref">[3]</a></sup> - box_a<sup><a href="#cite_note-1" class="cite-ref">[1]</a></sup>)
area_b = (box_b<sup><a href="#cite_note-2" class="cite-ref">[2]</a></sup> - box_b<sup><a href="#cite_note-0" class="cite-ref">[0]</a></sup>) * (box_b<sup><a href="#cite_note-3" class="cite-ref">[3]</a></sup> - box_b<sup><a href="#cite_note-1" class="cite-ref">[1]</a></sup>)
union = area_a + area_b - inter
return inter / union if union > 0 else 0.0
Vectorised implementations replace the scalar max and min with element-wise NumPy or PyTorch operations and broadcast across all $N \times M$ pairs of boxes.
IoU is dominant in computer vision because it is simple, scale-aware, and aligned with human notions of overlap, but it has well-known weaknesses.
IoU is therefore usually combined with other metrics: precision, recall, F1, average precision, and confusion-matrix style breakdowns.
| Metric | Relationship to IoU |
|---|---|
| Dice coefficient (F1 in the binary case) | $\text{Dice} = \frac{2 \cdot \text{IoU}}{1 + \text{IoU}}$; monotonically related to IoU |
| Pixel accuracy | Fraction of correctly classified pixels; biased by class imbalance, IoU is preferred |
| Frequency-weighted IoU (FWIoU) | Weighted average of per-class IoU, weighted by class frequency |
| Boundary IoU | Computed only on a thin band around object boundaries; emphasises edge accuracy |
| Tversky index | Generalisation of Jaccard with separately weighted FP and FN; reduces to IoU when both weights are 1 |
| Hausdorff distance | Distance-based dissimilarity, complementary to IoU for boundary quality in medical segmentation |
| Average Precision (AP) | Computed over precision-recall pairs derived from IoU-based matching |
The Jaccard form of IoU is widely used beyond images:
IoU is the workhorse spatial-overlap metric in computer vision. It is simple to compute, intuitive to interpret, and aligned with the way humans judge whether two regions match. Its limitations have driven a productive line of research into gradient-friendly variants (GIoU, DIoU, CIoU, EIoU, SIoU, WIoU, ProbIoU) and matching-aware NMS schemes (Soft-NMS, DIoU-NMS). It defines the matching criterion in PASCAL VOC, COCO, KITTI, Cityscapes, and most modern benchmarks, and it sits inside both the training loss and the post-processing of essentially every contemporary object detector.
Imagine you and your friend both draw a square around a cat in the same picture. You want to check how well the two squares match. Put one drawing on top of the other. The piece of paper that is covered by both squares is the intersection. The piece of paper that is covered by either square is the union. Divide the small (intersection) by the big (union). If the squares match perfectly you get 1. If they do not overlap at all you get 0. Computers use that same number, called IoU, to grade themselves when they try to find objects in pictures.