See also: precision, recall, F1 score, confusion matrix, AUC, precision-recall curve
Average precision (AP) is an evaluation metric used to measure the quality of ranked retrieval results in information retrieval, object detection, and classification tasks. It summarizes a precision-recall curve into a single scalar value by computing the weighted mean of precision values at each recall threshold, where the weight is the increase in recall from the previous threshold. When averaged across multiple queries or object classes, it produces the metric known as mean average precision (mAP).
The metric was first popularized in the information retrieval community through the Text Retrieval Conference (TREC), organized by the National Institute of Standards and Technology (NIST) beginning in 1992. TREC standardized evaluation procedures for search systems, and mean average precision became the primary measure for comparing retrieval system quality. The trec_eval program, written by Chris Buckley, provided a common implementation that ensured consistent handling of interpolation and ranking calculations across the research community.
In computer vision, average precision gained widespread adoption through the PASCAL Visual Object Classes (VOC) Challenge, which ran annually from 2005 to 2012, and the Microsoft Common Objects in Context (COCO) dataset introduced in 2014. These benchmarks adopted AP as the standard metric for evaluating object detection and image segmentation models, including widely used architectures such as YOLO, Faster R-CNN, and Mask R-CNN.
Average precision ranges from 0 to 1. A score of 1 indicates perfect precision at every recall level (all relevant items are ranked before all irrelevant items), while a score near 0 indicates that relevant items are ranked poorly among irrelevant ones.
Understanding average precision requires familiarity with two foundational metrics.
Precision measures the fraction of retrieved items that are relevant:
Precision = TP / (TP + FP)
Recall measures the fraction of relevant items that are retrieved:
Recall = TP / (TP + FN)
Where TP is the number of true positives, FP is the number of false positives, and FN is the number of false negatives.
These two metrics have an inherent tension. Increasing recall (by returning more results) typically decreases precision (because more irrelevant items get included), and vice versa. The precision-recall curve visualizes this trade-off by plotting precision on the y-axis against recall on the x-axis at different ranking thresholds. Average precision captures the overall shape of this curve as a single number.
Average precision is formally defined as the area under the precision-recall curve:
AP = integral from 0 to 1 of p(r) dr
In practice, this integral is approximated by a finite sum over the ranked list of predictions:
AP = sum over k from 1 to n of [P(k) * delta_r(k)]
Where:
| Symbol | Meaning |
|---|---|
| n | Total number of predictions in the ranked list |
| P(k) | Precision at rank position k |
| delta_r(k) | Change in recall between rank k and rank k-1 |
Since recall only changes when a relevant item is encountered, delta_r(k) equals zero for irrelevant items. This means only the precision values at positions where relevant items appear contribute to the sum. An equivalent formulation makes this explicit:
AP = (1 / R) * sum over k from 1 to n of [P(k) * rel(k)]
Where R is the total number of relevant items and rel(k) is an indicator function that equals 1 if the item at rank k is relevant and 0 otherwise.
The raw precision-recall curve has a characteristic saw-tooth pattern: when a non-relevant item is retrieved, recall stays the same but precision drops; when a relevant item is retrieved, both precision and recall increase, causing the curve to jag upward and to the right. To smooth out this pattern, the interpolated precision at a given recall level r is defined as the maximum precision at any recall level greater than or equal to r:
p_interp(r) = max p(r') for all r' >= r
This interpolation is based on a practical justification described in the Stanford Introduction to Information Retrieval textbook: a user who has seen the results up to a certain point would be willing to examine a few more documents if doing so would increase the fraction of relevant documents in their reviewed set.
Several methods exist for computing average precision, and the differences between them can produce noticeably different scores for the same set of predictions. Researchers have identified at least five distinct definitions used in the literature, so it is important to specify which method is being used when reporting results.
The most straightforward approach computes AP by summing the precision at each rank where a relevant item appears, weighted by the change in recall:
AP = sum over k from 1 to n of [P(k) * delta_r(k)]
This is the method used by scikit-learn's average_precision_score function. The scikit-learn documentation explicitly notes that this implementation is "not interpolated" and differs from computing the area under the precision-recall curve with the trapezoidal rule, which uses linear interpolation and "can be too optimistic."
The 11-point interpolation method, used in early TREC evaluations and the PASCAL VOC Challenge from 2007 to 2009, samples the interpolated precision at 11 equally spaced recall levels: {0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0}.
AP_11 = (1/11) * sum over r in {0, 0.1, ..., 1.0} of p_interp(r)
At each of the 11 recall levels, the interpolated precision is the maximum precision at any recall value greater than or equal to the current level. This method reduces the impact of small variations ("wiggles") in the precision-recall curve caused by minor changes in the ranking of examples.
Starting with VOC 2010, the PASCAL VOC Challenge switched to sampling the precision-recall curve at all unique recall values where the maximum precision drops, rather than only at 11 fixed points. This computes the exact area under the interpolated (monotonically decreasing) precision-recall curve:
AP_all = sum over k of [(r_k - r_{k-1}) * p_interp(r_k)]
This method is more precise than 11-point interpolation and generally produces slightly higher AP values because it accounts for the full shape of the curve. For example, on a sample detection task at IoU threshold 0.5, the 11-point method yielded AP = 88.64% while the all-point method yielded AP = 89.58%.
| Method | Used by | Recall sample points | Notes |
|---|---|---|---|
| Non-interpolated (exact) | scikit-learn, trec_eval | All ranks with relevant items | Sums precision only at relevant positions |
| 11-point interpolation | TREC (early), PASCAL VOC 2007-2009 | {0, 0.1, 0.2, ..., 1.0} | Averages max precision at 11 fixed points |
| All-point interpolation | PASCAL VOC 2010-2012, COCO | All unique recall transition points | Area under monotonically decreasing envelope |
Consider a search query with 5 relevant documents in a collection. A retrieval system returns a ranked list of 10 documents, where R denotes a relevant document and N denotes a non-relevant document:
Rank 1: R, Rank 2: N, Rank 3: R, Rank 4: N, Rank 5: N, Rank 6: R, Rank 7: N, Rank 8: R, Rank 9: N, Rank 10: R
The precision and recall at each relevant position are:
| Rank | Document | Precision at k | Recall at k | Precision contributes to AP? |
|---|---|---|---|---|
| 1 | Relevant | 1/1 = 1.000 | 1/5 = 0.20 | Yes |
| 2 | Non-relevant | 1/2 = 0.500 | 1/5 = 0.20 | No |
| 3 | Relevant | 2/3 = 0.667 | 2/5 = 0.40 | Yes |
| 4 | Non-relevant | 2/4 = 0.500 | 2/5 = 0.40 | No |
| 5 | Non-relevant | 2/5 = 0.400 | 2/5 = 0.40 | No |
| 6 | Relevant | 3/6 = 0.500 | 3/5 = 0.60 | Yes |
| 7 | Non-relevant | 3/7 = 0.429 | 3/5 = 0.60 | No |
| 8 | Relevant | 4/8 = 0.500 | 4/5 = 0.80 | Yes |
| 9 | Non-relevant | 4/9 = 0.444 | 4/5 = 0.80 | No |
| 10 | Relevant | 5/10 = 0.500 | 5/5 = 1.00 | Yes |
Average precision is the mean of the precision values at each relevant position:
AP = (1/5) * (1.000 + 0.667 + 0.500 + 0.500 + 0.500) = (1/5) * 3.167 = 0.633
If the system had returned all five relevant documents at the top of the ranking (ranks 1 through 5), the AP would be 1.0. The penalty for interleaving non-relevant documents among relevant ones is what drives AP below 1.0.
Mean average precision extends AP to evaluate system performance across multiple queries or classes. It is simply the arithmetic mean of the AP values:
MAP = (1/Q) * sum over q from 1 to Q of AP(q)
Where Q is the total number of queries (in information retrieval) or classes (in object detection).
In information retrieval, MAP averages AP across a set of test queries. For example, if a search engine is evaluated on 50 queries and produces AP values ranging from 0.1 to 0.9 for individual queries, the MAP is the mean of those 50 values. According to the Stanford Introduction to Information Retrieval textbook, MAP has several desirable properties: it has good discrimination and stability, it requires no fixed recall levels or interpolation (in the non-interpolated form), and it provides a single-figure quality measure across all recall levels. Typical MAP scores range from 0.1 to 0.7 across different information needs.
In object detection, mAP averages AP across all object categories. If a detector is evaluated on a dataset with 20 categories (as in PASCAL VOC) or 80 categories (as in COCO), the mAP is the mean of AP values computed separately for each category. Note that the COCO benchmark does not distinguish between AP and mAP in its notation; the primary metric labeled "AP" is actually the mean AP across all 80 categories and multiple IoU thresholds.
MAP@K (mean average precision at K) restricts the evaluation to the top K results in the ranked list. This variant is commonly used in recommendation systems and web search, where users rarely look beyond the first page of results. For instance, MAP@10 evaluates the average precision considering only the top 10 results for each query. The metric takes values from 0 to 1, where 1 indicates all relevant items appear at the top of the list.
Object detection adds an extra dimension to AP computation: spatial localization. A predicted bounding box is not simply "correct" or "incorrect" but must be evaluated against ground-truth boxes using a spatial overlap criterion.
Intersection over Union (IoU) measures the spatial overlap between a predicted bounding box and a ground-truth bounding box:
IoU = Area of Intersection / Area of Union
An IoU of 1.0 indicates perfect overlap, and 0.0 indicates no overlap. A prediction counts as a true positive only if its IoU with a ground-truth box exceeds a predetermined threshold. The most common threshold is 0.5, meaning the predicted box must overlap with at least half of the ground-truth box's area (and vice versa).
The PASCAL Visual Object Classes (VOC) Challenge, described by Everingham et al. (2010), defined the evaluation protocol that became standard in object detection research.
Key characteristics of the PASCAL VOC evaluation:
| Feature | Specification |
|---|---|
| IoU threshold | 0.5 (fixed) |
| Number of classes | 20 |
| AP interpolation (2007-2009) | 11-point |
| AP interpolation (2010-2012) | All-point |
| Primary metric | mAP (mean AP across 20 classes) |
| Matching rule | Each ground-truth box matches at most one prediction |
The change from 11-point to all-point interpolation in 2010 was motivated by the desire to measure the exact area under the precision-recall curve rather than an approximation.
The COCO dataset, introduced by Lin et al. (2014), refined the evaluation protocol in several ways to provide a more thorough assessment of detector performance.
COCO's primary metric, denoted AP, averages over 10 IoU thresholds from 0.50 to 0.95 in steps of 0.05. This penalizes detections that are only roughly localized and rewards models with tighter bounding boxes. The full suite of COCO metrics is:
| Metric | IoU threshold | Area range | Max detections | Description |
|---|---|---|---|---|
| AP | 0.50:0.95 | All | 100 | Primary metric; averaged over 10 IoU thresholds and all classes |
| AP50 | 0.50 | All | 100 | AP at IoU 0.50 (equivalent to PASCAL VOC metric) |
| AP75 | 0.75 | All | 100 | AP at a strict IoU threshold |
| AP_small | 0.50:0.95 | < 32x32 pixels | 100 | AP for small objects |
| AP_medium | 0.50:0.95 | 32x32 to 96x96 pixels | 100 | AP for medium objects |
| AP_large | 0.50:0.95 | > 96x96 pixels | 100 | AP for large objects |
| AR1 | 0.50:0.95 | All | 1 | Average recall with 1 detection per image |
| AR10 | 0.50:0.95 | All | 10 | Average recall with 10 detections per image |
| AR100 | 0.50:0.95 | All | 100 | Average recall with 100 detections per image |
| AR_small | 0.50:0.95 | < 32x32 pixels | 100 | AR for small objects |
| AR_medium | 0.50:0.95 | 32x32 to 96x96 pixels | 100 | AR for medium objects |
| AR_large | 0.50:0.95 | > 96x96 pixels | 100 | AR for large objects |
The distinction between AP at different object sizes is valuable because detectors often struggle with small objects. A model might achieve high AP_large but very low AP_small, and a single mAP number would mask this discrepancy.
| Feature | PASCAL VOC | COCO |
|---|---|---|
| Number of object categories | 20 | 80 |
| IoU thresholds | 0.5 only | 0.50 to 0.95 (10 thresholds) |
| Interpolation method | 11-point (2007-2009), all-point (2010+) | 101-point interpolation |
| Size-specific metrics | No | Yes (small, medium, large) |
| Max detections per image | Unlimited | 1, 10, or 100 |
| Primary metric | mAP@0.5 | mAP@[0.5:0.95] |
| Difficulty level for models | Easier (single, lenient IoU) | Harder (averaged over stricter IoU values) |
Outside of object detection and information retrieval, average precision is used to evaluate binary classifiers and multi-label classification systems, particularly when the classes are imbalanced.
In this context, the classifier outputs a continuous score (probability estimate, confidence value, or decision function output) for each sample, and AP summarizes the precision-recall trade-off across all possible decision thresholds. This is particularly useful when the positive class is rare, because the precision-recall curve and AP are more informative than ROC AUC on highly imbalanced datasets. The ROC curve can appear overly optimistic when the negative class is very large, because the false positive rate denominator (TN + FP) is dominated by true negatives.
Scikit-learn implements this through sklearn.metrics.average_precision_score, which takes true labels and predicted scores as input:
import numpy as np
from sklearn.metrics import average_precision_score
y_true = np.array([0, 0, 1, 1])
y_scores = np.array([0.1, 0.4, 0.35, 0.8])
ap = average_precision_score(y_true, y_scores)
print(ap) # 0.8333...
For multilabel classification, scikit-learn supports computing AP per class with different averaging strategies: macro (unweighted mean), weighted (weighted by class support), micro (global precision-recall), and per-sample averaging.
Average precision belongs to a family of evaluation metrics, each capturing different aspects of ranking or classification quality.
| Metric | What it measures | Relationship to AP |
|---|---|---|
| Precision at K (P@K) | Fraction of relevant items in the top K results | Does not account for the ranking order within K |
| R-Precision | Precision at rank R, where R is the number of relevant items | Equivalent to both precision and recall at rank R |
| F1 score | Harmonic mean of precision and recall at a single threshold | Single operating point, whereas AP spans all thresholds |
| ROC AUC | Area under the ROC curve (TPR vs. FPR) | Uses false positive rate instead of precision; less sensitive to class imbalance |
| NDCG | Normalized discounted cumulative gain | Handles graded (multi-level) relevance; AP uses binary relevance |
| BLEU | N-gram overlap for text generation | Measures text similarity, not ranking quality |
| ROUGE | Recall-oriented n-gram overlap for summarization | Measures text overlap, not ranking quality |
| Average Recall (AR) | Mean of recall values at multiple IoU thresholds or detection limits | Complementary to AP in COCO evaluation |
Both AP (area under the precision-recall curve) and ROC AUC (area under the ROC curve) summarize a classifier's threshold-independent performance. The key difference lies in how they handle class imbalance. ROC AUC uses the false positive rate (FP / (FP + TN)), which is insensitive to the ratio of positive to negative samples because it is normalized by the number of negatives. When negatives vastly outnumber positives, even a large number of false positives produces a small FPR, making the ROC curve look favorable. AP uses precision (TP / (TP + FP)), which is directly affected by false positives regardless of the number of true negatives. For this reason, AP is generally preferred over ROC AUC when evaluating performance on imbalanced datasets where the positive class is of primary interest.
Average precision has several properties that make it a widely adopted evaluation metric:
Threshold-free evaluation. AP evaluates performance across all possible decision thresholds, removing the need to choose a specific operating point. This allows fair comparison between models that might perform differently at different threshold settings.
Rank sensitivity. AP rewards systems that rank relevant items higher. Two systems might retrieve the same number of relevant items, but the one that places them at the top of the list will receive a higher AP score.
Single-number summary. AP compresses the full precision-recall curve into a single value, making it easy to compare systems across experiments and in leaderboard-style competitions.
Handles variable set sizes. Unlike metrics such as precision at K that require choosing a fixed cutoff, AP naturally handles situations where different queries have different numbers of relevant items.
Strong statistical properties. According to the Stanford Introduction to Information Retrieval textbook, MAP (the mean of AP across queries) has "good discrimination and stability," meaning it reliably distinguishes between better and worse systems.
Despite its widespread use, average precision has known weaknesses that practitioners should be aware of.
Not all precision-recall pairs matter equally. In real-world applications, there is usually a "sensible operating range" where both precision and recall must meet minimum thresholds. AP weights all parts of the precision-recall curve equally, including regions that may be irrelevant for a particular deployment scenario. Two models with identical AP values can have vastly different practical utility if one performs better in the operating range that matters.
Lack of consensus on calculation method. Researchers have identified at least five different definitions of AP in the literature. Results computed with one method are not directly comparable to results computed with another. Papers that report AP or mAP without specifying the exact calculation method create confusion.
Arbitrary extrapolation. In some implementations, precision values at recall levels beyond the highest achieved recall are set to zero. This can penalize detectors that legitimately achieve high precision at lower recall levels.
Binary relevance assumption. AP treats relevance as binary (relevant or not relevant). It cannot distinguish between partially relevant and highly relevant items. Metrics like Normalized Discounted Cumulative Gain (NDCG) handle graded relevance.
Masking component-level performance. In object detection, mAP aggregates performance across all classes. A detector that performs well on 18 of 20 classes but fails completely on 2 classes might still report a respectable mAP. Similarly, mAP does not distinguish whether errors stem from poor localization, incorrect classification, or missed detections.
IoU threshold dependency. In object detection, the AP value depends heavily on the chosen IoU threshold. A detector might achieve AP = 85% at IoU 0.5 but only AP = 40% at IoU 0.75. The COCO evaluation protocol addresses this by averaging over multiple IoU thresholds, but the choice of threshold range itself remains a design decision.
Sensitivity to dataset composition. AP can be influenced by the ratio of easy to hard examples in the dataset. A test set dominated by easy instances will inflate AP values, potentially obscuring poor performance on difficult cases.
Always specify the calculation method. When reporting AP or mAP, state whether 11-point interpolation, all-point interpolation, or non-interpolated AP was used. State the IoU threshold if applicable.
Report per-class AP alongside mAP. The per-class breakdown reveals which categories the model handles well and which it struggles with.
Use size-specific metrics. For object detection, COCO-style AP_small, AP_medium, and AP_large provide more actionable insights than a single mAP number.
Complement AP with other metrics. Consider reporting precision at specific recall levels, the full precision-recall curve, or confusion matrices to give readers a more complete picture of performance.
Consider application-specific recall ranges. When the deployment scenario has known precision or recall requirements, restrict the AP calculation to the relevant portion of the curve.
Average precision is implemented in all major machine learning libraries.
| Library | Function / Class | Notes |
|---|---|---|
| scikit-learn | sklearn.metrics.average_precision_score | Non-interpolated AP; supports binary and multilabel |
| scikit-learn | sklearn.metrics.precision_recall_curve | Returns raw precision-recall pairs for custom AP computation |
| COCO API (pycocotools) | COCOeval | All-point interpolation with 101 recall thresholds |
| PyTorch (TorchMetrics) | torchmetrics.detection.MeanAveragePrecision | Supports VOC and COCO evaluation protocols |
| TensorFlow Object Detection API | object_detection.metrics | COCO-compatible evaluation |
| trec_eval | Command-line tool | Standard TREC evaluation with multiple AP variants |
| Detectron2 | COCOEvaluator | COCO-style evaluation for detection and segmentation |
from sklearn.metrics import average_precision_score, precision_recall_curve
import matplotlib.pyplot as plt
# Binary classification example
y_true = [0, 0, 1, 1, 0, 1, 0, 0, 1, 1]
y_scores = [0.1, 0.35, 0.4, 0.8, 0.25, 0.65, 0.3, 0.15, 0.55, 0.9]
ap = average_precision_score(y_true, y_scores)
print(f"Average Precision: {ap:.4f}")
# Plot precision-recall curve
precision, recall, thresholds = precision_recall_curve(y_true, y_scores)
plt.plot(recall, precision, marker='.')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title(f'Precision-Recall Curve (AP = {ap:.4f})')
plt.show()
from pycocotools.coco import COCO
from pycocotools.cocoeval import COCOeval
# Load ground truth and detections
coco_gt = COCO('annotations/instances_val2017.json')
coco_dt = coco_gt.loadRes('detections_val2017.json')
# Run COCO evaluation
coco_eval = COCOeval(coco_gt, coco_dt, 'bbox')
coco_eval.evaluate()
coco_eval.accumulate()
coco_eval.summarize()
# Prints AP@[0.50:0.95], AP@0.50, AP@0.75, and size-specific metrics
Imagine you have a big basket of balls. Some are red (the ones you want to find) and most are blue (the ones you do not care about). You reach in and pull out balls one at a time, and you line them up in a row.
Every time you pull out a red ball, you check: "Out of all the balls I have pulled out so far, how many are red?" That fraction is your precision at that point.
Average precision looks at your precision every time you pull out a red ball and then takes the average of those numbers. If you pull out all the red balls first before any blue balls, your average precision is perfect (1.0). But if you pull out a bunch of blue balls before finding any red ones, your precision is low when you finally find a red ball, and your average precision goes down.
So average precision measures two things at once: did you find the red balls (recall), and did you avoid picking up blue balls along the way (precision)?