Average Precision

Introduction

Average precision (AP) is an evaluation metric used to measure the quality of ranked retrieval results in information retrieval, object detection, and classification tasks. It summarizes a precision-recall curve into a single scalar value by computing the weighted mean of precision values at each recall threshold, where the weight is the increase in recall from the previous threshold. When averaged across multiple queries or object classes, it produces the metric known as mean average precision (mAP).

The metric was first popularized in the information retrieval community through the Text Retrieval Conference (TREC), organized by the National Institute of Standards and Technology (NIST) beginning in 1992. TREC standardized evaluation procedures for search systems, and mean average precision became the primary measure for comparing retrieval system quality. The trec_eval program, written by Chris Buckley, provided a common implementation that ensured consistent handling of interpolation and ranking calculations across the research community.

In computer vision, average precision gained widespread adoption through the PASCAL Visual Object Classes (VOC) Challenge, which ran annually from 2005 to 2012, and the Microsoft Common Objects in Context (COCO) dataset introduced in 2014. These benchmarks adopted AP as the standard metric for evaluating object detection and image segmentation models, including widely used architectures such as YOLO, Faster R-CNN, and Mask R-CNN.

Average precision ranges from 0 to 1. A score of 1 indicates perfect precision at every recall level (all relevant items are ranked before all irrelevant items), while a score near 0 indicates that relevant items are ranked poorly among irrelevant ones.

Prerequisites: precision and recall

Understanding average precision requires familiarity with two foundational metrics.

Precision measures the fraction of retrieved items that are relevant:

Precision = TP / (TP + FP)

Recall measures the fraction of relevant items that are retrieved:

Recall = TP / (TP + FN)

Where TP is the number of true positives, FP is the number of false positives, and FN is the number of false negatives.

These two metrics have an inherent tension. Increasing recall (by returning more results) typically decreases precision (because more irrelevant items get included), and vice versa. The precision-recall curve visualizes this trade-off by plotting precision on the y-axis against recall on the x-axis at different ranking thresholds. Average precision captures the overall shape of this curve as a single number.

Mathematical definition

General formula

Average precision is formally defined as the area under the precision-recall curve:

AP = integral from 0 to 1 of p(r) dr

In practice, this integral is approximated by a finite sum over the ranked list of predictions:

AP = sum over k from 1 to n of [P(k) * delta_r(k)]

Where:

Symbol	Meaning
n	Total number of predictions in the ranked list
P(k)	Precision at rank position k
delta_r(k)	Change in recall between rank k and rank k-1

Since recall only changes when a relevant item is encountered, delta_r(k) equals zero for irrelevant items. This means only the precision values at positions where relevant items appear contribute to the sum. An equivalent formulation makes this explicit:

AP = (1 / R) * sum over k from 1 to n of [P(k) * rel(k)]

Where R is the total number of relevant items and rel(k) is an indicator function that equals 1 if the item at rank k is relevant and 0 otherwise.

Interpolated precision

The raw precision-recall curve has a characteristic saw-tooth pattern: when a non-relevant item is retrieved, recall stays the same but precision drops; when a relevant item is retrieved, both precision and recall increase, causing the curve to jag upward and to the right. To smooth out this pattern, the interpolated precision at a given recall level r is defined as the maximum precision at any recall level greater than or equal to r:

p_interp(r) = max p(r') for all r' >= r

This interpolation is based on a practical justification described in the Stanford Introduction to Information Retrieval textbook: a user who has seen the results up to a certain point would be willing to examine a few more documents if doing so would increase the fraction of relevant documents in their reviewed set.

Calculation methods

Several methods exist for computing average precision, and the differences between them can produce noticeably different scores for the same set of predictions. Researchers have identified at least five distinct definitions used in the literature, so it is important to specify which method is being used when reporting results.

All-point interpolation (trapezoidal approximation)

The most straightforward approach computes AP by summing the precision at each rank where a relevant item appears, weighted by the change in recall:

AP = sum over k from 1 to n of [P(k) * delta_r(k)]

This is the method used by scikit-learn's average_precision_score function. The scikit-learn documentation explicitly notes that this implementation is "not interpolated" and differs from computing the area under the precision-recall curve with the trapezoidal rule, which uses linear interpolation and "can be too optimistic."

11-point interpolation

The 11-point interpolation method, used in early TREC evaluations and the PASCAL VOC Challenge from 2007 to 2009, samples the interpolated precision at 11 equally spaced recall levels: {0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0}.

AP_11 = (1/11) * sum over r in {0, 0.1, ..., 1.0} of p_interp(r)

At each of the 11 recall levels, the interpolated precision is the maximum precision at any recall value greater than or equal to the current level. This method reduces the impact of small variations ("wiggles") in the precision-recall curve caused by minor changes in the ranking of examples.

All-point interpolation (PASCAL VOC 2010 and later)

Starting with VOC 2010, the PASCAL VOC Challenge switched to sampling the precision-recall curve at all unique recall values where the maximum precision drops, rather than only at 11 fixed points. This computes the exact area under the interpolated (monotonically decreasing) precision-recall curve:

AP_all = sum over k of [(r_k - r_{k-1}) * p_interp(r_k)]

This method is more precise than 11-point interpolation and generally produces slightly higher AP values because it accounts for the full shape of the curve. For example, on a sample detection task at IoU threshold 0.5, the 11-point method yielded AP = 88.64% while the all-point method yielded AP = 89.58%.

Comparison of interpolation methods

Method	Used by	Recall sample points	Notes
Non-interpolated (exact)	scikit-learn, trec_eval	All ranks with relevant items	Sums precision only at relevant positions
11-point interpolation	TREC (early), PASCAL VOC 2007-2009	{0, 0.1, 0.2, ..., 1.0}	Averages max precision at 11 fixed points
All-point interpolation	PASCAL VOC 2010-2012, COCO	All unique recall transition points	Area under monotonically decreasing envelope

Worked example in information retrieval

Consider a search query with 5 relevant documents in a collection. A retrieval system returns a ranked list of 10 documents, where R denotes a relevant document and N denotes a non-relevant document:

Rank 1: R, Rank 2: N, Rank 3: R, Rank 4: N, Rank 5: N, Rank 6: R, Rank 7: N, Rank 8: R, Rank 9: N, Rank 10: R

The precision and recall at each relevant position are:

Rank	Document	Precision at k	Recall at k	Precision contributes to AP?
1	Relevant	1/1 = 1.000	1/5 = 0.20	Yes
2	Non-relevant	1/2 = 0.500	1/5 = 0.20	No
3	Relevant	2/3 = 0.667	2/5 = 0.40	Yes
4	Non-relevant	2/4 = 0.500	2/5 = 0.40	No
5	Non-relevant	2/5 = 0.400	2/5 = 0.40	No
6	Relevant	3/6 = 0.500	3/5 = 0.60	Yes
7	Non-relevant	3/7 = 0.429	3/5 = 0.60	No
8	Relevant	4/8 = 0.500	4/5 = 0.80	Yes
9	Non-relevant	4/9 = 0.444	4/5 = 0.80	No
10	Relevant	5/10 = 0.500	5/5 = 1.00	Yes

Average precision is the mean of the precision values at each relevant position:

AP = (1/5) * (1.000 + 0.667 + 0.500 + 0.500 + 0.500) = (1/5) * 3.167 = 0.633

If the system had returned all five relevant documents at the top of the ranking (ranks 1 through 5), the AP would be 1.0. The penalty for interleaving non-relevant documents among relevant ones is what drives AP below 1.0.

Mean average precision (MAP)

Mean average precision extends AP to evaluate system performance across multiple queries or classes. It is simply the arithmetic mean of the AP values:

MAP = (1/Q) * sum over q from 1 to Q of AP(q)

Where Q is the total number of queries (in information retrieval) or classes (in object detection).

MAP in information retrieval

In information retrieval, MAP averages AP across a set of test queries. For example, if a search engine is evaluated on 50 queries and produces AP values ranging from 0.1 to 0.9 for individual queries, the MAP is the mean of those 50 values. According to the Stanford Introduction to Information Retrieval textbook, MAP has several desirable properties: it has good discrimination and stability, it requires no fixed recall levels or interpolation (in the non-interpolated form), and it provides a single-figure quality measure across all recall levels. Typical MAP scores range from 0.1 to 0.7 across different information needs.

MAP in object detection

In object detection, mAP averages AP across all object categories. If a detector is evaluated on a dataset with 20 categories (as in PASCAL VOC) or 80 categories (as in COCO), the mAP is the mean of AP values computed separately for each category. Note that the COCO benchmark does not distinguish between AP and mAP in its notation; the primary metric labeled "AP" is actually the mean AP across all 80 categories and multiple IoU thresholds.

MAP@K

MAP@K (mean average precision at K) restricts the evaluation to the top K results in the ranked list. This variant is commonly used in recommendation systems and web search, where users rarely look beyond the first page of results. For instance, MAP@10 evaluates the average precision considering only the top 10 results for each query. The metric takes values from 0 to 1, where 1 indicates all relevant items appear at the top of the list.

Average precision in object detection

Object detection adds an extra dimension to AP computation: spatial localization. A predicted bounding box is not simply "correct" or "incorrect" but must be evaluated against ground-truth boxes using a spatial overlap criterion.

Intersection over Union (IoU)

Intersection over Union (IoU) measures the spatial overlap between a predicted bounding box and a ground-truth bounding box:

IoU = Area of Intersection / Area of Union

An IoU of 1.0 indicates perfect overlap, and 0.0 indicates no overlap. A prediction counts as a true positive only if its IoU with a ground-truth box exceeds a predetermined threshold. The most common threshold is 0.5, meaning the predicted box must overlap with at least half of the ground-truth box's area (and vice versa).

Step-by-step AP calculation for object detection

Generate predictions. The model produces predicted bounding boxes with associated confidence scores and class labels for each image in the test set.
Sort predictions by confidence. All predictions for a given class are sorted by their confidence scores in descending order across all images.
Match predictions to ground truths. For each prediction, compute the IoU with all ground-truth boxes of the same class in the same image. If the IoU exceeds the threshold and the ground-truth box has not already been matched, the prediction is a true positive (TP). Otherwise, it is a false positive (FP). Each ground-truth box can be matched to at most one prediction.
Compute precision and recall at each rank. Walking down the confidence-sorted list, maintain running counts of TP and FP. At each position k: Precision(k) = TP(k) / (TP(k) + FP(k)) and Recall(k) = TP(k) / total ground-truth boxes for this class.
Plot the precision-recall curve. The running precision and recall values define the curve.
Compute AP. Apply one of the interpolation methods (11-point, all-point, or non-interpolated) to compute the area under the precision-recall curve.
Repeat for each class. Each object category has its own AP value.
Compute mAP. Average the per-class AP values.

PASCAL VOC evaluation

The PASCAL Visual Object Classes (VOC) Challenge, described by Everingham et al. (2010), defined the evaluation protocol that became standard in object detection research.

Key characteristics of the PASCAL VOC evaluation:

Feature	Specification
IoU threshold	0.5 (fixed)
Number of classes	20
AP interpolation (2007-2009)	11-point
AP interpolation (2010-2012)	All-point
Primary metric	mAP (mean AP across 20 classes)
Matching rule	Each ground-truth box matches at most one prediction

The change from 11-point to all-point interpolation in 2010 was motivated by the desire to measure the exact area under the precision-recall curve rather than an approximation.

COCO evaluation

The COCO dataset, introduced by Lin et al. (2014), refined the evaluation protocol in several ways to provide a more thorough assessment of detector performance.

COCO's primary metric, denoted AP, averages over 10 IoU thresholds from 0.50 to 0.95 in steps of 0.05. This penalizes detections that are only roughly localized and rewards models with tighter bounding boxes. The full suite of COCO metrics is:

Metric	IoU threshold	Area range	Max detections	Description
AP	0.50:0.95	All	100	Primary metric; averaged over 10 IoU thresholds and all classes
AP50	0.50	All	100	AP at IoU 0.50 (equivalent to PASCAL VOC metric)
AP75	0.75	All	100	AP at a strict IoU threshold
AP_small	0.50:0.95	< 32x32 pixels	100	AP for small objects
AP_medium	0.50:0.95	32x32 to 96x96 pixels	100	AP for medium objects
AP_large	0.50:0.95	> 96x96 pixels	100	AP for large objects
AR1	0.50:0.95	All	1	Average recall with 1 detection per image
AR10	0.50:0.95	All	10	Average recall with 10 detections per image
AR100	0.50:0.95	All	100	Average recall with 100 detections per image
AR_small	0.50:0.95	< 32x32 pixels	100	AR for small objects
AR_medium	0.50:0.95	32x32 to 96x96 pixels	100	AR for medium objects
AR_large	0.50:0.95	> 96x96 pixels	100	AR for large objects

The distinction between AP at different object sizes is valuable because detectors often struggle with small objects. A model might achieve high AP_large but very low AP_small, and a single mAP number would mask this discrepancy.

PASCAL VOC vs. COCO comparison

Feature	PASCAL VOC	COCO
Number of object categories	20	80
IoU thresholds	0.5 only	0.50 to 0.95 (10 thresholds)
Interpolation method	11-point (2007-2009), all-point (2010+)	101-point interpolation
Size-specific metrics	No	Yes (small, medium, large)
Max detections per image	Unlimited	1, 10, or 100
Primary metric	mAP@0.5	mAP@[0.5:0.95]
Difficulty level for models	Easier (single, lenient IoU)	Harder (averaged over stricter IoU values)

Average precision in classification

Outside of object detection and information retrieval, average precision is used to evaluate binary classifiers and multi-label classification systems, particularly when the classes are imbalanced.

In this context, the classifier outputs a continuous score (probability estimate, confidence value, or decision function output) for each sample, and AP summarizes the precision-recall trade-off across all possible decision thresholds. This is particularly useful when the positive class is rare, because the precision-recall curve and AP are more informative than ROC AUC on highly imbalanced datasets. The ROC curve can appear overly optimistic when the negative class is very large, because the false positive rate denominator (TN + FP) is dominated by true negatives.

Scikit-learn implements this through sklearn.metrics.average_precision_score, which takes true labels and predicted scores as input:

import numpy as np
from sklearn.metrics import average_precision_score

y_true = np.array([0, 0, 1, 1])
y_scores = np.array([0.1, 0.4, 0.35, 0.8])
ap = average_precision_score(y_true, y_scores)
print(ap)  # 0.8333...

For multilabel classification, scikit-learn supports computing AP per class with different averaging strategies: macro (unweighted mean), weighted (weighted by class support), micro (global precision-recall), and per-sample averaging.

Average precision belongs to a family of evaluation metrics, each capturing different aspects of ranking or classification quality.

Metric	What it measures	Relationship to AP
Precision at K (P@K)	Fraction of relevant items in the top K results	Does not account for the ranking order within K
R-Precision	Precision at rank R, where R is the number of relevant items	Equivalent to both precision and recall at rank R
F1 score	Harmonic mean of precision and recall at a single threshold	Single operating point, whereas AP spans all thresholds
ROC AUC	Area under the ROC curve (TPR vs. FPR)	Uses false positive rate instead of precision; less sensitive to class imbalance
NDCG	Normalized discounted cumulative gain	Handles graded (multi-level) relevance; AP uses binary relevance
BLEU	N-gram overlap for text generation	Measures text similarity, not ranking quality
ROUGE	Recall-oriented n-gram overlap for summarization	Measures text overlap, not ranking quality
Average Recall (AR)	Mean of recall values at multiple IoU thresholds or detection limits	Complementary to AP in COCO evaluation

AP vs. ROC AUC

Both AP (area under the precision-recall curve) and ROC AUC (area under the ROC curve) summarize a classifier's threshold-independent performance. The key difference lies in how they handle class imbalance. ROC AUC uses the false positive rate (FP / (FP + TN)), which is insensitive to the ratio of positive to negative samples because it is normalized by the number of negatives. When negatives vastly outnumber positives, even a large number of false positives produces a small FPR, making the ROC curve look favorable. AP uses precision (TP / (TP + FP)), which is directly affected by false positives regardless of the number of true negatives. For this reason, AP is generally preferred over ROC AUC when evaluating performance on imbalanced datasets where the positive class is of primary interest.

Strengths of average precision

Average precision has several properties that make it a widely adopted evaluation metric:

Threshold-free evaluation. AP evaluates performance across all possible decision thresholds, removing the need to choose a specific operating point. This allows fair comparison between models that might perform differently at different threshold settings.
Rank sensitivity. AP rewards systems that rank relevant items higher. Two systems might retrieve the same number of relevant items, but the one that places them at the top of the list will receive a higher AP score.
Single-number summary. AP compresses the full precision-recall curve into a single value, making it easy to compare systems across experiments and in leaderboard-style competitions.
Handles variable set sizes. Unlike metrics such as precision at K that require choosing a fixed cutoff, AP naturally handles situations where different queries have different numbers of relevant items.
Strong statistical properties. According to the Stanford Introduction to Information Retrieval textbook, MAP (the mean of AP across queries) has "good discrimination and stability," meaning it reliably distinguishes between better and worse systems.

Limitations and criticisms

Despite its widespread use, average precision has known weaknesses that practitioners should be aware of.

Not all precision-recall pairs matter equally. In real-world applications, there is usually a "sensible operating range" where both precision and recall must meet minimum thresholds. AP weights all parts of the precision-recall curve equally, including regions that may be irrelevant for a particular deployment scenario. Two models with identical AP values can have vastly different practical utility if one performs better in the operating range that matters.
Lack of consensus on calculation method. Researchers have identified at least five different definitions of AP in the literature. Results computed with one method are not directly comparable to results computed with another. Papers that report AP or mAP without specifying the exact calculation method create confusion.
Arbitrary extrapolation. In some implementations, precision values at recall levels beyond the highest achieved recall are set to zero. This can penalize detectors that legitimately achieve high precision at lower recall levels.
Binary relevance assumption. AP treats relevance as binary (relevant or not relevant). It cannot distinguish between partially relevant and highly relevant items. Metrics like Normalized Discounted Cumulative Gain (NDCG) handle graded relevance.
Masking component-level performance. In object detection, mAP aggregates performance across all classes. A detector that performs well on 18 of 20 classes but fails completely on 2 classes might still report a respectable mAP. Similarly, mAP does not distinguish whether errors stem from poor localization, incorrect classification, or missed detections.
IoU threshold dependency. In object detection, the AP value depends heavily on the chosen IoU threshold. A detector might achieve AP = 85% at IoU 0.5 but only AP = 40% at IoU 0.75. The COCO evaluation protocol addresses this by averaging over multiple IoU thresholds, but the choice of threshold range itself remains a design decision.
Sensitivity to dataset composition. AP can be influenced by the ratio of easy to hard examples in the dataset. A test set dominated by easy instances will inflate AP values, potentially obscuring poor performance on difficult cases.

Practical recommendations

Always specify the calculation method. When reporting AP or mAP, state whether 11-point interpolation, all-point interpolation, or non-interpolated AP was used. State the IoU threshold if applicable.
Report per-class AP alongside mAP. The per-class breakdown reveals which categories the model handles well and which it struggles with.
Use size-specific metrics. For object detection, COCO-style AP_small, AP_medium, and AP_large provide more actionable insights than a single mAP number.
Complement AP with other metrics. Consider reporting precision at specific recall levels, the full precision-recall curve, or confusion matrices to give readers a more complete picture of performance.
Consider application-specific recall ranges. When the deployment scenario has known precision or recall requirements, restrict the AP calculation to the relevant portion of the curve.

Software implementations

Average precision is implemented in all major machine learning libraries.

Library	Function / Class	Notes
scikit-learn	`sklearn.metrics.average_precision_score`	Non-interpolated AP; supports binary and multilabel
scikit-learn	`sklearn.metrics.precision_recall_curve`	Returns raw precision-recall pairs for custom AP computation
COCO API (pycocotools)	`COCOeval`	All-point interpolation with 101 recall thresholds
PyTorch (TorchMetrics)	`torchmetrics.detection.MeanAveragePrecision`	Supports VOC and COCO evaluation protocols
TensorFlow Object Detection API	`object_detection.metrics`	COCO-compatible evaluation
trec_eval	Command-line tool	Standard TREC evaluation with multiple AP variants
Detectron2	`COCOEvaluator`	COCO-style evaluation for detection and segmentation

Scikit-learn example

from sklearn.metrics import average_precision_score, precision_recall_curve
import matplotlib.pyplot as plt

# Binary classification example
y_true = [0, 0, 1, 1, 0, 1, 0, 0, 1, 1]
y_scores = [0.1, 0.35, 0.4, 0.8, 0.25, 0.65, 0.3, 0.15, 0.55, 0.9]

ap = average_precision_score(y_true, y_scores)
print(f"Average Precision: {ap:.4f}")

# Plot precision-recall curve
precision, recall, thresholds = precision_recall_curve(y_true, y_scores)
plt.plot(recall, precision, marker='.')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title(f'Precision-Recall Curve (AP = {ap:.4f})')
plt.show()

COCO evaluation example

from pycocotools.coco import COCO
from pycocotools.cocoeval import COCOeval

# Load ground truth and detections
coco_gt = COCO('annotations/instances_val2017.json')
coco_dt = coco_gt.loadRes('detections_val2017.json')

# Run COCO evaluation
coco_eval = COCOeval(coco_gt, coco_dt, 'bbox')
coco_eval.evaluate()
coco_eval.accumulate()
coco_eval.summarize()
# Prints AP@[0.50:0.95], AP@0.50, AP@0.75, and size-specific metrics

Explain like I'm 5 (ELI5)

Imagine you have a big basket of balls. Some are red (the ones you want to find) and most are blue (the ones you do not care about). You reach in and pull out balls one at a time, and you line them up in a row.

Every time you pull out a red ball, you check: "Out of all the balls I have pulled out so far, how many are red?" That fraction is your precision at that point.

Average precision looks at your precision every time you pull out a red ball and then takes the average of those numbers. If you pull out all the red balls first before any blue balls, your average precision is perfect (1.0). But if you pull out a bunch of blue balls before finding any red ones, your precision is low when you finally find a red ball, and your average precision goes down.

So average precision measures two things at once: did you find the red balls (recall), and did you avoid picking up blue balls along the way (precision)?

References

Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2010). "The Pascal Visual Object Classes (VOC) Challenge." *International Journal of Computer Vision*, 88(2), 303-338. https://doi.org/10.1007/s11263-009-0275-4
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollar, P., & Zitnick, C. L. (2014). "Microsoft COCO: Common Objects in Context." *European Conference on Computer Vision (ECCV)*. arXiv:1405.0312.
Manning, C. D., Raghavan, P., & Schutze, H. (2008). *Introduction to Information Retrieval*. Cambridge University Press. Chapter 8: Evaluation in information retrieval. https://nlp.stanford.edu/IR-book/
Salton, G., & McGill, M. J. (1983). *Introduction to Modern Information Retrieval*. McGraw-Hill.
Padilla, R., Passos, W. L., Dias, T. L. B., Netto, S. L., & da Silva, E. A. B. (2021). "A Comparative Analysis of Object Detection Metrics with a Companion Open-Source Toolkit." *Electronics*, 10(3), 279. https://doi.org/10.3390/electronics10030279
Zhu, M. (2004). "Recall, Precision and Average Precision." Department of Statistics and Actuarial Science, University of Waterloo.
Buckley, C., & Voorhees, E. M. (2005). "Retrieval System Evaluation." In E. M. Voorhees & D. K. Harman (Eds.), *TREC: Experiment and Evaluation in Information Retrieval*. MIT Press.
Everingham, M., Eslami, S. M. A., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2015). "The Pascal Visual Object Classes Challenge: A Retrospective." *International Journal of Computer Vision*, 111(1), 98-136. https://doi.org/10.1007/s11263-014-0733-5
Davis, J., & Goadrich, M. (2006). "The Relationship Between Precision-Recall and ROC Curves." *Proceedings of the 23rd International Conference on Machine Learning (ICML)*.
Boyd, K., Eng, K. H., & Page, C. D. (2013). "Area Under the Precision-Recall Curve: Point Estimates and Confidence Intervals." *Machine Learning and Knowledge Discovery in Databases (ECML PKDD)*, Springer. https://doi.org/10.1007/978-3-642-40994-3_29
Pedregosa, F., et al. (2011). "Scikit-learn: Machine Learning in Python." *Journal of Machine Learning Research*, 12, 2825-2830.
Harman, D. (1993). "Overview of the First Text REtrieval Conference (TREC-1)." *Proceedings of the First Text REtrieval Conference*, NIST Special Publication 500-207.

Introduction

Prerequisites: precision and recall

Mathematical definition

General formula

Interpolated precision

Calculation methods

All-point interpolation (trapezoidal approximation)

11-point interpolation

All-point interpolation (PASCAL VOC 2010 and later)

Comparison of interpolation methods

Worked example in information retrieval

Mean average precision (MAP)

MAP in information retrieval

MAP in object detection

MAP@K

Average precision in object detection

Intersection over Union (IoU)

Step-by-step AP calculation for object detection

PASCAL VOC evaluation

COCO evaluation

PASCAL VOC vs. COCO comparison

Average precision in classification

Related metrics

AP vs. ROC AUC

Strengths of average precision

Limitations and criticisms

Practical recommendations

Software implementations

Scikit-learn example

COCO evaluation example

Explain like I'm 5 (ELI5)

References

Improve this article

Related Articles

AUC-ROC

ARC-AGI 2

MTEB (Massive Text Embedding Benchmark)

Computer-use agent

Computer-use model

OCR Models

Introduction

Prerequisites: precision and recall

Mathematical definition

General formula

Interpolated precision

Calculation methods

All-point interpolation (trapezoidal approximation)

11-point interpolation

All-point interpolation (PASCAL VOC 2010 and later)

Comparison of interpolation methods

Worked example in information retrieval

Mean average precision (MAP)

MAP in information retrieval

MAP in object detection

MAP@K

Average precision in object detection

Intersection over Union (IoU)

Step-by-step AP calculation for object detection

PASCAL VOC evaluation

COCO evaluation

PASCAL VOC vs. COCO comparison

Average precision in classification

Related metrics

AP vs. ROC AUC

Strengths of average precision

Limitations and criticisms

Practical recommendations

Software implementations

Scikit-learn example

COCO evaluation example

Explain like I'm 5 (ELI5)

References

Related Articles

AUC-ROC

ARC-AGI 2

MTEB (Massive Text Embedding Benchmark)

Computer-use agent

Computer-use model

OCR Models