PR AUC (area under the PR curve)

Introduction

PR AUC, short for area under the precision-recall curve, is a scalar evaluation metric for binary classifiers. It summarizes the trade-off between precision and recall across every possible decision threshold into a single number between 0 and 1. PR AUC is particularly useful for imbalanced datasets where the positive class is rare, because it focuses on how well a classifier handles the minority class and ignores the (typically large) pool of true negatives.

In modern practice the metric is almost always computed as average precision (AP), a finite-sum approximation that does not interpolate the curve. The trapezoidal rule, which does interpolate, is also seen in older literature but tends to give optimistic estimates and has been discouraged since at least Davis and Goadrich (2006) and Boyd, Eng, and Page (2013).

PR AUC is widely used in fraud detection, medical diagnosis, anomaly detection, information retrieval, and object detection. In object detection in particular, the metric appears under the name mean average precision (mAP) and forms the headline number for benchmarks such as PASCAL VOC and Microsoft COCO.

The precision-recall curve

Building the curve

A binary classifier typically outputs a continuous score for each example: a probability, a logit, or a generic decision value. To turn that score into a hard 0-or-1 prediction, you pick a threshold and label every example above the threshold as positive. Different thresholds yield different confusion matrices, and therefore different precision and recall values.

The precision-recall curve plots precision on the y-axis against recall on the x-axis as the threshold sweeps from high to low. Each point on the curve corresponds to one threshold. As the threshold drops, more examples are predicted positive, recall generally rises, and precision usually falls.

Confusion matrix derivations

Precision and recall both depend on three of the four confusion-matrix cells:

Quantity	Formula	What it measures
Precision	TP / (TP + FP)	Fraction of predicted positives that are correct
Recall (sensitivity, TPR)	TP / (TP + FN)	Fraction of actual positives that are detected
F1 score	2 * Precision * Recall / (Precision + Recall)	Harmonic mean of precision and recall

Notice that true negatives (TN) appear in none of these formulas. That is the central reason PR-based metrics behave so differently from ROC-based metrics on imbalanced data.

Worked example

Consider a small test set with 10 examples. The classifier produces a probability score for each, and the true label is known.

Example	Score	True label
A	0.95	1
B	0.90	1
C	0.80	0
D	0.70	1
E	0.60	1
F	0.55	0
G	0.40	1
H	0.30	0
I	0.20	0
J	0.10	0

There are 5 positives and 5 negatives. Sorting by score (highest first) and walking down the list, we recompute precision and recall after each example is added to the predicted-positive set:

Top-k	Last added	TP	FP	FN	Precision	Recall
1	A	1	0	4	1.000	0.20
2	B	2	0	3	1.000	0.40
3	C	2	1	3	0.667	0.40
4	D	3	1	2	0.750	0.60
5	E	4	1	1	0.800	0.80
6	F	4	2	1	0.667	0.80
7	G	5	2	0	0.714	1.00
8	H	5	3	0	0.625	1.00
9	I	5	4	0	0.556	1.00
10	J	5	5	0	0.500	1.00

Using the average precision formula (see below), AP is the sum of precision values at the rows where recall changed (rows where a new positive was added):

AP = (0.20 - 0.00) * 1.000 + (0.40 - 0.20) * 1.000 + (0.60 - 0.40) * 0.750 + (0.80 - 0.60) * 0.800 + (1.00 - 0.80) * 0.714

AP = 0.200 + 0.200 + 0.150 + 0.160 + 0.143 = 0.853

A random classifier on the same data would score around 0.50, since the positive class prevalence is 5 / 10.

Computing PR AUC

There is more than one way to turn a precision-recall curve into a single area, and the choice matters. The two main approaches are linear (trapezoidal) interpolation and average precision.

Trapezoidal rule

The trapezoidal rule treats consecutive precision-recall points as the corners of a trapezoid and sums the areas. It is the default for ROC AUC and is intuitive, but for PR curves it can be misleadingly optimistic. The PR curve is not linear between adjacent operating points; in fact, the correct interpolation between two points (R1, P1) and (R2, P2) follows a non-linear shape derived from the underlying confusion matrix counts. Linear interpolation can sit far above this true curve, inflating the area.

Davis and Goadrich (2006) gave a worked example showing that linear interpolation between PR points can produce an entirely fictitious bulge in the curve. They derived the correct interpolation formula and showed it is not a straight line.

Average precision (AP)

Average precision is the recommended way to compute PR AUC and is what scikit-learn, COCO, and most modern libraries return when asked for area under the precision-recall curve. AP is defined as a finite weighted mean rather than an integral:

AP = sum over n of (R_n - R_{n-1}) * P_n

where P_n and R_n are the precision and recall at the n-th threshold and the sum runs over all distinct thresholds in the sorted score list. Each term weights the precision at threshold n by the gain in recall achieved by lowering the threshold to that point. Because there is no interpolation, the metric is conservative compared to the trapezoidal rule.

Boyd, Eng, and Page (2013) studied several estimators of PR AUC, including the upper trapezoid, lower trapezoid, average precision, and an interpolated median. They concluded that average precision and the lower trapezoid are unbiased in the limit and have lower bias for small samples than the upper trapezoid. Average precision has since become the de facto standard.

Comparison of estimators

Estimator	How it works	Bias	Common use
Average precision (AP)	Weighted mean of precisions, no interpolation	Low; unbiased in the limit	scikit-learn, COCO, modern ML
Lower trapezoid	Trapezoidal rule using lower precision at each step	Low; unbiased in the limit	Less common, recommended by Boyd et al.
Upper trapezoid	Standard trapezoidal rule between PR points	Optimistic, biased high	Older sklearn (pre-0.19), legacy code
11-point interpolation	Average of max precision at recall = 0, 0.1, ..., 1.0	Coarse, simple	PASCAL VOC 2007 to 2009
101-point interpolation	Average of max precision at 101 recall points	Smooth approximation	Microsoft COCO

Baseline and interpretation

The random-classifier baseline

A random classifier (one that produces scores independent of the true label) traces out a PR curve that is approximately a horizontal line at y = pi, where pi is the positive class prevalence in the test set. The PR AUC of such a classifier is therefore equal to pi.

This is fundamentally different from ROC AUC, where the random baseline is always 0.5 regardless of class balance. On a dataset with 1% positives, a random model has PR AUC around 0.01 and ROC AUC around 0.5. A model with PR AUC = 0.10 looks unimpressive in absolute terms, but it is actually 10 times better than chance.

A practical consequence is that PR AUC values are not directly comparable across datasets with different prevalence. A PR AUC of 0.40 on a dataset with 5% positives represents a much stronger lift over the baseline than a PR AUC of 0.40 on a dataset with 35% positives. When reporting PR AUC, also report the positive prevalence so readers can interpret the number.

Perfect, good, and useless scores

Score	Meaning
1.0	Perfect classifier; ranks every positive above every negative
Above pi	Better than random
Equal to pi	No skill (random ranking)
Below pi	Worse than random; the model is somehow anti-correlated with the label

PR AUC versus ROC AUC

ROC AUC plots true positive rate (recall) against false positive rate (FPR = FP / (FP + TN)) and integrates. Both metrics summarize a curve into a single number, but they emphasize very different parts of the confusion matrix.

Davis and Goadrich (2006) proved a one-to-one correspondence between ROC space and PR space: a curve that dominates another in ROC space also dominates in PR space, and vice versa. Despite this equivalence at the curve level, the summary statistics behave differently because they integrate different quantities.

Saito and Rehmsmeier (2015) demonstrated through simulation that the PR plot is more informative than ROC for highly imbalanced classification. Their main observation: when the negative class is overwhelming, the FPR denominator (FP + TN) stays large no matter how many false positives the model produces. The FPR therefore stays small, the ROC curve hugs the top-left corner, and ROC AUC looks great even when most predicted positives are wrong. PR AUC, which uses precision rather than FPR, exposes this failure mode directly.

Side-by-side comparison

Property	PR AUC	ROC AUC
Axes	Recall (x), Precision (y)	FPR (x), TPR (y)
Uses true negatives	No	Yes (in FPR denominator)
Random baseline	Positive class prevalence	0.5 (always)
Best on imbalanced data	Yes	Often misleadingly high
Sensitive to FP when negatives are common	Strong	Weak
Standard estimator	Average precision	Trapezoidal rule (Mann-Whitney U)
Comparable across datasets	No (baseline shifts)	Yes
Common in object detection	Yes (mAP)	No

A simple rule of thumb: when the cost of false positives matters and the positive class is rare, prefer PR AUC. When the dataset is roughly balanced and you care about ranking quality regardless of class skew, ROC AUC is fine.

Relationship to F1 and operating points

Every point on the precision-recall curve corresponds to a single threshold and therefore to a single F1 score. F1 is the harmonic mean of precision and recall, so curves that bow toward the top-right corner contain points with high F1.

A common way to use PR AUC alongside F1 is:

Compute PR AUC to compare models on threshold-independent ranking quality.
Pick the operating threshold that maximizes F1 (or some other point on the curve that fits the deployment cost structure).
Report both the PR AUC and the precision, recall, and F1 at the chosen threshold.

Iso-F1 contour lines on a PR plot make it easy to read off the maximum-F1 point. Each contour is the set of (precision, recall) pairs sharing the same F1 value, and the highest contour that touches the curve gives the best achievable F1.

Mean average precision (mAP) in object detection

In object detection, the headline metric on most leaderboards is mean average precision (mAP), which averages PR-based AP across object categories. The metric was popularized by the PASCAL VOC and COCO benchmarks.

PASCAL VOC

The PASCAL Visual Object Classes Challenge (Everingham et al., 2010) used a fixed Intersection over Union (IoU) threshold of 0.5 to decide which detections counted as true positives. Detections were ranked by confidence, and AP was computed using 11-point interpolation: the average of the maximum precision at recall = 0, 0.1, 0.2, ..., 1.0. Starting in 2010, the VOC challenge switched to all-point interpolation, which integrates the area under the monotonically decreasing envelope of the curve.

Microsoft COCO

The Common Objects in Context (COCO) benchmark refined the protocol. COCO computes AP using 101-point interpolation across recall values 0.00, 0.01, 0.02, ..., 1.00 and averages over 10 IoU thresholds from 0.50 to 0.95 in steps of 0.05. The primary metric, often written as AP@[0.5:0.95] or simply AP in COCO papers, rewards detectors with tight bounding boxes more than the lenient single-threshold AP@0.5 from PASCAL VOC.

COCO also reports several breakdowns: AP at IoU 0.50, AP at IoU 0.75, and AP for small (under 32 by 32 pixels), medium, and large objects. Together these give a much more detailed picture than a single mAP figure.

Summary of mAP variants

Benchmark	IoU threshold	Recall sample points	Primary metric
PASCAL VOC 2007-2009	0.50	11 (0, 0.1, ..., 1.0)	mAP@0.5
PASCAL VOC 2010-2012	0.50	All curve transitions	mAP@0.5
Microsoft COCO	0.50 to 0.95 in 0.05 steps	101 (0.00, 0.01, ..., 1.00)	AP@[0.5:0.95]
Open Images	Varies by class	All curve transitions	mAP
LVIS	0.50 to 0.95 in 0.05 steps	101	AP across rare/common/frequent classes

Limitations

PR AUC is a useful summary, but it is not a complete description of model behavior, and it has several known weaknesses.

Insensitivity to true negatives. Because TN never enters the formula, PR AUC cannot distinguish between a model that produces a small number of false positives and one that produces many, as long as the positive ranking is the same. This is exactly the property that makes it useful on imbalanced data, but it also means PR AUC says nothing about specificity.
Not comparable across datasets. The random baseline equals the positive prevalence, so a PR AUC of 0.30 on a 2% positive dataset means something very different from a PR AUC of 0.30 on a 30% positive dataset. Compare the lift over baseline rather than raw values.
Threshold-free interpretation can hide deployment realities. Most production systems run at one threshold. A high PR AUC does not guarantee that the chosen operating point has acceptable precision or recall, and two models with identical PR AUCs can perform very differently at a fixed threshold.
All parts of the curve are weighted equally. AP integrates over the full recall range. In many real applications only a sliver of the curve matters: a fraud team may only care about precision in the top 1% of scored transactions. Restricted-range metrics like AP at recall above 0.5, or precision at fixed recall, capture this better.
Sensitivity to small changes when there are few positives. With only a handful of positive examples, a single misclassified instance can shift PR AUC noticeably. The Boyd, Eng, and Page (2013) paper provides confidence-interval procedures for this case.
Estimator confusion. Reported PR AUC values can come from average precision, trapezoidal interpolation, 11-point interpolation, 101-point interpolation, or other approximations. These are not interchangeable. Always state which estimator was used.

Scikit-learn implementation

The scikit-learn library provides a small set of functions for computing and visualizing PR-based metrics. As of version 0.19, average_precision_score returns a non-interpolated AP, matching the formula above.

Function or class	Purpose
`sklearn.metrics.average_precision_score(y_true, y_score)`	Returns AP as a scalar; the recommended PR AUC implementation
`sklearn.metrics.precision_recall_curve(y_true, y_score)`	Returns arrays of precision, recall, and thresholds for plotting or custom AUC computation
`sklearn.metrics.PrecisionRecallDisplay.from_estimator(...)`	Plots the PR curve directly from a fitted estimator
`sklearn.metrics.PrecisionRecallDisplay.from_predictions(...)`	Plots the PR curve from precomputed scores
`sklearn.metrics.auc(recall, precision)`	Generic trapezoidal AUC; not recommended for PR curves

Example

import numpy as np
from sklearn.metrics import (
    average_precision_score,
    precision_recall_curve,
    PrecisionRecallDisplay,
)

y_true = np.array([0, 0, 1, 1, 0, 1, 0, 0, 1, 1])
y_score = np.array([0.1, 0.35, 0.4, 0.8, 0.25, 0.65, 0.3, 0.15, 0.55, 0.9])

ap = average_precision_score(y_true, y_score)
print(f"PR AUC (average precision): {ap:.4f}")

precision, recall, thresholds = precision_recall_curve(y_true, y_score)
disp = PrecisionRecallDisplay(precision=precision, recall=recall, average_precision=ap)
disp.plot()

For multilabel classification, average_precision_score accepts a 2D array of scores and an average argument that controls whether per-class AP values are combined with macro, micro, weighted, or sample averaging. The default is macro averaging.

For multiclass classification, scikit-learn does not directly compute a single multiclass PR AUC. The standard practice is one-versus-rest: compute AP per class and then average, or use the same average_precision_score with binarized labels.

The auc function in scikit-learn applies the trapezoidal rule to any pair of x and y arrays. It is sometimes used with precision_recall_curve outputs, but the result is the optimistic upper-trapezoid AUC and will generally be higher than the AP returned by average_precision_score. For PR curves, prefer average_precision_score.

Other tools

Tool	Function or class	Notes
TorchMetrics	`torchmetrics.classification.AveragePrecision`	Non-interpolated AP, mirrors sklearn
TorchMetrics	`torchmetrics.detection.MeanAveragePrecision`	COCO-style mAP for object detection
TensorFlow	`tf.keras.metrics.AUC(curve='PR')`	Trapezoidal AUC over the PR curve; can also use Riemann sums
pycocotools	`COCOeval`	Standard 101-point COCO evaluation
Detectron2	`COCOEvaluator`	Wraps pycocotools for PyTorch detection models
MMDetection	`eval_map`	PASCAL VOC and COCO style evaluation

Explain like I'm 5 (ELI5)

Imagine you are playing a game where you have to find special rocks among a pile of normal rocks. You have a tool that helps you identify the special rocks, but it is not always correct.

The PR AUC is a number that tells you how good your tool is at finding the special rocks without accidentally picking up too many normal rocks.

It is based on two things: how many of the rocks your tool calls special really are special (precision), and how many of the special rocks your tool actually finds out of all the special rocks in the pile (recall). If the PR AUC is close to 1, your tool is great at finding the special rocks without making many mistakes. If it is close to the share of special rocks in the pile, your tool is no better than guessing.

The trick with PR AUC is that it does not care how many normal rocks your tool correctly leaves alone. It only cares about how well it handles the special ones. That is exactly why it is the right tool for the job when special rocks are very rare, like one special rock in every hundred.

References

Davis, J., & Goadrich, M. (2006). "The Relationship Between Precision-Recall and ROC Curves." *Proceedings of the 23rd International Conference on Machine Learning (ICML)*, 233-240. https://doi.org/10.1145/1143844.1143874
Saito, T., & Rehmsmeier, M. (2015). "The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets." *PLOS ONE*, 10(3), e0118432. https://doi.org/10.1371/journal.pone.0118432
Boyd, K., Eng, K. H., & Page, C. D. (2013). "Area Under the Precision-Recall Curve: Point Estimates and Confidence Intervals." *Machine Learning and Knowledge Discovery in Databases (ECML PKDD)*, Springer, 451-466. https://doi.org/10.1007/978-3-642-40994-3_29
Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2010). "The Pascal Visual Object Classes (VOC) Challenge." *International Journal of Computer Vision*, 88(2), 303-338. https://doi.org/10.1007/s11263-009-0275-4
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollar, P., & Zitnick, C. L. (2014). "Microsoft COCO: Common Objects in Context." *European Conference on Computer Vision (ECCV)*. arXiv:1405.0312.
Pedregosa, F., et al. (2011). "Scikit-learn: Machine Learning in Python." *Journal of Machine Learning Research*, 12, 2825-2830.
scikit-learn developers. "average_precision_score documentation." https://scikit-learn.org/stable/modules/generated/sklearn.metrics.average_precision_score.html
scikit-learn developers. "precision_recall_curve documentation." https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_curve.html
Manning, C. D., Raghavan, P., & Schutze, H. (2008). *Introduction to Information Retrieval*. Cambridge University Press. Chapter 8: Evaluation in information retrieval. https://nlp.stanford.edu/IR-book/
Padilla, R., Passos, W. L., Dias, T. L. B., Netto, S. L., & da Silva, E. A. B. (2021). "A Comparative Analysis of Object Detection Metrics with a Companion Open-Source Toolkit." *Electronics*, 10(3), 279. https://doi.org/10.3390/electronics10030279
Goadrich, M., Oliphant, L., & Shavlik, J. (2006). "Gleaner: Creating Ensembles of First-Order Clauses to Improve Recall-Precision Curves." *Machine Learning*, 64, 231-261.
Flach, P., & Kull, M. (2015). "Precision-Recall-Gain Curves: PR Analysis Done Right." *Advances in Neural Information Processing Systems (NeurIPS)*, 28, 838-846.

Introduction

The precision-recall curve

Building the curve

Confusion matrix derivations

Worked example

Computing PR AUC

Trapezoidal rule

Average precision (AP)

Comparison of estimators

Baseline and interpretation

The random-classifier baseline

Perfect, good, and useless scores

PR AUC versus ROC AUC

Side-by-side comparison

Relationship to F1 and operating points

Mean average precision (mAP) in object detection

PASCAL VOC

Microsoft COCO

Summary of mAP variants

Limitations

Scikit-learn implementation

Example

Other tools

Explain like I'm 5 (ELI5)

References

Improve this article

Related Articles

True positive (TP)

True positive rate (TPR)

IoU

Squared Hinge Loss

AUC (Area Under the ROC Curve)

Accuracy

Introduction

The precision-recall curve

Building the curve

Confusion matrix derivations

Worked example

Computing PR AUC

Trapezoidal rule

Average precision (AP)

Comparison of estimators

Baseline and interpretation

The random-classifier baseline

Perfect, good, and useless scores

PR AUC versus ROC AUC

Side-by-side comparison

Relationship to F1 and operating points

Mean average precision (mAP) in object detection

PASCAL VOC

Microsoft COCO

Summary of mAP variants

Limitations

Scikit-learn implementation

Example

Other tools

Explain like I'm 5 (ELI5)

References

Related Articles

True positive (TP)

True positive rate (TPR)

IoU

Squared Hinge Loss

AUC (Area Under the ROC Curve)

Accuracy