See also: precision, recall, precision-recall curve, average precision, ROC AUC, F1 score, imbalanced dataset, binary classification, classifier, threshold
PR AUC, short for area under the precision-recall curve, is a scalar evaluation metric for binary classifiers. It summarizes the trade-off between precision and recall across every possible decision threshold into a single number between 0 and 1. PR AUC is particularly useful for imbalanced datasets where the positive class is rare, because it focuses on how well a classifier handles the minority class and ignores the (typically large) pool of true negatives.
In modern practice the metric is almost always computed as average precision (AP), a finite-sum approximation that does not interpolate the curve. The trapezoidal rule, which does interpolate, is also seen in older literature but tends to give optimistic estimates and has been discouraged since at least Davis and Goadrich (2006) and Boyd, Eng, and Page (2013).
PR AUC is widely used in fraud detection, medical diagnosis, anomaly detection, information retrieval, and object detection. In object detection in particular, the metric appears under the name mean average precision (mAP) and forms the headline number for benchmarks such as PASCAL VOC and Microsoft COCO.
A binary classifier typically outputs a continuous score for each example: a probability, a logit, or a generic decision value. To turn that score into a hard 0-or-1 prediction, you pick a threshold and label every example above the threshold as positive. Different thresholds yield different confusion matrices, and therefore different precision and recall values.
The precision-recall curve plots precision on the y-axis against recall on the x-axis as the threshold sweeps from high to low. Each point on the curve corresponds to one threshold. As the threshold drops, more examples are predicted positive, recall generally rises, and precision usually falls.
Precision and recall both depend on three of the four confusion-matrix cells:
| Quantity | Formula | What it measures |
|---|---|---|
| Precision | TP / (TP + FP) | Fraction of predicted positives that are correct |
| Recall (sensitivity, TPR) | TP / (TP + FN) | Fraction of actual positives that are detected |
| F1 score | 2 * Precision * Recall / (Precision + Recall) | Harmonic mean of precision and recall |
Notice that true negatives (TN) appear in none of these formulas. That is the central reason PR-based metrics behave so differently from ROC-based metrics on imbalanced data.
Consider a small test set with 10 examples. The classifier produces a probability score for each, and the true label is known.
| Example | Score | True label |
|---|---|---|
| A | 0.95 | 1 |
| B | 0.90 | 1 |
| C | 0.80 | 0 |
| D | 0.70 | 1 |
| E | 0.60 | 1 |
| F | 0.55 | 0 |
| G | 0.40 | 1 |
| H | 0.30 | 0 |
| I | 0.20 | 0 |
| J | 0.10 | 0 |
There are 5 positives and 5 negatives. Sorting by score (highest first) and walking down the list, we recompute precision and recall after each example is added to the predicted-positive set:
| Top-k | Last added | TP | FP | FN | Precision | Recall |
|---|---|---|---|---|---|---|
| 1 | A | 1 | 0 | 4 | 1.000 | 0.20 |
| 2 | B | 2 | 0 | 3 | 1.000 | 0.40 |
| 3 | C | 2 | 1 | 3 | 0.667 | 0.40 |
| 4 | D | 3 | 1 | 2 | 0.750 | 0.60 |
| 5 | E | 4 | 1 | 1 | 0.800 | 0.80 |
| 6 | F | 4 | 2 | 1 | 0.667 | 0.80 |
| 7 | G | 5 | 2 | 0 | 0.714 | 1.00 |
| 8 | H | 5 | 3 | 0 | 0.625 | 1.00 |
| 9 | I | 5 | 4 | 0 | 0.556 | 1.00 |
| 10 | J | 5 | 5 | 0 | 0.500 | 1.00 |
Using the average precision formula (see below), AP is the sum of precision values at the rows where recall changed (rows where a new positive was added):
AP = (0.20 - 0.00) * 1.000 + (0.40 - 0.20) * 1.000 + (0.60 - 0.40) * 0.750 + (0.80 - 0.60) * 0.800 + (1.00 - 0.80) * 0.714
AP = 0.200 + 0.200 + 0.150 + 0.160 + 0.143 = 0.853
A random classifier on the same data would score around 0.50, since the positive class prevalence is 5 / 10.
There is more than one way to turn a precision-recall curve into a single area, and the choice matters. The two main approaches are linear (trapezoidal) interpolation and average precision.
The trapezoidal rule treats consecutive precision-recall points as the corners of a trapezoid and sums the areas. It is the default for ROC AUC and is intuitive, but for PR curves it can be misleadingly optimistic. The PR curve is not linear between adjacent operating points; in fact, the correct interpolation between two points (R1, P1) and (R2, P2) follows a non-linear shape derived from the underlying confusion matrix counts. Linear interpolation can sit far above this true curve, inflating the area.
Davis and Goadrich (2006) gave a worked example showing that linear interpolation between PR points can produce an entirely fictitious bulge in the curve. They derived the correct interpolation formula and showed it is not a straight line.
Average precision is the recommended way to compute PR AUC and is what scikit-learn, COCO, and most modern libraries return when asked for area under the precision-recall curve. AP is defined as a finite weighted mean rather than an integral:
AP = sum over n of (R_n - R_{n-1}) * P_n
where P_n and R_n are the precision and recall at the n-th threshold and the sum runs over all distinct thresholds in the sorted score list. Each term weights the precision at threshold n by the gain in recall achieved by lowering the threshold to that point. Because there is no interpolation, the metric is conservative compared to the trapezoidal rule.
Boyd, Eng, and Page (2013) studied several estimators of PR AUC, including the upper trapezoid, lower trapezoid, average precision, and an interpolated median. They concluded that average precision and the lower trapezoid are unbiased in the limit and have lower bias for small samples than the upper trapezoid. Average precision has since become the de facto standard.
| Estimator | How it works | Bias | Common use |
|---|---|---|---|
| Average precision (AP) | Weighted mean of precisions, no interpolation | Low; unbiased in the limit | scikit-learn, COCO, modern ML |
| Lower trapezoid | Trapezoidal rule using lower precision at each step | Low; unbiased in the limit | Less common, recommended by Boyd et al. |
| Upper trapezoid | Standard trapezoidal rule between PR points | Optimistic, biased high | Older sklearn (pre-0.19), legacy code |
| 11-point interpolation | Average of max precision at recall = 0, 0.1, ..., 1.0 | Coarse, simple | PASCAL VOC 2007 to 2009 |
| 101-point interpolation | Average of max precision at 101 recall points | Smooth approximation | Microsoft COCO |
A random classifier (one that produces scores independent of the true label) traces out a PR curve that is approximately a horizontal line at y = pi, where pi is the positive class prevalence in the test set. The PR AUC of such a classifier is therefore equal to pi.
This is fundamentally different from ROC AUC, where the random baseline is always 0.5 regardless of class balance. On a dataset with 1% positives, a random model has PR AUC around 0.01 and ROC AUC around 0.5. A model with PR AUC = 0.10 looks unimpressive in absolute terms, but it is actually 10 times better than chance.
A practical consequence is that PR AUC values are not directly comparable across datasets with different prevalence. A PR AUC of 0.40 on a dataset with 5% positives represents a much stronger lift over the baseline than a PR AUC of 0.40 on a dataset with 35% positives. When reporting PR AUC, also report the positive prevalence so readers can interpret the number.
| Score | Meaning |
|---|---|
| 1.0 | Perfect classifier; ranks every positive above every negative |
| Above pi | Better than random |
| Equal to pi | No skill (random ranking) |
| Below pi | Worse than random; the model is somehow anti-correlated with the label |
ROC AUC plots true positive rate (recall) against false positive rate (FPR = FP / (FP + TN)) and integrates. Both metrics summarize a curve into a single number, but they emphasize very different parts of the confusion matrix.
Davis and Goadrich (2006) proved a one-to-one correspondence between ROC space and PR space: a curve that dominates another in ROC space also dominates in PR space, and vice versa. Despite this equivalence at the curve level, the summary statistics behave differently because they integrate different quantities.
Saito and Rehmsmeier (2015) demonstrated through simulation that the PR plot is more informative than ROC for highly imbalanced classification. Their main observation: when the negative class is overwhelming, the FPR denominator (FP + TN) stays large no matter how many false positives the model produces. The FPR therefore stays small, the ROC curve hugs the top-left corner, and ROC AUC looks great even when most predicted positives are wrong. PR AUC, which uses precision rather than FPR, exposes this failure mode directly.
| Property | PR AUC | ROC AUC |
|---|---|---|
| Axes | Recall (x), Precision (y) | FPR (x), TPR (y) |
| Uses true negatives | No | Yes (in FPR denominator) |
| Random baseline | Positive class prevalence | 0.5 (always) |
| Best on imbalanced data | Yes | Often misleadingly high |
| Sensitive to FP when negatives are common | Strong | Weak |
| Standard estimator | Average precision | Trapezoidal rule (Mann-Whitney U) |
| Comparable across datasets | No (baseline shifts) | Yes |
| Common in object detection | Yes (mAP) | No |
A simple rule of thumb: when the cost of false positives matters and the positive class is rare, prefer PR AUC. When the dataset is roughly balanced and you care about ranking quality regardless of class skew, ROC AUC is fine.
Every point on the precision-recall curve corresponds to a single threshold and therefore to a single F1 score. F1 is the harmonic mean of precision and recall, so curves that bow toward the top-right corner contain points with high F1.
A common way to use PR AUC alongside F1 is:
Iso-F1 contour lines on a PR plot make it easy to read off the maximum-F1 point. Each contour is the set of (precision, recall) pairs sharing the same F1 value, and the highest contour that touches the curve gives the best achievable F1.
In object detection, the headline metric on most leaderboards is mean average precision (mAP), which averages PR-based AP across object categories. The metric was popularized by the PASCAL VOC and COCO benchmarks.
The PASCAL Visual Object Classes Challenge (Everingham et al., 2010) used a fixed Intersection over Union (IoU) threshold of 0.5 to decide which detections counted as true positives. Detections were ranked by confidence, and AP was computed using 11-point interpolation: the average of the maximum precision at recall = 0, 0.1, 0.2, ..., 1.0. Starting in 2010, the VOC challenge switched to all-point interpolation, which integrates the area under the monotonically decreasing envelope of the curve.
The Common Objects in Context (COCO) benchmark refined the protocol. COCO computes AP using 101-point interpolation across recall values 0.00, 0.01, 0.02, ..., 1.00 and averages over 10 IoU thresholds from 0.50 to 0.95 in steps of 0.05. The primary metric, often written as AP@[0.5:0.95] or simply AP in COCO papers, rewards detectors with tight bounding boxes more than the lenient single-threshold AP@0.5 from PASCAL VOC.
COCO also reports several breakdowns: AP at IoU 0.50, AP at IoU 0.75, and AP for small (under 32 by 32 pixels), medium, and large objects. Together these give a much more detailed picture than a single mAP figure.
| Benchmark | IoU threshold | Recall sample points | Primary metric |
|---|---|---|---|
| PASCAL VOC 2007-2009 | 0.50 | 11 (0, 0.1, ..., 1.0) | mAP@0.5 |
| PASCAL VOC 2010-2012 | 0.50 | All curve transitions | mAP@0.5 |
| Microsoft COCO | 0.50 to 0.95 in 0.05 steps | 101 (0.00, 0.01, ..., 1.00) | AP@[0.5:0.95] |
| Open Images | Varies by class | All curve transitions | mAP |
| LVIS | 0.50 to 0.95 in 0.05 steps | 101 | AP across rare/common/frequent classes |
PR AUC is a useful summary, but it is not a complete description of model behavior, and it has several known weaknesses.
Insensitivity to true negatives. Because TN never enters the formula, PR AUC cannot distinguish between a model that produces a small number of false positives and one that produces many, as long as the positive ranking is the same. This is exactly the property that makes it useful on imbalanced data, but it also means PR AUC says nothing about specificity.
Not comparable across datasets. The random baseline equals the positive prevalence, so a PR AUC of 0.30 on a 2% positive dataset means something very different from a PR AUC of 0.30 on a 30% positive dataset. Compare the lift over baseline rather than raw values.
Threshold-free interpretation can hide deployment realities. Most production systems run at one threshold. A high PR AUC does not guarantee that the chosen operating point has acceptable precision or recall, and two models with identical PR AUCs can perform very differently at a fixed threshold.
All parts of the curve are weighted equally. AP integrates over the full recall range. In many real applications only a sliver of the curve matters: a fraud team may only care about precision in the top 1% of scored transactions. Restricted-range metrics like AP at recall above 0.5, or precision at fixed recall, capture this better.
Sensitivity to small changes when there are few positives. With only a handful of positive examples, a single misclassified instance can shift PR AUC noticeably. The Boyd, Eng, and Page (2013) paper provides confidence-interval procedures for this case.
Estimator confusion. Reported PR AUC values can come from average precision, trapezoidal interpolation, 11-point interpolation, 101-point interpolation, or other approximations. These are not interchangeable. Always state which estimator was used.
The scikit-learn library provides a small set of functions for computing and visualizing PR-based metrics. As of version 0.19, average_precision_score returns a non-interpolated AP, matching the formula above.
| Function or class | Purpose |
|---|---|
sklearn.metrics.average_precision_score(y_true, y_score) | Returns AP as a scalar; the recommended PR AUC implementation |
sklearn.metrics.precision_recall_curve(y_true, y_score) | Returns arrays of precision, recall, and thresholds for plotting or custom AUC computation |
sklearn.metrics.PrecisionRecallDisplay.from_estimator(...) | Plots the PR curve directly from a fitted estimator |
sklearn.metrics.PrecisionRecallDisplay.from_predictions(...) | Plots the PR curve from precomputed scores |
sklearn.metrics.auc(recall, precision) | Generic trapezoidal AUC; not recommended for PR curves |
import numpy as np
from sklearn.metrics import (
average_precision_score,
precision_recall_curve,
PrecisionRecallDisplay,
)
y_true = np.array([0, 0, 1, 1, 0, 1, 0, 0, 1, 1])
y_score = np.array([0.1, 0.35, 0.4, 0.8, 0.25, 0.65, 0.3, 0.15, 0.55, 0.9])
ap = average_precision_score(y_true, y_score)
print(f"PR AUC (average precision): {ap:.4f}")
precision, recall, thresholds = precision_recall_curve(y_true, y_score)
disp = PrecisionRecallDisplay(precision=precision, recall=recall, average_precision=ap)
disp.plot()
For multilabel classification, average_precision_score accepts a 2D array of scores and an average argument that controls whether per-class AP values are combined with macro, micro, weighted, or sample averaging. The default is macro averaging.
For multiclass classification, scikit-learn does not directly compute a single multiclass PR AUC. The standard practice is one-versus-rest: compute AP per class and then average, or use the same average_precision_score with binarized labels.
The auc function in scikit-learn applies the trapezoidal rule to any pair of x and y arrays. It is sometimes used with precision_recall_curve outputs, but the result is the optimistic upper-trapezoid AUC and will generally be higher than the AP returned by average_precision_score. For PR curves, prefer average_precision_score.
| Tool | Function or class | Notes |
|---|---|---|
| TorchMetrics | torchmetrics.classification.AveragePrecision | Non-interpolated AP, mirrors sklearn |
| TorchMetrics | torchmetrics.detection.MeanAveragePrecision | COCO-style mAP for object detection |
| TensorFlow | tf.keras.metrics.AUC(curve='PR') | Trapezoidal AUC over the PR curve; can also use Riemann sums |
| pycocotools | COCOeval | Standard 101-point COCO evaluation |
| Detectron2 | COCOEvaluator | Wraps pycocotools for PyTorch detection models |
| MMDetection | eval_map | PASCAL VOC and COCO style evaluation |
Imagine you are playing a game where you have to find special rocks among a pile of normal rocks. You have a tool that helps you identify the special rocks, but it is not always correct.
The PR AUC is a number that tells you how good your tool is at finding the special rocks without accidentally picking up too many normal rocks.
It is based on two things: how many of the rocks your tool calls special really are special (precision), and how many of the special rocks your tool actually finds out of all the special rocks in the pile (recall). If the PR AUC is close to 1, your tool is great at finding the special rocks without making many mistakes. If it is close to the share of special rocks in the pile, your tool is no better than guessing.
The trick with PR AUC is that it does not care how many normal rocks your tool correctly leaves alone. It only cares about how well it handles the special ones. That is exactly why it is the right tool for the job when special rocks are very rare, like one special rock in every hundred.