Area under the PR curve

See also: Machine learning terms

The area under the precision-recall curve (AUPRC), also known as average precision (AP) or PR-AUC, is a scalar summary of a binary classifier's performance across every decision threshold. It is computed from the precision and recall values that the model produces as the threshold sweeps from very permissive to very strict. AUPRC is one of the most common single-number metrics for tasks where the positive class is rare, and it is especially popular in information retrieval, object detection, fraud detection, and medical screening.

Unlike accuracy, AUPRC is threshold-free: it does not commit the user to any particular cutoff. Unlike the area under the ROC curve (AUROC), it does not credit the model for correctly rejecting easy negatives, which makes it more discriminating on highly skewed data. The metric was popularized in machine learning by the Davis and Goadrich paper at ICML 2006, which proved a formal correspondence between the ROC and PR spaces and warned against linear interpolation in PR space.

Precision, recall, and the PR curve

For a binary classification problem, a confusion matrix is built from four counts: true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN). Two ratios fall out of that table.

Metric	Formula	Reads as
Precision	TP / (TP + FP)	Of everything I called positive, how much was actually positive?
Recall	TP / (TP + FN)	Of all the real positives, how many did I catch?

Precision is also called positive predictive value. Recall is also called sensitivity or the true positive rate. Most classifiers output a score, probability, or rank rather than a hard label, so to get a label you compare the score against a threshold. Lowering the threshold tends to raise recall (you accept more candidates, so you catch more true positives) but lower precision (you also accept more junk). Raising it does the opposite.

The precision-recall curve plots precision on the y axis and recall on the x axis as the threshold moves through every value the model produces. Each operating point on the curve corresponds to one threshold. A perfect classifier sits at the top right corner (recall = 1, precision = 1). A random classifier produces a roughly flat line at the height of the positive class prevalence, since precision for a random ranking equals the base rate of positives in the data.

Definition of AUPRC

AUPRC is the area under that curve. The intuitive picture is the area beneath a piecewise plot of (recall, precision) points sorted from low recall to high recall. The precise definition depends on how the area is summed, and there are three implementations worth knowing.

Average precision as a discrete sum

The most common modern definition, used by scikit-learn's average_precision_score, is the weighted mean of precisions, with the change in recall used as the weight:

AP = sum over n of (R_n - R_{n-1}) * P_n

Here P_n and R_n are the precision and recall at the n-th threshold, taken in order. Scikit-learn's documentation states this implementation "is not interpolated and is different from computing the area under the precision-recall curve with the trapezoidal rule, which uses linear interpolation and can be too optimistic." Since version 0.19, scikit-learn explicitly avoids linear interpolation between operating points and weights precisions by the change in recall since the last point.

Intuitively, AP is the expected precision over a uniform distribution on recall: you sample a recall level uniformly between 0 and 1 and ask what precision the model achieved at that level. The discrete sum above is the corresponding Riemann sum.

11-point interpolation (PASCAL VOC 2007)

The original PASCAL VOC object detection challenge defined a simpler version. Pick 11 recall values evenly spaced from 0 to 1.0 (so 0, 0.1, 0.2, ..., 1.0). At each recall value r, take the maximum precision observed at any recall greater than or equal to r. Average those 11 numbers:

AP_11 = (1/11) * sum over r in {0, 0.1, ..., 1.0} of max_{r' >= r} P(r')

The maximum-precision interpolation smooths out the wiggles that come from a noisy ranking near the decision boundary. The 11 fixed points keep the math cheap and the values comparable across papers.

All-point interpolation (PASCAL VOC 2010 and after)

From 2010 onward, the PASCAL VOC challenge replaced the 11-point method with all-point interpolation, which integrates the same monotonic envelope (the running maximum of precision) across all recall levels actually observed in the data. The COCO benchmark generalizes this further with 101-point interpolation and averages AP across 10 IoU thresholds from 0.5 to 0.95 in steps of 0.05 to produce its main mean average precision score.

These three flavors of AP can give noticeably different numbers on the same predictions, which is why precise benchmark reports always say which definition they used.

Why linear interpolation between PR points is wrong

If you connect adjacent (recall, precision) points with straight lines and integrate under that polygon, you get an estimate that is biased upward. Davis and Goadrich (2006) showed that the precision between two real operating points does not generally trace a straight line; it follows a curve that depends on the ratio of true positives to false positives gained between those points. Linear interpolation overshoots that curve, especially in the high-recall region where precision tends to drop sharply. This is one reason scikit-learn's auc(recall, precision), which does use the trapezoidal rule, can disagree with average_precision_score on the same inputs, and the documentation warns users against trusting the trapezoidal version as an AUPRC estimate.

AUPRC versus ROC AUC

ROC curves plot the true positive rate (recall) against the false positive rate, where FPR = FP / (FP + TN). Because the denominator of FPR contains true negatives, ROC AUC counts an improvement on the negative class as a real improvement, even if the model has not actually gotten any better at finding rare positives. AUPRC ignores true negatives entirely. The two metrics therefore answer different questions.

	ROC AUC	AUPRC
Axes	TPR vs FPR	Precision vs Recall
Uses true negatives?	Yes	No
Random baseline	0.5	Prevalence of positives
Sensitive to class imbalance?	Largely invariant	Yes, baseline scales with imbalance
Best when...	Both classes matter	Positives are rare and costly to miss

The random-classifier baseline for AUPRC is the fraction of positives in the dataset, often written as P / (P + N). On a fraud detection problem with 1% positives, a model that simply ranks examples randomly gets an AUPRC of about 0.01. That means an AUPRC of 0.4 sounds modest in the abstract but is a 40x improvement over random. AUROC has no such reference scale; a 0.85 AUROC means roughly the same thing whether the positive rate is 50% or 0.1%, which is part of why it can flatter classifiers on heavily skewed data.

Davis and Goadrich proved that a curve dominates in ROC space if and only if it dominates in PR space, so the two metrics agree about which model is strictly better when one model's curve is everywhere above another's. They disagree about how much better, and they disagree about ranking when the curves cross. Optimizing AUROC does not guarantee an optimal AUPRC. For imbalanced data, most practitioners now report AUPRC alongside or instead of AUROC.

A 2024 paper by Sarah McElroy and coauthors pushed back on the conventional wisdom, arguing that AUROC is not blind to imbalance in the way critics often claim, and that PR-AUC is hard to disentangle from the base rate. The debate is real, but in applied work the rule of thumb still holds: if you care about the minority class, watch the PR curve.

Practical example: scikit-learn

The canonical computation in Python looks like this:

from sklearn.metrics import average_precision_score, precision_recall_curve

# y_true: 0/1 labels, y_score: model output scores or probabilities
ap = average_precision_score(y_true, y_score)
precision, recall, thresholds = precision_recall_curve(y_true, y_score)

average_precision_score returns the discrete-sum AP described above. precision_recall_curve returns the raw arrays you can plot, with one entry per unique threshold. For multilabel or multiclass problems, the function supports average='macro', 'micro', 'weighted', and 'samples' to aggregate per-class AP into one number.

In deep learning, the same metric appears under different names. PyTorch's torchmetrics.AveragePrecision, TensorFlow's tf.keras.metrics.AUC(curve='PR'), and Detectron2's COCO evaluator all compute essentially the same quantity, with small differences in how they handle ties and interpolation.

When AUPRC is a good metric, and when it is not

AUPRC is most useful when the positive class is rare and missing positives is expensive. Cancer screening, intrusion detection, defect inspection on a production line, and retrieval ranking all fit that pattern. The metric is also useful when you do not yet know the threshold you will deploy with and want a summary that respects every possible operating point.

It is less useful in three cases. First, when classes are roughly balanced, ROC AUC is easier to interpret and not much less informative. Second, when you have a fixed operating threshold that is dictated by business cost rather than tunable, a single point estimate of precision and recall at that threshold is more relevant than the area under the curve. Third, AUPRC moves around when the base rate moves, so comparing AUPRC numbers across datasets or time periods with different prevalence requires care. A 0.7 AUPRC on a 50%-positive dataset is not the same achievement as a 0.7 AUPRC on a 1%-positive dataset.

Most teams report several metrics together: AUPRC, AUROC, and a small table of precision and recall at one or two thresholds that match the deployment cost structure. That triangulates what the model is actually doing instead of relying on any one summary statistic.

Key facts at a glance

Aspect	Value or note
Other names	AP, AUPRC, PR-AUC, average precision
Range	0 to 1
Perfect classifier	1.0
Random baseline	Fraction of positives in the dataset
Best for	Imbalanced data where positives matter
Common sklearn function	`sklearn.metrics.average_precision_score`
Trapezoidal rule on PR curve	Discouraged: biased upward
11-point version	PASCAL VOC 2007
All-point version	PASCAL VOC 2010, COCO, modern object detection
Foundational paper	Davis and Goadrich, ICML 2006

ELI5

Imagine a spam filter that ranks every email by how likely it thinks each one is spam. You start at the top of the list, where the filter is most confident, and work your way down. Two questions matter as you go. Out of the emails you have flagged so far, what fraction really are spam? That is precision. Out of all the spam emails that exist in your inbox, how many have you flagged so far? That is recall. As you accept more emails, precision usually drops and recall usually rises. The area under the precision-recall curve is the average precision you keep, measured over the whole journey from catching no spam to catching all of it. A higher number means the filter stayed accurate even as it tried to catch everything.

References

Davis, J., and Goadrich, M. (2006). *The Relationship Between Precision-Recall and ROC Curves*. Proceedings of the 23rd International Conference on Machine Learning, pp. 233-240. https://pages.cs.wisc.edu/~jdavis/davisgoadrichcamera2.pdf
scikit-learn developers. *sklearn.metrics.average_precision_score*. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.average_precision_score.html
scikit-learn developers. *Precision-Recall example*. https://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html
Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., and Zisserman, A. *The PASCAL Visual Object Classes (VOC) Challenge*. International Journal of Computer Vision, 88(2), 303-338.
Lin, T.-Y. et al. *Microsoft COCO: Common Objects in Context*. ECCV 2014. https://cocodataset.org
Saito, T., and Rehmsmeier, M. (2015). *The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets*. PLOS ONE 10(3): e0118432.
Wikipedia. *Precision and recall*. https://en.wikipedia.org/wiki/Precision_and_recall
Wikipedia. *Evaluation of binary classifiers*. https://en.wikipedia.org/wiki/Evaluation_of_binary_classifiers
McElroy, S. et al. (2024). *A Closer Look at AUROC and AUPRC under Class Imbalance*. NeurIPS 2024. https://arxiv.org/abs/2401.06091
Glass Box Medicine. *Measuring Performance: AUPRC and Average Precision*. https://glassboxmedicine.com/2019/03/02/measuring-performance-auprc/

Precision, recall, and the PR curve

Definition of AUPRC

Average precision as a discrete sum

11-point interpolation (PASCAL VOC 2007)

All-point interpolation (PASCAL VOC 2010 and after)

Why linear interpolation between PR points is wrong

AUPRC versus ROC AUC

Practical example: scikit-learn

When AUPRC is a good metric, and when it is not

Key facts at a glance

ELI5

References

Improve this article

Related Articles

Machine learning terms/Natural Language Processing

Machine learning terms/Computer Vision

Machine learning terms/Sequence Models

Split

Static

Agglomerative clustering

Precision, recall, and the PR curve

Definition of AUPRC

Average precision as a discrete sum

11-point interpolation (PASCAL VOC 2007)

All-point interpolation (PASCAL VOC 2010 and after)

Why linear interpolation between PR points is wrong

AUPRC versus ROC AUC

Practical example: scikit-learn

When AUPRC is a good metric, and when it is not

Key facts at a glance

ELI5

References

Related Articles

Machine learning terms/Natural Language Processing

Machine learning terms/Computer Vision

Machine learning terms/Sequence Models

Split

Static

Agglomerative clustering