Area under the PR curve
Last reviewed
May 11, 2026
Sources
10 citations
Review status
Source-backed
Revision
v2 · 2,043 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 11, 2026
Sources
10 citations
Review status
Source-backed
Revision
v2 · 2,043 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Machine learning terms
The area under the precision-recall curve (AUPRC), also known as average precision (AP) or PR-AUC, is a scalar summary of a binary classifier's performance across every decision threshold. It is computed from the precision and recall values that the model produces as the threshold sweeps from very permissive to very strict. AUPRC is one of the most common single-number metrics for tasks where the positive class is rare, and it is especially popular in information retrieval, object detection, fraud detection, and medical screening.
Unlike accuracy, AUPRC is threshold-free: it does not commit the user to any particular cutoff. Unlike the area under the ROC curve (AUROC), it does not credit the model for correctly rejecting easy negatives, which makes it more discriminating on highly skewed data. The metric was popularized in machine learning by the Davis and Goadrich paper at ICML 2006, which proved a formal correspondence between the ROC and PR spaces and warned against linear interpolation in PR space.
For a binary classification problem, a confusion matrix is built from four counts: true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN). Two ratios fall out of that table.
| Metric | Formula | Reads as |
|---|---|---|
| Precision | TP / (TP + FP) | Of everything I called positive, how much was actually positive? |
| Recall | TP / (TP + FN) | Of all the real positives, how many did I catch? |
Precision is also called positive predictive value. Recall is also called sensitivity or the true positive rate. Most classifiers output a score, probability, or rank rather than a hard label, so to get a label you compare the score against a threshold. Lowering the threshold tends to raise recall (you accept more candidates, so you catch more true positives) but lower precision (you also accept more junk). Raising it does the opposite.
The precision-recall curve plots precision on the y axis and recall on the x axis as the threshold moves through every value the model produces. Each operating point on the curve corresponds to one threshold. A perfect classifier sits at the top right corner (recall = 1, precision = 1). A random classifier produces a roughly flat line at the height of the positive class prevalence, since precision for a random ranking equals the base rate of positives in the data.
AUPRC is the area under that curve. The intuitive picture is the area beneath a piecewise plot of (recall, precision) points sorted from low recall to high recall. The precise definition depends on how the area is summed, and there are three implementations worth knowing.
The most common modern definition, used by scikit-learn's average_precision_score, is the weighted mean of precisions, with the change in recall used as the weight:
AP = sum over n of (R_n - R_{n-1}) * P_n
Here P_n and R_n are the precision and recall at the n-th threshold, taken in order. Scikit-learn's documentation states this implementation "is not interpolated and is different from computing the area under the precision-recall curve with the trapezoidal rule, which uses linear interpolation and can be too optimistic." Since version 0.19, scikit-learn explicitly avoids linear interpolation between operating points and weights precisions by the change in recall since the last point.
Intuitively, AP is the expected precision over a uniform distribution on recall: you sample a recall level uniformly between 0 and 1 and ask what precision the model achieved at that level. The discrete sum above is the corresponding Riemann sum.
The original PASCAL VOC object detection challenge defined a simpler version. Pick 11 recall values evenly spaced from 0 to 1.0 (so 0, 0.1, 0.2, ..., 1.0). At each recall value r, take the maximum precision observed at any recall greater than or equal to r. Average those 11 numbers:
AP_11 = (1/11) * sum over r in {0, 0.1, ..., 1.0} of max_{r' >= r} P(r')
The maximum-precision interpolation smooths out the wiggles that come from a noisy ranking near the decision boundary. The 11 fixed points keep the math cheap and the values comparable across papers.
From 2010 onward, the PASCAL VOC challenge replaced the 11-point method with all-point interpolation, which integrates the same monotonic envelope (the running maximum of precision) across all recall levels actually observed in the data. The COCO benchmark generalizes this further with 101-point interpolation and averages AP across 10 IoU thresholds from 0.5 to 0.95 in steps of 0.05 to produce its main mean average precision score.
These three flavors of AP can give noticeably different numbers on the same predictions, which is why precise benchmark reports always say which definition they used.
If you connect adjacent (recall, precision) points with straight lines and integrate under that polygon, you get an estimate that is biased upward. Davis and Goadrich (2006) showed that the precision between two real operating points does not generally trace a straight line; it follows a curve that depends on the ratio of true positives to false positives gained between those points. Linear interpolation overshoots that curve, especially in the high-recall region where precision tends to drop sharply. This is one reason scikit-learn's auc(recall, precision), which does use the trapezoidal rule, can disagree with average_precision_score on the same inputs, and the documentation warns users against trusting the trapezoidal version as an AUPRC estimate.
ROC curves plot the true positive rate (recall) against the false positive rate, where FPR = FP / (FP + TN). Because the denominator of FPR contains true negatives, ROC AUC counts an improvement on the negative class as a real improvement, even if the model has not actually gotten any better at finding rare positives. AUPRC ignores true negatives entirely. The two metrics therefore answer different questions.
| ROC AUC | AUPRC | |
|---|---|---|
| Axes | TPR vs FPR | Precision vs Recall |
| Uses true negatives? | Yes | No |
| Random baseline | 0.5 | Prevalence of positives |
| Sensitive to class imbalance? | Largely invariant | Yes, baseline scales with imbalance |
| Best when... | Both classes matter | Positives are rare and costly to miss |
The random-classifier baseline for AUPRC is the fraction of positives in the dataset, often written as P / (P + N). On a fraud detection problem with 1% positives, a model that simply ranks examples randomly gets an AUPRC of about 0.01. That means an AUPRC of 0.4 sounds modest in the abstract but is a 40x improvement over random. AUROC has no such reference scale; a 0.85 AUROC means roughly the same thing whether the positive rate is 50% or 0.1%, which is part of why it can flatter classifiers on heavily skewed data.
Davis and Goadrich proved that a curve dominates in ROC space if and only if it dominates in PR space, so the two metrics agree about which model is strictly better when one model's curve is everywhere above another's. They disagree about how much better, and they disagree about ranking when the curves cross. Optimizing AUROC does not guarantee an optimal AUPRC. For imbalanced data, most practitioners now report AUPRC alongside or instead of AUROC.
A 2024 paper by Sarah McElroy and coauthors pushed back on the conventional wisdom, arguing that AUROC is not blind to imbalance in the way critics often claim, and that PR-AUC is hard to disentangle from the base rate. The debate is real, but in applied work the rule of thumb still holds: if you care about the minority class, watch the PR curve.
The canonical computation in Python looks like this:
from sklearn.metrics import average_precision_score, precision_recall_curve
# y_true: 0/1 labels, y_score: model output scores or probabilities
ap = average_precision_score(y_true, y_score)
precision, recall, thresholds = precision_recall_curve(y_true, y_score)
average_precision_score returns the discrete-sum AP described above. precision_recall_curve returns the raw arrays you can plot, with one entry per unique threshold. For multilabel or multiclass problems, the function supports average='macro', 'micro', 'weighted', and 'samples' to aggregate per-class AP into one number.
In deep learning, the same metric appears under different names. PyTorch's torchmetrics.AveragePrecision, TensorFlow's tf.keras.metrics.AUC(curve='PR'), and Detectron2's COCO evaluator all compute essentially the same quantity, with small differences in how they handle ties and interpolation.
AUPRC is most useful when the positive class is rare and missing positives is expensive. Cancer screening, intrusion detection, defect inspection on a production line, and retrieval ranking all fit that pattern. The metric is also useful when you do not yet know the threshold you will deploy with and want a summary that respects every possible operating point.
It is less useful in three cases. First, when classes are roughly balanced, ROC AUC is easier to interpret and not much less informative. Second, when you have a fixed operating threshold that is dictated by business cost rather than tunable, a single point estimate of precision and recall at that threshold is more relevant than the area under the curve. Third, AUPRC moves around when the base rate moves, so comparing AUPRC numbers across datasets or time periods with different prevalence requires care. A 0.7 AUPRC on a 50%-positive dataset is not the same achievement as a 0.7 AUPRC on a 1%-positive dataset.
Most teams report several metrics together: AUPRC, AUROC, and a small table of precision and recall at one or two thresholds that match the deployment cost structure. That triangulates what the model is actually doing instead of relying on any one summary statistic.
| Aspect | Value or note |
|---|---|
| Other names | AP, AUPRC, PR-AUC, average precision |
| Range | 0 to 1 |
| Perfect classifier | 1.0 |
| Random baseline | Fraction of positives in the dataset |
| Best for | Imbalanced data where positives matter |
| Common sklearn function | sklearn.metrics.average_precision_score |
| Trapezoidal rule on PR curve | Discouraged: biased upward |
| 11-point version | PASCAL VOC 2007 |
| All-point version | PASCAL VOC 2010, COCO, modern object detection |
| Foundational paper | Davis and Goadrich, ICML 2006 |
Imagine a spam filter that ranks every email by how likely it thinks each one is spam. You start at the top of the list, where the filter is most confident, and work your way down. Two questions matter as you go. Out of the emails you have flagged so far, what fraction really are spam? That is precision. Out of all the spam emails that exist in your inbox, how many have you flagged so far? That is recall. As you accept more emails, precision usually drops and recall usually rises. The area under the precision-recall curve is the average precision you keep, measured over the whole journey from catching no spam to catching all of it. A higher number means the filter stayed accurate even as it tried to catch everything.