See also: precision, recall, ROC curve, AUC, F1 score, confusion matrix, precision-recall curve
PR AUC (Precision-Recall Area Under the Curve), also referred to as AUPRC or AUC-PR, is a classification evaluation metric that quantifies the area beneath a precision-recall curve. The precision-recall curve plots precision on the y-axis against recall on the x-axis at varying classification thresholds. PR AUC compresses this entire curve into a single scalar value between 0 and 1, where higher values indicate better classifier performance at distinguishing the positive class from the negative class.
Unlike ROC AUC, which plots the true positive rate against the false positive rate, PR AUC focuses exclusively on the positive class. It does not take true negatives into account. This property makes PR AUC particularly well-suited for evaluating classifiers on imbalanced datasets where the positive class is rare, such as fraud detection, disease screening, or information retrieval tasks. On such datasets, ROC AUC can present an overly optimistic picture of performance because the large number of true negatives inflates the false positive rate denominator, making even a mediocre classifier appear strong.
The precision-recall framework has roots in information retrieval, where researchers have measured precision and recall since at least the 1960s to evaluate search and document retrieval systems. The formalization of the relationship between PR curves and ROC curves in machine learning was significantly advanced by Davis and Goadrich in their 2006 paper "The Relationship Between Precision-Recall and ROC Curves," which established key theorems connecting the two evaluation spaces and demonstrated proper interpolation methods for PR curves.
Before defining PR AUC, it is necessary to define its two component metrics. Both are derived from the confusion matrix.
Precision (also called positive predictive value) measures the fraction of positive predictions that are actually correct:
Precision = TP / (TP + FP)
Recall (also called sensitivity or true positive rate) measures the fraction of actual positive instances that the model correctly identified:
Recall = TP / (TP + FN)
Where:
| Symbol | Name | Meaning | |--------|------|---------|| | TP | True Positive | A positive instance correctly predicted as positive | | FP | False Positive | A negative instance incorrectly predicted as positive | | FN | False Negative | A positive instance incorrectly predicted as negative | | TN | True Negative | A negative instance correctly predicted as negative |
Notice that neither precision nor recall uses TN (true negatives). This is the fundamental reason PR AUC behaves differently from ROC AUC on imbalanced data.
A precision-recall curve is constructed by varying the decision threshold of a binary classifier that outputs continuous scores or probabilities. At each threshold, instances with scores above the threshold are predicted as positive, and precision and recall are computed for that threshold. Plotting all (recall, precision) pairs produces the curve.
As the threshold decreases, more instances are predicted as positive. Recall generally increases (more true positives are captured), but precision may decrease (more false positives are introduced). The resulting curve typically slopes downward from left to right, though unlike the ROC curve, the precision-recall curve is not guaranteed to be monotonic. Precision can increase or decrease as the threshold changes, leading to the characteristic "sawtooth" shape often observed in PR curves.
PR AUC is the area under the precision-recall curve. It can be interpreted as the average precision of the classifier across all recall levels. A perfect classifier achieves PR AUC = 1.0, meaning it has 100% precision at every level of recall. A random (no-skill) classifier has a PR AUC equal to the prevalence of the positive class in the dataset, which is P / (P + N), where P is the number of positive instances and N is the number of negative instances.
This is a key distinction from ROC AUC. The ROC AUC of a random classifier is always 0.5, regardless of class distribution. The PR AUC baseline shifts depending on the data. For example:
| Positive class prevalence | Random classifier PR AUC | Random classifier ROC AUC |
|---|---|---|
| 50% (balanced) | 0.50 | 0.50 |
| 10% | 0.10 | 0.50 |
| 1% | 0.01 | 0.50 |
| 0.1% | 0.001 | 0.50 |
This means that when interpreting PR AUC values, one must always consider the class distribution. A PR AUC of 0.30 on a dataset where only 1% of examples are positive represents a substantial improvement over random, while the same value on a balanced dataset indicates poor performance.
Computing PR AUC is more nuanced than computing ROC AUC because of the non-linear relationship between precision and recall. Several approaches exist, and they can produce different results.
The simplest approach uses the trapezoidal rule to compute the area under the curve by linearly interpolating between adjacent operating points. While this method works well for ROC curves, Davis and Goadrich (2006) showed that linear interpolation in precision-recall space is incorrect and yields overly optimistic estimates of performance.
The problem arises because the relationship between precision and recall is non-linear when expressed in terms of the underlying true positive and false positive counts. Between two operating points on a PR curve, the true interpolation follows a hyperbolic path, not a straight line. A straight line between two PR points cuts through regions of PR space that may not be achievable by any classifier, inflating the estimated area.
In scikit-learn, the sklearn.metrics.auc() function uses this trapezoidal method when given precision and recall arrays. The scikit-learn documentation explicitly warns that this approach "uses linear interpolation and can be too optimistic."
Average precision (AP) is the recommended method for summarizing a precision-recall curve. It is computed as the weighted mean of precisions at each threshold, with the increase in recall from the previous threshold used as the weight:
AP = sum over n of (R_n - R_{n-1}) * P_n
where P_n and R_n are the precision and recall at the nth threshold, and (R_0, P_0) = (0, 1).
This formulation uses a step-function (piecewise constant) interpolation rather than linear interpolation. At each threshold where a new positive example is retrieved, precision is evaluated and weighted by the marginal gain in recall. This avoids the overestimation inherent in linear interpolation.
In scikit-learn, this is implemented as sklearn.metrics.average_precision_score(). The function takes the true labels and predicted scores (probabilities or decision function values) as inputs and returns the average precision.
While AP and PR AUC are conceptually related, they are not identical. AP uses piecewise constant interpolation, while a naive PR AUC calculation uses trapezoidal interpolation. In practice, the two values are often close, but they can diverge noticeably on highly skewed datasets, which is exactly the situation where PR curves are most needed.
Davis and Goadrich (2006) proposed a principled interpolation method specifically designed for precision-recall space. Their approach works by interpolating in terms of the underlying true positive (TP) and false positive (FP) counts rather than directly interpolating precision and recall values.
Between two adjacent operating points (recall_a, precision_a) and (recall_b, precision_b), the method computes the "local skew," defined as the ratio (FP_b - FP_a) / (TP_b - TP_a). This ratio captures how many false positives are introduced for each new true positive as the threshold changes. Intermediate precision-recall points are then generated by incrementing TP one at a time and computing the corresponding FP using the local skew, yielding the correct non-linear interpolation path.
This method produces the most accurate estimate of the true area under the precision-recall curve and is the basis for the AUCCalculator software tool released alongside the Davis and Goadrich paper.
| Method | Interpolation type | Accuracy | Implementation |
|---|---|---|---|
| Trapezoidal rule | Linear | Overestimates (optimistic) | sklearn.metrics.auc(recall, precision) |
| Average precision | Piecewise constant (step function) | Accurate for discrete thresholds | sklearn.metrics.average_precision_score() |
| Davis-Goadrich | Non-linear (TP/FP based) | Most accurate for continuous curves | AUCCalculator, PRROC R package |
Davis and Goadrich (2006) established a fundamental theorem connecting precision-recall space and ROC space.
Dominance theorem: A curve dominates in ROC space if and only if it dominates in precision-recall space. In other words, if classifier A's ROC curve is everywhere above classifier B's ROC curve, then classifier A's PR curve is also everywhere above classifier B's PR curve, and vice versa.
This theorem means that the two representations are consistent in their rankings of classifiers when one clearly dominates the other. However, when ROC curves cross (which happens frequently in practice), the PR curves may provide a different, and often more informative, perspective on the relative strengths of the classifiers.
| Property | PR AUC | ROC AUC | |----------|--------|---------|| | Axes | Precision vs. Recall | True Positive Rate vs. False Positive Rate | | Uses true negatives | No | Yes | | Random baseline | Equals positive class prevalence P/(P+N) | Always 0.5 | | Monotonicity | Curve is not necessarily monotonic | Curve is monotonically increasing | | Interpolation | Non-linear (requires special handling) | Linear interpolation is valid | | Convex hull | "Achievable PR curve" (not a convex hull) | Standard convex hull applies | | Sensitivity to class imbalance | High (baseline shifts with prevalence) | Low (baseline is fixed) | | Best suited for | Imbalanced data, positive-class-focused evaluation | Balanced data, overall discrimination |
Saito and Rehmsmeier (2015) demonstrated through simulation studies and real-world re-analysis that PR plots are more informative than ROC plots when evaluating binary classifiers on imbalanced datasets. Their key finding was that ROC plots can be "visually deceptive" because the false positive rate (used on the x-axis of ROC curves) is diluted by the large number of true negatives in imbalanced datasets. This dilution can make a poorly performing classifier appear adequate in ROC space while PR space reveals its deficiencies.
In one striking example from their paper, they re-analyzed the MiRFinder microRNA discovery tool. ROC analysis suggested reasonable performance across multiple tools, but PR analysis exposed that most tools performed barely above random chance in terms of precision, calling into question their practical utility.
However, the choice between PR AUC and ROC AUC is not always straightforward. When both the positive and negative classes matter equally, and when the cost of false positives is comparable to the cost of false negatives, ROC AUC may be more appropriate because it considers all four cells of the confusion matrix.
Because the random baseline for PR AUC depends on class prevalence, absolute PR AUC values cannot be interpreted in isolation. A PR AUC of 0.20 on a dataset with 1% positive prevalence represents a 20-fold improvement over random (baseline 0.01), while a PR AUC of 0.20 on a balanced dataset is worse than random chance. Always compare against the baseline when evaluating PR AUC.
To facilitate comparison across datasets with different class distributions, some practitioners compute a normalized version of PR AUC:
Normalized PR AUC = (PR AUC - baseline) / (1 - baseline)
where baseline = P / (P + N). This rescales the metric so that 0 corresponds to random performance and 1 corresponds to perfect performance, regardless of class prevalence.
The following table provides rough interpretation guidelines, assuming the PR AUC is compared against the appropriate baseline:
| Normalized PR AUC range | Interpretation |
|---|---|
| 0.90 - 1.00 | Excellent discrimination |
| 0.70 - 0.90 | Good discrimination |
| 0.50 - 0.70 | Moderate discrimination |
| 0.30 - 0.50 | Fair discrimination |
| 0.00 - 0.30 | Poor discrimination |
| < 0.00 | Worse than random |
Boyd, Eng, and Page (2013) developed methods for computing point estimates and confidence intervals for PR AUC. Their approach uses stratified bootstrap resampling to generate confidence intervals that account for the non-linear properties of precision-recall space. Reporting confidence intervals alongside PR AUC values is considered best practice, especially when comparing classifiers or when the test set is small.
A common approach for computing bootstrap confidence intervals:
The F1 score is the harmonic mean of precision and recall:
F1 = 2 * (Precision * Recall) / (Precision + Recall)
While PR AUC summarizes performance across all thresholds, the F1 score evaluates performance at a single threshold. Each point on the precision-recall curve corresponds to a particular F1 score, and curves of constant F1 (called iso-F1 curves) form elliptical arcs in PR space.
The maximum F1 score achievable by a classifier can be read from its precision-recall curve by finding the point that is tangent to the highest iso-F1 curve. PR AUC provides a more complete picture because it considers all possible precision-recall trade-offs, while F1 captures only one.
The F-beta score generalizes F1 by allowing different weights for precision and recall:
F_beta = (1 + beta^2) * (Precision * Recall) / (beta^2 * Precision + Recall)
When beta > 1, recall is weighted more heavily. When beta < 1, precision is weighted more heavily. The choice of beta reflects the relative importance of false negatives versus false positives in a given application.
PR AUC is widely used across domains where the positive class is rare or where identifying positive instances is the primary goal.
In clinical settings, disease screening models must detect conditions that affect a small fraction of the population. For example, a cancer screening model might operate on a dataset where only 0.5% of patients actually have cancer. ROC AUC would give substantial credit for correctly classifying the 99.5% of healthy patients, potentially masking the model's failure to detect actual cancer cases. PR AUC focuses on how well the model identifies diseased patients (recall) while minimizing false alarms (precision), making it more informative for clinical decision-making.
Financial institutions use classifiers to detect fraudulent transactions, which typically constitute less than 1% of all transactions. The cost of missing a fraudulent transaction (false negative) can be high, but flagging too many legitimate transactions (false positives) creates operational overhead and customer friction. PR AUC helps evaluate how well a model balances detecting fraud against avoiding false alarms in this inherently imbalanced setting.
PR AUC has deep roots in information retrieval, where it evaluates how well a search engine retrieves relevant documents. In a typical search task, only a tiny fraction of all documents in a corpus are relevant to a given query. Precision measures the relevance of returned results, and recall measures the completeness of retrieval. The average precision metric, which is closely related to PR AUC, is the foundation of the mean average precision (MAP) metric widely used in information retrieval evaluation.
In computer vision, object detection models are evaluated using mean average precision (mAP), which extends the concept of PR AUC to multiple object classes. For each class, a precision-recall curve is computed based on the overlap (Intersection over Union, or IoU) between predicted and ground-truth bounding boxes. The AP for each class is computed, and mAP is the mean across all classes.
The COCO benchmark uses a 101-point interpolated AP definition and averages over 10 IoU thresholds (from 0.50 to 0.95 in steps of 0.05) and 80 object categories. This evaluation protocol has become the standard for comparing object detection models.
In named entity recognition, relation extraction, and other structured prediction tasks in NLP, the entity or relation of interest is often sparse relative to the total text. PR AUC provides a threshold-independent measure of how well models identify these rare structures.
PR AUC is commonly used to evaluate classifiers for gene prediction, protein function annotation, and variant pathogenicity scoring, where the positive class (e.g., pathogenic variants) is typically much rarer than the negative class (benign variants). Saito and Rehmsmeier (2015) found that 66.7% of surveyed bioinformatics studies using SVMs on genome-wide datasets relied on ROC evaluation, while only 12.1% used PR curves, despite the class imbalance inherent in most genomic datasets.
Sofaer, Hoeting, and Jarnevich (2019) advocated for using PR AUC in species distribution modeling, where the goal is to predict the presence of (typically rare) species across geographic areas. They showed that PR AUC is robust to changes in geographic extent and the number of background points sampled, problems that can bias ROC AUC in ecological applications.
PR AUC is fundamentally a binary classification metric, but it can be extended to multi-class and multi-label settings through averaging strategies.
In the one-vs-rest approach, each class is treated as the positive class in turn, with all other classes grouped as the negative class. A separate PR curve and AP score are computed for each class. This produces a per-class breakdown that reveals which classes the model handles well and which ones it struggles with.
Micro-averaging pools all true positives, false positives, and false negatives across all classes before computing a single precision-recall curve. This gives equal weight to each prediction rather than each class. Micro-averaged PR AUC tends to be dominated by the performance on more prevalent classes.
In scikit-learn, micro-averaging is achieved by flattening the label indicator matrix and treating each element as a binary prediction:
from sklearn.metrics import average_precision_score
average_precision_score(Y_test, y_score, average="micro")
Macro-averaging computes PR AUC independently for each class and then takes the arithmetic mean. This gives equal weight to each class regardless of its prevalence, which can be useful when all classes are equally important even if some are rare.
| Strategy | Weight given to | Best when |
|---|---|---|
| Micro-average | Each instance | Class sizes vary and frequent classes matter more |
| Macro-average | Each class | All classes are equally important |
| Weighted average | Each class, proportional to support | A compromise between micro and macro |
| Per-class | Individual class analysis | Detailed per-class diagnostics are needed |
Scikit-learn provides multiple functions for computing precision-recall curves and PR AUC.
Computing average precision (recommended):
import numpy as np
from sklearn.metrics import average_precision_score
# True labels (0 or 1) and predicted probabilities
y_true = np.array([0, 0, 1, 1, 0, 1, 0, 0, 1, 1])
y_scores = np.array([0.1, 0.4, 0.35, 0.8, 0.2, 0.75, 0.05, 0.6, 0.9, 0.55])
ap = average_precision_score(y_true, y_scores)
print(f"Average Precision: {ap:.4f}")
Computing the precision-recall curve and trapezoidal AUC:
from sklearn.metrics import precision_recall_curve, auc
precision, recall, thresholds = precision_recall_curve(y_true, y_scores)
pr_auc = auc(recall, precision)
print(f"PR AUC (trapezoidal): {pr_auc:.4f}")
Plotting the precision-recall curve:
from sklearn.metrics import PrecisionRecallDisplay
import matplotlib.pyplot as plt
disp = PrecisionRecallDisplay.from_predictions(
y_true, y_scores, name="Classifier", plot_chance_level=True
)
disp.ax_.set_title("Precision-Recall Curve")
plt.show()
The plot_chance_level=True argument draws the random baseline (a horizontal line at y = prevalence), which is helpful for visually assessing whether the classifier performs better than random.
The PRROC package in R, developed by Grau, Grosse, and Keilwagen (2015), provides functions for computing both PR and ROC curves with proper interpolation following the Davis-Goadrich method:
library(PRROC)
scores_pos <- c(0.9, 0.8, 0.75, 0.65, 0.55)
scores_neg <- c(0.4, 0.35, 0.3, 0.2, 0.1)
result <- pr.curve(scores.class0 = scores_pos,
scores.class1 = scores_neg,
curve = TRUE)
print(result$auc.davis.goadrich)
plot(result)
In TensorFlow and Keras, PR AUC can be tracked during training using the tf.keras.metrics.AUC metric with the curve parameter set to "PR":
import tensorflow as tf
model.compile(
optimizer="adam",
loss="binary_crossentropy",
metrics=[tf.keras.metrics.AUC(curve="PR", name="pr_auc")]
)
Despite its advantages for imbalanced data, PR AUC has several important limitations.
Because neither precision nor recall uses true negatives, PR AUC completely ignores the model's ability to correctly identify negative instances. In applications where correctly classifying negatives matters (e.g., ensuring that healthy patients are not subjected to unnecessary treatment), PR AUC alone is insufficient. In such cases, combining PR AUC with specificity-based metrics or ROC AUC provides a more complete evaluation.
Unlike ROC AUC, which has a clean probabilistic interpretation (the probability that a randomly chosen positive instance is scored higher than a randomly chosen negative instance), PR AUC lacks a similarly intuitive interpretation. This can make it harder to explain to non-technical stakeholders.
The shifting baseline of PR AUC means that values cannot be directly compared across datasets with different class distributions. A PR AUC of 0.40 on one dataset is not equivalent to a PR AUC of 0.40 on another unless both have the same positive class prevalence. Normalized PR AUC addresses this but is not commonly reported.
Not every point in precision-recall space corresponds to an achievable classifier. Boyd and Page (2012) showed that certain regions of PR space are unachievable for any classifier on a given dataset. This is unlike ROC space, where every point is achievable. When averaging PR curves across cross-validation folds with different class distributions, failing to account for these unachievable regions can lead to misleading results.
As discussed in the computation section, naive linear interpolation in PR space produces incorrect results. Practitioners who are unaware of this issue may compute overly optimistic PR AUC values. While average precision avoids this problem, confusion between the different computation methods persists in practice.
PR curves are not guaranteed to be monotonically decreasing. As the threshold changes, precision can increase or decrease, creating a "sawtooth" pattern. This non-monotonicity makes PR curves harder to interpret visually than ROC curves and complicates the definition of dominance between classifiers.
Use average precision instead of trapezoidal PR AUC. The average_precision_score function in scikit-learn provides a more accurate estimate than computing PR AUC with the trapezoidal rule.
Always report the class distribution. Because PR AUC's baseline depends on prevalence, reporting the positive class fraction alongside the PR AUC value is necessary for proper interpretation.
Compare against the random baseline. A horizontal line at y = P/(P+N) in PR space represents random performance. Any useful classifier should have a PR curve substantially above this line.
Report confidence intervals. Bootstrap confidence intervals help determine whether differences in PR AUC between models are statistically meaningful.
Plot the full curve, not just the summary statistic. Two classifiers can have identical PR AUC values but very different curves. One might have high precision at low recall, while the other has moderate precision across all recall levels. The full curve reveals these differences.
Consider PR AUC alongside other metrics. PR AUC is most useful when combined with ROC AUC, F1 score, or other metrics. No single metric captures all aspects of classifier performance.
Be aware of the computation method. Verify which interpolation method your software uses. Different libraries may produce different PR AUC values for the same data.
Use proper interpolation for visualization. When plotting PR curves, use step interpolation (step or post style in matplotlib) rather than linear interpolation to accurately represent the discrete nature of the curve.
Imagine you are playing a game where you have to find gold coins hidden in a big sandbox full of regular rocks. When you dig something up, two things matter. First, how many of the things you dug up are actually gold coins and not just rocks? That is precision. Second, out of all the gold coins hidden in the sandbox, how many did you find? That is recall.
Now, you can be really careful and only dig where you are very sure there is gold. You will get mostly gold coins (high precision), but you will miss a lot of them (low recall). Or you can dig everywhere, finding all the gold coins (high recall), but also pulling up tons of rocks (low precision).
PR AUC measures how good you are at this game overall. It looks at all the different ways you could play (being very careful, being somewhat careful, or digging everywhere) and gives you one score for how well you balance finding gold without grabbing too many rocks. A higher score means you are better at the game.