PR AUC

Machine Learning Model Evaluation

24 min read

Updated Jul 12, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 12, 2026

Fact-checked

In review queue

Sources

12 citations

Revision

v4 · 4,832 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

What is PR AUC?

PR AUC (Precision-Recall Area Under the Curve), also referred to as AUPRC or AUC-PR, is a classification evaluation metric that quantifies the area beneath a precision-recall curve. The precision-recall curve plots precision on the y-axis against recall on the x-axis at varying classification thresholds, and PR AUC compresses this entire curve into a single scalar value between 0 and 1, where higher values indicate better classifier performance at distinguishing the positive class from the negative class. Its defining feature is that the no-skill baseline equals the positive class prevalence, P / (P + N), rather than the fixed 0.5 of ROC AUC, which is why PR AUC is the preferred metric for imbalanced datasets where the positive class is rare ^[1]^[2]. In practice PR AUC is most often reported as average precision (AP), the non-interpolated estimator recommended by scikit-learn ^[7].

Unlike ROC AUC, which plots the true positive rate against the false positive rate, PR AUC focuses exclusively on the positive class. It does not take true negatives into account. This property makes PR AUC particularly well-suited for evaluating classifiers on imbalanced datasets where the positive class is rare, such as fraud detection, disease screening, or information retrieval tasks. On such datasets, ROC AUC can present an overly optimistic picture of performance because the large number of true negatives inflates the false positive rate denominator, making even a mediocre classifier appear strong ^[2].

The precision-recall framework has roots in information retrieval, where researchers have measured precision and recall since at least the 1960s to evaluate search and document retrieval systems ^[6]. The formalization of the relationship between PR curves and ROC curves in machine learning was significantly advanced by Davis and Goadrich in their 2006 paper "The Relationship Between Precision-Recall and ROC Curves," which established key theorems connecting the two evaluation spaces and demonstrated proper interpolation methods for PR curves ^[1].

Formal definition

Precision and recall

Before defining PR AUC, it is necessary to define its two component metrics. Both are derived from the confusion matrix.

Precision (also called positive predictive value) measures the fraction of positive predictions that are actually correct:

Precision = TP / (TP + FP)

Recall (also called sensitivity or true positive rate) measures the fraction of actual positive instances that the model correctly identified:

Recall = TP / (TP + FN)

Where:

Symbol	Name	Meaning
TP	True Positive	A positive instance correctly predicted as positive
FP	False Positive	A negative instance incorrectly predicted as positive
FN	False Negative	A positive instance incorrectly predicted as negative
TN	True Negative	A negative instance correctly predicted as negative

Notice that neither precision nor recall uses TN (true negatives). This is the fundamental reason PR AUC behaves differently from ROC AUC on imbalanced data.

The precision-recall curve

A precision-recall curve is constructed by varying the decision threshold of a binary classifier that outputs continuous scores or probabilities. At each threshold, instances with scores above the threshold are predicted as positive, and precision and recall are computed for that threshold. Plotting all (recall, precision) pairs produces the curve.

As the threshold decreases, more instances are predicted as positive. Recall generally increases (more true positives are captured), but precision may decrease (more false positives are introduced). The resulting curve typically slopes downward from left to right, though unlike the ROC curve, the precision-recall curve is not guaranteed to be monotonic. Precision can increase or decrease as the threshold changes, leading to the characteristic "sawtooth" shape often observed in PR curves.

Area under the curve

PR AUC is the area under the precision-recall curve. It can be interpreted as the average precision of the classifier across all recall levels. A perfect classifier achieves PR AUC = 1.0, meaning it has 100% precision at every level of recall. A random (no-skill) classifier has a PR AUC equal to the prevalence of the positive class in the dataset, which is P / (P + N), where P is the number of positive instances and N is the number of negative instances.

This is a key distinction from ROC AUC. The ROC AUC of a random classifier is always 0.5, regardless of class distribution. The PR AUC baseline shifts depending on the data. For example:

Positive class prevalence	Random classifier PR AUC	Random classifier ROC AUC
50% (balanced)	0.50	0.50
10%	0.10	0.50
1%	0.01	0.50
0.1%	0.001	0.50

This means that when interpreting PR AUC values, one must always consider the class distribution. A PR AUC of 0.30 on a dataset where only 1% of examples are positive represents a substantial improvement over random, while the same value on a balanced dataset indicates poor performance.

How is PR AUC computed?

Computing PR AUC is more nuanced than computing ROC AUC because of the non-linear relationship between precision and recall. Several approaches exist, and they can produce different results.

Trapezoidal rule (linear interpolation)

The simplest approach uses the trapezoidal rule to compute the area under the curve by linearly interpolating between adjacent operating points. While this method works well for ROC curves, Davis and Goadrich (2006) showed that linear interpolation in precision-recall space is incorrect and yields overly optimistic estimates of performance. As their paper states plainly, "in PR space it is incorrect to linearly interpolate between points" ^[1].

The problem arises because the relationship between precision and recall is non-linear when expressed in terms of the underlying true positive and false positive counts. Between two operating points on a PR curve, the true interpolation follows a hyperbolic path, not a straight line. A straight line between two PR points cuts through regions of PR space that may not be achievable by any classifier, inflating the estimated area.

In scikit-learn, the sklearn.metrics.auc() function uses this trapezoidal method when given precision and recall arrays. The scikit-learn documentation explicitly warns that this approach "uses linear interpolation and can be too optimistic" ^[7].

Average precision (recommended)

Average precision (AP) is the recommended method for summarizing a precision-recall curve. It is computed as the weighted mean of precisions at each threshold, with the increase in recall from the previous threshold used as the weight:

AP = sum over n of (R_n - R_{n-1}) * P_n

where P_n and R_n are the precision and recall at the nth threshold, and (R_0, P_0) = (0, 1).

This formulation uses a step-function (piecewise constant) interpolation rather than linear interpolation. At each threshold where a new positive example is retrieved, precision is evaluated and weighted by the marginal gain in recall. This avoids the overestimation inherent in linear interpolation.

In scikit-learn, this is implemented as sklearn.metrics.average_precision_score(). The function takes the true labels and predicted scores (probabilities or decision function values) as inputs and returns the average precision. Since version 0.19, scikit-learn has weighted precisions by the change in recall since the last operating point instead of linearly interpolating between them, which is precisely what makes AP a more conservative estimator than the trapezoidal rule ^[7].

While AP and PR AUC are conceptually related, they are not identical. AP uses piecewise constant interpolation, while a naive PR AUC calculation uses trapezoidal interpolation. In practice, the two values are often close, but they can diverge noticeably on highly skewed datasets, which is exactly the situation where PR curves are most needed.

Davis-Goadrich interpolation

Davis and Goadrich (2006) proposed a principled interpolation method specifically designed for precision-recall space. Their approach works by interpolating in terms of the underlying true positive (TP) and false positive (FP) counts rather than directly interpolating precision and recall values ^[1].

Between two adjacent operating points (recall_a, precision_a) and (recall_b, precision_b), the method computes the "local skew," defined as the ratio (FP_b - FP_a) / (TP_b - TP_a). This ratio captures how many false positives are introduced for each new true positive as the threshold changes. Intermediate precision-recall points are then generated by incrementing TP one at a time and computing the corresponding FP using the local skew, yielding the correct non-linear interpolation path.

This method produces the most accurate estimate of the true area under the precision-recall curve and is the basis for the AUCCalculator software tool released alongside the Davis and Goadrich paper ^[1].

Comparison of computation methods

Method	Interpolation type	Accuracy	Implementation
Trapezoidal rule	Linear	Overestimates (optimistic)	`sklearn.metrics.auc(recall, precision)`
Average precision	Piecewise constant (step function)	Accurate for discrete thresholds	`sklearn.metrics.average_precision_score()`
Davis-Goadrich	Non-linear (TP/FP based)	Most accurate for continuous curves	AUCCalculator, PRROC R package

How does PR AUC differ from ROC AUC?

Davis and Goadrich (2006) established a fundamental theorem connecting precision-recall space and ROC space ^[1].

Dominance theorem: A curve dominates in ROC space if and only if it dominates in precision-recall space. In other words, if classifier A's ROC curve is everywhere above classifier B's ROC curve, then classifier A's PR curve is also everywhere above classifier B's PR curve, and vice versa. The authors state the result directly: "a curve dominates in ROC space if and only if it dominates in PR space" ^[1].

This theorem means that the two representations are consistent in their rankings of classifiers when one clearly dominates the other. However, when ROC curves cross (which happens frequently in practice), the PR curves may provide a different, and often more informative, perspective on the relative strengths of the classifiers.

Key differences

Property	PR AUC	ROC AUC
Axes	Precision vs. Recall	True Positive Rate vs. False Positive Rate
Uses true negatives	No	Yes
Random baseline	Equals positive class prevalence P/(P+N)	Always 0.5
Monotonicity	Curve is not necessarily monotonic	Curve is monotonically increasing
Interpolation	Non-linear (requires special handling)	Linear interpolation is valid
Convex hull	"Achievable PR curve" (not a convex hull)	Standard convex hull applies
Sensitivity to class imbalance	High (baseline shifts with prevalence)	Low (baseline is fixed)
Best suited for	Imbalanced data, positive-class-focused evaluation	Balanced data, overall discrimination

When should you use PR AUC instead of ROC AUC?

Saito and Rehmsmeier (2015) demonstrated through simulation studies and real-world re-analysis that PR plots are more informative than ROC plots when evaluating binary classifiers on imbalanced datasets. Their key finding was that "the visual interpretability of ROC plots in the context of imbalanced datasets can be deceptive with respect to conclusions about the reliability of classification performance," because the false positive rate on the x-axis of a ROC curve is diluted by the large number of true negatives in imbalanced data ^[2]. This dilution can make a poorly performing classifier appear adequate in ROC space while PR space reveals its deficiencies.

In one striking example, they re-analyzed a published comparison of microRNA gene-finding tools. ROC analysis suggested broadly reasonable performance, but PR analysis exposed that, in their words, "all methods except MiRFinder have very low precision values, questioning their practical utility." The precision failure that PR space made obvious had been effectively invisible in ROC space ^[2].

However, the choice between PR AUC and ROC AUC is not always straightforward. When both the positive and negative classes matter equally, and when the cost of false positives is comparable to the cost of false negatives, ROC AUC may be more appropriate because it considers all four cells of the confusion matrix.

How do you interpret a PR AUC value?

Absolute values are context-dependent

Because the random baseline for PR AUC depends on class prevalence, absolute PR AUC values cannot be interpreted in isolation. A PR AUC of 0.20 on a dataset with 1% positive prevalence represents a 20-fold improvement over random (baseline 0.01), while a PR AUC of 0.20 on a balanced dataset is worse than random chance. Always compare against the baseline when evaluating PR AUC.

Normalized PR AUC

To facilitate comparison across datasets with different class distributions, some practitioners compute a normalized version of PR AUC:

Normalized PR AUC = (PR AUC - baseline) / (1 - baseline)

where baseline = P / (P + N). This rescales the metric so that 0 corresponds to random performance and 1 corresponds to perfect performance, regardless of class prevalence.

What is a good PR AUC value?

There is no universal threshold: a "good" PR AUC is always relative to the positive class prevalence that sets the no-skill baseline. The following table provides rough interpretation guidelines, assuming the PR AUC has first been normalized against the appropriate baseline:

Normalized PR AUC range	Interpretation
0.90 - 1.00	Excellent discrimination
0.70 - 0.90	Good discrimination
0.50 - 0.70	Moderate discrimination
0.30 - 0.50	Fair discrimination
0.00 - 0.30	Poor discrimination
< 0.00	Worse than random

Confidence intervals

Boyd, Eng, and Page (2013) analyzed the common estimators of the area under the precision-recall curve and their confidence intervals, identifying both sound and invalid procedures and recommending two interval estimators that remain robust across a range of assumptions ^[3]. Reporting confidence intervals alongside PR AUC values is considered best practice, especially when comparing classifiers or when the test set is small.

A common general approach for computing bootstrap confidence intervals:

Resample the test set with replacement B times (e.g., B = 1000).
Compute PR AUC on each bootstrap sample.
Use the 2.5th and 97.5th percentiles of the bootstrap distribution as the 95% confidence interval.

Relationship to the F1 score

The F1 score is the harmonic mean of precision and recall:

F1 = 2 * (Precision * Recall) / (Precision + Recall)

While PR AUC summarizes performance across all thresholds, the F1 score evaluates performance at a single threshold. Each point on the precision-recall curve corresponds to a particular F1 score, and curves of constant F1 (called iso-F1 curves) form elliptical arcs in PR space.

The maximum F1 score achievable by a classifier can be read from its precision-recall curve by finding the point that is tangent to the highest iso-F1 curve. PR AUC provides a more complete picture because it considers all possible precision-recall trade-offs, while F1 captures only one.

The F-beta score generalizes F1 by allowing different weights for precision and recall:

F_beta = (1 + beta^2) * (Precision * Recall) / (beta^2 * Precision + Recall)

When beta > 1, recall is weighted more heavily. When beta < 1, precision is weighted more heavily. The choice of beta reflects the relative importance of false negatives versus false positives in a given application.

Flach and Kull (2015) pointed out a subtle inconsistency in this relationship: the area under a PR curve implicitly averages precision on an arithmetic scale, whereas the F-beta family combines precision and recall on a harmonic scale. They proposed Precision-Recall-Gain curves as a correction, showing that the area under a PR-Gain curve corresponds to an expected F1 score and inherits the desirable properties of ROC curves ^[10].

What is PR AUC used for?

PR AUC is widely used across domains where the positive class is rare or where identifying positive instances is the primary goal.

Medical diagnosis

In clinical settings, disease screening models must detect conditions that affect a small fraction of the population. For example, a cancer screening model might operate on a dataset where only 0.5% of patients actually have cancer. ROC AUC would give substantial credit for correctly classifying the 99.5% of healthy patients, potentially masking the model's failure to detect actual cancer cases. PR AUC focuses on how well the model identifies diseased patients (recall) while minimizing false alarms (precision), making it more informative for clinical decision-making.

Fraud detection

Financial institutions use classifiers to detect fraudulent transactions, which typically constitute less than 1% of all transactions. The cost of missing a fraudulent transaction (false negative) can be high, but flagging too many legitimate transactions (false positives) creates operational overhead and customer friction. PR AUC helps evaluate how well a model balances detecting fraud against avoiding false alarms in this inherently imbalanced setting.

Information retrieval

PR AUC has deep roots in information retrieval, where it evaluates how well a search engine retrieves relevant documents. In a typical search task, only a tiny fraction of all documents in a corpus are relevant to a given query. Precision measures the relevance of returned results, and recall measures the completeness of retrieval. The average precision metric, which is closely related to PR AUC, is the foundation of the mean average precision (MAP) metric widely used in information retrieval evaluation ^[6].

Object detection

In computer vision, object detection models are evaluated using mean average precision (mAP), which extends the concept of PR AUC to multiple object classes. For each class, a precision-recall curve is computed based on the overlap (Intersection over Union, or IoU) between predicted and ground-truth bounding boxes. The AP for each class is computed, and mAP is the mean across all classes.

The COCO benchmark uses a 101-point interpolated AP definition and averages over 10 IoU thresholds (from 0.50 to 0.95 in steps of 0.05) and 80 object categories. This evaluation protocol has become the standard for comparing object detection models ^[12].

Natural language processing

In named entity recognition, relation extraction, and other structured prediction tasks in NLP, the entity or relation of interest is often sparse relative to the total text. PR AUC provides a threshold-independent measure of how well models identify these rare structures.

Bioinformatics and genomics

PR AUC is commonly used to evaluate classifiers for gene prediction, protein function annotation, and variant pathogenicity scoring, where the positive class (e.g., pathogenic variants) is typically much rarer than the negative class (benign variants). In their literature survey, Saito and Rehmsmeier (2015) found that of 33 surveyed studies applying support vector machines to imbalanced genome-scale datasets, 66.7% (22 studies) relied on ROC evaluation while only 12.1% (4 studies) used precision-recall curves, despite the class imbalance inherent in most genomic datasets ^[2].

Ecology

Sofaer, Hoeting, and Jarnevich (2019) advocated for using PR AUC in species distribution modeling, where the goal is to predict the presence of (typically rare) species across geographic areas. They showed that PR AUC is more robust than ROC AUC to changes in geographic extent and the number of background points sampled, problems that can inflate or bias ROC AUC in ecological applications ^[4].

Multi-class and multi-label extensions

PR AUC is fundamentally a binary classification metric, but it can be extended to multi-class and multi-label settings through averaging strategies.

One-vs-rest (OvR)

In the one-vs-rest approach, each class is treated as the positive class in turn, with all other classes grouped as the negative class. A separate PR curve and AP score are computed for each class. This produces a per-class breakdown that reveals which classes the model handles well and which ones it struggles with.

Micro-averaging

Micro-averaging pools all true positives, false positives, and false negatives across all classes before computing a single precision-recall curve. This gives equal weight to each prediction rather than each class. Micro-averaged PR AUC tends to be dominated by the performance on more prevalent classes.

In scikit-learn, micro-averaging is achieved by flattening the label indicator matrix and treating each element as a binary prediction ^[7]:

from sklearn.metrics import average_precision_score
average_precision_score(Y_test, y_score, average="micro")

Macro-averaging

Macro-averaging computes PR AUC independently for each class and then takes the arithmetic mean. This gives equal weight to each class regardless of its prevalence, which can be useful when all classes are equally important even if some are rare.

Comparison of averaging strategies

Strategy	Weight given to	Best when
Micro-average	Each instance	Class sizes vary and frequent classes matter more
Macro-average	Each class	All classes are equally important
Weighted average	Each class, proportional to support	A compromise between micro and macro
Per-class	Individual class analysis	Detailed per-class diagnostics are needed

How do you compute PR AUC in Python and R?

Python with scikit-learn

Scikit-learn provides multiple functions for computing precision-recall curves and PR AUC ^[7].

Computing average precision (recommended):

import numpy as np
from sklearn.metrics import average_precision_score

# True labels (0 or 1) and predicted probabilities
y_true = np.array([0, 0, 1, 1, 0, 1, 0, 0, 1, 1])
y_scores = np.array([0.1, 0.4, 0.35, 0.8, 0.2, 0.75, 0.05, 0.6, 0.9, 0.55])

ap = average_precision_score(y_true, y_scores)
print(f"Average Precision: {ap:.4f}")

Computing the precision-recall curve and trapezoidal AUC:

from sklearn.metrics import precision_recall_curve, auc

precision, recall, thresholds = precision_recall_curve(y_true, y_scores)
pr_auc = auc(recall, precision)
print(f"PR AUC (trapezoidal): {pr_auc:.4f}")

Plotting the precision-recall curve:

from sklearn.metrics import PrecisionRecallDisplay
import matplotlib.pyplot as plt

disp = PrecisionRecallDisplay.from_predictions(
    y_true, y_scores, name="Classifier", plot_chance_level=True
)
disp.ax_.set_title("Precision-Recall Curve")
plt.show()

The plot_chance_level=True argument draws the random baseline (a horizontal line at y = prevalence), which is helpful for visually assessing whether the classifier performs better than random.

R with PRROC

The PRROC package in R, developed by Grau, Grosse, and Keilwagen (2015), provides functions for computing both PR and ROC curves with proper interpolation following the Davis-Goadrich method ^[5]:

library(PRROC)

scores_pos <- c(0.9, 0.8, 0.75, 0.65, 0.55)
scores_neg <- c(0.4, 0.35, 0.3, 0.2, 0.1)

result <- pr.curve(scores.class0 = scores_pos,
                   scores.class1 = scores_neg,
                   curve = TRUE)
print(result$auc.davis.goadrich)
plot(result)

TensorFlow and Keras

In TensorFlow and Keras, PR AUC can be tracked during training using the tf.keras.metrics.AUC metric with the curve parameter set to "PR":

import tensorflow as tf

model.compile(
    optimizer="adam",
    loss="binary_crossentropy",
    metrics=[tf.keras.metrics.AUC(curve="PR", name="pr_auc")]
)

What are the limitations of PR AUC?

Despite its advantages for imbalanced data, PR AUC has several important limitations.

No consideration of true negatives

Because neither precision nor recall uses true negatives, PR AUC completely ignores the model's ability to correctly identify negative instances. In applications where correctly classifying negatives matters (e.g., ensuring that healthy patients are not subjected to unnecessary treatment), PR AUC alone is insufficient. In such cases, combining PR AUC with specificity-based metrics or ROC AUC provides a more complete evaluation. Chicco and Jurman (2020) argued that when a single-number summary is needed, the Matthews correlation coefficient (MCC) is more informative than the F1 score or accuracy for imbalanced binary classification, precisely because it incorporates all four cells of the confusion matrix, including true negatives ^[11].

Non-intuitive interpretation

Unlike ROC AUC, which has a clean probabilistic interpretation (the probability that a randomly chosen positive instance is scored higher than a randomly chosen negative instance, a value equal to the Wilcoxon-Mann-Whitney statistic), PR AUC lacks a similarly intuitive interpretation ^[9]. This can make it harder to explain to non-technical stakeholders.

Sensitivity to class distribution

The shifting baseline of PR AUC means that values cannot be directly compared across datasets with different class distributions. A PR AUC of 0.40 on one dataset is not equivalent to a PR AUC of 0.40 on another unless both have the same positive class prevalence. Normalized PR AUC addresses this but is not commonly reported.

Unachievable regions in PR space

Not every point in precision-recall space corresponds to an achievable classifier. Boyd and Page (2012) showed that certain regions of PR space are unachievable for any classifier on a given dataset ^[8]. This is unlike ROC space, where every point is achievable. When averaging PR curves across cross-validation folds with different class distributions, failing to account for these unachievable regions can lead to misleading results.

Interpolation complexity

As discussed in the computation section, naive linear interpolation in PR space produces incorrect results. Practitioners who are unaware of this issue may compute overly optimistic PR AUC values. While average precision avoids this problem, confusion between the different computation methods persists in practice.

Non-monotonic curves

PR curves are not guaranteed to be monotonically decreasing. As the threshold changes, precision can increase or decrease, creating a "sawtooth" pattern. This non-monotonicity makes PR curves harder to interpret visually than ROC curves and complicates the definition of dominance between classifiers.

Best practices

Use average precision instead of trapezoidal PR AUC. The average_precision_score function in scikit-learn provides a more accurate estimate than computing PR AUC with the trapezoidal rule ^[7].
Always report the class distribution. Because PR AUC's baseline depends on prevalence, reporting the positive class fraction alongside the PR AUC value is necessary for proper interpretation.
Compare against the random baseline. A horizontal line at y = P/(P+N) in PR space represents random performance. Any useful classifier should have a PR curve substantially above this line.
Report confidence intervals. Bootstrap confidence intervals help determine whether differences in PR AUC between models are statistically meaningful ^[3].
Plot the full curve, not just the summary statistic. Two classifiers can have identical PR AUC values but very different curves. One might have high precision at low recall, while the other has moderate precision across all recall levels. The full curve reveals these differences.
Consider PR AUC alongside other metrics. PR AUC is most useful when combined with ROC AUC, F1 score, MCC, or other metrics. No single metric captures all aspects of classifier performance.
Be aware of the computation method. Verify which interpolation method your software uses. Different libraries may produce different PR AUC values for the same data.
Use proper interpolation for visualization. When plotting PR curves, use step interpolation (step or post style in matplotlib) rather than linear interpolation to accurately represent the discrete nature of the curve.

Explain like I'm 5 (ELI5)

Imagine you are playing a game where you have to find gold coins hidden in a big sandbox full of regular rocks. When you dig something up, two things matter. First, how many of the things you dug up are actually gold coins and not just rocks? That is precision. Second, out of all the gold coins hidden in the sandbox, how many did you find? That is recall.

Now, you can be really careful and only dig where you are very sure there is gold. You will get mostly gold coins (high precision), but you will miss a lot of them (low recall). Or you can dig everywhere, finding all the gold coins (high recall), but also pulling up tons of rocks (low precision).

PR AUC measures how good you are at this game overall. It looks at all the different ways you could play (being very careful, being somewhat careful, or digging everywhere) and gives you one score for how well you balance finding gold without grabbing too many rocks. A higher score means you are better at the game.

References

Davis, J., & Goadrich, M. (2006). "The Relationship Between Precision-Recall and ROC Curves." *Proceedings of the 23rd International Conference on Machine Learning (ICML)*, 233-240. ACM. ↩
Saito, T., & Rehmsmeier, M. (2015). "The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets." *PLoS ONE*, 10(3), e0118432. ↩
Boyd, K., Eng, K. H., & Page, C. D. (2013). "Area Under the Precision-Recall Curve: Point Estimates and Confidence Intervals." *European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD)*, 451-466. Springer. ↩
Sofaer, H. R., Hoeting, J. A., & Jarnevich, C. S. (2019). "The area under the precision-recall curve as a performance metric for rare binary events." *Methods in Ecology and Evolution*, 10(4), 565-577. ↩
Grau, J., Grosse, I., & Keilwagen, J. (2015). "PRROC: computing and visualizing precision-recall and receiver operating characteristic curves in R." *Bioinformatics*, 31(15), 2595-2597. ↩
Manning, C. D., Raghavan, P., & Schutze, H. (2008). *Introduction to Information Retrieval*. Cambridge University Press. Chapter 8: Evaluation in information retrieval. ↩
Pedregosa, F., et al. (2011). "Scikit-learn: Machine Learning in Python." *Journal of Machine Learning Research*, 12, 2825-2830. ↩
Boyd, K., & Page, C. D. (2012). "Unachievable Region in Precision-Recall Space and Its Effect on Empirical Evaluation." *Proceedings of the 29th International Conference on Machine Learning (ICML)*. ↩
Fawcett, T. (2006). "An introduction to ROC analysis." *Pattern Recognition Letters*, 27(8), 861-874. ↩
Flach, P., & Kull, M. (2015). "Precision-Recall-Gain Curves: PR Analysis Done Right." *Advances in Neural Information Processing Systems (NeurIPS)*, 28. ↩
Chicco, D., & Jurman, G. (2020). "The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation." *BMC Genomics*, 21, 6. ↩
Lin, T.-Y., et al. (2014). "Microsoft COCO: Common Objects in Context." *European Conference on Computer Vision (ECCV)*, 740-755. Springer. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributors · full history

Suggest edit

What links here

Machine learning terms/All Majority class Terms

What is PR AUC?

Formal definition

Precision and recall

The precision-recall curve

Area under the curve

How is PR AUC computed?

Trapezoidal rule (linear interpolation)

Average precision (recommended)

Davis-Goadrich interpolation

Comparison of computation methods

How does PR AUC differ from ROC AUC?

Key differences

When should you use PR AUC instead of ROC AUC?

How do you interpret a PR AUC value?

Absolute values are context-dependent

Normalized PR AUC

What is a good PR AUC value?

Confidence intervals

Relationship to the F1 score

What is PR AUC used for?

Medical diagnosis

Fraud detection

Information retrieval

Object detection

Natural language processing

Bioinformatics and genomics

Ecology

Multi-class and multi-label extensions

One-vs-rest (OvR)

Micro-averaging

Macro-averaging

Comparison of averaging strategies

How do you compute PR AUC in Python and R?

Python with scikit-learn

R with PRROC

TensorFlow and Keras

What are the limitations of PR AUC?

No consideration of true negatives

Non-intuitive interpretation

Sensitivity to class distribution

Unachievable regions in PR space

Interpolation complexity

Non-monotonic curves

Best practices

Explain like I'm 5 (ELI5)

References

Improve this article

Related Articles

Generalization

Generalization Curve

Model Capacity

Splitter

AUC-ROC

AUC (Area Under the ROC Curve)

What links here

Related Articles

Generalization

Generalization Curve

Model Capacity

Splitter

AUC-ROC

AUC (Area Under the ROC Curve)

What links here