A precision-recall curve (PR curve) is a graph that plots precision on the y-axis against recall on the x-axis at every possible classification threshold for a binary classification model. The curve captures the trade-off between the two metrics: as the threshold is lowered, the classifier labels more examples as positive, which tends to increase recall but decrease precision. Because precision-recall curves focus exclusively on the performance of the positive class, they are especially informative when evaluating classifiers on class-imbalanced datasets where the positive class is rare.
Unlike the ROC curve, which plots the true positive rate against the false positive rate, the PR curve does not use true negatives in either axis. This makes it a more reliable evaluation tool when the number of negative examples vastly outnumbers the positive examples, as is common in fraud detection, disease screening, and information retrieval.
Imagine you have a big toy box full of red balls and blue balls, and your job is to pick out all the red ones.
Precision asks: "Of the balls you picked, how many were actually red?" If you picked 10 balls and 8 were red, your precision is 8 out of 10.
Recall asks: "Of all the red balls in the box, how many did you find?" If there are 20 red balls total and you found 8, your recall is 8 out of 20.
The tricky part is that being really careful (high precision) means you might miss some red balls (low recall). And grabbing lots of balls to find more red ones (high recall) means you also grab more blue ones by mistake (low precision).
A precision-recall curve draws a picture of this trade-off. A really good ball-picker has a curve that stays high up near the top-right corner, meaning they find lots of red balls without grabbing too many blue ones.
The concepts of precision and recall have roots in information retrieval, where they were first formalized to evaluate search and retrieval systems. Kent, Berry, Luehrs, and Perry introduced these notions in 1955 in their work on operational criteria for information retrieval system design, though the specific term "precision" appeared somewhat later in the literature. Cyril Cleverdon's experiments at the Cranfield Institute in the 1960s established a foundational methodology for evaluating retrieval systems using precision and recall on standardized test collections with predetermined relevant items.
Cleverdon's approach formed a blueprint for the Text Retrieval Conference (TREC), organized by the National Institute of Standards and Technology (NIST) beginning in 1992. TREC adopted precision-recall analysis as a core evaluation framework and popularized the 11-point interpolated average precision method. Mean Average Precision (MAP), the average of AP scores across a set of queries, became the standard single-figure metric for comparing retrieval systems in the TREC community.
C.J. van Rijsbergen's 1979 book Information Retrieval provided a theoretical foundation by introducing the effectiveness measure E, which was later reformulated as the F1 score. Van Rijsbergen showed that the harmonic mean is the appropriate method for combining precision and recall, a principle rooted in decreasing marginal relevance. The F-measure was subsequently adopted outside information retrieval when it was proposed for evaluation at the fourth Message Understanding Conference (MUC-4) in 1992.
In machine learning, Davis and Goadrich's 2006 paper at the International Conference on Machine Learning (ICML), "The Relationship Between Precision-Recall and ROC Curves," proved a formal correspondence between ROC space and PR space. They demonstrated that a curve dominates in ROC space if and only if it dominates in PR space, and they introduced the concept of an achievable PR curve, analogous to the convex hull in ROC space. Their work also showed that linear interpolation between operating points in PR space is inappropriate because precision does not vary linearly with recall.
Saito and Rehmsmeier's 2015 study in PLoS ONE further demonstrated that PR plots are more informative than ROC plots when evaluating classifiers on imbalanced datasets. They provided empirical evidence that the visual interpretability of ROC plots can be deceptive regarding classification reliability, owing to an intuitive but incorrect interpretation of specificity on skewed data.
Both precision and recall are derived from the confusion matrix, which summarizes the predictions of a binary classifier into four categories:
| Predicted positive | Predicted negative | |
|---|---|---|
| Actual positive | True Positive (TP) | False Negative (FN) |
| Actual negative | False Positive (FP) | True Negative (TN) |
Precision (also called positive predictive value) is defined as:
Precision = TP / (TP + FP)
It answers the question: "Of all instances the model labeled as positive, what fraction is truly positive?"
Recall (also called sensitivity or true positive rate) is defined as:
Recall = TP / (TP + FN)
It answers: "Of all actual positive instances, what fraction did the model correctly identify?"
Neither metric uses the true negative (TN) count. This is the fundamental reason why PR curves focus entirely on positive-class performance and are unaffected by the number of negative examples in the dataset.
Most classifiers output a continuous score or probability for each instance rather than a hard binary label. A classification threshold (also called a decision threshold) converts these continuous scores into binary predictions: instances with scores at or above the threshold are predicted positive, and those below are predicted negative.
As the threshold changes:
This inverse relationship between precision and recall at different thresholds is exactly what the PR curve visualizes. It is worth noting that the relationship is not perfectly monotonic in every case. At certain thresholds, lowering the threshold may leave recall unchanged while precision fluctuates, depending on the distribution of scores.
The process for building a PR curve follows these steps:
The resulting curve generally starts near the top-left region (high precision, low recall at a very high threshold) and moves toward the bottom-right (lower precision, higher recall at a lower threshold). An ideal classifier would produce a curve that stays close to the top-right corner, where both precision and recall equal 1.0.
PR curves characteristically exhibit a saw-tooth (zigzag) pattern rather than a smooth curve. This happens because:
This jagged appearance is a natural artifact of the ranking process and is a well-documented property described in Manning, Raghavan, and Schutze's Introduction to Information Retrieval. It distinguishes PR curves from the typically smoother ROC curves.
A PR curve that stays high and close to the upper-right corner indicates a strong classifier that maintains high precision even as recall grows. A curve that drops quickly as recall increases suggests the classifier introduces many false positives when it tries to capture more of the positive class.
When comparing two models, the one whose curve lies above and to the right of the other is generally superior, because it achieves higher precision at every recall level.
The precision values at very low recall levels indicate how confident the model's top-ranked predictions are. High precision at the left end means the model's most confident positive predictions are reliable.
If precision drops sharply at high recall, the model is introducing many false positives to find the last remaining positives. This region is often the most informative for understanding the practical limits of a classifier.
If an application requires a minimum recall level (for example, at least 90% recall for a medical screening test), a useful comparison strategy is to check the precision each model achieves at that fixed recall value.
The baseline for a PR curve corresponds to a random classifier that assigns positive labels with no skill. For such a classifier, precision equals the prevalence of the positive class:
Baseline precision = P / (P + N)
where P is the number of positive examples and N is the number of negative examples. This baseline appears as a horizontal line on the PR plot.
This context-dependent baseline is an important distinction from the ROC curve, where the random baseline is always a diagonal line with an AUC of 0.5, regardless of class distribution. Because the PR baseline changes with prevalence, it is important to plot or note the baseline when presenting PR curves so that readers can gauge how much better a model performs relative to chance.
For example, if only 2% of examples are positive, the random baseline sits at precision = 0.02. An AUPRC of 0.30 in this setting represents a substantial improvement over chance, whereas the same value on a dataset with 50% prevalence would be poor.
A perfect classifier would produce a single point at (recall = 1.0, precision = 1.0), representing the upper-right corner of the plot.
Summarizing the full PR curve into a single scalar value is useful for model comparison. Two related but distinct approaches are commonly used.
Average precision is defined as the weighted mean of precisions at each threshold, where the weight is the increase in recall from the previous threshold:
AP = sum over n of (R_n - R_{n-1}) * P_n
where P_n and R_n are the precision and recall at the n-th threshold, and R_0 = 0. This formulation computes a step-function approximation of the area under the PR curve. In scikit-learn, this is the method used by average_precision_score.
AP can also be understood intuitively: it is the average of the precision values obtained after each relevant (positive) document or example is retrieved. Higher AP indicates a better model.
The AUPRC can alternatively be computed using trapezoidal integration (e.g., via sklearn.metrics.auc). While related to AP, the trapezoidal method yields slightly different numerical values because it linearly interpolates between operating points rather than using a step function. Davis and Goadrich (2006) showed that the trapezoidal rule tends to overestimate the area in PR space because precision does not vary linearly with recall. For this reason, the step-function AP formulation is generally preferred.
Boyd, Eng, and Page (2013) performed a computational analysis of common AUPRC estimators and their confidence intervals. They found that some commonly used estimation procedures are invalid and recommended two simple interval estimation methods that are robust under various assumptions. Their work highlights the importance of reporting confidence intervals alongside point estimates of AUPRC, especially on small datasets.
Several interpolation methods have been developed to smooth the jagged PR curve for standardized comparison:
| Method | Description | Used by |
|---|---|---|
| All-point interpolation | Computes AP by summing precision at every threshold where recall changes; uses step-function approximation | scikit-learn, PASCAL VOC (2010 onward) |
| 11-point interpolation | Evaluates the maximum precision at 11 fixed recall levels: 0.0, 0.1, 0.2, ..., 1.0, and averages them | TREC (traditional), PASCAL VOC (2007) |
| 101-point interpolation | Evaluates the maximum precision at 101 recall levels: 0.00, 0.01, 0.02, ..., 1.00 | COCO benchmark |
| Trapezoidal rule | Uses linear interpolation between operating points; can overestimate area in PR space | General numerical integration |
For all interpolation methods, the interpolated precision at a given recall level r is defined as the maximum precision value at any recall level r' >= r. This monotonically decreasing envelope eliminates the saw-tooth artifacts and produces a smoother curve.
The choice of interpolation method can affect the computed AP value. The 11-point method produces coarser estimates than the all-point method. The difference between methods (sometimes called "average precision distortion") can be non-trivial for certain classifiers, which means AP values computed using different methods are not directly comparable. The interpolation method should always be reported alongside AP results.
Both the precision-recall curve and the ROC curve visualize classifier performance across thresholds, but they emphasize different aspects of performance.
| Aspect | PR curve | ROC curve |
|---|---|---|
| Y-axis | Precision | True positive rate (recall) |
| X-axis | Recall | False positive rate |
| Uses true negatives? | No | Yes (in FPR calculation) |
| Focus | Positive class only | Both classes equally |
| Random baseline | Horizontal line at prevalence (P / (P + N)) | Diagonal line (AUC = 0.5) |
| Sensitivity to class imbalance | High; reflects class distribution directly | Low; can mask poor precision on minority class |
| Convex hull property | Not guaranteed convex; achievable curve is the analog | Convex hull is well-defined |
| Comparable across datasets? | Only with same prevalence | Yes (baseline is fixed at 0.5) |
| Best suited for | Imbalanced datasets, rare event detection | Balanced datasets, general model comparison |
When the negative class vastly outnumbers the positive class, even a small false positive rate on the ROC curve can correspond to a large number of false positives in absolute terms. The ROC curve does not reveal this because its x-axis (FPR) divides by the large number of negatives, making even many false positives appear negligible.
The PR curve exposes this problem directly because precision divides by the total number of positive predictions (TP + FP). A large number of false positives will dramatically reduce precision, making the issue visible on the curve.
Saito and Rehmsmeier (2015) demonstrated this effect quantitatively. In their experiments, a classifier with an ROC AUC of 0.957 had a PR AUC of only 0.708, showing that the ROC curve painted an overly optimistic picture of the classifier's actual utility for identifying the positive class. Their general finding was that the stronger the class imbalance, the bigger the gap between ROC AUC and PR AUC tends to be.
Davis and Goadrich (2006) proved the dominance equivalence theorem: a curve dominates in ROC space if and only if it dominates in PR space. This means the two representations agree on which classifier is better overall. However, the visual appearance of the curves can be very different; a seemingly strong ROC curve may correspond to a mediocre PR curve on imbalanced data.
| Scenario | Recommended curve | Reason |
|---|---|---|
| Balanced binary classification | ROC curve | Both classes are equally represented; ROC gives a complete picture |
| Imbalanced binary classification | PR curve | Focuses on the minority (positive) class; exposes precision problems hidden by ROC |
| Rare event detection (fraud, rare disease) | PR curve | True negatives are abundant and uninformative; PR ignores them |
| Cost-asymmetric classification | PR curve | Directly examines the precision-recall trade-off relevant to cost functions |
| Information retrieval and ranking | PR curve | Standard evaluation framework; MAP is the established metric |
| General model comparison across datasets | ROC curve | Fixed baseline makes cross-dataset comparison possible |
| Both classes matter equally | ROC curve | FPR captures negative-class performance that PR ignores |
The F1 score is the harmonic mean of precision and recall:
F1 = 2 * (Precision * Recall) / (Precision + Recall)
Every point on a PR curve corresponds to a particular F1 score. The threshold that maximizes the F1 score is the one that provides the best balance between precision and recall under equal weighting.
The more general F-beta score allows adjusting the relative importance of precision versus recall:
F_beta = (1 + beta^2) * (Precision * Recall) / (beta^2 * Precision + Recall)
This generalization was introduced by van Rijsbergen (1979). When beta > 1, recall is weighted more heavily (useful in medical screening where missing a case is costly). When beta < 1, precision receives more weight (useful in spam filtering where false alarms are costly). Setting beta = 1 yields the standard F1 score.
| Beta value | Weighting | Typical use case |
|---|---|---|
| beta = 0.5 | Precision weighted twice as much as recall | Spam filtering, content moderation |
| beta = 1 | Equal weighting | General-purpose balanced evaluation |
| beta = 2 | Recall weighted twice as much as precision | Medical screening, safety-critical search |
Iso-F1 curves are contour lines on the precision-recall plot where every point yields the same F1 score. They are defined by the equation:
Precision = (F1 * Recall) / (2 * Recall - F1)
These curves form hyperbolic arcs on the PR plot. Plotting iso-F1 curves at several values (for example, 0.2, 0.4, 0.6, 0.8) provides visual reference lines that help practitioners see which F1 region a classifier's PR curve falls into. Points near the upper-right corner lie on iso-F1 curves close to 1.0; points near the axes lie on iso-F1 curves closer to 0.
Scikit-learn's precision-recall documentation includes code for overlaying iso-F1 curves on PR plots, making them a convenient visual aid when comparing multiple classifiers.
In object detection, mean average precision (mAP) extends the PR curve concept to multi-class localization tasks. The computation proceeds as follows:
Different benchmarks use different conventions:
| Benchmark | IoU threshold(s) | Interpolation method | Notation |
|---|---|---|---|
| PASCAL VOC 2007 | 0.50 | 11-point interpolation | mAP |
| PASCAL VOC 2010+ | 0.50 | All-point interpolation | mAP |
| COCO (primary) | 0.50 to 0.95, step 0.05 | 101-point interpolation | mAP@[.50:.05:.95] |
| COCO (loose) | 0.50 | 101-point interpolation | AP50 |
| COCO (strict) | 0.75 | 101-point interpolation | AP75 |
The COCO benchmark averages AP across 10 IoU thresholds (from 0.50 to 0.95 in steps of 0.05) and over all 80 object categories, resulting in a single mAP number that rewards both correct classification and precise localization. This is widely considered the current standard metric for evaluating object detection models.
PR curves have been used in information retrieval since the field's earliest formal evaluations. In this context, precision measures the proportion of retrieved documents that are relevant, and recall measures the proportion of relevant documents that have been retrieved. Mean Average Precision (MAP) remains the standard metric for comparing retrieval systems in TREC evaluations. The 11-point interpolated precision-recall curve was the traditional visualization method, though all-point AP has largely replaced it.
In clinical settings, PR curves help evaluate diagnostic tests and predictive models, particularly for rare diseases where prevalence is low. Ozenne, Subtil, and Maucort-Boulch (2015) showed that the PR curve overcame the optimism of the ROC curve for rare diseases. With a disease prevalence of 1%, they found that classifiers could achieve ROC AUC values above 0.9 while having AUPRC values below 0.2, illustrating that the ROC metric failed to reflect the practical difficulty of accurate diagnosis.
In medical applications, threshold selection is especially important because the consequences of both false positives (unnecessary procedures, patient anxiety, wasted resources) and false negatives (missed diagnoses, delayed treatment) are significant. The PR curve helps clinicians visualize this trade-off and choose an operating point appropriate for the clinical context.
Financial fraud detection operates on highly imbalanced data because fraudulent transactions are typically a tiny fraction of all transactions. PR curves are preferred over ROC curves in this domain because they directly measure how well the model identifies fraud (precision) without being diluted by the massive number of legitimate transactions that serve as true negatives.
PR curves are used to evaluate text classification, named entity recognition, relation extraction, and other NLP tasks where class distributions are often skewed. In sentiment analysis, PR curves can help compare models and select appropriate thresholds for different deployment contexts.
As described in the mAP section above, PR curves are the foundation of the primary evaluation metrics in object detection and image segmentation benchmarks including PASCAL VOC, COCO, and Open Images.
Scikit-learn provides several functions for computing and displaying PR curves:
| Function / Class | Purpose |
|---|---|
sklearn.metrics.precision_recall_curve | Returns arrays of precision, recall, and thresholds |
sklearn.metrics.average_precision_score | Computes average precision directly from labels and scores |
sklearn.metrics.PrecisionRecallDisplay | Generates PR curve plots; supports from_estimator and from_predictions class methods |
sklearn.metrics.auc | Computes area under any curve using the trapezoidal rule |
Basic usage:
from sklearn.metrics import precision_recall_curve, average_precision_score
from sklearn.metrics import PrecisionRecallDisplay
# Compute precision-recall pairs
precision, recall, thresholds = precision_recall_curve(y_true, y_scores)
# Compute average precision
ap = average_precision_score(y_true, y_scores)
# Plot with chance-level baseline
disp = PrecisionRecallDisplay.from_predictions(
y_true, y_scores,
name="My Classifier",
plot_chance_level=True
)
The plot_chance_level=True parameter draws the baseline horizontal line at the prevalence of the positive class, making it easy to assess performance relative to a random classifier.
For multi-class problems, scikit-learn supports the one-vs-rest decomposition using OneVsRestClassifier. Separate PR curves are computed for each class by binarizing the labels, and a micro-averaged PR curve can aggregate performance across all classes:
from sklearn.preprocessing import label_binarize
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import precision_recall_curve, average_precision_score
# Binarize labels for multi-class
Y = label_binarize(y, classes=[0, 1, 2])
n_classes = Y.shape<sup><a href="#cite_note-1" class="cite-ref">[1]</a></sup>
# Compute per-class AP
for i in range(n_classes):
precision[i], recall[i], _ = precision_recall_curve(Y_test[:, i], y_score[:, i])
average_precision[i] = average_precision_score(Y_test[:, i], y_score[:, i])
# Micro-averaged AP
average_precision["micro"] = average_precision_score(Y_test, y_score, average="micro")
import numpy as np
import matplotlib.pyplot as plt
f_scores = np.linspace(0.2, 0.8, num=4)
for f in f_scores:
x = np.linspace(0.01, 1)
y = f * x / (2 * x - f)
plt.plot(x[y >= 0], y[y >= 0], color="gray", alpha=0.3)
plt.annotate(f"F1={f:.1f}", xy=(0.9, y<sup><a href="#cite_note-45" class="cite-ref">[45]</a></sup> + 0.02))
The TorchMetrics library (part of the PyTorch Lightning ecosystem) provides AveragePrecision for computing AP from model outputs. It supports binary, multiclass, and multilabel settings.
The yardstick package in the tidymodels ecosystem provides pr_curve() and average_precision() functions. The PRROC package offers dedicated tools for computing and plotting PR and ROC curves with proper interpolation.
PR curves are fundamentally a binary classification tool. Extending them to multi-class or multi-label settings requires decomposition strategies:
| Strategy | Description | Weighting |
|---|---|---|
| One-vs-rest (OvR) | Treat each class as positive vs. all others; compute a separate PR curve and AP per class | Depends on aggregation |
| Micro-averaging | Aggregate TP, FP, and FN counts across all classes before computing a single PR curve | More weight to frequent classes |
| Macro-averaging | Compute AP for each class independently, then average the AP values | Equal weight to all classes |
In multi-label classification, where each instance can belong to multiple classes simultaneously, the one-vs-rest approach is standard. The micro-averaged AP provides an overall performance measure, while per-class AP values reveal performance differences across individual labels.