# Precision-Recall Curve

> Source: https://aiwiki.ai/wiki/precision-recall_curve
> Updated: 2026-06-23
> Categories: Machine Learning, Model Evaluation
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

A **precision-recall curve** (PR curve) is a graph that plots [precision](/wiki/precision) on the y-axis against [recall](/wiki/recall) on the x-axis at every possible [classification threshold](/wiki/classification_threshold) for a [binary classification](/wiki/binary_classification) model. The curve captures the trade-off between the two metrics: as the threshold is lowered, the classifier labels more examples as positive, which tends to increase recall but decrease precision. Because precision-recall curves focus exclusively on the performance of the positive class, they are especially informative when evaluating classifiers on [class-imbalanced datasets](/wiki/class-imbalanced_dataset) where the positive class is rare. The single most cited justification for the PR curve comes from Davis and Goadrich (2006), who concluded that "when dealing with highly skewed datasets, Precision-Recall (PR) curves give a more informative picture of an algorithm's performance" than ROC curves [2].

Unlike the [ROC curve](/wiki/roc_receiver_operating_characteristic_curve), which plots the true positive rate against the [false positive rate](/wiki/false_positive_rate_fpr), the PR curve does not use true negatives in either axis. This makes it a more reliable evaluation tool when the number of negative examples vastly outnumbers the positive examples, as is common in fraud detection, disease screening, and [information retrieval](/wiki/information_retrieval) [1][2]. The whole curve is usually summarized into one number, the average precision (AP) or area under the PR curve (AUPRC), so that models can be ranked directly.

## ELI5 (Explain Like I'm 5)

Imagine you have a big toy box full of red balls and blue balls, and your job is to pick out all the red ones.

**Precision** asks: "Of the balls you picked, how many were actually red?" If you picked 10 balls and 8 were red, your precision is 8 out of 10.

**Recall** asks: "Of all the red balls in the box, how many did you find?" If there are 20 red balls total and you found 8, your recall is 8 out of 20.

The tricky part is that being really careful (high precision) means you might miss some red balls (low recall). And grabbing lots of balls to find more red ones (high recall) means you also grab more blue ones by mistake (low precision).

A precision-recall curve draws a picture of this trade-off. A really good ball-picker has a curve that stays high up near the top-right corner, meaning they find lots of red balls without grabbing too many blue ones.

## What is a precision-recall curve used for?

A precision-recall curve is used to evaluate and compare binary classifiers, especially on tasks where the positive class is rare and false positives are costly. It serves four main purposes:

- **Threshold selection.** The curve shows the precision a model achieves at every possible recall level, so a practitioner can pick the operating point that satisfies a requirement (for example, the highest precision available at a fixed recall of 0.90).
- **Model comparison.** When summarized as average precision or AUPRC, the curve gives a single number for ranking models; the curve that lies above and to the right is the stronger classifier at every recall.
- **Diagnosing imbalanced performance.** Because precision divides by the number of positive predictions, a flood of false positives on a rare-positive task collapses the curve, exposing a weakness that an ROC curve can hide.
- **Benchmark scoring.** PR-curve area underlies the mean average precision (mAP) metric used by the PASCAL VOC and COCO [object detection](/wiki/object_detection) benchmarks and the Mean Average Precision (MAP) metric used in [information retrieval](/wiki/information_retrieval).

## Background and history

The concepts of precision and recall have roots in information retrieval, where they were first formalized to evaluate search and retrieval systems. Kent, Berry, Luehrs, and Perry introduced these notions in 1955 in their work on operational criteria for information retrieval system design, though the specific term "precision" appeared somewhat later in the literature [11]. Cyril Cleverdon's experiments at the Cranfield Institute in the 1960s established a foundational methodology for evaluating retrieval systems using precision and recall on standardized test collections with predetermined relevant items [12].

Cleverdon's approach formed a blueprint for the Text Retrieval Conference (TREC), organized by the National Institute of Standards and Technology (NIST) beginning in 1992. TREC adopted precision-recall analysis as a core evaluation framework and popularized the 11-point interpolated average precision method [17]. Mean Average Precision (MAP), the average of AP scores across a set of queries, became the standard single-figure metric for comparing retrieval systems in the TREC community.

C.J. van Rijsbergen's 1979 book *Information Retrieval* provided a theoretical foundation by introducing the effectiveness measure E, which was later reformulated as the [F1 score](/wiki/f1_score). Van Rijsbergen showed that the harmonic mean is the appropriate method for combining precision and recall, a principle rooted in decreasing marginal relevance [4]. The F-measure was subsequently adopted outside information retrieval when it was proposed for evaluation at the fourth Message Understanding Conference (MUC-4) in 1992.

In [machine learning](/wiki/machine_learning), Davis and Goadrich's 2006 paper at the International Conference on Machine Learning (ICML), "The Relationship Between Precision-Recall and ROC Curves," proved a formal correspondence between ROC space and PR space. They demonstrated that a curve dominates in ROC space if and only if it dominates in PR space, and they introduced the concept of an achievable PR curve, analogous to the convex hull in ROC space. Their work also showed that linear interpolation between operating points in PR space is inappropriate because precision does not vary linearly with recall [2].

Saito and Rehmsmeier's 2015 study in *PLoS ONE*, titled "The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets," further demonstrated that PR plots are more informative than ROC plots on imbalanced data. They warned that "the visual interpretability of ROC plots in the context of imbalanced datasets can be deceptive with respect to conclusions about the reliability of classification performance, owing to an intuitive but wrong interpretation of specificity" [1].

## Mathematical foundation

### Precision and recall from the confusion matrix

Both precision and recall are derived from the [confusion matrix](/wiki/confusion_matrix), which summarizes the predictions of a binary classifier into four categories [3][15]:

| | Predicted positive | Predicted negative |
|---|---|---|
| **Actual positive** | True Positive (TP) | False Negative (FN) |
| **Actual negative** | False Positive (FP) | True Negative (TN) |

**Precision** (also called positive predictive value) is defined as:

Precision = TP / (TP + FP)

It answers the question: "Of all instances the model labeled as positive, what fraction is truly positive?"

**Recall** (also called sensitivity or true positive rate) is defined as:

Recall = TP / (TP + FN)

It answers: "Of all actual positive instances, what fraction did the model correctly identify?"

Neither metric uses the true negative (TN) count. This is the fundamental reason why PR curves focus entirely on positive-class performance and are unaffected by the number of negative examples in the dataset.

### The role of the classification threshold

Most classifiers output a continuous score or probability for each instance rather than a hard binary label. A classification threshold (also called a decision threshold) converts these continuous scores into binary predictions: instances with scores at or above the threshold are predicted positive, and those below are predicted negative.

As the threshold changes:

- **Lowering the threshold** generally increases recall (more true positives are captured) but may decrease precision (more false positives are included).
- **Raising the threshold** generally increases precision (fewer false positives) but may decrease recall (more true positives are missed).

This inverse relationship between precision and recall at different thresholds is exactly what the PR curve visualizes. It is worth noting that the relationship is not perfectly monotonic in every case. At certain thresholds, lowering the threshold may leave recall unchanged while precision fluctuates, depending on the distribution of scores.

### Correspondence between ROC space and PR space

Davis and Goadrich (2006) established a precise structural link between the two evaluation spaces. Their Theorem 3.1 states that for a given dataset of positive and negative examples, there exists a one-to-one correspondence between a curve in ROC space and a curve in PR space, such that the curves contain exactly the same confusion matrices, provided recall is not zero [2]. The reasoning is that a point in ROC space fixes a unique confusion matrix when the dataset size is fixed, and although a point in PR space ignores the true negative count, that count is uniquely determined once the other three cells (TP, FP, FN) and the fixed totals are known. The single exception is recall equal to zero, where FP cannot be recovered and the mapping breaks down.

This correspondence is the foundation for the rest of their results. Because each ROC point maps to exactly one PR point and vice versa, a curve can be translated freely between the two spaces, and the dominance and convex-hull arguments carry over. It also clarifies why the curves look so different despite encoding identical information: the axes rescale the same underlying confusion matrices in ways that emphasize different error trade-offs.

## Constructing a precision-recall curve

The process for building a PR curve follows these steps:

1. **Obtain prediction scores.** Run the classifier on the evaluation dataset to get a continuous score (probability or decision function value) for each instance.
2. **Sort instances by score.** Rank all instances from highest to lowest predicted score.
3. **Sweep the threshold.** Starting from the highest score and progressively lowering the threshold, compute precision and recall at each threshold value. At each step, one additional instance crosses the threshold and is classified as positive.
4. **Compute precision and recall.** At each threshold, calculate precision and recall from the resulting confusion matrix.
5. **Plot coordinate pairs.** Plot each (recall, precision) pair as a point on the graph with recall on the x-axis and precision on the y-axis.
6. **Connect the points.** Connect adjacent points to form the curve, typically using a step function rather than linear interpolation.

The resulting curve generally starts near the top-left region (high precision, low recall at a very high threshold) and moves toward the bottom-right (lower precision, higher recall at a lower threshold). An ideal classifier would produce a curve that stays close to the top-right corner, where both precision and recall equal 1.0.

### The saw-tooth shape

PR curves characteristically exhibit a saw-tooth (zigzag) pattern rather than a smooth curve. This happens because:

- When a relevant (positive) instance is encountered at the next rank position, both recall and precision increase, causing the curve to move up and to the right.
- When a non-relevant (negative) instance is encountered, recall stays the same but precision drops, causing the curve to jag downward.

This jagged appearance is a natural artifact of the ranking process and is a well-documented property described in Manning, Raghavan, and Schutze's *Introduction to Information Retrieval* [6]. It distinguishes PR curves from the typically smoother ROC curves.

## Interpreting the curve

### Shape and position

A PR curve that stays high and close to the upper-right corner indicates a strong classifier that maintains high precision even as recall grows. A curve that drops quickly as recall increases suggests the classifier introduces many false positives when it tries to capture more of the positive class.

When comparing two models, the one whose curve lies above and to the right of the other is generally superior, because it achieves higher precision at every recall level.

### The left end of the curve

The precision values at very low recall levels indicate how confident the model's top-ranked predictions are. High precision at the left end means the model's most confident positive predictions are reliable.

### The right end of the curve

If precision drops sharply at high recall, the model is introducing many false positives to find the last remaining positives. This region is often the most informative for understanding the practical limits of a classifier.

### Comparing at fixed recall

If an application requires a minimum recall level (for example, at least 90% recall for a medical screening test), a useful comparison strategy is to check the precision each model achieves at that fixed recall value.

## Baseline and random classifier

The baseline for a PR curve corresponds to a random classifier that assigns positive labels with no skill. For such a classifier, precision equals the prevalence of the positive class:

Baseline precision = P / (P + N)

where P is the number of positive examples and N is the number of negative examples. This baseline appears as a horizontal line on the PR plot, and Saito and Rehmsmeier (2015) write it explicitly as y = P / (P + N), noting that for a 1-to-10 positive-to-negative ratio it sits at y = 0.09 [1]. The intuition is that a classifier assigning labels at random, or a model that simply predicts every example positive, attains a precision equal to the fraction of examples that are genuinely positive.

This context-dependent baseline is an important distinction from the ROC curve, where the random baseline is always a diagonal line with an [AUC](/wiki/auc_area_under_the_roc_curve) of 0.5, regardless of class distribution. Because the PR baseline changes with prevalence, it is important to plot or note the baseline when presenting PR curves so that readers can gauge how much better a model performs relative to chance.

For example, if only 2% of examples are positive, the random baseline sits at precision = 0.02. An AUPRC of 0.30 in this setting represents a substantial improvement over chance, whereas the same value on a dataset with 50% prevalence would be poor.

A perfect classifier would produce a single point at (recall = 1.0, precision = 1.0), representing the upper-right corner of the plot.

## Average precision and area under the PR curve

Summarizing the full PR curve into a single scalar value is useful for model comparison. Two related but distinct approaches are commonly used.

### Average precision (AP)

Average precision is defined as the weighted mean of precisions at each threshold, where the weight is the increase in recall from the previous threshold:

AP = sum over n of (R_n - R_{n-1}) * P_n

where P_n and R_n are the precision and recall at the n-th threshold, and R_0 = 0. This formulation computes a step-function approximation of the area under the PR curve. In [scikit-learn](/wiki/scikit-learn), this is the exact definition used by `average_precision_score` [10]. The library notes that since version 0.19, precisions are weighted by the change in recall since the last operating point rather than being linearly interpolated, and it warns that the trapezoidal alternative "uses linear interpolation and can be too optimistic" [10].

AP can also be understood intuitively: it is the average of the precision values obtained after each relevant (positive) document or example is retrieved. Higher AP indicates a better model.

### Area under the precision-recall curve (AUPRC)

The AUPRC can alternatively be computed using trapezoidal integration (e.g., via `sklearn.metrics.auc`). While related to AP, the trapezoidal method yields slightly different numerical values because it linearly interpolates between operating points rather than using a step function. Davis and Goadrich (2006) showed that the trapezoidal rule tends to overestimate the area in PR space because precision does not vary linearly with recall [2]. For this reason, the step-function AP formulation is generally preferred.

Davis and Goadrich gave a striking quantitative illustration of how badly naive interpolation can mislead. On a dataset with 433 positive and 56,164 negative examples, a curve passing through a single operating point at (recall = 0.02, precision = 1.0) and extended to the endpoints has a correctly computed AUC-PR of about 0.031, whereas connecting the points with straight lines inflates the estimate to roughly 0.50 [2]. They also presented a paired example in which optimizing for area in one space selects the wrong model for the other: two overlapping curves on a set with 20 positives and 2,000 negatives had AUC-ROC values of 0.813 and 0.875, so an algorithm maximizing AUC-ROC would prefer the second, yet their AUC-PR values were 0.514 and 0.038, making the first far better for the minority class [2]. This is the concrete basis for their conclusion that "algorithms that optimize the area under the ROC curve are not guaranteed to optimize the area under the PR curve" [2].

Boyd, Eng, and Page (2013) performed a computational analysis of common AUPRC estimators and their confidence intervals. They found that some commonly used estimation procedures are invalid and recommended two simple interval estimation methods that are robust under various assumptions [5]. Their work highlights the importance of reporting confidence intervals alongside point estimates of AUPRC, especially on small datasets.

### Interpolation methods

Several interpolation methods have been developed to smooth the jagged PR curve for standardized comparison. The all-point and 101-point variants underpin the PASCAL VOC and COCO object-detection benchmarks respectively [8][9], while the 11-point method has a long history in TREC information-retrieval evaluation [6]:

| Method | Description | Used by |
|---|---|---|
| All-point interpolation | Computes AP by summing precision at every threshold where recall changes; uses step-function approximation | scikit-learn, PASCAL VOC (2010 onward) |
| 11-point interpolation | Evaluates the maximum precision at 11 fixed recall levels: 0.0, 0.1, 0.2, ..., 1.0, and averages them | TREC (traditional), PASCAL VOC (2007) |
| 101-point interpolation | Evaluates the maximum precision at 101 recall levels: 0.00, 0.01, 0.02, ..., 1.00 | COCO benchmark |
| Trapezoidal rule | Uses linear interpolation between operating points; can overestimate area in PR space | General numerical integration |

For all interpolation methods, the interpolated precision at a given recall level r is defined as the maximum precision value at any recall level r' >= r. This monotonically decreasing envelope eliminates the saw-tooth artifacts and produces a smoother curve [6][13].

The choice of interpolation method can affect the computed AP value. The 11-point method produces coarser estimates than the all-point method. The difference between methods (sometimes called "average precision distortion") can be non-trivial for certain classifiers, which means AP values computed using different methods are not directly comparable [13]. The interpolation method should always be reported alongside AP results.

## How does a PR curve differ from an ROC curve?

Both the precision-recall curve and the [ROC curve](/wiki/roc_receiver_operating_characteristic_curve) visualize classifier performance across thresholds, but they emphasize different aspects of performance.

| Aspect | PR curve | ROC curve |
|---|---|---|
| Y-axis | Precision | True positive rate (recall) |
| X-axis | Recall | False positive rate |
| Uses true negatives? | No | Yes (in FPR calculation) |
| Focus | Positive class only | Both classes equally |
| Random baseline | Horizontal line at prevalence (P / (P + N)) | Diagonal line (AUC = 0.5) |
| Sensitivity to class imbalance | High; reflects class distribution directly | Low; can mask poor precision on minority class |
| Convex hull property | Not guaranteed convex; achievable curve is the analog | Convex hull is well-defined |
| Comparable across datasets? | Only with same prevalence | Yes (baseline is fixed at 0.5) |
| Best suited for | Imbalanced datasets, rare event detection | Balanced datasets, general model comparison |

### Why are PR curves preferred for imbalanced data?

When the negative class vastly outnumbers the positive class, even a small false positive rate on the ROC curve can correspond to a large number of false positives in absolute terms. The ROC curve does not reveal this because its x-axis (FPR) divides by the large number of negatives, making even many false positives appear negligible.

The PR curve exposes this problem directly because precision divides by the total number of positive predictions (TP + FP). A large number of false positives will dramatically reduce precision, making the issue visible on the curve.

Saito and Rehmsmeier (2015) demonstrated this effect quantitatively. In their re-analysis of the MiRFinder microRNA classifier on an imbalanced dataset, the tool reached an ROC AUC of 0.957 but a PR AUC of only 0.708, showing that the ROC curve painted an overly optimistic picture of the classifier's actual utility for identifying the positive class [1]. Their general finding was that the stronger the class imbalance, the bigger the gap between ROC AUC and PR AUC tends to be.

Davis and Goadrich (2006) proved the dominance equivalence theorem: a curve dominates in ROC space if and only if it dominates in PR space [2]. This means the two representations agree on which classifier is better overall. However, the visual appearance of the curves can be very different; a seemingly strong ROC curve may correspond to a mediocre PR curve on imbalanced data. Fawcett's widely cited introduction to ROC analysis provides the complementary background on ROC space, the convex hull, and the trade-offs that PR analysis reframes for the positive class [3].

### When is each curve appropriate?

| Scenario | Recommended curve | Reason |
|---|---|---|
| Balanced binary classification | ROC curve | Both classes are equally represented; ROC gives a complete picture |
| Imbalanced binary classification | PR curve | Focuses on the minority (positive) class; exposes precision problems hidden by ROC |
| Rare event detection (fraud, rare disease) | PR curve | True negatives are abundant and uninformative; PR ignores them |
| Cost-asymmetric classification | PR curve | Directly examines the precision-recall trade-off relevant to cost functions |
| Information retrieval and ranking | PR curve | Standard evaluation framework; MAP is the established metric |
| General model comparison across datasets | ROC curve | Fixed baseline makes cross-dataset comparison possible |
| Both classes matter equally | ROC curve | FPR captures negative-class performance that PR ignores |

## The F1 score and iso-F1 curves

The [F1 score](/wiki/f1_score) is the harmonic mean of precision and recall:

F1 = 2 * (Precision * Recall) / (Precision + Recall)

Every point on a PR curve corresponds to a particular F1 score. The threshold that maximizes the F1 score is the one that provides the best balance between precision and recall under equal weighting.

### The F-beta generalization

The more general F-beta score allows adjusting the relative importance of precision versus recall:

F_beta = (1 + beta^2) * (Precision * Recall) / (beta^2 * Precision + Recall)

This generalization was introduced by van Rijsbergen (1979) [4]. When beta > 1, recall is weighted more heavily (useful in medical screening where missing a case is costly). When beta < 1, precision receives more weight (useful in spam filtering where false alarms are costly). Setting beta = 1 yields the standard F1 score.

| Beta value | Weighting | Typical use case |
|---|---|---|
| beta = 0.5 | Precision weighted twice as much as recall | Spam filtering, content moderation |
| beta = 1 | Equal weighting | General-purpose balanced evaluation |
| beta = 2 | Recall weighted twice as much as precision | Medical screening, safety-critical search |

### Iso-F1 curves

Iso-F1 curves are contour lines on the precision-recall plot where every point yields the same F1 score. They are defined by the equation:

Precision = (F1 * Recall) / (2 * Recall - F1)

These curves form hyperbolic arcs on the PR plot. Plotting iso-F1 curves at several values (for example, 0.2, 0.4, 0.6, 0.8) provides visual reference lines that help practitioners see which F1 region a classifier's PR curve falls into. Points near the upper-right corner lie on iso-F1 curves close to 1.0; points near the axes lie on iso-F1 curves closer to 0.

Scikit-learn's precision-recall documentation includes code for overlaying iso-F1 curves on PR plots, making them a convenient visual aid when comparing multiple classifiers [16].

### Precision-Recall-Gain curves

Flach and Kull (2015) argued that the conventional area under the PR curve rests on an incoherent scale assumption: the area takes an arithmetic mean of precision values, while the F-beta score they are meant to summarize is built on a harmonic mean [14]. As a remedy they proposed Precision-Recall-Gain curves, which re-express precision and recall as "gains" relative to the baseline prevalence so that the always-positive classifier maps to the origin of the new space. The area under the Precision-Recall-Gain curve (AUPRG) then conveys an expected F1 score on a harmonic scale, in the same way that the area under the ROC curve relates to expected accuracy [14]. They showed experimentally that the area under a traditional PR curve can favor a model with a lower expected F1 score than a competitor, and that switching to the gain formulation leads to better model selection. The approach has not displaced ordinary AP and AUPRC in everyday practice, but it is a useful corrective when the absolute area value, rather than just the ranking of curves, is being interpreted.

## Mean average precision (mAP) in object detection

In [object detection](/wiki/object_detection), mean average precision (mAP) extends the PR curve concept to multi-class localization tasks [13]. The computation proceeds as follows:

1. For each object class, detections are matched to ground-truth bounding boxes using an Intersection over Union (IoU) threshold. A detection counts as a true positive if its IoU with a ground-truth box exceeds the threshold and that ground-truth box has not already been matched to a higher-confidence detection.
2. Detections are ranked by confidence score, and a PR curve is computed for each class.
3. The area under each class's PR curve (using the specified interpolation method) gives the average precision (AP) for that class.
4. The mean of these per-class AP values is the mAP.

Different benchmarks use different conventions, and the PASCAL VOC and COCO challenge papers define these in detail [8][9]:

| Benchmark | IoU threshold(s) | Interpolation method | Notation |
|---|---|---|---|
| PASCAL VOC 2007 | 0.50 | 11-point interpolation | mAP |
| PASCAL VOC 2010+ | 0.50 | All-point interpolation | mAP |
| COCO (primary) | 0.50 to 0.95, step 0.05 | 101-point interpolation | mAP@[.50:.05:.95] |
| COCO (loose) | 0.50 | 101-point interpolation | AP50 |
| COCO (strict) | 0.75 | 101-point interpolation | AP75 |

The COCO benchmark averages AP across 10 IoU thresholds (from 0.50 to 0.95 in steps of 0.05) and over all 80 object categories, resulting in a single mAP number that rewards both correct classification and precise localization [9]. This is widely considered the current standard metric for evaluating object detection models.

## Applications

### Information retrieval

PR curves have been used in [information retrieval](/wiki/information_retrieval) since the field's earliest formal evaluations. In this context, precision measures the proportion of retrieved documents that are relevant, and recall measures the proportion of relevant documents that have been retrieved. Mean Average Precision (MAP) remains the standard metric for comparing retrieval systems in TREC evaluations [17]. The 11-point interpolated precision-recall curve was the traditional visualization method, though all-point AP has largely replaced it [6]. NIST's own evaluation documentation now describes the 11-point average as obsolete and recommends MAP for current runs [17].

### Medical diagnosis

In clinical settings, PR curves help evaluate diagnostic tests and predictive models, particularly for rare diseases where prevalence is low. Ozenne, Subtil, and Maucort-Boulch (2015) showed that the PR curve overcame the optimism of the ROC curve for rare diseases [7]. With a disease prevalence of 1%, they found that classifiers could achieve ROC AUC values above 0.9 while having AUPRC values below 0.2, illustrating that the ROC metric failed to reflect the practical difficulty of accurate diagnosis [7].

In medical applications, threshold selection is especially important because the consequences of both false positives (unnecessary procedures, patient anxiety, wasted resources) and false negatives (missed diagnoses, delayed treatment) are significant. The PR curve helps clinicians visualize this trade-off and choose an operating point appropriate for the clinical context.

### Fraud detection

Financial fraud detection operates on highly [imbalanced data](/wiki/class-imbalanced_dataset) because fraudulent transactions are typically a tiny fraction of all transactions. PR curves are preferred over ROC curves in this domain because they directly measure how well the model identifies fraud (precision) without being diluted by the massive number of legitimate transactions that serve as true negatives.

### Natural language processing

PR curves are used to evaluate [text classification](/wiki/sentiment_analysis), [named entity recognition](/wiki/named_entity_recognition), relation extraction, and other NLP tasks where class distributions are often skewed. In [sentiment analysis](/wiki/sentiment_analysis), PR curves can help compare models and select appropriate thresholds for different deployment contexts.

### Object detection and computer vision

As described in the mAP section above, PR curves are the foundation of the primary evaluation metrics in [object detection](/wiki/object_detection) and [image segmentation](/wiki/image_segmentation) benchmarks including PASCAL VOC [8], COCO [9], and Open Images.

## Practical implementation

### Python with scikit-learn

[Scikit-learn](/wiki/scikit-learn) provides several functions for computing and displaying PR curves [10]:

| Function / Class | Purpose |
|---|---|
| `sklearn.metrics.precision_recall_curve` | Returns arrays of precision, recall, and thresholds |
| `sklearn.metrics.average_precision_score` | Computes average precision directly from labels and scores |
| `sklearn.metrics.PrecisionRecallDisplay` | Generates PR curve plots; supports `from_estimator` and `from_predictions` class methods |
| `sklearn.metrics.auc` | Computes area under any curve using the trapezoidal rule |

Basic usage:

```python
from sklearn.metrics import precision_recall_curve, average_precision_score
from sklearn.metrics import PrecisionRecallDisplay

# Compute precision-recall pairs
precision, recall, thresholds = precision_recall_curve(y_true, y_scores)

# Compute average precision
ap = average_precision_score(y_true, y_scores)

# Plot with chance-level baseline
disp = PrecisionRecallDisplay.from_predictions(
    y_true, y_scores,
    name="My Classifier",
    plot_chance_level=True
)
```

The `plot_chance_level=True` parameter draws the baseline horizontal line at the prevalence of the positive class, making it easy to assess performance relative to a random classifier.

### Multi-class extension in scikit-learn

For multi-class problems, scikit-learn supports the one-vs-rest decomposition using `OneVsRestClassifier`. Separate PR curves are computed for each class by binarizing the labels, and a micro-averaged PR curve can aggregate performance across all classes:

```python
from sklearn.preprocessing import label_binarize
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import precision_recall_curve, average_precision_score

# Binarize labels for multi-class
Y = label_binarize(y, classes=[0, 1, 2])
n_classes = Y.shape[-1]

# Compute per-class AP
for i in range(n_classes):
    precision[i], recall[i], _ = precision_recall_curve(Y_test[:, i], y_score[:, i])
    average_precision[i] = average_precision_score(Y_test[:, i], y_score[:, i])

# Micro-averaged AP
average_precision["micro"] = average_precision_score(Y_test, y_score, average="micro")
```

### Plotting iso-F1 curves

```python
import numpy as np
import matplotlib.pyplot as plt

f_scores = np.linspace(0.2, 0.8, num=4)
for f in f_scores:
    x = np.linspace(0.01, 1)
    y = f * x / (2 * x - f)
    plt.plot(x[y >= 0], y[y >= 0], color="gray", alpha=0.3)
    plt.annotate(f"F1={f:.1f}", xy=(0.9, y[-5] + 0.02))
```

### PyTorch

The TorchMetrics library (part of the PyTorch Lightning ecosystem) provides `AveragePrecision` for computing AP from model outputs. It supports binary, multiclass, and multilabel settings.

### R

The `yardstick` package in the tidymodels ecosystem provides `pr_curve()` and `average_precision()` functions. The `PRROC` package offers dedicated tools for computing and plotting PR and ROC curves with proper interpolation.

## Multi-class and multi-label extensions

PR curves are fundamentally a binary classification tool. Extending them to multi-class or multi-label settings requires decomposition strategies:

| Strategy | Description | Weighting |
|---|---|---|
| One-vs-rest (OvR) | Treat each class as positive vs. all others; compute a separate PR curve and AP per class | Depends on aggregation |
| Micro-averaging | Aggregate TP, FP, and FN counts across all classes before computing a single PR curve | More weight to frequent classes |
| Macro-averaging | Compute AP for each class independently, then average the AP values | Equal weight to all classes |

In multi-label classification, where each instance can belong to multiple classes simultaneously, the one-vs-rest approach is standard. The micro-averaged AP provides an overall performance measure, while per-class AP values reveal performance differences across individual labels.

## Limitations and considerations

- **Binary classification only.** PR curves do not natively handle multi-class problems. Multi-class extension requires one-vs-rest decomposition, which can obscure inter-class confusion patterns.
- **Prevalence dependence.** Because the baseline depends on the positive class prevalence, AUPRC values from datasets with different class ratios are not directly comparable. A model with AUPRC of 0.30 on a dataset with 1% prevalence may be performing better than one with AUPRC of 0.60 on a dataset with 30% prevalence.
- **No single optimal threshold.** The PR curve shows performance across all thresholds but does not prescribe which threshold to use. The optimal threshold depends on the application's cost structure and must be determined separately, often using domain-specific criteria.
- **Requires continuous scores.** PR curves need a continuous prediction score (probability or decision function output) from the classifier. Models that produce only hard binary predictions yield a single point rather than a curve.
- **Sensitivity to calibration.** The shape of the PR curve can be influenced by how well-calibrated the model's predicted probabilities are. Poorly calibrated models may produce misleading PR curves even if their ranking ability is good, because threshold-based operating point selection depends on meaningful probability values.
- **Sample size effects.** With small datasets, the PR curve can be noisy and unstable. Boyd et al. (2013) demonstrated the importance of reporting confidence intervals around AUPRC estimates to convey statistical uncertainty [5].
- **Interpolation method sensitivity.** Different interpolation methods (11-point, all-point, 101-point, trapezoidal) can yield different AP values for the same classifier. AP values computed using different methods are not directly comparable, and the interpolation method should always be reported [13].
- **Incoherent scale for the area.** The area under a PR curve averages precision arithmetically even though precision and recall are properly combined through the harmonic mean of the F-score, so the raw area can rank models inconsistently with expected F1. Flach and Kull (2015) proposed Precision-Recall-Gain curves specifically to fix this mismatch [14].
- **Non-convexity.** Unlike ROC curves, PR curves are not guaranteed to be convex, which complicates the construction of convex hull analogs. Davis and Goadrich (2006) addressed this by defining the achievable PR curve, but this concept is less widely implemented in standard software [2].

## See also

- [Precision](/wiki/precision)
- [Recall](/wiki/recall)
- [F1 score](/wiki/f1_score)
- [Confusion matrix](/wiki/confusion_matrix)
- [ROC curve](/wiki/roc_receiver_operating_characteristic_curve)
- [AUC (area under the ROC curve)](/wiki/auc_area_under_the_roc_curve)
- [Classification threshold](/wiki/classification_threshold)
- [Binary classification](/wiki/binary_classification)
- [Class-imbalanced dataset](/wiki/class-imbalanced_dataset)
- [Object detection](/wiki/object_detection)
- [Information retrieval](/wiki/information_retrieval)

## References

1. Saito, T. and Rehmsmeier, M. (2015). "The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets." *PLoS ONE*, 10(3): e0118432. https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0118432
2. Davis, J. and Goadrich, M. (2006). "The Relationship Between Precision-Recall and ROC Curves." *Proceedings of the 23rd International Conference on Machine Learning (ICML)*, pp. 233-240. https://www.biostat.wisc.edu/~page/rocpr.pdf
3. Fawcett, T. (2006). "An Introduction to ROC Analysis." *Pattern Recognition Letters*, 27(8), pp. 861-874. https://doi.org/10.1016/j.patrec.2005.10.010
4. Van Rijsbergen, C.J. (1979). *Information Retrieval*, 2nd edition. Butterworths, London. https://openlib.org/home/krichel/courses/lis618/readings/rijsbergen79_infor_retriev.pdf
5. Boyd, K., Eng, K.H., and Page, C.D. (2013). "Area Under the Precision-Recall Curve: Point Estimates and Confidence Intervals." *Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases (ECML PKDD)*, pp. 451-466. https://pages.cs.wisc.edu/~boyd/aucpr_final.pdf
6. Manning, C.D., Raghavan, P., and Schutze, H. (2008). *Introduction to Information Retrieval*. Cambridge University Press. Chapter 8: Evaluation in Information Retrieval. https://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-ranked-retrieval-results-1.html
7. Ozenne, B., Subtil, F., and Maucort-Boulch, D. (2015). "The Precision-Recall Curve Overcame the Optimism of the Receiver Operating Characteristic Curve in Rare Diseases." *Journal of Clinical Epidemiology*, 68(8), pp. 855-859. https://doi.org/10.1016/j.jclinepi.2015.02.010
8. Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., and Zisserman, A. (2010). "The Pascal Visual Object Classes (VOC) Challenge." *International Journal of Computer Vision*, 88(2), pp. 303-338. https://doi.org/10.1007/s11263-009-0275-4
9. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollar, P., and Zitnick, C.L. (2014). "Microsoft COCO: Common Objects in Context." *Proceedings of the European Conference on Computer Vision (ECCV)*, pp. 740-755. https://arxiv.org/abs/1405.0312
10. Pedregosa, F. et al. (2011). "Scikit-learn: Machine Learning in Python." *Journal of Machine Learning Research*, 12, pp. 2825-2830. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.average_precision_score.html
11. Kent, A., Berry, M.M., Luehrs, F.U., and Perry, J.W. (1955). "Machine Literature Searching VIII: Operational Criteria for Designing Information Retrieval Systems." *American Documentation*, 6(2), pp. 93-101. https://doi.org/10.1002/asi.5090060209
12. Cleverdon, C.W. (1967). "The Cranfield Tests on Index Language Devices." *Aslib Proceedings*, 19(6), pp. 173-194. https://doi.org/10.1108/eb050097
13. Padilla, R., Netto, S.L., and da Silva, E.A.B. (2020). "A Survey on Performance Metrics for Object-Detection Algorithms." *International Conference on Systems, Signals and Image Processing (IWSSIP)*, pp. 237-242. https://doi.org/10.1109/IWSSIP48289.2020.9145130
14. Flach, P. and Kull, M. (2015). "Precision-Recall-Gain Curves: PR Analysis Done Right." *Advances in Neural Information Processing Systems (NeurIPS)*, 28. https://proceedings.neurips.cc/paper/2015/file/33e8075e9970de0cfea955afd4644bb2-Paper.pdf
15. Powers, D.M.W. (2011). "Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness and Correlation." *Journal of Machine Learning Technologies*, 2(1), pp. 37-63. https://arxiv.org/abs/2010.16061
16. Scikit-learn developers. "Precision-Recall." *scikit-learn documentation* (example gallery). Retrieved 2026. https://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html
17. National Institute of Standards and Technology (NIST). "Common Evaluation Measures." *Text REtrieval Conference (TREC) Appendices*. https://trec.nist.gov/pubs/trec15/appendices/CE.MEASURES06.pdf
