Precision-Recall Curve

A precision-recall curve (PR curve) is a graph that plots precision on the y-axis against recall on the x-axis at every possible classification threshold for a binary classification model. The curve captures the trade-off between the two metrics: as the threshold is lowered, the classifier labels more examples as positive, which tends to increase recall but decrease precision. Because precision-recall curves focus exclusively on the performance of the positive class, they are especially informative when evaluating classifiers on class-imbalanced datasets where the positive class is rare.

Unlike the ROC curve, which plots the true positive rate against the false positive rate, the PR curve does not use true negatives in either axis. This makes it a more reliable evaluation tool when the number of negative examples vastly outnumbers the positive examples, as is common in fraud detection, disease screening, and information retrieval.

ELI5 (Explain Like I'm 5)

Imagine you have a big toy box full of red balls and blue balls, and your job is to pick out all the red ones.

Precision asks: "Of the balls you picked, how many were actually red?" If you picked 10 balls and 8 were red, your precision is 8 out of 10.

Recall asks: "Of all the red balls in the box, how many did you find?" If there are 20 red balls total and you found 8, your recall is 8 out of 20.

The tricky part is that being really careful (high precision) means you might miss some red balls (low recall). And grabbing lots of balls to find more red ones (high recall) means you also grab more blue ones by mistake (low precision).

A precision-recall curve draws a picture of this trade-off. A really good ball-picker has a curve that stays high up near the top-right corner, meaning they find lots of red balls without grabbing too many blue ones.

Background and history

The concepts of precision and recall have roots in information retrieval, where they were first formalized to evaluate search and retrieval systems. Kent, Berry, Luehrs, and Perry introduced these notions in 1955 in their work on operational criteria for information retrieval system design, though the specific term "precision" appeared somewhat later in the literature. Cyril Cleverdon's experiments at the Cranfield Institute in the 1960s established a foundational methodology for evaluating retrieval systems using precision and recall on standardized test collections with predetermined relevant items.

Cleverdon's approach formed a blueprint for the Text Retrieval Conference (TREC), organized by the National Institute of Standards and Technology (NIST) beginning in 1992. TREC adopted precision-recall analysis as a core evaluation framework and popularized the 11-point interpolated average precision method. Mean Average Precision (MAP), the average of AP scores across a set of queries, became the standard single-figure metric for comparing retrieval systems in the TREC community.

C.J. van Rijsbergen's 1979 book Information Retrieval provided a theoretical foundation by introducing the effectiveness measure E, which was later reformulated as the F1 score. Van Rijsbergen showed that the harmonic mean is the appropriate method for combining precision and recall, a principle rooted in decreasing marginal relevance. The F-measure was subsequently adopted outside information retrieval when it was proposed for evaluation at the fourth Message Understanding Conference (MUC-4) in 1992.

In machine learning, Davis and Goadrich's 2006 paper at the International Conference on Machine Learning (ICML), "The Relationship Between Precision-Recall and ROC Curves," proved a formal correspondence between ROC space and PR space. They demonstrated that a curve dominates in ROC space if and only if it dominates in PR space, and they introduced the concept of an achievable PR curve, analogous to the convex hull in ROC space. Their work also showed that linear interpolation between operating points in PR space is inappropriate because precision does not vary linearly with recall.

Saito and Rehmsmeier's 2015 study in PLoS ONE further demonstrated that PR plots are more informative than ROC plots when evaluating classifiers on imbalanced datasets. They provided empirical evidence that the visual interpretability of ROC plots can be deceptive regarding classification reliability, owing to an intuitive but incorrect interpretation of specificity on skewed data.

Mathematical foundation

Precision and recall from the confusion matrix

Both precision and recall are derived from the confusion matrix, which summarizes the predictions of a binary classifier into four categories:

	Predicted positive	Predicted negative
Actual positive	True Positive (TP)	False Negative (FN)
Actual negative	False Positive (FP)	True Negative (TN)

Precision (also called positive predictive value) is defined as:

Precision = TP / (TP + FP)

It answers the question: "Of all instances the model labeled as positive, what fraction is truly positive?"

Recall (also called sensitivity or true positive rate) is defined as:

Recall = TP / (TP + FN)

It answers: "Of all actual positive instances, what fraction did the model correctly identify?"

Neither metric uses the true negative (TN) count. This is the fundamental reason why PR curves focus entirely on positive-class performance and are unaffected by the number of negative examples in the dataset.

The role of the classification threshold

Most classifiers output a continuous score or probability for each instance rather than a hard binary label. A classification threshold (also called a decision threshold) converts these continuous scores into binary predictions: instances with scores at or above the threshold are predicted positive, and those below are predicted negative.

As the threshold changes:

Lowering the threshold generally increases recall (more true positives are captured) but may decrease precision (more false positives are included).
Raising the threshold generally increases precision (fewer false positives) but may decrease recall (more true positives are missed).

This inverse relationship between precision and recall at different thresholds is exactly what the PR curve visualizes. It is worth noting that the relationship is not perfectly monotonic in every case. At certain thresholds, lowering the threshold may leave recall unchanged while precision fluctuates, depending on the distribution of scores.

Constructing a precision-recall curve

The process for building a PR curve follows these steps:

Obtain prediction scores. Run the classifier on the evaluation dataset to get a continuous score (probability or decision function value) for each instance.
Sort instances by score. Rank all instances from highest to lowest predicted score.
Sweep the threshold. Starting from the highest score and progressively lowering the threshold, compute precision and recall at each threshold value. At each step, one additional instance crosses the threshold and is classified as positive.
Compute precision and recall. At each threshold, calculate precision and recall from the resulting confusion matrix.
Plot coordinate pairs. Plot each (recall, precision) pair as a point on the graph with recall on the x-axis and precision on the y-axis.
Connect the points. Connect adjacent points to form the curve, typically using a step function rather than linear interpolation.

The resulting curve generally starts near the top-left region (high precision, low recall at a very high threshold) and moves toward the bottom-right (lower precision, higher recall at a lower threshold). An ideal classifier would produce a curve that stays close to the top-right corner, where both precision and recall equal 1.0.

The saw-tooth shape

PR curves characteristically exhibit a saw-tooth (zigzag) pattern rather than a smooth curve. This happens because:

When a relevant (positive) instance is encountered at the next rank position, both recall and precision increase, causing the curve to move up and to the right.
When a non-relevant (negative) instance is encountered, recall stays the same but precision drops, causing the curve to jag downward.

This jagged appearance is a natural artifact of the ranking process and is a well-documented property described in Manning, Raghavan, and Schutze's Introduction to Information Retrieval. It distinguishes PR curves from the typically smoother ROC curves.

Interpreting the curve

Shape and position

A PR curve that stays high and close to the upper-right corner indicates a strong classifier that maintains high precision even as recall grows. A curve that drops quickly as recall increases suggests the classifier introduces many false positives when it tries to capture more of the positive class.

When comparing two models, the one whose curve lies above and to the right of the other is generally superior, because it achieves higher precision at every recall level.

The left end of the curve

The precision values at very low recall levels indicate how confident the model's top-ranked predictions are. High precision at the left end means the model's most confident positive predictions are reliable.

The right end of the curve

If precision drops sharply at high recall, the model is introducing many false positives to find the last remaining positives. This region is often the most informative for understanding the practical limits of a classifier.

Comparing at fixed recall

If an application requires a minimum recall level (for example, at least 90% recall for a medical screening test), a useful comparison strategy is to check the precision each model achieves at that fixed recall value.

Baseline and random classifier

The baseline for a PR curve corresponds to a random classifier that assigns positive labels with no skill. For such a classifier, precision equals the prevalence of the positive class:

Baseline precision = P / (P + N)

where P is the number of positive examples and N is the number of negative examples. This baseline appears as a horizontal line on the PR plot.

This context-dependent baseline is an important distinction from the ROC curve, where the random baseline is always a diagonal line with an AUC of 0.5, regardless of class distribution. Because the PR baseline changes with prevalence, it is important to plot or note the baseline when presenting PR curves so that readers can gauge how much better a model performs relative to chance.

For example, if only 2% of examples are positive, the random baseline sits at precision = 0.02. An AUPRC of 0.30 in this setting represents a substantial improvement over chance, whereas the same value on a dataset with 50% prevalence would be poor.

A perfect classifier would produce a single point at (recall = 1.0, precision = 1.0), representing the upper-right corner of the plot.

Average precision and area under the PR curve

Summarizing the full PR curve into a single scalar value is useful for model comparison. Two related but distinct approaches are commonly used.

Average precision (AP)

Average precision is defined as the weighted mean of precisions at each threshold, where the weight is the increase in recall from the previous threshold:

AP = sum over n of (R_n - R_{n-1}) * P_n

where P_n and R_n are the precision and recall at the n-th threshold, and R_0 = 0. This formulation computes a step-function approximation of the area under the PR curve. In scikit-learn, this is the method used by average_precision_score.

AP can also be understood intuitively: it is the average of the precision values obtained after each relevant (positive) document or example is retrieved. Higher AP indicates a better model.

Area under the precision-recall curve (AUPRC)

The AUPRC can alternatively be computed using trapezoidal integration (e.g., via sklearn.metrics.auc). While related to AP, the trapezoidal method yields slightly different numerical values because it linearly interpolates between operating points rather than using a step function. Davis and Goadrich (2006) showed that the trapezoidal rule tends to overestimate the area in PR space because precision does not vary linearly with recall. For this reason, the step-function AP formulation is generally preferred.

Boyd, Eng, and Page (2013) performed a computational analysis of common AUPRC estimators and their confidence intervals. They found that some commonly used estimation procedures are invalid and recommended two simple interval estimation methods that are robust under various assumptions. Their work highlights the importance of reporting confidence intervals alongside point estimates of AUPRC, especially on small datasets.

Interpolation methods

Several interpolation methods have been developed to smooth the jagged PR curve for standardized comparison:

Method	Description	Used by
All-point interpolation	Computes AP by summing precision at every threshold where recall changes; uses step-function approximation	scikit-learn, PASCAL VOC (2010 onward)
11-point interpolation	Evaluates the maximum precision at 11 fixed recall levels: 0.0, 0.1, 0.2, ..., 1.0, and averages them	TREC (traditional), PASCAL VOC (2007)
101-point interpolation	Evaluates the maximum precision at 101 recall levels: 0.00, 0.01, 0.02, ..., 1.00	COCO benchmark
Trapezoidal rule	Uses linear interpolation between operating points; can overestimate area in PR space	General numerical integration

For all interpolation methods, the interpolated precision at a given recall level r is defined as the maximum precision value at any recall level r' >= r. This monotonically decreasing envelope eliminates the saw-tooth artifacts and produces a smoother curve.

The choice of interpolation method can affect the computed AP value. The 11-point method produces coarser estimates than the all-point method. The difference between methods (sometimes called "average precision distortion") can be non-trivial for certain classifiers, which means AP values computed using different methods are not directly comparable. The interpolation method should always be reported alongside AP results.

PR curve vs. ROC curve

Both the precision-recall curve and the ROC curve visualize classifier performance across thresholds, but they emphasize different aspects of performance.

Aspect	PR curve	ROC curve
Y-axis	Precision	True positive rate (recall)
X-axis	Recall	False positive rate
Uses true negatives?	No	Yes (in FPR calculation)
Focus	Positive class only	Both classes equally
Random baseline	Horizontal line at prevalence (P / (P + N))	Diagonal line (AUC = 0.5)
Sensitivity to class imbalance	High; reflects class distribution directly	Low; can mask poor precision on minority class
Convex hull property	Not guaranteed convex; achievable curve is the analog	Convex hull is well-defined
Comparable across datasets?	Only with same prevalence	Yes (baseline is fixed at 0.5)
Best suited for	Imbalanced datasets, rare event detection	Balanced datasets, general model comparison

Why PR curves are preferred for imbalanced data

When the negative class vastly outnumbers the positive class, even a small false positive rate on the ROC curve can correspond to a large number of false positives in absolute terms. The ROC curve does not reveal this because its x-axis (FPR) divides by the large number of negatives, making even many false positives appear negligible.

The PR curve exposes this problem directly because precision divides by the total number of positive predictions (TP + FP). A large number of false positives will dramatically reduce precision, making the issue visible on the curve.

Saito and Rehmsmeier (2015) demonstrated this effect quantitatively. In their experiments, a classifier with an ROC AUC of 0.957 had a PR AUC of only 0.708, showing that the ROC curve painted an overly optimistic picture of the classifier's actual utility for identifying the positive class. Their general finding was that the stronger the class imbalance, the bigger the gap between ROC AUC and PR AUC tends to be.

Davis and Goadrich (2006) proved the dominance equivalence theorem: a curve dominates in ROC space if and only if it dominates in PR space. This means the two representations agree on which classifier is better overall. However, the visual appearance of the curves can be very different; a seemingly strong ROC curve may correspond to a mediocre PR curve on imbalanced data.

When each curve is appropriate

Scenario	Recommended curve	Reason
Balanced binary classification	ROC curve	Both classes are equally represented; ROC gives a complete picture
Imbalanced binary classification	PR curve	Focuses on the minority (positive) class; exposes precision problems hidden by ROC
Rare event detection (fraud, rare disease)	PR curve	True negatives are abundant and uninformative; PR ignores them
Cost-asymmetric classification	PR curve	Directly examines the precision-recall trade-off relevant to cost functions
Information retrieval and ranking	PR curve	Standard evaluation framework; MAP is the established metric
General model comparison across datasets	ROC curve	Fixed baseline makes cross-dataset comparison possible
Both classes matter equally	ROC curve	FPR captures negative-class performance that PR ignores

The F1 score and iso-F1 curves

The F1 score is the harmonic mean of precision and recall:

F1 = 2 * (Precision * Recall) / (Precision + Recall)

Every point on a PR curve corresponds to a particular F1 score. The threshold that maximizes the F1 score is the one that provides the best balance between precision and recall under equal weighting.

The F-beta generalization

The more general F-beta score allows adjusting the relative importance of precision versus recall:

F_beta = (1 + beta^2) * (Precision * Recall) / (beta^2 * Precision + Recall)

This generalization was introduced by van Rijsbergen (1979). When beta > 1, recall is weighted more heavily (useful in medical screening where missing a case is costly). When beta < 1, precision receives more weight (useful in spam filtering where false alarms are costly). Setting beta = 1 yields the standard F1 score.

Beta value	Weighting	Typical use case
beta = 0.5	Precision weighted twice as much as recall	Spam filtering, content moderation
beta = 1	Equal weighting	General-purpose balanced evaluation
beta = 2	Recall weighted twice as much as precision	Medical screening, safety-critical search

Iso-F1 curves

Iso-F1 curves are contour lines on the precision-recall plot where every point yields the same F1 score. They are defined by the equation:

Precision = (F1 * Recall) / (2 * Recall - F1)

These curves form hyperbolic arcs on the PR plot. Plotting iso-F1 curves at several values (for example, 0.2, 0.4, 0.6, 0.8) provides visual reference lines that help practitioners see which F1 region a classifier's PR curve falls into. Points near the upper-right corner lie on iso-F1 curves close to 1.0; points near the axes lie on iso-F1 curves closer to 0.

Scikit-learn's precision-recall documentation includes code for overlaying iso-F1 curves on PR plots, making them a convenient visual aid when comparing multiple classifiers.

Mean average precision (mAP) in object detection

In object detection, mean average precision (mAP) extends the PR curve concept to multi-class localization tasks. The computation proceeds as follows:

For each object class, detections are matched to ground-truth bounding boxes using an Intersection over Union (IoU) threshold. A detection counts as a true positive if its IoU with a ground-truth box exceeds the threshold and that ground-truth box has not already been matched to a higher-confidence detection.
Detections are ranked by confidence score, and a PR curve is computed for each class.
The area under each class's PR curve (using the specified interpolation method) gives the average precision (AP) for that class.
The mean of these per-class AP values is the mAP.

Different benchmarks use different conventions:

Benchmark	IoU threshold(s)	Interpolation method	Notation
PASCAL VOC 2007	0.50	11-point interpolation	mAP
PASCAL VOC 2010+	0.50	All-point interpolation	mAP
COCO (primary)	0.50 to 0.95, step 0.05	101-point interpolation	mAP@[.50:.05:.95]
COCO (loose)	0.50	101-point interpolation	AP50
COCO (strict)	0.75	101-point interpolation	AP75

The COCO benchmark averages AP across 10 IoU thresholds (from 0.50 to 0.95 in steps of 0.05) and over all 80 object categories, resulting in a single mAP number that rewards both correct classification and precise localization. This is widely considered the current standard metric for evaluating object detection models.

Applications

Information retrieval

PR curves have been used in information retrieval since the field's earliest formal evaluations. In this context, precision measures the proportion of retrieved documents that are relevant, and recall measures the proportion of relevant documents that have been retrieved. Mean Average Precision (MAP) remains the standard metric for comparing retrieval systems in TREC evaluations. The 11-point interpolated precision-recall curve was the traditional visualization method, though all-point AP has largely replaced it.

Medical diagnosis

In clinical settings, PR curves help evaluate diagnostic tests and predictive models, particularly for rare diseases where prevalence is low. Ozenne, Subtil, and Maucort-Boulch (2015) showed that the PR curve overcame the optimism of the ROC curve for rare diseases. With a disease prevalence of 1%, they found that classifiers could achieve ROC AUC values above 0.9 while having AUPRC values below 0.2, illustrating that the ROC metric failed to reflect the practical difficulty of accurate diagnosis.

In medical applications, threshold selection is especially important because the consequences of both false positives (unnecessary procedures, patient anxiety, wasted resources) and false negatives (missed diagnoses, delayed treatment) are significant. The PR curve helps clinicians visualize this trade-off and choose an operating point appropriate for the clinical context.

Fraud detection

Financial fraud detection operates on highly imbalanced data because fraudulent transactions are typically a tiny fraction of all transactions. PR curves are preferred over ROC curves in this domain because they directly measure how well the model identifies fraud (precision) without being diluted by the massive number of legitimate transactions that serve as true negatives.

Natural language processing

PR curves are used to evaluate text classification, named entity recognition, relation extraction, and other NLP tasks where class distributions are often skewed. In sentiment analysis, PR curves can help compare models and select appropriate thresholds for different deployment contexts.

Object detection and computer vision

As described in the mAP section above, PR curves are the foundation of the primary evaluation metrics in object detection and image segmentation benchmarks including PASCAL VOC, COCO, and Open Images.

Practical implementation

Python with scikit-learn

Scikit-learn provides several functions for computing and displaying PR curves:

Function / Class	Purpose
`sklearn.metrics.precision_recall_curve`	Returns arrays of precision, recall, and thresholds
`sklearn.metrics.average_precision_score`	Computes average precision directly from labels and scores
`sklearn.metrics.PrecisionRecallDisplay`	Generates PR curve plots; supports `from_estimator` and `from_predictions` class methods
`sklearn.metrics.auc`	Computes area under any curve using the trapezoidal rule

Basic usage:

from sklearn.metrics import precision_recall_curve, average_precision_score
from sklearn.metrics import PrecisionRecallDisplay

# Compute precision-recall pairs
precision, recall, thresholds = precision_recall_curve(y_true, y_scores)

# Compute average precision
ap = average_precision_score(y_true, y_scores)

# Plot with chance-level baseline
disp = PrecisionRecallDisplay.from_predictions(
    y_true, y_scores,
    name="My Classifier",
    plot_chance_level=True
)

The plot_chance_level=True parameter draws the baseline horizontal line at the prevalence of the positive class, making it easy to assess performance relative to a random classifier.

Multi-class extension in scikit-learn

For multi-class problems, scikit-learn supports the one-vs-rest decomposition using OneVsRestClassifier. Separate PR curves are computed for each class by binarizing the labels, and a micro-averaged PR curve can aggregate performance across all classes:

from sklearn.preprocessing import label_binarize
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import precision_recall_curve, average_precision_score

# Binarize labels for multi-class
Y = label_binarize(y, classes=[0, 1, 2])
n_classes = Y.shape<sup><a href="#cite_note-1" class="cite-ref">[1]</a></sup>

# Compute per-class AP
for i in range(n_classes):
    precision[i], recall[i], _ = precision_recall_curve(Y_test[:, i], y_score[:, i])
    average_precision[i] = average_precision_score(Y_test[:, i], y_score[:, i])

# Micro-averaged AP
average_precision["micro"] = average_precision_score(Y_test, y_score, average="micro")

Plotting iso-F1 curves

import numpy as np
import matplotlib.pyplot as plt

f_scores = np.linspace(0.2, 0.8, num=4)
for f in f_scores:
    x = np.linspace(0.01, 1)
    y = f * x / (2 * x - f)
    plt.plot(x[y >= 0], y[y >= 0], color="gray", alpha=0.3)
    plt.annotate(f"F1={f:.1f}", xy=(0.9, y<sup><a href="#cite_note-45" class="cite-ref">[45]</a></sup> + 0.02))

PyTorch

The TorchMetrics library (part of the PyTorch Lightning ecosystem) provides AveragePrecision for computing AP from model outputs. It supports binary, multiclass, and multilabel settings.

R

The yardstick package in the tidymodels ecosystem provides pr_curve() and average_precision() functions. The PRROC package offers dedicated tools for computing and plotting PR and ROC curves with proper interpolation.

Multi-class and multi-label extensions

PR curves are fundamentally a binary classification tool. Extending them to multi-class or multi-label settings requires decomposition strategies:

Strategy	Description	Weighting
One-vs-rest (OvR)	Treat each class as positive vs. all others; compute a separate PR curve and AP per class	Depends on aggregation
Micro-averaging	Aggregate TP, FP, and FN counts across all classes before computing a single PR curve	More weight to frequent classes
Macro-averaging	Compute AP for each class independently, then average the AP values	Equal weight to all classes

In multi-label classification, where each instance can belong to multiple classes simultaneously, the one-vs-rest approach is standard. The micro-averaged AP provides an overall performance measure, while per-class AP values reveal performance differences across individual labels.

Limitations and considerations

Binary classification only. PR curves do not natively handle multi-class problems. Multi-class extension requires one-vs-rest decomposition, which can obscure inter-class confusion patterns.
Prevalence dependence. Because the baseline depends on the positive class prevalence, AUPRC values from datasets with different class ratios are not directly comparable. A model with AUPRC of 0.30 on a dataset with 1% prevalence may be performing better than one with AUPRC of 0.60 on a dataset with 30% prevalence.
No single optimal threshold. The PR curve shows performance across all thresholds but does not prescribe which threshold to use. The optimal threshold depends on the application's cost structure and must be determined separately, often using domain-specific criteria.
Requires continuous scores. PR curves need a continuous prediction score (probability or decision function output) from the classifier. Models that produce only hard binary predictions yield a single point rather than a curve.
Sensitivity to calibration. The shape of the PR curve can be influenced by how well-calibrated the model's predicted probabilities are. Poorly calibrated models may produce misleading PR curves even if their ranking ability is good, because threshold-based operating point selection depends on meaningful probability values.
Sample size effects. With small datasets, the PR curve can be noisy and unstable. Boyd et al. (2013) demonstrated the importance of reporting confidence intervals around AUPRC estimates to convey statistical uncertainty.
Interpolation method sensitivity. Different interpolation methods (11-point, all-point, 101-point, trapezoidal) can yield different AP values for the same classifier. AP values computed using different methods are not directly comparable, and the interpolation method should always be reported.
Non-convexity. Unlike ROC curves, PR curves are not guaranteed to be convex, which complicates the construction of convex hull analogs. Davis and Goadrich (2006) addressed this by defining the achievable PR curve, but this concept is less widely implemented in standard software.

References

Saito, T. and Rehmsmeier, M. (2015). "The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets." *PLoS ONE*, 10(3): e0118432.
Davis, J. and Goadrich, M. (2006). "The Relationship Between Precision-Recall and ROC Curves." *Proceedings of the 23rd International Conference on Machine Learning (ICML)*, pp. 233-240.
Fawcett, T. (2006). "An Introduction to ROC Analysis." *Pattern Recognition Letters*, 27(8), pp. 861-874.
Van Rijsbergen, C.J. (1979). *Information Retrieval*, 2nd edition. Butterworths, London.
Boyd, K., Eng, K.H., and Page, C.D. (2013). "Area Under the Precision-Recall Curve: Point Estimates and Confidence Intervals." *Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases (ECML PKDD)*, pp. 451-466.
Manning, C.D., Raghavan, P., and Schutze, H. (2008). *Introduction to Information Retrieval*. Cambridge University Press. Chapter 8: Evaluation in Information Retrieval.
Ozenne, B., Subtil, F., and Maucort-Boulch, D. (2015). "The Precision-Recall Curve Overcame the Optimism of the Receiver Operating Characteristic Curve in Rare Diseases." *Journal of Clinical Epidemiology*, 68(8), pp. 855-859.
Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., and Zisserman, A. (2010). "The Pascal Visual Object Classes (VOC) Challenge." *International Journal of Computer Vision*, 88(2), pp. 303-338.
Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollar, P., and Zitnick, C.L. (2014). "Microsoft COCO: Common Objects in Context." *Proceedings of the European Conference on Computer Vision (ECCV)*, pp. 740-755.
Pedregosa, F. et al. (2011). "Scikit-learn: Machine Learning in Python." *Journal of Machine Learning Research*, 12, pp. 2825-2830.
Kent, A., Berry, M.M., Luehrs, F.U., and Perry, J.W. (1955). "Machine Literature Searching VIII: Operational Criteria for Designing Information Retrieval Systems." *American Documentation*, 6(2), pp. 93-101.
Cleverdon, C.W. (1967). "The Cranfield Tests on Index Language Devices." *Aslib Proceedings*, 19(6), pp. 173-194.
Padilla, R., Netto, S.L., and da Silva, E.A.B. (2020). "A Survey on Performance Metrics for Object-Detection Algorithms." *International Conference on Systems, Signals and Image Processing (IWSSIP)*, pp. 237-242.
Flach, P. and Kull, M. (2015). "Precision-Recall-Gain Curves: PR Analysis Done Right." *Advances in Neural Information Processing Systems (NeurIPS)*, 28.
Powers, D.M.W. (2011). "Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness and Correlation." *Journal of Machine Learning Technologies*, 2(1), pp. 37-63.

ELI5 (Explain Like I'm 5)

Background and history

Mathematical foundation

Precision and recall from the confusion matrix

The role of the classification threshold

Constructing a precision-recall curve

The saw-tooth shape

Interpreting the curve

Shape and position

The left end of the curve

The right end of the curve

Comparing at fixed recall

Baseline and random classifier

Average precision and area under the PR curve

Average precision (AP)

Area under the precision-recall curve (AUPRC)

Interpolation methods

PR curve vs. ROC curve

Why PR curves are preferred for imbalanced data

When each curve is appropriate

The F1 score and iso-F1 curves

The F-beta generalization

Iso-F1 curves

Mean average precision (mAP) in object detection

Applications

Information retrieval

Medical diagnosis

Fraud detection

Natural language processing

Object detection and computer vision

Practical implementation

Python with scikit-learn

Multi-class extension in scikit-learn

Plotting iso-F1 curves

PyTorch

R

Multi-class and multi-label extensions

Limitations and considerations

See also

References

Improve this article

Related Articles

ARC-AGI 2

ROC (Receiver Operating Characteristic) Curve

PR AUC

Root Mean Squared Error (RMSE)

Inter-rater agreement

WebArena

ELI5 (Explain Like I'm 5)

Background and history

Mathematical foundation

Precision and recall from the confusion matrix

The role of the classification threshold

Constructing a precision-recall curve

The saw-tooth shape

Interpreting the curve

Shape and position

The left end of the curve

The right end of the curve

Comparing at fixed recall

Baseline and random classifier

Average precision and area under the PR curve

Average precision (AP)

Area under the precision-recall curve (AUPRC)

Interpolation methods

PR curve vs. ROC curve

Why PR curves are preferred for imbalanced data

When each curve is appropriate

The F1 score and iso-F1 curves

The F-beta generalization

Iso-F1 curves

Mean average precision (mAP) in object detection

Applications

Information retrieval

Medical diagnosis

Fraud detection

Natural language processing

Object detection and computer vision

Practical implementation

Python with scikit-learn