ROC (Receiver Operating Characteristic) Curve

The Receiver Operating Characteristic (ROC) curve is a graphical tool that illustrates the diagnostic ability of a binary classification system as its discrimination classification threshold is varied. It plots the true positive rate (TPR, also called sensitivity or recall) on the y-axis against the false positive rate (FPR, equal to 1 minus specificity) on the x-axis at every possible threshold setting. The result is a curve that reveals the tradeoff between correctly identifying positive instances and incorrectly flagging negative instances as positive.

ROC curves are one of the most widely used evaluation tools in machine learning, statistics, medicine, and signal processing. They provide a threshold-independent view of classifier performance, making them especially valuable when the operating conditions (i.e., the cost of false positives versus false negatives) are unknown at training time or may change after deployment.

Historical origin

The ROC curve traces its roots to World War II. Following the attack on Pearl Harbor in 1941, the United States military launched research programs to improve the accuracy of radar-based detection of enemy aircraft. Early radar operators struggled to distinguish genuine targets (enemy planes) from noise (birds, weather patterns, and other environmental clutter). Engineers developed what they called the "receiver operating characteristic" to quantify how well a radar receiver could separate true signals from false alarms at different sensitivity settings. The name comes directly from this radar context, where the "receiver" refers to the radar receiver and "operating characteristic" describes the performance curve at various operating points.

After the war, the concept migrated into psychology through signal detection theory (SDT), which modeled human perception as a noisy detection problem. In SDT, any detection task is framed as distinguishing a "signal" embedded in noise from noise alone, with the observer applying an internal decision criterion. Psychologists W. P. Tanner and John A. Swets formalized the ROC framework in the 1950s and 1960s. Their 1954 paper "A decision-making theory of visual detection" introduced the mathematical basis connecting signal detection to ROC analysis. In 1966, David M. Green and John A. Swets published Signal Detection Theory and Psychophysics, a landmark text that consolidated the theory and established ROC analysis as a rigorous analytical method across multiple disciplines. The book showed how the ROC curve could isolate a pure measure of discrimination accuracy by separating sensory ability from the observer's response bias (the placement of the decision criterion).

By the 1970s and 1980s, ROC analysis had become standard practice in radiology and clinical medicine for evaluating diagnostic tests. Lee B. Lusted's 1971 paper "Signal detectability and medical decision-making" in Science is widely credited with introducing ROC analysis to medicine. Lusted demonstrated its value for comparing the accuracy of different radiologists and imaging techniques, and he employed a five-category confidence rating scale for observer studies. Charles E. Metz later developed parametric ROC fitting methods and widely used software (ROCFIT, LABROC) that became standard in radiological research through the 1980s and 1990s.

In the 1990s, the machine learning community adopted ROC curves as a primary tool for assessing classifier performance. Tom Fawcett's 2006 tutorial paper "An introduction to ROC analysis" in Pattern Recognition Letters became a widely cited reference that helped standardize ROC methodology in the ML community. Today, ROC analysis remains central to model evaluation across machine learning, medical diagnostics, meteorological forecasting, biometric authentication, and credit scoring.

Core concepts

Before constructing an ROC curve, it helps to review several foundational ideas from binary classification.

The confusion matrix

A confusion matrix summarizes the outcomes of a classifier at a single threshold:

	Predicted Positive	Predicted Negative
Actual Positive	True Positive (TP)	False Negative (FN)
Actual Negative	False Positive (FP)	True Negative (TN)

From these four counts, two rates define each point on the ROC curve:

Metric	Formula	Also known as
True Positive Rate (TPR)	TP / (TP + FN)	Sensitivity, Recall, Hit Rate
False Positive Rate (FPR)	FP / (FP + TN)	1 - Specificity, Fall-out
Specificity	TN / (TN + FP)	True Negative Rate, Selectivity
Positive Likelihood Ratio	TPR / FPR	Sensitivity / (1 - Specificity)

TPR measures the proportion of actual positives that the classifier correctly identifies. FPR measures the proportion of actual negatives that the classifier incorrectly labels as positive. The positive likelihood ratio, which equals the slope of the ROC curve at any given operating point, indicates how much more likely a positive result is in someone with the condition compared to someone without it.

Classification threshold

Most classifiers produce a continuous score or probability for each instance rather than a hard label. The classification threshold is the cutoff value above which an instance is predicted positive and below which it is predicted negative. Lowering the threshold makes the classifier more liberal (more positives predicted, increasing both TPR and FPR), while raising it makes the classifier more conservative (fewer positives predicted, decreasing both TPR and FPR). The ROC curve captures this entire spectrum.

How to construct an ROC curve

Building an ROC curve involves the following steps:

Obtain predicted scores. Run the classifier on a labeled test set. For each instance, record the predicted probability or confidence score for the positive class.
Sort instances by score. Rank all test instances from highest predicted score to lowest.
Sweep through thresholds. Starting from the highest score downward, treat each unique score as a potential threshold. At each threshold, compute TPR and FPR using the confusion matrix entries.
Plot the points. Place each (FPR, TPR) pair on a graph with FPR on the x-axis (ranging from 0 to 1) and TPR on the y-axis (ranging from 0 to 1).
Connect the points. Join adjacent points with line segments to form the ROC curve.

The curve always starts at the origin (0, 0), corresponding to a threshold so high that nothing is predicted positive, and ends at the point (1, 1), corresponding to a threshold so low that everything is predicted positive.

Mathematical formulation

For a classifier that produces a continuous score X, let f_1(x) denote the probability density of scores for the positive class and f_0(x) denote the density for the negative class. At a given threshold T:

TPR(T) = integral from T to infinity of f_1(x) dx
FPR(T) = integral from T to infinity of f_0(x) dx

The ROC curve is generated parametrically by varying T from positive infinity (where TPR = FPR = 0) to negative infinity (where TPR = FPR = 1). This parametric formulation shows that the ROC curve depends only on the relative positions and shapes of the two score distributions, not on any particular threshold choice.

When the positive and negative class scores follow Gaussian distributions with means mu_1 and mu_0 and variances sigma_1^2 and sigma_0^2, the ROC curve follows a binormal model. In this case, the curve can be described by two parameters: a = (mu_1 - mu_0) / sigma_1 and b = sigma_0 / sigma_1. The binormal model has been widely used in radiology for fitting smooth parametric ROC curves to empirical data.

Worked example

Consider a simple test set with 5 positive and 5 negative instances ranked by predicted score:

Rank	Score	True label
1	0.95	Positive
2	0.88	Positive
3	0.80	Negative
4	0.75	Positive
5	0.70	Positive
6	0.60	Negative
7	0.55	Negative
8	0.40	Positive
9	0.30	Negative
10	0.20	Negative

At a threshold of 0.95, only rank 1 is classified positive: TPR = 1/5 = 0.2, FPR = 0/5 = 0.0. At a threshold of 0.80, ranks 1 through 3 are classified positive: TPR = 2/5 = 0.4, FPR = 1/5 = 0.2. Continuing this process for all thresholds generates the full set of (FPR, TPR) points that form the ROC curve.

The resulting ROC points for this example are:

Threshold	TPR	FPR
> 0.95	0.0	0.0
0.95	0.2	0.0
0.88	0.4	0.0
0.80	0.4	0.2
0.75	0.6	0.2
0.70	0.8	0.2
0.60	0.8	0.4
0.55	0.8	0.6
0.40	1.0	0.6
0.30	1.0	0.8
0.20	1.0	1.0

Interpreting the ROC curve

Several landmarks on the ROC plot carry specific meanings:

Point or region	Coordinates	Meaning
Bottom-left corner	(0, 0)	Threshold is so high that the classifier predicts everything as negative
Top-left corner	(0, 1)	Perfect classifier: all positives caught, no false alarms
Top-right corner	(1, 1)	Threshold is so low that the classifier predicts everything as positive
Diagonal line	TPR = FPR	Performance equivalent to random guessing
Below the diagonal	TPR < FPR	Classifier performs worse than random (inverting predictions would improve it)

A curve that hugs the top-left corner indicates strong discriminative ability. The closer the curve stays to the upper-left, the better the classifier separates the two classes. A curve that follows the diagonal line from (0, 0) to (1, 1) offers no discriminative value; it performs no better than flipping a coin.

When a curve dips below the diagonal, it means the model's predictions are inversely correlated with the true labels. Simply flipping the model's outputs would produce a curve above the diagonal.

The slope of the ROC curve at any point equals the likelihood ratio at the corresponding threshold. Points where the curve is steep (high slope) represent thresholds where positives are much more likely than negatives to receive that score, indicating high diagnostic value. Points where the curve is flat (low slope) represent thresholds with poor discriminative power.

AUC-ROC: the area under the curve

The Area Under the ROC Curve, commonly abbreviated as AUC or AUC-ROC, condenses the entire ROC curve into a single scalar value. AUC ranges from 0 to 1 and has a useful probabilistic interpretation: it equals the probability that the classifier will assign a higher score to a randomly chosen positive instance than to a randomly chosen negative instance. This interpretation was formally established by Hanley and McNeil in their 1982 paper in Radiology, where they showed the equivalence between AUC and the Wilcoxon-Mann-Whitney U statistic.

AUC value guidelines

While interpretation depends on the specific domain and application, general guidelines are:

AUC range	Interpretation
0.90 - 1.00	Excellent discrimination
0.80 - 0.90	Good discrimination
0.70 - 0.80	Fair discrimination
0.60 - 0.70	Poor discrimination
0.50 - 0.60	Near-random; little discriminative value
Below 0.50	Worse than random (predictions are inverted)

These ranges are only rough guidelines. An AUC of 0.75 might be considered excellent for predicting rare geopolitical events but inadequate for a medical screening test that already has established alternatives with AUC above 0.90.

Methods for computing AUC

AUC can be computed in several ways:

Method	Description
Trapezoidal rule	Approximate the area under the piecewise-linear ROC curve using trapezoids between consecutive points. This is the most common numerical method and is used by scikit-learn. It is equivalent to the empirical AUC and produces an unbiased estimate.
Mann-Whitney U statistic	AUC = U / (n_pos x n_neg), where U counts the number of positive-negative pairs in which the positive instance receives a higher score. Ties contribute 0.5. Hanley and McNeil (1982) proved this equivalence.
Wilcoxon rank-sum	Equivalent to the Mann-Whitney approach. Rank all instances by predicted score, sum the ranks of the positive instances, then compute: AUC = (sum_of_positive_ranks - n_pos x (n_pos + 1) / 2) / (n_pos x n_neg).
Analytical (parametric)	When score distributions follow known distributions (e.g., Gaussian), AUC can be computed in closed form: AUC = Phi((mu_1 - mu_0) / sqrt(sigma_1^2 + sigma_0^2)), where Phi is the standard normal CDF.

Relationship to the Gini coefficient

The Gini coefficient, widely used in credit scoring and economics, is related to AUC by the formula:

Gini = 2 x AUC - 1

A Gini of 0 corresponds to AUC = 0.5 (random), and a Gini of 1 corresponds to AUC = 1.0 (perfect). This relationship means that any statement about AUC can be directly translated into a statement about the Gini coefficient and vice versa. The Gini coefficient is sometimes called the Accuracy Ratio in credit risk modeling or Somers' D in ordinal statistics.

Relationship to the concordance statistic (c-statistic)

In survival analysis, the concordance index (c-statistic or c-index), first proposed by Frank Harrell, measures how well a model discriminates between subjects who experience an event and those who do not. For binary outcomes without censoring, the c-index is mathematically identical to the AUC. With right-censored survival data, the c-index generalizes the AUC concept by considering only "comparable" pairs of subjects (pairs where the ordering of event times can be determined). Harrell's c-index is the most commonly used version, though Uno et al. proposed an inverse-probability-of-censoring weighted estimator that is less biased when censoring is heavy.

Partial AUC

In many practical settings, only a restricted region of the ROC curve is operationally relevant. For example, a population screening test might only be viable at FPR values below 0.05. Partial AUC (pAUC) measures the area under the ROC curve within a specified FPR range [0, max_fpr], providing a more focused measure of performance in the region that matters.

McClish (1989) introduced the standardized partial AUC, which normalizes the pAUC so that it ranges from 0.5 (random) to 1.0 (perfect) within the restricted region, making it directly comparable to full AUC. In scikit-learn, partial AUC is computed via the max_fpr parameter in roc_auc_score. Note that partial AUC is currently limited to binary classification; multi-class partial AUC is not yet supported in most standard libraries.

Selecting an optimal threshold

Youden's J statistic

One of the most popular methods for choosing an operating point on the ROC curve is Youden's J statistic (also called the Youden Index), introduced by W. J. Youden in 1950. It is defined as:

J = Sensitivity + Specificity - 1 = TPR - FPR

The J statistic ranges from -1 to 1. A value of 0 means the test is no better than random, while a value of 1 indicates a perfect test with no false positives or false negatives. The optimal threshold is the one that maximizes J.

Geometrically, the maximum J value corresponds to the point on the ROC curve that is farthest (in the vertical direction) from the diagonal line of random chance. Because the ROC curve is generally convex, this point tends to be near the upper-left "elbow" of the curve.

Youden's J statistic is widely used in medical diagnostics to determine cutoff values for clinical tests, such as the optimal PSA level for prostate cancer screening or the optimal blood glucose level for diabetes diagnosis. A notable advantage is that J does not depend on disease prevalence, making it transportable across populations with different base rates (unlike predictive values such as PPV and NPV). However, J assumes that false positives and false negatives are equally costly. In applications where one type of error carries greater consequences, the optimal threshold should be adjusted accordingly.

Cost-sensitive threshold selection

When the costs of different errors are unequal (for example, missing a cancer diagnosis is far more costly than a false alarm), the optimal threshold shifts. One approach is to define a cost function:

C = c_FP x FPR + c_FN x (1 - TPR)

where c_FP and c_FN are the costs of false positives and false negatives, respectively. The optimal operating point minimizes this cost along the ROC curve. Graphically, this corresponds to drawing iso-performance lines (lines of equal expected cost) on the ROC space. The slope of these lines equals (c_FP / c_FN) x (n_negative / n_positive), and the optimal point is where the ROC curve is tangent to the lowest-cost iso-performance line.

In medical screening for serious conditions, c_FN is typically set much higher than c_FP, which pushes the optimal threshold toward higher sensitivity (lower on the threshold scale) at the expense of more false positives. Conversely, in spam filtering, a false positive (marking a legitimate email as spam) may be considered more costly than a false negative (letting a spam email through), pushing the threshold toward higher specificity.

Closest-to-(0,1) method

An alternative geometric approach selects the threshold whose ROC point is closest to the ideal point (0, 1) in Euclidean distance. The selected threshold minimizes:

d = sqrt((1 - TPR)^2 + FPR^2)

This method implicitly gives equal weight to sensitivity and specificity, similar to Youden's J, but uses Euclidean distance rather than vertical distance from the diagonal.

Comparing models with ROC curves

ROC curves enable direct visual comparison of multiple classifiers on the same dataset. If one model's ROC curve lies entirely above another's, the first model dominates the second at every threshold. When curves cross, neither model is uniformly better, and the choice depends on the operating region of interest.

The DeLong test

To determine whether the difference in AUC between two models is statistically significant, practitioners commonly use the DeLong test, proposed by DeLong, DeLong, and Clarke-Pearson in 1988. This nonparametric method exploits the equivalence between AUC and the Mann-Whitney U statistic. The procedure works as follows:

Compute the AUC for each of the two models being compared.
Estimate the variance of each AUC using the theory of generalized U-statistics.
Compute the covariance between the two AUC estimates, accounting for the correlation that arises because both models are evaluated on the same test set.
Construct a Z-test statistic: Z = (AUC_1 - AUC_2) / SE_diff, where SE_diff is the standard error of the difference.
Derive a p-value from the Z statistic.

A p-value below 0.05 is typically taken as evidence that the two AUCs differ significantly. The DeLong test is available in the pROC package in R and in Python libraries such as scipy and MLstatkit. One important caveat: the DeLong test can be unreliable when applied to nested models (where one model's features are a subset of the other's). In such cases, the null distribution of the test statistic becomes non-normal, and the likelihood ratio test or bootstrap-based tests may be more appropriate.

The Hanley-McNeil method

Before the DeLong test became standard, Hanley and McNeil (1983) published a method for comparing two correlated AUCs. Their approach uses exponential approximations to estimate the correlation between AUC values from two tests applied to the same subjects. While simpler to implement, the Hanley-McNeil method is generally less accurate than the DeLong method and has been largely superseded in modern practice.

Confidence intervals for AUC

Reporting a point estimate of AUC without a confidence interval can be misleading, especially with small test sets. Several methods exist for constructing confidence intervals:

Method	Type	Description
Hanley-McNeil	Parametric	Uses a closed-form variance estimate: SE = sqrt[A(1-A) + (n_a - 1)(Q_1 - A^2) + (n_n - 1)(Q_2 - A^2)] / (n_a x n_n), where A is the AUC, n_a and n_n are the number of positives and negatives. Simple but can underestimate variance when there are ties in scores.
DeLong	Nonparametric	Based on the Mann-Whitney U-statistic variance. More accurate than Hanley-McNeil for most practical settings. Recommended as the default method.
Bootstrap	Nonparametric	Resamples the test set with replacement (typically 2,000 to 10,000 iterations) and computes AUC on each resample. Produces bias-corrected and accelerated (BCa) confidence intervals. Flexible and makes no distributional assumptions but is computationally intensive. First introduced by Bradley Efron.
Logit transformation	Nonparametric	Applies a logit transform to AUC before computing the interval, then transforms back. Ensures the interval stays within [0, 1], which is particularly useful when AUC is close to 0 or 1.

A narrow 95% confidence interval indicates that the AUC estimate is stable, while a wide interval suggests that more test data may be needed before drawing conclusions.

The ROC convex hull

The ROC Convex Hull (ROCCH) is the least convex majorant of the empirical ROC curve. It represents the best achievable performance through randomized combinations of classifiers or operating points.

Given two operating points A and B on an ROC curve, any point on the straight line segment between them can be achieved by randomly choosing between the two classifiers with appropriate probabilities. If the original ROC curve is non-convex (dips below the line connecting two of its points), the ROCCH "fills in" those concavities, since a randomized mixture of the endpoints would outperform the non-convex portion.

The ROCCH is useful for classifier selection under varying operating conditions. Each point on the hull corresponds to either a single operating point or a stochastic mixture of two adjacent operating points that is optimal under some cost ratio or class distribution. Classifiers whose operating points lie inside (below) the convex hull are dominated and should not be used under any cost scenario.

z-transformed ROC (zROC)

When the FPR and TPR axes are transformed using the inverse normal (z-score) function, the resulting plot is called a zROC curve. Under the binormal model (both class score distributions are Gaussian), the zROC curve becomes a straight line. The intercept of this line corresponds to the sensitivity index d' (d-prime), and the slope reflects the ratio of standard deviations of the two class distributions.

In cognitive psychology and memory research, zROC analysis reveals properties of the underlying recognition process. Under pure signal detection theory with equal variance, the zROC should have a slope of 1. Empirical zROC slopes typically fall between 0.5 and 0.9 (commonly around 0.8), indicating that the target ("old item") distribution has about 25% greater variability than the lure ("new item") distribution. Deviations from linearity in the zROC suggest that the simple equal-variance Gaussian model does not fully capture the decision process.

Detection Error Tradeoff (DET) curve

The DET curve is an alternative to the ROC curve that plots the false negative rate (FNR = 1 - TPR) against the false positive rate (FPR), both on axes transformed by the normal quantile function (probit scale). Introduced by Martin, Doddington, Kamm, Ordowski, and Przybocki in 1997, DET curves have become the standard evaluation format in speaker recognition and other biometric verification tasks.

DET curves offer two advantages over standard ROC curves. First, when the underlying score distributions are approximately Gaussian, DET curves appear as straight lines, making visual comparison between systems easier. Second, the non-linear axis scaling allocates more visual space to the low-error region of the plot, which is typically where systems operate in practice.

Limitations of ROC curves

Despite their popularity, ROC curves have several well-known limitations.

Class imbalance

The most frequently cited limitation concerns imbalanced datasets. When the negative class vastly outnumbers the positive class (for example, in fraud detection where only 0.1% of transactions are fraudulent), a large absolute number of false positives translates into a very small FPR because the denominator (FP + TN) is enormous. This can make the ROC curve and AUC look optimistically high even when the classifier produces many false alarms in absolute terms. In such settings, the precision-recall curve is often a more informative alternative because precision (TP / (TP + FP)) is directly affected by the number of false positives regardless of the size of the negative class.

That said, recent research has argued that the ROC curve is not inherently flawed for imbalanced data and that the issue is more nuanced than commonly presented. The key insight is that the ROC curve accurately reflects discriminative ability (ranking quality), but it does not reflect the predictive value of positive predictions, which is what matters in many imbalanced applications. A 2024 paper in Patterns (Cell Press) argued that the ROC curve "accurately assesses imbalanced datasets" because it measures discrimination independently of class proportions; the apparent problem arises when practitioners conflate discrimination with precision or predictive value.

No information about calibration

AUC measures the ranking quality of a classifier (whether positives tend to score higher than negatives) but says nothing about whether the predicted probabilities are well-calibrated. Two models with identical AUC can have very different calibration properties. A model with AUC = 0.90 might predict a probability of 0.8 for instances that are actually positive only 40% of the time.

Calibration and discrimination are complementary aspects of model quality. The Brier score, which can be decomposed into a calibration component and a refinement (discrimination) component, captures both. Notably, the refinement component of the Brier score is directly related to the area under the ROC curve. If well-calibrated probabilities are needed (e.g., for risk estimation in medicine), additional calibration analysis using tools like reliability diagrams, calibration curves, or the Brier score is necessary. Importantly, AUC is invariant to monotonic transformations of the predicted scores, meaning any monotonic recalibration (such as Platt scaling or isotonic regression) will not change the AUC.

Threshold-free can be a weakness

While the threshold-independent nature of ROC analysis is often cited as a strength, it can also be a weakness. In practice, a model must operate at a single threshold. If the relevant operating region is a narrow range of FPR values (for example, FPR < 0.01 in airport security screening), the full AUC may not reflect performance in that region. Partial AUC, which measures the area under the curve only within a specified FPR range, addresses this issue.

AUC can include irrelevant regions

The AUC integrates over the entire FPR range from 0 to 1, including regions where operating would be impractical. For instance, a medical screening test would never be deployed at FPR = 0.8 (80% of healthy patients receiving false positives), yet that region contributes to the AUC just as much as the clinically relevant region near FPR = 0.05. This has led some researchers, notably David Hand (2009), to propose alternative metrics like the H-measure that weight different operating points according to a beta distribution of costs rather than treating all costs as equally likely.

AUC does not capture precision

A classifier achieving AUC of 0.9 may have precision as low as 0.2 if the positive class is rare. AUC summarizes sensitivity and specificity but provides no information about precision (positive predictive value) or negative predictive value. This means that a high AUC does not guarantee that the model's positive predictions are reliable, particularly in low-prevalence settings.

Alternative summary metrics

Several alternative single-number summaries have been proposed to address AUC limitations:

Metric	Description
H-measure (Hand, 2009)	Weights operating points by a beta distribution of cost ratios rather than uniformly
Matthews Correlation Coefficient	Balanced measure that uses all four confusion matrix entries; ranges from -1 to +1
Informedness (Youden's J)	Equals 2 x AUC - 1 for a single operating point; measures how informed a decision is beyond random
Average Precision (AP)	Area under the precision-recall curve; emphasizes positive-class performance

Precision-recall curve as an alternative

The precision-recall curve plots precision (y-axis) against recall (x-axis) at varying thresholds. Because precision and recall do not involve true negatives, the PR curve focuses entirely on the positive class and is not inflated by a large number of easy negatives.

Feature	ROC curve	Precision-recall curve
X-axis	False Positive Rate	Recall (True Positive Rate)
Y-axis	True Positive Rate	Precision
Baseline (random)	Diagonal line (AUC = 0.5)	Horizontal line at y = prevalence
Sensitivity to class imbalance	Lower (FPR diluted by large TN count)	Higher (precision directly affected by FP count)
Best used when	Classes are roughly balanced, or both classes matter	Positive class is rare, or precision of positive predictions matters
Dominance relationship	If ROC curve A dominates B, A also dominates B in PR space	PR dominance implies ROC dominance

Davis and Goadrich (2006) proved a formal correspondence between ROC and PR space: a curve dominates in ROC space if and only if it dominates in PR space. This means the two views are consistent in their ranking of classifiers when one is uniformly better than another. In practice, it is often valuable to examine both curves together.

Multi-class ROC curves

The standard ROC curve is defined for binary classification, but it can be extended to multi-class problems using two main strategies.

One-vs-Rest (OvR)

For each class, treat it as the positive class and all other classes combined as the negative class. This yields one ROC curve per class. Individual per-class AUCs can then be aggregated using macro, micro, or weighted averaging.

One-vs-One (OvO)

For every pair of classes, compute the AUC using only the instances belonging to those two classes. With C classes, this yields C x (C - 1) / 2 pairwise AUC values, which are then averaged. OvO can reveal specific pairwise confusion patterns but becomes computationally expensive with many classes.

Averaging methods

Averaging method	Description	When to use
Macro-average	Compute AUC for each class independently, then take the unweighted mean	When all classes are equally important regardless of frequency
Micro-average	Pool all true labels and predicted scores across classes into a single curve and AUC; computed from the sum of all true positives and false positives across all classes	When overall per-instance performance matters and classes are imbalanced
Weighted average	Compute per-class AUC, then average weighted by class support (number of true instances per class)	When class sizes differ and you want proportional representation

Micro-averaging tends to be dominated by the most frequent classes, while macro-averaging treats all classes equally. For highly imbalanced multi-class problems, micro-averaging is often preferred because it reflects overall accuracy across all instances.

Volume Under Surface (VUS)

For problems with more than two classes, the ROC concept can be generalized to higher dimensions as the Volume Under Surface (VUS). This measures the probability of correctly ordering one random example from each of the C classes. However, VUS is difficult to interpret and computationally expensive to estimate, so the OvR and OvO decomposition approaches are more commonly used in practice.

Applications

ROC analysis has found applications across many fields:

Medical diagnostics

ROC analysis is the standard method for evaluating diagnostic tests in clinical medicine. It is routinely used to determine optimal cutoff values for laboratory tests (e.g., blood glucose for diabetes, PSA for prostate cancer, troponin for myocardial infarction) and to compare the diagnostic accuracy of different tests or imaging modalities. In radiology, ROC observer studies assess how well radiologists detect abnormalities in medical images. Metz's ROCFIT software and its successors have been used in hundreds of radiology studies since the 1980s.

Deep learning in medical imaging

ROC curves are the primary evaluation tool for comparing deep learning models against human radiologists. In breast cancer screening, deep learning models have achieved AUCs of 0.85 to 0.91 on mammography datasets, with some studies showing that radiologists assisted by AI achieve higher AUC than radiologists alone (AUC 0.852 vs. 0.805 in one multi-reader study). In lung cancer detection on CT scans, deep learning algorithms have reported AUCs of 0.92 to 0.99 on validation sets.

Credit scoring

Financial institutions use ROC curves and the Gini coefficient to evaluate credit risk models. A scorecard that ranks high-risk borrowers above low-risk borrowers will produce a high AUC. Regulators often require model validation reports to include ROC curves and Gini coefficients alongside other performance metrics.

Meteorological forecasting

Weather forecasting services use ROC analysis to evaluate probabilistic forecasts of binary events such as precipitation, severe storms, or temperature threshold exceedances. The ROC curve measures a forecast system's ability to discriminate between events and non-events across different probability thresholds.

Biometric authentication

In speaker recognition, face recognition, and fingerprint verification, system performance is typically reported using DET curves (a variant of ROC) or Equal Error Rate (EER), which is the operating point where FPR equals FNR. The National Institute of Standards and Technology (NIST) uses DET curves as the standard format for evaluating biometric systems in its regular benchmark evaluations.

Implementation in scikit-learn

Python's scikit-learn library provides convenient functions for ROC analysis:

Function	Purpose
`sklearn.metrics.roc_curve(y_true, y_score)`	Computes FPR, TPR, and thresholds for binary classification
`sklearn.metrics.auc(fpr, tpr)`	Calculates area under the curve from FPR and TPR arrays
`sklearn.metrics.roc_auc_score(y_true, y_score)`	Computes AUC directly; supports multi-class via `multi_class='ovr'` or `'ovo'` and averaging via `average='macro'`, `'micro'`, or `'weighted'`
`sklearn.metrics.roc_auc_score(y_true, y_score, max_fpr=0.1)`	Computes partial AUC up to a specified maximum FPR
`sklearn.metrics.RocCurveDisplay.from_predictions()`	Plots the ROC curve with optional chance-level line

A minimal binary classification example:

from sklearn.metrics import roc_curve, roc_auc_score
import matplotlib.pyplot as plt

# y_true: ground truth binary labels (0 or 1)
# y_scores: predicted probabilities for the positive class
fpr, tpr, thresholds = roc_curve(y_true, y_scores)
auc_value = roc_auc_score(y_true, y_scores)

plt.plot(fpr, tpr, label=f'ROC curve (AUC = {auc_value:.3f})')
plt.plot([0, 1], [0, 1], 'k--', label='Random classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.show()

Finding the optimal threshold using Youden's J:

import numpy as np
from sklearn.metrics import roc_curve

fpr, tpr, thresholds = roc_curve(y_true, y_scores)
j_scores = tpr - fpr
best_idx = np.argmax(j_scores)
print(f'Optimal threshold: {thresholds[best_idx]:.3f}')
print(f'TPR: {tpr[best_idx]:.3f}, FPR: {fpr[best_idx]:.3f}')

For multi-class ROC with one-vs-rest macro averaging:

from sklearn.metrics import roc_auc_score

# y_true: integer-encoded class labels
# y_score: predicted probabilities, shape (n_samples, n_classes)
macro_auc = roc_auc_score(y_true, y_score, multi_class='ovr', average='macro')

Other tools for ROC analysis include the pROC package in R (which provides the DeLong test, bootstrap confidence intervals, and partial AUC natively), tf.keras.metrics.AUC in TensorFlow, and the torchmetrics.AUROC class in PyTorch Lightning.

Explain like I'm 5 (ELI5)

Imagine you have a machine that tries to tell the difference between apples and oranges on a conveyor belt. You can turn a dial on the machine to make it pickier or more relaxed about what counts as an "apple."

When the dial is turned all the way to "picky," the machine barely calls anything an apple. It misses a lot of real apples, but it also almost never mistakes an orange for an apple. When the dial is turned all the way to "relaxed," the machine calls almost everything an apple. It catches every real apple, but it also wrongly calls many oranges apples.

The ROC curve is a picture that shows what happens at every position of the dial. Along the bottom of the picture, we see how often the machine makes mistakes (calling oranges apples). Along the side, we see how often the machine catches the real apples. A really good machine has a curve that jumps up quickly to the top and stays there, meaning it catches almost all the apples without making many mistakes.

The "area under the curve" (AUC) gives you a single number for how good the machine is overall. If the number is close to 1, the machine is great at telling apples from oranges. If the number is around 0.5, the machine is basically guessing randomly.

References

Fawcett, T. (2006). "An introduction to ROC analysis." *Pattern Recognition Letters*, 27(8), 861-874.
Hanley, J. A., & McNeil, B. J. (1982). "The meaning and use of the area under a receiver operating characteristic (ROC) curve." *Radiology*, 143(1), 29-36.
Hanley, J. A., & McNeil, B. J. (1983). "A method of comparing the areas under receiver operating characteristic curves derived from the same cases." *Radiology*, 148(3), 839-843.
DeLong, E. R., DeLong, D. M., & Clarke-Pearson, D. L. (1988). "Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach." *Biometrics*, 44(3), 837-845.
Swets, J. A. (1988). "Measuring the accuracy of diagnostic systems." *Science*, 240(4857), 1285-1293.
Green, D. M., & Swets, J. A. (1966). *Signal Detection Theory and Psychophysics*. New York: Wiley.
Youden, W. J. (1950). "Index for rating diagnostic tests." *Cancer*, 3(1), 32-35.
Davis, J., & Goadrich, M. (2006). "The relationship between Precision-Recall and ROC curves." *Proceedings of the 23rd International Conference on Machine Learning (ICML)*, 233-240.
Saito, T., & Rehmsmeier, M. (2015). "The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets." *PLOS ONE*, 10(3), e0118432.
Hand, D. J. (2009). "Measuring classifier performance: a coherent alternative to the area under the ROC curve." *Machine Learning*, 77(1), 103-123.
Lusted, L. B. (1971). "Signal detectability and medical decision-making." *Science*, 171(3977), 1217-1219.
Park, S. H., Goo, J. M., & Jo, C. H. (2004). "Receiver Operating Characteristic (ROC) Curve: Practical Review for Radiologists." *Korean Journal of Radiology*, 5(1), 11-18.
Pedregosa, F., et al. (2011). "Scikit-learn: Machine Learning in Python." *Journal of Machine Learning Research*, 12, 2825-2830.
McClish, D. K. (1989). "Analyzing a portion of the ROC curve." *Medical Decision Making*, 9(3), 190-195.
Martin, A., Doddington, G., Kamm, T., Ordowski, M., & Przybocki, M. (1997). "The DET curve in assessment of detection task performance." *Proceedings of Eurospeech*, 1895-1898.

Historical origin

Core concepts

The confusion matrix

Classification threshold

How to construct an ROC curve

Mathematical formulation

Worked example

Interpreting the ROC curve

AUC-ROC: the area under the curve

AUC value guidelines

Methods for computing AUC

Relationship to the Gini coefficient

Relationship to the concordance statistic (c-statistic)

Partial AUC

Selecting an optimal threshold

Youden's J statistic

Cost-sensitive threshold selection

Closest-to-(0,1) method

Comparing models with ROC curves

The DeLong test

The Hanley-McNeil method

Confidence intervals for AUC

The ROC convex hull

Related visualizations

z-transformed ROC (zROC)

Detection Error Tradeoff (DET) curve

Limitations of ROC curves

Class imbalance

No information about calibration

Threshold-free can be a weakness

AUC can include irrelevant regions

AUC does not capture precision

Alternative summary metrics

Precision-recall curve as an alternative

Multi-class ROC curves

One-vs-Rest (OvR)

One-vs-One (OvO)

Averaging methods

Volume Under Surface (VUS)

Applications

Medical diagnostics

Deep learning in medical imaging

Credit scoring

Meteorological forecasting

Biometric authentication

Implementation in scikit-learn

Explain like I'm 5 (ELI5)

References

Improve this article

Related Articles

ARC-AGI 2

Precision-Recall Curve

PR AUC

Root Mean Squared Error (RMSE)

Inter-rater agreement

WebArena

Historical origin

Core concepts

The confusion matrix

Classification threshold

How to construct an ROC curve

Mathematical formulation

Worked example

Interpreting the ROC curve

AUC-ROC: the area under the curve

AUC value guidelines

Methods for computing AUC

Relationship to the Gini coefficient

Relationship to the concordance statistic (c-statistic)

Partial AUC

Selecting an optimal threshold

Youden's J statistic

Cost-sensitive threshold selection

Closest-to-(0,1) method

Comparing models with ROC curves

The DeLong test

The Hanley-McNeil method

Confidence intervals for AUC

The ROC convex hull

Related visualizations