Area under the curve
Last reviewed
May 26, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v3 ยท 3,412 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 26, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v3 ยท 3,412 words
Add missing citations, update stale details, or suggest a clearer explanation.
Area under the ROC curve (AUC-ROC, or simply AUC) is a scalar summary of the performance of a binary classifier or a diagnostic test across all possible decision thresholds. It is defined as the area beneath the receiver operating characteristic (ROC) curve, which plots the true positive rate (sensitivity) against the false positive rate (1 minus specificity) as the discrimination threshold is varied.[1][2] AUC has an equivalent probabilistic interpretation, established by Bamber in 1975 and popularized in medicine by Hanley and McNeil in 1982, as the probability that a randomly chosen positive instance receives a higher score than a randomly chosen negative instance, which makes it numerically identical to the normalized Mann-Whitney U statistic.[3][1] The metric ranges from 0 to 1 with 0.5 corresponding to a non-informative classifier, and is widely used in machine learning model evaluation, medical diagnostic studies, credit scoring, and information retrieval. AUC has also been criticized, most prominently by David Hand in 2009, for using an implicit cost distribution that depends on the classifier being evaluated, and for being optimistic relative to precision-based metrics when class distributions are highly skewed.[4][5]
The receiver operating characteristic curve has its roots in signal detection theory developed during World War II for radar engineering, where operators had to choose between declaring a faint signal a real target (aircraft) or a noise artifact (bird, weather echo).[6][2] The English-language term "receiver operating characteristic" reflects this radar engineering heritage, with the curve describing how the operating point of a receiver (its sensitivity to weak signals) traded off correct detections against false alarms. Egan and others formalized the framework in psychophysics during the 1950s and 1960s, and Charles Metz, Lee Lusted, David Green, and John Swets transferred the methodology into radiology and experimental psychology.[2][6]
Two papers from the 1970s and 1980s consolidated the scalar AUC as the dominant summary measure of an ROC curve. Donald Bamber's 1975 paper in the Journal of Mathematical Psychology, "The area above the ordinal dominance graph and the area below the receiver operating characteristic graph," proved that the area below an empirical ROC curve equals the probability that a randomly chosen positive observation has a higher value than a randomly chosen negative observation.[3] James Hanley and Barbara McNeil's 1982 paper in Radiology, "The meaning and use of the area under a receiver operating characteristic (ROC) curve," made the same probabilistic interpretation accessible to a medical audience, provided closed-form variance expressions for the empirical AUC under a binormal assumption, and gave sample-size guidance for diagnostic studies.[1] Both papers explicitly identified the equivalence between the empirical AUC and the Mann-Whitney U statistic (also known as the Wilcoxon rank-sum statistic), which gave the metric a firm grounding in nonparametric statistics.[1][3]
By the late 1990s and early 2000s, AUC had become a default model-selection criterion in machine learning, partly because the KDD Cup and other competitions adopted it for tasks with highly imbalanced class priors where raw classification accuracy was uninformative. Tom Fawcett's 2006 article "An introduction to ROC analysis" in Pattern Recognition Letters codified the visualization and the AUC computation algorithm for a machine-learning audience and is among the most-cited methodological tutorials in the field.[2]
Let a binary classifier or diagnostic test produce a real-valued score s(x) for each instance x, with class labels y in {0, 1}. The decision rule predicts the positive class if s(x) is above a threshold t. As t varies, the classifier produces a family of (false positive rate, true positive rate) operating points. The ROC curve is the locus of these points in the unit square, where:
The area under the ROC curve is the integral
AUC = integral over [0,1] of TPR(FPR^{-1}(u)) du.
Equivalently, by integration by parts and a change of variables, AUC equals the double expectation
AUC = P(s(X_p) > s(X_n)) + (1/2) P(s(X_p) = s(X_n)),
where X_p denotes a random positive instance, X_n a random negative instance, and the second term accounts for ties.[3][1] In words: AUC is the probability that the classifier ranks a randomly chosen positive example above a randomly chosen negative example. This probabilistic interpretation is the most widely cited intuitive meaning of the metric and underlies the connection to nonparametric statistics described in the next section.
The empirical ROC curve for a finite sample is a step function joining the operating points produced as the threshold sweeps through the observed scores. The empirical AUC is the area under this step function, which corresponds to numerical integration by the trapezoidal rule.[2][7]
A diagonal line from (0,0) to (1,1) has AUC = 0.5 and corresponds to a classifier whose ranking is no better than a coin flip on the positive-versus-negative discrimination task. An AUC of 1 corresponds to perfect ranking: every positive instance scores above every negative one. An AUC below 0.5 indicates that the classifier ranks negatives above positives more often than chance; reversing the score sign produces a complementary classifier with AUC = 1 - AUC.[2]
Bamber's 1975 result, restated by Hanley and McNeil in 1982, places AUC in the family of two-sample rank statistics.[3][1] Given n_p positive instances with scores s_1, ..., s_{n_p} and n_n negative instances with scores t_1, ..., t_{n_n}, define the indicator
I(s_i, t_j) = 1 if s_i > t_j; 1/2 if s_i = t_j; 0 if s_i < t_j.
Then the empirical AUC is
AUC_hat = (1 / (n_p * n_n)) * sum over i,j of I(s_i, t_j).
This expression is identical to the Mann-Whitney U statistic divided by n_p * n_n, and Mann-Whitney U is in turn linearly related to the Wilcoxon rank-sum statistic by W = U + n_p(n_p + 1)/2.[1][3] The rank-sum equivalence has three practical consequences. First, the empirical AUC can be computed in O(N log N) time by sorting all scores, ranking the combined sample, and summing the positive-class ranks, which is faster than the O(n_p * n_n) double loop implied by the direct pair-counting formula.[2] Second, hypothesis testing on AUC can use the well-developed asymptotic theory of U-statistics, leading to closed-form variance estimators and the DeLong test described later. Third, AUC inherits invariance to monotonic transformations of the score from the rank-sum statistic: any strictly increasing function applied to s(x) leaves the empirical AUC unchanged, because it preserves all pairwise comparisons.[2]
This last property explains a feature of AUC that is sometimes mistaken for a defect. Two classifiers can produce identical AUC values while one is well-calibrated (outputs probabilities matching empirical positive rates) and the other is wildly miscalibrated, because AUC only measures the ranking of scores, not their absolute values.[2] calibration requires distinct diagnostics such as reliability diagrams or the Brier score, and a classifier optimized for AUC need not produce probabilities suitable for cost-sensitive decision making.
Two algorithms dominate practical computation of the empirical AUC.
The first is trapezoidal integration on the empirical ROC curve. Sort the predicted scores in decreasing order. Sweep a threshold through the sorted scores; at each step, increment TPR if the next instance is positive and FPR if it is negative. The cumulative trajectory of (FPR, TPR) values forms a piecewise-linear step function from (0,0) to (1,1), and summing the areas of the trapezoids beneath each segment yields the AUC.[7][2] For tied scores, fractional contributions to both TPR and FPR are added at the tie, which corresponds to the 1/2 weighting for ties in the rank-sum formula.
The second is the rank-based formula. Compute the ranks of all N = n_p + n_n scores (using average ranks for ties). Sum the ranks assigned to positive instances to obtain R_p. Then
AUC_hat = (R_p - n_p(n_p + 1)/2) / (n_p * n_n).
This formula is mathematically equivalent to the trapezoidal area, runs in O(N log N) sort time, and is used internally by many libraries.[2][1]
In the scikit-learn library (version 1.8.0 as of 2026), the function sklearn.metrics.roc_auc_score(y_true, y_score) implements both binary and multiclass AUC, accepts a max_fpr argument that returns the standardized partial AUC over a restricted range of false positive rates, and exposes multi_class options ovr (one-versus-rest) and ovo (one-versus-one) for problems with more than two classes.[8] R packages such as pROC and ROCR implement the same computation along with bootstrap and DeLong-based variance estimation.[9]
Because the empirical AUC is a U-statistic, it has known asymptotic properties that support hypothesis testing and confidence intervals.
Hanley and McNeil derived a closed-form expression for the variance of the empirical AUC under a binormal assumption (positive and negative scores both approximately normally distributed).[1] The expression involves AUC itself and two correlation-like terms Q_1 and Q_2 that can be estimated from the data. The resulting standard error supports approximate normal-theory confidence intervals and is exact under the binormal model.
For comparing the AUCs of two or more classifiers evaluated on the same individuals, the standard tool is the DeLong test, introduced by Elizabeth DeLong, David DeLong, and Daniel Clarke-Pearson in their 1988 Biometrics paper "Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach."[9] The DeLong test treats each AUC as a U-statistic, derives an unbiased estimator of the covariance matrix between the AUCs using the theory of generalized U-statistics, and constructs a multivariate normal test for differences. It accounts for the correlation induced by evaluating multiple classifiers on the same sample, and is widely implemented in statistical software including pROC, MedCalc, and SAS.[9]
A complementary approach uses Efron's bootstrap. By drawing bootstrap samples with replacement from the evaluation set, recomputing the empirical AUC on each, and taking the empirical quantiles, one obtains a bias-corrected and accelerated (BCa) bootstrap confidence interval for AUC without distributional assumptions.[10] cross-validation schemes such as stratified k-fold are commonly combined with bootstrap or with permutation tests when sample sizes are small.
The original definition of AUC is intrinsically binary because the ROC plot has two axes (TPR and FPR) tied to two-class confusion outcomes. Several generalizations to multiclass problems have been proposed.
David Hand and Robert Till's 2001 paper "A simple generalisation of the area under the ROC curve for multiple class classification problems," published in Machine Learning, defined the Hand-Till multiclass AUC as the average of one-versus-one AUCs over all pairs of classes.[11] For c classes, this requires c(c-1)/2 pairwise AUCs, each computed using only the instances belonging to the two classes in question. The Hand-Till measure reduces to the standard binary AUC when c = 2, and is insensitive to class prior distributions because pair-wise probabilities are estimated from the relevant pair only.[11] scikit-learn implements this as multi_class='ovo' with average='macro'.[8]
An alternative is the one-versus-rest macro AUC: for each class, compute a binary AUC treating that class as positive and all others as negative, then average. This is the multi_class='ovr' option and is sensitive to class prevalence because the negative class is a mixture whose composition depends on the data.[8] Provost and Domingos and others have proposed volume-under-the-surface generalizations that extend the geometric area to a higher-dimensional ROC manifold, but these are computationally expensive and rarely used in practice.
In many applications, especially medical screening, only a portion of the ROC curve is clinically relevant. A diagnostic test that requires a 50 percent false positive rate to achieve high sensitivity may be useless for population screening even if it has high overall AUC, because the human and economic costs of large numbers of false positives dominate.
Donna McClish's 1989 paper "Analyzing a portion of the ROC curve" in Medical Decision Making formalized the partial AUC (pAUC) as the area beneath the ROC curve restricted to a clinically meaningful range of false positive rates, typically [0, f] for some small f.[12] The standardized partial AUC rescales pAUC so that the minimum possible value (corresponding to a non-informative classifier on the restricted region) maps to 0.5 and the maximum (perfect classifier on the restricted region) to 1, placing it on the same scale as full AUC. Variants include partial AUC restricted by a sensitivity range rather than a specificity range, and area under the precision-recall curve restricted to a recall range. scikit-learn supports partial AUC via the max_fpr argument of roc_auc_score.[8]
Jesse Davis and Mark Goadrich's 2006 ICML paper "The relationship between precision-recall and ROC curves" established a one-to-one correspondence between points in ROC space and points in precision-recall space, and proved that a curve dominates another in ROC space if and only if it dominates in PR space.[4] However, the two summary areas (ROC-AUC and PR-AUC, often called average precision) reward different aspects of classifier behavior. ROC-AUC averages over all false positive rates with equal weight, while PR-AUC averages over all recall levels with weights that depend on precision. When negatives vastly outnumber positives, the false positive rate can stay near zero even as the absolute number of false positives grows, which makes ROC-AUC visually flattering compared to PR-AUC.[4][5]
Takaya Saito and Marc Rehmsmeier's 2015 PLoS ONE paper "The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets" provided empirical evidence on biomedical microRNA prediction tasks, showing that ROC-AUC varied by less than 0.02 across classifiers with very different practical utility while PR-AUC varied by more than 0.50.[5] Their analysis of 58 genome-wide imbalanced-classifier studies found that 66.7 percent reported ROC curves while only 12.1 percent reported PR curves, and they argued that this contributed to over-optimistic published performance claims.[5] Modern best practice in binary classification with highly imbalanced classes is to report both ROC-AUC and PR-AUC, and to specify the class prior so that PR-AUC values can be compared meaningfully across datasets.[5][4]
AUC has several attractive theoretical properties beyond the probabilistic interpretation. It is invariant under any strictly monotone transformation of scores, so it does not depend on calibration. It is threshold-free and so does not require a particular operating point to be selected before evaluation. It has a well-developed inferential theory through its U-statistic representation.[2][1]
Limitations have been the subject of substantial methodological debate. David Hand's 2009 paper "Measuring classifier performance: a coherent alternative to the area under the ROC curve," published in Machine Learning, presented the most cited critique.[13] Hand showed that AUC can be written as a weighted average of misclassification cost ratios, with weights that depend on the empirical distribution of scores produced by the classifier being evaluated. As a result, two different classifiers are implicitly evaluated against different cost distributions when their AUCs are compared, which Hand argued is incoherent because the relative costs of false positives and false negatives are a property of the problem rather than of the classifier. He proposed the H-measure, which fixes the cost distribution to a Beta(2,2) prior (or any other application-relevant prior) so that classifiers are compared on a common scale.[13] The H-measure is implemented in the R package hmeasure.
A separate line of criticism concerns the behavior of ROC-AUC under class imbalance, summarized in the Davis-Goadrich and Saito-Rehmsmeier papers above.[4][5] When the negative class is much larger than the positive class, large absolute increases in false positives correspond to small changes in FPR, producing a flattering ROC curve. In contrast, precision-based measures react directly to the absolute number of false positives.
A further consideration is that AUC averages performance over operating points that may never be used in practice. A medical screening program that operates only at very low false positive rates does not benefit from high AUC if that AUC is achieved mainly by good performance at high false positive rates. Partial AUC and the H-measure address this concern by restricting or weighting the integration region. Empirical studies have shown that AUC estimates from finite samples can be noisy, with confidence interval widths that exceed the typical differences reported between competing classifiers, which suggests caution when ranking models by AUC alone.[2]
AUC is among the most reported metrics in published machine learning evaluations. It serves as the headline summary in many tasks within the OpenML and Kaggle ecosystems, in the binary problems within the SuperGLUE-derived benchmarks for natural language inference, and in clinical prediction model evaluation. Logistic regression and many deep neural network classifiers can be trained directly to maximize a smooth surrogate of AUC, since the rank-sum formula is differentiable when ties are absent.[14] Pairwise ranking losses such as RankNet and ListNet were motivated in part by the desire to optimize AUC-like quantities for classification and information retrieval.
In tools like scikit-learn, AUC is the default scoring function for cross-validated model selection on imbalanced binary tasks via the scoring='roc_auc' argument to cross_val_score and GridSearchCV.[8] PyTorch and TensorFlow expose AUC as a metric class with running estimation suitable for streaming evaluation during training. Production ML systems typically monitor AUC alongside accuracy, precision, recall, and PR-AUC to detect drift in model quality.
In radiology, pathology, cardiology, and clinical chemistry, AUC is the standard summary of diagnostic accuracy for a continuous-valued biomarker or imaging score, with values typically interpreted using the rough rubric: 0.90 to 1.00 excellent, 0.80 to 0.90 good, 0.70 to 0.80 fair, 0.60 to 0.70 poor, 0.50 to 0.60 fail.[1] Regulatory guidelines from the U.S. Food and Drug Administration and the European Medicines Agency reference AUC when assessing diagnostic devices, although they require additional reporting of sensitivity and specificity at fixed operating points appropriate to the intended clinical use.
Credit scoring, fraud detection, and churn prediction in industrial settings also rely heavily on AUC and on its near-equivalent the Gini coefficient (Gini = 2 * AUC - 1).[2] Information retrieval often reports area under the precision-recall curve rather than ROC-AUC, reflecting the heavy class imbalance typical of relevance judgments and the precision focus of users browsing ranked results.[4]