AUC (Area Under the ROC Curve) is one of the most widely used evaluation metrics in machine learning for assessing the performance of classification models. It quantifies the entire two-dimensional area underneath the ROC (Receiver Operating Characteristic) curve, providing a single scalar value that summarizes a classifier's ability to distinguish between positive and negative classes across all possible classification thresholds. AUC is threshold-invariant and scale-invariant, making it especially valuable for model comparison and selection in domains such as medical diagnostics, credit scoring, fraud detection, and information retrieval.
Imagine you have a basket of red balls and blue balls all mixed together. You ask a robot to sort them: red balls on the left, blue balls on the right. The robot isn't perfect, so sometimes it puts a blue ball on the red side or a red ball on the blue side.
AUC is a score that tells you how good the robot is at sorting. If you pick one red ball and one blue ball at random, AUC is the chance that the robot correctly puts the red ball on the red side and the blue ball on the blue side. A score of 1.0 means the robot sorts perfectly every time. A score of 0.5 means the robot is just guessing randomly, like flipping a coin. The higher the AUC, the better the robot is at telling the two groups apart.
The ROC curve is a graphical plot that illustrates the diagnostic ability of a binary classification system as its discrimination threshold varies. It plots the True Positive Rate (TPR) on the y-axis against the False Positive Rate (FPR) on the x-axis at every possible threshold setting.
| Metric | Formula | Also Known As |
|---|---|---|
| True Positive Rate (TPR) | TP / (TP + FN) | Sensitivity, Recall, Hit Rate |
| False Positive Rate (FPR) | FP / (FP + TN) | 1 - Specificity, Fall-out |
| True Negative Rate (TNR) | TN / (TN + FP) | Specificity |
| False Negative Rate (FNR) | FN / (FN + TP) | Miss Rate |
Here, TP, FP, TN, and FN refer to values from the confusion matrix: True Positives, False Positives, True Negatives, and False Negatives, respectively.
A perfect classifier produces a ROC curve that passes through the upper-left corner of the plot (the point where FPR = 0 and TPR = 1), meaning it achieves 100% sensitivity with zero false positives. A random classifier, by contrast, produces a diagonal line from (0, 0) to (1, 1).
The ROC curve was first developed by electrical engineers and radar engineers during World War II for detecting enemy objects in battlefields, starting in 1941. The name "Receiver Operating Characteristic" comes from its original use in signal detection theory. In the 1950s, ROC analysis was adopted in psychophysics for studying human detection of weak signals. It later became a standard tool in medical diagnostics and, starting in the late 1980s, in machine learning. Tom Fawcett's influential 2006 tutorial and the growing availability of open-source implementations helped popularize ROC analysis across the data science community.
The AUC represents the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance. This probabilistic interpretation makes AUC an intuitive and meaningful metric.
| AUC Range | Classification Quality | Interpretation |
|---|---|---|
| 0.90 - 1.00 | Excellent | Outstanding discrimination; the model almost always ranks positives above negatives |
| 0.80 - 0.90 | Good | Strong discrimination; suitable for most practical applications |
| 0.70 - 0.80 | Fair | Acceptable discrimination; may need improvement for critical applications |
| 0.60 - 0.70 | Poor | Weak discrimination; the model struggles to separate classes |
| 0.50 | No Discrimination | Random guessing; the model provides no useful information |
| Below 0.50 | Inverse Predictions | Worse than random; reversing predictions would improve performance |
For example, an AUC of 0.85 means that if you randomly select one positive and one negative example, there is an 85% probability that the model assigns a higher predicted score to the positive example.
These guideline ranges should be interpreted with caution. What counts as a "good" AUC depends heavily on the application domain. In medical diagnostics, regulatory standards may require AUC values above 0.90 for screening tests. In some social science applications, an AUC of 0.70 may represent a meaningful advance over prior methods.
In credit scoring and some other financial domains, a related metric called the Gini coefficient is used. The Gini coefficient is computed from AUC as:
Gini = 2 x AUC - 1
This rescales the AUC so that random performance corresponds to a Gini of 0 and perfect performance corresponds to a Gini of 1. Note that this Gini coefficient is distinct from the Gini impurity used in decision trees and from the Gini index used in economics to measure income inequality.
An important caveat is that AUC measures only the ranking quality of a model, not whether its predicted probabilities are well-calibrated. A model can achieve a perfect AUC of 1.0 while producing wildly inaccurate probability estimates. For instance, a model that assigns a probability of 0.99 to every positive example and 0.01 to every negative example achieves AUC = 1.0, but so does a model assigning 0.51 and 0.49 respectively. If calibrated probability estimates matter (for example, when predicted probabilities are used for downstream decision-making or risk stratification), practitioners should supplement AUC with calibration metrics such as the Brier score, expected calibration error, or calibration curves.
There are several mathematically equivalent approaches for computing AUC, each offering a different perspective on the metric.
The most common computational method is the trapezoidal rule. After sorting all predicted scores and computing TPR and FPR at each distinct threshold, the area under the resulting piecewise-linear ROC curve is computed by summing the areas of trapezoids formed between consecutive operating points:
AUC = sum of (FPR_i+1 - FPR_i) x (TPR_i + TPR_i+1) / 2
This is the method used by scikit-learn's roc_auc_score function. It is straightforward to implement and efficient to compute, with time complexity O(n log n) dominated by the sorting step, where n is the number of predictions.
Handley and McNeil (1982) showed that the AUC is equivalent to the Mann-Whitney U statistic (also known as the Wilcoxon rank-sum statistic). Given a set of positive examples P and negative examples N, the AUC equals the proportion of all (positive, negative) pairs where the classifier assigns a higher score to the positive example:
AUC = (number of concordant pairs) / (|P| x |N|)
A pair is concordant when the positive example receives a higher predicted score than the negative example. Tied pairs are counted as 0.5. This equivalence gives AUC a natural non-parametric statistical interpretation and connects it to classical hypothesis testing. Unlike the trapezoidal approach, the Mann-Whitney formulation handles tied scores correctly without requiring special treatment.
Formally, AUC can be expressed as the integral:
AUC = integral from 0 to 1 of TPR(FPR^-1(x)) dx
where TPR is viewed as a function of FPR through the varying threshold parameter. In practice, this integral is approximated using the trapezoidal rule on the empirical ROC curve.
Reporting a point estimate of AUC without uncertainty information can be misleading. Confidence intervals quantify the precision of the AUC estimate and are essential for rigorous model evaluation.
| Method | Assumptions | Strengths | Weaknesses |
|---|---|---|---|
| Hanley-McNeil (1982) | Continuous scores, no ties | Closed-form, fast to compute | Underestimates variance near AUC = 0 or 1 |
| DeLong (1988) | Nonparametric | No distributional assumptions, supports two-AUC comparison | More complex to implement |
| Bootstrap | Nonparametric | Best coverage for extreme AUC values, flexible | Computationally expensive |
| Logit-transformed | Nonparametric | Better coverage near boundaries than Hanley-McNeil | Less widely implemented |
Handley and McNeil (1982) derived a closed-form variance estimator for AUC based on exponential approximations. Their formula expresses the variance of AUC in terms of the AUC value itself, the number of positive cases (n_p), and the number of negative cases (n_n). This approach is computationally efficient but may underestimate variance for extreme AUC values (near 0 or 1) and assumes continuous test scores with no ties.
Bootstrap resampling provides a nonparametric alternative for constructing confidence intervals. The procedure repeatedly resamples the dataset with replacement, computes AUC on each bootstrap sample, and uses the distribution of bootstrap AUC values to derive percentile-based or bias-corrected confidence intervals. Bootstrap methods tend to provide better coverage than the Hanley-McNeil formula, particularly for extreme AUC values. A typical implementation uses 1,000 to 10,000 bootstrap iterations.
DeLong et al. (1988) proposed a nonparametric variance estimator based on the theory of U-statistics. This method computes structural components called "placement values" for each observation and uses them to estimate variance without distributional assumptions. It is more robust than the Hanley-McNeil approach and forms the basis of DeLong's test for comparing two AUCs.
AUC and accuracy are both used to evaluate classifiers, but they differ in important ways.
| Property | AUC | Accuracy |
|---|---|---|
| Threshold dependence | Threshold-independent; evaluates all thresholds | Requires a fixed threshold |
| Scale sensitivity | Scale-invariant; cares about ranking, not calibration | Sensitive to predicted probability values |
| Class imbalance | Less affected by moderate imbalance | Can be misleading with imbalanced classes |
| Interpretation | Probability of correct ranking | Proportion of correct predictions |
| Use case | Model comparison and selection | Final performance reporting at a chosen threshold |
Because AUC evaluates the model's ranking quality across all thresholds, it is particularly useful during model development and comparison. Accuracy, on the other hand, depends on a single chosen threshold and can be misleading when classes are imbalanced. For instance, in a dataset where 95% of examples are negative, a trivial model predicting all negatives achieves 95% accuracy but an AUC of only 0.5.
While AUC is generally robust to moderate class imbalance, it can give an overly optimistic view of model performance when datasets are severely imbalanced. This happens because the FPR, which forms the x-axis of the ROC curve, uses True Negatives in its denominator. When the negative class heavily outnumbers the positive class, even a large number of false positives translates to a relatively small FPR, which inflates the ROC curve upward.
For example, a model evaluated on a dataset with a 1:100 positive-to-negative ratio might show an AUC of 0.95 while still missing many positive examples in absolute terms. Researchers have documented cases where a model achieves an AUC-ROC of 0.957 but only a PR-AUC of 0.708, highlighting a substantial gap between the two metrics.
The Precision-Recall AUC (PR-AUC) computes the area under the precision-recall curve. Because precision and recall do not involve true negatives, PR-AUC is unaffected by the size of the negative class. This makes it a more informative metric when:
| Metric | What It Measures | Baseline for Random Classifier | Best For |
|---|---|---|---|
| AUC-ROC | Discrimination across all thresholds | 0.5 (always) | Balanced or moderately imbalanced datasets |
| PR-AUC | Precision-recall trade-off | Equals the positive class prevalence | Severely imbalanced datasets |
It is worth noting that recent research presents a more nuanced picture. Saito and Rehmsmeier (2015) argued that PR curves are more informative for imbalanced data, but a 2024 study by Yang et al. found that AUPRC can unduly favor model improvements in subpopulations with more frequent positive labels. Meanwhile, Cook (2024) argued that the ROC curve does accurately assess imbalanced datasets when properly understood. Practitioners should consider both metrics together for a complete picture rather than treating one as universally superior.
Sometimes the full ROC curve is not equally relevant across all FPR values. In medical screening, for example, only very low false positive rates may be clinically acceptable. Partial AUC (pAUC) restricts the computation to a specific region of the ROC space.
| Variant | Description | Use Case |
|---|---|---|
| FPR-constrained pAUC | Area under the ROC curve where FPR is within [FPR_low, FPR_high] | When a maximum FPR is clinically mandated |
| TPR-constrained pAUC | Area under the ROC curve where TPR exceeds a minimum threshold | When a minimum sensitivity is required |
| Two-way pAUC | Area restricted by both FPR and TPR bounds simultaneously | When both sensitivity and specificity constraints apply |
To make partial AUC values comparable across different FPR ranges, McClish (1989) proposed a standardized partial AUC that normalizes the value to a [0.5, 1] scale, where 0.5 represents random performance and 1 represents perfect performance within the specified region. The standardization formula is:
pAUC_standardized = 0.5 x (1 + (pAUC - pAUC_min) / (pAUC_max - pAUC_min))
The Ratio of Relevant Areas (RRA) is another normalization approach that divides the pAUC by the area of the region of interest.
In scikit-learn, partial AUC can be computed by setting the max_fpr parameter in roc_auc_score. The returned value is the standardized partial AUC, as proposed by McClish.
While AUC is originally defined for binary classification, it can be extended to multi-class classification problems using two main strategies.
In the OvR approach, the AUC is computed for each class against all other classes combined. Each class gets its own ROC curve, and the per-class AUC values are then aggregated. This method can be affected by class imbalance because the "rest" group composition changes for each class, creating artificial imbalance even when the original classes are balanced.
In the OvO approach, introduced by Hand and Till (2001), AUC is computed for every unique pair of classes. For k classes, this produces k(k-1)/2 pairwise AUC values that are then averaged. Hand and Till showed that when macro averaging is used, the OvO approach is insensitive to class distribution, similar to the binary AUC case. This makes OvO the preferred method when class imbalance is a concern.
| Averaging Method | Description | When to Use |
|---|---|---|
| Macro | Unweighted mean of per-class AUCs | When all classes are equally important |
| Weighted | Weighted mean using class frequencies as weights | When larger classes should contribute more |
| Micro | Computed globally by treating each element of the label indicator matrix as a separate binary prediction | When individual observation accuracy matters more than per-class accuracy |
In scikit-learn, multi-class AUC is computed using roc_auc_score with the multi_class parameter set to 'ovr' or 'ovo' and the average parameter set to 'macro', 'weighted', or 'micro' (micro averaging is only supported for OvR).
When comparing two classifiers, observing a higher AUC for one model does not necessarily mean it is significantly better. DeLong's test (DeLong et al., 1988) provides a rigorous statistical framework for determining whether the difference between two correlated AUC values is statistically significant.
DeLong's test is a non-parametric method that leverages the connection between AUC and the Mann-Whitney U statistic. It computes a z-score from the difference between two AUC values, accounting for the correlation between them (since both models are evaluated on the same dataset). The steps are:
A z-score exceeding 1.96 in absolute value (corresponding to p < 0.05) suggests a statistically significant difference between the two AUCs.
DeLong's test should not be applied to nested models evaluated on the same data used to fit the parameters. When one model is a subset of another (for example, logistic regression with and without an additional predictor), the asymptotic theory underlying DeLong's test degenerates, potentially producing incorrect p-values. For nested models, the likelihood ratio test or a bootstrap-based approach is more appropriate.
DeLong's test is available in R through the pROC package (roc.test function) and in Python through third-party packages such as mlstatkit. An efficient O(n log n) algorithm by Sun and Xu (2014) makes the test practical for large datasets containing millions of observations.
scikit-learn provides comprehensive support for AUC computation through sklearn.metrics.roc_auc_score.
sklearn.metrics.roc_auc_score(
y_true, y_score, *,
average='macro',
sample_weight=None,
max_fpr=None,
multi_class='raise',
labels=None
)
| Parameter | Description | Values |
|---|---|---|
y_true | True binary labels or label indicator matrix | array-like |
y_score | Predicted scores (probabilities or decision function values) | array-like |
average | Averaging method for multiclass/multilabel | 'micro', 'macro', 'weighted', 'samples', None |
max_fpr | Upper bound on FPR for partial AUC computation | float in (0, 1] or None |
multi_class | Strategy for multiclass AUC | 'raise', 'ovr', 'ovo' |
Binary classification:
from sklearn.metrics import roc_auc_score
import numpy as np
y_true = np.array([0, 0, 1, 1])
y_scores = np.array([0.1, 0.4, 0.35, 0.8])
auc = roc_auc_score(y_true, y_scores) # Returns 0.75
With a trained model:
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_auc_score
X, y = load_breast_cancer(return_X_y=True)
clf = LogisticRegression(solver='newton-cholesky', random_state=0).fit(X, y)
auc = roc_auc_score(y, clf.predict_proba(X)[:, 1]) # Returns ~0.99
# Cross-validated AUC (recommended to avoid overfitting)
cv_scores = cross_val_score(clf, X, y, cv=5, scoring='roc_auc')
print(f"Mean AUC: {cv_scores.mean():.3f} +/- {cv_scores.std():.3f}")
Multiclass classification:
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
X, y = load_iris(return_X_y=True)
clf = LogisticRegression(solver='newton-cholesky').fit(X, y)
auc = roc_auc_score(y, clf.predict_proba(X), multi_class='ovr') # Returns ~0.99
Plotting the ROC curve:
from sklearn.metrics import roc_curve, RocCurveDisplay
import matplotlib.pyplot as plt
fpr, tpr, thresholds = roc_curve(y_true, y_scores)
RocCurveDisplay(fpr=fpr, tpr=tpr).plot()
plt.title('ROC Curve')
plt.show()
| Function | Purpose |
|---|---|
sklearn.metrics.roc_curve | Compute FPR, TPR, and thresholds for plotting the ROC curve |
sklearn.metrics.auc | Compute area under any curve using the trapezoidal rule |
sklearn.metrics.average_precision_score | Compute the PR-AUC (area under the precision-recall curve) |
sklearn.metrics.RocCurveDisplay | Visualize ROC curves with AUC annotation |
sklearn.metrics.brier_score_loss | Compute the Brier score for assessing probability calibration |
AUC is used across a wide range of domains where binary classification is central to the task.
| Domain | Application | Why AUC Is Preferred |
|---|---|---|
| Medical diagnostics | Evaluating diagnostic tests and biomarkers | Threshold-independent comparison of tests across different operating points |
| Credit scoring | Assessing default prediction models | Gini coefficient (derived from AUC) is an industry-standard metric |
| Fraud detection | Ranking transactions by fraud risk | Imbalanced datasets benefit from a ranking-based metric |
| Information retrieval | Evaluating search ranking quality | AUC naturally captures ranking performance |
| Drug discovery | Screening compounds for biological activity | Virtual screening involves ranking large compound libraries |
| Natural language processing | Spam filtering and sentiment analysis | Comparing models without committing to a classification threshold |
Best practices:
scoring='roc_auc' in scikit-learn) to avoid threshold selection bias.Common pitfalls: