# AUC-ROC

> Source: https://aiwiki.ai/wiki/auc_area_under_the_curve
> Updated: 2026-06-01
> Categories: Machine Learning, Model Evaluation, Statistics
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

*See also: [Machine learning terms](/wiki/machine_learning_terms)*

## Introduction

In [machine learning](/wiki/machine_learning), the **Area Under the ROC Curve (AUC-ROC)** is a widely used [evaluation metric](/wiki/evaluation_metrics) for assessing the performance of [binary classification](/wiki/binary_classification) models. This measure evaluates a model's ability to discriminate between positive and negative [classes](/wiki/class) across all possible classification thresholds, providing a single scalar value that summarizes the classifier's overall ranking performance. The AUC-ROC metric has become a standard tool in fields such as medical diagnostics, credit scoring, fraud detection, and natural language processing.

Unlike threshold-dependent metrics such as [accuracy](/wiki/accuracy) or [F1 score](/wiki/f1_score), the AUC-ROC evaluates the classifier's ability to rank positive instances higher than negative instances, independent of any specific decision threshold. This threshold-invariant property makes it particularly valuable during model selection and comparison. AUC also has a clean probabilistic interpretation that ties it to classical non-parametric statistics, namely the Mann-Whitney U statistic and the [Wilcoxon rank-sum test](/wiki/wilcoxon_rank_sum_test).

Keep two usages of the acronym in mind. "AUC" alone most commonly refers to the area under the ROC curve, but it can also denote the area under a related curve such as the precision-recall curve (AUC-PR or AP). When ambiguous, the full names AUC-ROC and AUC-PR are preferred.

## History

The ROC curve predates machine learning by several decades. Its origin lies in signal detection theory, which emerged during and after World War II as engineers and psychologists studied how to separate true signals from background noise in radar systems and human perception experiments.

### Radar origins (1940s and 1950s)

During World War II, radar operators needed to decide whether a faint blip on a screen was an enemy aircraft or random electromagnetic noise. Engineers at MIT and other laboratories began formalizing the trade-off between hit rate and false alarm rate. The earliest fully developed mathematical treatment is generally attributed to W. W. Peterson, T. G. Birdsall, and W. C. Fox, whose 1954 Transactions of the IRE Professional Group on Information Theory paper, "The theory of signal detectability," laid out the formal framework that would become signal detection theory.[1] Their work introduced the receiver operating characteristic as a curve relating the conditional probability of detection to the conditional probability of false alarm under a likelihood ratio decision rule.[1]

David Green and John Swets consolidated and expanded this framework in their 1966 book *Signal Detection Theory and Psychophysics*, which became the canonical reference for ROC analysis in experimental psychology and audiology.[2] Green and Swets demonstrated that ROC curves could be used to compare the discrimination ability of human observers, animals, and machines on a common footing, regardless of their internal decision criteria.[2]

### Adoption in medical diagnostics (1960s through 1980s)

Medicine adopted ROC analysis in the 1960s for evaluating radiological and laboratory diagnostic tests. Lee Lusted, a radiologist, was among the earliest to advocate its use, writing in *Science* in 1971 and in his 1968 book *Introduction to Medical Decision Making*.[3] The technique was attractive to clinicians because it disentangled a test's intrinsic accuracy from the subjective decision threshold that a particular reader or laboratory might apply.

The paper that turned AUC into a standard medical statistic is J. A. Hanley and B. J. McNeil's 1982 *Radiology* article, "The meaning and use of the area under a receiver operating characteristic (ROC) curve."[4] Hanley and McNeil gave the probabilistic interpretation in unambiguous terms, derived a variance estimator, and demonstrated the equivalence between AUC and the Mann-Whitney U statistic.[4] Their 1983 follow-up provided a method for comparing AUCs from correlated samples.[5] John Swets's 1988 *Science* review, "Measuring the accuracy of diagnostic systems," further popularized the metric and gave the familiar verbal scale (0.5 random, 0.7 fair, 0.9 excellent) that many practitioners still cite.[8]

### Adoption in machine learning (1990s and 2000s)

ROC analysis was for most of its history a tool of psychology and clinical medicine. It entered mainstream machine learning through a small group of papers in the late 1990s. Andrew P. Bradley's 1997 *Pattern Recognition* article, "The use of the area under the ROC curve in the evaluation of machine learning algorithms," argued that AUC was a more discriminating metric than accuracy for comparing classifiers, especially under class imbalance or unequal misclassification costs.[10] Foster Provost and Tom Fawcett's 2001 *Machine Learning* article "Robust Classification for Imprecise Environments" introduced the ROC convex hull and showed how it could be used to choose between classifiers without committing to a single cost ratio.[11]

Tom Fawcett's 2006 tutorial paper, "An introduction to ROC analysis" in *Pattern Recognition Letters*, became the most cited introduction for ML practitioners.[14] By the late 2000s, AUC-ROC had become the default headline metric for binary classification in many Kaggle contests with rare positive classes.

The metric also acquired a parallel life in survival analysis, where Frank Harrell's c-index for survival models (introduced in his 1982 *JAMA* paper with Califf, Pryor, Lee, and Rosati) is the natural generalization of AUC to right-censored time-to-event data.[6] The c-index reduces to AUC when there is no censoring.[6]

## Mathematical definition

### The ROC curve and AUC as an integral

Let $f$ be a scoring function that maps an input $x$ to a real number, and let the binary label $y$ take values 0 (negative) and 1 (positive). For a threshold $\tau$, define:

- True Positive Rate: $\mathrm{TPR}(\tau) = P(f(X) \ge \tau \mid Y = 1)$
- False Positive Rate: $\mathrm{FPR}(\tau) = P(f(X) \ge \tau \mid Y = 0)$

The ROC curve is the parametric curve $(\mathrm{FPR}(\tau), \mathrm{TPR}(\tau))$ traced out as $\tau$ ranges from $+\infty$ down to $-\infty$. The AUC is the area under this curve in the unit square:

$$\mathrm{AUC} = \int_0^1 \mathrm{TPR}(\mathrm{FPR}^{-1}(u))\, du$$

Equivalently, expressed as an integral with respect to the threshold:

$$\mathrm{AUC} = -\int_{-\infty}^{\infty} \mathrm{TPR}(\tau)\, d\mathrm{FPR}(\tau)$$

### Probabilistic interpretation

The single most useful identity in ROC analysis is the probabilistic interpretation of AUC. If $X_+$ is a randomly drawn positive example and $X_-$ is an independently drawn negative example, then:

$$\mathrm{AUC} = P(f(X_+) > f(X_-)) + \tfrac{1}{2} P(f(X_+) = f(X_-))$$

The half-weight on ties makes the identity exact even for scoring functions with positive probability of tied scores. In words: the AUC is the probability that the classifier scores a random positive higher than a random negative, with ties counted as half-wins. An AUC of 0.85 therefore means that on roughly 85 percent of randomly chosen positive-negative pairs, the model ranks them correctly.

This probabilistic interpretation was made precise by Hanley and McNeil in 1982.[4] It also explains why a random classifier has AUC 0.5 (a random ranking is correct on half the pairs by symmetry) and why a perfectly inverted classifier has AUC 0.

### Relation to the Mann-Whitney U statistic

Let $\{s_1, \dots, s_m\}$ be the scores assigned to the $m$ positive examples and $\{t_1, \dots, t_n\}$ the scores assigned to the $n$ negative examples. The Mann-Whitney U statistic counts the number of pairs where the positive score exceeds the negative score:

$$U = \sum_{i=1}^{m} \sum_{j=1}^{n} \left[ \mathbb{1}(s_i > t_j) + \tfrac{1}{2}\mathbb{1}(s_i = t_j) \right]$$

The normalized U statistic is exactly the empirical AUC:

$$\widehat{\mathrm{AUC}} = \frac{U}{mn}$$

The Wilcoxon rank-sum statistic gives the same value through a different bookkeeping: rank all $m+n$ scores together, sum the ranks of the positives, and apply a linear transformation. The equivalence between empirical AUC and the Wilcoxon-Mann-Whitney statistic ties ROC analysis to a hundred years of non-parametric statistics and gives it well-understood distributional properties.

### Relation to the concordance index

In survival analysis, [Harrell's c-index](/wiki/c_index) generalizes AUC to right-censored time-to-event data. The c-index is the proportion of usable pairs in which the subject with the higher predicted risk also experienced the event earlier. When there is no censoring, the c-index reduces to the AUC computed against the binary outcome at each time, and even under censoring it can be interpreted as a weighted average of time-specific AUCs. The c-index is the standard discrimination metric for the Cox proportional hazards model and for modern deep-learning-based survival models.[6]

## Construction of the ROC curve

The **Receiver Operating Characteristic (ROC) curve** is a graphical representation of a classifier's diagnostic ability across all classification thresholds. This section covers how to construct and read the curve in practice.

### Step by step construction

To construct an ROC curve:

1. Train a binary classifier that produces a continuous score or probability for each instance.
2. Sort all instances by predicted probability in descending order.
3. For each unique predicted probability value, use it as a classification threshold.
4. At each threshold, compute the True Positive Rate (TPR) and False Positive Rate (FPR).
5. Plot each (FPR, TPR) pair, then connect the points.

The resulting curve starts at the origin (0, 0), where the threshold is set so high that no instances are predicted as positive, and ends at (1, 1), where the threshold is set so low that all instances are predicted as positive. Consider a small dataset with 5 positives and 5 negatives. As the threshold decreases, each time a positive is encountered the TPR steps upward; each time a negative is encountered the FPR steps rightward. The resulting staircase pattern is the empirical ROC curve.

### True Positive Rate and False Positive Rate

The two axes of the ROC curve are defined by two fundamental metrics from the [confusion matrix](/wiki/confusion_matrix):

**True Positive Rate (TPR)**, also known as [sensitivity](/wiki/sensitivity) or [recall](/wiki/recall):

**TPR = TP / (TP + FN)**

TPR measures the proportion of actual positive instances that the classifier correctly identifies. A TPR of 1.0 means the classifier catches every positive instance.

**False Positive Rate (FPR)**, also known as the fall-out or (1 - [specificity](/wiki/specificity)):

**FPR = FP / (FP + TN)**

FPR measures the proportion of actual negative instances that the classifier incorrectly labels as positive. A FPR of 0.0 means the classifier never raises a false alarm.

| Component | Formula | Interpretation | Medical Example |
|---|---|---|---|
| True Positive (TP) | Correctly predicted positive | Hit | Sick patient correctly diagnosed as sick |
| False Positive (FP) | Incorrectly predicted positive | False alarm | Healthy patient incorrectly diagnosed as sick |
| True Negative (TN) | Correctly predicted negative | Correct rejection | Healthy patient correctly diagnosed as healthy |
| False Negative (FN) | Incorrectly predicted negative | Miss | Sick patient incorrectly diagnosed as healthy |
| TPR (Sensitivity) | TP / (TP + FN) | Detection rate | Proportion of sick patients correctly identified |
| FPR (1 - Specificity) | FP / (FP + TN) | False alarm rate | Proportion of healthy patients incorrectly flagged |
| Specificity | TN / (TN + FP) | Correct rejection rate | Proportion of healthy patients correctly cleared |

### Reading the ROC curve

The ROC curve provides a visual summary of the trade-off between sensitivity and specificity at every threshold. Key reference points on the plot include:

- **Top-left corner (0, 1):** Represents a perfect classifier that achieves a TPR of 1.0 and an FPR of 0.0. The closer the curve hugs this corner, the better the model performs.
- **Diagonal line from (0, 0) to (1, 1):** Represents a random classifier that has no discriminative ability. A classifier whose ROC curve falls along this line performs no better than flipping a coin.
- **Below the diagonal:** A curve that falls below the diagonal indicates a classifier that performs worse than random chance, which typically means the predictions are inverted.
- **Point (0, 0):** Corresponds to a threshold so high that nothing is classified as positive. TPR = 0, FPR = 0.
- **Point (1, 1):** Corresponds to a threshold so low that everything is classified as positive. TPR = 1, FPR = 1.

A classifier that lies below the diagonal across the entire curve is exactly as informative as one above, because its predictions can be inverted. Values below 0.5 are rarely reported in practice.

## What is AUC?

The **AUC (Area Under the Curve)** is the total area underneath the ROC curve. It provides a single number that summarizes the overall performance of the classifier across all possible thresholds.

### AUC interpretation

The AUC has a direct probabilistic interpretation: it represents the probability that the classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance. This is sometimes called the concordance statistic or the c-statistic. In formal terms, if P denotes a randomly selected positive example and N denotes a randomly selected negative example:

**AUC = P(score(P) > score(N))**

This interpretation makes AUC especially intuitive. An AUC of 0.85 means that if you randomly pick one positive and one negative example, there is an 85% chance the model assigns a higher score to the positive example.

| AUC Value | Interpretation | Practical Meaning |
|---|---|---|
| 1.0 | Perfect classifier | The model perfectly separates all positive and negative instances |
| 0.9 - 1.0 | Excellent | The model has outstanding discriminative ability |
| 0.8 - 0.9 | Good | The model has strong discriminative ability |
| 0.7 - 0.8 | Fair | The model has acceptable discriminative ability |
| 0.6 - 0.7 | Poor | The model has weak discriminative ability |
| 0.5 | Random | The model has no discriminative ability; equivalent to random guessing |
| Below 0.5 | Inverted | The model's predictions are inversely related to the actual classes |

### Properties of AUC

AUC has several formal properties that explain its popularity and also delimit its usefulness.

- **Bounded in [0, 1].** With 0.5 corresponding to random performance and 1 to perfect ranking.
- **Invariant to monotonic transformations of scores.** If $g$ is any strictly increasing function, then the classifier defined by $g(f(x))$ has the same AUC as the classifier defined by $f(x)$. In particular, AUC depends only on the ranks of the predictions, not on their absolute values. This is why AUC does not measure [calibration](/wiki/calibration).
- **Invariant to class prior shift in the ROC space.** Because TPR and FPR are class-conditional, changing the ratio of positives to negatives in the evaluation set leaves the ROC curve unchanged in expectation, though it does affect its sample variance. AUC-PR, in contrast, depends on prevalence.
- **Dominance translates to AUC ordering.** If one ROC curve lies above another over the entire unit square, the dominating classifier has the larger AUC. The converse is not true; equal AUCs can arise from crossing curves.
- **Differentiability.** The empirical AUC is not a differentiable function of model parameters, which is why most learning algorithms optimize a differentiable surrogate (log loss, hinge loss) rather than AUC directly. Pairwise ranking losses such as the one used in [RankNet](/wiki/ranknet) are smooth surrogates for AUC.

## Estimation and statistical inference

In finite samples we never observe the population AUC; we estimate it. Several estimators and inference procedures are in common use.

### Trapezoidal rule

The simplest estimator linearly interpolates between successive points on the empirical ROC curve and sums the trapezoid areas. For a curve with $K$ vertices $(\mathrm{FPR}_k, \mathrm{TPR}_k)$ sorted by FPR:

$$\widehat{\mathrm{AUC}} = \sum_{k=1}^{K-1} \tfrac{1}{2}(\mathrm{FPR}_{k+1} - \mathrm{FPR}_k)(\mathrm{TPR}_{k+1} + \mathrm{TPR}_k)$$

This is the default in [scikit-learn](/wiki/scikit_learn)'s `roc_auc_score` and in most software packages.[30] The trapezoidal estimator is numerically identical to the Wilcoxon-Mann-Whitney statistic when ties are handled consistently.

### Wilcoxon-Mann-Whitney estimator

A more direct estimator counts concordant pairs over all positive-negative combinations:

$$\widehat{\mathrm{AUC}} = \frac{1}{mn} \sum_{i=1}^{m} \sum_{j=1}^{n} \psi(s_i, t_j),\quad \psi(a, b) = \begin{cases} 1 & a > b \\ 1/2 & a = b \\ 0 & a < b \end{cases}$$

This estimator is exact, requires no interpolation, and matches the population AUC under the probabilistic definition. Its naive $O(mn)$ implementation is fine for small samples; for large samples an $O((m+n)\log(m+n))$ algorithm based on sorting and rank counting is used.

### DeLong's variance estimator

E. R. DeLong, D. M. DeLong, and D. L. Clarke-Pearson's 1988 *Biometrics* paper introduced a non-parametric variance estimator for the AUC based on the theory of generalized U-statistics.[7] Their formula expresses the variance of $\widehat{\mathrm{AUC}}$ in terms of the variances of certain placement values $V_{10}(s_i)$ and $V_{01}(t_j)$:

$$\widehat{\mathrm{Var}}(\widehat{\mathrm{AUC}}) = \frac{S_{10}}{m} + \frac{S_{01}}{n}$$

where $S_{10}$ and $S_{01}$ are the empirical variances of the placement values. The DeLong estimator also gives the covariance between two AUCs estimated on the same sample, which is the basis for testing whether two classifiers' AUCs differ significantly.[7] DeLong's test remains the standard comparison procedure in medical statistics and is implemented in the R packages [pROC](/wiki/proc) and [ROCR](/wiki/rocr).[20]

A fast $O((m+n)\log(m+n))$ algorithm for the DeLong estimator was given by X. Sun and W. Xu in 2014, replacing the original $O(mn)$ computation.[21] This is what most modern implementations use.

### Bootstrap confidence intervals

Resampling is a flexible alternative when the DeLong assumptions are uncomfortable or when joint inference is needed over many quantities. The standard procedure samples $B$ bootstrap replicates (commonly 1,000 to 10,000) of the test set with replacement, computes $\widehat{\mathrm{AUC}}^{(b)}$ on each replicate, and reports the percentile or BCa interval. The bootstrap also handles paired comparisons across non-overlapping subsets, stratified sampling, and partial AUC. The trade-off is computational cost and the usual caveats under heavy class imbalance or near-zero variance.

### Hanley and McNeil parametric variance

Hanley and McNeil's 1982 paper gave an earlier variance formula assuming exponential score distributions in each class, a reasonable approximation for many medical test scores.[4] It is still cited in the radiology literature but has largely been superseded by DeLong's distribution-free estimator.[7]

## AUC-PR (Precision-Recall AUC)

The **Precision-Recall (PR) curve** is an alternative to the ROC curve that focuses specifically on the positive class. Instead of plotting TPR vs. FPR, the PR curve plots [precision](/wiki/precision) (y-axis) against [recall](/wiki/recall) (x-axis).

**Precision = TP / (TP + FP)**

Precision measures how many of the predicted positives are actually positive. Unlike FPR, precision is directly affected by the class distribution, making it more sensitive to false positives when the positive class is rare.

**AUC-PR** is the area under the Precision-Recall curve. A random classifier achieves an AUC-PR approximately equal to the prevalence of the positive class (e.g., 0.001 for a 0.1% prevalence), whereas a perfect classifier achieves an AUC-PR of 1.0.

### Davis and Goadrich 2006: the formal connection

The formal relationship between ROC and PR curves was established by Jesse Davis and Mark Goadrich in their 2006 ICML paper, "The relationship between Precision-Recall and ROC curves."[15] Their key results are:

- For a fixed dataset, every point in ROC space corresponds to a unique point in PR space, and vice versa, given the class prior.
- One ROC curve dominates another (lies above it everywhere) if and only if the corresponding PR curve dominates the other in PR space. Dominance is preserved across the two representations.
- Crucially, a ROC curve can have higher AUC than another while the *PR* curve has lower AUC, because the two areas integrate over different quantities. This means a model with the best AUC-ROC may not be the best choice when measured by AUC-PR, and the difference matters in imbalanced regimes.

Davis and Goadrich also gave a correct interpolation scheme for PR space.[15] Naive linear interpolation between PR points overestimates the curve; the correct interpolation traces out a hyperbolic arc that reflects the underlying confusion matrix counts. Failure to use the right interpolation is a common source of inflated AUC-PR values in the wild.

### Saito and Rehmsmeier 2015: empirical case

T. Saito and M. Rehmsmeier's 2015 *PLOS ONE* paper, "The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets," demonstrated empirically that the PR plot exposes performance differences that the ROC plot can hide.[22] Their simulation studies showed that ROC curves can look impressively close to the top-left corner even when precision at useful recall levels is low, because the abundance of true negatives keeps the FPR small no matter how many false positives accumulate.[22] The paper has been widely cited as practical justification for reporting both metrics in imbalanced settings.

### Average precision and stepwise computation

AUC-PR is most often computed via the average precision (AP) estimator, which avoids interpolation entirely:

$$\mathrm{AP} = \sum_{n} (R_n - R_{n-1}) P_n$$

where $P_n$ and $R_n$ are precision and recall at the $n$-th threshold sorted in descending order of score. This is the formula used by scikit-learn's `average_precision_score` and by the Pascal VOC and COCO object detection benchmarks (though COCO uses an 11-point or 101-point interpolated variant of this idea).[30] Average precision and the trapezoidal AUC-PR generally agree closely on well-resolved curves; they can differ on small samples or curves with long flat segments.

## Partial AUC

The **partial AUC** restricts the integration to a clinically or operationally meaningful slice of the ROC curve, usually a low-FPR or high-TPR region. The standard formulation, introduced by McClish in 1989, integrates TPR from $\mathrm{FPR} = 0$ to $\mathrm{FPR} = e$ for some user-specified $e$:[9]

$$\mathrm{pAUC}(0, e) = \int_0^e \mathrm{TPR}(\mathrm{FPR}^{-1}(u))\, du$$

A standardized version divides by the maximum possible area in that strip so the result is again on a 0 to 1 scale. Partial AUC is the appropriate metric in regimes where false positives are intolerable, such as cancer screening tests that must operate at very low FPR to keep follow-up biopsies manageable, or industrial defect detection where false alarms shut down production lines. It is also used in pharmaceutical screening (high-throughput chemical assays), security applications (intrusion detection at low false alarm rates), and any task where only the top of a ranking matters.

Partial AUC inherits the probabilistic interpretation in a restricted form: it estimates the joint probability that a positive scores above a negative and that the negative is itself among the higher-scoring negatives. Bandos, Guo, and Gur in 2017 proposed extensions for two-region partial AUC and other shape constraints.[24]

The `pROC` R package and scikit-learn's `roc_auc_score` with `max_fpr` argument both support partial AUC computation.[20][30] The `max_fpr` argument was added to scikit-learn in version 0.20 (2018).[30]

## Multiclass and multi-label AUC

The standard AUC-ROC is defined for binary classification, but it can be extended to multi-class problems using several strategies. Two influential papers laid out the alternatives.

### Hand and Till 2001

David Hand and Robert Till's 2001 *Machine Learning* article, "A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems," proposed averaging the pairwise AUCs across all class pairs.[12] For $C$ classes the metric is:

$$\mathrm{M} = \frac{2}{C(C-1)} \sum_{i < j} \widehat{A}(i, j)$$

where $\widehat{A}(i, j)$ is the AUC computed restricted to examples of classes $i$ and $j$. The Hand-Till generalization is the basis of the `multi_class='ovo'` option in scikit-learn.[12][30]

### Averaging strategies

| Strategy | Description | When to use |
|---|---|---|
| **One-vs-Rest (OvR)** | Compute the ROC curve and AUC for each class against all other classes combined. The final AUC is the average across all classes. | When individual class performance matters |
| **One-vs-One (OvO)** | Compute the AUC for every pair of classes and average the results. This produces C*(C-1)/2 pairwise AUC values for C classes. | When pairwise discrimination is important |
| **Macro-averaging** | Compute AUC for each class independently and take the unweighted mean. Treats all classes equally. | When all classes are equally important |
| **Weighted averaging** | Compute AUC for each class and take a weighted mean based on the number of instances in each class. | When class frequency should influence the metric |
| **Micro-averaging** | Pool all per-class predictions and compute a single AUC. | When the global ranking matters more than per-class behavior |

The one-vs-rest and one-vs-one strategies are not interchangeable. OvR is sensitive to class imbalance because the rest-of-classes pool is dominated by frequent classes; OvO is symmetric in pairs but can be optimistic when some pairs are easy to separate.

[Scikit-learn](/wiki/scikit_learn)'s `roc_auc_score` function supports multi-class AUC computation through the `multi_class` parameter, accepting both 'ovr' and 'ovo' strategies, combined with the `average` parameter ('macro', 'weighted', or `None`).[30]

### Multi-label AUC

For multi-label problems where each example can belong to multiple classes, AUC is computed per label and then averaged. Micro-averaging pools predictions across all labels into one binary task before computing AUC; macro-averaging computes a separate AUC for each label and averages. Micro-averaging is dominated by frequent labels; macro-averaging treats all labels equally.

## Class imbalance and AUC

While AUC-ROC is a powerful metric, it has notable limitations, especially when dealing with [imbalanced datasets](/wiki/imbalanced_data).

### Why AUC-ROC can mislead

When the negative class vastly outnumbers the positive class (for example, in fraud detection where only 0.1% of transactions are fraudulent), the AUC-ROC can present an overly optimistic picture of model performance. This happens because the FPR denominator (FP + TN) is dominated by the large number of true negatives. Even a significant number of false positives may appear as a small FPR.

For example, if there are 10,000 negative instances and the model incorrectly classifies 100 of them as positive, the FPR is only 0.01 (1%). But in absolute terms, those 100 false positives may be unacceptable in a real-world setting. The ROC curve and AUC may not reflect this problem.

This limitation was formally analyzed by Jesse Davis and Mark Goadrich in their influential 2006 paper, which demonstrated the mathematical relationship between ROC and PR curves and showed that a curve that dominates in ROC space does not necessarily dominate in PR space.[15]

The practical recommendation for imbalanced problems is to report AUC-PR (or average precision) alongside AUC-ROC, since the former is anchored to the positive-class prevalence and reveals the difficulty of achieving high precision at high recall.

### AUC-ROC vs. AUC-PR comparison

| Aspect | AUC-ROC | AUC-PR |
|---|---|---|
| Axes | FPR (x) vs. TPR (y) | Recall (x) vs. Precision (y) |
| Random Baseline | 0.5 (always) | Approximately equal to positive class prevalence |
| Perfect Score | 1.0 | 1.0 |
| Sensitivity to Imbalance | Low (can be overly optimistic) | High (reflects true difficulty) |
| Focus | Both classes equally | Positive (minority) class |
| Best Used When | Classes are roughly balanced | Positive class is rare or costly to miss |
| Common Domains | General model comparison | Medical diagnosis, fraud detection, information retrieval |
| Interpolation | Linear interpolation between points | Non-linear interpolation (stepped) |

As a general guideline, when the positive class prevalence is below 10-20%, the AUC-PR provides a more informative evaluation. The stronger the class imbalance, the larger the gap between AUC-ROC and AUC-PR tends to be. Saito and Rehmsmeier (2015) provided empirical evidence that the PR curve is more informative than the ROC curve for evaluating binary classifiers on imbalanced datasets.[22]

## Relation to calibration

AUC is a measure of ranking quality, not of probability calibration. A model that always outputs the rank of the positive probability among its predictions, scaled to the unit interval, has the same AUC as the original model even though its outputs are not probabilities at all. More formally, because AUC is invariant under monotonic transformation, two models that agree on the ranking of all examples have the same AUC even if one is perfectly calibrated and the other is not.

This distinction matters in deployment. A medical risk model with AUC 0.85 may rank patients perfectly well for triage but still be unsafe to interpret as "the probability you have this disease is 0.32" if its probabilities are systematically biased. Calibration is what makes the absolute number trustworthy, and is measured with different tools.

### Calibration metrics

| Metric | What it measures | Sensitive to ranking? | Sensitive to calibration? |
|---|---|---|---|
| AUC-ROC | Discrimination | Yes | No |
| AUC-PR | Discrimination on rare class | Yes | No |
| Brier score | Mean squared error of probabilities | Yes | Yes |
| Log loss | Expected negative log likelihood | Yes | Yes |
| Expected Calibration Error (ECE) | Average bin-wise gap between probability and accuracy | No | Yes |
| Reliability diagram | Visual calibration assessment | No | Yes |

A common practice is to report AUC alongside one calibration metric (Brier score or ECE) and to inspect a reliability diagram. Post-hoc calibration methods such as [Platt scaling](/wiki/platt_scaling) and [isotonic regression](/wiki/isotonic_regression) can improve calibration without changing the model's rankings, and therefore without changing the AUC. This is convenient: you can fix calibration after the fact without losing discrimination.

## AUC for ranking and information retrieval

AUC's probabilistic interpretation as a pairwise ranking probability makes it a natural metric for [learning-to-rank](/wiki/learning_to_rank) tasks and [information retrieval](/wiki/information_retrieval). In an IR setting, the "positives" are relevant documents for a query and the "negatives" are irrelevant documents. The AUC then measures the probability that a relevant document is ranked above an irrelevant one.

In practice, IR has largely moved away from AUC toward rank-position-aware metrics that emphasize the top of the list, because users rarely look beyond the first ten or twenty results. The dominant metrics in modern IR are [NDCG](/wiki/ndcg) (Normalized Discounted Cumulative Gain), Mean Average Precision (MAP), Mean Reciprocal Rank (MRR), and Precision/Recall at k.

For recommender systems, AUC remains popular for offline evaluation of collaborative filtering models because it has a clean pairwise interpretation: AUC equals the probability that a held-out positive item is ranked above a held-out negative item for the same user. Bayesian Personalized Ranking (BPR), introduced by Rendle et al. in 2009, is literally an optimization of an AUC surrogate for implicit feedback.[18]

## Limitations and criticisms

Three streams of criticism have shaped how researchers think about AUC since the late 2000s.

### Lobo, Jimenez-Valverde, and Real (2008)

Jorge Lobo, Alberto Jimenez-Valverde, and Raimundo Real published "AUC: a misleading measure of the performance of predictive distribution models" in *Global Ecology and Biogeography* in 2008.[16] They listed five problems with AUC for species distribution modeling: it ignores the predicted probability values themselves; it summarizes performance over regions of ROC space that are not of interest; it weights the two omission error rates equally; it does not show the spatial distribution of model errors; and the total extent of the study area heavily influences AUC by changing the pool of true absences.[16] Their critique is most directly relevant to ecology, but several points apply elsewhere.

### Hand (2009)

David Hand's 2009 *Machine Learning* article, "Measuring classifier performance: a coherent alternative to the area under the ROC curve," gave the most theoretically pointed criticism.[17] Hand showed that AUC implicitly uses different misclassification cost ratios for different classifiers because the weighting at each operating point depends on the classifier's own ROC curve.[17] He proposed the H-measure as a coherent alternative that fixes a single cost-weighting distribution shared across classifiers.[17] The H-measure has been adopted in some sub-fields, especially credit scoring, but has not displaced AUC as the standard metric. Flach, Hernandez-Orallo, and Ferri (2011) argued that AUC's implicit cost distribution can be interpreted coherently as long as one is careful about the population over which it averages.[19]

### Pitfalls in practice

Beyond the theoretical critiques, AUC has well-known practical failure modes:

- **Crossing curves.** Two classifiers with the same AUC can have very different ROC curves, including curves that cross. Neither is uniformly better and the choice depends on the operating point.
- **Sample-size dependence of variance.** The standard error of AUC is dominated by the smaller class, which is exactly the class one cares about under imbalance. AUC estimates from small minority-class samples can have wide confidence intervals.
- **Misuse with thresholded predictions.** If a classifier outputs only the hard label, the empirical ROC curve has one operating point in the interior, and AUC degenerates to a function of accuracy. Computing AUC on hard labels is almost always a mistake.
- **Train-test leakage.** AUC is particularly sensitive to leakage because it is so easy to push up. Any feature that contains tiny amounts of target information will inflate AUC across the whole curve.

## Alternatives and complements

Many metrics measure aspects of classifier quality that AUC misses. Most production teams report a basket of metrics rather than rely on AUC alone.

| Metric | What it measures | Threshold-free? | Sensitive to imbalance? |
|---|---|---|---|
| [Accuracy](/wiki/accuracy) | Fraction of correct hard predictions | No | Yes (misleading on imbalance) |
| [Precision](/wiki/precision) | Fraction of predicted positives that are positive | No | Yes |
| [Recall](/wiki/recall) (sensitivity) | Fraction of true positives detected | No | Less |
| [F1 score](/wiki/f1_score) | Harmonic mean of precision and recall | No | Better than accuracy |
| F-beta score | Weighted harmonic mean with bias toward recall (beta>1) or precision (beta<1) | No | Configurable |
| [Matthews Correlation Coefficient (MCC)](/wiki/matthews_correlation_coefficient) | Correlation between predicted and true labels | No | Robust |
| [Brier score](/wiki/brier_score) | Mean squared error of predicted probabilities | Yes | Calibration-aware |
| [Log loss](/wiki/log_loss) | Negative log likelihood | Yes | Probability-aware |
| Expected Calibration Error (ECE) | Average gap between confidence and accuracy | Yes | Calibration only |
| Cohen's kappa | Agreement adjusted for chance | No | Moderate |
| AUC-PR / average precision | Area under PR curve | Yes | High |
| H-measure | Cost-coherent alternative to AUC | Yes | High |
| Top-k accuracy | Accuracy of the highest-confidence k predictions | No | Depends on k |

The Matthews Correlation Coefficient is the Pearson correlation between binary predicted and true labels and is robust to class imbalance. Chicco and Jurman (2020) argued in *BMC Genomics* that MCC should be preferred over F1 and accuracy for binary classification under imbalance.[26]

The Brier score, introduced by Glenn W. Brier in 1950 for verifying weather forecasts, is the mean squared difference between predicted probabilities and binary outcomes.[29] It is a proper scoring rule, meaning the expected Brier score is minimized when predicted probabilities equal true probabilities, and it penalizes both poor ranking and poor calibration.[29] Log loss (binary cross-entropy) is the negative log likelihood of the true labels given the predicted probabilities and is the loss function for [logistic regression](/wiki/logistic_regression) and most binary classification [neural networks](/wiki/neural_network).[28] Log loss is more aggressive than Brier score in penalizing confidently wrong predictions.

## Practical guidance

### When to use AUC vs. accuracy vs. F1

The choice of metric should follow from the deployment context.

- Use **accuracy** only when classes are balanced, error costs are roughly equal, and a single threshold is committed in advance.
- Use **AUC-ROC** for general-purpose model comparison during development, especially when the operating threshold is not yet decided.
- Use **AUC-PR** when the positive class is rare or false positives are expensive.
- Use **F1 or F-beta** when you need a single number tied to a specific operating point.
- Use **log loss or Brier score** when the absolute probability values matter.
- Use **partial AUC** when only a low-FPR or high-TPR region is operationally relevant.

In many real applications the right answer is to report several metrics. AUC-ROC alone is fine for a benchmark leaderboard; it is rarely enough for a production decision.

### Common pitfalls

| Pitfall | Description | Solution |
|---|---|---|
| Confusing AUC with accuracy | AUC measures ranking, not correctness at a specific threshold | Use AUC for model comparison, threshold-dependent metrics for deployment |
| Ignoring class imbalance | AUC-ROC may be misleading for rare positive classes | Supplement with AUC-PR |
| Overfitting to AUC | Tuning exclusively for AUC may harm calibration | Monitor calibration metrics alongside AUC |
| Comparing AUC across datasets | AUC values are not comparable across different datasets | Compare models on the same data; use relative differences |
| Not reporting confidence intervals | A single AUC number hides uncertainty | Report bootstrap confidence intervals or DeLong CIs |
| Computing AUC on hard labels | Degenerates to a function of accuracy | Always use predicted scores or probabilities |
| Data leakage | Inflates AUC dramatically because ranking is easy to perturb | Audit features for target information and time-order violations |
| Wrong PR interpolation | Linear interpolation in PR space overestimates AUC-PR | Use average precision instead of trapezoidal area |

### AUC in cross-validation

AUC is commonly used as the scoring metric in [cross-validation](/wiki/cross-validation) to select models and tune hyperparameters. Because it is threshold-independent, it avoids the need to choose a threshold during the model selection phase, which could otherwise bias the results. In scikit-learn, this is achieved by passing `scoring='roc_auc'` to cross-validation functions.

For heavily imbalanced data, stratified k-fold cross-validation should be used so that each fold has at least a few positive examples; otherwise the per-fold AUC can be undefined or extremely high variance. Repeated stratified k-fold cross-validation is recommended for small datasets.

### Factors that affect AUC

The AUC score of a classifier is influenced by training data quality and quantity (noisy or insufficient data leads to poor class discrimination); [feature engineering](/wiki/feature_engineering) and [feature selection](/wiki/feature_selection) (informative features improve separation); model choice ([gradient boosting](/wiki/gradient_boosting) methods like [XGBoost](/wiki/xgboost) and [LightGBM](/wiki/lightgbm) often achieve high AUC on tabular data, while [neural networks](/wiki/neural_network) may excel on unstructured data); [hyperparameter tuning](/wiki/hyperparameter_tuning) ([regularization](/wiki/regularization) strength, learning rate, and tree depth all affect discriminative power); and data leakage (inadvertent leakage of target information into the features can produce artificially high AUC scores that do not generalize to production).

## Implementations

### Python and scikit-learn

In [scikit-learn](/wiki/scikit_learn), AUC-ROC can be computed using `roc_auc_score` from `sklearn.metrics`. The function accepts true labels and either predicted probabilities or decision function scores.

```python
from sklearn.metrics import roc_auc_score, roc_curve, average_precision_score
from sklearn.metrics import precision_recall_curve

# Binary AUC-ROC
auc = roc_auc_score(y_true, y_scores)

# Multiclass with averaging
auc_macro = roc_auc_score(y_true, y_scores, multi_class='ovr', average='macro')
auc_weighted = roc_auc_score(y_true, y_scores, multi_class='ovo', average='weighted')

# Partial AUC up to FPR=0.1
pauc = roc_auc_score(y_true, y_scores, max_fpr=0.1)

# Average precision (AUC-PR)
ap = average_precision_score(y_true, y_scores)

# ROC and PR curves for plotting
fpr, tpr, thresholds_roc = roc_curve(y_true, y_scores)
precision, recall, thresholds_pr = precision_recall_curve(y_true, y_scores)
```

The `roc_auc_score` function uses the trapezoidal rule and supports `multi_class={'raise', 'ovr', 'ovo'}` and `average={'micro', 'macro', 'samples', 'weighted', None}`. The `max_fpr` argument enables standardized partial AUC.[30]

### PyTorch and torchmetrics

The [torchmetrics](/wiki/torchmetrics) library provides GPU-accelerated AUC computation suitable for use inside [PyTorch](/wiki/pytorch) training loops:

```python
from torchmetrics.classification import BinaryAUROC, BinaryAveragePrecision
auroc = BinaryAUROC()
auroc.update(preds, target)
value = auroc.compute()
```

Multi-class and multi-label variants (`MulticlassAUROC`, `MultilabelAUROC`) follow the same pattern with class number arguments. TensorFlow and Keras expose `tf.keras.metrics.AUC` with arguments controlling whether it computes ROC AUC or PR AUC and how many threshold buckets to use; the Keras implementation is an approximation using a fixed number of thresholds rather than the exact rank-based computation.

### R packages

The R ecosystem has two dominant packages.

- **pROC.** Implements DeLong's test, bootstrap CIs, partial AUC, smoothing, and ROC curve comparison. Authored by Xavier Robin and colleagues, published in *BMC Bioinformatics* in 2011.[20] The standard reference in medical biostatistics.
- **ROCR.** An older package with strong support for plotting performance curves and extracting threshold-by-threshold metrics. Useful for exploration and pedagogy.

Both packages are commonly used alongside `caret` and `tidymodels` for model evaluation pipelines.

### Other languages and platforms

Most ML platforms expose AUC computation natively: MATLAB (`perfcurve`), Julia (`MLJ.jl`), Java (`Weka`), Spark MLlib (`BinaryClassificationMetrics`), H2O, and cloud-hosted training services (Google Vertex AI, AWS SageMaker, Azure ML). Implementations are essentially equivalent for the binary case; minor differences appear in tie handling and the choice of trapezoidal vs. step interpolation.

## Applications

AUC-ROC is the dominant evaluation metric in several fields, each of which has shaped how the metric is interpreted and reported.

### Medical diagnosis

Medical research adopted ROC analysis early and remains the largest single source of AUC results in the literature. In radiology, reader studies compare radiologists or AI models on tasks such as detecting lung nodules on CT or breast cancer on mammograms. The 2020 *Nature* paper by McKinney et al. on a Google Health breast cancer screening model reported AUC 0.889 on a UK dataset and AUC 0.881 on a US dataset, comparable to human radiologists.[27] Esteva et al.'s 2017 *Nature* paper on convolutional networks for skin cancer reported AUC 0.96 on melanoma classification, on par with board-certified dermatologists.[23] ECG-based deep learning models reported by Hannun et al. in 2019 in *Nature Medicine* are routinely benchmarked by AUC against cardiologists.[25] Clinical risk scores such as HEART for chest pain triage and MELD for liver transplant prioritization are evaluated using AUC, typically with DeLong CIs.

Regulatory submissions to the US FDA for diagnostic AI devices commonly use AUC with DeLong-derived 95 percent confidence intervals as primary endpoints.

### Credit scoring and finance

Credit scoring has used a related quantity called the Gini coefficient since the 1950s. The Gini coefficient in credit risk is mathematically equivalent to $2 \cdot \mathrm{AUC} - 1$ and is sometimes called the Accuracy Ratio. A FICO model with Gini 0.6 corresponds to AUC 0.8.

Basel Committee guidance on internal ratings-based models and most regulatory backtesting frameworks use AUC or Gini for discrimination assessment. Modern credit scoring models from companies like Experian, Equifax, FICO, and Upstart routinely report AUC on holdout populations. Fraud detection systems at major payment networks use AUC-PR alongside AUC-ROC because of extreme class imbalance (typically less than 0.1 percent of transactions are fraud).

### Bioinformatics

Protein function prediction, gene-disease association ranking, and drug-target interaction prediction rely heavily on AUC. The CAFA and DREAM challenges use AUC and AUC-PR as primary metrics. Whole-genome variant effect predictors such as REVEL, CADD, and PrimateAI are compared on AUC against curated benchmarks like ClinVar.

### Information retrieval and recommender systems

Classical IR moved beyond AUC in favor of position-aware metrics like NDCG, but AUC remains a useful summary in research papers and offline recommender system evaluation. Spotify, Netflix, and YouTube have published papers using AUC as one of several offline metrics for ranking models.

### Anomaly detection and security

Intrusion detection systems, malware classifiers, and anti-spam filters operate where false positives are extremely costly. They are usually evaluated with partial AUC at low FPR (commonly FPR less than 0.001) or with detection rate at a fixed FPR.

### Object detection and segmentation

Object detection benchmarks use average precision (AUC-PR) rather than AUC-ROC because the negative class (background) is enormous and ill-defined. PASCAL VOC, COCO, and Open Images all report AP as their primary metric. Image segmentation follows the same convention.

## Visualizations

Three types of plot are standard companions to an AUC number.

- **ROC curve.** TPR vs. FPR with the diagonal random-baseline line drawn for reference. The AUC is the shaded area. Multiple models can be overlaid for comparison.
- **Precision-recall curve.** Precision vs. recall, with a horizontal baseline at the positive class prevalence. The AUC-PR is the shaded area. Stepwise interpolation is correct; smooth interpolation is misleading.
- **Calibration plot (reliability diagram).** Predicted probability binned on the x-axis vs. observed positive fraction on the y-axis. The diagonal is perfect calibration. Show this alongside AUC whenever absolute probabilities matter.

More advanced visualizations include the lift chart and gain chart (popular in marketing analytics), the cost curve (Drummond and Holte 2006), and the cumulative accuracy profile (used in credit scoring).[13]

## Explain like I'm 5 (ELI5)

AUC is a score that tells us how well a robot is at telling things apart. Say it has been trained to distinguish between cats and dogs. Imagine you have a pile of pictures, half cats and half dogs. You ask the robot to give each picture a "how-much-it-looks-like-a-cat" score. Then you randomly pick one cat picture and one dog picture. The AUC is the chance that the cat got a higher score than the dog. If the robot is perfect the cat always wins, so AUC is 1. If the robot is guessing the cat wins about half the time, so AUC is 0.5.

## See also

- [Confusion matrix](/wiki/confusion_matrix)
- [Precision](/wiki/precision) and [recall](/wiki/recall)
- [F1 score](/wiki/f1_score)
- [Sensitivity](/wiki/sensitivity) and [specificity](/wiki/specificity)
- [Brier score](/wiki/brier_score)
- [Log loss](/wiki/log_loss)
- [Matthews correlation coefficient](/wiki/matthews_correlation_coefficient)
- [Evaluation metrics](/wiki/evaluation_metrics)
- [Imbalanced data](/wiki/imbalanced_data)
- [Calibration](/wiki/calibration)
- [Cross-validation](/wiki/cross-validation)
- [Signal detection theory](/wiki/signal_detection_theory)
- [Wilcoxon rank-sum test](/wiki/wilcoxon_rank_sum_test)
- [Mann-Whitney U test](/wiki/mann_whitney_u_test)
- [c-index](/wiki/c_index)
- [NDCG](/wiki/ndcg)
- [Average precision](/wiki/average_precision)

## References

1. Peterson, W.W., Birdsall, T.G., and Fox, W.C. (1954). "The theory of signal detectability." *Transactions of the IRE Professional Group on Information Theory*, 4(4), 171-212.
2. Green, D.M. and Swets, J.A. (1966). *Signal Detection Theory and Psychophysics*. New York: Wiley.
3. Lusted, L.B. (1971). "Signal detectability and medical decision-making." *Science*, 171(3977), 1217-1219.
4. Hanley, J.A. and McNeil, B.J. (1982). "The meaning and use of the area under a receiver operating characteristic (ROC) curve." *Radiology*, 143(1), 29-36.
5. Hanley, J.A. and McNeil, B.J. (1983). "A method of comparing the areas under receiver operating characteristic curves derived from the same cases." *Radiology*, 148(3), 839-843.
6. Harrell, F.E., Califf, R.M., Pryor, D.B., Lee, K.L., and Rosati, R.A. (1982). "Evaluating the yield of medical tests." *JAMA*, 247(18), 2543-2546.
7. DeLong, E.R., DeLong, D.M., and Clarke-Pearson, D.L. (1988). "Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach." *Biometrics*, 44(3), 837-845.
8. Swets, J.A. (1988). "Measuring the accuracy of diagnostic systems." *Science*, 240(4857), 1285-1293.
9. McClish, D.K. (1989). "Analyzing a portion of the ROC curve." *Medical Decision Making*, 9(3), 190-195.
10. Bradley, A.P. (1997). "The use of the area under the ROC curve in the evaluation of machine learning algorithms." *Pattern Recognition*, 30(7), 1145-1159.
11. Provost, F. and Fawcett, T. (2001). "Robust Classification for Imprecise Environments." *Machine Learning*, 42(3), 203-231.
12. Hand, D.J. and Till, R.J. (2001). "A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems." *Machine Learning*, 45(2), 171-186.
13. Drummond, C. and Holte, R.C. (2006). "Cost curves: An improved method for visualizing classifier performance." *Machine Learning*, 65(1), 95-130.
14. Fawcett, T. (2006). "An introduction to ROC analysis." *Pattern Recognition Letters*, 27(8), 861-874.
15. Davis, J. and Goadrich, M. (2006). "The relationship between Precision-Recall and ROC curves." *Proceedings of the 23rd International Conference on Machine Learning (ICML)*, 233-240.
16. Lobo, J.M., Jimenez-Valverde, A., and Real, R. (2008). "AUC: a misleading measure of the performance of predictive distribution models." *Global Ecology and Biogeography*, 17(2), 145-151.
17. Hand, D.J. (2009). "Measuring classifier performance: a coherent alternative to the area under the ROC curve." *Machine Learning*, 77(1), 103-123.
18. Rendle, S., Freudenthaler, C., Gantner, Z., and Schmidt-Thieme, L. (2009). "BPR: Bayesian Personalized Ranking from implicit feedback." *UAI 2009*.
19. Flach, P., Hernandez-Orallo, J., and Ferri, C. (2011). "A coherent interpretation of AUC as a measure of aggregated classification performance." *Proceedings of ICML 2011*.
20. Robin, X., Turck, N., Hainard, A., Tiberti, N., Lisacek, F., Sanchez, J.C., and Muller, M. (2011). "pROC: an open-source package for R and S+ to analyze and compare ROC curves." *BMC Bioinformatics*, 12, 77.
21. Sun, X. and Xu, W. (2014). "Fast implementation of DeLong's algorithm for comparing the areas under correlated receiver operating characteristic curves." *IEEE Signal Processing Letters*, 21(11), 1389-1393.
22. Saito, T. and Rehmsmeier, M. (2015). "The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets." *PLOS ONE*, 10(3), e0118432.
23. Esteva, A., Kuprel, B., Novoa, R.A., Ko, J., Swetter, S.M., Blau, H.M., and Thrun, S. (2017). "Dermatologist-level classification of skin cancer with deep neural networks." *Nature*, 542(7639), 115-118.
24. Bandos, A.I., Guo, B., and Gur, D. (2017). "Estimating the area under ROC curve when the fitted binormal curves demonstrate improper shape." *Academic Radiology*, 24(2), 209-219.
25. Hannun, A.Y., et al. (2019). "Cardiologist-level arrhythmia detection and classification in ambulatory electrocardiograms using a deep neural network." *Nature Medicine*, 25(1), 65-69.
26. Chicco, D. and Jurman, G. (2020). "The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation." *BMC Genomics*, 21(1), 6.
27. McKinney, S.M., et al. (2020). "International evaluation of an AI system for breast cancer screening." *Nature*, 577(7788), 89-94.
28. Hastie, T., Tibshirani, R., and Friedman, J. (2009). *The Elements of Statistical Learning: Data Mining, Inference, and Prediction*, 2nd edition. Springer. Section 9.2.5 and Chapter 18.
29. Brier, G.W. (1950). "Verification of forecasts expressed in terms of probability." *Monthly Weather Review*, 78(1), 1-3.
30. Scikit-learn developers. "sklearn.metrics.roc_auc_score." Scikit-learn documentation, accessed 2024.

