# AUC (Area Under the ROC Curve)

> Source: https://aiwiki.ai/wiki/auc_area_under_the_roc_curve
> Updated: 2026-07-12
> Categories: Machine Learning, Model Evaluation
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**AUC (Area Under the ROC Curve)** is a classifier evaluation metric equal to the probability that a model ranks a randomly chosen positive instance higher than a randomly chosen negative instance.[1] It collapses the entire two-dimensional area beneath the [ROC (Receiver Operating Characteristic) curve](/wiki/roc_receiver_operating_characteristic_curve) into a single scalar between 0 and 1 that summarizes a [classification model's](/wiki/classification_model) ability to separate positive from negative cases across all possible [classification thresholds](/wiki/classification_threshold). An AUC of 1.0 denotes a perfect ranker, 0.5 denotes random guessing, and values below 0.5 mean the predictions are inverted. Because AUC is both threshold-invariant and scale-invariant, it is one of the most widely used metrics in [machine learning](/wiki/machine_learning) for model comparison and selection in domains such as medical diagnostics, credit scoring, fraud detection, and information retrieval.

The probabilistic definition traces to James Hanley and Barbara McNeil, who in their seminal 1982 *Radiology* paper described the area as "the probability that a randomly chosen diseased subject is (correctly) rated or ranked with greater suspicion than a randomly chosen non-diseased subject."[1]

## ELI5 (Explain Like I'm 5)

Imagine you have a basket of red balls and blue balls all mixed together. You ask a robot to sort them: red balls on the left, blue balls on the right. The robot isn't perfect, so sometimes it puts a blue ball on the red side or a red ball on the blue side.

AUC is a score that tells you how good the robot is at sorting. If you pick one red ball and one blue ball at random, AUC is the chance that the robot correctly puts the red ball on the red side and the blue ball on the blue side. A score of 1.0 means the robot sorts perfectly every time. A score of 0.5 means the robot is just guessing randomly, like flipping a coin. The higher the AUC, the better the robot is at telling the two groups apart.

## How does AUC relate to the ROC curve?

The [ROC curve](/wiki/roc_receiver_operating_characteristic_curve) is a graphical plot that illustrates the diagnostic ability of a [binary classification](/wiki/binary_classification) system as its discrimination threshold varies. It plots the **True Positive Rate (TPR)** on the y-axis against the **False Positive Rate (FPR)** on the x-axis at every possible threshold setting. AUC is the single number that summarizes that curve: it is literally the area underneath it. For the full mathematics, geometry, and history of the curve itself, see the dedicated [ROC (Receiver Operating Characteristic) curve](/wiki/roc_receiver_operating_characteristic_curve) page; this article focuses on the area metric.

| Metric | Formula | Also Known As |
|---|---|---|
| True Positive Rate (TPR) | $$\text{TP} / (\text{TP} + \text{FN})$$ | Sensitivity, [Recall](/wiki/recall), Hit Rate |
| False Positive Rate (FPR) | $$\text{FP} / (\text{FP} + \text{TN})$$ | 1 - Specificity, Fall-out |
| True Negative Rate (TNR) | $$\text{TN} / (\text{TN} + \text{FP})$$ | Specificity |
| False Negative Rate (FNR) | $$\text{FN} / (\text{FN} + \text{TP})$$ | Miss Rate |

Here, TP, FP, TN, and FN refer to values from the [confusion matrix](/wiki/confusion_matrix): True Positives, False Positives, True Negatives, and False Negatives, respectively.

A perfect classifier produces a ROC curve that passes through the upper-left corner of the plot (the point where FPR = 0 and TPR = 1), giving an AUC of 1.0, meaning it achieves 100% sensitivity with zero false positives. A random classifier, by contrast, produces a diagonal line from (0, 0) to (1, 1) with an AUC of 0.5.[5] The metric was popularized in machine learning in part by Tom Fawcett's influential 2006 tutorial "An introduction to ROC analysis," which has been cited more than 20,000 times.[5]

## What do AUC values mean?

The AUC represents the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance.[1] This probabilistic interpretation makes AUC an intuitive and meaningful metric.

| AUC Range | Classification Quality | Interpretation |
|---|---|---|
| 0.90 - 1.00 | Excellent | Outstanding discrimination; the model almost always ranks positives above negatives |
| 0.80 - 0.90 | Good | Strong discrimination; suitable for most practical applications |
| 0.70 - 0.80 | Fair | Acceptable discrimination; may need improvement for critical applications |
| 0.60 - 0.70 | Poor | Weak discrimination; the model struggles to separate classes |
| 0.50 | No Discrimination | Random guessing; the model provides no useful information |
| Below 0.50 | Inverse Predictions | Worse than random; reversing predictions would improve performance |

For example, an AUC of 0.85 means that if you randomly select one positive and one negative example, there is an 85% probability that the model assigns a higher predicted score to the positive example.

These guideline ranges should be interpreted with caution. What counts as a "good" AUC depends heavily on the application domain. In medical diagnostics, regulatory standards may require AUC values above 0.90 for screening tests. In some social science applications, an AUC of 0.70 may represent a meaningful advance over prior methods.

### AUC and the Gini Coefficient

In credit scoring and some other financial domains, a related metric called the Gini coefficient is used. The Gini coefficient is computed from AUC as:

$$
\text{Gini} = 2 \times \text{AUC} - 1
$$

This rescales the AUC so that random performance corresponds to a Gini of 0 and perfect performance corresponds to a Gini of 1. Note that this Gini coefficient is distinct from the Gini impurity used in [decision trees](/wiki/decision_tree) and from the Gini index used in economics to measure income inequality.

### Does a high AUC mean the model is well-calibrated?

No. An important caveat is that AUC measures only the ranking quality of a model, not whether its predicted probabilities are well-calibrated. A model can achieve a perfect AUC of 1.0 while producing wildly inaccurate probability estimates. For instance, a model that assigns a probability of 0.99 to every positive example and 0.01 to every negative example achieves AUC = 1.0, but so does a model assigning 0.51 and 0.49 respectively. If calibrated probability estimates matter (for example, when predicted probabilities are used for downstream decision-making or risk stratification), practitioners should supplement AUC with calibration metrics such as the Brier score, expected calibration error, or calibration curves.

## How is AUC calculated?

There are several mathematically equivalent approaches for computing AUC, each offering a different perspective on the metric.

### Trapezoidal Rule

The most common computational method is the trapezoidal rule. After sorting all predicted scores and computing TPR and FPR at each distinct threshold, the area under the resulting piecewise-linear ROC curve is computed by summing the areas of trapezoids formed between consecutive operating points:

$$
\text{AUC} = \sum_i (\text{FPR}_{i+1} - \text{FPR}_i) \times \frac{\text{TPR}_i + \text{TPR}_{i+1}}{2}
$$

This is the method used by scikit-learn's `roc_auc_score` function.[9] It is straightforward to implement and efficient to compute, with time complexity $$O(n \log n)$$ dominated by the sorting step, where $$n$$ is the number of predictions.

### Mann-Whitney U Statistic

Hanley and McNeil (1982) showed that the AUC is equivalent to the Mann-Whitney U statistic (also known as the Wilcoxon rank-sum statistic).[1] Given a set of positive examples $$P$$ and negative examples $$N$$, the AUC equals the proportion of all (positive, negative) pairs where the classifier assigns a higher score to the positive example:

$$
\text{AUC} = \frac{\text{number of concordant pairs}}{|P| \times |N|}
$$

A pair is concordant when the positive example receives a higher predicted score than the negative example. Tied pairs are counted as 0.5. This equivalence gives AUC a natural non-parametric statistical interpretation and connects it to classical hypothesis testing. Unlike the trapezoidal approach, the Mann-Whitney formulation handles tied scores correctly without requiring special treatment.

### Integration Under the ROC Curve

Formally, AUC can be expressed as the integral:

$$
\text{AUC} = \int_0^1 \text{TPR}(\text{FPR}^{-1}(x)) \, dx
$$

where TPR is viewed as a function of FPR through the varying threshold parameter. In practice, this integral is approximated using the trapezoidal rule on the empirical ROC curve.

## Confidence Intervals for AUC

Reporting a point estimate of AUC without uncertainty information can be misleading. Confidence intervals quantify the precision of the AUC estimate and are essential for rigorous model evaluation.

| Method | Assumptions | Strengths | Weaknesses |
|---|---|---|---|
| Hanley-McNeil (1982) | Continuous scores, no ties | Closed-form, fast to compute | Underestimates variance near AUC = 0 or 1 |
| DeLong (1988) | Nonparametric | No distributional assumptions, supports two-AUC comparison | More complex to implement |
| Bootstrap | Nonparametric | Best coverage for extreme AUC values, flexible | Computationally expensive |
| Logit-transformed | Nonparametric | Better coverage near boundaries than Hanley-McNeil | Less widely implemented |

### Hanley-McNeil Method

Hanley and McNeil (1982) derived a closed-form variance estimator for AUC based on exponential approximations.[1] Their formula expresses the variance of AUC in terms of the AUC value itself, the number of positive cases ($$n_p$$), and the number of negative cases ($$n_n$$). This approach is computationally efficient but may underestimate variance for extreme AUC values (near 0 or 1) and assumes continuous test scores with no ties.

### Bootstrap Method

Bootstrap resampling provides a nonparametric alternative for constructing confidence intervals. The procedure repeatedly resamples the dataset with replacement, computes AUC on each bootstrap sample, and uses the distribution of bootstrap AUC values to derive percentile-based or bias-corrected confidence intervals. Bootstrap methods tend to provide better coverage than the Hanley-McNeil formula, particularly for extreme AUC values. A typical implementation uses 1,000 to 10,000 bootstrap iterations.

### DeLong's Variance Estimator

DeLong et al. (1988) proposed a nonparametric variance estimator based on the theory of U-statistics.[2] This method computes structural components called "placement values" for each observation and uses them to estimate variance without distributional assumptions. It is more robust than the Hanley-McNeil approach and forms the basis of DeLong's test for comparing two AUCs.[2]

## How does AUC differ from accuracy?

AUC and [accuracy](/wiki/accuracy) are both used to evaluate classifiers, but they differ in important ways.

| Property | AUC | [Accuracy](/wiki/accuracy) |
|---|---|---|
| Threshold dependence | Threshold-independent; evaluates all thresholds | Requires a fixed threshold |
| Scale sensitivity | Scale-invariant; cares about ranking, not calibration | Sensitive to predicted probability values |
| Class imbalance | Less affected by moderate imbalance | Can be misleading with imbalanced classes |
| Interpretation | Probability of correct ranking | Proportion of correct predictions |
| Use case | Model comparison and selection | Final performance reporting at a chosen threshold |

Because AUC evaluates the model's ranking quality across all thresholds, it is particularly useful during model development and comparison. [Accuracy](/wiki/accuracy), on the other hand, depends on a single chosen threshold and can be misleading when classes are imbalanced. For instance, in a dataset where 95% of examples are negative, a trivial model predicting all negatives achieves 95% accuracy but an AUC of only 0.5.

## Is AUC reliable on imbalanced datasets?

While AUC is generally robust to moderate class imbalance, it can give an overly optimistic view of model performance when datasets are severely imbalanced. This happens because the FPR, which forms the x-axis of the ROC curve, uses True Negatives in its denominator. When the negative class heavily outnumbers the positive class, even a large number of false positives translates to a relatively small FPR, which inflates the ROC curve upward.

For example, a model evaluated on a dataset with a 1:100 positive-to-negative ratio might show an AUC of 0.95 while still missing many positive examples in absolute terms. Researchers have documented cases where a model achieves an AUC-ROC of 0.957 but only a PR-AUC of 0.708, highlighting a substantial gap between the two metrics.[6]

### What is the difference between AUC-ROC and PR-AUC?

The **Precision-Recall AUC (PR-AUC)** computes the area under the [precision](/wiki/precision)-[recall](/wiki/recall) curve. Because [precision](/wiki/precision) and [recall](/wiki/recall) do not involve true negatives, PR-AUC is unaffected by the size of the negative class.[6] This makes it a more informative metric when:

- The positive class is rare (e.g., disease detection, fraud detection)
- False positives are costly relative to false negatives
- The focus is on the model's ability to correctly identify positive instances

| Metric | What It Measures | Baseline for Random Classifier | Best For |
|---|---|---|---|
| AUC-ROC | Discrimination across all thresholds | 0.5 (always) | Balanced or moderately imbalanced datasets |
| PR-AUC | [Precision](/wiki/precision)-[recall](/wiki/recall) trade-off | Equals the positive class prevalence | Severely imbalanced datasets |

A key practical difference is the random baseline: AUC-ROC is always 0.5 for a random classifier regardless of class balance, whereas the PR-AUC baseline equals the positive class prevalence, so on a dataset that is 1% positive a random model scores a PR-AUC of about 0.01. This is why PR-AUC scores cannot be compared across datasets with different prevalence without context.

It is worth noting that recent research presents a more nuanced picture. Saito and Rehmsmeier (2015) argued that PR curves are more informative for imbalanced data, but a 2024 study by Yang et al. found that AUPRC can unduly favor model improvements in subpopulations with more frequent positive labels.[8][10] Meanwhile, Cook (2024) argued that the ROC curve does accurately assess imbalanced datasets when properly understood.[11] Practitioners should consider both metrics together for a complete picture rather than treating one as universally superior.

## What is Partial AUC (pAUC)?

Sometimes the full ROC curve is not equally relevant across all FPR values. In medical screening, for example, only very low false positive rates may be clinically acceptable. **Partial AUC (pAUC)** restricts the computation to a specific region of the ROC space.

### Types of Partial AUC

| Variant | Description | Use Case |
|---|---|---|
| FPR-constrained pAUC | Area under the ROC curve where FPR is within [FPR_low, FPR_high] | When a maximum FPR is clinically mandated |
| TPR-constrained pAUC | Area under the ROC curve where TPR exceeds a minimum threshold | When a minimum sensitivity is required |
| Two-way pAUC | Area restricted by both FPR and TPR bounds simultaneously | When both sensitivity and specificity constraints apply |

To make partial AUC values comparable across different FPR ranges, McClish (1989) proposed a standardized partial AUC that normalizes the value to a [0.5, 1] scale, where 0.5 represents random performance and 1 represents perfect performance within the specified region.[3] The standardization formula is:

$$
\text{pAUC}_{\text{standardized}} = 0.5 \times \left(1 + \frac{\text{pAUC} - \text{pAUC}_{\text{min}}}{\text{pAUC}_{\text{max}} - \text{pAUC}_{\text{min}}}\right)
$$

The **Ratio of Relevant Areas (RRA)** is another normalization approach that divides the pAUC by the area of the region of interest.

In scikit-learn, partial AUC can be computed by setting the `max_fpr` parameter in `roc_auc_score`. The returned value is the standardized partial AUC, as proposed by McClish, computed over the range [0, max_fpr] with the bounding areas defined as min_area = 0.5 x max_fpr squared and max_area = max_fpr.[3][9]

## How is AUC extended to multi-class problems?

While AUC is originally defined for [binary classification](/wiki/binary_classification), it can be extended to [multi-class classification](/wiki/multi-class_classification) problems using two main strategies.

### One-vs-Rest (OvR)

In the OvR approach, the AUC is computed for each class against all other classes combined. Each class gets its own ROC curve, and the per-class AUC values are then aggregated. This method can be affected by class imbalance because the "rest" group composition changes for each class, creating artificial imbalance even when the original classes are balanced.

### One-vs-One (OvO)

In the OvO approach, introduced by Hand and Till (2001), AUC is computed for every unique pair of classes.[4] For $$k$$ classes, this produces $$k(k-1)/2$$ pairwise AUC values that are then averaged. Hand and Till showed that when macro averaging is used, the OvO approach is insensitive to class distribution, similar to the binary AUC case.[4] This makes OvO the preferred method when class imbalance is a concern.

### Averaging Strategies

| Averaging Method | Description | When to Use |
|---|---|---|
| Macro | Unweighted mean of per-class AUCs | When all classes are equally important |
| Weighted | Weighted mean using class frequencies as weights | When larger classes should contribute more |
| Micro | Computed globally by treating each element of the label indicator matrix as a separate binary prediction | When individual observation accuracy matters more than per-class accuracy |

In scikit-learn, multi-class AUC is computed using `roc_auc_score` with the `multi_class` parameter set to `'ovr'` or `'ovo'` and the `average` parameter set to `'macro'`, `'weighted'`, or `'micro'` (micro averaging is only supported for OvR).[9]

## How do you test whether two AUCs differ significantly?

When comparing two classifiers, observing a higher AUC for one model does not necessarily mean it is significantly better. **DeLong's test** (DeLong et al., 1988) provides a rigorous statistical framework for determining whether the difference between two correlated AUC values is statistically significant.[2]

### How DeLong's Test Works

DeLong's test is a non-parametric method that leverages the connection between AUC and the Mann-Whitney U statistic. It computes a z-score from the difference between two AUC values, accounting for the correlation between them (since both models are evaluated on the same dataset). The steps are:

1. Compute the empirical AUC for each model using the trapezoidal rule.
2. Calculate structural components (placement values for each observation).
3. Estimate the variance and covariance of the two AUC estimates.
4. Compute the z-statistic: $$z = \frac{\text{AUC}_1 - \text{AUC}_2}{\sqrt{\text{Var}_1 + \text{Var}_2 - 2 \times \text{Cov}}}$$
5. Derive a p-value from the standard normal distribution.

A z-score exceeding 1.96 in absolute value (corresponding to p < 0.05) suggests a statistically significant difference between the two AUCs.

### Limitations and Caveats

DeLong's test should not be applied to nested models evaluated on the same data used to fit the parameters. When one model is a subset of another (for example, [logistic regression](/wiki/logistic_regression) with and without an additional predictor), the asymptotic theory underlying DeLong's test degenerates, potentially producing incorrect p-values. For nested models, the likelihood ratio test or a bootstrap-based approach is more appropriate.

### Implementation

DeLong's test is available in R through the `pROC` package (`roc.test` function) and in Python through third-party packages such as `mlstatkit`. An efficient $$O(n \log n)$$ algorithm by Sun and Xu (2014) makes the test practical for large datasets containing millions of observations.[7]

## How do you compute AUC in scikit-learn?

scikit-learn provides comprehensive support for AUC computation through `sklearn.metrics.roc_auc_score`.[9]

### Function Signature

```python
sklearn.metrics.roc_auc_score(
    y_true, y_score, *,
    average='macro',
    sample_weight=None,
    max_fpr=None,
    multi_class='raise',
    labels=None
)
```

### Key Parameters

| Parameter | Description | Values |
|---|---|---|
| `y_true` | True binary labels or label indicator matrix | array-like |
| `y_score` | Predicted scores (probabilities or decision function values) | array-like |
| `average` | Averaging method for multiclass/multilabel | 'micro', 'macro', 'weighted', 'samples', None |
| `max_fpr` | Upper bound on FPR for partial AUC computation | float in (0, 1] or None |
| `multi_class` | Strategy for multiclass AUC | 'raise', 'ovr', 'ovo' |

### Usage Examples

**Binary classification:**

```python
from sklearn.metrics import roc_auc_score
import numpy as np

y_true = np.array([0, 0, 1, 1])
y_scores = np.array([0.1, 0.4, 0.35, 0.8])
auc = roc_auc_score(y_true, y_scores)  # Returns 0.75
```

**With a trained model:**

```python
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_auc_score

X, y = load_breast_cancer(return_X_y=True)
clf = LogisticRegression(solver='newton-cholesky', random_state=0).fit(X, y)
auc = roc_auc_score(y, clf.predict_proba(X)[:, 1])  # Returns ~0.99

# Cross-validated AUC (recommended to avoid overfitting)
cv_scores = cross_val_score(clf, X, y, cv=5, scoring='roc_auc')
print(f"Mean AUC: {cv_scores.mean():.3f} +/- {cv_scores.std():.3f}")
```

**Multiclass classification:**

```python
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

X, y = load_iris(return_X_y=True)
clf = LogisticRegression(solver='newton-cholesky').fit(X, y)
auc = roc_auc_score(y, clf.predict_proba(X), multi_class='ovr')  # Returns ~0.99
```

**Plotting the ROC curve:**

```python
from sklearn.metrics import roc_curve, RocCurveDisplay
import matplotlib.pyplot as plt

fpr, tpr, thresholds = roc_curve(y_true, y_scores)
RocCurveDisplay(fpr=fpr, tpr=tpr).plot()
plt.title('ROC Curve')
plt.show()
```

### Related scikit-learn Functions

| Function | Purpose |
|---|---|
| `sklearn.metrics.roc_curve` | Compute FPR, TPR, and thresholds for plotting the ROC curve |
| `sklearn.metrics.auc` | Compute area under any curve using the trapezoidal rule |
| `sklearn.metrics.average_precision_score` | Compute the PR-AUC (area under the [precision](/wiki/precision)-[recall](/wiki/recall) curve) |
| `sklearn.metrics.RocCurveDisplay` | Visualize ROC curves with AUC annotation |
| `sklearn.metrics.brier_score_loss` | Compute the Brier score for assessing probability calibration |

## What is AUC used for?

AUC is used across a wide range of domains where binary classification is central to the task.

| Domain | Application | Why AUC Is Preferred |
|---|---|---|
| Medical diagnostics | Evaluating diagnostic tests and biomarkers | Threshold-independent comparison of tests across different operating points |
| Credit scoring | Assessing default prediction models | Gini coefficient (derived from AUC) is an industry-standard metric |
| Fraud detection | Ranking transactions by fraud risk | Imbalanced datasets benefit from a ranking-based metric |
| Information retrieval | Evaluating search ranking quality | AUC naturally captures ranking performance |
| Drug discovery | Screening compounds for biological activity | Virtual screening involves ranking large compound libraries |
| [Natural language processing](/wiki/natural_language_processing) | Spam filtering and [sentiment analysis](/wiki/sentiment_analysis) | Comparing models without committing to a classification threshold |

## Best Practices and Common Pitfalls

**Best practices:**

- Use AUC during cross-validation (`scoring='roc_auc'` in scikit-learn) to avoid threshold selection bias.
- Report confidence intervals alongside AUC values. Bootstrap resampling or DeLong's method can provide these intervals.
- Consider both AUC-ROC and PR-AUC for a comprehensive evaluation, especially with imbalanced data.
- When comparing models, use DeLong's test or a paired bootstrap test to determine statistical significance.
- For [multi-class classification](/wiki/multi-class_classification) problems, choose the averaging strategy (macro, weighted, micro) that aligns with the problem requirements.
- Supplement AUC with calibration metrics (Brier score, calibration plots) when probability estimates matter.

**Common pitfalls:**

- Confusing AUC with [accuracy](/wiki/accuracy). AUC measures ranking quality, not classification correctness at a specific threshold.
- Reporting AUC without confidence intervals, which makes it difficult to assess whether differences between models are meaningful.
- Relying solely on AUC-ROC for severely imbalanced datasets, where PR-AUC may be more informative.
- Using AUC when the cost of different types of errors varies greatly. In such cases, a cost-sensitive evaluation may be more appropriate.
- Applying DeLong's test to nested models, which violates the test's assumptions and can produce incorrect p-values.
- Assuming high AUC implies well-calibrated probabilities. AUC only reflects ranking quality, not probabilistic accuracy.

## See Also

- [ROC (Receiver Operating Characteristic) Curve](/wiki/roc_receiver_operating_characteristic_curve)
- [Binary Classification](/wiki/binary_classification)
- [Confusion Matrix](/wiki/confusion_matrix)
- [Precision](/wiki/precision)
- [Recall](/wiki/recall)
- [Accuracy](/wiki/accuracy)
- [F1 Score](/wiki/f1_score)
- [Classification Model](/wiki/classification_model)
- [Logistic Regression](/wiki/logistic_regression)

## References

1. Hanley, J. A., & McNeil, B. J. (1982). "The meaning and use of the area under a receiver operating characteristic (ROC) curve." *Radiology*, 143(1), 29-36.
2. DeLong, E. R., DeLong, D. M., & Clarke-Pearson, D. L. (1988). "Comparing the areas under two or more correlated receiver operating characteristic curves: A nonparametric approach." *Biometrics*, 44(3), 837-845.
3. McClish, D. K. (1989). "Analyzing a portion of the ROC curve." *Medical Decision Making*, 9(3), 190-195.
4. Hand, D. J., & Till, R. J. (2001). "A simple generalisation of the area under the ROC curve for multiple class classification problems." *Machine Learning*, 45(2), 171-186.
5. Fawcett, T. (2006). "An introduction to ROC analysis." *Pattern Recognition Letters*, 27(8), 861-874.
6. Davis, J., & Goadrich, M. (2006). "The relationship between Precision-Recall and ROC curves." *Proceedings of the 23rd International Conference on Machine Learning*, 233-240.
7. Sun, X., & Xu, W. (2014). "Fast implementation of DeLong's algorithm for comparing the areas under correlated receiver operating characteristic curves." *IEEE Signal Processing Letters*, 21(11), 1389-1393.
8. Saito, T., & Rehmsmeier, M. (2015). "The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets." *PLOS ONE*, 10(3), e0118432.
9. Pedregosa, F., et al. (2011). "Scikit-learn: Machine learning in Python." *Journal of Machine Learning Research*, 12, 2825-2830.
10. Yang, J., et al. (2024). "A closer look at AUROC and AUPRC under class imbalance." *arXiv preprint arXiv:2401.06091*.
11. Cook, N. R. (2024). "The receiver operating characteristic curve accurately assesses imbalanced datasets." *Patterns*, 5(7), 100994.