AUC (Area Under the ROC Curve)

Machine Learning Model Evaluation

20 min read

Updated Jul 12, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 12, 2026

Fact-checked

In review queue

Sources

11 citations

Revision

v6 · 4,003 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

AUC (Area Under the ROC Curve) is a classifier evaluation metric equal to the probability that a model ranks a randomly chosen positive instance higher than a randomly chosen negative instance.^[1] It collapses the entire two-dimensional area beneath the ROC (Receiver Operating Characteristic) curve into a single scalar between 0 and 1 that summarizes a classification model's ability to separate positive from negative cases across all possible classification thresholds. An AUC of 1.0 denotes a perfect ranker, 0.5 denotes random guessing, and values below 0.5 mean the predictions are inverted. Because AUC is both threshold-invariant and scale-invariant, it is one of the most widely used metrics in machine learning for model comparison and selection in domains such as medical diagnostics, credit scoring, fraud detection, and information retrieval.

The probabilistic definition traces to James Hanley and Barbara McNeil, who in their seminal 1982 Radiology paper described the area as "the probability that a randomly chosen diseased subject is (correctly) rated or ranked with greater suspicion than a randomly chosen non-diseased subject."^[1]

ELI5 (Explain Like I'm 5)

Imagine you have a basket of red balls and blue balls all mixed together. You ask a robot to sort them: red balls on the left, blue balls on the right. The robot isn't perfect, so sometimes it puts a blue ball on the red side or a red ball on the blue side.

AUC is a score that tells you how good the robot is at sorting. If you pick one red ball and one blue ball at random, AUC is the chance that the robot correctly puts the red ball on the red side and the blue ball on the blue side. A score of 1.0 means the robot sorts perfectly every time. A score of 0.5 means the robot is just guessing randomly, like flipping a coin. The higher the AUC, the better the robot is at telling the two groups apart.

How does AUC relate to the ROC curve?

The ROC curve is a graphical plot that illustrates the diagnostic ability of a binary classification system as its discrimination threshold varies. It plots the True Positive Rate (TPR) on the y-axis against the False Positive Rate (FPR) on the x-axis at every possible threshold setting. AUC is the single number that summarizes that curve: it is literally the area underneath it. For the full mathematics, geometry, and history of the curve itself, see the dedicated ROC (Receiver Operating Characteristic) curve page; this article focuses on the area metric.

Metric	Formula	Also Known As
True Positive Rate (TPR)	$\text{TP} / (\text{TP} + \text{FN})$	Sensitivity, Recall, Hit Rate
False Positive Rate (FPR)	$\text{FP} / (\text{FP} + \text{TN})$	1 - Specificity, Fall-out
True Negative Rate (TNR)	$\text{TN} / (\text{TN} + \text{FP})$	Specificity
False Negative Rate (FNR)	$\text{FN} / (\text{FN} + \text{TP})$	Miss Rate

Here, TP, FP, TN, and FN refer to values from the confusion matrix: True Positives, False Positives, True Negatives, and False Negatives, respectively.

A perfect classifier produces a ROC curve that passes through the upper-left corner of the plot (the point where FPR = 0 and TPR = 1), giving an AUC of 1.0, meaning it achieves 100% sensitivity with zero false positives. A random classifier, by contrast, produces a diagonal line from (0, 0) to (1, 1) with an AUC of 0.5.^[5] The metric was popularized in machine learning in part by Tom Fawcett's influential 2006 tutorial "An introduction to ROC analysis," which has been cited more than 20,000 times.^[5]

What do AUC values mean?

The AUC represents the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance.^[1] This probabilistic interpretation makes AUC an intuitive and meaningful metric.

AUC Range	Classification Quality	Interpretation
0.90 - 1.00	Excellent	Outstanding discrimination; the model almost always ranks positives above negatives
0.80 - 0.90	Good	Strong discrimination; suitable for most practical applications
0.70 - 0.80	Fair	Acceptable discrimination; may need improvement for critical applications
0.60 - 0.70	Poor	Weak discrimination; the model struggles to separate classes
0.50	No Discrimination	Random guessing; the model provides no useful information
Below 0.50	Inverse Predictions	Worse than random; reversing predictions would improve performance

For example, an AUC of 0.85 means that if you randomly select one positive and one negative example, there is an 85% probability that the model assigns a higher predicted score to the positive example.

These guideline ranges should be interpreted with caution. What counts as a "good" AUC depends heavily on the application domain. In medical diagnostics, regulatory standards may require AUC values above 0.90 for screening tests. In some social science applications, an AUC of 0.70 may represent a meaningful advance over prior methods.

AUC and the Gini Coefficient

In credit scoring and some other financial domains, a related metric called the Gini coefficient is used. The Gini coefficient is computed from AUC as:

\text{Gini} = 2 \times \text{AUC} - 1

This rescales the AUC so that random performance corresponds to a Gini of 0 and perfect performance corresponds to a Gini of 1. Note that this Gini coefficient is distinct from the Gini impurity used in decision trees and from the Gini index used in economics to measure income inequality.

Does a high AUC mean the model is well-calibrated?

No. An important caveat is that AUC measures only the ranking quality of a model, not whether its predicted probabilities are well-calibrated. A model can achieve a perfect AUC of 1.0 while producing wildly inaccurate probability estimates. For instance, a model that assigns a probability of 0.99 to every positive example and 0.01 to every negative example achieves AUC = 1.0, but so does a model assigning 0.51 and 0.49 respectively. If calibrated probability estimates matter (for example, when predicted probabilities are used for downstream decision-making or risk stratification), practitioners should supplement AUC with calibration metrics such as the Brier score, expected calibration error, or calibration curves.

How is AUC calculated?

There are several mathematically equivalent approaches for computing AUC, each offering a different perspective on the metric.

Trapezoidal Rule

The most common computational method is the trapezoidal rule. After sorting all predicted scores and computing TPR and FPR at each distinct threshold, the area under the resulting piecewise-linear ROC curve is computed by summing the areas of trapezoids formed between consecutive operating points:

\text{AUC} = \sum_i (\text{FPR}_{i+1} - \text{FPR}_i) \times \frac{\text{TPR}_i + \text{TPR}_{i+1}}{2}

This is the method used by scikit-learn's roc_auc_score function.^[9] It is straightforward to implement and efficient to compute, with time complexity $O(n \log n)$ dominated by the sorting step, where $n$ is the number of predictions.

Mann-Whitney U Statistic

Hanley and McNeil (1982) showed that the AUC is equivalent to the Mann-Whitney U statistic (also known as the Wilcoxon rank-sum statistic).^[1] Given a set of positive examples $P$ and negative examples $N$ , the AUC equals the proportion of all (positive, negative) pairs where the classifier assigns a higher score to the positive example:

\text{AUC} = \frac{\text{number of concordant pairs}}{|P| \times |N|}

A pair is concordant when the positive example receives a higher predicted score than the negative example. Tied pairs are counted as 0.5. This equivalence gives AUC a natural non-parametric statistical interpretation and connects it to classical hypothesis testing. Unlike the trapezoidal approach, the Mann-Whitney formulation handles tied scores correctly without requiring special treatment.

Integration Under the ROC Curve

Formally, AUC can be expressed as the integral:

\text{AUC} = \int_0^1 \text{TPR}(\text{FPR}^{-1}(x)) \, dx

where TPR is viewed as a function of FPR through the varying threshold parameter. In practice, this integral is approximated using the trapezoidal rule on the empirical ROC curve.

Confidence Intervals for AUC

Reporting a point estimate of AUC without uncertainty information can be misleading. Confidence intervals quantify the precision of the AUC estimate and are essential for rigorous model evaluation.

Method	Assumptions	Strengths	Weaknesses
Hanley-McNeil (1982)	Continuous scores, no ties	Closed-form, fast to compute	Underestimates variance near AUC = 0 or 1
DeLong (1988)	Nonparametric	No distributional assumptions, supports two-AUC comparison	More complex to implement
Bootstrap	Nonparametric	Best coverage for extreme AUC values, flexible	Computationally expensive
Logit-transformed	Nonparametric	Better coverage near boundaries than Hanley-McNeil	Less widely implemented

Hanley-McNeil Method

Hanley and McNeil (1982) derived a closed-form variance estimator for AUC based on exponential approximations.^[1] Their formula expresses the variance of AUC in terms of the AUC value itself, the number of positive cases ( $n_p$ ), and the number of negative cases ( $n_n$ ). This approach is computationally efficient but may underestimate variance for extreme AUC values (near 0 or 1) and assumes continuous test scores with no ties.

Bootstrap Method

Bootstrap resampling provides a nonparametric alternative for constructing confidence intervals. The procedure repeatedly resamples the dataset with replacement, computes AUC on each bootstrap sample, and uses the distribution of bootstrap AUC values to derive percentile-based or bias-corrected confidence intervals. Bootstrap methods tend to provide better coverage than the Hanley-McNeil formula, particularly for extreme AUC values. A typical implementation uses 1,000 to 10,000 bootstrap iterations.

DeLong's Variance Estimator

DeLong et al. (1988) proposed a nonparametric variance estimator based on the theory of U-statistics.^[2] This method computes structural components called "placement values" for each observation and uses them to estimate variance without distributional assumptions. It is more robust than the Hanley-McNeil approach and forms the basis of DeLong's test for comparing two AUCs.^[2]

How does AUC differ from accuracy?

AUC and accuracy are both used to evaluate classifiers, but they differ in important ways.

Property	AUC	Accuracy
Threshold dependence	Threshold-independent; evaluates all thresholds	Requires a fixed threshold
Scale sensitivity	Scale-invariant; cares about ranking, not calibration	Sensitive to predicted probability values
Class imbalance	Less affected by moderate imbalance	Can be misleading with imbalanced classes
Interpretation	Probability of correct ranking	Proportion of correct predictions
Use case	Model comparison and selection	Final performance reporting at a chosen threshold

Because AUC evaluates the model's ranking quality across all thresholds, it is particularly useful during model development and comparison. Accuracy, on the other hand, depends on a single chosen threshold and can be misleading when classes are imbalanced. For instance, in a dataset where 95% of examples are negative, a trivial model predicting all negatives achieves 95% accuracy but an AUC of only 0.5.

Is AUC reliable on imbalanced datasets?

While AUC is generally robust to moderate class imbalance, it can give an overly optimistic view of model performance when datasets are severely imbalanced. This happens because the FPR, which forms the x-axis of the ROC curve, uses True Negatives in its denominator. When the negative class heavily outnumbers the positive class, even a large number of false positives translates to a relatively small FPR, which inflates the ROC curve upward.

For example, a model evaluated on a dataset with a 1:100 positive-to-negative ratio might show an AUC of 0.95 while still missing many positive examples in absolute terms. Researchers have documented cases where a model achieves an AUC-ROC of 0.957 but only a PR-AUC of 0.708, highlighting a substantial gap between the two metrics.^[6]

What is the difference between AUC-ROC and PR-AUC?

The Precision-Recall AUC (PR-AUC) computes the area under the precision-recall curve. Because precision and recall do not involve true negatives, PR-AUC is unaffected by the size of the negative class.^[6] This makes it a more informative metric when:

The positive class is rare (e.g., disease detection, fraud detection)
False positives are costly relative to false negatives
The focus is on the model's ability to correctly identify positive instances

Metric	What It Measures	Baseline for Random Classifier	Best For
AUC-ROC	Discrimination across all thresholds	0.5 (always)	Balanced or moderately imbalanced datasets
PR-AUC	Precision-recall trade-off	Equals the positive class prevalence	Severely imbalanced datasets

A key practical difference is the random baseline: AUC-ROC is always 0.5 for a random classifier regardless of class balance, whereas the PR-AUC baseline equals the positive class prevalence, so on a dataset that is 1% positive a random model scores a PR-AUC of about 0.01. This is why PR-AUC scores cannot be compared across datasets with different prevalence without context.

It is worth noting that recent research presents a more nuanced picture. Saito and Rehmsmeier (2015) argued that PR curves are more informative for imbalanced data, but a 2024 study by Yang et al. found that AUPRC can unduly favor model improvements in subpopulations with more frequent positive labels.^[8]^[10] Meanwhile, Cook (2024) argued that the ROC curve does accurately assess imbalanced datasets when properly understood.^[11] Practitioners should consider both metrics together for a complete picture rather than treating one as universally superior.

What is Partial AUC (pAUC)?

Sometimes the full ROC curve is not equally relevant across all FPR values. In medical screening, for example, only very low false positive rates may be clinically acceptable. Partial AUC (pAUC) restricts the computation to a specific region of the ROC space.

Types of Partial AUC

Variant	Description	Use Case
FPR-constrained pAUC	Area under the ROC curve where FPR is within [FPR_low, FPR_high]	When a maximum FPR is clinically mandated
TPR-constrained pAUC	Area under the ROC curve where TPR exceeds a minimum threshold	When a minimum sensitivity is required
Two-way pAUC	Area restricted by both FPR and TPR bounds simultaneously	When both sensitivity and specificity constraints apply

To make partial AUC values comparable across different FPR ranges, McClish (1989) proposed a standardized partial AUC that normalizes the value to a [0.5, 1] scale, where 0.5 represents random performance and 1 represents perfect performance within the specified region.^[3] The standardization formula is:

\text{pAUC}_{\text{standardized}} = 0.5 \times \left(1 + \frac{\text{pAUC} - \text{pAUC}_{\text{min}}}{\text{pAUC}_{\text{max}} - \text{pAUC}_{\text{min}}}\right)

The Ratio of Relevant Areas (RRA) is another normalization approach that divides the pAUC by the area of the region of interest.

In scikit-learn, partial AUC can be computed by setting the max_fpr parameter in roc_auc_score. The returned value is the standardized partial AUC, as proposed by McClish, computed over the range [0, max_fpr] with the bounding areas defined as min_area = 0.5 x max_fpr squared and max_area = max_fpr.^[3]^[9]

How is AUC extended to multi-class problems?

While AUC is originally defined for binary classification, it can be extended to multi-class classification problems using two main strategies.

One-vs-Rest (OvR)

In the OvR approach, the AUC is computed for each class against all other classes combined. Each class gets its own ROC curve, and the per-class AUC values are then aggregated. This method can be affected by class imbalance because the "rest" group composition changes for each class, creating artificial imbalance even when the original classes are balanced.

One-vs-One (OvO)

In the OvO approach, introduced by Hand and Till (2001), AUC is computed for every unique pair of classes.^[4] For $k$ classes, this produces $k(k-1)/2$ pairwise AUC values that are then averaged. Hand and Till showed that when macro averaging is used, the OvO approach is insensitive to class distribution, similar to the binary AUC case.^[4] This makes OvO the preferred method when class imbalance is a concern.

Averaging Strategies

Averaging Method	Description	When to Use
Macro	Unweighted mean of per-class AUCs	When all classes are equally important
Weighted	Weighted mean using class frequencies as weights	When larger classes should contribute more
Micro	Computed globally by treating each element of the label indicator matrix as a separate binary prediction	When individual observation accuracy matters more than per-class accuracy

In scikit-learn, multi-class AUC is computed using roc_auc_score with the multi_class parameter set to 'ovr' or 'ovo' and the average parameter set to 'macro', 'weighted', or 'micro' (micro averaging is only supported for OvR).^[9]

How do you test whether two AUCs differ significantly?

When comparing two classifiers, observing a higher AUC for one model does not necessarily mean it is significantly better. DeLong's test (DeLong et al., 1988) provides a rigorous statistical framework for determining whether the difference between two correlated AUC values is statistically significant.^[2]

How DeLong's Test Works

DeLong's test is a non-parametric method that leverages the connection between AUC and the Mann-Whitney U statistic. It computes a z-score from the difference between two AUC values, accounting for the correlation between them (since both models are evaluated on the same dataset). The steps are:

Compute the empirical AUC for each model using the trapezoidal rule.
Calculate structural components (placement values for each observation).
Estimate the variance and covariance of the two AUC estimates.
Compute the z-statistic: $z = \frac{\text{AUC}_1 - \text{AUC}_2}{\sqrt{\text{Var}_1 + \text{Var}_2 - 2 \times \text{Cov}}}$
Derive a p-value from the standard normal distribution.

A z-score exceeding 1.96 in absolute value (corresponding to p < 0.05) suggests a statistically significant difference between the two AUCs.

Limitations and Caveats

DeLong's test should not be applied to nested models evaluated on the same data used to fit the parameters. When one model is a subset of another (for example, logistic regression with and without an additional predictor), the asymptotic theory underlying DeLong's test degenerates, potentially producing incorrect p-values. For nested models, the likelihood ratio test or a bootstrap-based approach is more appropriate.

Implementation

DeLong's test is available in R through the pROC package (roc.test function) and in Python through third-party packages such as mlstatkit. An efficient $O(n \log n)$ algorithm by Sun and Xu (2014) makes the test practical for large datasets containing millions of observations.^[7]

How do you compute AUC in scikit-learn?

scikit-learn provides comprehensive support for AUC computation through sklearn.metrics.roc_auc_score.^[9]

Function Signature

sklearn.metrics.roc_auc_score(
    y_true, y_score, *,
    average='macro',
    sample_weight=None,
    max_fpr=None,
    multi_class='raise',
    labels=None
)

Key Parameters

Parameter	Description	Values
`y_true`	True binary labels or label indicator matrix	array-like
`y_score`	Predicted scores (probabilities or decision function values)	array-like
`average`	Averaging method for multiclass/multilabel	'micro', 'macro', 'weighted', 'samples', None
`max_fpr`	Upper bound on FPR for partial AUC computation	float in (0, 1] or None
`multi_class`	Strategy for multiclass AUC	'raise', 'ovr', 'ovo'

Usage Examples

Binary classification:

from sklearn.metrics import roc_auc_score
import numpy as np

y_true = np.array([0, 0, 1, 1])
y_scores = np.array([0.1, 0.4, 0.35, 0.8])
auc = roc_auc_score(y_true, y_scores)  # Returns 0.75

With a trained model:

from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_auc_score

X, y = load_breast_cancer(return_X_y=True)
clf = LogisticRegression(solver='newton-cholesky', random_state=0).fit(X, y)
auc = roc_auc_score(y, clf.predict_proba(X)[:, 1])  # Returns ~0.99

# Cross-validated AUC (recommended to avoid overfitting)
cv_scores = cross_val_score(clf, X, y, cv=5, scoring='roc_auc')
print(f"Mean AUC: {cv_scores.mean():.3f} +/- {cv_scores.std():.3f}")

Multiclass classification:

from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

X, y = load_iris(return_X_y=True)
clf = LogisticRegression(solver='newton-cholesky').fit(X, y)
auc = roc_auc_score(y, clf.predict_proba(X), multi_class='ovr')  # Returns ~0.99

Plotting the ROC curve:

from sklearn.metrics import roc_curve, RocCurveDisplay
import matplotlib.pyplot as plt

fpr, tpr, thresholds = roc_curve(y_true, y_scores)
RocCurveDisplay(fpr=fpr, tpr=tpr).plot()
plt.title('ROC Curve')
plt.show()

Function	Purpose
`sklearn.metrics.roc_curve`	Compute FPR, TPR, and thresholds for plotting the ROC curve
`sklearn.metrics.auc`	Compute area under any curve using the trapezoidal rule
`sklearn.metrics.average_precision_score`	Compute the PR-AUC (area under the precision-recall curve)
`sklearn.metrics.RocCurveDisplay`	Visualize ROC curves with AUC annotation
`sklearn.metrics.brier_score_loss`	Compute the Brier score for assessing probability calibration

What is AUC used for?

AUC is used across a wide range of domains where binary classification is central to the task.

Domain	Application	Why AUC Is Preferred
Medical diagnostics	Evaluating diagnostic tests and biomarkers	Threshold-independent comparison of tests across different operating points
Credit scoring	Assessing default prediction models	Gini coefficient (derived from AUC) is an industry-standard metric
Fraud detection	Ranking transactions by fraud risk	Imbalanced datasets benefit from a ranking-based metric
Information retrieval	Evaluating search ranking quality	AUC naturally captures ranking performance
Drug discovery	Screening compounds for biological activity	Virtual screening involves ranking large compound libraries
Natural language processing	Spam filtering and sentiment analysis	Comparing models without committing to a classification threshold

Best Practices and Common Pitfalls

Best practices:

Use AUC during cross-validation (scoring='roc_auc' in scikit-learn) to avoid threshold selection bias.
Report confidence intervals alongside AUC values. Bootstrap resampling or DeLong's method can provide these intervals.
Consider both AUC-ROC and PR-AUC for a comprehensive evaluation, especially with imbalanced data.
When comparing models, use DeLong's test or a paired bootstrap test to determine statistical significance.
For multi-class classification problems, choose the averaging strategy (macro, weighted, micro) that aligns with the problem requirements.
Supplement AUC with calibration metrics (Brier score, calibration plots) when probability estimates matter.

Common pitfalls:

Confusing AUC with accuracy. AUC measures ranking quality, not classification correctness at a specific threshold.
Reporting AUC without confidence intervals, which makes it difficult to assess whether differences between models are meaningful.
Relying solely on AUC-ROC for severely imbalanced datasets, where PR-AUC may be more informative.
Using AUC when the cost of different types of errors varies greatly. In such cases, a cost-sensitive evaluation may be more appropriate.
Applying DeLong's test to nested models, which violates the test's assumptions and can produce incorrect p-values.
Assuming high AUC implies well-calibrated probabilities. AUC only reflects ranking quality, not probabilistic accuracy.

References

Hanley, J. A., & McNeil, B. J. (1982). "The meaning and use of the area under a receiver operating characteristic (ROC) curve." *Radiology*, 143(1), 29-36. ↩
DeLong, E. R., DeLong, D. M., & Clarke-Pearson, D. L. (1988). "Comparing the areas under two or more correlated receiver operating characteristic curves: A nonparametric approach." *Biometrics*, 44(3), 837-845. ↩
McClish, D. K. (1989). "Analyzing a portion of the ROC curve." *Medical Decision Making*, 9(3), 190-195. ↩
Hand, D. J., & Till, R. J. (2001). "A simple generalisation of the area under the ROC curve for multiple class classification problems." *Machine Learning*, 45(2), 171-186. ↩
Fawcett, T. (2006). "An introduction to ROC analysis." *Pattern Recognition Letters*, 27(8), 861-874. ↩
Davis, J., & Goadrich, M. (2006). "The relationship between Precision-Recall and ROC curves." *Proceedings of the 23rd International Conference on Machine Learning*, 233-240. ↩
Sun, X., & Xu, W. (2014). "Fast implementation of DeLong's algorithm for comparing the areas under correlated receiver operating characteristic curves." *IEEE Signal Processing Letters*, 21(11), 1389-1393. ↩
Saito, T., & Rehmsmeier, M. (2015). "The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets." *PLOS ONE*, 10(3), e0118432. ↩
Pedregosa, F., et al. (2011). "Scikit-learn: Machine learning in Python." *Journal of Machine Learning Research*, 12, 2825-2830. ↩
Yang, J., et al. (2024). "A closer look at AUROC and AUPRC under class imbalance." *arXiv preprint arXiv:2401.06091*. ↩
Cook, N. R. (2024). "The receiver operating characteristic curve accurately assesses imbalanced datasets." *Patterns*, 5(7), 100994. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

5 revisions by 1 contributors · full history

Suggest edit

AUC (Area Under the ROC Curve)

ELI5 (Explain Like I'm 5)

How does AUC relate to the ROC curve?

What do AUC values mean?

AUC and the Gini Coefficient

Does a high AUC mean the model is well-calibrated?

How is AUC calculated?

Trapezoidal Rule

Mann-Whitney U Statistic

Integration Under the ROC Curve

Confidence Intervals for AUC

Hanley-McNeil Method

Bootstrap Method

DeLong's Variance Estimator

How does AUC differ from accuracy?

Is AUC reliable on imbalanced datasets?

What is the difference between AUC-ROC and PR-AUC?

What is Partial AUC (pAUC)?

Types of Partial AUC

How is AUC extended to multi-class problems?

One-vs-Rest (OvR)

One-vs-One (OvO)

Averaging Strategies

How do you test whether two AUCs differ significantly?

How DeLong's Test Works

Limitations and Caveats

Implementation

How do you compute AUC in scikit-learn?

Function Signature

Key Parameters

Usage Examples

What is AUC used for?

Best Practices and Common Pitfalls

See Also

References

Improve this article

What links here

What links here

ELI5 (Explain Like I'm 5)

How does AUC relate to the ROC curve?

What do AUC values mean?

AUC and the Gini Coefficient

Does a high AUC mean the model is well-calibrated?

How is AUC calculated?

Trapezoidal Rule

Mann-Whitney U Statistic

Integration Under the ROC Curve

Confidence Intervals for AUC

Hanley-McNeil Method

Bootstrap Method

DeLong's Variance Estimator

How does AUC differ from accuracy?

Is AUC reliable on imbalanced datasets?

What is the difference between AUC-ROC and PR-AUC?

What is Partial AUC (pAUC)?

Types of Partial AUC

How is AUC extended to multi-class problems?

One-vs-Rest (OvR)

One-vs-One (OvO)

Averaging Strategies

How do you test whether two AUCs differ significantly?

How DeLong's Test Works

Limitations and Caveats

Implementation

How do you compute AUC in scikit-learn?

Function Signature

Key Parameters

Usage Examples

Related scikit-learn Functions

What is AUC used for?

Best Practices and Common Pitfalls

See Also

References

Improve this article

Related Articles

Generalization

Generalization Curve

Model Capacity

Splitter

AUC-ROC

Accuracy

What links here

Related Articles

Generalization

Generalization Curve

Model Capacity

Splitter

AUC-ROC

Accuracy

What links here