Classification Threshold

A classification threshold (also called a decision threshold or cut-off point) is a numeric value used to convert the continuous probability output of a classification model into a discrete class label. In binary classification, a model such as logistic regression produces a probability score between 0 and 1 for each input. If the predicted probability exceeds the threshold, the instance is assigned to the positive class; otherwise, it is assigned to the negative class. The default threshold in most implementations is 0.5, but this value is rarely optimal for real-world applications, and selecting an appropriate threshold is a separate modeling decision that can significantly affect a classifier's performance.

The threshold is not learned during model training. Instead, it is set by the practitioner after training, based on domain requirements, the relative costs of different types of errors, and the desired balance between precision and recall. Adjusting the threshold changes the entries of the confusion matrix, shifting the tradeoff between false positives and false negatives.

ELI5: Explain like I'm 5

Imagine you have a robot that looks at photos and tries to decide if each photo shows a cat or a dog. The robot is not always sure. For each photo, it gives a number from 0 to 100 that represents how confident it is that the photo shows a cat. A score of 90 means "I'm really sure this is a cat," while a score of 30 means "I don't think this is a cat."

Now you need to pick a "magic number" (the threshold). If the robot's confidence is above your magic number, you say "cat." If it is below, you say "dog." If you pick a high magic number like 80, the robot only says "cat" when it is very sure, so it almost never makes a mistake calling a dog a cat. But it might miss some real cats because it was not confident enough. If you pick a low magic number like 20, the robot catches almost every cat, but it might accidentally call some dogs "cats" too.

The threshold is that magic number. You choose it based on what matters more to you: catching every cat, or never making a wrong call.

How classification thresholds work

Most probabilistic classifiers do not directly output class labels. Instead, they output a continuous score, typically a probability estimate P(y = 1 | X), that reflects the model's confidence that a given instance belongs to the positive class. The classification threshold converts this continuous score into a binary decision.

The decision rule for binary classification is:

If P(y = 1 | X) >= t, predict the positive class (y = 1)
If P(y = 1 | X) < t, predict the negative class (y = 0)

where t is the classification threshold.

For example, consider a spam detection model that outputs the probability that an email is spam. With a threshold of 0.5, an email with a spam probability of 0.7 is classified as spam, while an email with a probability of 0.3 is classified as not spam. Raising the threshold to 0.8 would mean only emails with very high predicted spam probability get filtered, reducing false positives (legitimate emails sent to the spam folder) but increasing false negatives (spam emails that slip through).

Relationship to the confusion matrix

Changing the classification threshold directly changes the four counts in the confusion matrix:

Threshold change	True positives	False positives	False negatives	True negatives
Threshold increased	Decrease	Decrease	Increase	Increase
Threshold decreased	Increase	Increase	Decrease	Decrease

When the threshold is raised, the model becomes more conservative. It predicts the positive class less often, so both true positives and false positives decrease while false negatives and true negatives increase. Lowering the threshold has the opposite effect, making the model more liberal in assigning the positive label.

Relationship to the decision boundary

The terms "classification threshold" and "decision boundary" are related but distinct. The decision boundary is a geometric concept: it is the surface (a line in two dimensions, a hyperplane in higher dimensions) that separates the feature space into regions corresponding to different classes. In logistic regression, the decision boundary is the set of points where P(y = 1 | X) = t. When t = 0.5, the boundary lies where the log-odds equal zero. Adjusting the threshold shifts the decision boundary in feature space, expanding or contracting the region predicted as positive.

Effect on precision and recall

The classification threshold directly controls the tradeoff between precision and recall, two of the most commonly used metrics for evaluating classifiers.

Precision is the fraction of positive predictions that are correct: TP / (TP + FP). A higher threshold tends to increase precision because the model only predicts "positive" when it is very confident.
Recall (also called sensitivity or true positive rate) is the fraction of actual positives that are correctly identified: TP / (TP + FN). A lower threshold tends to increase recall because the model flags more instances as positive, catching more of the true positives.

In general, raising the threshold increases precision at the expense of recall, and lowering the threshold increases recall at the expense of precision. This inverse relationship is known as the precision-recall tradeoff.

The F1 score is the harmonic mean of precision and recall and provides a single number that balances both metrics. When neither precision nor recall is more important than the other, practitioners sometimes select the threshold that maximizes the F1 score on a validation set.

The F-beta score and asymmetric weighting

In many applications, precision and recall are not equally important. The F-beta score generalizes the F1 score by introducing a parameter beta that controls the relative weight of recall versus precision:

F_beta = (1 + beta^2) * (precision * recall) / (beta^2 * precision + recall)

Beta value	Weighting	Typical use case
beta = 0.5	Precision is weighted twice as much as recall	Spam filtering, where false positives (blocking legitimate email) are costly
beta = 1	Equal weight (standard F1)	General-purpose evaluation
beta = 2	Recall is weighted twice as much as precision	Medical diagnosis, where missing a disease (false negative) is dangerous

Selecting the threshold that maximizes the appropriate F-beta score allows practitioners to encode domain-specific cost preferences directly into the threshold selection process.

Threshold selection methods

Several systematic methods exist for choosing an optimal classification threshold. The best method depends on the problem domain, the class distribution, and the relative costs of different error types.

ROC curve analysis

The receiver operating characteristic (ROC) curve plots the true positive rate (TPR, also called recall) on the y-axis against the false positive rate (FPR) on the x-axis at every possible threshold value. Each point on the curve corresponds to a different threshold. A perfect classifier would have a point at the top-left corner (TPR = 1, FPR = 0), and a random classifier would follow the diagonal line from (0, 0) to (1, 1).

The area under the ROC curve (AUC) summarizes the classifier's overall discriminative ability across all thresholds. However, AUC does not identify a specific threshold. To select one, practitioners can use additional criteria applied to the ROC curve.

Youden's J statistic. Proposed by W. J. Youden in 1950 (though a similar formula was published by C. S. Peirce in 1884), Youden's J statistic identifies the threshold that maximizes the vertical distance between the ROC curve and the chance line. The formula is:

J = sensitivity + specificity - 1 = TPR - FPR

The optimal threshold is the one that maximizes J. This method gives equal weight to sensitivity and specificity and is widely used in medical diagnostic testing. The index ranges from -1 to 1, where 0 indicates a useless test and 1 indicates a perfect test.

Closest-to-(0,1) method. An alternative approach selects the threshold corresponding to the point on the ROC curve that is geometrically closest to the top-left corner (0, 1). This minimizes the Euclidean distance:

d = sqrt((1 - TPR)^2 + FPR^2)

This method also balances sensitivity and specificity, and it often selects a threshold very close to the Youden J optimum.

Precision-recall curve analysis

The precision-recall curve plots precision on the y-axis against recall on the x-axis at various thresholds. It is especially useful for imbalanced datasets where the positive class is rare. In such settings, the ROC curve can appear overly optimistic because the large number of true negatives inflates the true negative rate, while the precision-recall curve focuses exclusively on the performance with respect to the positive class.

From the precision-recall curve, a common threshold selection strategy is to choose the threshold that maximizes the F1 score (or the appropriate F-beta score). This is the point on the curve closest to the top-right corner (precision = 1, recall = 1).

G-mean optimization

The geometric mean (G-mean) balances the true positive rate and the true negative rate (specificity):

G-mean = sqrt(TPR * TNR) = sqrt(sensitivity * specificity)

Maximizing the G-mean selects a threshold that achieves good performance on both classes simultaneously. This metric is particularly useful for imbalanced classification because it penalizes a model that achieves high recall only by sacrificing specificity.

Cost-sensitive threshold selection

When the costs of false positives and false negatives are known and unequal, decision theory provides a principled framework for threshold selection. Given a cost matrix:

	Actual positive	Actual negative
Predicted positive	C_TP (often 0)	C_FP
Predicted negative	C_FN	C_TN (often 0)

The expected cost of predicting positive for an instance with predicted probability p is:

E[cost | predict positive] = p * C_TP + (1 - p) * C_FP

The expected cost of predicting negative is:

E[cost | predict negative] = p * C_FN + (1 - p) * C_TN

Setting these equal and solving for p yields the cost-optimal threshold:

t* = (C_FP - C_TN) / ((C_FP - C_TN) + (C_FN - C_TP))

If there is no cost or benefit for correct predictions (C_TP = C_TN = 0), this simplifies to:

t* = C_FP / (C_FP + C_FN)

For example, if a false negative costs five times as much as a false positive (C_FN = 5, C_FP = 1), the optimal threshold is 1 / (1 + 5) = 0.167. This low threshold ensures the model catches most positives, accepting more false positives to avoid the expensive false negatives.

Bayesian decision theory

Cost-sensitive threshold selection is a special case of Bayesian decision theory. In the Bayesian framework, the optimal decision rule minimizes the expected risk (also called the Bayes risk). For a two-class problem with prior probabilities P(w_1) and P(w_2), loss values lambda, and class-conditional likelihoods p(x | w_i), the Bayes decision rule assigns an observation x to class w_1 if:

p(x | w_1) / p(x | w_2) > (lambda_12 * P(w_2)) / (lambda_21 * P(w_1))

where lambda_12 is the loss for misclassifying w_2 as w_1, and lambda_21 is the loss for misclassifying w_1 as w_2. The right-hand side of this inequality defines the likelihood ratio threshold. When the classifier outputs calibrated posterior probabilities, this framework reduces to comparing the posterior probability to a cost-dependent threshold, connecting directly to the cost-sensitive threshold formula above.

Validation-based threshold tuning

In practice, many practitioners select the threshold empirically by evaluating classifier performance on a held-out validation set. The procedure is:

Train the classifier on the training set.
Generate predicted probabilities on the validation set.
Sweep through a range of candidate thresholds (for example, from 0.01 to 0.99 in steps of 0.01).
Compute the target metric (F1 score, G-mean, balanced accuracy, or a custom cost function) at each threshold.
Select the threshold that optimizes the target metric.

It is important to never tune the threshold on the same data used for training, as this can lead to overfitting. Cross-validation can be used to obtain more robust threshold estimates.

Comparison of threshold selection methods

The following table summarizes the most commonly used threshold selection methods, their formulas, and their typical use cases.

Method	Optimizes	Formula or criterion	Best suited for
Youden's J statistic	Balanced sensitivity and specificity	J = TPR - FPR (maximize)	Medical screening, balanced datasets
F1-score maximization	Harmonic mean of precision and recall	F1 = 2 * (P * R) / (P + R) (maximize)	General-purpose, moderate imbalance
F-beta maximization	Weighted precision-recall balance	F_beta = (1 + beta^2) * (P * R) / (beta^2 * P + R)	Domain-specific cost asymmetry
G-mean maximization	Geometric mean of TPR and TNR	G = sqrt(TPR * TNR) (maximize)	Imbalanced datasets
Cost-sensitive threshold	Minimum expected misclassification cost	t* = C_FP / (C_FP + C_FN)	Known, unequal error costs
Closest to (0,1) on ROC	Minimum distance to perfect classifier	d = sqrt((1-TPR)^2 + FPR^2) (minimize)	Balanced sensitivity-specificity tradeoff
Precision at fixed recall	Maintains minimum recall level	Highest precision where recall >= target	Safety-sensitive applications

Thresholds in multiclass and multi-label classification

Multiclass classification

In multiclass classification, models typically output a vector of scores or probabilities for each class. The standard approach uses the softmax function to convert raw logits into a probability distribution that sums to 1, and then assigns the class with the highest probability using the argmax operation. In this setting, there is no single scalar threshold; instead, the "threshold" is implicit in the comparison between class probabilities.

However, practitioners sometimes apply per-class thresholds in multiclass problems. For example, a model might require a minimum confidence of 0.6 before making any prediction, and if no class exceeds this confidence level, the model abstains or flags the instance for human review. This approach is common in high-stakes applications where low-confidence predictions should be avoided.

In the one-vs-rest (OvR) approach to multiclass classification, each class is treated as a separate binary classification problem, and each sub-classifier has its own threshold. The final prediction can be made by selecting the class whose binary classifier outputs the highest probability above its respective threshold.

Multi-label classification

In multi-label classification, each instance can belong to multiple classes simultaneously. Unlike multiclass classification (where classes are mutually exclusive), multi-label models use independent sigmoid functions for each label, producing independent probabilities. Each label then has its own threshold, and an instance is assigned to every label whose predicted probability exceeds its threshold.

Threshold selection in the multi-label setting is more complex because there are as many thresholds to tune as there are labels. Common strategies include:

Global threshold: A single threshold is applied to all labels.
Per-label threshold: Each label gets its own optimized threshold based on validation performance.
Rank-based methods: Labels are ranked by predicted probability and a fixed number of top labels are selected.

The role of probability calibration

Threshold selection assumes that the model's output scores are meaningful probabilities. If a model predicts a probability of 0.8, one would expect that approximately 80% of instances receiving that score truly belong to the positive class. However, many classifiers produce poorly calibrated probabilities. For example, support vector machines and gradient-boosted trees often produce overconfident or underconfident predictions.

Probability calibration methods transform the raw model outputs into better-calibrated probabilities, which can improve the effectiveness of threshold selection.

Platt scaling

Platt scaling, proposed by John Platt in 1999, fits a logistic regression model to the raw classifier scores. A logistic function with learned parameters A and B maps the original scores s to calibrated probabilities:

P(y = 1 | s) = 1 / (1 + exp(A * s + B))

The parameters A and B are estimated by maximizing the likelihood on a held-out calibration set. Platt scaling works well when the score distribution has a sigmoidal shape, which is common for SVMs and neural networks. It requires relatively little calibration data.

Isotonic regression

Isotonic regression is a non-parametric calibration method that fits a piecewise-constant, monotonically increasing function to map raw scores to calibrated probabilities. It makes no assumptions about the shape of the calibration curve, giving it more flexibility than Platt scaling. However, because it has more degrees of freedom, it requires more data (typically at least 1,000 calibration examples) to avoid overfitting.

Comparison of calibration methods

Method	Type	Assumptions	Data requirements	Best for
Platt scaling	Parametric	Sigmoid-shaped distortion	Small calibration sets	SVMs, boosted trees, neural networks
Isotonic regression	Non-parametric	Monotonic mapping only	Large calibration sets (1,000+ samples)	Complex, non-sigmoid distortions
Beta calibration	Parametric	Beta distribution family	Moderate calibration sets	A flexible middle ground

Calibrating the model before selecting a threshold generally leads to better threshold choices because the probability values are more trustworthy. A well-calibrated model makes it easier to reason about the expected costs and benefits of different threshold settings.

Real-world applications

The choice of classification threshold varies widely depending on the application domain, the consequences of different error types, and regulatory or business requirements.

Medical diagnosis

In medical screening (for example, cancer detection from imaging), a false negative (missing a disease) can be life-threatening, while a false positive (flagging a healthy patient for follow-up testing) is inconvenient but generally not dangerous. Practitioners therefore set a low threshold to maximize recall, accepting more false positives. For a mammography screening model, a threshold of 0.2 or even lower might be used to ensure that nearly all cancerous cases are detected. The downstream cost of a false positive is an additional biopsy, while the cost of a false negative could be a delayed cancer diagnosis.

Fraud detection

In financial fraud detection, both false positives and false negatives carry significant costs. A false positive (flagging a legitimate transaction) may irritate the customer and require manual investigation, while a false negative (missing actual fraud) results in direct financial loss. The optimal threshold depends on the specific economics: the average fraud loss per missed case, the cost of investigating a flagged transaction, and the customer friction caused by declined legitimate transactions. Many fraud detection systems use relatively low thresholds (often 0.3 to 0.4) and route flagged transactions to a secondary review system rather than blocking them outright.

Spam filtering

In email spam filtering, a false positive (blocking a legitimate email) is typically worse than a false negative (letting a spam email through), because users may miss important messages. Practitioners tend to set a higher threshold (for example, 0.7 or higher), accepting that some spam will reach the inbox in order to minimize the chance of filtering out real email. The F0.5 score, which weights precision more heavily than recall, is a natural metric for optimizing the threshold in this domain.

Autonomous vehicles

In autonomous driving, object detection classifiers must decide whether a detected object is a pedestrian, another vehicle, or a non-hazardous feature. The cost of a false negative (failing to detect a pedestrian) is catastrophic, so thresholds for safety-related detections are set extremely low. Models may also use multiple thresholds: a low threshold for initial detection followed by a higher threshold for classification refinement.

Content moderation

Online platforms use classifiers to detect harmful content such as hate speech, violence, or misinformation. Thresholds must balance the competing goals of removing harmful content (favoring low thresholds and high recall) and preserving free expression (favoring high thresholds and high precision). Many platforms use a tiered approach, with a low threshold for flagging content for human review and a higher threshold for automatic removal.

Application threshold summary

Application	Priority metric	Typical threshold	Reasoning
Medical screening	Recall	Low (0.1 to 0.3)	Missing disease is more dangerous than extra testing
Fraud detection	Balanced F1 or cost function	Moderate-low (0.3 to 0.5)	Both error types have significant financial costs
Spam filtering	Precision	High (0.7 to 0.9)	Blocking legitimate email is worse than letting spam through
Autonomous driving	Recall for safety objects	Very low (0.05 to 0.2)	Missing a pedestrian is life-threatening
Content moderation	Varies by severity	Tiered (0.3 for review, 0.8 for auto-removal)	Balances safety and free expression

Implementation in scikit-learn

The scikit-learn library provides built-in tools for tuning classification thresholds.

TunedThresholdClassifierCV

Introduced in scikit-learn 1.5, TunedThresholdClassifierCV automatically finds the optimal threshold using cross-validation. It wraps any classifier that provides predict_proba or decision_function and tunes the threshold to maximize a specified metric.

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import TunedThresholdClassifierCV
from sklearn.metrics import make_scorer, f1_score

base_model = LogisticRegression()
scorer = make_scorer(f1_score, pos_label=1)

tuned_model = TunedThresholdClassifierCV(
    base_model, 
    scoring=scorer, 
    cv=5
)
tuned_model.fit(X_train, y_train)

# The tuned threshold is stored in the model
print(f"Optimal threshold: {tuned_model.best_threshold_}")
y_pred = tuned_model.predict(X_test)

By default, TunedThresholdClassifierCV uses 5-fold stratified cross-validation and optimizes balanced accuracy. The scoring parameter accepts any scikit-learn scorer or a custom scoring function.

FixedThresholdClassifier

When the desired threshold is already known (for example, from a cost analysis), FixedThresholdClassifier applies a fixed threshold without searching:

from sklearn.model_selection import FixedThresholdClassifier

model = FixedThresholdClassifier(
    LogisticRegression(), 
    threshold=0.3
)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

Manual threshold tuning

For full control, practitioners can manually sweep through thresholds:

import numpy as np
from sklearn.metrics import f1_score

# Get predicted probabilities on validation set
y_proba = model.predict_proba(X_val)[:, 1]

# Sweep thresholds
thresholds = np.arange(0.1, 0.9, 0.01)
best_threshold = 0.5
best_f1 = 0

for t in thresholds:
    y_pred = (y_proba >= t).astype(int)
    score = f1_score(y_val, y_pred)
    if score > best_f1:
        best_f1 = score
        best_threshold = t

print(f"Best threshold: {best_threshold}, F1: {best_f1:.4f}")

Common pitfalls

Several mistakes frequently occur when working with classification thresholds.

Tuning on training data. Selecting the threshold on the same data used to train the model leads to overfitting. The threshold should always be tuned on a separate validation set or through cross-validation.

Assuming 0.5 is optimal. The default threshold of 0.5 is a convention, not an optimization. It is only optimal when the classes are balanced and the costs of false positives and false negatives are equal.

Ignoring class imbalance. On imbalanced datasets, a threshold of 0.5 often results in the model predicting the majority class for nearly all instances. Lowering the threshold can significantly improve recall for the minority class.

Neglecting calibration. Threshold selection works best when the model's probability outputs are well-calibrated. Applying threshold tuning to poorly calibrated models can produce unstable or suboptimal results. Calibrating the model first (using Platt scaling or isotonic regression) is recommended.

Using the wrong evaluation metric. The choice of metric for threshold optimization should reflect the actual costs and priorities of the application. Optimizing F1 when the application requires high recall (as in medical screening) can lead to a threshold that misses too many positive cases.

Fixed thresholds in production. Data distributions can shift over time, making a previously optimal threshold suboptimal. Periodic re-evaluation and recalibration of the threshold is important for maintaining model performance in production systems.

Historical context

The concept of using thresholds in decision-making has roots in signal detection theory, which was developed in the 1940s and 1950s for radar signal processing. The ROC curve itself originated from the analysis of radar operators' ability to distinguish enemy aircraft signals from noise during World War II. The term "receiver operating characteristic" comes from this radar context.

In the statistical and medical literature, Youden's J statistic was introduced in 1950 as a way to summarize the performance of diagnostic tests. The connection between cost-sensitive thresholds and Bayesian decision theory was formalized in the foundational work on pattern recognition by Richard Duda and Peter Hart in the 1970s.

The practical importance of threshold tuning in machine learning became more widely recognized in the 2000s as practitioners worked with increasingly imbalanced datasets in applications such as fraud detection, medical diagnosis, and information retrieval. Scikit-learn's addition of TunedThresholdClassifierCV in version 1.5 (2024) reflected the growing demand for built-in threshold tuning tools.

References

Google Developers. "Thresholds and the Confusion Matrix." Machine Learning Crash Course. https://developers.google.com/machine-learning/crash-course/classification/thresholding
Youden, W. J. (1950). "Index for rating diagnostic tests." *Cancer*, 3(1), 32-35.
Fawcett, T. (2006). "An introduction to ROC analysis." *Pattern Recognition Letters*, 27(8), 861-874.
Platt, J. (1999). "Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods." *Advances in Large Margin Classifiers*, 10(3), 61-74.
Lipton, Z. C., Elkan, C., & Naryanaswamy, B. (2014). "Optimal Thresholding of Classifiers to Maximize F1 Measure." *Machine Learning and Knowledge Discovery in Databases*, 225-239.
Brownlee, J. "A Gentle Introduction to Threshold-Moving for Imbalanced Classification." Machine Learning Mastery. https://machinelearningmastery.com/threshold-moving-for-imbalanced-classification/
scikit-learn documentation. "Tuning the decision threshold for class prediction." https://scikit-learn.org/stable/modules/classification_threshold.html
Duda, R. O., Hart, P. E., & Stork, D. G. (2001). *Pattern Classification* (2nd ed.). John Wiley & Sons.
Niculescu-Mizil, A., & Caruana, R. (2005). "Predicting good probabilities with supervised learning." *Proceedings of the 22nd International Conference on Machine Learning*, 625-632.
Saito, T., & Rehmsmeier, M. (2015). "The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets." *PLOS ONE*, 10(3), e0118432.
Elkan, C. (2001). "The foundations of cost-sensitive learning." *Proceedings of the 17th International Joint Conference on Artificial Intelligence*, 973-978.
scikit-learn documentation. "Probability calibration." https://scikit-learn.org/stable/modules/calibration.html
Evidently AI. "How to use classification threshold to balance precision and recall." https://www.evidentlyai.com/classification-metrics/classification-threshold

ELI5: Explain like I'm 5

How classification thresholds work

Relationship to the confusion matrix

Relationship to the decision boundary

Effect on precision and recall

The F-beta score and asymmetric weighting

Threshold selection methods

ROC curve analysis

Precision-recall curve analysis

G-mean optimization

Cost-sensitive threshold selection

Bayesian decision theory

Validation-based threshold tuning

Comparison of threshold selection methods

Thresholds in multiclass and multi-label classification

Multiclass classification

Multi-label classification

The role of probability calibration

Platt scaling

Isotonic regression

Comparison of calibration methods

Real-world applications

Medical diagnosis

Fraud detection

Spam filtering

Autonomous vehicles

Content moderation

Application threshold summary

Implementation in scikit-learn

TunedThresholdClassifierCV

FixedThresholdClassifier

Manual threshold tuning

Common pitfalls

Historical context

See also

References

Improve this article

Related Articles

ARC-AGI 2

AUC (Area Under the ROC Curve)

Confusion Matrix

Decision Threshold

False Negative (FN)

False Negative Rate

ELI5: Explain like I'm 5

How classification thresholds work

Relationship to the confusion matrix

Relationship to the decision boundary

Effect on precision and recall

The F-beta score and asymmetric weighting

Threshold selection methods

ROC curve analysis

Precision-recall curve analysis

G-mean optimization

Cost-sensitive threshold selection

Bayesian decision theory

Validation-based threshold tuning

Comparison of threshold selection methods

Thresholds in multiclass and multi-label classification

Multiclass classification

Multi-label classification

The role of probability calibration

Platt scaling

Isotonic regression

Comparison of calibration methods

Real-world applications

Medical diagnosis

Fraud detection

Spam filtering

Autonomous vehicles

Content moderation

Application threshold summary

Implementation in scikit-learn

TunedThresholdClassifierCV

FixedThresholdClassifier

Manual threshold tuning

Common pitfalls

Historical context

See also

References