Calibration (machine learning)

Machine Learning Statistics

27 min read

Updated Jul 11, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 11, 2026

Fact-checked

In review queue

Sources

35 citations

Revision

v5 · 5,410 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Calibration in machine learning is the property that the probability scores produced by a probabilistic classifier match the empirical frequency of the predicted event: a model that assigns a confidence of 0.8 to a set of inputs should be correct on about 80 percent of them. A classifier is said to be perfectly calibrated when, among all inputs assigned a predicted probability of $p$ , the long-run fraction of inputs whose true label is positive equals exactly $p$ . Formally, for a binary classifier $f$ that outputs $f(x) \in [0,1]$ and a true label $Y \in \{0,1\}$ , the model is calibrated if

$P(Y = 1 \mid f(X) = p) = p \quad \text{for all } p \in [0,1].$

Calibration is distinct from accuracy, discrimination, or ranking ability: a model can rank examples perfectly (high AUC) yet still produce probability scores that are systematically too confident or too cautious. Conversely a model can be perfectly calibrated and still inaccurate, for example by always predicting the marginal base rate. The most-cited modern result on the topic, Guo et al. 2017, reports that "modern neural networks, unlike those from a decade ago, are poorly calibrated," almost always in the over-confident direction, and that a one-parameter temperature rescaling of the softmax can cut a network's Expected Calibration Error from roughly 16 percent to about 1 percent.^[13]

The study of calibration began in meteorology with Brier's 1950 paper on verification of probabilistic forecasts^[1] and the formal calibration framework of DeGroot and Fienberg in 1983.^[2] It was carried into machine learning by Platt's 1999 sigmoid post-processing of support vector machine outputs^[3] and by Zadrozny and Elkan's isotonic regression work in 2001 and 2002.^[4]^[5] Niculescu-Mizil and Caruana's 2005 ICML paper systematized the comparison of calibration methods across classifier families.^[6] Interest exploded again after Guo et al. showed in 2017 that modern deep neural networks such as ResNet and DenseNet are systematically over-confident on ImageNet despite their high accuracy, and that temperature scaling fixes most of the gap.^[13] Calibration is now a standard concern in Bayesian inference, risk-sensitive deployment, fairness analysis, and language model evaluation.

Why does calibration matter?

Probability scores feed into downstream decisions, and miscalibrated scores break the math of those decisions. A few concrete reasons calibration matters in practice:

Cost-sensitive thresholding. When the cost of a false positive differs from the cost of a false negative, the optimal Bayes classification threshold depends on the absolute probability, not on rank. A model that ranks correctly but inflates its probabilities will threshold in the wrong place.
Risk-sensitive applications. In medical decision support, autonomous driving, and credit scoring, downstream actions are conditioned on probability estimates. A model that says "95 percent chance of malignant" had better be right about 95 percent of the time on inputs where it said that, or doctors and regulators will lose trust.
Combining outputs. Stacking, mixture-of-experts, ensembling, and Bayesian model averaging all rely on probabilities being on the same numerical scale. Pooling miscalibrated probabilities produces nonsense.
Active learning and selective prediction. Methods that pick the most uncertain examples to label, or that abstain on low-confidence inputs, only work if the reported probability genuinely reflects uncertainty.
Fairness analysis. Calibration-by-group is one of the standard fairness criteria for risk scores. Whether a score is calibrated within demographic subgroups is a separate question from whether the underlying rates match across groups.
Trust and interpretability. A score is a quantitative claim. If users learn that "90 percent" really means "about 75 percent," they will either silently recalibrate the number themselves or stop trusting it.

What are the types of calibration?

In the binary case, calibration is a single condition. In the multiclass case there are several inequivalent definitions, and a model can satisfy a weak one without satisfying a stronger one.

Notion	Condition	Use
Perfect (binary) calibration	$P(Y=1 \mid \hat p = p) = p$ for all $p$	Binary classification
Top-label / confidence calibration	$P(Y = \hat y \mid \max_k \hat p_k = p) = p$	Whether the model's most-likely class call is correct $p$ of the time
Class-wise calibration	For each class $k$ : $P(Y = k \mid \hat p_k = p) = p$	Independent calibration of each entry of the predicted distribution
Canonical / distributional calibration	$P(Y = k \mid \hat{\boldsymbol p}) = \hat p_k$ for all $k$	Strongest notion: the entire predicted distribution matches the conditional distribution given that prediction

Perfect distributional calibration implies class-wise calibration, which implies top-label calibration, but the reverse implications fail. Most of the deep-learning literature reports top-label calibration because it is the easiest to estimate and the most directly relevant to confidence claims.

How is calibration measured?

No single number captures calibration; researchers use several metrics with complementary blind spots.

Metric	Formula or idea	Notes
Brier score (Brier 1950)	$\frac{1}{N}\sum_i (\hat p_i - y_i)^2$ for binary	Strictly proper scoring rule; decomposes into reliability, resolution, and uncertainty
Negative log-likelihood / log-loss	$-\frac{1}{N}\sum_i \log \hat p_{i, y_i}$	Strictly proper; sensitive to extreme low-probability predictions
Reliability diagram	Plot of empirical accuracy vs binned mean confidence	Visual diagnostic; gap between curve and diagonal indicates miscalibration
Expected Calibration Error (ECE)	$\sum_m \tfrac{\lvert B_m \rvert}{N} \bigl\lvert \text{acc}(B_m) - \text{conf}(B_m) \bigr\rvert$ over $M$ bins	Most commonly reported; biased by binning choice
Maximum Calibration Error (MCE)	$\max_m \lvert \text{acc}(B_m) - \text{conf}(B_m) \rvert$	Worst-case bin gap; relevant for safety-critical applications
Static Calibration Error (SCE)	Per-class extension of ECE averaged across classes	For multiclass class-wise calibration (Nixon et al. 2019)
Adaptive Calibration Error (ACE)	ECE-like but with equal-mass adaptive bins	Reduces bias from sparse bins (Nixon et al. 2019)
KS calibration error	Kolmogorov-Smirnov distance between cumulative empirical and predicted distributions	Binning-free; used by Gupta et al. 2020 spline calibration

The Brier score introduced by Glenn W. Brier in 1950 in Monthly Weather Review is the mean squared error between predicted probability and the indicator of the realized outcome.^[1] It is a strictly proper scoring rule: it is uniquely minimized in expectation by reporting the true conditional probability. DeGroot and Fienberg's 1983 decomposition shows that the Brier score splits into a reliability term (the calibration error), a resolution term (how much predictions vary across outcomes), and an irreducible uncertainty term equal to the variance of the label.^[2] This is why proper scoring rules reward both calibration and refinement at once.

Expected Calibration Error is the metric most often quoted in the deep-learning calibration literature, made canonical by Guo et al. 2017.^[13] ECE partitions predicted confidences into $M$ equal-width bins, computes the absolute gap between average accuracy and average confidence in each bin, and weights by bin size. It is intuitive and easy to plot, but it has well-documented pathologies. The estimate depends sharply on the number of bins, can be near-zero by accident on coarse bins, and is statistically biased downward when bins are sparsely populated. Naeini, Cooper and Hauskrecht discussed bias in binning estimators in their 2015 AAAI paper,^[8] and Gupta and Ramdas's 2021 ICML work on distribution-free histogram binning provides finite-sample guarantees that older ECE estimators do not have.^[27]

Reliability diagrams plot the binned empirical accuracy against binned mean confidence on the unit square. A perfectly calibrated model traces the diagonal. Bars above the diagonal indicate under-confidence, bars below indicate over-confidence. The diagram is the visual companion to ECE.

Why are modern deep neural networks miscalibrated?

Guo, Pleiss, Sun and Weinberger's 2017 ICML paper On Calibration of Modern Neural Networks is the most cited result on the topic.^[13] Its opening finding is blunt: "modern neural networks, unlike those from a decade ago, are poorly calibrated."^[13] The authors observed that older shallow networks like LeNet were already well calibrated, but modern deep architectures such as ResNet, DenseNet, and wide ResNets on CIFAR-100 and ImageNet exhibit ECE between roughly 5 and 20 percent, almost always in the over-confident direction.^[13] They traced the effect to the interplay between high model capacity, the cross-entropy training objective, and reduced explicit regularization, reporting that depth, width, weight decay, and batch normalization are the main factors influencing calibration.^[13] As models got deeper and wider, training NLL kept improving, but the held-out NLL and ECE worsened.

The paper's signature contribution is temperature scaling, which it describes as "a single-parameter variant of Platt Scaling" that is "surprisingly effective at calibrating predictions."^[13] The method divides the pre-softmax logits by a single scalar $T > 0$ tuned by minimizing NLL on a held-out set. Because $T$ is shared across classes and does not change the argmax, temperature scaling preserves accuracy while sharpening or softening the predicted distribution. On CIFAR-100 it brings ECE for ResNet-110 from around 16 percent down to about 1 percent.^[13] The simplicity of the method, combined with the size of the effect, made temperature scaling the default first thing to try whenever a deep classifier feels overconfident.

Later work has nuanced the picture. Mukhoti et al. 2020 showed in Calibrating Deep Neural Networks using Focal Loss that training with focal loss instead of cross-entropy yields models that are already nearly calibrated, especially under distribution shift, and that focal loss combined with temperature scaling can outperform either alone.^[23] Mueller, Kornblith and Hinton's 2019 paper When Does Label Smoothing Help? showed that label smoothing improves top-label calibration but distorts inter-class similarity in ways that hurt knowledge distillation.^[19]

What methods recalibrate a model?

Calibration methods fall into two broad families: post-hoc methods that take a trained classifier as fixed and learn a correction map on a held-out set, and training-time methods that bake calibration into the loss or architecture.

Method	Type	Year	Main idea
Platt scaling	Post-hoc, parametric	1999	Fit a one-dimensional logistic regression on the classifier's score
Histogram binning	Post-hoc, non-parametric	2001	Bin scores; output the empirical positive rate per bin
Isotonic regression	Post-hoc, non-parametric	2002	Fit a monotone non-decreasing step function via pair-adjacent violators
BBQ (Bayesian Binning into Quantiles)	Post-hoc, non-parametric	2015	Bayesian model average over many binnings
Temperature scaling	Post-hoc, parametric	2017	Single scalar $T$ on softmax logits
Vector / matrix scaling	Post-hoc, parametric	2017	Per-class affine transform on logits
Beta calibration	Post-hoc, parametric	2017	Three-parameter family generalizing Platt for skewed score distributions
Dirichlet calibration	Post-hoc, parametric, multiclass	2019	Linear layer on log-probabilities followed by softmax
Non-parametric Gaussian-process calibration	Post-hoc	2020	Latent GP over confidence; AISTATS 2020
Spline calibration	Post-hoc, binning-free	2020-2021	Differentiable spline fit to empirical CDF
Label smoothing	Training-time	2016	Replace one-hot labels with $(1-\epsilon)\delta_y + \tfrac{\epsilon}{K}\mathbf{1}$
Mixup	Training-time	2018	Train on convex combinations of input-label pairs
Focal loss	Training-time	2017 / 2020	Down-weight easy examples; reduces overconfidence
MC dropout	Training-time	2016	Monte-Carlo forward passes with dropout active at test time
Deep ensembles	Training-time	2017	Average $M$ networks trained from different seeds
SWAG	Training-time	2019	Gaussian fit to SGD iterates; sample for Bayesian model averaging

Post-hoc methods

Platt scaling comes from John C. Platt's 1999 chapter Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods in Advances in Large Margin Classifiers.^[3] The original target was the SVM, whose decision-function output $f(x)$ has no probabilistic interpretation. Platt fits a sigmoid $P(Y=1 \mid f(x)) = 1/(1 + \exp(A f(x) + B))$ on a held-out validation set with maximum likelihood.^[3] The method is two-parameter, fast, and works well when the un-calibrated scores have a roughly sigmoidal distortion, which is typical of max-margin methods. Lin, Lin and Weng 2007 fixed numerical issues with the Levenberg-Marquardt fit Platt originally used.^[7]

Isotonic regression, advocated by Bianca Zadrozny and Charles Elkan in 2001 and 2002, fits a monotone non-decreasing piecewise-constant function to the (score, label) pairs using the pool-adjacent-violators algorithm.^[4]^[5] The 2002 KDD paper Transforming Classifier Scores into Accurate Multiclass Probability Estimates extends the method to multiclass via one-versus-rest reductions.^[5] Isotonic regression is more flexible than Platt scaling because it makes no parametric assumption about the shape of the distortion, but it requires more data to avoid overfitting and can produce flat regions where probabilities collapse to a constant.

Histogram binning sorts scores, partitions them into bins of equal width or equal frequency, and outputs the empirical positive rate inside the bin containing a new score. Gupta and Ramdas 2021 prove that, with the right sample-splitting strategy, uniform-mass histogram binning achieves distribution-free finite-sample calibration guarantees, and they show how to relax the sample-splitting requirement using a Markov property of order statistics.^[27]

BBQ, Bayesian Binning into Quantiles, was introduced by Naeini, Cooper and Hauskrecht at AAAI 2015.^[8] BBQ averages predictions from many histogram-binning models with different bin counts and boundaries, weighting each by its posterior under a uniform prior over partitions and a multinomial-Dirichlet likelihood.^[8] It outperforms any single binning and is usually competitive with isotonic regression while reporting better-behaved Brier scores.

Temperature, vector, and matrix scaling were introduced together by Guo et al. 2017.^[13] Vector scaling allows a separate gain and bias per class, and matrix scaling allows a full affine transform of logits. The paper finds that vector and matrix scaling can overfit on small validation sets, while temperature scaling almost never does.^[13] Temperature scaling is now standard practice for post-hoc calibration of deep classifiers.

Beta calibration, introduced by Kull, Silva Filho and Flach in Beyond Sigmoids (Electronic Journal of Statistics, 2017), generalizes Platt's logistic family by modeling each class's score distribution as a Beta.^[16] It includes Platt scaling as a special case but also handles the inverse-sigmoid distortions seen in some boosted models and naive Bayes.

Dirichlet calibration, by Kull, Perello-Nieto, Kangsepp, Silva Filho, Song and Flach at NeurIPS 2019 in Beyond Temperature Scaling, extends beta calibration to multiclass.^[20] It is implemented as a linear layer on log-probabilities followed by softmax, which makes it trivial to plug in after a neural network. The authors report improvements in classwise ECE and log-loss across many architectures and datasets.^[20]

Spline calibration by Gupta, Rahimi, Ajanthan, Mensink, Sminchisescu and Hartley (ICLR 2021) fits a smooth differentiable spline to the empirical cumulative distribution of confidences, sidestepping the binning artifacts of ECE.^[28] It introduces a binning-free Kolmogorov-Smirnov calibration error and shows the spline recalibration map is consistently competitive on ImageNet ResNets.^[28]

Non-parametric Gaussian-process calibration by Wenger, Kjellstrom and Triebel at AISTATS 2020 places a latent Gaussian process over the pre-calibration confidence and uses variational inference to obtain a posterior calibration map applicable to any classifier output, not only neural networks.^[24]

Training-time methods

Label smoothing (Szegedy et al. 2016, originally introduced as part of Inception-v3) replaces the one-hot training target with a mixture of the one-hot label and the uniform distribution.^[10] It improves test accuracy on many large image classifiers and, as Mueller, Kornblith and Hinton 2019 showed, also improves top-label calibration by preventing the network from pushing logits to extreme magnitudes.^[19] The same paper documents the awkward side effect that label smoothing erases the inter-class similarity structure that knowledge distillation depends on.^[19]

Mixup (Zhang, Cisse, Dauphin and Lopez-Paz, ICLR 2018) trains on convex combinations of pairs of training examples and their one-hot labels.^[17] The resulting models tend to be both more accurate and better calibrated, since the soft mixed labels prevent the network from collapsing all of its mass onto a single class. Mixup also improves robustness to label noise and adversarial perturbations.

Focal loss was introduced by Lin et al. 2017 for object detection, where extreme class imbalance dominated training.^[15] Mukhoti et al. 2020 found that, beyond detection, focal loss significantly improves classifier calibration on its own, and that combining focal loss with a final temperature-scaling pass yields among the best ECE numbers on standard benchmarks.^[23] Focal loss can be viewed as an upper bound on the regularized cross-entropy loss with an entropy penalty.

Bayesian neural networks and MC dropout. A fully Bayesian neural network maintains a posterior over weights and integrates over it to produce predictions, which under suitable priors yields well-calibrated uncertainty by construction. Exact Bayesian inference is intractable in general, so practitioners use approximations. Yarin Gal and Zoubin Ghahramani's 2016 ICML paper Dropout as a Bayesian Approximation showed that dropout, applied at both training and inference time, can be interpreted as a variational approximation to a deep Gaussian process, giving rise to MC dropout: average $M$ stochastic forward passes through the network.^[9] Sampling-based posterior approximations including Hamiltonian Monte Carlo and stochastic gradient Langevin dynamics are also used. See Bayesian Neural Network for the broader picture and Bayesian Optimization for a related use of calibrated uncertainty.

Deep ensembles. Lakshminarayanan, Pritzel and Blundell's 2017 NeurIPS paper Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles trains a small number of independent neural networks from different random initializations and averages their predictive distributions.^[14] Ensembles produce strikingly well-calibrated uncertainty in practice, often beating MC dropout and matching or beating sophisticated Bayesian methods, at a cost that scales linearly with ensemble size.^[14] Whether deep ensembles are better viewed as Bayesian or as just a strong baseline remains debated. See ensemble learning and ensemble for related ideas.

SWA-Gaussian (SWAG). Maddox, Garipov, Izmailov, Vetrov and Wilson 2019 fit a Gaussian to the trajectory of SGD iterates near the end of training.^[21] Sampling weights from this Gaussian produces an ensemble for free, with calibration competitive with deep ensembles at a fraction of the compute.^[21]

Calibration in regression

In regression the analog of calibration is quantile calibration or interval coverage: when a predictor outputs a 90 percent prediction interval, that interval should contain the true value 90 percent of the time on average. Two large families of techniques deliver this guarantee.

Quantile regression trains the model to predict specified quantiles directly under the pinball loss. This produces self-consistent intervals for the predicted distribution but does not by itself give frequentist coverage; recalibration on a held-out set is usually needed.

Conformal prediction is the dominant distribution-free technique. Originating with Vovk, Gammerman and Shafer in the early 2000s^[34] and popularized for the deep-learning era by Anastasios Angelopoulos and Stephen Bates's 2021 tutorial A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification, conformal methods construct prediction sets with finite-sample marginal coverage guarantees under the assumption of exchangeable data.^[25] Conformal prediction works on top of any base model, makes no assumption about the base model's accuracy or calibration, and the coverage guarantee holds in finite samples for any data distribution. Conformal techniques have been adapted to time series, structured outputs, distribution shift, and large language models.

How does calibration relate to fairness?

Calibration by group asks that the calibration condition holds within each demographic subgroup: $P(Y=1 \mid \hat p, G = g) = \hat p$ for every group $g$ . This is sometimes called predictive parity. The constraint is appealing because it says the score means the same thing regardless of group membership.

Kleinberg, Mullainathan and Raghavan's 2017 ITCS paper Inherent Trade-Offs in the Fair Determination of Risk Scores^[12] and Alexandra Chouldechova's 2017 Big Data paper Fair Prediction with Disparate Impact^[11] independently proved an impossibility theorem: when groups have different base rates, calibration by group, equal false-positive rates, and equal false-negative rates cannot all hold simultaneously, except in degenerate cases. The result has shaped the algorithmic-fairness literature ever since, because it forces practitioners to choose which fairness criterion to prioritize. See algorithmic fairness for the wider context, and the AI Fairness 360 and Fairlearn toolkits for software implementations of the standard criteria.

Are large language models calibrated?

Large language models present new calibration questions because they output token distributions rather than calibrated class probabilities, and because their inputs often have no fixed schema. Several recent results frame the field:

Multiple-choice probabilities. Kadavath et al. 2022 in Language Models (Mostly) Know What They Know found that "larger models are well-calibrated on diverse multiple choice and true/false questions when they are provided in the right format."^[29] They also studied self-evaluation P(True) and found it scales with model size.^[29]
RLHF degrades calibration. Reinforcement-learning-from-human-feedback fine-tuning tends to make probabilities sharper and less calibrated. OpenAI's GPT-4 technical report states that "the pre-trained model is highly calibrated" on MMLU, but that "after the post-training process, the calibration is reduced."^[35]
Verbalized confidence. Lin, Hilton and Evans 2022 trained GPT-3 to express its confidence in words ("I'm 70 percent sure") and showed that verbalized probabilities can be well calibrated and can generalize under distribution shift better than logit-based confidences.^[30]
Just ask for calibration. Tian, Mitchell, Zhou, Sharma, Rafailov, Yao, Finn and Manning's EMNLP 2023 paper Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback showed that for ChatGPT, GPT-4 and Claude, prompting the model to verbalize a confidence yields better-calibrated numbers than reading conditional token probabilities, contrary to what one might expect from a probabilistic model.^[31]
Ensembling and consistency-based calibration. Sampling many completions and measuring agreement, or running multiple paraphrases of the same prompt, can give surprisingly well-calibrated confidence proxies even when the underlying model's logits are not calibrated.

Related topics include AI deception, where miscalibrated or strategically over-confident outputs become a safety concern, and the role of temperature sampling at decoding time, which interacts with calibration in nontrivial ways.

Calibration in medical and clinical AI

Van Calster, McLernon, van Smeden, Wynants and Steyerberg's 2019 BMC Medicine paper Calibration: the Achilles Heel of Predictive Analytics argued that calibration is the single most under-checked property of clinical prediction models.^[22] They distinguish four levels of calibration, from weak agreement of mean predicted probability with overall event rate to moderate, strong, and perfect calibration.^[22] The paper made calibration plots a near-mandatory companion to AUC in clinical model validation. Regulators in oncology, cardiology and intensive care have since incorporated calibration assessment into model approval workflows.

Calibration of generative models

Calibration concepts apply differently to generative models because there is no single label to compare to. For a discriminative generative model used as a classifier, such as an energy-based model or a normalizing flow followed by a softmax head, the standard binary or multiclass calibration definitions apply. For unconditional generative models such as diffusion models, the relevant property is whether the model's likelihood function (where it has one) corresponds to the data distribution; this is closer to a goodness-of-fit question than to calibration in the classifier sense. Diffusion models do not produce calibrated probabilities for sampled images, since each sample is just one trajectory through the reverse process. Density-based generative models such as flows do, but the relationship between log-likelihood and perceptual quality is famously loose.

What are the common pitfalls?

Calibrating on training data. Post-hoc methods like Platt scaling, isotonic regression, and temperature scaling must be fit on a held-out set the model has never seen, otherwise they overfit to the training noise and underestimate the calibration error.
Distribution shift. A perfectly calibrated model on the validation set can become miscalibrated under covariate shift, label shift, or domain shift. Conformal prediction with distribution-shift corrections is one principled response.
Multiclass calibration is hard. A model that is calibrated in the top-label sense can be very miscalibrated in the class-wise or distributional sense. Reporting only ECE on the most-likely class hides this.
ECE is biased and binning-dependent. Two analysts can run "the same" ECE on the same data and report different numbers because they binned differently. The Adaptive Calibration Error and Static Calibration Error metrics of Nixon et al. 2019 partly address this, and KS-based or spline-based metrics avoid binning entirely.^[18]
Class imbalance. When one class dominates, naive calibration on the rare class produces noisy estimates. Stratified binning or class-weighted scoring rules help.
Confusing rank and probability. AUC is invariant to monotone transformations of the score; calibration is not. A model can have AUC of 0.95 and still be uselessly miscalibrated.

Software libraries

Library	Language	Coverage
scikit-learn `sklearn.calibration`	Python	`CalibratedClassifierCV` (Platt and isotonic), `calibration_curve`, `CalibrationDisplay`
netcal	Python (PyTorch / TensorFlow)	Binning, scaling, and regularization methods; classification and regression calibration
Uncertainty Toolbox (Chung et al. 2021)	Python	Regression-focused uncertainty metrics, visualizations, and recalibration
pycalib (Wenger et al.)	Python	Non-parametric GP calibration and other multiclass methods
MAPIE	Python	Conformal prediction for classification, regression, and time series
TorchMetrics `CalibrationError`	Python (PyTorch)	ECE, MCE, and class-wise variants

The scikit-learn CalibratedClassifierCV is the most widely used entry point in classical machine learning: it cross-validates a base estimator, fits Platt or isotonic calibration on the held-out folds, and averages the resulting probabilities.^[32] The companion calibration_curve and CalibrationDisplay produce reliability diagrams.^[32]

Connection to model selection and proper scoring rules

Proper scoring rules are loss functions that are minimized in expectation by reporting the true conditional probability. The Brier score and the negative log-likelihood are the two most common; both are strictly proper. DeGroot and Fienberg 1983 showed that any proper scoring rule decomposes into a calibration component and a refinement (resolution) component, plus an irreducible uncertainty term.^[2] Optimizing a proper scoring rule therefore rewards both calibration and refinement at once. This is why log-loss and Brier score, rather than accuracy or AUC, are the right defaults for evaluating probabilistic predictors. Accuracy ignores miscalibration entirely; AUC ignores calibration up to monotone transforms.

In model selection it is now standard to track at least one proper scoring rule alongside accuracy and AUC. For multiclass problems classwise ECE (or its smoother spline-based counterparts) is reported alongside log-loss. For regression, mean continuous ranked probability score (CRPS) and prediction-interval coverage are reported alongside RMSE.

Recent developments

The last few years have seen calibration broaden out from a post-hoc fixup into a first-class evaluation criterion across the field.

Distribution-free guarantees. Gupta and Ramdas 2021 and follow-ups gave finite-sample calibration guarantees for histogram binning that hold without distributional assumptions, partly closing the gap between empirical practice and theory.^[27]
Conformal prediction adoption. Conformal methods went from a niche idea in the early 2000s to a default tool for distribution-free uncertainty in 2021 onward, driven by the Angelopoulos and Bates tutorial and the MAPIE software ecosystem.^[25]
LLM calibration as an active area. With language models in production, calibration of verbalized confidence and of self-evaluation has become a topic of its own. The picture from 2022 to 2024 is that pretrained base models are roughly calibrated on multiple-choice tasks at scale, RLHF tends to harm calibration, and verbal probability prompts can recover much of what was lost.
Calibrating foundation models for downstream tasks. As foundation models are adapted to specific applications, lightweight calibration layers (a Calibration Layer on top of a fixed backbone) are being explored as an inexpensive way to recover trustworthy probabilities without full retraining.
Connections to evaluation infrastructure. Calibration is increasingly built into the standard evaluation suite for classifiers and rankers, alongside the ROC (Receiver Operating Characteristic) Curve, the sigmoid function used in Platt-style methods, and proper scoring rules.

References

Brier, G. W. (1950). "Verification of Forecasts Expressed in Terms of Probability." *Monthly Weather Review*, 78(1), 1-3. ↩
DeGroot, M. H., & Fienberg, S. E. (1983). "The Comparison and Evaluation of Forecasters." *Journal of the Royal Statistical Society: Series D (The Statistician)*, 32(1-2), 12-22. ↩
Platt, J. C. (1999). "Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods." In *Advances in Large Margin Classifiers*, MIT Press, pp. 61-74. ↩
Zadrozny, B., & Elkan, C. (2001). "Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers." *ICML 2001*. ↩
Zadrozny, B., & Elkan, C. (2002). "Transforming Classifier Scores into Accurate Multiclass Probability Estimates." *Proceedings of the 8th ACM SIGKDD*, 694-699. ↩
Niculescu-Mizil, A., & Caruana, R. (2005). "Predicting Good Probabilities With Supervised Learning." *Proceedings of the 22nd ICML*, 625-632. ↩
Lin, H.-T., Lin, C.-J., & Weng, R. C. (2007). "A note on Platt's probabilistic outputs for support vector machines." *Machine Learning*, 68(3), 267-276. ↩
Naeini, M. P., Cooper, G. F., & Hauskrecht, M. (2015). "Obtaining Well Calibrated Probabilities Using Bayesian Binning." *Proceedings of AAAI 2015*. ↩
Gal, Y., & Ghahramani, Z. (2016). "Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning." *Proceedings of ICML 2016*, 1050-1059. ↩
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). "Rethinking the Inception Architecture for Computer Vision." *CVPR 2016* (introduces label smoothing). ↩
Chouldechova, A. (2017). "Fair Prediction with Disparate Impact: A Study of Bias in Recidivism Prediction Instruments." *Big Data*, 5(2), 153-163. ↩
Kleinberg, J., Mullainathan, S., & Raghavan, M. (2017). "Inherent Trade-Offs in the Fair Determination of Risk Scores." *ITCS 2017*. ↩
Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). "On Calibration of Modern Neural Networks." *Proceedings of ICML 2017*. arXiv:1706.04599. ↩
Lakshminarayanan, B., Pritzel, A., & Blundell, C. (2017). "Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles." *NeurIPS 2017*. arXiv:1612.01474. ↩
Lin, T.-Y., Goyal, P., Girshick, R., He, K., & Dollar, P. (2017). "Focal Loss for Dense Object Detection." *ICCV 2017*. ↩
Kull, M., Silva Filho, T. M., & Flach, P. (2017). "Beyond sigmoids: How to obtain well-calibrated probabilities from binary classifiers with beta calibration." *Electronic Journal of Statistics*, 11(2), 5052-5080. ↩
Zhang, H., Cisse, M., Dauphin, Y. N., & Lopez-Paz, D. (2018). "mixup: Beyond Empirical Risk Minimization." *ICLR 2018*. arXiv:1710.09412. ↩
Nixon, J., Dusenberry, M. W., Zhang, L., Jerfel, G., & Tran, D. (2019). "Measuring Calibration in Deep Learning." *CVPR Workshops 2019*. arXiv:1904.01685. ↩
Mueller, R., Kornblith, S., & Hinton, G. (2019). "When Does Label Smoothing Help?" *NeurIPS 2019*. arXiv:1906.02629. ↩
Kull, M., Perello-Nieto, M., Kangsepp, M., Silva Filho, T. M., Song, H., & Flach, P. (2019). "Beyond temperature scaling: Obtaining well-calibrated multi-class probabilities with Dirichlet calibration." *NeurIPS 2019*. arXiv:1910.12656. ↩
Maddox, W. J., Garipov, T., Izmailov, P., Vetrov, D., & Wilson, A. G. (2019). "A Simple Baseline for Bayesian Uncertainty in Deep Learning." *NeurIPS 2019*. arXiv:1902.02476. ↩
Van Calster, B., McLernon, D. J., van Smeden, M., Wynants, L., & Steyerberg, E. W. (2019). "Calibration: the Achilles heel of predictive analytics." *BMC Medicine*, 17, 230. ↩
Mukhoti, J., Kulharia, V., Sanyal, A., Golodetz, S., Torr, P. H. S., & Dokania, P. K. (2020). "Calibrating Deep Neural Networks using Focal Loss." *NeurIPS 2020*. arXiv:2002.09437. ↩
Wenger, J., Kjellstrom, H., & Triebel, R. (2020). "Non-Parametric Calibration for Classification." *AISTATS 2020*. arXiv:1906.04933. ↩
Angelopoulos, A. N., & Bates, S. (2021). "A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification." arXiv:2107.07511. ↩
Chung, Y., Char, I., Guo, H., Schneider, J., & Neiswanger, W. (2021). "Uncertainty Toolbox: an Open-Source Library for Assessing, Visualizing, and Improving Uncertainty Quantification." arXiv:2109.10254.
Gupta, C., & Ramdas, A. (2021). "Distribution-Free Calibration Guarantees for Histogram Binning without Sample Splitting." *ICML 2021*. arXiv:2105.04656. ↩
Gupta, K., Rahimi, A., Ajanthan, T., Mensink, T., Sminchisescu, C., & Hartley, R. (2021). "Calibration of Neural Networks using Splines." *ICLR 2021*. arXiv:2006.12800. ↩
Kadavath, S., Conerly, T., Askell, A., et al. (2022). "Language Models (Mostly) Know What They Know." arXiv:2207.05221. ↩
Lin, S. C., Hilton, J., & Evans, O. (2022). "Teaching Models to Express Their Uncertainty in Words." *Transactions on Machine Learning Research*. arXiv:2205.14334. ↩
Tian, K., Mitchell, E., Zhou, A., Sharma, A., Rafailov, R., Yao, H., Finn, C., & Manning, C. D. (2023). "Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback." *EMNLP 2023*. arXiv:2305.14975. ↩
scikit-learn developers. "1.16. Probability calibration." scikit-learn documentation. ↩
EFS-OpenSource. "netcal calibration framework." https://github.com/EFS-OpenSource/calibration-framework
Vovk, V., Gammerman, A., & Shafer, G. (2005). *Algorithmic Learning in a Random World*. Springer (foundational text for conformal prediction). ↩
OpenAI (2023). "GPT-4 Technical Report." arXiv:2303.08774 (Section 5, calibration on MMLU before and after post-training/RLHF). ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

4 revisions by 1 contributors · full history

Suggest edit

What links here

AUC-ROC Area under the curve COMPAS (recidivism risk assessment)Calibration Layer Classification (machine learning)Counterfactual Fairness ExLlamaV2 (EXL2)Fairness Metric False negative False positive HELM (Holistic Evaluation of Language Models)Hinge Loss Log Loss Logits Prediction Bias Predictive rate parity True negative True positive

Why does calibration matter?

What are the types of calibration?

How is calibration measured?

Why are modern deep neural networks miscalibrated?

What methods recalibrate a model?

Post-hoc methods

Training-time methods

Calibration in regression

How does calibration relate to fairness?

Are large language models calibrated?

Calibration in medical and clinical AI

Calibration of generative models

What are the common pitfalls?

Software libraries

Connection to model selection and proper scoring rules

Recent developments

See also

References

Improve this article

Related Articles

A/B Testing

Generalized Linear Model

L1 Loss

L2 Loss

Squared Loss

Stationarity

What links here

Related Articles

A/B Testing

Generalized Linear Model

L1 Loss

L2 Loss

Squared Loss

Stationarity

What links here