Calibration (machine learning)
Last reviewed
Apr 30, 2026
Sources
34 citations
Review status
Source-backed
Revision
v1 ยท 5,280 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Apr 30, 2026
Sources
34 citations
Review status
Source-backed
Revision
v1 ยท 5,280 words
Add missing citations, update stale details, or suggest a clearer explanation.
Calibration in machine learning is the property that the probability scores produced by a probabilistic classifier match the empirical frequency of the predicted event. A classifier is said to be perfectly calibrated when, among all inputs assigned a predicted probability of $p$, the long-run fraction of inputs whose true label is positive equals exactly $p$. Formally, for a binary classifier $f$ that outputs $f(x) \in [0,1]$ and a true label $Y \in {0,1}$, the model is calibrated if
$$P(Y = 1 \mid f(X) = p) = p \quad \text{for all } p \in [0,1].$$
A model that says "0.8" should be right about 80 percent of the time over the inputs where it says 0.8, no more and no less. Calibration is distinct from accuracy, discrimination, or ranking ability: a model can rank examples perfectly (high AUC) yet still produce probability scores that are systematically too confident or too cautious. Conversely a model can be perfectly calibrated and still inaccurate, for example by always predicting the marginal base rate.
The study of calibration began in meteorology with Brier's 1950 paper on verification of probabilistic forecasts and the formal calibration framework of DeGroot and Fienberg in 1983. It was carried into machine learning by Platt's 1999 sigmoid post-processing of support vector machine outputs and by Zadrozny and Elkan's isotonic regression work in 2001 and 2002. Niculescu-Mizil and Caruana's 2005 ICML paper systematized the comparison of calibration methods across classifier families. Interest exploded again after Guo et al. showed in 2017 that modern deep neural networks such as ResNet and DenseNet are systematically over-confident on ImageNet despite their high accuracy, and that a one-parameter temperature rescaling of the softmax fixes most of the gap. Calibration is now a standard concern in Bayesian inference, risk-sensitive deployment, fairness analysis, and language model evaluation.
Probability scores feed into downstream decisions, and miscalibrated scores break the math of those decisions. A few concrete reasons calibration matters in practice:
In the binary case, calibration is a single condition. In the multiclass case there are several inequivalent definitions, and a model can satisfy a weak one without satisfying a stronger one.
| Notion | Condition | Use |
|---|---|---|
| Perfect (binary) calibration | $P(Y=1 \mid \hat p = p) = p$ for all $p$ | Binary classification |
| Top-label / confidence calibration | $P(Y = \hat y \mid \max_k \hat p_k = p) = p$ | Whether the model's most-likely class call is correct $p$ of the time |
| Class-wise calibration | For each class $k$: $P(Y = k \mid \hat p_k = p) = p$ | Independent calibration of each entry of the predicted distribution |
| Canonical / distributional calibration | $P(Y = k \mid \hat{\boldsymbol p}) = \hat p_k$ for all $k$ | Strongest notion: the entire predicted distribution matches the conditional distribution given that prediction |
Perfect distributional calibration implies class-wise calibration, which implies top-label calibration, but the reverse implications fail. Most of the deep-learning literature reports top-label calibration because it is the easiest to estimate and the most directly relevant to confidence claims.
No single number captures calibration; researchers use several metrics with complementary blind spots.
| Metric | Formula or idea | Notes |
|---|---|---|
| Brier score (Brier 1950) | $\frac{1}{N}\sum_i (\hat p_i - y_i)^2$ for binary | Strictly proper scoring rule; decomposes into reliability, resolution, and uncertainty |
| Negative log-likelihood / log-loss | $-\frac{1}{N}\sum_i \log \hat p_{i, y_i}$ | Strictly proper; sensitive to extreme low-probability predictions |
| Reliability diagram | Plot of empirical accuracy vs binned mean confidence | Visual diagnostic; gap between curve and diagonal indicates miscalibration |
| Expected Calibration Error (ECE) | $\sum_m \tfrac{ | B_m |
| Maximum Calibration Error (MCE) | $\max_m | \text{acc}(B_m) - \text{conf}(B_m) |
| Static Calibration Error (SCE) | Per-class extension of ECE averaged across classes | For multiclass class-wise calibration (Nixon et al. 2019) |
| Adaptive Calibration Error (ACE) | ECE-like but with equal-mass adaptive bins | Reduces bias from sparse bins (Nixon et al. 2019) |
| KS calibration error | Kolmogorov-Smirnov distance between cumulative empirical and predicted distributions | Binning-free; used by Gupta et al. 2020 spline calibration |
The Brier score introduced by Glenn W. Brier in 1950 in Monthly Weather Review is the mean squared error between predicted probability and the indicator of the realized outcome. It is a strictly proper scoring rule: it is uniquely minimized in expectation by reporting the true conditional probability. DeGroot and Fienberg's 1983 decomposition shows that the Brier score splits into a reliability term (the calibration error), a resolution term (how much predictions vary across outcomes), and an irreducible uncertainty term equal to the variance of the label. This is why proper scoring rules reward both calibration and refinement at once.
Expected Calibration Error is the metric most often quoted in the deep-learning calibration literature, made canonical by Guo et al. 2017. ECE partitions predicted confidences into $M$ equal-width bins, computes the absolute gap between average accuracy and average confidence in each bin, and weights by bin size. It is intuitive and easy to plot, but it has well-documented pathologies. The estimate depends sharply on the number of bins, can be near-zero by accident on coarse bins, and is statistically biased downward when bins are sparsely populated. Naeini, Cooper and Hauskrecht discussed bias in binning estimators in their 2015 AAAI paper, and Gupta and Ramdas's 2021 ICML work on distribution-free histogram binning provides finite-sample guarantees that older ECE estimators do not have.
Reliability diagrams plot the binned empirical accuracy against binned mean confidence on the unit square. A perfectly calibrated model traces the diagonal. Bars above the diagonal indicate under-confidence, bars below indicate over-confidence. The diagram is the visual companion to ECE.
Guo, Pleiss, Sun and Weinberger's 2017 ICML paper On Calibration of Modern Neural Networks is the most cited result on the topic. They observed that older shallow networks like LeNet were already well calibrated, but modern deep architectures such as ResNet, DenseNet, and wide ResNets on CIFAR-100 and ImageNet exhibit ECE between roughly 5 and 20 percent, almost always in the over-confident direction. The authors traced the effect to the interplay between high model capacity, the cross-entropy training objective, and reduced explicit regularization. As models got deeper and wider, training NLL kept improving, but the held-out NLL and ECE worsened. Modern training tricks like batch normalization helped accuracy but appeared to hurt calibration.
The paper's signature contribution is temperature scaling: dividing the pre-softmax logits by a single scalar $T > 0$ tuned by minimizing NLL on a held-out set. Because $T$ is shared across classes and does not change the argmax, temperature scaling preserves accuracy while sharpening or softening the predicted distribution. On CIFAR-100 it brings ECE for ResNet-110 from around 16 percent down to about 1 percent. The simplicity of the method, combined with the size of the effect, made temperature scaling the default first thing to try whenever a deep classifier feels overconfident.
Later work has nuanced the picture. Mukhoti et al. 2020 showed in Calibrating Deep Neural Networks using Focal Loss that training with focal loss instead of cross-entropy yields models that are already nearly calibrated, especially under distribution shift, and that focal loss combined with temperature scaling can outperform either alone. Mueller, Kornblith and Hinton's 2019 paper When Does Label Smoothing Help? showed that label smoothing improves top-label calibration but distorts inter-class similarity in ways that hurt knowledge distillation.
Calibration methods fall into two broad families: post-hoc methods that take a trained classifier as fixed and learn a correction map on a held-out set, and training-time methods that bake calibration into the loss or architecture.
| Method | Type | Year | Main idea |
|---|---|---|---|
| Platt scaling | Post-hoc, parametric | 1999 | Fit a one-dimensional logistic regression on the classifier's score |
| Histogram binning | Post-hoc, non-parametric | 2001 | Bin scores; output the empirical positive rate per bin |
| Isotonic regression | Post-hoc, non-parametric | 2002 | Fit a monotone non-decreasing step function via pair-adjacent violators |
| BBQ (Bayesian Binning into Quantiles) | Post-hoc, non-parametric | 2015 | Bayesian model average over many binnings |
| Temperature scaling | Post-hoc, parametric | 2017 | Single scalar $T$ on softmax logits |
| Vector / matrix scaling | Post-hoc, parametric | 2017 | Per-class affine transform on logits |
| Beta calibration | Post-hoc, parametric | 2017 | Three-parameter family generalizing Platt for skewed score distributions |
| Dirichlet calibration | Post-hoc, parametric, multiclass | 2019 | Linear layer on log-probabilities followed by softmax |
| Non-parametric Gaussian-process calibration | Post-hoc | 2020 | Latent GP over confidence; AISTATS 2020 |
| Spline calibration | Post-hoc, binning-free | 2020-2021 | Differentiable spline fit to empirical CDF |
| Label smoothing | Training-time | 2016 | Replace one-hot labels with $(1-\epsilon)\delta_y + \tfrac{\epsilon}{K}\mathbf{1}$ |
| Mixup | Training-time | 2018 | Train on convex combinations of input-label pairs |
| Focal loss | Training-time | 2017 / 2020 | Down-weight easy examples; reduces overconfidence |
| MC dropout | Training-time | 2016 | Monte-Carlo forward passes with dropout active at test time |
| Deep ensembles | Training-time | 2017 | Average $M$ networks trained from different seeds |
| SWAG | Training-time | 2019 | Gaussian fit to SGD iterates; sample for Bayesian model averaging |
Platt scaling comes from John C. Platt's 1999 chapter Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods in Advances in Large Margin Classifiers. The original target was the SVM, whose decision-function output $f(x)$ has no probabilistic interpretation. Platt fits a sigmoid $P(Y=1 \mid f(x)) = 1/(1 + \exp(A f(x) + B))$ on a held-out validation set with maximum likelihood. The method is two-parameter, fast, and works well when the un-calibrated scores have a roughly sigmoidal distortion, which is typical of max-margin methods. Lin, Lin and Weng 2007 fixed numerical issues with the Levenberg-Marquardt fit Platt originally used.
Isotonic regression, advocated by Bianca Zadrozny and Charles Elkan in 2001 and 2002, fits a monotone non-decreasing piecewise-constant function to the (score, label) pairs using the pool-adjacent-violators algorithm. The 2002 KDD paper Transforming Classifier Scores into Accurate Multiclass Probability Estimates extends the method to multiclass via one-versus-rest reductions. Isotonic regression is more flexible than Platt scaling because it makes no parametric assumption about the shape of the distortion, but it requires more data to avoid overfitting and can produce flat regions where probabilities collapse to a constant.
Histogram binning sorts scores, partitions them into bins of equal width or equal frequency, and outputs the empirical positive rate inside the bin containing a new score. Gupta and Ramdas 2021 prove that, with the right sample-splitting strategy, uniform-mass histogram binning achieves distribution-free finite-sample calibration guarantees, and they show how to relax the sample-splitting requirement using a Markov property of order statistics.
BBQ, Bayesian Binning into Quantiles, was introduced by Naeini, Cooper and Hauskrecht at AAAI 2015. BBQ averages predictions from many histogram-binning models with different bin counts and boundaries, weighting each by its posterior under a uniform prior over partitions and a multinomial-Dirichlet likelihood. It outperforms any single binning and is usually competitive with isotonic regression while reporting better-behaved Brier scores.
Temperature, vector, and matrix scaling were introduced together by Guo et al. 2017. Vector scaling allows a separate gain and bias per class, and matrix scaling allows a full affine transform of logits. The paper finds that vector and matrix scaling can overfit on small validation sets, while temperature scaling almost never does. Temperature scaling is now standard practice for post-hoc calibration of deep classifiers.
Beta calibration, introduced by Kull, Silva Filho and Flach in Beyond Sigmoids (Electronic Journal of Statistics, 2017), generalizes Platt's logistic family by modeling each class's score distribution as a Beta. It includes Platt scaling as a special case but also handles the inverse-sigmoid distortions seen in some boosted models and naive Bayes.
Dirichlet calibration, by Kull, Perello-Nieto, Kangsepp, Silva Filho, Song and Flach at NeurIPS 2019 in Beyond Temperature Scaling, extends beta calibration to multiclass. It is implemented as a linear layer on log-probabilities followed by softmax, which makes it trivial to plug in after a neural network. The authors report improvements in classwise ECE and log-loss across many architectures and datasets.
Spline calibration by Gupta, Rahimi, Ajanthan, Mensink, Sminchisescu and Hartley (ICLR 2021) fits a smooth differentiable spline to the empirical cumulative distribution of confidences, sidestepping the binning artifacts of ECE. It introduces a binning-free Kolmogorov-Smirnov calibration error and shows the spline recalibration map is consistently competitive on ImageNet ResNets.
Non-parametric Gaussian-process calibration by Wenger, Kjellstrom and Triebel at AISTATS 2020 places a latent Gaussian process over the pre-calibration confidence and uses variational inference to obtain a posterior calibration map applicable to any classifier output, not only neural networks.
Label smoothing (Szegedy et al. 2016, originally introduced as part of Inception-v3) replaces the one-hot training target with a mixture of the one-hot label and the uniform distribution. It improves test accuracy on many large image classifiers and, as Mueller, Kornblith and Hinton 2019 showed, also improves top-label calibration by preventing the network from pushing logits to extreme magnitudes. The same paper documents the awkward side effect that label smoothing erases the inter-class similarity structure that knowledge distillation depends on.
Mixup (Zhang, Cisse, Dauphin and Lopez-Paz, ICLR 2018) trains on convex combinations of pairs of training examples and their one-hot labels. The resulting models tend to be both more accurate and better calibrated, since the soft mixed labels prevent the network from collapsing all of its mass onto a single class. Mixup also improves robustness to label noise and adversarial perturbations.
Focal loss was introduced by Lin et al. 2017 for object detection, where extreme class imbalance dominated training. Mukhoti et al. 2020 found that, beyond detection, focal loss significantly improves classifier calibration on its own, and that combining focal loss with a final temperature-scaling pass yields among the best ECE numbers on standard benchmarks. Focal loss can be viewed as an upper bound on the regularized cross-entropy loss with an entropy penalty.
Bayesian neural networks and MC dropout. A fully Bayesian neural network maintains a posterior over weights and integrates over it to produce predictions, which under suitable priors yields well-calibrated uncertainty by construction. Exact Bayesian inference is intractable in general, so practitioners use approximations. Yarin Gal and Zoubin Ghahramani's 2016 ICML paper Dropout as a Bayesian Approximation showed that dropout, applied at both training and inference time, can be interpreted as a variational approximation to a deep Gaussian process, giving rise to MC dropout: average $M$ stochastic forward passes through the network. Sampling-based posterior approximations including Hamiltonian Monte Carlo and stochastic gradient Langevin dynamics are also used. See Bayesian Neural Network for the broader picture and Bayesian Optimization for a related use of calibrated uncertainty.
Deep ensembles. Lakshminarayanan, Pritzel and Blundell's 2017 NeurIPS paper Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles trains a small number of independent neural networks from different random initializations and averages their predictive distributions. Ensembles produce strikingly well-calibrated uncertainty in practice, often beating MC dropout and matching or beating sophisticated Bayesian methods, at a cost that scales linearly with ensemble size. Whether deep ensembles are better viewed as Bayesian or as just a strong baseline remains debated. See ensemble learning and ensemble for related ideas.
SWA-Gaussian (SWAG). Maddox, Garipov, Izmailov, Vetrov and Wilson 2019 fit a Gaussian to the trajectory of SGD iterates near the end of training. Sampling weights from this Gaussian produces an ensemble for free, with calibration competitive with deep ensembles at a fraction of the compute.
In regression the analog of calibration is quantile calibration or interval coverage: when a predictor outputs a 90 percent prediction interval, that interval should contain the true value 90 percent of the time on average. Two large families of techniques deliver this guarantee.
Quantile regression trains the model to predict specified quantiles directly under the pinball loss. This produces self-consistent intervals for the predicted distribution but does not by itself give frequentist coverage; recalibration on a held-out set is usually needed.
Conformal prediction is the dominant distribution-free technique. Originating with Vovk, Gammerman and Shafer in the early 2000s and popularized for the deep-learning era by Anastasios Angelopoulos and Stephen Bates's 2021 tutorial A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification, conformal methods construct prediction sets with finite-sample marginal coverage guarantees under the assumption of exchangeable data. Conformal prediction works on top of any base model, makes no assumption about the base model's accuracy or calibration, and the coverage guarantee holds in finite samples for any data distribution. Conformal techniques have been adapted to time series, structured outputs, distribution shift, and large language models.
Calibration by group asks that the calibration condition holds within each demographic subgroup: $P(Y=1 \mid \hat p, G = g) = \hat p$ for every group $g$. This is sometimes called predictive parity. The constraint is appealing because it says the score means the same thing regardless of group membership.
Kleinberg, Mullainathan and Raghavan's 2017 ITCS paper Inherent Trade-Offs in the Fair Determination of Risk Scores and Alexandra Chouldechova's 2017 Big Data paper Fair Prediction with Disparate Impact independently proved an impossibility theorem: when groups have different base rates, calibration by group, equal false-positive rates, and equal false-negative rates cannot all hold simultaneously, except in degenerate cases. The result has shaped the algorithmic-fairness literature ever since, because it forces practitioners to choose which fairness criterion to prioritize. See algorithmic fairness for the wider context, and the AI Fairness 360 and Fairlearn toolkits for software implementations of the standard criteria.
Large language models present new calibration questions because they output token distributions rather than calibrated class probabilities, and because their inputs often have no fixed schema. Several recent results frame the field:
Related topics include AI deception, where miscalibrated or strategically over-confident outputs become a safety concern, and the role of temperature sampling at decoding time, which interacts with calibration in nontrivial ways.
Van Calster, McLernon, van Smeden, Wynants and Steyerberg's 2019 BMC Medicine paper Calibration: the Achilles Heel of Predictive Analytics argued that calibration is the single most under-checked property of clinical prediction models. They distinguish four levels of calibration, from weak agreement of mean predicted probability with overall event rate to moderate, strong, and perfect calibration. The paper made calibration plots a near-mandatory companion to AUC in clinical model validation. Regulators in oncology, cardiology and intensive care have since incorporated calibration assessment into model approval workflows.
Calibration concepts apply differently to generative models because there is no single label to compare to. For a discriminative generative model used as a classifier, such as an energy-based model or a normalizing flow followed by a softmax head, the standard binary or multiclass calibration definitions apply. For unconditional generative models such as diffusion models, the relevant property is whether the model's likelihood function (where it has one) corresponds to the data distribution; this is closer to a goodness-of-fit question than to calibration in the classifier sense. Diffusion models do not produce calibrated probabilities for sampled images, since each sample is just one trajectory through the reverse process. Density-based generative models such as flows do, but the relationship between log-likelihood and perceptual quality is famously loose.
| Library | Language | Coverage |
|---|---|---|
scikit-learn sklearn.calibration | Python | CalibratedClassifierCV (Platt and isotonic), calibration_curve, CalibrationDisplay |
| netcal | Python (PyTorch / TensorFlow) | Binning, scaling, and regularization methods; classification and regression calibration |
| Uncertainty Toolbox (Chung et al. 2021) | Python | Regression-focused uncertainty metrics, visualizations, and recalibration |
| pycalib (Wenger et al.) | Python | Non-parametric GP calibration and other multiclass methods |
| MAPIE | Python | Conformal prediction for classification, regression, and time series |
TorchMetrics CalibrationError | Python (PyTorch) | ECE, MCE, and class-wise variants |
The scikit-learn CalibratedClassifierCV is the most widely used entry point in classical machine learning: it cross-validates a base estimator, fits Platt or isotonic calibration on the held-out folds, and averages the resulting probabilities. The companion calibration_curve and CalibrationDisplay produce reliability diagrams.
Proper scoring rules are loss functions that are minimized in expectation by reporting the true conditional probability. The Brier score and the negative log-likelihood are the two most common; both are strictly proper. DeGroot and Fienberg 1983 showed that any proper scoring rule decomposes into a calibration component and a refinement (resolution) component, plus an irreducible uncertainty term. Optimizing a proper scoring rule therefore rewards both calibration and refinement at once. This is why log-loss and Brier score, rather than accuracy or AUC, are the right defaults for evaluating probabilistic predictors. Accuracy ignores miscalibration entirely; AUC ignores calibration up to monotone transforms.
In model selection it is now standard to track at least one proper scoring rule alongside accuracy and AUC. For multiclass problems classwise ECE (or its smoother spline-based counterparts) is reported alongside log-loss. For regression, mean continuous ranked probability score (CRPS) and prediction-interval coverage are reported alongside RMSE.
The last few years have seen calibration broaden out from a post-hoc fixup into a first-class evaluation criterion across the field.