Expected calibration error

Machine Learning Model Evaluation Statistics

14 min read

Updated Jul 13, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 13, 2026

Fact-checked

In review queue

Sources

13 citations

Revision

v3 · 2,764 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Expected calibration error (ECE) is a metric that measures how well a classifier's predicted confidence matches its observed accuracy. A model is well calibrated when the probabilities it reports line up with how often it is right: among all the predictions it makes with 70 percent confidence, about 70 percent should turn out correct. ECE summarizes the gap between confidence and accuracy as a single number by sorting predictions into bins and averaging the per-bin discrepancy. It became one of the most widely cited tools for evaluating the reliability of machine learning classifiers, especially deep neural network models, after Chuan Guo and colleagues, in a widely cited 2017 paper at the International Conference on Machine Learning (ICML), used it to show that "modern neural networks, unlike those from a decade ago, are poorly calibrated" ^[1].

Calibration matters whenever a downstream decision depends on the probability a model reports rather than just its top guess. A medical triage system that flags a scan as 90 percent likely to be malignant, a self-driving stack that weighs detections by confidence, or a language model that should know when to defer to a human all rely on probabilities that mean what they say. Accuracy alone does not capture this. Two models can reach the same accuracy while one reports honest probabilities and the other is systematically overconfident.

What does it mean for a model to be calibrated?

Formally, let a model produce a predicted label $\hat{y}$ and an associated confidence $\hat{p}$ for an input. The model is perfectly calibrated if, for every confidence level $p \in [0, 1]$ , the probability that the prediction is correct given that confidence equals $p$ :

P(\hat{y} = Y \mid \hat{p} = p) = p \quad \text{for all } p \in [0, 1]

This is the definition used by Guo et al. ^[1], following earlier work on calibrated probabilities by Naeini, Cooper, and Hauskrecht ^[2]. The condition cannot be checked directly because the confidence $\hat{p}$ is a continuous quantity and no single value of $p$ has enough samples attached to it to estimate the conditional accuracy. The standard workaround is to group predictions into a small number of confidence intervals and check calibration within each group.

A few terms recur in this literature. Confidence usually refers to the probability the model assigns to its predicted (top-1) class, although some metrics consider the full probability vector. Overconfidence means confidence exceeds accuracy, the most common failure mode for deep networks. Underconfidence is the reverse. These notions are distinct from discrimination, which is a model's ability to separate correct from incorrect cases; a model can rank examples well yet still report poorly scaled probabilities.

Reliability diagrams

The reliability diagram is the visual companion to ECE. Predictions are partitioned into $M$ equal-width confidence bins, for example the ten intervals [0, 0.1), [0.1, 0.2), and so on up to [0.9, 1.0]. For each bin the diagram plots the average confidence on the x axis against the empirical accuracy on the y axis. A perfectly calibrated model produces points that lie on the diagonal line $y = x$ . Bars that fall below the diagonal indicate overconfidence (the model claims more certainty than it earns), and bars above it indicate underconfidence. The gap between each bar and the diagonal is exactly the quantity ECE averages. Reliability diagrams trace back to forecast verification work by DeGroot and Fienberg, whose 1983 study of the comparison and evaluation of forecasters formalized the calibration and refinement of probability forecasts ^[9], and they remain the most direct way to see the shape of a model's miscalibration rather than just its magnitude ^[1].

How is ECE defined and computed?

To compute ECE, the $n$ predictions are divided into $M$ bins $B_1$ through $B_M$ by their confidence. For each bin two quantities are measured. The accuracy of bin $B_m$ is the fraction of its examples that are classified correctly:

\text{acc}(B_m) = \frac{1}{|B_m|} \sum_{i \in B_m} \mathbf{1}(\hat{y}_i = y_i)

and the confidence of the bin is the mean predicted probability of its members:

\text{conf}(B_m) = \frac{1}{|B_m|} \sum_{i \in B_m} \hat{p}_i

Expected calibration error is the weighted average of the absolute gap between these two, where each bin is weighted by the share of data it contains ^[1]:

\text{ECE} = \sum_{m=1}^{M} \frac{|B_m|}{n} \left| \text{acc}(B_m) - \text{conf}(B_m) \right|

The result is a number in [0, 1], and lower is better, with 0 meaning the binned estimate detects no miscalibration. Guo et al. used $M = 15$ equal-width bins in their experiments, and 10 or 15 bins remain common defaults ^[1]. ECE in this top-label form was introduced by Naeini et al. in 2015, in the paper that proposed the Bayesian Binning into Quantiles (BBQ) calibration method, and then popularized for deep learning by Guo et al. in 2017 ^[1]^[2].

Maximum calibration error

Maximum calibration error (MCE) replaces the average with the worst case. Instead of weighting bins by their population, it reports the single largest gap across all bins ^[1]:

\text{MCE} = \max_{m \in \{1, \ldots, M\}} \left| \text{acc}(B_m) - \text{conf}(B_m) \right|

MCE is the more conservative metric and is used in high-stakes settings where any region of severe miscalibration is a problem, even if it covers few examples. ECE answers "how miscalibrated is the model on average," while MCE answers "how bad does it get anywhere." Because MCE ignores how many predictions fall in the offending bin, a single sparse bin can dominate it, which makes MCE noisier than ECE on small datasets.

Brier score and negative log-likelihood

ECE and MCE are convenient and interpretable, but they are not proper scoring rules: a model can drive them toward zero without producing genuinely useful probabilities (a constant predictor that outputs the base rate can score a low ECE). Proper scoring rules avoid this. A scoring rule is proper if a forecaster minimizes its expected value only by reporting true probabilities, and strictly proper if that optimum is unique ^[11].

The Brier score, proposed by Glenn W. Brier in 1950 in "Verification of forecasts expressed in terms of probability," is the mean squared error between predicted probabilities and outcomes ^[3]. For binary predictions with forecast probability $f_t$ and outcome $o_t \in \{0, 1\}$ ,

\text{BS} = \frac{1}{N} \sum_t (f_t - o_t)^2

which ranges from 0 to 1, lower being better. It extends to multiple classes by summing the squared error over every class. The Brier score is strictly proper, and Allan Murphy showed in 1973 that it decomposes additively into reliability, resolution, and uncertainty terms, $\text{BS} = \text{reliability} - \text{resolution} + \text{uncertainty}$ ^[10]. The reliability term is itself a calibration measure, which links the Brier score directly to the idea ECE captures.

Negative log-likelihood (NLL), also called cross-entropy loss, is the other standard proper scoring rule. For predicted distribution $\hat{\pi}$ over labels,

\text{NLL} = -\sum_{i=1}^{n} \log \hat{\pi}(y_i \mid x_i)

and it is the objective most classifiers are trained on. Guo et al. note that NLL is minimized exactly when the model recovers the true conditional distribution, and they track the gap between NLL and classification error as a sign of overfitting in probability space rather than in accuracy ^[1]. Both Brier and NLL conflate calibration with discrimination, so they reward a model that both ranks examples well and scales its probabilities honestly. ECE isolates the calibration piece but loses the propriety guarantee. In practice researchers report several of these together.

Why are deep networks miscalibrated, and how does temperature scaling help?

A central finding of Guo et al. is that the accurate deep networks of the late 2010s were substantially more overconfident than the shallower, less accurate models of a decade earlier ^[1]. They traced this to several design choices. Increasing network depth and width improves accuracy but worsens calibration. Batch normalization, despite helping optimization, tends to make predictions more overconfident. Reducing weight decay (less regularization) also degrades calibration. The common thread is that high-capacity networks trained to minimize NLL keep pushing probabilities toward 0 and 1 long after classification error has plateaued, overfitting the loss in a way that does not show up as a drop in accuracy.

Their proposed remedy is temperature scaling, a post-hoc method and a single-parameter special case of Platt scaling ^[4]. After training, the logits $z_i$ for an example are divided by a scalar temperature $T$ before the softmax, and the calibrated confidence is

\hat{q}_i = \max_k \mathrm{softmax}(z_i / T)_k

The temperature $T$ is fit by minimizing NLL on a held-out validation set, with the network's weights frozen. Because $T$ scales all logits uniformly, it never changes which class is the argmax, so accuracy is untouched while the probabilities are softened ( $T > 1$ ) or sharpened ( $T < 1$ ). Guo et al. found that this one extra parameter was often enough to bring ECE close to that of more elaborate calibration schemes such as isotonic regression or histogram binning, which is why temperature scaling became a default baseline ^[1]. Earlier scaling approaches include Platt scaling for support vector machines ^[4] and the comparison of Platt scaling against isotonic regression by Niculescu-Mizil and Caruana ^[5].

This narrative was later qualified. In a 2021 study titled "Revisiting the Calibration of Modern Neural Networks," Minderer et al. systematically compared calibration and accuracy across newer image classifiers and found that "the most recent models, notably those not using convolutions, are among the best calibrated" ^[13]. Non-convolutional architectures such as Vision Transformers and MLP-Mixer were simultaneously more accurate and better calibrated than the convolutional networks Guo et al. had studied, and the earlier decay of calibration with model size was much less pronounced. Because model size and amount of pretraining did not fully account for the difference, the authors concluded that architecture is a major determinant of calibration ^[13]. The practical takeaway is that overconfidence is a property of particular designs rather than an inevitable cost of accuracy or scale, while temperature scaling remains a cheap, accuracy-preserving correction across architectures.

A worked example

The table below shows a small reliability computation with five confidence bins over 1,000 predictions. The per-bin contribution to ECE is the bin weight times the absolute confidence-accuracy gap.

Bin (confidence)	Samples	Avg confidence	Accuracy	Gap	Weight	Contribution
[0.5, 0.6)	100	0.55	0.58	0.03	0.10	0.0030
[0.6, 0.7)	150	0.65	0.60	0.05	0.15	0.0075
[0.7, 0.8)	200	0.75	0.68	0.07	0.20	0.0140
[0.8, 0.9)	250	0.85	0.74	0.11	0.25	0.0275
[0.9, 1.0]	300	0.96	0.82	0.14	0.30	0.0420

Summing the contributions gives ECE = 0.094, while MCE is the largest gap, 0.14, found in the most populated high-confidence bin. The pattern (accuracy trailing confidence, and the gap widening as confidence rises) is the textbook signature of an overconfident network.

What are the known pitfalls of ECE?

The binning that makes ECE tractable is also its main weakness. The estimate depends on the number of bins and the binning scheme, and there is no canonical choice. Too few bins hide miscalibration by averaging opposing errors together; too many bins leave each bin with so few samples that the accuracy estimate becomes noisy and ECE is biased. Equal-width binning is especially fragile for deep networks, whose confidences pile up near 1, so most predictions land in the last bin or two while the low-confidence bins sit nearly empty and contribute little ^[6].

Several variants try to repair this. Adaptive (equal-mass) binning, used in the Adaptive Calibration Error (ACE) of Nixon et al., draws bin edges so that each bin holds roughly the same number of samples, giving every region equal statistical weight ^[6]. The same work introduces a Thresholded Adaptive Calibration Error (TACE) that drops near-zero probabilities, which matters when there are hundreds of classes, and a Static Calibration Error (SCE) that averages over all class probabilities rather than only the top one. A more recent line replaces hard bins entirely: the smooth ECE (smECE) of Jaroslaw Blasiok and Preetum Nakkiran (2023) applies Gaussian kernel smoothing with a principled bandwidth, producing a continuous calibration measure and a reliability diagram without an arbitrary bin count ^[7]. They connect smECE to a notion of a consistent calibration measure that provably tracks the Wasserstein distance to perfect calibration. Common practical advice is to report the bin count, check robustness across several settings, and pair ECE with a proper scoring rule ^[6]^[7].

How well calibrated are large language models?

Calibration studies have moved from image classifiers to large language models. The most cited result comes from OpenAI's GPT-4 technical report, which examined how well the model's reported probability on multiple-choice questions matched its accuracy. The report states that "the pre-trained model is highly calibrated (its predicted confidence in an answer generally matches the probability of being correct)," reporting an ECE of about 0.007 on a subset of MMLU and a reliability curve close to the diagonal ^[8]. After post-training with reinforcement learning from human feedback, "the calibration is reduced," degrading sharply to an ECE of roughly 0.074 ^[8]. RLHF makes the model more helpful and better at following instructions, but it tends to flatten the model's probabilities toward overconfident, assertive answers, which is a recognized tension between alignment and calibration. Similar findings appear across instruction-tuned and chat models, where pre-trained next-token probabilities are often better calibrated than the post-trained policy. Calibrating language models is harder than calibrating image classifiers because confidence can be read from token log-probabilities, from verbalized statements like "I am 80 percent sure," or from sampling consistency, and these signals do not always agree.

Relation to uncertainty quantification and selective prediction

Calibration is one component of the broader field of uncertainty quantification, which also covers separating aleatoric (data) from epistemic (model) uncertainty, out-of-distribution detection, and methods such as Bayesian neural networks, deep ensembles, and conformal prediction. A well-calibrated confidence score is the input that makes many of these usable. It is especially central to selective prediction (also called classification with a reject option), studied for deep networks by Geifman and El-Yaniv ^[12], where a model abstains on inputs whose confidence falls below a threshold and defers them to a human or a fallback system. The quality of that abstention policy depends directly on whether the confidence scores are calibrated, since the threshold is only meaningful if a stated probability reflects a real error rate. ECE, reliability diagrams, and proper scoring rules together give a working picture of whether a model's stated certainty can be trusted.

References

Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). On Calibration of Modern Neural Networks. Proceedings of the 34th International Conference on Machine Learning (ICML). arXiv:1706.04599. https://arxiv.org/abs/1706.04599 ↩
Naeini, M. P., Cooper, G. F., & Hauskrecht, M. (2015). Obtaining Well Calibrated Probabilities Using Bayesian Binning. Proceedings of the AAAI Conference on Artificial Intelligence, 29(1). https://ojs.aaai.org/index.php/AAAI/article/view/9602 ↩
Brier, G. W. (1950). Verification of Forecasts Expressed in Terms of Probability. Monthly Weather Review, 78(1), 1-3. https://journals.ametsoc.org/view/journals/mwre/78/1/1520-0493_1950_078_0001_vofeit_2_0_co_2.xml ↩
Platt, J. C. (1999). Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods. Advances in Large Margin Classifiers, 10(3), 61-74. https://www.cs.colorado.edu/~mozer/Teaching/syllabi/6622/papers/Platt1999.pdf ↩
Niculescu-Mizil, A., & Caruana, R. (2005). Predicting Good Probabilities with Supervised Learning. Proceedings of the 22nd International Conference on Machine Learning (ICML). https://www.cs.cornell.edu/~alexn/papers/calibration.icml05.crc.rev3.pdf ↩
Nixon, J., Dusenberry, M. W., Zhang, L., Jerfel, G., & Tran, D. (2019). Measuring Calibration in Deep Learning. CVPR Workshops. arXiv:1904.01685. https://arxiv.org/abs/1904.01685 ↩
Blasiok, J., & Nakkiran, P. (2023). Smooth ECE: Principled Reliability Diagrams via Kernel Smoothing. International Conference on Learning Representations (ICLR 2024). arXiv:2309.12236. https://arxiv.org/abs/2309.12236 ↩
OpenAI. (2023). GPT-4 Technical Report. arXiv:2303.08774. https://arxiv.org/abs/2303.08774 ↩
DeGroot, M. H., & Fienberg, S. E. (1983). The Comparison and Evaluation of Forecasters. Journal of the Royal Statistical Society: Series D (The Statistician), 32(1-2), 12-22. https://www.jstor.org/stable/2987588 ↩
Murphy, A. H. (1973). A New Vector Partition of the Probability Score. Journal of Applied Meteorology, 12(4), 595-600. https://journals.ametsoc.org/view/journals/apme/12/4/1520-0450_1973_012_0595_anvpot_2_0_co_2.xml ↩
Gneiting, T., & Raftery, A. E. (2007). Strictly Proper Scoring Rules, Prediction, and Estimation. Journal of the American Statistical Association, 102(477), 359-378. https://doi.org/10.1198/016214506000001437 ↩
Geifman, Y., & El-Yaniv, R. (2017). Selective Classification for Deep Neural Networks. Advances in Neural Information Processing Systems (NeurIPS). arXiv:1705.08500. https://arxiv.org/abs/1705.08500 ↩
Minderer, M., Djolonga, J., Romijnders, R., Hubis, F., Zhai, X., Houlsby, N., Tran, D., & Lucic, M. (2021). Revisiting the Calibration of Modern Neural Networks. Advances in Neural Information Processing Systems (NeurIPS). arXiv:2106.07998. https://arxiv.org/abs/2106.07998 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributors · full history

Suggest edit

What links here

Calibration (machine learning)Calibration Layer Logits Machine learning terms/Fundamentals

What does it mean for a model to be calibrated?

Reliability diagrams

How is ECE defined and computed?

Maximum calibration error

Brier score and negative log-likelihood

Why are deep networks miscalibrated, and how does temperature scaling help?

A worked example

What are the known pitfalls of ECE?

How well calibrated are large language models?

Relation to uncertainty quantification and selective prediction

References

Improve this article

Related Articles

AUC-ROC

Area under the curve

False negative

False Negative Rate

False positive

False Positive Rate (FPR)

What links here

Related Articles

AUC-ROC

Area under the curve

False negative

False Negative Rate

False positive

False Positive Rate (FPR)

What links here