Expected calibration error
Last reviewed
May 31, 2026
Sources
13 citations
Review status
Source-backed
Revision
v1 ยท 2,551 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
13 citations
Review status
Source-backed
Revision
v1 ยท 2,551 words
Add missing citations, update stale details, or suggest a clearer explanation.
Expected calibration error (ECE) is a metric that measures how well a classifier's predicted confidence matches its observed accuracy. A model is well calibrated when the probabilities it reports line up with how often it is right: among all the predictions it makes with 70 percent confidence, about 70 percent should turn out correct. ECE summarizes the gap between confidence and accuracy as a single number by sorting predictions into bins and averaging the per-bin discrepancy. It became one of the most widely cited tools for evaluating the reliability of machine learning classifiers, especially deep neural network models, after Chuan Guo and colleagues used it to show that modern networks tend to be badly miscalibrated [1].
Calibration matters whenever a downstream decision depends on the probability a model reports rather than just its top guess. A medical triage system that flags a scan as 90 percent likely to be malignant, a self-driving stack that weighs detections by confidence, or a language model that should know when to defer to a human all rely on probabilities that mean what they say. Accuracy alone does not capture this. Two models can reach the same accuracy while one reports honest probabilities and the other is systematically overconfident.
Formally, let a model produce a predicted label y-hat and an associated confidence p-hat for an input. The model is perfectly calibrated if, for every confidence level p in [0, 1], the probability that the prediction is correct given that confidence equals p:
P(y-hat = Y | p-hat = p) = p for all p in [0, 1].
This is the definition used by Guo et al. [1], following earlier work on calibrated probabilities by Naeini, Cooper, and Hauskrecht [2]. The condition cannot be checked directly because the confidence p-hat is a continuous quantity and no single value of p has enough samples attached to it to estimate the conditional accuracy. The standard workaround is to group predictions into a small number of confidence intervals and check calibration within each group.
A few terms recur in this literature. Confidence usually refers to the probability the model assigns to its predicted (top-1) class, although some metrics consider the full probability vector. Overconfidence means confidence exceeds accuracy, the most common failure mode for deep networks. Underconfidence is the reverse. These notions are distinct from discrimination, which is a model's ability to separate correct from incorrect cases; a model can rank examples well yet still report poorly scaled probabilities.
The reliability diagram is the visual companion to ECE. Predictions are partitioned into M equal-width confidence bins, for example the ten intervals [0, 0.1), [0.1, 0.2), and so on up to [0.9, 1.0]. For each bin the diagram plots the average confidence on the x axis against the empirical accuracy on the y axis. A perfectly calibrated model produces points that lie on the diagonal line y = x. Bars that fall below the diagonal indicate overconfidence (the model claims more certainty than it earns), and bars above it indicate underconfidence. The gap between each bar and the diagonal is exactly the quantity ECE averages. Reliability diagrams trace back to forecast verification work by DeGroot and Fienberg and others, and they remain the most direct way to see the shape of a model's miscalibration rather than just its magnitude [1][2].
To compute ECE, the n predictions are divided into M bins B_1 through B_M by their confidence. For each bin two quantities are measured. The accuracy of bin B_m is the fraction of its examples that are classified correctly:
acc(B_m) = (1 / |B_m|) * sum over i in B_m of 1(y-hat_i = y_i),
and the confidence of the bin is the mean predicted probability of its members:
conf(B_m) = (1 / |B_m|) * sum over i in B_m of p-hat_i.
Expected calibration error is the weighted average of the absolute gap between these two, where each bin is weighted by the share of data it contains [1]:
ECE = sum over m = 1 to M of (|B_m| / n) * |acc(B_m) - conf(B_m)|.
The result is a number in [0, 1], and lower is better, with 0 meaning the binned estimate detects no miscalibration. Guo et al. used M = 15 equal-width bins in their experiments, and 10 or 15 bins remain common defaults [1]. ECE in this top-label form was introduced by Naeini et al. in 2015 and then popularized for deep learning by Guo et al. in 2017 [1][2].
Maximum calibration error (MCE) replaces the average with the worst case. Instead of weighting bins by their population, it reports the single largest gap across all bins [1]:
MCE = max over m in {1, ..., M} of |acc(B_m) - conf(B_m)|.
MCE is the more conservative metric and is used in high-stakes settings where any region of severe miscalibration is a problem, even if it covers few examples. ECE answers "how miscalibrated is the model on average," while MCE answers "how bad does it get anywhere." Because MCE ignores how many predictions fall in the offending bin, a single sparse bin can dominate it, which makes MCE noisier than ECE on small datasets.
ECE and MCE are convenient and interpretable, but they are not proper scoring rules: a model can drive them toward zero without producing genuinely useful probabilities (a constant predictor that outputs the base rate can score a low ECE). Proper scoring rules avoid this. A scoring rule is proper if a forecaster minimizes its expected value only by reporting true probabilities, and strictly proper if that optimum is unique.
The Brier score, proposed by Glenn W. Brier in 1950 in "Verification of forecasts expressed in terms of probability," is the mean squared error between predicted probabilities and outcomes [3]. For binary predictions with forecast probability f_t and outcome o_t in {0, 1},
BS = (1 / N) * sum over t of (f_t - o_t)^2,
which ranges from 0 to 1, lower being better. It extends to multiple classes by summing the squared error over every class. The Brier score is strictly proper, and Allan Murphy showed in 1973 that it decomposes additively into reliability, resolution, and uncertainty terms, BS = reliability - resolution + uncertainty [3]. The reliability term is itself a calibration measure, which links the Brier score directly to the idea ECE captures.
Negative log-likelihood (NLL), also called cross-entropy loss, is the other standard proper scoring rule. For predicted distribution pi-hat over labels,
NLL = - sum over i = 1 to n of log pi-hat(y_i | x_i),
and it is the objective most classifiers are trained on. Guo et al. note that NLL is minimized exactly when the model recovers the true conditional distribution, and they track the gap between NLL and classification error as a sign of overfitting in probability space rather than in accuracy [1]. Both Brier and NLL conflate calibration with discrimination, so they reward a model that both ranks examples well and scales its probabilities honestly. ECE isolates the calibration piece but loses the propriety guarantee. In practice researchers report several of these together.
A central finding of Guo et al. is that the accurate deep networks of the late 2010s were substantially more overconfident than the shallower, less accurate models of a decade earlier [1]. They traced this to several design choices. Increasing network depth and width improves accuracy but worsens calibration. Batch normalization, despite helping optimization, tends to make predictions more overconfident. Reducing weight decay (less regularization) also degrades calibration. The common thread is that high-capacity networks trained to minimize NLL keep pushing probabilities toward 0 and 1 long after classification error has plateaued, overfitting the loss in a way that does not show up as a drop in accuracy.
Their proposed remedy is temperature scaling, a post-hoc method and a single-parameter special case of Platt scaling [4]. After training, the logits z_i for an example are divided by a scalar temperature T before the softmax, and the calibrated confidence is
q-hat_i = max over k of softmax(z_i / T)_k.
The temperature T is fit by minimizing NLL on a held-out validation set, with the network's weights frozen. Because T scales all logits uniformly, it never changes which class is the argmax, so accuracy is untouched while the probabilities are softened (T > 1) or sharpened (T < 1). Guo et al. found that this one extra parameter was often enough to bring ECE close to that of more elaborate calibration schemes such as isotonic regression or histogram binning, which is why temperature scaling became a default baseline [1]. Earlier scaling approaches include Platt scaling for support vector machines [4] and the comparison of Platt scaling against isotonic regression by Niculescu-Mizil and Caruana [5].
The table below shows a small reliability computation with five confidence bins over 1,000 predictions. The per-bin contribution to ECE is the bin weight times the absolute confidence-accuracy gap.
| Bin (confidence) | Samples | Avg confidence | Accuracy | Gap | Weight | Contribution |
|---|---|---|---|---|---|---|
| [0.5, 0.6) | 100 | 0.55 | 0.58 | 0.03 | 0.10 | 0.0030 |
| [0.6, 0.7) | 150 | 0.65 | 0.60 | 0.05 | 0.15 | 0.0075 |
| [0.7, 0.8) | 200 | 0.75 | 0.68 | 0.07 | 0.20 | 0.0140 |
| [0.8, 0.9) | 250 | 0.85 | 0.74 | 0.11 | 0.25 | 0.0275 |
| [0.9, 1.0] | 300 | 0.96 | 0.82 | 0.14 | 0.30 | 0.0420 |
Summing the contributions gives ECE = 0.094, while MCE is the largest gap, 0.14, found in the most populated high-confidence bin. The pattern (accuracy trailing confidence, and the gap widening as confidence rises) is the textbook signature of an overconfident network.
The binning that makes ECE tractable is also its main weakness. The estimate depends on the number of bins and the binning scheme, and there is no canonical choice. Too few bins hide miscalibration by averaging opposing errors together; too many bins leave each bin with so few samples that the accuracy estimate becomes noisy and ECE is biased. Equal-width binning is especially fragile for deep networks, whose confidences pile up near 1, so most predictions land in the last bin or two while the low-confidence bins sit nearly empty and contribute little [6].
Several variants try to repair this. Adaptive (equal-mass) binning, used in the Adaptive Calibration Error (ACE) of Nixon et al., draws bin edges so that each bin holds roughly the same number of samples, giving every region equal statistical weight [6]. The same work introduces a Thresholded Adaptive Calibration Error (TACE) that drops near-zero probabilities, which matters when there are hundreds of classes, and a Static Calibration Error (SCE) that averages over all class probabilities rather than only the top one. A more recent line replaces hard bins entirely: the smooth ECE (smECE) of Jaroslaw Blasiok and Preetum Nakkiran (2023) applies Gaussian kernel smoothing with a principled bandwidth, producing a continuous calibration measure and a reliability diagram without an arbitrary bin count [7]. They connect smECE to a notion of a consistent calibration measure that provably tracks the Wasserstein distance to perfect calibration. Common practical advice is to report the bin count, check robustness across several settings, and pair ECE with a proper scoring rule [6][7].
Calibration studies have moved from image classifiers to large language models. The most cited result comes from OpenAI's GPT-4 technical report, which examined how well the model's reported probability on multiple-choice questions matched its accuracy. The pre-trained base model was highly calibrated on a subset of MMLU, with an ECE of about 0.007, close to the diagonal of a reliability diagram [8]. After post-training with reinforcement learning from human feedback, the calibration degraded sharply, to an ECE of roughly 0.074 [8]. RLHF makes the model more helpful and better at following instructions, but it tends to flatten the model's probabilities toward overconfident, assertive answers, which is a recognized tension between alignment and calibration. Similar findings appear across instruction-tuned and chat models, where pre-trained next-token probabilities are often better calibrated than the post-trained policy. Calibrating language models is harder than calibrating image classifiers because confidence can be read from token log-probabilities, from verbalized statements like "I am 80 percent sure," or from sampling consistency, and these signals do not always agree.
Calibration is one component of the broader field of uncertainty quantification, which also covers separating aleatoric (data) from epistemic (model) uncertainty, out-of-distribution detection, and methods such as Bayesian neural networks, deep ensembles, and conformal prediction. A well-calibrated confidence score is the input that makes many of these usable. It is especially central to selective prediction (also called classification with a reject option), where a model abstains on inputs whose confidence falls below a threshold and defers them to a human or a fallback system. The quality of that abstention policy depends directly on whether the confidence scores are calibrated, since the threshold is only meaningful if a stated probability reflects a real error rate. ECE, reliability diagrams, and proper scoring rules together give a working picture of whether a model's stated certainty can be trusted.