# Calibration Layer

> Source: https://aiwiki.ai/wiki/calibration_layer
> Updated: 2026-07-13
> Categories: Deep Learning, Machine Learning, Model Evaluation, Neural Networks
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

A **calibration layer** is a post-prediction adjustment appended to a trained [machine learning](/wiki/machine_learning) model that rescales its raw output scores or predicted probabilities so they better reflect the true likelihood of outcomes. Google's Machine Learning Glossary defines it concisely as "a post-prediction adjustment, typically to account for prediction bias," adding that "the adjusted predictions and probabilities should match the distribution of an observed set of labels." [16] In classification tasks, a model is considered well-calibrated when its confidence scores match observed frequencies: among all predictions made with 80% confidence, roughly 80% should actually belong to the predicted class. [17] Calibration layers and related [calibration](/wiki/calibration) methods are widely used in [deep learning](/wiki/deep_neural_network), medical diagnosis, [autonomous driving](/wiki/autonomous_driving), weather forecasting, and other safety-sensitive applications where trustworthy uncertainty estimates are needed.

## Explain like I'm 5 (ELI5)

Imagine you have a friend who guesses the weather every day. They say things like "I'm 90% sure it will rain." But when you check, it only rains about half the time they say 90%. Your friend is too confident in their guesses. A calibration layer is like giving your friend a special pair of glasses that helps them see more clearly. After wearing the glasses, when they say "I'm 90% sure," it actually rains about 90% of the time. The glasses do not change what your friend sees; they just help your friend describe what they see more honestly.

## What is calibration in machine learning?

A probabilistic classifier outputs a confidence score for each class. Calibration measures how well those confidence scores correspond to actual correctness rates. Formally, a classifier is perfectly calibrated if:

> $$P(Y = y \mid \hat{p} = p) = p$$, for all $$p \in [0, 1]$$

This means that among all instances where the model predicts class y with probability p, the true fraction of instances that are indeed class y should equal p. Equivalently, of all the events a well-calibrated model predicts with probability 0.8, about 80% should occur in practice. [17]

Calibration is distinct from [accuracy](/wiki/accuracy). A model can be highly accurate (making correct predictions most of the time) while still being poorly calibrated (assigning confidence values that do not match actual correctness rates). Conversely, a model can be perfectly calibrated but have low accuracy.

### Why are modern neural networks miscalibrated?

Research by Guo et al. (2017), presented at ICML, demonstrated that modern [neural networks](/wiki/neural_network) are significantly more miscalibrated than their predecessors from the early 2000s. The paper's central finding is stated directly in its abstract: "We discover that modern neural networks, unlike those from a decade ago, are poorly calibrated." [6] Despite achieving higher accuracy, these deeper and wider networks tend to produce overconfident predictions. Several factors contribute to this miscalibration:

| Factor | Effect on calibration |
|---|---|
| Increased model depth | Deeper networks have more capacity to minimize training loss beyond the point of correct classification, pushing confidence toward extreme values |
| Increased model width | Wider layers provide more parameters, enabling the model to overfit confidence scores on training data |
| [Batch normalization](/wiki/batch_normalization) | Enables training of very deep networks, indirectly contributing to overcapacity and overconfidence |
| Insufficient [weight decay](/wiki/regularization) | Without proper regularization, models can memorize training data and produce uncalibrated scores |
| Negative log-likelihood training | After a model correctly classifies most training samples, continued training further minimizes NLL by making predictions more extreme rather than more accurate |

Guo et al. identified depth, width, weight decay, and batch normalization as the principal architectural factors driving this degradation. [6] This finding was significant because earlier, shallower networks (such as LeNet-style architectures) tended to be reasonably well-calibrated. The shift toward deeper architectures like [ResNet](/wiki/resnet), DenseNet, and [Inception](/wiki/inception) introduced a systematic calibration problem.

## What are the main post-hoc calibration methods?

Post-hoc calibration methods are applied after training. They learn a mapping from the model's raw outputs to calibrated probabilities using a held-out calibration dataset. The base model's weights remain frozen, and only the calibration parameters are learned. This modularity is one of their main advantages: any trained classifier can be calibrated without retraining.

### Platt scaling

Platt scaling (also called sigmoid calibration) was introduced by John Platt in 1999 for [support vector machines](/wiki/support_vector_machine_svm) (SVMs). [1] It fits a [logistic regression](/wiki/logistic_regression) model to the classifier's output scores.

Given a classifier's output $$f(x)$$, Platt scaling transforms it into a calibrated probability:

$$
P(y = 1 \mid f(x)) = \frac{1}{1 + \exp(A \cdot f(x) + B)}
$$

The parameters $$A$$ and $$B$$ are learned by maximizing the log-likelihood on a held-out calibration set. Platt originally recommended using the Levenberg-Marquardt algorithm for optimization, though later work by Lin et al. (2007) proposed a Newton's method variant that offers improved numerical stability. [4]

Platt scaling is effective for classifiers that produce sigmoidal distortions in their output distributions, which is common with max-margin methods like SVMs and [boosted trees](/wiki/gradient_boosting). Niculescu-Mizil and Caruana (2005) observed that maximum-margin methods such as boosted trees and boosted stumps push probability mass away from 0 and 1, yielding a characteristic sigmoid-shaped distortion that Platt scaling is well suited to correct. [3] It is less effective for models that already produce well-calibrated probabilities, such as logistic regression and [random forests](/wiki/random_forest).

**Advantages:** Simple to implement, requires learning only two parameters, and works well with limited calibration data.

**Limitations:** Assumes the calibration function follows a sigmoid shape. Cannot correct non-sigmoidal miscalibration patterns.

### Temperature scaling

Temperature scaling is a single-parameter extension of Platt scaling introduced by Guo et al. (2017) specifically for [neural networks](/wiki/neural_network). The authors describe it as "a single-parameter variant of Platt Scaling" that is "surprisingly effective at calibrating predictions." [6] It divides the logit vector (the output of the network's final layer before the [softmax](/wiki/softmax) function) by a learned scalar parameter $$T$$ called the temperature:

$$
q_i = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}
$$

where $$z_i$$ are the original logits, $$T > 0$$ is the temperature parameter, and $$q_i$$ are the calibrated probabilities.

The temperature T is optimized by minimizing the negative log-likelihood ([cross-entropy](/wiki/cross-entropy) loss) on a held-out validation set. When $$T > 1$$, the softmax distribution becomes softer (less confident). When $$T < 1$$, predictions become sharper (more confident). When $$T = 1$$, the output is unchanged.

A key property of temperature scaling is that it does not change the model's predictions. Since dividing all logits by the same scalar preserves their relative ordering, the $$\arg\max$$ (predicted class) remains the same. Only the confidence values change. This means temperature scaling preserves accuracy while improving calibration.

Despite having only a single learnable parameter, temperature scaling was shown by Guo et al. to outperform more complex methods (including matrix scaling and vector scaling) on a range of image classification benchmarks. [6] This is partly because methods with more parameters are prone to overfitting on the typically small calibration set.

### Isotonic regression

Isotonic regression is a non-parametric calibration method that fits a piecewise-constant, monotonically non-decreasing function to the relationship between predicted scores and observed frequencies. Unlike Platt scaling, it makes no assumption about the shape of the calibration function.

The method works by solving the following optimization problem:

$$
\min \sum_i (y_i - m(f_i))^2 \quad \text{subject to: } m(f_i) \le m(f_j) \text{ whenever } f_i \le f_j
$$

where $$f_i$$ are the model's predicted scores, $$y_i$$ are the true labels, and $$m$$ is the calibration mapping function constrained to be monotonically non-decreasing.

The result is a step function that maps raw scores to calibrated probabilities. The Pool Adjacent Violators (PAV) algorithm efficiently solves this optimization in $$O(n)$$ time.

Niculescu-Mizil and Caruana (2005) showed that isotonic regression outperforms Platt scaling when sufficient calibration data is available, because its non-parametric nature allows it to correct arbitrary monotonic miscalibration patterns. [3] However, with small calibration datasets, isotonic regression is prone to overfitting. The scikit-learn documentation gives the practical guideline that "'isotonic' will perform as well as or better than 'sigmoid' when there is enough data (greater than ~ 1000 samples) to avoid overfitting." [18]

### Histogram binning

Histogram binning divides the predicted probability range [0, 1] into a fixed number of bins and replaces each prediction with the empirical accuracy of training samples that fall into that bin. While simple, this method requires choosing the number of bins and can produce discontinuous calibration maps.

### Bayesian binning into quantiles (BBQ)

Bayesian Binning into Quantiles, proposed by Naeini, Cooper, and Hauskrecht (2015), extends histogram binning by considering multiple binning schemes simultaneously and combining them using Bayesian model averaging. [5] Instead of committing to a single binning, BBQ evaluates multiple equal-frequency binning models with different numbers of bins and weights them according to a Bayesian score (derived from the BDeu score used in Bayesian network structure learning). Experiments showed BBQ to be statistically superior to other calibration methods in terms of both ECE and MCE. [5]

### Beta calibration

Beta calibration is a parametric method based on the Beta distribution that provides a more flexible alternative to Platt scaling for binary classifiers. [7] It fits a function with three parameters that can model a wider range of calibration distortions than the two-parameter sigmoid used in Platt scaling, including asymmetric distortions where the calibration error differs between low-confidence and high-confidence predictions.

### Spline calibration

Spline calibration uses cubic splines to fit a smooth calibration function, providing greater flexibility than sigmoid-based methods while avoiding the discontinuities of histogram binning and isotonic regression. [12] The spline approach can capture complex non-linear miscalibration patterns and produces a continuous, differentiable calibration map.

## How do the post-hoc calibration methods compare?

| Method | Type | Parameters | Strengths | Weaknesses | Best suited for |
|---|---|---|---|---|---|
| [Platt scaling](/wiki/platt_scaling) | Parametric | 2 (A, B) | Simple, low data requirement, preserves ranking | Assumes sigmoid shape | SVMs, boosted models, small calibration sets |
| [Temperature scaling](/wiki/temperature_scaling) | Parametric | 1 (T) | Very simple, preserves accuracy, native multiclass support | Limited expressiveness | Deep neural networks, multiclass tasks |
| Isotonic regression | Non-parametric | O(n) | Handles arbitrary monotonic patterns, no shape assumption | Overfits on small datasets, produces step functions | Large calibration sets (1,000+ samples) |
| Histogram binning | Non-parametric | Number of bins | Easy to understand and implement | Discontinuous, bin count must be chosen | Quick baseline calibration |
| BBQ | Non-parametric (Bayesian) | Multiple binning models | Robust, combines multiple binnings | More complex to implement | General-purpose, research settings |
| Beta calibration | Parametric | 3 | Handles asymmetric distortions | More complex than Platt scaling | Binary classifiers with asymmetric errors |
| Spline calibration | Semi-parametric | Knot positions + coefficients | Smooth, flexible, continuous output | Requires selecting number of knots | Complex miscalibration patterns |

## How is calibration extended to multiclass problems?

Many calibration methods were originally designed for binary classification and require adaptation for multiclass settings.

### One-vs-all calibration

The simplest approach is to decompose the multiclass problem into multiple binary calibration problems. For a K-class problem, K separate calibrators are trained, one per class, each treating its class as positive and all others as negative. The calibrated probabilities are then renormalized to sum to 1. This approach is used by scikit-learn's CalibratedClassifierCV with the sigmoid and isotonic methods. [18]

### Vector scaling

Vector scaling extends temperature scaling by learning a separate temperature parameter for each class. Instead of a single scalar T, a diagonal matrix $$W$$ and bias vector $$b$$ are applied to the logit vector:

$$
q = \mathrm{softmax}(W z + b)
$$

where $$W$$ is a $$K \times K$$ diagonal matrix. This provides more flexibility than temperature scaling but introduces K additional parameters, which may lead to overfitting. [6]

### Matrix scaling

Matrix scaling goes further by allowing a full $$K \times K$$ transformation matrix $$W$$ (not restricted to be diagonal), along with a bias vector $$b$$. This is the most expressive linear post-hoc method but requires learning $$K^2 + K$$ parameters. In practice, matrix scaling tends to overfit severely when the calibration set is small relative to the number of classes. [6]

### Dirichlet calibration

Dirichlet calibration, proposed by Kull et al. (2019), provides a principled multiclass extension derived from Dirichlet distributions. [8] It generalizes beta calibration from binary to multiclass settings. The method log-transforms the uncalibrated probabilities, applies a linear layer, and passes the result through a softmax function. Regularization is needed to prevent overfitting, and even with regularization, Dirichlet calibration can underperform simpler methods like temperature scaling on datasets with many classes. [8]

## Can models be calibrated during training?

Instead of applying a calibration fix after training, some methods modify the training process itself to produce better-calibrated models from the start.

### Label smoothing

[Label smoothing](/wiki/label_smoothing) replaces hard one-hot target labels with soft targets. Instead of training against a target of 1 for the correct class and 0 for all others, label smoothing uses $$(1 - \epsilon)$$ for the correct class and $$\epsilon / (K - 1)$$ for each incorrect class, where $$\epsilon$$ is a small constant (typically 0.1). This prevents the model from becoming overly confident during training, which can improve calibration. However, the relationship between label smoothing and calibration is nuanced; excessive smoothing can lead to underconfidence.

### Focal loss

Mukhoti et al. (2020) demonstrated that replacing standard [cross-entropy loss](/wiki/cross-entropy) with focal loss during training produces naturally well-calibrated neural networks. [10] Focal loss down-weights the loss contribution from well-classified (easy) examples, focusing training on hard examples. This acts as an implicit maximum-entropy regularizer, preventing the overconfident predictions that characterize miscalibrated networks. When combined with post-hoc temperature scaling, focal loss training achieves state-of-the-art calibration results. [10]

The focal loss formula is:

$$
\mathrm{FL}(p_t) = -(1 - p_t)^\gamma \log(p_t)
$$

where $$p_t$$ is the predicted probability for the true class and $$\gamma \ge 0$$ is a focusing parameter. When $$\gamma = 0$$, focal loss reduces to standard cross-entropy.

### Mixup training

[Mixup](/wiki/data_augmentation) is a data augmentation technique that creates synthetic training examples by taking convex combinations of pairs of training inputs and their labels. Thulasidasan et al. (2019) showed that mixup training improves calibration by acting as a form of regularization that prevents the model from memorizing hard decision boundaries and producing overconfident predictions near those boundaries. [9]

## How is calibration quality measured?

Several metrics and visualization tools exist for assessing whether a model is well-calibrated.

### Reliability diagrams

A reliability diagram (also called a calibration curve or calibration plot) is a graphical tool for visualizing calibration quality. Predictions are grouped into bins based on their confidence level (e.g., predictions between 0.7 and 0.8 go into one bin). For each bin, the average predicted confidence is plotted on the x-axis and the actual fraction of positive outcomes is plotted on the y-axis. A perfectly calibrated model produces points that lie along the diagonal ($$y = x$$). Points above the diagonal indicate underconfidence (the model is more accurate than it claims), while points below the diagonal indicate overconfidence (the model is less accurate than it claims).

### Expected calibration error (ECE)

The [Expected Calibration Error](/wiki/expected_calibration_error) is the most widely used scalar metric for measuring calibration. It partitions predictions into M equally-spaced bins based on confidence and computes a weighted average of the absolute difference between accuracy and confidence within each bin:

$$
\mathrm{ECE} = \sum_{m=1}^{M} \frac{|B_m|}{n} \left| \mathrm{acc}(B_m) - \mathrm{conf}(B_m) \right|
$$

where $$B_m$$ is the set of predictions in bin m, n is the total number of predictions, $$\mathrm{acc}(B_m)$$ is the accuracy within bin m, and $$\mathrm{conf}(B_m)$$ is the average confidence within bin m.

Typical implementations use $$M = 15$$ bins, following the convention established by Guo et al. (2017). [6] While widely adopted, ECE has known limitations: it is sensitive to bin width, it is a discontinuous functional, and it can produce misleading results when bins contain very few samples. Because the metric depends on the choice of bin count and boundaries, ECE values are difficult to compare across studies that use different binning protocols. [11]

### Maximum calibration error (MCE)

The Maximum Calibration Error reports the worst-case calibration gap across all bins:

$$
\mathrm{MCE} = \max_{m \in \{1,\ldots,M\}} \left| \mathrm{acc}(B_m) - \mathrm{conf}(B_m) \right|
$$

MCE is relevant in safety-sensitive applications where even a single poorly calibrated confidence region could lead to dangerous decisions.

### Brier score

The [Brier score](/wiki/brier_score) measures the mean squared error between predicted probabilities and actual outcomes:

$$
\mathrm{BS} = \frac{1}{n} \sum_{i=1}^{n} (p_i - y_i)^2
$$

where $$p_i$$ is the predicted probability and $$y_i$$ is the true label (0 or 1). Lower Brier scores indicate better probabilistic predictions. The score was introduced by the American meteorologist Glenn W. Brier in 1950 in the paper "Verification of Forecasts Expressed in Terms of Probability," published in Monthly Weather Review (volume 78, pages 1-3). [19]

The Brier score can be decomposed into three components:

| Component | Meaning |
|---|---|
| Reliability (calibration) | Measures how well predicted probabilities match observed frequencies. Lower is better. |
| Resolution | Measures how much predictions deviate from the overall base rate. Higher is better. |
| Uncertainty | Reflects the inherent difficulty of the prediction task based on the class distribution. Cannot be controlled by the model. |

Unlike ECE, the Brier score is a proper scoring rule, meaning it is minimized when the predicted probabilities exactly equal the true conditional probabilities.

### Negative log-likelihood

Negative log-likelihood (NLL), also known as log loss or cross-entropy loss, can also serve as a calibration metric. It is defined as:

$$
\mathrm{NLL} = -\frac{1}{n} \sum_{i=1}^{n} \log(p_{i,y_i})
$$

where $$p_{i,y_i}$$ is the predicted probability for the true class of sample i. NLL is a proper scoring rule and is the metric most commonly optimized when learning post-hoc calibration parameters (such as the temperature in temperature scaling).

## Comparison of calibration metrics

| Metric | Proper scoring rule | Sensitivity to extreme errors | Bin-dependent | Common use |
|---|---|---|---|---|
| ECE | No | Low (averages over bins) | Yes | General calibration assessment |
| MCE | No | High (reports worst bin) | Yes | Safety-critical applications |
| Brier score | Yes | Moderate | No | Overall probability quality |
| NLL | Yes | High (penalizes confident wrong predictions heavily) | No | Optimizing calibration parameters |

## What is a calibration layer used for?

### Medical diagnosis

In clinical decision support systems, calibrated probabilities directly inform treatment decisions. A physician presented with a "92% probability of malignancy" needs that number to reflect reality. Miscalibrated models can lead to unnecessary biopsies (if overconfident on benign cases) or missed diagnoses (if underconfident on malignant cases). Post-hoc calibration methods like temperature scaling and isotonic regression are routinely applied to medical imaging models.

### Autonomous driving

[Self-driving](/wiki/autonomous_driving) vehicles use neural networks for [object detection](/wiki/object_detection), pedestrian detection, and lane recognition. Calibrated confidence scores are needed so that the vehicle's planning system can appropriately weigh conflicting perceptions. For example, a miscalibrated pedestrian detector that assigns 99% confidence to false positives could cause unnecessary emergency braking, while one that assigns low confidence to true positives could miss actual pedestrians.

### Weather forecasting

Probabilistic weather forecasting has a long history of calibration evaluation. When a weather model predicts a 30% chance of rain, it should rain on approximately 30% of such occasions over time. The Brier score was originally developed in 1950 by Glenn Brier specifically for evaluating weather probability forecasts. [19]

### Natural language processing

[Large language models](/wiki/large_language_model) (LLMs) also exhibit calibration issues. Research has shown that LLMs can be overconfident in incorrect answers and underconfident in correct ones, with calibration quality varying significantly depending on prompt framing, task difficulty, and model size. [15] Calibration of LLMs remains an active research area, with techniques like verbalized confidence estimation and consistency-based methods being explored alongside traditional post-hoc approaches. [15]

### Financial risk modeling

In credit scoring, insurance pricing, and fraud detection, calibrated probability estimates directly translate to monetary outcomes. Overconfident fraud detection models may cause excessive false alerts, while underconfident models miss actual fraud. Regulatory frameworks in finance often require that risk models produce well-calibrated probability estimates.

## Software implementations

| Library | Language | Calibration methods supported |
|---|---|---|
| [scikit-learn](/wiki/scikit-learn) (CalibratedClassifierCV) | Python | Platt scaling (sigmoid), isotonic regression, temperature scaling |
| [TensorFlow](/wiki/tensorflow) Probability | Python | Temperature scaling, custom calibration layers |
| [PyTorch](/wiki/pytorch) (torch.nn) | Python | Temperature scaling via custom modules |
| Netcal | Python | Over 20 methods including BBQ, beta calibration, spline calibration |
| calibration (R package) | R | Platt scaling, isotonic regression |

In scikit-learn, calibration can be applied as follows: [18]

```python
from sklearn.calibration import CalibratedClassifierCV
from sklearn.svm import SVC

# Train the base classifier
base_clf = SVC()
base_clf.fit(X_train, y_train)

# Calibrate using Platt scaling (sigmoid)
calibrated_clf = CalibratedClassifierCV(base_clf, method='sigmoid', cv=5)
calibrated_clf.fit(X_train, y_train)

# Get calibrated probabilities
calibrated_probs = calibrated_clf.predict_proba(X_test)
```

For neural networks in PyTorch, temperature scaling can be implemented as a simple calibration layer:

```python
import torch
import torch.nn as nn

class TemperatureScaling(nn.Module):
    def __init__(self):
        super().__init__()
        self.temperature = nn.Parameter(torch.ones(1))

    def forward(self, logits):
        return logits / self.temperature
```

## Historical development

| Year | Development | Authors |
|---|---|---|
| 1950 | Brier score introduced for weather probability verification | Glenn W. Brier |
| 1999 | Platt scaling proposed for SVMs | John Platt |
| 2001 | Isotonic regression applied to classifier calibration | Zadrozny and Elkan |
| 2005 | Comparative study of calibration across multiple classifiers | Niculescu-Mizil and Caruana |
| 2015 | Bayesian Binning into Quantiles (BBQ) | Naeini, Cooper, and Hauskrecht |
| 2017 | Discovery that modern deep networks are miscalibrated; temperature scaling proposed | Guo, Pleiss, Sun, and Weinberger |
| 2017 | Beta calibration proposed | Kull, Silva Filho, and Flach |
| 2019 | Dirichlet calibration for multiclass problems | Kull, Perello Nieto, Kangsepp, et al. |
| 2020 | Focal loss shown to improve calibration during training | Mukhoti, Kulharia, Sanber, et al. |
| 2021 | Revisiting calibration of modern neural networks (Minderer et al.) | Minderer, Djolonga, Romijnders, et al. |

## See also

- [Calibration (machine learning)](/wiki/calibration)
- [Expected calibration error](/wiki/expected_calibration_error)
- [Softmax](/wiki/softmax)
- [Cross-entropy](/wiki/cross-entropy)
- [Logistic regression](/wiki/logistic_regression)
- [Loss function](/wiki/loss_function)
- [Overfitting](/wiki/overfitting)
- [Regularization](/wiki/regularization)
- [Confusion matrix](/wiki/confusion_matrix)
- [ROC curve](/wiki/roc_receiver_operating_characteristic_curve)
- [Brier score](/wiki/brier_score)
- [Batch normalization](/wiki/batch_normalization)

## References

1. Platt, J. C. (1999). "Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods." *Advances in Large Margin Classifiers*, 10(3), 61-74.
2. Zadrozny, B., & Elkan, C. (2001). "Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers." *Proceedings of the 18th International Conference on Machine Learning (ICML)*, 609-616.
3. Niculescu-Mizil, A., & Caruana, R. (2005). "Predicting Good Probabilities with Supervised Learning." *Proceedings of the 22nd International Conference on Machine Learning (ICML)*, 625-632.
4. Lin, H.-T., Lin, C.-J., & Weng, R. C. (2007). "A Note on Platt's Probabilistic Outputs for Support Vector Machines." *Machine Learning*, 68(3), 267-276.
5. Naeini, M. P., Cooper, G. F., & Hauskrecht, M. (2015). "Obtaining Well Calibrated Probabilities Using Bayesian Binning." *Proceedings of the AAAI Conference on Artificial Intelligence*, 29(1).
6. Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). "On Calibration of Modern Neural Networks." *Proceedings of the 34th International Conference on Machine Learning (ICML)*, 1321-1330. https://arxiv.org/abs/1706.04599
7. Kull, M., Silva Filho, T., & Flach, P. (2017). "Beta calibration: a well-founded and easily implemented improvement on logistic calibration for binary classifiers." *Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS)*, 623-631.
8. Kull, M., Perello Nieto, M., Kangsepp, M., Silva Filho, T., Song, H., & Flach, P. (2019). "Beyond temperature scaling: Obtaining well-calibrated multi-class probabilities with Dirichlet calibration." *Advances in Neural Information Processing Systems (NeurIPS)*, 32.
9. Thulasidasan, S., Chennupati, G., Bilmes, J., Bhatt, T., & Gunter, M. (2019). "On Mixup Training: Improved Calibration and Predictive Uncertainty for Deep Neural Networks." *Advances in Neural Information Processing Systems (NeurIPS)*, 32.
10. Mukhoti, J., Kulharia, V., Sanber, A., Torr, P. H. S., & Sturgess, P. (2020). "Calibrating Deep Neural Networks using Focal Loss." *Advances in Neural Information Processing Systems (NeurIPS)*, 33.
11. Minderer, M., Djolonga, J., Romijnders, R., Viber, F., Lucic, M., Gritsenko, A., & Houlsby, N. (2021). "Revisiting the Calibration of Modern Neural Networks." *Advances in Neural Information Processing Systems (NeurIPS)*, 34.
12. Gupta, K., Rahimi, A., Ajanthan, T., Mensink, T., Sminchisescu, C., & Hartley, R. (2021). "Calibration of Neural Networks using Splines." *International Conference on Learning Representations (ICLR)*.
13. Wang, C. (2023). "Calibration in Deep Learning: A Survey of the State-of-the-Art." *arXiv preprint arXiv:2308.01222*.
14. Silva Filho, T., Song, H., Perello Nieto, M., Santos-Rodriguez, R., Kull, M., & Flach, P. (2023). "Classifier calibration: a survey on how to assess and improve predicted class probabilities." *Machine Learning*, 112, 3211-3260.
15. Xiong, M., Hu, Z., Lu, X., Li, Y., Fu, J., He, J., & Hooi, B. (2024). "A Survey of Confidence Estimation and Calibration in Large Language Models." *Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)*.
16. Google for Developers. "Machine Learning Glossary: calibration layer." https://developers.google.com/machine-learning/glossary
17. scikit-learn developers. "1.16. Probability calibration." https://scikit-learn.org/stable/modules/calibration.html
18. scikit-learn developers. "CalibratedClassifierCV and the sigmoid vs isotonic guideline." https://scikit-learn.org/stable/modules/calibration.html
19. Brier, G. W. (1950). "Verification of Forecasts Expressed in Terms of Probability." *Monthly Weather Review*, 78(1), 1-3. https://doi.org/10.1175/1520-0493(1950)078<0001:VOFEIT>2.0.CO;2