Prediction Bias

Prediction bias is a systematic error in a machine learning model's outputs where the average of the model's predictions differs from the average of the actual observed values (ground truth labels) in the data. In a well-calibrated model, the mean predicted value should equal the mean observed value. When these two averages diverge, the model exhibits prediction bias, which signals underlying problems in the data, the training process, or the model itself.

Prediction bias is distinct from the statistical concept of estimator bias and from the bias term in a neural network. It is also different from the broader notion of algorithmic fairness bias, though prediction bias can contribute to unfair outcomes when it affects certain demographic subgroups more than others.

Definition and Formula

Prediction bias is formally defined as the difference between the mean of a model's predictions and the mean of the ground-truth labels:

Prediction Bias = Mean of Predictions - Mean of Observed Labels

For a classification model, consider a binary classifier predicting whether an email is spam. If 5% of emails in the dataset are actually spam, a model with zero prediction bias should predict approximately 5% spam on average across the dataset. If the model instead predicts an average spam probability of 8%, the prediction bias is +0.03 (or +3 percentage points), indicating the model systematically overestimates spam likelihood.

A prediction bias of zero does not guarantee that the model is perfect. A model can have zero prediction bias overall while still making large errors on individual examples; the errors simply cancel out when averaged. However, a nonzero prediction bias is a reliable warning sign that something is wrong.

Sources of Prediction Bias

Prediction bias can arise from several causes. Identifying the root cause is essential for choosing the right corrective strategy.

Source	Description	Example
Biased or unrepresentative training data	The training set does not reflect the true distribution of the target population. Sampling biases, historical biases, or label noise can skew the data.	A loan default model trained mostly on high-income applicants underestimates default risk for low-income applicants.
Missing or insufficient features	The model lacks access to important predictor variables that influence the outcome. This is sometimes called omitted variable bias.	A disease prediction model missing a patient's smoking status may systematically misjudge risk.
Buggy data preprocessing	Errors in the data pipeline, such as flawed imputation, incorrect scaling, or dropped records, introduce systematic distortions.	A feature normalization bug that clips values above a threshold causes the model to underestimate high-value predictions.
Excessive regularization	Too-strong regularization forces the model toward overly simple solutions, preventing it from capturing real patterns in the data.	Heavy L2 regularization shrinks coefficients so aggressively that the model underfits and predicts values closer to the mean.
Wrong model complexity	A model that is too simple (underfitting) or too complex (overfitting) relative to the true data-generating process will produce biased predictions.	A linear model applied to a nonlinear relationship consistently misestimates outcomes in certain value ranges.
Training pipeline bugs	Software errors during training, such as incorrect loss function implementation, data leakage, or improper shuffling.	A classification pipeline that accidentally includes the label as a feature produces misleadingly perfect predictions on training data but biased predictions on new data.

Diagnosing Prediction Bias

Detecting prediction bias requires looking beyond single-example errors and examining patterns across groups of examples.

Bucketing and Calibration Plots

The most common diagnostic tool is the calibration plot (also called a reliability diagram). To build one:

Collect the model's predicted probabilities (or predicted values for regression) on a held-out dataset.
Group predictions into buckets (bins). Two common strategies are equal-width bins (e.g., 0.0-0.1, 0.1-0.2, ...) and quantile bins (each bin contains roughly the same number of examples).
For each bin, compute the average predicted value (x-axis) and the average actual observed value (y-axis).
Plot the points. A perfectly calibrated model produces points that lie along the diagonal line y = x.

Deviations from the diagonal reveal prediction bias. Points above the diagonal indicate the model underestimates the true rate in that range; points below indicate overestimation.

Expected Calibration Error (ECE)

The Expected Calibration Error is a single-number summary of calibration quality. It computes a weighted average of the absolute difference between predicted confidence and actual accuracy across bins:

ECE = Sum over all bins of (fraction of samples in bin) * |accuracy in bin - confidence in bin|

More concretely, predictions are divided into M bins. For each bin b, the ECE computes the absolute difference between the fraction of positive examples (accuracy) and the average predicted probability (confidence), weighted by the proportion of samples falling in that bin.

A perfectly calibrated model has an ECE of zero. Lower ECE values indicate better calibration. However, ECE has known limitations: it is sensitive to the number of bins chosen, and it can mask within-bin miscalibration. Variants such as Adaptive Calibration Error (ACE) use flexible bin boundaries to address some of these shortcomings.

Per-Subgroup Analysis

Overall prediction bias can be zero even when the model is severely biased for specific subgroups. Practitioners should compute prediction bias separately for meaningful slices of the data, such as demographic groups, geographic regions, or value ranges of key features. This disaggregated evaluation is critical for fairness auditing.

Calibration Methods for Fixing Prediction Bias

When a trained model exhibits prediction bias, two broad strategies exist: fix the root cause (better data, better features, better model) or apply post-hoc calibration to the model's outputs.

Root-Cause Fixes

Strategy	Details
Improve data quality	Collect more representative training data, fix label errors, and address class imbalance through oversampling or undersampling.
Feature engineering	Add missing predictor variables, create interaction features, or apply domain-specific transformations to give the model access to the signal it needs.
Adjust model complexity	Use cross-validation to choose the right balance between underfitting and overfitting. Reduce regularization strength if the model is too constrained; add regularization if it is too flexible.
Fix preprocessing bugs	Audit the entire data pipeline for errors in imputation, encoding, scaling, and feature extraction.
Model selection	Try models with different inductive biases. For example, switch from a linear model to a tree-based ensemble if the relationship is nonlinear.

Post-Hoc Calibration

Post-hoc calibration methods learn a mapping from the model's raw output scores to better-calibrated probabilities. These methods are applied after training is complete and do not require retraining the model.

Platt Scaling

Platt scaling, introduced by John Platt in 1999 in the context of support vector machines, fits a logistic regression model to the classifier's raw output scores. The calibrated probability is:

P(y = 1 | f(x)) = 1 / (1 + exp(A * f(x) + B))

where f(x) is the uncalibrated model output and A and B are scalar parameters learned by maximum likelihood estimation on a held-out calibration set.

Aspect	Platt Scaling
Type	Parametric (sigmoid/logistic)
Parameters	2 (A and B)
Data requirement	Works well with small calibration sets
Best for	Models with sigmoidal distortion (SVMs, boosted models)
Limitation	Assumes the calibration curve has a sigmoid shape; may not correct non-sigmoid distortions

Isotonic Regression

Isotonic regression fits a non-parametric, monotonically non-decreasing step function to map raw scores to calibrated probabilities. It solves the optimization problem of minimizing the sum of squared differences between actual labels and calibrated outputs, subject to the constraint that the mapping is monotonically increasing.

Aspect	Isotonic Regression
Type	Non-parametric (step function)
Parameters	Variable (depends on data)
Data requirement	Needs approximately 1,000+ calibration samples to avoid overfitting
Best for	Any monotonic distortion; more flexible than Platt scaling
Limitation	Can overfit on small datasets; may introduce ties that affect ranking metrics

Research has shown that isotonic regression generally outperforms Platt scaling when sufficient calibration data is available, because it can correct arbitrary monotonic distortions rather than only sigmoid-shaped ones.

Temperature Scaling

Temperature scaling is particularly popular for deep neural networks and multiclass classification. It divides the model's logits by a single learned scalar parameter T (the temperature) before applying the softmax function:

calibrated output = softmax(z / T)

where z is the vector of logits. A temperature T > 1 softens the probability distribution (reducing overconfidence), while T < 1 sharpens it. Temperature scaling does not change the model's top-1 accuracy because it does not alter the ranking of classes; it only adjusts the confidence of predictions.

Comparison of Calibration Methods

Method	Parametric	Parameters	Multiclass Support	Ranking Preserved	Minimum Data
Platt Scaling	Yes	2	Via one-vs-rest	Yes	Small
Isotonic Regression	No	Variable	Via one-vs-rest	Sometimes	~1,000+ samples
Temperature Scaling	Yes	1	Native	Yes	Moderate

Calibration in Practice (scikit-learn)

In scikit-learn, the CalibratedClassifierCV class implements both Platt scaling (method='sigmoid') and isotonic regression (method='isotonic'). It uses cross-validation to produce unbiased calibrated probabilities:

from sklearn.calibration import CalibratedClassifierCV

calibrated_model = CalibratedClassifierCV(
    estimator=base_classifier,
    method='isotonic',
    cv=5
)
calibrated_model.fit(X_train, y_train)
probabilities = calibrated_model.predict_proba(X_test)

Calibration curves can be visualized using CalibrationDisplay.from_estimator() to compare the calibration quality of different models or calibration methods.

Group-Level Prediction Bias and Fairness

Prediction bias becomes a fairness concern when it varies across demographic or sensitive subgroups. A model might exhibit zero prediction bias overall but systematically overestimate outcomes for one group and underestimate them for another.

Calibration within groups requires that for every predicted probability p, approximately a p fraction of individuals in each group who receive that score actually belong to the positive class. If a model assigns a risk score of 0.8, that score should correspond to an 80% actual positive rate regardless of which subgroup the individual belongs to. Violations of this property can lead to disparate impact, where one group is systematically disadvantaged by the model's predictions.

An important theoretical result in algorithmic fairness is that three desirable properties cannot all hold simultaneously (except in trivial cases): calibration within groups, balance for the positive class, and balance for the negative class. This impossibility result, established by Chouldechova (2017) and Kleinberg, Mullainathan, and Raghavan (2016), means that practitioners must make deliberate choices about which fairness criteria to prioritize.

Best practices for addressing group-level prediction bias include:

Computing calibration curves and ECE separately for each subgroup
Using disaggregated evaluation dashboards to monitor subgroup performance
Applying group-specific calibration if necessary
Auditing for both statistical parity and calibration-based fairness criteria

Prediction Bias in Regression vs. Classification

While prediction bias is most commonly discussed in the context of classification (where it measures how well predicted probabilities match observed frequencies), it also applies to regression tasks. In regression, prediction bias manifests as a systematic tendency to overpredict or underpredict across certain value ranges.

Research has shown that machine learning regression models often exhibit a characteristic bias pattern: predictions for large-valued outcomes tend to be negatively biased (underestimated), while predictions for small-valued outcomes tend to be positively biased (overestimated). This phenomenon, sometimes called regression to the mean in predictions, is particularly pronounced in models that optimize mean squared error, as the loss function naturally encourages predictions near the center of the distribution.

ELI5: Prediction Bias Explained Simply

Imagine you have a robot that guesses how many jellybeans are in different jars. You show it 100 jars, and it makes a guess for each one. After checking all the guesses, you notice that the robot's guesses average out to 250 jellybeans, but the actual average is only 200 jellybeans. The robot consistently guesses too high. That difference of 50 is the prediction bias.

A good robot should get individual jars wrong sometimes (guessing too high for some and too low for others), but on average, its guesses should land right around the true number. If the average guess is always too high or too low, something is off: maybe the robot was shown mostly big jars during practice, or maybe it is missing some clue (like jar shape) that would help it guess better.

Fixing prediction bias is like giving the robot better practice jars that look like the real ones, showing it more useful clues, or adding an adjustment step at the end that nudges all its guesses slightly down to compensate.

Relationship to Other Concepts

Prediction bias is related to but distinct from several other concepts in machine learning:

Bias-variance tradeoff: The bias component in the bias-variance decomposition refers to the error from simplifying assumptions in the model. Prediction bias is a specific, observable manifestation of this statistical bias.
Accuracy: A model can have high accuracy but nonzero prediction bias, or vice versa. Accuracy measures individual correctness; prediction bias measures systematic directional error.
Overfitting: An overfit model may show low prediction bias on training data but high prediction bias on new data, because it has memorized the training distribution rather than learning generalizable patterns.
Calibration: Calibration is the broader concept of ensuring predicted probabilities match observed frequencies. Prediction bias (the gap between mean prediction and mean label) is one specific aspect of calibration.

References

Google Developers. "Classification: Prediction Bias." *Machine Learning Crash Course*. https://developers.google.com/machine-learning/crash-course/classification/prediction-bias
Platt, John C. (1999). "Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods." *Advances in Large Margin Classifiers*, MIT Press.
Niculescu-Mizil, Alexandru, and Rich Caruana (2005). "Predicting Good Probabilities With Supervised Learning." *Proceedings of the 22nd International Conference on Machine Learning (ICML)*. https://www.cs.cornell.edu/~alexn/papers/calibration.icml05.crc.rev3.pdf
Guo, Chuan, et al. (2017). "On Calibration of Modern Neural Networks." *Proceedings of the 34th International Conference on Machine Learning (ICML)*.
scikit-learn Documentation. "1.16. Probability Calibration." https://scikit-learn.org/stable/modules/calibration.html
Chouldechova, Alexandra (2017). "Fair Prediction with Disparate Impact: A Study of Bias in Recidivism Prediction Instruments." *Big Data*, 5(2), 153-163.
Kleinberg, Jon, Sendhil Mullainathan, and Manish Raghavan (2016). "Inherent Trade-Offs in the Fair Determination of Risk Scores." *Proceedings of Innovations in Theoretical Computer Science (ITCS)*.
Dormann, Carsten F. (2020). "Calibration of probability predictions from machine-learning and statistical models." *Global Ecology and Biogeography*, 29(5), 760-765. https://onlinelibrary.wiley.com/doi/full/10.1111/geb.13070
Tian, Junjie, et al. (2024). "A Systematic Bias of Machine Learning Regression Models and Its Correction." *arXiv preprint arXiv:2405.15950*. https://arxiv.org/abs/2405.15950
Nixon, Jeremy, et al. (2019). "Measuring Calibration in Deep Learning." *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)*.

Definition and Formula

Sources of Prediction Bias

Diagnosing Prediction Bias

Bucketing and Calibration Plots

Expected Calibration Error (ECE)

Per-Subgroup Analysis

Calibration Methods for Fixing Prediction Bias

Root-Cause Fixes

Post-Hoc Calibration

Platt Scaling

Isotonic Regression

Temperature Scaling

Comparison of Calibration Methods

Calibration in Practice (scikit-learn)

Group-Level Prediction Bias and Fairness

Prediction Bias in Regression vs. Classification

ELI5: Prediction Bias Explained Simply

Relationship to Other Concepts

References

Improve this article

Related Articles

ARC-AGI 2

Generalization

Generalization Curve

Model Capacity

AUC (Area Under the ROC Curve)

Accuracy

Definition and Formula

Sources of Prediction Bias

Diagnosing Prediction Bias

Bucketing and Calibration Plots

Expected Calibration Error (ECE)

Per-Subgroup Analysis

Calibration Methods for Fixing Prediction Bias

Root-Cause Fixes

Post-Hoc Calibration

Platt Scaling

Isotonic Regression

Temperature Scaling

Comparison of Calibration Methods

Calibration in Practice (scikit-learn)

Group-Level Prediction Bias and Fairness

Prediction Bias in Regression vs. Classification

ELI5: Prediction Bias Explained Simply

Relationship to Other Concepts

References

Related Articles

ARC-AGI 2

Generalization

Generalization Curve

Model Capacity

AUC (Area Under the ROC Curve)

Accuracy