Prediction bias is a systematic error in a machine learning model's outputs where the average of the model's predictions differs from the average of the actual observed values (ground truth labels) in the data. In a well-calibrated model, the mean predicted value should equal the mean observed value. When these two averages diverge, the model exhibits prediction bias, which signals underlying problems in the data, the training process, or the model itself.
Prediction bias is distinct from the statistical concept of estimator bias and from the bias term in a neural network. It is also different from the broader notion of algorithmic fairness bias, though prediction bias can contribute to unfair outcomes when it affects certain demographic subgroups more than others.
Prediction bias is formally defined as the difference between the mean of a model's predictions and the mean of the ground-truth labels:
Prediction Bias = Mean of Predictions - Mean of Observed Labels
For a classification model, consider a binary classifier predicting whether an email is spam. If 5% of emails in the dataset are actually spam, a model with zero prediction bias should predict approximately 5% spam on average across the dataset. If the model instead predicts an average spam probability of 8%, the prediction bias is +0.03 (or +3 percentage points), indicating the model systematically overestimates spam likelihood.
A prediction bias of zero does not guarantee that the model is perfect. A model can have zero prediction bias overall while still making large errors on individual examples; the errors simply cancel out when averaged. However, a nonzero prediction bias is a reliable warning sign that something is wrong.
Prediction bias can arise from several causes. Identifying the root cause is essential for choosing the right corrective strategy.
| Source | Description | Example |
|---|---|---|
| Biased or unrepresentative training data | The training set does not reflect the true distribution of the target population. Sampling biases, historical biases, or label noise can skew the data. | A loan default model trained mostly on high-income applicants underestimates default risk for low-income applicants. |
| Missing or insufficient features | The model lacks access to important predictor variables that influence the outcome. This is sometimes called omitted variable bias. | A disease prediction model missing a patient's smoking status may systematically misjudge risk. |
| Buggy data preprocessing | Errors in the data pipeline, such as flawed imputation, incorrect scaling, or dropped records, introduce systematic distortions. | A feature normalization bug that clips values above a threshold causes the model to underestimate high-value predictions. |
| Excessive regularization | Too-strong regularization forces the model toward overly simple solutions, preventing it from capturing real patterns in the data. | Heavy L2 regularization shrinks coefficients so aggressively that the model underfits and predicts values closer to the mean. |
| Wrong model complexity | A model that is too simple (underfitting) or too complex (overfitting) relative to the true data-generating process will produce biased predictions. | A linear model applied to a nonlinear relationship consistently misestimates outcomes in certain value ranges. |
| Training pipeline bugs | Software errors during training, such as incorrect loss function implementation, data leakage, or improper shuffling. | A classification pipeline that accidentally includes the label as a feature produces misleadingly perfect predictions on training data but biased predictions on new data. |
Detecting prediction bias requires looking beyond single-example errors and examining patterns across groups of examples.
The most common diagnostic tool is the calibration plot (also called a reliability diagram). To build one:
Deviations from the diagonal reveal prediction bias. Points above the diagonal indicate the model underestimates the true rate in that range; points below indicate overestimation.
The Expected Calibration Error is a single-number summary of calibration quality. It computes a weighted average of the absolute difference between predicted confidence and actual accuracy across bins:
ECE = Sum over all bins of (fraction of samples in bin) * |accuracy in bin - confidence in bin|
More concretely, predictions are divided into M bins. For each bin b, the ECE computes the absolute difference between the fraction of positive examples (accuracy) and the average predicted probability (confidence), weighted by the proportion of samples falling in that bin.
A perfectly calibrated model has an ECE of zero. Lower ECE values indicate better calibration. However, ECE has known limitations: it is sensitive to the number of bins chosen, and it can mask within-bin miscalibration. Variants such as Adaptive Calibration Error (ACE) use flexible bin boundaries to address some of these shortcomings.
Overall prediction bias can be zero even when the model is severely biased for specific subgroups. Practitioners should compute prediction bias separately for meaningful slices of the data, such as demographic groups, geographic regions, or value ranges of key features. This disaggregated evaluation is critical for fairness auditing.
When a trained model exhibits prediction bias, two broad strategies exist: fix the root cause (better data, better features, better model) or apply post-hoc calibration to the model's outputs.
| Strategy | Details |
|---|---|
| Improve data quality | Collect more representative training data, fix label errors, and address class imbalance through oversampling or undersampling. |
| Feature engineering | Add missing predictor variables, create interaction features, or apply domain-specific transformations to give the model access to the signal it needs. |
| Adjust model complexity | Use cross-validation to choose the right balance between underfitting and overfitting. Reduce regularization strength if the model is too constrained; add regularization if it is too flexible. |
| Fix preprocessing bugs | Audit the entire data pipeline for errors in imputation, encoding, scaling, and feature extraction. |
| Model selection | Try models with different inductive biases. For example, switch from a linear model to a tree-based ensemble if the relationship is nonlinear. |
Post-hoc calibration methods learn a mapping from the model's raw output scores to better-calibrated probabilities. These methods are applied after training is complete and do not require retraining the model.
Platt scaling, introduced by John Platt in 1999 in the context of support vector machines, fits a logistic regression model to the classifier's raw output scores. The calibrated probability is:
P(y = 1 | f(x)) = 1 / (1 + exp(A * f(x) + B))
where f(x) is the uncalibrated model output and A and B are scalar parameters learned by maximum likelihood estimation on a held-out calibration set.
| Aspect | Platt Scaling |
|---|---|
| Type | Parametric (sigmoid/logistic) |
| Parameters | 2 (A and B) |
| Data requirement | Works well with small calibration sets |
| Best for | Models with sigmoidal distortion (SVMs, boosted models) |
| Limitation | Assumes the calibration curve has a sigmoid shape; may not correct non-sigmoid distortions |
Isotonic regression fits a non-parametric, monotonically non-decreasing step function to map raw scores to calibrated probabilities. It solves the optimization problem of minimizing the sum of squared differences between actual labels and calibrated outputs, subject to the constraint that the mapping is monotonically increasing.
| Aspect | Isotonic Regression |
|---|---|
| Type | Non-parametric (step function) |
| Parameters | Variable (depends on data) |
| Data requirement | Needs approximately 1,000+ calibration samples to avoid overfitting |
| Best for | Any monotonic distortion; more flexible than Platt scaling |
| Limitation | Can overfit on small datasets; may introduce ties that affect ranking metrics |
Research has shown that isotonic regression generally outperforms Platt scaling when sufficient calibration data is available, because it can correct arbitrary monotonic distortions rather than only sigmoid-shaped ones.
Temperature scaling is particularly popular for deep neural networks and multiclass classification. It divides the model's logits by a single learned scalar parameter T (the temperature) before applying the softmax function:
calibrated output = softmax(z / T)
where z is the vector of logits. A temperature T > 1 softens the probability distribution (reducing overconfidence), while T < 1 sharpens it. Temperature scaling does not change the model's top-1 accuracy because it does not alter the ranking of classes; it only adjusts the confidence of predictions.
| Method | Parametric | Parameters | Multiclass Support | Ranking Preserved | Minimum Data |
|---|---|---|---|---|---|
| Platt Scaling | Yes | 2 | Via one-vs-rest | Yes | Small |
| Isotonic Regression | No | Variable | Via one-vs-rest | Sometimes | ~1,000+ samples |
| Temperature Scaling | Yes | 1 | Native | Yes | Moderate |
In scikit-learn, the CalibratedClassifierCV class implements both Platt scaling (method='sigmoid') and isotonic regression (method='isotonic'). It uses cross-validation to produce unbiased calibrated probabilities:
from sklearn.calibration import CalibratedClassifierCV
calibrated_model = CalibratedClassifierCV(
estimator=base_classifier,
method='isotonic',
cv=5
)
calibrated_model.fit(X_train, y_train)
probabilities = calibrated_model.predict_proba(X_test)
Calibration curves can be visualized using CalibrationDisplay.from_estimator() to compare the calibration quality of different models or calibration methods.
Prediction bias becomes a fairness concern when it varies across demographic or sensitive subgroups. A model might exhibit zero prediction bias overall but systematically overestimate outcomes for one group and underestimate them for another.
Calibration within groups requires that for every predicted probability p, approximately a p fraction of individuals in each group who receive that score actually belong to the positive class. If a model assigns a risk score of 0.8, that score should correspond to an 80% actual positive rate regardless of which subgroup the individual belongs to. Violations of this property can lead to disparate impact, where one group is systematically disadvantaged by the model's predictions.
An important theoretical result in algorithmic fairness is that three desirable properties cannot all hold simultaneously (except in trivial cases): calibration within groups, balance for the positive class, and balance for the negative class. This impossibility result, established by Chouldechova (2017) and Kleinberg, Mullainathan, and Raghavan (2016), means that practitioners must make deliberate choices about which fairness criteria to prioritize.
Best practices for addressing group-level prediction bias include:
While prediction bias is most commonly discussed in the context of classification (where it measures how well predicted probabilities match observed frequencies), it also applies to regression tasks. In regression, prediction bias manifests as a systematic tendency to overpredict or underpredict across certain value ranges.
Research has shown that machine learning regression models often exhibit a characteristic bias pattern: predictions for large-valued outcomes tend to be negatively biased (underestimated), while predictions for small-valued outcomes tend to be positively biased (overestimated). This phenomenon, sometimes called regression to the mean in predictions, is particularly pronounced in models that optimize mean squared error, as the loss function naturally encourages predictions near the center of the distribution.
Imagine you have a robot that guesses how many jellybeans are in different jars. You show it 100 jars, and it makes a guess for each one. After checking all the guesses, you notice that the robot's guesses average out to 250 jellybeans, but the actual average is only 200 jellybeans. The robot consistently guesses too high. That difference of 50 is the prediction bias.
A good robot should get individual jars wrong sometimes (guessing too high for some and too low for others), but on average, its guesses should land right around the true number. If the average guess is always too high or too low, something is off: maybe the robot was shown mostly big jars during practice, or maybe it is missing some clue (like jar shape) that would help it guess better.
Fixing prediction bias is like giving the robot better practice jars that look like the real ones, showing it more useful clues, or adding an adjustment step at the end that nudges all its guesses slightly down to compensate.
Prediction bias is related to but distinct from several other concepts in machine learning: