Bias-variance tradeoff

Machine Learning Statistics

17 min read

Updated Jul 11, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 11, 2026

Fact-checked

In review queue

Sources

10 citations

Revision

v8 · 3,343 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

The bias-variance tradeoff is a foundational concept in machine learning and statistics that describes the tension between two competing sources of error in predictive models: bias (error from overly simplistic assumptions) and variance (error from excessive sensitivity to fluctuations in the training data).^[1]^[7] For squared-error loss, a model's expected test error decomposes exactly into three additive terms, bias squared plus variance plus irreducible noise, so lowering one error source by adding or removing model complexity typically raises the other.^[2]^[3] The tradeoff was formalized for neural networks by Stuart Geman, Elie Bienenstock, and Rene Doursat in a 1992 paper spanning 58 pages in the journal Neural Computation, and it remains central to building models that generalize well to unseen data.^[1]^[9]

What is the bias-variance tradeoff?

The bias-variance tradeoff refers to the observation that, as a model's complexity increases, its bias tends to decrease while its variance tends to increase, and vice versa.^[3] The total prediction error on unseen data can be decomposed into three components: bias squared, variance, and irreducible error (noise).^[2] The goal of any learning algorithm is to minimize the sum of these three quantities, but reducing bias typically raises variance and reducing variance typically raises bias.^[7]

This concept applies broadly across regression, classification, and other prediction tasks. It provides a theoretical lens for understanding why simple models underfit, why complex models overfit, and why the best model lies somewhere in between (at least in the classical view).^[9]

What is bias (underfitting)?

Bias refers to the error introduced by approximating a real-world problem with a simplified model.^[7] A model with high bias pays little attention to the training data and makes strong assumptions about the form of the underlying function. This leads to systematic errors that persist regardless of how much training data is provided.^[1]

Characteristics of high-bias models:

Property	Description
Training error	High
Test error	High
Gap between training and test error	Small
Model complexity	Low
Common name	Underfitting

For example, fitting a straight line to data that follows a quadratic curve produces high bias. No matter how many data points are collected, the linear model cannot capture the curvature. The model is "biased" toward a linear relationship that does not match reality.^[7]

Common high-bias models include linear regression with too few features, shallow decision trees, and heavily constrained models. Bias can also arise when important features are omitted from the model.^[9]

What is variance (overfitting)?

Variance refers to the amount by which the model's predictions change when trained on different subsets of the data.^[7] A model with high variance is highly sensitive to the specific training examples it sees. It captures not only the true signal in the data but also the random noise, leading to poor performance on new data.^[2]

Characteristics of high-variance models:

Property	Description
Training error	Very low (often near zero)
Test error	High
Gap between training and test error	Large
Model complexity	High
Common name	Overfitting

For example, fitting a high-degree polynomial to a small dataset might pass through every training point exactly, but the resulting curve could oscillate wildly between data points. When new data arrives, the model's predictions are unreliable because they were shaped by noise rather than signal.^[7]

Common high-variance models include deep neural networks without regularization, very deep decision trees, and k-nearest neighbors with $k=1$ .^[9]

How is prediction error decomposed mathematically?

The bias-variance decomposition provides a formal framework for understanding prediction error.^[8] Consider a regression setting where the true relationship between input x and output y is:

y = f(\mathbf{x}) + \epsilon

where $f(\mathbf{x})$ is the true function and $\epsilon$ is noise with mean zero and variance $\sigma^2$ . A learning algorithm trained on dataset D produces a model $h_D(\mathbf{x})$ . The expected prediction error, taken over all possible training sets D and all test points, can be decomposed as follows.^[3]

Derivation

Start with the expected mean squared error (MSE) for a single test point $\mathbf{x}$ :

\mathbb{E}[(h_D(\mathbf{x}) - y)^2]

Step 1: Introduce the expected prediction $\bar{h}(\mathbf{x}) = \mathbb{E}_D[h_D(\mathbf{x})]$ , which is the average prediction across all possible training sets. Add and subtract $\bar{h}(\mathbf{x})$ inside the squared term:

\mathbb{E}[((h_D(\mathbf{x}) - \bar{h}(\mathbf{x})) + (\bar{h}(\mathbf{x}) - y))^2]

Step 2: Expand the square:

\mathbb{E}[(h_D(\mathbf{x}) - \bar{h}(\mathbf{x}))^2] + 2 \, \mathbb{E}[(h_D(\mathbf{x}) - \bar{h}(\mathbf{x}))(\bar{h}(\mathbf{x}) - y)] + \mathbb{E}[(\bar{h}(\mathbf{x}) - y)^2]

The cross-term vanishes because $\mathbb{E}_D[h_D(\mathbf{x}) - \bar{h}(\mathbf{x})] = 0$ by definition of $\bar{h}$ . This leaves:

\mathbb{E}_D[(h_D(\mathbf{x}) - \bar{h}(\mathbf{x}))^2] + \mathbb{E}_y[(\bar{h}(\mathbf{x}) - y)^2]

The first term is the variance of the model.

Step 3: Decompose the second term by introducing $\bar{y}(\mathbf{x}) = \mathbb{E}[y \mid \mathbf{x}] = f(\mathbf{x})$ , the true conditional mean. Add and subtract $\bar{y}(\mathbf{x})$ :

\mathbb{E}[((\bar{h}(\mathbf{x}) - \bar{y}(\mathbf{x})) + (\bar{y}(\mathbf{x}) - y))^2]

Expand and note that the cross-term again vanishes:

(\bar{h}(\mathbf{x}) - \bar{y}(\mathbf{x}))^2 + \mathbb{E}[(\bar{y}(\mathbf{x}) - y)^2]

The first term is the bias squared, and the second term is the irreducible error (noise variance, $\sigma^2$ ).^[8]

Final Result

The complete decomposition is:

\text{Expected Test Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error}

Component	Formula	Interpretation
$\text{Bias}^2$	$(\bar{h}(\mathbf{x}) - f(\mathbf{x}))^2$	Systematic error from model assumptions; how far the average prediction is from the true function
Variance	$\mathbb{E}_D[(h_D(\mathbf{x}) - \bar{h}(\mathbf{x}))^2]$	Sensitivity of predictions to training set variation; how much predictions fluctuate across different training sets
Irreducible Error	$\sigma^2 = \mathbb{E}[(y - f(\mathbf{x}))^2]$	Noise inherent in the data; cannot be eliminated by any model

This decomposition holds exactly for squared loss.^[3] For other loss functions (such as 0-1 loss in classification), analogous but more complex decompositions exist. The classification case involves different definitions of bias and variance, as proposed by Pedro Domingos in his 2000 paper "A Unified Bias-Variance Decomposition for Zero-One and Squared Loss" and by other researchers.^[6]

How does the tradeoff relate to model complexity?

The classical view of the bias-variance tradeoff can be visualized as a U-shaped curve when total error is plotted against model complexity.^[9]

Low complexity (high bias, low variance): Simple models make strong assumptions. They produce consistent predictions across different training sets (low variance) but systematically miss the true pattern (high bias).

High complexity (low bias, high variance): Complex models can capture intricate patterns, reducing bias. However, they also fit noise in the training data, causing predictions to vary substantially across different training sets (high variance).

Optimal complexity: The best generalization performance occurs at an intermediate level of complexity where the sum of bias squared and variance is minimized.^[7] At this point, the model is flexible enough to capture the underlying signal but constrained enough to avoid fitting noise.

Model Complexity	Bias	Variance	Training Error	Test Error
Very low (e.g., linear model on nonlinear data)	High	Low	High	High
Low-moderate	Moderate	Low-moderate	Moderate	Moderate-low
Optimal	Balanced	Balanced	Low-moderate	Lowest
High-moderate	Low	Moderate-high	Low	Moderate
Very high (e.g., overfitting polynomial)	Very low	Very high	Near zero	Very high

Concrete examples of model complexity include the degree of a polynomial in polynomial regression, the depth of a decision tree, the number of hidden layers and neurons in a neural network, and the number of neighbors k in k-nearest neighbors (where lower k means higher complexity).^[9]

How does regularization control the tradeoff?

Regularization is one of the primary tools for managing the bias-variance tradeoff. Regularization techniques add a penalty to the model's loss function that discourages overly complex solutions, effectively trading a small increase in bias for a larger decrease in variance.^[2]

L2 Regularization (Ridge)

Ridge regression adds the sum of squared weights to the loss function:

\text{Loss} = \frac{1}{n} \sum (y_i - h(\mathbf{x}_i))^2 + \lambda \sum w_j^2

The regularization parameter $\lambda$ controls the strength of the penalty. Larger $\lambda$ values shrink the weights toward zero, increasing bias but decreasing variance. Ridge regularization tends to shrink all weights uniformly without setting any to exactly zero.^[9]

L1 Regularization (Lasso)

Lasso regression adds the sum of absolute values of weights:

\text{Loss} = \frac{1}{n} \sum (y_i - h(\mathbf{x}_i))^2 + \lambda \sum \lvert w_j \rvert

L1 regularization can drive some weights to exactly zero, performing implicit feature selection. This produces sparser models that may have slightly higher bias but substantially lower variance.^[9]

Elastic Net

Elastic net combines L1 and L2 penalties, providing a balance between the sparsity of Lasso and the stability of Ridge.^[2]

Regularization in Neural Networks

For deep learning models, common regularization techniques include:

Technique	How It Works	Effect on Bias-Variance
Dropout	Randomly sets a fraction of neuron activations to zero during training	Reduces variance by preventing co-adaptation of neurons
Weight decay	Adds L2 penalty to weights (equivalent to Ridge)	Reduces variance by constraining weight magnitudes
Early stopping	Halts training before the model fully converges on training data	Reduces variance by limiting effective model complexity
Data augmentation	Creates additional training examples through transformations	Reduces variance by providing more diverse training signals
Batch normalization	Normalizes layer inputs during training	Acts as a mild regularizer; reduces variance

Regularization Parameter and the Tradeoff

The regularization parameter (commonly called $\lambda$ or $\alpha$ ) directly controls where a model sits on the bias-variance spectrum:

Lambda = 0: No regularization. The model is free to fit the training data as closely as possible (low bias, high variance).
Small lambda: Mild regularization. Slight increase in bias, noticeable decrease in variance.
Large lambda: Strong regularization. The model is heavily constrained and may underfit (high bias, low variance).
Lambda approaching infinity: The model effectively ignores the data and makes constant predictions (maximum bias, zero variance).

The optimal $\lambda$ is typically chosen via cross-validation, selecting the value that minimizes estimated test error.^[9]

How do you diagnose bias and variance with learning curves?

Learning curves are a practical diagnostic tool for determining whether a model suffers from high bias, high variance, or a combination of both.^[7] A learning curve plots training error and validation error as a function of the number of training examples (or, alternatively, as a function of training iterations).

High Bias (Underfitting) Learning Curve

When a model has high bias:

Training error starts low but increases quickly and plateaus at a relatively high value.
Validation error starts high and decreases but also plateaus at a high value.
The gap between training and validation error is small.
Both curves converge to a high error level.
Adding more training data does not substantially reduce the error.

What to do: Increase model complexity, add features, reduce regularization, or use a more expressive model family.

High Variance (Overfitting) Learning Curve

When a model has high variance:

Training error remains low throughout.
Validation error is substantially higher than training error.
There is a large and persistent gap between the two curves.
Adding more training data tends to narrow the gap (validation error decreases).

What to do: Add regularization, reduce model complexity, gather more training data, use ensemble methods like bagging, or apply dropout.

Summary of Diagnostic Patterns

Diagnostic Signal	Likely Problem	Recommended Actions
High training error, high validation error, small gap	High bias (underfitting)	Increase complexity, add features, reduce regularization
Low training error, high validation error, large gap	High variance (overfitting)	Add regularization, simplify model, gather more data
Low training error, low validation error, small gap	Good fit	Model is well-calibrated
Both errors high initially, both decrease with more data	Insufficient data	Collect more training examples

Beyond simple learning curves, practitioners also use validation curves (plotting error against a hyperparameter like regularization strength) and cross-validation to diagnose and address bias-variance issues.^[9]

What is double descent (the modern view)?

The classical bias-variance tradeoff predicts a U-shaped test error curve as model complexity increases: error decreases as bias drops, reaches a minimum, and then increases as variance dominates.^[4] However, research in the late 2010s revealed a more nuanced picture, particularly for modern overparameterized models such as deep neural networks.

The Double Descent Phenomenon

In 2019, Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal published a landmark paper, "Reconciling modern machine learning practice and the classical bias-variance trade-off," in the Proceedings of the National Academy of Sciences (volume 116, issue 32, pages 15849-15854), which introduced the concept of double descent.^[4] They showed that as model complexity increases past a critical point called the interpolation threshold (where the model has just enough parameters to perfectly fit, or interpolate, the training data), test error first follows the classical U-shape but then decreases again, forming a second descent.^[4] As the authors wrote, "This 'double descent' curve subsumes the textbook U-shaped bias-variance trade-off curve by showing how increasing model capacity beyond the point of interpolation results in improved performance."^[4]

The double descent curve has three regimes:

Regime	Model Complexity	Behavior
Under-parameterized	Fewer parameters than training examples	Classical U-shaped curve; bias decreases, variance increases
Interpolation threshold	Parameters approximately equal training examples	Peak in test error; the model barely fits the data, and any noise forces distorted solutions
Over-parameterized	Many more parameters than training examples	Test error decreases again; the model finds smooth interpolating solutions

Why does double descent occur?

At the interpolation threshold, there is essentially only one function that passes through all training points. If the data contains any noise or label errors, this unique interpolating solution must contort itself to fit the noise, resulting in poor generalization. Beyond this threshold, when the model has many more parameters than needed, there are infinitely many functions that interpolate the training data. Among these, optimization procedures (like stochastic gradient descent) tend to find solutions with desirable properties, such as minimum norm or maximum smoothness, which generalize well.^[4]

OpenAI's Deep Double Descent

In 2019, researchers at OpenAI (Preetum Nakkiran and colleagues, including Ilya Sutskever) extended Belkin's findings to deep neural networks in the paper "Deep Double Descent: Where Bigger Models and More Data Hurt," posted to arXiv on December 4, 2019.^[5] The paper opens by stating, "We show that a variety of modern deep learning tasks exhibit a 'double-descent' phenomenon where, as we increase model size, performance first gets worse and then gets better."^[5] To unify the different forms of the phenomenon, the authors introduced a measure they call effective model complexity and showed that double descent occurs along three axes:^[5]

Model-wise double descent: Test error exhibits double descent as model width or depth increases.
Epoch-wise double descent: For a fixed model, test error can exhibit double descent as training progresses (more epochs).
Sample-wise double descent: Increasing the number of training samples can sometimes temporarily increase test error before eventually decreasing it.

They demonstrated these effects using ResNet, transformers, and other architectures on standard benchmarks like CIFAR-10 and ImageNet.^[5] A central result used ResNet18 trained on CIFAR-10 with 15 percent label noise, where the peak in test error appeared right at the interpolation threshold and grew more pronounced as label noise increased.^[5]

Implications for Practice

The double descent phenomenon does not invalidate the bias-variance tradeoff. Rather, it extends it. The classical tradeoff remains valid in the under-parameterized regime. In the over-parameterized regime, implicit regularization from the optimization algorithm plays a role that is not captured by the traditional analysis.^[5]

Practical takeaways include:

Larger models are not always worse, even when they have far more parameters than training examples.
Early stopping and regularization can help avoid the peak at the interpolation threshold.
Label noise amplifies the double descent peak, so data quality matters.
The choice of optimizer and its hyperparameters (such as learning rate) influence where and whether double descent appears.

How do common algorithms handle the tradeoff?

Different machine learning algorithms handle the tradeoff in distinct ways:

Algorithm	Bias Characteristics	Variance Characteristics	How the Tradeoff Is Managed
Linear regression	Can be high if the true relationship is nonlinear	Generally low	Regularization (Ridge, Lasso)
Decision trees	Low (can fit complex boundaries)	High (sensitive to training data)	Pruning, maximum depth limits
Random forests	Slightly higher than individual trees	Much lower than individual trees	Bagging reduces variance
Gradient boosting	Low (sequentially reduces bias)	Can be high without tuning	Learning rate, tree depth, number of estimators
K-nearest neighbors	Low for small k	High for small k; low for large k	Choosing k via cross-validation
Support vector machines	Depends on kernel and C parameter	Controlled by C and kernel bandwidth	C parameter, kernel selection
Neural networks	Low (universal approximators)	Can be very high	Dropout, weight decay, early stopping, architecture choice

How do ensemble methods improve the tradeoff?

Ensemble methods are specifically designed to improve the bias-variance tradeoff:^[2]

Bagging (e.g., Random Forests): Trains multiple models on bootstrap samples of the data and averages their predictions. This primarily reduces variance while keeping bias roughly the same.
Boosting (e.g., AdaBoost, Gradient Boosting): Sequentially trains models, each focusing on the errors of the previous one. This primarily reduces bias, though variance can increase if not properly regularized.
Stacking: Combines predictions from multiple heterogeneous models using a meta-learner, potentially reducing both bias and variance.

Who developed the bias-variance tradeoff?

The bias-variance decomposition has roots in classical statistics. The decomposition of mean squared error into bias and variance terms has been known in estimation theory for decades. In the context of machine learning, the tradeoff was formalized and popularized through the work of Stuart Geman and Elie Bienenstock (with Rene Doursat) in their 1992 paper "Neural Networks and the Bias/Variance Dilemma," published across pages 1 to 58 of Neural Computation volume 4.^[1] The concept was further developed by researchers including Tom Mitchell, who discussed it in his 1997 textbook "Machine Learning,"^[10] and Trevor Hastie, Robert Tibshirani, and Jerome Friedman in "The Elements of Statistical Learning" (first published 2001; second edition 2009).^[2]

The modern reconsideration of the tradeoff through the lens of double descent began with Belkin et al. (2019)^[4] and was expanded by Nakkiran et al. (2019) at OpenAI, sparking renewed interest in understanding generalization in overparameterized models.^[5]

References

Geman, S., Bienenstock, E., & Doursat, R. (1992). "Neural Networks and the Bias/Variance Dilemma." Neural Computation, 4(1), 1-58. ↩
Hastie, T., Tibshirani, R., & Friedman, J. (2009). "The Elements of Statistical Learning: Data Mining, Inference, and Prediction." 2nd Edition. Springer. ↩
Bishop, C. M. (2006). "Pattern Recognition and Machine Learning." Springer. ↩
Belkin, M., Hsu, D., Ma, S., & Mandal, S. (2019). "Reconciling modern machine-learning practice and the classical bias-variance trade-off." Proceedings of the National Academy of Sciences, 116(32), 15849-15854. https://www.pnas.org/doi/10.1073/pnas.1903070116 ↩
Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., & Sutskever, I. (2019). "Deep Double Descent: Where Bigger Models and More Data Hurt." arXiv preprint arXiv:1912.02292. https://arxiv.org/abs/1912.02292 ↩
Domingos, P. (2000). "A Unified Bias-Variance Decomposition for Zero-One and Squared Loss." Proceedings of the 17th National Conference on Artificial Intelligence (AAAI), 564-569. ↩
Fortmann-Roe, S. (2012). "Understanding the Bias-Variance Tradeoff." Available at: http://scott.fortmann-roe.com/docs/BiasVariance.html ↩
Cornell University CS 4780 Lecture Notes. "Lecture 12: Bias-Variance Tradeoff." Available at: https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote12.html ↩
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). "An Introduction to Statistical Learning." Springer. ↩
Mitchell, T. M. (1997). "Machine Learning." McGraw-Hill. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

7 revisions by 1 contributors · full history

Suggest edit

Bias-variance tradeoff

What is the bias-variance tradeoff?

What is bias (underfitting)?

What is variance (overfitting)?

How is prediction error decomposed mathematically?

Derivation

Final Result

How does the tradeoff relate to model complexity?

How does regularization control the tradeoff?

L2 Regularization (Ridge)

L1 Regularization (Lasso)

Elastic Net

Regularization in Neural Networks

Regularization Parameter and the Tradeoff

How do you diagnose bias and variance with learning curves?

High Bias (Underfitting) Learning Curve

High Variance (Overfitting) Learning Curve

Summary of Diagnostic Patterns

What is double descent (the modern view)?

The Double Descent Phenomenon

Why does double descent occur?

OpenAI's Deep Double Descent

Implications for Practice

How do common algorithms handle the tradeoff?

How do ensemble methods improve the tradeoff?

Who developed the bias-variance tradeoff?

References

Improve this article

What links here (24 of 33)

What links here (24 of 33)

What is the bias-variance tradeoff?

What is bias (underfitting)?

What is variance (overfitting)?

How is prediction error decomposed mathematically?

Derivation

Final Result

How does the tradeoff relate to model complexity?

How does regularization control the tradeoff?

L2 Regularization (Ridge)

L1 Regularization (Lasso)

Elastic Net

Regularization in Neural Networks

Regularization Parameter and the Tradeoff

How do you diagnose bias and variance with learning curves?

High Bias (Underfitting) Learning Curve

High Variance (Overfitting) Learning Curve

Summary of Diagnostic Patterns

What is double descent (the modern view)?

The Double Descent Phenomenon

Why does double descent occur?

OpenAI's Deep Double Descent

Implications for Practice

How do common algorithms handle the tradeoff?

How do ensemble methods improve the tradeoff?

Who developed the bias-variance tradeoff?

References

Improve this article

Related Articles

A/B Testing

Generalized Linear Model

L1 Loss

L2 Loss

Squared Loss

Stationarity

What links here (24 of 33)

Related Articles

A/B Testing

Generalized Linear Model

L1 Loss

L2 Loss

Squared Loss

Stationarity

What links here (24 of 33)