Double Descent

Deep Learning Machine Learning

25 min read

Updated Jul 12, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 12, 2026

Fact-checked

In review queue

Sources

19 citations

Revision

v3 · 5,082 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Double descent is a phenomenon in machine learning and statistical learning theory in which a model's test error, plotted against increasing model complexity, first traces the classical U-shaped bias-variance tradeoff curve, peaks sharply at the interpolation threshold (the point where the model has just enough capacity to fit every training example), and then falls a second time as the model grows heavily overparameterized. This second descent, which occurs past the point where the model already reaches zero training error, contradicts the classical expectation that models with far more parameters than training data must overfit and generalize poorly.

The term was introduced by Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal, first in a December 2018 arXiv preprint and then in a 2019 paper in the Proceedings of the National Academy of Sciences (PNAS), "Reconciling modern machine-learning practice and the classical bias-variance trade-off" ^[1]. The authors described a curve that "subsumes the textbook U-shaped bias-variance trade-off curve by showing how increasing model capacity beyond the point of interpolation results in improved performance" ^[1]. The effect has since been confirmed across neural networks, decision trees, random forests, kernel methods, and linear models ^[1]. A widely cited 2019 study led by Preetum Nakkiran, later cross-posted on the OpenAI blog and published in the Journal of Statistical Mechanics in 2021, showed the same pattern in modern deep networks and identified three axes along which it appears: model size, training time, and dataset size ^[2].

ELI5 (Explain like I'm 5)

Imagine you are trying to draw a line through a set of dots on a page. If you use a really simple tool (like a ruler), your line will be straight and miss a lot of dots. If you use a slightly bendier tool, you can hit more dots, but at some point your tool is just bendy enough to pass through every single dot. That exact "just enough" tool makes a really wiggly, ugly line because it is stretching itself to reach every dot, including ones that are in weird places because of mistakes.

Now here is the surprising part: if you use an even bendier tool, one that could easily pass through all the dots in many different ways, it actually picks a smoother, nicer path through them. It does not need to stretch or wiggle because it has so many options for how to connect the dots. That is double descent. The error gets bad right when the tool barely fits all the dots, but then gets better again when the tool has way more flexibility than it needs.

What is the classical bias-variance tradeoff?

The classical view of model selection, set out in foundational textbooks by Hastie, Tibshirani, and Friedman (2009) ^[11] and Bishop (2006) ^[19], predicts a U-shaped test error curve as model complexity increases. In this framework, expected prediction error decomposes into three components:

$\text{Expected test error} = \text{Bias}^2 + \text{Variance} + \sigma^2$

where the first two terms depend on the model and sigma^2 is the irreducible error inherent in the data.

Component	Description
Bias squared	Systematic error from the model's simplifying assumptions. Decreases as model complexity increases.
Variance	Error from sensitivity to the particular training set. Increases as model complexity increases.
Irreducible error	Noise inherent in the data that no model can eliminate.

As complexity grows, bias decreases (the model can capture more of the true function) while variance increases (the model starts fitting noise). The optimal model sits at the complexity level that minimizes the sum of bias squared and variance. Beyond that point, classical theory predicts that test error should rise monotonically. This prediction holds well for small, underparameterized models but breaks down in the modern regime where models have far more parameters than training samples ^[1].

What is the interpolation threshold?

The interpolation threshold is the critical point at which a model has just enough capacity (parameters, depth, features, or other complexity measures) to perfectly fit every training example, achieving zero training error. At this point, the model transitions from the underparameterized regime (more training samples than parameters) to the overparameterized regime (more parameters than training samples).

At the interpolation threshold, the model is forced into a unique or nearly unique interpolating solution. If the training data contains any noise or mislabeled examples, this solution must contort itself to accommodate every noisy data point exactly. The result is a highly irregular function that generalizes poorly, producing a sharp spike in test error ^[1].

Formally, for a model with p parameters and n training samples, the interpolation threshold occurs near the ratio p/n = 1. In practice, the exact location depends on the model family, the effective degrees of freedom (which may differ from the raw parameter count), and the data distribution.

What does the double descent curve look like?

The double descent curve extends the classical U-shaped curve into a broader performance landscape with three distinct regimes.

Regime	Parameter-to-sample ratio (p/n)	Behavior
Underparameterized	p/n < 1	Follows the classical U-curve. Increasing complexity first reduces bias and test error, then begins raising variance and test error.
Interpolation threshold	p/n approximately equal to 1	Test error peaks sharply. The model barely fits the training data, and noise forces distorted solutions.
Overparameterized	p/n > 1	Test error decreases again. Among the many interpolating solutions, optimization selects smooth, low-norm solutions that generalize well.

The key insight is that beyond the interpolation threshold, there exist infinitely many functions that perfectly fit the training data ^[1]. Optimization algorithms such as stochastic gradient descent exhibit an implicit bias toward solutions with certain desirable properties (such as minimum norm in parameter space), and these solutions tend to generalize well despite achieving zero training error.

What are the three types of double descent?

Research by Nakkiran, Kaplun, Bansal, Yang, Barak, and Sutskever (2019, published in the Journal of Statistical Mechanics in 2021) extended the double descent phenomenon beyond model size, identifying three distinct axes along which it manifests ^[2].

Model-wise double descent

Model-wise double descent is the original form of the phenomenon. When a family of models is ordered by increasing capacity (for example, by widening a convolutional neural network or adding more random features), the test error follows the double descent pattern. Nakkiran et al. demonstrated this using parameterized ResNet-18 architectures with layer widths scaled as [k, 2k, 4k, 8k], where varying k controls the total parameter count and the standard configuration uses k = 64. As k increases, test error first drops, spikes near the interpolation threshold, and then drops again for large k values. The peak is clearest in the presence of label noise: for ResNet-18 on CIFAR-10 with 20% of training labels randomized, the model-wise peak sits near the width where the network first interpolates the training set ^[2].

Epoch-wise double descent

For a model of fixed size, test error can also exhibit double descent as a function of the number of training epochs. Early in training, the model learns the signal in the data and test error decreases. As training continues, the model begins to fit noise, and test error increases. With further training, the model transitions through the interpolation threshold (achieving zero training error) and eventually finds a smoother interpolating solution, causing test error to decrease once more. In the experiments, the epoch-wise peak tends to occur around the epoch where training error first reaches zero ^[2].

This form of double descent challenges the conventional wisdom that early stopping always prevents overfitting. In some cases, training past the point of apparent overfitting leads to better final performance. The phenomenon is most pronounced in models operating near the interpolation threshold and in the presence of label noise.

Sample-wise double descent

Sample-wise double descent occurs when, for a model of fixed size, adding more training data temporarily increases test error before eventually improving it. This happens because additional samples shift the effective interpolation threshold: a model that was comfortably overparameterized with n samples may be pushed into the critical regime when trained on 2n samples. The subtitle of the paper, "Where Bigger Models and More Data Hurt," captures this directly; Nakkiran et al. reported regimes in which "increasing (even quadrupling) the number of train samples actually hurts test performance" for models in the critical zone ^[2].

This finding has practical consequences for data collection. It suggests that, for models of intermediate complexity, adding moderate amounts of data can temporarily degrade performance, though sufficient additional data eventually resolves the issue.

What is effective model complexity?

To unify the three forms of double descent, Nakkiran et al. introduced the concept of effective model complexity (EMC). EMC is defined as the maximum number of training samples on which a given training procedure (architecture plus optimizer plus hyperparameters) can achieve approximately zero training error ^[2].

EMC differs from raw parameter count in that it accounts for the entire training pipeline: the architecture, the optimizer (Adam, SGD), the learning rate schedule, regularization, and the number of training epochs. Two models with the same parameter count can have very different EMC values depending on how they are trained.

The double descent conjecture states that test error peaks when EMC is approximately equal to the number of training samples n ^[2]. This framework explains all three forms of double descent.

Axis	How EMC changes relative to n
Model-wise	Increasing model size raises EMC through the critical point EMC approximately equal to n
Epoch-wise	Training longer raises EMC (model fits more data over time) through the critical point
Sample-wise	Adding more data raises n, potentially pushing a previously overparameterized model into the critical regime

How does double descent reconcile classical theory with modern practice?

Belkin et al. (2019) presented double descent as a reconciliation of two seemingly contradictory observations. Classical statistical learning theory (Vapnik, 1998) prescribes regularization and model selection to avoid overfitting ^[12]. Modern deep learning practice, however, routinely trains massively overparameterized networks to zero training error and still achieves excellent generalization.

The reconciliation lies in recognizing that the classical theory describes only the underparameterized regime, where the U-shaped curve is valid. The overparameterized regime operates under different dynamics. Belkin et al. demonstrated the double descent curve across diverse model families ^[1].

Model family	Dataset(s)	Key finding
Random Fourier features	MNIST	Clear double descent with peak at interpolation threshold
Fully connected neural networks	MNIST	Two-layer networks exhibit second descent in overparameterized regime
Decision trees	Multiple	Interpolating trees can generalize well when sufficiently deep
Random forests	Multiple	Ensemble of interpolating trees achieves good generalization
AdaBoost	Multiple	Continued boosting past interpolation improves performance

This work was published in the Proceedings of the National Academy of Sciences (PNAS), volume 116, number 32, pages 15849-15854 ^[1].

Nakkiran's deep double descent experiments

Nakkiran et al. (2019) extended Belkin's findings to modern deep learning architectures and larger-scale experiments. Their work, initially released as an arXiv preprint (arXiv:1912.02292) and later published in the Journal of Statistical Mechanics: Theory and Experiment (2021), tested double descent across multiple architectures and datasets ^[2].

Architectures and datasets

Architecture	Configuration	Datasets
ResNet-18	Widths scaled as [k, 2k, 4k, 8k]; standard uses k=64	CIFAR-10, CIFAR-100
5-layer CNN	Convolutional widths [k, 2k, 4k, 8k] plus fully connected output	CIFAR-10, CIFAR-100
Transformer	6-layer encoder-decoder, embedding dimension d_model scaled, d_ff = 4 * d_model	IWSLT'14 (de-en), WMT'14 (en-fr, subsampled to 200K sentences)
Random Fourier features	Varying number of random features	Fashion-MNIST

Role of label noise

Double descent appears most strongly in settings with label noise, though Nakkiran et al. argued that this reflects model misspecification more broadly rather than noise per se. Adding label noise shifts the interpolation threshold rightward and amplifies the peak in test error. However, the authors also demonstrated double descent without any added noise (for example, ResNets on CIFAR-100 and CNNs on CIFAR-100), confirming that noise is not a prerequisite for the phenomenon ^[2].

Key experimental findings

Several results from the paper stand out.

For ResNet-18 on CIFAR-10 with 20% label noise, model-wise double descent is clearly visible, with test error peaking near the interpolation threshold and then declining as model width increases ^[2].
Epoch-wise double descent was observed across ResNets, CNNs, and Transformers, with the peak in test error occurring around the epoch where training error first reaches zero ^[2].
For Transformers on IWSLT'14 with limited data, increasing the dataset size from a small subset to a moderately larger subset worsened performance for models in the critical regime, demonstrating sample-wise double descent ^[2].
Optimal regularization (such as weight decay or data augmentation) can mitigate or eliminate the double descent peak by preventing the model from reaching the interpolation threshold ^[2].

Why do overparameterized models generalize well?

Double descent is closely tied to the broader question of why overparameterized models generalize well. Classical learning theory bounds (such as VC dimension and Rademacher complexity) suggest that models with more parameters than training samples should generalize poorly ^[12]. In practice, the opposite often holds: larger models frequently achieve lower test error.

Several theoretical perspectives help explain this.

Implicit regularization. Optimization algorithms like SGD and its variants do not simply find any interpolating solution; they find specific ones. In linear regression, gradient descent started near zero converges to the minimum-norm interpolating solution. For an overparameterized linear model with design matrix X and targets y, that solution is

$\hat{\beta} = \arg\min_{\beta:\, X\beta = y} \lVert \beta \rVert_2 ,$

which spreads weight across many directions rather than concentrating it, keeping predictions stable ^[4]. In neural networks, the implicit bias of SGD favors solutions that lie in flat regions of the loss landscape, which tend to generalize better.

Effective dimension. The number of parameters that are functionally important for prediction may be much smaller than the total parameter count. Overparameterized models may have many redundant parameters, and the effective dimension (measured by the curvature of the loss landscape or the rank of the Hessian) can be much smaller than p. This helps explain why models with millions of parameters do not necessarily overfit.

Neural tangent kernel (NTK) perspective. In the infinite-width limit, neural networks trained with gradient descent behave like kernel regression with the neural tangent kernel. Analysis of NTK regression reveals double descent behavior as a function of the ratio of parameters to samples, providing a theoretical foundation for the phenomenon in neural networks. Adlam and Pennington (2020) showed that the NTK in high dimensions can exhibit not just double descent but even triple descent, where an additional peak appears at a second critical ratio ^[6].

What is benign overfitting?

Benign overfitting is a closely related concept that describes the situation where a model perfectly interpolates noisy training data yet still achieves near-optimal test performance. The term was formalized by Peter Bartlett, Philip Long, Gabor Lugosi, and Alexander Tsigler in their 2020 paper "Benign overfitting in linear regression," published in PNAS (volume 117, number 48, pages 30063-30070) ^[3].

Bartlett et al. showed that benign overfitting in linear regression requires the data covariance matrix to satisfy specific spectral conditions. In particular, the number of "unimportant" directions in parameter space (those with small eigenvalues in the covariance matrix) must significantly exceed the sample size. When this condition holds, the minimum-norm interpolating solution places the noise component of its fit into these unimportant directions, where it has little effect on predictions ^[3].

The relationship between benign overfitting and double descent is as follows: the overparameterized regime of the double descent curve is precisely the regime where benign overfitting can occur. In the underparameterized regime, the model cannot interpolate, so benign overfitting is not applicable. At the interpolation threshold, overfitting is "catastrophic" (harmful). Beyond the threshold, overfitting becomes "benign" (harmless) under the right conditions.

Regime	Overfitting type	Generalization
Underparameterized (p < n)	No interpolation possible	Depends on bias-variance balance
Interpolation threshold (p approximately equal to n)	Catastrophic overfitting	Poor generalization; peak test error
Mildly overparameterized (p slightly greater than n)	Potentially harmful overfitting	May still generalize poorly depending on data structure
Heavily overparameterized (p much greater than n)	Benign overfitting	Good generalization; smooth interpolating solutions

Theoretical foundations

Several lines of theoretical work have provided rigorous foundations for the double descent phenomenon.

Statistical physics approaches

The earliest theoretical observations of non-monotonic generalization curves came from the statistical physics community. Krogh and Hertz (1992) provided theoretical explanations for model-wise double descent in linear models using methods from statistical mechanics ^[7]. Opper (1995) and Opper and Kinzel (1996) further analyzed generalization in neural networks using the replica method and related techniques ^[8]. These early results showed that, at intermediate complexity levels where the model size equals the number of training examples, the model is very sensitive to noise and generalizes poorly. Advani and Saxe (2017) later analyzed the high-dimensional dynamics of generalization error in neural networks and observed the same kind of non-monotonic behavior in gradient-trained networks ^[9].

High-dimensional asymptotics

Hastie, Montanari, Rosset, and Tibshirani (2022) provided a precise analysis of double descent in "ridgeless" (unregularized) least squares regression. Their paper, "Surprises in high-dimensional ridgeless least squares interpolation" (Annals of Statistics, volume 50, number 2, pages 949-986), showed that in the proportional asymptotic regime (where p and n grow together with p/n approaching a constant gamma), the test risk of the minimum-norm interpolating estimator diverges as gamma approaches 1 from either side and decreases for gamma well above 1 ^[4]. This provides an exact mathematical characterization of the double descent peak.

Mei and Montanari (2022) extended this analysis to random features regression in "The generalization error of random features regression: Precise asymptotics and the double descent curve" (Communications on Pure and Applied Mathematics, volume 75, number 4, pages 667-766). They showed that the double descent curve emerges naturally in the proportional limit and that the global minimum of test risk can lie in the extremely overparameterized regime ^[5].

Bias-variance decomposition in overparameterized models

Recent work has refined the bias-variance decomposition for overparameterized settings. In the classical regime, variance increases monotonically with complexity. In the overparameterized regime, both bias and variance can decrease as complexity increases beyond the interpolation threshold. Rocks and Mehta (2022), using methods from statistical physics, derived analytic expressions for bias and variance in minimal models of overparameterization and showed that variance can be non-monotonic and that models can memorize noisy data without the catastrophic overfitting classical theory predicts ^[18]. This occurs because the minimum-norm interpolating solution spreads its weight across many parameters, reducing the influence of any single noisy training point (which lowers variance) while maintaining the ability to capture the true signal (keeping bias low).

How does regularization affect double descent?

Regularization interacts with double descent in several ways.

Optimal regularization can mitigate double descent

Nakkiran, Venkat, Kakade, and Ma (2021) proved that for linear regression with isotropic data, optimally tuned L2 regularization (weight decay) achieves monotonic test performance as either sample size or model size grows, smoothing the double-descent-shaped risk curve into a monotonically decreasing one; they also demonstrated empirically that tuned L2 regularization can mitigate double descent in more general models, including neural networks ^[14]. Relatedly, Kobak, Lomond, and Sanchez (2020) found that for real-world high-dimensional data the optimal ridge penalty can be zero or even negative, because low-variance directions in the data already supply an implicit ridge regularization of their own ^[15].

Regularization-wise double descent

Fatih Furkan Yilmaz and Reinhard Heckel (2022) demonstrated that double descent can also occur as a function of regularization strength. As the L2 penalty decreases from a large value toward zero, test error can trace a non-monotonic pattern analogous to model-wise double descent. They attributed this regularization-wise double descent to "a superposition of bias-variance tradeoffs corresponding to different parts of the model" and showed it can be eliminated by scaling the regularization strength of each part appropriately, for example by tuning the first-layer and second-layer weight decay of a two-layer network separately ^[13].

Early stopping as implicit regularization

Early stopping acts as a form of implicit regularization that limits the effective model complexity. By halting training before the model reaches the interpolation threshold, early stopping can avoid the double descent peak entirely. However, as epoch-wise double descent shows, stopping at the apparent overfitting point (where test error first rises) may not be optimal; training further can lead to a second descent in test error. This creates a tension between conservative early stopping and the potential benefits of extended training.

What are triple descent and multiple descent?

Subsequent research has shown that the double descent pattern can itself be extended. Adlam and Pennington (2020) found that the neural tangent kernel in high dimensions can exhibit triple descent, where a third regime of decreasing test error appears after a second peak; this triple descent arises from the interaction of multiple scales in the feature map ^[6].

d'Ascoli, Sagun, and Biroli (2020), in "Triple descent and the two kinds of overfitting: Where and why do they appear?" (NeurIPS 2020), investigated triple descent in the context of sample-wise risk. They showed that at high noise levels the risk profile can peak both when the number of samples equals the number of parameters and when it equals the input dimension, and that the relative prominence of these two peaks depends on the nonlinearity of the activation function ^[10].

The pattern generalizes further. Chen, Min, Belkin, and Karbasi (2021) proved that "the generalization curve can have an arbitrary number of peaks, and moreover, locations of those peaks can be explicitly controlled," establishing the concept of multiple descent ^[16]. Building on this line of work, Meng, Yao, and Cao (2024) analyzed models that ensemble several distinct random feature families and showed that a model combining K types of random features "may exhibit (K+1)-fold descent" ^[17].

What are the practical implications of double descent?

The double descent phenomenon has several practical consequences for machine learning practitioners.

Model selection

Classical model selection criteria (such as AIC, BIC, and cross-validation at moderate complexities) assume a U-shaped test error curve and may select suboptimal models. In the presence of double descent, the globally best model may be the largest available one, well beyond the interpolation threshold. Practitioners should consider evaluating very large models rather than restricting their search to the classical "sweet spot."

Training duration

Epoch-wise double descent suggests that the common practice of early stopping based on validation error may sometimes be premature. If the model is in the critical regime and validation error has begun to rise, continued training may eventually yield better performance. Monitoring training and validation error for longer periods, even past apparent overfitting, can be informative.

Data collection

Sample-wise double descent implies that adding a moderate amount of data to a model operating near the interpolation threshold can temporarily worsen performance. This does not mean more data is harmful in general. It means that when increasing dataset size, practitioners should also consider scaling the model to stay in the overparameterized regime.

Regularization strategy

Optimal regularization can eliminate the double descent peak. For practitioners who cannot afford to train very large models, tuning regularization (weight decay, dropout, data augmentation) is especially important to smooth the transition through the critical regime ^[14].

Data quality

Label noise amplifies the double descent peak ^[2]. Cleaning training labels and reducing noise in the data can mitigate the worst effects of double descent, particularly for models near the interpolation threshold.

Summary of practical guidelines

Concern	Recommendation
Model size	Prefer larger, overparameterized models when computationally feasible
Training epochs	Do not stop training solely because validation error begins to rise; monitor for a potential second descent
Adding data	When adding data, also consider scaling model capacity to avoid the critical regime
Regularization	Tune regularization carefully; optimal weight decay can eliminate the double descent peak
Label noise	Invest in data quality; noisy labels amplify the peak at the interpolation threshold
Model selection	Evaluate models across a wide range of complexities, including far beyond the interpolation threshold

Connections to other phenomena

Double descent is part of a broader family of surprising behaviors observed in modern machine learning.

Grokking. Grokking refers to the phenomenon where a model trained on a small dataset achieves perfect training accuracy early but does not generalize until much later in training, when test accuracy suddenly jumps. Grokking shares conceptual similarities with epoch-wise double descent in that extended training can produce unexpected improvements in generalization.

Scaling laws. Empirical scaling laws (Kaplan et al., 2020) show that test loss decreases as a power law with increasing model size, dataset size, and compute. Double descent adds nuance to these scaling laws by showing that the decrease is not always monotonic and that there can be critical regimes where, temporarily, bigger models or more data can hurt.

Neural network pruning. The observation that heavily overparameterized networks can be pruned to a fraction of their size without losing accuracy (the lottery ticket hypothesis) is consistent with the effective dimension perspective on double descent. The full model may use only a small fraction of its parameters for prediction, with the rest providing the "room" needed for benign overfitting.

Historical timeline

Year	Contribution	Researchers
1992	First theoretical explanations of non-monotonic generalization in linear models using statistical mechanics	Krogh and Hertz
1995	Analysis of generalization in neural networks showing sensitivity at interpolation	Opper; Opper and Kinzel (1996)
2017	Investigation of high-dimensional dynamics of generalization error in neural networks	Advani and Saxe
2018	Term "double descent" coined; demonstrated in decision trees, random features, two-layer neural networks	Belkin, Hsu, Ma, Mandal (arXiv)
2019	Published in PNAS with formal treatment and broader experimental evidence	Belkin, Hsu, Ma, Mandal
2019	Deep double descent demonstrated in ResNets, CNNs, Transformers; three axes identified	Nakkiran, Kaplun, Bansal, Yang, Barak, Sutskever
2020	Benign overfitting formalized for linear regression	Bartlett, Long, Lugosi, Tsigler
2020	Triple descent in NTK at high dimensions	Adlam and Pennington
2020	Triple descent and the two kinds of overfitting characterized	d'Ascoli, Sagun, Biroli
2020	Optimal regularization shown to mitigate double descent	Nakkiran, Venkat, Kakade, Ma
2020	Optimal ridge penalty shown to be potentially zero or negative	Kobak, Lomond, Sanchez
2021	Multiple descent: generalization curves with arbitrarily many peaks	Chen, Min, Belkin, Karbasi
2022	Precise asymptotics for ridgeless regression in high dimensions	Hastie, Montanari, Rosset, Tibshirani
2022	Precise asymptotics for random features regression	Mei and Montanari
2022	Regularization-wise double descent characterized	Yilmaz and Heckel
2024	(K+1)-fold descent in multiple random feature models	Meng, Yao, Cao

References

Belkin, M., Hsu, D., Ma, S., & Mandal, S. (2019). "Reconciling modern machine-learning practice and the classical bias-variance trade-off." Proceedings of the National Academy of Sciences, 116(32), 15849-15854. (Preprint arXiv:1812.11118, 2018.) ↩
Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., & Sutskever, I. (2021). "Deep double descent: Where bigger models and more data hurt." Journal of Statistical Mechanics: Theory and Experiment, 2021(12), 124003. (Preprint arXiv:1912.02292, 2019.) ↩
Bartlett, P. L., Long, P. M., Lugosi, G., & Tsigler, A. (2020). "Benign overfitting in linear regression." Proceedings of the National Academy of Sciences, 117(48), 30063-30070. ↩
Hastie, T., Montanari, A., Rosset, S., & Tibshirani, R. J. (2022). "Surprises in high-dimensional ridgeless least squares interpolation." Annals of Statistics, 50(2), 949-986. ↩
Mei, S. & Montanari, A. (2022). "The generalization error of random features regression: Precise asymptotics and the double descent curve." Communications on Pure and Applied Mathematics, 75(4), 667-766. ↩
Adlam, B. & Pennington, J. (2020). "The neural tangent kernel in high dimensions: Triple descent and a multi-scale theory of generalization." Proceedings of the 37th International Conference on Machine Learning (ICML 2020), PMLR 119. (arXiv:2008.06786.) ↩
Krogh, A. & Hertz, J. (1992). "Generalization in a linear perceptron in the presence of noise." Journal of Physics A: Mathematical and General, 25(5), 1135-1147. ↩
Opper, M. (1995). "Statistical mechanics of learning: Generalization." In The Handbook of Brain Theory and Neural Networks, MIT Press. ↩
Advani, M. S. & Saxe, A. M. (2017). "High-dimensional dynamics of generalization error in neural networks." arXiv preprint arXiv:1710.03667. ↩
d'Ascoli, S., Sagun, L., & Biroli, G. (2020). "Triple descent and the two kinds of overfitting: Where and why do they appear?" Advances in Neural Information Processing Systems 33 (NeurIPS 2020). (arXiv:2006.03509.) ↩
Hastie, T., Tibshirani, R., & Friedman, J. (2009). "The Elements of Statistical Learning: Data Mining, Inference, and Prediction." 2nd Edition. Springer. ↩
Vapnik, V. N. (1998). "Statistical Learning Theory." Wiley. ↩
Yilmaz, F. F. & Heckel, R. (2022). "Regularization-wise double descent: Why it occurs and how to eliminate it." 2022 IEEE International Symposium on Information Theory (ISIT). (arXiv:2206.01378.) ↩
Nakkiran, P., Venkat, P., Kakade, S., & Ma, T. (2021). "Optimal regularization can mitigate double descent." International Conference on Learning Representations (ICLR 2021). (arXiv:2003.01897.) ↩
Kobak, D., Lomond, J., & Sanchez, B. (2020). "The optimal ridge penalty for real-world high-dimensional data can be zero or negative due to the implicit ridge regularization." Journal of Machine Learning Research, 21(169), 1-16. ↩
Chen, L., Min, Y., Belkin, M., & Karbasi, A. (2021). "Multiple descent: Design your own generalization curve." Advances in Neural Information Processing Systems 34 (NeurIPS 2021). (arXiv:2008.01036.) ↩
Meng, X., Yao, J., & Cao, Y. (2024). "Multiple descent in the multiple random feature model." Journal of Machine Learning Research, 25. (arXiv:2208.09897.) ↩
Rocks, J. W. & Mehta, P. (2022). "Memorizing without overfitting: Bias, variance, and interpolation in over-parameterized models." Physical Review Research, 4(1), 013201. (arXiv:2010.13933.) ↩
Bishop, C. M. (2006). "Pattern Recognition and Machine Learning." Springer. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributors · full history

Suggest edit

What links here

Grokking Model Capacity Neural Network Statistical learning theory Underfitting

ELI5 (Explain like I'm 5)

What is the classical bias-variance tradeoff?

What is the interpolation threshold?

What does the double descent curve look like?

What are the three types of double descent?

Model-wise double descent

Epoch-wise double descent

Sample-wise double descent

What is effective model complexity?

How does double descent reconcile classical theory with modern practice?

Nakkiran's deep double descent experiments

Architectures and datasets

Role of label noise

Key experimental findings

Why do overparameterized models generalize well?

What is benign overfitting?

Theoretical foundations

Statistical physics approaches

High-dimensional asymptotics

Bias-variance decomposition in overparameterized models

How does regularization affect double descent?

Optimal regularization can mitigate double descent

Regularization-wise double descent

Early stopping as implicit regularization

What are triple descent and multiple descent?

What are the practical implications of double descent?

Model selection

Training duration

Data collection

Regularization strategy

Data quality

Summary of practical guidelines

Connections to other phenomena

Historical timeline

References

Improve this article

Related Articles

Diffusion model

Generalization

Mixture of Experts (MoE)

Modality

Sparsity

Activation Function

What links here

Related Articles

Diffusion model

Generalization

Mixture of Experts (MoE)

Modality

Sparsity

Activation Function

What links here