Double descent is a phenomenon in machine learning and statistical learning theory where a model's test error, plotted against increasing model complexity, follows a two-part pattern: it first traces the classical U-shaped bias-variance tradeoff curve, peaks sharply at the interpolation threshold, and then decreases again as the model becomes heavily overparameterized. This second drop in error beyond the interpolation threshold contradicts the classical expectation that models with too many parameters will always overfit and generalize poorly.
The term "double descent" was introduced by Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal in their 2019 paper "Reconciling modern machine-learning practice and the classical bias-variance trade-off." The phenomenon has since been confirmed across a wide range of model families, including neural networks, decision trees, random forests, kernel methods, and linear models.
Imagine you are trying to draw a line through a set of dots on a page. If you use a really simple tool (like a ruler), your line will be straight and miss a lot of dots. If you use a slightly bendier tool, you can hit more dots, but at some point your tool is just bendy enough to pass through every single dot. That exact "just enough" tool makes a really wiggly, ugly line because it is stretching itself to reach every dot, including ones that are in weird places because of mistakes.
Now here is the surprising part: if you use an even bendier tool, one that could easily pass through all the dots in many different ways, it actually picks a smoother, nicer path through them. It does not need to stretch or wiggle because it has so many options for how to connect the dots. That is double descent. The error gets bad right when the tool barely fits all the dots, but then gets better again when the tool has way more flexibility than it needs.
The classical view of model selection, articulated in foundational textbooks by Hastie, Tibshirani, and Friedman (2009) and Bishop (2006), predicts a U-shaped test error curve as model complexity increases. In this framework, total prediction error decomposes into three components.
| Component | Description |
|---|---|
| Bias squared | Systematic error from the model's simplifying assumptions. Decreases as model complexity increases. |
| Variance | Error from sensitivity to the particular training set. Increases as model complexity increases. |
| Irreducible error | Noise inherent in the data that no model can eliminate. |
As complexity grows, bias decreases (the model can capture more of the true function) while variance increases (the model starts fitting noise). The optimal model sits at the complexity level that minimizes the sum of bias squared and variance. Beyond that point, classical theory predicts that test error should rise monotonically. This prediction holds well for small, underparameterized models but breaks down in the modern regime where models have far more parameters than training samples.
The interpolation threshold is the critical point at which a model has just enough capacity (parameters, depth, features, or other complexity measures) to perfectly fit every training example, achieving zero training error. At this point, the model transitions from the underparameterized regime (more training samples than parameters) to the overparameterized regime (more parameters than training samples).
At the interpolation threshold, the model is forced into a unique or nearly unique interpolating solution. If the training data contains any noise or mislabeled examples, this solution must contort itself to accommodate every noisy data point exactly. The result is a highly irregular function that generalizes poorly, producing a sharp spike in test error.
Formally, for a model with p parameters and n training samples, the interpolation threshold occurs near the ratio p/n = 1. In practice, the exact location depends on the model family, the effective degrees of freedom (which may differ from the raw parameter count), and the data distribution.
The double descent curve extends the classical U-shaped curve into a broader performance landscape with three distinct regimes.
| Regime | Parameter-to-sample ratio (p/n) | Behavior |
|---|---|---|
| Underparameterized | p/n < 1 | Follows the classical U-curve. Increasing complexity first reduces bias and test error, then begins raising variance and test error. |
| Interpolation threshold | p/n approximately equal to 1 | Test error peaks sharply. The model barely fits the training data, and noise forces distorted solutions. |
| Overparameterized | p/n > 1 | Test error decreases again. Among the many interpolating solutions, optimization selects smooth, low-norm solutions that generalize well. |
The key insight is that beyond the interpolation threshold, there exist infinitely many functions that perfectly fit the training data. Optimization algorithms such as stochastic gradient descent exhibit an implicit bias toward solutions with certain desirable properties (such as minimum norm in parameter space), and these solutions tend to generalize well despite achieving zero training error.
Research by Nakkiran, Kaplun, Bansal, Yang, Barak, and Sutskever (2019, published in the Journal of Statistical Mechanics, 2021) extended the double descent phenomenon beyond model size, identifying three distinct axes along which it manifests.
Model-wise double descent is the original form of the phenomenon. When a family of models is ordered by increasing capacity (for example, by widening a convolutional neural network or adding more random features), the test error follows the double descent pattern. Nakkiran et al. demonstrated this using parameterized ResNet-18 architectures with layer widths scaled as [k, 2k, 4k, 8k], where varying k controls the total parameter count. As k increases, test error first drops, spikes near the interpolation threshold, and then drops again for large k values.
For a model of fixed size, test error can also exhibit double descent as a function of the number of training epochs. Early in training, the model learns the signal in the data and test error decreases. As training continues, the model begins to fit noise, and test error increases. With further training, the model transitions through the interpolation threshold (achieving zero training error) and eventually finds a smoother interpolating solution, causing test error to decrease once more.
This form of double descent challenges the conventional wisdom that early stopping always prevents overfitting. In some cases, training past the point of apparent overfitting leads to better final performance. The phenomenon is most pronounced in models operating near the interpolation threshold and in the presence of label noise.
Sample-wise double descent occurs when, for a model of fixed size, adding more training data temporarily increases test error before eventually improving it. This happens because additional samples shift the effective interpolation threshold: a model that was comfortably overparameterized with n samples may be pushed into the critical regime when trained on 2n samples. Nakkiran et al. demonstrated that there exist regimes where quadrupling the number of training samples actually hurts test performance for models in the critical zone.
This finding has practical consequences for data collection. It suggests that, for models of intermediate complexity, adding moderate amounts of data can temporarily degrade performance, though sufficient additional data eventually resolves the issue.
To unify the three forms of double descent, Nakkiran et al. introduced the concept of effective model complexity (EMC). EMC is defined as the maximum number of training samples on which a given training procedure (architecture plus optimizer plus hyperparameters) can achieve approximately zero training error.
EMC differs from raw parameter count in that it accounts for the entire training pipeline: the architecture, the optimizer (Adam, SGD), the learning rate schedule, regularization, and the number of training epochs. Two models with the same parameter count can have very different EMC values depending on how they are trained.
The double descent conjecture states that test error peaks when EMC is approximately equal to the number of training samples n. This framework explains all three forms of double descent.
| Axis | How EMC changes relative to n |
|---|---|
| Model-wise | Increasing model size raises EMC through the critical point EMC approximately equal to n |
| Epoch-wise | Training longer raises EMC (model fits more data over time) through the critical point |
| Sample-wise | Adding more data raises n, potentially pushing a previously overparameterized model into the critical regime |
Belkin et al. (2019) presented double descent as a reconciliation of two seemingly contradictory observations. Classical statistical learning theory (Vapnik, 1998) prescribes regularization and model selection to avoid overfitting. Modern deep learning practice, however, routinely trains massively overparameterized networks to zero training error and achieves excellent generalization.
The reconciliation lies in recognizing that the classical theory describes only the underparameterized regime, where the U-shaped curve is valid. The overparameterized regime operates under different dynamics. Belkin et al. demonstrated the double descent curve across diverse model families.
| Model family | Dataset(s) | Key finding |
|---|---|---|
| Random Fourier features | MNIST | Clear double descent with peak at interpolation threshold |
| Fully connected neural networks | MNIST | Two-layer networks exhibit second descent in overparameterized regime |
| Decision trees | Multiple | Interpolating trees can generalize well when sufficiently deep |
| Random forests | Multiple | Ensemble of interpolating trees achieves good generalization |
| AdaBoost | Multiple | Continued boosting past interpolation improves performance |
This work was published in the Proceedings of the National Academy of Sciences (PNAS), volume 116, number 32, pages 15849-15854.
Nakkiran et al. (2019) extended Belkin's findings to modern deep learning architectures and larger-scale experiments. Their work, initially published as a preprint and later in the Journal of Statistical Mechanics: Theory and Experiment (2021), tested double descent across multiple architectures and datasets.
| Architecture | Configuration | Datasets |
|---|---|---|
| ResNet-18 | Widths scaled as [k, 2k, 4k, 8k]; standard uses k=64 | CIFAR-10, CIFAR-100 |
| 5-layer CNN | Convolutional widths [k, 2k, 4k, 8k] plus fully connected output | CIFAR-10, CIFAR-100 |
| Transformer | 6-layer encoder-decoder, embedding dimension d_model scaled, d_ff = 4 * d_model | IWSLT'14 (de-en), WMT'14 (en-fr, subsampled to 200K sentences) |
| Random Fourier features | Varying number of random features | Fashion-MNIST |
Double descent appears most strongly in settings with label noise, though Nakkiran et al. argued that this reflects model misspecification more broadly rather than noise per se. Adding label noise shifts the interpolation threshold rightward and amplifies the peak in test error. However, the authors also demonstrated double descent without any added noise (for example, ResNets on CIFAR-100 and CNNs on CIFAR-100), confirming that noise is not a prerequisite for the phenomenon.
Several results from the paper stand out.
Double descent is closely tied to the broader question of why overparameterized models generalize well. Classical learning theory bounds (such as VC dimension and Rademacher complexity) suggest that models with more parameters than training samples should generalize poorly. In practice, the opposite often holds: larger models frequently achieve lower test error.
Several theoretical perspectives help explain this.
Implicit regularization. Optimization algorithms like SGD and its variants do not simply find any interpolating solution; they find specific ones. In linear regression, gradient descent converges to the minimum-norm interpolating solution. In neural networks, the implicit bias of SGD favors solutions that lie in flat regions of the loss landscape, which tend to generalize better.
Effective dimension. The number of parameters that are functionally important for prediction may be much smaller than the total parameter count. Overparameterized models may have many redundant parameters, and the effective dimension (measured by the curvature of the loss landscape or the rank of the Hessian) can be much smaller than p. This helps explain why models with millions of parameters do not necessarily overfit.
Neural tangent kernel (NTK) perspective. In the infinite-width limit, neural networks trained with gradient descent behave like kernel regression with the NTK. Analysis of NTK regression reveals double descent behavior as a function of the ratio of parameters to samples, providing a theoretical foundation for the phenomenon in neural networks. Adlam and Pennington (2020) showed that the NTK in high dimensions can exhibit not just double descent but even triple descent, where an additional peak appears at a second critical ratio.
Benign overfitting is a closely related concept that describes the situation where a model perfectly interpolates noisy training data yet still achieves near-optimal test performance. The term was formalized by Peter Bartlett, Philip Long, Gabor Lugosi, and Alexander Tsigler in their 2020 paper "Benign overfitting in linear regression," published in PNAS.
Bartlett et al. showed that benign overfitting in linear regression requires the data covariance matrix to satisfy specific spectral conditions. In particular, the number of "unimportant" directions in parameter space (those with small eigenvalues in the covariance matrix) must significantly exceed the sample size. When this condition holds, the minimum-norm interpolating solution places the noise component of its fit into these unimportant directions, where it has little effect on predictions.
The relationship between benign overfitting and double descent is as follows: the overparameterized regime of the double descent curve is precisely the regime where benign overfitting can occur. In the underparameterized regime, the model cannot interpolate, so benign overfitting is not applicable. At the interpolation threshold, overfitting is "catastrophic" (harmful). Beyond the threshold, overfitting becomes "benign" (harmless) under the right conditions.
| Regime | Overfitting type | Generalization |
|---|---|---|
| Underparameterized (p < n) | No interpolation possible | Depends on bias-variance balance |
| Interpolation threshold (p approximately equal to n) | Catastrophic overfitting | Poor generalization; peak test error |
| Mildly overparameterized (p slightly greater than n) | Potentially harmful overfitting | May still generalize poorly depending on data structure |
| Heavily overparameterized (p much greater than n) | Benign overfitting | Good generalization; smooth interpolating solutions |
Several lines of theoretical work have provided rigorous foundations for the double descent phenomenon.
The earliest theoretical observations of non-monotonic generalization curves came from the statistical physics community. Krogh and Hertz (1992) provided theoretical explanations for model-wise double descent in linear models using methods from statistical mechanics. Opper (1995) and Opper and Kinzel (1996) further analyzed generalization in neural networks using the replica method and related techniques. These early results showed that, at intermediate complexity levels where the model size equals the number of training examples, the model is very sensitive to noise and generalizes poorly.
Hastie, Montanari, Rosset, and Tibshirani (2022) provided a precise analysis of double descent in "ridgeless" (unregularized) least squares regression. Their paper, "Surprises in high-dimensional ridgeless least squares interpolation" (Annals of Statistics, vol. 50, no. 2), showed that in the proportional asymptotic regime (where p and n grow together with p/n approaching a constant gamma), the test risk of the minimum-norm interpolating estimator diverges as gamma approaches 1 from either side and decreases for gamma well above 1. This provides an exact mathematical characterization of the double descent peak.
Mei and Montanari (2022) extended this analysis to random features regression in "The generalization error of random features regression: Precise asymptotics and the double descent curve" (Communications on Pure and Applied Mathematics, vol. 75, no. 4, pages 667-766). They showed that the double descent curve emerges naturally in the proportional limit and that the global minimum of test risk can lie in the extremely overparameterized regime.
Recent work has refined the bias-variance decomposition for overparameterized settings. In the classical regime, variance increases monotonically with complexity. In the overparameterized regime, both bias and variance can decrease as complexity increases beyond the interpolation threshold. This occurs because the minimum-norm interpolating solution spreads its weight across many parameters, reducing the influence of any single noisy training point (which lowers variance) while maintaining the ability to capture the true signal (keeping bias low).
Regularization interacts with double descent in nuanced ways.
Nakkiran et al. and subsequent work (Kobak, Lomond, and Tsybakov, 2020) showed that appropriately tuned L2 regularization (weight decay) can smooth out the double descent peak, converting the double-descent-shaped risk curve into a monotonically decreasing one. For linear regression with isotropic data, optimally tuned L2 regularization achieves monotonic test performance as either sample size or model size grows.
Fernandez, Pinto, and von Luxburg (2022) demonstrated that double descent can also occur as a function of regularization strength. As the L2 penalty decreases from a large value toward zero, the test error can exhibit a non-monotonic pattern analogous to model-wise double descent. They showed that this regularization-wise double descent can be understood as a superposition of bias-variance tradeoffs corresponding to different components of the model. Per-layer or per-component regularization can eliminate this effect.
Early stopping acts as a form of implicit regularization that limits the effective model complexity. By halting training before the model reaches the interpolation threshold, early stopping can avoid the double descent peak entirely. However, as epoch-wise double descent shows, stopping at the apparent overfitting point (where test error first rises) may not be optimal; training further can lead to a second descent in test error. This creates a tension between conservative early stopping and the potential benefits of extended training.
Subsequent research has revealed that the double descent pattern can itself be extended. Adlam and Pennington (2020) showed that the NTK in high dimensions can exhibit triple descent, where a third regime of decreasing test error appears after a second peak. This triple descent arises from the interaction of multiple scales in the feature map.
d'Ascoli, Refinetti, Biroli, and Krzakala (2020) investigated triple descent in the context of sample-wise risk, showing that at high noise levels, the risk profile can exhibit peaks both when the number of samples equals the number of parameters and when it equals the input dimension. The relative prominence of these peaks depends on the nonlinearity of the activation function.
For models with multiple feature components (such as concatenations of different random feature types), Rocks and Mehta showed that the risk curve can exhibit (K+1)-fold descent for K distinct feature types, establishing the concept of multiple descent.
The double descent phenomenon has several practical consequences for machine learning practitioners.
Classical model selection criteria (such as AIC, BIC, and cross-validation at moderate complexities) assume a U-shaped test error curve and may select suboptimal models. In the presence of double descent, the globally best model may be the largest available one, well beyond the interpolation threshold. Practitioners should consider evaluating very large models rather than restricting their search to the classical "sweet spot."
Epoch-wise double descent suggests that the common practice of early stopping based on validation error may sometimes be premature. If the model is in the critical regime and validation error has begun to rise, continued training may eventually yield better performance. Monitoring training and validation error for longer periods, even past apparent overfitting, can be informative.
Sample-wise double descent implies that adding a moderate amount of data to a model operating near the interpolation threshold can temporarily worsen performance. This does not mean more data is harmful in general. It means that when increasing dataset size, practitioners should also consider scaling the model to stay in the overparameterized regime.
Optimal regularization can eliminate the double descent peak. For practitioners who cannot afford to train very large models, tuning regularization (weight decay, dropout, data augmentation) is especially important to smooth the transition through the critical regime.
Label noise amplifies the double descent peak. Cleaning training labels and reducing noise in the data can mitigate the worst effects of double descent, particularly for models near the interpolation threshold.
| Concern | Recommendation |
|---|---|
| Model size | Prefer larger, overparameterized models when computationally feasible |
| Training epochs | Do not stop training solely because validation error begins to rise; monitor for a potential second descent |
| Adding data | When adding data, also consider scaling model capacity to avoid the critical regime |
| Regularization | Tune regularization carefully; optimal weight decay can eliminate the double descent peak |
| Label noise | Invest in data quality; noisy labels amplify the peak at the interpolation threshold |
| Model selection | Evaluate models across a wide range of complexities, including far beyond the interpolation threshold |
Double descent is part of a broader family of surprising behaviors observed in modern machine learning.
Grokking. Grokking refers to the phenomenon where a model trained on a small dataset achieves perfect training accuracy early but does not generalize until much later in training, when test accuracy suddenly jumps. Grokking shares conceptual similarities with epoch-wise double descent in that extended training can produce unexpected improvements in generalization.
Scaling laws. Empirical scaling laws (Kaplan et al., 2020) show that test loss decreases as a power law with increasing model size, dataset size, and compute. Double descent adds nuance to these scaling laws by showing that the decrease is not always monotonic and that there can be critical regimes where temporarily, bigger models or more data can hurt.
Neural network pruning. The observation that heavily overparameterized networks can be pruned to a fraction of their size without losing accuracy (the lottery ticket hypothesis) is consistent with the effective dimension perspective on double descent. The full model may use only a small fraction of its parameters for prediction, with the rest providing the "room" needed for benign overfitting.
| Year | Contribution | Researchers |
|---|---|---|
| 1992 | First theoretical explanations of non-monotonic generalization in linear models using statistical mechanics | Krogh and Hertz |
| 1995 | Analysis of generalization in neural networks showing sensitivity at interpolation | Opper; Opper and Kinzel (1996) |
| 2017 | Investigation of high-dimensional dynamics of generalization error in neural networks | Advani and Saxe |
| 2018 | Term "double descent" coined; demonstrated in decision trees, random features, two-layer neural networks | Belkin, Hsu, Ma, Mandal (arXiv) |
| 2019 | Published in PNAS with formal treatment and broader experimental evidence | Belkin, Hsu, Ma, Mandal |
| 2019 | Deep double descent demonstrated in ResNets, CNNs, Transformers; three axes identified | Nakkiran, Kaplun, Bansal, Yang, Barak, Sutskever |
| 2020 | Benign overfitting formalized for linear regression | Bartlett, Long, Lugosi, Tsigler |
| 2020 | Triple descent in NTK at high dimensions | Adlam and Pennington |
| 2020 | Optimal regularization shown to mitigate double descent | Nakkiran et al. |
| 2022 | Precise asymptotics for ridgeless regression in high dimensions | Hastie, Montanari, Rosset, Tibshirani |
| 2022 | Precise asymptotics for random features regression | Mei and Montanari |
| 2022 | Regularization-wise double descent characterized | Fernandez, Pinto, von Luxburg |