See also: Machine learning terms
Overfitting is a phenomenon that occurs in machine learning when a model becomes excessively tailored to the training set, resulting in a decrease in its generalization performance on unseen data. In essence, the model learns the noise and peculiarities present in the training data, which negatively impacts its ability to make accurate predictions for new, unseen data. Overfitting is a common challenge in machine learning and can lead to models with poor predictive performance and reduced utility in real-world applications.
A model that overfits has low training loss but high validation or test set loss. The gap between training performance and test performance is the hallmark of overfitting. In statistical terms, the model has high variance and low bias, meaning it is highly sensitive to the specific training examples it has seen.
Formally, consider a function $f$ learned from a training set $D_{\text{train}}$. The model overfits when its error on $D_{\text{train}}$ is substantially lower than its error on unseen data drawn from the same distribution. That is, $E_{\text{train}}(f) \ll E_{\text{test}}(f)$, where the difference $E_{\text{test}}(f) - E_{\text{train}}(f)$ is called the generalization gap.
The bias-variance tradeoff is one of the most fundamental concepts in supervised learning and is directly connected to overfitting. Every prediction error a model makes can be decomposed into three components: bias, variance, and irreducible error (noise).
Bias measures the error introduced by approximating a complex real-world problem with a simplified model. High bias means the model makes strong assumptions about the data and misses important patterns. This leads to underfitting.
Variance measures how much the model's predictions change when trained on different subsets of the data. High variance means the model is overly sensitive to the specific training set and captures noise as if it were signal. This leads to overfitting.
The irreducible error represents noise inherent in the data that no model can eliminate, regardless of complexity.
These components relate through the equation:
Total Error = Bias² + Variance + Irreducible Error
As model complexity increases, bias generally decreases (the model can capture more patterns) while variance increases (the model becomes more sensitive to training data). The optimal model complexity sits at the point where the sum of bias and variance is minimized. Past this point, additional complexity only increases test error through overfitting, even though training error keeps dropping.
| Component | Low | High |
|---|---|---|
| Bias | Model captures true patterns well | Model is too simple, misses patterns (underfitting) |
| Variance | Predictions are stable across datasets | Predictions change wildly with different training data (overfitting) |
| Total error | Good generalization | Poor performance on new data |
Several indicators suggest a model is overfitting:
Overfitting can be caused by a variety of factors, including:
Model complexity: A complex model with many parameters may be more prone to overfitting, as it has the capacity to fit the training data too closely. A neural network with millions of parameters can memorize a small dataset rather than learning generalizable features. This is sometimes described as the model having too many degrees of freedom relative to the amount of training data.
Insufficient training data: A small training dataset increases the likelihood of overfitting, as the model may not have enough information to learn the underlying patterns in the data. This can lead to a model that is overly influenced by random noise or unique features of the training data. As a rough guideline, the number of training examples should be several times larger than the number of model parameters, though this ratio varies depending on the task.
Noise in the data: Noisy or corrupted data can exacerbate overfitting, as the model may learn to fit the noise rather than the true underlying patterns in the data. Label noise (incorrect labels on training examples) is particularly harmful because the model tries to accommodate contradictory information.
Training for too many epochs: Continuing to train beyond the point where validation performance peaks allows the model to progressively memorize training data. Each additional epoch past this point typically widens the gap between training and validation performance.
Inappropriate feature engineering: Including too many features, especially noisy or irrelevant ones, gives the model more dimensions in which to overfit. This is sometimes called the "curse of dimensionality," where the feature space becomes so large relative to the number of samples that the model can find spurious patterns.
Overfitting and underfitting sit on opposite ends of the model complexity spectrum. Understanding both is necessary for building models that generalize well.
| Aspect | Underfitting | Overfitting |
|---|---|---|
| Also known as | High bias | High variance |
| Training performance | Poor | Excellent |
| Validation/test performance | Poor | Poor |
| Model complexity | Too low for the task | Too high for the data available |
| Typical cause | Model is too simple; not enough features or capacity | Model is too complex; too many parameters relative to data |
| Training-validation gap | Small (both are bad) | Large (training is good, validation is bad) |
| How to fix | Increase model capacity, add features, train longer | Add regularization, get more data, simplify model |
| Example | Linear model on highly nonlinear data | Deep network memorizing 100 training examples |
The goal is to find the sweet spot between these two extremes, where the model is complex enough to capture the true underlying patterns but not so complex that it memorizes noise.
Overfitting manifests differently depending on the type of model being used. The following table summarizes how overfitting behaves across common model families and what specific remedies apply to each.
| Model type | How overfitting occurs | Typical remedies |
|---|---|---|
| Linear models (linear regression, logistic regression) | When too many features are included (especially with collinear or irrelevant features), the model assigns large coefficients to fit noise in the training data. Polynomial regression with high-degree terms is a classic example. | L1 regularization (Lasso) for feature selection, L2 regularization (Ridge) for coefficient shrinkage, elastic net, removing irrelevant features |
| Decision trees | An unpruned tree will grow until every training example is correctly classified, creating branches for individual outliers and noise. Deep trees have low bias but very high variance. | Pruning (pre-pruning by limiting depth or minimum samples per leaf; post-pruning by removing branches that do not improve validation performance), using ensemble methods like random forests |
| Neural networks | Large networks with millions or billions of parameters can memorize entire training sets. Overfitting in neural networks often shows up as a steadily widening gap between training and validation loss curves. | Dropout, weight decay, early stopping, batch normalization, data augmentation, reducing layer count or width |
| k-Nearest Neighbors (k-NN) | With k=1, the model memorizes the training data perfectly, since each prediction is based on the single closest training example. Small values of k create complex, jagged decision boundaries that are highly sensitive to noise. | Increasing k (larger k values smooth predictions), using distance-weighted voting, feature scaling, dimensionality reduction |
| Support vector machines | Overfitting occurs when the kernel maps data into an excessively high-dimensional space, or when the regularization parameter C is too large, allowing the model to tolerate very few margin violations. | Tuning the C parameter, choosing an appropriate kernel and kernel parameters, cross-validation for hyperparameter selection |
Several techniques can be employed to prevent or mitigate overfitting in machine learning models:
| Technique | How it works | When to use |
|---|---|---|
| Regularization (L1/L2) | Adds a penalty on the magnitude of model weights to the loss function, discouraging overly complex models | Linear models, neural networks; L1 for feature selection, L2 for weight shrinkage |
| Elastic net | Combines L1 and L2 penalties, balancing feature selection with coefficient shrinkage | When features are correlated; when p >> n (more features than samples) |
| Dropout | Randomly deactivates a fraction of neurons during each training step, preventing co-adaptation | Deep learning models, especially fully connected layers |
| Early stopping | Monitors validation performance during training and stops when it begins to degrade | Almost any iterative training process; requires a validation set |
| Data augmentation | Artificially expands the training set by applying transformations (rotations, flips, crops, noise) to existing examples | Computer vision, NLP, audio tasks where transformed data remains valid |
| Cross-validation | Divides data into multiple folds, trains on different combinations, and averages results to get a more reliable estimate of generalization | Model selection and hyperparameter tuning, especially with limited data |
| Ensemble methods | Combines predictions from multiple models (e.g., random forests, bagging, boosting) to reduce variance | When single models show high variance; bagging specifically reduces variance |
| Batch normalization | Normalizes activations within each layer, which has a mild regularizing effect and stabilizes training | Deep neural networks; often used alongside dropout |
| Pruning | Removes unnecessary parameters, branches, or neurons from a trained model, reducing its effective complexity | Decision trees (branch pruning), neural networks (weight pruning) |
| Reducing model size | Uses fewer layers, fewer neurons per layer, or fewer parameters to limit the model's capacity to memorize | When the model is clearly too large for the dataset |
| Getting more data | Increases the training set size so the model has more examples to learn general patterns from | Whenever possible; the most reliable way to reduce overfitting |
| Weight decay | Shrinks weights toward zero at each update step, functionally similar to L2 regularization | Standard practice in deep learning optimizers like AdamW |
Regularization techniques add a penalty term to the model's loss function, which discourages the model from becoming overly complex.
L1 regularization (also called Lasso) adds the sum of absolute values of the weights to the loss: $L_{\text{total}} = L_{\text{data}} + \lambda \sum |w_i|$. This drives some weights to exactly zero, performing automatic feature selection. L1 is especially useful when many features are irrelevant or redundant.
L2 regularization (also called Ridge) adds the sum of squared weights: $L_{\text{total}} = L_{\text{data}} + \lambda \sum w_i^2$. This shrinks all weights toward zero without eliminating any entirely. L2 is effective at preventing any single weight from growing too large.
Elastic net, introduced by Zou and Hastie (2005), combines both penalties: $L_{\text{total}} = L_{\text{data}} + \lambda_1 \sum |w_i| + \lambda_2 \sum w_i^2$. This approach provides the feature selection benefits of L1 while retaining L2's ability to handle correlated features. When predictors are strongly correlated, L1 alone tends to arbitrarily select one and drop the others, while elastic net selects correlated predictors together and assigns them similar coefficients.
The hyperparameter $\lambda$ (or $\alpha$ in some implementations) controls the strength of regularization. A larger value produces a simpler model with more regularization, while a smaller value allows the model to fit the training data more closely.
Dropout, proposed by Srivastava et al. (2014), randomly sets a fraction of neuron activations to zero during each training step. A typical dropout rate is 0.2 to 0.5, meaning 20% to 50% of neurons are deactivated on each forward pass. This prevents neurons from co-adapting: since any neuron might be absent on a given training step, the network is forced to learn redundant representations that do not depend on specific neurons. At inference time, all neurons are active, and their outputs are scaled to account for the missing activations during training. Dropout can be interpreted as training an implicit ensemble of exponentially many sub-networks, which is why it reduces variance.
Early stopping involves monitoring the model's performance on a validation dataset during the training process and stopping training when the performance starts to degrade. In practice, a "patience" parameter specifies how many epochs of no improvement to wait before halting. This prevents the model from overfitting the training data by limiting the total number of training updates. Early stopping is one of the simplest and most widely used regularization techniques, applicable to virtually any model trained with iterative optimization.
Cross-validation involves dividing the dataset into multiple subsets and training the model on different combinations of these subsets. This helps assess the model's generalization performance and can be used to identify overfitting. The most common form is k-fold cross-validation, where the data is split into k equally sized folds. The model trains k times, each time holding out a different fold for validation. The average performance across all folds gives a more reliable estimate of how the model will perform on unseen data.
Data augmentation artificially increases the size and diversity of the training set by applying transformations to existing examples. In computer vision, common augmentations include random rotations, horizontal flips, cropping, color jittering, and adding Gaussian noise. In natural language processing, augmentation techniques include synonym replacement, back-translation, and random insertion or deletion of words. By presenting the model with varied versions of the same underlying data, augmentation forces it to learn more robust, generalizable features rather than memorizing specific details of the training examples.
Pruning reduces model complexity by removing unnecessary components after training. For decision trees, pruning cuts branches that add little predictive value on validation data, preventing the tree from fitting noise in the training set. For neural networks, pruning involves removing weights, neurons, or entire layers with small magnitudes, under the assumption that low-magnitude parameters contribute little to the model's output. The lottery ticket hypothesis (Frankle and Carlin, 2019) demonstrated that dense neural networks contain sparse subnetworks ("winning tickets") that can match or exceed the full network's accuracy when trained in isolation. These subnetworks are typically 10-20% of the original size for architectures tested on MNIST and CIFAR-10. Pruning reduces overfitting by limiting the effective number of parameters while also improving inference efficiency.
Choosing an appropriate model with the right level of complexity for the given problem can help prevent overfitting. Models that are too complex may overfit the data, while models that are too simple may underfit the data. Techniques like cross-validation and information criteria (AIC, BIC) help compare models of different complexities on a fair basis.
Learning curves are a practical diagnostic tool for detecting overfitting (and underfitting). They plot model performance (loss or accuracy) on the y-axis against either the number of training epochs or the training set size on the x-axis.
Loss vs. epochs curves: In a well-fitting model, both the training loss and the validation loss decrease together and converge to similar values. When overfitting occurs, the training loss continues to decrease while the validation loss reaches a minimum and then begins to increase. The point where the validation loss starts rising is the optimal stopping point. The gap between the two curves grows wider as overfitting gets worse.
Performance vs. training set size curves: When plotted against training set size, a model that overfits will show high training performance but low validation performance at small dataset sizes. As the dataset grows, the training performance typically drops slightly (it becomes harder to memorize more data) while the validation performance improves. The two curves converge as the dataset becomes large enough for the model's complexity. If the curves have converged and performance is still poor, the model is underfitting and needs more capacity.
Learning curves help practitioners decide whether to collect more data (if the gap is still large and the curves have not converged), simplify the model (if overfitting is severe), or increase model complexity (if both curves have converged at poor performance).
The following step-by-step approach helps practitioners identify and resolve overfitting in a systematic way:
A landmark study by Zhang et al. (2017), titled "Understanding Deep Learning Requires Rethinking Generalization," demonstrated that standard deep neural networks have enough capacity to perfectly memorize entirely random labels. The authors trained state-of-the-art convolutional networks on image classification tasks where the true labels were replaced with random assignments. The networks achieved zero training error on these random labels, proving that their effective capacity is large enough to memorize any arbitrary mapping from inputs to outputs.
The key findings of this work include:
This paper, which received the Best Paper Award at ICLR 2017, fundamentally challenged the classical view that overfitting should be the default outcome for overparameterized models. If a network can memorize random data just as easily as structured data, then the architecture and optimizer together must implicitly favor solutions that generalize, even without explicit regularization. The question of exactly how and why neural networks generalize despite their enormous capacity remains an active area of research.
The classical understanding of overfitting assumes a U-shaped relationship between model complexity and test error: as complexity increases, test error first decreases and then increases. However, research from 2019-2020 revealed a more nuanced picture called double descent.
Belkin et al. (2019) first documented model-wise double descent in their paper "Reconciling Modern Machine-Learning Practice and the Classical Bias-Variance Trade-off." They showed that for many models, including neural networks, random forests, and kernel methods, test error follows a double descent curve:
This second descent explains why modern deep learning models with far more parameters than training examples can generalize well, contradicting classical statistical wisdom.
Nakkiran et al. (2020) from OpenAI extended this finding to training dynamics, showing that epoch-wise double descent occurs as well. For a fixed model size, as training proceeds:
This means that the conventional practice of stopping training when validation error starts to rise may actually be premature. In some cases, training through the overfitting region leads to better generalization on the other side.
| Double descent type | What varies | Observation |
|---|---|---|
| Model-wise | Number of parameters (model size) | Test error decreases, spikes at interpolation threshold, then decreases again |
| Epoch-wise | Number of training epochs | Test error decreases, rises (overfitting), then decreases again with further training |
| Sample-wise | Number of training samples | Adding more data can temporarily hurt test performance near the interpolation threshold |
Double descent is most pronounced when:
For modern large language models with billions of parameters trained on trillions of tokens, double descent is less of a concern because these models are deep in the overparameterized regime. The interpolation threshold is far below the actual model size.
Grokking is a phenomenon first described by Power et al. (2022) in their paper "Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets." In grokking, a model first memorizes the training data (achieving perfect training accuracy but random-level test accuracy) and then, after a prolonged period of additional training, suddenly transitions to perfect generalization.
The key feature of grokking is the delay between memorization and generalization. Unlike standard training where generalization follows closely behind memorization, grokking involves a gap that can span thousands or even millions of additional training steps. The model appears to have fully overfit, yet continued training eventually leads to a sudden, sharp improvement in test performance.
Grokking has been observed primarily in:
Research by Nanda et al. (2023) used mechanistic interpretability to show that during grokking, the model transitions from memorization-based circuits (lookup tables) to algorithmic circuits (general computation). Weight decay slowly penalizes the large weights needed for memorization, eventually causing the model to discover more efficient, generalizing solutions.
Grokking raises important questions about when to stop training. Early stopping based on validation loss would halt training before grokking occurs, preventing the model from discovering the generalizing solution. However, grokking has been primarily observed in small, structured datasets, and it remains unclear how often it occurs in practical, large-scale training scenarios.
Benign overfitting is a phenomenon where a model perfectly interpolates (memorizes) the training data, including noise, yet still generalizes well to unseen data. This contradicts the classical expectation that memorizing noisy data leads to poor generalization.
Bartlett et al. (2020) provided the first theoretical analysis of benign overfitting in their paper "Benign Overfitting in Linear Regression." They showed that in high-dimensional linear regression, the minimum-norm interpolating solution can achieve near-optimal test error even when the training data is noisy, provided the data distribution satisfies certain conditions related to the eigenvalue spectrum of the covariance matrix.
The key insight is that in high-dimensional spaces, noise gets "spread out" across many dimensions. When the model interpolates the data, it assigns a small amount of weight to each of many noise dimensions, rather than a large amount to a few. The noise contribution to predictions on new data averages out, resulting in good generalization despite perfect training set memorization.
Benign overfitting is most likely when:
| Concept | Classical expectation | Actual behavior in overparameterized models |
|---|---|---|
| Memorizing training data | Leads to poor generalization | Can still generalize well (benign overfitting) |
| Adding more parameters beyond needed | Increases test error | Can decrease test error (double descent) |
| Training past overfitting | Wastes compute, hurts model | Can lead to sudden generalization (grokking) |
| Zero training loss with noisy data | Model has learned noise | Noise contribution averages out in high dimensions |
These phenomena collectively challenge the traditional view that overfitting is always harmful and that simpler models are always better. In the modern deep learning regime:
Imagine you are trying to learn about different types of animals by looking at pictures in a book. Overfitting is like memorizing every tiny detail in each picture, like the color of a leaf behind the animal or a small scratch on the page. This would make it hard for you to recognize the same animal in a different picture because you focused too much on the little details that don't matter.
Here is another way to think about it. Suppose your teacher shows you five cats and five dogs. All the cats happen to be sitting on blue blankets, and all the dogs happen to be on red blankets. If you "overfit," you learn the rule "blue blanket means cat." This works perfectly for the pictures you have seen, but when someone shows you a cat on a green couch, you get confused because you learned the wrong pattern.
In machine learning, overfitting is when a model learns the training data so well that it gets confused when it sees new data because it paid too much attention to the unimportant details. To avoid overfitting, we can use different techniques (like regularization, dropout, and getting more training data) to help the model focus on the important patterns and not get distracted by the small, unimportant details.