Overfitting

Overfitting in machine learning

Definition

Overfitting is a phenomenon that occurs in machine learning when a model becomes excessively tailored to the training set, resulting in a decrease in its generalization performance on unseen data. In essence, the model learns the noise and peculiarities present in the training data, which negatively impacts its ability to make accurate predictions for new, unseen data. Overfitting is a common challenge in machine learning and can lead to models with poor predictive performance and reduced utility in real-world applications.

A model that overfits has low training loss but high validation or test set loss. The gap between training performance and test performance is the hallmark of overfitting. In statistical terms, the model has high variance and low bias, meaning it is highly sensitive to the specific training examples it has seen.

Formally, consider a function $f$ learned from a training set $D_{\text{train}}$. The model overfits when its error on $D_{\text{train}}$ is substantially lower than its error on unseen data drawn from the same distribution. That is, $E_{\text{train}}(f) \ll E_{\text{test}}(f)$, where the difference $E_{\text{test}}(f) - E_{\text{train}}(f)$ is called the generalization gap.

The bias-variance tradeoff

The bias-variance tradeoff is one of the most fundamental concepts in supervised learning and is directly connected to overfitting. Every prediction error a model makes can be decomposed into three components: bias, variance, and irreducible error (noise).

Bias measures the error introduced by approximating a complex real-world problem with a simplified model. High bias means the model makes strong assumptions about the data and misses important patterns. This leads to underfitting.

Variance measures how much the model's predictions change when trained on different subsets of the data. High variance means the model is overly sensitive to the specific training set and captures noise as if it were signal. This leads to overfitting.

The irreducible error represents noise inherent in the data that no model can eliminate, regardless of complexity.

These components relate through the equation:

Total Error = Bias² + Variance + Irreducible Error

As model complexity increases, bias generally decreases (the model can capture more patterns) while variance increases (the model becomes more sensitive to training data). The optimal model complexity sits at the point where the sum of bias and variance is minimized. Past this point, additional complexity only increases test error through overfitting, even though training error keeps dropping.

Component	Low	High
Bias	Model captures true patterns well	Model is too simple, misses patterns (underfitting)
Variance	Predictions are stable across datasets	Predictions change wildly with different training data (overfitting)
Total error	Good generalization	Poor performance on new data

Signs of overfitting

Several indicators suggest a model is overfitting:

Training-validation gap: Training accuracy continues to improve while validation set accuracy stagnates or declines. This is the most reliable signal.
Near-perfect training performance: If a model achieves 99%+ accuracy on training data but much lower accuracy on validation data, it is almost certainly overfitting.
Increasing validation loss: Even when training loss continues to decrease, the validation loss begins to rise. Plotting both curves over training iterations makes this visible.
Sensitivity to small data changes: Removing or adding a few training examples causes large changes in model predictions.
Complex decision boundaries: In classification tasks, the model draws overly complex boundaries that twist and turn to accommodate individual data points rather than capturing the general trend.
Large model weights: In linear models and neural networks, very large weight values often correlate with overfitting, as the model uses extreme parameter values to memorize specific training examples.

Causes

Overfitting can be caused by a variety of factors, including:

Model complexity: A complex model with many parameters may be more prone to overfitting, as it has the capacity to fit the training data too closely. A neural network with millions of parameters can memorize a small dataset rather than learning generalizable features. This is sometimes described as the model having too many degrees of freedom relative to the amount of training data.
Insufficient training data: A small training dataset increases the likelihood of overfitting, as the model may not have enough information to learn the underlying patterns in the data. This can lead to a model that is overly influenced by random noise or unique features of the training data. As a rough guideline, the number of training examples should be several times larger than the number of model parameters, though this ratio varies depending on the task.
Noise in the data: Noisy or corrupted data can exacerbate overfitting, as the model may learn to fit the noise rather than the true underlying patterns in the data. Label noise (incorrect labels on training examples) is particularly harmful because the model tries to accommodate contradictory information.
Training for too many epochs: Continuing to train beyond the point where validation performance peaks allows the model to progressively memorize training data. Each additional epoch past this point typically widens the gap between training and validation performance.
Inappropriate feature engineering: Including too many features, especially noisy or irrelevant ones, gives the model more dimensions in which to overfit. This is sometimes called the "curse of dimensionality," where the feature space becomes so large relative to the number of samples that the model can find spurious patterns.

Overfitting vs. underfitting

Overfitting and underfitting sit on opposite ends of the model complexity spectrum. Understanding both is necessary for building models that generalize well.

Aspect	Underfitting	Overfitting
Also known as	High bias	High variance
Training performance	Poor	Excellent
Validation/test performance	Poor	Poor
Model complexity	Too low for the task	Too high for the data available
Typical cause	Model is too simple; not enough features or capacity	Model is too complex; too many parameters relative to data
Training-validation gap	Small (both are bad)	Large (training is good, validation is bad)
How to fix	Increase model capacity, add features, train longer	Add regularization, get more data, simplify model
Example	Linear model on highly nonlinear data	Deep network memorizing 100 training examples

The goal is to find the sweet spot between these two extremes, where the model is complex enough to capture the true underlying patterns but not so complex that it memorizes noise.

Overfitting in different model types

Overfitting manifests differently depending on the type of model being used. The following table summarizes how overfitting behaves across common model families and what specific remedies apply to each.

Model type	How overfitting occurs	Typical remedies
Linear models (linear regression, logistic regression)	When too many features are included (especially with collinear or irrelevant features), the model assigns large coefficients to fit noise in the training data. Polynomial regression with high-degree terms is a classic example.	L1 regularization (Lasso) for feature selection, L2 regularization (Ridge) for coefficient shrinkage, elastic net, removing irrelevant features
Decision trees	An unpruned tree will grow until every training example is correctly classified, creating branches for individual outliers and noise. Deep trees have low bias but very high variance.	Pruning (pre-pruning by limiting depth or minimum samples per leaf; post-pruning by removing branches that do not improve validation performance), using ensemble methods like random forests
Neural networks	Large networks with millions or billions of parameters can memorize entire training sets. Overfitting in neural networks often shows up as a steadily widening gap between training and validation loss curves.	Dropout, weight decay, early stopping, batch normalization, data augmentation, reducing layer count or width
k-Nearest Neighbors (k-NN)	With k=1, the model memorizes the training data perfectly, since each prediction is based on the single closest training example. Small values of k create complex, jagged decision boundaries that are highly sensitive to noise.	Increasing k (larger k values smooth predictions), using distance-weighted voting, feature scaling, dimensionality reduction
Support vector machines	Overfitting occurs when the kernel maps data into an excessively high-dimensional space, or when the regularization parameter C is too large, allowing the model to tolerate very few margin violations.	Tuning the C parameter, choosing an appropriate kernel and kernel parameters, cross-validation for hyperparameter selection

Prevention and mitigation techniques

Several techniques can be employed to prevent or mitigate overfitting in machine learning models:

Technique	How it works	When to use
Regularization (L1/L2)	Adds a penalty on the magnitude of model weights to the loss function, discouraging overly complex models	Linear models, neural networks; L1 for feature selection, L2 for weight shrinkage
Elastic net	Combines L1 and L2 penalties, balancing feature selection with coefficient shrinkage	When features are correlated; when p >> n (more features than samples)
Dropout	Randomly deactivates a fraction of neurons during each training step, preventing co-adaptation	Deep learning models, especially fully connected layers
Early stopping	Monitors validation performance during training and stops when it begins to degrade	Almost any iterative training process; requires a validation set
Data augmentation	Artificially expands the training set by applying transformations (rotations, flips, crops, noise) to existing examples	Computer vision, NLP, audio tasks where transformed data remains valid
Cross-validation	Divides data into multiple folds, trains on different combinations, and averages results to get a more reliable estimate of generalization	Model selection and hyperparameter tuning, especially with limited data
Ensemble methods	Combines predictions from multiple models (e.g., random forests, bagging, boosting) to reduce variance	When single models show high variance; bagging specifically reduces variance
Batch normalization	Normalizes activations within each layer, which has a mild regularizing effect and stabilizes training	Deep neural networks; often used alongside dropout
Pruning	Removes unnecessary parameters, branches, or neurons from a trained model, reducing its effective complexity	Decision trees (branch pruning), neural networks (weight pruning)
Reducing model size	Uses fewer layers, fewer neurons per layer, or fewer parameters to limit the model's capacity to memorize	When the model is clearly too large for the dataset
Getting more data	Increases the training set size so the model has more examples to learn general patterns from	Whenever possible; the most reliable way to reduce overfitting
Weight decay	Shrinks weights toward zero at each update step, functionally similar to L2 regularization	Standard practice in deep learning optimizers like AdamW

Regularization

Regularization techniques add a penalty term to the model's loss function, which discourages the model from becoming overly complex.

L1 regularization (also called Lasso) adds the sum of absolute values of the weights to the loss: $L_{\text{total}} = L_{\text{data}} + \lambda \sum |w_i|$. This drives some weights to exactly zero, performing automatic feature selection. L1 is especially useful when many features are irrelevant or redundant.

L2 regularization (also called Ridge) adds the sum of squared weights: $L_{\text{total}} = L_{\text{data}} + \lambda \sum w_i^2$. This shrinks all weights toward zero without eliminating any entirely. L2 is effective at preventing any single weight from growing too large.

Elastic net, introduced by Zou and Hastie (2005), combines both penalties: $L_{\text{total}} = L_{\text{data}} + \lambda_1 \sum |w_i| + \lambda_2 \sum w_i^2$. This approach provides the feature selection benefits of L1 while retaining L2's ability to handle correlated features. When predictors are strongly correlated, L1 alone tends to arbitrarily select one and drop the others, while elastic net selects correlated predictors together and assigns them similar coefficients.

The hyperparameter $\lambda$ (or $\alpha$ in some implementations) controls the strength of regularization. A larger value produces a simpler model with more regularization, while a smaller value allows the model to fit the training data more closely.

Dropout

Dropout, proposed by Srivastava et al. (2014), randomly sets a fraction of neuron activations to zero during each training step. A typical dropout rate is 0.2 to 0.5, meaning 20% to 50% of neurons are deactivated on each forward pass. This prevents neurons from co-adapting: since any neuron might be absent on a given training step, the network is forced to learn redundant representations that do not depend on specific neurons. At inference time, all neurons are active, and their outputs are scaled to account for the missing activations during training. Dropout can be interpreted as training an implicit ensemble of exponentially many sub-networks, which is why it reduces variance.

Early stopping

Early stopping involves monitoring the model's performance on a validation dataset during the training process and stopping training when the performance starts to degrade. In practice, a "patience" parameter specifies how many epochs of no improvement to wait before halting. This prevents the model from overfitting the training data by limiting the total number of training updates. Early stopping is one of the simplest and most widely used regularization techniques, applicable to virtually any model trained with iterative optimization.

Cross-validation

Cross-validation involves dividing the dataset into multiple subsets and training the model on different combinations of these subsets. This helps assess the model's generalization performance and can be used to identify overfitting. The most common form is k-fold cross-validation, where the data is split into k equally sized folds. The model trains k times, each time holding out a different fold for validation. The average performance across all folds gives a more reliable estimate of how the model will perform on unseen data.

Data augmentation

Data augmentation artificially increases the size and diversity of the training set by applying transformations to existing examples. In computer vision, common augmentations include random rotations, horizontal flips, cropping, color jittering, and adding Gaussian noise. In natural language processing, augmentation techniques include synonym replacement, back-translation, and random insertion or deletion of words. By presenting the model with varied versions of the same underlying data, augmentation forces it to learn more robust, generalizable features rather than memorizing specific details of the training examples.

Pruning

Pruning reduces model complexity by removing unnecessary components after training. For decision trees, pruning cuts branches that add little predictive value on validation data, preventing the tree from fitting noise in the training set. For neural networks, pruning involves removing weights, neurons, or entire layers with small magnitudes, under the assumption that low-magnitude parameters contribute little to the model's output. The lottery ticket hypothesis (Frankle and Carlin, 2019) demonstrated that dense neural networks contain sparse subnetworks ("winning tickets") that can match or exceed the full network's accuracy when trained in isolation. These subnetworks are typically 10-20% of the original size for architectures tested on MNIST and CIFAR-10. Pruning reduces overfitting by limiting the effective number of parameters while also improving inference efficiency.

Model selection

Choosing an appropriate model with the right level of complexity for the given problem can help prevent overfitting. Models that are too complex may overfit the data, while models that are too simple may underfit the data. Techniques like cross-validation and information criteria (AIC, BIC) help compare models of different complexities on a fair basis.

Learning curves

Learning curves are a practical diagnostic tool for detecting overfitting (and underfitting). They plot model performance (loss or accuracy) on the y-axis against either the number of training epochs or the training set size on the x-axis.

Loss vs. epochs curves: In a well-fitting model, both the training loss and the validation loss decrease together and converge to similar values. When overfitting occurs, the training loss continues to decrease while the validation loss reaches a minimum and then begins to increase. The point where the validation loss starts rising is the optimal stopping point. The gap between the two curves grows wider as overfitting gets worse.

Performance vs. training set size curves: When plotted against training set size, a model that overfits will show high training performance but low validation performance at small dataset sizes. As the dataset grows, the training performance typically drops slightly (it becomes harder to memorize more data) while the validation performance improves. The two curves converge as the dataset becomes large enough for the model's complexity. If the curves have converged and performance is still poor, the model is underfitting and needs more capacity.

Learning curves help practitioners decide whether to collect more data (if the gap is still large and the curves have not converged), simplify the model (if overfitting is severe), or increase model complexity (if both curves have converged at poor performance).

Practical guidelines for diagnosing and fixing overfitting

The following step-by-step approach helps practitioners identify and resolve overfitting in a systematic way:

Establish a baseline: Train the model and record both training and validation metrics. If the model performs poorly on both, the problem is underfitting, not overfitting.
Plot learning curves: Visualize training loss and validation loss over epochs. A divergence between the two curves signals overfitting.
Check the generalization gap: Compute the difference between training and validation performance. A gap larger than a few percentage points typically warrants investigation.
Try the simplest fixes first: Collect more training data if possible, or apply data augmentation. These address the root cause rather than the symptoms.
Add regularization: Apply L2 regularization or weight decay. For neural networks, add dropout layers (starting with a rate of 0.2 to 0.3). Tune the regularization strength using cross-validation.
Reduce model complexity: Try a shallower network, fewer features, or a simpler model family. Compare validation performance before and after.
Enable early stopping: Set up monitoring on the validation loss with a patience of 5-10 epochs. This prevents the model from training past the point of diminishing returns.
Use ensembles if needed: If a single model still overfits, combine multiple models through bagging or other ensemble methods to reduce variance.
Iterate and validate: After each change, re-check the learning curves and the generalization gap. The goal is a small gap with acceptable absolute performance.

Memorization vs. generalization in neural networks

A landmark study by Zhang et al. (2017), titled "Understanding Deep Learning Requires Rethinking Generalization," demonstrated that standard deep neural networks have enough capacity to perfectly memorize entirely random labels. The authors trained state-of-the-art convolutional networks on image classification tasks where the true labels were replaced with random assignments. The networks achieved zero training error on these random labels, proving that their effective capacity is large enough to memorize any arbitrary mapping from inputs to outputs.

The key findings of this work include:

Neural networks can fit completely random labels with zero training error, with only a small increase in training time compared to learning true labels.
The same networks can also fit data where the input images themselves are replaced with random noise.
Explicit regularization (such as dropout, weight decay, or data augmentation) is neither necessary nor sufficient to explain why neural networks generalize. Removing all regularization from well-known architectures barely changes their test performance on real data.
Traditional complexity measures, such as VC dimension and Rademacher complexity, cannot explain the generalization of deep networks because these bounds are too loose.

This paper, which received the Best Paper Award at ICLR 2017, fundamentally challenged the classical view that overfitting should be the default outcome for overparameterized models. If a network can memorize random data just as easily as structured data, then the architecture and optimizer together must implicitly favor solutions that generalize, even without explicit regularization. The question of exactly how and why neural networks generalize despite their enormous capacity remains an active area of research.

Double descent

The classical understanding of overfitting assumes a U-shaped relationship between model complexity and test error: as complexity increases, test error first decreases and then increases. However, research from 2019-2020 revealed a more nuanced picture called double descent.

Model-wise double descent

Belkin et al. (2019) first documented model-wise double descent in their paper "Reconciling Modern Machine-Learning Practice and the Classical Bias-Variance Trade-off." They showed that for many models, including neural networks, random forests, and kernel methods, test error follows a double descent curve:

Classical regime (underparameterized): As model size increases, test error decreases, following the classical bias-variance tradeoff.
Interpolation threshold: When the model has just enough parameters to perfectly fit the training data (zero training error), test error spikes sharply. The model is barely interpolating, and any noise in the training data is amplified.
Modern regime (overparameterized): As model size increases further beyond the interpolation threshold, test error decreases again, often reaching values lower than the minimum in the classical regime.

This second descent explains why modern deep learning models with far more parameters than training examples can generalize well, contradicting classical statistical wisdom.

Epoch-wise double descent

Nakkiran et al. (2020) from OpenAI extended this finding to training dynamics, showing that epoch-wise double descent occurs as well. For a fixed model size, as training proceeds:

Test error initially decreases (the model is learning useful patterns).
Test error then increases (the model begins to overfit).
With continued training, test error decreases again (the model transitions from memorization to generalization).

This means that the conventional practice of stopping training when validation error starts to rise may actually be premature. In some cases, training through the overfitting region leads to better generalization on the other side.

Double descent type	What varies	Observation
Model-wise	Number of parameters (model size)	Test error decreases, spikes at interpolation threshold, then decreases again
Epoch-wise	Number of training epochs	Test error decreases, rises (overfitting), then decreases again with further training
Sample-wise	Number of training samples	Adding more data can temporarily hurt test performance near the interpolation threshold

When double descent occurs

Double descent is most pronounced when:

The data contains noise (label noise amplifies the spike at the interpolation threshold).
The model is near the interpolation threshold (the number of parameters is close to the number of training samples).
The optimizer converges to a minimum-norm solution in the overparameterized regime.

For modern large language models with billions of parameters trained on trillions of tokens, double descent is less of a concern because these models are deep in the overparameterized regime. The interpolation threshold is far below the actual model size.

Grokking

Grokking is a phenomenon first described by Power et al. (2022) in their paper "Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets." In grokking, a model first memorizes the training data (achieving perfect training accuracy but random-level test accuracy) and then, after a prolonged period of additional training, suddenly transitions to perfect generalization.

The key feature of grokking is the delay between memorization and generalization. Unlike standard training where generalization follows closely behind memorization, grokking involves a gap that can span thousands or even millions of additional training steps. The model appears to have fully overfit, yet continued training eventually leads to a sudden, sharp improvement in test performance.

When grokking occurs

Grokking has been observed primarily in:

Small algorithmic datasets: Modular arithmetic operations (addition, multiplication modulo a prime), permutation group operations, and other structured mathematical tasks.
Small models trained with weight decay: Weight decay appears to be essential for grokking; without it, the model remains in the memorization state indefinitely.
Training far beyond the point of zero training loss: Grokking requires patience. The model must be trained for much longer than would normally be considered reasonable.

Research by Nanda et al. (2023) used mechanistic interpretability to show that during grokking, the model transitions from memorization-based circuits (lookup tables) to algorithmic circuits (general computation). Weight decay slowly penalizes the large weights needed for memorization, eventually causing the model to discover more efficient, generalizing solutions.

Practical implications

Grokking raises important questions about when to stop training. Early stopping based on validation loss would halt training before grokking occurs, preventing the model from discovering the generalizing solution. However, grokking has been primarily observed in small, structured datasets, and it remains unclear how often it occurs in practical, large-scale training scenarios.

Benign overfitting

Benign overfitting is a phenomenon where a model perfectly interpolates (memorizes) the training data, including noise, yet still generalizes well to unseen data. This contradicts the classical expectation that memorizing noisy data leads to poor generalization.

Bartlett et al. (2020) provided the first theoretical analysis of benign overfitting in their paper "Benign Overfitting in Linear Regression." They showed that in high-dimensional linear regression, the minimum-norm interpolating solution can achieve near-optimal test error even when the training data is noisy, provided the data distribution satisfies certain conditions related to the eigenvalue spectrum of the covariance matrix.

Why benign overfitting occurs

The key insight is that in high-dimensional spaces, noise gets "spread out" across many dimensions. When the model interpolates the data, it assigns a small amount of weight to each of many noise dimensions, rather than a large amount to a few. The noise contribution to predictions on new data averages out, resulting in good generalization despite perfect training set memorization.

Benign overfitting is most likely when:

The model is heavily overparameterized (far more parameters than training examples).
The data has a favorable eigenvalue structure (many small eigenvalues in the covariance matrix, meaning the signal lives in a low-dimensional subspace while noise is spread across many dimensions).
The optimizer converges to the minimum-norm solution (which SGD and its variants tend to do in overparameterized settings).

Concept	Classical expectation	Actual behavior in overparameterized models
Memorizing training data	Leads to poor generalization	Can still generalize well (benign overfitting)
Adding more parameters beyond needed	Increases test error	Can decrease test error (double descent)
Training past overfitting	Wastes compute, hurts model	Can lead to sudden generalization (grokking)
Zero training loss with noisy data	Model has learned noise	Noise contribution averages out in high dimensions

Implications for practice

These phenomena collectively challenge the traditional view that overfitting is always harmful and that simpler models are always better. In the modern deep learning regime:

Overparameterization is not inherently bad. Models with far more parameters than data points can generalize well, especially when combined with appropriate regularization (even just weight decay).
The bias-variance tradeoff still holds in principle, but the "second descent" in the overparameterized regime means that going far beyond the interpolation threshold can be beneficial.
Early stopping remains a useful heuristic, but practitioners should be aware that it can sometimes be premature, especially on structured or small datasets where grokking might occur.

Explain like I'm 5 (ELI5)

Imagine you are trying to learn about different types of animals by looking at pictures in a book. Overfitting is like memorizing every tiny detail in each picture, like the color of a leaf behind the animal or a small scratch on the page. This would make it hard for you to recognize the same animal in a different picture because you focused too much on the little details that don't matter.

Here is another way to think about it. Suppose your teacher shows you five cats and five dogs. All the cats happen to be sitting on blue blankets, and all the dogs happen to be on red blankets. If you "overfit," you learn the rule "blue blanket means cat." This works perfectly for the pictures you have seen, but when someone shows you a cat on a green couch, you get confused because you learned the wrong pattern.

In machine learning, overfitting is when a model learns the training data so well that it gets confused when it sees new data because it paid too much attention to the unimportant details. To avoid overfitting, we can use different techniques (like regularization, dropout, and getting more training data) to help the model focus on the important patterns and not get distracted by the small, unimportant details.

References

Hastie, T., Tibshirani, R., and Friedman, J. (2009). *The Elements of Statistical Learning*. Springer. https://hastie.su.domains/ElemStatLearn/
Geman, S., Bienenstock, E., and Doursat, R. (1992). "Neural Networks and the Bias/Variance Dilemma." *Neural Computation*, 4(1), 1-58. https://doi.org/10.1162/neco.1992.4.1.1
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). "Dropout: A Simple Way to Prevent Neural Networks from Overfitting." *Journal of Machine Learning Research*, 15, 1929-1958. https://www.jmlr.org/papers/v15/srivastava14a.html
Goodfellow, I., Bengio, Y., and Courville, A. (2016). *Deep Learning*. MIT Press. https://www.deeplearningbook.org/
Bishop, C.M. (2006). *Pattern Recognition and Machine Learning*. Springer. https://www.springer.com/gp/book/9780387310732
Belkin, M., Hsu, D., Ma, S., and Mandal, S. (2019). "Reconciling Modern Machine-Learning Practice and the Classical Bias-Variance Trade-off." *Proceedings of the National Academy of Sciences*, 116(32), 15849-15854. https://arxiv.org/abs/1812.11118
Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., and Sutskever, I. (2020). "Deep Double Descent: Where Bigger Models and More Data Can Hurt." *Journal of Statistical Mechanics: Theory and Experiment*. https://arxiv.org/abs/1912.02292
Power, A., Burda, Y., Edwards, H., Babuschkin, I., and Misra, V. (2022). "Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets." *arXiv preprint*. https://arxiv.org/abs/2201.02177
Bartlett, P.L., Long, P.M., Lugosi, G., and Tsigler, A. (2020). "Benign Overfitting in Linear Regression." *Proceedings of the National Academy of Sciences*, 117(48), 30063-30070. https://arxiv.org/abs/1906.11300
Nanda, N., Chan, L., Liberum, T., Smith, J., and Steinhardt, J. (2023). "Progress Measures for Grokking via Mechanistic Interpretability." *International Conference on Learning Representations (ICLR)*. https://arxiv.org/abs/2301.05217
Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. (2017). "Understanding Deep Learning Requires Rethinking Generalization." *International Conference on Learning Representations (ICLR)*. https://arxiv.org/abs/1611.03530
Zou, H. and Hastie, T. (2005). "Regularization and Variable Selection via the Elastic Net." *Journal of the Royal Statistical Society: Series B*, 67(2), 301-320. https://doi.org/10.1111/j.1467-9868.2005.00503.x
Frankle, J. and Carlin, M. (2019). "The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks." *International Conference on Learning Representations (ICLR)*. https://arxiv.org/abs/1803.03635

Overfitting in machine learning

Definition

The bias-variance tradeoff

Signs of overfitting

Causes

Overfitting vs. underfitting

Overfitting in different model types

Prevention and mitigation techniques

Regularization

Dropout

Early stopping

Cross-validation

Data augmentation

Pruning

Model selection

Learning curves

Practical guidelines for diagnosing and fixing overfitting

Memorization vs. generalization in neural networks

Double descent

Model-wise double descent

Epoch-wise double descent

When double descent occurs

Grokking

When grokking occurs

Practical implications

Benign overfitting

Why benign overfitting occurs

Implications for practice

Explain like I'm 5 (ELI5)

References

Improve this article

Related Articles

Sparse autoencoder

ARC-AGI 2

Generalization

Loss Curve

GELU (Gaussian Error Linear Unit)

LeNet

Overfitting in machine learning

Definition

The bias-variance tradeoff

Signs of overfitting

Causes

Overfitting vs. underfitting

Overfitting in different model types

Prevention and mitigation techniques

Regularization

Dropout

Early stopping

Cross-validation

Data augmentation

Pruning

Model selection

Learning curves

Practical guidelines for diagnosing and fixing overfitting

Memorization vs. generalization in neural networks

Double descent

Model-wise double descent

Epoch-wise double descent

When double descent occurs

Grokking

When grokking occurs

Practical implications

Benign overfitting

Why benign overfitting occurs

Implications for practice

Explain like I'm 5 (ELI5)

References

Related Articles

Sparse autoencoder

ARC-AGI 2

Generalization

Loss Curve

GELU (Gaussian Error Linear Unit)

LeNet