# Generalization

> Source: https://aiwiki.ai/wiki/generalization
> Updated: 2026-06-21
> Categories: Deep Learning, Machine Learning, Model Evaluation
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

*See also: [Machine learning terms](/wiki/machine_learning_terms), [Bias-variance tradeoff](/wiki/bias_variance_tradeoff)*

## What is generalization in machine learning?

Generalization in [machine learning](/wiki/machine_learning) is the ability of a trained model to perform accurately on new, unseen data drawn from the same distribution as its training set, rather than merely reproducing the examples it was trained on. A model that generalizes well has learned the true underlying patterns in the data instead of memorizing specific training examples. Generalization is widely regarded as the central goal of machine learning, because a model that works only on data it has already seen provides little practical value.

The difference between a model's error on its training data and its error on unseen data is called the generalization gap; a small gap signals that the model has captured genuine structure, while a large gap typically indicates [overfitting](/wiki/overfitting). One of the defining puzzles of modern deep learning is that neural networks with far more parameters than training examples often generalize well anyway: Zhang et al. (2017) showed that the same architectures can also achieve zero training error on completely random labels, where no pattern exists to learn.[5]

## Introduction

Generalization in machine learning refers to the ability of a trained model to perform accurately on new, unseen data that was not part of its training set. A model that generalizes well has learned the true underlying patterns in the data rather than memorizing the specific examples it was trained on. Generalization is arguably the central goal of machine learning, because a model that only works on the exact data it has already seen provides little practical value.

When a model is trained, its parameters are optimized to minimize a loss function on the [training set](/wiki/training_set). The hope is that low training error will translate into low error on new data drawn from the same distribution. In practice, however, models can fall into two failure modes. [Overfitting](/wiki/overfitting) occurs when a model fits the [noise](/wiki/noise) in the training data rather than the signal, producing a complex function that performs well on training data but poorly on new inputs. [Underfitting](/wiki/underfitting) occurs when a model is too simple to capture the true structure in the data, resulting in high error on both training and test examples. The pursuit of good generalization involves finding the right balance between these two extremes.

## Generalization error

Generalization error (also called out-of-sample error) measures how well a model performs on data it has never seen. Formally, suppose a model is trained to minimize some loss function L over training data. The **training error** (or empirical risk) is the average loss computed on the training set. The **generalization error** (or expected risk) is the expected loss over the full data-generating distribution.

The gap between these two quantities is sometimes called the **generalization gap**:

> Generalization gap = Test error - Training error

A small generalization gap indicates that the model's performance on training data is a reliable predictor of its performance on unseen data. A large gap typically signals overfitting: the model has exploited specific patterns in the training set that do not hold in general.

In statistical learning theory, the expected risk of a hypothesis h is defined as:

> R(h) = E[L(h(x), y)]

where the expectation is taken over the joint distribution of inputs x and labels y. The empirical risk is the sample average of the loss over the training set. The goal of learning is to find a hypothesis that minimizes the expected risk, but since the true distribution is unknown, learning algorithms minimize the empirical risk and rely on theoretical guarantees (or empirical validation) to ensure the expected risk is also low.

## Bias-variance tradeoff

The [bias](/wiki/bias_math_or_bias_term)-variance tradeoff is a classical framework for understanding generalization. For a given learning algorithm, the expected prediction error on new data can be decomposed into three components:

| Component | Definition | Effect on generalization |
|---|---|---|
| Bias | Error from incorrect assumptions in the learning algorithm. High bias causes the model to miss relevant patterns. | Leads to underfitting; systematically wrong predictions |
| Variance | Error from sensitivity to small fluctuations in the training data. High variance causes the model to capture noise as if it were signal. | Leads to overfitting; predictions vary widely across different training sets |
| Irreducible error (noise) | Error from inherent noise in the data that no model can eliminate. | Sets a floor on achievable error |

The total expected error can be written as:

> Expected error = Bias^2 + Variance + Irreducible error

Simple models (few parameters, strong assumptions) tend to have high bias and low variance. Complex models (many parameters, flexible functional forms) tend to have low bias and high variance. The classical prescription is to choose a model of intermediate complexity that minimizes the sum of bias and variance. This tradeoff has guided model selection for decades, though modern deep learning has complicated this picture, as discussed below.

## Statistical learning theory and generalization bounds

Statistical learning theory provides mathematical tools for characterizing when and why learning algorithms generalize. Several key frameworks have been developed.

### PAC learning

The Probably Approximately Correct (PAC) learning framework, introduced by Leslie Valiant in 1984, formalizes the notion of learnability.[1] A concept class is PAC-learnable if there exists an algorithm that, given enough training examples, produces a hypothesis that is approximately correct (low error) with high probability. More precisely, for any desired accuracy epsilon and confidence delta, the algorithm must find a hypothesis with generalization error at most epsilon with probability at least 1 - delta, using a number of samples that is polynomial in 1/epsilon, 1/delta, and the complexity of the concept class. Valiant's 1984 paper, "A theory of the learnable," later contributed to his receiving the 2010 ACM A.M. Turing Award.[1]

PAC learning provides a rigorous answer to the question: how much training data does a learning algorithm need to generalize well?

### VC dimension

The Vapnik-Chervonenkis (VC) dimension, introduced by Vladimir Vapnik and Alexey Chervonenkis in 1971, measures the capacity (complexity) of a hypothesis class.[2] Formally, the VC dimension of a hypothesis class H is the largest number of data points that H can shatter (classify in all possible ways). A hypothesis class with VC dimension d requires on the order of d/epsilon training samples to guarantee generalization error at most epsilon.

The VC dimension connects model complexity to sample complexity through generalization bounds. For a hypothesis class with VC dimension d, the generalization gap is bounded (with high probability) by a term proportional to sqrt(d/n), where n is the number of training samples.[2] This means that more complex hypothesis classes (higher VC dimension) require more training data to generalize well.

| Concept | What it measures | Key result |
|---|---|---|
| PAC learning | Learnability of a concept class | A class is PAC-learnable if and only if it has finite VC dimension |
| VC dimension | Capacity of a hypothesis class | Generalization gap scales as O(sqrt(d/n)) |
| Rademacher complexity | Data-dependent measure of hypothesis class richness | Tighter, distribution-dependent generalization bounds |

### Rademacher complexity

Rademacher complexity is a data-dependent measure of the richness of a hypothesis class, introduced as a refinement of VC-based bounds. It measures how well a function class can fit random noise. Unlike VC dimension, which provides a single number characterizing a hypothesis class across all possible data distributions, Rademacher complexity depends on the specific data distribution. This makes Rademacher-based bounds tighter and more informative in practice.

Given a dataset of size n and a hypothesis class H, the empirical Rademacher complexity measures the expected correlation between the hypotheses in H and a set of random labels drawn uniformly from {+1, -1}. A function class with lower Rademacher complexity is easier to learn because it cannot fit random noise as easily, which implies better generalization.

## Evaluating generalization in practice

### Train, validation, and test splits

The standard approach to evaluating generalization involves splitting available data into three subsets:

- **Training set**: Used to fit model parameters. The model learns directly from this data.
- **[Validation set](/wiki/validation_set)**: Used to tune hyperparameters and make model selection decisions. The model does not train on this data, but choices about the model are influenced by validation performance.
- **[Test set](/wiki/test_set)**: Used only for the final evaluation of generalization performance. It must remain untouched until all modeling decisions are finalized.

A common split ratio is 60/20/20 or 70/15/15 for training, validation, and test sets respectively. The key principle is that the test set provides an unbiased estimate of generalization performance only if it is not used during any stage of model development.

### Cross-validation

[Cross-validation](/wiki/cross-validation) is a resampling technique used to estimate generalization performance when data is limited. In k-fold cross-validation, the dataset is divided into k equally sized subsets (folds). The model is trained on k - 1 folds and evaluated on the remaining fold. This process is repeated k times, with each fold serving as the evaluation set exactly once. The final performance estimate is the average across all k runs.

| Variant | Description | When to use |
|---|---|---|
| k-fold CV (k = 5 or 10) | Standard approach; good bias-variance balance | General-purpose model evaluation |
| Stratified k-fold | Preserves class distribution in each fold | Imbalanced classification problems |
| Leave-one-out CV (LOOCV) | k equals the number of samples; minimal bias but high variance and computational cost | Very small datasets |
| Repeated k-fold | Runs k-fold multiple times with different random splits and averages results | When more stable estimates are needed |

Cross-validation gives a more reliable estimate of generalization than a single train-test split, especially when data is scarce. The standard choice of k = 10 provides a good tradeoff between computational cost and estimation accuracy.

## Regularization techniques that improve generalization

[Regularization](/wiki/regularization) encompasses a family of techniques that constrain or penalize model complexity to prevent overfitting and improve generalization. The core idea is that among all models that fit the training data well, simpler models tend to generalize better.

### Explicit regularization

**L1 regularization (Lasso)** adds a penalty proportional to the sum of absolute values of model weights. This encourages sparsity, driving some weights to exactly zero and performing implicit feature selection.

**L2 regularization (Ridge, weight decay)** adds a penalty proportional to the sum of squared weights. This discourages large individual weights and produces smoother decision boundaries.

**Elastic net** combines L1 and L2 penalties, balancing sparsity with smoothness.

### Dropout

Dropout, introduced by Srivastava et al. (2014), randomly sets a fraction of neuron activations to zero during each training iteration.[4] Each training step effectively trains a different sub-network, forcing the [neural network](/wiki/neural_network) to develop redundant representations. At test time, all neurons are active (with scaled weights), and the result approximates an ensemble of many sub-networks. The authors described the central idea directly: "The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much."[4] Published in the *Journal of Machine Learning Research* (volume 15, pages 1929-1958), the method reported state-of-the-art results across vision, speech recognition, and document-classification benchmarks, and remains one of the most widely used regularization techniques in deep learning.[4]

### Early stopping

Early stopping monitors the model's performance on a validation set during training and halts the process when validation performance begins to degrade. This prevents the model from reaching the point of overfitting, where training loss continues to decrease but validation loss increases.

### Data augmentation

[Data augmentation](/wiki/data_augmentation) artificially expands the training set by applying transformations to existing data points. In computer vision, common augmentations include random cropping, horizontal flipping, rotation, color jitter, and scaling. In natural language processing, augmentations include synonym replacement, back-translation, and random insertion or deletion of words. By exposing the model to a wider variety of plausible inputs during training, data augmentation reduces overfitting and improves generalization to new examples.

### Batch normalization

Batch normalization normalizes the inputs to each layer within a mini-batch during training. While originally proposed to address internal covariate shift, it has been shown to have a regularizing effect that improves generalization. The noise introduced by computing statistics over mini-batches acts as an implicit regularizer.

## Generalization in deep learning: open questions

Deep learning has challenged many classical assumptions about generalization. Modern neural networks routinely use far more parameters than training examples, yet they often generalize remarkably well. Understanding why this happens remains one of the most active areas of research in machine learning theory.

### Generalization vs. memorization

Zhang et al. (2017) demonstrated a striking result that reshaped the field's understanding of generalization. They showed that standard deep neural networks can perfectly fit (memorize) training data with completely random labels, achieving zero training error on noise.[5] As the paper states, "when trained on a completely random labeling of the true data, neural networks achieve 0 training error," yet test performance on such data is no better than chance because random labels contain no learnable pattern.[5] This means that the effective capacity of modern neural networks is large enough to memorize any training set.

The key puzzle is this: the same architectures that can memorize random labels also generalize well on real data with true labels. Traditional measures of model complexity (such as VC dimension or parameter count) cannot distinguish between these two cases, since the model architecture is the same in both. The authors concluded that "explicit regularization may improve generalization performance, but is neither necessary nor by itself sufficient for controlling generalization error."[5] This implies that something about the interaction between the data, the architecture, and the training algorithm is responsible for generalization, not model capacity alone.

### Implicit bias of gradient descent

One leading explanation for why overparameterized neural networks generalize involves the implicit bias of optimization algorithms. When a model has more parameters than needed to fit the data, there are many possible solutions (global minima of the training loss) that achieve zero training error. Most of these solutions will not generalize well. However, gradient descent and stochastic gradient descent (SGD) do not find arbitrary solutions. Instead, they are implicitly biased toward solutions with specific structural properties.[12]

For linear models, gradient descent converges to the minimum-norm solution. For neural networks, the picture is more complex, but research has shown that SGD tends to find solutions that are "simpler" in various senses. The stochasticity of SGD (arising from mini-batch sampling) acts as an implicit regularizer, with smaller batch sizes and larger learning rates providing stronger regularization. Soudry et al. (2018) showed that for linearly separable data, gradient descent on logistic loss converges in the direction of the maximum-margin classifier, connecting the implicit bias of gradient descent to the well-known support vector machine solution.[7]

### Flatness of minima

Another line of research connects generalization to the geometry of the loss landscape. Hochreiter and Schmidhuber (1997) proposed that "flat" minima (regions of weight space where the loss remains approximately constant over a large neighborhood) correspond to simpler models and better generalization.[3] The intuition is that a flat minimum is robust to small perturbations in the weights, suggesting that the model captures genuine patterns rather than noise.

Keskar et al. (2017) provided empirical evidence that large-batch SGD tends to converge to sharp minima (where the loss changes rapidly with small weight perturbations), while small-batch SGD converges to flat minima.[6] They argued that this explains why small-batch training often generalizes better than large-batch training.

However, the flatness-generalization connection is not without controversy. Dinh et al. (2017) showed that sharpness measures can be manipulated through reparameterization without changing the function computed by the network, complicating the theoretical picture. Despite this debate, the practical success of Sharpness-Aware Minimization (SAM), proposed by Foret et al. (2021), which explicitly seeks flat minima during training, has provided additional evidence that flatness is correlated with generalization in practice.[11] SAM minimizes the worst-case loss within a neighborhood of the current weights and reported state-of-the-art results at the time, including 0.30% test error on CIFAR-10 and improved top-1 accuracy on ImageNet.[11]

### Double descent

The double descent phenomenon, characterized by Belkin et al. (2019) in the *Proceedings of the National Academy of Sciences*, challenged the classical U-shaped bias-variance tradeoff curve.[8] In the classical picture, test error first decreases as model complexity increases (reducing bias) and then increases (as variance dominates). Double descent adds a twist: beyond the point where the model can exactly interpolate the training data (the interpolation threshold), test error decreases again as the model becomes even more overparameterized. While the underlying behavior had been observed in earlier work, the term "double descent" was popularized by Belkin et al.'s 2019 paper.[8]

The double descent curve has three regimes:

1. **Underparameterized regime**: The model has fewer parameters than training examples. Classical bias-variance tradeoff applies, and increasing complexity reduces test error.
2. **Interpolation threshold**: The model has just enough capacity to fit the training data exactly. Test error often peaks here because the model is forced into a complex, possibly noisy, interpolating solution.
3. **Overparameterized regime**: The model has far more parameters than training examples. Among the many interpolating solutions, the optimization algorithm (e.g., gradient descent) selects one with favorable properties, and test error decreases again.

Nakkiran et al. (2021) showed that double descent occurs not only as a function of model size but also as a function of training time (epoch-wise double descent) and dataset size; their paper is titled "Deep double descent: Where bigger models and more data hurt," and it identified regimes where adding training samples can actually hurt test performance.[10] This phenomenon has been observed across various architectures, including decision trees, random features models, and deep neural networks.

### What is grokking?

Grokking is a delayed-generalization phenomenon in which a neural network first memorizes its training set, reaching near-perfect training accuracy with test accuracy no better than chance, and then, after a long period of further training, abruptly transitions to strong generalization. The term was introduced by Power et al. (2022) at OpenAI in "Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets," submitted to arXiv on January 6, 2022.[13] Studying neural networks on small, algorithmically generated tasks such as modular arithmetic, the authors observed that "neural networks learn through a process of 'grokking' a pattern in the data, improving generalization performance from random chance level to perfect generalization, and that this improvement in generalization can happen well past the point of overfitting."[13]

The defining feature of grokking is the size of the delay. In the modular-arithmetic experiments, validation accuracy began rising above chance only after roughly 1,000 times more optimization steps than were needed for training accuracy to approach its maximum.[13] Power et al. also found that smaller datasets require increasing amounts of optimization to generalize, making grokking a striking example of generalization emerging long after a model appears to have finished learning.[13] Grokking is closely related to the broader puzzle of overparameterized generalization and to double descent, since both involve test performance improving well beyond the point at which the training loss has been driven to zero.

### Scaling laws

Kaplan et al. (2020) discovered that the test loss of neural language models follows smooth power-law relationships with model size, dataset size, and the amount of compute used for training.[9] The paper reports that "the loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude."[9] The fitted power-law exponent relating loss to parameter count is roughly 0.076.[9] These scaling laws have several implications for generalization:

- Larger models are more sample-efficient, achieving lower test loss with the same amount of data. The authors state that "larger models are significantly more sample-efficient, such that optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence."[9]
- Performance scales predictably with resources, allowing researchers to extrapolate from small experiments to larger ones.
- Generalization depends primarily on the test loss itself and is largely independent of other architectural details like network width or depth (within a wide range).

These findings have influenced the strategy behind modern [large language models](/wiki/large_language_model), where training very large models on large datasets has consistently produced better generalization across a wide range of downstream tasks.

## Domain generalization and distribution shift

Standard generalization assumes that training and test data come from the same distribution. In practice, this assumption is often violated. Domain generalization addresses the more challenging scenario where the test data may come from a different distribution than the training data.

**Distribution shift** occurs when the statistical properties of the data change between training and deployment. Common forms include:

- **Covariate shift**: The input distribution changes, but the relationship between inputs and outputs remains the same.
- **Label shift**: The distribution of outputs changes.
- **Concept drift**: The relationship between inputs and outputs itself changes over time.

Approaches to domain generalization include:

| Approach | Description |
|---|---|
| Domain alignment | Learning representations that are invariant across different source domains |
| Data augmentation | Generating synthetic training examples that simulate distribution shifts |
| Meta-learning | Training the model to learn how to adapt quickly to new domains |
| Ensemble methods | Combining models trained on different domains to improve robustness |
| Invariant risk minimization | Learning representations that yield optimal classifiers across all training environments |

Domain generalization remains an active research area because real-world deployment rarely guarantees that the test distribution will match the training distribution exactly.

## How can you improve generalization in practice?

Based on both theory and empirical findings, several strategies are known to improve generalization in practice:

1. **Use more training data.** Larger and more diverse training sets almost always improve generalization. When additional real data is unavailable, data augmentation can help.
2. **Apply regularization.** Techniques such as weight decay, dropout, and early stopping constrain model complexity and reduce overfitting.
3. **Choose appropriate model complexity.** Use cross-validation or a held-out validation set to select model architecture and hyperparameters.
4. **Use ensembles.** Combining predictions from multiple models (bagging, boosting, or model averaging) typically reduces variance and improves generalization.
5. **Tune the learning rate and batch size.** Smaller batch sizes and appropriately tuned learning rates have been linked to flatter minima and better generalization.[6]
6. **Apply data augmentation.** Augmentations that preserve label semantics expand the effective training set and reduce overfitting.
7. **Monitor the generalization gap.** Track both training and validation performance during training. A growing gap between the two is a warning sign of overfitting.
8. **Leverage pre-trained models.** Transfer learning from large pre-trained models provides strong inductive biases and often improves generalization, especially when task-specific data is limited.
9. **Use normalization techniques.** Batch normalization and layer normalization can stabilize training and provide regularization benefits.
10. **Consider Sharpness-Aware Minimization.** SAM and related methods that explicitly seek flat minima have been shown to improve generalization across architectures.[11]

## Explain like I'm 5 (ELI5)

Imagine you are learning to recognize animals. Your parents show you pictures of many different dogs: big ones, small ones, fluffy ones, and short-haired ones. After seeing enough examples, you start to understand what makes a dog a dog.

Now someone shows you a dog you have never seen before. You can still tell it is a dog because you learned the general idea, not just the specific dogs in your picture book. That is generalization.

But if you had only seen one dog (say, a golden retriever), you might think every golden-colored animal is a dog. That is like a machine learning model that memorized too few examples instead of learning the general pattern. And if you had only been told "animals have four legs," you might call a cat or a horse a dog, too. That is like a model that is too simple to capture the differences.

Good generalization means learning just the right amount: enough to recognize new dogs you have never seen, but not so rigid that you confuse them with other animals.

## References

1. Valiant, L. G. (1984). "A theory of the learnable." *Communications of the ACM*, 27(11), 1134-1142.
2. Vapnik, V. N., & Chervonenkis, A. Y. (1971). "On the uniform convergence of relative frequencies of events to their probabilities." *Theory of Probability and Its Applications*, 16(2), 264-280.
3. Hochreiter, S., & Schmidhuber, J. (1997). "Flat minima." *Neural Computation*, 9(1), 1-42.
4. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). "Dropout: A simple way to prevent neural networks from overfitting." *Journal of Machine Learning Research*, 15(1), 1929-1958.
5. Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2017). "Understanding deep learning requires rethinking generalization." *International Conference on Learning Representations (ICLR)*. arXiv:1611.03530.
6. Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., & Tang, P. T. P. (2017). "On large-batch training for deep learning: Generalization gap and sharp minima." *International Conference on Learning Representations (ICLR)*.
7. Soudry, D., Hoffer, E., Nacson, M. S., Gunasekar, S., & Srebro, N. (2018). "The implicit bias of gradient descent on separable data." *Journal of Machine Learning Research*, 19(1), 2822-2878.
8. Belkin, M., Hsu, D., Ma, S., & Mandal, S. (2019). "Reconciling modern machine-learning practice and the classical bias-variance trade-off." *Proceedings of the National Academy of Sciences*, 116(32), 15849-15854.
9. Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., & Amodei, D. (2020). "Scaling laws for neural language models." *arXiv preprint arXiv:2001.08361*.
10. Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., & Sutskever, I. (2021). "Deep double descent: Where bigger models and more data hurt." *Journal of Statistical Mechanics: Theory and Experiment*, 2021(12), 124003. arXiv:1912.02292.
11. Foret, P., Kleiner, A., Mobahi, H., & Neyshabur, B. (2021). "Sharpness-aware minimization for efficiently improving generalization." *International Conference on Learning Representations (ICLR)*. arXiv:2010.01412.
12. Vardi, G. (2023). "On the implicit bias in deep-learning algorithms." *Communications of the ACM*, 66(6), 86-93.
13. Power, A., Burda, Y., Edwards, H., Babuschkin, I., & Misra, V. (2022). "Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets." *arXiv preprint arXiv:2201.02177*.

