Generalization Curve

A generalization curve (also called a learning curve) is a plot that visualizes how a machine learning model's performance on training data and unseen data changes as a function of some varying quantity, such as the number of training examples, training epochs, or model complexity. Generalization curves are one of the most widely used diagnostic tools in machine learning, providing direct visual evidence of whether a model is underfitting, overfitting, or achieving a good balance between the two.

The term "generalization" refers to a model's ability to perform well on data it has not encountered during training. A model that memorizes its training set but fails on new inputs has poor generalization, while a model that captures the true underlying patterns in the data generalizes well. The generalization curve makes this distinction visible by plotting both training and validation (or test) metrics side by side.

Explain like I'm 5 (ELI5)

Imagine you are studying for a spelling test. Your teacher gives you a list of 20 words to practice. After studying really hard, you can spell all 20 words perfectly. But what happens when the test includes 5 new words you never practiced? If you can spell those new words too, that means you actually learned how spelling works, not just those 20 specific words.

A generalization curve is like a chart that tracks two things while you study: how well you do on your practice words, and how well you do on surprise words you have never seen. If both lines on the chart go up together, you are really learning. If only the practice line goes up but the surprise line stays flat or goes down, you are just memorizing without understanding.

In machine learning, the computer is the student, the practice words are the training data, and the surprise words are the validation set. The generalization curve helps engineers figure out whether the computer is actually learning useful patterns or just memorizing answers.

Types of generalization curves

Generalization curves can be constructed by varying different quantities along the x-axis. Each type provides different diagnostic information.

Type	X-axis variable	What it reveals	Common use case
Sample learning curve	Number of training examples	Whether more data would improve performance	Deciding whether to collect additional data
Epoch learning curve	Number of training iterations (epochs)	When the model begins to overfit during training	Determining when to stop training
Complexity curve (validation curve)	Model complexity or hyperparameter value	Optimal model capacity for the task	Hyperparameter tuning and model selection
Compute curve	Amount of training compute (FLOPs)	Efficiency of scaling compute resources	Resource allocation and budgeting

Sample learning curves

A sample learning curve plots model performance (such as accuracy or loss) against the number of training samples used. As the number of training examples increases, the training error typically rises (because it becomes harder to fit every point perfectly) while the validation error decreases (because the model has more information to learn from). The gap between the two curves is called the generalization gap.

When both curves converge to a low error value, the model has sufficient capacity and enough data. When both curves plateau at a high error value, the model is too simple for the task and more data alone will not help. When the training error is low but the validation error remains high with a large gap between them, the model is overfitting and would benefit from either more data or regularization.

Epoch learning curves

An epoch learning curve tracks training and validation performance over successive passes through the training data. In a typical scenario, both training loss and validation loss decrease during early epochs. At some point, the training loss continues to decrease while the validation loss stops improving or begins to increase. This divergence point signals the onset of overfitting, and it is the basis for the early stopping technique.

Complexity curves (validation curves)

A complexity curve (sometimes called a validation curve) plots model performance as a function of a hyperparameter that controls model complexity, such as the depth of a decision tree, the number of hidden units in a neural network, or the degree of a polynomial. These curves directly illustrate the bias-variance tradeoff: simple models sit on the left side with high bias and low variance, while complex models sit on the right side with low bias and high variance.

The bias-variance tradeoff

The bias-variance tradeoff is one of the most important concepts in statistical learning theory and lies at the heart of generalization curve analysis.

Mathematical decomposition

For a regression problem, the expected prediction error at a point x can be decomposed into three components:

E[(y - f̂(x))²] = Bias²[f̂(x)] + Var[f̂(x)] + σ²

where:

Bias² measures the systematic error introduced by approximating a complex real-world problem with a simpler model. High bias means the model consistently misses relevant relationships in the data.
Variance measures how much the model's predictions would change if it were trained on a different sample from the same distribution. High variance means the model is sensitive to the specific training data it received.
σ² (irreducible error) represents noise inherent in the data that no model can eliminate.

This decomposition was formalized by Stuart Geman and colleagues in 1992, building on earlier statistical work. It shows that minimizing prediction error requires balancing bias against variance, since reducing one typically increases the other.

How the tradeoff appears on generalization curves

On a complexity curve, the bias-variance tradeoff produces a characteristic U-shaped test error curve:

Region	Model complexity	Bias	Variance	Training error	Validation error	Diagnosis
Left (underfitting)	Low	High	Low	High	High (close to training)	Model too simple
Middle (optimal)	Moderate	Moderate	Moderate	Moderate	Lowest point	Best generalization
Right (overfitting)	High	Low	High	Low	High (far from training)	Model too complex

The optimal model complexity corresponds to the minimum of the validation error curve. This is the point where the model is complex enough to capture the true patterns in the data but not so complex that it fits noise.

Training error versus validation error

Formal definitions

The training error (or empirical risk) is computed directly on the training dataset:

R_emp = (1/n) * sum of L(x_i, y_i, f(x_i)) for i = 1 to n

where L is the loss function, n is the number of training samples, and f is the learned model.

The generalization error (or population risk) is the expected loss over the entire data-generating distribution:

R = E_{(x,y) ~ P} [L(x, y, f(x))]

Since the true distribution P is unknown, the generalization error is estimated using a held-out validation set or test set that the model has never seen during training.

The generalization gap

The difference between validation error and training error is the generalization gap:

Generalization gap = Validation error - Training error

A small generalization gap (with both errors being low) indicates good generalization. A large generalization gap (low training error, high validation error) indicates overfitting. When both errors are high with a small gap, the model is underfitting.

The generalization gap typically widens as training progresses beyond the optimal point, because the model begins to memorize training-specific noise rather than learning generalizable patterns.

Diagnosing model problems from curve shapes

Different generalization curve shapes correspond to different model pathologies. The following table summarizes the most common patterns.

Curve pattern	Training error	Validation error	Gap	Problem	Remedy
Both high, converging	High	High	Small	Underfitting (high bias)	Increase model complexity, add features, reduce regularization
Training low, validation high	Low	High	Large	Overfitting (high variance)	Add more data, increase regularization, use dropout, simplify model
Both low, converging	Low	Low	Small	Good fit	No changes needed
Validation decreasing, not yet stable	Decreasing	Decreasing	Moderate	Needs more training	Continue training for more epochs
Both high, validation still decreasing	High	Decreasing	Varies	Needs more data	Collect additional training samples
Validation oscillating	Low	Unstable	Varies	Learning rate too high or batch size too small	Reduce learning rate, increase batch size

Classical generalization theory

VC dimension and generalization bounds

The Vapnik-Chervonenkis (VC) dimension, introduced by Vladimir Vapnik and Alexey Chervonenkis in the 1970s, provides a formal measure of the capacity (or complexity) of a hypothesis class. The VC dimension of a hypothesis class H is the largest number of data points that can be shattered (i.e., classified in all possible ways) by some hypothesis in H.

VC theory gives an upper bound on the generalization error:

R(h) ≤ R_emp(h) + O(sqrt((d * log(n/d) + log(1/δ)) / n))

where d is the VC dimension, n is the number of training samples, and δ is the confidence parameter. This bound tells us that as the number of training samples n grows relative to the VC dimension d, the generalization gap shrinks.

For generalization curves, VC theory predicts that models with higher VC dimension (greater complexity) require more training data to generalize well. This relationship is directly visible in sample learning curves: more complex models need their training set size to be proportionally larger before the validation error converges to the training error.

PAC learning framework

The Probably Approximately Correct (PAC) learning framework, proposed by Leslie Valiant in 1984, formalizes the conditions under which learning is feasible. A concept class is PAC-learnable if there exists an algorithm that, given enough training samples, can find a hypothesis with low generalization error with high probability.

The sample complexity of a learning problem is the minimum number of training examples required to achieve a given level of generalization error with a given confidence level. For finite hypothesis classes, the sample complexity scales logarithmically with the size of the hypothesis class. For infinite hypothesis classes, the sample complexity depends on the VC dimension.

Sample complexity bounds directly relate to generalization curves: they predict the shape and rate at which the validation error decreases as the number of training samples increases.

Rademacher complexity

Rademacher complexity provides data-dependent generalization bounds that can be tighter than VC-based bounds in practice. While VC dimension measures the worst-case capacity of a hypothesis class, Rademacher complexity measures the ability of the hypothesis class to fit random noise in the specific dataset at hand. This makes Rademacher bounds more informative for understanding the generalization behavior of specific models on specific datasets.

Double descent

The double descent phenomenon challenges the classical U-shaped generalization curve predicted by the bias-variance tradeoff. Instead of test error increasing monotonically as model complexity grows beyond the optimal point, double descent shows that test error first increases, peaks near the interpolation threshold, and then decreases again as complexity continues to grow.

Historical development

Early hints of double descent appeared in specific models as far back as 1989 in work by Vallet and colleagues. However, the phenomenon gained widespread attention in 2019 when Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal published their paper "Reconciling Modern Machine-Learning Practice and the Classical Bias-Variance Trade-Off" in the Proceedings of the National Academy of Sciences. They coined the term "double descent" and demonstrated the phenomenon across a range of model families, including random forests, neural networks, and boosted trees.

The discovery was motivated by a seeming contradiction: classical theory warned that models with too many parameters would overfit catastrophically, yet practitioners in deep learning routinely observed that very large models trained to interpolate (perfectly fit) the training data still generalized well to unseen data.

The interpolation threshold

The peak of the first descent and the subsequent rise in test error typically occurs near the interpolation threshold, the point where the number of model parameters roughly equals the number of training data points. At this threshold, the model has just enough capacity to fit the training data perfectly, but it does so in a way that is extremely sensitive to noise. As parameters increase beyond this threshold into the overparameterized regime, the model has many possible solutions that interpolate the training data, and gradient descent tends to find solutions with favorable properties (such as minimum norm), leading to improved generalization.

Three axes of double descent

Nakkiran, Kaplun, Bansal, Yang, Bao, and Hilton (2021) extended the double descent framework by demonstrating that the phenomenon occurs along three axes:

Axis	What varies	Description
Model-wise double descent	Number of parameters	Test error peaks near the interpolation threshold as model size increases, then decreases in the overparameterized regime
Epoch-wise double descent	Number of training epochs	Test error initially decreases, then increases as the model begins to overfit, then decreases again with further training
Sample-wise double descent	Number of training samples	Adding more training data can temporarily hurt performance before eventually helping, particularly near the interpolation threshold

Epoch-wise double descent is particularly notable because it means that training a model for longer, even past the point of apparent overfitting, can sometimes lead to improved test performance. This contradicts the conventional wisdom behind early stopping.

Proposed explanations

Several theories have been proposed to explain double descent:

Effective dimensionality. Although a network may have a large number of parameters, only a subset of those parameters are relevant for generalization, as measured by the local Hessian curvature. The effective dimension may be much smaller than the total parameter count.
Implicit regularization by gradient descent. Stochastic gradient descent (SGD) has an implicit bias toward simpler solutions, particularly minimum-norm solutions. In the overparameterized regime, this implicit bias selects interpolating solutions that generalize well.
Feature learning dynamics. In the underparameterized regime, models learn coarse features. Near the interpolation threshold, models struggle to fit both signal and noise. In the overparameterized regime, models can separate signal from noise by allocating excess capacity to absorb noise without distorting the learned signal.

Grokking (delayed generalization)

Grokking is a related but distinct phenomenon first described by Alethea Power and colleagues at OpenAI in 2022. In grokking, a neural network initially memorizes the training data (achieving near-perfect training accuracy but poor test accuracy), and then, after many additional epochs of training with no apparent improvement, suddenly transitions to strong generalization.

This behavior is surprising because it contradicts the typical expectation that generalization either improves gradually alongside training performance or degrades once overfitting begins. Grokking has been observed primarily on small algorithmic datasets (such as modular arithmetic tasks) and tends to occur more frequently with smaller datasets and stronger regularization.

Proposed explanations include a transition from a "lazy training" regime (where weights remain close to initialization) to a "rich" regime (where weights move substantially in task-relevant directions), and the gradual formation of structured internal representations that replace the initial memorized solution.

Grokking appears on generalization curves as a dramatic, delayed drop in validation error long after the training error has plateaued at near zero.

Neural scaling laws

Neural scaling laws, first systematically characterized by Jared Kaplan and colleagues at OpenAI in 2020, describe how the generalization performance of large language models scales as a power law with model size, dataset size, and training compute.

Key findings

The main results from Kaplan et al. (2020) relevant to generalization curves include:

Power-law scaling. The test loss L scales as a power law with the number of model parameters N, the dataset size D, and the training compute C: L(N) ~ N^(-0.076), L(D) ~ D^(-0.095), and L(C) ~ C^(-0.050). These trends hold across more than seven orders of magnitude.
Smooth generalization curves. Unlike classical theory, which predicts a U-shaped test error curve, scaling laws show that test loss decreases smoothly and predictably as model or data size increases (in the overparameterized regime with sufficient data).
Sample efficiency of large models. Larger models are more sample-efficient, meaning they achieve the same level of generalization with fewer training examples per parameter.
Optimal compute allocation. For a fixed compute budget, it is more efficient to train a larger model for fewer steps than a smaller model for more steps, because larger models reach the same loss with less total compute.

Subsequent work by Hoffmann et al. (2022), known as the "Chinchilla" scaling laws, refined these findings by showing that model size and training data should be scaled roughly equally for optimal compute efficiency.

Implications for generalization curves

Scaling laws suggest that in the regime of modern large-scale models, generalization curves are smooth and predictable rather than exhibiting the classical U-shape. The dominant factor determining generalization is the ratio between model capacity and data availability, with both needing to grow together for optimal performance.

Regularization and generalization

Regularization techniques are used to improve a model's generalization performance by constraining the model in ways that reduce overfitting. Their effects are directly visible on generalization curves as a narrowing of the gap between training and validation error.

Common regularization techniques

Technique	Mechanism	Effect on generalization curve
L1 regularization (Lasso)	Adds sum of absolute weight values to the loss	Increases training error slightly, reduces validation error; promotes sparse solutions
L2 regularization (Ridge/weight decay)	Adds sum of squared weight values to the loss	Smoothly constrains weight magnitudes; equivalent to early stopping in some settings
Dropout	Randomly sets a fraction of neuron activations to zero during training	Acts as approximate ensemble averaging; reduces co-adaptation of neurons
Early stopping	Halts training when validation error stops improving	Prevents the model from entering the overfitting region of the epoch learning curve
Data augmentation	Creates modified copies of training examples	Effectively increases dataset size; reduces variance without changing model complexity
Batch normalization	Normalizes layer inputs within each mini-batch	Provides mild regularization effect; stabilizes training and allows higher learning rates

Early stopping in detail

Early stopping is one of the simplest and most effective regularization methods, and it is directly motivated by the shape of the epoch learning curve. The idea is to monitor the validation error during training and stop when it begins to increase (or stops decreasing). In practice, a patience parameter specifies how many epochs the training process should continue without improvement before stopping.

Early stopping is mathematically related to L2 regularization. Bishop (1995) and others showed that, under certain conditions, the number of training iterations plays a role analogous to the inverse of the regularization strength in L2 regularization. Stopping earlier corresponds to stronger regularization.

However, the discovery of epoch-wise double descent complicates the use of early stopping: in some cases, training past the initial overfitting peak can lead to improved generalization during the second descent phase.

Estimating generalization performance

Hold-out validation

The simplest approach to estimating generalization performance is to split the available data into a training set, a validation set, and a test set. The model is trained on the training set, hyperparameters are tuned based on validation set performance, and the final generalization estimate is obtained from the test set. A common split ratio is 60/20/20 or 80/10/10.

K-fold cross-validation

Cross-validation provides a more robust estimate of generalization error, particularly when data is limited. In k-fold cross-validation, the dataset is divided into k equally sized folds. The model is trained k times, each time using k-1 folds for training and the remaining fold for validation. The generalization error estimate is the average validation error across all k folds.

The choice of k involves its own bias-variance tradeoff: larger k values (such as leave-one-out, where k equals the number of samples) produce estimates with low bias but high variance, while smaller k values produce estimates with higher bias but lower variance. Empirical studies suggest that k = 5 or k = 10 provides a good balance for most practical applications.

Bootstrapping

Bootstrap methods estimate generalization error by repeatedly sampling the training data with replacement to create multiple training sets. The model is trained on each bootstrap sample, and the error on out-of-sample data points (those not selected in a given bootstrap sample) provides the generalization estimate. The .632 bootstrap and .632+ bootstrap are refined variants that correct for the optimistic bias of basic bootstrap estimates.

Practical guidelines for using generalization curves

Constructing informative curves

Always plot both training and validation metrics. A single curve (training only or validation only) provides incomplete information. The relationship between the two curves is what reveals bias and variance problems.
Use appropriate metrics. Choose metrics that align with the task objective. For classification tasks, both loss and accuracy curves can be informative, as they sometimes reveal different issues. For regression, mean squared error or mean absolute error are standard choices.
Smooth noisy curves. Validation curves can be noisy, especially with small validation sets. Applying moving averages or plotting error bars helps identify genuine trends versus random fluctuations.
Log-scale axes. When studying scaling behavior or spanning large ranges of training samples or epochs, logarithmic axes can reveal power-law relationships that are invisible on linear scales.
Multiple runs. Neural network training involves randomness (weight initialization, data shuffling, dropout masks). Plotting curves averaged over multiple runs with confidence intervals provides more reliable diagnostics.

Common pitfalls

Data leakage. If information from the validation or test set leaks into the training process (through shared preprocessing, feature engineering, or other means), the generalization curve will be overly optimistic and misleading.
Non-representative splits. If the training and validation sets are not drawn from the same distribution, the generalization curve may not reflect true model performance. Stratified splitting helps maintain consistent class distributions.
Over-tuning on validation set. Repeatedly adjusting hyperparameters based on validation performance effectively turns the validation set into a second training set. Using a separate test set for final evaluation, or nested cross-validation for both hyperparameter selection and error estimation, avoids this problem.

Software tools

Most machine learning frameworks provide built-in support for generating generalization curves.

Framework	Relevant functionality
scikit-learn	`learning_curve()` and `validation_curve()` functions in `sklearn.model_selection`
TensorFlow / Keras	Training history object (`history.history`) stores per-epoch metrics; TensorBoard provides real-time visualization
PyTorch	Manual logging during training loop; integration with TensorBoard, Weights & Biases, or MLflow
MLflow	Experiment tracking with automatic metric logging and comparison across runs

Historical context

The study of generalization in learning systems has a long history. The concept of a learning curve originated in psychology, where Hermann Ebbinghaus studied memory retention in the 1880s. In machine learning, the mathematical foundations were established through the work of Vapnik and Chervonenkis in the 1960s and 1970s, Valiant's PAC learning framework in 1984, and the extensive development of statistical learning theory throughout the 1990s and 2000s.

The classical view, codified in textbooks by Hastie, Tibshirani, and Friedman (2001) and Bishop (2006), held that generalization curves follow a U-shaped pattern with respect to model complexity. This view was fundamentally challenged in the late 2010s by the double descent phenomenon (Belkin et al., 2019; Nakkiran et al., 2021) and the empirical success of massively overparameterized deep learning models.

Today, the study of generalization curves continues to evolve, with active research into scaling laws, grokking, the role of implicit regularization in overparameterized models, and the conditions under which classical versus modern generalization behaviors emerge.

References

Hastie, T., Tibshirani, R., & Friedman, J. (2001). *The Elements of Statistical Learning*. Springer. Chapter 7: Model Assessment and Selection.
Bishop, C. M. (2006). *Pattern Recognition and Machine Learning*. Springer. Chapter 1.3: Model Selection.
Vapnik, V. N. (1998). *Statistical Learning Theory*. Wiley-Interscience.
Valiant, L. G. (1984). A theory of the learnable. *Communications of the ACM*, 27(11), 1134-1142.
Geman, S., Bienenstock, E., & Doursat, R. (1992). Neural networks and the bias/variance dilemma. *Neural Computation*, 4(1), 1-58.
Belkin, M., Hsu, D., Ma, S., & Mandal, S. (2019). Reconciling modern machine-learning practice and the classical bias-variance trade-off. *Proceedings of the National Academy of Sciences*, 116(32), 15849-15854.
Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Bao, B., & Hilton, I. (2021). Deep double descent: where bigger models and more data can hurt. *Journal of Statistical Mechanics: Theory and Experiment*, 2021(12), 124003.
Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., & Amodei, D. (2020). Scaling laws for neural language models. *arXiv preprint arXiv:2001.08361*.
Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., ... & Sifre, L. (2022). Training compute-optimal large language models. *arXiv preprint arXiv:2203.15556*.
Power, A., Burda, Y., Edwards, H., Babuschkin, I., & Misra, V. (2022). Grokking: Generalization beyond overfitting on small algorithmic datasets. *arXiv preprint arXiv:2201.02177*.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. *Journal of Machine Learning Research*, 15(1), 1929-1958.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. *Journal of Machine Learning Research*, 12, 2825-2830.
Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. *Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence*, 2, 1137-1143.

Explain like I'm 5 (ELI5)

Types of generalization curves

Sample learning curves

Epoch learning curves

Complexity curves (validation curves)

The bias-variance tradeoff

Mathematical decomposition

How the tradeoff appears on generalization curves

Training error versus validation error

Formal definitions

The generalization gap

Diagnosing model problems from curve shapes

Classical generalization theory

VC dimension and generalization bounds

PAC learning framework

Rademacher complexity

Double descent

Historical development

The interpolation threshold

Three axes of double descent

Proposed explanations

Grokking (delayed generalization)

Neural scaling laws

Key findings

Implications for generalization curves

Regularization and generalization

Common regularization techniques

Early stopping in detail

Estimating generalization performance

Hold-out validation

K-fold cross-validation

Bootstrapping

Practical guidelines for using generalization curves

Constructing informative curves

Common pitfalls

Software tools

Historical context

See also

References

Improve this article

Related Articles

ARC-AGI 2

Incompatibility of Fairness Metrics

Empirical Risk Minimization

Generalization

Model Capacity

AUC (Area Under the ROC Curve)

Explain like I'm 5 (ELI5)

Types of generalization curves

Sample learning curves

Epoch learning curves

Complexity curves (validation curves)

The bias-variance tradeoff

Mathematical decomposition

How the tradeoff appears on generalization curves

Training error versus validation error

Formal definitions

The generalization gap

Diagnosing model problems from curve shapes

Classical generalization theory

VC dimension and generalization bounds

PAC learning framework

Rademacher complexity

Double descent

Historical development

The interpolation threshold

Three axes of double descent

Proposed explanations

Grokking (delayed generalization)

Neural scaling laws

Key findings

Implications for generalization curves

Regularization and generalization

Common regularization techniques

Early stopping in detail

Estimating generalization performance

Hold-out validation

K-fold cross-validation

Bootstrapping

Practical guidelines for using generalization curves