# Model Capacity

> Source: https://aiwiki.ai/wiki/model_capacity
> Updated: 2026-07-12
> Categories: Machine Learning, Model Evaluation
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**Model capacity** is the size and richness of the family of functions a [machine learning](/wiki/machine_learning) model can represent and learn, which determines how complex a pattern the model can fit. A high-capacity model can approximate intricate, highly nonlinear mappings, while a low-capacity model is restricted to simpler functions. Capacity is the central lever in the [bias-variance tradeoff](/wiki/bias_variance_tradeoff): too little capacity makes a model [underfit](/wiki/underfitting), and too much lets it [overfit](/wiki/overfitting). A 2025 study quantified the raw capacity of GPT-style language models at roughly 3.6 bits of memorized information per parameter, one of the first hard numbers for a quantity that was historically described only qualitatively. [1]

## What is model capacity?

In statistical learning theory, model capacity (sometimes called model complexity or expressiveness) describes the richness of the hypothesis space from which a learning algorithm selects its final function. More formally, a hypothesis class with greater capacity contains a larger and more diverse set of functions, giving the learner more flexibility to match the training data. [2]

Consider fitting data points with polynomials. A linear model (degree 1) can only represent straight lines, so its capacity is low. A degree-10 polynomial can bend and twist to pass through many more points, so its capacity is much higher. If the true relationship in the data is moderately nonlinear, the linear model will [underfit](/wiki/underfitting) because it lacks sufficient capacity, while the degree-10 polynomial may [overfit](/wiki/overfitting) because it has excess capacity and begins fitting noise rather than signal.

Capacity is not solely determined by the number of [parameters](/wiki/parameter). The architecture, the type of functions the model can represent, the training algorithm, and [regularization](/wiki/regularization) techniques all influence a model's effective capacity. [8]

## How is model capacity measured?

Several theoretical frameworks have been developed to quantify model capacity. The two most prominent are the VC dimension and Rademacher complexity.

### VC Dimension

The [Vapnik-Chervonenkis (VC) dimension](/wiki/vc_dimension), introduced by Vladimir Vapnik and Alexey Chervonenkis in a foundational 1971 paper, is one of the central measures of capacity in statistical learning theory. [2] The VC dimension of a hypothesis class is defined as the largest number of data points that the class can **shatter**, meaning the class contains functions that can perfectly classify those points under every possible binary labeling.

| Classifier | VC Dimension | Explanation |
|---|---|---|
| Threshold function (1D) | 1 | Can separate one point on the real line |
| Linear classifier (2D) | 3 | Can shatter any 3 non-collinear points in the plane |
| Linear classifier in d dimensions | $$d + 1$$ | Generalizes to d-dimensional feature spaces |
| Sinusoidal classifier | Infinite | A sine function with adjustable frequency can shatter arbitrarily many points |
| [Neural network](/wiki/neural_network) with W weights | $$O(W \log W)$$ | Approximate bound for networks with piecewise-linear activations |

The VC dimension connects directly to [generalization](/wiki/generalization) through Vapnik's generalization bound. For a hypothesis class with VC dimension $$d$$, trained on $$N$$ samples, the gap between training error and true error is bounded roughly by a term proportional to $$\sqrt{d / N}$$. This means that higher capacity models require proportionally more training data to generalize reliably. [1]

### Rademacher Complexity

Rademacher complexity, named after the mathematician Hans Rademacher, provides a data-dependent measure of function class richness. Rather than considering worst-case shattering behavior as VC dimension does, Rademacher complexity measures how well functions in the hypothesis class can correlate with random noise. [3]

Given a sample of size $$m$$ and a function class $$F$$, the empirical Rademacher complexity is defined as:

$$
\mathrm{Rad}_S(F) = \frac{1}{m} \mathbb{E}_\sigma \left[\sup_{f \in F} \sum_i \sigma_i f(x_i)\right]
$$

where $$\sigma_i$$ are independent Rademacher random variables taking values +1 or -1 with equal probability. Intuitively, a function class with high Rademacher complexity can fit random labels well, indicating that it has high capacity and may be prone to overfitting. [3]

Rademacher complexity offers tighter, data-dependent generalization bounds compared to VC dimension, which relies on worst-case analysis. The generalization error of empirical risk minimization is bounded by twice the Rademacher complexity plus a concentration term. Because it adapts to the actual data distribution, Rademacher complexity gives more practical estimates of how well a model will generalize in specific scenarios. [3]

## How does capacity relate to the bias-variance tradeoff?

The classical [bias-variance tradeoff](/wiki/bias_variance_tradeoff) is intimately connected to model capacity.

| Capacity Level | Bias | Variance | Typical Outcome |
|---|---|---|---|
| Too low | High | Low | [Underfitting](/wiki/underfitting): model misses important patterns |
| Optimal | Balanced | Balanced | Good [generalization](/wiki/generalization) |
| Too high | Low | High | [Overfitting](/wiki/overfitting): model memorizes noise |

**Bias** represents the systematic error introduced by approximating a complex real-world problem with a simpler model. Low-capacity models impose strong assumptions about the data-generating process, leading to high bias. **Variance** reflects how much the model's predictions change when trained on different subsets of data. High-capacity models are sensitive to the specific training set, leading to high variance.

The classical view suggests an optimal capacity somewhere in the middle, where total error (bias squared plus variance) is minimized. For decades, this U-shaped test error curve guided model selection: practitioners aimed to find the sweet spot where the model was complex enough to capture real patterns but constrained enough to avoid fitting noise. [8]

### Underfitting

[Underfitting](/wiki/underfitting) occurs when a model's capacity is too low to capture the underlying structure of the data. Symptoms include high error on both the training set and the test set. For example, fitting a linear model to data that follows a quadratic relationship will systematically miss the curvature, producing poor predictions regardless of how much data is available.

### Overfitting

[Overfitting](/wiki/overfitting) occurs when a model's capacity exceeds what is needed, allowing it to memorize noise and idiosyncrasies in the training data. The model achieves very low training error but performs poorly on new data. Classic examples include fitting a high-degree polynomial through a small number of data points, where the polynomial oscillates wildly between the points.

## How is capacity controlled in neural networks?

The capacity of a [neural network](/wiki/neural_network) depends on multiple architectural factors.

| Factor | Effect on Capacity |
|---|---|
| Depth (number of layers) | Deeper networks can represent hierarchical compositions of features, exponentially increasing expressiveness |
| Width (neurons per layer) | Wider layers increase the number of parameters and the representational power within each level of abstraction |
| Number of [parameters](/wiki/parameter) | More weights and biases provide more degrees of freedom for fitting complex functions |
| Activation functions | Nonlinear activations (ReLU, sigmoid, tanh) enable representation of nonlinear functions; linear activations collapse multi-layer networks into a single linear map |
| Skip connections | Residual connections in architectures like ResNet allow gradients to flow more easily, enabling effective training of very deep (high-capacity) networks |

A [deep neural network](/wiki/deep_neural_network) with millions or billions of parameters has enormous theoretical capacity. In an ICLR 2017 best-paper study, Zhang et al. famously demonstrated that standard [deep neural networks](/wiki/deep_neural_network) can perfectly fit a random labeling of the training data, and showed that this behavior is "qualitatively unaffected by explicit regularization," confirming that their capacity far exceeds what is needed for typical tasks. [4] Despite this, these same networks generalize well on real data when trained with standard methods, a phenomenon that challenged classical learning theory.

### Representational vs. Effective Capacity

An important distinction exists between a model's **representational capacity** and its **effective capacity**, formalized in the Goodfellow, Bengio, and Courville textbook *Deep Learning*. Representational capacity refers to the family of functions a model can theoretically express given its architecture: the set of functions the learning algorithm can choose from when varying its parameters. Effective capacity describes the smaller subset of those functions that the learning algorithm can actually reach in practice. As the textbook puts it, "the imperfection of the optimization algorithm" means that effective capacity "may be less than the representational capacity of the model family." [8]

Several factors cause effective capacity to be lower than representational capacity:

- **Optimization limitations**: [Stochastic gradient descent](/wiki/stochastic_gradient_descent) (SGD) and its variants do not explore the full parameter space; they follow particular trajectories that favor certain solutions.
- **Implicit regularization**: SGD has been shown to act as an implicit regularizer, preferring simpler, flatter minima in the loss landscape. This biases the model toward lower-complexity solutions even without explicit regularization.
- **Training duration**: The number of epochs determines how much of the parameter space is explored. [Early stopping](/wiki/early_stopping) effectively reduces capacity by halting training before the model can fit noise.
- **Data characteristics**: The structure and size of the training data constrain which solutions are reachable.

## Structural Risk Minimization

Structural Risk Minimization (SRM) is a principle formalized by Vapnik and Chervonenkis that provides a systematic framework for selecting model capacity. SRM addresses the fundamental question: given data of a certain size, what level of model complexity will produce the best generalization? [1]

The SRM principle works by organizing hypothesis classes into a nested sequence of increasing complexity:

$$
H_1 \subset H_2 \subset H_3 \subset \cdots \subset H_k
$$

where each successive class has a higher VC dimension ($$h_1 < h_2 < \cdots < h_k$$). For each class, the learning algorithm minimizes the empirical risk (training error). Then, the optimal class is selected by minimizing the **guaranteed risk**, which is the sum of the empirical risk and a confidence interval that grows with the VC dimension.

This creates a tradeoff: as model capacity increases, the minimum achievable training error decreases, but the confidence interval (penalty for complexity) increases. The SRM principle selects the capacity level where this total bound is minimized. [Support Vector Machines](/wiki/support_vector_machine) (SVMs) are a notable practical realization of the SRM principle, as they explicitly control capacity through the margin. [1]

## Regularization as Capacity Control

[Regularization](/wiki/regularization) techniques provide practical methods for controlling effective model capacity without changing the model architecture. By adding constraints or penalties, regularization steers the learning algorithm away from overly complex solutions. [8]

| Regularization Method | Mechanism of Capacity Control |
|---|---|
| L1 regularization (Lasso) | Adds a penalty proportional to the absolute value of weights, encouraging sparsity and effectively reducing the number of active parameters |
| L2 regularization (Ridge) | Adds a penalty proportional to the squared magnitude of weights, shrinking weights toward zero and smoothing the learned function |
| [Dropout](/wiki/dropout_regularization) | Randomly deactivates neurons during training, preventing co-adaptation and simulating an ensemble of smaller networks |
| [Early stopping](/wiki/early_stopping) | Halts training before the model fully fits the training data, limiting the effective complexity of the learned function |
| Data augmentation | Increases the effective training set size, raising the bar for how much capacity is needed to overfit |
| Weight decay | Functionally equivalent to L2 regularization in many optimizers; gradually shrinks weights each update |

The strength of regularization directly controls the tradeoff between fitting the training data and maintaining simplicity. Stronger regularization reduces effective capacity, increasing bias but decreasing variance.

## What is double descent and overparameterization?

Modern deep learning has revealed a surprising departure from the classical U-shaped test error curve. The **double descent** phenomenon, documented by Belkin et al. (2019) and Nakkiran et al. (2021), shows that as model capacity increases past the **interpolation threshold** (the point where the model has just enough parameters to perfectly fit the training data), the test error first peaks and then decreases again. [5][6] Belkin and colleagues described their result as showing how increasing model capacity beyond the point of interpolation "results in improved performance." [5]

The classical view predicts that test error should increase monotonically beyond the interpolation threshold. Instead, in the overparameterized regime, adding more capacity actually improves generalization. This creates a curve with two descents: the first descent follows the classical bias-variance tradeoff, the peak occurs near the interpolation threshold, and the second descent occurs in the heavily overparameterized regime.

Several mechanisms explain this behavior:

- **Implicit bias of optimization**: SGD and related algorithms favor minimum-norm solutions among the many interpolating functions available in overparameterized models. These minimum-norm solutions tend to be smoother and generalize better.
- **Solution geometry**: In overparameterized models, the set of interpolating solutions forms a large manifold. The optimization algorithm selects solutions from this manifold based on its inductive biases, often finding well-generalizing solutions.
- **Benign overfitting**: Under certain conditions, even models that perfectly interpolate noisy training data can achieve near-optimal test error because the noise component of the learned function has vanishing effect on test predictions.

The Nakkiran et al. study showed that [double descent](/wiki/double_descent) is broad: it has been observed across a wide range of architectures, including fully connected networks, convolutional neural networks (CNNs), ResNets, and [Transformers](/wiki/transformer). [6] It can manifest along three axes: model-wise (increasing parameters), epoch-wise (increasing training time), and sample-wise (increasing data), which is why the paper is subtitled "Where Bigger Models and More Data Hurt." [6] This phenomenon has had major implications for how practitioners think about model capacity, suggesting that in many cases, making a model larger is preferable to trying to find the exact optimal capacity.

## How much can a model store per parameter?

A 2025 paper by John X. Morris and collaborators, "How much do language models memorize?", produced one of the first concrete capacity numbers for [large language models](/wiki/large_language_model). By training hundreds of GPT-2-style transformers on synthetic random bitstrings and on the FineWeb text corpus, the authors separated what a model stores about specific training points (unintended memorization) from what it learns about the underlying data distribution (generalization), then measured the saturation point. [1]

The headline finding is that "GPT-style models have a capacity of approximately 3.6 bits per parameter." [1] More precisely, the paper reports that "our models consistently memorize between 3.5 and 3.6 bits per parameter," measuring about 3.51 bits per parameter when models are trained in bfloat16 (half) precision and about 3.83 bits per parameter in fp32 (full) precision. [1] This ratio held steady across models spanning roughly 100K to 20M parameters, indicating that bits-per-parameter is a stable property of the architecture rather than an artifact of a single model size. [1]

The study also connects capacity to the [grokking](/wiki/grokking) transition: "Models memorize until their capacity fills, at which point grokking begins, and unintended memorization decreases as models begin to generalize." [1] In other words, once a dataset contains more information than the model can store, the model is forced to compress, and that pressure to compress is what pushes it from memorizing examples toward learning generalizable structure.

## Capacity of Modern Large Language Models

Modern [large language models](/wiki/large_language_model) (LLMs) operate firmly in the overparameterized regime, with parameter counts ranging from billions to over a trillion.

| Model | Approximate Parameters | Year |
|---|---|---|
| [GPT-2](/wiki/gpt2) | 1.5 billion | 2019 |
| [GPT-3](/wiki/gpt3) | 175 billion | 2020 |
| [PaLM](/wiki/palm) | 540 billion | 2022 |
| [GPT-4](/wiki/gpt4) | Estimated 1+ trillion (mixture of experts) | 2023 |
| [Llama 3](/wiki/llama3) | Up to 405 billion | 2024 |

[Scaling laws](/wiki/scaling_laws) discovered by Kaplan et al. (2020) at OpenAI showed that the test loss of neural language models follows a power-law relationship with model size, dataset size, and compute budget, with trends spanning more than seven orders of magnitude across a sweep of over 200 transformer models. [7] These laws suggest that increasing capacity (parameters) continues to yield predictable performance improvements, though with diminishing returns at larger scales. Kaplan and colleagues also found that "larger models are significantly more sample-efficient," so the compute-optimal strategy is to train very large models and stop well before convergence. [7]

At the same time, the efficiency with which models use their capacity has been rising fast. The "densing law" of LLMs, proposed by Xiao et al. (2024), defines capability density as the ratio of an LLM's effective parameter size to its actual parameter size and reports that this density has been doubling approximately every 3.5 months. [9] In practical terms, equivalent performance can be achieved with exponentially fewer parameters over time as architectures and training methods become more efficient at utilizing model capacity. For example, the authors note that the 2.4B-parameter MiniCPM matched the performance of the earlier 7B-parameter Mistral model using roughly 35 percent of the parameters. [9]

## Explain Like I'm 5 (ELI5)

Imagine you have a box of crayons for drawing pictures. If you only have 2 crayons, you can draw simple pictures but not ones with lots of colors and details. If you have 200 crayons, you can draw much more detailed and colorful pictures. The number of crayons is like "model capacity."

But here is the tricky part. If you have too many crayons and you try to copy a photo exactly, you might spend all your time coloring in tiny scratches and smudges on the photo that do not matter. That is like overfitting: you are paying too much attention to the unimportant details. If you only have 2 crayons, your drawing will be too simple and miss important things. That is like underfitting.

The goal is to have just the right number of crayons (or just enough model capacity) so your drawing captures the important parts of the picture without copying all the little mistakes. Scientists have found that sometimes having way more crayons than you need can actually work well too, as long as you use them wisely. That surprising discovery is called "double descent."

## See Also

- [Overfitting](/wiki/overfitting)
- [Underfitting](/wiki/underfitting)
- [Regularization](/wiki/regularization)
- [Bias-Variance Tradeoff](/wiki/bias_variance_tradeoff)
- [Generalization](/wiki/generalization)
- [Double Descent](/wiki/double_descent)
- [Scaling Laws](/wiki/scaling_laws)
- [Neural Network](/wiki/neural_network)
- [Deep Neural Network](/wiki/deep_neural_network)

## References

1. Morris, J. X., Sitawarin, C., Guo, C., Kokhlikyan, N., Suh, G. E., Rush, A. M., Chaudhuri, K., & Mahloujifar, S. (2025). "How much do language models memorize?" *arXiv preprint arXiv:2505.24832*. https://arxiv.org/abs/2505.24832
2. Vapnik, V. N., & Chervonenkis, A. Y. (1971). "On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities." *Theory of Probability and Its Applications*, 16(2), 264-280.
3. Bartlett, P. L., & Mendelson, S. (2002). "Rademacher and Gaussian Complexities: Risk Bounds and Structural Results." *Journal of Machine Learning Research*, 3, 463-482.
4. Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2017). "Understanding Deep Learning Requires Rethinking Generalization." *International Conference on Learning Representations (ICLR)*. (ICLR 2017 Best Paper Award.) https://arxiv.org/abs/1611.03530
5. Belkin, M., Hsu, D., Ma, S., & Mandal, S. (2019). "Reconciling Modern Machine-Learning Practice and the Classical Bias-Variance Trade-Off." *Proceedings of the National Academy of Sciences*, 116(32), 15849-15854. https://www.pnas.org/doi/10.1073/pnas.1903070116
6. Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., & Sutskever, I. (2021). "Deep Double Descent: Where Bigger Models and More Data Hurt." *Journal of Statistical Mechanics: Theory and Experiment*, 2021(12), 124003. (First released as ICLR 2020.) https://arxiv.org/abs/1912.02292
7. Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., & Amodei, D. (2020). "Scaling Laws for Neural Language Models." *arXiv preprint arXiv:2001.08361*. https://arxiv.org/abs/2001.08361
8. Goodfellow, I., Bengio, Y., & Courville, A. (2016). *Deep Learning*. MIT Press. Chapter 5: Machine Learning Basics.
9. Xiao, C., Cai, J., Zhao, W., Zeng, G., Lin, B., Zhou, J., Han, X., Liu, Z., & Sun, M. (2024). "Densing Law of LLMs." *arXiv preprint arXiv:2412.04315*; published in *Nature Machine Intelligence* (2025). https://arxiv.org/abs/2412.04315
10. Shalev-Shwartz, S., & Ben-David, S. (2014). *Understanding Machine Learning: From Theory to Algorithms*. Cambridge University Press.
11. Koltchinskii, V. (2001). "Rademacher Penalties and Structural Risk Minimization." *IEEE Transactions on Information Theory*, 47(5), 1902-1914.