Width

Machine Learning

11 min read

Updated Jul 17, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 17, 2026

Fact-checked

In review queue

Sources

17 citations

Revision

v3 · 2,198 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

See also: Machine learning terms

width in machine learning

Width refers to the number of neurons in a specific layer of a neural network. In modern transformer language models, the dominant width parameter is usually called hidden_size or d_model, and it sets the dimensionality of the vector flowing through every residual stream position. Width is one of two primary axes of network shape; the other is depth, the number of stacked layers. Together they determine parameter count, compute per forward pass, and how the model represents information.

A network that is too narrow cannot fit complex functions even with unlimited training data, and a network that is excessively wide wastes memory and FLOPs without improving generalization. Width also interacts with optimization: initialization schemes, learning rate schedules, and even the Adam update rule behave differently as width grows, which is why parameterizations such as muP exist to keep training stable as width changes.^[7]

width in feed forward networks

In a fully connected network, width is the number of hidden units in a layer. A network with input dimension 784 (a flattened MNIST image) and hidden layers of size 512, 512, and 10 has widths 512 and 512 in its hidden layers. Each unit computes a weighted sum of the previous layer's activations followed by a nonlinearity such as ReLU or GELU. Wider layers represent more linearly independent directions in feature space.

Parameter count of a fully connected layer scales linearly with both its input width and output width. A layer mapping size $d_\text{in}$ to size $d_\text{out}$ has $d_\text{in} \cdot d_\text{out}$ weights plus $d_\text{out}$ biases. When adjacent layers share width $d$, parameter count is quadratic: $d^2$ weights per layer. This is why wider models become memory bound quickly.

width in convolutional networks

In a convolutional neural network, width usually refers to the number of channels in a feature map rather than spatial dimensions. A convolutional layer with 64 output channels has a width of 64. The Wide Residual Networks paper by Sergey Zagoruyko and Nikos Komodakis (BMVC 2016) made this precise by introducing a widening factor $k$ that multiplies the number of channels in every residual block. The original ResNet corresponds to $k = 1$.^[2]

Zagoruyko and Komodakis argued that very deep thin ResNets suffered from diminishing feature reuse, where most residual blocks contributed little after training. They proposed shallower wider networks, trading layers for channels. A 16 layer WRN with $k = 8$ outperformed thousand layer ResNets on CIFAR while training much faster. WRN-28-10 reached 20.0% error on CIFAR-100, and WRN-40-4 matched ResNet-1001 in about six hours on a single Titan X.^[2] The paper is one of the most influential demonstrations that width can substitute for depth.

width in transformers

For transformer language models, width is a single number called hidden_size, d_model, or n_embd depending on the codebase. It sets the size of token embeddings, the residual stream between attention and feed forward sublayers, and the input and output of every linear projection inside an attention head or MLP. Vaswani et al. (2017) set the feed forward sublayer's intermediate dimension to $4 \times d_\text{model}$, a ratio that became a near universal convention.^[1]

Standard practice keeps each attention head 64 dimensional, so the number of heads scales with $d_\text{model} / 64$. SwiGLU based models such as Llama use a different ratio: because SwiGLU gates one projection, the effective MLP capacity is doubled, so Llama style models use an intermediate size around $2.7 \times d_\text{model}$.^[14]

width of famous models

The table below lists the hidden size of several widely studied models. Parameter counts are approximate and refer to the published configuration.

Model	Hidden size ($d_\text{model}$)	Layers	Heads	Parameters
BERT Base	768	12	12	110M^[11]
BERT Large	1024	24	16	340M^[11]
GPT-2 Small	768	12	12	117M
GPT-2 Medium	1024	24	16	345M
GPT-2 Large	1280	36	20	762M
GPT-2 XL	1600	48	25	1.5B
GPT-3 175B	12288	96	96	175B^[12]
PaLM 540B	18432	118	48	540B^[13]
Llama 3 8B	4096	32	32	8B^[14]
Llama 3 70B	8192	80	64	70B^[14]
Llama 3.1 405B	16384	126	128	405B^[14]

GPT-3's hidden size of 12288 is roughly three times that of Turing-NLG; because parameter count scales quadratically with width per block, that change accounts for most of the parameter growth over earlier OpenAI models.^[12] PaLM 540B pushed width further to 18432 with 48 heads of dimension 384.^[13] GPT-4's architecture has never been officially disclosed, but reports attributed to George Hotz describe it as a mixture of experts model with roughly 1.8 trillion total parameters across 16 experts and 120 layers. OpenAI has not confirmed the hidden size.

width versus depth

The tradeoff between width and depth is one of the central questions of neural architecture design. Deeper networks learn hierarchical compositions of features, where layer $\ell + 1$ acts on representations built by layer $\ell$. Wider networks learn more features in parallel at the same level of abstraction.

For computer vision, WideResNet pushed practice toward wider, shallower networks. For transformers, the situation is more nuanced. Very deep narrow models lose gradient signal in their lower layers; very wide shallow models can become memory bound at inference. Most large language models settle in a middle range with both axes scaled together. PaLM 540B uses 118 layers and width 18432, GPT-3 uses 96 and 12288, and Llama 3 405B uses 126 and 16384. The pattern is approximately $L \propto d_\text{model}^{2/3}$, meaning width grows faster than depth at scale.

Latency scales linearly with layers because layers run sequentially, but sublinearly with width because wider matrix multiplications parallelize well on modern accelerators. For inference critical applications, a wider shallower model often serves traffic faster than a narrower deeper one at the same parameter count.

the universal approximation theorem

The theoretical motivation for studying width is the universal approximation theorem. The classical version, proved independently by George Cybenko in 1989 for sigmoidal activations^[3] and by Kurt Hornik, Maxwell Stinchcombe, and Halbert White also in 1989 for more general activations,^[4] states that a feed forward network with one hidden layer can approximate any continuous function on a compact set to arbitrary accuracy, provided the hidden layer is wide enough.^[17] Hornik extended this in 1991 to show the property comes from the multilayer architecture itself rather than the specific activation.^[5]

The theorem does not bound how wide "wide enough" actually is. For a hard target function, the required hidden units can grow exponentially in the input dimension; the theorem guarantees representability, not practicality.

A dual line of work studies the arbitrary depth version. For ReLU networks approximating any Lebesgue integrable function on $\mathbb{R}^n$, Lu et al. (2017) showed that width $n + 4$ suffices,^[6] and later work by Park, Yun, Lee, and Shin (2021) and Cai (2023) tightened the bound for continuous functions to a minimum width of $\max{n + 1, m}$, where $n$ is the input dimension and $m$ is the output dimension.

width and optimization

Width changes the geometry of training. Signal propagation depends on how variance flows through layers, and as width grows, naive initialization can cause activations to explode or collapse. The classical fix is variance preserving initialization such as He or Xavier, which scales weights by $1 / \sqrt{\text{fan_in}}$ to keep pre activation variance constant.

A deeper phenomenon is that as width approaches infinity, training dynamics simplify. In the standard parameterization, this limit is the neural tangent kernel (NTK) regime studied by Jacot et al. (2018), where the network behaves like a linear model in feature space and weights move only infinitesimally during training.^[16] The NTK regime captures something real about very wide networks but rules out feature learning, since the features stay fixed at initialization.

muP and hyperparameter transfer

Greg Yang and colleagues at Microsoft Research introduced the maximal update parameterization, abbreviated muP, to address this issue. The foundation appears in Yang's Tensor Programs V paper (Tuning Large Neural Networks via Zero Shot Hyperparameter Transfer), presented at NeurIPS 2021 and posted to arXiv in March 2022. Collaborators included researchers at OpenAI.^[7]

muP rescales initialization variance and per layer learning rates so that activations, gradients, and parameter updates remain order one regardless of width. Three properties hold in the muP limit:

Activation coordinates stay $\Theta(1)$ throughout training.
Network outputs remain $O(1)$ at every width.
Parameters undergo maximal updates, so the learned features actually change during training instead of being frozen at initialization.

The practical payoff is hyperparameter transfer, called muTransfer.^[8] Because muP networks of different widths share similar training dynamics, optimal hyperparameters tuned on a small proxy carry over to a larger model without retuning. Yang and collaborators tuned a 40 million parameter proxy and applied the result to a 6.7 billion parameter model, beating the GPT-3 paper's numbers at that scale while spending only about 7% of pretraining compute on tuning.^[7] Microsoft released a PyTorch package called mup (installed with pip install mup) and applied via set_base_shapes() with a smaller base as the reference shape.^[9] EleutherAI and Cerebras have published practitioner guides,^[10] and unit scaled variants such as u-muP from Graphcore extend the approach to mixed precision.

practical guidance for choosing width

A practical starting point is a reference architecture in the same family. For transformer language models, BERT Base, GPT-2 Small, or Llama 3 8B are common reference points with hidden sizes 768, 768, and 4096. Typical adjustments:

For memory limited inference, prefer a narrower hidden size and compensate with more layers or knowledge distillation.
For long context targets, attention memory scales quadratically with sequence length but linearly with hidden size per token, so long context regimes sometimes favor a narrower model.
If tuning compute is scarce, parameterize with muP so a small sweep on a proxy width transfers to production.
For grouped query attention models, the number of key value heads is another lever; the same hidden size can be split into more or fewer KV groups.

The Chinchilla scaling laws of Hoffmann et al. (2022) prescribe how parameter count should grow with training tokens but do not specify the split between width and depth.^[15] Compute optimal models tend to keep an approximately constant aspect ratio as scale grows, but the exact ratio remains an open research question.

explain like i'm 5 (ELI5)

Imagine a group of friends trying to figure out what is in a picture. Each friend notices one thing. Width is the number of friends in the group. Two friends miss most of the picture. A thousand friends notice everything but take forever to compare notes.

In a language model like ChatGPT, the "friends" are numbers in a giant list (the hidden vector), and the length of that list is the width. GPT-3 uses lists with 12,288 numbers per word;^[12] Llama 3 8B uses 4,096.^[14] A wider list helps the model remember more about each word, but everything else has to be bigger too, which is why these systems need racks of GPUs.

Depth: the number of stacked layers, the other axis of model shape.
Parameters: total count of learnable weights.
Capacity: how complex a function a model can represent.
Overfitting and underfitting: failure modes correlated with width being too large or too small for the data.
Scaling laws: empirical relationships between model size, data size, and loss.
Neural tangent kernel: the infinite width limit of standard parameterization.

references

Vaswani, A., et al. (2017). Attention Is All You Need. NeurIPS 2017. ↩
Zagoruyko, S., and Komodakis, N. (2016). Wide Residual Networks. BMVC 2016. ↩
Cybenko, G. (1989). Approximation by Superpositions of a Sigmoidal Function. Mathematics of Control, Signals, and Systems, 2(4). ↩
Hornik, K., Stinchcombe, M., and White, H. (1989). Multilayer Feedforward Networks are Universal Approximators. Neural Networks, 2(5). ↩
Hornik, K. (1991). Approximation Capabilities of Multilayer Feedforward Networks. Neural Networks, 4(2). ↩
Lu, Z., et al. (2017). The Expressive Power of Neural Networks: A View from the Width. NeurIPS 2017. ↩
Yang, G., et al. (2022). Tensor Programs V: Tuning Large Neural Networks via Zero Shot Hyperparameter Transfer. ↩
Microsoft Research. μTransfer: A technique for hyperparameter tuning of enormous neural networks. ↩
Microsoft. GitHub: microsoft/mup. ↩
EleutherAI Blog. The Practitioner's Guide to the Maximal Update Parameterization. ↩
Devlin, J., et al. (2019). BERT: Pre-training of Deep Bidirectional Transformers. NAACL 2019. ↩
Brown, T., et al. (2020). Language Models are Few-Shot Learners. NeurIPS 2020. ↩
Chowdhery, A., et al. (2022). PaLM: Scaling Language Modeling with Pathways. JMLR. ↩
Grattafiori, A., et al. (2024). The Llama 3 Herd of Models. Meta AI. ↩
Hoffmann, J., et al. (2022). Training Compute-Optimal Large Language Models (Chinchilla). ↩
Jacot, A., Gabriel, F., and Hongler, C. (2018). Neural Tangent Kernel: Convergence and Generalization in Neural Networks. NeurIPS 2018. ↩
Universal approximation theorem. Wikipedia. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributors · full history

Suggest edit

What links here

Machine learning terms/All Terms

width in machine learning

width in feed forward networks

width in convolutional networks

width in transformers

width of famous models

width versus depth

the universal approximation theorem

width and optimization

muP and hyperparameter transfer

practical guidance for choosing width

explain like i'm 5 (ELI5)

related concepts

references

Improve this article

Related Articles

A/B Testing

Diffusion model

Dimension Reduction

Dimensions

Discrete Feature

Discriminative Model

What links here

Related Articles

A/B Testing

Diffusion model

Dimension Reduction

Dimensions

Discrete Feature

Discriminative Model

What links here