Width
Last reviewed
May 11, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 2,198 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 11, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 2,198 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Machine learning terms
Width refers to the number of neurons in a specific layer of a neural network. In modern transformer language models, the dominant width parameter is usually called hidden_size or d_model, and it sets the dimensionality of the vector flowing through every residual stream position. Width is one of two primary axes of network shape; the other is depth, the number of stacked layers. Together they determine parameter count, compute per forward pass, and how the model represents information.
A network that is too narrow cannot fit complex functions even with unlimited training data, and a network that is excessively wide wastes memory and FLOPs without improving generalization. Width also interacts with optimization: initialization schemes, learning rate schedules, and even the Adam update rule behave differently as width grows, which is why parameterizations such as muP exist to keep training stable as width changes.
In a fully connected network, width is the number of hidden units in a layer. A network with input dimension 784 (a flattened MNIST image) and hidden layers of size 512, 512, and 10 has widths 512 and 512 in its hidden layers. Each unit computes a weighted sum of the previous layer's activations followed by a nonlinearity such as ReLU or GELU. Wider layers represent more linearly independent directions in feature space.
Parameter count of a fully connected layer scales linearly with both its input width and output width. A layer mapping size $d_\text{in}$ to size $d_\text{out}$ has $d_\text{in} \cdot d_\text{out}$ weights plus $d_\text{out}$ biases. When adjacent layers share width $d$, parameter count is quadratic: $d^2$ weights per layer. This is why wider models become memory bound quickly.
In a convolutional neural network, width usually refers to the number of channels in a feature map rather than spatial dimensions. A convolutional layer with 64 output channels has a width of 64. The Wide Residual Networks paper by Sergey Zagoruyko and Nikos Komodakis (BMVC 2016) made this precise by introducing a widening factor $k$ that multiplies the number of channels in every residual block. The original ResNet corresponds to $k = 1$.
Zagoruyko and Komodakis argued that very deep thin ResNets suffered from diminishing feature reuse, where most residual blocks contributed little after training. They proposed shallower wider networks, trading layers for channels. A 16 layer WRN with $k = 8$ outperformed thousand layer ResNets on CIFAR while training much faster. WRN-28-10 reached 20.0% error on CIFAR-100, and WRN-40-4 matched ResNet-1001 in about six hours on a single Titan X. The paper is one of the most influential demonstrations that width can substitute for depth.
For transformer language models, width is a single number called hidden_size, d_model, or n_embd depending on the codebase. It sets the size of token embeddings, the residual stream between attention and feed forward sublayers, and the input and output of every linear projection inside an attention head or MLP. Vaswani et al. (2017) set the feed forward sublayer's intermediate dimension to $4 \times d_\text{model}$, a ratio that became a near universal convention.
Standard practice keeps each attention head 64 dimensional, so the number of heads scales with $d_\text{model} / 64$. SwiGLU based models such as Llama use a different ratio: because SwiGLU gates one projection, the effective MLP capacity is doubled, so Llama style models use an intermediate size around $2.7 \times d_\text{model}$.
The table below lists the hidden size of several widely studied models. Parameter counts are approximate and refer to the published configuration.
| Model | Hidden size ($d_\text{model}$) | Layers | Heads | Parameters |
|---|---|---|---|---|
| BERT Base | 768 | 12 | 12 | 110M |
| BERT Large | 1024 | 24 | 16 | 340M |
| GPT-2 Small | 768 | 12 | 12 | 117M |
| GPT-2 Medium | 1024 | 24 | 16 | 345M |
| GPT-2 Large | 1280 | 36 | 20 | 762M |
| GPT-2 XL | 1600 | 48 | 25 | 1.5B |
| GPT-3 175B | 12288 | 96 | 96 | 175B |
| PaLM 540B | 18432 | 118 | 48 | 540B |
| Llama 3 8B | 4096 | 32 | 32 | 8B |
| Llama 3 70B | 8192 | 80 | 64 | 70B |
| Llama 3.1 405B | 16384 | 126 | 128 | 405B |
GPT-3's hidden size of 12288 is roughly three times that of Turing-NLG; because parameter count scales quadratically with width per block, that change accounts for most of the parameter growth over earlier OpenAI models. PaLM 540B pushed width further to 18432 with 48 heads of dimension 384. GPT-4's architecture has never been officially disclosed, but reports attributed to George Hotz describe it as a mixture of experts model with roughly 1.8 trillion total parameters across 16 experts and 120 layers. OpenAI has not confirmed the hidden size.
The tradeoff between width and depth is one of the central questions of neural architecture design. Deeper networks learn hierarchical compositions of features, where layer $\ell + 1$ acts on representations built by layer $\ell$. Wider networks learn more features in parallel at the same level of abstraction.
For computer vision, WideResNet pushed practice toward wider, shallower networks. For transformers, the situation is more nuanced. Very deep narrow models lose gradient signal in their lower layers; very wide shallow models can become memory bound at inference. Most large language models settle in a middle range with both axes scaled together. PaLM 540B uses 118 layers and width 18432, GPT-3 uses 96 and 12288, and Llama 3 405B uses 126 and 16384. The pattern is approximately $L \propto d_\text{model}^{2/3}$, meaning width grows faster than depth at scale.
Latency scales linearly with layers because layers run sequentially, but sublinearly with width because wider matrix multiplications parallelize well on modern accelerators. For inference critical applications, a wider shallower model often serves traffic faster than a narrower deeper one at the same parameter count.
The theoretical motivation for studying width is the universal approximation theorem. The classical version, proved independently by George Cybenko in 1989 for sigmoidal activations and by Kurt Hornik, Maxwell Stinchcombe, and Halbert White also in 1989 for more general activations, states that a feed forward network with one hidden layer can approximate any continuous function on a compact set to arbitrary accuracy, provided the hidden layer is wide enough. Hornik extended this in 1991 to show the property comes from the multilayer architecture itself rather than the specific activation.
The theorem does not bound how wide "wide enough" actually is. For a hard target function, the required hidden units can grow exponentially in the input dimension; the theorem guarantees representability, not practicality.
A dual line of work studies the arbitrary depth version. For ReLU networks approximating any Lebesgue integrable function on $\mathbb{R}^n$, Lu et al. (2017) showed that width $n + 4$ suffices, and later work by Park, Yun, Lee, and Shin (2021) and Cai (2023) tightened the bound for continuous functions to a minimum width of $\max{n + 1, m}$, where $n$ is the input dimension and $m$ is the output dimension.
Width changes the geometry of training. Signal propagation depends on how variance flows through layers, and as width grows, naive initialization can cause activations to explode or collapse. The classical fix is variance preserving initialization such as He or Xavier, which scales weights by $1 / \sqrt{\text{fan_in}}$ to keep pre activation variance constant.
A deeper phenomenon is that as width approaches infinity, training dynamics simplify. In the standard parameterization, this limit is the neural tangent kernel (NTK) regime studied by Jacot et al. (2018), where the network behaves like a linear model in feature space and weights move only infinitesimally during training. The NTK regime captures something real about very wide networks but rules out feature learning, since the features stay fixed at initialization.
Greg Yang and colleagues at Microsoft Research introduced the maximal update parameterization, abbreviated muP, to address this issue. The foundation appears in Yang's Tensor Programs V paper (Tuning Large Neural Networks via Zero Shot Hyperparameter Transfer), presented at NeurIPS 2021 and posted to arXiv in March 2022. Collaborators included researchers at OpenAI.
muP rescales initialization variance and per layer learning rates so that activations, gradients, and parameter updates remain order one regardless of width. Three properties hold in the muP limit:
The practical payoff is hyperparameter transfer, called muTransfer. Because muP networks of different widths share similar training dynamics, optimal hyperparameters tuned on a small proxy carry over to a larger model without retuning. Yang and collaborators tuned a 40 million parameter proxy and applied the result to a 6.7 billion parameter model, beating the GPT-3 paper's numbers at that scale while spending only about 7% of pretraining compute on tuning. Microsoft released a PyTorch package called mup (installed with pip install mup) and applied via set_base_shapes() with a smaller base as the reference shape. EleutherAI and Cerebras have published practitioner guides, and unit scaled variants such as u-muP from Graphcore extend the approach to mixed precision.
A practical starting point is a reference architecture in the same family. For transformer language models, BERT Base, GPT-2 Small, or Llama 3 8B are common reference points with hidden sizes 768, 768, and 4096. Typical adjustments:
The Chinchilla scaling laws of Hoffmann et al. (2022) prescribe how parameter count should grow with training tokens but do not specify the split between width and depth. Compute optimal models tend to keep an approximately constant aspect ratio as scale grows, but the exact ratio remains an open research question.
Imagine a group of friends trying to figure out what is in a picture. Each friend notices one thing. Width is the number of friends in the group. Two friends miss most of the picture. A thousand friends notice everything but take forever to compare notes.
In a language model like ChatGPT, the "friends" are numbers in a giant list (the hidden vector), and the length of that list is the width. GPT-3 uses lists with 12,288 numbers per word; Llama 3 8B uses 4,096. A wider list helps the model remember more about each word, but everything else has to be bigger too, which is why these systems need racks of GPUs.