muP (Maximal Update Parametrization)

muP, short for Maximal Update Parametrization (often written muP, μP, or mu-P), is a parametrization scheme for deep neural networks in which a small set of optimization hyperparameters, most importantly the learning rate, transfers exactly from a small proxy model to a much larger target model of the same architecture family[^1][^2]. The scheme was introduced by Greg Yang and Edward J. Hu as part of the Tensor Programs theoretical series, and it underpins a practical recipe called muTransfer in which practitioners sweep hyperparameters on a network of (for example) 40 million parameters and reuse the resulting settings unchanged on a network of several billion parameters[^2][^3]. muP is derived from a width-limit analysis of feature learning: a parametrization is "maximal" when every layer continues to update its features (rather than collapsing into a kernel regression) as the width tends to infinity[^1]. The reference implementation mup is published by Microsoft Research and is installable via pip install mup[^4]. The methodology has been adopted by groups training large transformers including Cerebras Systems (Cerebras-GPT, up to 2.7B parameters)[^5], and the GPT-4 technical report from OpenAI describes a related approach in which optimization infrastructure is tuned to be predictable across scales by orders of magnitude[^6].

Background

Hyperparameter tuning is one of the most expensive activities in modern deep learning. A practitioner who wants to train a several-billion-parameter language model cannot reasonably afford to sweep dozens of candidate learning rates at full scale, because each run consumes enormous amounts of compute. The empirical workaround used for years was to scale the learning rate by hand using rules of thumb (typically: decrease the learning rate as the model grows) and accept that the resulting hyperparameters might not be optimal. This problem becomes more acute as the gap between proxy-model tuning and target-model deployment widens, because the optimal learning rate under the conventional (PyTorch-default) "standard parametrization" (SP) generally shifts with model width[^2][^7].

The theoretical roots of muP lie in the study of infinite-width neural networks. Two limit regimes had been well known prior to muP. In the Neural Tangent Kernel (NTK) regime, a network with carefully chosen initialization scales behaves, at infinite width, like a fixed kernel-regression model; gradient descent in this limit becomes kernel gradient descent and the network does not learn its internal representations. In the so-called mean-field regime studied in two-layer networks, the network does learn features but the analysis does not extend straightforwardly to deeper architectures[^1]. Greg Yang and Edward J. Hu, in "Feature Learning in Infinite-Width Neural Networks" (also known as Tensor Programs IV; arXiv:2011.14522, ICML 2021), provided a more general classification of parametrizations[^1][^8]. They derived a "Dynamical Dichotomy Theorem" stating that, within a broad family of stable parametrizations, any choice either admits feature learning or has infinite-width training dynamics equivalent to kernel gradient descent, but not both[^1][^8]. The set of feature-learning parametrizations forms a face of a polyhedron in the parameter-scaling space; muP is identified as a particular vertex of this set, the unique parametrization in which the contribution of every layer to feature updates remains of order one as width grows[^1][^8].

The companion paper, "Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer" (arXiv:2203.03466, NeurIPS 2021), by Greg Yang, Edward J. Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao, made the practical observation that this width-independent feature-learning property also leads to width-independent optimal hyperparameters[^2]. In particular, the optimal learning rate under muP, as a function of width, converges to a nonzero constant in the infinite-width limit, providing a theoretical underpinning for transferring it from a small model to a large one[^2][^7]. The authors verified this on Transformers and ResNets and reported that, by transferring hyperparameters from a 13M-parameter proxy to a 350M BERT-large, and from a 40M-parameter proxy to a 6.7B GPT-3 sized model, the resulting full-scale models outperformed the published baselines while consuming only a small fraction of the original tuning compute[^2].

Standard parametrization versus muP

To understand what muP actually changes, it helps to compare it to "standard parametrization" (SP), the default behavior of common frameworks. In SP, the initialization standard deviation of a hidden weight matrix typically scales as 1/sqrt(fan_in), and the learning rate is treated as a single scalar that applies uniformly to every weight tensor[^7][^9]. SP describes only the initialization variance; the learning rate is left to the practitioner[^9]. As the network is widened, the per-coordinate magnitude of activations and updates under SP does not stay constant: hidden activations can grow or shrink with width, and the optimal learning rate must be re-tuned for each new model size[^2][^9].

muP modifies SP by introducing three coordinated changes[^2][^4][^9].

The initialization variance of hidden weights is scaled as a function of layer width in a way that depends on the layer type (input, hidden, or output).
The learning rate per layer is scaled by an explicit factor that depends on width, again with separate rules for input embeddings, hidden weights, and output (readout) layers, and the rule differs between SGD and adaptive optimizers like Adam.
Multiplicative scalars (often called alpha-multipliers) are introduced on the input embedding, on the output logits, and on the attention logits, replacing the conventional 1/sqrt(d_head) attention scaling with 1/d_head in attention blocks[^9].

The combined effect is that, under muP, the average magnitude of activations at every layer is approximately independent of width across training (verifiable empirically via the "coordinate check" test), and the optimal learning rate stays approximately fixed as width changes[^4][^9]. The Cerebras and EleutherAI practitioners' guide makes the SP/muP comparison concrete for a decoder-only transformer trained with Adam: relative to a baseline width d_base, with multiplier m_d = d/d_base, hidden weights are initialized with variance sigma_base^2 / m_d, and the Adam learning rate for hidden weights is scaled as eta_base / m_d; embeddings keep the SP initialization and learning rate but acquire a forward-pass alpha-input multiplier; the output (logit) layer acquires an alpha-output multiplier and its weights are scaled by 1/m_d; biases and layer-norm parameters need no additional corrections[^9].

A useful intuition is that SP focuses on preventing forward-pass activations from exploding or vanishing at initialization, but it does not constrain what happens after several optimization steps; muP, by contrast, asks for the additional property that each weight tensor contributes an update of order one (in coordinate magnitude) to the change in features at every layer at every step, which forces a particular set of width-dependent scalings on initialization, learning rate, and multipliers[^1][^9].

Another way to read the difference is in terms of the so-called abc-parametrization framework introduced in Tensor Programs IV. Each layer's weight matrix is associated with three exponents (a, b, c) describing, respectively, its initialization scale, its multiplier in the forward pass, and the scaling of its learning rate, all expressed as powers of the width. Both NTK parametrization and SP correspond to particular points in this three-exponent space, but those points have the property that the resulting infinite-width limit is a kernel and the features of intermediate layers freeze during training[^1]. muP corresponds to the unique stable choice in which the contribution of the weight update to the change in features remains of order one at every layer, including the embeddings and the readout, simultaneously and for all widths. The "maximal" qualifier refers to this property: any larger contribution would blow up, and any smaller contribution would fail to learn features at the relevant layer[^1][^8].

Scaling rules summary

The following table summarizes one common formulation of the muP scaling rules for a transformer trained with Adam, against a baseline width d_base, where m_d = d/d_base is the width multiplier[^9].

Layer	Init variance under muP	Adam LR under muP	Forward multiplier
Input embedding	sigma_base^2	eta_base	alpha_input
Hidden (e.g., Q, K, V, FFN)	sigma_base^2 / m_d	eta_base / m_d	(none)
Attention logits	(standard)	(standard)	1/d_head instead of 1/sqrt(d_head)
Output / readout	(standard)	(standard)	alpha_output / m_d

Different presentations of muP differ slightly in where alpha-multipliers are placed (on the operation versus the weight), but all amount to the same width-asymptotic prescription[^4][^10].

How muTransfer works in practice

muTransfer is the workflow that exploits muP[^2][^3]. The user begins by writing the target model architecture in a muP-compatible form, typically by using a library like mup or by replacing the readout layer with MuReadout and using MuAdam or MuSGD as the optimizer wrapper. The same model code is then instantiated at a much smaller "proxy" width (for example, 256 or 512 hidden units instead of the target's 8192 or larger), and set_base_shapes() is called to declare which dimensions are being scaled[^4].

The practitioner sweeps the desired hyperparameters (learning rate, weight initialization scale, alpha-multipliers, and any optimizer constants such as Adam betas) on the proxy model. Because the optimal hyperparameters are by construction approximately width-invariant under muP, the values found on the proxy can then be plugged into the full-size target model without further tuning, hence the term "zero-shot" hyperparameter transfer[^2][^3]. Microsoft Research reports that this yields approximately a 10x reduction in the cost of hyperparameter tuning relative to direct sweeps at full size, and the original muTransfer paper reports a tuning cost equivalent to roughly 7% of the pretraining cost of a 6.7B parameter GPT-3 sized model when the proxy is 40M parameters[^2][^3].

The Microsoft Research blog post accompanying the paper notes that the technique was verified across widths ranging from 128 to 4,096 hidden units and depths from 2 to 32 layers on transformers, and was applied to a 6.7B GPT-3 sized run for which the muTransfer hyperparameters yielded performance "comparable to GPT-3 13B" from the original GPT-3 paper at roughly half the parameter count[^3]. The same blog quotes Colin Raffel of the University of North Carolina describing muP as "an impressive step toward removing some of the black magic from scaling up neural networks"[^3].

Two diagnostic tools

The mup reference implementation ships two diagnostic procedures that practitioners use to validate that muP has been applied correctly[^4][^9].

The first is the "coordinate check" or "coord check". The user trains the model at several different widths for a small number of steps (typically about ten) and records the per-coordinate magnitude of activations in each layer at each step. Under a correct muP implementation, these magnitudes should be approximately equal across widths; under SP, they typically diverge or vanish[^4][^9].

The second is the "muTransfer test": run a small random search over learning rates at the proxy width, then again at a wider width, and verify that the optimal learning rate is approximately the same in both sweeps. If the optima drift with width, an muP error is likely present[^9].

Adoption and implementations

Microsoft's mup library

The reference implementation, hosted at github.com/microsoft/mup, provides PyTorch wrappers for the most common layer types and optimizers: MuReadout and MuSharedReadout for the output projection, MuAdam and MuSGD for the optimizer, and set_base_shapes() for declaring the scaling axes[^4]. Examples cover MLPs, transformers, ResNets, and a separate mutransformers submodule that ports the Hugging Face Transformers library to muP[^4]. The library is compatible with PyTorch DistributedDataParallel-based training[^4].

Cerebras-GPT

In April 2023, Cerebras Systems released Cerebras-GPT, a family of open compute-optimal language models trained on the Cerebras Wafer-Scale Cluster, with sizes ranging from 111M to 13B parameters[^5]. The paper, by Nolan Dey and colleagues at Cerebras, reports that for the muP variants of the family the team followed the muTransfer recipe: they first ran a hyperparameter sweep on a 40M-parameter muP model and then transferred the resulting learning rate (along the muP scaling law) up to 2.7B parameters[^5]. The reported result was that, averaged across model sizes from 111M to 2.7B, the muP models achieved a 0.43% lower Pile test loss and a 1.7% higher average downstream-task accuracy than the matched SP models, with the muP family also showing about 16x lower run-to-run standard deviation in their fitted scaling-law parameters (0.04% versus 0.66%), enabling more predictable extrapolation to larger sizes[^5]. Cerebras's training documentation contains tutorials for applying muP to its model zoo[^11].

GPT-4 infrastructure

The GPT-4 technical report, released by OpenAI in March 2023, devotes a substantial section to "predictable scaling," in which the authors describe building optimization and training infrastructure designed to behave predictably across many orders of magnitude of compute, in particular allowing aspects of GPT-4's final loss to be extrapolated from training runs using as little as one ten-thousandth of the compute of the final model[^6]. The report cites Yang et al.'s work on hyperparameter transfer as part of this scaling infrastructure[^6]. Several of the authors of the original muTransfer paper (Yang, Babuschkin, Sidor, Farhi, Ryder, Pachocki) were affiliated with OpenAI at the time of the paper or subsequently[^2].

Other follow-on work

Beyond the reference implementation, muP has been picked up in a number of subsequent papers, libraries, and external products. The mutransformers extension covers BERT, RoBERTa, and GPT-2 architectures in the Hugging Face style[^4]. Diffusion researchers have adapted muP to scale diffusion transformers efficiently (arXiv:2505.15270)[^12]. Other recent extensions include a treatment for Mixture of Experts models (arXiv:2508.09752)[^13] and the depth-transfer extension known as Depth-muP, discussed below.

Tensor Programs VI and Depth-muP

The original muP paper handles transfer across width but does not directly address transfer across depth. "Tensor Programs VI: Feature Learning in Infinite-Depth Neural Networks" (arXiv:2310.02244, October 2023), by Greg Yang, Dingli Yu, Chen Zhu, and Soufiane Hayou, classifies depthwise parametrizations of deep residual networks under the infinite-width-then-infinite-depth limit and identifies a unique optimal scheme called Depth-muP[^14]. The recipe extends muP by rescaling each residual block (and parameter update within the block) by 1/sqrt(L) where L is the network's depth, which (for blocks of unit depth) admits depthwise hyperparameter transfer in the same sense that muP admits widthwise transfer[^14]. The paper notes that, in current transformer architectures whose residual blocks contain attention and feedforward sub-blocks of nontrivial depth, the recipe has further fundamental limitations, and the analysis identifies absolute-value activations as a special case that maximizes "feature diversity" across depth[^14]. A related contemporary paper on depthwise hyperparameter transfer in residual networks is Bordelon, Noci, Li, and Pehlevan, "Depthwise Hyperparameter Transfer in Residual Networks" (arXiv:2309.16620)[^15].

u-muP (Unit-Scaled muP)

In July 2024, researchers at Graphcore Research and Aleph Alpha published "u-muP: The Unit-Scaled Maximal Update Parametrization" (arXiv:2407.17465, ICLR 2025), by Charlie Blake and colleagues[^16]. The paper combines muP with the Unit Scaling technique to design models in which activations, weights, and gradients all begin training with a scale of approximately one, which makes the resulting models train robustly in FP8 low-precision arithmetic[^16]. The authors argue that the resulting parametrization removes a number of incidental hyperparameters (such as base-shape choice and initialization scale), makes the remaining hyperparameters more interpretable and less interdependent, and reaches a lower loss than comparable muP models[^16]. u-muP is available as part of Graphcore Research's unit-scaling PyTorch library[^16].

Significance

muP matters because it changes the economics of training very large models in two distinct ways.

First, it makes hyperparameter tuning cheap relative to pretraining. A 10x or larger reduction in tuning cost (relative to direct sweeps at full size) is reported in the original muTransfer paper, and Cerebras reports tuning a 40M proxy and successfully transferring to 2.7B, which corresponds to roughly a 70x parameter ratio[^2][^3][^5]. For models in the tens of billions of parameters and larger, the alternative (direct sweeps at full scale) is generally infeasible, so muP turns "tune the small model and transfer" from a heuristic into a principled procedure with theoretical backing[^1][^2].

Second, muP makes scaling laws more reliable. When optimal hyperparameters drift with model size under SP, fitting a scaling law to small-model runs and extrapolating to large models can systematically misjudge the achievable loss, because the small runs were not at their own optimum or the large runs are not at theirs. Cerebras observes that under muP the fitted scaling-law parameters become substantially less noisy, which improves the accuracy of extrapolation[^5]. The same logic underlies the predictable-scaling section of the GPT-4 technical report, which presents the ability to forecast frontier-model loss from much smaller proxy runs as one of the project's core engineering contributions[^6].

A third, more theoretical contribution is that muP gives a clean operational definition of "maximal feature learning" in the infinite-width limit, which provides a reference point against which other parametrizations (NTK, mean-field, hybrid schemes) can be compared, and against which the choice of optimizer, normalization, and depth can be analyzed[^1][^14].

Limitations and criticisms

muP is not a closed chapter. Several empirical and theoretical critiques have appeared since 2022.

Lucas Lingle, in "A Large-Scale Exploration of mu-Transfer" (arXiv:2404.05728, April 2024), evaluated mu-Transfer on decoder-only transformer language models with up to 1.2B parameters trained on 33B tokens, plus a large-scale study at 10B parameters and 190B tokens, and found that for many of the most common architectural choices the method gives near-optimal learning rates, but for some configurations it does not[^17]. Lingle identifies several practical pitfalls including trainable normalization gains, particular optimizer choices, and per-layer bias inclusion, all of which can disrupt the muP scaling law if not aligned with the prescription[^17]. The paper recommends a number of small architectural adjustments to make mu-Transfer reliable in practice for transformer language models.

In October 2025, the paper "Weight Decay may matter more than muP for Learning Rate Transfer in Practice" (arXiv:2510.19093) argued that for the bulk of typical large-language-model training runs, it is decoupled weight decay rather than the muP parametrization itself that stabilizes the update dynamics of internal representations across widths, facilitating learning rate transfer[^18]. The paper does not contradict the theoretical content of muP but suggests that the practical "where does learning-rate transfer come from?" attribution is more subtle than it first appears: muP guarantees the right scaling in the infinite-width limit, but in finite-width practice the contribution of weight decay is significant, and indeed a properly tuned weight-decay setting can produce learning-rate transfer under standard parametrization as well[^18].

A related line of empirical work argues that all parameterizations (not just muP) can be induced to achieve hyperparameter transfer if the per-layer learning rate prescription is right, and that for some setups a tuned per-layer SP recipe can match or exceed muP[^7][^19]. Other contemporary papers have studied learning-rate transfer across the "token horizon" (the number of training tokens), and observed that the optimal learning rate is not invariant to training duration even when widthwise muP is applied; this has motivated further extensions and recipes[^20].

In addition to these empirical critiques, muP has several practical limitations[^9].

Architecture or training procedure changes can necessitate re-tuning; muP's guarantees are tightest when the proxy and the target share architecture, optimizer, and data distribution.
muP does not eliminate all sources of training instability; precision issues, numerical stability problems, and data quality problems are not addressed.
Effective transfer requires that proxy-model batch sizes be large enough to be in the same data-noise regime as the target model.
Several alpha-multipliers and a "base shape" must be set, and incorrect base-shape declarations silently break the scaling.

Finally, muP's theoretical derivations rely on assumptions about the geometric alignment of layer inputs, weights, and gradient updates that, in practical training runs, hold cleanly only at the start of training. The most recent theoretical work (Bordelon et al.; Ghosh et al., "Understanding the Mechanisms of Fast Hyperparameter Transfer", 2025) seeks to extend the analysis beyond initialization[^15][^21].

Comparison with adjacent techniques

muP is not the only attempt to make training large neural networks more predictable.

Scaling laws, beginning with the Kaplan et al. paper "Scaling Laws for Neural Language Models" (which is canonical in the literature)[^22], and continuing with the "Chinchilla" compute-optimal analysis by Hoffmann et al.[^23], focus on the question "given a compute budget, how should I split it between model size and tokens?", whereas muP focuses on the orthogonal question "how do I choose hyperparameters that remain optimal as I scale?". The two approaches are complementary; scaling laws assume that the hyperparameters used at each scale are reasonably optimal, and muP is one way to make that assumption justified[^5][^6].

NTK parameterization, in contrast to muP, is the unique stable parametrization in which the infinite-width limit is a fixed kernel-regression model and the network does not learn features. NTK is mathematically convenient for some analyses but the corresponding training dynamics are not what large language models actually do; in particular, transfer learning and pretraining presuppose feature learning, which muP supplies and NTK does not[^1][^8].

Unit Scaling, the technique combined with muP in u-muP, is another scale-invariance discipline whose goal is to keep tensor magnitudes near one throughout training, primarily to enable low-precision (FP8) training[^16]. It is concerned with numerical scales, not with hyperparameter transfer per se, but as the u-muP paper notes, the two concerns mesh naturally.

References

muP (Maximal Update Parametrization)

Background

Standard parametrization versus muP

Scaling rules summary

How muTransfer works in practice

Two diagnostic tools

Adoption and implementations

Microsoft's mup library

Cerebras-GPT

GPT-4 infrastructure

Other follow-on work

Tensor Programs VI and Depth-muP

u-muP (Unit-Scaled muP)

Significance

Limitations and criticisms

Comparison with adjacent techniques

See also

References

Improve this article

Background

Standard parametrization versus muP

Scaling rules summary

How muTransfer works in practice

Two diagnostic tools

Adoption and implementations

Microsoft's mup library

Cerebras-GPT

GPT-4 infrastructure

Other follow-on work

Tensor Programs VI and Depth-muP

u-muP (Unit-Scaled muP)

Significance

Limitations and criticisms

Comparison with adjacent techniques

See also

References