# muP (Maximal Update Parametrization)

> Source: https://aiwiki.ai/wiki/mup
> Updated: 2026-07-11
> Categories: Deep Learning, Training & Optimization
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

muP, short for Maximal Update Parametrization (often written muP, μP, or mu-P), is a parametrization scheme for deep [neural networks](/wiki/neural_network) in which a small set of optimization hyperparameters, most importantly the learning rate, transfers exactly from a small proxy model to a much larger target model of the same architecture family[^1][^2]. The scheme was introduced by Greg Yang and Edward J. Hu as part of the Tensor Programs theoretical series, and it underpins a practical recipe called muTransfer in which practitioners sweep hyperparameters on a network of (for example) 40 million parameters and reuse the resulting settings unchanged on a network of several billion parameters[^2][^3]. muP is derived from a width-limit analysis of feature learning: a parametrization is "maximal" when every layer continues to update its features (rather than collapsing into a kernel regression) as the width tends to infinity[^1]. The reference implementation `mup` is published by Microsoft Research and is installable via `pip install mup`[^4]. The methodology has been adopted by groups training large [transformers](/wiki/transformer) including [Cerebras Systems](/wiki/cerebras) (Cerebras-GPT, up to 2.7B parameters)[^5], and the [GPT-4](/wiki/gpt-4) technical report from [OpenAI](/wiki/openai) describes a related approach in which optimization infrastructure is tuned to be predictable across scales by orders of magnitude[^6].

## Background

Hyperparameter tuning is one of the most expensive activities in modern deep learning. A practitioner who wants to train a several-billion-parameter language model cannot reasonably afford to sweep dozens of candidate learning rates at full scale, because each run consumes enormous amounts of compute. The empirical workaround used for years was to scale the learning rate by hand using rules of thumb (typically: decrease the learning rate as the model grows) and accept that the resulting hyperparameters might not be optimal. This problem becomes more acute as the gap between proxy-model tuning and target-model deployment widens, because the optimal learning rate under the conventional ([PyTorch](/wiki/pytorch)-default) "standard parametrization" (SP) generally shifts with model width[^2][^7].

The theoretical roots of muP lie in the study of infinite-width neural networks. Two limit regimes had been well known prior to muP. In the Neural Tangent Kernel (NTK) regime, a network with carefully chosen initialization scales behaves, at infinite width, like a fixed kernel-regression model; gradient descent in this limit becomes kernel gradient descent and the network does not learn its internal representations. In the so-called mean-field regime studied in two-layer networks, the network does learn features but the analysis does not extend straightforwardly to deeper architectures[^1]. Greg Yang and Edward J. Hu, in "Feature Learning in Infinite-Width Neural Networks" (also known as Tensor Programs IV; arXiv:2011.14522, ICML 2021), provided a more general classification of parametrizations[^1][^8]. They derived a "Dynamical Dichotomy Theorem" stating that, within a broad family of stable parametrizations, any choice either admits feature learning or has infinite-width training dynamics equivalent to kernel gradient descent, but not both[^1][^8]. The set of feature-learning parametrizations forms a face of a polyhedron in the parameter-scaling space; muP is identified as a particular vertex of this set, the unique parametrization in which the contribution of every layer to feature updates remains of order one as width grows[^1][^8].

The companion paper, "Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer" (arXiv:2203.03466, NeurIPS 2021), by Greg Yang, Edward J. Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao, made the practical observation that this width-independent feature-learning property also leads to width-independent optimal hyperparameters[^2]. In particular, the optimal [learning rate](/wiki/learning_rate) under muP, as a function of width, converges to a nonzero constant in the infinite-width limit, providing a theoretical underpinning for transferring it from a small model to a large one[^2][^7]. The authors verified this on [Transformers](/wiki/transformer) and [ResNets](/wiki/resnet) and reported that, by transferring hyperparameters from a 13M-parameter proxy to a 350M [BERT](/wiki/bert)-large, and from a 40M-parameter proxy to a 6.7B GPT-3 sized model, the resulting full-scale models outperformed the published baselines while consuming only a small fraction of the original tuning compute[^2].

## Standard parametrization versus muP

To understand what muP actually changes, it helps to compare it to "standard parametrization" (SP), the default behavior of common frameworks. In SP, the initialization standard deviation of a hidden weight matrix typically scales as $$1/\sqrt{\text{fan\_in}}$$, and the [learning rate](/wiki/learning_rate) is treated as a single scalar that applies uniformly to every weight tensor[^7][^9]. SP describes only the initialization variance; the learning rate is left to the practitioner[^9]. As the network is widened, the per-coordinate magnitude of activations and updates under SP does not stay constant: hidden activations can grow or shrink with width, and the optimal learning rate must be re-tuned for each new model size[^2][^9].

muP modifies SP by introducing three coordinated changes[^2][^4][^9].

1. The initialization variance of hidden weights is scaled as a function of layer width in a way that depends on the layer type (input, hidden, or output).
2. The learning rate per layer is scaled by an explicit factor that depends on width, again with separate rules for input embeddings, hidden weights, and output (readout) layers, and the rule differs between SGD and adaptive optimizers like [Adam](/wiki/adam_optimizer).
3. Multiplicative scalars (often called alpha-multipliers) are introduced on the input embedding, on the output logits, and on the attention logits, replacing the conventional $$1/\sqrt{d_{\text{head}}}$$ attention scaling with $$1/d_{\text{head}}$$ in [attention](/wiki/attention) blocks[^9].

The combined effect is that, under muP, the average magnitude of activations at every layer is approximately independent of width across training (verifiable empirically via the "coordinate check" test), and the optimal learning rate stays approximately fixed as width changes[^4][^9]. The Cerebras and EleutherAI practitioners' guide makes the SP/muP comparison concrete for a decoder-only transformer trained with Adam: relative to a baseline width $$d_{\text{base}}$$, with multiplier $$m_d = d/d_{\text{base}}$$, hidden weights are initialized with variance $$\sigma_{\text{base}}^2 / m_d$$, and the Adam learning rate for hidden weights is scaled as $$\eta_{\text{base}} / m_d$$; embeddings keep the SP initialization and learning rate but acquire a forward-pass alpha-input multiplier; the output (logit) layer acquires an alpha-output multiplier and its weights are scaled by $$1/m_d$$; biases and layer-norm parameters need no additional corrections[^9].

A useful intuition is that SP focuses on preventing forward-pass activations from exploding or vanishing at initialization, but it does not constrain what happens after several optimization steps; muP, by contrast, asks for the additional property that each weight tensor contributes an update of order one (in coordinate magnitude) to the change in features at every layer at every step, which forces a particular set of width-dependent scalings on initialization, learning rate, and multipliers[^1][^9].

Another way to read the difference is in terms of the so-called abc-parametrization framework introduced in Tensor Programs IV. Each layer's weight matrix is associated with three exponents (a, b, c) describing, respectively, its initialization scale, its multiplier in the forward pass, and the scaling of its learning rate, all expressed as powers of the width. Both NTK parametrization and SP correspond to particular points in this three-exponent space, but those points have the property that the resulting infinite-width limit is a kernel and the features of intermediate layers freeze during training[^1]. muP corresponds to the unique stable choice in which the contribution of the weight update to the change in features remains of order one at every layer, including the embeddings and the readout, simultaneously and for all widths. The "maximal" qualifier refers to this property: any larger contribution would blow up, and any smaller contribution would fail to learn features at the relevant layer[^1][^8].

### Scaling rules summary

The following table summarizes one common formulation of the muP scaling rules for a transformer trained with Adam, against a baseline width $$d_{\text{base}}$$, where $$m_d = d/d_{\text{base}}$$ is the width multiplier[^9].

| Layer | Init variance under muP | Adam LR under muP | Forward multiplier |
|---|---|---|---|
| Input embedding | $$\sigma_{\text{base}}^2$$ | $$\eta_{\text{base}}$$ | $$\alpha_{\text{input}}$$ |
| Hidden (e.g., Q, K, V, FFN) | $$\sigma_{\text{base}}^2 / m_d$$ | $$\eta_{\text{base}} / m_d$$ | (none) |
| Attention logits | (standard) | (standard) | $$1/d_{\text{head}}$$ instead of $$1/\sqrt{d_{\text{head}}}$$ |
| Output / readout | (standard) | (standard) | $$\alpha_{\text{output}} / m_d$$ |

Different presentations of muP differ slightly in where alpha-multipliers are placed (on the operation versus the weight), but all amount to the same width-asymptotic prescription[^4][^10].

## How muTransfer works in practice

muTransfer is the workflow that exploits muP[^2][^3]. The user begins by writing the target model architecture in a muP-compatible form, typically by using a library like `mup` or by replacing the readout layer with `MuReadout` and using `MuAdam` or `MuSGD` as the optimizer wrapper. The same model code is then instantiated at a much smaller "proxy" width (for example, 256 or 512 hidden units instead of the target's 8192 or larger), and `set_base_shapes()` is called to declare which dimensions are being scaled[^4].

The practitioner sweeps the desired hyperparameters (learning rate, weight initialization scale, alpha-multipliers, and any optimizer constants such as Adam betas) on the proxy model. Because the optimal hyperparameters are by construction approximately width-invariant under muP, the values found on the proxy can then be plugged into the full-size target model without further tuning, hence the term "zero-shot" hyperparameter transfer[^2][^3]. Microsoft Research reports that this yields approximately a 10x reduction in the cost of hyperparameter tuning relative to direct sweeps at full size, and the original muTransfer paper reports a tuning cost equivalent to roughly 7% of the pretraining cost of a 6.7B parameter GPT-3 sized model when the proxy is 40M parameters[^2][^3].

The Microsoft Research blog post accompanying the paper notes that the technique was verified across widths ranging from 128 to 4,096 hidden units and depths from 2 to 32 layers on transformers, and was applied to a 6.7B GPT-3 sized run for which the muTransfer hyperparameters yielded performance "comparable to GPT-3 13B" from the original [GPT-3](/wiki/gpt-3) paper at roughly half the parameter count[^3]. The same blog quotes Colin Raffel of the University of North Carolina describing muP as "an impressive step toward removing some of the black magic from scaling up neural networks"[^3].

### Two diagnostic tools

The `mup` reference implementation ships two diagnostic procedures that practitioners use to validate that muP has been applied correctly[^4][^9].

The first is the "coordinate check" or "coord check". The user trains the model at several different widths for a small number of steps (typically about ten) and records the per-coordinate magnitude of activations in each layer at each step. Under a correct muP implementation, these magnitudes should be approximately equal across widths; under SP, they typically diverge or vanish[^4][^9].

The second is the "muTransfer test": run a small random search over learning rates at the proxy width, then again at a wider width, and verify that the optimal learning rate is approximately the same in both sweeps. If the optima drift with width, an muP error is likely present[^9].

## Adoption and implementations

### Microsoft's mup library

The reference implementation, hosted at `github.com/microsoft/mup`, provides PyTorch wrappers for the most common layer types and optimizers: `MuReadout` and `MuSharedReadout` for the output projection, `MuAdam` and `MuSGD` for the optimizer, and `set_base_shapes()` for declaring the scaling axes[^4]. Examples cover MLPs, transformers, ResNets, and a separate `mutransformers` submodule that ports the Hugging Face Transformers library to muP[^4]. The library is compatible with [PyTorch](/wiki/pytorch) `DistributedDataParallel`-based training[^4].

### Cerebras-GPT

In April 2023, [Cerebras Systems](/wiki/cerebras) released Cerebras-GPT, a family of open compute-optimal language models trained on the Cerebras Wafer-Scale Cluster, with sizes ranging from 111M to 13B parameters[^5]. The paper, by Nolan Dey and colleagues at Cerebras, reports that for the muP variants of the family the team followed the muTransfer recipe: they first ran a hyperparameter sweep on a 40M-parameter muP model and then transferred the resulting learning rate (along the muP scaling law) up to 2.7B parameters[^5]. The reported result was that, averaged across model sizes from 111M to 2.7B, the muP models achieved a 0.43% lower Pile test loss and a 1.7% higher average downstream-task accuracy than the matched SP models, with the muP family also showing about 16x lower run-to-run standard deviation in their fitted scaling-law parameters (0.04% versus 0.66%), enabling more predictable extrapolation to larger sizes[^5]. Cerebras's training documentation contains tutorials for applying muP to its model zoo[^11].

### GPT-4 infrastructure

The GPT-4 technical report, released by [OpenAI](/wiki/openai) in March 2023, devotes a substantial section to "predictable scaling," in which the authors describe building optimization and training infrastructure designed to behave predictably across many orders of magnitude of compute, in particular allowing aspects of GPT-4's final loss to be extrapolated from training runs using as little as one ten-thousandth of the compute of the final model[^6]. The report cites Yang et al.'s work on hyperparameter transfer as part of this scaling infrastructure[^6]. Several of the authors of the original muTransfer paper (Yang, Babuschkin, Sidor, Farhi, Ryder, Pachocki) were affiliated with OpenAI at the time of the paper or subsequently[^2].

### Other follow-on work

Beyond the reference implementation, muP has been picked up in a number of subsequent papers, libraries, and external products. The mutransformers extension covers BERT, RoBERTa, and GPT-2 architectures in the Hugging Face style[^4]. Diffusion researchers have adapted muP to scale diffusion transformers efficiently (arXiv:2505.15270)[^12]. Other recent extensions include a treatment for [Mixture of Experts](/wiki/mixture_of_experts) models (arXiv:2508.09752)[^13] and the depth-transfer extension known as Depth-muP, discussed below.

### Tensor Programs VI and Depth-muP

The original muP paper handles transfer across width but does not directly address transfer across depth. "Tensor Programs VI: Feature Learning in Infinite-Depth Neural Networks" (arXiv:2310.02244, October 2023), by Greg Yang, Dingli Yu, Chen Zhu, and Soufiane Hayou, classifies depthwise parametrizations of deep residual networks under the infinite-width-then-infinite-depth limit and identifies a unique optimal scheme called Depth-muP[^14]. The recipe extends muP by rescaling each residual block (and parameter update within the block) by $$1/\sqrt{L}$$ where $$L$$ is the network's depth, which (for blocks of unit depth) admits depthwise hyperparameter transfer in the same sense that muP admits widthwise transfer[^14]. The paper notes that, in current transformer architectures whose residual blocks contain attention and feedforward sub-blocks of nontrivial depth, the recipe has further fundamental limitations, and the analysis identifies absolute-value activations as a special case that maximizes "feature diversity" across depth[^14]. A related contemporary paper on depthwise hyperparameter transfer in residual networks is Bordelon, Noci, Li, and Pehlevan, "Depthwise Hyperparameter Transfer in Residual Networks" (arXiv:2309.16620)[^15].

### u-muP (Unit-Scaled muP)

In July 2024, researchers at Graphcore Research and Aleph Alpha published "u-muP: The Unit-Scaled Maximal Update Parametrization" (arXiv:2407.17465, ICLR 2025), by Charlie Blake and colleagues[^16]. The paper combines muP with the Unit Scaling technique to design models in which activations, weights, and gradients all begin training with a scale of approximately one, which makes the resulting models train robustly in FP8 low-precision arithmetic[^16]. The authors argue that the resulting parametrization removes a number of incidental hyperparameters (such as base-shape choice and initialization scale), makes the remaining hyperparameters more interpretable and less interdependent, and reaches a lower loss than comparable muP models[^16]. u-muP is available as part of Graphcore Research's `unit-scaling` PyTorch library[^16].

## Significance

muP matters because it changes the economics of training very large models in two distinct ways.

First, it makes hyperparameter tuning cheap relative to pretraining. A 10x or larger reduction in tuning cost (relative to direct sweeps at full size) is reported in the original muTransfer paper, and Cerebras reports tuning a 40M proxy and successfully transferring to 2.7B, which corresponds to roughly a 70x parameter ratio[^2][^3][^5]. For models in the tens of billions of parameters and larger, the alternative (direct sweeps at full scale) is generally infeasible, so muP turns "tune the small model and transfer" from a heuristic into a principled procedure with theoretical backing[^1][^2].

Second, muP makes scaling laws more reliable. When optimal hyperparameters drift with model size under SP, fitting a [scaling law](/wiki/scaling_laws) to small-model runs and extrapolating to large models can systematically misjudge the achievable loss, because the small runs were not at their own optimum or the large runs are not at theirs. Cerebras observes that under muP the fitted scaling-law parameters become substantially less noisy, which improves the accuracy of extrapolation[^5]. The same logic underlies the predictable-scaling section of the GPT-4 technical report, which presents the ability to forecast frontier-model loss from much smaller proxy runs as one of the project's core engineering contributions[^6].

A third, more theoretical contribution is that muP gives a clean operational definition of "maximal feature learning" in the infinite-width limit, which provides a reference point against which other parametrizations (NTK, mean-field, hybrid schemes) can be compared, and against which the choice of optimizer, normalization, and depth can be analyzed[^1][^14].

## Limitations and criticisms

muP is not a closed chapter. Several empirical and theoretical critiques have appeared since 2022.

Lucas Lingle, in "A Large-Scale Exploration of mu-Transfer" (arXiv:2404.05728, April 2024), evaluated mu-Transfer on decoder-only transformer language models with up to 1.2B parameters trained on 33B tokens, plus a large-scale study at 10B parameters and 190B tokens, and found that for many of the most common architectural choices the method gives near-optimal learning rates, but for some configurations it does not[^17]. Lingle identifies several practical pitfalls including trainable normalization gains, particular optimizer choices, and per-layer bias inclusion, all of which can disrupt the muP scaling law if not aligned with the prescription[^17]. The paper recommends a number of small architectural adjustments to make mu-Transfer reliable in practice for transformer language models.

In October 2025, the paper "Weight Decay may matter more than muP for Learning Rate Transfer in Practice" (arXiv:2510.19093) argued that for the bulk of typical large-language-model training runs, it is decoupled weight decay rather than the muP parametrization itself that stabilizes the update dynamics of internal representations across widths, facilitating learning rate transfer[^18]. The paper does not contradict the theoretical content of muP but suggests that the practical "where does learning-rate transfer come from?" attribution is more subtle than it first appears: muP guarantees the right scaling in the infinite-width limit, but in finite-width practice the contribution of [weight decay](/wiki/weight_decay) is significant, and indeed a properly tuned weight-decay setting can produce learning-rate transfer under standard parametrization as well[^18].

A related line of empirical work argues that all parameterizations (not just muP) can be induced to achieve hyperparameter transfer if the per-layer learning rate prescription is right, and that for some setups a tuned per-layer SP recipe can match or exceed muP[^7][^19]. Other contemporary papers have studied learning-rate transfer across the "token horizon" (the number of training tokens), and observed that the optimal learning rate is not invariant to training duration even when widthwise muP is applied; this has motivated further extensions and recipes[^20].

In addition to these empirical critiques, muP has several practical limitations[^9].

- Architecture or training procedure changes can necessitate re-tuning; muP's guarantees are tightest when the proxy and the target share architecture, optimizer, and data distribution.
- muP does not eliminate all sources of training instability; precision issues, numerical stability problems, and data quality problems are not addressed.
- Effective transfer requires that proxy-model batch sizes be large enough to be in the same data-noise regime as the target model.
- Several alpha-multipliers and a "base shape" must be set, and incorrect base-shape declarations silently break the scaling.

Finally, muP's theoretical derivations rely on assumptions about the geometric alignment of layer inputs, weights, and gradient updates that, in practical training runs, hold cleanly only at the start of training. The most recent theoretical work (Bordelon et al.; Ghosh et al., "Understanding the Mechanisms of Fast Hyperparameter Transfer", 2025) seeks to extend the analysis beyond initialization[^15][^21].

## Comparison with adjacent techniques

muP is not the only attempt to make training large [neural networks](/wiki/neural_network) more predictable.

[Scaling laws](/wiki/scaling_laws), beginning with the Kaplan et al. paper "Scaling Laws for Neural Language Models" (which is canonical in the literature)[^22], and continuing with the "[Chinchilla](/wiki/chinchilla)" compute-optimal analysis by Hoffmann et al.[^23], focus on the question "given a compute budget, how should I split it between model size and tokens?", whereas muP focuses on the orthogonal question "how do I choose hyperparameters that remain optimal as I scale?". The two approaches are complementary; scaling laws assume that the hyperparameters used at each scale are reasonably optimal, and muP is one way to make that assumption justified[^5][^6].

NTK parameterization, in contrast to muP, is the unique stable parametrization in which the infinite-width limit is a fixed kernel-regression model and the network does not learn features. NTK is mathematically convenient for some analyses but the corresponding training dynamics are not what large [language models](/wiki/language_model) actually do; in particular, transfer learning and pretraining presuppose feature learning, which muP supplies and NTK does not[^1][^8].

Unit Scaling, the technique combined with muP in u-muP, is another scale-invariance discipline whose goal is to keep tensor magnitudes near one throughout training, primarily to enable [low-precision (FP8) training](/wiki/mixed_precision_training)[^16]. It is concerned with numerical scales, not with hyperparameter transfer per se, but as the u-muP paper notes, the two concerns mesh naturally.

## See also

- [scaling laws](/wiki/scaling_laws)
- [chinchilla scaling](/wiki/chinchilla_scaling)
- [scaling laws paper](/wiki/scaling_laws_paper)
- [learning rate](/wiki/learning_rate)
- [hyperparameter tuning](/wiki/hyperparameter_tuning)
- [adam optimizer](/wiki/adam_optimizer)
- [adamw](/wiki/adamw)
- [transformer](/wiki/transformer)
- [bert](/wiki/bert)
- [gpt-3](/wiki/gpt-3)
- [gpt-4](/wiki/gpt-4)
- [resnet](/wiki/resnet)
- [attention](/wiki/attention)
- [mixture of experts](/wiki/mixture_of_experts)
- [transfer learning](/wiki/transfer_learning)
- [weight decay](/wiki/weight_decay)
- [deep learning](/wiki/deep_learning)
- [pytorch](/wiki/pytorch)
- [cerebras](/wiki/cerebras)
- [openai](/wiki/openai)
- [microsoft research](/wiki/microsoft_research)

## References

[^1]: Greg Yang and Edward J. Hu, "Feature Learning in Infinite-Width Neural Networks" (Tensor Programs IV), arXiv preprint 2011.14522 (ICML 2021), 2020-11-30. https://arxiv.org/abs/2011.14522. Accessed 2026-05-20.
[^2]: Greg Yang, Edward J. Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao, "Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer", arXiv preprint 2203.03466 (NeurIPS 2021), 2022-03-07. https://arxiv.org/abs/2203.03466. Accessed 2026-05-20.
[^3]: Edward Hu, Greg Yang, and Jianfeng Gao, "muTransfer: A technique for hyperparameter tuning of enormous neural networks", Microsoft Research Blog, 2022-03-08. https://www.microsoft.com/en-us/research/blog/%C2%B5transfer-a-technique-for-hyperparameter-tuning-of-enormous-neural-networks/. Accessed 2026-05-20.
[^4]: Microsoft Research, "microsoft/mup: maximal update parametrization (muP)", GitHub repository, 2022. https://github.com/microsoft/mup. Accessed 2026-05-20.
[^5]: Nolan Dey, Gurpreet Gosal, Zhiming Chen, Hemant Khachane, William Marshall, Ribhu Pathria, Marvin Tom, and Joel Hestness, "Cerebras-GPT: Open Compute-Optimal Language Models Trained on the Cerebras Wafer-Scale Cluster", arXiv preprint 2304.03208, 2023-04-06. https://arxiv.org/abs/2304.03208. Accessed 2026-05-20.
[^6]: OpenAI, "GPT-4 Technical Report", arXiv preprint 2303.08774, 2023-03-15. https://arxiv.org/abs/2303.08774. Accessed 2026-05-20.
[^7]: Katie Everett, Lechao Xiao, Mitchell Wortsman, Alexander A. Alemi, Roman Novak, Peter J. Liu, Izzeddin Gur, Jascha Sohl-Dickstein, Leslie Pack Kaelbling, Jaehoon Lee, and Jeffrey Pennington, "Scaling Exponents Across Parameterizations and Optimizers", arXiv preprint 2407.05872, 2024-07-08. https://arxiv.org/abs/2407.05872. Accessed 2026-05-20.
[^8]: Greg Yang and Edward J. Hu, "Tensor Programs IV: Feature Learning in Infinite-Width Neural Networks", Proceedings of the 38th International Conference on Machine Learning (PMLR 139), 2021. https://proceedings.mlr.press/v139/yang21c.html. Accessed 2026-05-20.
[^9]: Nolan Dey, Quentin Anthony, and Joel Hestness, "The Practitioner's Guide to the Maximal Update Parameterization", EleutherAI Blog, 2024-09-19. https://blog.eleuther.ai/mutransfer/. Accessed 2026-05-20.
[^10]: Nolan Dey, Quentin Anthony, and Joel Hestness, "The Practitioner's Guide to the Maximal Update Parameterization", Cerebras Blog, 2024-09-23. https://www.cerebras.ai/blog/the-practitioners-guide-to-the-maximal-update-parameterization. Accessed 2026-05-20.
[^11]: Cerebras Systems, "Train an LLM using Maximal Update Parameterization", Cerebras Developer Documentation. https://docs.cerebras.net/en/2.1.1/wsc/how_to_guides/mup_docs.html. Accessed 2026-05-20.
[^12]: Anonymous authors, "Scaling Diffusion Transformers Efficiently via muP", arXiv preprint 2505.15270, 2025. https://arxiv.org/abs/2505.15270. Accessed 2026-05-20.
[^13]: Anonymous authors, "muP-Parametrization for Mixture of Experts", arXiv preprint 2508.09752, 2025. https://arxiv.org/abs/2508.09752. Accessed 2026-05-20.
[^14]: Greg Yang, Dingli Yu, Chen Zhu, and Soufiane Hayou, "Tensor Programs VI: Feature Learning in Infinite-Depth Neural Networks", arXiv preprint 2310.02244, 2023-10-03. https://arxiv.org/abs/2310.02244. Accessed 2026-05-20.
[^15]: Blake Bordelon, Lorenzo Noci, Mufan Bill Li, Boris Hanin, and Cengiz Pehlevan, "Depthwise Hyperparameter Transfer in Residual Networks: Dynamics and Scaling Limit", arXiv preprint 2309.16620, 2023-09-28. https://arxiv.org/abs/2309.16620. Accessed 2026-05-20.
[^16]: Charlie Blake, Constantin Eichenberg, Josef Dean, Lukas Balles, Luke Y. Prince, Bjorn Deiseroth, Andres Felipe Cruz-Salinas, Carlo Luschi, Samuel Weinbach, and Douglas Orr, "u-muP: The Unit-Scaled Maximal Update Parametrization", arXiv preprint 2407.17465 (ICLR 2025), 2024-07-24. https://arxiv.org/abs/2407.17465. Accessed 2026-05-20.
[^17]: Lucas Lingle, "A Large-Scale Exploration of mu-Transfer", arXiv preprint 2404.05728, 2024-04-08. https://arxiv.org/abs/2404.05728. Accessed 2026-05-20.
[^18]: Authors of "Weight Decay may matter more than muP for Learning Rate Transfer in Practice", arXiv preprint 2510.19093, 2025-10. https://arxiv.org/abs/2510.19093. Accessed 2026-05-20.
[^19]: "How To Scale" (companion page to Everett et al., 2024). https://howtoscalenn.github.io/. Accessed 2026-05-20.
[^20]: Authors of "Scaling Optimal LR Across Token Horizon", arXiv preprint 2409.19913 (ICLR 2025). https://arxiv.org/abs/2409.19913. Accessed 2026-05-20.
[^21]: Nikhil Ghosh and co-authors, "Understanding the Mechanisms of Fast Hyperparameter Transfer", arXiv preprint 2512.22768, 2025. https://arxiv.org/abs/2512.22768. Accessed 2026-05-20.
[^22]: Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei, "Scaling Laws for Neural Language Models", arXiv preprint 2001.08361, 2020-01-23. https://arxiv.org/abs/2001.08361. Accessed 2026-05-20.
[^23]: Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, et al., "Training Compute-Optimal Large Language Models" (Chinchilla), arXiv preprint 2203.15556, 2022-03-29. https://arxiv.org/abs/2203.15556. Accessed 2026-05-20.