Muon (optimizer)
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,194 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,194 words
Add missing citations, update stale details, or suggest a clearer explanation.
Muon (short for MomentUm Orthogonalized by Newton-Schulz) is an Optimizer designed for the two-dimensional weight matrices of neural network hidden layers. The algorithm performs a standard SGD update with Nesterov momentum, then post-processes the momentum buffer through a short matrix Newton-Schulz iteration that approximates the matrix sign function, replacing the singular values of the update with ones before applying it.[1][2] Muon was developed in 2024 by Keller Jordan together with Yuchen Jin, Vlado Boza, You Jiacheng, Franz Cesista, Laker Newhouse, and Jeremy Bernstein, originally as a tool for shaving seconds off the modded-nanogpt training-speed benchmark on eight NVIDIA H100 GPUs.[1][2] It has since been adopted at production scale: Moonshot AI's "Muon is Scalable for LLM Training" report (February 2025) demonstrated roughly 2x compute efficiency over AdamW when training a 16-billion-parameter Mixture-of-Experts model on 5.7 trillion tokens, and a stabilised variant called MuonClip was used to pre-train Kimi K2, a one-trillion-parameter MoE model, with zero loss spikes.[3][4][5]
| Property | Value |
|---|---|
| Full name | MomentUm Orthogonalized by Newton-Schulz |
| First public release | 4 October 2024 (Keller Jordan, X post and GitHub) |
| Core authors | Keller Jordan, Yuchen Jin, Vlado Boza, You Jiacheng, Franz Cesista, Laker Newhouse, Jeremy Bernstein |
| Target parameters | 2D hidden-layer weight matrices (not embeddings, classifier heads, or biases) |
| Default hyperparameters | momentum = 0.95, Nesterov = true, Newton-Schulz steps = 5, lr = 0.02, weight decay = 0 |
| Newton-Schulz quintic coefficients | (a, b, c) = (3.4445, -4.7750, 2.0315) |
| Reference write-up | Keller Jordan, "Muon: An optimizer for hidden layers in neural networks", kellerjordan.github.io, 2024 |
| Reference scaling paper | Liu et al., "Muon is Scalable for LLM Training", arXiv:2502.16982, 2025 |
| Reference implementation | github.com/KellerJordan/Muon (MIT) |
By 2024, AdamW had been the dominant optimizer for large language model pretraining for roughly half a decade, retaining the per-parameter moment estimates of Adam together with the decoupled weight decay introduced by Loshchilov and Hutter. Adam's per-parameter second-moment normalisation is a coordinate-wise operation: it treats every scalar in a matrix as an independent variable, ignoring the fact that the parameters in a linear layer collectively define a matrix that maps one activation space to another. A parallel research thread, beginning with Carlson et al. (2015) and continuing through Shazeer and Stern's Adafactor (2018), Shampoo (Gupta, Koren, and Singer, 2018), and more recently SOAP (Vyas et al., 2024), proposed updates that exploit this matrix structure, typically by preconditioning gradients with statistics of one or both factors. These methods were known to converge faster than Adam in many regimes, but they were memory-heavy, slow per step, and rarely adopted at frontier pretraining scale.[6]
Muon emerged from the modded-nanogpt speedrun, a community benchmark started in May 2024 that races to train GPT-2-small to a validation loss of 3.28 on the FineWeb dataset using eight NVIDIA H100 GPUs.[7] The starting point was Andrej Karpathy's llm.c GPT-2 replication, which reached the target in roughly 45 minutes; by April 2026, descendant runs had pushed the same task below 1.5 minutes.[7][8] Keller Jordan, a researcher who maintains the modded-nanogpt leaderboard, introduced Muon as Record 3 on 4 October 2024, dropping the wall-clock time from 31.4 minutes to 24.9 minutes by replacing AdamW on the hidden weights.[1][7] Muon has remained the optimizer of choice for every subsequent record on the leaderboard.[1]
The theoretical inspiration came from a parallel line of work by Jeremy Bernstein and Laker Newhouse on modular duality, which casts different optimizers as steepest descent under different choices of norm.[9] Bernstein and Newhouse argued that, under an RMS-to-RMS operator norm on dense linear layers, the optimal update direction is the matrix sign of the gradient, computed by replacing the singular values of the gradient with ones. Muon is essentially a fast, GPU-friendly approximation of this sign operation applied to the momentum buffer rather than to the raw gradient.[9][10]
The historical reference for orthogonalising gradients in deep learning is older than the modular-duality framework. Carlson and colleagues (2015) studied "preconditioned stochastic gradient" methods that whitened updates by the gradient covariance, and Flynn (2017) explored matrix-sign style updates for two-layer networks. These earlier methods were rarely used in practice because the orthogonalisation step was implemented with eigendecompositions or full SVDs that did not scale to billion-parameter models. The novel contribution of the 2024 Muon authors was algorithmic rather than purely conceptual: a quintic Newton-Schulz iteration with hand-tuned coefficients runs in bfloat16 on GPU tensor cores and dominates the cost of a forward-backward step by less than two percent at NanoGPT scale.[1][9]
Muon is described compactly by the reference implementation at github.com/KellerJordan/Muon.[2] For every two-dimensional parameter matrix W of shape A x B:
zeropower_via_newtonschulz5 that returns an approximately orthogonal matrix O of the same shape.The Newton-Schulz routine is the algorithmic heart of Muon. Direct singular value decomposition would correctly produce the matrix sign U V^T of G' = U Sigma V^T, but SVD is too slow for an inner-loop optimizer. Instead, Muon iterates a quintic polynomial that commutes with the SVD because it acts only through G' G'^T. The reference iteration normalises X = G' / ||G'||_F so its singular values lie in (0, 1], then performs five updates of the form:
A = X @ X.T
B = b * A + c * (A @ A)
X = a * X + B @ X
with the tuned coefficients (a, b, c) = (3.4445, -4.7750, 2.0315).[1][2] Five iterations are sufficient to drive the singular values of X to lie inside the band [0.5, 1.5] in bfloat16 arithmetic, which is empirically close enough to true orthogonality for optimization purposes.[1][2] Because every operation is a matrix-matrix multiply with no division or square root, the iteration runs entirely on the GPU's tensor cores and remains numerically stable at low precision.[1]
Two features distinguish Muon from a strict implementation of "matrix sign of the gradient":
sqrt(max(A, B)) (or equivalently max(1, A/B)^0.5 after the normalisation inside the iteration) keeps the per-element update RMS approximately constant across rectangular matrices, which is needed for AdamW-compatible learning rates to transfer.[2][3]Default hyperparameters in the reference implementation are momentum 0.95, Nesterov enabled, five Newton-Schulz steps, base learning rate 0.02, and weight decay zero; the GitHub README states that only learning rate and weight decay typically need tuning.[2]
The coefficients (3.4445, -4.7750, 2.0315) were not derived analytically but found by direct numerical search against a target objective that pushes the singular values of the input toward unity in as few steps as possible. Franz Cesista, one of the named co-authors, has documented follow-up work that squeezes an additional 1 to 2 percent of efficiency out of Muon by re-tuning the coefficients per step or per training stage rather than reusing the same triple for all five iterations.[1] The reference implementation does not adopt these per-step coefficients, on the grounds that the additional engineering complexity is rarely worth the marginal gain.
A subtle but important property of the iteration is that it is a polynomial in X X^T, which means it commutes with the singular value decomposition: if X = U Sigma V^T, then zeropower_via_newtonschulz5(X) = U f(Sigma) V^T for some scalar function f. The Newton-Schulz design problem is to pick coefficients so that f maps the interval (0, 1] as close to the constant function 1 as possible after five iterations. The choice (3.4445, -4.7750, 2.0315) overshoots slightly so that the output singular values land in roughly [0.5, 1.5] rather than collapsing exactly to 1; Keller Jordan reports that this small spread is empirically beneficial, perhaps because it preserves a little of the original spectral information.[1]
Muon is explicitly designed for the 2D weight matrices of hidden layers. The reference implementation and writeup recommend leaving the following parameters to AdamW:
The argument for the carve-out is that embedding and unembedding rows correspond to individual tokens whose update statistics differ sharply from one another, so per-coordinate adaptive scaling is genuinely useful there; the orthogonalization step also has no obvious meaning when one dimension has length one. The reference Muon repository ships a MuonWithAuxAdam helper that performs parameter grouping automatically.[2]
The most visible early adopter of Muon was the modded-nanogpt speedrun.[7] The challenge fixes the model architecture family (a GPT-2-small-class transformer), the validation target (3.28 on a held-out FineWeb shard), and the hardware budget (eight H100s); competitors race to minimise wall-clock training time. Selected records illustrate Muon's role:
| Record | Date | Wall-clock | Headline change |
|---|---|---|---|
| 1 | May 2024 | ~45 min | Karpathy llm.c GPT-2 baseline |
| 2 | 6 Jun 2024 | 31.4 min | Tuned LR plus rotary position embeddings |
| 3 | 4 Oct 2024 | 24.9 min | Switched hidden weights from AdamW to Muon |
| 5 | 14 Oct 2024 | 15.2 min | Architectural refinements on top of Muon |
| 13 | 19 Nov 2024 | 5.03 min | FlexAttention integration |
| 20 | 16 Jan 2025 | 2.99 min | Merged QKV projections plus attention tweaks |
| 50 | 18 Dec 2025 | 2.13 min | Further iteration of Muon-based stack |
| 62 | 19 Jan 2026 | 1.66 min | Bigram hash embedding |
| 80 | 8 Apr 2026 | 1.41 min | Combined optimisations |
Sources: modded-nanogpt repository commit history and Keller Jordan's Muon write-up.[1][7]
Muon's introduction at Record 3 produced a 1.35x speedup over the previous AdamW-based record at the same loss target.[1] At larger scales, Keller Jordan's write-up reports that training a 1.5B-parameter transformer to GPT-2 XL quality required 10 hours on 8xH100 with Muon versus 13.3 hours with AdamW, a comparable ratio.[1] Muon was also used to set training-speed records for CIFAR-10, pushing a long-standing benchmark from 3.3 to 2.6 A100-seconds at fixed accuracy.[2]
The speedrun setting is unusual in several ways that matter for interpreting Muon's results. The model size is small (124M parameters), the dataset is fixed, the validation target is a single loss value rather than a benchmark suite, and the hardware budget is generous relative to model size. Some optimisations that help at this scale, including extremely large batch sizes, aggressive learning-rate schedules, and architectural compression, do not transfer directly to multi-billion-parameter pretraining. Nonetheless, the speedrun has been an unusually useful research environment because it forces every change to demonstrably reduce wall-clock training time on identical hardware, exposing the difference between methods that look better in offline plots and methods that actually run faster on a real GPU. Muon survived this test through 80-plus consecutive records, which is a strong signal that the speedup is real rather than an artefact of cherry-picked seeds or hyperparameter searches.[1][7][8]
The most important external validation of Muon came in February 2025 when researchers at Moonshot AI posted "Muon is Scalable for LLM Training" (arXiv:2502.16982).[3] The paper, with 29 named authors including Jingyuan Liu and Jianlin Su, identified two adjustments that allowed Muon to train large models out-of-the-box without per-scale hyperparameter sweeps:[3]
With these changes, scaling-law experiments on dense transformers up to several billion parameters showed that Muon reached the AdamW compute-optimal loss using approximately 52 percent of the FLOPs, i.e. roughly 2x compute efficiency at compute-optimal token counts.[3]
The paper also introduced Moonlight, an open-weight 16B-total / 2.24B-activated Mixture-of-Experts model trained with Muon on 5.7T tokens.[3] Moonlight reuses the DeepSeek V3-Small architecture (18 layers, hidden size 2,560, 16 experts) so that its results can be compared directly to the AdamW-trained DSV3-Small baseline. Selected benchmarks at roughly equal token budgets:[3]
| Benchmark | DSV3-Small (AdamW, 1.33T tokens) | Moonlight (Muon, 1.2T tokens) |
|---|---|---|
| MMLU | 53.3 | 60.4 |
| HumanEval | 26.8 | 37.2 |
| GSM8K | 31.4 | 45.0 |
Moonshot also described a distributed implementation, Distributed Muon, layered on top of ZeRO-1-style optimizer-state sharding. The added communication relative to Distributed AdamW is reported to be at most 1.25x, while peak memory is roughly halved because Muon stores only one momentum buffer rather than two moment estimates.[3] Moonlight, intermediate checkpoints, and the Distributed Muon code were released under the MIT license through Moonshot's GitHub and Hugging Face organisations, with vLLM and SGLang integrations.[11]
The key engineering challenge for Distributed Muon was that the Newton-Schulz iteration is a dense matrix operation on the full update tensor, which is awkward under tensor-parallel or pipeline-parallel sharding. The Moonlight team resolved this by gathering the sharded momentum buffer onto a single rank, running the Newton-Schulz iteration there in bfloat16, and then scattering the orthogonalised update back to its original shards. Because each parameter matrix is processed independently, the gather and scatter operations can be overlapped with computation for other matrices, keeping the additional communication bounded.[3] This pattern has since been refined in NorMuon (Li et al., 2025) and in independent implementations layered on PyTorch's FSDP2, which use more aggressive overlapping and per-row normalisation to handle very large matrices.[14]
Moonshot's follow-on system, Kimi K2, is a one-trillion-parameter MoE language model with 32B activated parameters released in 2025 and described in detail in the Kimi K2 technical report (arXiv:2507.20534).[4][5] The report states that Kimi K2 was pre-trained on 15.5 trillion tokens "with zero training instability" using a Muon variant called MuonClip.[4][5]
MuonClip addresses a failure mode that the Kimi team observed when scaling Muon: exploding attention logits. As model size and learning-rate-times-batch-size grew, the dot products between the orthogonalised query and key projections occasionally drifted upward, eventually causing softmax saturation and loss spikes. Logit soft-capping and query-key normalisation, the standard defences in AdamW training, were reported to be inadequate in the Muon regime.[5] MuonClip therefore adds a post-update step called QK-Clip: after each Muon step, for every attention head, the maximum attention logit is measured, and whenever it exceeds a threshold tau (the team reports tau = 100), the corresponding query and key weight matrices are rescaled in place so that the bound is restored.[4][5] This brings the logit magnitude back under control at the source without altering the forward computation.
The Kimi team reports that MuonClip enabled the entire Kimi K2 pretraining run to complete without a single loss spike, a property they attribute to QK-Clip rather than to Muon alone.[4][5] Kimi K2 benchmarks reported in the technical report include 65.8 on SWE-Bench Verified, 53.7 on LiveCodeBench v6, 49.5 on AIME 2025, 75.1 on GPQA-Diamond, and 66.1 on Tau2-Bench, placing it competitively among open-weight non-thinking models at release.[4]
Beyond Moonshot, Muon has been picked up by several other groups:
pip install-able directly from GitHub; a pull-request-style discussion exists in the PyTorch repository (issue #148819) proposing to add Muon to torch.optim.[2][15]Open-source ecosystems also include:
The theoretical case for Muon is most fully developed in Jeremy Bernstein's "Deriving Muon" essay and in Bernstein and Newhouse's "Modular Duality in Deep Learning" (arXiv:2410.21265, October 2024).[9][10] Their framing is that any first-order optimizer can be viewed as steepest descent under some choice of norm on the parameter space; the gradient itself is a dual vector and must be mapped back to the primal weight space before it can be subtracted from the weights. For dense linear layers, equipping the layer with the RMS-to-RMS operator norm (the spectral norm scaled by sqrt(B/A) for an A x B matrix) and dualising yields exactly the matrix sign update implemented by Muon.[9][10] In this view:
Subsequent papers have begun to formalise Muon's convergence. Among them are "A Note on the Convergence of Muon" (arXiv:2502.02900), "On the Convergence Analysis of Muon" (arXiv:2505.23737), and "Towards Understanding Orthogonalization in Muon" (OpenReview, 2025), which analyse Muon as a non-Euclidean stochastic gradient method under the spectral or Frobenius norm.[16] Thinking Machines Lab's "Modular Manifolds" essay places Muon inside a broader framework that treats each layer's weight space as a manifold with its own metric.[17]
Muon's significance is twofold. Algorithmically, it is the first matrix-aware optimizer that has been demonstrated at frontier-scale language model training, breaking what had been a decade-long AdamW monoculture in production pretraining. The roughly 2x compute saving reported by Moonshot for compute-optimal scaling, if it holds across other architectures and data mixtures, represents the largest published improvement in pretraining optimizer efficiency since Adam itself.[3] Conceptually, it has popularised the view that optimizers should respect the geometry of the parameters they update: a transformer hidden layer is a matrix, and the natural update direction is also a matrix object, not a coordinate-wise rescaling of one.[9][10]
For practitioners, Muon also has favourable systems properties:
Muon is not without caveats:
torch.optim, and distributed implementations vary in quality across the open-source landscape, though the gap is closing rapidly.[15]| Optimizer | Type of update | Per-parameter memory | Notable property |
|---|---|---|---|
| SGD with momentum | First-order, coordinate-wise | 1 buffer | Baseline; no adaptive scaling |
| Adam / AdamW | First-order, coordinate-wise adaptive | 2 buffers | Industry standard for transformers |
| Shampoo | Second-order, block-diagonal preconditioner | Matrix factor statistics | Matrix-aware; expensive inverses |
| SOAP | Shampoo eigenbasis with Adam inside | Adam plus Shampoo factors | Combines first- and second-order signals |
| Lion | First-order, sign of momentum | 1 buffer | Coordinate-wise sign update; memory-light |
| Muon | First-order on momentum, matrix sign | 1 buffer | Matrix-aware via Newton-Schulz; orthogonal step direction |
| MuonClip | Muon plus QK-Clip post-step | 1 buffer | Stabilises attention logits at trillion-parameter scale |
The most important contrasts: Muon shares Lion's memory frugality but operates on matrices rather than coordinates; it shares Shampoo's matrix awareness but is dramatically cheaper per step because Newton-Schulz replaces full preconditioner inverses.[6]