SOAP (optimizer)
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,323 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,323 words
Add missing citations, update stale details, or suggest a clearer explanation.
SOAP (ShampoO with Adam in the Preconditioner's eigenbasis) is a second-order optimization algorithm for training deep neural networks, introduced by Nikhil Vyas, Depen Morwani, Rosie Zhao, Mujin Kwun, Itai Shapira, David Brandfonbrener, Lucas Janson, and Sham Kakade in a September 2024 paper titled "SOAP: Improving and Stabilizing Shampoo using Adam".[^1] The method runs an Adam-style adaptive update inside the eigenbasis of the Shampoo preconditioner, so that per-coordinate second-moment statistics are tracked in a slowly rotating coordinate system instead of the raw parameter space. Vyas and coauthors prove that, when its preconditioner is refreshed every step, the canonical 1/2-power Shampoo update is mathematically equivalent to running Adafactor in that eigenbasis; SOAP exploits this equivalence by replacing the implicit Adafactor inside Shampoo with an explicit Adam.[^1] In language-model pre-training experiments at 360M and 660M parameters, SOAP reduces the number of optimizer steps to a target loss by roughly 40% over AdamW and 20% over Shampoo, with comparable savings in wall-clock time, while introducing only one new hyperparameter (the preconditioning frequency) beyond standard Adam.[^1][^2]
The acronym SOAP capitalizes the letters in ShampOO with Adam in the Preconditioner's eigenbasis. The lowercase "oo" is dropped from the title casing, but the construction makes the algorithmic recipe explicit in the name: take Shampoo's eigenbasis, run Adam inside it. The "preconditioner" refers to the implicit metric induced by Shampoo's per-mode Gram-matrix accumulators, not to a classical Newton or Gauss-Newton preconditioner. Throughout the paper and the implementation, "Adam" is used loosely to mean "AdamW" (i.e., with decoupled weight decay), which is the version SOAP actually invokes inside the rotated basis.[^1][^3]
Adam is a first-order adaptive optimizer that maintains exponential moving averages of the gradient (the first moment) and the squared gradient (the second moment), then divides the momentum by the square root of the second moment to produce a step. AdamW modifies Adam by decoupling weight decay from the gradient-based update, which has become the de-facto recipe for training transformer language models.[^3] Adam is cheap to run (each per-parameter state is just two scalars) and largely insensitive to the curvature structure of the loss surface, so it treats every coordinate independently. This independence is part of why it works robustly across architectures, but it also means Adam cannot exploit correlations between parameters of the same weight matrix.
For matrix-valued parameters such as the dense layers of a Transformer, the second moment $V$ that Adam stores is the same shape as the weight, $\mathbb{R}^{m\times n}$. AdamW holds three matrices of that shape (parameters, momentum, second moment), and modern large language model training accordingly devotes a significant fraction of accelerator memory just to optimizer state.[^3]
Full-matrix preconditioned methods such as full-matrix AdaGrad would form and invert an $mn \times mn$ matrix per layer, which is intractable for any nontrivial model. Shampoo, introduced by Vineet Gupta, Tomer Koren, and Yoram Singer at ICML 2018, addresses this by approximating the full preconditioner by a Kronecker product of two smaller per-mode preconditioners.[^4] For an $m \times n$ weight matrix $W$ with gradient $G$, Shampoo accumulates:
$$L_t = L_{t-1} + G_t G_t^\top \quad \in \mathbb{R}^{m\times m}$$ $$R_t = R_{t-1} + G_t^\top G_t \quad \in \mathbb{R}^{n\times n}$$
and applies the update $W_{t+1} = W_t - \eta, L_t^{-1/4}, G_t, R_t^{-1/4}$ (with exponent 1/4 in the original formulation).[^4] The matrix-fourth-roots are computed by eigendecomposing $L_t$ and $R_t$, which is the expensive step. Shampoo was originally derived as a structure-aware approximation to full-matrix AdaGrad and came with convergence guarantees in the stochastic convex setting.[^4]
A 2020 Google paper by Rohan Anil, Vineet Gupta, Tomer Koren, Kevin Regan, and Yoram Singer, "Scalable Second Order Optimization for Deep Learning", described a distributed implementation of Shampoo that ran the inverse-root step asynchronously on CPUs while the GPUs computed gradients, and demonstrated wall-clock-time wins on production-scale workloads.[^5] Meta later released a PyTorch reimplementation called Distributed Shampoo, which is the implementation the SOAP authors used as their Shampoo baseline.[^6] In modern usage, the 1/2 power $L_t^{-1/2}, G_t, R_t^{-1/2}$ has become standard, because Shampoo's preconditioner is better viewed as an approximation of a single-mode whitening operator than of full-matrix AdaGrad.[^1]
Adafactor, introduced by Noam Shazeer and Mitchell Stern in 2018, sidesteps the memory cost of Adam's second moment by storing only the row and column sums of the squared gradients, $A \in \mathbb{R}^m$ and $C \in \mathbb{R}^n$, then reconstructing a rank-one factored second moment $V_{ij} \approx A_i C_j / \sum_k A_k$.[^7] This drops the optimizer-state cost for a weight matrix from $O(mn)$ to $O(m+n)$, which is critical at billion-parameter scale and is part of why Adafactor was used to train Google's 11B-parameter T5.[^7] Adafactor also includes update clipping and a slow second-moment decay schedule for stability.[^7]
The SOAP paper observes that this factored second moment is structurally similar to Shampoo's $L$ and $R$ accumulators, and shows that the similarity is not a coincidence.[^1]
A central contribution of Vyas et al. is a precise statement of the relationship between Shampoo and Adafactor. The paper proves that Shampoo with the 1/2 power is equivalent to running Adafactor in the eigenbasis of Shampoo's preconditioner when the preconditioner is recomputed at every step.[^1]
The proof is short. Eigendecompose $L = Q_L \Lambda Q_L^\top$ and $R = Q_R M Q_R^\top$, with diagonal eigenvalues $\lambda_i$ and $\mu_j$. The Shampoo update with exponent 1/2 acts on the gradient $G$ as $L^{-1/2} G R^{-1/2}$. In the basis $(Q_L, Q_R)$, this is a coordinate-wise rescaling: the $(i,j)$ entry of the rotated gradient is divided by $(\lambda_i \mu_j)^{1/2}$. Adafactor's rank-one factored second moment, applied in the same rotated basis, also produces a coordinate-wise rescaling by $(A_i C_j / \sum_k A_k)^{1/2}$ at entry $(i,j)$. When $A_i = \lambda_i$ and $C_j = \mu_j / \sum_k \lambda_k$ (which holds after appropriate normalization), the two updates coincide.[^1]
The corollary is striking: Shampoo, often viewed as a fundamentally different "second-order" algorithm, is in fact a memory-light approximation of an adaptive first-order method, just operating in a rotated basis chosen to align with the dominant curvature directions of the layer's Gram matrices. Distributed Shampoo's various heuristics, including grafting onto Adam's per-parameter scale, can then be reinterpreted as efforts to compensate for the gap between Shampoo's rank-one Adafactor and the full per-coordinate Adam.[^1]
This insight is closely related to a parallel result by Depen Morwani and coauthors, "A New Perspective on Shampoo's Preconditioner", which derived the 1/2-power Shampoo update as an optimal Kronecker approximation to a whitening preconditioner rather than to full-matrix AdaGrad.[^8] Read together, the two papers reposition Shampoo as a rotated, factored adaptive method rather than as an approximation to a true second-order Newton-type step.
If Shampoo is "Adafactor in Shampoo's eigenbasis", then there is a natural strict improvement: replace the rank-one Adafactor with full per-coordinate Adam, while keeping the rotation. That is SOAP.[^1]
For an $m \times n$ weight matrix $W$, SOAP maintains six pieces of state:
| State | Shape | Purpose |
|---|---|---|
| $L$ | $m \times m$ | Left Shampoo accumulator (EMA of $GG^\top$) |
| $R$ | $n \times n$ | Right Shampoo accumulator (EMA of $G^\top G$) |
| $Q_L$ | $m \times m$ | Cached left eigenvectors of $L$ |
| $Q_R$ | $n \times n$ | Cached right eigenvectors of $R$ |
| $M$ | $m \times n$ | Adam first moment in the rotated basis |
| $V$ | $m \times n$ | Adam second moment in the rotated basis |
Per step, given gradient $G_t$:
When the rotation matrices are refreshed, the Adam moments $M, V$ are themselves rotated into the new basis so that no statistics are lost.[^1] One-dimensional layers (biases, embeddings, layer-norm scales) fall back to plain AdamW because there is only one mode to precondition. For high-dimensional tensors with $n > 2$ axes, SOAP generalizes naturally: maintain an accumulator per mode.[^2]
The single new hyperparameter is the preconditioning frequency $f$, the number of optimizer steps between eigenvector refreshes. Setting $f = 1$ recovers a Shampoo-equivalent regime; large $f$ amortizes the eigendecomposition cost across many cheap rotated-Adam steps.[^1]
Inputs: lr eta, betas (b1, b2), weight decay lambda, frequency f, epsilon eps
Initialize W; L,R,Q_L,Q_R,M,V <- zero/identity
for t = 1, 2, ...:
G = compute_gradient(W)
# 1. Rotate gradient
Gtil = Q_L.T @ G @ Q_R
# 2. Adam in rotated basis
M = b1*M + (1-b1)*Gtil
V = b2*V + (1-b2)*Gtil*Gtil
Mhat = M / (1 - b1**t)
Vhat = V / (1 - b2**t)
Util = Mhat / (sqrt(Vhat) + eps)
# 3. Rotate update back
U = Q_L @ Util @ Q_R.T
# 4. AdamW step
W = W - eta * U - eta * lambda * W
# 5. Shampoo EMAs
L = b2*L + (1-b2)*(G @ G.T)
R = b2*R + (1-b2)*(G.T @ G)
# 6. Refresh basis every f steps (rotate M, V into new basis)
if t % f == 0:
Q_L_new = eigenvectors(L); Q_R_new = eigenvectors(R)
M = Q_L_new.T @ Q_L @ M @ Q_R.T @ Q_R_new
V = Q_L_new.T @ Q_L @ V @ Q_R.T @ Q_R_new
Q_L, Q_R = Q_L_new, Q_R_new
The above follows Algorithm 3 in the paper.[^1]
The frequency $f$ trades off two costs. A small $f$ means eigendecompositions happen often, so the cached $Q_L, Q_R$ closely track the moving Shampoo accumulators, but each refresh costs $O(m^3 + n^3)$. A large $f$ amortizes the eigendecomposition cost over many cheap rotated-Adam steps, but allows the basis to grow stale. The recommended default in the reference implementation is $f = 10$ for batch-size-2M Transformer pre-training; values up to $f = 80$ are reported as still effective at smaller batch sizes.[^1][^2] In practice, the eigendecomposition is implemented with a power-iteration warm start using the cached eigenvectors, which converges in a small number of inner iterations because the underlying matrix changes slowly between calls.[^1]
A subtle implementation choice is whether to rotate the Adam moments $M, V$ into the new eigenbasis when the basis is refreshed, or to reset them. The SOAP paper rotates rather than resets, treating the moments as fixed vectors expressed in different coordinate systems, so that no statistical information is discarded at a basis refresh.[^1] Empirically this is important: dropping the rotation step (i.e., naively resetting the moments at refresh) recovers most of Shampoo's degradation with infrequent updates.[^1]
Shampoo's effective per-coordinate learning rate $(\lambda_i \mu_j)^{-1/2}$ only changes when the preconditioner is recomputed, because the magnitudes $\lambda_i, \mu_j$ are baked into the matrix-root. Between refreshes the algorithm uses a stale Adafactor approximation. SOAP, by contrast, keeps the rotation fixed between refreshes but updates the Adam second moment $V$ in the rotated basis at every step. This means SOAP behaves like full Adam in a coordinate system that happens to be aligned with the dominant curvature directions of the weight matrix, and it remains adaptive between expensive eigendecompositions.[^1]
The authors report that Shampoo's loss curves degrade rapidly when the preconditioning frequency is raised, while SOAP "degrades significantly slower" because of this continual second-moment tracking.[^1]
Letting $m \geq n$ for an $m \times n$ layer, the paper gives the following per-step costs (excluding the periodic eigendecomposition):[^1]
| Quantity | AdamW | Distributed Shampoo | SOAP |
|---|---|---|---|
| Optimizer state per layer | $3mn$ | $2m^2 + 2n^2 + 3mn$ | $2m^2 + 2n^2 + 3mn$ |
| Per-step matmul cost | $\Theta(mn)$ | $m^3 + n^3 + m^2 n + m n^2$ | $m^3 + n^3 + 2m^2 n + 2 m n^2$ |
| Eigendecomposition every | n/a | $f$ steps | $f$ steps |
SOAP has the same memory footprint as Distributed Shampoo (the two are dominated by the $L$, $R$, $Q_L$, $Q_R$ matrices), and a slightly higher per-step matmul cost than Shampoo because of the extra rotate-in and rotate-out passes. In wall-clock terms this overhead is small relative to the forward-backward pass for typical Transformer dimensions, and is more than offset by the reduced number of steps needed to reach a given loss.[^1] When a layer dimension is so large that an eigendecomposition becomes prohibitive, the SOAP implementation falls back to identity rotation matrices on that axis, recovering AdamW behavior locally.[^2]
The paper evaluates SOAP on decoder-only Transformer language models with 360M and 660M parameters, trained on a language modeling corpus.[^1] Experiments use a single NVIDIA H100 GPU with gradient accumulation to reach large effective batch sizes.[^1] The two regimes studied are:
Both Chinchilla-optimal token counts ($\approx 20\times$ model parameters) and longer 100$\times$ runs are reported.[^1] Baselines are AdamW with standard recipe ($\beta_1 = 0.9$, $\beta_2 = 0.95$) and Meta's Distributed Shampoo with $\beta_2 = 0.95$ and 1/2 power.[^1][^6]
In the large-batch regime, SOAP reaches the same final validation loss as AdamW in 40% fewer iterations and 35% less wall-clock time, and outperforms Shampoo by approximately 20% on both metrics.[^1][^2] In the small-batch regime, the gap narrows but remains material: at least 25% iteration savings versus AdamW and roughly 10% versus Shampoo.[^1] As is generally the case for second-order methods, the gain over first-order baselines grows with batch size, which is consistent with the broader literature on the "critical batch size" beyond which adaptive first-order methods saturate.[^1]
The most informative ablation is the sweep over preconditioning frequency $f$.[^1] At $f = 1$, Shampoo and SOAP behave nearly identically, as predicted by the theoretical equivalence. As $f$ increases, Shampoo's loss curve degrades noticeably by $f = 25$ and badly by $f = 100$, because its learning-rate scales become stale. SOAP, by contrast, remains close to its $f = 1$ performance well past $f = 100$, because the Adam second moment in the rotated basis continues to adapt. This is the operational reason SOAP can keep its eigendecompositions infrequent (and therefore cheap in amortized terms) without giving up the adaptivity that makes second-order methods effective.[^1]
Shortly after the paper appeared, the SOAP authors released a fork of Karpathy's modded-nanoGPT benchmark, replacing the OrthogonalNesterov optimizer with SOAP for the 2D layers (keeping AdamW for the input and output projections).[^9] On a 124M-parameter GPT-2-style model trained on the FineWeb corpus, the SOAP fork reports reaching a target validation loss of 3.2564 using 3.67B tokens and roughly 10% fewer iterations than the AdamW-equivalent baseline.[^9] The hyperparameters that work in this setting are learning rate 0.0018 to 0.003, $\beta_1 = \beta_2 = 0.95$, zero weight decay, and preconditioning frequency 10.[^9] This is consistent with the recommended defaults in the official SOAP implementation: learning rate $3 \times 10^{-3}$, betas $(0.95, 0.95)$, weight decay $0.01$, precondition frequency $10$.[^2]
The reference implementation, released in September 2024 at github.com/nikhilvyas/SOAP, is a single-file PyTorch optimizer that handles 2D layers natively and exposes additional hyperparameters for higher-dimensional tensors.[^2] The authors describe it as a "preliminary" implementation, noting plans to add support for lower-precision arithmetic and distributed training.[^2] A community JAX port by Haydn Jones exists but has not been verified by the original authors.[^2] As of the v2 arXiv revision in January 2025, the paper still reports single-GPU experiments and lists distributed and low-precision implementations as open work.[^1]
The paper is published at ICLR 2025 as a poster, and an earlier version appeared at the OPT 2024 workshop at NeurIPS 2024.[^10][^11]
SOAP can be situated against several other adaptive optimizers used in modern LLM pre-training:
| Optimizer | Per-coord adaptivity | Cross-coord preconditioning | Extra state vs Adam | Notes |
|---|---|---|---|---|
| Adam / AdamW | Yes | No | None | Baseline |
| AdaGrad | Yes (cumulative) | No | None | Predates Adam |
| Adafactor | Approximate (rank-1 factored) | No | Saves memory vs Adam | Used for T5 |
| Shampoo | Implicit (Adafactor in rotated basis) | Yes (Kronecker) | $O(m^2 + n^2)$ per layer | Original 2018 paper |
| Distributed Shampoo | Same as Shampoo | Yes | Same as Shampoo | Heuristics for stability and scale |
| Lion | Sign-based | No | Saves memory vs Adam | Discovered by program search |
| SOAP | Yes (full Adam in rotated basis) | Yes (Kronecker) | Same as Shampoo | Combines Adam and Shampoo |
A few comparisons deserve more detail:
SOAP is also closely related to two ideas in the prior literature that the authors explicitly call out:
The SOAP authors note that "we are the first to systematically evaluate" the specific combination of Adam moments inside the Shampoo eigenbasis, despite the conceptual precedents.[^1]
A 2025 follow-up by Vyas, "Improving SOAP using Iterative Whitening and Muon", explores combining SOAP with Newton-Schulz-based orthogonalization in the style of Keller Jordan's Muon optimizer, suggesting that the Shampoo-Adam axis and the orthogonalization axis are largely complementary.[^12]
The SOAP paper is upfront about the regime in which its claims are validated.[^1] The reported experiments are at 360M and 660M parameters, "two orders of magnitude" smaller than frontier models.[^1] The strongest results require large batches (2M tokens), where second-order methods generally have more headroom over first-order ones.[^1] All measurements are made on a single H100 with gradient accumulation, so the distributed efficiency story (which is what made Shampoo practical at Google scale) is unresolved.[^1][^2] The reference PyTorch implementation runs in fp32 and does not yet support sharded optimizer state across data-parallel ranks, both of which the authors flag as planned future work.[^2]
There are also subtler open questions that the paper raises but does not fully resolve:
Independent third-party reproductions at trillion-token scale are still scarce as of mid-2026, and the most rigorous evidence remains at the 360M / 660M scale plus the NanoGPT-speedrun-style benchmarks.[^9]
The original SOAP preprint was posted on arXiv on 17 September 2024, with code released on GitHub the same day.[^1][^2] The work was performed primarily at Harvard University (Vyas, Morwani, Zhao, Kwun, Shapira, Brandfonbrener, Janson, Kakade), with co-authors affiliated with the Massachusetts Institute of Technology and Google DeepMind orbit through Sham Kakade's group at Harvard's Kempner Institute.[^1] A revised version 2 was uploaded on 31 January 2025, contemporaneously with acceptance to ICLR 2025 as a poster.[^1][^10] An earlier version of the same content was presented at the OPT 2024 optimization workshop at NeurIPS 2024, where Depen Morwani delivered the contributed talk.[^11]
In the months after release, SOAP became a frequent reference point in two communities. The first was the broader optimizer-research community, where the Shampoo-equals-Adafactor result was widely cited and prompted follow-up theoretical work, including the Kullback-Leibler-minimization reformulation of Shampoo and the gradient-whitening view of SOAP itself.[^8][^12] The second was the modded-nanoGPT speedrun community organized around Keller Jordan's benchmark, where SOAP entries appeared on the leaderboard alongside the contemporaneous Muon optimizer and the older AdamW baselines.[^9]
SOAP has had two distinct kinds of impact since its release in September 2024.[^1]
First, theoretically: it makes precise the long-suspected relationship between Shampoo and Adam-family methods. Combined with the parallel "New Perspective on Shampoo's Preconditioner" paper,[^8] it reframes Shampoo from "an approximation to a true second-order method" to "an Adam-style first-order method operating in a rotated basis". This reframing is now standard in the optimizer literature and has informed subsequent work on whitening-based optimizers.[^12]
Second, empirically: SOAP has become a frequent reference point in pre-training benchmarks. On the modded-nanoGPT speedrun leaderboard it features in several entries; the 124M-parameter loss target of 3.28 was reached with roughly 10% fewer tokens than the AdamW-equivalent baseline once SOAP was dropped in for the 2D layers.[^9] At larger scales the public evidence is still limited to the paper's 360M and 660M experiments, but the algorithm is simple enough that several open-source pre-training stacks have adopted it as an option.[^2]
The broader trend SOAP exemplifies is the partial rehabilitation of preconditioned methods for deep learning. Second-order methods were largely abandoned in the LLM era because Adam was robust, cheap, and "good enough"; SOAP and contemporary methods such as Muon argue that, in the large-batch pre-training regime that frontier labs actually operate in, the extra optimizer-state memory of a Kronecker-factored preconditioner buys back significant wall-clock time, even after accounting for the eigendecomposition overhead.