SOAP (ShampoO with Adam in the Preconditioner's eigenbasis)

SOAP (ShampoO with Adam in the Preconditioner's eigenbasis) is a second-order optimization algorithm for training deep neural networks, introduced by Nikhil Vyas, Depen Morwani, Rosie Zhao, Mujin Kwun, Itai Shapira, David Brandfonbrener, Lucas Janson, and Sham Kakade in a September 2024 paper titled "SOAP: Improving and Stabilizing Shampoo using Adam".[^1] The method runs an Adam-style adaptive update inside the eigenbasis of the Shampoo preconditioner, so that per-coordinate second-moment statistics are tracked in a slowly rotating coordinate system instead of the raw parameter space. Vyas and coauthors prove that, when its preconditioner is refreshed every step, the canonical 1/2-power Shampoo update is mathematically equivalent to running Adafactor in that eigenbasis; SOAP exploits this equivalence by replacing the implicit Adafactor inside Shampoo with an explicit Adam.[^1] In language-model pre-training experiments at 360M and 660M parameters, SOAP reduces the number of optimizer steps to a target loss by roughly 40% over AdamW and 20% over Shampoo, with comparable savings in wall-clock time, while introducing only one new hyperparameter (the preconditioning frequency) beyond standard Adam.[^1][^2]

Background

A note on terminology

The acronym SOAP capitalizes the letters in ShampOO with Adam in the Preconditioner's eigenbasis. The lowercase "oo" is dropped from the title casing, but the construction makes the algorithmic recipe explicit in the name: take Shampoo's eigenbasis, run Adam inside it. The "preconditioner" refers to the implicit metric induced by Shampoo's per-mode Gram-matrix accumulators, not to a classical Newton or Gauss-Newton preconditioner. Throughout the paper and the implementation, "Adam" is used loosely to mean "AdamW" (i.e., with decoupled weight decay), which is the version SOAP actually invokes inside the rotated basis.[^1][^3]

The first-order baseline: Adam and AdamW

Adam is a first-order adaptive optimizer that maintains exponential moving averages of the gradient (the first moment) and the squared gradient (the second moment), then divides the momentum by the square root of the second moment to produce a step. AdamW modifies Adam by decoupling weight decay from the gradient-based update, which has become the de-facto recipe for training transformer language models.[^3] Adam is cheap to run (each per-parameter state is just two scalars) and largely insensitive to the curvature structure of the loss surface, so it treats every coordinate independently. This independence is part of why it works robustly across architectures, but it also means Adam cannot exploit correlations between parameters of the same weight matrix.

For matrix-valued parameters such as the dense layers of a Transformer, the second moment $V$ that Adam stores is the same shape as the weight, $\mathbb{R}^{m\times n}$. AdamW holds three matrices of that shape (parameters, momentum, second moment), and modern large language model training accordingly devotes a significant fraction of accelerator memory just to optimizer state.[^3]

Shampoo and full-matrix preconditioning

Full-matrix preconditioned methods such as full-matrix AdaGrad would form and invert an $mn \times mn$ matrix per layer, which is intractable for any nontrivial model. Shampoo, introduced by Vineet Gupta, Tomer Koren, and Yoram Singer at ICML 2018, addresses this by approximating the full preconditioner by a Kronecker product of two smaller per-mode preconditioners.[^4] For an $m \times n$ weight matrix $W$ with gradient $G$, Shampoo accumulates:

$$L_t = L_{t-1} + G_t G_t^\top \quad \in \mathbb{R}^{m\times m}$$ $$R_t = R_{t-1} + G_t^\top G_t \quad \in \mathbb{R}^{n\times n}$$

and applies the update $W_{t+1} = W_t - \eta, L_t^{-1/4}, G_t, R_t^{-1/4}$ (with exponent 1/4 in the original formulation).[^4] The matrix-fourth-roots are computed by eigendecomposing $L_t$ and $R_t$, which is the expensive step. Shampoo was originally derived as a structure-aware approximation to full-matrix AdaGrad and came with convergence guarantees in the stochastic convex setting.[^4]

A 2020 Google paper by Rohan Anil, Vineet Gupta, Tomer Koren, Kevin Regan, and Yoram Singer, "Scalable Second Order Optimization for Deep Learning", described a distributed implementation of Shampoo that ran the inverse-root step asynchronously on CPUs while the GPUs computed gradients, and demonstrated wall-clock-time wins on production-scale workloads.[^5] Meta later released a PyTorch reimplementation called Distributed Shampoo, which is the implementation the SOAP authors used as their Shampoo baseline.[^6] In modern usage, the 1/2 power $L_t^{-1/2}, G_t, R_t^{-1/2}$ has become standard, because Shampoo's preconditioner is better viewed as an approximation of a single-mode whitening operator than of full-matrix AdaGrad.[^1]

Adafactor and factored moments

Adafactor, introduced by Noam Shazeer and Mitchell Stern in 2018, sidesteps the memory cost of Adam's second moment by storing only the row and column sums of the squared gradients, $A \in \mathbb{R}^m$ and $C \in \mathbb{R}^n$, then reconstructing a rank-one factored second moment $V_{ij} \approx A_i C_j / \sum_k A_k$.[^7] This drops the optimizer-state cost for a weight matrix from $O(mn)$ to $O(m+n)$, which is critical at billion-parameter scale and is part of why Adafactor was used to train Google's 11B-parameter T5.[^7] Adafactor also includes update clipping and a slow second-moment decay schedule for stability.[^7]

The SOAP paper observes that this factored second moment is structurally similar to Shampoo's $L$ and $R$ accumulators, and shows that the similarity is not a coincidence.[^1]

The theoretical connection

A central contribution of Vyas et al. is a precise statement of the relationship between Shampoo and Adafactor. The paper proves that Shampoo with the 1/2 power is equivalent to running Adafactor in the eigenbasis of Shampoo's preconditioner when the preconditioner is recomputed at every step.[^1]

The proof is short. Eigendecompose $L = Q_L \Lambda Q_L^\top$ and $R = Q_R M Q_R^\top$, with diagonal eigenvalues $\lambda_i$ and $\mu_j$. The Shampoo update with exponent 1/2 acts on the gradient $G$ as $L^{-1/2} G R^{-1/2}$. In the basis $(Q_L, Q_R)$, this is a coordinate-wise rescaling: the $(i,j)$ entry of the rotated gradient is divided by $(\lambda_i \mu_j)^{1/2}$. Adafactor's rank-one factored second moment, applied in the same rotated basis, also produces a coordinate-wise rescaling by $(A_i C_j / \sum_k A_k)^{1/2}$ at entry $(i,j)$. When $A_i = \lambda_i$ and $C_j = \mu_j / \sum_k \lambda_k$ (which holds after appropriate normalization), the two updates coincide.[^1]

The corollary is striking: Shampoo, often viewed as a fundamentally different "second-order" algorithm, is in fact a memory-light approximation of an adaptive first-order method, just operating in a rotated basis chosen to align with the dominant curvature directions of the layer's Gram matrices. Distributed Shampoo's various heuristics, including grafting onto Adam's per-parameter scale, can then be reinterpreted as efforts to compensate for the gap between Shampoo's rank-one Adafactor and the full per-coordinate Adam.[^1]

This insight is closely related to a parallel result by Depen Morwani and coauthors, "A New Perspective on Shampoo's Preconditioner", which derived the 1/2-power Shampoo update as an optimal Kronecker approximation to a whitening preconditioner rather than to full-matrix AdaGrad.[^8] Read together, the two papers reposition Shampoo as a rotated, factored adaptive method rather than as an approximation to a true second-order Newton-type step.

The SOAP algorithm

If Shampoo is "Adafactor in Shampoo's eigenbasis", then there is a natural strict improvement: replace the rank-one Adafactor with full per-coordinate Adam, while keeping the rotation. That is SOAP.[^1]

For an $m \times n$ weight matrix $W$, SOAP maintains six pieces of state:

State	Shape	Purpose
$L$	$m \times m$	Left Shampoo accumulator (EMA of $GG^\top$)
$R$	$n \times n$	Right Shampoo accumulator (EMA of $G^\top G$)
$Q_L$	$m \times m$	Cached left eigenvectors of $L$
$Q_R$	$n \times n$	Cached right eigenvectors of $R$
$M$	$m \times n$	Adam first moment in the rotated basis
$V$	$m \times n$	Adam second moment in the rotated basis

Per step, given gradient $G_t$:

Rotate: $\tilde G_t = Q_L^\top G_t Q_R$, projecting the gradient into the cached eigenbasis.
Adam in the rotated basis: update the rotated momentum and second moment with the usual $\beta_1$ and $\beta_2$ EMAs on $\tilde G_t$, then form the preconditioned rotated update $\tilde U_t = \hat M_t / (\sqrt{\hat V_t} + \epsilon)$ with bias correction.
Rotate back: $U_t = Q_L \tilde U_t Q_R^\top$.
Apply update: $W_{t+1} = W_t - \eta, U_t - \eta, \lambda, W_t$ (AdamW-style decoupled weight decay $\lambda$).
Update Shampoo accumulators: $L \gets \beta_2 L + (1-\beta_2), G_t G_t^\top$ and $R \gets \beta_2 R + (1-\beta_2), G_t^\top G_t$ as exponential moving averages.
Refresh eigenvectors every $f$ steps: recompute $Q_L$ and $Q_R$, typically via one step of subspace iteration plus a QR decomposition rather than a full eigendecomposition.

When the rotation matrices are refreshed, the Adam moments $M, V$ are themselves rotated into the new basis so that no statistics are lost.[^1] One-dimensional layers (biases, embeddings, layer-norm scales) fall back to plain AdamW because there is only one mode to precondition. For high-dimensional tensors with $n > 2$ axes, SOAP generalizes naturally: maintain an accumulator per mode.[^2]

The single new hyperparameter is the preconditioning frequency $f$, the number of optimizer steps between eigenvector refreshes. Setting $f = 1$ recovers a Shampoo-equivalent regime; large $f$ amortizes the eigendecomposition cost across many cheap rotated-Adam steps.[^1]

Pseudocode (simplified, one matrix layer)

Inputs: lr eta, betas (b1, b2), weight decay lambda, frequency f, epsilon eps
Initialize W; L,R,Q_L,Q_R,M,V <- zero/identity

for t = 1, 2, ...:
    G = compute_gradient(W)

    # 1. Rotate gradient
    Gtil = Q_L.T @ G @ Q_R

    # 2. Adam in rotated basis
    M = b1*M + (1-b1)*Gtil
    V = b2*V + (1-b2)*Gtil*Gtil
    Mhat = M / (1 - b1**t)
    Vhat = V / (1 - b2**t)
    Util = Mhat / (sqrt(Vhat) + eps)

    # 3. Rotate update back
    U = Q_L @ Util @ Q_R.T

    # 4. AdamW step
    W = W - eta * U - eta * lambda * W

    # 5. Shampoo EMAs
    L = b2*L + (1-b2)*(G @ G.T)
    R = b2*R + (1-b2)*(G.T @ G)

    # 6. Refresh basis every f steps (rotate M, V into new basis)
    if t % f == 0:
        Q_L_new = eigenvectors(L); Q_R_new = eigenvectors(R)
        M = Q_L_new.T @ Q_L @ M @ Q_R.T @ Q_R_new
        V = Q_L_new.T @ Q_L @ V @ Q_R.T @ Q_R_new
        Q_L, Q_R = Q_L_new, Q_R_new

The above follows Algorithm 3 in the paper.[^1]

Choosing the preconditioning frequency

The frequency $f$ trades off two costs. A small $f$ means eigendecompositions happen often, so the cached $Q_L, Q_R$ closely track the moving Shampoo accumulators, but each refresh costs $O(m^3 + n^3)$. A large $f$ amortizes the eigendecomposition cost over many cheap rotated-Adam steps, but allows the basis to grow stale. The recommended default in the reference implementation is $f = 10$ for batch-size-2M Transformer pre-training; values up to $f = 80$ are reported as still effective at smaller batch sizes.[^1][^2] In practice, the eigendecomposition is implemented with a power-iteration warm start using the cached eigenvectors, which converges in a small number of inner iterations because the underlying matrix changes slowly between calls.[^1]

A subtle implementation choice is whether to rotate the Adam moments $M, V$ into the new eigenbasis when the basis is refreshed, or to reset them. The SOAP paper rotates rather than resets, treating the moments as fixed vectors expressed in different coordinate systems, so that no statistical information is discarded at a basis refresh.[^1] Empirically this is important: dropping the rotation step (i.e., naively resetting the moments at refresh) recovers most of Shampoo's degradation with infrequent updates.[^1]

Why this is faster than Shampoo

Shampoo's effective per-coordinate learning rate $(\lambda_i \mu_j)^{-1/2}$ only changes when the preconditioner is recomputed, because the magnitudes $\lambda_i, \mu_j$ are baked into the matrix-root. Between refreshes the algorithm uses a stale Adafactor approximation. SOAP, by contrast, keeps the rotation fixed between refreshes but updates the Adam second moment $V$ in the rotated basis at every step. This means SOAP behaves like full Adam in a coordinate system that happens to be aligned with the dominant curvature directions of the weight matrix, and it remains adaptive between expensive eigendecompositions.[^1]

The authors report that Shampoo's loss curves degrade rapidly when the preconditioning frequency is raised, while SOAP "degrades significantly slower" because of this continual second-moment tracking.[^1]

Compute and memory cost

Letting $m \geq n$ for an $m \times n$ layer, the paper gives the following per-step costs (excluding the periodic eigendecomposition):[^1]

Quantity	AdamW	Distributed Shampoo	SOAP
Optimizer state per layer	$3mn$	$2m^2 + 2n^2 + 3mn$	$2m^2 + 2n^2 + 3mn$
Per-step matmul cost	$\Theta(mn)$	$m^3 + n^3 + m^2 n + m n^2$	$m^3 + n^3 + 2m^2 n + 2 m n^2$
Eigendecomposition every	n/a	$f$ steps	$f$ steps

SOAP has the same memory footprint as Distributed Shampoo (the two are dominated by the $L$, $R$, $Q_L$, $Q_R$ matrices), and a slightly higher per-step matmul cost than Shampoo because of the extra rotate-in and rotate-out passes. In wall-clock terms this overhead is small relative to the forward-backward pass for typical Transformer dimensions, and is more than offset by the reduced number of steps needed to reach a given loss.[^1] When a layer dimension is so large that an eigendecomposition becomes prohibitive, the SOAP implementation falls back to identity rotation matrices on that axis, recovering AdamW behavior locally.[^2]

Experimental results

Setup

The paper evaluates SOAP on decoder-only Transformer language models with 360M and 660M parameters, trained on a language modeling corpus.[^1] Experiments use a single NVIDIA H100 GPU with gradient accumulation to reach large effective batch sizes.[^1] The two regimes studied are:

Large-batch: 2,097,152 (2M) tokens per step.
Small-batch: 262,144 (256K) tokens per step.

Both Chinchilla-optimal token counts ($\approx 20\times$ model parameters) and longer 100$\times$ runs are reported.[^1] Baselines are AdamW with standard recipe ($\beta_1 = 0.9$, $\beta_2 = 0.95$) and Meta's Distributed Shampoo with $\beta_2 = 0.95$ and 1/2 power.[^1][^6]

Headline numbers

In the large-batch regime, SOAP reaches the same final validation loss as AdamW in 40% fewer iterations and 35% less wall-clock time, and outperforms Shampoo by approximately 20% on both metrics.[^1][^2] In the small-batch regime, the gap narrows but remains material: at least 25% iteration savings versus AdamW and roughly 10% versus Shampoo.[^1] As is generally the case for second-order methods, the gain over first-order baselines grows with batch size, which is consistent with the broader literature on the "critical batch size" beyond which adaptive first-order methods saturate.[^1]

Preconditioning-frequency ablation

The most informative ablation is the sweep over preconditioning frequency $f$.[^1] At $f = 1$, Shampoo and SOAP behave nearly identically, as predicted by the theoretical equivalence. As $f$ increases, Shampoo's loss curve degrades noticeably by $f = 25$ and badly by $f = 100$, because its learning-rate scales become stale. SOAP, by contrast, remains close to its $f = 1$ performance well past $f = 100$, because the Adam second moment in the rotated basis continues to adapt. This is the operational reason SOAP can keep its eigendecompositions infrequent (and therefore cheap in amortized terms) without giving up the adaptivity that makes second-order methods effective.[^1]

NanoGPT speedrun

Shortly after the paper appeared, the SOAP authors released a fork of Karpathy's modded-nanoGPT benchmark, replacing the OrthogonalNesterov optimizer with SOAP for the 2D layers (keeping AdamW for the input and output projections).[^9] On a 124M-parameter GPT-2-style model trained on the FineWeb corpus, the SOAP fork reports reaching a target validation loss of 3.2564 using 3.67B tokens and roughly 10% fewer iterations than the AdamW-equivalent baseline.[^9] The hyperparameters that work in this setting are learning rate 0.0018 to 0.003, $\beta_1 = \beta_2 = 0.95$, zero weight decay, and preconditioning frequency 10.[^9] This is consistent with the recommended defaults in the official SOAP implementation: learning rate $3 \times 10^{-3}$, betas $(0.95, 0.95)$, weight decay $0.01$, precondition frequency $10$.[^2]

Implementation

The reference implementation, released in September 2024 at github.com/nikhilvyas/SOAP, is a single-file PyTorch optimizer that handles 2D layers natively and exposes additional hyperparameters for higher-dimensional tensors.[^2] The authors describe it as a "preliminary" implementation, noting plans to add support for lower-precision arithmetic and distributed training.[^2] A community JAX port by Haydn Jones exists but has not been verified by the original authors.[^2] As of the v2 arXiv revision in January 2025, the paper still reports single-GPU experiments and lists distributed and low-precision implementations as open work.[^1]

The paper is published at ICLR 2025 as a poster, and an earlier version appeared at the OPT 2024 workshop at NeurIPS 2024.[^10][^11]

Relationship to other optimizers

SOAP can be situated against several other adaptive optimizers used in modern LLM pre-training:

Optimizer	Per-coord adaptivity	Cross-coord preconditioning	Extra state vs Adam	Notes
Adam / AdamW	Yes	No	None	Baseline
AdaGrad	Yes (cumulative)	No	None	Predates Adam
Adafactor	Approximate (rank-1 factored)	No	Saves memory vs Adam	Used for T5
Shampoo	Implicit (Adafactor in rotated basis)	Yes (Kronecker)	$O(m^2 + n^2)$ per layer	Original 2018 paper
Distributed Shampoo	Same as Shampoo	Yes	Same as Shampoo	Heuristics for stability and scale
Lion	Sign-based	No	Saves memory vs Adam	Discovered by program search
SOAP	Yes (full Adam in rotated basis)	Yes (Kronecker)	Same as Shampoo	Combines Adam and Shampoo

A few comparisons deserve more detail:

vs. Adam / AdamW: SOAP is strictly more expensive per step but reaches the same loss in substantially fewer steps. It dominates AdamW in the large-batch regime where AdamW saturates.[^1]
vs. Shampoo: SOAP and Shampoo have the same optimizer state and the same eigendecomposition cost. SOAP adds a small overhead per step for the rotate-in / rotate-out passes, in exchange for sharply better tolerance to infrequent preconditioner updates. When preconditioning is infrequent (as is necessary at scale), SOAP wins clearly.[^1]
vs. Adafactor: Adafactor was designed primarily to save memory, not to accelerate convergence; SOAP is on the opposite end of the trade-off, spending more memory than Adam to converge faster than either Adam or Shampoo.[^1][^7]
vs. Lion: Lion is a sign-momentum method discovered by an evolutionary search; it competes with AdamW on memory but does not use any cross-coordinate information. The two are orthogonal: Lion is cheap-and-cheerful, SOAP is expensive-but-fast-converging.

SOAP is also closely related to two ideas in the prior literature that the authors explicitly call out:

E-KFAC runs a diagonal preconditioner between expensive Kronecker-factored inversions, similar in spirit to running Adam between Shampoo eigendecompositions.[^1]
GaLore maintains optimizer momentum in a preconditioned low-rank subspace; SOAP keeps the full rotation but applies the same conceptual move of "track moments in a basis chosen by the curvature structure".[^1]

The SOAP authors note that "we are the first to systematically evaluate" the specific combination of Adam moments inside the Shampoo eigenbasis, despite the conceptual precedents.[^1]

A 2025 follow-up by Vyas, "Improving SOAP using Iterative Whitening and Muon", explores combining SOAP with Newton-Schulz-based orthogonalization in the style of Keller Jordan's Muon optimizer, suggesting that the Shampoo-Adam axis and the orthogonalization axis are largely complementary.[^12]

Limitations and open questions

The SOAP paper is upfront about the regime in which its claims are validated.[^1] The reported experiments are at 360M and 660M parameters, "two orders of magnitude" smaller than frontier models.[^1] The strongest results require large batches (2M tokens), where second-order methods generally have more headroom over first-order ones.[^1] All measurements are made on a single H100 with gradient accumulation, so the distributed efficiency story (which is what made Shampoo practical at Google scale) is unresolved.[^1][^2] The reference PyTorch implementation runs in fp32 and does not yet support sharded optimizer state across data-parallel ranks, both of which the authors flag as planned future work.[^2]

There are also subtler open questions that the paper raises but does not fully resolve:

Stability at scale. Whether the eigenbasis trick remains numerically stable when layer dimensions reach into the tens of thousands, and how to mix with mixed-precision training, is unclear; the reference implementation uses identity rotations for very large dimensions, which gives up some of the benefit.[^2]
Generalization beyond language. All reported experiments are autoregressive language modeling on a single corpus. The paper does not test vision, multimodal, or RL workloads.[^1]
Critical-batch-size behavior. The gains over AdamW shrink as the batch size shrinks. For small-batch regimes (single-node fine-tuning, e.g.), the case for SOAP versus AdamW is weaker.[^1]
Hyperparameter robustness across model scales. The OpenReview discussion at ICLR highlighted that the optimizer's relative ranking versus AdamW and Shampoo is sensitive to whether each is given its own well-tuned learning rate; the SOAP comparisons rely on careful two-dimensional sweeps for each baseline.[^10]

Independent third-party reproductions at trillion-token scale are still scarce as of mid-2026, and the most rigorous evidence remains at the 360M / 660M scale plus the NanoGPT-speedrun-style benchmarks.[^9]

History and reception

The original SOAP preprint was posted on arXiv on 17 September 2024, with code released on GitHub the same day.[^1][^2] The work was performed primarily at Harvard University (Vyas, Morwani, Zhao, Kwun, Shapira, Brandfonbrener, Janson, Kakade), with co-authors affiliated with the Massachusetts Institute of Technology and Google DeepMind orbit through Sham Kakade's group at Harvard's Kempner Institute.[^1] A revised version 2 was uploaded on 31 January 2025, contemporaneously with acceptance to ICLR 2025 as a poster.[^1][^10] An earlier version of the same content was presented at the OPT 2024 optimization workshop at NeurIPS 2024, where Depen Morwani delivered the contributed talk.[^11]

In the months after release, SOAP became a frequent reference point in two communities. The first was the broader optimizer-research community, where the Shampoo-equals-Adafactor result was widely cited and prompted follow-up theoretical work, including the Kullback-Leibler-minimization reformulation of Shampoo and the gradient-whitening view of SOAP itself.[^8][^12] The second was the modded-nanoGPT speedrun community organized around Keller Jordan's benchmark, where SOAP entries appeared on the leaderboard alongside the contemporaneous Muon optimizer and the older AdamW baselines.[^9]

Significance

SOAP has had two distinct kinds of impact since its release in September 2024.[^1]

First, theoretically: it makes precise the long-suspected relationship between Shampoo and Adam-family methods. Combined with the parallel "New Perspective on Shampoo's Preconditioner" paper,[^8] it reframes Shampoo from "an approximation to a true second-order method" to "an Adam-style first-order method operating in a rotated basis". This reframing is now standard in the optimizer literature and has informed subsequent work on whitening-based optimizers.[^12]

Second, empirically: SOAP has become a frequent reference point in pre-training benchmarks. On the modded-nanoGPT speedrun leaderboard it features in several entries; the 124M-parameter loss target of 3.28 was reached with roughly 10% fewer tokens than the AdamW-equivalent baseline once SOAP was dropped in for the 2D layers.[^9] At larger scales the public evidence is still limited to the paper's 360M and 660M experiments, but the algorithm is simple enough that several open-source pre-training stacks have adopted it as an option.[^2]

The broader trend SOAP exemplifies is the partial rehabilitation of preconditioned methods for deep learning. Second-order methods were largely abandoned in the LLM era because Adam was robust, cheap, and "good enough"; SOAP and contemporary methods such as Muon argue that, in the large-batch pre-training regime that frontier labs actually operate in, the extra optimizer-state memory of a Kronecker-factored preconditioner buys back significant wall-clock time, even after accounting for the eigendecomposition overhead.

Adam optimizer: the first-order adaptive baseline SOAP is built on top of.
AdamW: decoupled weight decay variant of Adam, used as the strong first-order baseline in the SOAP paper.
AdaGrad: historical ancestor of both Adam and Shampoo.
Lion (optimizer): a contemporary alternative discovered by program search, on the memory-saving side of the trade-off.
GaLore (Gradient Low-Rank Projection): a related "moments in a preconditioned subspace" idea, applied to memory-efficient training.
RMSProp: early per-coordinate adaptive method.
Momentum: first-moment EMA underlying both Adam and SOAP.
Gradient Descent and Stochastic Gradient Descent (SGD): the broader algorithmic family.

References

SOAP (ShampoO with Adam in the Preconditioner's eigenbasis)

Background

A note on terminology

The first-order baseline: Adam and AdamW

Shampoo and full-matrix preconditioning

Adafactor and factored moments

The theoretical connection

The SOAP algorithm

Pseudocode (simplified, one matrix layer)

Choosing the preconditioning frequency

Why this is faster than Shampoo

Compute and memory cost

Experimental results

Setup

Headline numbers

Preconditioning-frequency ablation

NanoGPT speedrun

Implementation

Relationship to other optimizers

Limitations and open questions

History and reception

Significance

Related work

See also

References

Improve this article

SOAP (ShampoO with Adam in the Preconditioner's eigenbasis)

Background

A note on terminology

The first-order baseline: Adam and AdamW

Shampoo and full-matrix preconditioning

Adafactor and factored moments

The theoretical connection

The SOAP algorithm

Pseudocode (simplified, one matrix layer)

Choosing the preconditioning frequency

Why this is faster than Shampoo

Compute and memory cost

Experimental results

Setup

Headline numbers

Preconditioning-frequency ablation

NanoGPT speedrun

Implementation

Relationship to other optimizers

Limitations and open questions

History and reception

Significance

Related work

See also

References