Shampoo (optimizer)

Shampoo is a second-order stochastic optimization algorithm for training neural networks that maintains, for each parameter tensor, a set of small full-matrix preconditioners (one per tensor mode) computed from outer products of the stochastic gradient. The algorithm was introduced by Vineet Gupta, Tomer Koren, and Yoram Singer at Google in the 2018 paper "Shampoo: Preconditioned Stochastic Tensor Optimization" (arXiv:1802.09568), and updates parameters using inverse matrix roots (the inverse fourth root for matrix-shaped parameters) of these factors.[^1] Shampoo belongs to the AdaGrad family and can be viewed as a structured approximation to full-matrix AdaGrad: instead of storing one giant preconditioning matrix per tensor, it stores one Kronecker factor per mode, giving memory cost O(n_1^2 + n_2^2 + ... + n_k^2) rather than O(n_1^2 * n_2^2 * ... * n_k^2).[^1] A distributed, hardware-aware variant developed by Anil et al. at Google was presented in "Scalable Second Order Optimization for Deep Learning" (arXiv:2002.09018) and demonstrated wall-clock speedups on Transformer, BERT, ImageNet, and click-through workloads.[^2] A PyTorch reference implementation, Distributed Shampoo, is maintained by Meta in the public repository facebookresearch/optimizers and won the external-tuning track of the inaugural MLCommons AlgoPerf: Training Algorithms Benchmark Competition in 2024, training the eight benchmark workloads roughly 28% faster than the NAdamW baseline.[^3][^4][^5]

Infobox

Field	Value
Algorithm family	Adaptive preconditioned gradient methods (AdaGrad family)
Original paper	Gupta, Koren, Singer, "Shampoo: Preconditioned Stochastic Tensor Optimization"
First arXiv version	2018-02-26 (arXiv:1802.09568)
Conference	ICML 2018
Scalable variant	Anil et al., "Scalable Second Order Optimization for Deep Learning" (arXiv:2002.09018, 2020)
Reference implementation	Distributed Shampoo, Meta (github.com/facebookresearch/optimizers)
Notable result	Won the AlgoPerf 2024 external-tuning track (about 28% faster than NAdamW baseline)
Memory per tensor	O(sum of squared mode sizes), not O(product of squared mode sizes)
Update root	Inverse fourth root for matrices; inverse 1/(2k) root for order-k tensors

Background

Stochastic optimization for deep networks is overwhelmingly dominated by first-order methods such as SGD with momentum, Adam, and AdamW. These methods use only diagonal information about gradient curvature: they multiply each coordinate of the gradient by a scalar that depends on its own running second moment. Diagonal preconditioning is cheap, but it cannot rotate the gradient into a more favorable coordinate system, which classical second-order methods such as Newton's method or natural gradient descent achieve by multiplying the gradient by a full inverse Hessian or inverse Fisher matrix.[^1]

The obstacle to full-matrix preconditioning is scale. For a layer with a weight matrix W of size n by m, the full Hessian or full-matrix AdaGrad preconditioner has size (nm) by (nm), so storing it costs O(n^2 * m^2) numbers and inverting it costs O(n^3 * m^3). Even a single dense fully connected layer of a modest transformer can have n*m in the millions, so the full-matrix approach is infeasible. Earlier work on approximate second-order methods, most prominently K-FAC by Martens and Grosse, approximated the Fisher matrix of feedforward layers as a Kronecker product A ⊗ B of two smaller matrices derived from layer activations and back-propagated derivatives.[^6] Shampoo borrows this Kronecker structure but obtains it directly from gradient statistics without any architectural assumption: the factors are accumulated outer products of the stochastic gradient with itself, contracted along all but one tensor mode at a time.[^1]

The Shampoo paper was published at ICML in 2018. The name is a pun: the algorithm "preconditions" gradients in the same way that hair shampoo "conditions" hair.[^1] The authors framed Shampoo as filling the algorithmic gap between purely diagonal preconditioners such as AdaGrad and full-matrix preconditioners that are mathematically attractive but computationally infeasible at deep learning scale.

How it works

Matrix case

Consider a matrix parameter W in R^{n by m} that receives a stochastic gradient G_t in R^{n by m} at step t. Vanilla full-matrix AdaGrad would treat vec(W) as a vector of length nm, accumulate the outer product H_t = sum_{s <= t} vec(G_s) vec(G_s)^T, and update vec(W) by H_t^{-1/2} vec(G_t). Shampoo replaces this single (nm by n*m) preconditioner with two much smaller matrices.[^1] At each step Shampoo maintains a "left" preconditioner L_t in R^{n by n} and a "right" preconditioner R_t in R^{m by m}:

L_t = L_{t-1} + G_t G_t^T R_t = R_{t-1} + G_t^T G_t

where the initial L_0 and R_0 are small multiples of the identity for numerical stability.[^1] L_t accumulates row covariance and R_t accumulates column covariance of all gradients seen so far. The parameter update uses the inverse fourth root of each factor:

W_{t+1} = W_t - eta * L_t^{-1/4} * G_t * R_t^{-1/4}

Here L_t^{-1/4} and R_t^{-1/4} are computed via spectral decomposition of the symmetric positive definite matrices L_t and R_t, raising the eigenvalues to the -1/4 power and reconstructing. The implicit full preconditioner is the Kronecker product L_t^{-1/4} ⊗ R_t^{-1/4}, which is equivalent to applying (L_t ⊗ R_t)^{-1/4}.[^1] The choice of the fourth root, rather than the square root used in AdaGrad, is exactly what makes the two-factor Kronecker product approximate the square root of the full AdaGrad statistic: by the matrix arithmetic-geometric mean inequality used in the convergence proof, L_t ⊗ R_t upper-bounds the full statistic, so taking its inverse fourth root yields the equivalent of an inverse square root on the bounded full statistic.[^1]

Memory drops from O(n^2 * m^2) for the full preconditioner to O(n^2 + m^2) for the two Kronecker factors. Compute per step (excluding the eigendecomposition) is dominated by the two matrix products L_t G_t and (L_t G_t) R_t, which cost O(n^2 * m + n * m^2), comparable to a single linear-algebra pass over the parameter and far cheaper than O(n^3 * m^3) full-matrix inversion.[^1]

Higher-order tensor case

The construction generalizes naturally to any order-k tensor parameter W in R^{n_1 by n_2 by ... by n_k}, such as the four-dimensional weight tensor of a convolutional layer. For each mode i, Shampoo maintains a preconditioner H_t^{(i)} in R^{n_i by n_i} formed by contracting the gradient tensor G_t with itself over every mode except i:

H_t^{(i)} = H_{t-1}^{(i)} + G_t (contracted over modes != i) G_t

Each mode preconditioner is raised to the inverse 1/(2k) power, and the update applies all of them in sequence via mode-wise tensor contractions:

W_{t+1} = W_t - eta * H_t^{(1) -1/(2k)} *_1 H_t^{(2) -1/(2k)} *_2 ... *_k G_t

where _i denotes contraction along mode i.[^1] For matrices (k = 2) this collapses back to the inverse fourth root rule above, since 1/(22) = 1/4. For an order-4 convolutional kernel of shape (C_out, C_in, K, K), Shampoo stores four small matrices of sizes C_out^2, C_in^2, K^2, K^2 instead of one matrix of size (C_out * C_in * K * K)^2.[^1]

Convergence

Shampoo carries a formal regret bound in the stochastic convex setting. Gupta, Koren, and Singer prove that for an online convex optimization problem with bounded diameter D and gradients of bounded rank, the regret of Shampoo after T steps is bounded by an expression of the form O(D * sqrt(T * tr(L_T)^{1/2} * tr(R_T)^{1/2})), giving the standard O(sqrt(T)) regret rate of adaptive online algorithms.[^1] The proof relies on matrix trace inequalities, specifically a Lieb-style concavity inequality and a Kronecker-product matrix arithmetic-geometric mean inequality showing that L_t ⊗ R_t dominates the full AdaGrad statistic vec(G_t) vec(G_t)^T summed over t.[^1] The convergence theorem does not assume convexity of the loss for typical deep learning applications, but it provides the formal grounding that motivated the algorithm.

Eigendecomposition and root inverse computation

The most numerically delicate step in Shampoo is computing the inverse fourth root of the preconditioners. The reference Distributed Shampoo implementation offers several options for this step, including symmetric eigendecomposition (the default), an approximate eigendecomposition via the QR algorithm, coupled Newton iteration for the inverse pth root, and higher-order coupled iterations using an estimate of the largest eigenvalue to choose the relative epsilon.[^3] The eigendecomposition path is preferred because it is more numerically stable for ill-conditioned matrices that arise routinely in deep learning, but it is also the most expensive step. The original Shampoo paper performs the root inverse at every step; in practice this is too expensive for large models, which motivated the scalable variant described below.[^2]

Distributed and scalable Shampoo

The 2018 algorithm was a step forward over diagonal methods, but applying it directly at deep-learning scale ran into three problems: the cost of computing inverse matrix roots every step, the memory cost of storing preconditioners for very large layers (such as token embedding tables with vocabulary in the tens of thousands), and the difficulty of fitting these computations efficiently into the heterogeneous hardware used to train large models. The 2020 paper "Scalable Second Order Optimization for Deep Learning" by Rohan Anil, Vineet Gupta, Tomer Koren, Kevin Regan, and Yoram Singer at Google introduced a series of algorithmic and systems improvements that turned Shampoo into a competitive optimizer at large scale.[^2] The paper validated the scalable version on Transformer machine translation, BERT pretraining, Criteo click-through prediction, and ImageNet classification, with improvements in both convergence and wall-clock time over first-order baselines.[^2]

Asynchronous root inverse and stale preconditioners

The most important systems trick is to decouple the frequent gradient accumulation step from the rare and expensive eigendecomposition step. Preconditioner statistics L_t and R_t are still updated every step, but the inverse fourth roots L_t^{-1/4} and R_t^{-1/4} are recomputed only once every N steps, with N typically in the tens or low hundreds.[^2] Between recomputations, the optimizer uses a stale inverse root, which is empirically nearly as effective as the fresh one because L_t and R_t change slowly relative to G_t. The recomputation is also performed asynchronously on a CPU (or a separate accelerator stream) so that the main training loop is not blocked by the linear-algebra work.[^2]

Block-diagonal preconditioning for very large layers

Layers such as embedding tables can have one mode (the vocabulary dimension) on the order of 30000 or larger, which makes even one Kronecker factor too big to store or to eigendecompose. Scalable Shampoo addresses this by partitioning very large layers into smaller blocks along the large dimension and running Shampoo independently within each block.[^2] This "block-diagonal" approximation gives up some cross-block curvature information but keeps memory and compute manageable. The PyTorch Distributed Shampoo implementation exposes a max_preconditioner_dim configuration parameter that triggers this blocking automatically.[^3]

Grafting

A practical issue in moving from Adam or AdamW to Shampoo is that hyperparameter schedules, especially the learning rate, are tuned for the magnitudes that Adam-style updates produce. Shampoo updates can have very different magnitudes because they are routed through Kronecker-factored inverse roots. Agarwal et al. introduced "learning rate grafting", in which the direction of the update comes from one optimizer (here Shampoo) and the per-parameter magnitude comes from another optimizer (typically Adam, AdaGrad, or AdamW).[^3] The Distributed Shampoo README documents that grafting is performed only on the second moment or diagonal preconditioner, and that momentum and first-moment updates are kept separate from grafting, which allows existing learning-rate schedules to transfer to Shampoo with little retuning.[^3]

Distribution across data-parallel workers

Distributed Shampoo as implemented by Meta uses PyTorch's FSDP and DDP infrastructure, plus the DTensor data structure introduced in PyTorch 2.x, to spread the per-parameter Kronecker factors and their inverse roots across data-parallel workers.[^3][^4] In effect each worker becomes responsible for a subset of preconditioner blocks, computing their statistics and root inverses locally; the resulting search directions are reassembled with an AllGather communication primitive at each step.[^4] This is conceptually a ZeRO-1 style optimizer-state sharding applied not only to first-moment statistics but to the (much larger in aggregate) preconditioner statistics as well.[^3] For FSDP specifically, the implementation uses FSDP metadata to recover the original per-parameter tensor shape that Shampoo needs, because the default FSDP parameter flattening would otherwise destroy the mode structure that Shampoo's Kronecker factorization exploits.[^3] For FSDP2 (per-parameter sharding), Distributed Shampoo creates preconditioner tensor blocks from the rank-local tensors of the dim-0 sharded DTensor parameters.[^3]

The reference Distributed Shampoo paper (Shi et al., 2023) reports that the implementation achieves at most a 10% increase in per-step wall-clock time over standard diagonal adaptive gradient methods such as AdamW, while providing the convergence benefits of full-matrix preconditioning.[^4]

AlgoPerf 2024 benchmark result

The MLCommons AlgoPerf: Training Algorithms Benchmark Competition was announced as a community effort to find optimization algorithms that train standard deep learning workloads faster than well-tuned baselines, measured in wall-clock time rather than in optimizer steps. The inaugural competition results were announced by MLCommons on 2024-08-01.[^5] The competition had two tracks: an "external tuning" ruleset, in which submissions had to provide a workload-agnostic hyperparameter search space, and a "self-tuning" ruleset, in which submissions had to be fully hyperparameter-free.[^5] The baseline algorithm against which submissions were measured was NAdamW. According to MLCommons, the inaugural competition attracted 18 submissions from 10 teams, with 15 of them being scorable, and required more than 4000 individual training runs across 14 workloads to evaluate.[^5] A 25,000 US dollar prize was awarded to the external-tuning winner, drawn from a 50,000 US dollar total prize pool.[^5]

The external-tuning track was won by a Distributed Shampoo submission from Meta researchers Hao-Jun Michael Shi, Tsung-Hsien Lee, Anna Cai, Shintaro Iwasaki, Wenyin Fu, Yuchen Hao, and Mike Rabbat.[^5] The submission trained the benchmark workloads about 28% faster than the NAdamW baseline in wall-clock time.[^5] The self-tuning track was won by a Schedule-Free AdamW submission from Aaron Defazio and Alice Yang at Meta and Konstantin Mishchenko at Samsung AI, achieving an approximately 8% wall-clock speedup over the baseline.[^5]

A follow-up retrospective by the AlgoPerf organizers, "Accelerating Neural Network Training: An Analysis of the AlgoPerf Competition" (arXiv:2502.15015), positions Shampoo's win as evidence that non-diagonal preconditioning can beat well-tuned Adam-style methods on real wall-clock metrics, and notes that the top-scoring submissions were unusually robust across the benchmark workloads.[^7] The result drew further attention to Kronecker-factored second-order methods and helped catalyze a wave of follow-up work, including SOAP and Muon.[^7]

Variants and follow-up work

SOAP

In 2024 Nikhil Vyas, Depen Morwani, Rosie Zhao, Mujin Kwun, Itai Shapira, David Brandfonbrener, Lucas Janson, and Sham Kakade introduced SOAP, "Shampoo with Adam in the Preconditioner's eigenbasis" (arXiv:2409.11321).[^8] The paper establishes a formal connection between Shampoo (with the 1/2 power) and Adafactor (a memory-efficient approximation of Adam): Shampoo is equivalent to running Adafactor in the eigenbasis defined by Shampoo's preconditioners. SOAP exploits this observation by maintaining the slowly evolving Shampoo eigenbasis explicitly and then running Adam in that basis. In large-batch language-model training experiments the authors report over 40% fewer iterations than AdamW and over 35% lower wall-clock time, with roughly 20% improvements in both metrics over plain Shampoo, while requiring only one extra hyperparameter (the preconditioner update frequency).[^8]

Convergence-rate analyses and "purified" Shampoo

The empirical success of Shampoo motivated a wave of theoretical work. "Purifying Shampoo: Investigating Shampoo's Heuristics by Decomposing its Preconditioner" (arXiv:2506.03595) analyzes how individual heuristics in Distributed Shampoo, such as learning-rate grafting, stale preconditioning, and the choice of root, contribute to performance.[^9] A 2026 convergence-rate analysis of an AdamW-style Shampoo unifies one-sided and two-sided preconditioning under a single theoretical framework (arXiv:2601.07326).[^10] A separate line of work, including "A New Perspective on Shampoo's Preconditioner" (Morwani et al., 2024), shows that Shampoo's preconditioner can be re-derived as an approximation to a one-step optimal preconditioner.[^11]

4-bit and memory-efficient Shampoo

Shampoo's preconditioner statistics dominate optimizer memory once layers grow large. Wang et al. proposed 4-bit Shampoo, a quantized variant in which the eigenvectors and eigenvalues of L_t and R_t are stored in 4-bit form between root recomputations, reducing optimizer memory at modest accuracy cost.[^12] A related thread is "Memory-Efficient 4-bit Preconditioned Stochastic Optimization" by the same group, exploring 4-bit precision for both Shampoo and related preconditioned optimizers.[^13]

DASH and other inverse-root accelerators

DASH ("Faster Shampoo via Batched Block Preconditioning and Efficient Inverse-Root Solvers", arXiv:2602.02016) targets the root-inverse bottleneck directly, replacing eigendecomposition with a Newton-DB iteration and Chebyshev polynomial approximations, and batching block preconditioning across layers of similar shape so that GPU matrix kernels achieve higher utilization.[^14]

Comparison with other optimizers

Optimizer	Preconditioner structure	Per-parameter memory	Compute per step	Notes
SGD with momentum	none (identity)	1x parameters (for momentum buffer)	O(parameters)	Standard baseline for vision (e.g. ImageNet ResNet)
AdaGrad	diagonal, accumulating	1x parameters	O(parameters)	First adaptive method, tends to over-decay learning rate
RMSProp	diagonal, exponential moving average	1x parameters	O(parameters)	Drop-in fix for AdaGrad's decay
Adam / AdamW	diagonal, EMA of squared gradients	2x parameters	O(parameters)	De facto default for Transformer training
Adafactor	rank-1 factored second moment	sublinear for big matrices	O(parameters)	Sublinear-memory alternative to Adam
Lion	sign of momentum	1x parameters	O(parameters)	Discovered by program search, no second moment
K-FAC	per-layer Kronecker factors of the Fisher matrix	O(n^2 + m^2) per layer	O(n^3 + m^3) for inverse	Requires architecture-specific Fisher derivation
Shampoo	per-mode factors from gradient outer products	O(sum of n_i^2)	O(n_i^3) for root inverse, amortized	Optionally with grafting and stale roots
SOAP	Adam in Shampoo's eigenbasis	O(sum of n_i^2) plus Adam state	dominated by basis update	About 35% wall-clock savings vs AdamW on LMs (paper claim)

The conceptual relationships are summarized as follows. Adam is a diagonal approximation to AdaGrad with exponential-moving-average statistics. Adafactor is a rank-1 factored approximation to Adam's second moment. K-FAC factorizes the layer-wise Fisher information matrix as a Kronecker product computed from activations and back-propagated derivatives. Shampoo factorizes the layer-wise empirical AdaGrad statistic as a Kronecker product computed from gradient outer products alone, requires no architecture-specific bookkeeping, and applies the inverse fourth root rather than the inverse square root. SOAP runs Adam in the eigenbasis of Shampoo's preconditioner, exploiting the equivalence shown in the SOAP paper.[^8]

Applications

Shampoo and Distributed Shampoo are used primarily for training neural networks at scale. The 2020 scalable Shampoo paper reported gains on Transformer machine translation on WMT, on BERT pretraining, on Criteo click-through-rate prediction (a workload dominated by very large embedding tables), and on ImageNet ResNet-50 classification.[^2] The 2023 PyTorch Distributed Shampoo paper used the optimizer on ImageNet ResNet-50 and on internal large-model workloads.[^4] Meta has used Distributed Shampoo internally for training recommendation models, where layers dominated by large embedding tables benefit from Kronecker preconditioning of the dense interaction tower.[^3][^4] Outside Meta, the optimizer has been picked up by deep-learning practitioners interested in non-diagonal preconditioning, and reference implementations exist in both PyTorch (Meta's facebookresearch/optimizers, the dominant reference) and in JAX (Google's google-research/scalable_shampoo).[^3][^15]

Limitations and criticisms

Shampoo's strengths come with several real costs and engineering risks.

Computing the inverse fourth root via eigendecomposition is the main computational and numerical bottleneck. Symmetric eigendecomposition is the most stable choice but is expensive enough that the root must be cached and recomputed only occasionally. The PyTorch Distributed Shampoo documentation specifically warns that some PyTorch / CUDA version combinations produce numerical instabilities in torch.linalg.eigh, so the project pins compatible CUDA versions and explicitly avoids known-bad ranges.[^3]

The algorithm has several heuristics whose individual roles are not fully understood. The "Purifying Shampoo" analysis decomposes Distributed Shampoo into its components (grafting, stale preconditioners, blocking, root choice) and asks which of these are actually necessary; the authors find that some heuristics provide most of the empirical benefit and that simplified variants can sometimes match the full Distributed Shampoo recipe.[^9] The follow-up "Understanding and Improving the Shampoo Optimizer via Kullback-Leibler Minimization" reinterprets Shampoo's two-factor approximation as a particular KL minimization and suggests modifications based on that perspective.[^16]

Memory is still a concern at the largest model scales. Although Kronecker-factored preconditioners are vastly smaller than the full-matrix preconditioner, they are still O(n^2) per mode dimension. For embedding tables of size 30000 or larger, even one factor would be ~10^9 entries, so blocking is mandatory. Quantization (4-bit Shampoo and successors) and aggressive sharding through DTensor / FSDP are required to keep state size manageable on top-of-the-line accelerators.[^3][^12]

The convergence theory applies to convex problems and to a specific online setting. As with most adaptive methods used for deep networks, Shampoo's strong empirical performance on non-convex deep learning problems is justified by experiments rather than by theorems. The AlgoPerf result is one of the most rigorous large-scale wall-clock comparisons available, but it covers a fixed set of 14 workloads under a specific evaluation protocol, so it should not be read as a universal claim that Shampoo dominates Adam on every task.[^5][^7]

Finally, Shampoo's interaction with other tricks of the trade is intricate. Learning-rate grafting from Adam or AdamW is recommended for transferring schedules but adds another preconditioner whose statistics must be maintained. Effective use of Shampoo typically requires careful selection of block size, eigendecomposition cadence, and the grafting source optimizer, and the recipe that worked best at AlgoPerf is documented in the Distributed Shampoo README rather than in any single paper.[^3]

Adaptive first-order optimizers in the AdaGrad family include RMSProp, Adam, AdamW, and Lion. K-FAC is the closest classical relative on the second-order side, factorizing the Fisher matrix rather than the gradient outer-product statistic. SOAP and Muon, both motivated in part by AlgoPerf's results, are recent non-diagonal preconditioners that draw on the Shampoo line of work.[^7][^8] Distributed Shampoo runs inside PyTorch data-parallel and FSDP training setups and is part of Meta's facebookresearch/optimizers research repository.[^3]

The algorithm's original setting is online learning, and its convergence proof uses tools from online convex optimization (regret minimization with adaptive learning rates). The Shampoo paper appeared at ICML 2018, and the scalable version was released in 2020.[^1][^2]

References

Shampoo (optimizer)

Infobox

Background

How it works

Matrix case

Higher-order tensor case

Convergence

Eigendecomposition and root inverse computation

Distributed and scalable Shampoo

Asynchronous root inverse and stale preconditioners

Block-diagonal preconditioning for very large layers

Grafting

Distribution across data-parallel workers

AlgoPerf 2024 benchmark result

Variants and follow-up work

SOAP

Convergence-rate analyses and "purified" Shampoo

4-bit and memory-efficient Shampoo

DASH and other inverse-root accelerators

Comparison with other optimizers

Applications

Limitations and criticisms

Related work

See also

References

Improve this article

Shampoo (optimizer)

Infobox

Background

How it works

Matrix case

Higher-order tensor case

Convergence

Eigendecomposition and root inverse computation

Distributed and scalable Shampoo

Asynchronous root inverse and stale preconditioners

Block-diagonal preconditioning for very large layers

Grafting

Distribution across data-parallel workers

AlgoPerf 2024 benchmark result

Variants and follow-up work

SOAP

Convergence-rate analyses and "purified" Shampoo

4-bit and memory-efficient Shampoo

DASH and other inverse-root accelerators

Comparison with other optimizers

Applications

Limitations and criticisms

Related work

See also

References