Sophia (optimizer)

Sophia (a backronym for Second-order Clipped Stochastic Optimization with Adaptive estimator) is a stochastic second-order optimization algorithm introduced in May 2023 by Hong Liu, Zhiyuan Li, David Hall, Percy Liang, and Tengyu Ma at Stanford University for pre-training large language models.[^1] The method combines a cheap stochastic estimate of the diagonal of the Hessian as a per-coordinate pre-conditioner with element-wise clipping that bounds the worst-case update size, and updates the Hessian estimate only every handful of steps to keep per-iteration cost close to that of AdamW.[^1] The authors reported a roughly two times speed-up over Adam in steps, total compute, and wall-clock time on autoregressive GPT-2 style models at sizes from 125 million to 1.5 billion parameters, while preserving the same final validation perplexity.[^1] Sophia comes in two forms differing in how the diagonal Hessian is estimated: Sophia-H, which uses a Hutchinson-style stochastic trace estimator based on Hessian-vector products, and Sophia-G, which uses a Gauss-Newton-Bartlett (GNB) estimator that resamples labels from the model's own output distribution and is guaranteed to be positive semi-definite.[^1] The paper was published at ICLR 2024 and the official implementation is released under an MIT license at the GitHub repository Liuhong99/Sophia.[^2][^3]

Background and motivation

Pre-training of large autoregressive transformer language models dominates the cost of building modern systems such as GPT-style models and similar large language models. In that regime, the de-facto optimizer is AdamW, a variant of Adam in which the L2 regularization term is decoupled into a separate weight decay update.[^4] AdamW is a diagonal preconditioned stochastic gradient descent method: each parameter is rescaled by an exponential moving average of the square of the per-coordinate gradient, which acts as a rough estimate of curvature. Because the cost of computing the full or block Hessian is prohibitive for billion-parameter models, more sophisticated second-order methods such as K-FAC and Shampoo were historically considered too expensive for routine pretraining, despite the long-standing intuition from classical numerical optimization that a Newton-style preconditioner adapts better to heterogeneous curvatures of the loss function than a gradient-magnitude preconditioner does.[^1][^5]

The Sophia paper formalizes that intuition for transformer pre-training. The authors argue that AdamW essentially scales each coordinate by the magnitude of the gradient signal, which gives a noisy proxy for curvature and tends to allocate too small an effective step in flat directions and too large a step in sharp directions.[^1] A diagonal Hessian rescales updates by the second derivative along each axis, equalizing the per-direction descent and in principle converging in a number of steps that depends on the dynamic range of the diagonal rather than on the full conditioning of the problem.[^1] The challenge is that estimating even the diagonal of the Hessian by direct probing costs additional Hessian-vector products, and naive use of a stochastic estimate is unstable in non-convex settings where the local Hessian can be indefinite. Sophia addresses both issues at once. It computes a stochastic diagonal Hessian estimate only every k steps (k = 10 in the released code), and it clips the resulting per-coordinate update so that an arbitrarily small or even negative Hessian estimate cannot blow up the step.[^1][^3]

The paper builds on a line of diagonal second-order methods that includes AdaHessian, which used a Hutchinson estimator for a similar diagonal Hessian preconditioner but updated it every step and did not employ element-wise clipping in the same way.[^1] Sophia's distinguishing claim is that the combination of infrequent stochastic Hessian estimation plus per-coordinate clipping is exactly what makes a diagonal second-order method work at billion-parameter language model scale with negligible overhead per step.[^1]

Timeline

The first arXiv preprint of Sophia, version 1, appeared on 23 May 2023.[^6] Subsequent revisions integrated additional GPT-2 medium and large experiments, the GNB estimator analysis, and the final theoretical results; the fourth and final arXiv version is dated 5 March 2024.[^6] The paper was accepted as a poster at the Twelfth International Conference on Learning Representations (ICLR) 2024, with the OpenReview record published on 16 January 2024 and last modified on 9 April 2024.[^2] The official PyTorch implementation, hosted at github.com/Liuhong99/Sophia, was released under an MIT license and is based on the PyTorch nanoGPT codebase and on the JAX-based Levanter language modeling framework from Stanford CRFM.[^3]

How Sophia works

Sophia maintains, in addition to the parameters, two persistent state tensors: an exponential moving average m of the gradients and an exponential moving average h of a diagonal Hessian estimate. At each step it computes the stochastic gradient g of a language model training loss, updates m, and produces a clipped update direction using both m and h.[^1]

Algorithm

Pseudo-code for the generic Sophia algorithm, following Algorithm 1 of the paper, is shown below.[^1]

input: parameters theta_0, learning rate eta, betas (beta1, beta2),
       Hessian update interval k, clipping threshold rho, epsilon, weight decay lambda
m_0 = 0, h_0 = 0
for t = 1, 2, ...:
    g_t = grad of mini-batch loss at theta_{t-1}
    m_t = beta1 * m_{t-1} + (1 - beta1) * g_t
    if t mod k == 1:
        hhat_t = diagonal Hessian estimate at theta_{t-1}
        h_t = beta2 * h_{t-k} + (1 - beta2) * hhat_t
    else:
        h_t = h_{t-1}
    theta_t = theta_{t-1} - eta * lambda * theta_{t-1}            # decoupled weight decay
    theta_t = theta_t - eta * clip( m_t / max(h_t, epsilon), rho )

The two interchangeable choices for the diagonal Hessian estimate hhat_t give Sophia-H (Hutchinson) and Sophia-G (Gauss-Newton-Bartlett).

Sophia-H: Hutchinson estimator

The Hutchinson variant draws a random vector u with independent Rademacher entries (each coordinate plus or minus one with equal probability), computes a single Hessian-vector product H u using reverse-mode automatic differentiation, and forms the estimate u ⊙ (H u), where ⊙ denotes element-wise multiplication.[^1][^7] In expectation this equals the true diagonal of the Hessian. Modern frameworks such as PyTorch and JAX expose Hessian-vector products through automatic differentiation without forming the full Hessian, so the cost of one estimator call is comparable to one extra backward pass through the transformer.[^7] Because Sophia evaluates the estimator only every k steps (the released code uses k = 10), the amortized per-step overhead relative to AdamW is approximately one extra backward pass divided by k, in practice about five percent or less.[^1][^3]

The Hutchinson estimator does not require any particular structure of the loss; it is valid for any twice-differentiable objective. However, its variance can be high, and individual coordinates of the estimate can be negative when the true Hessian is indefinite, which is common in non-convex regions of a deep learning loss landscape. The clipping step in Sophia is designed to keep updates well-behaved when this happens.[^1]

Sophia-G: Gauss-Newton-Bartlett estimator

The Gauss-Newton-Bartlett variant exploits the fact that a language model training loss is a negative log-likelihood of a categorical distribution over next tokens. For such losses, the Gauss-Newton matrix can be written as an expectation over labels drawn from the model's current output distribution, a fact related to Bartlett's identity in statistics.[^1] Sophia-G implements this by sampling a fresh set of fake labels from the model's softmax outputs and computing the squared per-parameter gradient of the loss against those resampled labels; the element-wise square of that surrogate gradient, scaled by the batch size, gives a stochastic estimate of the diagonal of the Gauss-Newton matrix.[^1] The estimator is biased relative to the full Hessian (the Gauss-Newton matrix discards a second-derivative term that vanishes near a minimum), but it is provably positive semi-definite, which guarantees that the preconditioned update is a descent direction whenever the gradient itself is non-zero.[^1]

In the official repository, Sophia-G is the variant used to reproduce the headline GPT-2 results, and the README notes that Sophia-G is what was scaled to the 1.5 billion parameter run on TPU using the JAX-based Levanter codebase.[^3]

Per-coordinate clipping

Both estimators produce noisy diagonal Hessian estimates, and the EMA h smooths them over time. To guard against the residual cases where h is still very small, zero, or has the wrong sign, Sophia normalizes the gradient EMA m by max(h, epsilon) and then applies element-wise clipping with threshold rho:[^1]

update = clip( m_t / max(h_t, epsilon), rho )

The function clip(x, rho) is applied entry-wise and returns sign(x) * min(|x|, rho). When a coordinate's Hessian estimate is very small, the ratio m_t / h_t would be enormous and the clip saturates, so the update for that coordinate degenerates into a sign-momentum step of fixed magnitude rho, similar in spirit to Lion or signSGD. When the Hessian estimate is large and accurate, the clip is inactive and Sophia behaves like a true diagonal Newton step. The authors emphasize that this dual behavior is precisely what makes the algorithm robust to inaccurate Hessian estimates and to the rapid changes in curvature characteristic of non-convex transformer training.[^1]

A second consequence of clipping is that the worst-case absolute update per coordinate is bounded by eta * rho, regardless of the gradient magnitude. This serves as an implicit form of gradient clipping, although it operates after the second-order preconditioning rather than on the raw gradient.[^1]

Hyperparameters

The authors document tuned hyperparameter ranges in the paper and in the GitHub repository's README. Typical recommended values for GPT-2 pre-training are listed below.[^1][^3]

Hyperparameter	Symbol	Typical value	Notes
Gradient EMA decay	beta1	0.965	Slightly lower than AdamW's default 0.9; tuned for GPT-2.
Hessian EMA decay	beta2	0.99	Smooths the noisy diagonal Hessian estimate.
Hessian update interval	k	10 steps	The estimator is called every 10 iterations.
Clipping threshold	rho	0.01 to 0.05	Tuned per model size; smaller for larger models.
Numerical floor	epsilon	1e-12	Lower than AdamW's 1e-8 because h has different scaling.
Weight decay	lambda	about 0.2	Roughly twice the value used with AdamW.
Learning rate	eta	model-dependent	See size-specific table in the repository README.

The repository's README reports the following learning rate, rho, and weight decay settings for the headline GPT-2 experiments.[^3]

Model size	LR (Adam)	LR (Lion)	LR (Sophia)	rho	Weight decay
125M	6e-4	1e-4	6e-4	0.05	0.2
355M	3e-4	1e-4	7e-4	0.08	0.2
770M	2e-4	8e-5	3e-4	0.05	0.2

Sophia is not entirely hyperparameter-free: rho and the weight decay are explicitly tuned per model size in the released configurations, and the authors recommend monitoring the fraction of coordinates whose updates are being clipped at any given step. A clipping fraction of roughly a third to a half is reported as healthy in the paper; higher fractions indicate that rho is too small or the Hessian estimate is too noisy, and lower fractions indicate that the second-order information is dominating and rho could be tightened.[^1]

Theoretical analysis

The Sophia paper proves several formal results to justify the algorithm. The most prominent is a convergence rate analysis showing that for a deterministic version of Sophia on a class of strictly convex objectives with a heterogeneous diagonal Hessian, the number of iterations to reach a given suboptimality depends on the ratio of the maximum to minimum diagonal entries of the Hessian, but is independent of the global condition number of the Hessian, unlike vanilla gradient descent whose rate is governed by that full condition number.[^1] The intuition is the same as classical preconditioned gradient descent: rescaling each coordinate by the local second derivative makes the effective optimization landscape isotropic along the diagonal, so off-diagonal coupling becomes the only remaining source of ill-conditioning.[^1]

The paper also analyzes the behavior of the clipping mechanism. In a non-convex setting with potentially negative Hessian estimates, the authors show that the clipped update remains bounded and that Sophia continues to make progress as long as the gradient magnitude is non-trivially larger than zero, even in regions where the Hessian estimate is uninformative.[^1] This is the key argument that the same algorithm can run safely through the highly non-convex early phase of transformer training and the smoother later phase without changing hyperparameters.[^1]

The analysis is local in nature, treats the Hessian estimate as a noisy black-box, and is not a global convergence guarantee for the full non-convex transformer pre-training loss; the authors are explicit that Sophia is justified primarily by its empirical performance on language modeling and that the theory is a clarifying tool rather than a sharp prediction.[^1]

Experimental results

The Sophia paper's main empirical claim is a roughly two times reduction in steps, compute, and wall-clock time relative to AdamW (and a comparable improvement over Lion) on GPT-2 style autoregressive language model pre-training at sizes from 125 million to 1.5 billion parameters, when training to the same validation perplexity on OpenWebText or The Pile.[^1][^3]

GPT-2 small, medium, large

For GPT-2 small (125M), medium (355M), and large (770M), the authors trained on OpenWebText using a fixed sequence length of 1,024 tokens and a global batch size of approximately 480, and matched the AdamW recipes of the open-source nanoGPT codebase as closely as possible.[^1][^3] In each setting they report that Sophia (Sophia-G in the released configurations) reaches the same final validation loss as AdamW using roughly half the number of optimization steps, and that because the per-step cost is essentially the same (the Hessian estimator runs once every ten steps), the wall-clock saving is also approximately a factor of two.[^1][^3] The paper reports that a 540 million parameter Sophia-trained model, when trained for 100,000 steps, matches the validation perplexity of a 770 million parameter AdamW model trained on the same data, illustrating an effective reduction in required model size for a given loss target.[^1]

GPT-2 1.5B

The headline large-scale experiment is on a 1.5 billion parameter GPT-2 architecture, trained on The Pile using the JAX-based Levanter framework on TPUs at Stanford CRFM.[^3] At this scale the authors observed the same qualitative two times advantage over AdamW in steps and wall-clock time. This experiment was a major motivator for upstreaming the Sophia-G implementation into Levanter, where it became a default-available option.[^8]

Wall-clock and memory overhead

Because the Hessian estimate is recomputed only once every k = 10 steps, the additional compute is roughly one extra backward pass amortized over ten optimizer steps. In the paper's measurements this corresponds to about a five percent increase in per-step time over AdamW, which is more than compensated by the halving of the number of steps required.[^1] The additional memory footprint of Sophia over AdamW is one tensor of the same shape as the parameters, since both AdamW and Sophia already maintain a first-moment EMA (the gradient EMA) and a second tensor (the squared-gradient EMA for AdamW or the diagonal Hessian EMA for Sophia).[^1] In other words, Sophia's optimizer states have the same total size as AdamW's: two tensors per parameter group.

Comparison to AdamW and Lion

The paper compares Sophia head-to-head with AdamW, with Lion (a sign-momentum optimizer discovered by symbolic search at Google), and with several second-order baselines including diagonal Adagrad and AdaHessian.[^1] In the GPT-2 experiments Sophia reaches a given validation loss with the fewest steps in all configurations, with AdamW second and Lion close behind AdamW on the smaller sizes but slightly worse on the largest. The paper also includes ablations on the choice of estimator (Sophia-H vs Sophia-G), on the value of the update interval k, and on the clipping threshold rho.[^1]

Variants

Variant	Hessian estimator	PSD guarantee	Extra compute per estimator call	Notes
Sophia-H	Hutchinson with Rademacher probe vector	No	One Hessian-vector product (about one extra backward pass)	Unbiased; works for any twice-differentiable loss.
Sophia-G	Gauss-Newton-Bartlett with resampled labels	Yes	One extra forward and backward pass with sampled labels	Biased toward Gauss-Newton; designed for softmax cross-entropy losses such as those used for next-token prediction in language models.

In the released code, Sophia-G is the default for language model pre-training because the Gauss-Newton-Bartlett estimator is positive semi-definite and tends to be more numerically stable in practice; Sophia-H is exposed primarily as a reference implementation and for non-cross-entropy losses.[^3]

Implementations and adoption

The official PyTorch implementation by the first author lives at github.com/Liuhong99/Sophia and is released under the MIT license.[^3] It includes both SophiaG and SophiaH classes implementing torch.optim.Optimizer, integration scripts for the PyTorch nanoGPT codebase, and configuration files matching the GPT-2 small, medium, and large experiments from the paper.[^3]

For larger-scale training, the algorithm has been integrated into Levanter, the JAX-based language modeling framework maintained by the Stanford Center for Research on Foundation Models. In a March 2024 announcement, Percy Liang noted that Levanter included Sophia as an option, citing the roughly two times wall-clock reduction reported in the paper.[^8] At least one mirrored or unofficial port (kyegomez/Sophia) is also publicly available, but the canonical reference implementations are the original Liuhong99/Sophia repository for PyTorch and Levanter for JAX.[^9][^8]

Sophia has been the subject of independent re-implementations and class projects, including a Stanford CS 224N final project that explored variations on the estimator and on the update frequency, reporting that the basic algorithm replicated as described on small GPT-2 runs but that performance was sensitive to careful tuning of rho.[^10] Adoption in production pretraining pipelines for frontier large language models has been more limited; AdamW continues to dominate published frontier-scale recipes such as those reported for the LLaMA and Mistral families, where Sophia has not been the headline optimizer.

The optimization literature for deep learning is broad, and Sophia sits at a specific intersection of three threads.

Optimizer	Order	Preconditioner	Per-step extra cost vs SGD	Notes
SGD	First	None	0	Baseline.
RMSProp	First	EMA of squared gradient	One element-wise square and divide	Diagonal scaling by gradient magnitude.
AdaGrad	First	Sum of squared gradient	One element-wise square and divide	Aggressive learning-rate decay per coordinate.
Adam	First	EMA of squared gradient with bias correction	Two EMAs and bias correction	Combines RMSProp-style scaling with first-moment EMA.
AdamW	First	EMA of squared gradient	Two EMAs and bias correction; decoupled weight decay	De-facto standard for transformer pre-training.
Lion	First	None (sign of EMA)	One EMA	Symbolic search discovery; uses only the sign of the momentum.
AdaHessian	Diagonal second	Hutchinson diagonal Hessian estimate every step	One extra backward per step	A precursor to Sophia.
K-FAC	Block second	Kronecker-factored Fisher approximation	Periodic large matrix solves	More accurate but expensive.
Shampoo	Block second	Per-tensor Kronecker preconditioner	Periodic SVDs or root solves	Used in some large-scale language model runs.
Sophia	Diagonal second	Hutchinson or GNB diagonal Hessian, every k steps, with clipping	Roughly one extra backward every k steps	Designed specifically for transformer pre-training.

Compared with AdamW, Sophia replaces the EMA of squared gradients (a first-order proxy for curvature) with an actual stochastic diagonal Hessian estimate, and replaces an implicit reliance on bias correction and tuned epsilon with explicit per-coordinate clipping.[^1] Compared with Lion, Sophia degenerates into a similar sign-momentum step only on coordinates whose Hessian estimate is small or unreliable; on other coordinates it makes a true preconditioned step.[^1] Compared with K-FAC and Shampoo, Sophia gives up the modeling power of a non-diagonal preconditioner in exchange for radically lower per-step cost and far simpler implementation.[^1]

Limitations and criticisms

Although the Sophia paper reports a clean two times speed-up over AdamW on GPT-2 scale, several subsequent independent investigations have qualified the picture.

First, the speed-up depends on hyperparameter retuning. Switching from AdamW to Sophia is not a drop-in replacement; rho, weight decay, learning rate, and warmup all need to be tuned, and the released configurations show that the optimal AdamW and optimal Sophia learning rates differ by a factor of about two for GPT-2 medium.[^3] When practitioners apply Sophia with AdamW-tuned settings the gains can disappear.

Second, the headline experiments target final validation loss on a fixed dataset and architecture rather than downstream task accuracy on benchmarks. There is no claim in the paper that the same Sophia-trained model achieves better downstream zero-shot or few-shot performance per training token than an AdamW model; the claim is about reaching the same validation loss more quickly.[^1]

Third, benchmarking work in 2025 has reported reproducibility issues at scales and configurations that differ from the paper's. A September 2025 benchmark of optimizers for large language model pretraining reported that Sophia diverged in some small-batch settings on a 124 million parameter model and degraded beyond approximately 130,000 steps in longer-horizon runs, even at sizes (around 720 million parameters with very large batch sizes) where it could outperform AdamW for the first several thousand iterations.[^11] The authors of that work excluded Sophia from parts of their small-batch evaluation because of those divergence issues and concluded that AdamW remained a more consistently reliable choice for their long-horizon protocols.[^11]

Fourth, the second-order machinery is most valuable when the curvature genuinely varies across coordinates. In smaller transformer pre-training settings with moderate batch sizes, the gap between Sophia and well-tuned sign-momentum methods or AdamW has been reported to shrink, indicating that the headline advantage may be concentrated in particular regimes of model size and batch size rather than holding uniformly.[^11]

Finally, Sophia is more complex than AdamW and adds a non-trivial implementation surface (Hessian-vector products, label resampling for Sophia-G, clipping logic, per-step bookkeeping for the Hessian update cadence k), which raises the operational cost of trying it in an established pretraining pipeline. The authors and the community have not yet matched the operational maturity of AdamW for the optimizer.[^3][^11]

Significance

Within the deep learning optimization literature, Sophia is notable for being a stochastic second-order method that is competitive in wall-clock time with first-order baselines at billion-parameter language model scale, and for showing concretely how cheap per-step second-order information (via a single Hutchinson probe or a Gauss-Newton resampling step) plus per-coordinate clipping can substitute for the more expensive block-structured preconditioners used in Shampoo or K-FAC.[^1] In doing so it revived interest in diagonal second-order methods for transformer pre-training and motivated subsequent work on benchmarking optimizers under realistic LLM training conditions.[^11] Even where Sophia has not displaced AdamW in production training of frontier large language models, its design ideas, in particular the use of infrequent Hessian estimation and an explicit clip that gracefully degrades to sign-momentum, have entered the working vocabulary of optimizer design for transformer training.

References

Background and motivation

Timeline

How Sophia works

Algorithm

Sophia-H: Hutchinson estimator

Sophia-G: Gauss-Newton-Bartlett estimator

Per-coordinate clipping

Hyperparameters

Theoretical analysis

Experimental results

GPT-2 small, medium, large

GPT-2 1.5B

Wall-clock and memory overhead

Comparison to AdamW and Lion

Variants

Implementations and adoption

Comparison with related optimizers

Limitations and criticisms

Significance

See also

References

Improve this article

Background and motivation

Timeline

How Sophia works

Algorithm

Sophia-H: Hutchinson estimator

Sophia-G: Gauss-Newton-Bartlett estimator

Per-coordinate clipping

Hyperparameters

Theoretical analysis

Experimental results

GPT-2 small, medium, large

GPT-2 1.5B

Wall-clock and memory overhead

Comparison to AdamW and Lion

Variants

Implementations and adoption

Comparison with related optimizers

Limitations and criticisms

Significance

See also

References