Adafactor

Deep Learning Training & Optimization

22 min read

Updated Jul 11, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 11, 2026

Fact-checked

In review queue

Sources

18 citations

Revision

v3 · 4,381 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

title: Adafactor slug: adafactor

Adafactor

Adafactor is a memory-efficient adaptive learning-rate optimizer for training deep neural networks, introduced by Noam Shazeer and Mitchell Stern in the 2018 paper Adafactor: Adaptive Learning Rates with Sublinear Memory Cost.^[1] Its central contribution is a rank-1 factorization of the per-parameter second-moment estimator that Adam stores for every weight: instead of keeping one running average of squared gradients per parameter, Adafactor stores only a per-row vector and a per-column vector for each weight matrix and reconstructs the full second-moment matrix via an outer product.^[1] This reduces the optimizer state for a matrix of shape $n \times m$ from $O(nm)$ to $O(n+m)$ , which is sublinear in the number of parameters.^[1]^[2] Adafactor also introduces a relative step-size schedule, update clipping based on the root-mean-square of the proposed update, and an optional rescaling by the RMS of the parameter tensor itself.^[1] The optimizer became the standard choice for training the T5 family of text-to-text transformers,^[3] was used in slightly modified form for PaLM,^[4] and remains the default in several large-scale Google training pipelines for Transformer models on TPU hardware.^[3]^[4]

Background and motivation

Adaptive gradient methods such as AdaGrad, RMSProp, and Adam scale each parameter's update by an estimate of the second moment of its gradient, typically maintained as an exponential moving average of squared gradients.^[1] These methods made training of deep neural networks substantially more robust to learning-rate choice and helped popularize the Transformer architecture introduced in Attention Is All You Need.^[5] The price is a memory cost equal to the model size: AdamW and Adam both maintain one first-moment vector and one second-moment vector per parameter, doubling the parameter footprint stored in optimizer state on top of the parameters themselves.^[6] For a model with one billion parameters trained in 32-bit precision, this corresponds to roughly 8 GB just for the optimizer's second-moment buffer (or 4 GB if only the second moment is held in 32-bit while the rest of the optimizer is mixed-precision).^[2]

As Transformer models scaled past the BERT-base regime in 2018 and 2019, optimizer memory became a binding constraint. The original Attention Is All You Need models used Adam, and the BERT release continued that tradition,^[5] but training larger seq2seq Transformers on TPU pods made the second-moment storage uncomfortably expensive. Shazeer, who had already invented Mixture of Experts layers and was building the Mesh-TensorFlow distribution library at Google Brain, and Stern, then a PhD student at UC Berkeley working as an intern at Google, sought an optimizer that retained Adam-like per-parameter learning rates while eliminating most of the per-parameter state.^[1]^[7] Their April 2018 arXiv preprint, 1804.04235, presented Adafactor as the result.^[1] The paper was accepted to the International Conference on Machine Learning (ICML) 2018, held in Stockholm, with a poster presentation in July of that year.^[8]

The work occupies a specific design point in optimizer history: rather than designing a new update rule from scratch (as later work on Lion or Sophia would do), Adafactor begins from Adam's update and asks how much of its memory can be removed without breaking the empirical behavior on Transformer training. The answer in the paper is that the per-parameter second-moment buffer can be replaced by $O(n+m)$ statistics for each matrix-shaped weight, that the first-moment buffer can be dropped entirely with care, and that several additional pieces (update clipping, a slowly increasing decay rate, and relative step sizes) are needed to make the method stable on machine translation benchmarks.^[1]

Technical details

The Adam baseline

Adam^[6] maintains, for each parameter, a first-moment estimate m (running mean of gradients) and a second-moment estimate v (running mean of squared gradients), both as exponential moving averages with decay rates $\beta_1$ and $\beta_2$ . The update at step t is roughly proportional to $m / \sqrt{v + \epsilon}$ , with bias-correction terms applied to both moments early in training.^[6] For a weight matrix $W$ of shape $n \times m$ , Adam therefore stores two additional matrices $m$ and $v$ , each of shape $n \times m$ , on top of $W$ itself.

Factorization of the second moment

Adafactor's key observation, derived in the paper's Section 3, is that for non-negative matrices the second-moment matrix $V$ can be approximated using only its row sums $R$ and column sums $C$ , with $V_{\text{approx}} = R C^\top / \mathrm{sum}(R)$ .^[1] Concretely, instead of maintaining a full matrix V of squared-gradient moving averages, the optimizer maintains:

a row vector $R$ of shape $n \times 1$ , holding an exponential moving average of the row sums of squared gradients,
a column vector $C$ of shape $1 \times m$ , holding an exponential moving average of the column sums of squared gradients.^[1]

At step t, Adafactor computes:

R_t = \hat{\beta}_{2t} R_{t-1} + (1 - \hat{\beta}_{2t})(G_t^2 + \epsilon_1)\mathbf{1}_m

C_t = \hat{\beta}_{2t} C_{t-1} + (1 - \hat{\beta}_{2t})\mathbf{1}_n^\top (G_t^2 + \epsilon_1)

\hat{V}_t = \frac{R_t C_t}{\mathbf{1}_n^\top R_t}

where $G_t$ is the gradient at step t, $G_t^2$ is the elementwise square, $\mathbf{1}_m$ and $\mathbf{1}_n$ are vectors of ones, and $\hat{\beta}_{2t}$ is a time-dependent decay rate.^[1]^[9] The update direction is then $G_t / \sqrt{\hat{V}_t}$ , the same Adam-style normalization but with $\hat{V}_t$ derived from the rank-1 factors rather than stored directly.^[1] The Cornell Optimization Wiki notes that this factorization is the minimizer of the generalized Kullback-Leibler divergence (I-divergence) between the true second-moment matrix and its rank-1 approximation under the non-negativity constraint, which is why the row-times-column-over-sum form appears rather than a more naive outer product.^[9]

For tensors with more than two dimensions, Adafactor's implementations typically apply the factorization to the last two axes, treating any leading axes as a batch dimension.^[10] Vectors (rank-1 tensors) and small dimensions are usually left in unfactored form: Optax's reference implementation, for example, exposes a min_dim_size_to_factor parameter that defaults to 128, below which Adafactor falls back to a full per-parameter second-moment estimate.^[11]

Memory savings

For an $n \times m$ matrix, the optimizer state shrinks from one $n \times m$ matrix (Adam's $v$ ) to two vectors of total size $n + m$ . With Adafactor's default of dropping the first moment entirely (beta1 = None), the total state per matrix weight is $O(n + m)$ rather than Adam's $O(2nm)$ .^[1]^[2] On a dense feed-forward Transformer layer of width $4096 \times 16384$ , for example, Adam's $v$ alone is $4096 \times 16384 = 67\text{M}$ entries, while Adafactor's $R + C$ is $4096 + 16384 = 20480$ entries, a roughly 3300x reduction for that tensor.^[2] When momentum (a first-moment buffer) is enabled in Adafactor, the savings are smaller but still substantial, since the first moment can optionally be stored in lower precision.^[11]

Update clipping

A side effect of the factored approximation is that the implied per-parameter second-moment estimate can drift from the true one, occasionally producing very large normalized updates. The paper diagnoses this in Section 4 and proposes update clipping: after computing the proposed update $U_t = G_t / \sqrt{\hat{V}_t}$ , Adafactor scales it down whenever its root-mean-square exceeds a threshold $d$ , replacing $U_t$ with $U_t / \max(1, \mathrm{RMS}(U_t) / d)$ .^[1] The default threshold is 1.0 in both the original paper and the HuggingFace and Optax implementations.^[10]^[11] This update clipping is distinct from gradient clipping (which acts on the raw gradient) and is reported by the HuggingFace docs as essential for stability when training Transformer models with Adafactor.^[10]

Slowly increasing decay rate

The paper also identifies a more general failure mode of Adam-style optimizers: when the second-moment decay rate $\beta_2$ is held fixed near 1, the running estimate can lag behind sudden increases in gradient magnitude, producing oversized updates. Adafactor instead uses a time-dependent decay rate $\hat{\beta}_{2t}$ that approaches 1 as training progresses, defined in the paper as $1 - t^{-c}$ for a constant $c$ (the decay_rate hyperparameter, often set to -0.8 in implementations to give an effective exponent).^[1]^[10] This is a separate mitigation from update clipping; the paper shows it also helps Adam itself when applied as a drop-in modification.^[1]

Relative step size

In place of a fixed external learning rate, Adafactor introduces a relative step size $\alpha_t = \max(\epsilon_2, \mathrm{RMS}(W_{t-1})) \cdot \rho_t$ , where $W_{t-1}$ is the current parameter tensor and $\rho_t$ is a base step size that depends only on the iteration count, typically $\rho_t = \min(0.01, 1/\sqrt{t})$ .^[1]^[9] The intuition is that the appropriate magnitude of a parameter update should scale with the magnitude of the parameter itself, so a layer with very small weights gets very small updates while a layer with large weights gets correspondingly larger ones.^[1] This "scale_parameter" option allows Adafactor to be run with no externally supplied learning rate at all, which is the configuration used during T5 pretraining.^[3]^[10]

Algorithm summary

Combining these pieces, the per-step update for a matrix-shaped weight W with gradient G is approximately:

Compute $G_t$ .
Update $R_t$ and $C_t$ exponential moving averages of row and column sums of $G_t^2$ .
Form $\hat{V}_t = R_t C_t / \mathrm{sum}(R_t)$ .
Compute proposed update $U_t = G_t / \sqrt{\hat{V}_t}$ .
Clip: $\hat{U}_t = U_t / \max(1, \mathrm{RMS}(U_t) / d)$ .
Scale by step size $\alpha_t$ (relative or external).
Update parameters: $W_t = W_{t-1} - \alpha_t \hat{U}_t$ (with optional weight decay and optional first-moment momentum).^[1]^[9]

For tensors that are vectors or have a dimension smaller than the factoring threshold, the same algorithm is run with $V_t$ maintained directly rather than factored.^[11]

Comparison with AdamW

The most natural baseline for Adafactor is AdamW, the variant of Adam with decoupled weight decay that has become the default optimizer in modern LLM training.^[6] Both methods produce per-parameter adaptive step sizes via second-moment normalization, but they differ in important ways.

Property	AdamW	Adafactor
First moment storage	One full tensor per parameter	Optional; default off in the original paper
Second moment storage	One full tensor per parameter	Two vectors per matrix (rank-1 factor)
State size for $n \times m$ matrix	$2nm$	$n + m$ (default), or $2(n+m)$ with momentum
Learning rate	External, typically with warmup + decay	Internal "relative step", or external if scale_parameter=False
Update clipping	Not built in	Built in, default RMS threshold 1.0
Decay rate schedule	Fixed $\beta_2$ (commonly 0.999)	Time-varying, approaches 1 over training
Convergence on small models	Reliable	Often slightly worse, can be unstable without care
Convergence on very large Transformers	Stable but expensive in memory	Stable, much cheaper in memory
Typical use today	Most LLM and PEFT training	T5, PaLM, large MoE, GaLore-Adafactor, memory-constrained training

Sources: ^[1]^[6]^[10]^[11]^[12].

The trade-off is roughly that Adafactor sacrifices some convergence consistency, particularly on small models or short training runs, in exchange for substantial memory savings on the optimizer state. On the Transformer-Big WMT 2014 English-German task used in the original paper, Adafactor's published numbers match Adam's BLEU score while using only the per-row and per-column statistics described above.^[1] On smaller fine-tuning runs, practitioners frequently report that AdamW converges faster, which is one reason the HuggingFace documentation warns that training without learning-rate warmup or update clipping with Adafactor "is not recommended".^[10]

Hyperparameters and common configurations

The HuggingFace transformers.Adafactor implementation, which is a PyTorch port of the original fairseq code, exposes the following hyperparameters:^[10]

Parameter	Default	Description
`lr`	None	External learning rate; ignored when `relative_step=True`
`eps`	(1e-30, 0.001)	Tuple $(\epsilon_1, \epsilon_2)$ ; regularization constants for squared-gradient and parameter-scale denominators
`clip_threshold`	1.0	RMS threshold for update clipping
`decay_rate`	-0.8	Exponent controlling the time-varying second-moment decay
`beta1`	None	If set, enables a first-moment momentum buffer
`weight_decay`	0.0	L2 weight decay applied to updates
`scale_parameter`	True	If True, scales the learning rate by the RMS of the parameter
`relative_step`	True	If True, uses time-dependent relative step size and ignores `lr`
`warmup_init`	False	If True, modifies the relative step schedule to include warmup

The Optax (JAX) implementation has a similar surface, with parameters learning_rate, min_dim_size_to_factor=128, decay_rate=0.8, decay_offset=0, multiply_by_parameter_scale=True, clipping_threshold=1.0, momentum=None, dtype_momentum=float32, weight_decay_rate=None, eps=1e-30, and factored=True.^[11] The Keras Adafactor adds an epsilon_2=0.001 parameter that plays the role of the second epsilon in the HuggingFace tuple, with otherwise similar defaults.^[13]

T5 fine-tuning settings

The HuggingFace documentation explicitly calls out recommended settings for T5 fine-tuning, derived from community experience on the HuggingFace forums:^[10]

Adafactor(
    model.parameters(),
    scale_parameter=False,
    relative_step=False,
    warmup_init=False,
    lr=1e-3,
)

The same docs note that fine-tuning T5 without LR warmup or update clipping is not recommended, that clip_threshold=1.0 should be used, and that other gradient-clipping operations should not be combined with Adafactor.^[10] An alternative configuration, used when no external scheduler is available, is scale_parameter=True, relative_step=True, warmup_init=True, lr=None, paired with HuggingFace's AdafactorSchedule helper which exposes the optimizer's internal learning rate to the Trainer's scheduling hooks.^[10]

Pretraining settings

For pretraining from scratch, the original paper uses the relative step size with no external learning rate.^[1] The T5 paper inherits this: T5-Base, T5-Large, T5-3B, and T5-11B were all pretrained with Adafactor using the relative step schedule, for one million steps at a batch size of $2^{11}$ sequences of length 512.^[3] PaLM uses a custom variant of Adafactor "without factorization", which Chowdhery et al. describe as "effectively equivalent to Adam with parameter scaling": the optimizer scales each parameter's learning rate by the root-mean-square of that parameter, keeping the relative-step and parameter-scale machinery from Adafactor but storing the full second-moment buffer.^[4] This unusual configuration suggests that, at PaLM's scale, the team valued the parameter-scaling behavior more than the memory savings, since they could afford the optimizer state with a model-parallel layout on TPU v4 pods.^[4]

Adoption in large language models

T5 and the text-to-text family

Adafactor's largest single application is the T5 family, introduced in Raffel et al.'s 2019 paper Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.^[3] The paper describes the optimization setup briefly: "We use the AdaFactor optimizer for training. To make our results easier to reproduce, we use a 'inverse square root' learning rate schedule" or, in the relative-step configuration, no external schedule at all.^[3] T5-11B's 11-billion parameter encoder-decoder was trained on the C4 corpus for approximately one trillion tokens using Adafactor on TPU v3 pods, and the same optimizer choice carried over to subsequent releases such as T5 v1.1, the multilingual variant mT5 introduced by Xue et al.,^[14] and instruction-tuned descendants such as Flan-T5.

The success of T5 cemented Adafactor's reputation as the optimizer of choice for encoder-decoder Transformers at scale, and HuggingFace's documentation specifically calls out the optimizer as the recommended tool for fine-tuning T5 derivatives.^[10] One reason it appears so often in T5-related codebases is that switching to AdamW for T5 fine-tuning often produces worse loss curves and divergence under the same hyperparameter budget, a quirk that is widely reported in community write-ups but only partially explained in the literature; the dominant hypothesis is that the original pretraining used a relative step schedule, so post-training optimization with a fixed external learning rate of the wrong magnitude leads to drift.^[10]

PaLM

PaLM, Google's 540-billion-parameter dense Transformer described by Chowdhery et al. in 2022, used the unfactored variant of Adafactor described above.^[4] The paper reports a learning rate schedule with linear warmup followed by inverse-square-root decay, with the parameter-scale rescaling from Adafactor applied throughout, and reports 57.8% hardware FLOPs utilization on TPU v4 pods at the largest scale.^[4] PaLM 2 and the larger Gemini family continue to use Adafactor-derived configurations for some experiments, though the precise optimizer choices are no longer fully disclosed in subsequent Google technical reports.

Switch Transformer and other MoE

The Switch Transformer, Fedus et al.'s 2021 Mixture of Experts (MoE) architecture that scaled to over a trillion parameters, used Adafactor for optimization but encountered new instabilities specific to the sparse-routing regime.^[15] The paper introduced an auxiliary "router z-loss" to stabilize high-FLOP sparse models that had previously been unstable when trained with Adafactor in encoder-decoder configurations, allowing Switch-C and Switch-XXL to converge.^[15] This instability stemmed less from Adafactor itself than from the interaction between the optimizer's update-clipping behavior and the discrete expert-routing gradient pathways, and was resolved by the auxiliary loss rather than by changing optimizers.^[15]

Other adoption

Beyond Google, Adafactor sees use in a long tail of memory-constrained training settings. The HuggingFace Transformers integration registers Adafactor under optim="adafactor" in TrainingArguments, making it a one-line switch from AdamW in any fine-tuning workflow built on the library's Trainer.^[16] GaLore (Gradient Low-Rank Projection), a 2024 memory-efficient optimizer family, registers a galore_adafactor variant that composes the rank-1 projection of GaLore with Adafactor's factored second moments for cumulative memory savings.^[16]^[17] The Cornell Optimization Wiki lists additional applications including ResNet50 on ImageNet and several multilingual classification tasks.^[9] When fine-tuning T5-v1.1 or mT5 with LoRA or QLoRA, practitioners often pair the parameter-efficient adapter with Adafactor for the trainable subset to keep the full optimizer state below the available GPU memory.

Significance and place in the optimizer landscape

Adafactor's lasting significance is twofold. First, it demonstrated that the per-parameter second-moment buffer that had become standard in adaptive optimizers since AdaGrad and RMSProp could be drastically compressed without giving up Transformer-scale performance.^[1] This insight opened a family of follow-up optimizers, including CAME (Confidence-guided Adaptive Memory Efficient Optimization), which keeps Adafactor's factorization and addresses the resulting instability via a confidence-weighted update,^[12] and 8-bit and lower-precision optimizer variants that store the moment buffers in reduced precision rather than reducing their count.^[17] StableAdamW, registered in HuggingFace Transformers as stable_adamw, ports Adafactor's update-clipping mechanism back into AdamW so that gradient clipping is no longer necessary, demonstrating that some of Adafactor's contributions can be valuable even outside the factored regime.^[16]

Second, Adafactor served as the workhorse optimizer for several of the most influential LLM training programs of the 2018 to 2022 period, including T5, mT5, and PaLM.^[3]^[4]^[14] This longevity contrasts with later memory-efficient optimizers such as Lion (Chen et al. 2023) and Sophia (Liu et al. 2023), neither of which has yet reached the breadth of large-model deployment Adafactor achieved. Architectural changes in the dominant LLM training stacks of 2024 and 2025, particularly the rise of fully decoder-only Transformers trained on enormous token budgets, have shifted defaults back toward AdamW (often in 8-bit form), but Adafactor remains the recommended choice when memory is the binding constraint or when continuing a pretraining run that started with it.

The optimizer also influenced engineering choices outside its own algorithmic family. The Muon optimizer and the Schedule-Free optimizer cite Adafactor's mixed history (efficient but sometimes slow to converge) as part of their motivation for re-examining what the right baseline should be, and the explicit decoupling of optimizer state precision in DeepSpeed ZeRO and similar systems is partly a response to the same problem Adafactor first targeted: how to make optimizer memory not the limiting factor at scale.

Limitations and criticisms

Adafactor has several well-documented weaknesses, some inherent to the factored second moment and some related to the auxiliary heuristics:

Slower or noisier convergence on small models. The CAME paper notes that Adafactor "suffers a performance degradation in the training of large language models compared with conventional adaptive gradient-based optimization methods", attributing this to errors introduced by the non-negative matrix factorization step.^[12] Community reports on smaller fine-tuning tasks frequently echo this, finding that AdamW reaches a lower validation loss faster under matched compute.^[10]^[18]
Sensitivity to hyperparameters and to whether the relative step is active. A persistent source of confusion is whether to use relative_step=True with lr=None, or relative_step=False with an external learning rate. The HuggingFace documentation lists both as plausible, and an open issue on the HuggingFace Transformers tracker explicitly flagged that the documentation of Adafactor "is at odds with Google implementations" because some Google implementations set defaults that differ from the published paper.^[18]
No first moment by default. While dropping the first-moment buffer is a major memory saving, it removes the implicit smoothing that Adam-style optimizers gain from a momentum term. The original paper's experiments include the no-momentum configuration and report acceptable BLEU, but later large-scale training runs frequently re-enable momentum (beta1 != None in HuggingFace, or momentum in Optax), trading some of the memory savings for stability.^[4]^[11]
Update-clipping interactions. Combining Adafactor with separate gradient clipping is explicitly discouraged in the HuggingFace docs, since the optimizer already clips its own proposed updates and adding a second clipping step changes the effective step distribution in ways that are hard to reason about.^[10]
Factorization assumes matrix structure. Adafactor's savings only apply when the parameter has at least two factorizable dimensions of nontrivial size. Vectors (e.g., biases, gradient-scale parameters, and 1D LayerNorm scales) and small embeddings are stored unfactored, so the per-parameter savings depend heavily on the model architecture.^[11]
Less benefit when optimizer state is already sharded. With DeepSpeed ZeRO-2/3 or FSDP, Adam's optimizer state is sharded across data-parallel workers, so the per-worker memory cost falls roughly linearly with the worker count. Adafactor's absolute savings remain, but their relative importance is diminished compared to single-host training.

Implementations and software

Reference and widely used implementations include:

Framework	Entry point	Notes
Original (TensorFlow / Mesh-TensorFlow)	`tf.contrib.opt.AdafactorOptimizer` and `mesh_tensorflow.optimize.AdafactorOptimizer`	Used in original T5 pretraining^[3]
fairseq (PyTorch)	`fairseq.optim.adafactor.Adafactor`	First PyTorch port; basis for many later ports^[10]
HuggingFace Transformers (PyTorch)	`transformers.Adafactor`, `transformers.optimization.AdafactorSchedule`; also `TrainingArguments(optim="adafactor")`	Most common modern usage; handles low-precision values^[10]^[16]
Optax (JAX)	`optax.adafactor`	Pure-JAX implementation; used in T5X, JAX-based PaLM reproductions^[11]
Keras	`keras.optimizers.Adafactor`	Standalone Keras implementation with similar defaults^[13]

The HuggingFace Adafactor class explicitly documents support for FP16 and bfloat16 values, though the docs note this has not been extensively tested.^[10] Optax's implementation exposes a dtype_momentum parameter that allows the first-moment buffer (when enabled) to be stored in lower precision, an extension beyond the original paper's design.^[11]

Memory-efficient adaptive optimizers

Adafactor sits in a small family of optimizers that explicitly trade some statistical fidelity for reduced state size:

CAME (Luo et al., ACL 2023) keeps Adafactor's factorization but adds a confidence-weighted update, modulating step size by the agreement between the running moving average and the current update.^[12]
GaLore (Zhao et al. 2024) projects gradients into a low-rank subspace before passing them to an underlying optimizer; its galore_adafactor variant composes the low-rank projection with Adafactor's factored second moments for additional savings.^[16]^[17]
8-bit optimizers (Dettmers et al. 2022) keep the per-parameter buffer but quantize it to 8 bits, an orthogonal approach that complements rather than competes with Adafactor.

Direct competitors

AdamW remains the dominant baseline against which Adafactor is judged.^[6] Newer optimizers such as Lion (Chen et al. 2023, EvoLved Sign Momentum), Sophia (Liu et al. 2023, second-order via Hutchinson estimator), Muon (Jordan 2024, Newton-Schulz orthogonalization), Shampoo (Gupta et al. 2018, full second-order), and Schedule-Free (Defazio et al. 2024) all occupy slightly different points in the trade-off space among memory, second-order information, learning-rate schedule, and convergence behavior, and most have been benchmarked against Adafactor as part of their evaluations.

Theoretical analyses

The Cornell Optimization Wiki notes that the row-times-column-over-sum factorization Adafactor uses is exactly the minimizer of the generalized Kullback-Leibler divergence (I-divergence) between the true second-moment matrix and its rank-1 non-negative approximation, connecting Adafactor to the broader literature on non-negative matrix factorization.^[9] The relative step size and update clipping are more ad hoc and have inspired several follow-up papers that attempt to justify them in terms of trust-region behavior or per-layer learning-rate adaptation.^[12]

References

Noam Shazeer and Mitchell Stern, "Adafactor: Adaptive Learning Rates with Sublinear Memory Cost", arXiv, 2018-04-11. https://arxiv.org/abs/1804.04235. Accessed 2026-05-20. ↩
Sulbha Jain, "Optimizers in LLM Fine-Tuning: AdamW vs Adafactor", Medium, 2024-08-08. https://sulbhajain.medium.com/optimizers-in-llm-adamw-vs-adafactor-54fc3cb37671. Accessed 2026-05-20. ↩
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu, "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer", arXiv, 2019-10-23. https://arxiv.org/abs/1910.10683. Accessed 2026-05-20. ↩
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, et al., "PaLM: Scaling Language Modeling with Pathways", arXiv, 2022-04-05. https://arxiv.org/abs/2204.02311. Accessed 2026-05-20. ↩
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin, "Attention Is All You Need", arXiv, 2017-06-12. https://arxiv.org/abs/1706.03762. Accessed 2026-05-20. ↩
Ilya Loshchilov and Frank Hutter, "Decoupled Weight Decay Regularization", arXiv, 2017-11-14. https://arxiv.org/abs/1711.05101. Accessed 2026-05-20. ↩
Noam Shazeer, "Noam Shazeer Personal Site", noamshazeer.com, 2024. https://www.noamshazeer.com/. Accessed 2026-05-20. ↩
ICML, "Adafactor: Adaptive Learning Rates with Sublinear Memory Cost (Poster)", International Conference on Machine Learning 2018, 2018-07-12. https://icml.cc/virtual/2018/poster/2446. Accessed 2026-05-20. ↩
Cornell University Computational Optimization Open Textbook, "Adafactor", optimization.cbe.cornell.edu, 2023-12-15. https://optimization.cbe.cornell.edu/index.php?title=Adafactor. Accessed 2026-05-20. ↩
HuggingFace, "Optimization: Adafactor", HuggingFace Transformers documentation, 2026-04-15. https://huggingface.co/docs/transformers/en/main_classes/optimizer_schedules. Accessed 2026-05-20. ↩
Google DeepMind, "Optax Optimizers: adafactor", Optax documentation, 2026-03-10. https://optax.readthedocs.io/en/latest/api/optimizers.html. Accessed 2026-05-20. ↩
Yang Luo, Xiaozhe Ren, Zangwei Zheng, Zhuo Jiang, Xin Jiang, and Yang You, "CAME: Confidence-guided Adaptive Memory Efficient Optimization", arXiv, 2023-07-05. https://arxiv.org/abs/2307.02047. Accessed 2026-05-20. ↩
Keras, "Adafactor optimizer", keras.io API documentation, 2026-02-20. https://keras.io/api/optimizers/adafactor/. Accessed 2026-05-20. ↩
Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel, "mT5: A massively multilingual pre-trained text-to-text transformer", arXiv, 2020-10-22. https://arxiv.org/abs/2010.11934. Accessed 2026-05-20. ↩
William Fedus, Barret Zoph, and Noam Shazeer, "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity", arXiv, 2021-01-11. https://arxiv.org/abs/2101.03961. Accessed 2026-05-20. ↩
HuggingFace, "Optimizers and schedulers", HuggingFace Transformers documentation, 2026-04-15. https://huggingface.co/docs/transformers/en/optimizers. Accessed 2026-05-20. ↩
Jiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang Wang, Anima Anandkumar, and Yuandong Tian, "GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection", arXiv, 2024-03-06. https://arxiv.org/abs/2403.03507. Accessed 2026-05-20. ↩
HuggingFace, "Documentation of Adafactor is at odds with Google implementations (Issue #19387)", GitHub, 2022-10-05. https://github.com/huggingface/transformers/issues/19387. Accessed 2026-05-20. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributor · full history

Suggest edit

What links here

PaLM T5 (language model)

title: Adafactor slug: adafactor

Adafactor

Background and motivation

Technical details

The Adam baseline

Factorization of the second moment

Memory savings

Update clipping

Slowly increasing decay rate

Relative step size

Algorithm summary

Comparison with AdamW

Hyperparameters and common configurations

T5 fine-tuning settings

Pretraining settings

Adoption in large language models

T5 and the text-to-text family

PaLM

Switch Transformer and other MoE

Other adoption

Significance and place in the optimizer landscape

Limitations and criticisms

Implementations and software

Related work

Memory-efficient adaptive optimizers

Direct competitors

Theoretical analyses

See also

References

Improve this article

Related Articles

Staged training

Clipping

Dropout Regularization

Early Stopping

Fine Tuning

Gradient Descent

What links here

Related Articles

Staged training

Clipping

Dropout Regularization

Early Stopping

Fine Tuning

Gradient Descent

What links here