Adafactor
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,395 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,395 words
Add missing citations, update stale details, or suggest a clearer explanation.
Adafactor is a memory-efficient adaptive learning-rate optimizer for training deep neural networks, introduced by Noam Shazeer and Mitchell Stern in the 2018 paper Adafactor: Adaptive Learning Rates with Sublinear Memory Cost.[1] Its central contribution is a rank-1 factorization of the per-parameter second-moment estimator that Adam stores for every weight: instead of keeping one running average of squared gradients per parameter, Adafactor stores only a per-row vector and a per-column vector for each weight matrix and reconstructs the full second-moment matrix via an outer product.[1] This reduces the optimizer state for a matrix of shape n by m from O(nm) to O(n+m), which is sublinear in the number of parameters.[1][2] Adafactor also introduces a relative step-size schedule, update clipping based on the root-mean-square of the proposed update, and an optional rescaling by the RMS of the parameter tensor itself.[1] The optimizer became the standard choice for training the T5 family of text-to-text transformers,[3] was used in slightly modified form for PaLM,[4] and remains the default in several large-scale Google training pipelines for Transformer models on TPU hardware.[3][4]
Adaptive gradient methods such as AdaGrad, RMSProp, and Adam scale each parameter's update by an estimate of the second moment of its gradient, typically maintained as an exponential moving average of squared gradients.[1] These methods made training of deep neural networks substantially more robust to learning-rate choice and helped popularize the Transformer architecture introduced in Attention Is All You Need.[5] The price is a memory cost equal to the model size: AdamW and Adam both maintain one first-moment vector and one second-moment vector per parameter, doubling the parameter footprint stored in optimizer state on top of the parameters themselves.[6] For a model with one billion parameters trained in 32-bit precision, this corresponds to roughly 8 GB just for the optimizer's second-moment buffer (or 4 GB if only the second moment is held in 32-bit while the rest of the optimizer is mixed-precision).[2]
As Transformer models scaled past the BERT-base regime in 2018 and 2019, optimizer memory became a binding constraint. The original Attention Is All You Need models used Adam, and the BERT release continued that tradition,[5] but training larger seq2seq Transformers on TPU pods made the second-moment storage uncomfortably expensive. Shazeer, who had already invented Mixture of Experts layers and was building the Mesh-TensorFlow distribution library at Google Brain, and Stern, then a PhD student at UC Berkeley working as an intern at Google, sought an optimizer that retained Adam-like per-parameter learning rates while eliminating most of the per-parameter state.[1][7] Their April 2018 arXiv preprint, 1804.04235, presented Adafactor as the result.[1] The paper was accepted to the International Conference on Machine Learning (ICML) 2018, held in Stockholm, with a poster presentation in July of that year.[8]
The work occupies a specific design point in optimizer history: rather than designing a new update rule from scratch (as later work on Lion or Sophia would do), Adafactor begins from Adam's update and asks how much of its memory can be removed without breaking the empirical behavior on Transformer training. The answer in the paper is that the per-parameter second-moment buffer can be replaced by O(n+m) statistics for each matrix-shaped weight, that the first-moment buffer can be dropped entirely with care, and that several additional pieces (update clipping, a slowly increasing decay rate, and relative step sizes) are needed to make the method stable on machine translation benchmarks.[1]
Adam[6] maintains, for each parameter, a first-moment estimate m (running mean of gradients) and a second-moment estimate v (running mean of squared gradients), both as exponential moving averages with decay rates beta_1 and beta_2. The update at step t is roughly proportional to m / sqrt(v + epsilon), with bias-correction terms applied to both moments early in training.[6] For a weight matrix W of shape n by m, Adam therefore stores two additional matrices m and v, each of shape n by m, on top of W itself.
Adafactor's key observation, derived in the paper's Section 3, is that for non-negative matrices the second-moment matrix V can be approximated using only its row sums R and column sums C, with V_approx = R C^T / sum(R).[1] Concretely, instead of maintaining a full matrix V of squared-gradient moving averages, the optimizer maintains:
At step t, Adafactor computes:
where G_t is the gradient at step t, G_t^2 is the elementwise square, 1_m and 1_n are vectors of ones, and beta_2_hat_t is a time-dependent decay rate.[1][9] The update direction is then G_t / sqrt(V_hat_t), the same Adam-style normalization but with V_hat_t derived from the rank-1 factors rather than stored directly.[1] The Cornell Optimization Wiki notes that this factorization is the minimizer of the generalized Kullback-Leibler divergence (I-divergence) between the true second-moment matrix and its rank-1 approximation under the non-negativity constraint, which is why the row-times-column-over-sum form appears rather than a more naive outer product.[9]
For tensors with more than two dimensions, Adafactor's implementations typically apply the factorization to the last two axes, treating any leading axes as a batch dimension.[10] Vectors (rank-1 tensors) and small dimensions are usually left in unfactored form: Optax's reference implementation, for example, exposes a min_dim_size_to_factor parameter that defaults to 128, below which Adafactor falls back to a full per-parameter second-moment estimate.[11]
For an n by m matrix, the optimizer state shrinks from one n by m matrix (Adam's v) to two vectors of total size n + m. With Adafactor's default of dropping the first moment entirely (beta1 = None), the total state per matrix weight is O(n + m) rather than Adam's O(2nm).[1][2] On a dense feed-forward Transformer layer of width 4096 by 16384, for example, Adam's v alone is 4096 * 16384 = 67M entries, while Adafactor's R + C is 4096 + 16384 = 20480 entries, a roughly 3300x reduction for that tensor.[2] When momentum (a first-moment buffer) is enabled in Adafactor, the savings are smaller but still substantial, since the first moment can optionally be stored in lower precision.[11]
A side effect of the factored approximation is that the implied per-parameter second-moment estimate can drift from the true one, occasionally producing very large normalized updates. The paper diagnoses this in Section 4 and proposes update clipping: after computing the proposed update U_t = G_t / sqrt(V_hat_t), Adafactor scales it down whenever its root-mean-square exceeds a threshold d, replacing U_t with U_t / max(1, RMS(U_t) / d).[1] The default threshold is 1.0 in both the original paper and the HuggingFace and Optax implementations.[10][11] This update clipping is distinct from gradient clipping (which acts on the raw gradient) and is reported by the HuggingFace docs as essential for stability when training Transformer models with Adafactor.[10]
The paper also identifies a more general failure mode of Adam-style optimizers: when the second-moment decay rate beta_2 is held fixed near 1, the running estimate can lag behind sudden increases in gradient magnitude, producing oversized updates. Adafactor instead uses a time-dependent decay rate beta_2_hat_t that approaches 1 as training progresses, defined in the paper as 1 - t^{-c} for a constant c (the decay_rate hyperparameter, often set to -0.8 in implementations to give an effective exponent).[1][10] This is a separate mitigation from update clipping; the paper shows it also helps Adam itself when applied as a drop-in modification.[1]
In place of a fixed external learning rate, Adafactor introduces a relative step size alpha_t = max(epsilon_2, RMS(W_{t-1})) * rho_t, where W_{t-1} is the current parameter tensor and rho_t is a base step size that depends only on the iteration count, typically rho_t = min(0.01, 1 / sqrt(t)).[1][9] The intuition is that the appropriate magnitude of a parameter update should scale with the magnitude of the parameter itself, so a layer with very small weights gets very small updates while a layer with large weights gets correspondingly larger ones.[1] This "scale_parameter" option allows Adafactor to be run with no externally supplied learning rate at all, which is the configuration used during T5 pretraining.[3][10]
Combining these pieces, the per-step update for a matrix-shaped weight W with gradient G is approximately:
For tensors that are vectors or have a dimension smaller than the factoring threshold, the same algorithm is run with V_t maintained directly rather than factored.[11]
The most natural baseline for Adafactor is AdamW, the variant of Adam with decoupled weight decay that has become the default optimizer in modern LLM training.[6] Both methods produce per-parameter adaptive step sizes via second-moment normalization, but they differ in important ways.
| Property | AdamW | Adafactor |
|---|---|---|
| First moment storage | One full tensor per parameter | Optional; default off in the original paper |
| Second moment storage | One full tensor per parameter | Two vectors per matrix (rank-1 factor) |
| State size for n by m matrix | 2nm | n + m (default), or 2(n+m) with momentum |
| Learning rate | External, typically with warmup + decay | Internal "relative step", or external if scale_parameter=False |
| Update clipping | Not built in | Built in, default RMS threshold 1.0 |
| Decay rate schedule | Fixed beta_2 (commonly 0.999) | Time-varying, approaches 1 over training |
| Convergence on small models | Reliable | Often slightly worse, can be unstable without care |
| Convergence on very large Transformers | Stable but expensive in memory | Stable, much cheaper in memory |
| Typical use today | Most LLM and PEFT training | T5, PaLM, large MoE, GaLore-Adafactor, memory-constrained training |
The trade-off is roughly that Adafactor sacrifices some convergence consistency, particularly on small models or short training runs, in exchange for substantial memory savings on the optimizer state. On the Transformer-Big WMT 2014 English-German task used in the original paper, Adafactor's published numbers match Adam's BLEU score while using only the per-row and per-column statistics described above.[1] On smaller fine-tuning runs, practitioners frequently report that AdamW converges faster, which is one reason the HuggingFace documentation warns that training without learning-rate warmup or update clipping with Adafactor "is not recommended".[10]
The HuggingFace transformers.Adafactor implementation, which is a PyTorch port of the original fairseq code, exposes the following hyperparameters:[10]
| Parameter | Default | Description |
|---|---|---|
lr | None | External learning rate; ignored when relative_step=True |
eps | (1e-30, 0.001) | Tuple (epsilon_1, epsilon_2); regularization constants for squared-gradient and parameter-scale denominators |
clip_threshold | 1.0 | RMS threshold for update clipping |
decay_rate | -0.8 | Exponent controlling the time-varying second-moment decay |
beta1 | None | If set, enables a first-moment momentum buffer |
weight_decay | 0.0 | L2 weight decay applied to updates |
scale_parameter | True | If True, scales the learning rate by the RMS of the parameter |
relative_step | True | If True, uses time-dependent relative step size and ignores lr |
warmup_init | False | If True, modifies the relative step schedule to include warmup |
The Optax (JAX) implementation has a similar surface, with parameters learning_rate, min_dim_size_to_factor=128, decay_rate=0.8, decay_offset=0, multiply_by_parameter_scale=True, clipping_threshold=1.0, momentum=None, dtype_momentum=float32, weight_decay_rate=None, eps=1e-30, and factored=True.[11] The Keras Adafactor adds an epsilon_2=0.001 parameter that plays the role of the second epsilon in the HuggingFace tuple, with otherwise similar defaults.[13]
The HuggingFace documentation explicitly calls out recommended settings for T5 fine-tuning, derived from community experience on the HuggingFace forums:[10]
Adafactor(
model.parameters(),
scale_parameter=False,
relative_step=False,
warmup_init=False,
lr=1e-3,
)
The same docs note that fine-tuning T5 without LR warmup or update clipping is not recommended, that clip_threshold=1.0 should be used, and that other gradient-clipping operations should not be combined with Adafactor.[10] An alternative configuration, used when no external scheduler is available, is scale_parameter=True, relative_step=True, warmup_init=True, lr=None, paired with HuggingFace's AdafactorSchedule helper which exposes the optimizer's internal learning rate to the Trainer's scheduling hooks.[10]
For pretraining from scratch, the original paper uses the relative step size with no external learning rate.[1] The T5 paper inherits this: T5-Base, T5-Large, T5-3B, and T5-11B were all pretrained with Adafactor using the relative step schedule, for one million steps at a batch size of 2^11 sequences of length 512.[3] PaLM uses a custom variant of Adafactor "without factorization", which Chowdhery et al. describe as "effectively equivalent to Adam with parameter scaling": the optimizer scales each parameter's learning rate by the root-mean-square of that parameter, keeping the relative-step and parameter-scale machinery from Adafactor but storing the full second-moment buffer.[4] This unusual configuration suggests that, at PaLM's scale, the team valued the parameter-scaling behavior more than the memory savings, since they could afford the optimizer state with a model-parallel layout on TPU v4 pods.[4]
Adafactor's largest single application is the T5 family, introduced in Raffel et al.'s 2019 paper Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.[3] The paper describes the optimization setup briefly: "We use the AdaFactor optimizer for training. To make our results easier to reproduce, we use a 'inverse square root' learning rate schedule" or, in the relative-step configuration, no external schedule at all.[3] T5-11B's 11-billion parameter encoder-decoder was trained on the C4 corpus for approximately one trillion tokens using Adafactor on TPU v3 pods, and the same optimizer choice carried over to subsequent releases such as T5 v1.1, the multilingual variant mT5 introduced by Xue et al.,[14] and instruction-tuned descendants such as Flan-T5.
The success of T5 cemented Adafactor's reputation as the optimizer of choice for encoder-decoder Transformers at scale, and HuggingFace's documentation specifically calls out the optimizer as the recommended tool for fine-tuning T5 derivatives.[10] One reason it appears so often in T5-related codebases is that switching to AdamW for T5 fine-tuning often produces worse loss curves and divergence under the same hyperparameter budget, a quirk that is widely reported in community write-ups but only partially explained in the literature; the dominant hypothesis is that the original pretraining used a relative step schedule, so post-training optimization with a fixed external learning rate of the wrong magnitude leads to drift.[10]
PaLM, Google's 540-billion-parameter dense Transformer described by Chowdhery et al. in 2022, used the unfactored variant of Adafactor described above.[4] The paper reports a learning rate schedule with linear warmup followed by inverse-square-root decay, with the parameter-scale rescaling from Adafactor applied throughout, and reports 57.8% hardware FLOPs utilization on TPU v4 pods at the largest scale.[4] PaLM 2 and the larger Gemini family continue to use Adafactor-derived configurations for some experiments, though the precise optimizer choices are no longer fully disclosed in subsequent Google technical reports.
The Switch Transformer, Fedus et al.'s 2021 Mixture of Experts (MoE) architecture that scaled to over a trillion parameters, used Adafactor for optimization but encountered new instabilities specific to the sparse-routing regime.[15] The paper introduced an auxiliary "router z-loss" to stabilize high-FLOP sparse models that had previously been unstable when trained with Adafactor in encoder-decoder configurations, allowing Switch-C and Switch-XXL to converge.[15] This instability stemmed less from Adafactor itself than from the interaction between the optimizer's update-clipping behavior and the discrete expert-routing gradient pathways, and was resolved by the auxiliary loss rather than by changing optimizers.[15]
Beyond Google, Adafactor sees use in a long tail of memory-constrained training settings. The HuggingFace Transformers integration registers Adafactor under optim="adafactor" in TrainingArguments, making it a one-line switch from AdamW in any fine-tuning workflow built on the library's Trainer.[16] GaLore (Gradient Low-Rank Projection), a 2024 memory-efficient optimizer family, registers a galore_adafactor variant that composes the rank-1 projection of GaLore with Adafactor's factored second moments for cumulative memory savings.[16][17] The Cornell Optimization Wiki lists additional applications including ResNet50 on ImageNet and several multilingual classification tasks.[9] When fine-tuning T5-v1.1 or mT5 with LoRA or QLoRA, practitioners often pair the parameter-efficient adapter with Adafactor for the trainable subset to keep the full optimizer state below the available GPU memory.
Adafactor's lasting significance is twofold. First, it demonstrated that the per-parameter second-moment buffer that had become standard in adaptive optimizers since AdaGrad and RMSProp could be drastically compressed without giving up Transformer-scale performance.[1] This insight opened a family of follow-up optimizers, including CAME (Confidence-guided Adaptive Memory Efficient Optimization), which keeps Adafactor's factorization and addresses the resulting instability via a confidence-weighted update,[12] and 8-bit and lower-precision optimizer variants that store the moment buffers in reduced precision rather than reducing their count.[17] StableAdamW, registered in HuggingFace Transformers as stable_adamw, ports Adafactor's update-clipping mechanism back into AdamW so that gradient clipping is no longer necessary, demonstrating that some of Adafactor's contributions can be valuable even outside the factored regime.[16]
Second, Adafactor served as the workhorse optimizer for several of the most influential LLM training programs of the 2018 to 2022 period, including T5, mT5, and PaLM.[3][4][14] This longevity contrasts with later memory-efficient optimizers such as Lion (Chen et al. 2023) and Sophia (Liu et al. 2023), neither of which has yet reached the breadth of large-model deployment Adafactor achieved. Architectural changes in the dominant LLM training stacks of 2024 and 2025, particularly the rise of fully decoder-only Transformers trained on enormous token budgets, have shifted defaults back toward AdamW (often in 8-bit form), but Adafactor remains the recommended choice when memory is the binding constraint or when continuing a pretraining run that started with it.
The optimizer also influenced engineering choices outside its own algorithmic family. The Muon optimizer and the Schedule-Free optimizer cite Adafactor's mixed history (efficient but sometimes slow to converge) as part of their motivation for re-examining what the right baseline should be, and the explicit decoupling of optimizer state precision in DeepSpeed ZeRO and similar systems is partly a response to the same problem Adafactor first targeted: how to make optimizer memory not the limiting factor at scale.
Adafactor has several well-documented weaknesses, some inherent to the factored second moment and some related to the auxiliary heuristics:
relative_step=True with lr=None, or relative_step=False with an external learning rate. The HuggingFace documentation lists both as plausible, and an open issue on the HuggingFace Transformers tracker explicitly flagged that the documentation of Adafactor "is at odds with Google implementations" because some Google implementations set defaults that differ from the published paper.[18]beta1 != None in HuggingFace, or momentum in Optax), trading some of the memory savings for stability.[4][11]Reference and widely used implementations include:
| Framework | Entry point | Notes |
|---|---|---|
| Original (TensorFlow / Mesh-TensorFlow) | tf.contrib.opt.AdafactorOptimizer and mesh_tensorflow.optimize.AdafactorOptimizer | Used in original T5 pretraining[3] |
| fairseq (PyTorch) | fairseq.optim.adafactor.Adafactor | First PyTorch port; basis for many later ports[10] |
| HuggingFace Transformers (PyTorch) | transformers.Adafactor, transformers.optimization.AdafactorSchedule; also TrainingArguments(optim="adafactor") | Most common modern usage; handles low-precision values[10][16] |
| Optax (JAX) | optax.adafactor | Pure-JAX implementation; used in T5X, JAX-based PaLM reproductions[11] |
| Keras | keras.optimizers.Adafactor | Standalone Keras implementation with similar defaults[13] |
The HuggingFace Adafactor class explicitly documents support for FP16 and bfloat16 values, though the docs note this has not been extensively tested.[10] Optax's implementation exposes a dtype_momentum parameter that allows the first-moment buffer (when enabled) to be stored in lower precision, an extension beyond the original paper's design.[11]
Adafactor sits in a small family of optimizers that explicitly trade some statistical fidelity for reduced state size:
galore_adafactor variant composes the low-rank projection with Adafactor's factored second moments for additional savings.[16][17]AdamW remains the dominant baseline against which Adafactor is judged.[6] Newer optimizers such as Lion (Chen et al. 2023, EvoLved Sign Momentum), Sophia (Liu et al. 2023, second-order via Hutchinson estimator), Muon (Jordan 2024, Newton-Schulz orthogonalization), Shampoo (Gupta et al. 2018, full second-order), and Schedule-Free (Defazio et al. 2024) all occupy slightly different points in the trade-off space among memory, second-order information, learning-rate schedule, and convergence behavior, and most have been benchmarked against Adafactor as part of their evaluations.
The Cornell Optimization Wiki notes that the row-times-column-over-sum factorization Adafactor uses is exactly the minimizer of the generalized Kullback-Leibler divergence (I-divergence) between the true second-moment matrix and its rank-1 non-negative approximation, connecting Adafactor to the broader literature on non-negative matrix factorization.[9] The relative step size and update clipping are more ad hoc and have inspired several follow-up papers that attempt to justify them in terms of trust-region behavior or per-layer learning-rate adaptation.[12]