# Adafactor

> Source: https://aiwiki.ai/wiki/adafactor
> Updated: 2026-06-07
> Categories: Deep Learning, Training & Optimization
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

---
title: Adafactor
slug: adafactor
---

# Adafactor

**Adafactor** is a memory-efficient adaptive learning-rate optimizer for training deep neural networks, introduced by Noam Shazeer and Mitchell Stern in the 2018 paper *Adafactor: Adaptive Learning Rates with Sublinear Memory Cost*.[^1] Its central contribution is a rank-1 factorization of the per-parameter second-moment estimator that Adam stores for every weight: instead of keeping one running average of squared gradients per parameter, Adafactor stores only a per-row vector and a per-column vector for each weight matrix and reconstructs the full second-moment matrix via an outer product.[^1] This reduces the optimizer state for a matrix of shape n by m from O(nm) to O(n+m), which is sublinear in the number of parameters.[^1][^2] Adafactor also introduces a relative step-size schedule, update clipping based on the root-mean-square of the proposed update, and an optional rescaling by the RMS of the parameter tensor itself.[^1] The optimizer became the standard choice for training the [T5](/wiki/t5) family of text-to-text transformers,[^3] was used in slightly modified form for [PaLM](/wiki/palm),[^4] and remains the default in several large-scale [Google](/wiki/google_brain) training pipelines for [Transformer](/wiki/transformer) models on [TPU](/wiki/cloud_tpu) hardware.[^3][^4]

## Background and motivation

Adaptive gradient methods such as AdaGrad, RMSProp, and Adam scale each parameter's update by an estimate of the second moment of its gradient, typically maintained as an exponential moving average of squared gradients.[^1] These methods made training of deep neural networks substantially more robust to learning-rate choice and helped popularize the Transformer architecture introduced in *Attention Is All You Need*.[^5] The price is a memory cost equal to the model size: [AdamW](/wiki/adamw) and Adam both maintain one first-moment vector and one second-moment vector per parameter, doubling the parameter footprint stored in optimizer state on top of the parameters themselves.[^6] For a model with one billion parameters trained in 32-bit precision, this corresponds to roughly 8 GB just for the optimizer's second-moment buffer (or 4 GB if only the second moment is held in 32-bit while the rest of the optimizer is mixed-precision).[^2]

As [Transformer](/wiki/transformer) models scaled past the BERT-base regime in 2018 and 2019, optimizer memory became a binding constraint. The original [Attention Is All You Need](/wiki/attention_is_all_you_need_transformer) models used Adam, and the BERT release continued that tradition,[^5] but training larger seq2seq Transformers on TPU pods made the second-moment storage uncomfortably expensive. Shazeer, who had already invented Mixture of Experts layers and was building the Mesh-TensorFlow distribution library at Google Brain, and Stern, then a PhD student at UC Berkeley working as an intern at Google, sought an optimizer that retained Adam-like per-parameter learning rates while eliminating most of the per-parameter state.[^1][^7] Their April 2018 arXiv preprint, 1804.04235, presented Adafactor as the result.[^1] The paper was accepted to the International Conference on Machine Learning ([ICML](/wiki/icml)) 2018, held in Stockholm, with a poster presentation in July of that year.[^8]

The work occupies a specific design point in optimizer history: rather than designing a new update rule from scratch (as later work on Lion or Sophia would do), Adafactor begins from Adam's update and asks how much of its memory can be removed without breaking the empirical behavior on Transformer training. The answer in the paper is that the per-parameter second-moment buffer can be replaced by O(n+m) statistics for each matrix-shaped weight, that the first-moment buffer can be dropped entirely with care, and that several additional pieces (update clipping, a slowly increasing decay rate, and relative step sizes) are needed to make the method stable on machine translation benchmarks.[^1]

## Technical details

### The Adam baseline

Adam[^6] maintains, for each parameter, a first-moment estimate m (running mean of gradients) and a second-moment estimate v (running mean of squared gradients), both as exponential moving averages with decay rates beta_1 and beta_2. The update at step t is roughly proportional to m / sqrt(v + epsilon), with bias-correction terms applied to both moments early in training.[^6] For a weight matrix W of shape n by m, Adam therefore stores two additional matrices m and v, each of shape n by m, on top of W itself.

### Factorization of the second moment

Adafactor's key observation, derived in the paper's Section 3, is that for non-negative matrices the second-moment matrix V can be approximated using only its row sums R and column sums C, with V_approx = R C^T / sum(R).[^1] Concretely, instead of maintaining a full matrix V of squared-gradient moving averages, the optimizer maintains:

- a row vector R of shape n by 1, holding an exponential moving average of the row sums of squared gradients,
- a column vector C of shape 1 by m, holding an exponential moving average of the column sums of squared gradients.[^1]

At step t, Adafactor computes:

- R_t = beta_2_hat_t * R_{t-1} + (1 - beta_2_hat_t) * (G_t^2 + epsilon_1) * 1_m
- C_t = beta_2_hat_t * C_{t-1} + (1 - beta_2_hat_t) * 1_n^T * (G_t^2 + epsilon_1)
- V_hat_t = R_t * C_t / (1_n^T * R_t)

where G_t is the gradient at step t, G_t^2 is the elementwise square, 1_m and 1_n are vectors of ones, and beta_2_hat_t is a time-dependent decay rate.[^1][^9] The update direction is then G_t / sqrt(V_hat_t), the same Adam-style normalization but with V_hat_t derived from the rank-1 factors rather than stored directly.[^1] The Cornell Optimization Wiki notes that this factorization is the minimizer of the generalized Kullback-Leibler divergence (I-divergence) between the true second-moment matrix and its rank-1 approximation under the non-negativity constraint, which is why the row-times-column-over-sum form appears rather than a more naive outer product.[^9]

For tensors with more than two dimensions, Adafactor's implementations typically apply the factorization to the last two axes, treating any leading axes as a batch dimension.[^10] Vectors (rank-1 tensors) and small dimensions are usually left in unfactored form: Optax's reference implementation, for example, exposes a `min_dim_size_to_factor` parameter that defaults to 128, below which Adafactor falls back to a full per-parameter second-moment estimate.[^11]

### Memory savings

For an n by m matrix, the optimizer state shrinks from one n by m matrix (Adam's v) to two vectors of total size n + m. With Adafactor's default of dropping the first moment entirely (beta1 = None), the total state per matrix weight is O(n + m) rather than Adam's O(2nm).[^1][^2] On a dense feed-forward Transformer layer of width 4096 by 16384, for example, Adam's v alone is 4096 * 16384 = 67M entries, while Adafactor's R + C is 4096 + 16384 = 20480 entries, a roughly 3300x reduction for that tensor.[^2] When momentum (a first-moment buffer) is enabled in Adafactor, the savings are smaller but still substantial, since the first moment can optionally be stored in lower precision.[^11]

### Update clipping

A side effect of the factored approximation is that the implied per-parameter second-moment estimate can drift from the true one, occasionally producing very large normalized updates. The paper diagnoses this in Section 4 and proposes update clipping: after computing the proposed update U_t = G_t / sqrt(V_hat_t), Adafactor scales it down whenever its root-mean-square exceeds a threshold d, replacing U_t with U_t / max(1, RMS(U_t) / d).[^1] The default threshold is 1.0 in both the original paper and the HuggingFace and Optax implementations.[^10][^11] This update clipping is distinct from gradient clipping (which acts on the raw gradient) and is reported by the HuggingFace docs as essential for stability when training Transformer models with Adafactor.[^10]

### Slowly increasing decay rate

The paper also identifies a more general failure mode of Adam-style optimizers: when the second-moment decay rate beta_2 is held fixed near 1, the running estimate can lag behind sudden increases in gradient magnitude, producing oversized updates. Adafactor instead uses a time-dependent decay rate beta_2_hat_t that approaches 1 as training progresses, defined in the paper as 1 - t^{-c} for a constant c (the `decay_rate` hyperparameter, often set to -0.8 in implementations to give an effective exponent).[^1][^10] This is a separate mitigation from update clipping; the paper shows it also helps Adam itself when applied as a drop-in modification.[^1]

### Relative step size

In place of a fixed external learning rate, Adafactor introduces a *relative step size* alpha_t = max(epsilon_2, RMS(W_{t-1})) * rho_t, where W_{t-1} is the current parameter tensor and rho_t is a base step size that depends only on the iteration count, typically rho_t = min(0.01, 1 / sqrt(t)).[^1][^9] The intuition is that the appropriate magnitude of a parameter update should scale with the magnitude of the parameter itself, so a layer with very small weights gets very small updates while a layer with large weights gets correspondingly larger ones.[^1] This "scale_parameter" option allows Adafactor to be run with no externally supplied learning rate at all, which is the configuration used during T5 pretraining.[^3][^10]

### Algorithm summary

Combining these pieces, the per-step update for a matrix-shaped weight W with gradient G is approximately:

1. Compute G_t.
2. Update R_t and C_t exponential moving averages of row and column sums of G_t^2.
3. Form V_hat_t = R_t * C_t / sum(R_t).
4. Compute proposed update U_t = G_t / sqrt(V_hat_t).
5. Clip: U_hat_t = U_t / max(1, RMS(U_t) / d).
6. Scale by step size alpha_t (relative or external).
7. Update parameters: W_t = W_{t-1} - alpha_t * U_hat_t (with optional weight decay and optional first-moment momentum).[^1][^9]

For tensors that are vectors or have a dimension smaller than the factoring threshold, the same algorithm is run with V_t maintained directly rather than factored.[^11]

## Comparison with AdamW

The most natural baseline for Adafactor is [AdamW](/wiki/adamw), the variant of Adam with decoupled weight decay that has become the default optimizer in modern LLM training.[^6] Both methods produce per-parameter adaptive step sizes via second-moment normalization, but they differ in important ways.

| Property | AdamW | Adafactor |
|---|---|---|
| First moment storage | One full tensor per parameter | Optional; default off in the original paper |
| Second moment storage | One full tensor per parameter | Two vectors per matrix (rank-1 factor) |
| State size for n by m matrix | 2nm | n + m (default), or 2(n+m) with momentum |
| Learning rate | External, typically with warmup + decay | Internal "relative step", or external if scale_parameter=False |
| Update clipping | Not built in | Built in, default RMS threshold 1.0 |
| Decay rate schedule | Fixed beta_2 (commonly 0.999) | Time-varying, approaches 1 over training |
| Convergence on small models | Reliable | Often slightly worse, can be unstable without care |
| Convergence on very large Transformers | Stable but expensive in memory | Stable, much cheaper in memory |
| Typical use today | Most LLM and PEFT training | T5, PaLM, large MoE, GaLore-Adafactor, memory-constrained training |

Sources: [^1][^6][^10][^11][^12].

The trade-off is roughly that Adafactor sacrifices some convergence consistency, particularly on small models or short training runs, in exchange for substantial memory savings on the optimizer state. On the Transformer-Big WMT 2014 English-German task used in the original paper, Adafactor's published numbers match Adam's BLEU score while using only the per-row and per-column statistics described above.[^1] On smaller fine-tuning runs, practitioners frequently report that AdamW converges faster, which is one reason the HuggingFace documentation warns that training without learning-rate warmup or update clipping with Adafactor "is not recommended".[^10]

## Hyperparameters and common configurations

The HuggingFace `transformers.Adafactor` implementation, which is a PyTorch port of the original fairseq code, exposes the following hyperparameters:[^10]

| Parameter | Default | Description |
|---|---|---|
| `lr` | None | External learning rate; ignored when `relative_step=True` |
| `eps` | (1e-30, 0.001) | Tuple (epsilon_1, epsilon_2); regularization constants for squared-gradient and parameter-scale denominators |
| `clip_threshold` | 1.0 | RMS threshold for update clipping |
| `decay_rate` | -0.8 | Exponent controlling the time-varying second-moment decay |
| `beta1` | None | If set, enables a first-moment momentum buffer |
| `weight_decay` | 0.0 | L2 weight decay applied to updates |
| `scale_parameter` | True | If True, scales the learning rate by the RMS of the parameter |
| `relative_step` | True | If True, uses time-dependent relative step size and ignores `lr` |
| `warmup_init` | False | If True, modifies the relative step schedule to include warmup |

The Optax (JAX) implementation has a similar surface, with parameters `learning_rate`, `min_dim_size_to_factor=128`, `decay_rate=0.8`, `decay_offset=0`, `multiply_by_parameter_scale=True`, `clipping_threshold=1.0`, `momentum=None`, `dtype_momentum=float32`, `weight_decay_rate=None`, `eps=1e-30`, and `factored=True`.[^11] The Keras Adafactor adds an `epsilon_2=0.001` parameter that plays the role of the second epsilon in the HuggingFace tuple, with otherwise similar defaults.[^13]

### T5 fine-tuning settings

The HuggingFace documentation explicitly calls out recommended settings for T5 fine-tuning, derived from community experience on the HuggingFace forums:[^10]

```python
Adafactor(
    model.parameters(),
    scale_parameter=False,
    relative_step=False,
    warmup_init=False,
    lr=1e-3,
)
```

The same docs note that fine-tuning T5 without LR warmup or update clipping is not recommended, that `clip_threshold=1.0` should be used, and that other gradient-clipping operations should not be combined with Adafactor.[^10] An alternative configuration, used when no external scheduler is available, is `scale_parameter=True, relative_step=True, warmup_init=True, lr=None`, paired with HuggingFace's `AdafactorSchedule` helper which exposes the optimizer's internal learning rate to the Trainer's scheduling hooks.[^10]

### Pretraining settings

For pretraining from scratch, the original paper uses the relative step size with no external learning rate.[^1] The T5 paper inherits this: T5-Base, T5-Large, T5-3B, and T5-11B were all pretrained with Adafactor using the relative step schedule, for one million steps at a batch size of 2^11 sequences of length 512.[^3] PaLM uses a custom variant of Adafactor "without factorization", which Chowdhery et al. describe as "effectively equivalent to Adam with parameter scaling": the optimizer scales each parameter's learning rate by the root-mean-square of that parameter, keeping the relative-step and parameter-scale machinery from Adafactor but storing the full second-moment buffer.[^4] This unusual configuration suggests that, at PaLM's scale, the team valued the parameter-scaling behavior more than the memory savings, since they could afford the optimizer state with a model-parallel layout on TPU v4 pods.[^4]

## Adoption in large language models

### T5 and the text-to-text family

Adafactor's largest single application is the T5 family, introduced in Raffel et al.'s 2019 paper *Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer*.[^3] The paper describes the optimization setup briefly: "We use the AdaFactor optimizer for training. To make our results easier to reproduce, we use a 'inverse square root' learning rate schedule" or, in the relative-step configuration, no external schedule at all.[^3] T5-11B's 11-billion parameter encoder-decoder was trained on the C4 corpus for approximately one trillion tokens using Adafactor on TPU v3 pods, and the same optimizer choice carried over to subsequent releases such as T5 v1.1, the multilingual variant [mT5](/wiki/mt5) introduced by Xue et al.,[^14] and instruction-tuned descendants such as Flan-T5.

The success of T5 cemented Adafactor's reputation as the optimizer of choice for encoder-decoder Transformers at scale, and HuggingFace's documentation specifically calls out the optimizer as the recommended tool for fine-tuning T5 derivatives.[^10] One reason it appears so often in T5-related codebases is that switching to AdamW for T5 fine-tuning often produces worse loss curves and divergence under the same hyperparameter budget, a quirk that is widely reported in community write-ups but only partially explained in the literature; the dominant hypothesis is that the original pretraining used a relative step schedule, so post-training optimization with a fixed external learning rate of the wrong magnitude leads to drift.[^10]

### PaLM

[PaLM](/wiki/palm), Google's 540-billion-parameter dense Transformer described by Chowdhery et al. in 2022, used the unfactored variant of Adafactor described above.[^4] The paper reports a learning rate schedule with linear warmup followed by inverse-square-root decay, with the parameter-scale rescaling from Adafactor applied throughout, and reports 57.8% hardware FLOPs utilization on TPU v4 pods at the largest scale.[^4] PaLM 2 and the larger Gemini family continue to use Adafactor-derived configurations for some experiments, though the precise optimizer choices are no longer fully disclosed in subsequent Google technical reports.

### Switch Transformer and other MoE

The [Switch Transformer](/wiki/switch_transformer), Fedus et al.'s 2021 [Mixture of Experts](/wiki/mixture_of_experts) (MoE) architecture that scaled to over a trillion parameters, used Adafactor for optimization but encountered new instabilities specific to the sparse-routing regime.[^15] The paper introduced an auxiliary "router z-loss" to stabilize high-FLOP sparse models that had previously been unstable when trained with Adafactor in encoder-decoder configurations, allowing Switch-C and Switch-XXL to converge.[^15] This instability stemmed less from Adafactor itself than from the interaction between the optimizer's update-clipping behavior and the discrete expert-routing gradient pathways, and was resolved by the auxiliary loss rather than by changing optimizers.[^15]

### Other adoption

Beyond Google, Adafactor sees use in a long tail of memory-constrained training settings. The HuggingFace Transformers integration registers Adafactor under `optim="adafactor"` in `TrainingArguments`, making it a one-line switch from AdamW in any fine-tuning workflow built on the library's Trainer.[^16] GaLore (Gradient Low-Rank Projection), a 2024 memory-efficient optimizer family, registers a `galore_adafactor` variant that composes the rank-1 projection of GaLore with Adafactor's factored second moments for cumulative memory savings.[^16][^17] The Cornell Optimization Wiki lists additional applications including ResNet50 on ImageNet and several multilingual classification tasks.[^9] When fine-tuning T5-v1.1 or mT5 with [LoRA](/wiki/lora) or [QLoRA](/wiki/qlora), practitioners often pair the parameter-efficient adapter with Adafactor for the trainable subset to keep the full optimizer state below the available GPU memory.

## Significance and place in the optimizer landscape

Adafactor's lasting significance is twofold. First, it demonstrated that the per-parameter second-moment buffer that had become standard in adaptive optimizers since AdaGrad and RMSProp could be drastically compressed without giving up Transformer-scale performance.[^1] This insight opened a family of follow-up optimizers, including CAME (Confidence-guided Adaptive Memory Efficient Optimization), which keeps Adafactor's factorization and addresses the resulting instability via a confidence-weighted update,[^12] and 8-bit and lower-precision optimizer variants that store the moment buffers in reduced precision rather than reducing their count.[^17] StableAdamW, registered in HuggingFace Transformers as `stable_adamw`, ports Adafactor's update-clipping mechanism back into AdamW so that gradient clipping is no longer necessary, demonstrating that some of Adafactor's contributions can be valuable even outside the factored regime.[^16]

Second, Adafactor served as the workhorse optimizer for several of the most influential LLM training programs of the 2018 to 2022 period, including T5, mT5, and PaLM.[^3][^4][^14] This longevity contrasts with later memory-efficient optimizers such as [Lion](/wiki/lion_optimizer) (Chen et al. 2023) and [Sophia](/wiki/sophia_optimizer) (Liu et al. 2023), neither of which has yet reached the breadth of large-model deployment Adafactor achieved. Architectural changes in the dominant LLM training stacks of 2024 and 2025, particularly the rise of fully decoder-only Transformers trained on enormous token budgets, have shifted defaults back toward AdamW (often in 8-bit form), but Adafactor remains the recommended choice when memory is the binding constraint or when continuing a pretraining run that started with it.

The optimizer also influenced engineering choices outside its own algorithmic family. The [Muon](/wiki/muon_optimizer) optimizer and the [Schedule-Free](/wiki/schedule_free) optimizer cite Adafactor's mixed history (efficient but sometimes slow to converge) as part of their motivation for re-examining what the right baseline should be, and the explicit decoupling of optimizer state precision in [DeepSpeed](/wiki/deepspeed) ZeRO and similar systems is partly a response to the same problem Adafactor first targeted: how to make optimizer memory not the limiting factor at scale.

## Limitations and criticisms

Adafactor has several well-documented weaknesses, some inherent to the factored second moment and some related to the auxiliary heuristics:

- **Slower or noisier convergence on small models.** The CAME paper notes that Adafactor "suffers a performance degradation in the training of large language models compared with conventional adaptive gradient-based optimization methods", attributing this to errors introduced by the non-negative matrix factorization step.[^12] Community reports on smaller fine-tuning tasks frequently echo this, finding that AdamW reaches a lower validation loss faster under matched compute.[^10][^18]
- **Sensitivity to hyperparameters and to whether the relative step is active.** A persistent source of confusion is whether to use `relative_step=True` with `lr=None`, or `relative_step=False` with an external learning rate. The HuggingFace documentation lists both as plausible, and an open issue on the HuggingFace Transformers tracker explicitly flagged that the documentation of Adafactor "is at odds with Google implementations" because some Google implementations set defaults that differ from the published paper.[^18]
- **No first moment by default.** While dropping the first-moment buffer is a major memory saving, it removes the implicit smoothing that Adam-style optimizers gain from a momentum term. The original paper's experiments include the no-momentum configuration and report acceptable BLEU, but later large-scale training runs frequently re-enable momentum (`beta1 != None` in HuggingFace, or `momentum` in Optax), trading some of the memory savings for stability.[^4][^11]
- **Update-clipping interactions.** Combining Adafactor with separate gradient clipping is explicitly discouraged in the HuggingFace docs, since the optimizer already clips its own proposed updates and adding a second clipping step changes the effective step distribution in ways that are hard to reason about.[^10]
- **Factorization assumes matrix structure.** Adafactor's savings only apply when the parameter has at least two factorizable dimensions of nontrivial size. Vectors (e.g., biases, [gradient](/wiki/gradient)-scale parameters, and 1D LayerNorm scales) and small embeddings are stored unfactored, so the per-parameter savings depend heavily on the model architecture.[^11]
- **Less benefit when optimizer state is already sharded.** With [DeepSpeed](/wiki/deepspeed) ZeRO-2/3 or FSDP, Adam's optimizer state is sharded across data-parallel workers, so the per-worker memory cost falls roughly linearly with the worker count. Adafactor's absolute savings remain, but their relative importance is diminished compared to single-host training.

## Implementations and software

Reference and widely used implementations include:

| Framework | Entry point | Notes |
|---|---|---|
| Original (TensorFlow / Mesh-TensorFlow) | `tf.contrib.opt.AdafactorOptimizer` and `mesh_tensorflow.optimize.AdafactorOptimizer` | Used in original T5 pretraining[^3] |
| fairseq (PyTorch) | `fairseq.optim.adafactor.Adafactor` | First PyTorch port; basis for many later ports[^10] |
| HuggingFace Transformers (PyTorch) | `transformers.Adafactor`, `transformers.optimization.AdafactorSchedule`; also `TrainingArguments(optim="adafactor")` | Most common modern usage; handles low-precision values[^10][^16] |
| Optax (JAX) | `optax.adafactor` | Pure-JAX implementation; used in T5X, JAX-based PaLM reproductions[^11] |
| Keras | `keras.optimizers.Adafactor` | Standalone Keras implementation with similar defaults[^13] |

The HuggingFace `Adafactor` class explicitly documents support for FP16 and bfloat16 values, though the docs note this has not been extensively tested.[^10] Optax's implementation exposes a `dtype_momentum` parameter that allows the first-moment buffer (when enabled) to be stored in lower precision, an extension beyond the original paper's design.[^11]

## Related work

### Memory-efficient adaptive optimizers

Adafactor sits in a small family of optimizers that explicitly trade some statistical fidelity for reduced state size:

- **CAME** (Luo et al., ACL 2023) keeps Adafactor's factorization but adds a confidence-weighted update, modulating step size by the agreement between the running moving average and the current update.[^12]
- **GaLore** (Zhao et al. 2024) projects gradients into a low-rank subspace before passing them to an underlying optimizer; its `galore_adafactor` variant composes the low-rank projection with Adafactor's factored second moments for additional savings.[^16][^17]
- **8-bit optimizers** (Dettmers et al. 2022) keep the per-parameter buffer but quantize it to 8 bits, an orthogonal approach that complements rather than competes with Adafactor.

### Direct competitors

[AdamW](/wiki/adamw) remains the dominant baseline against which Adafactor is judged.[^6] Newer optimizers such as [Lion](/wiki/lion_optimizer) (Chen et al. 2023, EvoLved Sign Momentum), [Sophia](/wiki/sophia_optimizer) (Liu et al. 2023, second-order via Hutchinson estimator), [Muon](/wiki/muon_optimizer) (Jordan 2024, Newton-Schulz orthogonalization), [Shampoo](/wiki/shampoo_optimizer) (Gupta et al. 2018, full second-order), and [Schedule-Free](/wiki/schedule_free) (Defazio et al. 2024) all occupy slightly different points in the trade-off space among memory, second-order information, learning-rate schedule, and convergence behavior, and most have been benchmarked against Adafactor as part of their evaluations.

### Theoretical analyses

The Cornell Optimization Wiki notes that the row-times-column-over-sum factorization Adafactor uses is exactly the minimizer of the generalized Kullback-Leibler divergence (I-divergence) between the true second-moment matrix and its rank-1 non-negative approximation, connecting Adafactor to the broader literature on non-negative matrix factorization.[^9] The relative step size and update clipping are more ad hoc and have inspired several follow-up papers that attempt to justify them in terms of trust-region behavior or per-layer learning-rate adaptation.[^12]

## See also

- [AdamW](/wiki/adamw)
- [Adam optimizer](/wiki/adam_optimizer)
- [Stochastic Gradient Descent (SGD)](/wiki/stochastic_gradient_descent_sgd)
- [RMSProp](/wiki/rmsprop)
- [Lion (optimizer)](/wiki/lion_optimizer)
- [Sophia (optimizer)](/wiki/sophia_optimizer)
- [Muon (optimizer)](/wiki/muon_optimizer)
- [Shampoo (optimizer)](/wiki/shampoo_optimizer)
- [Schedule-Free optimizer](/wiki/schedule_free)
- [GaLore](/wiki/galore)
- [T5 (language model)](/wiki/t5)
- [mT5](/wiki/mt5)
- [PaLM](/wiki/palm)
- [Switch Transformer](/wiki/switch_transformer)
- [Mixture of Experts (MoE)](/wiki/mixture_of_experts)
- [Noam Shazeer](/wiki/noam_shazeer)
- [Transformer](/wiki/transformer)
- [Attention Is All You Need](/wiki/attention_is_all_you_need_transformer)
- [Gradient clipping](/wiki/gradient_clipping)
- [Learning Rate](/wiki/learning_rate)
- [Weight Decay](/wiki/weight_decay)
- [ICML](/wiki/icml)
- [Google Brain](/wiki/google_brain)
- [Cloud TPU](/wiki/cloud_tpu)
- [JAX](/wiki/jax)
- [PyTorch](/wiki/pytorch)
- [TensorFlow](/wiki/tensorflow)
- [DeepSpeed](/wiki/deepspeed)
- [bfloat16](/wiki/bfloat16)
- [Pretraining](/wiki/pretraining)
- [Transfer Learning](/wiki/transfer_learning)
- [Mixture of Experts](/wiki/mixture_of_experts)

## References

[^1]: Noam Shazeer and Mitchell Stern, "Adafactor: Adaptive Learning Rates with Sublinear Memory Cost", arXiv, 2018-04-11. https://arxiv.org/abs/1804.04235. Accessed 2026-05-20.

[^2]: Sulbha Jain, "Optimizers in LLM Fine-Tuning: AdamW vs Adafactor", Medium, 2024-08-08. https://sulbhajain.medium.com/optimizers-in-llm-adamw-vs-adafactor-54fc3cb37671. Accessed 2026-05-20.

[^3]: Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu, "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer", arXiv, 2019-10-23. https://arxiv.org/abs/1910.10683. Accessed 2026-05-20.

[^4]: Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, et al., "PaLM: Scaling Language Modeling with Pathways", arXiv, 2022-04-05. https://arxiv.org/abs/2204.02311. Accessed 2026-05-20.

[^5]: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin, "Attention Is All You Need", arXiv, 2017-06-12. https://arxiv.org/abs/1706.03762. Accessed 2026-05-20.

[^6]: Ilya Loshchilov and Frank Hutter, "Decoupled Weight Decay Regularization", arXiv, 2017-11-14. https://arxiv.org/abs/1711.05101. Accessed 2026-05-20.

[^7]: Noam Shazeer, "Noam Shazeer Personal Site", noamshazeer.com, 2024. https://www.noamshazeer.com/. Accessed 2026-05-20.

[^8]: ICML, "Adafactor: Adaptive Learning Rates with Sublinear Memory Cost (Poster)", International Conference on Machine Learning 2018, 2018-07-12. https://icml.cc/virtual/2018/poster/2446. Accessed 2026-05-20.

[^9]: Cornell University Computational Optimization Open Textbook, "Adafactor", optimization.cbe.cornell.edu, 2023-12-15. https://optimization.cbe.cornell.edu/index.php?title=Adafactor. Accessed 2026-05-20.

[^10]: HuggingFace, "Optimization: Adafactor", HuggingFace Transformers documentation, 2026-04-15. https://huggingface.co/docs/transformers/en/main_classes/optimizer_schedules. Accessed 2026-05-20.

[^11]: Google DeepMind, "Optax Optimizers: adafactor", Optax documentation, 2026-03-10. https://optax.readthedocs.io/en/latest/api/optimizers.html. Accessed 2026-05-20.

[^12]: Yang Luo, Xiaozhe Ren, Zangwei Zheng, Zhuo Jiang, Xin Jiang, and Yang You, "CAME: Confidence-guided Adaptive Memory Efficient Optimization", arXiv, 2023-07-05. https://arxiv.org/abs/2307.02047. Accessed 2026-05-20.

[^13]: Keras, "Adafactor optimizer", keras.io API documentation, 2026-02-20. https://keras.io/api/optimizers/adafactor/. Accessed 2026-05-20.

[^14]: Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel, "mT5: A massively multilingual pre-trained text-to-text transformer", arXiv, 2020-10-22. https://arxiv.org/abs/2010.11934. Accessed 2026-05-20.

[^15]: William Fedus, Barret Zoph, and Noam Shazeer, "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity", arXiv, 2021-01-11. https://arxiv.org/abs/2101.03961. Accessed 2026-05-20.

[^16]: HuggingFace, "Optimizers and schedulers", HuggingFace Transformers documentation, 2026-04-15. https://huggingface.co/docs/transformers/en/optimizers. Accessed 2026-05-20.

[^17]: Jiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang Wang, Anima Anandkumar, and Yuandong Tian, "GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection", arXiv, 2024-03-06. https://arxiv.org/abs/2403.03507. Accessed 2026-05-20.

[^18]: HuggingFace, "Documentation of Adafactor is at odds with Google implementations (Issue #19387)", GitHub, 2022-10-05. https://github.com/huggingface/transformers/issues/19387. Accessed 2026-05-20.

