GaLore (Gradient Low-Rank Projection)
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,277 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,277 words
Add missing citations, update stale details, or suggest a clearer explanation.
GaLore (Gradient Low-Rank Projection) is a memory-efficient training strategy for large neural networks that projects each weight matrix's gradient into a low-rank subspace before the Adam-style optimizer state is computed, then projects the resulting update back to the original parameter space. Unlike adapter methods such as LoRA, GaLore trains every weight of the model (full-parameter learning); the low-rank approximation lives in the optimizer's view of the gradient rather than in the weights themselves. The technique was introduced by Jiawei Zhao and collaborators at Caltech, Meta AI, UT Austin, and Carnegie Mellon University in March 2024 and accepted as an Oral presentation at the 41st International Conference on Machine Learning (ICML) the same year.[1][2] Combined with 8-bit optimizer state quantization and per-layer weight updates, GaLore enables pretraining a 7-billion-parameter LLaMA-style model from scratch on a single 24 GB consumer GPU without model parallelism, sharding, or CPU offloading.[1][3]
Training large LLMs is memory-bound because optimizer states, activations, weights, and gradients all reside in GPU memory simultaneously. For AdamW, the dominant component is the optimizer state: every trainable parameter requires two extra floating-point values (the first-moment estimate m and the second-moment estimate v), so a model with n parameters in BF16 consumes roughly 2n bytes for weights plus 8n bytes for the AdamW state in FP32, before accounting for gradients and activations.[1] A 7-billion-parameter LLaMA model in BF16 therefore needs about 14 GB for weights and roughly 56 GB for FP32 Adam state, a footprint that already exceeds the 24 GB available on an NVIDIA RTX 4090.
Two prior families of techniques addressed this. The first, parameter-efficient fine-tuning (PEFT), is exemplified by low-rank adaptation and its quantized variant QLoRA: only a small set of newly inserted low-rank adapter matrices are trained, so optimizer state grows with the adapter rank rather than with the base model size.[1] The drawback is that LoRA is not full-parameter learning; the base weights stay frozen, which has been shown to underperform full-parameter training in pretraining settings and to bound the expressivity of fine-tunes.[1] The second family, exemplified by ReLoRA, periodically merges low-rank updates into the base weights so that pretraining can proceed in a low-rank regime, but this approach requires a full-rank warmup phase to match dense baselines.[1]
GaLore takes a different tack: it leaves the parameterization unchanged and instead exploits a structural property of the gradient matrix itself. The authors prove that during training of typical reversible network blocks the gradient with respect to a weight matrix becomes low-rank, and that the subspace spanned by the dominant singular vectors evolves slowly.[1] Consequently, an SVD-derived projection matrix P can be reused for many optimizer steps before it needs to be refreshed, and the optimizer state can live in the low-rank subspace rather than in full parameter space. This shifts memory from O(mn) per layer to O(mn) + O(mr + nr) for a rank-r approximation while still updating every entry of W.[1]
The paper "GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection" by Jiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang ("Atlas") Wang, Anima Anandkumar, and Yuandong Tian was posted to arXiv as preprint 2403.03507 on 6 March 2024.[2] The v2 revision followed on 2 June 2024.[2] At the time of submission the authors were affiliated with the California Institute of Technology, the University of Texas at Austin, Carnegie Mellon University, and Meta AI / FAIR.[1][2]
The reference implementation galore-torch was released alongside the paper to the public GitHub repository jiaweizzhao/GaLore under the Apache-2.0 license.[3] Hugging Face engineer Younes Belkada opened pull request 29588 to the Transformers repository on 14 March 2024 to integrate the optimizer; the PR was merged on 19 March 2024 and the optimizer shipped publicly in the v4.39.0 release of the library on 20 March 2024.[4][5] On the same day, Hugging Face published a joint blog post by Titus von Koeller, Jiawei Zhao, Matthew Douglas, Yaowei Zheng, Younes Belkada, Zachary Mueller, Amy Roberts, Sourab Mangrulkar, and Benjamin Bossan titled "GaLore: Advancing Large Model Training on Consumer-grade Hardware," which documented usage from the Trainer API.[5]
The work was accepted to ICML 2024 with an Oral presentation slot in the "Low Rank Learning" session.[6] A first quantized follow-up, Q-GaLore by Zhenyu Zhang, Ajay Jaiswal, Lu Yin, Shiwei Liu, Jiawei Zhao, Yuandong Tian, and Zhangyang Wang, appeared as arXiv:2407.08296 on 11 July 2024 and pushed pretraining of a 7B model down to a single 16 GB RTX 4060 Ti.[7] A more recent successor, "GaLore 2: Large-Scale LLM Pre-Training by Gradient Low-Rank Projection" by DiJia Su, Andrew Gu, Jane Xu, Yuandong Tian, and Jiawei Zhao, appeared as arXiv:2504.20437 in April 2025, focusing on randomized SVD, compatibility with Fully Sharded Data Parallel, and a demonstrated pretraining run of LLaMA 7B with up to 500 billion tokens.[8]
For a single linear layer with weight matrix W in R^(m x n) (m <= n by convention), the gradient G_t in R^(m x n) at step t has the same shape as W. Full-rank AdamW maintains two state tensors M_t and V_t, both of shape R^(m x n), holding bias-corrected first and second moments of G_t. The total memory for one parameter tensor under AdamW is therefore proportional to 3mn (weight plus two state tensors), ignoring gradients and master copies.[1]
GaLore introduces a single tall-skinny projection matrix P_t in R^(m x r) with r << m. At each step the gradient is first projected into the low-rank subspace:
R_t = P_t^T G_t # R_t in R^(r x n)
The optimizer state lives at the projected scale. M_t and V_t are stored as R^(r x n) tensors and are updated using the standard Adam recurrences applied to R_t. After Adam produces a normalized update N_t in R^(r x n), GaLore projects it back to the original shape:
G_tilde_t = alpha * P_t N_t
W_{t+1} = W_t - eta * G_tilde_t
where alpha is a fixed scaling factor (typically 0.25) that controls update strength independently of the rank, and eta is the learning rate.[9] In the paper's standard "left" projection setting only P is used; a "right" projection variant uses Q in R^(n x r) instead, switching which dimension is compressed based on which is larger.[1]
The projection matrix is built from the singular value decomposition of the most recent full gradient. Concretely, every T steps (the update_proj_gap hyperparameter), GaLore computes a truncated SVD G_t approx U Sigma V^T and sets P_t = U[:, :r], the columns corresponding to the r largest singular values.[1][9] Between refreshes the projection matrix is held fixed, so the cost of SVD is amortized across T steps. The authors prove that under the assumption of stable rank gradients the subspace spanned by U[:, :r] drifts slowly enough that this lazy update incurs only a bounded loss compared to recomputing P_t every step.[1]
In code, the truncated SVD is implemented with torch.linalg.svd on the gradient, with the projection matrix kept in the dtype of the gradient (usually BF16) and converted lazily during the matrix multiplication.[3]
The paper provides a formal argument for why the projection-based approach should converge. The central lemma states that for a class of reversible networks (which the authors show includes typical transformer blocks under mild conditions on the activations), the time evolution of the weight matrix under gradient descent leaves the rank of the gradient bounded by a slowly growing function of the training step.[1] Concretely, Lemma 3.1 of the paper shows that if the loss is smooth and the network is reversible, the gradient G_t can be approximated arbitrarily well by its rank-r truncated SVD with error that depends only on r and on the spectral decay rate of the gradient matrix.[1] Empirically the authors plot the singular value distribution of gradients across training steps for LLaMA blocks and observe a heavy concentration in the top few hundred singular directions, validating that the assumption holds in practice.[1]
A second theoretical claim concerns the validity of holding P_t fixed for many steps. The authors prove that if the subspace drift rate (the angular velocity of the dominant r-dimensional eigenspace) is bounded, then the cumulative error introduced by reusing P_t for T consecutive steps is proportional to T times that drift rate.[1] In practice this means that update_proj_gap can be set to several hundred steps with negligible effect on convergence, which is what makes the amortized SVD cost manageable.[1]
The original paper presents a per-layer accounting that makes the savings explicit. For a weight matrix W in R^(m x n) (m <= n) and rank r, the memory required per layer is approximately:[1]
| Component | AdamW | LoRA (rank r) | GaLore (rank r) |
|---|---|---|---|
| Weights | mn | mn + mr + nr | mn |
| Optimizer states | 2mn | 2mr + 2nr | mr + 2nr |
| Total | 3mn | mn + 3mr + 3nr | mn + mr + 2nr |
For a LLaMA-style 7B model, the paper reports that switching from BF16 AdamW to GaLore reduces optimizer state memory by up to 65.5 percent.[1] Stacking 8-bit quantization of M and V on top (yielding the galore_adamw_8bit configuration) reduces optimizer state memory by up to 82.5 percent and total training memory by 63.3 percent relative to a BF16 AdamW baseline.[1][5]
A second optimization, often listed as the _layerwise variant in implementations, eliminates the need to ever hold the full gradient tensor of the model in memory. Normally backpropagation produces gradients for every weight before the optimizer step runs at the end of the backward pass. GaLore's per-layer mode hooks into the backward graph and runs the projection and the AdamW update for each layer immediately after its gradient is computed, freeing that layer's full-shape gradient before the next layer's gradient materializes.[5][9] The original paper credits per-layer updates with saving roughly 13.5 GB on a LLaMA 7B pretraining run, which is what closes the gap between the 8-bit AdamW state footprint and the 24 GB capacity of an RTX 4090.[1] The trade-off is that this mode is incompatible with gradient accumulation in its naive form, because accumulation requires retaining the gradient across micro-batches.[5]
Inputs: weight W in R^(m x n), rank r, update interval T,
scale alpha, learning rate eta
Initialize M_0 = 0, V_0 = 0, both in R^(r x n)
Initialize P_0 from SVD of grad(W) at step 0
for t = 1, 2, ... do
G_t = backward(W_t) # full grad, BF16
if t mod T == 0 then
U, _, _ = svd(G_t)
P_t = U[:, :r]
else
P_t = P_{t-1}
R_t = P_t^T @ G_t # (r, n), BF16
M_t = beta1 * M_{t-1} + (1 - beta1) * R_t
V_t = beta2 * V_{t-1} + (1 - beta2) * R_t * R_t
N_t = M_t / (sqrt(V_t) + eps) # Adam normalized update
G_til = alpha * P_t @ N_t # back to (m, n)
W_t = W_t - eta * G_til
Hyperparameters used in the paper for LLaMA pretraining include r = 1024 for the 7B model (matrix dimension 4096) and r = 512 for the 1B model (dimension 2048), update_proj_gap between 200 and 500 steps, and scale = 0.25.[1][9]
The paper conducts ablations over each of the three GaLore-specific hyperparameters. The rank r exhibits monotone but saturating effects: doubling r improves perplexity but with diminishing returns past roughly r = d / 4, where d is the matrix dimension. The authors recommend r = d / 4 as a starting point for a balanced memory and accuracy trade.[1] The update interval update_proj_gap shows a U-shaped sensitivity curve: very small values (under 50 steps) are wasteful because the subspace cannot drift meaningfully between refreshes, while very large values (over 1000 steps) cause the subspace to lag behind the true gradient distribution and degrade convergence.[1] The scale alpha is set to a constant 0.25 in the pretraining experiments and a smaller value such as 4.0 / r for fine-tuning, where smaller-rank projections benefit from a slightly larger effective step size; this differs from LoRA's standard alpha / r scaling rule.[1][9]
In practice GaLore is run with the model weights, gradients, and projection matrix in BF16, while the Adam optimizer state in the projected subspace is held in either BF16 (default) or 8-bit (via the bitsandbytes blockwise quantization scheme). The reference implementation casts the projection matrix P to the dtype of the gradient at multiplication time so that the matrix multiplication kernel can dispatch to tensor cores without an explicit upcast.[3] The SVD itself is computed in FP32 internally because torch.linalg.svd does not support BF16 inputs reliably; in galore-torch this is handled by an explicit .float() cast at SVD time and a downcast immediately afterward.[3]
The original paper pretrains LLaMA-style decoders ranging from 60M to 7B parameters on the C4 (Colossal Clean Crawled Corpus) dataset and reports validation perplexity. The headline numbers from Table 1 of the paper are:[1][9]
| Model | Full-rank AdamW | GaLore | LoRA | ReLoRA |
|---|---|---|---|---|
| 60M | 34.06 | 34.88 | 34.99 | 37.04 |
| 130M | 25.08 | 25.36 | 33.92 | 29.37 |
| 350M | 18.80 | 18.95 | 25.58 | 29.08 |
| 1B | 15.56 | 15.64 | 19.21 | 18.33 |
| 7B | (not run by authors) | 14.65 | (not run) | (not run) |
GaLore tracks full-rank AdamW closely at every scale (the gap is 0.08 perplexity at 1B) while LoRA and ReLoRA fall meaningfully behind once the model exceeds about 130M parameters.[1] The 7B run consumed roughly 19.7 billion tokens and was the first reported single-GPU LLaMA-7B from-scratch pretraining at that footprint.[1]
For RoBERTa-base fine-tuning on the GLUE benchmark, the paper reports that GaLore matches full fine-tuning within statistical noise across MNLI, QQP, SST-2, CoLA, QNLI, MRPC, RTE, and STS-B at a fraction of the optimizer-state memory; the GaLore average score is 85.89 versus 85.61 for full fine-tuning and 85.21 for LoRA.[1]
The combined effect of low-rank optimizer state, 8-bit AdamW quantization (built on bitsandbytes), and per-layer weight updates brings the pretraining footprint of a LLaMA 7B model to approximately 22.0 GB, fitting within a single consumer-grade NVIDIA RTX 4090 without sharding or CPU offloading.[1] This was the principal demonstration that gave the paper its widespread attention.[5]
The reference implementation is published as the galore-torch package on PyPI and as the GitHub repository jiaweizzhao/GaLore (Apache-2.0).[3] It exposes three optimizer classes that mimic PyTorch's standard optimizer interface: GaLoreAdamW, GaLoreAdamW8bit, and GaLoreAdafactor.[3] Each accepts the GaLore-specific keyword arguments rank, update_proj_gap, scale, and proj_type, applied to a per-parameter param_groups entry.[3] The 8-bit variant wraps the bitsandbytes AdamW8bit kernel; the Adafactor variant builds on the Hugging Face Transformers Adafactor implementation.[3]
PR 29588, authored by Younes Belkada, merged on 19 March 2024 and exposed GaLore as a first-class option in the Trainer API through three string identifiers: optim="galore_adamw", "galore_adamw_8bit", and "galore_adafactor".[4] These shipped publicly in the v4.39.0 release.[5] The integration adds two new TrainingArguments fields, optim_target_modules for selecting which submodules to attach GaLore to (accepting a list, a regex, or a fully qualified module path) and optim_args for passing the GaLore hyperparameters as a comma-separated string.[5] The Hugging Face blog post demonstrates the typical use:[5]
args = TrainingArguments(
output_dir="./galore-run",
optim="galore_adamw",
optim_target_modules=["attn", "mlp"],
optim_args="rank=128, update_proj_gap=200, scale=0.25",
per_device_train_batch_size=2,
max_steps=100,
)
Layer-wise variants are selected by appending _layerwise to the optimizer name, for example "galore_adamw_layerwise".[5] In that mode the framework disables gradient accumulation and runs the projection plus weight update inside backward hooks; the documentation explicitly warns that throughput is lower than the non-layerwise variant in exchange for the further memory saving.[5]
The same optim keys are accepted by the SFT trainer in trl.SFTTrainer via SFTConfig, with identical semantics. The blog's worked example trains Mistral-7B on IMDB with optim="galore_adamw" and target modules ["attn", "mlp"].[5]
The LLaMA-Factory project, a popular YAML-driven fine-tuning framework that supports more than 100 base models, lists GaLore as one of its optimizer choices alongside LoRA, QLoRA, full freeze-tuning, and 32-bit full fine-tuning.[10] Setting optim: galore_adamw_8bit in the LLaMA-Factory training YAML invokes the same Hugging Face Trainer integration described above.[10]
Axolotl, another widely used training framework, also supports GaLore through the underlying Transformers integration.[11] The Axolotl issue tracker has documented out-of-memory edge cases when the user requests galore_adamw_8bit together with full gradient accumulation, reflecting the layerwise/accumulation incompatibility noted by the upstream maintainers.[11]
The Hugging Face PEFT and accelerate teams collaborated on enabling layerwise variants under their respective abstractions, and the Graphcore Research group reproduced the C4 results on IPU hardware shortly after publication.[5]
Q-GaLore by Zhenyu Zhang, Ajay Jaiswal, Lu Yin, Shiwei Liu, Jiawei Zhao, Yuandong Tian, and Zhangyang Wang (arXiv:2407.08296) extends GaLore with two ideas: keep the projection matrices P themselves in INT4 instead of BF16, and keep the underlying weights in INT8, with stochastic rounding used to preserve accumulated gradient information across quantization boundaries.[7] The authors also introduce a layer-adaptive subspace refresh schedule: layers whose gradient subspace has converged are updated less often, reducing the average SVD cost.[7] The headline claim is that Q-GaLore can pretrain a LLaMA-7B from scratch on a single NVIDIA RTX 4060 Ti with only 16 GB of memory, and that for fine-tuning it reduces memory by up to 50 percent relative to LoRA and GaLore while outperforming QLoRA at matched memory.[7] A reference implementation is available at github.com/VITA-Group/Q-GaLore.[7]
GaLore 2 (arXiv:2504.20437) by DiJia Su, Andrew Gu, Jane Xu, Yuandong Tian, and Jiawei Zhao addresses the principal complaint about GaLore in production settings: SVD is expensive, and full SVD of a 4096 x 4096 BF16 gradient is so slow on GPU that it dominates wall-clock time at large model sizes. GaLore 2 swaps the exact truncated SVD for a randomized SVD construction, restructures the projection step to be compatible with FSDP sharding, and demonstrates a from-scratch pretraining run of LLaMA 7B with up to 500 billion training tokens, which is roughly 25 times the token budget used in the original paper.[8]
Several smaller variants have been published in the wider literature. GoLore (PKU) replaces the SVD-derived P with a random orthogonal projection sampled uniformly on the Stiefel manifold, providing convergence guarantees that the deterministic SVD-based construction lacks in late training stages, and showing improved late-stage perplexity on LLaMA pretraining.[12] "Online Subspace Descent" generalizes GaLore into a continuous subspace tracking scheme that updates P at every step via a low-cost stochastic rule, narrowing the gap to full-rank baselines on C4 pretraining.[13] GALE proposes replacing the SVD with a fast randomized QR decomposition for a reported 23x speedup on the subspace update step.[14]
The principal limitations of vanilla GaLore are well documented in the literature.
First, the SVD step is computationally expensive. On a LLaMA 7B model, a single full truncated SVD of all weight gradients can take on the order of ten to twenty minutes on a consumer GPU, which dominates per-step wall-clock time when update_proj_gap is small.[14][15] This is what motivates both Q-GaLore's layer-adaptive refresh schedule and GaLore 2's randomized SVD.[7][8]
Second, the per-layer weight-update trick that delivers the 24 GB pretraining demonstration is fundamentally incompatible with multi-step gradient accumulation, because the gradient is freed immediately after the layer's weights have been updated.[5] In practice this caps effective batch size at what fits in one micro-batch, which is small on consumer hardware and can hurt convergence stability.[11]
Third, GaLore is more memory-efficient than LoRA only on the optimizer state and not on the weights themselves; full BF16 weight storage is still required, so for very large models GaLore does not entirely eliminate the need for sharding.[1][8]
Fourth, the theoretical convergence guarantees in the original paper rest on the assumption that gradients have a stable low-rank structure. Follow-up work has shown that this assumption can break down in late training stages, where the dominant singular directions of the gradient can shift faster than the update_proj_gap allows, leading to degraded convergence; GoLore's random-projection alternative was proposed in part to address this case.[12]
Fifth, the SVD-based projection is non-trivial to combine with FSDP-style sharded training because the SVD must be computed on the full unsharded gradient, which adds communication overhead in distributed settings. The GaLore 2 paper explicitly cites this as a motivation for its restructuring.[8]
| Property | Full AdamW | LoRA / QLoRA | GaLore | Q-GaLore |
|---|---|---|---|---|
| Trains all weights | Yes | No (adapter only) | Yes | Yes |
| Optimizer state size per layer | 2mn | 2(m+n)r | (m+2n)r | INT4-quantized projection plus low-rank Adam state |
| Extra projection matrices stored | None | BA adapters (mr + nr) | P (mr) | P in INT4 |
| Weight dtype | BF16/FP16 | BF16 weights + adapter | BF16 | INT8 |
| Demonstrated single-GPU 7B pretraining | No | No | Yes (24 GB) | Yes (16 GB) |
| SVD overhead | None | None | Significant | Reduced via adaptive schedule |
The cleanest one-line distinction between GaLore and LoRA: LoRA decomposes the weight delta as Delta W = B A, freezes W, and trains B, A. GaLore decomposes the gradient as G approx P (P^T G), trains W itself, and stores Adam moments only of the smaller P^T G. The two methods can be stacked (GaLore-LoRA-style hybrids exist in the literature) but they are conceptually independent.[1]
GaLore changed the practical conversation around LLM pretraining hardware. Before its publication the consensus position was that pretraining a 7-billion-parameter model required at minimum a multi-GPU node with sharding or aggressive CPU/NVMe offloading.[1] By demonstrating that a single 24 GB consumer GPU was sufficient when optimizer state was compressed in the right way, GaLore widened the population of researchers and small organizations who could attempt from-scratch pretraining experiments at the 1B to 7B scale.[5][16]
Within the Hugging Face ecosystem the GaLore integration was the first time the Transformers library shipped a non-trivial new optimizer behind a string identifier in TrainingArguments, setting a template that was reused for APOLLO and other subsequent memory-efficient optimizers.[17] The "optim_target_modules" abstraction introduced by PR 29588 is now part of the standard configuration surface for swapping in low-rank methods across PEFT-style and full-parameter training paths.[4][5]
In a broader research framing, GaLore validated the empirical claim, central to a growing line of work, that the gradient of an over-parameterized neural network has approximate low-rank structure that can be exploited at training time, not merely for compression of trained models.[1] This idea now underpins several other memory-efficient training proposals, including online subspace descent and gradient wavelet transforms.[13]