GaLore (Gradient Low-Rank Projection)

Machine Learning Training & Optimization

23 min read

Updated Jun 27, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 27, 2026

Fact-checked

In review queue

Sources

17 citations

Revision

v3 · 4,643 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

GaLore (Gradient Low-Rank Projection) is a memory-efficient training strategy for large neural networks that projects each weight matrix's gradient into a low-rank subspace, computes the Adam-style optimizer state inside that smaller subspace, and then projects the resulting update back to the original full-size parameter space. Because only the optimizer's view of the gradient is compressed, GaLore trains every weight of the model (full-parameter learning), which is the key contrast with adapter methods such as LoRA that constrain the weight update itself to low rank. Introduced by Jiawei Zhao and collaborators in the paper "GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection," the method reduces optimizer-state memory by up to 65.5 percent and, for the first time, made it feasible to pretrain a 7-billion-parameter LLaMA-style model from scratch on a single 24 GB consumer GPU (for example an NVIDIA RTX 4090) without model parallelism, checkpointing, or CPU offloading.^[1] The work was posted to arXiv on 6 March 2024 and accepted as an Oral presentation at the 41st International Conference on Machine Learning (ICML) the same year.^[1]^[2]^[6]

The paper frames the contribution in one sentence: "we propose Gradient Low-Rank Projection (GaLore), a training strategy that allows full-parameter learning but is more memory-efficient than common low-rank adaptation methods such as LoRA."^[1]

What is GaLore?

GaLore is an optimizer-level technique, not a change to model architecture or to the parameters being trained. At each step it takes the gradient G of a weight matrix W, multiplies it by a tall-skinny projection matrix P derived from the singular value decomposition (SVD) of a recent gradient, and runs the standard Adam moment updates on the much smaller projected gradient P^T G. The normalized Adam update is then projected back to the shape of W and applied to the full weight. The low-rank approximation therefore lives in the optimizer's view of the gradient rather than in the weights themselves, so the model still learns at full rank while the expensive Adam first- and second-moment buffers shrink to the size of the subspace.^[1] Combined with 8-bit optimizer-state quantization and per-layer weight updates, this is what brings a 7B pretraining footprint inside 24 GB.^[1]^[3]

Why is training large models memory-bound?

Training large LLMs is memory-bound because optimizer states, activations, weights, and gradients all reside in GPU memory simultaneously. For AdamW, the dominant component is the optimizer state: every trainable parameter requires two extra floating-point values (the first-moment estimate m and the second-moment estimate v), so a model with n parameters in BF16 consumes roughly 2n bytes for weights plus 8n bytes for the AdamW state in FP32, before accounting for gradients and activations.^[1] A 7-billion-parameter LLaMA model in BF16 therefore needs about 14 GB for weights and roughly 56 GB for FP32 Adam state, a footprint that already exceeds the 24 GB available on an NVIDIA RTX 4090.

Two prior families of techniques addressed this. The first, parameter-efficient fine-tuning (PEFT), is exemplified by low-rank adaptation and its quantized variant QLoRA: only a small set of newly inserted low-rank adapter matrices are trained, so optimizer state grows with the adapter rank rather than with the base model size.^[1] The drawback is that LoRA is not full-parameter learning; the base weights stay frozen, which has been shown to underperform full-parameter training in pretraining settings and to bound the expressivity of fine-tunes.^[1] The second family, exemplified by ReLoRA, periodically merges low-rank updates into the base weights so that pretraining can proceed in a low-rank regime, but this approach requires a full-rank warmup phase to match dense baselines.^[1]

GaLore takes a different tack: it leaves the parameterization unchanged and instead exploits a structural property of the gradient matrix itself. The authors prove that during training of typical reversible network blocks the gradient with respect to a weight matrix becomes low-rank, and that the subspace spanned by the dominant singular vectors evolves slowly.^[1] Consequently, an SVD-derived projection matrix P can be reused for many optimizer steps before it needs to be refreshed, and the optimizer state can live in the low-rank subspace rather than in full parameter space. This shifts memory from O(mn) per layer to O(mn) + O(mr + nr) for a rank-r approximation while still updating every entry of W.^[1]

When was GaLore released, and who wrote it?

The paper "GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection" by Jiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang ("Atlas") Wang, Anima Anandkumar, and Yuandong Tian was posted to arXiv as preprint 2403.03507 on 6 March 2024.^[2] The v2 revision followed on 2 June 2024.^[2] At the time of submission the authors were affiliated with the California Institute of Technology, the University of Texas at Austin, Carnegie Mellon University, and Meta AI / FAIR.^[1]^[2]

The reference implementation galore-torch was released alongside the paper to the public GitHub repository jiaweizzhao/GaLore under the Apache-2.0 license.^[3] Hugging Face engineer Younes Belkada opened pull request 29588 to the Transformers repository on 14 March 2024 to integrate the optimizer; the PR was merged on 19 March 2024 and the optimizer shipped publicly in the v4.39.0 release of the library on 20 March 2024.^[4]^[5] On the same day, Hugging Face published a joint blog post by Titus von Koeller, Jiawei Zhao, Matthew Douglas, Yaowei Zheng, Younes Belkada, Zachary Mueller, Amy Roberts, Sourab Mangrulkar, and Benjamin Bossan titled "GaLore: Advancing Large Model Training on Consumer-grade Hardware," which documented usage from the Trainer API.^[5]

The work was accepted to ICML 2024 with an Oral presentation slot in the "Low Rank Learning" session.^[6] A first quantized follow-up, Q-GaLore by Zhenyu Zhang, Ajay Jaiswal, Lu Yin, Shiwei Liu, Jiawei Zhao, Yuandong Tian, and Zhangyang Wang, appeared as arXiv:2407.08296 on 11 July 2024 and pushed pretraining of a 7B model down to a single 16 GB RTX 4060 Ti.^[7] A more recent successor, "GaLore 2: Large-Scale LLM Pre-Training by Gradient Low-Rank Projection" by DiJia Su, Andrew Gu, Jane Xu, Yuandong Tian, and Jiawei Zhao, appeared as arXiv:2504.20437 in April 2025, focusing on randomized SVD, compatibility with Fully Sharded Data Parallel, and a demonstrated pretraining run of LLaMA 7B with up to 500 billion tokens.^[8]

How does GaLore work?

Setup and notation

For a single linear layer with weight matrix W in R^(m x n) (m <= n by convention), the gradient G_t in R^(m x n) at step t has the same shape as W. Full-rank AdamW maintains two state tensors M_t and V_t, both of shape R^(m x n), holding bias-corrected first and second moments of G_t. The total memory for one parameter tensor under AdamW is therefore proportional to 3mn (weight plus two state tensors), ignoring gradients and master copies.^[1]

GaLore introduces a single tall-skinny projection matrix P_t in R^(m x r) with r << m. At each step the gradient is first projected into the low-rank subspace:

R_t = P_t^T G_t        # R_t in R^(r x n)

The optimizer state lives at the projected scale. M_t and V_t are stored as R^(r x n) tensors and are updated using the standard Adam recurrences applied to R_t. After Adam produces a normalized update N_t in R^(r x n), GaLore projects it back to the original shape:

G_tilde_t = alpha * P_t N_t
W_{t+1}   = W_t - eta * G_tilde_t

where alpha is a fixed scaling factor (typically 0.25) that controls update strength independently of the rank, and eta is the learning rate.^[9] In the paper's standard "left" projection setting only P is used; a "right" projection variant uses Q in R^(n x r) instead, switching which dimension is compressed based on which is larger.^[1]

Constructing P from SVD

The projection matrix is built from the singular value decomposition of the most recent full gradient. Concretely, every T steps (the update_proj_gap hyperparameter), GaLore computes a truncated SVD G_t approx U Sigma V^T and sets P_t = U[:, :r], the columns corresponding to the r largest singular values.^[1]^[9] Between refreshes the projection matrix is held fixed, so the cost of SVD is amortized across T steps. The authors prove that under the assumption of stable rank gradients the subspace spanned by U[:, :r] drifts slowly enough that this lazy update incurs only a bounded loss compared to recomputing P_t every step.^[1]

In code, the truncated SVD is implemented with torch.linalg.svd on the gradient, with the projection matrix kept in the dtype of the gradient (usually BF16) and converted lazily during the matrix multiplication.^[3]

Theoretical justification

The paper provides a formal argument for why the projection-based approach should converge. The central lemma states that for a class of reversible networks (which the authors show includes typical transformer blocks under mild conditions on the activations), the time evolution of the weight matrix under gradient descent leaves the rank of the gradient bounded by a slowly growing function of the training step.^[1] Concretely, Lemma 3.1 of the paper shows that if the loss is smooth and the network is reversible, the gradient G_t can be approximated arbitrarily well by its rank-r truncated SVD with error that depends only on r and on the spectral decay rate of the gradient matrix.^[1] Empirically the authors plot the singular value distribution of gradients across training steps for LLaMA blocks and observe a heavy concentration in the top few hundred singular directions, validating that the assumption holds in practice.^[1]

A second theoretical claim concerns the validity of holding P_t fixed for many steps. The authors prove that if the subspace drift rate (the angular velocity of the dominant r-dimensional eigenspace) is bounded, then the cumulative error introduced by reusing P_t for T consecutive steps is proportional to T times that drift rate.^[1] In practice this means that update_proj_gap can be set to several hundred steps with negligible effect on convergence, which is what makes the amortized SVD cost manageable.^[1]

How does GaLore save memory?

The original paper presents a per-layer accounting that makes the savings explicit. For a weight matrix W in R^(m x n) (m <= n) and rank r, the memory required per layer is approximately:^[1]

Component	AdamW	LoRA (rank r)	GaLore (rank r)
Weights	`mn`	`mn + mr + nr`	`mn`
Optimizer states	`2mn`	`2mr + 2nr`	`mr + 2nr`
Total	`3mn`	`mn + 3mr + 3nr`	`mn + mr + 2nr`

For a LLaMA-style 7B model, the paper reports that switching from BF16 AdamW to GaLore reduces optimizer state memory by up to 65.5 percent.^[1] Stacking 8-bit quantization of M and V on top (yielding the galore_adamw_8bit configuration) reduces optimizer state memory by up to 82.5 percent and total training memory by 63.3 percent relative to a BF16 AdamW baseline.^[1]^[5] The abstract states the headline result directly: "Our 8-bit GaLore further reduces optimizer memory by up to 82.5% and total training memory by 63.3%, compared to a BF16 baseline."^[1]

Per-layer weight updates

A second optimization, often listed as the _layerwise variant in implementations, eliminates the need to ever hold the full gradient tensor of the model in memory. Normally backpropagation produces gradients for every weight before the optimizer step runs at the end of the backward pass. GaLore's per-layer mode hooks into the backward graph and runs the projection and the AdamW update for each layer immediately after its gradient is computed, freeing that layer's full-shape gradient before the next layer's gradient materializes.^[5]^[9] The original paper credits per-layer updates with saving roughly 13.5 GB on a LLaMA 7B pretraining run, which is what closes the gap between the 8-bit AdamW state footprint and the 24 GB capacity of an RTX 4090.^[1] The trade-off is that this mode is incompatible with gradient accumulation in its naive form, because accumulation requires retaining the gradient across micro-batches.^[5]

Algorithm 1 in pseudocode

Inputs: weight W in R^(m x n), rank r, update interval T,
        scale alpha, learning rate eta
Initialize M_0 = 0, V_0 = 0, both in R^(r x n)
Initialize P_0 from SVD of grad(W) at step 0
for t = 1, 2, ... do
    G_t = backward(W_t)                       # full grad, BF16
    if t mod T == 0 then
        U, _, _ = svd(G_t)
        P_t = U[:, :r]
    else
        P_t = P_{t-1}
    R_t   = P_t^T @ G_t                       # (r, n), BF16
    M_t   = beta1 * M_{t-1} + (1 - beta1) * R_t
    V_t   = beta2 * V_{t-1} + (1 - beta2) * R_t * R_t
    N_t   = M_t / (sqrt(V_t) + eps)           # Adam normalized update
    G_til = alpha * P_t @ N_t                 # back to (m, n)
    W_t   = W_t - eta * G_til

Hyperparameters used in the paper for LLaMA pretraining include r = 1024 for the 7B model (matrix dimension 4096) and r = 512 for the 1B model (dimension 2048), update_proj_gap between 200 and 500 steps, and scale = 0.25.^[1]^[9]

How sensitive is GaLore to its hyperparameters?

The paper conducts ablations over each of the three GaLore-specific hyperparameters. The rank r exhibits monotone but saturating effects: doubling r improves perplexity but with diminishing returns past roughly r = d / 4, where d is the matrix dimension. The authors recommend r = d / 4 as a starting point for a balanced memory and accuracy trade.^[1] The update interval update_proj_gap shows a U-shaped sensitivity curve: very small values (under 50 steps) are wasteful because the subspace cannot drift meaningfully between refreshes, while very large values (over 1000 steps) cause the subspace to lag behind the true gradient distribution and degrade convergence.^[1] The scale alpha is set to a constant 0.25 in the pretraining experiments and a smaller value such as 4.0 / r for fine-tuning, where smaller-rank projections benefit from a slightly larger effective step size; this differs from LoRA's standard alpha / r scaling rule.^[1]^[9]

Compatibility with mixed precision

In practice GaLore is run with the model weights, gradients, and projection matrix in BF16, while the Adam optimizer state in the projected subspace is held in either BF16 (default) or 8-bit (via the bitsandbytes blockwise quantization scheme). The reference implementation casts the projection matrix P to the dtype of the gradient at multiplication time so that the matrix multiplication kernel can dispatch to tensor cores without an explicit upcast.^[3] The SVD itself is computed in FP32 internally because torch.linalg.svd does not support BF16 inputs reliably; in galore-torch this is handled by an explicit .float() cast at SVD time and a downcast immediately afterward.^[3]

How well does GaLore perform?

LLaMA pretraining on C4

The original paper pretrains LLaMA-style decoders ranging from 60M to 7B parameters on the C4 (Colossal Clean Crawled Corpus) dataset and reports validation perplexity. The headline numbers from Table 1 of the paper are:^[1]^[9]

Model	Full-rank AdamW	GaLore	LoRA	ReLoRA
60M	34.06	34.88	34.99	37.04
130M	25.08	25.36	33.92	29.37
350M	18.80	18.95	25.58	29.08
1B	15.56	15.64	19.21	18.33
7B	(not run by authors)	14.65	(not run)	(not run)

GaLore tracks full-rank AdamW closely at every scale (the gap is 0.08 perplexity at 1B) while LoRA and ReLoRA fall meaningfully behind once the model exceeds about 130M parameters.^[1] The paper's pretraining experiments use the C4 dataset "with up to 19.7B tokens," and the 7B run was the first reported single-GPU LLaMA-7B from-scratch pretraining at that footprint.^[1]

RoBERTa GLUE fine-tuning

For RoBERTa-base fine-tuning on the GLUE benchmark, the paper reports that GaLore matches full fine-tuning within statistical noise across MNLI, QQP, SST-2, CoLA, QNLI, MRPC, RTE, and STS-B at a fraction of the optimizer-state memory; the GaLore average score is 85.89 versus 85.61 for full fine-tuning and 85.21 for LoRA.^[1]

Memory and hardware results

The combined effect of low-rank optimizer state, 8-bit AdamW quantization (built on bitsandbytes), and per-layer weight updates brings the pretraining footprint of a LLaMA 7B model to approximately 22.0 GB, fitting within a single consumer-grade NVIDIA RTX 4090 without sharding or CPU offloading.^[1] As the abstract puts it, the authors "demonstrate, for the first time, the feasibility of pre-training a 7B model on consumer GPUs with 24GB memory (e.g., NVIDIA RTX 4090) without model parallel, checkpointing, or offloading strategies."^[1] This was the principal demonstration that gave the paper its widespread attention.^[5]

Which frameworks support GaLore?

galore-torch

The reference implementation is published as the galore-torch package on PyPI and as the GitHub repository jiaweizzhao/GaLore (Apache-2.0).^[3] It exposes three optimizer classes that mimic PyTorch's standard optimizer interface: GaLoreAdamW, GaLoreAdamW8bit, and GaLoreAdafactor.^[3] Each accepts the GaLore-specific keyword arguments rank, update_proj_gap, scale, and proj_type, applied to a per-parameter param_groups entry.^[3] The 8-bit variant wraps the bitsandbytes AdamW8bit kernel; the Adafactor variant builds on the Hugging Face Transformers Adafactor implementation.^[3]

Hugging Face Transformers and TRL

PR 29588, authored by Younes Belkada, merged on 19 March 2024 and exposed GaLore as a first-class option in the Trainer API through three string identifiers: optim="galore_adamw", "galore_adamw_8bit", and "galore_adafactor".^[4] These shipped publicly in the v4.39.0 release.^[5] The integration adds two new TrainingArguments fields, optim_target_modules for selecting which submodules to attach GaLore to (accepting a list, a regex, or a fully qualified module path) and optim_args for passing the GaLore hyperparameters as a comma-separated string.^[5] The Hugging Face blog post demonstrates the typical use:^[5]

args = TrainingArguments(
    output_dir="./galore-run",
    optim="galore_adamw",
    optim_target_modules=["attn", "mlp"],
    optim_args="rank=128, update_proj_gap=200, scale=0.25",
    per_device_train_batch_size=2,
    max_steps=100,
)

Layer-wise variants are selected by appending _layerwise to the optimizer name, for example "galore_adamw_layerwise".^[5] In that mode the framework disables gradient accumulation and runs the projection plus weight update inside backward hooks; the documentation explicitly warns that throughput is lower than the non-layerwise variant in exchange for the further memory saving.^[5]

The same optim keys are accepted by the SFT trainer in trl.SFTTrainer via SFTConfig, with identical semantics. The blog's worked example trains Mistral-7B on IMDB with optim="galore_adamw" and target modules ["attn", "mlp"].^[5]

LLaMA-Factory

The LLaMA-Factory project, a popular YAML-driven fine-tuning framework that supports more than 100 base models, lists GaLore as one of its optimizer choices alongside LoRA, QLoRA, full freeze-tuning, and 32-bit full fine-tuning.^[10] Setting optim: galore_adamw_8bit in the LLaMA-Factory training YAML invokes the same Hugging Face Trainer integration described above.^[10]

Axolotl

Axolotl, another widely used training framework, also supports GaLore through the underlying Transformers integration.^[11] The Axolotl issue tracker has documented out-of-memory edge cases when the user requests galore_adamw_8bit together with full gradient accumulation, reflecting the layerwise/accumulation incompatibility noted by the upstream maintainers.^[11]

Other ecosystems

The Hugging Face PEFT and accelerate teams collaborated on enabling layerwise variants under their respective abstractions, and the Graphcore Research group reproduced the C4 results on IPU hardware shortly after publication.^[5]

What variants and follow-ups exist?

Q-GaLore (July 2024)

Q-GaLore by Zhenyu Zhang, Ajay Jaiswal, Lu Yin, Shiwei Liu, Jiawei Zhao, Yuandong Tian, and Zhangyang Wang (arXiv:2407.08296) extends GaLore with two ideas: keep the projection matrices P themselves in INT4 instead of BF16, and keep the underlying weights in INT8, with stochastic rounding used to preserve accumulated gradient information across quantization boundaries.^[7] The authors also introduce a layer-adaptive subspace refresh schedule: layers whose gradient subspace has converged are updated less often, reducing the average SVD cost.^[7] The headline claim is that Q-GaLore can pretrain a LLaMA-7B from scratch on a single NVIDIA RTX 4060 Ti with only 16 GB of memory, and that for fine-tuning it reduces memory by up to 50 percent relative to LoRA and GaLore while outperforming QLoRA at matched memory.^[7] A reference implementation is available at github.com/VITA-Group/Q-GaLore.^[7]

GaLore 2 (April 2025)

GaLore 2 (arXiv:2504.20437) by DiJia Su, Andrew Gu, Jane Xu, Yuandong Tian, and Jiawei Zhao addresses the principal complaint about GaLore in production settings: SVD is expensive, and full SVD of a 4096 x 4096 BF16 gradient is so slow on GPU that it dominates wall-clock time at large model sizes. GaLore 2 swaps the exact truncated SVD for a randomized SVD construction, restructures the projection step to be compatible with FSDP sharding, and demonstrates a from-scratch pretraining run of LLaMA 7B with up to 500 billion training tokens, which is roughly 25 times the token budget used in the original paper.^[8]

Several smaller variants have been published in the wider literature. GoLore (PKU) replaces the SVD-derived P with a random orthogonal projection sampled uniformly on the Stiefel manifold, providing convergence guarantees that the deterministic SVD-based construction lacks in late training stages, and showing improved late-stage perplexity on LLaMA pretraining.^[12] "Online Subspace Descent" generalizes GaLore into a continuous subspace tracking scheme that updates P at every step via a low-cost stochastic rule, narrowing the gap to full-rank baselines on C4 pretraining.^[13] GALE proposes replacing the SVD with a fast randomized QR decomposition for a reported 23x speedup on the subspace update step.^[14]

What are the limitations of GaLore?

The principal limitations of vanilla GaLore are well documented in the literature.

First, the SVD step is computationally expensive. On a LLaMA 7B model, a single full truncated SVD of all weight gradients can take on the order of ten to twenty minutes on a consumer GPU, which dominates per-step wall-clock time when update_proj_gap is small.^[14]^[15] This is what motivates both Q-GaLore's layer-adaptive refresh schedule and GaLore 2's randomized SVD.^[7]^[8]

Second, the per-layer weight-update trick that delivers the 24 GB pretraining demonstration is fundamentally incompatible with multi-step gradient accumulation, because the gradient is freed immediately after the layer's weights have been updated.^[5] In practice this caps effective batch size at what fits in one micro-batch, which is small on consumer hardware and can hurt convergence stability.^[11]

Third, GaLore is more memory-efficient than LoRA only on the optimizer state and not on the weights themselves; full BF16 weight storage is still required, so for very large models GaLore does not entirely eliminate the need for sharding.^[1]^[8]

Fourth, the theoretical convergence guarantees in the original paper rest on the assumption that gradients have a stable low-rank structure. Follow-up work has shown that this assumption can break down in late training stages, where the dominant singular directions of the gradient can shift faster than the update_proj_gap allows, leading to degraded convergence; GoLore's random-projection alternative was proposed in part to address this case.^[12]

Fifth, the SVD-based projection is non-trivial to combine with FSDP-style sharded training because the SVD must be computed on the full unsharded gradient, which adds communication overhead in distributed settings. The GaLore 2 paper explicitly cites this as a motivation for its restructuring.^[8]

How does GaLore differ from LoRA?

Property	Full AdamW	LoRA / QLoRA	GaLore	Q-GaLore
Trains all weights	Yes	No (adapter only)	Yes	Yes
Optimizer state size per layer	`2mn`	`2(m+n)r`	`(m+2n)r`	INT4-quantized projection plus low-rank Adam state
Extra projection matrices stored	None	`BA` adapters (`mr + nr`)	`P` (`mr`)	`P` in INT4
Weight dtype	BF16/FP16	BF16 weights + adapter	BF16	INT8
Demonstrated single-GPU 7B pretraining	No	No	Yes (24 GB)	Yes (16 GB)
SVD overhead	None	None	Significant	Reduced via adaptive schedule

The cleanest one-line distinction between GaLore and LoRA: LoRA decomposes the weight delta as Delta W = B A, freezes W, and trains B, A (so the learned update is constrained to low rank). GaLore decomposes the gradient as G approx P (P^T G), trains W itself at full rank, and stores Adam moments only of the smaller P^T G. The two methods can be stacked (GaLore-LoRA-style hybrids exist in the literature) but they are conceptually independent.^[1] This is the distinction the abstract draws when it describes GaLore as a strategy "that allows full-parameter learning but is more memory-efficient than common low-rank adaptation methods such as LoRA."^[1]

Why does GaLore matter?

GaLore changed the practical conversation around LLM pretraining hardware. Before its publication the consensus position was that pretraining a 7-billion-parameter model required at minimum a multi-GPU node with sharding or aggressive CPU/NVMe offloading.^[1] By demonstrating that a single 24 GB consumer GPU was sufficient when optimizer state was compressed in the right way, GaLore widened the population of researchers and small organizations who could attempt from-scratch pretraining experiments at the 1B to 7B scale.^[5]^[16]

Within the Hugging Face ecosystem the GaLore integration was the first time the Transformers library shipped a non-trivial new optimizer behind a string identifier in TrainingArguments, setting a template that was reused for APOLLO and other subsequent memory-efficient optimizers.^[17] The "optim_target_modules" abstraction introduced by PR 29588 is now part of the standard configuration surface for swapping in low-rank methods across PEFT-style and full-parameter training paths.^[4]^[5]

In a broader research framing, GaLore validated the empirical claim, central to a growing line of work, that the gradient of an over-parameterized neural network has approximate low-rank structure that can be exploited at training time, not merely for compression of trained models.^[1] This idea now underpins several other memory-efficient training proposals, including online subspace descent and gradient wavelet transforms.^[13]

References

Jiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang Wang, Anima Anandkumar, Yuandong Tian, "GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection", arXiv, 2024-03-06. https://arxiv.org/abs/2403.03507. Accessed 2026-05-20. ↩
Jiawei Zhao et al., "GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection (v2 revision)", arXiv:2403.03507v2, 2024-06-02. https://arxiv.org/abs/2403.03507v2. Accessed 2026-05-20. ↩
Jiawei Zhao, "GaLore reference implementation (galore-torch)", GitHub repository jiaweizzhao/GaLore, Apache-2.0 license, 2024-03-06. https://github.com/jiaweizzhao/GaLore. Accessed 2026-05-20. ↩
Younes Belkada, "FEAT / Optim: Add GaLore optimizer", Hugging Face Transformers Pull Request 29588, merged 2024-03-19. https://github.com/huggingface/transformers/pull/29588. Accessed 2026-05-20. ↩
Titus von Koeller, Jiawei Zhao, Matthew Douglas, Yaowei Zheng, Younes Belkada, Zachary Mueller, Amy Roberts, Sourab Mangrulkar, Benjamin Bossan, "GaLore: Advancing Large Model Training on Consumer-grade Hardware", Hugging Face Blog, 2024-03-20. https://huggingface.co/blog/galore. Accessed 2026-05-20. ↩
ICML 2024 Program Committee, "GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection (Oral, Track 6B Low Rank Learning)", ICML 2024, 2024-07-25. https://icml.cc/virtual/2024/oral/35485. Accessed 2026-05-20. ↩
Zhenyu Zhang, Ajay Jaiswal, Lu Yin, Shiwei Liu, Jiawei Zhao, Yuandong Tian, Zhangyang Wang, "Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients", arXiv:2407.08296, 2024-07-11. https://arxiv.org/abs/2407.08296. Accessed 2026-05-20. ↩
DiJia Su, Andrew Gu, Jane Xu, Yuandong Tian, Jiawei Zhao, "GaLore 2: Large-Scale LLM Pre-Training by Gradient Low-Rank Projection", arXiv:2504.20437, 2025-04-29. https://arxiv.org/abs/2504.20437. Accessed 2026-05-20. ↩
Jiawei Zhao et al., "GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection (HTML rendering)", arXiv HTML view, 2024. https://arxiv.org/html/2403.03507v1. Accessed 2026-05-20. ↩
Yaowei Zheng et al., "LLaMA-Factory project documentation: training configuration and supported optimizers", LLaMA-Factory project (covered in third-party guide by Superteams.ai), 2024. https://www.superteams.ai/blog/a-definitive-guide-to-fine-tuning-llms-using-axolotl-and-llama-factory. Accessed 2026-05-20. ↩
OpenAccess-AI-Collective contributors, "OOM On Galore Axolotl (issue 1448)", Axolotl GitHub issue tracker, 2024-04. https://github.com/OpenAccess-AI-Collective/axolotl/issues/1448. Accessed 2026-05-20. ↩
He et al., "GoLore: Random projection alternative to GaLore on the Stiefel manifold", GitHub repository pkumelon/Golore, 2024. https://github.com/pkumelon/Golore. Accessed 2026-05-20. ↩
Liang et al., "Online Subspace Descent for Memory-Efficient LLM Training", arXiv preprint discussed in survey of memory-efficient optimizers, 2024. https://arxiv.org/html/2605.09176v1. Accessed 2026-05-20. ↩
GALE authors, "GALE: Gradient Activation Low-rank Extraction for Fast Memory Efficient Large Language Model Training", OpenReview, 2024. https://openreview.net/forum?id=D9Oq3c5iHn. Accessed 2026-05-20. ↩
SubTrack++ authors, "SubTrack++: Gradient Subspace Tracking for Scalable LLM Training (discusses SVD overhead in GaLore)", OpenReview, 2024. https://openreview.net/pdf?id=6geRIdlFWJ. Accessed 2026-05-20. ↩
Graphcore Research, "GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection (reproduction note)", Graphcore Research Blog, 2024. https://graphcore-research.github.io/galore/. Accessed 2026-05-20. ↩
Hanqing Zhu, "Optim: APOLLO optimizer integration (PR 36062, references GaLore template)", Hugging Face Transformers Pull Request 36062, 2024-12. https://github.com/huggingface/transformers/pull/36062. Accessed 2026-05-20. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributor · full history

Suggest edit

What links here

Adafactor QLoRA SOAP (optimizer)VeRA (Vector-based Random Matrix Adaptation)rsLoRA (Rank-Stabilized LoRA)

What is GaLore?

Why is training large models memory-bound?

When was GaLore released, and who wrote it?

How does GaLore work?

Setup and notation

Constructing P from SVD

Theoretical justification

How does GaLore save memory?

Per-layer weight updates

Algorithm 1 in pseudocode

How sensitive is GaLore to its hyperparameters?

Compatibility with mixed precision

How well does GaLore perform?

LLaMA pretraining on C4

RoBERTa GLUE fine-tuning

Memory and hardware results

Which frameworks support GaLore?

galore-torch

Hugging Face Transformers and TRL

LLaMA-Factory

Axolotl

Other ecosystems

What variants and follow-ups exist?

Q-GaLore (July 2024)

GaLore 2 (April 2025)

GaLore-Adam and related

What are the limitations of GaLore?

How does GaLore differ from LoRA?

Why does GaLore matter?

See also

References

Improve this article

Related Articles

L0 Regularization

L1 Loss

L1 Regularization

L2 Loss

L2 Regularization

Pruning

What links here

Related Articles

L0 Regularization

L1 Loss

L1 Regularization

L2 Loss

L2 Regularization

Pruning

What links here