# GaLore (Gradient Low-Rank Projection)

> Source: https://aiwiki.ai/wiki/galore
> Updated: 2026-06-07
> Categories: Machine Learning, Training & Optimization
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

GaLore (Gradient Low-Rank Projection) is a memory-efficient training strategy for large neural networks that projects each weight matrix's gradient into a low-rank subspace before the [Adam-style](/wiki/adam_optimizer) optimizer state is computed, then projects the resulting update back to the original parameter space. Unlike adapter methods such as [LoRA](/wiki/lora), GaLore trains every weight of the model (full-parameter learning); the low-rank approximation lives in the optimizer's view of the gradient rather than in the weights themselves. The technique was introduced by Jiawei Zhao and collaborators at Caltech, Meta AI, UT Austin, and Carnegie Mellon University in March 2024 and accepted as an Oral presentation at the 41st International Conference on Machine Learning ([ICML](/wiki/icml)) the same year.[^1][^2] Combined with 8-bit optimizer state quantization and per-layer weight updates, GaLore enables pretraining a 7-billion-parameter [LLaMA](/wiki/llama)-style model from scratch on a single 24 GB consumer GPU without model parallelism, sharding, or CPU offloading.[^1][^3]

## Background

Training large [LLMs](/wiki/llm) is memory-bound because optimizer states, activations, weights, and gradients all reside in GPU memory simultaneously. For [AdamW](/wiki/adamw), the dominant component is the optimizer state: every trainable parameter requires two extra floating-point values (the first-moment estimate `m` and the second-moment estimate `v`), so a model with `n` parameters in BF16 consumes roughly `2n` bytes for weights plus `8n` bytes for the AdamW state in FP32, before accounting for gradients and activations.[^1] A 7-billion-parameter LLaMA model in BF16 therefore needs about 14 GB for weights and roughly 56 GB for FP32 Adam state, a footprint that already exceeds the 24 GB available on an [NVIDIA RTX 4090](/wiki/gpu).

Two prior families of techniques addressed this. The first, parameter-efficient fine-tuning ([PEFT](/wiki/peft)), is exemplified by [low-rank adaptation](/wiki/low-rank_adaptation) and its quantized variant [QLoRA](/wiki/qlora): only a small set of newly inserted low-rank adapter matrices are trained, so optimizer state grows with the adapter rank rather than with the base model size.[^1] The drawback is that LoRA is not full-parameter learning; the base weights stay frozen, which has been shown to underperform full-parameter training in pretraining settings and to bound the expressivity of fine-tunes.[^1] The second family, exemplified by ReLoRA, periodically merges low-rank updates into the base weights so that pretraining can proceed in a low-rank regime, but this approach requires a full-rank warmup phase to match dense baselines.[^1]

GaLore takes a different tack: it leaves the parameterization unchanged and instead exploits a structural property of the gradient matrix itself. The authors prove that during training of typical reversible network blocks the gradient with respect to a weight matrix becomes low-rank, and that the subspace spanned by the dominant singular vectors evolves slowly.[^1] Consequently, an SVD-derived projection matrix `P` can be reused for many optimizer steps before it needs to be refreshed, and the optimizer state can live in the low-rank subspace rather than in full parameter space. This shifts memory from `O(mn)` per layer to `O(mn) + O(mr + nr)` for a rank-`r` approximation while still updating every entry of `W`.[^1]

## History

The paper "GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection" by Jiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang ("Atlas") Wang, Anima Anandkumar, and Yuandong Tian was posted to arXiv as preprint 2403.03507 on 6 March 2024.[^2] The v2 revision followed on 2 June 2024.[^2] At the time of submission the authors were affiliated with the California Institute of Technology, the University of Texas at Austin, Carnegie Mellon University, and Meta AI / FAIR.[^1][^2]

The reference implementation `galore-torch` was released alongside the paper to the public GitHub repository `jiaweizzhao/GaLore` under the Apache-2.0 license.[^3] Hugging Face engineer Younes Belkada opened pull request 29588 to the [Transformers](/wiki/transformers_library) repository on 14 March 2024 to integrate the optimizer; the PR was merged on 19 March 2024 and the optimizer shipped publicly in the v4.39.0 release of the library on 20 March 2024.[^4][^5] On the same day, Hugging Face published a joint blog post by Titus von Koeller, Jiawei Zhao, Matthew Douglas, Yaowei Zheng, Younes Belkada, Zachary Mueller, Amy Roberts, Sourab Mangrulkar, and Benjamin Bossan titled "GaLore: Advancing Large Model Training on Consumer-grade Hardware," which documented usage from the `Trainer` API.[^5]

The work was accepted to ICML 2024 with an Oral presentation slot in the "Low Rank Learning" session.[^6] A first quantized follow-up, Q-GaLore by Zhenyu Zhang, Ajay Jaiswal, Lu Yin, Shiwei Liu, Jiawei Zhao, Yuandong Tian, and Zhangyang Wang, appeared as arXiv:2407.08296 on 11 July 2024 and pushed pretraining of a 7B model down to a single 16 GB RTX 4060 Ti.[^7] A more recent successor, "GaLore 2: Large-Scale LLM Pre-Training by Gradient Low-Rank Projection" by DiJia Su, Andrew Gu, Jane Xu, Yuandong Tian, and Jiawei Zhao, appeared as arXiv:2504.20437 in April 2025, focusing on randomized SVD, compatibility with [Fully Sharded Data Parallel](/wiki/fsdp), and a demonstrated pretraining run of LLaMA 7B with up to 500 billion tokens.[^8]

## How it works

### Setup and notation

For a single linear layer with weight matrix `W in R^(m x n)` (`m <= n` by convention), the gradient `G_t in R^(m x n)` at step `t` has the same shape as `W`. Full-rank AdamW maintains two state tensors `M_t` and `V_t`, both of shape `R^(m x n)`, holding bias-corrected first and second moments of `G_t`. The total memory for one parameter tensor under AdamW is therefore proportional to `3mn` (weight plus two state tensors), ignoring gradients and master copies.[^1]

GaLore introduces a single tall-skinny projection matrix `P_t in R^(m x r)` with `r << m`. At each step the gradient is first projected into the low-rank subspace:

```
R_t = P_t^T G_t        # R_t in R^(r x n)
```

The optimizer state lives at the projected scale. `M_t` and `V_t` are stored as `R^(r x n)` tensors and are updated using the standard Adam recurrences applied to `R_t`. After Adam produces a normalized update `N_t in R^(r x n)`, GaLore projects it back to the original shape:

```
G_tilde_t = alpha * P_t N_t
W_{t+1}   = W_t - eta * G_tilde_t
```

where `alpha` is a fixed scaling factor (typically `0.25`) that controls update strength independently of the rank, and `eta` is the learning rate.[^9] In the paper's standard "left" projection setting only `P` is used; a "right" projection variant uses `Q in R^(n x r)` instead, switching which dimension is compressed based on which is larger.[^1]

### Constructing P from SVD

The projection matrix is built from the singular value decomposition of the most recent full gradient. Concretely, every `T` steps (the `update_proj_gap` hyperparameter), GaLore computes a truncated SVD `G_t approx U Sigma V^T` and sets `P_t = U[:, :r]`, the columns corresponding to the `r` largest singular values.[^1][^9] Between refreshes the projection matrix is held fixed, so the cost of SVD is amortized across `T` steps. The authors prove that under the assumption of stable rank gradients the subspace spanned by `U[:, :r]` drifts slowly enough that this lazy update incurs only a bounded loss compared to recomputing `P_t` every step.[^1]

In code, the [truncated SVD](/wiki/singular_value_decomposition) is implemented with `torch.linalg.svd` on the gradient, with the projection matrix kept in the dtype of the gradient (usually BF16) and converted lazily during the matrix multiplication.[^3]

### Theoretical justification

The paper provides a formal argument for why the projection-based approach should converge. The central lemma states that for a class of reversible networks (which the authors show includes typical transformer blocks under mild conditions on the activations), the time evolution of the weight matrix under gradient descent leaves the rank of the gradient bounded by a slowly growing function of the training step.[^1] Concretely, Lemma 3.1 of the paper shows that if the loss is smooth and the network is reversible, the gradient `G_t` can be approximated arbitrarily well by its rank-`r` truncated SVD with error that depends only on `r` and on the spectral decay rate of the gradient matrix.[^1] Empirically the authors plot the singular value distribution of gradients across training steps for LLaMA blocks and observe a heavy concentration in the top few hundred singular directions, validating that the assumption holds in practice.[^1]

A second theoretical claim concerns the validity of holding `P_t` fixed for many steps. The authors prove that if the subspace drift rate (the angular velocity of the dominant `r`-dimensional eigenspace) is bounded, then the cumulative error introduced by reusing `P_t` for `T` consecutive steps is proportional to `T` times that drift rate.[^1] In practice this means that `update_proj_gap` can be set to several hundred steps with negligible effect on convergence, which is what makes the amortized SVD cost manageable.[^1]

### Memory math

The original paper presents a per-layer accounting that makes the savings explicit. For a weight matrix `W in R^(m x n)` (`m <= n`) and rank `r`, the memory required per layer is approximately:[^1]

| Component | AdamW | LoRA (rank r) | GaLore (rank r) |
|---|---|---|---|
| Weights | `mn` | `mn + mr + nr` | `mn` |
| Optimizer states | `2mn` | `2mr + 2nr` | `mr + 2nr` |
| Total | `3mn` | `mn + 3mr + 3nr` | `mn + mr + 2nr` |

For a LLaMA-style 7B model, the paper reports that switching from BF16 AdamW to GaLore reduces optimizer state memory by up to 65.5 percent.[^1] Stacking 8-bit quantization of `M` and `V` on top (yielding the `galore_adamw_8bit` configuration) reduces optimizer state memory by up to 82.5 percent and total training memory by 63.3 percent relative to a BF16 AdamW baseline.[^1][^5]

### Per-layer weight updates

A second optimization, often listed as the `_layerwise` variant in implementations, eliminates the need to ever hold the full gradient tensor of the model in memory. Normally [backpropagation](/wiki/backpropagation) produces gradients for every weight before the optimizer step runs at the end of the backward pass. GaLore's per-layer mode hooks into the backward graph and runs the projection and the AdamW update for each layer immediately after its gradient is computed, freeing that layer's full-shape gradient before the next layer's gradient materializes.[^5][^9] The original paper credits per-layer updates with saving roughly 13.5 GB on a LLaMA 7B pretraining run, which is what closes the gap between the 8-bit AdamW state footprint and the 24 GB capacity of an RTX 4090.[^1] The trade-off is that this mode is incompatible with [gradient accumulation](/wiki/gradient_accumulation) in its naive form, because accumulation requires retaining the gradient across micro-batches.[^5]

### Algorithm 1 in pseudocode

```
Inputs: weight W in R^(m x n), rank r, update interval T,
        scale alpha, learning rate eta
Initialize M_0 = 0, V_0 = 0, both in R^(r x n)
Initialize P_0 from SVD of grad(W) at step 0
for t = 1, 2, ... do
    G_t = backward(W_t)                       # full grad, BF16
    if t mod T == 0 then
        U, _, _ = svd(G_t)
        P_t = U[:, :r]
    else
        P_t = P_{t-1}
    R_t   = P_t^T @ G_t                       # (r, n), BF16
    M_t   = beta1 * M_{t-1} + (1 - beta1) * R_t
    V_t   = beta2 * V_{t-1} + (1 - beta2) * R_t * R_t
    N_t   = M_t / (sqrt(V_t) + eps)           # Adam normalized update
    G_til = alpha * P_t @ N_t                 # back to (m, n)
    W_t   = W_t - eta * G_til
```

Hyperparameters used in the paper for LLaMA pretraining include `r = 1024` for the 7B model (matrix dimension 4096) and `r = 512` for the 1B model (dimension 2048), `update_proj_gap` between 200 and 500 steps, and `scale = 0.25`.[^1][^9]

### Hyperparameter sensitivity

The paper conducts ablations over each of the three GaLore-specific hyperparameters. The rank `r` exhibits monotone but saturating effects: doubling `r` improves perplexity but with diminishing returns past roughly `r = d / 4`, where `d` is the matrix dimension. The authors recommend `r = d / 4` as a starting point for a balanced memory and accuracy trade.[^1] The update interval `update_proj_gap` shows a U-shaped sensitivity curve: very small values (under 50 steps) are wasteful because the subspace cannot drift meaningfully between refreshes, while very large values (over 1000 steps) cause the subspace to lag behind the true gradient distribution and degrade convergence.[^1] The scale `alpha` is set to a constant `0.25` in the pretraining experiments and a smaller value such as `4.0 / r` for fine-tuning, where smaller-rank projections benefit from a slightly larger effective step size; this differs from LoRA's standard `alpha / r` scaling rule.[^1][^9]

### Compatibility with mixed precision

In practice GaLore is run with the model weights, gradients, and projection matrix in BF16, while the Adam optimizer state in the projected subspace is held in either BF16 (default) or 8-bit (via the `bitsandbytes` blockwise quantization scheme). The reference implementation casts the projection matrix `P` to the dtype of the gradient at multiplication time so that the matrix multiplication kernel can dispatch to tensor cores without an explicit upcast.[^3] The SVD itself is computed in FP32 internally because `torch.linalg.svd` does not support BF16 inputs reliably; in `galore-torch` this is handled by an explicit `.float()` cast at SVD time and a downcast immediately afterward.[^3]

## Results

### LLaMA pretraining on C4

The original paper pretrains LLaMA-style decoders ranging from 60M to 7B parameters on the [C4 (Colossal Clean Crawled Corpus)](/wiki/c4_dataset) dataset and reports validation perplexity. The headline numbers from Table 1 of the paper are:[^1][^9]

| Model | Full-rank AdamW | GaLore | LoRA | ReLoRA |
|---|---|---|---|---|
| 60M  | 34.06 | 34.88 | 34.99 | 37.04 |
| 130M | 25.08 | 25.36 | 33.92 | 29.37 |
| 350M | 18.80 | 18.95 | 25.58 | 29.08 |
| 1B   | 15.56 | 15.64 | 19.21 | 18.33 |
| 7B   | (not run by authors) | 14.65 | (not run) | (not run) |

GaLore tracks full-rank AdamW closely at every scale (the gap is 0.08 perplexity at 1B) while LoRA and ReLoRA fall meaningfully behind once the model exceeds about 130M parameters.[^1] The 7B run consumed roughly 19.7 billion tokens and was the first reported single-GPU LLaMA-7B from-scratch pretraining at that footprint.[^1]

### RoBERTa GLUE fine-tuning

For [RoBERTa](/wiki/roberta)-base fine-tuning on the [GLUE benchmark](/wiki/glue_benchmark), the paper reports that GaLore matches full fine-tuning within statistical noise across MNLI, QQP, SST-2, CoLA, QNLI, MRPC, RTE, and STS-B at a fraction of the optimizer-state memory; the GaLore average score is 85.89 versus 85.61 for full fine-tuning and 85.21 for LoRA.[^1]

### Memory and hardware results

The combined effect of low-rank optimizer state, 8-bit AdamW quantization (built on `bitsandbytes`), and per-layer weight updates brings the pretraining footprint of a LLaMA 7B model to approximately 22.0 GB, fitting within a single consumer-grade [NVIDIA RTX 4090](/wiki/gpu) without sharding or CPU offloading.[^1] This was the principal demonstration that gave the paper its widespread attention.[^5]

## Implementations and adoption

### galore-torch

The reference implementation is published as the `galore-torch` package on PyPI and as the GitHub repository `jiaweizzhao/GaLore` (Apache-2.0).[^3] It exposes three optimizer classes that mimic PyTorch's standard optimizer interface: `GaLoreAdamW`, `GaLoreAdamW8bit`, and `GaLoreAdafactor`.[^3] Each accepts the GaLore-specific keyword arguments `rank`, `update_proj_gap`, `scale`, and `proj_type`, applied to a per-parameter `param_groups` entry.[^3] The 8-bit variant wraps the bitsandbytes `AdamW8bit` kernel; the Adafactor variant builds on the [Hugging Face Transformers](/wiki/transformers_library) Adafactor implementation.[^3]

### Hugging Face Transformers and TRL

PR 29588, authored by Younes Belkada, merged on 19 March 2024 and exposed GaLore as a first-class option in the `Trainer` API through three string identifiers: `optim="galore_adamw"`, `"galore_adamw_8bit"`, and `"galore_adafactor"`.[^4] These shipped publicly in the v4.39.0 release.[^5] The integration adds two new `TrainingArguments` fields, `optim_target_modules` for selecting which submodules to attach GaLore to (accepting a list, a regex, or a fully qualified module path) and `optim_args` for passing the GaLore hyperparameters as a comma-separated string.[^5] The Hugging Face blog post demonstrates the typical use:[^5]

```python
args = TrainingArguments(
    output_dir="./galore-run",
    optim="galore_adamw",
    optim_target_modules=["attn", "mlp"],
    optim_args="rank=128, update_proj_gap=200, scale=0.25",
    per_device_train_batch_size=2,
    max_steps=100,
)
```

Layer-wise variants are selected by appending `_layerwise` to the optimizer name, for example `"galore_adamw_layerwise"`.[^5] In that mode the framework disables gradient accumulation and runs the projection plus weight update inside backward hooks; the documentation explicitly warns that throughput is lower than the non-layerwise variant in exchange for the further memory saving.[^5]

The same `optim` keys are accepted by the [SFT](/wiki/supervised_fine-tuning) trainer in `trl.SFTTrainer` via `SFTConfig`, with identical semantics. The blog's worked example trains [Mistral-7B](/wiki/mistral_7b) on IMDB with `optim="galore_adamw"` and target modules `["attn", "mlp"]`.[^5]

### LLaMA-Factory

The LLaMA-Factory project, a popular YAML-driven fine-tuning framework that supports more than 100 base models, lists GaLore as one of its optimizer choices alongside LoRA, QLoRA, full freeze-tuning, and 32-bit full fine-tuning.[^10] Setting `optim: galore_adamw_8bit` in the LLaMA-Factory training YAML invokes the same Hugging Face `Trainer` integration described above.[^10]

### Axolotl

[Axolotl](/wiki/axolotl), another widely used training framework, also supports GaLore through the underlying Transformers integration.[^11] The Axolotl issue tracker has documented out-of-memory edge cases when the user requests `galore_adamw_8bit` together with full gradient accumulation, reflecting the layerwise/accumulation incompatibility noted by the upstream maintainers.[^11]

### Other ecosystems

The Hugging Face PEFT and accelerate teams collaborated on enabling layerwise variants under their respective abstractions, and the Graphcore Research group reproduced the C4 results on IPU hardware shortly after publication.[^5]

## Variants and follow-ups

### Q-GaLore (July 2024)

Q-GaLore by Zhenyu Zhang, Ajay Jaiswal, Lu Yin, Shiwei Liu, Jiawei Zhao, Yuandong Tian, and Zhangyang Wang (arXiv:2407.08296) extends GaLore with two ideas: keep the projection matrices `P` themselves in INT4 instead of BF16, and keep the underlying weights in INT8, with stochastic rounding used to preserve accumulated gradient information across quantization boundaries.[^7] The authors also introduce a layer-adaptive subspace refresh schedule: layers whose gradient subspace has converged are updated less often, reducing the average SVD cost.[^7] The headline claim is that Q-GaLore can pretrain a LLaMA-7B from scratch on a single NVIDIA RTX 4060 Ti with only 16 GB of memory, and that for fine-tuning it reduces memory by up to 50 percent relative to LoRA and GaLore while outperforming QLoRA at matched memory.[^7] A reference implementation is available at `github.com/VITA-Group/Q-GaLore`.[^7]

### GaLore 2 (April 2025)

GaLore 2 (arXiv:2504.20437) by DiJia Su, Andrew Gu, Jane Xu, Yuandong Tian, and Jiawei Zhao addresses the principal complaint about GaLore in production settings: SVD is expensive, and full SVD of a 4096 x 4096 BF16 gradient is so slow on GPU that it dominates wall-clock time at large model sizes. GaLore 2 swaps the exact truncated SVD for a randomized SVD construction, restructures the projection step to be compatible with [FSDP](/wiki/fsdp) sharding, and demonstrates a from-scratch pretraining run of LLaMA 7B with up to 500 billion training tokens, which is roughly 25 times the token budget used in the original paper.[^8]

### GaLore-Adam and related

Several smaller variants have been published in the wider literature. GoLore (PKU) replaces the SVD-derived `P` with a random orthogonal projection sampled uniformly on the Stiefel manifold, providing convergence guarantees that the deterministic SVD-based construction lacks in late training stages, and showing improved late-stage perplexity on LLaMA pretraining.[^12] "Online Subspace Descent" generalizes GaLore into a continuous subspace tracking scheme that updates `P` at every step via a low-cost stochastic rule, narrowing the gap to full-rank baselines on C4 pretraining.[^13] GALE proposes replacing the SVD with a fast randomized QR decomposition for a reported 23x speedup on the subspace update step.[^14]

## Limitations

The principal limitations of vanilla GaLore are well documented in the literature.

First, the SVD step is computationally expensive. On a LLaMA 7B model, a single full truncated SVD of all weight gradients can take on the order of ten to twenty minutes on a consumer GPU, which dominates per-step wall-clock time when `update_proj_gap` is small.[^14][^15] This is what motivates both Q-GaLore's layer-adaptive refresh schedule and GaLore 2's randomized SVD.[^7][^8]

Second, the per-layer weight-update trick that delivers the 24 GB pretraining demonstration is fundamentally incompatible with multi-step gradient accumulation, because the gradient is freed immediately after the layer's weights have been updated.[^5] In practice this caps effective batch size at what fits in one micro-batch, which is small on consumer hardware and can hurt convergence stability.[^11]

Third, GaLore is more memory-efficient than LoRA only on the optimizer state and not on the weights themselves; full BF16 weight storage is still required, so for very large models GaLore does not entirely eliminate the need for sharding.[^1][^8]

Fourth, the theoretical convergence guarantees in the original paper rest on the assumption that gradients have a stable low-rank structure. Follow-up work has shown that this assumption can break down in late training stages, where the dominant singular directions of the gradient can shift faster than the `update_proj_gap` allows, leading to degraded convergence; GoLore's random-projection alternative was proposed in part to address this case.[^12]

Fifth, the SVD-based projection is non-trivial to combine with FSDP-style sharded training because the SVD must be computed on the full unsharded gradient, which adds communication overhead in distributed settings. The GaLore 2 paper explicitly cites this as a motivation for its restructuring.[^8]

## Comparison

| Property | Full AdamW | LoRA / QLoRA | GaLore | Q-GaLore |
|---|---|---|---|---|
| Trains all weights | Yes | No (adapter only) | Yes | Yes |
| Optimizer state size per layer | `2mn` | `2(m+n)r` | `(m+2n)r` | INT4-quantized projection plus low-rank Adam state |
| Extra projection matrices stored | None | `BA` adapters (`mr + nr`) | `P` (`mr`) | `P` in INT4 |
| Weight dtype | BF16/FP16 | BF16 weights + adapter | BF16 | INT8 |
| Demonstrated single-GPU 7B pretraining | No | No | Yes (24 GB) | Yes (16 GB) |
| SVD overhead | None | None | Significant | Reduced via adaptive schedule |

The cleanest one-line distinction between GaLore and LoRA: LoRA decomposes the weight delta as `Delta W = B A`, freezes `W`, and trains `B, A`. GaLore decomposes the gradient as `G approx P (P^T G)`, trains `W` itself, and stores Adam moments only of the smaller `P^T G`. The two methods can be stacked (GaLore-LoRA-style hybrids exist in the literature) but they are conceptually independent.[^1]

## Significance

GaLore changed the practical conversation around LLM pretraining hardware. Before its publication the consensus position was that pretraining a 7-billion-parameter model required at minimum a multi-GPU node with sharding or aggressive CPU/NVMe offloading.[^1] By demonstrating that a single 24 GB consumer GPU was sufficient when optimizer state was compressed in the right way, GaLore widened the population of researchers and small organizations who could attempt from-scratch pretraining experiments at the 1B to 7B scale.[^5][^16]

Within the Hugging Face ecosystem the GaLore integration was the first time the Transformers library shipped a non-trivial new optimizer behind a string identifier in `TrainingArguments`, setting a template that was reused for APOLLO and other subsequent memory-efficient optimizers.[^17] The "optim_target_modules" abstraction introduced by PR 29588 is now part of the standard configuration surface for swapping in low-rank methods across PEFT-style and full-parameter training paths.[^4][^5]

In a broader research framing, GaLore validated the empirical claim, central to a growing line of work, that the gradient of an over-parameterized neural network has approximate low-rank structure that can be exploited at training time, not merely for compression of trained models.[^1] This idea now underpins several other memory-efficient training proposals, including online subspace descent and gradient wavelet transforms.[^13]

## See also

- [LoRA (Low-Rank Adaptation)](/wiki/lora)
- [QLoRA](/wiki/qlora)
- [Low-rank adaptation](/wiki/low-rank_adaptation)
- [PEFT](/wiki/peft)
- [AdamW](/wiki/adamw)
- [Adam optimizer](/wiki/adam_optimizer)
- [Singular value decomposition](/wiki/singular_value_decomposition)
- [Quantization](/wiki/quantization)
- [Fully Sharded Data Parallel (FSDP)](/wiki/fsdp)
- [Hugging Face Transformers](/wiki/transformers_library)
- [Axolotl](/wiki/axolotl)
- [Unsloth](/wiki/unsloth)
- [DeepSpeed](/wiki/deepspeed)
- [C4 (Colossal Clean Crawled Corpus)](/wiki/c4_dataset)
- [RoBERTa](/wiki/roberta)
- [GLUE benchmark](/wiki/glue_benchmark)
- [Mistral 7B](/wiki/mistral_7b)
- [LLaMA](/wiki/llama)
- [ICML](/wiki/icml)

## References

[^1]: Jiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang Wang, Anima Anandkumar, Yuandong Tian, "GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection", arXiv, 2024-03-06. https://arxiv.org/abs/2403.03507. Accessed 2026-05-20.
[^2]: Jiawei Zhao et al., "GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection (v2 revision)", arXiv:2403.03507v2, 2024-06-02. https://arxiv.org/abs/2403.03507v2. Accessed 2026-05-20.
[^3]: Jiawei Zhao, "GaLore reference implementation (galore-torch)", GitHub repository jiaweizzhao/GaLore, Apache-2.0 license, 2024-03-06. https://github.com/jiaweizzhao/GaLore. Accessed 2026-05-20.
[^4]: Younes Belkada, "FEAT / Optim: Add GaLore optimizer", Hugging Face Transformers Pull Request 29588, merged 2024-03-19. https://github.com/huggingface/transformers/pull/29588. Accessed 2026-05-20.
[^5]: Titus von Koeller, Jiawei Zhao, Matthew Douglas, Yaowei Zheng, Younes Belkada, Zachary Mueller, Amy Roberts, Sourab Mangrulkar, Benjamin Bossan, "GaLore: Advancing Large Model Training on Consumer-grade Hardware", Hugging Face Blog, 2024-03-20. https://huggingface.co/blog/galore. Accessed 2026-05-20.
[^6]: ICML 2024 Program Committee, "GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection (Oral, Track 6B Low Rank Learning)", ICML 2024, 2024-07-25. https://icml.cc/virtual/2024/oral/35485. Accessed 2026-05-20.
[^7]: Zhenyu Zhang, Ajay Jaiswal, Lu Yin, Shiwei Liu, Jiawei Zhao, Yuandong Tian, Zhangyang Wang, "Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients", arXiv:2407.08296, 2024-07-11. https://arxiv.org/abs/2407.08296. Accessed 2026-05-20.
[^8]: DiJia Su, Andrew Gu, Jane Xu, Yuandong Tian, Jiawei Zhao, "GaLore 2: Large-Scale LLM Pre-Training by Gradient Low-Rank Projection", arXiv:2504.20437, 2025-04-29. https://arxiv.org/abs/2504.20437. Accessed 2026-05-20.
[^9]: Jiawei Zhao et al., "GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection (HTML rendering)", arXiv HTML view, 2024. https://arxiv.org/html/2403.03507v1. Accessed 2026-05-20.
[^10]: Yaowei Zheng et al., "LLaMA-Factory project documentation: training configuration and supported optimizers", LLaMA-Factory project (covered in third-party guide by Superteams.ai), 2024. https://www.superteams.ai/blog/a-definitive-guide-to-fine-tuning-llms-using-axolotl-and-llama-factory. Accessed 2026-05-20.
[^11]: OpenAccess-AI-Collective contributors, "OOM On Galore Axolotl (issue 1448)", Axolotl GitHub issue tracker, 2024-04. https://github.com/OpenAccess-AI-Collective/axolotl/issues/1448. Accessed 2026-05-20.
[^12]: He et al., "GoLore: Random projection alternative to GaLore on the Stiefel manifold", GitHub repository pkumelon/Golore, 2024. https://github.com/pkumelon/Golore. Accessed 2026-05-20.
[^13]: Liang et al., "Online Subspace Descent for Memory-Efficient LLM Training", arXiv preprint discussed in survey of memory-efficient optimizers, 2024. https://arxiv.org/html/2605.09176v1. Accessed 2026-05-20.
[^14]: GALE authors, "GALE: Gradient Activation Low-rank Extraction for Fast Memory Efficient Large Language Model Training", OpenReview, 2024. https://openreview.net/forum?id=D9Oq3c5iHn. Accessed 2026-05-20.
[^15]: SubTrack++ authors, "SubTrack++: Gradient Subspace Tracking for Scalable LLM Training (discusses SVD overhead in GaLore)", OpenReview, 2024. https://openreview.net/pdf?id=6geRIdlFWJ. Accessed 2026-05-20.
[^16]: Graphcore Research, "GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection (reproduction note)", Graphcore Research Blog, 2024. https://graphcore-research.github.io/galore/. Accessed 2026-05-20.
[^17]: Hanqing Zhu, "Optim: APOLLO optimizer integration (PR 36062, references GaLore template)", Hugging Face Transformers Pull Request 36062, 2024-12. https://github.com/huggingface/transformers/pull/36062. Accessed 2026-05-20.

