rsLoRA (Rank-Stabilized LoRA)

rsLoRA (rank-stabilized LoRA) is a one-line modification to the low-rank adaptation fine-tuning method for large neural networks. It replaces the conventional LoRA scaling factor $\gamma_r = \alpha/r$ with $\gamma_r = \alpha/\sqrt{r}$, where $r$ is the adapter rank and $\alpha$ is a hyperparameter. The change was proposed by Damjan Kalajdzievski in a December 2023 arXiv preprint, "A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA", which shows analytically that the original $1/r$ scaling causes adapter gradients to collapse as the rank is increased, so larger ranks fail to improve performance.[^1] Under the $1/\sqrt{r}$ scaling, both forward activations and backward gradients remain of order $\Theta(1)$ for any rank, restoring the expected compute-versus-performance trade-off and allowing practitioners to use higher LoRA ranks without saturation.[^1] rsLoRA is implemented in Hugging Face's PEFT library as a single boolean flag, use_rslora=True in LoraConfig, with no change to inference cost.[^2]

Background

LoRA and its scaling factor

Standard LoRA, introduced by Hu et al. (2022), attaches a trainable rank-$r$ "adapter" matrix product $BA$ to selected linear sub-modules of a pre-trained transformer while keeping the base weights frozen.[^3] Concretely, a linear layer that originally computes $x_{\text{out}} = Wx_{\text{in}} + b$ is replaced by $$x_{\text{out}} = (W + \gamma_r BA)x_{\text{in}} + b,$$ with $A \in \mathbb{R}^{r \times d_1}$, $B \in \mathbb{R}^{d_2 \times r}$, and a positive scaling factor $\gamma_r \in \mathbb{R}^{+}$ that depends on the rank.[^1] At initialization $B = 0$ and $A$ has independent zero-mean entries with variance independent of $r$, so the adapter starts at the identity and is gradually learned during fine-tuning.[^1]

In the original LoRA implementation and most popular libraries, the scaling factor is set to $\gamma_r = \alpha/r$ for a hyperparameter $\alpha$.[^1][^2] After fine-tuning, the sum $W + \gamma_r BA$ can be folded back into a single weight matrix, so inference cost matches that of the unmodified base model regardless of $r$.[^1]

The $\alpha/r$ convention has a long folkloric justification: as $r$ grows, the product $BA$ accumulates a sum of $r$ rank-one terms, so dividing by $r$ keeps the average term size constant. The paper by Kalajdzievski shows that this intuition is in fact wrong, because it ignores how the variance of the matrix product scales when $B$ and $A$ are jointly trained from a degenerate initialization.[^1]

The rank-collapse problem

Practitioners working with LoRA had repeatedly observed an empirical puzzle: increasing the adapter rank $r$ beyond small values (typically $r = 4, 8, 16$) brought essentially no improvement in fine-tuning quality, even though the trainable parameter count grew linearly with $r$.[^1] The LoRA paper itself reported that "very low ranks suffice" and provided ablations showing that rank 64 fine-tuning performed no better than rank 4 (Table 6 of Hu et al.).[^3] This observation was widely interpreted as evidence that the fine-tuning manifold for large language models has very low intrinsic dimension, an idea consistent with earlier results by Aghajanyan, Zettlemoyer, and Gupta on intrinsic dimensionality of fine-tuning.[^1]

The rsLoRA analysis shows that this interpretation conflates two distinct phenomena. While the intrinsic-dimensionality hypothesis may still hold to some degree, a separate effect dominates the empirical picture: with $\gamma_r = \alpha/r$, the gradient signal flowing through each adapter collapses as $r$ increases, so the network simply cannot exploit the extra parameters even when they would be useful.[^1] The conclusion that "low ranks suffice" is therefore an artifact of overly aggressive scaling, not a fundamental limit of the LoRA family.[^1]

Scaling-initialization-update schemes

The mathematical setup that rsLoRA borrows from infinite-width neural network theory traces back to the analysis of "abc-parametrizations" by Yang and Hu, the same line of work that produced maximal update parametrization for pretraining.[^1] In that framework, a network with hidden dimension $d$ is described by three exponents: weights are scaled by $1/d^{a_l}$, initialized with variance $1/d^{b_l}$, and updated with effective learning rate $\eta/d^{c}$. Yang and Hu proved that only certain triples $(a, b, c)$ yield stable, non-collapsing learning in the limit $d \to \infty$, and they used this to recommend specific scaling choices for pretraining wide networks.

Kalajdzievski applies an analogous learning-trajectory analysis to LoRA's rank parameter $r$, treating $r$ as the analog of the infinite hidden dimension.[^1] Because LoRA already fixes the initialization scheme ($B = 0$, $A$ with rank-independent variance), the only free degree of freedom is the scaling factor $\gamma_r$. The paper then asks: which $\gamma_r$ keeps activations and gradients well-behaved as $r$ grows?

Technical details

Definition of rank-stability

The paper introduces a formal definition that captures what "well-behaved" should mean for an adapter:

An adapter $\gamma_r BA$ is rank-stabilized if (1) when inputs are i.i.d. with $m$-th moment $\Theta_r(1)$ in each entry, the outputs of the adapter also have $m$-th moment $\Theta_r(1)$ in each entry; and (2) when loss gradients with respect to adapter outputs are $\Theta_r(1)$ in each entry, the loss gradients with respect to adapter inputs are also $\Theta_r(1)$ in each entry.[^1]

Here $\Theta_r(\cdot)$ is the standard big-Theta asymptotic notation in the rank $r$, meaning the quantity is bounded above and below by constants times the argument as $r \to \infty$. The two conditions together require that adapters neither inflate nor deflate signal magnitudes in the forward pass, and neither inflate nor deflate gradient magnitudes in the backward pass, regardless of rank.[^1]

Rank-stability is the right notion to demand because adapters in a deep network compose. If the first adapter inflates activations by a factor of $\sqrt{r}$, the second adapter receives inputs that are themselves $r$-dependent, and the cascade can multiply through many layers; the same logic applies in reverse for gradients. Demanding $\Theta_r(1)$ behavior at every adapter is what allows the paper to chain stability through arbitrary depth using a simple inductive argument over the forward and backward passes.[^1]

Main theorem

The central result is Theorem 3.2:

Consider LoRA adapters of the form $\gamma_r BA$, with $B \in \mathbb{R}^{d_1 \times r}$, $A \in \mathbb{R}^{r \times d_2}$, $B$ initialized to zero, entries of $A$ i.i.d. with mean zero and variance $\sigma_A$ not depending on $r$, and $\gamma_r \to 0$ as $r \to \infty$. In expectation over initialization, all adapters are rank-stabilized if and only if $\gamma_r \in \Theta_r(1/\sqrt{r})$. Furthermore, this holds at any point in the learning trajectory, and unless $\gamma_r \in \Theta_r(1/\sqrt{r})$ there is unstable or collapsing learning for sufficiently large values of $r$.[^1]

The "if and only if" is essential: any scaling that grows asymptotically faster than $1/\sqrt{r}$ produces blowup, any scaling that decays faster produces collapse, and only $\Theta(1/\sqrt{r})$ achieves both conditions simultaneously. The conventional choice $\alpha/r$ is therefore decaying too quickly, by a factor of $\sqrt{r}$.[^1]

The proof in Appendix A computes the variance of $\gamma_r BA x_{\text{in}}$ entry-by-entry, observes that this variance scales as $\gamma_r^2 \cdot r$ when $B$ and $A$ each have $\Theta(1)$ variance per entry, and concludes that $\gamma_r^2 r = \Theta(1)$ forces $\gamma_r = \Theta(1/\sqrt{r})$.[^1] A parallel argument for the backward pass confirms the same exponent for gradients. The induction step that propagates the result from one adapter to the next uses the fact that the model's input data does not depend on $r$, so stability can be carried forward layer by layer.[^1]

A useful intuition is to think of the matrix product $BA$ as a sum of $r$ rank-one outer products $\sum_{i=1}^{r} b_i a_i^{\top}$, where $b_i$ is the $i$-th column of $B$ and $a_i$ is the $i$-th row of $A$. Once training begins, the entries of $B$ and $A$ both become $\Theta(1)$ in magnitude, so each rank-one term contributes $\Theta(1)$ to the entry-wise variance of the sum, and the variance of the full sum is $\Theta(r)$. To bring the product back to $\Theta(1)$ variance one must divide by the standard deviation of the sum, which is $\sqrt{r}$, not by $r$.[^1] Dividing by $r$ overcorrects by a factor of $\sqrt{r}$, suppressing the adapter's contribution faster than the law of large numbers requires, and the gradients inherit the same overcorrection. This is the central conceptual error in the original LoRA scaling that rsLoRA repairs.

The corrected scaling factor

Setting the constant in front to a free hyperparameter, rsLoRA prescribes $$\gamma_r = \frac{\alpha}{\sqrt{r}}.$$ This is the only change. The architecture, initialization, optimizer, and inference path are all identical to standard LoRA.[^1]

Practitioners who want to enable rsLoRA while keeping a familiar $\alpha$ scale can equivalently leave $\gamma_r = \alpha'/r$ but set $\alpha' = \alpha\sqrt{r}$, since $\alpha\sqrt{r}/r = \alpha/\sqrt{r}$.[^4] The two formulations are arithmetically identical, but most libraries now expose rsLoRA through a boolean flag that handles the substitution internally.[^2]

Theoretical caveats

Theorem 3.2 is asymptotic and applies in the limit of $r \to \infty$. The paper is careful to note that the result speaks to stability and collapse of learning, not to the quality of features that learning produces.[^1] If the features learned at different ranks are themselves rank-dependent, the assumption that gradients into the adapter are $\Theta_r(1)$ may not hold and the conclusion would need empirical validation.[^1] Conversely, if feature quality is genuinely rank-independent, there is no benefit to higher ranks regardless of the scaling factor. The practical implications of the theorem therefore require experimental confirmation, which the paper provides.

Comparison of scaling regimes

Regime	Scaling	Forward variance	Gradient norm	Behavior at large $r$
Naive (no scaling)	$\gamma_r = \alpha$	grows like $r$	grows like $\sqrt{r}$	activations and gradients blow up
Standard LoRA	$\gamma_r = \alpha/r$	shrinks like $1/r$	shrinks like $1/\sqrt{r}$	gradients collapse, no learning at high $r$
rsLoRA	$\gamma_r = \alpha/\sqrt{r}$	$\Theta(1)$	$\Theta(1)$	stable for all $r$

The middle row is the case originally implemented in LoRA, and the bottom row is the rsLoRA correction.[^1]

Experimental validation

Setup

The paper validates the theorem on Llama 2 (7B), fine-tuned on 20,000 instruction-tuning examples from the OpenOrca dataset.[^1] The optimizer is AdamW with the Hugging Face default learning rate of $5 \times 10^{-5}$ on a constant schedule, and adapters are placed in all linear sub-modules of every transformer block (both attention projections and feed-forward MLP layers), since prior work showed that targeting all linear layers maximizes performance per parameter for LoRA.[^1] Ranks tested are $r \in {4, 8, 32, 128, 512, 2048}$, spanning more than two orders of magnitude.[^1]

Perplexity versus rank

In Figure 2 of the paper, fine-tuning perplexity is plotted against training step for each rank, for both LoRA and rsLoRA.[^1] The LoRA curves collapse onto essentially the same loss trajectory regardless of rank, and some larger ranks even perform slightly worse than the smallest.[^1] In contrast, rsLoRA produces a monotone improvement with rank: $r = 2048$ converges to a lower perplexity than $r = 512$, which is lower than $r = 128$, and so on down to $r = 4$.[^1] This is exactly the compute-versus-performance trade-off that practitioners had expected from LoRA and that the original $\alpha/r$ scaling had silently denied them.

Gradient norm tracking

Figure 3 plots the average parameter gradient norm throughout training.[^1] For LoRA, gradient norms decay as $r$ grows, consistent with the predicted $1/\sqrt{r}$ collapse implied by $\gamma_r = \alpha/r$.[^1] For rsLoRA, the gradient norms at the onset of training are essentially identical across all six ranks, and they remain within the same order of magnitude throughout fine-tuning.[^1] This is the most direct empirical confirmation of Theorem 3.2.

Ablations

The paper includes several ablations to rule out confounders.[^1] First, the experiment is repeated with a different base model, dataset, and optimizer, and the same qualitative pattern holds. Second, the gradient-stability result is reproduced with SGD instead of AdamW, showing that the effect is not an artifact of adaptive optimizers. Third, the authors sweep learning rates for rank-4 LoRA and confirm that no learning-rate boost can reproduce the high-rank rsLoRA result, ruling out the hypothesis that rsLoRA is merely a disguised learning-rate change. Fourth, restricting adapters to attention projections (the original LoRA practice) gives the same qualitative pattern. Fifth, attempting to fix the problem only at initialization, without changing the scaling factor, restores instability at large rank.[^1]

Hugging Face blog validation

A February 2024 Hugging Face blog post by the same author reports a more applied evaluation using preference fine-tuning of OpenChat 3.5 on the ultrafeedback_binarized dataset with DPO, evaluated on MT-Bench.[^4] The base model scores 7.791, rank-16 LoRA reaches 7.931, rank-256 LoRA reaches 7.963 (essentially no gain over rank 16), and rank-256 rsLoRA reaches 8.088, with only roughly thirteen additional minutes of training time over rank-16 LoRA.[^4] The pattern matches the paper: at rank 16, the difference between LoRA and rsLoRA is small, but at rank 256, only rsLoRA actually exploits the extra parameters.

Method	MT-Bench Turn 1	MT-Bench Turn 2	Average
Base OpenChat 3.5	8.206	7.375	7.791
LoRA rank 16	8.338	7.525	7.931
LoRA rank 256	8.300	7.625	7.963
rsLoRA rank 256	8.425	7.750	8.088

The training-time overhead reported was approximately 13 minutes (2h 19m for rsLoRA at rank 256 versus 2h 7m for LoRA at rank 16), since the additional parameters at higher rank still cost some forward and backward compute even though the scaling factor is "free".[^4]

Implementation

Hugging Face PEFT

rsLoRA is exposed in the Hugging Face PEFT library as the use_rslora boolean field of LoraConfig.[^2] The current PEFT source code documents the parameter as follows:

use_rslora: bool = field(
    default=False,
    metadata={
        "help": (
            "When set to True, uses Rank-Stabilized LoRA "
            "(https://huggingface.co/papers/2312.03732) which sets the adapter "
            "scaling factor to `lora_alpha/math.sqrt(r)`, since it was proven "
            "to work better. Otherwise, it will use the original default value "
            "of `lora_alpha/r`."
        )
    },
)

When use_rslora=True, the per-layer scaling factor is computed as lora_alpha / math.sqrt(r) instead of lora_alpha / r, and the rest of the LoRA forward and backward path is unchanged.[^2] The default is False to preserve backward compatibility with the original LoRA convention, although the documentation notes the change is "proven to work better".[^2] rsLoRA composes with the other PEFT initialization and adapter options, including PiSSA, LoftQ, EVA, OLoRA, DoRA, and the merge / unload utilities.[^2]

A typical configuration looks like:

from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=256,
    lora_alpha=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "up_proj", "gate_proj", "down_proj"],
    lora_dropout=0.1,
    use_rslora=True,
)
peft_model = get_peft_model(base_model, config)

Because use_rslora=True substitutes lora_alpha / math.sqrt(r) for the scaling, the user does not need to manually compute the corrected $\alpha$.[^2] However, hyperparameter ranges familiar from LoRA practice (such as "set $\alpha = 2r$") need to be re-tuned, because the effective scaling has changed; community guidance commonly recommends using rsLoRA whenever the desired rank exceeds the small-rank regime (roughly $r > 32$).[^5]

A practical rule that the Unsloth documentation surfaces is to start with $\alpha = r$ or $\alpha = 2r$ as the LoRA baseline, then enable use_rslora=True and consider lowering $\alpha$ slightly to compensate for the larger effective scaling at moderate ranks.[^5] At very high ranks the $1/\sqrt{r}$ rule keeps the effective per-adapter signal larger than the original $1/r$ would have made it, so retuning the learning rate downward by a small amount is also common practice for ranks above a few hundred. Because the optimizer (typically AdamW) is adaptive, modest mistakes in $\alpha$ are usually absorbed without divergence, but careful tuning still pays off for the highest ranks.

Other libraries

rsLoRA propagated quickly to downstream fine-tuning frameworks that wrap PEFT:

Unsloth: Supports use_rslora=True as a parameter of FastLanguageModel.get_peft_model(). The Unsloth documentation describes rsLoRA as "the optimal approach" of scaling $\alpha$ by the square root of the rank and lists it among the recommended advanced options when training stability is a concern, particularly at higher ranks.[^5]
Axolotl: Lists Rank-Stabilized LoRA among the supported LoRA variants, alongside LoftQ, for improving training stability and integrating quantization.[^6]
LLaMA-Factory: Lists rsLoRA in its set of "fine-tuning tricks" together with FlashAttention-2, Unsloth integration, RoPE scaling, and NEFTune.[^7]

In all three frameworks the user-visible API is the same boolean flag; the underlying implementation simply replaces the scaling factor inside the LoRA forward pass.

Composition with other LoRA variants

Because rsLoRA modifies only the scaling factor, it is orthogonal to most other LoRA improvements that touch the initialization, the parameterization, or the targeting strategy. In particular it composes with:

DoRA (Weight-Decomposed LoRA), which decomposes the update into magnitude and direction parts while leaving the direction component as a LoRA adapter; the rsLoRA scaling can be applied to that LoRA direction without conflict.[^2]
PiSSA, which changes the initialization of $A$ and $B$ to be informed by the top singular vectors of the base weight matrix.[^2]
QLoRA, which quantizes the base model and trains LoRA adapters on top in higher precision; QLoRA does not specify a different scaling rule and so can use rsLoRA as a drop-in replacement.[^2]
LoftQ initialization, EVA, OLoRA, and CorDA, all of which leave the LoRA scaling factor unmodified.[^2]

Because rsLoRA does not change the trainable parameters or the inference path, the resulting models can be merged back into a single weight matrix and served exactly like a LoRA model, with no additional runtime cost.[^1]

Adoption

rsLoRA was integrated into Hugging Face PEFT in early 2024 and has since become a standard option in major fine-tuning toolkits, including Unsloth, Axolotl, and LLaMA-Factory.[^2][^5][^6][^7] The Hugging Face PEFT documentation explicitly recommends use_rslora=True when using higher LoRA ranks, citing the proof that it works better.[^2] The author published a community Hugging Face blog post titled "Rank-Stabilized LoRA: Unlocking the Potential of LoRA Fine-Tuning" on 20 February 2024 to accompany the integration; the blog presents the math, the MT-Bench validation, and worked PEFT examples for practitioners.[^4]

One reason rsLoRA propagated so quickly is the low integration cost. The patch in the PEFT source code amounts to a one-line change that conditionally divides by math.sqrt(r) instead of r in the LoRA scaling computation, and the rest of the PEFT API (LoRA initialization, target-module selection, merging, saving, loading, multi-adapter switching) needs no modification.[^2] Downstream toolkits that wrap PEFT simply expose the same boolean flag in their own configuration files, which keeps the interface stable and predictable across the ecosystem.

The paper appears on Semantic Scholar and Google Scholar with a substantial citation count across subsequent LoRA work, and the rsLoRA scaling rule is cited as the standard fix for rank-collapse in surveys on parameter-efficient fine-tuning for large language models.[^8] Notable directions building on rsLoRA include:

Federated and privacy-preserving fine-tuning: Several federated learning extensions adopt rsLoRA's $\alpha/\sqrt{r}$ scaling to mitigate rank-related gradient instability when client-side LoRA adapters are aggregated server-side. This has been picked up in frameworks like FedIT, FFA-LoRA, and FedSA-LoRA for federated LLM tuning, where the $\gamma_r = \alpha/\sqrt{r}$ rule corrects gradient-collapse at high rank.[^8]
AdaLoRA / dynamic-rank methods: The original paper notes that AdaLoRA, which dynamically allocates rank during training, still uses the old $\alpha/r$ scaling and could potentially benefit from being rebuilt on top of rsLoRA.[^1] Several follow-on works have explored this combination.
Surveys and tutorials: The 2024 survey "A survey on LoRA of large language models" lists rsLoRA among the canonical scaling-factor improvements over the original LoRA recipe.[^8]

The author, Damjan Kalajdzievski, lists Tenyx as his affiliation on the paper and maintains a public GitHub presence including a rsLoRA_Blog repository accompanying the Hugging Face blog post.[^9]

Applications

rsLoRA's practical value is most evident when practitioners want to push LoRA toward full-fine-tuning quality without paying full-fine-tuning memory cost. Because $r$ can be raised much higher without saturation, rsLoRA enables a continuous compute-versus-performance trade-off that LoRA had previously lacked:[^1]

High-rank LoRA fine-tuning: Domains such as instruction tuning, preference optimization with DPO, and specialized adaptation (legal, medical, financial NLP) often benefit from richer adapters than the standard $r = 8$ or $r = 16$ allow. rsLoRA makes ranks like $r = 128$ or $r = 256$ practically useful rather than just nominally larger.[^4]
Quantized fine-tuning: When base model weights are quantized to 4-bit or 8-bit precision (as in QLoRA) and only the LoRA adapters carry higher-precision updates, the rsLoRA scaling helps the adapters absorb more signal at higher rank, which can partially compensate for quantization error.[^2]
Preference and instruction tuning: The Hugging Face blog uses DPO on OpenChat 3.5 with rank-256 rsLoRA to obtain a measurable MT-Bench improvement over rank-16 LoRA, demonstrating that the technique transfers to alignment-style fine-tuning beyond standard supervised fine-tuning.[^4]
Adapter-rich agents and skills: When deploying multiple LoRA adapters for different tasks on top of a single base model, rsLoRA allows each adapter to be larger without becoming useless, expanding the effective design space for adapter ensembles.

Crucially, the inference cost remains identical to a non-LoRA fine-tuned model, because $W + \gamma_r BA$ can be merged into a single matrix after training.[^1] This preserves LoRA's main deployment advantage while removing its rank-saturation problem.

Limitations and criticisms

rsLoRA is a narrowly-scoped fix and does not claim to solve every limitation of LoRA-style fine-tuning. Several caveats apply:

Asymptotic guarantee, not finite-rank optimum: Theorem 3.2 is a statement about the asymptotic behavior in the limit $r \to \infty$. It guarantees that learning does not collapse, but it does not prove that rsLoRA dominates LoRA at every finite rank. At very low ranks, where standard LoRA already works well, the two scalings give similar results in practice; the gap opens up only when $r$ is large enough for rank-collapse to bite.[^1][^4]
Re-tuning of $\alpha$: Hyperparameter recipes inherited from LoRA practice (such as "use $\alpha = 2r$") do not transfer literally to rsLoRA, since the effective scaling at a given $r$ now differs by a factor of $\sqrt{r}$. Practitioners enabling use_rslora=True should expect to retune $\alpha$ for their target rank; Unsloth and other toolkits document this caveat.[^5]
Training compute does grow with rank: rsLoRA removes the rank-collapse barrier but not the actual cost of larger adapters. Backward passes through a rank-2048 adapter cost more than through a rank-16 adapter even though the scaling factor is "free", so the trade-off is genuinely compute-versus-quality rather than a free improvement.[^1][^4]
Feature-quality assumption: The proof assumes that the gradient into each adapter is $\Theta_r(1)$, which can fail if the features learned at different ranks are themselves rank-dependent. The paper notes this caveat explicitly and treats experimental validation as a necessary complement to the theorem.[^1]
Does not address other LoRA problems: rsLoRA does not solve catastrophic forgetting, intruder dimensions, the gap between LoRA and full fine-tuning on out-of-distribution generalization, or domain-shift issues. Other LoRA variants such as DoRA, PiSSA, and full fine-tuning still play complementary roles.

A second-order observation is that the rsLoRA result revises the conventional wisdom that "very low LoRA ranks suffice", which had been cited as evidence for low-intrinsic-dimensionality hypotheses about fine-tuning. The paper concludes that this interpretation was at least partially an artifact of the $\alpha/r$ scaling, and motivates further studies of the true intrinsic dimensionality of fine-tuning manifolds.[^1]

rsLoRA sits within a dense ecosystem of LoRA variants and stabilization techniques developed since 2022. Some directly relevant comparisons:

LoRA (Hu et al., 2022): The original method that rsLoRA modifies. LoRA's $\gamma_r = \alpha/r$ scaling is exactly what Theorem 3.2 shows to be too aggressive.[^3]
QLoRA (Dettmers et al., 2023): Combines 4-bit base-model quantization with LoRA adapters. QLoRA does not change the scaling factor, so rsLoRA can be applied on top of it.
DoRA (Liu et al., 2024): Decomposes weight updates into magnitude and direction. The direction component is a LoRA adapter, and rsLoRA scaling can be applied to it.
PiSSA: Changes adapter initialization to align with the top singular vectors of the base weight matrix, leaving the scaling factor unchanged.
AdaLoRA: Dynamically allocates rank per layer based on importance. Because AdaLoRA inherits LoRA's $\alpha/r$ scaling, it is potentially improved by switching to rsLoRA's $\alpha/\sqrt{r}$, an extension flagged as future work in the rsLoRA paper.[^1]
GaLore: A separate parameter-efficient fine-tuning approach based on projecting full gradients into a low-rank subspace, rather than adding rank-constrained adapters. GaLore solves a different problem (memory-efficient pretraining and full fine-tuning) and is largely orthogonal to rsLoRA.
Maximal update parametrization (muP): The infinite-width parametrization framework of Yang and Hu that inspired the rsLoRA analysis. muP prescribes width-stable scaling for pretraining, while rsLoRA prescribes rank-stable scaling for adapter fine-tuning.[^1]

The relationship between rsLoRA and these neighbors can be summarized: rsLoRA fixes the scaling factor specifically, leaving open the orthogonal questions of initialization (PiSSA, LoftQ), decomposition (DoRA), quantization (QLoRA), rank allocation (AdaLoRA), and the projection-based alternative (GaLore).

References

rsLoRA (Rank-Stabilized LoRA)

Background

LoRA and its scaling factor

The rank-collapse problem

Scaling-initialization-update schemes

Technical details

Definition of rank-stability

Main theorem

The corrected scaling factor

Theoretical caveats

Comparison of scaling regimes

Experimental validation

Setup

Perplexity versus rank

Gradient norm tracking

Ablations

Hugging Face blog validation

Implementation

Hugging Face PEFT

Other libraries

Composition with other LoRA variants

Adoption

Applications

Limitations and criticisms

Related work

See also

References

Improve this article

rsLoRA (Rank-Stabilized LoRA)

Background

LoRA and its scaling factor

The rank-collapse problem

Scaling-initialization-update schemes

Technical details

Definition of rank-stability

Main theorem

The corrected scaling factor

Theoretical caveats

Comparison of scaling regimes

Experimental validation

Setup

Perplexity versus rank

Gradient norm tracking

Ablations

Hugging Face blog validation

Implementation

Hugging Face PEFT

Other libraries

Composition with other LoRA variants

Adoption

Applications

Limitations and criticisms

Related work

See also

References