rsLoRA (Rank-Stabilized LoRA)
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,335 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,335 words
Add missing citations, update stale details, or suggest a clearer explanation.
rsLoRA (rank-stabilized LoRA) is a one-line modification to the low-rank adaptation fine-tuning method for large neural networks. It replaces the conventional LoRA scaling factor $\gamma_r = \alpha/r$ with $\gamma_r = \alpha/\sqrt{r}$, where $r$ is the adapter rank and $\alpha$ is a hyperparameter. The change was proposed by Damjan Kalajdzievski in a December 2023 arXiv preprint, "A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA", which shows analytically that the original $1/r$ scaling causes adapter gradients to collapse as the rank is increased, so larger ranks fail to improve performance.[^1] Under the $1/\sqrt{r}$ scaling, both forward activations and backward gradients remain of order $\Theta(1)$ for any rank, restoring the expected compute-versus-performance trade-off and allowing practitioners to use higher LoRA ranks without saturation.[^1] rsLoRA is implemented in Hugging Face's PEFT library as a single boolean flag, use_rslora=True in LoraConfig, with no change to inference cost.[^2]
Standard LoRA, introduced by Hu et al. (2022), attaches a trainable rank-$r$ "adapter" matrix product $BA$ to selected linear sub-modules of a pre-trained transformer while keeping the base weights frozen.[^3] Concretely, a linear layer that originally computes $x_{\text{out}} = Wx_{\text{in}} + b$ is replaced by $$x_{\text{out}} = (W + \gamma_r BA)x_{\text{in}} + b,$$ with $A \in \mathbb{R}^{r \times d_1}$, $B \in \mathbb{R}^{d_2 \times r}$, and a positive scaling factor $\gamma_r \in \mathbb{R}^{+}$ that depends on the rank.[^1] At initialization $B = 0$ and $A$ has independent zero-mean entries with variance independent of $r$, so the adapter starts at the identity and is gradually learned during fine-tuning.[^1]
In the original LoRA implementation and most popular libraries, the scaling factor is set to $\gamma_r = \alpha/r$ for a hyperparameter $\alpha$.[^1][^2] After fine-tuning, the sum $W + \gamma_r BA$ can be folded back into a single weight matrix, so inference cost matches that of the unmodified base model regardless of $r$.[^1]
The $\alpha/r$ convention has a long folkloric justification: as $r$ grows, the product $BA$ accumulates a sum of $r$ rank-one terms, so dividing by $r$ keeps the average term size constant. The paper by Kalajdzievski shows that this intuition is in fact wrong, because it ignores how the variance of the matrix product scales when $B$ and $A$ are jointly trained from a degenerate initialization.[^1]
Practitioners working with LoRA had repeatedly observed an empirical puzzle: increasing the adapter rank $r$ beyond small values (typically $r = 4, 8, 16$) brought essentially no improvement in fine-tuning quality, even though the trainable parameter count grew linearly with $r$.[^1] The LoRA paper itself reported that "very low ranks suffice" and provided ablations showing that rank 64 fine-tuning performed no better than rank 4 (Table 6 of Hu et al.).[^3] This observation was widely interpreted as evidence that the fine-tuning manifold for large language models has very low intrinsic dimension, an idea consistent with earlier results by Aghajanyan, Zettlemoyer, and Gupta on intrinsic dimensionality of fine-tuning.[^1]
The rsLoRA analysis shows that this interpretation conflates two distinct phenomena. While the intrinsic-dimensionality hypothesis may still hold to some degree, a separate effect dominates the empirical picture: with $\gamma_r = \alpha/r$, the gradient signal flowing through each adapter collapses as $r$ increases, so the network simply cannot exploit the extra parameters even when they would be useful.[^1] The conclusion that "low ranks suffice" is therefore an artifact of overly aggressive scaling, not a fundamental limit of the LoRA family.[^1]
The mathematical setup that rsLoRA borrows from infinite-width neural network theory traces back to the analysis of "abc-parametrizations" by Yang and Hu, the same line of work that produced maximal update parametrization for pretraining.[^1] In that framework, a network with hidden dimension $d$ is described by three exponents: weights are scaled by $1/d^{a_l}$, initialized with variance $1/d^{b_l}$, and updated with effective learning rate $\eta/d^{c}$. Yang and Hu proved that only certain triples $(a, b, c)$ yield stable, non-collapsing learning in the limit $d \to \infty$, and they used this to recommend specific scaling choices for pretraining wide networks.
Kalajdzievski applies an analogous learning-trajectory analysis to LoRA's rank parameter $r$, treating $r$ as the analog of the infinite hidden dimension.[^1] Because LoRA already fixes the initialization scheme ($B = 0$, $A$ with rank-independent variance), the only free degree of freedom is the scaling factor $\gamma_r$. The paper then asks: which $\gamma_r$ keeps activations and gradients well-behaved as $r$ grows?
The paper introduces a formal definition that captures what "well-behaved" should mean for an adapter:
An adapter $\gamma_r BA$ is rank-stabilized if (1) when inputs are i.i.d. with $m$-th moment $\Theta_r(1)$ in each entry, the outputs of the adapter also have $m$-th moment $\Theta_r(1)$ in each entry; and (2) when loss gradients with respect to adapter outputs are $\Theta_r(1)$ in each entry, the loss gradients with respect to adapter inputs are also $\Theta_r(1)$ in each entry.[^1]
Here $\Theta_r(\cdot)$ is the standard big-Theta asymptotic notation in the rank $r$, meaning the quantity is bounded above and below by constants times the argument as $r \to \infty$. The two conditions together require that adapters neither inflate nor deflate signal magnitudes in the forward pass, and neither inflate nor deflate gradient magnitudes in the backward pass, regardless of rank.[^1]
Rank-stability is the right notion to demand because adapters in a deep network compose. If the first adapter inflates activations by a factor of $\sqrt{r}$, the second adapter receives inputs that are themselves $r$-dependent, and the cascade can multiply through many layers; the same logic applies in reverse for gradients. Demanding $\Theta_r(1)$ behavior at every adapter is what allows the paper to chain stability through arbitrary depth using a simple inductive argument over the forward and backward passes.[^1]
The central result is Theorem 3.2:
Consider LoRA adapters of the form $\gamma_r BA$, with $B \in \mathbb{R}^{d_1 \times r}$, $A \in \mathbb{R}^{r \times d_2}$, $B$ initialized to zero, entries of $A$ i.i.d. with mean zero and variance $\sigma_A$ not depending on $r$, and $\gamma_r \to 0$ as $r \to \infty$. In expectation over initialization, all adapters are rank-stabilized if and only if $\gamma_r \in \Theta_r(1/\sqrt{r})$. Furthermore, this holds at any point in the learning trajectory, and unless $\gamma_r \in \Theta_r(1/\sqrt{r})$ there is unstable or collapsing learning for sufficiently large values of $r$.[^1]
The "if and only if" is essential: any scaling that grows asymptotically faster than $1/\sqrt{r}$ produces blowup, any scaling that decays faster produces collapse, and only $\Theta(1/\sqrt{r})$ achieves both conditions simultaneously. The conventional choice $\alpha/r$ is therefore decaying too quickly, by a factor of $\sqrt{r}$.[^1]
The proof in Appendix A computes the variance of $\gamma_r BA x_{\text{in}}$ entry-by-entry, observes that this variance scales as $\gamma_r^2 \cdot r$ when $B$ and $A$ each have $\Theta(1)$ variance per entry, and concludes that $\gamma_r^2 r = \Theta(1)$ forces $\gamma_r = \Theta(1/\sqrt{r})$.[^1] A parallel argument for the backward pass confirms the same exponent for gradients. The induction step that propagates the result from one adapter to the next uses the fact that the model's input data does not depend on $r$, so stability can be carried forward layer by layer.[^1]
A useful intuition is to think of the matrix product $BA$ as a sum of $r$ rank-one outer products $\sum_{i=1}^{r} b_i a_i^{\top}$, where $b_i$ is the $i$-th column of $B$ and $a_i$ is the $i$-th row of $A$. Once training begins, the entries of $B$ and $A$ both become $\Theta(1)$ in magnitude, so each rank-one term contributes $\Theta(1)$ to the entry-wise variance of the sum, and the variance of the full sum is $\Theta(r)$. To bring the product back to $\Theta(1)$ variance one must divide by the standard deviation of the sum, which is $\sqrt{r}$, not by $r$.[^1] Dividing by $r$ overcorrects by a factor of $\sqrt{r}$, suppressing the adapter's contribution faster than the law of large numbers requires, and the gradients inherit the same overcorrection. This is the central conceptual error in the original LoRA scaling that rsLoRA repairs.
Setting the constant in front to a free hyperparameter, rsLoRA prescribes $$\gamma_r = \frac{\alpha}{\sqrt{r}}.$$ This is the only change. The architecture, initialization, optimizer, and inference path are all identical to standard LoRA.[^1]
Practitioners who want to enable rsLoRA while keeping a familiar $\alpha$ scale can equivalently leave $\gamma_r = \alpha'/r$ but set $\alpha' = \alpha\sqrt{r}$, since $\alpha\sqrt{r}/r = \alpha/\sqrt{r}$.[^4] The two formulations are arithmetically identical, but most libraries now expose rsLoRA through a boolean flag that handles the substitution internally.[^2]
Theorem 3.2 is asymptotic and applies in the limit of $r \to \infty$. The paper is careful to note that the result speaks to stability and collapse of learning, not to the quality of features that learning produces.[^1] If the features learned at different ranks are themselves rank-dependent, the assumption that gradients into the adapter are $\Theta_r(1)$ may not hold and the conclusion would need empirical validation.[^1] Conversely, if feature quality is genuinely rank-independent, there is no benefit to higher ranks regardless of the scaling factor. The practical implications of the theorem therefore require experimental confirmation, which the paper provides.
| Regime | Scaling | Forward variance | Gradient norm | Behavior at large $r$ |
|---|---|---|---|---|
| Naive (no scaling) | $\gamma_r = \alpha$ | grows like $r$ | grows like $\sqrt{r}$ | activations and gradients blow up |
| Standard LoRA | $\gamma_r = \alpha/r$ | shrinks like $1/r$ | shrinks like $1/\sqrt{r}$ | gradients collapse, no learning at high $r$ |
| rsLoRA | $\gamma_r = \alpha/\sqrt{r}$ | $\Theta(1)$ | $\Theta(1)$ | stable for all $r$ |
The middle row is the case originally implemented in LoRA, and the bottom row is the rsLoRA correction.[^1]
The paper validates the theorem on Llama 2 (7B), fine-tuned on 20,000 instruction-tuning examples from the OpenOrca dataset.[^1] The optimizer is AdamW with the Hugging Face default learning rate of $5 \times 10^{-5}$ on a constant schedule, and adapters are placed in all linear sub-modules of every transformer block (both attention projections and feed-forward MLP layers), since prior work showed that targeting all linear layers maximizes performance per parameter for LoRA.[^1] Ranks tested are $r \in {4, 8, 32, 128, 512, 2048}$, spanning more than two orders of magnitude.[^1]
In Figure 2 of the paper, fine-tuning perplexity is plotted against training step for each rank, for both LoRA and rsLoRA.[^1] The LoRA curves collapse onto essentially the same loss trajectory regardless of rank, and some larger ranks even perform slightly worse than the smallest.[^1] In contrast, rsLoRA produces a monotone improvement with rank: $r = 2048$ converges to a lower perplexity than $r = 512$, which is lower than $r = 128$, and so on down to $r = 4$.[^1] This is exactly the compute-versus-performance trade-off that practitioners had expected from LoRA and that the original $\alpha/r$ scaling had silently denied them.
Figure 3 plots the average parameter gradient norm throughout training.[^1] For LoRA, gradient norms decay as $r$ grows, consistent with the predicted $1/\sqrt{r}$ collapse implied by $\gamma_r = \alpha/r$.[^1] For rsLoRA, the gradient norms at the onset of training are essentially identical across all six ranks, and they remain within the same order of magnitude throughout fine-tuning.[^1] This is the most direct empirical confirmation of Theorem 3.2.
The paper includes several ablations to rule out confounders.[^1] First, the experiment is repeated with a different base model, dataset, and optimizer, and the same qualitative pattern holds. Second, the gradient-stability result is reproduced with SGD instead of AdamW, showing that the effect is not an artifact of adaptive optimizers. Third, the authors sweep learning rates for rank-4 LoRA and confirm that no learning-rate boost can reproduce the high-rank rsLoRA result, ruling out the hypothesis that rsLoRA is merely a disguised learning-rate change. Fourth, restricting adapters to attention projections (the original LoRA practice) gives the same qualitative pattern. Fifth, attempting to fix the problem only at initialization, without changing the scaling factor, restores instability at large rank.[^1]
A February 2024 Hugging Face blog post by the same author reports a more applied evaluation using preference fine-tuning of OpenChat 3.5 on the ultrafeedback_binarized dataset with DPO, evaluated on MT-Bench.[^4] The base model scores 7.791, rank-16 LoRA reaches 7.931, rank-256 LoRA reaches 7.963 (essentially no gain over rank 16), and rank-256 rsLoRA reaches 8.088, with only roughly thirteen additional minutes of training time over rank-16 LoRA.[^4] The pattern matches the paper: at rank 16, the difference between LoRA and rsLoRA is small, but at rank 256, only rsLoRA actually exploits the extra parameters.
| Method | MT-Bench Turn 1 | MT-Bench Turn 2 | Average |
|---|---|---|---|
| Base OpenChat 3.5 | 8.206 | 7.375 | 7.791 |
| LoRA rank 16 | 8.338 | 7.525 | 7.931 |
| LoRA rank 256 | 8.300 | 7.625 | 7.963 |
| rsLoRA rank 256 | 8.425 | 7.750 | 8.088 |
The training-time overhead reported was approximately 13 minutes (2h 19m for rsLoRA at rank 256 versus 2h 7m for LoRA at rank 16), since the additional parameters at higher rank still cost some forward and backward compute even though the scaling factor is "free".[^4]
rsLoRA is exposed in the Hugging Face PEFT library as the use_rslora boolean field of LoraConfig.[^2] The current PEFT source code documents the parameter as follows:
use_rslora: bool = field(
default=False,
metadata={
"help": (
"When set to True, uses Rank-Stabilized LoRA "
"(https://huggingface.co/papers/2312.03732) which sets the adapter "
"scaling factor to `lora_alpha/math.sqrt(r)`, since it was proven "
"to work better. Otherwise, it will use the original default value "
"of `lora_alpha/r`."
)
},
)
When use_rslora=True, the per-layer scaling factor is computed as lora_alpha / math.sqrt(r) instead of lora_alpha / r, and the rest of the LoRA forward and backward path is unchanged.[^2] The default is False to preserve backward compatibility with the original LoRA convention, although the documentation notes the change is "proven to work better".[^2] rsLoRA composes with the other PEFT initialization and adapter options, including PiSSA, LoftQ, EVA, OLoRA, DoRA, and the merge / unload utilities.[^2]
A typical configuration looks like:
from peft import LoraConfig, get_peft_model
config = LoraConfig(
r=256,
lora_alpha=16,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"up_proj", "gate_proj", "down_proj"],
lora_dropout=0.1,
use_rslora=True,
)
peft_model = get_peft_model(base_model, config)
Because use_rslora=True substitutes lora_alpha / math.sqrt(r) for the scaling, the user does not need to manually compute the corrected $\alpha$.[^2] However, hyperparameter ranges familiar from LoRA practice (such as "set $\alpha = 2r$") need to be re-tuned, because the effective scaling has changed; community guidance commonly recommends using rsLoRA whenever the desired rank exceeds the small-rank regime (roughly $r > 32$).[^5]
A practical rule that the Unsloth documentation surfaces is to start with $\alpha = r$ or $\alpha = 2r$ as the LoRA baseline, then enable use_rslora=True and consider lowering $\alpha$ slightly to compensate for the larger effective scaling at moderate ranks.[^5] At very high ranks the $1/\sqrt{r}$ rule keeps the effective per-adapter signal larger than the original $1/r$ would have made it, so retuning the learning rate downward by a small amount is also common practice for ranks above a few hundred. Because the optimizer (typically AdamW) is adaptive, modest mistakes in $\alpha$ are usually absorbed without divergence, but careful tuning still pays off for the highest ranks.
rsLoRA propagated quickly to downstream fine-tuning frameworks that wrap PEFT:
use_rslora=True as a parameter of FastLanguageModel.get_peft_model(). The Unsloth documentation describes rsLoRA as "the optimal approach" of scaling $\alpha$ by the square root of the rank and lists it among the recommended advanced options when training stability is a concern, particularly at higher ranks.[^5]In all three frameworks the user-visible API is the same boolean flag; the underlying implementation simply replaces the scaling factor inside the LoRA forward pass.
Because rsLoRA modifies only the scaling factor, it is orthogonal to most other LoRA improvements that touch the initialization, the parameterization, or the targeting strategy. In particular it composes with:
Because rsLoRA does not change the trainable parameters or the inference path, the resulting models can be merged back into a single weight matrix and served exactly like a LoRA model, with no additional runtime cost.[^1]
rsLoRA was integrated into Hugging Face PEFT in early 2024 and has since become a standard option in major fine-tuning toolkits, including Unsloth, Axolotl, and LLaMA-Factory.[^2][^5][^6][^7] The Hugging Face PEFT documentation explicitly recommends use_rslora=True when using higher LoRA ranks, citing the proof that it works better.[^2] The author published a community Hugging Face blog post titled "Rank-Stabilized LoRA: Unlocking the Potential of LoRA Fine-Tuning" on 20 February 2024 to accompany the integration; the blog presents the math, the MT-Bench validation, and worked PEFT examples for practitioners.[^4]
One reason rsLoRA propagated so quickly is the low integration cost. The patch in the PEFT source code amounts to a one-line change that conditionally divides by math.sqrt(r) instead of r in the LoRA scaling computation, and the rest of the PEFT API (LoRA initialization, target-module selection, merging, saving, loading, multi-adapter switching) needs no modification.[^2] Downstream toolkits that wrap PEFT simply expose the same boolean flag in their own configuration files, which keeps the interface stable and predictable across the ecosystem.
The paper appears on Semantic Scholar and Google Scholar with a substantial citation count across subsequent LoRA work, and the rsLoRA scaling rule is cited as the standard fix for rank-collapse in surveys on parameter-efficient fine-tuning for large language models.[^8] Notable directions building on rsLoRA include:
The author, Damjan Kalajdzievski, lists Tenyx as his affiliation on the paper and maintains a public GitHub presence including a rsLoRA_Blog repository accompanying the Hugging Face blog post.[^9]
rsLoRA's practical value is most evident when practitioners want to push LoRA toward full-fine-tuning quality without paying full-fine-tuning memory cost. Because $r$ can be raised much higher without saturation, rsLoRA enables a continuous compute-versus-performance trade-off that LoRA had previously lacked:[^1]
Crucially, the inference cost remains identical to a non-LoRA fine-tuned model, because $W + \gamma_r BA$ can be merged into a single matrix after training.[^1] This preserves LoRA's main deployment advantage while removing its rank-saturation problem.
rsLoRA is a narrowly-scoped fix and does not claim to solve every limitation of LoRA-style fine-tuning. Several caveats apply:
use_rslora=True should expect to retune $\alpha$ for their target rank; Unsloth and other toolkits document this caveat.[^5]A second-order observation is that the rsLoRA result revises the conventional wisdom that "very low LoRA ranks suffice", which had been cited as evidence for low-intrinsic-dimensionality hypotheses about fine-tuning. The paper concludes that this interpretation was at least partially an artifact of the $\alpha/r$ scaling, and motivates further studies of the true intrinsic dimensionality of fine-tuning manifolds.[^1]
rsLoRA sits within a dense ecosystem of LoRA variants and stabilization techniques developed since 2022. Some directly relevant comparisons:
The relationship between rsLoRA and these neighbors can be summarized: rsLoRA fixes the scaling factor specifically, leaving open the orthogonal questions of initialization (PiSSA, LoftQ), decomposition (DoRA), quantization (QLoRA), rank allocation (AdaLoRA), and the projection-based alternative (GaLore).