LoRA (Low-Rank Adaptation)

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique for large language models and other deep learning models. Introduced by Edward Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen in 2021, LoRA freezes the pre-trained model weights and injects small, trainable low-rank decomposition matrices into selected layers of a transformer architecture. This allows models with billions of parameters to be adapted for specific tasks using only a fraction of the parameters that full fine-tuning would require. The method was first published as a preprint on June 17, 2021 (arXiv:2106.09685) and later presented at ICLR 2022.

LoRA reduces trainable parameters by up to 10,000x while matching or exceeding full fine-tuning performance, achieving roughly 3x less GPU memory and zero inference latency overhead. It has become one of the most widely adopted fine-tuning methods in both research and production settings, with support in libraries such as Hugging Face PEFT, and applications spanning natural language processing, computer vision, and image generation. Platforms like Hugging Face and CivitAI host over 60,000 public LoRA adapters across text-to-image, LLM, and multimodal categories.

ELI5 (Explain like I'm 5)

Imagine you have a giant coloring book that already has beautiful pictures in it. You want to add some special details, but you do not want to erase or change the original pictures. So instead of getting a whole new coloring book, you just put a thin transparent sheet on top and draw your small additions there. LoRA works the same way: it keeps the original model (the coloring book) exactly as it is, and adds a very thin layer of new information on top. Because this added layer is so small, it takes up almost no extra space and is very fast to create.

Motivation

Fine-tuning a pre-trained language model is one of the most effective ways to adapt it to new tasks or domains. However, as models have grown from millions to hundreds of billions of parameters, full fine-tuning has become prohibitively expensive in terms of both compute and memory. For a model like GPT-3 with 175 billion parameters, full fine-tuning requires storing and updating a complete copy of all weights for every downstream task, consuming over 1.2 terabytes of GPU memory with mixed-precision training.

Several observations motivate LoRA's approach:

Over-parameterization and intrinsic dimensionality. Research by Aghajanyan et al. (2020) showed that pre-trained language models have a low "intrinsic dimensionality," meaning their behavior can be captured in a much smaller subspace than the full parameter space. Their paper demonstrated that pre-trained language models could be effectively fine-tuned by optimizing in extremely small subspaces, achieving 90% of full fine-tuning performance on RoBERTa with just 200 trainable parameters through random projection. This implies that the weight changes needed for task adaptation also lie in a low-dimensional subspace.
Storage and deployment costs. In production environments, organizations often need to serve dozens or hundreds of task-specific models. If each requires a full copy of the base model's weights, storage and deployment become impractical. LoRA reduces checkpoint sizes by a factor of 10,000 or more.
Task-switching overhead. With full fine-tuning, switching between tasks requires loading entirely different model weights. LoRA adapters can be swapped in and out at minimal cost, or even merged directly into the base weights.

History and development

Eight researchers at Microsoft Corporation created LoRA, with Edward J. Hu and Yelong Shen as equal first authors alongside Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. The team first submitted their paper "LoRA: Low-Rank Adaptation of Large Language Models" to arXiv on June 17, 2021, with publication at the International Conference on Learning Representations (ICLR 2022). Edward Hu, who invented LoRA during his AI residency at Microsoft Research, later joined OpenAI and pursued doctoral studies under Yoshua Bengio at Mila in Montreal.

The work built directly on research by Armen Aghajanyan and colleagues who published "Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning" in December 2020. That empirical evidence that learned models reside on low intrinsic dimensions provided the theoretical foundation for LoRA's hypothesis that weight updates during adaptation also have low intrinsic rank.

The original LoRA paper validated the approach across multiple architectures. On GPT-3 175B, LoRA with rank r=4 achieved 73.4% accuracy on WikiSQL versus 73.8% for full fine-tuning while reducing trainable parameters from 175 billion to just 4.7 million, roughly 18 million parameters when applied to all attention layers. On RoBERTa, LoRA with only 0.3 million parameters outperformed full fine-tuning's 125 million parameters. The breakthrough enabled inference with zero latency overhead by merging weights (W = W₀ + BA) during deployment, unlike adapter methods that added 20 to 30 percent latency.

Adoption expanded rapidly. By late 2022, members of the Stable Diffusion community adapted LoRA for fine-tuning the diffusion model's cross-attention layers, enabling the model to learn new visual concepts and styles from a small number of images without full model retraining. Hugging Face integrated LoRA into their PEFT library in February 2023, making it accessible to millions of developers. By 2024 and 2025, major companies including Apple, Microsoft, OpenAI, NVIDIA, Google, and Meta deployed LoRA in production systems, while the research community generated dozens of variants and extensions addressing specific limitations.

Mathematical formulation

Low-rank decomposition

For a pre-trained weight matrix $W_0 \in \mathbb{R}^{d \times k}$, LoRA constrains the weight update $\Delta W$ to a low-rank decomposition:

$$W = W_0 + \Delta W = W_0 + BA$$

where:

$B \in \mathbb{R}^{d \times r}$ is a trainable "down-projection" matrix
$A \in \mathbb{R}^{r \times k}$ is a trainable "up-projection" matrix
$r \ll \min(d, k)$ is the rank, a hyperparameter controlling the capacity of the adapter

During the forward pass, the modified output for input $x$ is:

$$h = W_0 x + \Delta W x = W_0 x + BAx$$

The pre-trained weights $W_0$ remain frozen throughout training. Only the matrices $B$ and $A$ receive gradient updates. Because $r$ is much smaller than both $d$ and $k$, the number of trainable parameters per adapted layer drops from $d \times k$ to $r \times (d + k)$.

This decomposition achieves dramatic parameter reduction. A full update matrix with dimensions 10,000 by 10,000 requires 100 million parameters, but with rank r=8, the low-rank factorization needs only r(d+k) = 160,000 parameters, a 625x reduction.

Initialization

The initialization strategy ensures that the adaptation starts from zero:

Matrix A is initialized with a random Gaussian distribution (typically Kaiming-uniform)
Matrix B is initialized to all zeros

This means $\Delta W = BA = 0$ at the start of training, so the model begins with the exact behavior of the pre-trained model and gradually learns the task-specific modifications. This guarantees a stable starting point and prevents abrupt, potentially destabilizing changes to the model's behavior.

Scaling factor

The LoRA update is scaled by a factor $\alpha / r$, where $\alpha$ is a constant hyperparameter:

$$h = W_0 x + \frac{\alpha}{r} BAx$$

The scaling factor $\alpha / r$ serves to normalize the contribution of the low-rank update regardless of the chosen rank. When $\alpha$ is set equal to $r$, the scaling factor becomes 1 and has no effect. In practice, $\alpha$ is often set to a value that amplifies or attenuates the adapter's contribution. Common choices include setting $\alpha = 2r$ (doubling the adapter's effect) or $\alpha = r$ (neutral scaling). The scaling factor is applied during training and can be absorbed into the merged weights during inference.

When adjusting rank during experimentation, maintaining constant $\alpha$ means the effective scaling changes proportionally: if doubling rank from 8 to 16, the same $\alpha$ value automatically halves the per-parameter influence, stabilizing training dynamics. Recent research introduced alternative scaling $\alpha/\sqrt{r}$ (rsLoRA) that prevents gradient collapse at higher ranks, enabling effective use of r>64 where standard scaling fails.

Weight merging at inference

A key advantage of LoRA is that the adapted weights can be merged back into the base model:

$$W_{\text{merged}} = W_0 + \frac{\alpha}{r} BA$$

Once merged, the model has exactly the same architecture and inference cost as the original. There is no additional latency, no extra computation, and no change to the model's structure. This stands in contrast to adapter-based methods, which insert new layers that add latency during inference.

Gradient flow and backpropagation

The gradient flow during backpropagation follows standard chain rule mechanics but only updates the low-rank matrices. For loss L, gradients compute as $\partial L/\partial B = (\partial L/\partial h)(Ax^T)$ and $\partial L/\partial A = B^T(\partial L/\partial h)(x^T)$, while the frozen $W_0$ receives no gradient updates. This dramatically reduces optimizer memory requirements. With Adam requiring 3x parameters for weights plus two momentum states, LoRA eliminates optimizer states for billions of frozen parameters, storing only $3r(d+k)$ additional values for the low-rank adapters.

Computational complexity remains tractable. The forward pass for $W_0 x$ requires $O(d^2)$ operations while the LoRA component $BAx$ requires $O(r(d+k)) \approx O(2rd)$ for square matrices, minimal overhead when $r \ll d$. For GPT-3 175B with $d=12{,}288$ and $r=4$, each LoRA-modified layer adds approximately 100,000 operations versus 150 million for the frozen matrix, a negligible 0.07 percent increase. During inference, weight merging eliminates even this small overhead, producing a single matrix $W = W_0 + BA$ with identical dimensions and computational requirements as the original.

Which layers to adapt

The original LoRA paper focused on the self-attention weight matrices in the transformer architecture. Each multi-head attention module contains four weight matrices:

Matrix	Role	Typical dimensions
$W_q$	Query projection	$d_{\text{model}} \times d_{\text{model}}$
$W_k$	Key projection	$d_{\text{model}} \times d_{\text{model}}$
$W_v$	Value projection	$d_{\text{model}} \times d_{\text{model}}$
$W_o$	Output projection	$d_{\text{model}} \times d_{\text{model}}$

Hu et al. found that adapting both $W_q$ and $W_v$ together yielded the best results when the total parameter budget was held constant. With 18 million trainable parameters, distributing rank across two weight matrices outperformed concentrating all parameters in a single matrix at higher rank.

However, subsequent research (particularly by Sebastian Raschka, 2023, and the QLoRA paper) has shown that enabling LoRA across all linear layers in the transformer, including the feed-forward network (MLP) layers (gate_proj, up_proj, down_proj) and output projection, can noticeably improve performance, despite increasing the number of trainable parameters roughly fivefold. Attention-only configurations underperform badly, though they save minimal memory. The current best practice for most applications is to apply LoRA to all linear layers rather than just the attention projections, often via auto-detection features such as target_modules="all-linear" in modern PEFT versions.

Rank selection

The rank $r$ is the single most important hyperparameter in LoRA. It controls the expressiveness of the low-rank update.

Empirical findings from the original paper

Hu et al. found that LoRA performs surprisingly well even at very low ranks. On GPT-3 175B, the results across different ranks were:

Rank ($r$)	Target modules	WikiSQL accuracy	MNLI accuracy
1	$W_q$, $W_v$	73.4%	91.3%
2	$W_q$, $W_v$	73.3%	91.4%
4	$W_q$, $W_v$	73.7%	91.3%
8	$W_q$, $W_v$	73.8%	91.6%
64	$W_q$, $W_v$	73.6%	-

These results show that even a rank of 1 or 2 captures most of the adaptation benefit, and that r=64 (with 301.9M parameters) offered no meaningful gain over r=8 on this task. Subspace analysis confirmed that the directions corresponding to the top singular vector overlap significantly between $r = 8$ and $r = 64$ configurations, suggesting that the adaptation indeed lies in a very low-dimensional subspace.

Practical guidelines for rank selection

The appropriate rank depends on the complexity of the adaptation task:

Use case	Recommended rank	Rationale
Simple style or format adaptation	$r = 4$ to $8$	Small changes require few parameters
Domain-specific knowledge injection	$r = 16$ to $32$	Moderate new information
Complex multi-task adaptation	$r = 32$ to $64$	Diverse task requirements
Highly complex or data-rich domains	$r = 64$ to $256$	Datasets significantly different from pre-training data

A common starting point is $r = 8$ for most tasks (or $r = 16$ as a reliable general default), increasing if validation performance plateaus. For a 7B parameter model, rank 8 adds approximately 4.2 million trainable parameters (0.06%), while rank 256 adds approximately 20.3 million (0.3%).

Alpha and scaling

The $\alpha$ hyperparameter controls the magnitude of the LoRA update relative to the pre-trained weights. Practical recommendations include:

$\alpha = 2r$: The original Microsoft guideline suggests setting alpha to twice the rank. This amplifies the adapter's contribution.
$\alpha = r$: Neutral scaling (scaling factor = 1). Some practitioners prefer this as a baseline because it makes it easier to attenuate the LoRA effect post-training.
$\alpha = 0.5r$: In some experiments, reduced scaling performed better, particularly at higher ranks.

The effective learning rate of the LoRA parameters is proportional to $\alpha / r$, so increasing $\alpha$ while holding $r$ constant has a similar effect to increasing the learning rate. Research shows $\alpha$ and the optimizer learning rate are mathematically interchangeable, so a common workflow is to fix $\alpha$ early then adjust only the learning rate and rank. For extreme cases, alpha can be scaled down post-training (for example multiplying by 0.5) to reduce overfitting, effectively averaging between pre-trained and fine-tuned weights.

Hyperparameter summary

Hyperparameter	Typical range	Recommended start	Purpose
Rank ($r$)	4 to 256	16	Controls adaptation capacity
Alpha ($\alpha$)	$r$ to $4r$	$2r$	Scales LoRA influence
Learning rate	1e-4 to 5e-4	2e-4	Update step size
Dropout	0 to 0.1	0.05	Regularization
Epochs	1 to 5	1 to 3	Training iterations
Effective batch size	8 to 32	16	Training stability

GPT-3 175B results

The original paper demonstrated LoRA's effectiveness on GPT-3 175B, one of the largest models available at the time:

Method	Trainable parameters	WikiSQL	MNLI-m	SAMSum (R1/R2/RL)
Full fine-tuning	175.3B	73.8%	89.5%	52.0 / 28.0 / 44.5
LoRA ($r = 4$)	4.7M	73.4%	91.7%	53.8 / 29.8 / 45.9
LoRA ($r = 8$)	37.7M	74.0%	91.6%	53.4 / 29.2 / 45.1

LoRA matched or exceeded full fine-tuning on all three benchmarks while using only about 0.003 percent of the trainable parameters. Training was 25 percent faster (43.1 versus 32.5 tokens per second on a V100), GPU memory dropped from 1.2 TB to 350 GB, and checkpoint sizes shrank from 350 GB to 35 MB per task.

Inference latency comparison

On GPT-2, LoRA showed zero inference overhead compared to adapter methods:

Method	Latency (batch=1)	Overhead
Full fine-tuning / LoRA	19.8 ms	Baseline
Adapter (Lin et al.)	23.9 ms	+20.7%
Adapter (Houlsby et al.)	25.8 ms	+30.3%

Cross-model benchmark summary

Model	Method	Trainable params	MNLI accuracy	WikiSQL accuracy	SAMSum Rouge-L	Storage size	VRAM required
GPT-3 175B	Full FT	175,255.8M	89.5%	73.8%	44.5	350 GB	1,200 GB
GPT-3 175B	LoRA (r=4)	4.7M	91.7%	73.4%	45.9	35 MB	350 GB
RoBERTa base	Full FT	125.0M	87.6%	-	-	~500 MB	~8 GB
RoBERTa base	LoRA	0.3M	87.5%	-	-	~2 MB	~4 GB
DeBERTa XXL	Full FT	1,500M	91.8%	-	-	~6 GB	~50 GB
DeBERTa XXL	LoRA	4.7M	91.9%	-	-	~19 MB	~16 GB
Llama 2 7B	Full FT	6,738M	-	-	-	~14 GB	73 GB (2 GPU)
Llama 2 7B	LoRA (r=16)	4.2M	-	-	-	~8 MB	14 GB (1 GPU)

On RoBERTa large, LoRA with 0.8 million parameters reached 89.0 percent average accuracy versus 88.9 percent for full fine-tuning with 355 million parameters, a 444x reduction with superior results. On MNLI with only 100 training examples, LoRA achieved 63.8 percent accuracy versus 60.2 percent for full fine-tuning, demonstrating LoRA's advantage in low-data regimes where the low-rank constraint acts as strong regularization.

QLoRA

QLoRA (Quantized LoRA) was introduced by Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer in May 2023. It combines 4-bit quantization of the base model with LoRA adapters, reducing memory requirements enough to fine-tune a 65-billion-parameter model on a single 48 GB GPU while preserving full 16-bit fine-tuning performance. The paper was presented at NeurIPS 2023.

Key innovations in QLoRA

QLoRA introduces three technical innovations:

Innovation	Description
4-bit NormalFloat (NF4)	A new data type that is information-theoretically optimal for normally distributed weights. It quantizes the base model to 4 bits using a format specifically designed for the weight distributions found in neural networks.
Double quantization	Quantizes the quantization constants themselves, reducing the average memory footprint of the quantization overhead. This second layer of compression saves approximately 0.37 bits per parameter.
Paged optimizers	Uses NVIDIA unified memory to manage memory spikes during training. When GPU memory runs out, optimizer states are automatically paged to CPU memory, preventing out-of-memory errors during gradient checkpointing.

How QLoRA works

The base model is quantized to 4-bit precision (NF4) and frozen. LoRA adapter matrices ($B$ and $A$) are added in floating-point 16-bit (BFloat16) precision. During the forward pass, the 4-bit weights are dequantized to BFloat16 for computation. Gradients flow through the frozen quantized model into the LoRA adapters, which are the only parameters updated.

QLoRA performance and memory

QLoRA's flagship model family, Guanaco, outperformed all previously released open models on the Vicuna benchmark, reaching 99.3% of ChatGPT's performance level with only 24 hours of fine-tuning on a single GPU.

Memory and speed trade-offs compared to standard LoRA (for a 7B model):

Metric	LoRA (16-bit)	QLoRA (4-bit)
GPU memory	~21 GB	~14 GB
Training time	~1.85 hours	~2.79 hours
Memory savings	Baseline	~33% reduction
Speed difference	Baseline	~39% slower

QLoRA trades increased training time for significantly lower memory usage, with minimal impact on final model quality.

LoRA variants

Since the original paper, several variants have been proposed to address specific limitations of standard LoRA.

AdaLoRA

AdaLoRA (Adaptive LoRA) was introduced by Qingru Zhang, Minshuo Chen, Alexander Bukharin, Nikos Karampatziakis, Pengcheng He, Yu Cheng, Weizhu Chen, and Tuo Zhao at ICLR 2023. Standard LoRA uses the same rank for all adapted weight matrices, but different layers and weight types may vary significantly in their importance for a given task. AdaLoRA addresses this by parameterizing incremental updates in the form of singular value decomposition (SVD), dividing each update into triplets of singular values and their corresponding left and right singular vectors. During training, an importance scoring mechanism evaluates each triplet's contribution to model performance and prunes low-importance singular values (setting them to zero), effectively reducing the rank for unimportant weight matrices while allocating more capacity to critical ones. AdaLoRA shows notable improvements over standard LoRA in low-budget settings.

DoRA

DoRA (Weight-Decomposed Low-Rank Adaptation) was introduced by Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen (NVIDIA Research Taiwan) in February 2024 and accepted as an ICML 2024 Oral (1.5 percent acceptance rate). DoRA decomposes pre-trained weights into separate magnitude and direction components:

$$W' = \bar{m} \cdot \frac{W_0 + BA}{|W_0 + BA|_c}$$

where $\bar{m}$ is a learnable magnitude vector and $|\cdot|_c$ denotes the column-wise norm. The direction component is updated through LoRA, while the magnitude is learned independently. This decomposition more closely mirrors the learning patterns observed in full fine-tuning and yields consistent improvements over LoRA:

Model	LoRA accuracy	DoRA accuracy	Improvement
LLaMA-7B	Baseline	+3.7%	Commonsense reasoning
LLaMA-13B	Baseline	+1.0%	Commonsense reasoning
LLaMA2-7B	Baseline	+2.9%	Commonsense reasoning
LLaMA3-8B	Baseline	+4.4%	Commonsense reasoning

DoRA can be merged into the base weights before deployment, adding zero inference overhead. A half-rank variant (DoRA with $r/2$) can exceed standard LoRA's performance using only 50 percent of the trainable parameters. QDoRA extends the approach to quantized training, combining QLoRA's memory efficiency with DoRA's performance gains.

LoRA+

LoRA+ was proposed by Soufiane Hayou, Nikhil Ghosh, and Bin Yu in 2024 (ICML 2024, ICLR 2025). The key observation is that in standard LoRA, both matrices $A$ and $B$ use the same learning rate, but theoretical analysis of wide networks shows this is suboptimal. Matrix $A$ controls the projection of the input into a lower-dimensional space, while $B$ controls the reverse projection. LoRA+ assigns different learning rates to these two matrices, with matrix B's learning rate typically set 16x higher than matrix A. In experiments, LoRA+ improves fine-tuning speed by up to 2x and performance by 1 to 2 percent at the same computational cost as standard LoRA, at the expense of one additional hyperparameter (the learning rate ratio).

rsLoRA (rank-stabilized LoRA)

rsLoRA was proposed by Damjan Kalajdzievski in 2023. The standard LoRA scaling factor $\alpha / r$ causes learning to slow down as rank increases, which limits the practical benefit of using higher ranks. rsLoRA replaces the scaling factor with $\alpha / \sqrt{r}$, which maintains gradient stability at higher ranks. This simple change allows practitioners to use larger ranks and trade increased training compute for better fine-tuning performance without any change in inference cost.

PiSSA

PiSSA (Principal Singular Values and Singular Vectors Adaptation) was accepted as a NeurIPS 2024 Spotlight. It initializes adapter matrices with principal components from singular value decomposition of original weights rather than random noise. This "train principal components, freeze residuals" approach contrasts with LoRA's "train noise and zero" strategy, achieving dramatically faster convergence and superior accuracy. On Mistral-7B, PiSSA reached 72.86 percent on GSM8K versus LoRA's 67.7 percent (a 5.16 point gain). On LLaMA-3-70B with quantization (QPiSSA), performance hit 86.05 percent versus QLoRA's 81.73 percent.

VeRA

VeRA (Vector-based Random Matrix Adaptation) was introduced by Dawid Jan Kopiczko, Tijmen Blankevoort, and Yuki M. Asano at ICLR 2024. VeRA uses a single pair of frozen random matrices shared across all layers and learns only small scaling vectors per layer. This reduces the number of trainable parameters by over 10x compared to LoRA while approaching or matching LoRA's performance, particularly on larger models (7B and 13B parameters). For GPT-3, VeRA reduces adapter parameters from 75.5M to 2.8M (a 97 percent reduction) while maintaining performance, a storage efficiency critical for deploying hundreds of task-specific modules.

Additional notable variants

LongLoRA: Extends context windows efficiently for long-document tasks through sparse local attention during fine-tuning, adapting models like LLaMA 2 to handle inputs of up to 100k tokens.
LoRA-FA (Frozen-A LoRA): Provides 1.4x memory reduction by freezing projection-down matrix A.
DyLoRA: Enables 4-7x faster training through dynamic rank sampling across a range during training.
LoRA-XS: Achieves 100x storage reduction by inserting small trainable matrices between frozen low-rank SVD components.
Tied-LoRA: Applies weight tying to the low-rank matrices to further improve parameter efficiency.
LoHa and LoKr: Use Hadamard and Kronecker products for alternative decompositions.
MoLE (Mixture of LoRA Experts): Treats layer-wise LoRAs as distinct experts with learnable gating for multi-task scenarios.
1LoRA: Introduces single-parameter per module adaptation, exploring extreme compression.
CorDA, EVA, OLoRA, LoftQ: Alternative initialization schemes that improve convergence over random initialization.
S-LoRA: UC Berkeley serving framework that demonstrates running 1,000 LoRAs on a single GPU via dynamic adapter swapping.
X-LoRA: Introduces mixture-of-experts gating for dynamic LoRA selection at token and layer granularity.

Summary of LoRA variants

Variant	Year	Key idea	Parameter efficiency vs. LoRA	Performance vs. LoRA
LoRA	2021	Low-rank decomposition $BA$	Baseline	Baseline
AdaLoRA	2023	Adaptive rank allocation via SVD	Similar	Better in low-budget settings
QLoRA	2023	4-bit quantization + LoRA	~33% less memory	Comparable
rsLoRA	2023	Scaling factor $\alpha/\sqrt{r}$	Same	Better at high ranks
LoRA+	2024	Different learning rates for $A$, $B$	Same	+1-2%, up to 2x faster
DoRA	2024	Magnitude/direction decomposition	Better (50% params for same quality)	+1-4% on reasoning
PiSSA	2024	SVD-based initialization	Same	Faster convergence, +5% on GSM8K
VeRA	2024	Shared random matrices + learned vectors	10x fewer parameters	Comparable
LongLoRA	2023	Sparse attention during fine-tuning	Same	Extends context to 100k tokens
S-LoRA	2023	Serving-layer batching and memory management	N/A	Enables 1,000 adapters per GPU

Comparison with other PEFT methods

LoRA is one of several parameter-efficient fine-tuning approaches. These methods can be broadly categorized by where they intervene in the model: modifying weights, altering the architecture, or manipulating activations. LoRA's unique strength lies in its ability to modify weights in a powerful yet non-invasive way that disappears at inference time. Each method makes different trade-offs between parameter efficiency, performance, and inference cost.

Method	Approach	Trainable params (7B model)	Inference latency	Merges into base model
Full fine-tuning	Update all weights	100% (~7B)	Baseline	N/A
LoRA ($r = 16$)	Low-rank weight updates	~0.24% (~16.8M)	No overhead (merged)	Yes
LoRA ($r = 64$)	Low-rank weight updates	~0.96% (~67M)	No overhead (merged)	Yes
Adapters (Houlsby)	Insert small neural networks after sublayers	~0.48% (~33.5M)	+20-30%	No
Prefix tuning	Prepend learnable vectors to keys/values	~0.07% (~5.2M)	Minimal overhead	No
Prompt tuning	Prepend learnable embeddings to input	~0.006% (~0.4M)	Minimal overhead	No
IA3	Learn element-wise scaling vectors	~0.009% (~0.6M)	No overhead (merged)	Yes
BitFit	Fine-tune only bias terms	<0.1%	None	Yes

Adapter tuning

Adapter tuning was one of the earliest PEFT techniques, introduced by Houlsby et al. in 2019. It involves inserting small, fully connected "bottleneck" layers between the existing layers of a pre-trained model. These adapter modules typically consist of a down-projection, a non-linearity, and an up-projection. During fine-tuning, only the parameters of these new adapter layers are trained. While both methods add a small number of trainable parameters, adapters add new modules to the model's computational graph that remain during inference, introducing a 20 to 30 percent latency penalty for small batch sizes. LoRA, by contrast, modifies the existing weight matrices through a low-rank update that can be merged back into the original weights, resulting in zero inference latency.

Prefix tuning and prompt tuning

Prefix-tuning (Li & Liang, 2021) and its simplification, prompt tuning, operate on the model's activations rather than its weights. These methods freeze the entire pre-trained model and instead learn a small set of continuous vectors, or "soft prompts," that are prepended to the input sequence or the hidden states of each layer. These learned prefixes steer the model's behavior towards the desired task without modifying any of its parameters. While prefix/prompt tuning can be highly parameter-efficient and can sometimes work even with black-box API access to a model, LoRA is generally considered more expressive and often achieves higher performance because it has more direct control over the model's computations.

BitFit

BitFit (Bias-term Fine-tuning) is one of the most spartan PEFT methods, introduced by Ben Zaken et al. in 2021. It involves freezing all of the model's main weight matrices and fine-tuning only the bias parameters and the task-specific classification head. Since bias terms constitute a tiny fraction of a model's total parameters (often less than 0.1 percent), this approach is extremely parameter-efficient. On GPT-3 175B, BitFit with 14.2M parameters reached 71.3 percent WikiSQL accuracy, reasonable but underperforming LoRA's 73.4 percent with 4.7M parameters.

Advantages of LoRA over other methods

No inference latency. Unlike adapters, which add new computation during inference, LoRA adapters merge directly into the base weights.
Strong performance. LoRA consistently matches or exceeds full fine-tuning quality, while prompt tuning and prefix tuning can underperform on complex tasks.
Task switching. Multiple LoRA adapters can be stored and swapped, with each adapter being only a few megabytes.
Library support. LoRA is supported by all major fine-tuning frameworks, including Hugging Face PEFT, PyTorch native, and specialized tools like Unsloth and Axolotl.

When other methods may be preferred

Prompt tuning is suitable when the absolute minimum number of trainable parameters is needed, or when task-switching must happen at inference time without any weight changes.
Adapters may be preferred when modular, composable adaptation is needed and inference latency is acceptable.
Full fine-tuning remains the best choice when compute is not a constraint and maximum performance is needed, particularly for tasks that differ substantially from the pre-training distribution.

Applications

LoRA has found wide adoption across multiple domains.

Natural language processing

LoRA is used extensively for instruction tuning, domain adaptation, and task-specific fine-tuning of large language models. Models from GPT-2 to GPT-3 175B, Llama 2 (7B, 13B, 70B), Llama 3.1 (8B, 70B, 405B), Mistral-7B, Falcon, Qwen, RoBERTa, and DeBERTa are routinely adapted with LoRA for specialized applications. Common use cases include:

SQL query generation and semantic parsing
Grammar correction and text classification
Legal document analysis and medical terminology adaptation
Financial report generation and risk analysis
Multi-task learning with dynamic adapter switching
Question answering and dialogue summarization

The Alpaca-LoRA project demonstrated that it was possible to replicate the instruction-following capabilities of Stanford's Alpaca model by fine-tuning the LLaMA 7B model using LoRA on a single consumer GPU (an RTX 4090) in a matter of hours. IBM deploys Granite 3.0 models with LoRA for hallucination reduction, while enterprises use multiple LoRA adapters as a "menu" of specialized capabilities built atop shared base models. Performance metrics show 95 to 99 percent of full fine-tuning quality with training times around 3 hours on A100 GPUs for 7B models with 50,000 examples. The Guanaco model family, trained using QLoRA, demonstrated that LoRA-based fine-tuning can produce instruction-following models competitive with proprietary systems.

Image generation

LoRA has become the standard method for customizing Stable Diffusion and other diffusion models for specific styles, characters, or concepts. In late 2022, the technique was adapted for fine-tuning the diffusion model's cross-attention layers within the U-Net architecture. A LoRA adapter for a Stable Diffusion model is typically only 1 to 10 MB (sometimes up to 3 to 10 MB), compared to the 2 to 7 GB base model. This small size has enabled a large ecosystem of community-created adapters shared on platforms like Civitai and Hugging Face. LoRA can fine-tune a Stable Diffusion model with as few as 10 to 50 images, making personalization accessible to individual artists and creators.

CivitAI hosts popular adapters including:

Style specialization: Arcane style, Studio Ghibli aesthetic, pixel art, cyberpunk themes, oil painting, anime
Artist styles: Greg Rutkowski, WLOP, Jim Lee styles with 1 to 10 MB file sizes
Character replication: Shinobu Kochou, Zero Two, Princess Zelda with consistent generation
Technical enhancements: Detail Tweaker, Hyper-SD (over 119,000 downloads for speed optimization)
Control methods: Stability AI's Control-LoRA for depth maps, canny edges, sketch colorization

Hugging Face hosts over 60,000 LoRA models with categories spanning text-to-image, image-to-image, and emerging image-to-video.

Vision models

Vision models employ LoRA for image classification, object detection, segmentation, and synthetic image detection. Fine-tuning Vision Transformers (ViT) with LoRA achieves 96 percent validation accuracy on Food-101 with approximately 147,000 trainable parameters versus 86 million for the full model. DeepLabv3 leverages LoRA adapters for semantic segmentation, while Bi-LORA detects GAN and diffusion-generated images. Medical imaging applications use LoRA-adapted vision models for CT scan and X-ray analysis with domain-specific adaptations that preserve general visual understanding.

Multimodal models

LoRA is applied to adapt vision-language models such as LLaVA for visual question answering and instruction following. DoRA has shown improvements of 0.6 to 1.9 percent over standard LoRA on visual question answering and video captioning tasks. Llama-3.2-11B-Vision-Instruct fine-tunes with LoRA for medical diagnosis processing CT scans alongside text descriptions. BLIP-2 adapters enable multimodal sentiment analysis combining images and captions. Novel architectures include:

Vision as LoRA (VoRA): Integrates vision directly into LLMs through LoRA layers without separate vision encoders.
Mixture of LoRA (MixLoRA): Handles multimodal instruction tuning with conditional mixtures reducing task interference.
LoCAL/SV-RAG: Uses dual LoRA adapters for retrieval and question answering on hundreds of visually-rich pages.

Reinforcement learning from human feedback (RLHF)

LoRA adapters are used during the reward modeling and policy optimization stages of RLHF pipelines, reducing the memory required to train multiple model copies simultaneously.

Production deployments at scale

Together AI serves hundreds of LoRA adapters atop single base models with pay-per-token pricing, partnering with Salesforce, Zomato, and The Washington Post. Fireworks AI reports 100x cost-efficiency improvements using multi-LoRA with FireAttention, partnering with Cresta for enterprise AI. Microsoft Azure AI deploys Llama 3.1 fine-tuned LoRAs, while Databricks uses LoRA for product description generation. S-LoRA from UC Berkeley demonstrates serving 1,000 LoRAs on a single GPU via dynamic adapter swapping.

Implementation frameworks and ecosystem

Hugging Face PEFT (Parameter-Efficient Fine-Tuning) serves as the primary LoRA implementation with version 0.17.0 and later supporting LoRA, LoRA+, DoRA, QLoRA, AdaLoRA, LoHa, LoKr, VeRA, and other variants. The library integrates seamlessly with Transformers, Diffusers, and Accelerate, training only about 0.19 percent of parameters for models like bigscience/mt0-large. Advanced features include multiple adapter composition, dynamic adapter switching, and sophisticated initialization methods (Gaussian, LoftQ, EVA, OLoRA, PiSSA, CorDA). Recent releases added DoRA support, AWQ/AQLM quantization compatibility, VB-LoRA, and LoRA-FA optimizer optimizations.

A typical workflow using PEFT involves loading a pre-trained model, creating a LoraConfig specifying the rank, alpha, target modules, and dropout, wrapping the model with get_peft_model(), and training normally:

from peft import get_peft_model, LoraConfig
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("gpt2")
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
)
model = get_peft_model(model, lora_config)

After training, the adapter can be saved as a small file (a few megabytes) and later loaded and merged into the base model for deployment.

Microsoft's original LoRA repository provides the foundational loralib PyTorch implementation validated in the ICLR 2022 paper. While Hugging Face PEFT has become the recommended successor for most use cases, the Microsoft implementation remains valuable for understanding core concepts and reproduces the original research results.

PyTorch's torchtune offers native LoRA implementation optimized for the Llama model family with built-in recipes, gradient checkpointing, and FSDP support across versions 0.2 through 0.6. Alternative PyTorch implementations include LoRA-Torch (supporting nn.MultiheadAttention in OpenCLIP with extensible architecture) and lora-pytorch (version 0.2.0 with zero dependencies beyond PyTorch, compatible with CNNs and MLPs).

TensorFlow/Keras 3 provides multi-backend LoRA support (TensorFlow, JAX, PyTorch) through KerasHub with native integration in the Gemma model family. The model.backbone.enable_lora() method offers simple API access with configurable rank and target modules. JAX optimization proves particularly effective for TPU deployments.

Distributed training frameworks integrate comprehensively: DeepSpeed works seamlessly with PEFT, FSDP supports QDoRA, and Accelerate enables multi-GPU training. Quantization frameworks including bitsandbytes (4-bit and 8-bit for QLoRA), AWQ (supported in PEFT v0.9.0 and later), AQLM (2-bit quantization), and GPTQ all prove compatible with LoRA training.

Model serving platforms enable production deployment. Text Generation Inference (TGI v2.1.1 and later) serves multiple LoRAs simultaneously from Hugging Face Hub with dynamic adapter loading. Vertex AI offers custom multi-adapter handlers, FriendliAI enables per-request adapter switching, and Optimum Neuron fuses LoRA for AWS Inferentia/Trainium. Diffusers supports LoRA for Stable Diffusion, SDXL, and FLUX with DreamBooth+LoRA combinations.

Platform repositories host thousands of public adapters. Hugging Face Hub serves as the primary repository for text-to-image, LLM, and multimodal adapters with one-click loading via PEFT. CivitAI focuses on Stable Diffusion and FLUX models with on-site training, supporting SD1.5, SDXL, Flux, and Pony Diffusion V6 XL. Community features include auto-captioning with WD Tagger 1.4 and JoyCaption, training data sharing, and model card generation.

Practical guidelines

Hyperparameter recommendations

Based on extensive experimental work by researchers including Sebastian Raschka (2023):

Hyperparameter	Recommended starting value	Notes
Rank ($r$)	8 (simple tasks), 16 (general), 32-64 (complex tasks)	Increase if validation loss plateaus
Alpha ($\alpha$)	$2r$	Adjust based on results; try $r$ or $0.5r$ if unstable
Target modules	All linear layers	Outperforms attention-only targeting
Dropout	0.05 to 0.1	Helps prevent overfitting on small datasets
Learning rate	1e-4 to 3e-4 (AdamW)	Higher than typical full fine-tuning rates
Optimizer	AdamW	SGD offers modest memory savings but similar performance
Epochs	1-3	Multi-epoch training on static data can cause overfitting
Effective batch size	16	Use gradient accumulation (e.g., batch=2, accum=8) for memory

Data quality over quantity

Experiments have consistently shown that data quality matters more than quantity for LoRA fine-tuning. In one study, 1,000 curated examples outperformed 50,000 synthetic examples. Multi-epoch training on the same data tends to cause overfitting, so each training pass should ideally see fresh or augmented data.

LoRA vs. QLoRA trade-offs

Choose standard LoRA when training speed is a priority and GPU memory is sufficient. Choose QLoRA when GPU memory is limited (for example, fine-tuning on consumer GPUs with 16 to 24 GB). QLoRA provides approximately 33 percent memory savings at the cost of roughly 39 percent longer training time, with minimal impact on final quality.

Deployment checklist

Successful LoRA deployment begins with rank selection based on adaptation complexity. Start with r=16 as a reliable default, then adjust based on training behavior. Undertrained models (high validation loss, poor task performance) benefit from increasing rank to 32 to 64, while overtrained models (training loss below 0.2, large train-validation gap) should reduce rank to 8 or lower.

Set alpha to 2r following widespread best practice. Maintain constant alpha when adjusting rank during experimentation because alpha and learning rate are mathematically interchangeable. For high ranks exceeding 64, enable rank-stabilized LoRA (use_rslora=True) with $\alpha/\sqrt{r}$ scaling to prevent gradient collapse.

Target all major linear layers for optimal adaptation quality. Use auto-detection features (target_modules="all-linear") in modern PEFT versions to simplify configuration.

Memory optimization proves critical for consumer hardware deployment. Combine QLoRA's 4-bit quantization with standard LoRA training to enable 65B models on 48 GB GPUs: configure bitsandbytes with load_in_4bit=True, bnb_4bit_quant_type="nf4", and bnb_4bit_use_double_quant=True. Enable gradient checkpointing to trade computation for memory. Start with lower ranks (r=8) during development then scale up once workflows stabilize.

Inference deployment offers two strategies. Merged weights ($W = W_0 + BA$) provide zero latency overhead but fix the adapter, ideal for single-task deployment. Separate weights enable dynamic adapter switching with minimal latency (2 to 5 percent depending on batch and sequence length), necessary for multi-task serving where different requests need different adapters. Production systems like Text Generation Inference and Fireworks AI serve hundreds of LoRAs by keeping base models in memory and swapping lightweight adapters per request.

Monitor training carefully for stability issues. Unstable training (loss spikes, NaN gradients) warrants reducing learning rate by 2 to 5x or enabling mixed precision (fp16=True or bf16=True). Slow convergence suggests increasing learning rate or rank. Large train-validation gaps indicate overfitting; reduce epochs, increase regularization, or lower rank. Training loss below 0.2 generally signals overfitting; consider early stopping or post-training alpha scaling (multiply by 0.5) to reduce LoRA influence.

Model card documentation should specify rank, alpha, target modules, training dataset size, base model version, intended use cases, and known limitations. Version adapters systematically as base models update; LoRA weights trained on Llama-2-7B may not transfer to Llama-3-8B without retraining. Store adapters alongside base model information for reproducibility, and test adapter merging versus dynamic loading to verify inference behavior matches expectations.

For production serving at scale, leverage multi-LoRA frameworks like S-LoRA (1,000 adapters per GPU), TGI multi-adapter support, or Fireworks AI's FireAttention stack. Implement request routing to direct queries to appropriate task-specific adapters. Monitor memory usage as adapter count scales; hundreds of 35 MB adapters remain manageable while thousands may require more sophisticated caching strategies.

Intrinsic rank analysis

One of the most interesting findings from the original LoRA paper is the empirical analysis of intrinsic rank. The authors found that the weight updates learned by LoRA reside in a very low-dimensional subspace. When comparing the subspaces learned at different ranks, the top singular vectors from a rank-8 adapter overlapped significantly with those from a rank-64 adapter. This means the most important directions of adaptation are captured even at very low ranks.

Further analysis of the 48th transformer layer in GPT-3 revealed that $\Delta W$ amplifies directions that are not emphasized in the original weight matrix $W$. The amplification factor reached approximately 21.5x for $r = 4$, suggesting that LoRA learns to selectively boost features that are relevant for the downstream task but underrepresented in the pre-trained weights.

Limitations

Despite its widespread adoption, LoRA has several known limitations:

Rank constraint. The fixed low-rank structure may be insufficient for tasks that require high-rank weight updates, such as adapting a model to a domain that differs substantially from its pre-training data. AdaLoRA and DoRA partially address this.
Mathematical reasoning and complex programming gaps. Research from Anyscale and Columbia University/Databricks consistently shows full fine-tuning substantially outperforms LoRA on math datasets like GSM8k, the largest gap observed across benchmarks. Complex code generation similarly shows 4 to 6 percent lower performance with LoRA, with full fine-tuning requiring fewer training examples to reach equivalent capabilities. The low-rank constraint appears insufficient for capturing intricate multi-step reasoning patterns these domains demand.
Hyperparameter sensitivity. The choice of rank, alpha, and target modules can significantly affect performance, and there is no universal formula for selecting these values. Anyscale research documented "drastic differences in evaluation loss" from small learning rate changes, noting LoRA demands more careful tuning than full fine-tuning for stable training.
Intruder dimensions. Recent research revealed high-ranking singular vectors in LoRA updates orthogonal to pre-trained weight structure. While LoRA matches target task accuracy, these structural differences cause models to become worse predictors of pre-training distributions and adapt less robustly to sequential multi-task learning. Higher-rank, rank-stabilized LoRA mitigates this issue but sacrifices efficiency advantages.
Not a replacement for full fine-tuning. On some tasks, particularly those requiring extensive knowledge acquisition, full fine-tuning still outperforms LoRA. When dataset size exceeds LoRA's inherent parameter storage capacity, such as in continued pretraining scenarios with millions of examples, full fine-tuning becomes necessary.
Composition challenges. Combining multiple LoRA adapters (for example, merging a style adapter with a domain adapter) is not straightforward and can lead to interference between the adaptations.
Training instability at high ranks. Without modifications like rsLoRA, standard LoRA's scaling factor can cause learning to stall at higher ranks.
Batch size sensitivity. Research demonstrates LoRA performance degrades faster than full fine-tuning as batch size increases, independent of rank configuration.
Less learning, less forgetting. Work titled "LoRA Learns Less and Forgets Less" reveals a fundamental tradeoff: LoRA retains more original capabilities (less forgetting) but learns less efficiently in domains distant from pre-training (less learning).

Current research directions

The LoRA research landscape in 2024 and 2025 focuses on five primary directions.

Initialization strategies represent active investigation with methods like PiSSA, CorDA (Context-Oriented Decomposition Adaptation achieving faster convergence than PiSSA), EVA (Explained Variance Adaptation with data-driven SVD-based initialization), OLoRA (orthogonal initialization of matrices A and B), and LoftQ (quantization-aware initialization) all demonstrating that proper initialization enables substantially faster convergence and better final performance than LoRA's default random initialization.

Dynamic allocation methods address rank distribution challenges. AdaLoRA and DyLoRA enable adaptive parameter budgets where different layers receive different ranks based on task-specific importance rather than uniform distribution. QDyLoRA combines this flexibility with quantization for memory-constrained scenarios. Research shows optimal rank varies dramatically by layer and task, top attention layers often need higher capacity while middle layers suffice with minimal adaptation.

Extreme efficiency variants push parameter reduction limits. LoRA-XS achieves over 100x storage reduction compared to standard LoRA through inserting tiny trainable matrices between frozen SVD components. VeRA demonstrated 97 percent parameter reduction (75.5M to 2.8M) remains viable. 1LoRA introduces single-parameter per module adaptation showing extreme compression possibilities.

Theoretical understanding deepens through computational complexity analysis. Research published June 2024 (updated June 2025) examines phase transition behavior in LoRA efficiency, proving almost linear approximation algorithms exist for certain rank regimes.

Multi-task and reasoning capabilities expand through specialized architectures. X-LoRA introduces mixture-of-experts gating for dynamic LoRA selection at token and layer granularity. Research on "Tina: Tiny Reasoning Models via LoRA" explores memory-efficient reasoning through LoRA decomposition. Agent-based systems use LoRA modules as tools, with research surveying LoRA-driven agents where different adapters represent specialized agent roles or capabilities.

A comprehensive survey published July 2024 (updated October 2024, arXiv:2407.11046) categorizes LoRA methods into downstream adaptation improvements, cross-task generalization approaches, efficiency-improving techniques, and data privacy-preserving methods.

Open problems demanding attention include optimal rank selection heuristics (currently requiring task-specific experimentation), target module selection guidance, hyperparameter tuning complexity, memory optimization for edge deployment, efficient batching with mixed adapters (different samples using different LoRAs), adapter pruning strategies, catastrophic forgetting in sequential multi-task learning, zero-shot adapter transfer across model families, and dynamic adapter selection without routing models. Theoretical challenges include formal approximation quality bounds, understanding intrinsic dimension requirements, convergence guarantees, and generalization bounds from statistical learning theory perspectives.

References

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2021). "LoRA: Low-Rank Adaptation of Large Language Models." *arXiv preprint arXiv:2106.09685*. Presented at ICLR 2022.
Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2023). "QLoRA: Efficient Finetuning of Quantized LLMs." *Proceedings of NeurIPS 2023*. arXiv:2305.14314.
Zhang, Q., Chen, M., Bukharin, A., Karampatziakis, N., He, P., Cheng, Y., Chen, W., & Zhao, T. (2023). "AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning." *Proceedings of ICLR 2023*. arXiv:2303.10512.
Liu, S.-Y., Wang, C.-Y., Yin, H., Molchanov, P., Wang, Y.-C.F., Cheng, K.-T., & Chen, M.-H. (2024). "DoRA: Weight-Decomposed Low-Rank Adaptation." *arXiv preprint arXiv:2402.09353*. ICML 2024 Oral.
Hayou, S., Ghosh, N., & Yu, B. (2024). "LoRA+: Efficient Low Rank Adaptation of Large Models." *Proceedings of ICML 2024*. arXiv:2402.12354.
Kalajdzievski, D. (2023). "A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA." *arXiv preprint arXiv:2312.03732*.
Kopiczko, D.J., Blankevoort, T., & Asano, Y.M. (2024). "VeRA: Vector-based Random Matrix Adaptation." *Proceedings of ICLR 2024*. arXiv:2310.11454.
Aghajanyan, A., Gupta, S., & Zettlemoyer, L. (2020). "Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning." *Proceedings of ACL 2021*. arXiv:2012.13255.
Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., de Laroussilhe, Q., Gesmundo, A., Attariyan, M., & Gelly, S. (2019). "Parameter-Efficient Transfer Learning for NLP." *Proceedings of ICML 2019*.
Li, X.L. & Liang, P. (2021). "Prefix-Tuning: Optimizing Continuous Prompts for Generation." *Proceedings of ACL 2021*. arXiv:2101.00190.
Lester, B., Al-Rfou, R., & Constant, N. (2021). "The Power of Scale for Parameter-Efficient Prompt Tuning." *Proceedings of EMNLP 2021*. arXiv:2104.08691.
Raschka, S. (2023). "Practical Tips for Finetuning LLMs Using LoRA (Low-Rank Adaptation)." *Sebastian Raschka's AI Magazine*.
Hugging Face. (2024). "PEFT: State-of-the-art Parameter-Efficient Fine-Tuning." GitHub repository. https://github.com/huggingface/peft
Ben Zaken, E., Ravfogel, S., & Goldberg, Y. (2021). "BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models." arXiv:2106.10199.
Biderman, D., et al. (2024). "LoRA Learns Less and Forgets Less." arXiv:2405.09673.
Shuttleworth, R., et al. (2024). "LoRA vs Full Fine-tuning: An Illusion of Equivalence" (intruder dimensions). arXiv:2410.21228.
Meng, F., Wang, Z., & Zhang, M. (2024). "PiSSA: Principal Singular Values and Singular Vectors Adaptation of Large Language Models." *NeurIPS 2024 Spotlight*. arXiv:2404.02948.
Chen, Y., et al. (2023). "LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models." arXiv:2309.12307.
Sheng, Y., et al. (2023). "S-LoRA: Serving Thousands of Concurrent LoRA Adapters." arXiv:2311.03285.
Zhang, L., Zhang, L., Shi, S., Chu, X., & Li, B. (2023). "LoRA-FA: Memory-efficient Low-rank Adaptation for Large Language Models Fine-tuning." arXiv:2308.03303.
Valipour, M., Rezagholizadeh, M., Kobyzev, I., & Ghodsi, A. (2023). "DyLoRA: Parameter Efficient Tuning of Pre-trained Models using Dynamic Search-Free Low-Rank Adaptation." *EACL 2023*. arXiv:2210.07558.
Microsoft LoRA reference implementation. GitHub repository. https://github.com/microsoft/LoRA
Mao, Y., et al. (2024). "A Survey on LoRA of Large Language Models." arXiv:2407.11046.
CivitAI. LoRA model repository. https://civitai.com/models
Simo Ryu (cloneofsimo). "Low-rank Adaptation for Fast Text-to-Image Diffusion Fine-tuning." GitHub repository. https://github.com/cloneofsimo/lora
LoRA paper on OpenReview. https://openreview.net/forum?id=nZeVKeeFYf9

External links

ELI5 (Explain like I'm 5)

Motivation

History and development

Mathematical formulation

Low-rank decomposition

Initialization

Scaling factor

Weight merging at inference

Gradient flow and backpropagation

Which layers to adapt

Rank selection

Empirical findings from the original paper

Practical guidelines for rank selection

Alpha and scaling

Hyperparameter summary

GPT-3 175B results

Inference latency comparison

Cross-model benchmark summary

QLoRA

Key innovations in QLoRA

How QLoRA works

QLoRA performance and memory

LoRA variants

AdaLoRA

DoRA

LoRA+

rsLoRA (rank-stabilized LoRA)

PiSSA

VeRA

Additional notable variants

Summary of LoRA variants

Comparison with other PEFT methods

Adapter tuning

Prefix tuning and prompt tuning

BitFit

Advantages of LoRA over other methods

When other methods may be preferred

Applications

Natural language processing

Image generation

Vision models

Multimodal models

Reinforcement learning from human feedback (RLHF)

Production deployments at scale

Implementation frameworks and ecosystem

Practical guidelines

Hyperparameter recommendations

Data quality over quantity

LoRA vs. QLoRA trade-offs

Deployment checklist

Intrinsic rank analysis

Limitations

Current research directions

See also

References

External links

Improve this article

Related Articles

Sparse autoencoder

Context window

OCR Models

Post-training

Pre-training

Supervised fine-tuning

ELI5 (Explain like I'm 5)

Motivation

History and development

Mathematical formulation

Low-rank decomposition

Initialization

Scaling factor

Weight merging at inference

Gradient flow and backpropagation

Which layers to adapt

Rank selection

Empirical findings from the original paper

Practical guidelines for rank selection

Alpha and scaling

Hyperparameter summary

GPT-3 175B results