LoRA

From AI Wiki
(Redirected from Low-rank adaptation)

Template:Infobox algorithm

Low-rank Adaptation (LoRA) is a parameter-efficient fine-tuning technique that reduces trainable parameters by up to 10,000× while matching or exceeding full fine-tuning performance. By freezing pre-trained model weights and injecting trainable low-rank decomposition matrices into neural network layers, LoRA enables fine-tuning of massive models like GPT-3 175B on consumer hardware with 3× less GPU memory and zero inference latency overhead.[1] Originally developed at Microsoft Research in 2021 for large language models, LoRA has become the dominant adaptation method across AI domains including diffusion models, vision transformers, and multimodal systems, with over 60,000 public adapters deployed on platforms like Hugging Face and CivitAI.[2]

Core concept and mathematical foundations

LoRA addresses a fundamental challenge in modern AI: as models scale to billions of parameters, traditional fine-tuning becomes prohibitively expensive. The technique builds on theoretical work demonstrating that pre-trained language models have low "intrinsic dimensionality"—they can learn effectively in much smaller subspaces than their full parameter count suggests.[3] LoRA applies this insight by hypothesizing that weight updates during adaptation also possess low intrinsic rank, meaning most adaptation can occur along relatively few dimensions.

The mathematical formulation elegantly decomposes weight updates. For any pre-trained weight matrix W₀ ∈ ℝ^(d×k), LoRA constrains updates via low-rank decomposition: W₀ + ΔW = W₀ + BA, where B ∈ ℝ^(d×r) and A ∈ ℝ^(r×k) with rank r ≪ min(d,k). During training, W₀ remains frozen while only the low-rank matrices B and A are updated. The forward pass becomes h = W₀x + (α/r)BAx, where α is a scaling hyperparameter typically set to twice the rank.[1]

This decomposition achieves dramatic parameter reduction. A full update matrix with dimensions 10,000×10,000 requires 100 million parameters, but with rank r=8, the low-rank factorization needs only r(d+k) = 160,000 parameters—a 625× reduction. The initialization strategy ensures continuity: matrix A uses random Gaussian initialization (typically Kaiming-uniform) while B initializes to zero, guaranteeing ΔW = BA = 0 at training start so the model begins from exact pre-trained weights.[1]

The theoretical justification comes from subspace analysis showing that singular value decompositions of low-rank and higher-rank LoRA matrices share significant overlap in their top singular directions. Research demonstrates that adaptation matrices truly have very low rank—even r=1 or r=2 suffices for many tasks despite full-rank dimensions exceeding 12,000.[1] Importantly, LoRA's learned updates amplify directions that were de-emphasized during pre-training, enabling task-specific feature learning without catastrophic forgetting of pre-trained knowledge.

History and development

Eight researchers at Microsoft Corporation created LoRA, with Edward J. Hu and Yelong Shen as equal first authors alongside Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. The team first submitted their paper "LoRA: Low-Rank Adaptation of Large Language Models" to arXiv on June 17, 2021 (arXiv:2106.09685), with publication at the International Conference on Learning Representations (ICLR 2022).[1][4] Edward Hu, who invented LoRA during his AI residency at Microsoft Research, later joined OpenAI and pursued doctoral studies under Yoshua Bengio at Mila in Montreal.[5]

The work built directly on groundbreaking research by Armen Aghajanyan and colleagues who published "Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning" in December 2020.[3] That paper demonstrated pre-trained language models could be effectively fine-tuned by optimizing in extremely small subspaces—achieving 90% of full fine-tuning performance on RoBERTa with just 200 trainable parameters through random projection. This empirical evidence that learned models reside on low intrinsic dimensions provided the theoretical foundation for LoRA's hypothesis that weight updates during adaptation also have low intrinsic rank.

The original LoRA paper validated the approach across multiple architectures. On GPT-3 175B, LoRA with rank r=4 achieved 73.4% accuracy on WikiSQL versus 73.8% for full fine-tuning while reducing trainable parameters from 175 billion to just 4.7 million—roughly 18 million parameters when applied to all attention layers.[1][6] On RoBERTa, LoRA with only 0.3 million parameters outperformed full fine-tuning's 125 million parameters. The breakthrough enabled inference with zero latency overhead by merging weights (W = W₀ + BA) during deployment, unlike adapter methods that added 20-30% latency.

Adoption exploded rapidly. By late 2022, members of the Stable Diffusion community adapted LoRA for fine-tuning the diffusion model's cross-attention layers, enabling the model to learn new visual concepts and styles from a small number of images without full model retraining.[7] Hugging Face integrated LoRA into their PEFT library in February 2023, making it accessible to millions of developers.[2] By 2024-2025, major companies including Apple, Microsoft, OpenAI, NVIDIA, Google, and Meta deployed LoRA in production systems, while the research community generated dozens of variants and extensions addressing specific limitations.

Technical architecture and algorithm

Adapter Injection and Initialization

LoRA integrates seamlessly into transformer architectures by targeting specific linear layers. For transformer attention mechanisms with model dimension d_model (typically 768 to 12,288), LoRA modifies the query, key, value, and output projection matrices. Each projection matrix W_q ∈ ℝ^(d_model × d_model) receives corresponding LoRA matrices B_q ∈ ℝ^(d_model × r) and A_q ∈ ℝ^(r × d_model), transforming the forward pass to h = W_q x + B_q A_q x.[1]

The initialization strategy is crucial for stable training. Matrix A is initialized from a random Gaussian distribution (typically using Kaiming-uniform initialization), while matrix B is initialized to zero. This ensures that at the beginning of training, the adapted model is identical to the base model (since BA = 0), providing a stable starting point and preventing abrupt, potentially destabilizing changes to the model's behavior. The adaptation is learned gradually as B's weights move away from zero during training.[1]

Gradient Flow and Backpropagation

The gradient flow during backpropagation follows standard chain rule mechanics but only updates the low-rank matrices. For loss L, gradients compute as ∂L/∂B = (∂L/∂h)(Ax^T) and ∂L/∂A = B^T(∂L/∂h)(x^T), while the frozen W₀ receives no gradient updates. This dramatically reduces optimizer memory requirements—with Adam requiring 3× parameters for weights plus two momentum states, LoRA eliminates optimizer states for billions of frozen parameters, storing only 3r(d+k) additional values for the low-rank adapters.[1]

Computational complexity remains tractable. The forward pass for W₀x requires O(d²) operations while the LoRA component BAx requires O(r(d+k)) ≈ O(2rd) for square matrices—minimal overhead when r ≪ d. For GPT-3 175B with d=12,288 and r=4, each LoRA-modified layer adds approximately 100,000 operations versus 150 million for the frozen matrix, a negligible 0.07% increase. During inference, weight merging eliminates even this small overhead, producing a single matrix W = W₀ + BA with identical dimensions and computational requirements as the original.[1]

Rank-Stabilized Scaling

The rank-stabilized scaling factor α/r serves critical purposes beyond simple magnitude control. When adjusting rank during experimentation, maintaining constant α means the effective scaling changes proportionally—if doubling rank from 8 to 16, the same α value automatically halves the per-parameter influence, stabilizing training dynamics. Recent research introduced alternative scaling α/√r (rsLoRA) that prevents gradient collapse at higher ranks, enabling effective use of r>64 where standard scaling fails.[8]

Parameter configuration and hyperparameters

The rank hyperparameter r fundamentally controls LoRA's capacity-efficiency tradeoff. Rank values of 4-8 provide minimal adaptation suitable for style and formatting changes without new knowledge, ranks 16-32 handle standard instruction tuning and domain adaptation, ranks 32-64 enable complex task learning, and ranks 64-256 support scenarios requiring substantial new knowledge injection.[9] The original paper demonstrated that even r=1 performs competitively on many tasks—a remarkable finding suggesting adaptation's intrinsic dimension is extraordinarily low.[1] For a 7B parameter model, rank 8 adds approximately 4.2 million trainable parameters (0.06%), while rank 256 adds approximately 20.3 million (0.3%).

The alpha parameter α scales LoRA updates via α/r before adding to frozen weights. Common practice sets α = 2r, producing an effective scaling factor of 2, though α = r (scaling factor 1) offers more conservative adaptation.[10] Crucially, research shows α and learning rate are mathematically interchangeable—tuning α roughly equals tuning learning rate with appropriate initialization. The recommended workflow fixes α early then adjusts only learning rate and rank, simplifying hyperparameter search. For extreme cases, alpha can be scaled down post-training (multiplying by 0.5) to reduce overfitting, effectively averaging between pre-trained and fine-tuned weights.

Target module selection significantly impacts performance. While original research focused on query and value projection matrices (W_q, W_v) in attention layers, subsequent work—particularly the QLoRA paper—demonstrated that targeting all linear layers including attention projections (q_proj, k_proj, v_proj, o_proj) and MLP layers (gate_proj, up_proj, down_proj) substantially improves results.[11] Attention-only configurations underperform badly, though they save minimal memory. Modern best practice enables LoRA on all major linear layers for optimal adaptation quality.

Additional hyperparameters include dropout (typically 0.05-0.1, though recent research questions its reliability for short fine-tuning runs), learning rate (2e-4 to 5e-4 for standard LoRA, significantly higher than full fine-tuning's 1e-5 to 3e-5), and training epochs (1-3 recommended to prevent overfitting).[9] For distributed training, effective batch size of 16 balances stability and performance, often achieved via small per-device batches (2-4) with gradient accumulation (4-8 steps).

LoRA Hyperparameter Recommendations
Hyperparameter Typical Range Recommended Start Purpose
Rank (r) 4-256 16 Controls adaptation capacity
Alpha (α) r to 4r 2r Scales LoRA influence
Learning Rate 1e-4 to 5e-4 2e-4 Update step size
Dropout 0 to 0.1 0.05 Regularization
Epochs 1-5 3 Training iterations
Effective Batch Size 8-32 16 Training stability

Advantages over traditional fine-tuning

LoRA achieves 10,000× parameter reduction while maintaining or exceeding full fine-tuning performance. On GPT-3 175B, LoRA with 4.7 million trainable parameters matched full fine-tuning's 175 billion parameters on WikiSQL (73.4% vs 73.8%) and actually surpassed it on MNLI-matched (91.7% vs 89.5%).[1] On RoBERTa base, LoRA with 0.3 million parameters outperformed full fine-tuning's 125 million parameters across the GLUE benchmark (87.2% vs 86.4% average accuracy). This performance parity stems from LoRA's implicit regularization—the low-rank constraint forces models to learn only the most essential adaptations, preventing overfitting particularly on small datasets.

Memory savings transform hardware requirements. Full fine-tuning GPT-3 175B demands 1.2TB VRAM across multiple high-end GPUs, while LoRA reduces this to 350GB—a 3× reduction.[1] For 7B parameter models, full fine-tuning requires 2+ GPUs totaling 73GB VRAM, but LoRA enables single-GPU training on 14-18GB consumer hardware. Storage efficiency proves equally dramatic: LoRA checkpoints occupy 35MB versus 350GB for full models, enabling deployment of 100 task-specific adapters in 3.5GB versus 35TB for 100 full models. Optimizer state savings compound benefits—eliminating Adam states for frozen parameters saves an additional 2/3 VRAM.

Training speed increases alongside memory efficiency. GPT-3 175B processes 43.1 tokens/second per V100 GPU with LoRA versus 32.5 tokens/second for full fine-tuning—a 25% throughput improvement.[1] This acceleration comes from fewer gradient computations and reduced memory bandwidth requirements. Variants like LoRA+ achieve 2× training speedups through differential learning rates for matrices A and B,[12] while specialized frameworks like Unsloth report additional 2× performance gains through kernel optimization.

Zero inference latency represents a unique LoRA advantage. By merging weights before deployment (W = W₀ + BA), the adapted model requires identical computation as the original—no additional forward pass operations, no dynamic adapter selection overhead.[1] This contrasts sharply with adapter-based methods that impose 20-30% latency penalties in online scenarios with small batch sizes. For production serving, a single base model can dynamically load different LoRA adapters (35MB each) to switch tasks in seconds, whereas full fine-tuned models require loading entire 350GB checkpoints.

The technique excels in low-data regimes where full fine-tuning typically overfits. On MNLI with only 100 training examples, LoRA achieved 63.8% accuracy versus 60.2% for full fine-tuning.[1] The low-rank constraint acts as strong regularization, preventing catastrophic forgetting of pre-trained capabilities—LoRA models maintain better performance on out-of-domain tasks and preserve more general knowledge while specializing for target tasks. Research titled "LoRA Learns Less and Forgets Less" confirms this fundamental tradeoff.[13]

Limitations and performance boundaries

Mathematical reasoning and complex programming represent LoRA's most significant performance gaps. Research from Anyscale and Columbia University/Databricks consistently shows full fine-tuning substantially outperforms LoRA on math datasets like GSM8k—the largest gap observed across benchmarks.[14] Complex code generation similarly shows 4-6% lower performance with LoRA, with full fine-tuning requiring fewer training examples to reach equivalent capabilities. The low-rank constraint appears insufficient for capturing intricate multi-step reasoning patterns these domains demand.

Rank selection remains frustratingly task-dependent with no universal heuristics. Optimal rank varies from 1 to 256+ depending on model size, dataset characteristics, and task complexity—practitioners must explore this hyperparameter space empirically for each configuration.[15] Setting rank too low creates insufficient capacity for task-specific information, while excessive rank increases memory usage and risks overfitting with diminishing returns. The original paper demonstrated performance saturation—increasing rank beyond 4-8 provided minimal benefit on tested tasks, yet later research shows higher ranks (64-128) prove essential for complex domains.

Learning stability requires careful hyperparameter tuning. Anyscale research documented "drastic differences in evaluation loss" from small learning rate changes, noting LoRA demands more careful tuning than full fine-tuning for stable training.[14] In SQL task experiments, reducing learning rate from 1e-4 to 3e-5 proved necessary to stabilize training, illustrating the "trickier optimization landscape" created by limited parameters.

Recent research revealed "intruder dimensions"—high-ranking singular vectors in LoRA updates orthogonal to pre-trained weight structure.[16] While LoRA matches target task accuracy, these structural differences cause models to become worse predictors of pre-training distributions and adapt less robustly to sequential multi-task learning. Higher-rank, rank-stabilized LoRA mitigates this issue but sacrifices efficiency advantages.

Information loss from matrix decomposition may introduce minor losses in models that aren't heavily over-parameterized.[6] Batch size sensitivity poses practical constraints—research demonstrates LoRA performance degrades faster than full fine-tuning as batch size increases, independent of rank configuration.[17] When dataset size exceeds LoRA's inherent parameter storage capacity, such as in continued pretraining scenarios with millions of examples, full fine-tuning becomes necessary.

Applications across AI domains

Large language models represent LoRA's primary application domain. Models from GPT-2 to GPT-3 175B, Llama 2 (7B, 13B, 70B), Llama 3.1 (8B, 70B, 405B), Mistral-7B, Falcon, Qwen, RoBERTa, and DeBERTa leverage LoRA for instruction tuning, domain adaptation, and task specialization.[1][18] Common use cases include:

  • SQL query generation and semantic parsing
  • Grammar correction and text classification
  • Legal document analysis and medical terminology adaptation
  • Financial report generation and risk analysis
  • Multi-task learning with dynamic adapter switching
  • Question answering and dialogue summarization

The Alpaca-LoRA project demonstrated that it was possible to replicate the instruction-following capabilities of Stanford's Alpaca model by fine-tuning the LLaMA 7B model using LoRA on a single consumer GPU (an RTX 4090) in a matter of hours.[19] IBM deploys Granite 3.0 models with LoRA for hallucination reduction,[6] while enterprises use multiple LoRA adapters as a "menu" of specialized capabilities built atop shared base models. Performance metrics show 95-99% of full fine-tuning quality with training times around 3 hours on A100 GPUs for 7B models with 50,000 examples.

Diffusion model applications exploded in popularity with Stable Diffusion and FLUX adopters creating thousands of style, character, and technique LoRAs. In late 2022, the technique was adapted for fine-tuning diffusion model's cross-attention layers within the U-Net architecture.[7] CivitAI hosts popular adapters including:

  • Style specialization: Arcane style, Studio Ghibli aesthetic, pixel art, cyberpunk themes, oil painting, anime
  • Artist styles: Greg Rutkowski, WLOP, Jim Lee styles with 1-10MB file sizes
  • Character replication: Shinobu Kochou, Zero Two, Princess Zelda with consistent generation
  • Technical enhancements: Detail Tweaker, Hyper-SD (119,000+ downloads for speed optimization)
  • Control methods: Stability AI's Control-LoRA for depth maps, canny edges, sketch colorization

The small file size of LoRA adapters (often 1-10 MB versus 2-7GB for full checkpoints) makes them easy to distribute and combine.[20] This has led to the rise of platforms where creators can share thousands of LoRA files, allowing users to mix and match adapters to achieve highly customized results. Hugging Face hosts over 60,000 LoRA models with categories spanning text-to-image, image-to-image, and emerging image-to-video.[21]

Vision models employ LoRA for image classification, object detection, segmentation, and synthetic image detection. Fine-tuning Vision Transformers (ViT) with LoRA achieves 96% validation accuracy on Food-101 with approximately 147,000 trainable parameters versus 86 million for the full model.[22] DeepLabv3 leverages LoRA adapters for semantic segmentation, while Bi-LORA detects GAN and diffusion-generated images. Medical imaging applications use LoRA-adapted vision models for CT scan and X-ray analysis with domain-specific adaptations that preserve general visual understanding.

Multimodal vision-language models represent rapid growth areas. Llama-3.2-11B-Vision-Instruct fine-tunes with LoRA for medical diagnosis processing CT scans alongside text descriptions.[23] BLIP-2 adapters enable multimodal sentiment analysis combining images and captions. Novel architectures include:

  • Vision as LoRA (VoRA): Integrates vision directly into LLMs through LoRA layers without separate vision encoders[24]
  • Mixture of LoRA (MixLoRA): Handles multimodal instruction tuning with conditional mixtures reducing task interference
  • LoCAL/SV-RAG: Uses dual LoRA adapters for retrieval and question answering on hundreds of visually-rich pages[25]

Production deployments at scale validate LoRA's practical viability. Together AI serves hundreds of LoRA adapters atop single base models with pay-per-token pricing, partnering with Salesforce, Zomato, and The Washington Post.[26] Fireworks AI reports 100× cost-efficiency improvements using multi-LoRA with FireAttention, partnering with Cresta for enterprise AI.[27] Microsoft Azure AI deploys Llama 3.1 fine-tuned LoRAs, while Databricks uses LoRA for product description generation. S-LoRA from UC Berkeley demonstrates serving 1,000 LoRAs on single GPU via dynamic adapter swapping.[28]

Variant methods and extensions

QLoRA (Quantized LoRA) emerged in May 2023 from University of Washington researchers, enabling 65B parameter model fine-tuning on single 48GB GPUs. The technique introduces 4-bit NormalFloat (NF4)—an information-theoretically optimal data type for normally distributed weights—combined with double quantization that reduces memory by quantizing the quantization constants themselves.[11] Paged optimizers manage memory spikes during training. The resulting Guanaco model reached 99.3% of ChatGPT performance with 24 hours training on consumer hardware. QLoRA trades 33% memory reduction for 39% increased runtime due to quantization/dequantization overhead but democratizes large model fine-tuning. Integration into Hugging Face PEFT made QLoRA the de facto standard for memory-constrained scenarios.

DoRA (Weight-Decomposed Low-Rank Adaptation) from NVIDIA Research Taiwan, published February 2024 and accepted as ICML 2024 Oral (1.5% acceptance rate), decomposes pre-trained weights into magnitude and direction components.[29] The formulation W' = m × (V + ΔV)/||V + ΔV|| applies LoRA only to the directional component while training magnitude separately, better mimicking full fine-tuning's learning patterns. DoRA achieves +3.7% improvement on LLaMA-7B and +1.0% on LLaMA-13B for commonsense reasoning, consistently outperforming LoRA across LLaMA, LLaVA, and VL-BART with zero additional inference overhead. QDoRA extends the approach to quantized training, combining QLoRA's memory efficiency with DoRA's performance gains.[30]

AdaLoRA (March 2023, ICLR 2023) dynamically allocates parameter budgets among weight matrices using importance scores derived from singular value decomposition.[31] It parameterizes weight updates using singular value decomposition and prunes the least important singular values during training. Different layers receive different ranks based on task requirements—concentrating parameters where they matter most—outperforming uniform-rank LoRA with identical total parameter budgets. The approach addresses LoRA's suboptimal uniform distribution assumption, though requiring calling update_and_allocate() during training adds implementation complexity.

LoRA+ (February 2024) achieves 1-2% accuracy improvements and up to 2× training speedup through the surprisingly simple modification of using different learning rates for adapter matrices A and B.[12] Setting matrix B's learning rate significantly higher (typically 16×) than matrix A addresses the suboptimality of equal learning rates, particularly for large-width networks. The technique requires no additional computational cost beyond standard LoRA—merely a learning rate scheduler adjustment.

PiSSA (Principal Singular Values and Singular Vectors Adaptation), accepted as NeurIPS 2024 Spotlight, initializes adapter matrices with principal components from singular value decomposition of original weights rather than random noise.[32] This "train principal components, freeze residuals" approach contrasts with LoRA's "train noise and zero" strategy, achieving dramatically faster convergence and superior accuracy. On Mistral-7B, PiSSA reached 72.86% on GSM8K versus LoRA's 67.7% (+5.16%). On LLaMA-3-70B with quantization (QPiSSA), performance hit 86.05% versus QLoRA's 81.73%.

VeRA (Vector-based Random Matrix Adaptation) freezes a single pair of random low-rank matrices shared across all layers, training only small scaling vectors d and b.[33] This radical approach achieves 10× fewer trainable parameters than LoRA—reducing GPT-3 adapter parameters from 75.5M to 2.8M—while maintaining performance. The storage efficiency proves critical for deploying hundreds of task-specific modules.

Additional notable variants include:

  • LongLoRA: Extends context windows efficiently for long-document tasks through sparse local attention during fine-tuning, adapting models like LLaMA 2 to handle inputs of up to 100k tokens[34]
  • LoRA-FA (Frozen-A LoRA): Provides 1.4× memory reduction by freezing projection-down matrix A[35]
  • DyLoRA: Enables 4-7× faster training through dynamic rank sampling across a range during training[36]
  • LoRA-XS: Achieves 100× storage reduction by inserting small trainable matrices between frozen low-rank SVD components
  • Tied-LoRA: Applies weight tying to the low-rank matrices to further improve parameter efficiency[37]
  • LoHa and LoKr: Use Hadamard and Kronecker products for alternative decompositions
  • MoLE (Mixture of LoRA Experts): Treats layer-wise LoRAs as distinct experts with learnable gating for multi-task scenarios
Comparison of Major LoRA Variants
Variant Core Innovation Key Benefit(s) Primary Use Case
LoRA (Baseline) Low-rank decomposition of weight updates (ΔW=BA) Parameter & memory efficiency, no inference latency General-purpose efficient fine-tuning
QLoRA Combines LoRA with 4-bit quantization of the base model Drastically reduced memory footprint Fine-tuning very large models on single, consumer-grade GPUs
DoRA Decomposes weights into magnitude and direction; fine-tunes direction Improved training stability and performance over LoRA Tasks requiring higher fidelity or more stable training
QDoRA Hybrid of QLoRA and DoRA Memory-efficient high performance Memory-constrained scenarios requiring best accuracy
AdaLoRA Dynamically allocates rank based on layer importance scores Better performance for a given parameter budget Scenarios with very strict parameter or memory constraints
LoRA+ Uses different learning rates for the A and B matrices Faster convergence and improved performance Optimizing training speed and final model quality
PiSSA SVD-based initialization of adapter matrices Dramatically faster convergence, better accuracy Complex tasks requiring fast training
LongLoRA Uses sparse attention during fine-tuning to extend context Efficiently adapts models to handle longer input sequences Fine-tuning for long-document summarization, RAG, etc.
VeRA Shared frozen matrices with trainable scaling vectors 10× fewer parameters than LoRA Extreme parameter efficiency requirements
S-LoRA Optimized memory management and batching for serving Enables serving thousands of LoRA adapters concurrently Production environments with many customized models

Comparison with other PEFT methods

LoRA is one of several distinct approaches within the PEFT landscape. These methods can be broadly categorized by where they intervene in the model: modifying weights, altering the architecture, or manipulating activations. LoRA's unique strength lies in its ability to modify weights in a powerful yet non-invasive way that disappears at inference time.

Adapter Tuning

Adapter tuning (or "Adapters") was one of the earliest PEFT techniques, introduced by Houlsby et al. in 2019. It involves inserting small, fully connected "bottleneck" layers between the existing layers of a pre-trained model.[38] These adapter modules typically consist of a down-projection, a non-linearity, and an up-projection. During fine-tuning, only the parameters of these new adapter layers are trained.

Key Difference from LoRA: While both methods add a small number of trainable parameters, adapters add new modules to the model's computational graph. These modules remain during inference, which can introduce a 20-30% latency penalty for small batch sizes. LoRA, by contrast, modifies the existing weight matrices through a low-rank update that can be merged back into the original weights, resulting in zero inference latency. This "train-as-adapter, deploy-as-native" capability is a primary reason for LoRA's widespread adoption over traditional adapters for latency-sensitive applications.

Prefix-Tuning and Prompt Tuning

Prefix-tuning (Li & Liang, 2021) and its simplification, Prompt Tuning, operate on the model's activations rather than its weights.[39] These methods freeze the entire pre-trained model and instead learn a small set of continuous vectors, or "soft prompts," that are prepended to the input sequence or the hidden states of each layer. These learned prefixes steer the model's behavior towards the desired task without modifying any of its parameters.

Key Difference from LoRA: The primary distinction is the target of modification. LoRA updates the model's internal parameters (weights), whereas prefix/prompt tuning manipulates the model's internal states (activations) through modified inputs. While prefix/prompt tuning can be highly parameter-efficient and can sometimes work even with black-box API access to a model, LoRA is generally considered more expressive and often achieves higher performance because it has more direct control over the model's computations.

BitFit

BitFit (Bias-term Fine-tuning) is one of the most spartan PEFT methods, introduced by Ben Zaken et al. in 2021. It involves freezing all of the model's main weight matrices and fine-tuning only the bias parameters and the task-specific classification head.[40] Since bias terms constitute a tiny fraction of a model's total parameters (often <0.1%), this approach is extremely parameter-efficient.

Key Difference from LoRA: BitFit modifies a very sparse subset of existing parameters, whereas LoRA introduces new parameters to approximate a dense update to the main weight matrices. While BitFit is highly efficient, its expressive power is limited because it can only shift the activation functions' outputs, not change the learned relationships within the weight matrices. On GPT-3 175B, BitFit with 14.2M parameters reached 71.3% WikiSQL accuracy—reasonable but underperforming LoRA's 73.4% with 4.7M parameters.

Comparison of PEFT Methods
Method Trainable Parameters GPU Memory Reduction Inference Latency Performance on Benchmarks Forgetting of Pre-trained Knowledge
Full Fine-Tuning 100% (for example 175B for GPT-3) None None High (baseline) High[16]
LoRA 0.01%-0.1% None Matches/Exceeds FT[1] Low
Adapters 0.3%-6% Moderate +20-30% Slightly below FT Moderate[1]
Prefix-Tuning Small (prefix vectors) High Reduces sequence length Non-monotonic; good in low-data Low[41]
BitFit <0.1% High None Below LoRA Low

Implementation frameworks and ecosystem

Hugging Face PEFT (Parameter-Efficient Fine-Tuning) serves as the primary LoRA implementation with version 0.17.0+ supporting LoRA, LoRA+, DoRA, QLoRA, AdaLoRA, LoHa, LoKr, VeRA, and other variants.[2] The library integrates seamlessly with Transformers, Diffusers, and Accelerate, training only 0.19% of parameters for models like bigscience/mt0-large. Advanced features include multiple adapter composition, dynamic adapter switching, and sophisticated initialization methods (Gaussian, LoftQ, EVA, OLoRA, PiSSA, CorDA). Recent releases added DoRA support, AWQ/AQLM quantization compatibility, VB-LoRA, and LoRA-FA optimizer optimizations.

Example Python code using PEFT:

from peft import get_peft_model, LoraConfig
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("gpt2")
lora_config = LoraConfig(
    r=16, 
    lora_alpha=32, 
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05
)
model = get_peft_model(model, lora_config)

Microsoft's original LoRA repository provides the foundational loralib PyTorch implementation validated in the ICLR 2022 paper.[42] While Hugging Face PEFT has become the recommended successor for most use cases, the Microsoft implementation remains valuable for understanding core concepts and reproduces the original research results.

PyTorch's torchtune offers native LoRA implementation optimized for the Llama model family with built-in recipes, gradient checkpointing, and FSDP support across versions 0.2 through 0.6.[18] Alternative PyTorch implementations include LoRA-Torch (supporting nn.MultiheadAttention in OpenCLIP with extensible architecture) and lora-pytorch (version 0.2.0 with zero dependencies beyond PyTorch, compatible with CNNs and MLPs).

TensorFlow/Keras 3 provides multi-backend LoRA support (TensorFlow, JAX, PyTorch) through KerasHub with native integration in the Gemma model family.[43] The model.backbone.enable_lora() method offers simple API access with configurable rank and target modules. JAX optimization proves particularly effective for TPU deployments.

Distributed training frameworks integrate comprehensively: DeepSpeed works seamlessly with PEFT, FSDP supports QDoRA, and Accelerate enables multi-GPU training. Quantization frameworks including bitsandbytes (4-bit and 8-bit for QLoRA), AWQ (supported in PEFT v0.9.0+), AQLM (2-bit quantization), and GPTQ all prove compatible with LoRA training.

Model serving platforms enable production deployment. Text Generation Inference (TGI v2.1.1+) serves multiple LoRAs simultaneously from Hugging Face Hub with dynamic adapter loading.[44] Vertex AI offers custom multi-adapter handlers, FriendliAI enables per-request adapter switching, and Optimum Neuron fuses LoRA for AWS Inferentia/Trainium. Diffusers supports LoRA for Stable Diffusion, SDXL, and FLUX with DreamBooth+LoRA combinations.[45]

Platform repositories host thousands of public adapters. Hugging Face Hub serves as the primary repository for text-to-image, LLM, and multimodal adapters with one-click loading via PEFT.[21] CivitAI focuses on Stable Diffusion and FLUX models with on-site training, supporting SD1.5, SDXL, Flux, and Pony Diffusion V6 XL.[20] Community features include auto-captioning with WD Tagger 1.4 and JoyCaption, training data sharing, and model card generation.

Performance metrics and comparative analysis

Comprehensive benchmarking reveals LoRA's strengths and boundaries across model scales and task types. On GPT-3 175B, LoRA with 4.7 million trainable parameters (0.0027% of base model) achieved 73.4% WikiSQL accuracy versus 73.8% for full fine-tuning with 175 billion parameters, and 91.7% MNLI-matched accuracy actually surpassing full fine-tuning's 89.5%.[1] On SAMSum dialogue summarization, LoRA scored 53.8/29.8/45.9 (Rouge-1/2/L) compared to 52.0/28.0/44.5 for full fine-tuning.

RoBERTa on GLUE benchmark demonstrates parameter efficiency extremes. Full fine-tuning's 125 million parameters achieved 86.4% average accuracy while LoRA with 0.3 million parameters (0.24% of full model) reached 87.2%—outperforming with 417× fewer parameters.[1] RoBERTa large showed similar patterns: 355 million parameters at 88.9% accuracy versus 0.8 million LoRA parameters at 89.0%, a 444× reduction with superior results.

Rank sensitivity analysis on GPT-3 WikiSQL revealed performance plateaus. Rank r=1 achieved 73.4% accuracy with 4.7M parameters, r=2 reached 73.3% with 9.4M parameters, r=4 scored 73.7% with 18.8M parameters, r=8 hit 73.8% with 37.7M parameters, and r=64 with 301.9M parameters yielded only 73.6%—demonstrating minimal benefit from higher ranks on this task and validating the low intrinsic rank hypothesis.[1]

Memory and speed comparisons quantify practical advantages. GPT-3 175B full fine-tuning requires 1,200GB VRAM at 32.5 tokens/second per V100, while LoRA needs 350GB at 43.1 tokens/second—a simultaneous 3× memory reduction and 25% speed increase.[1] For 7B models, full fine-tuning demands 73.3GB across 2 GPUs whereas LoRA operates on 14.18GB single GPU (rank 8, attention only) or 16.62GB (all layers).

LoRA vs Full Fine-tuning Performance Comparison
Model Method Trainable Params MNLI Accuracy WikiSQL Accuracy SAMSum Rouge-L Storage Size VRAM Required
GPT-3 175B Full FT 175,255.8M 89.5% 73.8% 44.5 350 GB 1,200 GB
GPT-3 175B LoRA (r=4) 4.7M 91.7% 73.4% 45.9 35 MB 350 GB
RoBERTa base Full FT 125.0M 87.6% - - ~500 MB ~8 GB
RoBERTa base LoRA 0.3M 87.5% - - ~2 MB ~4 GB
DeBERTa XXL Full FT 1,500M 91.8% - - ~6 GB ~50 GB
DeBERTa XXL LoRA 4.7M 91.9% - - ~19 MB ~16 GB
Llama 2 7B Full FT 6,738M - - - ~14 GB 73 GB (2 GPU)
Llama 2 7B LoRA (r=16) 4.2M - - - ~8 MB 14 GB (1 GPU)

Performance gaps emerge in specific domains. Mathematical reasoning on GSM8k shows LoRA's largest underperformance—research consistently reports "substantial gaps" though exact magnitudes vary by configuration.[14] Complex programming tasks demonstrate 4-6% lower scores, with full fine-tuning requiring fewer examples to reach equivalent capability. Typical instruction tuning and domain adaptation tasks show 2% absolute difference (95% vs 97% accuracy patterns), while many NLU tasks favor LoRA by 1-3% due to implicit regularization preventing overfitting.

Low-data regime analysis reveals LoRA advantages. On MNLI with only 100 training examples, LoRA achieved 63.8% accuracy versus 60.2% for full fine-tuning.[1] The low-rank constraint provides strong regularization when data scarcity would cause full fine-tuning to overfit, making LoRA particularly effective for few-shot adaptation scenarios with hundreds to low thousands of examples.

Deployment considerations and best practices

Successful LoRA deployment begins with rank selection based on adaptation complexity. Start with r=16 as a reliable default, then adjust based on training behavior.[9] Undertrained models (high validation loss, poor task performance) benefit from increasing rank to 32-64, while overtrained models (training loss <0.2, large train-validation gap) should reduce rank to 8 or lower. Simple style and formatting tasks succeed with r=4-8, standard instruction tuning works with r=16-32, complex domain adaptation requires r=32-64, and substantial knowledge injection demands r=64-256.

Set alpha to 2r (twice the rank) following widespread best practice, producing an effective scaling factor of 2.[10] Crucially, maintain constant alpha when adjusting rank during experimentation—research demonstrates alpha and learning rate are mathematically interchangeable, so fixing alpha early then tuning only learning rate simplifies hyperparameter search. For high ranks exceeding 64, enable rank-stabilized LoRA (use_rslora=True) with α/√r scaling to prevent gradient collapse.[8]

Target all major linear layers for optimal adaptation quality. While original research focused on query and value projections, comprehensive studies prove targeting all layers (q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj) substantially improves results.[11] Use auto-detection features (target_modules="all-linear") in modern PEFT versions to simplify configuration.

Training hyperparameters require careful tuning. Set learning rate to 2e-4 as the standard LoRA starting point (2-10× higher than full fine-tuning's typical 1e-5 to 3e-5).[9] Train for 1-3 epochs maximum—training beyond 3 epochs typically causes overfitting. Use effective batch size of 16 through gradient accumulation (for example batch_size=2 with gradient_accumulation_steps=8 for memory efficiency). Enable gradient checkpointing to reduce activation memory at minimal computational cost. Apply moderate weight decay (0.01) and dropout (0.05) for regularization.

Memory optimization proves critical for consumer hardware deployment. Combine QLoRA's 4-bit quantization with standard LoRA training to enable 65B models on 48GB GPUs—configure bitsandbytes with load_in_4bit=True, bnb_4bit_quant_type="nf4", and bnb_4bit_use_double_quant=True.[11] Enable gradient checkpointing (gradient_checkpointing=True) to trade computation for memory. Start with lower ranks (r=8) during development then scale up once workflows stabilize.

Inference deployment offers two strategies. Merged weights (W = W₀ + BA) provide zero latency overhead but fix the adapter—ideal for single-task deployment.[1] Separate weights enable dynamic adapter switching with minimal latency (2-5% depending on batch and sequence length)—necessary for multi-task serving where different requests need different adapters. Production systems like Text Generation Inference and Fireworks AI serve hundreds of LoRAs by keeping base models in memory and swapping lightweight adapters per request.[44][27]

Monitor training carefully for stability issues. Unstable training (loss spikes, NaN gradients) warrants reducing learning rate by 2-5× or enabling mixed precision (fp16=True or bf16=True). Slow convergence suggests increasing learning rate or rank. Large train-validation gaps indicate overfitting—reduce epochs, increase regularization, or lower rank. Training loss <0.2 generally signals overfitting; consider early stopping or post-training alpha scaling (multiply by 0.5) to reduce LoRA influence.

Model card documentation should specify rank, alpha, target modules, training dataset size, base model version, intended use cases, and known limitations. Version adapters systematically as base models update—LoRA weights trained on Llama-2-7B may not transfer to Llama-3-8B without retraining. Store adapters alongside base model information for reproducibility, and test adapter merging versus dynamic loading to verify inference behavior matches expectations.

For production serving at scale, leverage multi-LoRA frameworks like S-LoRA (1,000 adapters per GPU), TGI multi-adapter support, or Fireworks AI's FireAttention stack.[28][44][27] Implement request routing to direct queries to appropriate task-specific adapters. Monitor memory usage as adapter count scales—hundreds of 35MB adapters remain manageable while thousands may require more sophisticated caching strategies.

Current research directions and future work

The LoRA research landscape in 2024-2025 focuses on five primary directions. Initialization strategies represent active investigation with methods like PiSSA, CorDA (Context-Oriented Decomposition Adaptation achieving faster convergence than PiSSA), EVA (Explained Variance Adaptation with data-driven SVD-based initialization), OLoRA (orthogonal initialization of matrices A and B), and LoftQ (quantization-aware initialization) all demonstrating that proper initialization enables substantially faster convergence and better final performance than LoRA's default random initialization.[32][46]

Dynamic allocation methods address rank distribution challenges. AdaLoRA and DyLoRA enable adaptive parameter budgets where different layers receive different ranks based on task-specific importance rather than uniform distribution.[31][36] QDyLoRA combines this flexibility with quantization for memory-constrained scenarios.[47] Research shows optimal rank varies dramatically by layer and task—top attention layers often need higher capacity while middle layers suffice with minimal adaptation.

Extreme efficiency variants push parameter reduction limits. LoRA-XS achieves over 100× storage reduction compared to standard LoRA through inserting tiny trainable matrices between frozen SVD components. VeRA demonstrated 97% parameter reduction (75.5M → 2.8M) remains viable.[33] 1LoRA introduces single-parameter per module adaptation showing extreme compression possibilities.[48]

Theoretical understanding deepens through computational complexity analysis. Research published June 2024 (updated June 2025) examines phase transition behavior in LoRA efficiency, proving almost linear approximation algorithms exist for certain rank regimes.[49] Work titled "LoRA Learns Less and Forgets Less" reveals the fundamental tradeoff: LoRA retains more original capabilities (less forgetting) but learns less efficiently in domains distant from pre-training (less learning).[13]

Multi-task and reasoning capabilities expand through specialized architectures. X-LoRA introduces mixture-of-experts gating for dynamic LoRA selection at token and layer granularity. Research on "Tina: Tiny Reasoning Models via LoRA" explores memory-efficient reasoning through LoRA decomposition. Agent-based systems use LoRA modules as tools, with research surveying LoRA-driven agents where different adapters represent specialized agent roles or capabilities.

A comprehensive survey published July 2024 (updated October 2024, arXiv 2407.11046) categorizes LoRA methods into downstream adaptation improvements, cross-task generalization approaches, efficiency-improving techniques, and data privacy-preserving methods.[46] The maintained GitHub repository tracks rapid developments as the field evolves.

Open problems demanding attention include optimal rank selection heuristics (currently requiring task-specific experimentation), target module selection guidance (which layers to adapt for which tasks), hyperparameter tuning complexity, memory optimization for edge deployment, efficient batching with mixed adapters (different samples using different LoRAs), adapter pruning strategies, catastrophic forgetting in sequential multi-task learning, zero-shot adapter transfer across model families, and dynamic adapter selection without routing models. Theoretical challenges include formal approximation quality bounds, understanding intrinsic dimension requirements, convergence guarantees, and generalization bounds from statistical learning theory perspectives.

See also

References

  1. 1.00 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08 1.09 1.10 1.11 1.12 1.13 1.14 1.15 1.16 1.17 1.18 1.19 1.20 1.21 1.22 1.23 1.24 Hu, Edward J., et al. "LoRA: Low-Rank Adaptation of Large Language Models." arXiv preprint arXiv:2106.09685 (2021). https://arxiv.org/abs/2106.09685
  2. 2.0 2.1 2.2 Hugging Face PEFT Documentation. https://huggingface.co/docs/peft
  3. 3.0 3.1 Aghajanyan, Armen, et al. "Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning." arXiv preprint arXiv:2012.13255 (2020). https://arxiv.org/abs/2012.13255
  4. LoRA: Low-Rank Adaptation of Large Language Models. OpenReview ICLR 2022. https://openreview.net/forum?id=nZeVKeeFYf9
  5. Edward Hu personal website. https://edwardjhu.com/about/
  6. 6.0 6.1 6.2 IBM. "What is LoRA (Low-Rank Adaption)?" 2024. https://www.ibm.com/think/topics/lora
  7. 7.0 7.1 Cuenca, Pedro; Paul, Sayak (2023). "Using LoRA for Efficient Stable Diffusion Fine-Tuning." Hugging Face blog. https://huggingface.co/blog/lora
  8. 8.0 8.1 Kalajdzievski, Damjan. "Rank-Stabilized LoRA: Unlocking the Potential of LoRA Fine-Tuning." Hugging Face Blog, 2024. https://huggingface.co/blog/damjan-k/rslora
  9. 9.0 9.1 9.2 9.3 Raschka, Sebastian. "Practical Tips for Finetuning LLMs Using LoRA." Magazine, 2024. https://magazine.sebastianraschka.com/p/practical-tips-for-finetuning-llms
  10. 10.0 10.1 Determined AI. "Finding the best LoRA parameters." 2024. https://www.determined.ai/blog/lora-parameters
  11. 11.0 11.1 11.2 11.3 Dettmers, Tim, et al. "QLoRA: Efficient Finetuning of Quantized LLMs." arXiv preprint arXiv:2305.14314 (2023). https://arxiv.org/abs/2305.14314
  12. 12.0 12.1 Hayou, Soufiane, et al. "LoRA+: Efficient Low Rank Adaptation of Large Models." arXiv preprint arXiv:2402.12354 (2024).
  13. 13.0 13.1 Biderman, Liu, et al. "LoRA Learns Less and Forgets Less." arXiv preprint (2024).
  14. 14.0 14.1 14.2 Anyscale. "Fine-Tuning LLMs: LoRA or Full-Parameter? An In-Depth Analysis with Llama 2." 2023. https://www.anyscale.com/blog/fine-tuning-llms-lora-or-full-parameter-an-in-depth-analysis-with-llama-2
  15. Zhang, Jiaming, et al. "PLoRA: Efficient LoRA Hyperparameter Tuning for Large Models." arXiv preprint arXiv:2508.02932 (2024). https://arxiv.org/html/2508.02932v1
  16. 16.0 16.1 Biderman, Davison, et al. "LoRA vs Full Fine-tuning: An Illusion of Equivalence." arXiv preprint arXiv:2410.21228 (2024).
  17. Gradient Flow. "Customizing LLMs: When to Choose LoRA or Full Fine-Tuning." 2024. https://gradientflow.com/lora-or-full-fine-tuning/
  18. 18.0 18.1 PyTorch. "Fine-Tuning Llama2 with LoRA." torchtune documentation. https://docs.pytorch.org/torchtune/stable/tutorials/lora_finetune.html
  19. tloen. "Alpaca-LoRA: Instruction-tuning LLaMA on consumer hardware." GitHub. https://github.com/tloen/alpaca-lora
  20. 20.0 20.1 CivitAI. "LoRA Models and How to Use Them with Stable Diffusion." 2024. https://civitai.com/articles/2099/
  21. 21.0 21.1 Hugging Face. "Models tagged with 'lora'." https://huggingface.co/models?other=lora
  22. Hugging Face. "Image classification using LoRA." PEFT documentation. https://huggingface.co/docs/peft/main/en/task_guides/image_classification_lora
  23. Jiang, Haixin, et al. "Multimodal Instruction Tuning with Conditional Mixture of LoRA." arXiv preprint arXiv:2402.15896 (2024).
  24. Vision as LoRA. arXiv preprint arXiv:2503.20680 (2025).
  25. LoRA-Contextualizing Adaptation of Large Multimodal Models. arXiv preprint arXiv:2411.01106 (2024).
  26. Together AI. "Serverless Multi-LoRA." 2024. https://www.together.ai/blog/serverless-multi-lora
  27. 27.0 27.1 27.2 Fireworks AI. "Multi-LoRA: Personalize AI at scale." 2024. https://fireworks.ai/blog/multi-lora
  28. 28.0 28.1 Sheng, Ying, et al. "S-LoRA: Serving Thousands of Concurrent LoRA Adapters." 2024.
  29. Liu, Shih-Yang, et al. "DoRA: Weight-Decomposed Low-Rank Adaptation." ICML 2024. https://developer.nvidia.com/blog/introducing-dora-a-high-performing-alternative-to-lora-for-fine-tuning/
  30. Pravi. "Efficient Fine-Tuning of LLMs with LoRA, QLoRA and DoRA." 2024. https://pravi.tech/posts/fine-tuning/
  31. 31.0 31.1 Zhang, Qingru, et al. "AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning." ICLR 2023. https://arxiv.org/abs/2303.10512
  32. 32.0 32.1 Meng, Fanxu, et al. "PiSSA: Principal Singular Values and Singular Vectors Adaptation of Large Language Models." NeurIPS 2024.
  33. 33.0 33.1 Kopiczko, Dawid Jan, et al. "VeRA: Vector-based Random Matrix Adaptation." 2024. https://arxiv.org/abs/2310.10196
  34. Chen, Yukang, et al. "LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models." arXiv preprint arXiv:2309.12307 (2023). https://arxiv.org/abs/2309.12307
  35. Zhang, Longteng, et al. "LoRA-FA: Memory-efficient Low-rank Adaptation for Large Language Models Fine-tuning." OpenReview 2024. https://openreview.net/forum?id=RbKThNNFxr
  36. 36.0 36.1 Valipour, Mojtaba, et al. "DyLoRA: Parameter-Efficient Tuning of Pre-trained Models using Dynamic Search-Free Low-Rank Adaptation." EACL 2023. https://aclanthology.org/2023.eacl-main.239/
  37. Konakovsky, Adnan. "Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning." ResearchGate 2024. https://www.researchgate.net/publication/369379949_Adaptive_Budget_Allocation_for_Parameter-Efficient_Fine-Tuning
  38. Houlsby, Neil, et al. "Parameter-Efficient Transfer Learning for NLP." ICML 2019.
  39. Li, Xiang Lisa, and Percy Liang. "Prefix-tuning: Optimizing continuous prompts for generation." ACL 2021.
  40. Ben Zaken, Elad, et al. "BitFit: Simple parameter-efficient fine-tuning for transformer-based masked language-models." ACL 2022.
  41. DataCamp. "Mastering Low-Rank Adaptation (LoRA)." 2024. https://www.datacamp.com/tutorial/mastering-low-rank-adaptation-lora-enhancing-large-language-models-for-efficient-adaptation
  42. Microsoft. "LoRA: Code for loralib." GitHub. https://github.com/microsoft/LoRA
  43. Google AI. "Fine-tune Gemma in Keras using LoRA." 2024. https://ai.google.dev/gemma/docs/core/lora_tuning
  44. 44.0 44.1 44.2 Hugging Face. "TGI Multi-LoRA: Deploy Once, Serve 30 Models." 2024. https://huggingface.co/blog/multi-lora-serving
  45. Hugging Face. "LoRA for Diffusion Models." Diffusers documentation. https://huggingface.co/docs/diffusers/en/training/lora
  46. 46.0 46.1 Tian, Yixiao, et al. "A Survey on LoRA of Large Language Models." arXiv preprint arXiv:2407.11046 (2024). https://arxiv.org/abs/2407.11046
  47. QDyLoRA: Quantized Dynamic Low-Rank Adaptation. arXiv preprint arXiv:2402.10462 (2024).
  48. 1LoRA: A Parameter-Efficient Fine-Tuning Approach. arXiv preprint arXiv:2503.08333 (2024). https://www.arxiv.org/pdf/2503.08333
  49. Computational Complexity of Low-Rank Adaptation. arXiv preprint (2024).

External links