Parameter-efficient fine-tuning (PEFT) refers to a family of methods that adapt pre-trained large language models to downstream tasks by updating only a small fraction of the model's parameters, or by adding a small number of new trainable parameters, while keeping the vast majority of the original weights frozen. These techniques emerged as a practical response to the escalating cost of full fine-tuning, which requires updating every parameter in the model and storing a complete copy of the model weights for each task. For a model with hundreds of billions of parameters, full fine-tuning demands enormous GPU memory, significant compute budgets, and large storage footprints. PEFT methods achieve comparable or near-comparable performance to full fine-tuning while reducing trainable parameters by 100x to 10,000x, making it feasible to customize large models on consumer hardware or adapt a single base model to many tasks simultaneously.
The term PEFT became widely used after Hugging Face released its open-source peft library in early 2023, which packaged several earlier research methods (adapters, prefix tuning, prompt tuning, LoRA, IA3) under a single, consistent API. By late 2023, LoRA and QLoRA had become the standard recipe for fine-tuning open-weight LLMs, and most public fine-tuning projects (Alpaca, Vicuna, OpenAssistant, Guanaco, Zephyr, OpenHermes) used some flavor of PEFT. As of 2025-2026, PEFT is the default starting point for adapting foundation models, and full fine-tuning is reserved for cases where the task genuinely requires it.
The growth in language model scale has created a widening gap between model capability and the resources required to customize those models. GPT-3 had 175 billion parameters. Subsequent models like Llama 3 405B, Gemini Ultra, and DeepSeek-V3 pushed even further. Full fine-tuning of such models requires loading all parameters into GPU memory in a format that supports gradient computation (typically FP32 or mixed-precision with FP32 optimizer states), which can require 4-8x the memory of the model weights alone. For a 70B parameter model in FP16, the weights alone consume around 140 GB; with Adam optimizer states and gradients, the total training memory easily exceeds 500 GB [1].
A concrete memory accounting helps. Adam stores two moment estimates per trainable parameter (first and second moments). At FP32 precision, that is 8 bytes per parameter for the optimizer state alone, plus 4 bytes for the gradient and 4 bytes for the master weight copy. Combined with FP16 model weights, mixed-precision Adam needs roughly 16 bytes per trainable parameter just for training state. For a 7B model, that translates to about 112 GB. For a 70B model, more than 1 TB. PEFT removes most of this cost by reducing the number of trainable parameters from billions to millions, so the optimizer state shrinks proportionally even though the base model itself remains large.
Beyond memory, full fine-tuning creates logistical problems for serving. Each fine-tuned variant is a complete copy of the model. An organization that needs 50 task-specific variants of a 70B model would need to store 50 complete model copies, totaling around 7 TB. PEFT methods solve this by producing small adapter modules (typically 0.1-1% of the full model size) that can be swapped in and out at serving time on top of a single shared base model. With LoRA, an adapter for a 70B model often weighs only 100 to 800 MB depending on rank and which modules are targeted, compared to roughly 140 GB for the full FP16 model.
There is also a generalization argument. With small datasets (a few hundred to a few thousand examples), full fine-tuning of a giant model can overfit aggressively because the optimizer has billions of degrees of freedom and only a handful of constraints. Restricting updates to a low-dimensional subspace, as PEFT methods do, acts as an implicit regularizer. Several empirical studies have found that LoRA generalizes better than full fine-tuning on small target distributions, although it can underperform on tasks that require large structural changes [2].
The idea of parameter-efficient adaptation predates the current LLM era. Feature extraction, where a pre-trained network's representations are used as input to a lightweight classifier, was common in computer vision with networks like VGG and ResNet. In NLP, freezing word embeddings while training task-specific layers was also standard practice. Transfer learning more broadly had been recognized as the key to data-efficient training since the early 2010s.
The modern PEFT paradigm began with adapter modules proposed by Houlsby et al. in 2019, which inserted small bottleneck layers into BERT for natural language understanding tasks. The Houlsby paper, presented at ICML 2019, was the first to show that an order-of-magnitude reduction in trainable parameters could match full fine-tuning quality on the GLUE benchmark [3]. Pfeiffer et al. (2020-2021) refined the placement and design of adapter modules and built the AdapterHub ecosystem.
Prefix tuning (Li and Liang, 2021) and prompt tuning (Lester et al., 2021) extended the idea to operate on the input representations rather than inserting modules between layers. The Stanford prefix-tuning work showed that learnable continuous vectors could steer GPT-2 and BART without modifying any model weights at all [4]. Lester et al. then showed that as model scale grows, even simpler input-only soft prompts catch up with full fine-tuning, a result they called the power of scale [5].
LoRA (Hu et al., 2021) introduced low-rank weight decomposition and became the dominant PEFT method due to its simplicity and the fact that adapter weights can be merged into the base model at inference time, adding zero latency [6]. QLoRA (Dettmers et al., 2023) combined LoRA with 4-bit weight quantization to bring 65B-parameter fine-tuning within reach of a single consumer GPU, which arguably democratized open-source LLM fine-tuning more than any other single technique [7]. DoRA (Liu et al., 2024) decomposed weight updates into magnitude and direction components and consistently outperformed LoRA at the same parameter budget [8].
The lineage from Houlsby 2019 to DoRA 2024 traces a fairly direct evolution: each method either reduces the parameter count further, removes inference overhead, improves convergence, or extends compatibility with quantization and long contexts. By 2025, the field had largely settled on LoRA and its descendants for production use, with adapters and prefix tuning surviving mainly in academic comparisons.
The following table summarizes the major PEFT methods, their core mechanisms, and key characteristics. Years are first arXiv release; conference venues are noted where applicable.
| Method | Year | Core Mechanism | Trainable Params | Inference Overhead | Key Paper |
|---|---|---|---|---|---|
| Adapter modules | 2019 | Bottleneck layers inserted after attention and FFN blocks | ~3.6% | Small (extra layers) | Houlsby et al., ICML 2019 [3] |
| Prefix tuning | 2021 | Learnable continuous vectors prepended to K,V at every layer | ~0.1-1% | Small (extra tokens in attention) | Li & Liang, ACL 2021 [4] |
| Prompt tuning | 2021 | Learnable soft tokens prepended to the input embedding | ~0.01-0.1% | Minimal | Lester et al., EMNLP 2021 [5] |
| BitFit | 2021 | Train only bias terms | ~0.05-0.1% | Zero | Ben Zaken et al., ACL 2022 [9] |
| Compacter | 2021 | Hypercomplex Kronecker-product adapters | ~0.05% | Small | Karimi Mahabadi et al., NeurIPS 2021 [10] |
| P-tuning v2 | 2021 | Deep prefix prompts at every layer | ~0.1-3% | Small | Liu et al., ACL 2022 [11] |
| LoRA | 2021 | Low-rank decomposition of weight update matrices | ~0.1-0.5% | Zero (merged at inference) | Hu et al., ICLR 2022 [6] |
| IA3 | 2022 | Learned rescaling vectors for attention keys, values, and FFN activations | ~0.01% | Minimal | Liu et al., NeurIPS 2022 [12] |
| AdaLoRA | 2023 | SVD-based LoRA with adaptive rank pruning | ~0.1-0.5% | Zero (merged) | Zhang et al., ICLR 2023 [13] |
| QLoRA | 2023 | LoRA applied to 4-bit quantized base model | ~0.1-0.5% | Zero (merged) | Dettmers et al., NeurIPS 2023 [7] |
| LongLoRA | 2023 | LoRA + shifted sparse attention for long context | ~0.1-0.5% | Zero (merged) | Chen et al., ICLR 2024 [14] |
| LoftQ | 2023 | Quantization-aware LoRA initialization | ~0.1-0.5% | Zero (merged) | Li et al., ICLR 2024 [15] |
| VeRA | 2023 | Shared frozen random matrices, learn scaling vectors | ~10x fewer than LoRA | Zero (merged) | Kopiczko et al., ICLR 2024 [16] |
| DoRA | 2024 | Weight-decomposed LoRA with magnitude and direction | ~0.1-0.5% | Zero (merged) | Liu et al., ICML 2024 (Oral) [8] |
| LoRA+ | 2024 | LoRA with separate, higher learning rate for B matrix | ~0.1-0.5% | Zero (merged) | Hayou et al., ICML 2024 [17] |
The original adapter approach inserts small bottleneck modules within each transformer layer. Each adapter consists of a down-projection from the model dimension d to a smaller dimension m, a non-linear activation, and an up-projection back to d, plus a residual connection. These modules are placed after both the multi-head attention sub-layer and the feed-forward sub-layer in each transformer block [3].
In pseudocode, the adapter computes h = h + W_up * activation(W_down * h), where W_down has shape (m, d), W_up has shape (d, m), and m is much smaller than d. The residual connection means the adapter is initialized close to the identity function, so the model starts at the pre-trained behavior and gradually shifts as W_down and W_up are trained.
The bottleneck dimension m controls the parameter-performance tradeoff. With m = 64 on a BERT-large model (d = 1024), each adapter adds 64 * 1024 + 1024 * 64 = 131,072 parameters per insertion point. Across all layers, this amounts to roughly 3.6% of the total parameters. On the GLUE benchmark, adapters achieved within 0.4% of full fine-tuning performance using only 3.6% of the parameters [3].
The main drawback of adapters is that they add sequential computation during inference. The extra layers cannot be absorbed into the base model weights, so every forward pass incurs a small additional latency, typically on the order of 4-6% per token. This motivated the development of methods like LoRA that can be merged into the base weights for zero-overhead inference.
A related variant by Pfeiffer et al. inserted adapters only after the feed-forward sub-layer, halving the parameter count without much loss in quality. AdapterFusion combined multiple task-specific adapters via a learned attention mechanism for multi-task settings.
Prefix tuning prepends a sequence of learnable continuous vectors (the prefix) to the key and value matrices at every layer of the transformer. These vectors function as virtual tokens that the model attends to, and they can steer the model's behavior without modifying any of the original weights [4].
Unlike discrete text prompts, the prefix vectors live in the continuous embedding space and are optimized via backpropagation. Li and Liang found that directly optimizing the prefix vectors led to instability, so they used a reparameterization trick: the prefix is generated by a small MLP that takes a learnable matrix as input, and after training, the MLP is discarded and only the generated prefix vectors are kept.
Prefix tuning was originally evaluated on GPT-2 for table-to-text generation and BART for summarization. With 0.1% of the parameters, it achieved comparable performance to full fine-tuning on these tasks. However, its effectiveness tends to decrease on more complex tasks, and it is sensitive to the length of the prefix [4]. The longer the prefix, the more virtual tokens consume context window space and add attention cost at inference.
Prompt tuning can be viewed as a simplification of prefix tuning. Instead of adding learnable vectors at every layer, prompt tuning adds a set of learnable soft prompt tokens only at the input embedding layer. The rest of the model processes these soft tokens alongside the actual input tokens using its frozen weights [5].
Lester et al. showed a compelling scaling result: as model size increases, the gap between prompt tuning and full fine-tuning narrows. For models with 10 billion or more parameters (T5-XXL), prompt tuning essentially matched full fine-tuning performance. This suggested that very large models have enough capacity to be steered effectively by input-level modifications alone. The method requires tuning only around 0.01% of the total parameters; for T5-XXL (11B parameters), the entire learned prompt was about 20,000 numbers, around 80 KB [5].
A caveat: prompt tuning works much less well at smaller scales (under a billion parameters). For a 770M T5 model, prompt tuning lagged full fine-tuning by 5 to 10 points on SuperGLUE. The narrowing gap is partly what motivates the title The Power of Scale.
P-tuning is a parallel line of work from the THUDM group at Tsinghua. The original P-tuning (Liu et al. 2021) used an LSTM or MLP to encode learnable continuous prompts injected at the input layer, making prompt tuning work on smaller models. P-tuning v2 (Liu et al. 2022) extended the technique to insert learnable prompts at every transformer layer, similar to prefix tuning but with a slightly different reparameterization [11].
The authors showed that with 0.1% to 3% of parameters tuned, P-tuning v2 matches full fine-tuning across model scales from 300M to 10B and across tasks from sequence labeling to question answering. This was important because it removed the small-model handicap of the original prompt tuning.
Low-Rank Adaptation (LoRA) has become the most widely adopted PEFT method. Its core insight is that the weight updates during fine-tuning tend to have low intrinsic rank, meaning they can be approximated by the product of two much smaller matrices [6]. The hypothesis builds on earlier work by Aghajanyan et al. on intrinsic dimensionality, which found that fine-tuning of large pre-trained models could be reparameterized in surprisingly low-dimensional subspaces.
For a pre-trained weight matrix W_0 of shape (d, k), LoRA represents the weight update as a low-rank decomposition: the change in weights is expressed as the product B * A, where B has shape (d, r), A has shape (r, k), and r is much smaller than min(d, k). During training, W_0 is frozen and only A and B are updated. The forward pass computes:
output = x * W_0 + (alpha / r) * x * B * A
The scalar alpha is a scaling hyperparameter that controls the magnitude of the LoRA update relative to the base weights. The convention from the original paper is that alpha and r are chosen together; common settings are alpha = r or alpha = 2r.
Matrix A is initialized from a normal distribution and matrix B is initialized to zero, so the adapter initially produces zero output, meaning the model starts from the pre-trained weights. After training, the adapter matrices can be merged into the base weights:
W = W_0 + (alpha / r) * B * A
This merged weight is used during inference, which means LoRA adds exactly zero inference latency compared to the original model [6]. The merge is a one-time operation; once performed, the model behaves like a fully fine-tuned model from a serving perspective.
When applied to GPT-3 175B with rank r = 8, LoRA reduced the number of trainable parameters by 10,000x (from 175B to about 17.5M) and GPU memory requirements by 3x compared to full fine-tuning, while matching full fine-tuning quality on multiple benchmarks [6]. The paper was published at ICLR 2022 and has been cited tens of thousands of times.
LoRA adapters are typically applied to the attention weight matrices (W_Q, W_K, W_V, W_O) and sometimes to the feed-forward layers. The rank r is a hyperparameter that controls the expressiveness of the adaptation; common values range from 4 to 64, with 8 or 16 being typical defaults. Recent guidance from practitioner studies recommends targeting both attention and MLP modules (q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj for Llama-style architectures) for best quality [18].
A practical hyperparameter table:
| Setting | Typical value | Notes |
|---|---|---|
| Rank r | 8, 16, 32 | Higher r increases capacity and parameters; 16 is a common default |
| Alpha | r or 2r | Set proportional to r for stable updates |
| LoRA dropout | 0.05 to 0.1 | Mild regularization, often skipped for short runs |
| Target modules | q, k, v, o (attention), and gate, up, down (MLP) | Targeting MLP improves quality at modest extra cost |
| Learning rate | 1e-4 to 5e-4 | Higher than full fine-tuning (which uses 1e-5 to 5e-5) |
| Optimizer | AdamW | Standard; 8-bit AdamW from bitsandbytes saves memory |
Two practical extensions of LoRA worth knowing. rsLoRA (rank-stabilized LoRA, Kalajdzievski 2023) replaces alpha / r with alpha / sqrt(r), which keeps update magnitudes stable across very different ranks and is the default in some frameworks for high-rank settings. LoRA+ (Hayou et al. 2024) uses different learning rates for the A and B matrices, with B trained at roughly 16x the learning rate of A, improving convergence speed by up to 2x with no extra parameters [17].
QLoRA combines LoRA with aggressive base model quantization to enable fine-tuning of very large models on a single consumer GPU. The base model is quantized to 4-bit precision using a novel data type called 4-bit NormalFloat (NF4), and LoRA adapters are applied on top of these quantized weights. Gradients are backpropagated through the quantized model into the LoRA adapters, which are kept in higher precision (BF16) [7].
QLoRA introduces three technical innovations. The first is 4-bit NormalFloat (NF4), a data type designed to be information-theoretically optimal for normally distributed weights. Pre-trained transformer weights tend to follow a roughly zero-mean Gaussian distribution, and NF4 allocates more quantization levels near zero where most weights live, and fewer levels in the tails. This outperforms standard 4-bit integer (INT4) and 4-bit floating-point (FP4) formats for LLM weight quantization. The second is double quantization, which further reduces memory by quantizing the quantization constants themselves. Block-wise quantization stores one scale factor per block of weights (typically 64 weights per block); these scale factors are normally stored in FP32. Double quantization compresses them to 8-bit values with a second-level scale, saving approximately 0.37 bits per parameter on average. The third is paged optimizers, which use the NVIDIA unified memory feature to swap optimizer state between GPU and CPU memory. When GPU memory pressure spikes (for example, during a long-sequence forward pass), pages of optimizer state automatically migrate to CPU RAM and back, preventing out-of-memory crashes.
The practical impact is substantial. A 65B parameter model that would require over 780 GB of GPU memory for full fine-tuning in FP32, or around 130 GB for LoRA in FP16, can be fine-tuned with QLoRA on a single 48 GB GPU (such as an NVIDIA A6000 or A100). Using a consumer RTX 4090 (24 GB), models up to 33B parameters can be fine-tuned. The Guanaco family of models, fine-tuned with QLoRA on a single GPU in 24 hours, reached 99.3% of ChatGPT's performance on the Vicuna benchmark [7].
QLoRA is implemented in the Hugging Face peft library through integration with the bitsandbytes library. The Answer.AI team in collaboration with Tim Dettmers and Hugging Face later combined QLoRA with PyTorch FSDP to scale this further, training a 70B Llama model at home on a pair of 24 GB consumer GPUs in 2024.
IA3 (Infused Adapter by Inhibiting and Amplifying Inner Activations) takes a different approach from LoRA. Instead of adding low-rank matrices, IA3 learns three vectors per transformer block that rescale the keys and values in the self-attention mechanism and the intermediate activations in the feed-forward network. These learned vectors element-wise multiply the existing activations, requiring an extremely small number of trainable parameters (around 0.01% of the model) [12].
Formally, for a self-attention block IA3 introduces vectors l_k and l_v that scale the keys and values: K' = K * l_k, V' = V * l_v. For the feed-forward network, a vector l_ff scales the output of the up-projection: ff_out = (gelu(x * W_1) * l_ff) * W_2. Because the rescaling vectors can be folded into the existing weight matrices for inference (W_K * diag(l_k), and so on), IA3 also has zero inference overhead.
IA3 was motivated by the goal of making few-shot fine-tuning practical with very limited labeled data. The extreme parameter efficiency means IA3 is less prone to overfitting than methods with more trainable parameters, making it suitable for scenarios with only tens or hundreds of training examples. The method was evaluated with the T-Few recipe, which combined IA3 with specific loss functions for few-shot learning, and it outperformed in-context learning with much larger models on the RAFT benchmark [12].
IA3 also enables mixed-task batching: because each example can be multiplied by its own task vector cheaply, a server can serve multiple IA3 task adapters in a single batch with minimal overhead.
BitFit is the simplest PEFT method on this list. It freezes everything in the model except the bias terms. For a transformer, that includes biases in the attention projections, the feed-forward layers, and the layer norms (where present). Total trainable parameters typically come out to 0.05% to 0.1% of the model [9].
Despite its simplicity, BitFit is competitive on small to medium datasets for BERT-style encoder models. The authors reported that on GLUE, BitFit matches or beats full fine-tuning with very small data, and stays within a few points on larger datasets. The Ben Zaken paper was published at ACL 2022.
The theoretical implication is striking. BitFit suggests that fine-tuning may largely be about exposing knowledge already present in the pre-trained weights rather than learning genuinely new task-specific knowledge. The biases act as gates that selectively amplify or suppress existing features.
In practice, BitFit works well on encoder-only models for classification, but has been less competitive on decoder-only generative models for harder tasks like instruction following. Most modern PEFT recipes prefer LoRA over BitFit because LoRA gives more capacity at a similar parameter count.
Weight-Decomposed Low-Rank Adaptation (DoRA) analyzes the difference between full fine-tuning and LoRA through the lens of weight decomposition into magnitude and direction components. The authors found that full fine-tuning tends to make nuanced adjustments to both magnitude and direction, while LoRA predominantly alters direction, sometimes at the expense of magnitude accuracy [8].
DoRA decomposes each pre-trained weight matrix into a magnitude vector m (one scalar per output column) and a directional matrix V (the column-normalized weight matrix). The magnitude vector m is made trainable directly, while the directional matrix V is updated using LoRA's low-rank decomposition: V_new = V + B * A. The reconstructed weight is then m * (V_new / norm(V_new, columnwise)). This decomposition allows DoRA to more closely approximate the learning dynamics of full fine-tuning while still keeping the parameter count almost identical to LoRA [8].
In experiments across language understanding (commonsense reasoning with LLaMA), visual instruction tuning (LLaVA), and multimodal understanding (VL-BART), DoRA consistently outperformed LoRA, often by 1 to 4 percentage points at the same rank. It was presented as an oral paper at ICML 2024 and is supported in the Hugging Face PEFT library, including with bitsandbytes quantized layers. Like LoRA, DoRA can be merged into the base weights for zero-overhead inference [8].
AdaLoRA (Adaptive LoRA) extends LoRA by dynamically allocating the rank of each adapter based on the importance of each weight matrix. Instead of using a fixed rank r for all layers, AdaLoRA parameterizes the LoRA update in singular value decomposition form (B = P * diag(lambda) * Q) and prunes singular values during training based on an importance score derived from gradient magnitudes [13].
This budget allocation lets the method assign more capacity to layers that benefit most from adaptation and less to layers that are already well-suited to the task. Empirically, AdaLoRA performed especially well in low-budget settings, where allocating the limited capacity intelligently makes a larger difference than at high parameter budgets.
LongLoRA targets a specific challenge: extending the context window of a pre-trained LLM. Standard LoRA fine-tuning at long context lengths is expensive because attention is quadratic in sequence length. LongLoRA introduces shifted sparse attention (S2-Attn), which splits the context into groups and computes attention within each group, with half the heads shifted by a half group size to allow information flow across boundaries [14].
During inference, the model uses standard dense attention, but during fine-tuning S2-Attn provides most of the benefit at a fraction of the cost. Combined with LoRA on the attention projections, plus making the embedding and normalization layers trainable, LongLoRA extended Llama 2 7B from 4K to 100K context, and Llama 2 70B to 32K, on a single 8x A100 machine. The paper appeared at ICLR 2024 (Oral).
LoftQ (LoRA-Fine-Tuning-aware Quantization) addresses a subtle problem with QLoRA. When the base model is quantized to 4 bits, there is quantization error: the quantized weight Q(W) differs from the original W. QLoRA initializes the LoRA adapters at zero, so at the start of fine-tuning, the model behaves like the quantized model, not the full-precision model. This degrades initial performance, especially at low quantization bit widths.
LoftQ instead initializes the LoRA matrices A and B such that Q(W) + B * A approximates W as closely as possible. The initialization is found by alternating between quantizing W minus the current low-rank correction and computing a new low-rank approximation of the residual. This iterative procedure is performed once before fine-tuning begins. As a result, the model starts much closer to full-precision behavior, and downstream performance is consistently better than QLoRA, especially at 2 or 3 bit quantization [15].
LoftQ was presented at ICLR 2024 and is implemented in the Hugging Face PEFT library.
VeRA (Vector-based Random Matrix Adaptation) further reduces the parameter count by sharing a pair of frozen random matrices across all layers and learning only per-layer scaling vectors. The random matrices A_random and B_random are sampled once, frozen, and never stored as trainable parameters. Each layer learns two diagonal scaling vectors b and d such that the effective update is B_random * diag(d) * A_random with an outer scaling by diag(b).
Because the only trainable parameters are the per-layer scaling vectors, VeRA achieves comparable performance to LoRA while training roughly 10x fewer parameters. This makes it especially attractive for serving thousands of per-user adapters where adapter file size matters. VeRA was presented at ICLR 2024 [16].
Compacter combines adapter modules with parameterized hypercomplex multiplication (PHM) layers and low-rank matrices. Each adapter weight is constructed as a sum of Kronecker products between shared slow weights and per-layer rank-one matrices. The result is an adapter with even fewer trainable parameters than the original Houlsby adapter, around 0.05% of the model. On GLUE and SuperGLUE, Compacter matched or beat full fine-tuning at this very low parameter count [10]. Although less commonly used today than LoRA, Compacter is an interesting reference point for the lower bound of trainable parameters needed to match full fine-tuning.
MoRA challenges the assumption that low rank is always best. The authors observe that LoRA's low-rank update may struggle on tasks that require absorbing genuinely new knowledge (continual pre-training, math, code) rather than simply rerouting existing capabilities. MoRA replaces the two low-rank matrices A and B with a single square matrix M of size r' by r' (where r' is chosen to match the LoRA parameter budget), combined with non-parameter compression and decompression operators. The square matrix is full rank within its dimensions, providing higher representational capacity than a low-rank decomposition with the same parameter count. On continual pre-training and math tasks, MoRA outperformed LoRA at matched parameter budgets.
The peft library from Hugging Face has become the standard open-source implementation for parameter-efficient fine-tuning. Released in early 2023 and led by Sourab Mangrulkar with contributions from Younes Belkada, Sayak Paul, Benjamin Bossan, Marc Sun, and others, it provides a unified interface for applying multiple PEFT methods to any Hugging Face Transformers model. Supported methods include LoRA, QLoRA (via integration with the bitsandbytes library), prefix tuning, prompt tuning, P-tuning, IA3, AdaLoRA, LoftQ, VeRA, and DoRA [19].
The library's design follows a simple pattern. A PEFT configuration object (for example LoraConfig) specifies the method and its hyperparameters. The get_peft_model() function wraps a base model with the specified PEFT modules. Training proceeds with the standard Trainer API or any custom loop. After training, only the small adapter weights are saved with save_pretrained(), typically a few megabytes. At inference time, the base model is loaded once with from_pretrained(), and PeftModel.from_pretrained(base, adapter_path) attaches an adapter. Different adapters can be loaded and swapped dynamically with load_adapter() and set_adapter().
A minimal LoRA training snippet looks like this:
from peft import LoraConfig, get_peft_model
config = LoraConfig(
r=16, lora_alpha=32, lora_dropout=0.05,
target_modules=["q_proj", "v_proj"],
bias="none", task_type="CAUSAL_LM",
)
model = get_peft_model(base_model, config)
# train model with standard HuggingFace Trainer
model.save_pretrained("./lora_adapter")
This design enables practical multi-tenant serving: a single base model resides in GPU memory, and per-task LoRA adapters (each only a few megabytes) are loaded as needed. Frameworks like vLLM, Lorax, and TGI support serving multiple LoRA adapters simultaneously on a shared base model, batching requests across different adapters for high throughput.
The library also integrates with the broader Hugging Face stack: transformers for base model classes, accelerate for distributed and mixed-precision training, trl (Transformer Reinforcement Learning) for SFT, DPO, and PPO with PEFT models, bitsandbytes for 8-bit and 4-bit quantization needed for QLoRA, and diffusers for applying LoRA to diffusion models for image generation.
The following table provides a more detailed comparison of the major PEFT methods along several practical dimensions, assuming a 7B parameter model.
| Method | Trainable Params (7B) | GPU Memory (7B) | Performance vs. Full FT | Supports Merging | Multi-task Serving | Ease of Use |
|---|---|---|---|---|---|---|
| Full fine-tuning | 7B (100%) | ~56-112 GB | Baseline | N/A | Requires full copies | Standard |
| LoRA (r=16) | ~17M (0.24%) | ~16-24 GB (FP16) | 90-95% | Yes | Excellent (swap adapters) | Very easy |
| QLoRA (r=16) | ~17M (0.24%) | ~6-10 GB (4-bit) | 85-95% | Yes | Excellent | Easy |
| Adapters (m=64) | ~7M (0.1%) | ~16-24 GB (FP16) | 90-95% | No | Moderate | Moderate |
| Prefix tuning | ~1-5M (0.01-0.07%) | ~16-20 GB (FP16) | 85-95% | No | Moderate | Moderate |
| Prompt tuning | ~0.1-1M (0.001-0.01%) | ~16-18 GB (FP16) | 80-95% (scale-dependent) | No | Good (swap prompts) | Easy |
| IA3 | ~0.5M (0.007%) | ~16-18 GB (FP16) | 85-90% | Yes | Good | Easy |
| BitFit | ~3M (0.05%) | ~16-18 GB (FP16) | 80-90% (small data) | Yes (already in W) | Limited | Very easy |
| DoRA (r=16) | ~18M (0.26%) | ~16-24 GB (FP16) | 92-97% | Yes | Excellent | Easy |
| VeRA | ~1.7M (0.024%) | ~16-18 GB (FP16) | 90-95% | Yes | Excellent | Easy |
Note: GPU memory estimates assume a 7B model in the specified precision. QLoRA's dramatically lower memory comes from the 4-bit quantized base model, not from fewer adapter parameters.
A second comparison focused on the headline tradeoff between full fine-tuning, LoRA, QLoRA, and adapters:
| Aspect | Full fine-tuning | LoRA | QLoRA | Adapter (Houlsby) |
|---|---|---|---|---|
| Trainable parameters | 100% | 0.1-0.5% | 0.1-0.5% | 0.5-3.6% |
| Saved adapter size (70B model) | ~140 GB | ~150-800 MB | ~150-800 MB | ~500 MB-5 GB |
| Min GPU memory (70B) | 1+ TB | ~140 GB (FP16 base) | ~35-48 GB (4-bit base) | ~140 GB |
| Inference latency overhead | None | None (after merge) | None (after merge) | ~5% |
| Quality on hard tasks | Best | Within 1-3% | Within 2-5% | Within 1-2% |
| Generalization on small data | May overfit | Often better than full | Comparable to LoRA | Comparable to LoRA |
| Multi-adapter serving | Requires full model copies | Yes, well-supported | Yes, well-supported | Possible but slower |
| Default in modern open-source recipes | No | Yes | Yes (resource-constrained) | No |
The most concrete way to understand PEFT is through the memory math for a real model. Take Llama 3 70B. The weights alone require about 280 GB in FP32, 140 GB in FP16 or BF16, 70 GB in INT8, and roughly 35 GB in NF4. For full fine-tuning with mixed-precision Adam, you need the weights, gradients, master weights, and Adam moments. A common rule of thumb is around 16 bytes per parameter, giving roughly 1.1 TB for a 70B model. This typically requires multi-GPU clusters with model parallelism.
For LoRA at rank 16 applied to all attention and MLP projections of Llama 3 70B (about 7 such projections per layer, 80 layers, average projection size 8192 by 8192 or 8192 by 28672), the trainable parameter count is roughly 70 to 200 million depending on which modules are targeted. The optimizer state for these adapters takes only 1 to 4 GB. Adding the FP16 base model (140 GB), you can train Llama 3 70B with LoRA on a node with 2 H100 80GB GPUs, or even one H200 (141 GB).
For QLoRA with NF4 quantization, the base model drops to about 35 GB, and total training memory fits in a single 48 GB GPU for 65B-70B models, or 24 GB for 30B models. With FSDP plus QLoRA (the Answer.AI recipe), a 70B model can be fine-tuned on dual 24 GB consumer GPUs.
A quick storage example: an organization that fine-tunes Llama 3 70B for 100 different customer use cases would need 100 * 140 GB = 14 TB of model storage with full fine-tuning. With LoRA, those 100 adapters total about 100 * 500 MB = 50 GB, a 280x storage reduction, plus a single shared 140 GB base model.
A recurring question is how close PEFT comes to full fine-tuning quality. The honest answer depends on the task, the data size, and the method:
DoRA improves consistently over LoRA at the same parameter budget, typically by 1 to 4 points on commonsense reasoning, visual instruction tuning, and multimodal benchmarks [8]. LoRA+ improves convergence speed without changing final accuracy [17]. AdaLoRA helps most when the parameter budget is tight [13].
PEFT changed not just training but also serving. Three patterns are common in 2025. The first is adapter merge for single-task serving. When a single fine-tuned model is served, the LoRA adapter is merged into the base weights with model.merge_and_unload(). The result is a model identical in shape to the base, with no inference overhead. This is the pattern used by most providers when they ship a fine-tuned variant as a single artifact.
The second is multi-LoRA inference. When serving many adapters from one base model, frameworks like vLLM, Lorax (from Predibase), and Text Generation Inference (TGI) keep one copy of the base model in GPU memory and apply different LoRA adapters per request. vLLM supports the LoRA API natively and allows specifying an adapter ID per request; the adapter is loaded into a small GPU buffer if not already cached. Lorax goes further with heterogeneous continuous batching, packing requests for different adapters into the same batch. AWS Bedrock, OpenAI's fine-tuning API, Together AI, Anyscale, and Modal all expose multi-LoRA serving in some form.
The third is LoRA for personalization. The combination of small adapter size, fast loading, and a shared base model has enabled per-user fine-tuning at scale. Predibase and Replicate advertise serving thousands or even tens of thousands of distinct LoRA adapters from a single GPU. The same pattern drives the Civitai and Hugging Face hubs, which host hundreds of thousands of community LoRA adapters for Stable Diffusion, Flux, and other diffusion models. Each diffusion LoRA is typically 10 to 200 MB.
Several open-source frameworks have packaged PEFT recipes for end users:
| Framework | Strength | Typical use case |
|---|---|---|
| Hugging Face PEFT | Reference implementation, broadest method support | Research, custom training loops |
| Axolotl | YAML-based config, broad model support, active community | Production fine-tuning of open-weight LLMs |
| LLaMA-Factory | Web UI plus CLI, supports 100+ models out of the box | Beginners, rapid experimentation, RLHF pipelines |
| Unsloth | Custom Triton kernels, 2-5x faster training, up to 80% less VRAM | Resource-constrained training (single consumer GPU) |
| TorchTune | Native PyTorch, deep customization | PyTorch-first teams, large-scale custom recipes |
| Answer.AI FSDP+QLoRA | FSDP integration with QLoRA for multi-GPU 70B+ training | Training 70B models on 2 consumer GPUs |
Unsloth in particular has driven a step change in single-GPU efficiency through manually written backward kernels and an optimized rotary position embedding, RMSNorm, and cross-entropy implementation. Its benchmarks on RTX 4090 GPUs show roughly 24% faster training than TorchTune at similar memory.
Although PEFT methods were originally developed for NLP, they have been successfully applied to other domains. LoRA has been widely adopted for fine-tuning diffusion models for image generation, where users create custom LoRA adapters that teach Stable Diffusion new styles, characters, or concepts. The Civitai and Hugging Face model hubs host hundreds of thousands of community-created LoRA adapters for image generation models.
In computer vision, LoRA and adapters have been applied to vision transformers for image classification, object detection, and segmentation. For multimodal models like LLaVA, PEFT methods are used to fine-tune the language model component while keeping the vision encoder frozen. DoRA explicitly evaluates on LLaVA and VL-BART for multimodal benchmarks and outperforms LoRA in this setting [8].
LoRA has also been used for audio models (Whisper, MusicGen) and protein language models (ESM, ProGen2), and for adapting world models in robotics. Anywhere a transformer backbone is fine-tuned, some form of PEFT is now common.
PEFT is not a free lunch. Hyperparameters matter: rank, alpha, target modules, and learning rate all affect quality, and a poor choice (rank too low, alpha mismatched, only attention modules targeted) can cost several points on hard tasks. The rsLoRA finding that the standard alpha / r scaling becomes unstable at high rank is a recent example.
Hard tasks may need full fine-tuning. When the target requires substantially new knowledge (a new programming language not in the pre-training mix, a new natural language, deep mathematical reasoning), low-rank updates can saturate. Higher rank, full fine-tuning, or MoRA are alternatives.
Quantization losses compound. QLoRA and LoftQ work well, but 4-bit quantization plus low-rank updates loses something relative to full-precision fine-tuning, and the gap widens at very low bit widths (2 or 3 bit).
Adapter methods add inference cost. Houlsby adapters, prefix tuning, and prompt tuning all add some overhead at inference (extra layers, extra attention tokens). Only LoRA, IA3, BitFit, and their merge-friendly descendants avoid this.
Multi-adapter batching is non-trivial. Although vLLM and Lorax support batching across different LoRA adapters in the same request batch, this requires careful kernel work (segmented matrix multiplications). Naive implementations serialize adapters and lose much of the benefit.
Catastrophic forgetting still happens. LoRA does not magically prevent the model from forgetting pre-training knowledge during fine-tuning. Standard mitigations (mix in pre-training data, use lower learning rates) still apply.
The choice between PEFT and full fine-tuning depends on several factors:
Use PEFT when:
Use full fine-tuning when:
In practice, LoRA and QLoRA have narrowed the gap to the point where many practitioners default to PEFT unless there is a specific reason to perform full fine-tuning. The 2024-2025 trend has been toward increasingly sophisticated PEFT methods (DoRA, VeRA, LoRA+, MoRA, LoftQ) that continue to close the remaining quality gap.
The PEFT landscape in 2025-2026 is characterized by several developments:
LoRA remains the dominant method, but DoRA is gaining adoption as the default recommendation for practitioners who want slightly better performance with no additional inference cost. The Hugging Face PEFT library supports both, making switching trivial.
QLoRA has become the standard approach for fine-tuning on consumer hardware. With tools like Unsloth and Axolotl providing optimized training pipelines, it is now possible to fine-tune a 70B model with QLoRA on a single NVIDIA RTX 4090 (24 GB VRAM) in reasonable time frames, and on a pair of RTX 4090s with FSDP+QLoRA the 70B fine-tune is straightforward.
Multi-adapter serving has matured. vLLM, Lorax (Predibase), Text Generation Inference (TGI), and TensorRT-LLM all support efficient batching across multiple LoRA adapters sharing a single base model. AWS Bedrock, OpenAI's fine-tuning API, Together AI, Anyscale, and Modal all expose multi-LoRA serving as a managed product. This enables cost-effective deployment of personalized models at scale, with services like Predibase advertising thousands of distinct adapters per GPU.
Foundation model providers have built first-class LoRA support into their model architectures. Llama 3.1, Llama 4, Mistral, Mixtral, DeepSeek V3, and Qwen 3 all ship with documented LoRA target modules and recommended ranks. Apple's foundation models for Apple Intelligence are also shipped with adapter slots designed for on-device LoRA personalization.
Research continues on combining PEFT with other efficiency techniques. Quantization-aware PEFT (LoftQ, QuAILoRA), structured pruning combined with LoRA, MoE plus LoRA (MoLE, MoLoRA), and continual learning with PEFT adapters are all active research areas.
The theoretical understanding of why PEFT works so well has also advanced. Studies have shown that pre-trained models develop low-dimensional subspaces during training, and fine-tuning adjustments naturally reside in these subspaces. This explains why low-rank approximations (the foundation of LoRA) are effective: the updates truly are approximately low-rank for many target distributions, although not always [20].
A reasonable bet is that LoRA-family methods will remain the default for the next several years. The base model weights keep getting larger (Llama 4, GPT-5 class, DeepSeek R2), while consumer GPU memory grows much more slowly. PEFT is the bridge between those two trends.