Parameter-efficient fine-tuning (PEFT) refers to a family of methods that adapt pre-trained large language models to downstream tasks by updating only a small fraction of the model's parameters, or by adding a small number of new trainable parameters, while keeping the vast majority of the original weights frozen. These techniques emerged as a practical response to the escalating cost of full fine-tuning, which requires updating every parameter in the model and storing a complete copy of the model weights for each task. For a model with hundreds of billions of parameters, full fine-tuning demands enormous GPU memory, significant compute budgets, and large storage footprints. PEFT methods achieve comparable or near-comparable performance to full fine-tuning while reducing trainable parameters by 100x to 10,000x, making it feasible to customize large models on consumer hardware or adapt a single base model to many tasks simultaneously.
The growth in language model scale has created a widening gap between model capability and the resources required to customize those models. GPT-3 had 175 billion parameters. Subsequent models like Llama 3 405B and Gemini pushed even further. Full fine-tuning of such models requires loading all parameters into GPU memory in a format that supports gradient computation (typically FP32 or mixed-precision with FP32 optimizer states), which can require 4-8x the memory of the model weights alone. For a 70B parameter model in FP16, the weights alone consume around 140 GB; with Adam optimizer states and gradients, the total training memory easily exceeds 500 GB [1].
Beyond memory, full fine-tuning creates logistical problems for serving. Each fine-tuned variant is a complete copy of the model. An organization that needs 50 task-specific variants of a 70B model would need to store 50 complete model copies, totaling around 7 TB. PEFT methods solve this by producing small adapter modules (typically 0.1-1% of the full model size) that can be swapped in and out at serving time on top of a single shared base model.
The idea of parameter-efficient adaptation predates the current LLM era. Feature extraction, where a pre-trained network's representations are used as input to a lightweight classifier, was common in computer vision with networks like VGG and ResNet. In NLP, freezing word embeddings while training task-specific layers was also standard practice.
The modern PEFT paradigm began with adapter modules proposed by Houlsby et al. in 2019, which inserted small bottleneck layers into BERT. This was followed by prefix tuning (Li and Liang, 2021) and prompt tuning (Lester et al., 2021), which operated on the input representations. LoRA (Hu et al., 2021) introduced low-rank weight decomposition and became the dominant PEFT method due to its simplicity and the fact that adapter weights can be merged into the base model at inference time, adding zero latency [2].
The following table summarizes the major PEFT methods, their core mechanisms, and key characteristics.
| Method | Year | Core Mechanism | Trainable Params (%) | Inference Overhead | Key Paper |
|---|---|---|---|---|---|
| Adapter modules | 2019 | Bottleneck layers inserted after attention and FFN blocks | ~3.6% | Small (extra layers) | Houlsby et al., ICML 2019 [3] |
| Prefix tuning | 2021 | Learnable continuous vectors prepended to K,V at every layer | ~0.1-1% | Small (extra tokens in attention) | Li & Liang, ACL 2021 [4] |
| Prompt tuning | 2021 | Learnable soft tokens prepended to the input embedding | ~0.01-0.1% | Minimal | Lester et al., EMNLP 2021 [5] |
| LoRA | 2021 | Low-rank decomposition of weight update matrices | ~0.1-0.5% | Zero (merged at inference) | Hu et al., ICLR 2022 [2] |
| QLoRA | 2023 | LoRA applied to 4-bit quantized base model | ~0.1-0.5% | Zero (merged at inference) | Dettmers et al., NeurIPS 2023 [6] |
| IA3 | 2022 | Learned rescaling vectors for attention keys, values, and FFN activations | ~0.01% | Minimal | Liu et al., 2022 [7] |
| DoRA | 2024 | Weight-decomposed LoRA with separate magnitude and direction updates | ~0.1-0.5% | Zero (merged at inference) | Liu et al., ICML 2024 [8] |
The original adapter approach inserts small bottleneck modules within each transformer layer. Each adapter consists of a down-projection from the model dimension d to a smaller dimension m, a non-linear activation, and an up-projection back to d, plus a residual connection. These modules are placed after both the multi-head attention sub-layer and the feed-forward sub-layer in each transformer block [3].
The bottleneck dimension m controls the parameter-performance tradeoff. With m = 64 on a BERT-large model (d = 1024), each adapter adds 64 * 1024 + 1024 * 64 = 131,072 parameters per insertion point. Across all layers, this amounts to roughly 3.6% of the total parameters. On the GLUE benchmark, adapters achieved within 0.4% of full fine-tuning performance [3].
The main drawback of adapters is that they add sequential computation during inference. The extra layers cannot be absorbed into the base model weights, so every forward pass incurs a small additional latency. This motivated the development of methods like LoRA that can be merged into the base weights.
Prefix tuning prepends a sequence of learnable continuous vectors (the "prefix") to the key and value matrices at every layer of the transformer. These vectors function as "virtual tokens" that the model attends to, and they can steer the model's behavior without modifying any of the original weights [4].
Unlike discrete text prompts, the prefix vectors live in the continuous embedding space and are optimized via backpropagation. Li and Liang found that directly optimizing the prefix vectors led to instability, so they used a reparameterization trick: the prefix is generated by a small MLP that takes a learnable matrix as input, and after training, the MLP is discarded and only the generated prefix vectors are kept.
Prefix tuning was originally evaluated on GPT-2 for table-to-text generation and BART for summarization. With 0.1% of the parameters, it achieved comparable performance to full fine-tuning on these tasks. However, its effectiveness tends to decrease on more complex tasks, and it is sensitive to the length of the prefix [4].
Prompt tuning can be viewed as a simplification of prefix tuning. Instead of adding learnable vectors at every layer, prompt tuning adds a set of learnable "soft prompt" tokens only at the input embedding layer. The rest of the model processes these soft tokens alongside the actual input tokens using its frozen weights [5].
Lester et al. showed a compelling scaling result: as model size increases, the gap between prompt tuning and full fine-tuning narrows. For models with 10 billion or more parameters (T5-XXL), prompt tuning essentially matched full fine-tuning performance. This suggested that very large models have enough capacity to be steered effectively by input-level modifications alone. The method requires tuning only around 0.01% of the total parameters, making it one of the most parameter-efficient approaches [5].
Low-Rank Adaptation (LoRA) has become the most widely adopted PEFT method. Its core insight is that the weight updates during fine-tuning tend to have low intrinsic rank, meaning they can be approximated by the product of two much smaller matrices [2].
For a pre-trained weight matrix W_0 with dimensions d x k, LoRA represents the weight update as a low-rank decomposition: the change in weights is expressed as B * A, where B has dimensions d x r and A has dimensions r x k, with the rank r being much smaller than both d and k. During training, W_0 is frozen and only A and B are updated. The forward pass computes: output = x * W_0 + x * B * A, where the second term is the adapter's contribution.
Matrix A is initialized from a normal distribution and matrix B is initialized to zero, so the adapter initially produces zero output, meaning the model starts from the pre-trained weights. After training, the adapter matrices can be merged: W = W_0 + B * A. This merged weight is used during inference, which means LoRA adds exactly zero inference latency compared to the original model [2].
When applied to GPT-3 175B with rank r = 8, LoRA reduced the number of trainable parameters by 10,000x (from 175B to about 17.5M) and GPU memory requirements by 3x compared to full fine-tuning, while matching full fine-tuning quality on multiple benchmarks [2].
LoRA adapters are typically applied to the attention weight matrices (W_Q, W_K, W_V, W_O) and sometimes to the feed-forward layers. The rank r is a hyperparameter that controls the expressiveness of the adaptation; common values range from 4 to 64, with 8 or 16 being typical defaults.
QLoRA combines LoRA with aggressive base model quantization to enable fine-tuning of very large models on a single consumer GPU. The base model is quantized to 4-bit precision using a novel data type called 4-bit NormalFloat (NF4), and LoRA adapters are applied on top of these quantized weights. Gradients are backpropagated through the quantized model into the LoRA adapters, which are kept in higher precision (BF16) [6].
QLoRA introduces three technical innovations. First, the NF4 data type is designed to be information-theoretically optimal for normally distributed weights, outperforming standard 4-bit integer and floating-point formats. Second, double quantization further reduces memory by quantizing the quantization constants themselves, saving approximately 0.37 bits per parameter. Third, paged optimizers use CPU memory as overflow when GPU memory spikes occur during training, preventing out-of-memory errors [6].
The practical impact is significant. A 65B parameter model that would require over 780 GB of GPU memory for full fine-tuning in FP32, or around 130 GB for LoRA in FP16, can be fine-tuned with QLoRA on a single 48 GB GPU (such as an NVIDIA A6000 or A100). Using a consumer RTX 4090 (24 GB), models up to 33B parameters can be fine-tuned. The Guanaco family of models, fine-tuned with QLoRA on a single GPU in under 12 hours, reached 99.3% of ChatGPT's performance on the Vicuna benchmark [6].
Infused Adapter by Inhibiting and Amplifying Inner Activations (IA3) takes a different approach from LoRA. Instead of adding low-rank matrices, IA3 learns three vectors that rescale the keys and values in the self-attention mechanism and the intermediate activations in the feed-forward network. These learned vectors element-wise multiply the existing activations, requiring an extremely small number of trainable parameters (around 0.01% of the model) [7].
IA3 was motivated by the goal of making few-shot fine-tuning practical with very limited labeled data. The extreme parameter efficiency means that IA3 is less prone to overfitting than methods with more trainable parameters, making it suitable for scenarios with only tens or hundreds of training examples. The method was evaluated with the T-Few recipe, which combined IA3 with specific loss functions for few-shot learning [7].
Weight-Decomposed Low-Rank Adaptation (DoRA) analyzes the difference between full fine-tuning and LoRA through the lens of weight decomposition into magnitude and direction components. The authors found that full fine-tuning tends to make nuanced adjustments to both magnitude and direction, while LoRA predominantly alters direction, sometimes at the expense of magnitude accuracy [8].
DoRA decomposes each pre-trained weight matrix into a magnitude component (a scalar per output neuron) and a directional component (the normalized weight matrix). The magnitude component is made trainable directly, while the directional component is updated using LoRA's low-rank decomposition. This decomposition allows DoRA to more closely approximate the learning dynamics of full fine-tuning [8].
In experiments across language understanding (commonsense reasoning with LLaMA), visual instruction tuning (LLaVA), and multimodal understanding (VL-BART), DoRA consistently outperformed LoRA. It was presented as an oral paper at ICML 2024 and is supported in the Hugging Face PEFT library. Like LoRA, DoRA can be merged into the base weights for zero-overhead inference [8].
The PEFT library from Hugging Face has become the standard open-source implementation for parameter-efficient fine-tuning. Released in early 2023, it provides a unified interface for applying multiple PEFT methods to any Hugging Face Transformers model. Supported methods include LoRA, QLoRA (via integration with the bitsandbytes library), prefix tuning, prompt tuning, IA3, AdaLoRA, and DoRA [9].
The library's design follows a simple pattern: a PEFT configuration object specifies the method and its hyperparameters, and the get_peft_model() function wraps a base model with the specified PEFT modules. After training, only the small adapter weights are saved. At inference time, the base model is loaded once, and different adapters can be loaded and swapped dynamically.
This design enables practical multi-tenant serving: a single base model resides in GPU memory, and per-task LoRA adapters (each only a few megabytes) are loaded as needed. Frameworks like vLLM and Lorax support serving multiple LoRA adapters simultaneously on a shared base model, batching requests across different adapters for high throughput [9].
The following table provides a more detailed comparison of the major PEFT methods along several practical dimensions.
| Method | Trainable Params (7B model) | GPU Memory (7B model) | Performance vs. Full FT | Supports Merging | Multi-task Serving | Ease of Use |
|---|---|---|---|---|---|---|
| Full fine-tuning | 7B (100%) | ~56-112 GB | Baseline | N/A | Requires full copies | Standard |
| LoRA (r=16) | ~17M (0.24%) | ~16-24 GB (FP16) | 90-95% | Yes | Excellent (swap adapters) | Very easy |
| QLoRA (r=16) | ~17M (0.24%) | ~6-10 GB (4-bit) | 85-95% | Yes | Excellent | Easy |
| Adapters (m=64) | ~7M (0.1%) | ~16-24 GB (FP16) | 90-95% | No | Moderate | Moderate |
| Prefix tuning | ~1-5M (0.01-0.07%) | ~16-20 GB (FP16) | 85-95% | No | Moderate | Moderate |
| Prompt tuning | ~0.1-1M (0.001-0.01%) | ~16-18 GB (FP16) | 80-95% (scale-dependent) | No | Good (swap prompts) | Easy |
| IA3 | ~0.5M (0.007%) | ~16-18 GB (FP16) | 85-90% | Yes | Good | Easy |
| DoRA (r=16) | ~18M (0.26%) | ~16-24 GB (FP16) | 92-97% | Yes | Excellent | Easy |
Note: GPU memory estimates assume a 7B model in the specified precision. QLoRA's dramatically lower memory comes from the 4-bit quantized base model, not from fewer adapter parameters.
AdaLoRA (Adaptive LoRA) extends LoRA by dynamically allocating the rank of each adapter based on the importance of each weight matrix. Instead of using a fixed rank r for all layers, AdaLoRA starts with a higher rank and prunes singular values during training based on an importance score. This allows the method to assign more capacity to layers that benefit most from adaptation and less to layers that are already well-suited to the task [10].
VeRA (Vector-based Random Matrix Adaptation) further reduces the parameter count by sharing a pair of random matrices across all layers and learning only per-layer scaling vectors. The random matrices are frozen and not stored as trainable parameters, so the total number of trainable parameters is dramatically reduced compared to LoRA. VeRA achieves comparable performance to LoRA while training 10x fewer parameters [11].
LoRA+ observes that using different learning rates for the A and B matrices in LoRA can significantly improve performance. Specifically, setting the learning rate for B to be much higher (e.g., 16x) than the learning rate for A yields more efficient learning. This simple modification requires no architectural changes and can improve convergence speed by up to 2x [12].
The choice between PEFT and full fine-tuning depends on several factors:
Use PEFT when:
Use full fine-tuning when:
In practice, LoRA and QLoRA have narrowed the gap to the point where many practitioners default to PEFT unless there is a specific reason to perform full fine-tuning. The 2024-2025 trend has been toward increasingly sophisticated PEFT methods (DoRA, VeRA, LoRA+) that continue to close the remaining quality gap.
Although PEFT methods were originally developed for NLP, they have been successfully applied to other domains. LoRA has been widely adopted for fine-tuning diffusion models for image generation, where users create custom LoRA adapters that teach Stable Diffusion new styles, characters, or concepts. The Civitai and Hugging Face model hubs host thousands of community-created LoRA adapters for image generation models.
In computer vision, LoRA and adapters have been applied to vision transformers for image classification, object detection, and segmentation. For multimodal models like LLaVA, PEFT methods are used to fine-tune the language model component while keeping the vision encoder frozen.
The PEFT landscape in 2025-2026 is characterized by several developments:
LoRA remains the dominant method, but DoRA is gaining adoption as the default recommendation for practitioners who want slightly better performance with no additional inference cost. The Hugging Face PEFT library supports both, making switching trivial.
QLoRA has become the standard approach for fine-tuning on consumer hardware. With tools like unsloth and Axolotl providing optimized training pipelines, it is now possible to fine-tune a 70B model with QLoRA on a single NVIDIA RTX 4090 (24 GB VRAM) in reasonable time frames.
Multi-adapter serving has matured, with vLLM, Lorax, and TensorRT-LLM supporting efficient batching across multiple LoRA adapters sharing a single base model. This enables cost-effective deployment of personalized models at scale.
Research continues on combining PEFT with other efficiency techniques. Quantization-aware PEFT (combining quantization of both the base model and the adapter), structured pruning combined with LoRA, and continual learning with PEFT adapters are all active research areas.
The theoretical understanding of why PEFT works so well has also advanced. Studies have shown that pre-trained models develop low-dimensional subspaces during training, and fine-tuning adjustments naturally reside in these subspaces. This explains why low-rank approximations (the foundation of LoRA) are effective: the updates truly are low-rank, not just approximately so [13].