HuggingFace PEFT

Developer Tools Open Source AI Training & Optimization

25 min read

Updated Jul 23, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 23, 2026

Fact-checked

In review queue

Sources

33 citations

Revision

v4 · 5,045 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

PEFT (Parameter-Efficient Fine-Tuning) is an open-source Python library from Hugging Face that adapts large pretrained models to new tasks by training only a small set of added or selected parameters, often less than 1% of the total, while keeping the base model's weights frozen. First released in February 2023, it provides unified implementations of methods such as LoRA, QLoRA, prefix tuning, prompt tuning, P-tuning, IA3, AdaLoRA, DoRA, LoHa, and LoKr, and it is the de facto adapter library of the open-source ecosystem because it makes fine-tuning large language models, vision models, and diffusion models feasible on a single consumer GPU.^[1]^[2] The decisive payoff is memory and storage: PEFT adapters are typically a few megabytes rather than the tens of gigabytes of a full fine-tune, and the QLoRA recipe it ships can fine-tune a 65 billion parameter model on a single 48 GB GPU while preserving 16-bit task performance.^[1]^[13] As the official documentation states, PEFT methods "only fine-tune a small number of (extra) model parameters ... while yielding performance comparable to a fully fine-tuned model," which "makes it more accessible to train and store large language models (LLMs) on consumer hardware."^[2]

PEFT is integrated tightly with the broader Hugging Face stack, including Transformers, Accelerate, Diffusers, and TRL.^[1]^[3]^[4] The project is hosted at github.com/huggingface/peft, is licensed under Apache 2.0, and as of mid-2026 carries roughly 21,300 GitHub stars across an active release cadence.^[6]^[8] Its catalogue has grown to more than thirty methods, including LoRA, QLoRA, IA3, prefix tuning, prompt tuning, P-tuning, AdaLoRA, X-LoRA, LoftQ, LoHa, LoKr, DoRA, BOFT, VeRA, and VB-LoRA, among many others.^[2]^[5]^[6]

What problem does PEFT solve?

Fine-tuning a pretrained transformer such as BERT, GPT-2, or T5 historically required updating every parameter of the model and storing a full copy of those parameters per task. As models scaled into the tens and hundreds of billions of parameters, this approach became prohibitive both in GPU memory during training and in storage cost for serving many task-specific copies. The original LoRA paper made this point starkly: deploying multiple independent fine-tuned copies of a 175B parameter GPT-3 model is operationally impractical, because each task copy is itself 175B parameters.^[7] A research literature on parameter-efficient methods emerged through 2019 to 2022, ranging from adapter layers and prefix tuning to LoRA and IA3, but most reference implementations lived in separate repositories with incompatible APIs.

Hugging Face's PEFT library was created to consolidate these methods behind a single API that plugs cleanly into the Transformers ecosystem. The first commit to the huggingface/peft repository on GitHub was made on 25 November 2022, and the project was first released to PyPI as version 0.0.1 on 19 January 2023.^[8] The launch blog post, titled "Parameter-Efficient Fine-Tuning using PEFT," was published on 10 February 2023 by Sourab Mangrulkar and Sayak Paul, who introduced LoRA, prefix tuning, P-tuning, and prompt tuning as the initial set of supported methods.^[1] The same date corresponds to the v0.1.0 tag on GitHub, marked as the "Initial release."^[8]

The motivation given in the launch announcement was twofold: PEFT methods produce checkpoints in the megabyte range rather than the gigabyte range of full fine-tunes, while attaining performance close to full fine-tuning; and they are robust to catastrophic forgetting because the base model's weights remain frozen.^[1] The post illustrated the storage gap concretely: "bigscience/mt0-xxl takes up 40GB of storage and full fine-tuning will lead to 40GB checkpoints for each downstream dataset whereas using PEFT methods it would be just a few MBs for each downstream dataset," with one worked example producing an adapter_model.bin of just 19 MB.^[1]

When was PEFT released, and how has it evolved?

The PEFT library has followed a steady release cadence since its initial launch, generally adding new adapter families, quantization integrations, and distributed training capabilities at each minor version. The table below summarises major releases as of May 2026.

Version	Date	Key additions
v0.1.0	2023-02-10	Initial public release with LoRA, prefix tuning, prompt tuning, P-tuning^[8]
v0.2.0	2023-03-10	API stabilisation, broader task support^[8]
v0.3.0	2023-05-03	Multi-adapter support, expanded testing suite, new examples^[8]
v0.4.0	2023-07-18	QLoRA integration, IA3 method, AutoPeftModel classes, LoRA for custom (non-transformer) models^[9]
v0.5.0	2023-08-22	GPTQ quantization support, low-level API^[8]
v0.6.0	2023-11-03	Diffusers backend integration, LoHa, LoKr, multitask prompt tuning, 4-bit/8-bit LoRA merging^[10]
v0.7.0	2023-12-06	Orthogonal Fine-Tuning (OFT), Megatron support, safetensors and better initialisation^[8]
v0.8.0	2024-01-30	Poly PEFT method, LoRA improvements^[8]
v0.9.0	2024-02-28	DoRA support, TIES/DARE/magnitude merging via `add_weighted_adapter`, AutoAWQ and AQLM quantization^[11]
v0.10.0	2024-03-21	QLoRA with DeepSpeed ZeRO-3 and FSDP (70B LLaMA on 2x24GB), layer replication, mixed adapter batches, LoftQ initialisation helper^[12]
v0.11.0	2024-05-16	BOFT, VeRA, PiSSA, HQQ and EETQ quantization backends^[8]
v0.12.0	2024-07-24	OLoRA, X-LoRA, FourierFT, HRA, and additional methods^[8]
v0.13.0	2024-09-25	LoRA+, VB-LoRA, and quality-of-life improvements^[8]
v0.14.0	2024-12-06	EVA, Context-aware Prompt Tuning, Bone method^[8]
v0.15.0	2025-03-19	CorDA initialisation, Trainable Tokens for selective embedding training^[8]
v0.16.0	2025-07-03	LoRA-FA optimizer, RandLoRA, C3A, expanded Conv2d support^[8]
v0.17.0	2025-08-01	SHiRA, MiSS, LoRA for `nn.Parameter` (enabling MoE LoRA), Bone deprecated in favour of MiSS^[8]
v0.18.0	2025-11-13	RoAd, ALoRA, Arrow, WaveFT, DeLoRA, OSF, text generation benchmark framework^[8]
v0.19.0	2026-04-14	Multiple additional methods, non-LoRA-to-LoRA checkpoint conversion, Tensor Parallelism support^[8]
v0.19.1	2026-04-16	Patch release^[8]

Version 0.4.0 in July 2023 is widely regarded as a turning point because it brought QLoRA into the standard PEFT API, making 4-bit quantized fine-tuning of large LLaMA-class models accessible without manual integration work.^[9] Version 0.6.0 in November 2023 marked another inflection: the Diffusers library replaced its bespoke LoRA loader with PEFT as its adapter backend, unifying text-to-image LoRA inference and training under PEFT and enabling features like instant adapter switching, scaling, and merge or unmerge for Stable Diffusion checkpoints.^[10]

Supported Methods

PEFT exposes a wide collection of parameter-efficient methods, organised broadly into low-rank update methods, soft prompting methods, orthogonal methods, and a smaller set of structured or sparse approaches. Each method is implemented as a subclass of PeftConfig paired with adapter modules that wrap the relevant layers of a base model.

Low-rank update methods

LoRA (Low-Rank Adaptation), introduced by Edward Hu and colleagues at Microsoft in 2021, freezes the pretrained weight matrix W and parameterises its update as a product of two trainable low-rank matrices A and B, giving an effective update W + BA where the inner dimension r is much smaller than the original dimensions. The original LoRA paper claims this can reduce trainable parameters by up to 10,000 times relative to full fine-tuning of GPT-3 175B while preserving downstream quality.^[7] LoRA is the most heavily used method in PEFT and serves as the basis for many derivative techniques.

QLoRA (Quantized LoRA), introduced by Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer in May 2023, combines 4-bit NormalFloat (NF4) quantization of the frozen base model with LoRA adapters trained in higher precision. The paper opens with the claim that QLoRA "reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU while preserving full 16-bit finetuning task performance," and the resulting "Guanaco" model family reached 99.3% of the performance level of ChatGPT on the Vicuna benchmark after just 24 hours of finetuning on a single GPU, evidence widely cited at the time that strong instruction-tuned LLaMA derivatives could be produced on consumer-grade hardware.^[13] QLoRA support landed in PEFT v0.4.0 alongside IA3 in July 2023.^[9]

DoRA (Weight-Decomposed Low-Rank Adaptation), proposed by NVIDIA researchers in early 2024, decomposes each pretrained weight matrix into a magnitude vector and a direction matrix, then applies LoRA only to the direction component. The paper, accepted as an ICML 2024 Oral, reports consistent improvements over LoRA on commonsense reasoning, visual instruction tuning, and image and video text understanding, especially at very low ranks.^[14] DoRA is enabled in PEFT via use_dora=True in LoraConfig and shipped in v0.9.0 in February 2024, with broader Conv2d and bitsandbytes-quantized support added in v0.10.0.^[11]^[12]

AdaLoRA, by Qingru Zhang and colleagues, was accepted at ICLR 2023. It parameterises each update in singular value decomposition form and learns to redistribute the rank budget across layers and weight matrices according to per-triplet importance scores. The training schedule consists of an initial phase with no budgeting, a budgeting phase that prunes low-importance triplets, and a final phase that retrains with the redistributed ranks.^[15]^[16]

X-LoRA (Mixture of LoRA Experts) by Eric L. Buehler and Markus J. Buehler, published in February 2024, performs token-level and layer-level gating over a set of frozen LoRA experts. The model runs a first forward pass without adapters to compute scalings, then re-runs with the dynamically weighted LoRA experts applied. The original paper applies X-LoRA to protein mechanics and molecular design tasks.^[17] X-LoRA support was added in PEFT v0.12.0 in July 2024.^[8]

VeRA (Vector-based Random Matrix Adaptation), by Dawid J. Kopiczko, Tijmen Blankevoort, and Yuki M. Asano (ICLR 2024), shares a single pair of frozen random low-rank matrices across all layers and learns only two small scaling vectors per layer. It targets parameter counts considerably smaller than LoRA while preserving accuracy on GLUE, E2E, image classification, and instruction tuning benchmarks.^[18]

VB-LoRA by Yang Li and colleagues introduces a global vector bank from which all adapter low-rank matrices are composed via a differentiable top-k admixture module, enabling parameter sharing across modules and layers and achieving extreme parameter efficiency.^[19]

LoHa (Low-Rank Hadamard) and LoKr (Low-Rank Kronecker) represent ∆W using a Hadamard product of low-rank matrix pairs (LoHa) or a Kronecker product structure (LoKr). LoHa is based on the FedPara construction by Nam Hyeon-Woo and colleagues (arXiv:2108.06098) originally developed for communication-efficient federated learning.^[20] Both methods are popular in the diffusion adapter community via the LyCORIS project and were added in PEFT v0.6.0.^[5]^[10]

PiSSA (Principal Singular Values and Singular Vectors Adaptation) by Fanxu Meng and colleagues uses SVD of the pretrained weight matrix to initialise the LoRA A and B matrices with the principal singular components and freezes the residual. The paper reports faster convergence and better final performance than vanilla LoRA, and the method was accepted as a NeurIPS 2024 Spotlight.^[21]

OLoRA by Kerim Büyükakyüz uses QR decomposition of the base weights to initialise the LoRA adapters with orthonormal matrices and to mutate the base weights accordingly, accelerating convergence relative to default LoRA initialisation.^[22]

FourierFT treats the weight change matrix as a 2D spatial signal and learns a small set of discrete Fourier transform coefficients, recovering ∆W via the inverse DFT. The paper reports surpassing LoRA on instruction tuning of LLaMA2-7B using only 0.064 M trainable parameters versus 33.5 M for LoRA.^[23]

Soft prompting methods

Prefix Tuning, by Xiang Lisa Li and Percy Liang (2021), prepends a sequence of continuous task-specific vectors to every layer of a frozen language model. Subsequent tokens attend to these "virtual" prefix vectors as if they were real tokens. The paper reports that learning roughly 0.1% of parameters yields performance comparable to full fine-tuning for GPT-2 table-to-text and BART summarisation tasks, and better than fine-tuning in low-data regimes.^[24]

Prompt Tuning, by Brian Lester, Rami Al-Rfou, and Noah Constant (EMNLP 2021), is a simpler variant that learns a single set of soft prompt embeddings at the input layer only. The paper's key empirical finding is that prompt tuning becomes increasingly competitive with full model tuning as model scale grows; at multi-billion parameter T5 scales, it matches model tuning while training only a tiny fraction of the parameters.^[25]

P-Tuning, by Xiao Liu and colleagues at Tsinghua University ("GPT Understands, Too"), learns continuous prompt embeddings interleaved with discrete prompt tokens through a small bidirectional prompt encoder. It stabilises sensitivity to prompt wording and improves performance on LAMA and SuperGLUE for both frozen and tuned base models.^[26]

Multitask Prompt Tuning and Llama-Adapter are additional soft-prompt-style methods supported by PEFT. Llama-Adapter, from Renrui Zhang and colleagues, prepends learnable adaption prompts to the upper layers of LLaMA and uses zero-initialised attention plus a learnable gating factor to introduce instruction-following behaviour without overwriting pretrained knowledge.^[5]

Activation rescaling

IA3 (Infused Adapter by Inhibiting and Amplifying Inner Activations), by Haokun Liu and colleagues at UNC ("Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning," 2022), introduces learned vectors that rescale activations in the keys, values, and feed-forward intermediate of each transformer block. The trainable parameters are extremely small compared to LoRA at equivalent rank, and the paper argues that few-shot PEFT outperforms in-context learning on the T-Few benchmark while costing far less compute.^[27] IA3 entered PEFT in v0.4.0 in July 2023.^[9]

Orthogonal methods

OFT (Orthogonal Fine-Tuning) constrains the update to an orthogonal transformation of the base weights, preserving the hyperspherical energy (pairwise cosine similarity between neurons) of the pretrained model. The original OFT paper (arXiv:2306.07280) focuses on controlling text-to-image diffusion, where preserving subject identity is critical.^[28] OFT support was added in PEFT v0.7.0 in December 2023.^[8]

BOFT (Orthogonal Butterfly) by Weiyang Liu and colleagues (arXiv:2311.06243) generalises OFT by factoring the orthogonal transformation into a sequence of sparse butterfly matrices, inspired by the Cooley-Tukey FFT, yielding O(d log d) parameters while maintaining the orthogonality guarantee.^[29] BOFT shipped in PEFT v0.11.0 in May 2024.^[8]

HRA (Householder Reflection Adaptation), introduced in 2024, parameterises the update as a chain of r trainable Householder reflections. Because each Householder matrix is orthogonal, the chain remains orthogonal, and the construction simultaneously admits a low-rank interpretation.^[5]

Quantization-aware initialisation

LoftQ (LoRA-Fine-Tuning-aware Quantization), by Yixiao Li and colleagues (arXiv:2310.08659), jointly quantizes the base LLM and computes a low-rank initialisation for the LoRA adapters that compensates for quantization error. The construction reduces the performance gap between full fine-tuning and quantization-plus-LoRA on understanding, QA, summarisation, and generation tasks.^[30] PEFT exposes LoftQ via the replace_lora_weights_loftq helper added in v0.10.0.^[12]

Comparative summary

Method	Core idea	Original year	First PEFT version
LoRA^[7]	∆W = BA with rank r	2021	v0.1.0
Prefix Tuning^[24]	Per-layer learned prefix vectors	2021	v0.1.0
Prompt Tuning^[25]	Input-layer soft prompts	2021	v0.1.0
P-Tuning^[26]	Continuous prompts via prompt encoder	2021	v0.1.0
IA3^[27]	Activation rescaling vectors	2022	v0.4.0
QLoRA^[13]	4-bit NF4 base plus LoRA	2023	v0.4.0
AdaLoRA^[15]	SVD-form ∆W with importance pruning	2023	v0.3.0
LoHa^[20]	Hadamard of low-rank pairs	2021 (FedPara)	v0.6.0
LoKr^[5]	Kronecker structure	2023	v0.6.0
OFT^[28]	Orthogonal block-diagonal update	2023	v0.7.0
LoftQ^[30]	Quantization-aware LoRA init	2023	v0.10.0
DoRA^[14]	Magnitude + direction (LoRA)	2024	v0.9.0
BOFT^[29]	Butterfly-factored orthogonal	2023	v0.11.0
VeRA^[18]	Shared frozen B/A plus learned vectors	2024	v0.11.0
PiSSA^[21]	SVD-based LoRA initialisation	2024	v0.11.0
X-LoRA^[17]	Mixture of LoRA experts	2024	v0.12.0
OLoRA^[22]	QR-based LoRA init	2024	v0.12.0
FourierFT^[23]	Sparse DFT coefficients	2024	v0.12.0
VB-LoRA^[19]	Shared global vector bank	2024	v0.13.0

How do you use PEFT? (API surface)

The PEFT library exposes a small, regular surface that is intended to mirror the patterns of Hugging Face Transformers. The three central abstractions are PeftConfig, get_peft_model, and PeftModel.

PeftConfig

PeftConfig is the abstract base class for all method-specific configuration objects. Each adapter method has a corresponding subclass: LoraConfig, IA3Config, PrefixTuningConfig, PromptTuningConfig, PromptEncoderConfig (for P-tuning), AdaLoraConfig, OFTConfig, BOFTConfig, VeRAConfig, LoHaConfig, LoKrConfig, and many more. These configurations carry the hyperparameters that define the adapter (such as rank r, scaling lora_alpha, dropout, target module patterns, and quantization-related flags) along with a task_type that selects the appropriate task head wrapper (causal LM, sequence-to-sequence LM, sequence classification, token classification, question answering, feature extraction).^[9]

get_peft_model and PeftModel

The typical entry point is get_peft_model(model, peft_config), which takes a Hugging Face model loaded via AutoModel...from_pretrained and a config object, and returns a PeftModel wrapping the base model with the adapter layers inserted at the appropriate target modules.^[1] PeftModel is the base class encompassing the various PEFT methods; it inherits from torch.nn.Module and stores the base model, the peft configuration, the list of modules to save, and (for soft-prompt methods) a prompt_encoder and the virtual prompt tokens.^[31]

PeftModel exposes:

save_pretrained(path) and from_pretrained(model, path) to save and reload the adapter weights only, producing the megabyte-scale checkpoints noted in the launch announcement.^[1]
print_trainable_parameters() to report how many parameters are trainable as both an absolute count and a percentage of the base model.^[1]
merge_and_unload() to fold the LoRA delta into the base weights and return a pure base model, removing any inference overhead from adapter computation.^[10]
add_adapter, set_adapter, disable_adapter, delete_adapter, enable_adapters, and disable_adapters for managing multiple adapters on a single base model.^[10]
add_weighted_adapter(adapters, weights, adapter_name, combination_type) to merge multiple LoRAs into a new adapter, with merging strategies including a plain linear combination, SVD-based concatenation, and the TIES, DARE, and Magnitude Prune strategies added in v0.9.0.^[11]

AutoPeftModel

PEFT v0.4.0 introduced the AutoPeftModel family (AutoPeftModelForCausalLM, AutoPeftModelForSeq2SeqLM, AutoPeftModelForSequenceClassification, AutoPeftModelForTokenClassification, AutoPeftModelForQuestionAnswering, AutoPeftModelForFeatureExtraction). Calling AutoPeftModelForCausalLM.from_pretrained(adapter_dir) reads the adapter config, identifies the base model and task type, downloads or loads the base, and applies the adapter in a single call.^[9]

Quantization integrations

PEFT integrates with multiple quantization backends rather than implementing quantization itself. The bitsandbytes 4-bit (NF4 and FP4) and 8-bit paths underpin QLoRA-style training. From v0.5.0, PEFT supports applying LoRA on top of GPTQ-quantized models. v0.9.0 added AutoAWQ (4-bit) and AQLM (down to 2-bit). v0.11.0 added HQQ and EETQ. Many of these backends do not support merge_and_unload(), because merging the LoRA delta into a quantized matrix would degrade numerical fidelity.^[8]^[11]

How does PEFT power Diffusers and Stable Diffusion?

A pivotal milestone in PEFT's adoption was the November 2023 v0.6.0 release, in which the Diffusers library switched from its own LoRA loader to PEFT as the backend for LoRA adapter management. The announcement explicitly stated: "Diffusers now uses PEFT," and listed the new user-visible capabilities unlocked by the change.^[10]

The integration brought several previously fragmented capabilities under a single API: simultaneous use of multiple LoRA adapters on the same diffusion pipeline, instant switching between adapters at inference time, scalar scaling and combination of adapters, permanent merging and unmerging into the base UNet or text encoder, and on-the-fly enable or disable controls. The implementation reads both Diffusers-native LoRA checkpoint formats and Kohya-style formats popularised by community fine-tuners for Stable Diffusion derivatives, ensuring interoperability with the very large pool of community-uploaded LoRAs on the Hub.^[10] PEFT v0.10.0 in March 2024 added mixed-adapter batches, letting different samples in the same batch use different adapters via syntax such as adapter_names=["adapter1", "adapter2", "__base__"].^[12]

The downstream impact has been visible in the diffusion community: a Hugging Face engineering blog post from July 2025 observed that the Flux family of community diffusion models alone had accumulated over 30,000 LoRA adapters on the Hub trained against a single base, all of which are loadable through the PEFT-backed Diffusers pipeline.^[32]

Distributed Training and Quantized Backends

A recurring theme in the release notes from v0.10.0 onward is making large-scale fine-tuning approachable on commodity multi-GPU systems. v0.10.0 added support for combining QLoRA with both DeepSpeed ZeRO-3 and PyTorch FSDP, with the stated goal of fine-tuning a 70B LLaMA-2 model on two GPUs each with 24 GB of memory. The integration required coordinating updates across bitsandbytes, transformers, and FSDP to handle 4-bit parameter sharding and offloading correctly.^[12] v0.10.0 also added layer replication for LoRA, which duplicates layers with the LoRA adapters applied while keeping the base weights shared, providing a memory-efficient way to grow effective model depth.^[12]

The library uses prepare_model_for_kbit_training to set up frozen quantized models for stable PEFT training (replacing the older prepare_model_for_int8_training, which was removed in v0.10.0).^[12] The mixed-adapter-batch feature in v0.10.0 enables per-sample adapter selection, which is useful for evaluation harnesses and routing scenarios.^[12]

How widely is PEFT adopted?

PEFT became, in the words of its own ecosystem partners, the de facto adapter standard within months of release. Several adoption indicators are visible in public data.

GitHub repository metrics. As of mid-2026, huggingface/peft carries roughly 21,300 stars and about 2,300 forks, with an active commit history and an ongoing release cadence (most recently v0.19.1 in April 2026).^[6]^[8]
PyPI presence. The peft package has been on PyPI continuously since 19 January 2023.^[33]
Library uptake. The Diffusers library uses PEFT as its LoRA backend from late 2023 onward.^[10] The TRL library, which implements reinforcement learning from human feedback (RLHF) and supervised fine-tuning trainers, integrates PEFT for parameter-efficient training of reward and policy models.^[2] Accelerate exposes PEFT-aware code paths for distributed quantized training.^[12]
Diffusion community footprint. The Flux community base has over 30,000 PEFT-loadable LoRA adapters on the Hub, as reported by Hugging Face engineering in mid-2025.^[32]
Method coverage. As of the v0.19 series in 2026, PEFT integrates more than thirty distinct PEFT methods across LoRA-style, orthogonal, soft-prompt, and structured families, with most novel adapter methods proposed in the academic literature shipped as official PEFT implementations within months of their publication.^[5]^[8]

The library has also become a teaching reference: official Hugging Face course material, the Smol Course, and external curricula consistently use PEFT to demonstrate parameter-efficient fine-tuning of LLaMA-, Mistral-, and T5-class models.^[2]^[3]

Significance

PEFT's significance lies less in any single algorithmic contribution and more in standardisation. Before PEFT, each adapter method (LoRA, prefix tuning, IA3, OFT, and so on) was distributed as a separate research codebase, often tightly coupled to a specific model architecture and training stack. The PEFT API makes the cost of switching between methods small: a practitioner can change peft_config from LoraConfig(r=8, ...) to IA3Config(...) or BOFTConfig(...) without touching the surrounding training loop.^[2] This has accelerated empirical comparisons across methods and lowered the engineering cost of putting parameter-efficient fine-tuning into production pipelines.

For supervised fine-tuning of large language models on consumer hardware, the LoRA-plus-QLoRA path through PEFT remains the most common recipe in open-source post-training as of 2026.^[13]^[9] For diffusion adapters, PEFT supplies the loading, merging, and scaling primitives that the broader image-generation community depends on.^[10] For research, PEFT functions as a reference implementation that authors of new adapter papers target when they want adoption; the time between paper release and an official LoraXxxConfig or equivalent merging into main is often measured in weeks.^[5]

Limitations and Criticisms

PEFT is not a one-size-fits-all replacement for full fine-tuning, and the library's documentation and method-specific papers identify several real limitations.

Performance gap on hard tasks. Several papers (notably DoRA and PiSSA) frame themselves explicitly as closing the accuracy gap between LoRA and full fine-tuning, an admission that vanilla LoRA can underperform on some tasks. PiSSA's authors describe LoRA's noise-and-zero adapter initialisation as a likely cause of slow convergence and lower final accuracy.^[14]^[21] DoRA's authors document accuracy gaps for LoRA on commonsense reasoning and visual instruction tuning that DoRA narrows but does not fully close at all ranks.^[14]

Quantization-merge limitations. PEFT can apply LoRA on top of GPTQ, AutoAWQ, AQLM, HQQ, and EETQ quantized models, but in most cases merge_and_unload() is not supported on these quantized backends because materialising the merged delta would lose precision. This means the inference-time overhead of LoRA cannot always be eliminated when the base is quantized to very low precision.^[11]

Catastrophic forgetting can still occur. Although freezing the base model reduces the risk of forgetting, large-rank LoRA on many target modules trained for long schedules can still degrade base-model capabilities; the launch blog and method papers describe this as a tuning choice rather than a guarantee.^[1]

Method proliferation. The library's choice to ship many adapter families (LoRA, LoHa, LoKr, OFT, BOFT, OLoRA, PiSSA, VeRA, VB-LoRA, FourierFT, HRA, X-LoRA, DoRA, MiSS, RoAd, ALoRA, Arrow, WaveFT, DeLoRA, OSF, and more) presents real selection difficulty for new users. The official conceptual guide attempts to triage by recommending LoRA as the default starting point, but in practice picking the right method, rank, and target modules remains an empirical exercise.^[5]

Breaking changes during evolution. Some releases introduce required interface changes (for example, v0.9.0 mandated an update_layer method on custom adapter layer implementations and disallowed rank-zero configurations, while v0.10.0 removed prepare_model_for_int8_training). These changes are documented but can break downstream training scripts that pin loose version ranges.^[11]^[12]

PEFT sits alongside several complementary projects in the Hugging Face stack and the broader open-source ecosystem.

Hugging Face Transformers supplies the base model classes that PEFT wraps. Most PEFT examples assume a model loaded with AutoModelForCausalLM.from_pretrained(...) or similar.^[2]
The Accelerate library handles distributed training, mixed precision, and device placement for PEFT models, and integrates with FSDP and DeepSpeed.^[12]
TRL (Transformer Reinforcement Learning) implements supervised fine-tuning, direct preference optimisation, and reinforcement learning training loops, all of which can wrap a PeftModel to keep memory low during alignment phases.^[2]
The Diffusers library uses PEFT as the LoRA backend for Stable Diffusion and successor models, including SDXL and the diffusion transformer families.^[10]
The reference Microsoft LoRA repository (microsoft/LoRA) remains a primary citation for the algorithm but is no longer the dominant production implementation; PEFT supersedes it for most workflows.^[7]
The LyCORIS project provides additional diffusion-focused variants (LoHa, LoKr, DyLoRA, and others), and PEFT includes implementations of LoHa and LoKr drawn conceptually from this lineage.^[5]

References

Sourab Mangrulkar and Sayak Paul, "Parameter-Efficient Fine-Tuning using PEFT," Hugging Face Blog, 2023-02-10. https://huggingface.co/blog/peft. Accessed 2026-05-20. ↩
Hugging Face, "PEFT documentation index," HuggingFace, 2026. https://huggingface.co/docs/peft/en/index. Accessed 2026-05-20. ↩
Hugging Face, "Parameter-efficient fine-tuning (Transformers integration)," HuggingFace Transformers docs, 2026. https://huggingface.co/docs/transformers/en/peft. Accessed 2026-05-20. ↩
Hugging Face, "huggingface/peft README," GitHub, 2026. https://github.com/huggingface/peft/blob/main/README.md. Accessed 2026-05-20. ↩
Hugging Face, "Adapters conceptual guide," PEFT docs, 2026. https://huggingface.co/docs/peft/main/en/conceptual_guides/adapter. Accessed 2026-05-20. ↩
Hugging Face, "huggingface/peft repository overview," GitHub, 2026. https://github.com/huggingface/peft. Accessed 2026-05-20. ↩
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen, "LoRA: Low-Rank Adaptation of Large Language Models," arXiv:2106.09685, 2021-06-17. https://arxiv.org/abs/2106.09685. Accessed 2026-05-20. ↩
Hugging Face, "huggingface/peft Releases," GitHub, 2026. https://github.com/huggingface/peft/releases. Accessed 2026-05-20. ↩
Hugging Face, "v0.4.0: QLoRA, IA3 PEFT method, support for QA and Feature Extraction tasks, AutoPeftModel for simplified UX, LoRA for custom models with new added utils," GitHub Releases, 2023-07-18. https://github.com/huggingface/peft/releases/tag/v0.4.0. Accessed 2026-05-20. ↩
Hugging Face, "v0.6.0: Diffusers now uses PEFT, new tuning methods, better quantization support," GitHub Releases, 2023-11-03. https://github.com/huggingface/peft/releases/tag/v0.6.0. Accessed 2026-05-20. ↩
Hugging Face, "v0.9.0: Merging LoRA weights, new quantization options, DoRA support, and more," GitHub Releases, 2024-02-28. https://github.com/huggingface/peft/releases/tag/v0.9.0. Accessed 2026-05-20. ↩
Hugging Face, "v0.10.0: Fine-tune larger QLoRA models with DeepSpeed and FSDP, layer replication, and more," GitHub Releases, 2024-03-21. https://github.com/huggingface/peft/releases/tag/v0.10.0. Accessed 2026-05-20. ↩
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer, "QLoRA: Efficient Finetuning of Quantized LLMs," arXiv:2305.14314, 2023-05-23. https://arxiv.org/abs/2305.14314. Accessed 2026-05-20. ↩
Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen, "DoRA: Weight-Decomposed Low-Rank Adaptation," arXiv:2402.09353, 2024-02-14. https://arxiv.org/abs/2402.09353. Accessed 2026-05-20. ↩
Qingru Zhang, Minshuo Chen, Alexander Bukharin, Pengcheng He, Yu Cheng, Weizhu Chen, and Tuo Zhao, "AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning," arXiv:2303.10512, 2023-03-18. https://arxiv.org/abs/2303.10512. Accessed 2026-05-20. ↩
Hugging Face, "AdaLoRA conceptual guide entry," PEFT docs, 2026. https://huggingface.co/docs/peft/main/en/conceptual_guides/adapter. Accessed 2026-05-20. ↩
Eric L. Buehler and Markus J. Buehler, "X-LoRA: Mixture of Low-Rank Adapter Experts, a Flexible Framework for Large Language Models with Applications in Protein Mechanics and Molecular Design," arXiv:2402.07148, 2024-02-11. https://arxiv.org/abs/2402.07148. Accessed 2026-05-20. ↩
Dawid J. Kopiczko, Tijmen Blankevoort, and Yuki M. Asano, "VeRA: Vector-based Random Matrix Adaptation," arXiv:2310.11454, 2023-10-17. https://arxiv.org/abs/2310.11454. Accessed 2026-05-20. ↩
Yang Li, Shaobo Han, and Shihao Ji, "VB-LoRA: Extreme Parameter Efficient Fine-Tuning with Vector Banks," arXiv:2405.15179, 2024-05-24. https://arxiv.org/abs/2405.15179. Accessed 2026-05-20. ↩
Nam Hyeon-Woo, Moon Ye-Bin, and Tae-Hyun Oh, "FedPara: Low-Rank Hadamard Product for Communication-Efficient Federated Learning," arXiv:2108.06098, 2021-08-13. https://arxiv.org/abs/2108.06098. Accessed 2026-05-20. ↩
Fanxu Meng, Zhaohui Wang, and Muhan Zhang, "PiSSA: Principal Singular Values and Singular Vectors Adaptation of Large Language Models," arXiv:2404.02948, 2024-04-03. https://arxiv.org/abs/2404.02948. Accessed 2026-05-20. ↩
Kerim Büyükakyüz, "OLoRA: Orthonormal Low-Rank Adaptation of Large Language Models," arXiv:2406.01775, 2024-06-03. https://arxiv.org/abs/2406.01775. Accessed 2026-05-20. ↩
Ziqi Gao, Qichao Wang, Aochuan Chen, Zijing Liu, Bingzhe Wu, Liang Chen, and Jia Li, "Parameter-Efficient Fine-Tuning with Discrete Fourier Transform," arXiv:2405.03003, 2024-05-05. https://arxiv.org/abs/2405.03003. Accessed 2026-05-20. ↩
Xiang Lisa Li and Percy Liang, "Prefix-Tuning: Optimizing Continuous Prompts for Generation," arXiv:2101.00190, 2021-01-01. https://arxiv.org/abs/2101.00190. Accessed 2026-05-20. ↩
Brian Lester, Rami Al-Rfou, and Noah Constant, "The Power of Scale for Parameter-Efficient Prompt Tuning," arXiv:2104.08691 (EMNLP 2021), 2021-04-18. https://arxiv.org/abs/2104.08691. Accessed 2026-05-20. ↩
Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie Qian, Zhilin Yang, and Jie Tang, "GPT Understands, Too," arXiv:2103.10385, 2021-03-18. https://arxiv.org/abs/2103.10385. Accessed 2026-05-20. ↩
Haokun Liu, Derek Tam, Mohammed Muqeeth, Jay Mohta, Tenghao Huang, Mohit Bansal, and Colin Raffel, "Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning," arXiv:2205.05638, 2022-05-11. https://arxiv.org/abs/2205.05638. Accessed 2026-05-20. ↩
Zeju Qiu, Weiyang Liu, Haiwen Feng, Yuxuan Xue, Yao Feng, Zhen Liu, Dan Zhang, Adrian Weller, and Bernhard Schoelkopf, "Controlling Text-to-Image Diffusion by Orthogonal Finetuning," arXiv:2306.07280, 2023-06-12. https://arxiv.org/abs/2306.07280. Accessed 2026-05-20. ↩
Weiyang Liu, Zeju Qiu, Yao Feng, Yuliang Xiu, Yuxuan Xue, Longhui Yu, Haiwen Feng, Zhen Liu, Juyeon Heo, Songyou Peng, Yandong Wen, Michael J. Black, Adrian Weller, and Bernhard Schoelkopf, "Parameter-Efficient Orthogonal Finetuning via Butterfly Factorization," arXiv:2311.06243, 2023-11-10. https://arxiv.org/abs/2311.06243. Accessed 2026-05-20. ↩
Yixiao Li, Yifan Yu, Chen Liang, Pengcheng He, Nikos Karampatziakis, Weizhu Chen, and Tuo Zhao, "LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models," arXiv:2310.08659, 2023-10-12. https://arxiv.org/abs/2310.08659. Accessed 2026-05-20. ↩
Hugging Face, "PeftModel package reference," PEFT docs, 2026. https://huggingface.co/docs/peft/main/en/package_reference/peft_model. Accessed 2026-05-20. ↩
Sayak Paul and Benjamin Bossan, "Fast LoRA inference for Flux with Diffusers and PEFT," Hugging Face Blog, 2025-07-23. https://huggingface.co/blog/lora-fast. Accessed 2026-05-20. ↩
Python Package Index, "peft project history," PyPI, 2026. https://pypi.org/project/peft/. Accessed 2026-05-20. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributor · full history

Suggest edit

What links here

FP4 (4-bit floating point)Lightning AI Lion (optimizer)LoftQ LongLoRA Mistral 7B NormalFloat 4-bit (NF4)Unsloth VeRA (Vector-based Random Matrix Adaptation)rsLoRA (Rank-Stabilized LoRA)

What problem does PEFT solve?

When was PEFT released, and how has it evolved?

Supported Methods

Low-rank update methods

Soft prompting methods

Activation rescaling

Orthogonal methods

Quantization-aware initialisation

Comparative summary

How do you use PEFT? (API surface)

PeftConfig

get_peft_model and PeftModel

AutoPeftModel

Quantization integrations

How does PEFT power Diffusers and Stable Diffusion?

Distributed Training and Quantized Backends

How widely is PEFT adopted?

Significance

Limitations and Criticisms

How does PEFT compare with related tools?

See also

References

Improve this article

Related Articles

Axolotl

Unsloth

LLaMA-Factory

Fully Sharded Data Parallel (FSDP)

AutoML (Automated Machine Learning)

torch.compile

What links here

Related Articles

Axolotl

Unsloth

LLaMA-Factory

Fully Sharded Data Parallel (FSDP)

AutoML (Automated Machine Learning)

torch.compile

What links here