QLoRA
Last reviewed
May 31, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v4 · 7,782 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v4 · 7,782 words
Add missing citations, update stale details, or suggest a clearer explanation.
QLoRA (Quantized Low-Rank Adaptation) is a parameter-efficient fine-tuning method for large language models introduced in May 2023 by Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer at the University of Washington[1]. The technique combines LoRA, the low-rank adapter method published by Hu et al. in 2021[2], with aggressive 4-bit quantization of the frozen base model. The result is a training recipe that allows fine-tuning of an LLM with up to 65 billion parameters on a single GPU with 48 GB of memory, while preserving the predictive quality of full 16-bit fine-tuning[1].
QLoRA's central design idea is that the base model's weights can be stored in a heavily compressed 4-bit representation and dequantized on the fly during the forward pass, while the small LoRA adapter matrices that absorb gradient updates remain in higher precision. The paper introduced three new components to make this work without quality loss: a 4-bit data type called NF4 (NormalFloat 4-bit), a double quantization scheme that compresses the quantization constants themselves, and paged optimizer states that use NVIDIA unified memory to handle gradient checkpointing memory spikes[1]. Using QLoRA, the authors trained the Guanaco family of chat models, which reached 99.3% of ChatGPT performance on the Vicuna benchmark after only 24 hours of fine-tuning on a single GPU[1]. The 99.3% figure is best understood in the 2023 context of the Vicuna benchmark, which used GPT-4 as an automated judge over a fixed 80-prompt set and is no longer considered a reliable measure of frontier model quality.
QLoRA was presented as an oral paper at NeurIPS 2023[22]. Its open-source implementation in the bitsandbytes library and integration with Hugging Face Transformers and the PEFT library underpin a large fraction of community fine-tunes built on Llama, Llama 2, Mistral, and similar models[3][4]. By the end of 2023, the original arXiv preprint had accumulated thousands of citations and the bitsandbytes library had become a default dependency in nearly every public LLM training stack.
Full fine-tuning of a large language model means computing and storing gradients and optimizer state for every parameter. For a 65-billion-parameter model trained in bfloat16 with the AdamW optimizer in 32-bit precision, the memory budget is dominated by the weights themselves, gradients, and two Adam moment buffers. Adding the activations required for backpropagation, the total comfortably exceeds 780 GB of GPU memory, far beyond what any single accelerator delivers[1]. Training such a model end-to-end therefore requires multi-node clusters with hundreds of GPUs, an option available only to well-funded laboratories.
Two earlier lines of work attacked this cost from different angles. The first is parameter-efficient fine-tuning, which freezes most of the model and trains only a small set of additional parameters. LoRA, introduced by Hu et al. in 2021[2], freezes the pretrained weight matrix W and represents the update as a product of two narrow matrices, delta_W = B * A, where A has shape (r, d_in) and B has shape (d_out, r) with rank r much smaller than the model dimensions. The forward pass becomes Y = W X + B A X. Only A and B receive gradients, which typically reduces the trainable parameter count by two to four orders of magnitude. Other parameter-efficient techniques include adapter modules (Houlsby et al. 2019[18]), prefix tuning, and prompt tuning.
LoRA on its own does not eliminate the memory cost of holding the frozen base weights in 16-bit precision. For a 65B model, that is roughly 130 GB simply to store the unmodified weights, before any optimizer state, activations, or adapters are added. This puts vanilla LoRA out of reach for any single consumer or workstation card, which is exactly the gap QLoRA targets.
The second line is post-training quantization of model weights. The LLM.int8() paper by Dettmers et al. in 2022[5] showed that an 8-bit weight representation can match 16-bit inference quality on transformer LLMs if outlier features are handled with a mixed-precision decomposition. Subsequent work pushed quantization to four bits per weight: GPTQ (Frantar et al. 2022[6]) and AWQ (Lin et al. 2023[7]) both demonstrated 4-bit inference with small accuracy loss. These methods, however, are designed for inference; using a naive 4-bit weight matrix during quantization-aware fine-tuning breaks gradient flow and degrades quality because gradients with respect to discretized weights are zero almost everywhere.
QLoRA's contribution is to combine these two threads in a way that avoids both of their failure modes. The base weights stay frozen, so they never receive gradients and can be stored at four bits without harming optimization. The LoRA adapters stay in 16-bit precision and absorb all gradient updates. The only step where the two precisions meet is the forward pass, where the 4-bit base weights are dequantized to 16-bit on the fly to compute the matrix product W X, then immediately discarded.
QLoRA builds on a sequence of efficiency papers from the same author. The 2022 ICLR paper 8-bit Optimizers via Block-wise Quantization[16] showed that the moment buffers of Adam and other adaptive optimizers can be quantized to 8 bits with block-wise scaling without quality loss. The 2022 NeurIPS paper LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale[5] established that block-wise quantization combined with mixed-precision handling of outlier features can compress entire transformer weight matrices to 8 bits at zero quality cost. QLoRA inherits the block-wise scaling philosophy and extends it from inference to training and from 8 bits to 4 bits.
The QLoRA paper was authored by Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer, all then affiliated with the Paul G. Allen School of Computer Science and Engineering at the University of Washington in Seattle[1]. The first author Tim Dettmers had previously published the LLM.int8() paper at NeurIPS 2022 and authored the bitsandbytes library, the open-source CUDA toolkit that hosts the QLoRA implementation. Ari Holtzman is best known for the 2020 ICLR paper on nucleus sampling. Luke Zettlemoyer is a professor at UW and a research scientist at Meta AI Research, and was at the time a co-supervisor of the OPT, OPT-IML, and LIMA projects.
The paper was first posted on arXiv as preprint 2305.14314 on 23 May 2023[1]. It was accepted as an oral presentation at the Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS) in December 2023, with the talk recorded and made publicly available through the NeurIPS virtual site[22]. The full reference implementation was released at the GitHub repository artidoro/qlora under MIT license, named after second author Artidoro Pagnoni, simultaneously with the arXiv release. The bitsandbytes library, which contains the NF4 kernels, paged optimizers, and double quantization primitives, was concurrently updated to support all three contributions.
Dettmers moved from UW to Carnegie Mellon University as an assistant professor in 2024 and subsequently to AI2 (Allen Institute for AI) as a research scientist, while continuing to maintain bitsandbytes within the newly created bitsandbytes-foundation umbrella that sees contributions from Hugging Face, Intel, AMD, and the original UW group[3].
The first contribution is a new 4-bit data type, the 4-bit NormalFloat (NF4). The name is sometimes mistakenly written as "Normal Float 4-bit" or shortened to "Normal4"; the canonical name from the paper is NormalFloat 4-bit, abbreviated NF4[1]. A 4-bit type can represent at most 16 distinct values, since 2^4 = 16. Standard 4-bit integer quantization spaces these values uniformly across the weight range, which is wasteful because the weights of a pretrained transformer are not uniformly distributed. Empirically, the weights of large pretrained models are well approximated by a zero-mean normal distribution with a small standard deviation, so most weights cluster near zero and a uniform grid wastes capacity on regions that are almost never populated[1].
NF4 instead places its 16 quantization levels at the quantiles of a standard normal distribution. Each level represents an equal probability mass under a unit normal, after which the grid is rescaled by a per-block scaling constant to match the actual weight magnitudes. The construction satisfies an information-theoretic optimality criterion: under the assumption that the weights are normally distributed, no other 16-level grid in [-1, 1] minimizes expected quantization error more effectively. The exact level positions are computed once and stored as a fixed lookup table; both quantization and dequantization reduce to nearest-neighbor lookup.
Formally, the levels are defined by
q_i = 0.5 * ( Q( i / (2^k + 1) ) + Q( (i+1) / (2^k + 1) ) )
for i = 0, 1, ..., 2^k - 1, where Q is the inverse cumulative distribution function (the quantile function) of a standard normal N(0, 1), and k = 4 for NF4. The 16 resulting values are normalized to the range [-1, 1] and stored as the NF4 codebook.
In the paper's ablations on LLaMA, OPT, BLOOM, and Pythia models from 125M to 13B parameters, NF4 with double quantization achieves a mean perplexity on Pile Common Crawl of 27.41, compared to 29.48 for FP4 with three exponent bits, 31.07 for FP4 with two exponent bits, and 34.34 for plain Int4[1]. Across MMLU and other downstream benchmarks, the gap between NF4 and FP4 corresponds to roughly one percentage point of accuracy.
Quantization is performed block-wise with a block size of 64. Within each block of 64 consecutive weights, the maximum absolute value is computed as a scaling factor c1, and each weight is quantized to its nearest NF4 level scaled by c1. Block-wise scaling localizes the effect of any single outlier and is the same trick that allows LLM.int8() to handle activation outliers without losing accuracy.
Block-wise quantization introduces a second-order overhead: the scaling constants c1 themselves must be stored, and at one 32-bit constant per 64-weight block, this overhead is 32 / 64 = 0.5 bits per parameter on top of the 4-bit weights. For a 65B model this is about 4 GB of additional storage, a non-trivial fraction of the budget on a 48 GB card.
Double quantization (DQ) addresses this by quantizing the constants themselves[1]. The first-level constants c1 are quantized to 8-bit floating point (E4M3-style) with a second-level scaling constant c2, applied across blocks of 256 first-level constants. With DQ, the average overhead drops from about 0.5 bits per parameter to roughly 0.127 bits per parameter, saving an additional 0.37 bits per parameter on average, roughly 3 GB on a 65B model. The combined storage is approximately 4.127 effective bits per weight. The paper reports that double quantization has no measurable effect on downstream accuracy[1].
The full dequantization expression is
W_BF16 = doubleDequant(c2_FP32, c1_FP8, W_NF4)
= dequant( dequant(c2_FP32, c1_FP8), W_NF4 )
where the outer dequant reconstructs the original 4-bit codebook scaled by the recovered FP8 constants, and the inner dequant reconstructs the FP8 first-level constants from the second-level FP32 constants. The resulting full-precision weights are kept in BF16 only for the matrix multiplication and discarded afterward.
The third contribution addresses a different memory bottleneck. During training with gradient checkpointing, recomputed activations and optimizer states can produce sudden memory spikes that drive the GPU into out-of-memory failure even when average usage is below the device limit. These spikes are particularly common at long sequence lengths where the recomputed attention activations briefly dominate the budget.
QLoRA introduces paged optimizers, which use NVIDIA unified memory to allow optimizer state to spill from GPU memory to CPU memory automatically. The CUDA unified memory subsystem moves pages of optimizer state between host and device transparently when GPU memory pressure rises, then pages them back when needed, a mechanism conceptually similar to the way paged attention later managed key-value cache memory at inference time. The implementation in bitsandbytes provides paged variants of AdamW and other optimizers in the form of PagedAdamW8bit, PagedAdamW32bit, PagedLion, and similar wrappers. In practice this prevents OOM failures during occasional spikes without significantly slowing training, because the spikes are short and pages are returned to the GPU before the next training step needs them. The paper notes that paging events are rare enough that wall-clock impact is hard to measure under typical workloads[1].
The QLoRA recipe is, at a high level, a five-step transformation applied to the standard LoRA training loop.
c1, and the weights are normalized into [-1, 1] and mapped to the nearest NF4 quantile. The c1 constants are then themselves quantized via double quantization into 8-bit floating point with a second-level FP32 constant c2 per block of 256 c1 values. The resulting state (W_NF4, c1_FP8, c2_FP32) is roughly 4.127 bits per weight on average and replaces the original 16-bit weights in GPU memory. The full-precision weights can be discarded[1].A (shape r × d_in) and B (shape d_out × r) are added with rank r typically between 8 and 64. A is initialized with Kaiming uniform and B is initialized to zero, so the initial adapter contribution is exactly zero[2]. Both A and B are stored in bfloat16 in the QLoRA reference recipe, although fp16 is also supported on older hardware.Y = W_BF16 X + s * B (A X) where s = alpha / r, and immediately frees the dequantized weights. Peak memory therefore reflects the 4-bit storage cost rather than the full 16-bit footprint, except for a small transient during each layer's matmul[1].W_NF4 as a constant. Gradients flow from the loss back through the adapter sum B A X, producing gradients only for A and B. These gradients are accumulated in bf16; mixed-precision training scales the loss to avoid bf16 underflow during accumulation.A and B only. Because the paged variant can spill to CPU memory under pressure, the recipe survives the activation memory spikes that gradient checkpointing produces at long sequence lengths.After training, the only changed state is the small set of LoRA matrices, typically a few hundred megabytes for a 65B base model. The frozen 4-bit base is unchanged and can be reused as the starting point for further fine-tunes.
Let W be a pretrained linear layer weight matrix that QLoRA freezes and stores in NF4. Let c1 be the block-wise NF4 scaling constants stored in 8-bit floating point, and c2 be the second-level FP32 constants used to dequantize c1. The doubly-dequantized weight is
W_BF16 = doubleDequant(c2_FP32, c1_FP8, W_NF4)
where doubleDequant first uses c2 to recover the per-block constants c1, then uses c1 to recover the full-precision weights. The LoRA adapters A and B are stored in bf16. For an input activation X, the QLoRA forward pass is
Y = doubleDequant(c2_FP32, c1_FP8, W_NF4) @ X + B @ A @ X
where @ denotes matrix multiplication. Equivalently, with s denoting the LoRA scaling factor alpha / r,
Y = W_BF16 @ X + s * B @ A @ X
During the backward pass, the gradient with respect to the LoRA adapters is
dL/dA = s * B^T @ dL/dY @ X^T
dL/dB = s * dL/dY @ (A @ X)^T
No gradient is computed for W_NF4, c1, or c2. The optimizer therefore only updates A and B, both stored in bf16, with optimizer moments stored in 32-bit using paged AdamW.
A simplified pseudocode view of one training step:
for batch in dataloader:
# forward
W_bf16 = double_dequantize(W_nf4, c1_fp8, c2_fp32)
Y = W_bf16 @ X + (alpha / r) * B @ A @ X
loss = loss_fn(Y, target)
# backward, only A and B receive gradients
grads = backward(loss, params=[A, B])
# paged AdamW update
paged_adamw_step(A, B, grads)
The dequantized weights W_bf16 are kept only for the duration of the matrix multiplication and immediately discarded, so peak memory still reflects the 4-bit storage cost rather than the full 16-bit footprint.
QLoRA mixes four different numeric formats during training. The table below shows where each one appears and how many bits it occupies.
| Format | Bits | Range/Levels | Role in QLoRA |
|---|---|---|---|
| FP32 | 32 | ~1.4e-45 to ~3.4e38 | Second-level scaling constants c2; AdamW optimizer moments by default |
| BF16 | 16 | ~1e-38 to ~3e38, 8-bit mantissa | Compute dtype for the dequantized forward pass; LoRA adapters A, B |
| FP16 | 16 | ~6e-5 to ~6e4, 11-bit mantissa | Optional alternative compute dtype, narrower range than BF16 |
| FP8 (E4M3) | 8 | ~2^-9 to 448 | First-level scaling constants c1 after double quantization |
| Int8 | 8 | -128 to 127 | LLM.int8() base weights when used (a separate mode) |
| NF4 | 4 | 16 quantile levels in [-1, 1] | Frozen base model weights |
| FP4 (E2M1) | 4 | 16 floating-point values | Alternative 4-bit format, supported but not preferred |
The QLoRA paper and the reference implementation provide a recipe that has become the default starting point for community fine-tunes[1][3]. The settings below correspond to the Guanaco-65B model.
q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, and down_proj for Llama-family architectures. The paper found that targeting all linear layers is more important than choosing a particular rank.r: 64 in the Guanaco models. Smaller values such as 8 or 16 are common for smaller datasets.alpha / r = 16 / 64 = 0.25.Later work has experimented with higher ranks (256 or 512), different alpha conventions (alpha equal to rank, or twice the rank), and other quantization data types, but the original recipe remains a strong baseline. Sebastian Raschka's reproductions on a 7B Llama base reported that targeting all linear layers and using rank 256 with alpha 512 gave the best downstream quality, raising the trainable parameter count from about 4 million to about 20 million while adding only roughly 2.4 GB of memory[17].
The headline memory numbers from the QLoRA paper illustrate the gap between 16-bit and 4-bit base storage[1]:
| Model size | Full fine-tune (BF16 + AdamW FP32) | LoRA on FP16 base | QLoRA (NF4 base) |
|---|---|---|---|
| 7B | ~112 GB | ~28 GB | ~6 GB |
| 13B | ~208 GB | ~52 GB | ~10 GB |
| 33B | ~528 GB | ~132 GB | ~21 GB |
| 65B | >780 GB | ~260 GB | ~41 GB |
The full fine-tune column counts weights, gradients, and AdamW state in 32-bit. The LoRA column counts a frozen FP16 base plus FP16 adapters and 32-bit AdamW state for the adapters. The QLoRA column counts a 4-bit base, bf16 adapters, paged AdamW state, and overhead for activations and gradient checkpointing. The reduction from BF16 LoRA to QLoRA is roughly 6 to 8 times on the base model storage, which lets a 65B model fit on a single 48 GB workstation card such as an NVIDIA A100 80 GB, an A6000 48 GB, or an NVIDIA H100 80 GB.
A Hugging Face study comparing QLoRA against alternative methods on consumer hardware found that NF4 with double quantization and bf16 compute can fit a 13B Llama on a 16 GB T4 GPU at sequence length 1024 with gradient checkpointing enabled, a configuration that fails with plain LoRA on the same card[4][8]. On the production side, Answer.AI demonstrated in March 2024 that combining QLoRA with Fully Sharded Data Parallel (FSDP) lets two RTX 3090 or 4090 cards train a 70B model end-to-end, dropping the hardware floor for that scale to roughly two thousand US dollars in consumer parts[9].
A back-of-the-envelope check on the 65B number illustrates the construction. A 65B Llama base in NF4 with double quantization is about 65e9 * 4.127 / 8 = 33.5 GB. LoRA adapters with rank 64 across all linear projections add roughly 200 MB. Paged AdamW 32-bit state for the adapters takes another 800 MB. Activations with gradient checkpointing at sequence length 512 and batch size 1 are under 5 GB. The total budget is comfortably below 48 GB, leaving room for the workspace allocations of CUDA kernels and the bf16 dequantization scratch buffer.
The paper reports a large set of experiments across model sizes, datasets, and benchmarks. The two headline results are quality parity with 16-bit fine-tuning and the Guanaco family.
On the MMLU benchmark, QLoRA fine-tunes of LLaMA 7B through 65B match the accuracy of equivalent 16-bit LoRA fine-tunes within statistical noise. After fine-tuning on FLAN v2, the 5-shot MMLU scores are 44.5 for 7B, 51.4 for 13B, 59.2 for 33B, and 63.9 for 65B, compared to baseline LLaMA scores of 35.1, 46.9, 57.8, and 63.4 respectively[1]. On other benchmarks such as Open Assistant evaluations, QLoRA again matches BF16 LoRA. The authors trained more than 1,000 models across these ablations, an unusually large experimental scale that strengthens the conclusion that 4-bit storage with NF4 and double quantization is essentially lossless for fine-tuning[1].
The second result is the Guanaco chat model family, produced by QLoRA fine-tuning LLaMA on the OASST1 (OpenAssistant Conversations) dataset. Guanaco-65B reached 99.3% of ChatGPT performance on the Vicuna benchmark, an automated evaluation that uses GPT-4 as a judge to score model responses across 80 prompts spanning writing, roleplay, math, coding, and general knowledge[1]. Smaller Guanaco models reached the percentages shown below.
| Model | Vicuna benchmark vs ChatGPT | Notes |
|---|---|---|
| Guanaco 7B | 87.0% | Fits in roughly 5 GB at inference, runs on phones |
| Guanaco 13B | 90.4% | Outperforms Alpaca 13B by ~20 points |
| Guanaco 33B | 97.8% | Trains on a 24 GB consumer GPU in under 12 hours |
| Guanaco 65B | 99.3% | 24 hours of fine-tuning on one 48 GB card |
On the Open Assistant evaluation, Guanaco 65B reached an Elo rating of 1,008, statistically tied with ChatGPT-3.5 Turbo at 1,015[1]. The 65B fine-tune required 24 hours on a single 48 GB GPU. The paper also showed that smaller, higher-quality datasets such as OASST1 (about 9,000 conversations) produce stronger chat models than larger but noisier sources such as FLAN v2 (15 million examples) or Self-Instruct (about 82,000 examples), shifting community focus toward data curation rather than dataset scale.
The "99.3% of ChatGPT" headline figure should be interpreted with care. The Vicuna benchmark evaluates the model with reference to a specific ChatGPT snapshot from early 2023 (the gpt-3.5-turbo checkpoint available at the time of the paper) and uses GPT-4 as the judge with a 1-to-10 score scale; the comparison is therefore a relative ratio against that frozen 2023 baseline rather than against current models. Subsequent academic work has shown that GPT-4-as-judge systematically inflates scores for verbose answers, that the 80-prompt Vicuna set provides limited statistical power, and that modern open-weight models have far surpassed the original Guanaco-65B on harder benchmarks such as MMLU-Pro, GPQA, and LiveBench. The 99.3% claim was accurate as a relative figure at the time of submission and remains the canonical "QLoRA works" headline, but it is not a substitute for evaluation on current benchmarks.
The paper evaluated eight instruction-following datasets, summarized below.
| Dataset | Examples | Source | Notable property |
|---|---|---|---|
| OASST1 | ~9,209 | Crowd-sourced | Best Vicuna scores when small |
| FLAN v2 | ~15M | Task collection | Best for academic benchmarks like MMLU |
| Self-Instruct | ~82,612 | Distilled from GPT | Mid-quality |
| Alpaca | ~51,942 | Distilled from GPT | Standard baseline at the time |
| Chip2 | ~210,289 | Hybrid | Useful for chat |
| HH-RLHF | ~160,800 | Preference-based | Anthropic helpful-harmless |
| Longform | ~23,700 | Hybrid | Open-ended generation |
| Unnatural Instructions | ~240,670 | Distilled | Larger but noisier |
The key finding was that data quality dominates dataset size for chat-oriented evaluation, while broad task coverage from FLAN v2 still helps on academic question-answering benchmarks like MMLU.
The reference implementation lives in the bitsandbytes library, originally written by Tim Dettmers, which provides the 4-bit quantization primitives, the NF4 data type, double quantization, and paged optimizers[3]. The library is built on CUDA kernels and integrates directly with PyTorch. As of 2026 it is maintained by the bitsandbytes-foundation, with sponsorship from Hugging Face and Intel, and supports NVIDIA SM60 and newer, AMD CDNA and RDNA architectures, Intel Data Center GPU Max and Arc series, Intel Gaudi 2 and 3, and ARM64 plus Apple Silicon CPUs[3]. The library exposes a bnb.nn.Linear4bit module that can be substituted for torch.nn.Linear to apply 4-bit storage transparently, plus bnb.optim.PagedAdamW8bit, PagedAdamW32bit, and PagedLion paged optimizer classes.
The 4-bit kernels in bitsandbytes implement a dequantize-then-matmul scheme: the NF4 weight blocks are unpacked into bf16 in shared memory using the NF4 lookup table and the double-dequantized scaling constants, then a standard cuBLAS gemm computes the product. Because the dequantization step is fused with the gemm kernel, the bf16 weights never reach global memory and the latency overhead is small relative to a pure bf16 matmul.
In the Hugging Face ecosystem, QLoRA is exposed through three components. Transformers accepts a BitsAndBytesConfig at model load time[8]:
from transformers import BitsAndBytesConfig, AutoModelForCausalLM
import torch
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16,
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=bnb_config,
)
The PEFT library wraps this quantized model with LoRA adapters via LoraConfig and get_peft_model, after preparing the model for k-bit training so that LayerNorm and embedding parameters are upcast to FP32 for stability[4]:
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
model = prepare_model_for_kbit_training(model)
lora_config = LoraConfig(
r=64, lora_alpha=16,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_dropout=0.05, bias="none", task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
Recent versions of PEFT also accept a shorthand, target_modules="all-linear", which automatically targets every linear layer in the architecture. This shorthand is the recommended setting for QLoRA-style training across model families with different attention naming conventions[4].
This pattern is the de facto standard for community QLoRA training and is documented in the Hugging Face PEFT documentation[4]. Higher-level training frameworks build on top of this stack:
| Framework | Maintainer | Distinguishing feature |
|---|---|---|
| Axolotl | OpenAccess AI Collective | YAML-driven configuration wrapping Transformers, PEFT, and bitsandbytes; supports DPO, IPO, KTO, ORPO, GRPO |
| Unsloth | Daniel Han, Michael Han | Custom Triton kernels reporting up to 2x speed and 70% less VRAM on consumer GPUs |
| Hugging Face TRL | Hugging Face | Built-in SFTTrainer and DPOTrainer with QLoRA support |
| LLaMA-Factory | hiyouga | Unified fine-tuning UI for 100+ LLMs and VLMs |
| DeepSpeed | Microsoft | Distributed training that integrates bitsandbytes for ZeRO + QLoRA |
| Apple MLX | Apple | QLoRA-style training on Apple Silicon with native quantization primitives |
| torchtune | Meta PyTorch | Native PyTorch fine-tuning library with 4-bit and 8-bit recipes |
The original reference code is published under MIT license at the artidoro/qlora GitHub repository, named after second author Artidoro Pagnoni. The bitsandbytes library is licensed under MIT and is one of the most starred quantization libraries on GitHub.
Several configuration mistakes recur in community QLoRA pipelines.
bnb_4bit_compute_dtype=torch.float32 defeats most of the point of 4-bit training because matrix multiplications run at FP32 speed and the dequantized buffer requires 32-bit storage. The recommended setting is torch.bfloat16 on Ampere or newer hardware, or torch.float16 on older Turing cards that lack BF16 support[8].bnb_4bit_use_double_quant=False increases the base footprint by roughly 0.37 bits per parameter (about 3 GB on a 65B model) and almost always strictly worse since DQ is essentially free in quality.model.config.use_cache = True produces silent activation memory growth and occasional OOM. The prepare_model_for_kbit_training helper sets it to False automatically.prepare_model_for_kbit_training. This helper upcasts LayerNorm and embedding parameters to FP32 and enables gradient checkpointing. Without it, the LayerNorm gradients can underflow in 4-bit storage and training diverges or stalls.q_proj and v_proj only, the original LoRA paper convention, leaves a measurable quality gap on instruction tuning compared with targeting all linear layers. The QLoRA paper explicitly recommends all linear projections for parity with full fine-tuning[1].merge_and_unload directly on a 4-bit quantized model loses precision because the addition W + B A is performed at the precision of the dequantized base. The correct procedure is to load the base in BF16, attach the trained adapter, merge, save, and then re-quantize with GPTQ, AWQ, or HQQ for deployment.paged_adamw_32bit to paged_adamw_8bit) silently re-initializes the moment buffers and discards the warmup state. Practitioners should keep the optimizer flag identical between runs.BitsAndBytesConfig. The Transformers from_pretrained integration handles this correctly when quantization_config is passed at load time.QLoRA has inspired and is regularly compared to a family of follow-up methods that refine the quantization, the adapter parameterization, or the optimizer:
| Method | Year | Key idea |
|---|---|---|
| LoRA | 2021 | Original low-rank adapter on FP16 frozen base |
| QLoRA | 2023 | 4-bit NF4 base + LoRA in bf16, single-GPU 65B fine-tuning |
| DoRA | 2024 | Decomposes weight into magnitude and direction, applies LoRA to direction only |
| LoftQ | 2023 | Joint adapter and base quantization initialization to minimize Q(W) + B A - W |
| GaLore | 2024 | Projects gradients into low-rank subspace; full fine-tunes 7B on 24 GB |
| QA-LoRA | 2023 | Quantization-aware adapter that quantizes both base and adapter at deployment |
| ReLoRA | 2023 | Iteratively merges and reinitializes LoRA adapters to reach full-rank updates |
| LongLoRA | 2023 | Sparse attention + LoRA to extend context length on a single GPU |
| MoRA | 2024 | Uses a single square matrix with the same parameter count as low-rank LoRA |
| LoRA-FA | 2023 | Freezes the A matrix, training only B, halving activation memory |
| LoRA+ | 2024 | Different learning rates for A and B matrices to improve convergence |
| AQLM | 2024 | Additive quantization to 2-bit per weight |
| HQQ | 2024 | Half-quadratic, calibration-free 4-bit and 2-bit quantization |
| FSDP+QLoRA | 2024 | Answer.AI's combination with FSDP for two-GPU 70B training |
use_dora=True flag in LoraConfig.Q(W) + B A is closer to the original W at the start of training, improving downstream accuracy on aggressive 2-bit and 3-bit settings. PEFT exposes LoftQ via init_lora_weights="loftq" and a helper replace_lora_weights_loftq for in-place upgrade of an existing LoRA adapter[4].B A product with a single square matrix of the same parameter count, claiming better convergence on memory-intensive tasks like math reasoning and continual pretraining.A matrix at its random initialization and trains only B, halving activation memory and removing one of the two adapter matrices from the optimizer state.The high-level summary in terms of cost-quality trade-off is shown below.
| Method | Trainable params | Base storage | Inference overhead | Quality vs full FT | Typical use |
|---|---|---|---|---|---|
| Full fine-tuning | 100% | FP16 or BF16 | None | Reference | Frontier model post-training |
| LoRA (FP16 base) | 0.01% to 1% | FP16 base + small adapters | None after merge | Within 1% on most tasks | Standard PEFT for mid-size models |
| QLoRA | 0.01% to 1% | NF4 base + bf16 adapters | 20% to 40% if not merged | Within 1% of LoRA | Single-GPU fine-tuning of 30B-plus models |
| DoRA | 0.01% to 1% | FP16 or NF4 base | Similar to LoRA | Slight gains over LoRA | Quality-sensitive instruction tuning |
| Prefix tuning | 0.01% to 0.1% | FP16 base | Small per-token cost | Below LoRA in many tasks | Lightweight task adaptation |
| Prompt tuning | 0.001% to 0.01% | FP16 base | Negligible | Below LoRA at small scale | Soft-prompt adaptation |
| Adapter modules | 0.1% to 1% | FP16 base + adapters | Small per-layer cost | Comparable to LoRA | Multi-task adaptation |
| GPTQ only | 0% (no fine-tune) | 4-bit base | None | Inference only | Quantized deployment |
| GaLore | 100% | FP16 or BF16 | None | Comparable to full FT | Memory-efficient full fine-tuning |
| FSDP+QLoRA | 0.01% to 1% | NF4 base sharded | Same as QLoRA | Same as QLoRA | Multi-GPU 70B-plus on consumer cards |
QLoRA is the method of choice when memory constraints make 16-bit storage of the base model infeasible. For models smaller than about 13B, BF16 LoRA is often preferred because it avoids dequantization overhead and is trivially mergeable. For 30B and larger models, QLoRA dominates the cost-quality trade-off.
QLoRA is now the default fine-tuning method for parameter-efficient adaptation of any model larger than about 30 billion parameters and a common choice even at smaller scales. The 2024-2026 trends around QLoRA cluster into four areas.
Frontier-scale fine-tuning on consumer hardware. The Answer.AI FSDP+QLoRA recipe[9] made it possible to fine-tune a 70B Llama 2 or Llama 3 base on two RTX 3090 or 4090 cards at home. Subsequent work extended this to 8B-to-405B sharded fine-tunes on small clusters of consumer GPUs, dropping the entry cost for high-end LLM adaptation from tens of thousands of dollars to a few thousand. Hugging Face's accelerate and PyTorch's native FSDP both ship configuration paths for this combination.
Domain-specific fine-tunes at scale. Most publicly released community fine-tunes through 2024 and 2025, including the Nous Research family (Hermes, DeepHermes), OpenChat, OpenHermes, Dolphin, Zephyr, MythoMax, and the long tail of role-play and uncensored merges, were trained end-to-end via QLoRA on rented A100 or H100 cards. The technique also underpins many of the medical, legal, and finance-specific Llama variants released by domain-focused startups, where the cost of fine-tuning a 70B-class model with QLoRA is in the low thousands of dollars rather than the hundreds of thousands required for full fine-tuning.
Industrial fine-tuning pipelines. Cloud providers such as AWS SageMaker, Google Vertex AI, and Azure Machine Learning added managed QLoRA training paths. Together AI, Anyscale, Replicate, Modal, and Runpod all offer QLoRA-based fine-tuning as a primitive in their developer-facing platforms, often at hourly rates measured in single dollars for 7B and 13B models. Apple's on-device foundation model team adapted QLoRA-style 4-bit storage for the personalization adapters distributed with Apple Intelligence, and Microsoft's Olive optimization toolkit ships a QLoRA configuration as a default recipe for Phi family models.
Method composition. Modern QLoRA workflows routinely combine QLoRA with DoRA (via use_dora=True), LoftQ initialization, DPO / ORPO / GRPO preference optimization in TRL, LongLoRA-style context extension, and FlashAttention 2 or 3. The Unsloth library (Daniel Han, Michael Han) ships hand-written Triton kernels that fuse the NF4 dequantization with the matmul more aggressively than bitsandbytes, reporting up to 2x training throughput and 70% memory reduction on consumer GPUs. As of 2026, Unsloth is the most popular front-end for QLoRA fine-tuning on a single GPU.
QLoRA is a strong default but has several practical caveats.
Quality variance. Although the paper reports quality parity with BF16 LoRA in aggregate, individual training runs occasionally show small regressions, particularly with smaller LoRA ranks and on tasks sensitive to long-context behavior. The paper explicitly notes that 33B and 65B parity with full 16-bit fine-tuning could not be exhaustively verified due to the cost of the BF16 baseline, and that conclusions at those scales rest on extrapolation from smaller models[1].
Training throughput. QLoRA training is typically 5% to 30% slower than BF16 LoRA at the same configuration because every forward pass dequantizes the base weights to BF16 and discards the result. Sebastian Raschka measured a 39% increase in training time on a 7B Llama (1.85 hours for BF16 LoRA versus 2.79 hours for QLoRA) alongside a 33% reduction in GPU memory (21.33 GB to 14.18 GB)[17]. Optimized kernels in Unsloth recover most of this gap and in some configurations make QLoRA faster than naive BF16 LoRA[19].
Inference latency. Inference with a QLoRA-trained model is slower than inference with a pure 4-bit quantized model that has no LoRA adapter. The base weights must be dequantized on the fly during the forward pass, the adapter B A X term must be computed in 16-bit, and the two paths added. In practice this costs roughly 20% to 40% additional latency compared to a static 4-bit GPTQ or AWQ deployment. To recover full inference speed, practitioners typically merge the LoRA adapter back into a 16-bit copy of the base model, then re-quantize that merged model with GPTQ or AWQ.
Adapter merging. Merging an adapter directly into the 4-bit base is non-trivial. The arithmetic W' = W + B A must be done in higher precision to avoid catastrophic accumulation error, after which the merged weights can be re-quantized. Most workflows therefore keep an unquantized BF16 copy of the base model around for merging and post-training quantization.
Hardware support. Bitsandbytes was originally NVIDIA-only and required CUDA capability 7.5 (Turing) or higher for full bf16 plus 4-bit support, with best results on Ampere (8.0) and newer. AMD ROCm support has been added but lags in stability, and Apple Silicon support is provided through MLX or the bitsandbytes Metal backend, the latter labeled slow in the upstream documentation. Intel GPU and Gaudi backends added by 2025 support most QLoRA primitives though feature coverage is still narrower than on CUDA[3].
Benchmark reliability. The QLoRA paper itself devotes a long limitations section to chatbot benchmarks. The authors found weak agreement between GPT-4 judgments and human raters (Fleiss kappa of 0.25 at the example level) and only moderate agreement among human raters themselves (kappa 0.42), and explicitly state that current chatbot benchmarks are not trustworthy enough to draw fine-grained quality conclusions[1]. The 99.3% Vicuna figure for Guanaco-65B should be read in this context.
Pretraining-from-scratch. QLoRA is fundamentally a fine-tuning method. It cannot replace full pretraining because the quantized base must already encode useful representations; quantizing an untrained model to NF4 and then trying to learn through adapters yields essentially random initialization. ReLoRA and GaLore partially close this gap but at higher memory cost.