QLoRA
Last reviewed
May 9, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 ยท 6,309 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 9, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 ยท 6,309 words
Add missing citations, update stale details, or suggest a clearer explanation.
QLoRA (Quantized Low-Rank Adaptation) is a parameter-efficient fine-tuning method for large language models introduced in May 2023 by Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer at the University of Washington [1]. The technique combines LoRA, the low-rank adapter method published by Hu et al. in 2021 [2], with aggressive 4-bit quantization of the frozen base model. The result is a training recipe that allows fine-tuning of an LLM with up to 65 billion parameters on a single GPU with 48 GB of memory, while preserving the predictive quality of full 16-bit fine-tuning [1].
QLoRA's central design idea is that the base model's weights can be stored in a heavily compressed 4-bit representation and dequantized on the fly during the forward pass, while the small LoRA adapter matrices that absorb gradient updates remain in higher precision. The paper introduced three new components to make this work without quality loss: a 4-bit data type called NF4 (NormalFloat 4-bit), a double quantization scheme that compresses the quantization constants themselves, and paged optimizer states that use NVIDIA unified memory to handle gradient checkpointing memory spikes [1]. Using QLoRA, the authors trained the Guanaco family of chat models, which reached 99.3% of ChatGPT performance on the Vicuna benchmark after only 24 hours of fine-tuning on a single GPU [1].
QLoRA was presented as an oral paper at NeurIPS 2023. Its open-source implementation in the bitsandbytes library and integration with Hugging Face Transformers and the PEFT library underpin a large fraction of community fine-tunes built on Llama, Mistral, and similar models [3][4]. By the end of 2023, the original arXiv preprint had accumulated thousands of citations and the bitsandbytes library had become a default dependency in nearly every public LLM training stack.
Full fine-tuning of a large language model means computing and storing gradients and optimizer state for every parameter. For a 65-billion-parameter model trained in bfloat16 with the AdamW optimizer in 32-bit precision, the memory budget is dominated by the weights themselves, gradients, and two Adam moment buffers. Adding the activations required for backpropagation, the total comfortably exceeds 780 GB of GPU memory, far beyond what any single accelerator delivers [1]. Training such a model end-to-end therefore requires multi-node clusters with hundreds of GPUs, an option available only to well-funded laboratories.
Two earlier lines of work attacked this cost from different angles. The first is parameter-efficient fine-tuning, which freezes most of the model and trains only a small set of additional parameters. LoRA, introduced by Hu et al. in 2021 [2], freezes the pretrained weight matrix W and represents the update as a product of two narrow matrices, delta_W = B * A, where A has shape (r, d_in) and B has shape (d_out, r) with rank r much smaller than the model dimensions. The forward pass becomes Y = W X + B A X. Only A and B receive gradients, which typically reduces the trainable parameter count by two to four orders of magnitude. Other parameter-efficient techniques include adapter modules (Houlsby et al. 2019), prefix tuning, and prompt tuning.
The second line is post-training quantization of model weights. The LLM.int8() paper by Dettmers et al. in 2022 [5] showed that an 8-bit weight representation can match 16-bit inference quality on transformer LLMs if outlier features are handled with a mixed-precision decomposition. Subsequent work pushed quantization to four bits per weight: GPTQ (Frantar et al. 2022) [6] and AWQ (Lin et al. 2023) [7] both demonstrated 4-bit inference with small accuracy loss. These methods, however, are designed for inference; using a naive 4-bit weight matrix during training breaks gradient flow and degrades quality.
QLoRA's contribution is to combine these two threads in a way that avoids both of their failure modes. The base weights stay frozen, so they never receive gradients and can be stored at four bits without harming optimization. The LoRA adapters stay in 16-bit precision and absorb all gradient updates. The only step where the two precisions meet is the forward pass, where the 4-bit base weights are dequantized to 16-bit on the fly to compute the matrix product W X, then immediately discarded.
QLoRA builds on a sequence of efficiency papers from the same author. The 2022 ICLR paper 8-bit Optimizers via Block-wise Quantization [16] showed that the moment buffers of Adam and other adaptive optimizers can be quantized to 8 bits with block-wise scaling without quality loss. The 2022 NeurIPS paper LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale [5] established that block-wise quantization combined with mixed-precision handling of outlier features can compress entire transformer weight matrices to 8 bits at zero quality cost. QLoRA inherits the block-wise scaling philosophy and extends it from inference to training and from 8 bits to 4 bits.
The first contribution is a new 4-bit data type, the 4-bit NormalFloat (NF4). A 4-bit type can represent at most 16 distinct values, since 2^4 = 16. Standard 4-bit integer quantization spaces these values uniformly across the weight range, which is wasteful because the weights of a pretrained transformer are not uniformly distributed. Empirically, the weights of large pretrained models are well approximated by a zero-mean normal distribution with a small standard deviation, so most weights cluster near zero and a uniform grid wastes capacity on regions that are almost never populated [1].
NF4 instead places its 16 quantization levels at the quantiles of a standard normal distribution. Each level represents an equal probability mass under a unit normal, after which the grid is rescaled by a per-block scaling constant to match the actual weight magnitudes. The construction satisfies an information-theoretic optimality criterion: under the assumption that the weights are normally distributed, no other 16-level grid in [-1, 1] minimizes expected quantization error more effectively. The exact level positions are computed once and stored as a fixed lookup table; both quantization and dequantization reduce to nearest-neighbor lookup.
Formally, the levels are defined by
q_i = 0.5 * ( Q( i / (2^k + 1) ) + Q( (i+1) / (2^k + 1) ) )
for i = 0, 1, ..., 2^k - 1, where Q is the inverse cumulative distribution function (the quantile function) of a standard normal N(0, 1), and k = 4 for NF4. The 16 resulting values are normalized to the range [-1, 1] and stored as the NF4 codebook.
In the paper's ablations on LLaMA, OPT, BLOOM, and Pythia models from 125M to 13B parameters, NF4 with double quantization achieves a mean perplexity on Pile Common Crawl of 27.41, compared to 29.48 for FP4 with three exponent bits, 31.07 for FP4 with two exponent bits, and 34.34 for plain Int4 [1]. Across MMLU and other downstream benchmarks, the gap between NF4 and FP4 corresponds to roughly one percentage point of accuracy.
Quantization is performed block-wise with a block size of 64. Within each block of 64 consecutive weights, the maximum absolute value is computed as a scaling factor c1, and each weight is quantized to its nearest NF4 level scaled by c1. Block-wise scaling localizes the effect of any single outlier and is the same trick that allows LLM.int8() to handle activation outliers without losing accuracy.
Block-wise quantization introduces a second-order overhead: the scaling constants c1 themselves must be stored, and at one 32-bit constant per 64-weight block, this overhead is 32 / 64 = 0.5 bits per parameter on top of the 4-bit weights. For a 65B model this is about 4 GB of additional storage, a non-trivial fraction of the budget on a 48 GB card.
Double quantization (DQ) addresses this by quantizing the constants themselves [1]. The first-level constants c1 are quantized to 8-bit floating point with a second-level scaling constant c2, applied across blocks of 256 first-level constants. With DQ, the average overhead drops from about 0.5 bits per parameter to roughly 0.127 bits per parameter, saving an additional 0.37 bits per parameter on average. For a 65B model this is roughly 3 GB. The combined storage is approximately 4.127 effective bits per weight. The paper reports that double quantization has no measurable effect on downstream accuracy [1].
The full dequantization expression is
W_BF16 = doubleDequant(c2_FP32, c1_FP8, W_NF4)
= dequant( dequant(c2_FP32, c1_FP8), W_NF4 )
where the outer dequant reconstructs the original 4-bit codebook scaled by the recovered FP8 constants, and the inner dequant reconstructs the FP8 first-level constants from the second-level FP32 constants. The resulting full-precision weights are kept in BF16 only for the matrix multiplication and discarded afterward.
The third contribution addresses a different memory bottleneck. During training with gradient checkpointing, recomputed activations and optimizer states can produce sudden memory spikes that drive the GPU into out-of-memory failure even when average usage is below the device limit. These spikes are particularly common at long sequence lengths where the recomputed attention activations briefly dominate the budget.
QLoRA introduces paged optimizers, which use NVIDIA unified memory to allow optimizer state to spill from GPU memory to CPU memory automatically. The CUDA unified memory subsystem moves pages of optimizer state between host and device transparently when GPU memory pressure rises, then pages them back when needed. The implementation in bitsandbytes provides paged variants of AdamW and other optimizers in the form of PagedAdamW8bit, PagedAdamW32bit, PagedLion, and similar wrappers. In practice this prevents OOM failures during occasional spikes without significantly slowing training, because the spikes are short and pages are returned to the GPU before the next training step needs them. The paper notes that paging events are rare enough that wall-clock impact is hard to measure under typical workloads [1].
Let W be a pretrained linear layer weight matrix that QLoRA freezes and stores in NF4. Let c1 be the block-wise NF4 scaling constants stored in 8-bit floating point, and c2 be the second-level FP32 constants used to dequantize c1. The doubly-dequantized weight is
W_BF16 = doubleDequant(c2_FP32, c1_FP8, W_NF4)
where doubleDequant first uses c2 to recover the per-block constants c1, then uses c1 to recover the full-precision weights. The LoRA adapters A and B are stored in bf16. For an input activation X, the QLoRA forward pass is
Y = doubleDequant(c2_FP32, c1_FP8, W_NF4) @ X + B @ A @ X
where @ denotes matrix multiplication. Equivalently, with s denoting the LoRA scaling factor alpha / r,
Y = W_BF16 @ X + s * B @ A @ X
During the backward pass, the gradient with respect to the LoRA adapters is
dL/dA = s * B^T @ dL/dY @ X^T
dL/dB = s * dL/dY @ (A @ X)^T
No gradient is computed for W_NF4, c1, or c2. The optimizer therefore only updates A and B, both stored in bf16, with optimizer moments stored in 32-bit using paged AdamW.
A simplified pseudocode view of one training step:
for batch in dataloader:
# forward
W_bf16 = double_dequantize(W_nf4, c1_fp8, c2_fp32)
Y = W_bf16 @ X + (alpha / r) * B @ A @ X
loss = loss_fn(Y, target)
# backward, only A and B receive gradients
grads = backward(loss, params=[A, B])
# paged AdamW update
paged_adamw_step(A, B, grads)
The dequantized weights W_bf16 are kept only for the duration of the matrix multiplication and immediately discarded, so peak memory still reflects the 4-bit storage cost rather than the full 16-bit footprint.
QLoRA mixes four different numeric formats during training. The table below shows where each one appears and how many bits it occupies.
| Format | Bits | Range/Levels | Role in QLoRA |
|---|---|---|---|
| FP32 | 32 | ~1.4e-45 to ~3.4e38 | Second-level scaling constants c2; AdamW optimizer moments by default |
| BF16 | 16 | ~1e-38 to ~3e38, 8-bit mantissa | Compute dtype for the dequantized forward pass; LoRA adapters A, B |
| FP16 | 16 | ~6e-5 to ~6e4, 11-bit mantissa | Optional alternative compute dtype, narrower range than BF16 |
| FP8 (E4M3) | 8 | ~2^-9 to 448 | First-level scaling constants c1 after double quantization |
| Int8 | 8 | -128 to 127 | LLM.int8() base weights when used (a separate mode) |
| NF4 | 4 | 16 quantile levels in [-1, 1] | Frozen base model weights |
| FP4 (E2M1) | 4 | 16 floating-point values | Alternative 4-bit format, supported but not preferred |
The QLoRA paper and the reference implementation provide a recipe that has become the default starting point for community fine-tunes [1][3]. The settings below correspond to the Guanaco-65B model.
q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, and down_proj for Llama-family architectures. The paper found that targeting all linear layers is more important than choosing a particular rank.r: 64 in the Guanaco models. Smaller values such as 8 or 16 are common for smaller datasets.alpha / r = 16 / 64 = 0.25.Later work has experimented with higher ranks (256 or 512), different alpha conventions (alpha equal to rank, or twice the rank), and other quantization data types, but the original recipe remains a strong baseline. Sebastian Raschka's reproductions on a 7B Llama base reported that targeting all linear layers and using rank 256 with alpha 512 gave the best downstream quality, raising the trainable parameter count from about 4 million to about 20 million while adding only roughly 2.4 GB of memory [17].
The headline memory numbers from the QLoRA paper illustrate the gap between 16-bit and 4-bit base storage [1]:
| Model size | Full fine-tune (BF16 + AdamW FP32) | LoRA on FP16 base | QLoRA (NF4 base) |
|---|---|---|---|
| 7B | ~112 GB | ~28 GB | ~6 GB |
| 13B | ~208 GB | ~52 GB | ~10 GB |
| 33B | ~528 GB | ~132 GB | ~21 GB |
| 65B | >780 GB | ~260 GB | ~41 GB |
The full fine-tune column counts weights, gradients, and AdamW state in 32-bit. The LoRA column counts a frozen FP16 base plus FP16 adapters and 32-bit AdamW state for the adapters. The QLoRA column counts a 4-bit base, bf16 adapters, paged AdamW state, and overhead for activations and gradient checkpointing. The reduction from BF16 LoRA to QLoRA is roughly 6 to 8 times on the base model storage, which lets a 65B model fit on a single 48 GB workstation card such as an NVIDIA A100 80 GB, an A6000 48 GB, or an NVIDIA H100 80 GB.
A Hugging Face study comparing QLoRA against alternative methods on consumer hardware found that NF4 with double quantization and bf16 compute can fit a 13B Llama on a 16 GB T4 GPU at sequence length 1024 with gradient checkpointing enabled, a configuration that fails with plain LoRA on the same card [4][8]. On the production side, Answer.AI demonstrated in March 2024 that combining QLoRA with Fully Sharded Data Parallel (FSDP) lets two RTX 3090 or 4090 cards train a 70B model end-to-end, dropping the hardware floor for that scale to roughly two thousand US dollars in consumer parts [9].
The paper reports a large set of experiments across model sizes, datasets, and benchmarks. The two headline results are quality parity with 16-bit fine-tuning and the Guanaco family.
On the MMLU benchmark, QLoRA fine-tunes of LLaMA 7B through 65B match the accuracy of equivalent 16-bit LoRA fine-tunes within statistical noise. After fine-tuning on FLAN v2, the 5-shot MMLU scores are 44.5 for 7B, 51.4 for 13B, 59.2 for 33B, and 63.9 for 65B, compared to baseline LLaMA scores of 35.1, 46.9, 57.8, and 63.4 respectively [1]. On other benchmarks such as Open Assistant evaluations, QLoRA again matches BF16 LoRA. The authors trained more than 1,000 models across these ablations, an unusually large experimental scale that strengthens the conclusion that 4-bit storage with NF4 and double quantization is essentially lossless for fine-tuning [1].
The second result is the Guanaco chat model family, produced by QLoRA fine-tuning LLaMA on the OASST1 (OpenAssistant Conversations) dataset. Guanaco-65B reached 99.3% of ChatGPT performance on the Vicuna benchmark, an automated evaluation that uses GPT-4 as a judge to score model responses across 80 prompts spanning writing, roleplay, math, coding, and general knowledge. Smaller Guanaco models reached the percentages shown below.
| Model | Vicuna benchmark vs ChatGPT | Notes |
|---|---|---|
| Guanaco 7B | 87.0% | Fits in roughly 5 GB at inference, runs on phones |
| Guanaco 13B | 90.4% | Outperforms Alpaca 13B by ~20 points |
| Guanaco 33B | 97.8% | Trains on a 24 GB consumer GPU in under 12 hours |
| Guanaco 65B | 99.3% | 24 hours of fine-tuning on one 48 GB card |
On the Open Assistant evaluation, Guanaco 65B reached an Elo rating of 1,008, statistically tied with ChatGPT-3.5 Turbo at 1,015 [1]. The 65B fine-tune required 24 hours on a single 48 GB GPU. The paper also showed that smaller, higher-quality datasets such as OASST1 (about 9,000 conversations) produce stronger chat models than larger but noisier sources such as FLAN v2 (15 million examples) or Self-Instruct (about 82,000 examples), shifting community focus toward data curation rather than dataset scale.
The paper evaluated eight instruction-following datasets, summarized below.
| Dataset | Examples | Source | Notable property |
|---|---|---|---|
| OASST1 | ~9,209 | Crowd-sourced | Best Vicuna scores when small |
| FLAN v2 | ~15M | Task collection | Best for academic benchmarks like MMLU |
| Self-Instruct | ~82,612 | Distilled from GPT | Mid-quality |
| Alpaca | ~51,942 | Distilled from GPT | Standard baseline at the time |
| Chip2 | ~210,289 | Hybrid | Useful for chat |
| HH-RLHF | ~160,800 | Preference-based | Anthropic helpful-harmless |
| Longform | ~23,700 | Hybrid | Open-ended generation |
| Unnatural Instructions | ~240,670 | Distilled | Larger but noisier |
The key finding was that data quality dominates dataset size for chat-oriented evaluation, while broad task coverage from FLAN v2 still helps on academic question-answering benchmarks like MMLU.
The reference implementation lives in the bitsandbytes library by Tim Dettmers, which provides the 4-bit quantization primitives, the NF4 data type, double quantization, and paged optimizers [3]. The library is built on CUDA kernels and integrates directly with PyTorch. As of 2026 it is maintained by the bitsandbytes-foundation, with sponsorship from Hugging Face and Intel, and supports NVIDIA SM60 and newer, AMD CDNA and RDNA architectures, Intel Data Center GPU Max and Arc series, Intel Gaudi 2 and 3, and ARM64 plus Apple Silicon CPUs.
In the Hugging Face ecosystem, QLoRA is exposed through three components. Transformers accepts a BitsAndBytesConfig at model load time:
from transformers import BitsAndBytesConfig, AutoModelForCausalLM
import torch
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16,
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=bnb_config,
)
The PEFT library wraps this quantized model with LoRA adapters via LoraConfig and get_peft_model, after preparing the model for k-bit training so that LayerNorm and embedding parameters are upcast to FP32 for stability:
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
model = prepare_model_for_kbit_training(model)
lora_config = LoraConfig(
r=64, lora_alpha=16,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_dropout=0.05, bias="none", task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
Recent versions of PEFT also accept a shorthand, target_modules="all-linear", which automatically targets every linear layer in the architecture. This shorthand is the recommended setting for QLoRA-style training across model families with different attention naming conventions [4].
This pattern is the de facto standard for community QLoRA training and is documented in the Hugging Face PEFT documentation [4]. Higher-level training frameworks build on top of this stack:
| Framework | Maintainer | Distinguishing feature |
|---|---|---|
| Axolotl | OpenAccess AI Collective | YAML-driven configuration wrapping Transformers, PEFT, and bitsandbytes; supports DPO, IPO, KTO, ORPO, GRPO |
| Unsloth | Daniel Han, Michael Han | Custom Triton kernels reporting up to 2x speed and 70% less VRAM on consumer GPUs |
| Hugging Face TRL | Hugging Face | Built-in SFTTrainer and DPOTrainer with QLoRA support |
| LLaMA-Factory | hiyouga | Unified fine-tuning UI for 100+ LLMs and VLMs |
| DeepSpeed | Microsoft | Distributed training that integrates bitsandbytes for ZeRO + QLoRA |
| Apple MLX | Apple | QLoRA-style training on Apple Silicon with native quantization primitives |
| torchtune | Meta PyTorch | Native PyTorch fine-tuning library with 4-bit and 8-bit recipes |
The original reference code is published under MIT license at the artidoro/qlora GitHub repository, named after second author Artidoro Pagnoni. The bitsandbytes library is licensed under MIT and is one of the most starred quantization libraries on GitHub.
QLoRA is a strong default but has several practical caveats.
Quality variance. Although the paper reports quality parity with BF16 LoRA in aggregate, individual training runs occasionally show small regressions, particularly with smaller LoRA ranks and on tasks sensitive to long-context behavior. The paper explicitly notes that 33B and 65B parity with full 16-bit fine-tuning could not be exhaustively verified due to the cost of the BF16 baseline, and that conclusions at those scales rest on extrapolation from smaller models.
Training throughput. QLoRA training is typically 5% to 30% slower than BF16 LoRA at the same configuration because every forward pass dequantizes the base weights to BF16 and discards the result. Sebastian Raschka measured a 39% increase in training time on a 7B Llama (1.85 hours for BF16 LoRA versus 2.79 hours for QLoRA) alongside a 33% reduction in GPU memory (21.33 GB to 14.18 GB) [17]. Optimized kernels in Unsloth recover most of this gap and in some configurations make QLoRA faster than naive BF16 LoRA.
Inference latency. Inference with a QLoRA-trained model is slower than inference with a pure 4-bit quantized model that has no LoRA adapter. The base weights must be dequantized on the fly during the forward pass, the adapter B A X term must be computed in 16-bit, and the two paths added. In practice this costs roughly 20% to 40% additional latency compared to a static 4-bit GPTQ or AWQ deployment. To recover full inference speed, practitioners typically merge the LoRA adapter back into a 16-bit copy of the base model, then re-quantize that merged model with GPTQ or AWQ.
Adapter merging. Merging an adapter directly into the 4-bit base is non-trivial. The arithmetic W' = W + B A must be done in higher precision to avoid catastrophic accumulation error, after which the merged weights can be re-quantized. Most workflows therefore keep an unquantized BF16 copy of the base model around for merging and post-training quantization.
Hardware support. Bitsandbytes was originally NVIDIA-only and required CUDA capability 7.5 (Turing) or higher for full bf16 plus 4-bit support, with best results on Ampere (8.0) and newer. AMD ROCm support has been added but lags in stability, and Apple Silicon support is provided through MLX or the bitsandbytes Metal backend, the latter labeled slow in the upstream documentation. Intel GPU and Gaudi backends added by 2025 support most QLoRA primitives though feature coverage is still narrower than on CUDA.
Benchmark reliability. The QLoRA paper itself devotes a long limitations section to chatbot benchmarks. The authors found weak agreement between GPT-4 judgments and human raters (Fleiss kappa of 0.25 at the example level) and only moderate agreement among human raters themselves (kappa 0.42), and explicitly state that current chatbot benchmarks are not trustworthy enough to draw fine-grained quality conclusions. The 99.3% Vicuna figure for Guanaco-65B should be read in this context.
QLoRA has inspired a family of follow-up methods that refine the quantization, the adapter parameterization, or the optimizer:
| Method | Year | Key idea |
|---|---|---|
| DoRA | 2024 | Decomposes weight into magnitude and direction, applies LoRA to direction only |
| LoftQ | 2023 | Joint adapter and base quantization initialization to minimize Q(W) + B A - W |
| GaLore | 2024 | Projects gradients into low-rank subspace; full fine-tunes 7B on 24 GB |
| QA-LoRA | 2023 | Quantization-aware adapter that quantizes both base and adapter at deployment |
| AQLM | 2024 | Additive quantization to 2-bit per weight |
| HQQ | 2024 | Half-quadratic, calibration-free 4-bit and 2-bit quantization |
| FSDP+QLoRA | 2024 | Answer.AI's combination with FSDP for two-GPU 70B training |
| LoRA+ | 2024 | Different learning rates for A and B matrices to improve convergence |
use_dora=True flag in LoraConfig.Q(W) + B A is closer to the original W at the start of training, improving downstream accuracy on aggressive 2-bit and 3-bit settings. PEFT exposes LoftQ via init_lora_weights="loftq" and a helper replace_lora_weights_loftq for in-place upgrade of an existing LoRA adapter.QLoRA significantly broadened access to LLM fine-tuning. Before its publication, fine-tuning a 30B-plus model required institutional clusters with tens of GPUs. After QLoRA, a single 48 GB or 80 GB data-center card was sufficient for 65B, and with FSDP+QLoRA two consumer cards became enough for 70B. Within months of release, the open-source community produced thousands of QLoRA fine-tunes on Hugging Face Hub, spanning coding assistants, multilingual chatbots, role-play models, domain-specific medical and legal models, and many more, built on Llama 2, Mistral, Llama 3, Qwen, Phi, and Gemma base models.
The technique also reshaped the economics of community model releases. Many of the most popular fine-tuned models on Hugging Face Hub through 2023 and 2024 (the Nous Research family, OpenChat, OpenHermes, WizardLM, Dolphin, Zephyr, MythoMax, and dozens of role-play merges) were built end-to-end on QLoRA pipelines that ran on rented A100 or A6000 cards. The combination of low cost and parameter-efficient adapters that ship as small files alongside the base model also enabled the practice of releasing multiple specialized adapters per base, a workflow popularized by adapter-marketplace style sites and by the adapter-transformers and PEFT integrations.
The QLoRA paper has accumulated several thousand citations within its first year and is a standard baseline in efficient LLM training papers. The bitsandbytes library that hosts the QLoRA implementation is now a core dependency of the Hugging Face training stack and shipped under the bitsandbytes-foundation umbrella with multi-vendor backing. The public release of the Guanaco weights helped establish norms around reproducibility in the LLM fine-tuning literature.
The technique also influenced industrial fine-tuning pipelines. Cloud providers such as AWS SageMaker, Google Vertex AI, and Azure Machine Learning added managed QLoRA training paths. Together AI, Anyscale, Replicate, Modal, and Runpod all offer QLoRA-based fine-tuning as a primitive in their developer-facing platforms, often at hourly rates measured in single dollars for 7B and 13B models. Apple's on-device foundation model team adapted QLoRA-style 4-bit storage for the personalization adapters distributed with Apple Intelligence, and Microsoft's Olive optimization toolkit ships a QLoRA configuration as a default recipe for Phi family models.
The table below summarizes how QLoRA compares to other adaptation strategies on the dimensions that matter most for practitioners: trainable parameter share, base model storage, inference cost, and reported quality relative to full fine-tuning.
| Method | Trainable params | Base storage | Inference overhead | Quality vs full FT | Typical use |
|---|---|---|---|---|---|
| Full fine-tuning | 100% | FP16 or BF16 | None | Reference | Frontier model post-training |
| LoRA (FP16 base) | 0.01% to 1% | FP16 base + small adapters | None after merge | Within 1% on most tasks | Standard PEFT for mid-size models |
| QLoRA | 0.01% to 1% | NF4 base + bf16 adapters | 20% to 40% if not merged | Within 1% of LoRA | Single-GPU fine-tuning of 30B-plus models |
| DoRA | 0.01% to 1% | FP16 or NF4 base | Similar to LoRA | Slight gains over LoRA | Quality-sensitive instruction tuning |
| Prefix tuning | 0.01% to 0.1% | FP16 base | Small per-token cost | Below LoRA in many tasks | Lightweight task adaptation |
| Prompt tuning | 0.001% to 0.01% | FP16 base | Negligible | Below LoRA at small scale | Soft-prompt adaptation |
| Adapter modules | 0.1% to 1% | FP16 base + adapters | Small per-layer cost | Comparable to LoRA | Multi-task adaptation |
| GPTQ only | 0% (no fine-tune) | 4-bit base | None | Inference only | Quantized deployment |
| GaLore | 100% | FP16 or BF16 | None | Comparable to full FT | Memory-efficient full fine-tuning |
| FSDP+QLoRA | 0.01% to 1% | NF4 base sharded | Same as QLoRA | Same as QLoRA | Multi-GPU 70B-plus on consumer cards |
QLoRA is the method of choice when memory constraints make 16-bit storage of the base model infeasible. For models smaller than about 13B, BF16 LoRA is often preferred because it avoids dequantization overhead and is trivially mergeable. For 30B and larger models, QLoRA dominates the cost-quality trade-off.
A typical end-to-end QLoRA fine-tuning workflow proceeds in four stages.
BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.bfloat16) together with AutoModelForCausalLM.from_pretrained to load the base model in 4-bit. Wrap it with prepare_model_for_kbit_training to upcast LayerNorm and embeddings; this step also disables the cache used for autoregressive decoding, which is incompatible with gradient checkpointing.LoraConfig with target_modules="all-linear" and r=64, lora_alpha=16, lora_dropout=0.05, then call get_peft_model. Freeze everything except the LoRA matrices. The resulting model exposes only the adapters as trainable parameters, typically a fraction of one percent of the base model count.transformers.Trainer or trl.SFTTrainer, paged AdamW optimizer, gradient checkpointing, and bf16 mixed precision. After training, save the adapter as a small file (typically a few hundred megabytes for a 65B base). To deploy, either load the adapter on top of the 4-bit base, or merge it into a 16-bit copy of the base and re-quantize with GPTQ, AWQ, or HQQ.A single training run for a 7B Llama variant on 50,000 instruction examples typically completes in 2 to 4 hours on a single A100, costs less than 10 US dollars on rental cloud GPUs, and produces an adapter file that fits on a USB stick.
Several configuration mistakes recur in community QLoRA pipelines.
bnb_4bit_compute_dtype=torch.float32 defeats most of the point of 4-bit training because matrix multiplications run at FP32 speed. The recommended setting is torch.bfloat16 on Ampere or newer hardware, or torch.float16 on older Turing cards that lack BF16 support.model.config.use_cache = True produces silent activation memory growth and occasional OOM. The prepare_model_for_kbit_training helper sets it to False automatically.q_proj and v_proj only, the original LoRA paper convention, leaves a measurable quality gap on instruction tuning compared with targeting all linear layers. The QLoRA paper explicitly recommends all linear projections for parity with full fine-tuning.merge_and_unload directly on a 4-bit quantized model loses precision because the addition W + B A is performed in 4-bit. The correct procedure is to load the base in BF16, attach the trained adapter, merge, save, and then re-quantize with GPTQ, AWQ, or HQQ for deployment.paged_adamw_32bit to paged_adamw_8bit) silently re-initializes the moment buffers and discards the warmup state. Practitioners should keep the optimizer flag identical between runs.