QLoRA

QLoRA (Quantized Low-Rank Adaptation) is a parameter-efficient fine-tuning method for large language models introduced in May 2023 by Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer at the University of Washington [1]. The technique combines LoRA, the low-rank adapter method published by Hu et al. in 2021 [2], with aggressive 4-bit quantization of the frozen base model. The result is a training recipe that allows fine-tuning of an LLM with up to 65 billion parameters on a single GPU with 48 GB of memory, while preserving the predictive quality of full 16-bit fine-tuning [1].

QLoRA's central design idea is that the base model's weights can be stored in a heavily compressed 4-bit representation and dequantized on the fly during the forward pass, while the small LoRA adapter matrices that absorb gradient updates remain in higher precision. The paper introduced three new components to make this work without quality loss: a 4-bit data type called NF4 (NormalFloat 4-bit), a double quantization scheme that compresses the quantization constants themselves, and paged optimizer states that use NVIDIA unified memory to handle gradient checkpointing memory spikes [1]. Using QLoRA, the authors trained the Guanaco family of chat models, which reached 99.3% of ChatGPT performance on the Vicuna benchmark after only 24 hours of fine-tuning on a single GPU [1].

QLoRA was presented as an oral paper at NeurIPS 2023. Its open-source implementation in the bitsandbytes library and integration with Hugging Face Transformers and the PEFT library underpin a large fraction of community fine-tunes built on Llama, Mistral, and similar models [3][4]. By the end of 2023, the original arXiv preprint had accumulated thousands of citations and the bitsandbytes library had become a default dependency in nearly every public LLM training stack.

background

Full fine-tuning of a large language model means computing and storing gradients and optimizer state for every parameter. For a 65-billion-parameter model trained in bfloat16 with the AdamW optimizer in 32-bit precision, the memory budget is dominated by the weights themselves, gradients, and two Adam moment buffers. Adding the activations required for backpropagation, the total comfortably exceeds 780 GB of GPU memory, far beyond what any single accelerator delivers [1]. Training such a model end-to-end therefore requires multi-node clusters with hundreds of GPUs, an option available only to well-funded laboratories.

Two earlier lines of work attacked this cost from different angles. The first is parameter-efficient fine-tuning, which freezes most of the model and trains only a small set of additional parameters. LoRA, introduced by Hu et al. in 2021 [2], freezes the pretrained weight matrix W and represents the update as a product of two narrow matrices, delta_W = B * A, where A has shape (r, d_in) and B has shape (d_out, r) with rank r much smaller than the model dimensions. The forward pass becomes Y = W X + B A X. Only A and B receive gradients, which typically reduces the trainable parameter count by two to four orders of magnitude. Other parameter-efficient techniques include adapter modules (Houlsby et al. 2019), prefix tuning, and prompt tuning.

The second line is post-training quantization of model weights. The LLM.int8() paper by Dettmers et al. in 2022 [5] showed that an 8-bit weight representation can match 16-bit inference quality on transformer LLMs if outlier features are handled with a mixed-precision decomposition. Subsequent work pushed quantization to four bits per weight: GPTQ (Frantar et al. 2022) [6] and AWQ (Lin et al. 2023) [7] both demonstrated 4-bit inference with small accuracy loss. These methods, however, are designed for inference; using a naive 4-bit weight matrix during training breaks gradient flow and degrades quality.

QLoRA's contribution is to combine these two threads in a way that avoids both of their failure modes. The base weights stay frozen, so they never receive gradients and can be stored at four bits without harming optimization. The LoRA adapters stay in 16-bit precision and absorb all gradient updates. The only step where the two precisions meet is the forward pass, where the 4-bit base weights are dequantized to 16-bit on the fly to compute the matrix product W X, then immediately discarded.

precursor work by Tim Dettmers

QLoRA builds on a sequence of efficiency papers from the same author. The 2022 ICLR paper 8-bit Optimizers via Block-wise Quantization [16] showed that the moment buffers of Adam and other adaptive optimizers can be quantized to 8 bits with block-wise scaling without quality loss. The 2022 NeurIPS paper LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale [5] established that block-wise quantization combined with mixed-precision handling of outlier features can compress entire transformer weight matrices to 8 bits at zero quality cost. QLoRA inherits the block-wise scaling philosophy and extends it from inference to training and from 8 bits to 4 bits.

three core contributions

NF4 quantization

The first contribution is a new 4-bit data type, the 4-bit NormalFloat (NF4). A 4-bit type can represent at most 16 distinct values, since 2^4 = 16. Standard 4-bit integer quantization spaces these values uniformly across the weight range, which is wasteful because the weights of a pretrained transformer are not uniformly distributed. Empirically, the weights of large pretrained models are well approximated by a zero-mean normal distribution with a small standard deviation, so most weights cluster near zero and a uniform grid wastes capacity on regions that are almost never populated [1].

NF4 instead places its 16 quantization levels at the quantiles of a standard normal distribution. Each level represents an equal probability mass under a unit normal, after which the grid is rescaled by a per-block scaling constant to match the actual weight magnitudes. The construction satisfies an information-theoretic optimality criterion: under the assumption that the weights are normally distributed, no other 16-level grid in [-1, 1] minimizes expected quantization error more effectively. The exact level positions are computed once and stored as a fixed lookup table; both quantization and dequantization reduce to nearest-neighbor lookup.

Formally, the levels are defined by

q_i = 0.5 * ( Q( i / (2^k + 1) ) + Q( (i+1) / (2^k + 1) ) )

for i = 0, 1, ..., 2^k - 1, where Q is the inverse cumulative distribution function (the quantile function) of a standard normal N(0, 1), and k = 4 for NF4. The 16 resulting values are normalized to the range [-1, 1] and stored as the NF4 codebook.

In the paper's ablations on LLaMA, OPT, BLOOM, and Pythia models from 125M to 13B parameters, NF4 with double quantization achieves a mean perplexity on Pile Common Crawl of 27.41, compared to 29.48 for FP4 with three exponent bits, 31.07 for FP4 with two exponent bits, and 34.34 for plain Int4 [1]. Across MMLU and other downstream benchmarks, the gap between NF4 and FP4 corresponds to roughly one percentage point of accuracy.

Quantization is performed block-wise with a block size of 64. Within each block of 64 consecutive weights, the maximum absolute value is computed as a scaling factor c1, and each weight is quantized to its nearest NF4 level scaled by c1. Block-wise scaling localizes the effect of any single outlier and is the same trick that allows LLM.int8() to handle activation outliers without losing accuracy.

double quantization

Block-wise quantization introduces a second-order overhead: the scaling constants c1 themselves must be stored, and at one 32-bit constant per 64-weight block, this overhead is 32 / 64 = 0.5 bits per parameter on top of the 4-bit weights. For a 65B model this is about 4 GB of additional storage, a non-trivial fraction of the budget on a 48 GB card.

Double quantization (DQ) addresses this by quantizing the constants themselves [1]. The first-level constants c1 are quantized to 8-bit floating point with a second-level scaling constant c2, applied across blocks of 256 first-level constants. With DQ, the average overhead drops from about 0.5 bits per parameter to roughly 0.127 bits per parameter, saving an additional 0.37 bits per parameter on average. For a 65B model this is roughly 3 GB. The combined storage is approximately 4.127 effective bits per weight. The paper reports that double quantization has no measurable effect on downstream accuracy [1].

The full dequantization expression is

W_BF16 = doubleDequant(c2_FP32, c1_FP8, W_NF4)
        = dequant( dequant(c2_FP32, c1_FP8), W_NF4 )

where the outer dequant reconstructs the original 4-bit codebook scaled by the recovered FP8 constants, and the inner dequant reconstructs the FP8 first-level constants from the second-level FP32 constants. The resulting full-precision weights are kept in BF16 only for the matrix multiplication and discarded afterward.

paged optimizers

The third contribution addresses a different memory bottleneck. During training with gradient checkpointing, recomputed activations and optimizer states can produce sudden memory spikes that drive the GPU into out-of-memory failure even when average usage is below the device limit. These spikes are particularly common at long sequence lengths where the recomputed attention activations briefly dominate the budget.

QLoRA introduces paged optimizers, which use NVIDIA unified memory to allow optimizer state to spill from GPU memory to CPU memory automatically. The CUDA unified memory subsystem moves pages of optimizer state between host and device transparently when GPU memory pressure rises, then pages them back when needed. The implementation in bitsandbytes provides paged variants of AdamW and other optimizers in the form of PagedAdamW8bit, PagedAdamW32bit, PagedLion, and similar wrappers. In practice this prevents OOM failures during occasional spikes without significantly slowing training, because the spikes are short and pages are returned to the GPU before the next training step needs them. The paper notes that paging events are rare enough that wall-clock impact is hard to measure under typical workloads [1].

mathematical formulation

Let W be a pretrained linear layer weight matrix that QLoRA freezes and stores in NF4. Let c1 be the block-wise NF4 scaling constants stored in 8-bit floating point, and c2 be the second-level FP32 constants used to dequantize c1. The doubly-dequantized weight is

W_BF16 = doubleDequant(c2_FP32, c1_FP8, W_NF4)

where doubleDequant first uses c2 to recover the per-block constants c1, then uses c1 to recover the full-precision weights. The LoRA adapters A and B are stored in bf16. For an input activation X, the QLoRA forward pass is

Y = doubleDequant(c2_FP32, c1_FP8, W_NF4) @ X + B @ A @ X

where @ denotes matrix multiplication. Equivalently, with s denoting the LoRA scaling factor alpha / r,

Y = W_BF16 @ X + s * B @ A @ X

During the backward pass, the gradient with respect to the LoRA adapters is

dL/dA = s * B^T @ dL/dY @ X^T
dL/dB = s * dL/dY @ (A @ X)^T

No gradient is computed for W_NF4, c1, or c2. The optimizer therefore only updates A and B, both stored in bf16, with optimizer moments stored in 32-bit using paged AdamW.

A simplified pseudocode view of one training step:

for batch in dataloader:
    # forward
    W_bf16 = double_dequantize(W_nf4, c1_fp8, c2_fp32)
    Y = W_bf16 @ X + (alpha / r) * B @ A @ X
    loss = loss_fn(Y, target)

    # backward, only A and B receive gradients
    grads = backward(loss, params=[A, B])

    # paged AdamW update
    paged_adamw_step(A, B, grads)

The dequantized weights W_bf16 are kept only for the duration of the matrix multiplication and immediately discarded, so peak memory still reflects the 4-bit storage cost rather than the full 16-bit footprint.

data type comparison

QLoRA mixes four different numeric formats during training. The table below shows where each one appears and how many bits it occupies.

Format	Bits	Range/Levels	Role in QLoRA
FP32	32	~1.4e-45 to ~3.4e38	Second-level scaling constants `c2`; AdamW optimizer moments by default
BF16	16	~1e-38 to ~3e38, 8-bit mantissa	Compute dtype for the dequantized forward pass; LoRA adapters `A`, `B`
FP16	16	~6e-5 to ~6e4, 11-bit mantissa	Optional alternative compute dtype, narrower range than BF16
FP8 (E4M3)	8	~2^-9 to 448	First-level scaling constants `c1` after double quantization
Int8	8	-128 to 127	LLM.int8() base weights when used (a separate mode)
NF4	4	16 quantile levels in `[-1, 1]`	Frozen base model weights
FP4 (E2M1)	4	16 floating-point values	Alternative 4-bit format, supported but not preferred

hyperparameters

The QLoRA paper and the reference implementation provide a recipe that has become the default starting point for community fine-tunes [1][3]. The settings below correspond to the Guanaco-65B model.

Quantization data type: NF4.
Block size: 64 for first-level quantization, 256 for second-level constants.
Compute dtype: bfloat16 for the dequantized forward pass and for LoRA adapters.
Optimizer: paged AdamW 32-bit.
LoRA target modules: all linear projections in each transformer block, including q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, and down_proj for Llama-family architectures. The paper found that targeting all linear layers is more important than choosing a particular rank.
LoRA rank r: 64 in the Guanaco models. Smaller values such as 8 or 16 are common for smaller datasets.
LoRA alpha: 16 in the original recipe, with the effective scaling factor alpha / r = 16 / 64 = 0.25.
LoRA dropout: 0.1 for models up to 13B, 0.05 for 33B and 65B.
Learning rate: 2e-4 for 7B and 13B, 1e-4 for 33B and 65B, with cosine schedule and warmup.
Batch size: small (1 to 4) per device, with gradient accumulation to reach an effective batch of 16 to 64.
Gradient checkpointing: enabled to reduce activation memory.
Sequence length: 512 for instruction tuning, with longer sequences supported on cards with sufficient memory.

Later work has experimented with higher ranks (256 or 512), different alpha conventions (alpha equal to rank, or twice the rank), and other quantization data types, but the original recipe remains a strong baseline. Sebastian Raschka's reproductions on a 7B Llama base reported that targeting all linear layers and using rank 256 with alpha 512 gave the best downstream quality, raising the trainable parameter count from about 4 million to about 20 million while adding only roughly 2.4 GB of memory [17].

memory savings

The headline memory numbers from the QLoRA paper illustrate the gap between 16-bit and 4-bit base storage [1]:

Model size	Full fine-tune (BF16 + AdamW FP32)	LoRA on FP16 base	QLoRA (NF4 base)
7B	~112 GB	~28 GB	~6 GB
13B	~208 GB	~52 GB	~10 GB
33B	~528 GB	~132 GB	~21 GB
65B	>780 GB	~260 GB	~41 GB

The full fine-tune column counts weights, gradients, and AdamW state in 32-bit. The LoRA column counts a frozen FP16 base plus FP16 adapters and 32-bit AdamW state for the adapters. The QLoRA column counts a 4-bit base, bf16 adapters, paged AdamW state, and overhead for activations and gradient checkpointing. The reduction from BF16 LoRA to QLoRA is roughly 6 to 8 times on the base model storage, which lets a 65B model fit on a single 48 GB workstation card such as an NVIDIA A100 80 GB, an A6000 48 GB, or an NVIDIA H100 80 GB.

A Hugging Face study comparing QLoRA against alternative methods on consumer hardware found that NF4 with double quantization and bf16 compute can fit a 13B Llama on a 16 GB T4 GPU at sequence length 1024 with gradient checkpointing enabled, a configuration that fails with plain LoRA on the same card [4][8]. On the production side, Answer.AI demonstrated in March 2024 that combining QLoRA with Fully Sharded Data Parallel (FSDP) lets two RTX 3090 or 4090 cards train a 70B model end-to-end, dropping the hardware floor for that scale to roughly two thousand US dollars in consumer parts [9].

empirical results

The paper reports a large set of experiments across model sizes, datasets, and benchmarks. The two headline results are quality parity with 16-bit fine-tuning and the Guanaco family.

parity with 16-bit fine-tuning

On the MMLU benchmark, QLoRA fine-tunes of LLaMA 7B through 65B match the accuracy of equivalent 16-bit LoRA fine-tunes within statistical noise. After fine-tuning on FLAN v2, the 5-shot MMLU scores are 44.5 for 7B, 51.4 for 13B, 59.2 for 33B, and 63.9 for 65B, compared to baseline LLaMA scores of 35.1, 46.9, 57.8, and 63.4 respectively [1]. On other benchmarks such as Open Assistant evaluations, QLoRA again matches BF16 LoRA. The authors trained more than 1,000 models across these ablations, an unusually large experimental scale that strengthens the conclusion that 4-bit storage with NF4 and double quantization is essentially lossless for fine-tuning [1].

the Guanaco models

The second result is the Guanaco chat model family, produced by QLoRA fine-tuning LLaMA on the OASST1 (OpenAssistant Conversations) dataset. Guanaco-65B reached 99.3% of ChatGPT performance on the Vicuna benchmark, an automated evaluation that uses GPT-4 as a judge to score model responses across 80 prompts spanning writing, roleplay, math, coding, and general knowledge. Smaller Guanaco models reached the percentages shown below.

Model	Vicuna benchmark vs ChatGPT	Notes
Guanaco 7B	87.0%	Fits in roughly 5 GB at inference, runs on phones
Guanaco 13B	90.4%	Outperforms Alpaca 13B by ~20 points
Guanaco 33B	97.8%	Trains on a 24 GB consumer GPU in under 12 hours
Guanaco 65B	99.3%	24 hours of fine-tuning on one 48 GB card

On the Open Assistant evaluation, Guanaco 65B reached an Elo rating of 1,008, statistically tied with ChatGPT-3.5 Turbo at 1,015 [1]. The 65B fine-tune required 24 hours on a single 48 GB GPU. The paper also showed that smaller, higher-quality datasets such as OASST1 (about 9,000 conversations) produce stronger chat models than larger but noisier sources such as FLAN v2 (15 million examples) or Self-Instruct (about 82,000 examples), shifting community focus toward data curation rather than dataset scale.

dataset comparisons

The paper evaluated eight instruction-following datasets, summarized below.

Dataset	Examples	Source	Notable property
OASST1	~9,209	Crowd-sourced	Best Vicuna scores when small
FLAN v2	~15M	Task collection	Best for academic benchmarks like MMLU
Self-Instruct	~82,612	Distilled from GPT	Mid-quality
Alpaca	~51,942	Distilled from GPT	Standard baseline at the time
Chip2	~210,289	Hybrid	Useful for chat
HH-RLHF	~160,800	Preference-based	Anthropic helpful-harmless
Longform	~23,700	Hybrid	Open-ended generation
Unnatural Instructions	~240,670	Distilled	Larger but noisier

The key finding was that data quality dominates dataset size for chat-oriented evaluation, while broad task coverage from FLAN v2 still helps on academic question-answering benchmarks like MMLU.

implementations

The reference implementation lives in the bitsandbytes library by Tim Dettmers, which provides the 4-bit quantization primitives, the NF4 data type, double quantization, and paged optimizers [3]. The library is built on CUDA kernels and integrates directly with PyTorch. As of 2026 it is maintained by the bitsandbytes-foundation, with sponsorship from Hugging Face and Intel, and supports NVIDIA SM60 and newer, AMD CDNA and RDNA architectures, Intel Data Center GPU Max and Arc series, Intel Gaudi 2 and 3, and ARM64 plus Apple Silicon CPUs.

In the Hugging Face ecosystem, QLoRA is exposed through three components. Transformers accepts a BitsAndBytesConfig at model load time:

from transformers import BitsAndBytesConfig, AutoModelForCausalLM
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
)

The PEFT library wraps this quantized model with LoRA adapters via LoraConfig and get_peft_model, after preparing the model for k-bit training so that LayerNorm and embedding parameters are upcast to FP32 for stability:

from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

model = prepare_model_for_kbit_training(model)

lora_config = LoraConfig(
    r=64, lora_alpha=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05, bias="none", task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)

Recent versions of PEFT also accept a shorthand, target_modules="all-linear", which automatically targets every linear layer in the architecture. This shorthand is the recommended setting for QLoRA-style training across model families with different attention naming conventions [4].

This pattern is the de facto standard for community QLoRA training and is documented in the Hugging Face PEFT documentation [4]. Higher-level training frameworks build on top of this stack:

Framework	Maintainer	Distinguishing feature
Axolotl	OpenAccess AI Collective	YAML-driven configuration wrapping Transformers, PEFT, and bitsandbytes; supports DPO, IPO, KTO, ORPO, GRPO
Unsloth	Daniel Han, Michael Han	Custom Triton kernels reporting up to 2x speed and 70% less VRAM on consumer GPUs
Hugging Face TRL	Hugging Face	Built-in `SFTTrainer` and `DPOTrainer` with QLoRA support
LLaMA-Factory	hiyouga	Unified fine-tuning UI for 100+ LLMs and VLMs
DeepSpeed	Microsoft	Distributed training that integrates bitsandbytes for ZeRO + QLoRA
Apple MLX	Apple	QLoRA-style training on Apple Silicon with native quantization primitives
torchtune	Meta PyTorch	Native PyTorch fine-tuning library with 4-bit and 8-bit recipes

The original reference code is published under MIT license at the artidoro/qlora GitHub repository, named after second author Artidoro Pagnoni. The bitsandbytes library is licensed under MIT and is one of the most starred quantization libraries on GitHub.

limitations

QLoRA is a strong default but has several practical caveats.

Quality variance. Although the paper reports quality parity with BF16 LoRA in aggregate, individual training runs occasionally show small regressions, particularly with smaller LoRA ranks and on tasks sensitive to long-context behavior. The paper explicitly notes that 33B and 65B parity with full 16-bit fine-tuning could not be exhaustively verified due to the cost of the BF16 baseline, and that conclusions at those scales rest on extrapolation from smaller models.

Training throughput. QLoRA training is typically 5% to 30% slower than BF16 LoRA at the same configuration because every forward pass dequantizes the base weights to BF16 and discards the result. Sebastian Raschka measured a 39% increase in training time on a 7B Llama (1.85 hours for BF16 LoRA versus 2.79 hours for QLoRA) alongside a 33% reduction in GPU memory (21.33 GB to 14.18 GB) [17]. Optimized kernels in Unsloth recover most of this gap and in some configurations make QLoRA faster than naive BF16 LoRA.

Inference latency. Inference with a QLoRA-trained model is slower than inference with a pure 4-bit quantized model that has no LoRA adapter. The base weights must be dequantized on the fly during the forward pass, the adapter B A X term must be computed in 16-bit, and the two paths added. In practice this costs roughly 20% to 40% additional latency compared to a static 4-bit GPTQ or AWQ deployment. To recover full inference speed, practitioners typically merge the LoRA adapter back into a 16-bit copy of the base model, then re-quantize that merged model with GPTQ or AWQ.

Adapter merging. Merging an adapter directly into the 4-bit base is non-trivial. The arithmetic W' = W + B A must be done in higher precision to avoid catastrophic accumulation error, after which the merged weights can be re-quantized. Most workflows therefore keep an unquantized BF16 copy of the base model around for merging and post-training quantization.

Hardware support. Bitsandbytes was originally NVIDIA-only and required CUDA capability 7.5 (Turing) or higher for full bf16 plus 4-bit support, with best results on Ampere (8.0) and newer. AMD ROCm support has been added but lags in stability, and Apple Silicon support is provided through MLX or the bitsandbytes Metal backend, the latter labeled slow in the upstream documentation. Intel GPU and Gaudi backends added by 2025 support most QLoRA primitives though feature coverage is still narrower than on CUDA.

Benchmark reliability. The QLoRA paper itself devotes a long limitations section to chatbot benchmarks. The authors found weak agreement between GPT-4 judgments and human raters (Fleiss kappa of 0.25 at the example level) and only moderate agreement among human raters themselves (kappa 0.42), and explicitly state that current chatbot benchmarks are not trustworthy enough to draw fine-grained quality conclusions. The 99.3% Vicuna figure for Guanaco-65B should be read in this context.

variants and successors

QLoRA has inspired a family of follow-up methods that refine the quantization, the adapter parameterization, or the optimizer:

Method	Year	Key idea
DoRA	2024	Decomposes weight into magnitude and direction, applies LoRA to direction only
LoftQ	2023	Joint adapter and base quantization initialization to minimize `Q(W) + B A - W`
GaLore	2024	Projects gradients into low-rank subspace; full fine-tunes 7B on 24 GB
QA-LoRA	2023	Quantization-aware adapter that quantizes both base and adapter at deployment
AQLM	2024	Additive quantization to 2-bit per weight
HQQ	2024	Half-quadratic, calibration-free 4-bit and 2-bit quantization
FSDP+QLoRA	2024	Answer.AI's combination with FSDP for two-GPU 70B training
LoRA+	2024	Different learning rates for `A` and `B` matrices to improve convergence

DoRA (Weight-Decomposed Low-Rank Adaptation) by Liu et al. 2024 [10] decomposes the pretrained weight into a magnitude vector and a directional matrix, then applies LoRA only to the directional component. DoRA reports small but consistent quality gains over LoRA and QLoRA on commonsense reasoning and visual instruction tuning. PEFT supports DoRA via a use_dora=True flag in LoraConfig.
LoftQ (LoRA-Fine-Tuning-aware Quantization) by Li et al. 2023 [11] jointly initializes the adapters and quantized weights so that Q(W) + B A is closer to the original W at the start of training, improving downstream accuracy on aggressive 2-bit and 3-bit settings. PEFT exposes LoftQ via init_lora_weights="loftq" and a helper replace_lora_weights_loftq for in-place upgrade of an existing LoRA adapter.
GaLore (Gradient Low-Rank Projection) by Zhao et al. 2024 [12] projects gradients themselves into a low-rank subspace and updates the full-precision weights along that projection. GaLore can fully fine-tune a 7B model on a 24 GB consumer GPU.
QA-LoRA (Quantization-Aware Low-Rank Adaptation) by Xu et al. 2023 [13] proposes a quantization-aware variant in which the LoRA adapters are quantized along with the base weights at deployment.
AQLM (Additive Quantization of Language Models) by Egiazarian et al. 2024 [14] uses additive quantization to reach 2-bit per weight while preserving most of the quality of 4-bit methods.
HQQ (Half-Quadratic Quantization) by Mobius Labs [15] is a calibration-free quantization scheme that produces high-quality 4-bit and 2-bit weights without requiring a calibration dataset.
FSDP+QLoRA by Answer.AI [9] combines QLoRA with PyTorch Fully Sharded Data Parallel, allowing two consumer 24 GB GPUs to train a 70B model with bf16 base in 35 GB after 4-bit quantization.

impact

QLoRA significantly broadened access to LLM fine-tuning. Before its publication, fine-tuning a 30B-plus model required institutional clusters with tens of GPUs. After QLoRA, a single 48 GB or 80 GB data-center card was sufficient for 65B, and with FSDP+QLoRA two consumer cards became enough for 70B. Within months of release, the open-source community produced thousands of QLoRA fine-tunes on Hugging Face Hub, spanning coding assistants, multilingual chatbots, role-play models, domain-specific medical and legal models, and many more, built on Llama 2, Mistral, Llama 3, Qwen, Phi, and Gemma base models.

The technique also reshaped the economics of community model releases. Many of the most popular fine-tuned models on Hugging Face Hub through 2023 and 2024 (the Nous Research family, OpenChat, OpenHermes, WizardLM, Dolphin, Zephyr, MythoMax, and dozens of role-play merges) were built end-to-end on QLoRA pipelines that ran on rented A100 or A6000 cards. The combination of low cost and parameter-efficient adapters that ship as small files alongside the base model also enabled the practice of releasing multiple specialized adapters per base, a workflow popularized by adapter-marketplace style sites and by the adapter-transformers and PEFT integrations.

The QLoRA paper has accumulated several thousand citations within its first year and is a standard baseline in efficient LLM training papers. The bitsandbytes library that hosts the QLoRA implementation is now a core dependency of the Hugging Face training stack and shipped under the bitsandbytes-foundation umbrella with multi-vendor backing. The public release of the Guanaco weights helped establish norms around reproducibility in the LLM fine-tuning literature.

The technique also influenced industrial fine-tuning pipelines. Cloud providers such as AWS SageMaker, Google Vertex AI, and Azure Machine Learning added managed QLoRA training paths. Together AI, Anyscale, Replicate, Modal, and Runpod all offer QLoRA-based fine-tuning as a primitive in their developer-facing platforms, often at hourly rates measured in single dollars for 7B and 13B models. Apple's on-device foundation model team adapted QLoRA-style 4-bit storage for the personalization adapters distributed with Apple Intelligence, and Microsoft's Olive optimization toolkit ships a QLoRA configuration as a default recipe for Phi family models.

comparison with alternatives

The table below summarizes how QLoRA compares to other adaptation strategies on the dimensions that matter most for practitioners: trainable parameter share, base model storage, inference cost, and reported quality relative to full fine-tuning.

Method	Trainable params	Base storage	Inference overhead	Quality vs full FT	Typical use
Full fine-tuning	100%	FP16 or BF16	None	Reference	Frontier model post-training
LoRA (FP16 base)	0.01% to 1%	FP16 base + small adapters	None after merge	Within 1% on most tasks	Standard PEFT for mid-size models
QLoRA	0.01% to 1%	NF4 base + bf16 adapters	20% to 40% if not merged	Within 1% of LoRA	Single-GPU fine-tuning of 30B-plus models
DoRA	0.01% to 1%	FP16 or NF4 base	Similar to LoRA	Slight gains over LoRA	Quality-sensitive instruction tuning
Prefix tuning	0.01% to 0.1%	FP16 base	Small per-token cost	Below LoRA in many tasks	Lightweight task adaptation
Prompt tuning	0.001% to 0.01%	FP16 base	Negligible	Below LoRA at small scale	Soft-prompt adaptation
Adapter modules	0.1% to 1%	FP16 base + adapters	Small per-layer cost	Comparable to LoRA	Multi-task adaptation
GPTQ only	0% (no fine-tune)	4-bit base	None	Inference only	Quantized deployment
GaLore	100%	FP16 or BF16	None	Comparable to full FT	Memory-efficient full fine-tuning
FSDP+QLoRA	0.01% to 1%	NF4 base sharded	Same as QLoRA	Same as QLoRA	Multi-GPU 70B-plus on consumer cards

QLoRA is the method of choice when memory constraints make 16-bit storage of the base model infeasible. For models smaller than about 13B, BF16 LoRA is often preferred because it avoids dequantization overhead and is trivially mergeable. For 30B and larger models, QLoRA dominates the cost-quality trade-off.

practical workflow

A typical end-to-end QLoRA fine-tuning workflow proceeds in four stages.

Data preparation. Collect or curate an instruction dataset. Following the QLoRA paper's findings, smaller and higher-quality datasets such as OASST1 or LIMA generally outperform large noisy collections for chat fine-tuning, while broad task collections such as FLAN v2 are stronger for academic benchmarks. Most practitioners now use the chat-template format introduced with Llama 2 and standardized in the Hugging Face tokenizer API.
Quantized model load. Use BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.bfloat16) together with AutoModelForCausalLM.from_pretrained to load the base model in 4-bit. Wrap it with prepare_model_for_kbit_training to upcast LayerNorm and embeddings; this step also disables the cache used for autoregressive decoding, which is incompatible with gradient checkpointing.
Adapter attachment. Apply LoraConfig with target_modules="all-linear" and r=64, lora_alpha=16, lora_dropout=0.05, then call get_peft_model. Freeze everything except the LoRA matrices. The resulting model exposes only the adapters as trainable parameters, typically a fraction of one percent of the base model count.
Training and merging. Train with transformers.Trainer or trl.SFTTrainer, paged AdamW optimizer, gradient checkpointing, and bf16 mixed precision. After training, save the adapter as a small file (typically a few hundred megabytes for a 65B base). To deploy, either load the adapter on top of the 4-bit base, or merge it into a 16-bit copy of the base and re-quantize with GPTQ, AWQ, or HQQ.

A single training run for a 7B Llama variant on 50,000 instruction examples typically completes in 2 to 4 hours on a single A100, costs less than 10 US dollars on rental cloud GPUs, and produces an adapter file that fits on a USB stick.

common pitfalls

Several configuration mistakes recur in community QLoRA pipelines.

Mismatched compute dtype. Setting bnb_4bit_compute_dtype=torch.float32 defeats most of the point of 4-bit training because matrix multiplications run at FP32 speed. The recommended setting is torch.bfloat16 on Ampere or newer hardware, or torch.float16 on older Turing cards that lack BF16 support.
Cache enabled during training. The KV cache used for inference is incompatible with gradient checkpointing; leaving model.config.use_cache = True produces silent activation memory growth and occasional OOM. The prepare_model_for_kbit_training helper sets it to False automatically.
Wrong target modules. Restricting LoRA to q_proj and v_proj only, the original LoRA paper convention, leaves a measurable quality gap on instruction tuning compared with targeting all linear layers. The QLoRA paper explicitly recommends all linear projections for parity with full fine-tuning.
Adapter merge into 4-bit base. Calling merge_and_unload directly on a 4-bit quantized model loses precision because the addition W + B A is performed in 4-bit. The correct procedure is to load the base in BF16, attach the trained adapter, merge, save, and then re-quantize with GPTQ, AWQ, or HQQ for deployment.
Optimizer mismatch on resume. Resuming a QLoRA training run with a different optimizer (for example, switching from paged_adamw_32bit to paged_adamw_8bit) silently re-initializes the moment buffers and discards the warmup state. Practitioners should keep the optimizer flag identical between runs.

references

Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2023). "QLoRA: Efficient Finetuning of Quantized LLMs." Advances in Neural Information Processing Systems 36 (NeurIPS 2023). arXiv:2305.14314. https://arxiv.org/abs/2305.14314
Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2021). "LoRA: Low-Rank Adaptation of Large Language Models." arXiv:2106.09685. https://arxiv.org/abs/2106.09685
bitsandbytes contributors. "bitsandbytes: 8-bit and 4-bit CUDA functions for PyTorch." GitHub repository. https://github.com/bitsandbytes-foundation/bitsandbytes
Hugging Face. "PEFT: Parameter-Efficient Fine-Tuning." Documentation and GitHub repository. https://github.com/huggingface/peft and https://huggingface.co/docs/peft
Dettmers, T., Lewis, M., Belkada, Y., & Zettlemoyer, L. (2022). "LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale." NeurIPS 2022. arXiv:2208.07339. https://arxiv.org/abs/2208.07339
Frantar, E., Ashkboos, S., Hoefler, T., & Alistarh, D. (2022). "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers." arXiv:2210.17323. https://arxiv.org/abs/2210.17323
Lin, J., Tang, J., Tang, H., Yang, S., Dang, X., & Han, S. (2023). "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration." arXiv:2306.00978. https://arxiv.org/abs/2306.00978
Hugging Face. "Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA." Blog post (May 2023). https://huggingface.co/blog/4bit-transformers-bitsandbytes
Howard, J., Warner, B., & Turgutlu, K. (2024). "You can now train a 70b language model at home." Answer.AI blog. https://www.answer.ai/posts/2024-03-06-fsdp-qlora.html
Liu, S.-Y., Wang, C.-Y., Yin, H., Molchanov, P., Wang, Y.-C. F., Cheng, K.-T., & Chen, M.-H. (2024). "DoRA: Weight-Decomposed Low-Rank Adaptation." ICML 2024. arXiv:2402.09353. https://arxiv.org/abs/2402.09353
Li, Y., Yu, Y., Liang, C., He, P., Karampatziakis, N., Chen, W., & Zhao, T. (2023). "LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models." arXiv:2310.08659. https://arxiv.org/abs/2310.08659
Zhao, J., Zhang, Z., Chen, B., Wang, Z., Anandkumar, A., & Tian, Y. (2024). "GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection." ICML 2024. arXiv:2403.03507. https://arxiv.org/abs/2403.03507
Xu, Y., Xie, L., Gu, X., Chen, X., Chang, H., Zhang, H., Chen, Z., Zhang, X., & Tian, Q. (2023). "QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models." arXiv:2309.14717. https://arxiv.org/abs/2309.14717
Egiazarian, V., Panferov, A., Kuznedelev, D., Frantar, E., Babenko, A., & Alistarh, D. (2024). "Extreme Compression of Large Language Models via Additive Quantization." ICML 2024. arXiv:2401.06118. https://arxiv.org/abs/2401.06118
Mobius Labs. "HQQ: Half-Quadratic Quantization of Large Machine Learning Models." Blog and GitHub repository. https://mobiusml.github.io/hqq_blog/ and https://github.com/mobiusml/hqq
Dettmers, T., Lewis, M., Shleifer, S., & Zettlemoyer, L. (2022). "8-bit Optimizers via Block-wise Quantization." ICLR 2022. arXiv:2110.02861. https://arxiv.org/abs/2110.02861
Raschka, S. (2023). "Practical Tips for Finetuning LLMs Using LoRA (Low-Rank Adaptation)." Ahead of AI newsletter. https://magazine.sebastianraschka.com/p/practical-tips-for-finetuning-llms
Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., de Laroussilhe, Q., Gesmundo, A., Attariyan, M., & Gelly, S. (2019). "Parameter-Efficient Transfer Learning for NLP." ICML 2019. arXiv:1902.00751. https://arxiv.org/abs/1902.00751
Unsloth contributors. "Unsloth: 2x faster QLoRA fine-tuning with custom Triton kernels." GitHub repository. https://github.com/unslothai/unsloth
Axolotl contributors. "Axolotl: open-source LLM fine-tuning." GitHub repository. https://github.com/axolotl-ai-cloud/axolotl
Conover, M., Hayes, M., Mathur, A., Xie, J., Wan, J., Shah, S., Ghodsi, A., Wendell, P., Zaharia, M., & Xin, R. (2023). "Free Dolly: Introducing the World's First Truly Open Instruction-Tuned LLM." Databricks blog. https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm.html
NeurIPS. "QLoRA: Efficient Finetuning of Quantized LLMs (Oral)." NeurIPS 2023 Oral presentation. https://neurips.cc/virtual/2023/oral/73855

background

precursor work by Tim Dettmers

three core contributions

NF4 quantization

double quantization

paged optimizers

mathematical formulation

data type comparison

hyperparameters

memory savings

empirical results

parity with 16-bit fine-tuning

the Guanaco models

dataset comparisons

implementations

limitations

variants and successors

impact

comparison with alternatives

practical workflow

common pitfalls

see also

references

Improve this article

Related Articles

AWQ

LoRA (Low-Rank Adaptation)

DeepSeek 3.0

ORPO

GPTQ

PEFT

background

precursor work by Tim Dettmers

three core contributions

NF4 quantization

double quantization

paged optimizers

mathematical formulation

data type comparison

hyperparameters

memory savings

empirical results

parity with 16-bit fine-tuning

the Guanaco models

dataset comparisons

implementations

limitations

variants and successors

impact

comparison with alternatives

practical workflow

common pitfalls

see also

references

Related Articles

AWQ

LoRA (Low-Rank Adaptation)

DeepSeek 3.0

ORPO

GPTQ

PEFT