NormalFloat 4-bit (NF4)

AI Inference Training & Optimization

24 min read

Updated Jun 25, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 25, 2026

Fact-checked

In review queue

Sources

19 citations

Revision

v3 · 4,729 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

NormalFloat 4-bit (NF4) is a 4-bit numerical data type for storing the weights of deep neural networks, introduced in the 2023 QLoRA paper by Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer of the University of Washington as "a new data type that is information theoretically optimal for normally distributed weights."^[1] Instead of spacing its sixteen representable levels uniformly (like INT4) or by a fixed exponent/mantissa split (like FP4), NF4 places its quantization levels at quantiles of the standard normal distribution so that each of the 16 bins captures an equal expected probability mass for zero-mean Gaussian weights.^[1]^[2] Paired with two companion techniques from the same paper, Double Quantization and Paged Optimizers, NF4 reduced memory enough "to finetune a 65B parameter model on a single 48GB GPU while preserving full 16-bit finetuning task performance," and the resulting Guanaco models reached 99.3% of ChatGPT's score on the Vicuna benchmark.^[1] NF4 is implemented in the bitsandbytes library and exposed through the Hugging Face Transformers BitsAndBytesConfig API, where it is the recommended 4-bit type for QLoRA-style fine-tuning and for memory-constrained inference.^[2]^[3]^[4]


Type	4-bit numerical data type for weight quantization
Bits per value	4 (16 codepoints)
Codebook	Quantiles of N(0, 1), normalized to [-1, 1]
Default block size	64 weights per absmax block^[1]
Introduced by	Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer^[1]
First publication	arXiv 2305.14314, 23 May 2023^[1]
Venue	NeurIPS 2023 (oral)^[7]
Reference implementation	`bitsandbytes` (MIT license)^[4]^[10]
Hugging Face API	`BitsAndBytesConfig(bnb_4bit_quant_type="nf4")`^[3]

What is NF4 (4-bit NormalFloat)?

NF4, short for NormalFloat 4-bit, is a 4-bit numerical data type for representing the weights of deep neural networks. It was introduced by Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer of the University of Washington in the 2023 paper "QLoRA: Efficient Finetuning of Quantized LLMs," where it is presented as an information-theoretically optimal code for tensors whose values are drawn from a zero-mean normal distribution.^[1] Rather than spacing its sixteen representable levels uniformly (as INT4 does) or using a fixed exponent/mantissa split (as FP4 does), NF4 places its quantization levels at quantiles of the standard normal distribution so that each bin captures an equal expected probability mass.^[1]^[2] Combined with two other innovations from the same paper, Double Quantization and Paged Optimizers, NF4 made it possible to fine-tune a 65 billion parameter LLaMA model on a single 48 GB GPU while preserving the task performance of full 16-bit fine-tuning.^[1] NF4 is implemented in the bitsandbytes library and exposed through the Hugging Face Transformers BitsAndBytesConfig API, where it serves as the default and recommended 4-bit type for QLoRA-style training and for memory-constrained inference.^[2]^[3]^[4]

Background

Why quantize neural network weights?

Quantization reduces the bit-width used to store and compute with model parameters, trading a small amount of representational fidelity for large reductions in memory and bandwidth. For very large transformers, the weight matrices dominate memory usage, so post-training and during-training quantization of weights has become central to fitting modern models onto commodity hardware.^[2]^[5] Two simple 4-bit alternatives exist as baselines. Integer 4-bit (INT4) places sixteen evenly spaced levels across the dynamic range of a block of values, and floating point 4-bit (FP4) splits the four bits into a sign, an exponent, and a mantissa, giving non-uniform spacing biased toward small magnitudes.^[1]^[5]

Neither baseline directly exploits the distributional shape of pre-trained network weights. Empirically, transformer weights are well approximated by zero-centred Gaussians with low standard deviation, with most of the mass close to zero and few weights far in the tails.^[1]^[2] A quantization grid that ignores this structure wastes capacity on rarely used regions and under-resolves the dense region near zero, where most of the actual values land.

Quantile quantization and Dettmers' earlier work

The idea of using sample or theoretical quantiles to lay out quantization bins predates NF4. In "8-bit Optimizers via Block-wise Quantization" (2021), Dettmers and collaborators used quantile-based dynamic codes together with block-wise normalization to compress optimizer states (Adam moments) from 32 bits to 8 bits without losing convergence quality.^[6] That paper established several primitives reused by QLoRA: a fixed 256-entry lookup table of quantile values, blockwise absmax normalization that bounds the dynamic range per block, and CUDA kernels that dequantize on the fly during matmul.^[6] NF4 can be read as a 4-bit specialization of that machinery, with the empirical eCDF replaced by the theoretical CDF of N(0, 1).^[1]^[6]

When was NF4 introduced (the QLoRA paper)?

QLoRA was first posted to arXiv as 2305.14314 on 23 May 2023.^[1] The paper combines three memory-saving techniques (NF4, Double Quantization, Paged Optimizers) with LoRA low-rank adapters to enable backpropagation through a frozen, 4-bit-quantized base model.^[1] It was accepted as an oral presentation at NeurIPS 2023.^[7] Alongside the paper, the authors released the Guanaco family of QLoRA-tuned chat models at 7B, 13B, 33B, and 65B parameters, trained on the OASST1 instruction dataset.^[8] The reference implementation depends on bitsandbytes for the underlying CUDA kernels that materialize NF4 weights and dequantize them at compute time.^[8]

The QLoRA paper's central claim is that a careful combination of these three techniques makes the gap between 16-bit fine-tuning and 4-bit fine-tuning vanishingly small while shrinking memory requirements by roughly a factor of four. The authors back this claim by fine-tuning over a thousand model and dataset combinations spanning the LLaMA, T5, and Pythia families and instruction datasets including Alpaca, OASST1, FLAN, Self-Instruct, and Chip2.^[1] On the MMLU benchmark, the paper reports that NF4 plus Double Quantization fully recovers the 16-bit LoRA score, while pure INT4 lags substantially.^[1] On the Vicuna pairwise chat benchmark, the abstract states Guanaco "outperforms all previous openly released models on the Vicuna benchmark, reaching 99.3% of the performance level of ChatGPT while only requiring 24 hours of finetuning on a single GPU."^[1] Within the paper, Guanaco 65B reaches that 99.3% figure and Guanaco 33B reaches 97.8%, with the 33B model trainable in under 12 hours on a single 24 GB consumer GPU.^[1]^[8]

How does NF4 work?

Information-theoretic motivation

A scalar quantizer is information-theoretically optimal for a given source when each of its bins carries an equal expected probability mass under that source. In that case the quantizer's output entropy is maximized at log2(k) for k bins, and (under the assumption of independent and identically distributed inputs) it minimizes the rate-distortion gap for a fixed budget.^[1]^[9] NF4 takes this as its design principle: assuming a block of weights is drawn from N(0, sigma^2) and then scaled into [-1, 1] by dividing by the block's absolute maximum, each of the sixteen NF4 codepoints is placed at the midpoint of an equal-mass quantile of N(0, 1).^[1] The bitsandbytes reference implementation states the principle directly, describing LinearNF4 as "a quantization data type where each bin has equal area under a standard normal distribution N(0, 1) that is normalized into the range [-1, 1]."^[13]

What are the 16 NF4 levels?

NF4 uses asymmetric quantiles so that an exact zero is representable, which is important for masking and padding operations whose semantic correctness depends on exact zeros. The QLoRA paper computes one set of 2^(k-1) negative quantile midpoints and a second set of 2^(k-1)+1 non-negative midpoints, then unifies the two sets and removes the duplicated zero, yielding 2^k = 16 unique values.^[1] More concretely, the construction proceeds as follows. For a k-bit code with 2^k bins, an idealized quantile quantizer would place codepoint q_i at the midpoint of the i-th equal-probability bin under the source distribution, computed as the average of the (i / 2^k)-th and ((i+1) / 2^k)-th quantiles. This formula is symmetric and therefore does not in general place a codepoint exactly at zero. NF4 enforces zero by computing one side of the code with 2^(k-1) bins and the other with 2^(k-1)+1 bins, then merging the two halves; the resulting 17-entry set has zero in both halves, which is deduplicated to 16.^[1]

The reference implementation in bitsandbytes calls scipy.stats.norm.ppf (the inverse normal CDF) at a sequence of evenly spaced quantile boundaries, with an empirical offset parameter that controls how much probability mass the outermost bins cover before the values are normalized so that the extremes map to -1 and +1.^[10] The sixteen sorted NF4 levels stored in the library lookup table are:^[10]

-1.0000, -0.6962, -0.5251, -0.3949, -0.2844, -0.1848, -0.0911, 0.0,
 0.0796,  0.1609,  0.2461,  0.3379,  0.4407,  0.5626,  0.7230, 1.0

(values rounded to four decimals for display; the exact constants are stored at single precision in functional.py).^[10]

These levels are denser near zero and sparser in the tails, exactly inverting INT4's uniform layout. Because the codepoints are fixed, NF4 does not require per-block calibration of bin locations; only the per-block scale factor (the absolute maximum of each block) must be stored alongside the quantized indices.^[1]

Block-wise normalization

NF4 is always applied block-wise rather than tensor-wide. In QLoRA the weight tensor is split into contiguous blocks of 64 elements; each block is divided by its absolute maximum to bring values into [-1, 1] and then mapped to the nearest of the sixteen NF4 codepoints.^[1] Each block therefore costs 64 indices x 4 bits = 32 bytes for the data, plus one 32-bit scale factor for the block, for an effective storage cost of 4 + 32/64 = 4.5 bits per parameter before any further compression.^[1]

Block size is a deliberate trade-off. Small blocks (like 64) keep the per-block dynamic range tight so that the fixed NF4 grid sees something close to a normalized normal distribution, but they require many scale factors. Larger blocks save metadata but spread the normalized distribution further from a clean N(0, 1) shape, which weakens the information-theoretic argument for NF4.^[11]

How are NF4 weights encoded and decoded?

To encode a tensor W in NF4, the bitsandbytes kernel performs the following steps for each block of 64 weights:

Compute s = max(abs(block)), the per-block absolute maximum, and store it as a 32-bit float.^[1]^[10]
Compute the normalized block block / s, which lies in [-1, 1].
For each normalized value, find the index i in [0, 15] of the closest NF4 codepoint by binary search over the sorted 16-entry lookup table.^[10]
Pack two 4-bit indices per output byte and append them to the quantized buffer.^[13]

Decoding reverses these steps. The kernel reads a packed 4-bit index, looks up the corresponding NF4 codepoint c[i] from the table, multiplies by the stored block scale s, and emits a 16-bit float s * c[i] into the dequantized buffer. This dequantized buffer is then used as one operand in a 16-bit matrix multiplication.^[13] Because the lookup table is small (16 entries, 256 with padding) it stays in registers or shared memory, and the dominant cost is the memory traffic for the 4-bit indices and the per-block scales rather than the arithmetic.^[6]^[13]

What is Double Quantization?

The scale factors themselves are stored as 32-bit floats, which (at one per 64 weights) costs 0.5 bits per parameter in metadata. QLoRA introduces Double Quantization (DQ), which quantizes those FP32 scales again into 8-bit floats using a second-level block size of 256 scales per super-block.^[1] The paper describes this as reducing the overhead "from 32/64 = 0.5 bits, to 8/64 + 32/(64 x 256) = 0.127 bits, a reduction of 0.373 bits per parameter," or roughly 3 GB on a 65B-parameter LLaMA model.^[1] DQ is a lossy compression of the metadata, but the QLoRA paper reports that perplexity on language modeling benchmarks is essentially unchanged with DQ enabled, matching or improving on plain NF4.^[1]

In bitsandbytes and Hugging Face Transformers, DQ is exposed as the boolean bnb_4bit_use_double_quant flag on BitsAndBytesConfig, and the Transformers documentation recommends enabling it whenever GPU memory is tight (for example, when fine-tuning a 13B model on a 16 GB T4).^[3]

What are Paged Optimizers?

The third innovation packaged with NF4 is the Paged Optimizer. During fine-tuning with long sequences and gradient checkpointing, optimizer state for AdamW can transiently exceed available GPU memory, triggering out-of-memory crashes. Paged Optimizers use NVIDIA's unified memory mechanism to allow optimizer state pages to migrate between GPU memory and pinned CPU host memory on demand, so that transient spikes do not fail.^[1] The technique is implemented in bitsandbytes as paged variants of AdamW and other optimizers, and the QLoRA paper reports that it absorbs the gradient-checkpointing spikes that would otherwise prevent 33B and 65B fine-tunes on 24 GB and 48 GB cards respectively.^[1]^[12]

Paged Optimizers are independent of NF4 in principle and can be combined with any optimizer state representation, but they are usually deployed together because the same memory pressure that motivates 4-bit weights also motivates spillable optimizer state.^[1]^[12]

Why is NF4 better than plain 4-bit (FP4 and INT4)?

Three 4-bit data types are commonly compared in the LLM quantization literature: uniform INT4, floating-point FP4, and NF4.^[1]^[5] They differ in how their sixteen codepoints are distributed across the dynamic range.

Type	Codebook layout	Strength	Weakness
INT4	16 uniformly spaced levels	Simple to implement; fast on integer ALUs	Wastes capacity in low-density regions; quantization error grows for distributions concentrated near zero^[1]^[5]
FP4 (E2M1 / E3M0)	Sign + small exponent + small mantissa	Non-uniform spacing biased toward small magnitudes; flexible exponent choice	Mantissa precision near zero is still limited; not matched to N(0, 1)^[1]^[5]
NF4	16 quantile midpoints of N(0, 1) normalized to [-1, 1]	Each bin carries equal probability mass for Gaussian inputs^[1]	Optimality argument is approximate under blockwise absmax normalization^[11]

The QLoRA paper presents a head-to-head comparison of NF4, FP4, and INT4 on language modeling, measured as mean perplexity over Pile-derived evaluation suites with several model families. The figures below are the Pile Common Crawl results reported in the paper:^[1]

Data type	Mean perplexity (lower is better)
Int4	34.34
Float4 (E2M1)	31.07
Float4 (E3M0)	29.48
NFloat4 + Double Quantization	27.41

Numbers reported in the QLoRA paper's data-type ablation.^[1] NF4 with Double Quantization beats both INT4 and FP4 at the same 4-bit budget; INT4's uniform grid pays a substantial penalty on weight matrices that concentrate mass near zero, while FP4 sits in between.^[1] Subsequent independent reviews and tutorials reach the same qualitative ranking, with NF4 preferred for training the base model and FP4 or INT4 sometimes used for inference where slightly faster paths matter more than the last fraction of accuracy.^[2]^[5]

The Hugging Face integration blog summarizes the practical guidance succinctly: use NF4 for higher precision, enable Double Quantization if memory is tight, and use a 16-bit compute dtype (typically bfloat16) for faster matmul.^[2]^[3]

Is NF4 really information-theoretically optimal?

The information-theoretic optimality claim in the QLoRA paper assumes that the values entering the quantizer are i.i.d. samples from N(0, 1) after block scaling. Davis Yoshida pointed out in a June 2023 follow-up (arXiv 2306.06965) that this assumption is violated in practice: dividing a block of n weights by its absolute maximum makes the resulting normalized values dependent (the maximum is forced to +/-1), and the conditional distribution of the remaining n - 1 values depends on the block size.^[11] Yoshida derived an alternative 4-bit code (called AF4) by directly minimizing the expected L1 reconstruction error under the conditional distribution, and showed it marginally outperforms NF4 for large block sizes such as 4096, while the two codes are essentially equivalent at the QLoRA block size of 64.^[11] The paper's title (NF4 Isn't Information Theoretically Optimal (and that's Good)) captures the bottom line: NF4 is not literally optimal under its stated criterion, but the gap that remains is small enough that the simpler quantile-based code is a practical winner for typical block sizes.^[11]

How is NF4 implemented in bitsandbytes?

NF4 is implemented in the open-source bitsandbytes library, originally maintained by Tim Dettmers and now hosted under the bitsandbytes-foundation GitHub organization, which describes itself as providing "accessible large language models via k-bit quantization for PyTorch".^[4]^[10] The library offers three closely related capabilities: 8-bit optimizers built on block-wise quantization (the 2021 paper), LLM.int8() 8-bit inference, and the 4-bit QLoRA path that includes NF4.^[4]^[6] All three share the same block-wise normalization scheme, lookup-table-based dequantization, and CUDA kernel infrastructure.^[4]^[6]

At the Python level the data type is exposed through two classes in bitsandbytes.nn:

Linear4bit is the base 4-bit linear layer; it accepts a quant_type argument with values "nf4" or "fp4", a compute_dtype (typically torch.bfloat16 or torch.float16), a compress_statistics flag for Double Quantization, and a quant_storage dtype.^[13]
LinearNF4 is a thin subclass that fixes quant_type="nf4" and documents the data type as "a quantization data type where each bin has equal area under a standard normal distribution N(0, 1)".^[13]

The NF4 lookup table itself is produced by create_normal_map in bitsandbytes/functional.py, which calls scipy.stats.norm.ppf to obtain quantiles, then normalizes the result into [-1, 1] and pads the 16 values into a 256-entry table so the kernel can use 8-bit integer indices.^[10] The dequantization kernel reads 4-bit indices, looks up the corresponding NF4 codepoint, multiplies by the per-block scale, and feeds the result into a 16-bit matmul.^[13]

The bitsandbytes documentation lists hardware support for NF4 as NVIDIA Pascal (compute capability 6.0) and newer GPUs on CUDA, plus Intel XPU, Intel Gaudi (HPU), and CPU back-ends in more recent releases; the library is MIT licensed.^[4]^[14] The Hugging Face Transformers documentation notes that bitsandbytes is supported on CUDA versions 11.8 through 13.0, with ongoing work to extend coverage to additional accelerators.^[3]

Quantization storage format

Beyond the 4-bit indices and the per-block scales, a bitsandbytes 4-bit checkpoint records the quantization configuration (block size, quant type, whether DQ is enabled, and the compute dtype) so that loading the model from disk reconstructs the layer with the same arithmetic. Linear4bit accepts a quant_storage parameter, defaulting to torch.uint8, which controls the dtype used to physically pack the 4-bit indices into a flat tensor.^[13] Two 4-bit indices fit into a single uint8, so a layer with n weights stores ceil(n/2) packed bytes plus the block metadata.^[13]

How do you use NF4 in Hugging Face Transformers?

NF4 reached the broader open-source ecosystem on 24 May 2023, one day after the QLoRA paper appeared, via a Hugging Face blog post and a Transformers integration co-authored by Younes Belkada, Tim Dettmers, Artidoro Pagnoni, Sylvain Gugger, and Sourab Mangrulkar.^[2] The integration adds two main entry points to the Transformers library:

A load_in_4bit=True shortcut on AutoModelForCausalLM.from_pretrained and similar constructors, which replaces every eligible torch.nn.Linear layer with a bitsandbytes 4-bit layer.^[2]^[3]
A BitsAndBytesConfig dataclass that exposes four QLoRA-specific knobs:

Parameter	Values	Default	Effect
`bnb_4bit_quant_type`	`"nf4"`, `"fp4"`	`"fp4"`	Selects NF4 or FP4 code; NF4 is recommended for QLoRA training^[3]
`bnb_4bit_use_double_quant`	`True`, `False`	`False`	Enables Double Quantization (saves ~0.4 bits per parameter)^[2]^[3]
`bnb_4bit_compute_dtype`	`torch.float16`, `torch.bfloat16`, `torch.float32`	`torch.float32`	Dtype for matmul; bf16 is fastest on Ampere and newer^[3]
`load_in_4bit`	`True`, `False`	`False`	Master switch to replace Linear layers with `Linear4bit`^[3]

A typical NF4 inference configuration looks like this (taken from the Transformers quantization guide):^[3]

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

nf4_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=nf4_config)

For QLoRA fine-tuning, this configuration is composed with a PEFT LoRA adapter and a TRL trainer; the base model stays frozen in NF4 while the LoRA matrices remain in 16-bit and receive gradients.^[2]^[15] The HuggingFace PEFT library handles wiring the adapter onto the quantized layers, and HuggingFace TRL provides the SFT and DPO training loops.^[2]

Where is NF4 used? (Adoption)

Since 2023, NF4 has become a standard building block in open-source LLM tooling. Beyond bitsandbytes and Transformers, it appears in:

Axolotl, an LLM fine-tuning framework that exposes QLoRA-style NF4 training as a first-class configuration option, with bitsandbytes under the hood.^[16]
Unsloth, a fine-tuning library that uses NF4 weights together with custom kernels and gradient checkpointing for further speedups; it ships pre-quantized NF4 checkpoints for popular LLaMA, Mistral, and other base models.^[17]
HuggingFace PEFT, which integrates with NF4 layers so that LoRA adapters can be inserted into a quantized base.^[2]^[15]
Diffusion model deployment: the bitsandbytes 4-bit integration was extended to the Diffusers library and is widely used for running FLUX.1 image models in NF4, where it reduces VRAM enough to fit the 12B-parameter transformer of FLUX.1 on consumer 8-16 GB GPUs.^[18]
Pre-quantized model hubs: many checkpoints distributed on Hugging Face Hub now include NF4 variants with the quantization config embedded so that downstream users can load them directly without redoing the quantization.^[3]

NF4 is also one of the data types covered by bfloat16-vs-int-vs-NF benchmark studies and survey papers on low-bit LLM quantization, typically as the reference 4-bit code against which newer learned codebooks are compared.^[5]^[11]

What is NF4 used for? (Applications)

Fine-tuning very large models on a single GPU

The original target application for NF4 is fine-tuning large language models on hardware where the base model cannot fit in 16-bit form. The QLoRA paper demonstrated 65B LLaMA fine-tuning on a single 48 GB A100, 33B LLaMA fine-tuning on a single 24 GB consumer card in under 12 hours, and 7B/13B fine-tuning on T4-class GPUs.^[1]^[2] The released Guanaco models reached 99.3% of ChatGPT's score on the Vicuna benchmark with 65B parameters and 97.8% with 33B, demonstrating that 4-bit base weights are not the limiting factor for instruction-tuned chat quality.^[1]^[8]

Memory-efficient inference

Even outside training, NF4 is used as a memory-efficient inference format for transformer and diffusion models. The Hugging Face documentation notes that for inference the choice between NF4 and FP4 is less consequential than for training, but the same bitsandbytes infrastructure provides both with identical APIs, so NF4 is often kept as the default to stay consistent with whatever code path produced the weights.^[3] In diffusion image generation, NF4-quantized FLUX models have been reported by community benchmarks to be faster than FP8 at the same VRAM budget, with throughput speedups of roughly 1.3x to 2.5x on 6-8 GB cards.^[18]

Research baselines

In the years since 2023, NF4 has settled into a role as the canonical 4-bit baseline for new quantization research. Subsequent methods (Yoshida's AF4, learned codebooks such as "any4," activation-aware schemes like AWQ and GPTQ applied at 4 bits, and Blackwell-era FP4 quantization-aware training) typically report results against NF4 to demonstrate improvements at equal bit-width.^[5]^[11]

Training dynamics with NF4

In QLoRA-style fine-tuning, NF4 weights are immutable: gradients never flow back into the 4-bit codebook or scales. Trainable parameters live entirely in the 16-bit LoRA adapters that are added in parallel to selected linear layers (commonly the q, k, v, and output projections of attention, sometimes the MLP projections).^[1]^[15] The forward pass dequantizes the frozen NF4 weights, computes (W + BA) x, and the backward pass updates only A and B.^[1]^[15]

This division of labor has two practical consequences. First, because the adapters are tiny relative to the base model (typically less than 1% of base parameters), the optimizer state is small and the dominant memory cost is the dequantized activations during the forward pass and the gradients during the backward pass.^[1]^[15] Second, the choice of compute dtype (bnb_4bit_compute_dtype) matters more than the storage dtype for training speed; the Hugging Face documentation and the QLoRA reference implementation use torch.bfloat16 on Ampere or newer hardware to maximize tensor core throughput.^[2]^[3]

Composition with KV cache and activation quantization

NF4 quantizes the static weight matrices of a transformer; it does not by itself quantize activations, attention scores, or the KV cache that grows with the context length. In long-context inference, the KV cache often dominates GPU memory, and NF4-quantized weights can be combined with separate KV cache compression (such as 8-bit or 4-bit quantization of cached keys and values) implemented by other libraries.^[3]^[4] The Hugging Face documentation explicitly notes that the bnb 4-bit integration is intended for weights and uses 16-bit compute, leaving activation-level optimization to complementary techniques.^[3]

What are the limitations of NF4?

NF4 has several documented limitations:

Optimality is approximate. As Yoshida (2023) shows, NF4's optimality argument breaks under blockwise absmax normalization because the resulting normalized values are not i.i.d. samples from N(0, 1).^[11] The practical gap is small at block size 64 but grows with block size, so very-large-block NF4 deployments are sub-optimal.^[11]
Weights only, not activations. NF4 is designed for static weight tensors. It does not address dynamic activation quantization, which has different distributional characteristics (heavy-tailed outliers in particular). Activation-aware methods such as AWQ and SmoothQuant tackle that side of the problem separately.^[5]
Inference speed is bound by dequantization. The standard bitsandbytes NF4 path dequantizes to 16-bit before matmul, so it does not in itself produce a faster compute path than 16-bit weights, only a smaller memory footprint. Speedups on memory-bound workloads such as small-batch decoding come from reduced HBM traffic.^[4]^[13]
Training overhead. QLoRA-style training with NF4 is slower per step than plain LoRA on 16-bit weights, with practitioners reporting roughly 30-40% slowdowns due to dequantization in every forward and backward pass.^[2]^[15]
Hardware support. NF4 requires bitsandbytes kernels; on backends without those kernels (older CUDA versions, exotic accelerators) NF4 either falls back to slow paths or is unavailable.^[4]^[14]

NF4 sits at the intersection of several adjacent topics:

Other 4-bit codes: GPTQ uses learned per-channel codebooks and reconstructs weights via second-order error correction; AWQ preserves salient weights identified by activation statistics; FP4 and INT4 are the obvious uniform baselines.^[5]
Adapter methods: NF4 was designed to be used together with LoRA adapters; the broader PEFT toolkit also supports DoRA and other variants over NF4 base models.^[15]
Optimizer compression: Dettmers' earlier work on 8-bit optimizers uses block-wise quantile codes for AdamW state, with NF4 as the 4-bit analogue for weights.^[6]
Quantization in deployment formats: GGUF (used by llama.cpp) defines its own family of 4-bit and 5-bit codes including K-quants; these target CPU and Apple Silicon inference and are not byte-identical to NF4, though they share the underlying idea of non-uniform 4-bit codepoints.^[19]

References

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer, "QLoRA: Efficient Finetuning of Quantized LLMs", arXiv, 2023-05-23. https://arxiv.org/abs/2305.14314. Accessed 2026-06-21. ↩
Younes Belkada, Tim Dettmers, Artidoro Pagnoni, Sylvain Gugger, Sourab Mangrulkar, "Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA", Hugging Face Blog, 2023-05-24. https://huggingface.co/blog/4bit-transformers-bitsandbytes. Accessed 2026-06-21. ↩
Hugging Face, "Bitsandbytes (Transformers quantization documentation)", Hugging Face, 2024-2026 (continuously updated). https://huggingface.co/docs/transformers/quantization/bitsandbytes. Accessed 2026-06-21. ↩
bitsandbytes-foundation, "bitsandbytes: Accessible large language models via k-bit quantization for PyTorch", GitHub, 2024-2026 (continuously updated). https://github.com/bitsandbytes-foundation/bitsandbytes. Accessed 2026-06-21. ↩
apxml, "Low-Bit LLM Quantization (INT4, NF4, FP4)", apxml Quantized LLM Deployment course, 2024-2026. https://apxml.com/courses/quantized-llm-deployment/chapter-1-advanced-llm-quantization-fundamentals/low-bit-quantization-techniques. Accessed 2026-06-21. ↩
Tim Dettmers, Mike Lewis, Sam Shleifer, Luke Zettlemoyer, "8-bit Optimizers via Block-wise Quantization", arXiv, 2021-10-06. https://arxiv.org/abs/2110.02861. Accessed 2026-06-21. ↩
NeurIPS, "QLoRA: Efficient Finetuning of Quantized LLMs (Oral)", NeurIPS 2023 virtual program, 2023-12. https://neurips.cc/virtual/2023/oral/73855. Accessed 2026-06-21. ↩
Artidoro Pagnoni et al., "artidoro/qlora (QLoRA reference implementation)", GitHub, 2023-2024. https://github.com/artidoro/qlora. Accessed 2026-06-21. ↩
Michael Brenndoerfer, "QLoRA: Efficient Fine-Tuning of Quantized Language Models", mbrenndoerfer.com, 2024. https://mbrenndoerfer.com/writing/qlora-efficient-finetuning-quantized-language-models. Accessed 2026-06-21. ↩
bitsandbytes-foundation, "bitsandbytes/bitsandbytes/functional.py (create_normal_map and NF4 lookup table)", GitHub, 2024-2026. https://github.com/bitsandbytes-foundation/bitsandbytes/blob/main/bitsandbytes/functional.py. Accessed 2026-06-21. ↩
Davis Yoshida, "NF4 Isn't Information Theoretically Optimal (and that's Good)", arXiv, 2023-06-12. https://arxiv.org/abs/2306.06965. Accessed 2026-06-21. ↩
apxml, "Paged Optimizers for Memory Efficiency", apxml LoRA/PEFT course, 2024-2026. https://apxml.com/courses/lora-peft-efficient-llm-training/chapter-4-advanced-lora-variants/qlora-paged-optimizers. Accessed 2026-06-21. ↩
Hugging Face, "bitsandbytes: 4-bit quantization (Linear4bit, LinearNF4, LinearFP4 reference)", Hugging Face, 2024-2026. https://huggingface.co/docs/bitsandbytes/en/reference/nn/linear4bit. Accessed 2026-06-21. ↩
bitsandbytes-foundation, "bitsandbytes/docs/source/installation.mdx (hardware support matrix)", GitHub, 2024-2026. https://github.com/bitsandbytes-foundation/bitsandbytes/blob/main/docs/source/installation.mdx. Accessed 2026-06-21. ↩
PyTorch team, "Finetune LLMs on your own consumer hardware using tools from PyTorch and Hugging Face ecosystem", PyTorch Blog, 2024. https://pytorch.org/blog/finetune-llms/. Accessed 2026-06-21. ↩
apxml, "Quantization and its effect on Fine-Tuning (QLoRA)", apxml Introduction to LLM Fine-Tuning course, 2024-2026. https://apxml.com/courses/introduction-to-llm-fine-tuning/chapter-4-parameter-efficient-fine-tuning-peft/quantization-and-qlora. Accessed 2026-06-21. ↩
Matteo Sorci, "QLoRA Fine-Tuning with Unsloth: A Complete Guide", Medium, 2024-2026. https://medium.com/@matteo28/qlora-fine-tuning-with-unsloth-a-complete-guide-8652c9c7edb3. Accessed 2026-06-21. ↩
Hugging Face, "bitsandbytes (Diffusers quantization documentation)", Hugging Face, 2024-2026. https://huggingface.co/docs/diffusers/en/quantization/bitsandbytes. Accessed 2026-06-21. ↩
Hugging Face, "Quantization (Transformers main classes documentation)", Hugging Face, 2024-2026. https://huggingface.co/docs/transformers/en/main_classes/quantization. Accessed 2026-06-21. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributor · full history

Suggest edit

What links here

FP4 (4-bit floating point)LLM.int8()OpenVINO Optimum-Quanto QLoRA SmoothQuant

What is NF4 (4-bit NormalFloat)?

Background

Why quantize neural network weights?

Quantile quantization and Dettmers' earlier work

When was NF4 introduced (the QLoRA paper)?

How does NF4 work?

Information-theoretic motivation

What are the 16 NF4 levels?

Block-wise normalization

How are NF4 weights encoded and decoded?

What is Double Quantization?

What are Paged Optimizers?

Why is NF4 better than plain 4-bit (FP4 and INT4)?

Is NF4 really information-theoretically optimal?

How is NF4 implemented in bitsandbytes?

Quantization storage format

How do you use NF4 in Hugging Face Transformers?

Where is NF4 used? (Adoption)

What is NF4 used for? (Applications)

Fine-tuning very large models on a single GPU

Memory-efficient inference

Research baselines

Training dynamics with NF4

Composition with KV cache and activation quantization

What are the limitations of NF4?

Related Work

See also

References

Improve this article

Related Articles

Pruning

QLoRA

GRPO

KTO

RLVR

Test-Time Training (TTT)

What links here

Related Articles

Pruning

QLoRA

GRPO

KTO

RLVR

Test-Time Training (TTT)

What links here