NormalFloat 4-bit (NF4)
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,421 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,421 words
Add missing citations, update stale details, or suggest a clearer explanation.
| Type | 4-bit numerical data type for weight quantization |
| Bits per value | 4 (16 codepoints) |
| Codebook | Quantiles of N(0, 1), normalized to [-1, 1] |
| Default block size | 64 weights per absmax block[1] |
| Introduced by | Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer[1] |
| First publication | arXiv 2305.14314, 23 May 2023[1] |
| Venue | NeurIPS 2023 (oral)[7] |
| Reference implementation | bitsandbytes (MIT license)[4][10] |
| Hugging Face API | BitsAndBytesConfig(bnb_4bit_quant_type="nf4")[3] |
NormalFloat 4-bit, commonly abbreviated NF4, is a 4-bit numerical data type for representing the weights of deep neural networks. It was introduced by Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer of the University of Washington in the 2023 paper "QLoRA: Efficient Finetuning of Quantized LLMs," where it is presented as an information-theoretically optimal code for tensors whose values are drawn from a zero-mean normal distribution.[1] Rather than spacing its sixteen representable levels uniformly (as INT4 does) or using a fixed exponent/mantissa split (as FP4 does), NF4 places its quantization levels at quantiles of the standard normal distribution so that each bin captures an equal expected probability mass.[1][2] Combined with two other innovations from the same paper, Double Quantization and Paged Optimizers, NF4 made it possible to fine-tune a 65 billion parameter LLaMA model on a single 48 GB GPU while preserving the task performance of full 16-bit fine-tuning.[1] NF4 is implemented in the bitsandbytes library and exposed through the Hugging Face Transformers BitsAndBytesConfig API, where it serves as the default and recommended 4-bit type for QLoRA-style training and for memory-constrained inference.[2][3][4]
Quantization reduces the bit-width used to store and compute with model parameters, trading a small amount of representational fidelity for large reductions in memory and bandwidth. For very large transformers, the weight matrices dominate memory usage, so post-training and during-training quantization of weights has become central to fitting modern models onto commodity hardware.[2][5] Two simple 4-bit alternatives exist as baselines. Integer 4-bit (INT4) places sixteen evenly spaced levels across the dynamic range of a block of values, and floating point 4-bit (FP4) splits the four bits into a sign, an exponent, and a mantissa, giving non-uniform spacing biased toward small magnitudes.[1][5]
Neither baseline directly exploits the distributional shape of pre-trained network weights. Empirically, transformer weights are well approximated by zero-centred Gaussians with low standard deviation, with most of the mass close to zero and few weights far in the tails.[1][2] A quantization grid that ignores this structure wastes capacity on rarely used regions and under-resolves the dense region near zero, where most of the actual values land.
The idea of using sample or theoretical quantiles to lay out quantization bins predates NF4. In "8-bit Optimizers via Block-wise Quantization" (2021), Dettmers and collaborators used quantile-based dynamic codes together with block-wise normalization to compress optimizer states (Adam moments) from 32 bits to 8 bits without losing convergence quality.[6] That paper established several primitives reused by QLoRA: a fixed 256-entry lookup table of quantile values, blockwise absmax normalization that bounds the dynamic range per block, and CUDA kernels that dequantize on the fly during matmul.[6] NF4 can be read as a 4-bit specialization of that machinery, with the empirical eCDF replaced by the theoretical CDF of N(0, 1).[1][6]
QLoRA was first posted to arXiv as 2305.14314 on 23 May 2023.[1] The paper combines three memory-saving techniques (NF4, Double Quantization, Paged Optimizers) with LoRA low-rank adapters to enable backpropagation through a frozen, 4-bit-quantized base model.[1] It was accepted as an oral presentation at NeurIPS 2023.[7] Alongside the paper, the authors released the Guanaco family of QLoRA-tuned chat models at 7B, 13B, 33B, and 65B parameters, trained on the OASST1 instruction dataset.[8] The reference implementation depends on bitsandbytes for the underlying CUDA kernels that materialize NF4 weights and dequantize them at compute time.[8]
The QLoRA paper's central claim is that a careful combination of these three techniques makes the gap between 16-bit fine-tuning and 4-bit fine-tuning vanishingly small while shrinking memory requirements by roughly a factor of four. The authors back this claim by fine-tuning over a thousand model and dataset combinations spanning the LLaMA, T5, and Pythia families and instruction datasets including Alpaca, OASST1, FLAN, Self-Instruct, and Chip2.[1] On the MMLU benchmark, the paper reports that NF4 plus Double Quantization fully recovers the 16-bit LoRA score, while pure INT4 lags substantially.[1] On the Vicuna pairwise chat benchmark, Guanaco 65B reaches 99.3% of ChatGPT's score and Guanaco 33B reaches 97.8%, with the 33B model trainable in under 12 hours on a single consumer 24 GB GPU.[1][8]
A scalar quantizer is information-theoretically optimal for a given source when each of its bins carries an equal expected probability mass under that source. In that case the quantizer's output entropy is maximized at log2(k) for k bins, and (under the assumption of independent and identically distributed inputs) it minimizes the rate-distortion gap for a fixed budget.[1][9] NF4 takes this as its design principle: assuming a block of weights is drawn from N(0, sigma^2) and then scaled into [-1, 1] by dividing by the block's absolute maximum, each of the sixteen NF4 codepoints is placed at the midpoint of an equal-mass quantile of N(0, 1).[1]
NF4 uses asymmetric quantiles so that an exact zero is representable, which is important for masking and padding operations whose semantic correctness depends on exact zeros. The QLoRA paper computes one set of 2^(k-1) negative quantile midpoints and a second set of 2^(k-1)+1 non-negative midpoints, then unifies the two sets and removes the duplicated zero, yielding 2^k = 16 unique values.[1] More concretely, the construction proceeds as follows. For a k-bit code with 2^k bins, an idealized quantile quantizer would place codepoint q_i at the midpoint of the i-th equal-probability bin under the source distribution, computed as the average of the (i / 2^k)-th and ((i+1) / 2^k)-th quantiles. This formula is symmetric and therefore does not in general place a codepoint exactly at zero. NF4 enforces zero by computing one side of the code with 2^(k-1) bins and the other with 2^(k-1)+1 bins, then merging the two halves; the resulting 17-entry set has zero in both halves, which is deduplicated to 16.[1]
The reference implementation in bitsandbytes calls scipy.stats.norm.ppf (the inverse normal CDF) at a sequence of evenly spaced quantile boundaries, with an empirical offset parameter that controls how much probability mass the outermost bins cover before the values are normalized so that the extremes map to -1 and +1.[10] The sixteen sorted NF4 levels stored in the library lookup table are:[10]
-1.0000, -0.6962, -0.5251, -0.3949, -0.2844, -0.1848, -0.0911, 0.0,
0.0796, 0.1609, 0.2461, 0.3379, 0.4407, 0.5626, 0.7230, 1.0
(values rounded to four decimals for display; the exact constants are stored at single precision in functional.py).[10]
These levels are denser near zero and sparser in the tails, exactly inverting INT4's uniform layout. Because the codepoints are fixed, NF4 does not require per-block calibration of bin locations; only the per-block scale factor (the absolute maximum of each block) must be stored alongside the quantized indices.[1]
NF4 is always applied block-wise rather than tensor-wide. In QLoRA the weight tensor is split into contiguous blocks of 64 elements; each block is divided by its absolute maximum to bring values into [-1, 1] and then mapped to the nearest of the sixteen NF4 codepoints.[1] Each block therefore costs 64 indices x 4 bits = 32 bytes for the data, plus one 32-bit scale factor for the block, for an effective storage cost of 4 + 32/64 = 4.5 bits per parameter before any further compression.[1]
Block size is a deliberate trade-off. Small blocks (like 64) keep the per-block dynamic range tight so that the fixed NF4 grid sees something close to a normalized normal distribution, but they require many scale factors. Larger blocks save metadata but spread the normalized distribution further from a clean N(0, 1) shape, which weakens the information-theoretic argument for NF4.[11]
To encode a tensor W in NF4, the bitsandbytes kernel performs the following steps for each block of 64 weights:
s = max(abs(block)), the per-block absolute maximum, and store it as a 32-bit float.[1][10]block / s, which lies in [-1, 1].i in [0, 15] of the closest NF4 codepoint by binary search over the sorted 16-entry lookup table.[10]Decoding reverses these steps. The kernel reads a packed 4-bit index, looks up the corresponding NF4 codepoint c[i] from the table, multiplies by the stored block scale s, and emits a 16-bit float s * c[i] into the dequantized buffer. This dequantized buffer is then used as one operand in a 16-bit matrix multiplication.[13] Because the lookup table is small (16 entries, 256 with padding) it stays in registers or shared memory, and the dominant cost is the memory traffic for the 4-bit indices and the per-block scales rather than the arithmetic.[6][13]
The scale factors themselves are stored as 32-bit floats, which (at one per 64 weights) costs 0.5 bits per parameter in metadata. QLoRA introduces Double Quantization (DQ), which quantizes those FP32 scales again into 8-bit floats using a second-level block size of 256 scales per super-block.[1] After DQ the per-parameter overhead drops from 32/64 = 0.5 bits to roughly 8/64 + 32/(64x256) ~= 0.127 bits, a savings of about 0.373 bits per parameter, or roughly 3 GB on a 65B-parameter LLaMA model.[1] DQ is a lossy compression of the metadata, but the QLoRA paper reports that perplexity on language modeling benchmarks is essentially unchanged with DQ enabled, matching or improving on plain NF4.[1]
In bitsandbytes and Hugging Face Transformers, DQ is exposed as the boolean bnb_4bit_use_double_quant flag on BitsAndBytesConfig, and the Transformers documentation recommends enabling it whenever GPU memory is tight (for example, when fine-tuning a 13B model on a 16 GB T4).[3]
The third innovation packaged with NF4 is the Paged Optimizer. During fine-tuning with long sequences and gradient checkpointing, optimizer state for AdamW can transiently exceed available GPU memory, triggering out-of-memory crashes. Paged Optimizers use NVIDIA's unified memory mechanism to allow optimizer state pages to migrate between GPU memory and pinned CPU host memory on demand, so that transient spikes do not fail.[1] The technique is implemented in bitsandbytes as paged variants of AdamW and other optimizers, and the QLoRA paper reports that it absorbs the gradient-checkpointing spikes that would otherwise prevent 33B and 65B fine-tunes on 24 GB and 48 GB cards respectively.[1][12]
Paged Optimizers are independent of NF4 in principle and can be combined with any optimizer state representation, but they are usually deployed together because the same memory pressure that motivates 4-bit weights also motivates spillable optimizer state.[1][12]
Three 4-bit data types are commonly compared in the LLM quantization literature: uniform INT4, floating-point FP4, and NF4.[1][5] They differ in how their sixteen codepoints are distributed across the dynamic range.
| Type | Codebook layout | Strength | Weakness |
|---|---|---|---|
| INT4 | 16 uniformly spaced levels | Simple to implement; fast on integer ALUs | Wastes capacity in low-density regions; quantization error grows for distributions concentrated near zero[1][5] |
| FP4 (E2M1 / E3M0) | Sign + small exponent + small mantissa | Non-uniform spacing biased toward small magnitudes; flexible exponent choice | Mantissa precision near zero is still limited; not matched to N(0, 1)[1][5] |
| NF4 | 16 quantile midpoints of N(0, 1) normalized to [-1, 1] | Each bin carries equal probability mass for Gaussian inputs[1] | Optimality argument is approximate under blockwise absmax normalization[11] |
The QLoRA paper presents a head-to-head comparison of NF4, FP4, and INT4 on language modeling, measured as mean perplexity over Pile-derived evaluation suites with several model families:[1]
| Data type | Mean perplexity (lower is better) |
|---|---|
| INT4 | 34.34 |
| FP4 | 29.48 |
| NF4 | (between FP4 and NF4 + DQ) |
| NF4 + Double Quantization | 27.41 |
Numbers reported in the QLoRA paper's Table 3.[1] NF4 with Double Quantization beats both INT4 and FP4 at the same 4-bit budget; INT4's uniform grid pays a substantial penalty on weight matrices that concentrate mass near zero, while FP4 sits in between.[1] Subsequent independent reviews and tutorials reach the same qualitative ranking, with NF4 preferred for training the base model and FP4 or INT4 sometimes used for inference where slightly faster paths matter more than the last fraction of accuracy.[2][5]
The Hugging Face integration blog summarizes the practical guidance succinctly: use NF4 for higher precision, enable Double Quantization if memory is tight, and use a 16-bit compute dtype (typically bfloat16) for faster matmul.[2][3]
The information-theoretic optimality claim in the QLoRA paper assumes that the values entering the quantizer are i.i.d. samples from N(0, 1) after block scaling. Davis Yoshida pointed out in a June 2023 follow-up (arXiv 2306.06965) that this assumption is violated in practice: dividing a block of n weights by its absolute maximum makes the resulting normalized values dependent (the maximum is forced to +/-1), and the conditional distribution of the remaining n - 1 values depends on the block size.[11] Yoshida derived an alternative 4-bit code (called AF4) by directly minimizing the expected L1 reconstruction error under the conditional distribution, and showed it marginally outperforms NF4 for large block sizes such as 4096, while the two codes are essentially equivalent at the QLoRA block size of 64.[11] The paper's title (NF4 Isn't Information Theoretically Optimal (and that's Good)) captures the bottom line: NF4 is not literally optimal under its stated criterion, but the gap that remains is small enough that the simpler quantile-based code is a practical winner for typical block sizes.[11]
NF4 is implemented in the open-source bitsandbytes library, originally maintained by Tim Dettmers and now hosted under the bitsandbytes-foundation GitHub organization, which describes itself as providing "accessible large language models via k-bit quantization for PyTorch".[4][10] The library offers three closely related capabilities: 8-bit optimizers built on block-wise quantization (the 2021 paper), LLM.int8() 8-bit inference, and the 4-bit QLoRA path that includes NF4.[4][6] All three share the same block-wise normalization scheme, lookup-table-based dequantization, and CUDA kernel infrastructure.[4][6]
At the Python level the data type is exposed through two classes in bitsandbytes.nn:
Linear4bit is the base 4-bit linear layer; it accepts a quant_type argument with values "nf4" or "fp4", a compute_dtype (typically torch.bfloat16 or torch.float16), a compress_statistics flag for Double Quantization, and a quant_storage dtype.[13]LinearNF4 is a thin subclass that fixes quant_type="nf4" and documents the data type as "a quantization data type where each bin has equal area under a standard normal distribution N(0, 1)".[13]The NF4 lookup table itself is produced by create_normal_map in bitsandbytes/functional.py, which calls scipy.stats.norm.ppf to obtain quantiles, then normalizes the result into [-1, 1] and pads the 16 values into a 256-entry table so the kernel can use 8-bit integer indices.[10] The dequantization kernel reads 4-bit indices, looks up the corresponding NF4 codepoint, multiplies by the per-block scale, and feeds the result into a 16-bit matmul.[13]
The bitsandbytes documentation lists hardware support for NF4 as NVIDIA Pascal (compute capability 6.0) and newer GPUs on CUDA, plus Intel XPU, Intel Gaudi (HPU), and CPU back-ends in more recent releases; the library is MIT licensed.[4][14] The Hugging Face Transformers documentation notes that bitsandbytes is supported on CUDA versions 11.8 through 13.0, with ongoing work to extend coverage to additional accelerators.[3]
Beyond the 4-bit indices and the per-block scales, a bitsandbytes 4-bit checkpoint records the quantization configuration (block size, quant type, whether DQ is enabled, and the compute dtype) so that loading the model from disk reconstructs the layer with the same arithmetic. Linear4bit accepts a quant_storage parameter, defaulting to torch.uint8, which controls the dtype used to physically pack the 4-bit indices into a flat tensor.[13] Two 4-bit indices fit into a single uint8, so a layer with n weights stores ceil(n/2) packed bytes plus the block metadata.[13]
NF4 reached the broader open-source ecosystem on 24 May 2023, one day after the QLoRA paper appeared, via a Hugging Face blog post and a Transformers integration co-authored by Younes Belkada, Tim Dettmers, Artidoro Pagnoni, Sylvain Gugger, and Sourab Mangrulkar.[2] The integration adds two main entry points to the Transformers library:
load_in_4bit=True shortcut on AutoModelForCausalLM.from_pretrained and similar constructors, which replaces every eligible torch.nn.Linear layer with a bitsandbytes 4-bit layer.[2][3]BitsAndBytesConfig dataclass that exposes four QLoRA-specific knobs:| Parameter | Values | Default | Effect |
|---|---|---|---|
bnb_4bit_quant_type | "nf4", "fp4" | "fp4" | Selects NF4 or FP4 code; NF4 is recommended for QLoRA training[3] |
bnb_4bit_use_double_quant | True, False | False | Enables Double Quantization (saves ~0.4 bits per parameter)[2][3] |
bnb_4bit_compute_dtype | torch.float16, torch.bfloat16, torch.float32 | torch.float32 | Dtype for matmul; bf16 is fastest on Ampere and newer[3] |
load_in_4bit | True, False | False | Master switch to replace Linear layers with Linear4bit[3] |
A typical NF4 inference configuration looks like this (taken from the Transformers quantization guide):[3]
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch
nf4_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16,
)
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=nf4_config)
For QLoRA fine-tuning, this configuration is composed with a PEFT LoRA adapter and a TRL trainer; the base model stays frozen in NF4 while the LoRA matrices remain in 16-bit and receive gradients.[2][15] The HuggingFace PEFT library handles wiring the adapter onto the quantized layers, and HuggingFace TRL provides the SFT and DPO training loops.[2]
Since 2023, NF4 has become a standard building block in open-source LLM tooling. Beyond bitsandbytes and Transformers, it appears in:
bitsandbytes under the hood.[16]bitsandbytes 4-bit integration was extended to the Diffusers library and is widely used for running FLUX.1 image models in NF4, where it reduces VRAM enough to fit the 12B-parameter transformer of FLUX.1 on consumer 8-16 GB GPUs.[18]NF4 is also one of the data types covered by bfloat16-vs-int-vs-NF benchmark studies and survey papers on low-bit LLM quantization, typically as the reference 4-bit code against which newer learned codebooks are compared.[5][11]
The original target application for NF4 is fine-tuning large language models on hardware where the base model cannot fit in 16-bit form. The QLoRA paper demonstrated 65B LLaMA fine-tuning on a single 48 GB A100, 33B LLaMA fine-tuning on a single 24 GB consumer card in under 12 hours, and 7B/13B fine-tuning on T4-class GPUs.[1][2] The released Guanaco models reached 99.3% of ChatGPT's score on the Vicuna benchmark with 65B parameters and 97.8% with 33B, demonstrating that 4-bit base weights are not the limiting factor for instruction-tuned chat quality.[1][8]
Even outside training, NF4 is used as a memory-efficient inference format for transformer and diffusion models. The Hugging Face documentation notes that for inference the choice between NF4 and FP4 is less consequential than for training, but the same bitsandbytes infrastructure provides both with identical APIs, so NF4 is often kept as the default to stay consistent with whatever code path produced the weights.[3] In diffusion image generation, NF4-quantized FLUX models have been reported by community benchmarks to be faster than FP8 at the same VRAM budget, with throughput speedups of roughly 1.3x to 2.5x on 6-8 GB cards.[18]
In the years since 2023, NF4 has settled into a role as the canonical 4-bit baseline for new quantization research. Subsequent methods (Yoshida's AF4, learned codebooks such as "any4," activation-aware schemes like AWQ and GPTQ applied at 4 bits, and Blackwell-era FP4 quantization-aware training) typically report results against NF4 to demonstrate improvements at equal bit-width.[5][11]
In QLoRA-style fine-tuning, NF4 weights are immutable: gradients never flow back into the 4-bit codebook or scales. Trainable parameters live entirely in the 16-bit LoRA adapters that are added in parallel to selected linear layers (commonly the q, k, v, and output projections of attention, sometimes the MLP projections).[1][15] The forward pass dequantizes the frozen NF4 weights, computes (W + BA) x, and the backward pass updates only A and B.[1][15]
This division of labor has two practical consequences. First, because the adapters are tiny relative to the base model (typically less than 1% of base parameters), the optimizer state is small and the dominant memory cost is the dequantized activations during the forward pass and the gradients during the backward pass.[1][15] Second, the choice of compute dtype (bnb_4bit_compute_dtype) matters more than the storage dtype for training speed; the Hugging Face documentation and the QLoRA reference implementation use torch.bfloat16 on Ampere or newer hardware to maximize tensor core throughput.[2][3]
NF4 quantizes the static weight matrices of a transformer; it does not by itself quantize activations, attention scores, or the KV cache that grows with the context length. In long-context inference, the KV cache often dominates GPU memory, and NF4-quantized weights can be combined with separate KV cache compression (such as 8-bit or 4-bit quantization of cached keys and values) implemented by other libraries.[3][4] The Hugging Face documentation explicitly notes that the bnb 4-bit integration is intended for weights and uses 16-bit compute, leaving activation-level optimization to complementary techniques.[3]
NF4 has several documented limitations:
bitsandbytes NF4 path dequantizes to 16-bit before matmul, so it does not in itself produce a faster compute path than 16-bit weights, only a smaller memory footprint. Speedups on memory-bound workloads such as small-batch decoding come from reduced HBM traffic.[4][13]bitsandbytes kernels; on backends without those kernels (older CUDA versions, exotic accelerators) NF4 either falls back to slow paths or is unavailable.[4][14]NF4 sits at the intersection of several adjacent topics: