FP4 (4-bit floating point)
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,712 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,712 words
Add missing citations, update stale details, or suggest a clearer explanation.
FP4 is a 4-bit floating-point numerical format used to represent the weights, activations, and (in some recipes) gradients of deep neural networks at extreme precision. The dominant 4-bit floating-point layout, E2M1, allocates 1 sign bit, 2 exponent bits, and 1 mantissa bit, producing only 16 distinct representable values that span a range of roughly negative six to positive six.[1][2] Because half of an 8-bit byte holds a full FP4 element, the format offers an eight-times reduction in storage versus FP32 and a two-times reduction versus FP8, which translates directly into higher arithmetic throughput on hardware that supports it.[3] FP4 became practically important in 2024 when NVIDIA shipped native FP4 tensor-core support in its Blackwell architecture, and in 2023 when the Open Compute Project (OCP) standardised the MXFP4 microscaling format alongside several other narrow data types.[4][5] Subsequent variants, especially NVIDIA's proprietary NVFP4 with FP8 block scales of size 16, have enabled near-FP8 quality on large language models while consuming roughly half the memory bandwidth.[6][7]
A binary floating-point number is described by the layout ExMy where x is the number of exponent bits and y is the number of mantissa (significand) bits, with one additional sign bit and the constraint x + y + 1 equals the total width.[2] For a 4-bit float the possibilities are E0M3, E1M2, E2M1, and E3M0, but the configuration that has been adopted across vendors for deep learning is E2M1: 1 sign, 2 exponent, 1 mantissa.[2][4]
Following IEEE-style decoding, a non-subnormal E2M1 value equals (-1)^s * 2^(e - bias) * (1 + m/2), where the exponent bias is 1 for E2M1 and subnormal numbers (exponent bits all zero) take the form (-1)^s * 2^(1 - bias) * (m/2).[2] Enumerating all 16 codes yields the set {+/-0, +/-0.5, +/-1, +/-1.5, +/-2, +/-3, +/-4, +/-6}; there is no infinity or NaN encoding because the four exponent codes are needed to cover useful magnitudes.[1][2] As a consequence the largest finite value is +/-6 and the smallest positive subnormal is 0.5.[1]
E2M1 is non-uniformly spaced: gaps double with each successive power of two (0.5, 0.5, 1, 1, 2). This bell-shaped distribution maps reasonably well to weight and activation tensors in trained transformers, which is one reason FP4 outperforms uniformly quantised INT4 on outlier-heavy layers.[8]
Sub-8-bit quantisation in deep learning has been studied since at least 2016, but the modern wave of 4-bit floating-point work began in 2023. The QLoRA paper from Tim Dettmers and collaborators (May 2023) introduced the NF4 "NormalFloat" 4-bit data type for fine-tuning frozen weights, and the bitsandbytes library also exposed a software-only FP4 mode using the E2M1 code points decoded with a lookup table; both were used to push 4-bit quantisation into mainstream usage, although the actual matrix multiplications still ran in higher precision.[9] LLM-FP4 (Liu et al., 2023) was an early academic study showing that FP4 weights and activations could be calibrated post-training for transformer models.[10]
In September 2023 the Open Compute Project published the Microscaling Formats (MX) Specification v1.0, the first cross-vendor standard for sub-8-bit number formats with block-shared scales.[4][11] The standard was developed by an alliance that included AMD, Arm, Intel, Meta, Microsoft, NVIDIA, and Qualcomm Technologies and defined four concrete element types (MXFP8, MXFP6, MXFP4, and MXINT8), each paired with a shared scale factor of size E8M0 (an 8-bit unsigned exponent ranging from 2^-127 to 2^127) and a block size of 32 elements.[4][5] The companion Microsoft Research paper Microscaling Data Formats for Deep Learning (Rouhani et al., October 2023, arXiv:2310.10537) provided empirical evidence that sub-8-bit MX formats could serve as drop-in replacements for FP32 across more than two dozen workloads with negligible quality loss.[5]
On 18 March 2024 NVIDIA introduced the Blackwell GPU architecture at GTC, including the B200 discrete accelerator and the GB200 Grace-Blackwell superchip; the B200 was the first GPU to expose FP4 directly to the Tensor Cores, advertising 20 PFLOPS of dense FP4 throughput per GPU and a fifth-generation Transformer Engine that automatically routes layers across FP4, FP6, FP8, and higher precisions.[3][12] On 6 January 2025 NVIDIA extended FP4 to consumer hardware with the GeForce RTX 50 series (Blackwell consumer architecture), making the RTX 50 the first consumer GPU family to accelerate FP4 inference natively.[13]
NVIDIA introduced its own micro-scaled FP4 variant, NVFP4, alongside the same Blackwell launch but described it in detail only in a developer blog dated 24 June 2025, which laid out the dual-level scaling, the block size of 16 elements, and the use of an E4M3 FP8 scale per block.[6] A research paper from the NVIDIA team, Pretraining Large Language Models with NVFP4 (Felix Abecassis, Paulius Micikevicius, Asit Mishra, and others; arXiv:2509.25149, 29 September 2025), then demonstrated end-to-end pretraining of a 12-billion-parameter model on 10 trillion tokens at NVFP4 precision, matching FP8 baselines on MMLU-Pro to within 0.04 percentage points.[7]
INT4 maps the 16 codes uniformly across a chosen range, which devotes most of its representational budget to a narrow region around zero and wastes resolution far from it. FP4 instead places representable values at uneven, log-like spacing, providing fine resolution near zero and coarse resolution near the maximum. For weight tensors in trained transformers, which approximately follow zero-centred Gaussian or Laplace distributions, this layout is closer to information-theoretically efficient and is also more robust to the outlier activations that commonly appear in attention layers.[8][14]
The dynamic range of bare E2M1 (about 12.5 to 1) is far too narrow for raw weight or activation tensors. Every practical FP4 pipeline therefore pairs E2M1 elements with a block-shared scale: each contiguous group of k elements is divided by a per-block scale before quantisation and multiplied back after dequantisation. The micro-scaling family fixes k and a scale format and standardises both on hardware. This block-floating-point idea is decades old but only recently became economically viable as an in-silicon datatype.[11]
The OCP MXFP4 variant uses:
Total storage per 32-element block is 32 * 4 + 8 = 136 bits, equal to 4.25 bits per element on average.[11] The E8M0 scale is a pure power-of-two: the dequantised value is element * 2^(scale - 127). Because the scale can take any integer exponent between -127 and 127, MXFP4 can represent enormous dynamic range, but the absence of fractional scaling means every block is rounded up to the nearest power-of-two granularity, which limits precision compared to fractional scale formats.[11]
MXFP4 is natively supported by NVIDIA Blackwell tensor cores, the AMD Instinct MI400 series, AWS Trainium, and several recent neural processing units (NPUs).[1][15] OpenAI's gpt-oss model release (2025) used MXFP4 for mixture-of-experts weight storage, allowing the 120-billion-parameter variant to fit within a single 80 GB GPU.[15]
NVFP4 is NVIDIA's micro-scaled FP4 variant with two key differences from MXFP4:[6]
Total NVFP4 storage per 16-element block is 16 * 4 + 8 = 72 bits, equal to 4.5 bits per element on average, plus a negligible FP32 tensor-wide overhead.[6] The format gives roughly a 3.5x memory reduction versus FP16 and a 1.8x reduction versus FP8 while keeping accuracy close to FP8 on tasks NVIDIA has measured.[6] By construction at least one value per block (the block-maximum) is stored near FP8 precision because the E4M3 scale is fitted to it, while the remaining values fall back to native FP4 precision.[6]
Because each FP4 code covers a wider value range than its FP8 counterpart, the rounding strategy matters more than at higher precision.
The Blackwell B200 advertises about 20 PFLOPS of dense FP4 throughput per GPU (about 40 PFLOPS with sparsity), exactly twice the FP8 throughput and four times the FP16 throughput.[3][18] On the GB200 NVL72 system, which packs 72 Blackwell GPUs into a single liquid-cooled rack, NVIDIA reported up to 1.4 EFLOPS of FP4 inference at the rack scale.[3] Consumer Blackwell silicon (RTX 50 series) inherits the same FP4 tensor cores; FP4 doubles raw throughput and halves model footprint relative to FP8, with NVIDIA reporting a 2x speed-up for FLUX image generation on the RTX 5090 against FP16 on the RTX 4090.[13]
The 4-bit and 8-bit landscape now includes several closely related but incompatible formats. The most important features are summarised below.
| Format | Element type | Block size | Scale format | Avg bits/value | Hardware native |
|---|---|---|---|---|---|
| INT4 | signed 4-bit integer | per-channel typical | FP16/FP32 | ~4.x | Hopper, Ada, Blackwell (W4A16) |
| NF4 | 4-bit non-uniform code | 64 (typical) | FP32 + 2nd-level FP8 | ~4.1 (with double quant) | Software only (bitsandbytes) |
| MXFP4 | E2M1 (4 bits) | 32 | E8M0 (8 bits, power of two) | 4.25 | Blackwell, MI400, Trainium |
| NVFP4 | E2M1 (4 bits) | 16 | E4M3 FP8 + FP32 tensor scalar | 4.5 + epsilon | Blackwell only |
| FP8 (E4M3) | E4M3 (8 bits) | per-tensor or block | FP32 | 8.x | Hopper, Ada, Blackwell |
| INT8 | signed 8-bit integer | per-channel or per-token | FP16/FP32 | ~8.x | All recent GPUs |
NF4 is a lookup-table quantisation format introduced by QLoRA: the 16 codes are chosen to be quantiles of a standard normal distribution rather than samples of a fixed floating-point grid. Because pretrained transformer weights are approximately normally distributed, NF4 achieves lower perplexity than FP4 and INT4 on most weight-only fine-tuning benchmarks.[9] However, NF4 has no efficient hardware implementation: every NF4 element must be looked up to recover a 16-bit value before computation, which is acceptable for memory-bound finetuning but uncompetitive for the compute-bound forward pass at inference time. FP4 (and the MX/NVFP4 variants) inverts the trade-off: slightly worse for memory-bound workloads but far faster on hardware that supports it.[9][14]
INT4 weight-only quantisation (W4A16) is the dominant low-precision inference path on Hopper and Ada-Lovelace GPUs through algorithms such as GPTQ and AWQ. INT4 typically maintains 1 to 3 percentage points of accuracy versus FP16 on standard LLM benchmarks but cannot be used for activations without significant loss because of attention outliers.[14] FP4 with block scaling can quantise both weights and activations (W4A4), unlocking the full doubling of tensor-core throughput; in 2026 NVIDIA-published evaluations on Llama 3 family models, NVFP4 W4A4 closes most of the accuracy gap relative to FP8.[6]
FP8 became the workhorse low-precision format on the NVIDIA Hopper generation. FP8 is essentially lossless versus FP16 for almost all transformer workloads.[14] FP4 pushes throughput two-times higher than FP8 but adds non-trivial accuracy risk: published NVFP4 results show typical degradation of well under a percentage point on instruction-tuned models, while older or single-level FP4 recipes can lose several percentage points without careful calibration.[6][7] INT8 with proper per-channel and per-token scaling remains competitive with FP8 on many tasks and is widely deployed on edge devices.[14]
NVIDIA's TransformerEngine (open-source, Apache-2.0) is the reference implementation of Blackwell low-precision training and inference. From the 2.x release line it exposes a NVFP4BlockScaling quantisation recipe with options for stochastic rounding, random Hadamard transforms, and per-axis 2D quantisation (separate quantisation grids along the row and column axis of each weight matrix, so that the matmul partner sees a transposed view that is itself well quantised).[16] A helper function is_nvfp4_available() lets user code detect at runtime whether the underlying device has Blackwell-class tensor cores.[16] The same library also implements MXFP8 and MXFP4 recipes following the OCP specification.[16]
TensorRT-LLM gained NVFP4 support in version 0.17, including W4A4 quantised attention and KV-cache compression in FP4.[20] The recommended workflow is post-training calibration with NVIDIA Model Optimizer to compute per-block scales and per-tensor outer scales, followed by engine compilation in TensorRT-LLM. On Blackwell SM 100/103 devices the framework supports the broadest range of formats: NVFP4, MXFP4, FP8 per-tensor, block-scaling, and rowwise variants, plus W4A8 and W4A16 weight-only paths for AWQ and GPTQ.[20]
The llm-compressor library (maintained as part of the vLLM project) provides a QuantizationModifier with an "NVFP4" scheme that converts a HuggingFace transformer checkpoint into a compressed-tensors NVFP4 weight file in a few lines of code.[19] After calibration the resulting checkpoint can be served by vLLM on Blackwell hardware; on devices below SM 100 vLLM falls back to weight-only NVFP4 (W4A16) because activation quantisation requires native FP4 tensor cores.[19]
By early 2026 a substantial catalogue of pre-quantised NVFP4 and MXFP4 models was available on the HuggingFace Hub, including official NVIDIA releases of Llama 3.1 8B, Llama 3.1 405B, Llama 3.3 70B, Llama 4 Scout, DeepSeek-R1, and DeepSeek V3.2 along with Red Hat AI ports of Qwen3 8B/14B/32B and Gemma checkpoints; all were produced with llm-compressor and target vLLM or TensorRT-LLM as the inference backend.[19][20]
OpenAI's gpt-oss release used MXFP4 weight storage in its mixture-of-experts layers, fitting a 120-billion-parameter model on a single 80 GB GPU.[15] AMD and Intel have publicly committed to MXFP4 in subsequent generations of accelerators (Instinct MI400 and Gaudi-class hardware respectively), giving the OCP MX family the broader multi-vendor footprint of the two FP4 standards.[1][11]
The primary commercial driver of FP4 is large-language-model inference. By halving weight and KV-cache footprint relative to FP8, FP4 lets a given accelerator fit larger models, longer contexts, and more concurrent sessions; at the same time tensor-core throughput doubles. In practice NVIDIA reports about 2.3 times higher throughput for 4-bit LLMs on Blackwell against FP8 baselines while preserving accuracy.[6] Image-generation models such as FLUX show a 2x speed-up at half the memory footprint when run in FP4 on the RTX 5090 compared to FP16 on the RTX 4090.[13]
Although FP8 has rapidly become the default precision for frontier pretraining, NVIDIA's Pretraining Large Language Models with NVFP4 showed that an entire 10-trillion-token pretrain at 12 billion parameters can be performed predominantly in NVFP4 with downstream accuracy on par with FP8 (62.58% versus 62.62% on MMLU-Pro).[7] The earlier academic FP4 All the Way result (Chmiel et al., May 2025) reached the same conclusion on a 7-billion-parameter run on Intel Gaudi 2 accelerators using a software-emulated NVFP4-shaped format.[14] Training LLMs with MXFP4 (Tseng et al., February 2025) and the Quartet family of papers (Castro Tsizhanovska et al., 2025) extended the recipe to MXFP4 and showed that stochastic rounding plus random Hadamard transforms is the dominant technique for stable FP4 training.[17]
QLoRA-style finetuning historically used NF4 rather than FP4 because the matrix multiplication runs in 16-bit anyway; on Blackwell hardware, NVFP4-quantised base weights paired with LoRA adapters compute the forward pass at native FP4 tensor-core speed, restoring the throughput advantage of low precision to the finetuning workflow.[16][19]
FP4 has at most 16 representable codes, so the per-element quantisation error is large in absolute terms. The MX and NVFP4 strategies hide this by absorbing range into a per-block scale, but each new layer of scaling shifts a tiny budget away from the codeword: NVFP4 already spends 0.5 bits per value on its FP8 scale alone, and adding a higher-precision per-tensor scalar increases overhead further. There is an information-theoretic floor below which adding more scale layers cannot recover lost precision.[6][11]
Transformer activations contain a small fraction of outlier values whose magnitudes can be hundreds of times larger than the typical entry. Without intervention, these outliers force the block scale to enlarge, pushing the bulk of the distribution into the codes near zero and destroying useful precision. Random Hadamard transforms, learned per-channel scaling (as in SmoothQuant), and rotation-based schemes have all been used to redistribute outlier mass before FP4 quantisation.[7][17]
Round-to-nearest in FP4 introduces a measurable bias into accumulated gradients because the codeword spacing is uneven and the rounding error is correlated with the underlying tensor. Stochastic rounding restores zero-mean error in expectation but increases per-step variance, and the Chmiel et al. analysis quantified a training-effectiveness threshold: when the gradient norm falls below roughly the square root of three times the quantisation noise, further FP4 updates contribute almost nothing to convergence.[14] In practice this means the last few layers of LLMs, where signals are smallest, are usually kept in higher precision (MXFP8 or FP16) even in otherwise FP4 training runs.[7]
The NF4-derived idea of "double quantisation" (storing the per-block scale itself in a quantised form) saves about 0.37 bits per parameter in QLoRA configurations and was a key innovation in fitting a 65-billion-parameter model on a single 48 GB GPU.[9] MXFP4 effectively double-quantises by using the cheap E8M0 power-of-two scale, while NVFP4 spends more bits on the inner scale (FP8) and recovers them with the outer FP32 tensor scalar. The choice between these strategies is fundamentally a question of how much variance lives inside each block versus across blocks of the same tensor, and the right answer depends on layer and model.[6][11]
FP4 is currently supported natively on only a small fraction of deployed AI accelerators (NVIDIA Blackwell datacentre and consumer Blackwell, AMD Instinct MI400, AWS Trainium, and a handful of NPUs). NVFP4 in particular is NVIDIA-only. Older Hopper and Ada-Lovelace GPUs can store FP4 weights and dequantise on the fly into FP16 for tensor-core math, but that yields no compute speed-up. Software stacks therefore have to ship both true-FP4 and emulated-FP4 paths.[16][19][20]
The most cited contemporary 4-bit floating-point papers fall into three groups.
A clear consensus has emerged across these works: native 4-bit floating point is viable both for inference and for pretraining provided that (a) block sizes are small (16 to 32 elements), (b) at least one scale level uses a fractional (non-power-of-two) representation, (c) stochastic rounding and Hadamard or rotation transforms are applied during training, and (d) a handful of the most sensitive layers remain in higher precision.