FP4 (4-bit floating point)

AI Hardware AI Inference Training & Optimization

21 min read

Updated Jun 27, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 27, 2026

Fact-checked

In review queue

Sources

21 citations

Revision

v3 · 4,226 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

FP4 (4-bit floating point) is a numerical format that stores a real number in just 4 bits, the smallest floating-point type in mainstream use for deep learning. Its dominant layout, E2M1, uses 1 sign bit, 2 exponent bits, and 1 mantissa bit, which yields only 16 representable values spanning roughly negative six to positive six.^[1]^[2] Because FP4 fits two values in a single byte, it cuts model storage eight times versus FP32 and two times versus FP8, and on hardware that supports it natively this translates into roughly double the arithmetic throughput.^[3] FP4 is used almost always with a shared per-block scale in the microscaling formats MXFP4 (Open Compute Project) and NVFP4 (NVIDIA); it became practically important in 2024 when NVIDIA shipped native FP4 tensor cores in its Blackwell architecture, and it now powers low-precision large-language-model inference and, increasingly, training.^[4]^[5]^[6]

What is FP4?

FP4 is quantization taken to an extreme: it represents the weights, activations, and (in some recipes) gradients of deep neural networks using only 4 bits per element. The dominant 4-bit floating-point layout, E2M1, allocates 1 sign bit, 2 exponent bits, and 1 mantissa bit, producing only 16 distinct representable values that span a range of roughly negative six to positive six.^[1]^[2] Because half of an 8-bit byte holds a full FP4 element, the format offers an eight-times reduction in storage versus FP32 and a two-times reduction versus FP8, which translates directly into higher arithmetic throughput on hardware that supports it.^[3] FP4 became practically important in 2024 when NVIDIA shipped native FP4 tensor-core support in its Blackwell architecture, and in 2023 when the Open Compute Project (OCP) standardised the MXFP4 microscaling format alongside several other narrow data types.^[4]^[5] Subsequent variants, especially NVIDIA's proprietary NVFP4 with FP8 block scales of size 16, have enabled near-FP8 quality on large language models while consuming roughly half the memory bandwidth.^[6]^[7]

What is the E2M1 format?

A binary floating-point number is described by the layout ExMy where x is the number of exponent bits and y is the number of mantissa (significand) bits, with one additional sign bit and the constraint x + y + 1 equals the total width.^[2] For a 4-bit float the possibilities are E0M3, E1M2, E2M1, and E3M0, but the configuration that has been adopted across vendors for deep learning is E2M1: 1 sign, 2 exponent, 1 mantissa.^[2]^[4]

Following IEEE-style decoding, a non-subnormal E2M1 value equals (-1)^s * 2^(e - bias) * (1 + m/2), where the exponent bias is 1 for E2M1 and subnormal numbers (exponent bits all zero) take the form (-1)^s * 2^(1 - bias) * (m/2).^[2] Enumerating all 16 codes yields the set {+/-0, +/-0.5, +/-1, +/-1.5, +/-2, +/-3, +/-4, +/-6}; there is no infinity or NaN encoding because the four exponent codes are needed to cover useful magnitudes.^[1]^[2] As a consequence the largest finite value is +/-6 and the smallest positive subnormal is 0.5.^[1]

E2M1 is non-uniformly spaced: gaps double with each successive power of two (0.5, 0.5, 1, 1, 2). This bell-shaped distribution maps reasonably well to weight and activation tensors in trained transformers, which is one reason FP4 outperforms uniformly quantised INT4 on outlier-heavy layers.^[8]

When was FP4 introduced and standardised?

Earlier 4-bit work

Sub-8-bit quantisation in deep learning has been studied since at least 2016, but the modern wave of 4-bit floating-point work began in 2023. The QLoRA paper from Tim Dettmers and collaborators (May 2023) introduced the NF4 "NormalFloat" 4-bit data type for fine-tuning frozen weights, and the bitsandbytes library also exposed a software-only FP4 mode using the E2M1 code points decoded with a lookup table; both were used to push 4-bit quantisation into mainstream usage, although the actual matrix multiplications still ran in higher precision.^[9] LLM-FP4 (Liu et al., 2023) was an early academic study showing that FP4 weights and activations could be calibrated post-training for transformer models.^[10]

The OCP Microscaling specification

In September 2023 the Open Compute Project published the Microscaling Formats (MX) Specification v1.0, the first cross-vendor standard for sub-8-bit number formats with block-shared scales.^[4]^[11] The standard was developed by an alliance that included AMD, Arm, Intel, Meta, Microsoft, NVIDIA, and Qualcomm Technologies and defined four concrete element types (MXFP8, MXFP6, MXFP4, and MXINT8), each paired with a shared scale factor of size E8M0 (an 8-bit unsigned exponent ranging from 2^-127 to 2^127) and a block size of 32 elements.^[4]^[5] The companion Microsoft Research paper Microscaling Data Formats for Deep Learning (Rouhani et al., October 2023, arXiv:2310.10537) provided empirical evidence that sub-8-bit MX formats could serve as drop-in replacements for FP32 across more than two dozen workloads with negligible quality loss; the authors report that "empirical results on over two dozen benchmarks demonstrate practicality of MX data formats as a drop-in replacement for baseline FP32 for AI inference and training with low user friction."^[5]

NVIDIA Blackwell and native FP4

On 18 March 2024 NVIDIA introduced the Blackwell GPU architecture at GTC, including the B200 discrete accelerator and the GB200 Grace-Blackwell superchip; the B200 was the first GPU to expose FP4 directly to the Tensor Cores, advertising 20 PFLOPS of dense FP4 throughput per GPU and a fifth-generation Transformer Engine that automatically routes layers across FP4, FP6, FP8, and higher precisions.^[3]^[12] On 6 January 2025 NVIDIA extended FP4 to consumer hardware with the GeForce RTX 50 series (Blackwell consumer architecture), making the RTX 50 the first consumer GPU family to accelerate FP4 inference natively.^[13]

NVFP4 release

NVIDIA introduced its own micro-scaled FP4 variant, NVFP4, alongside the same Blackwell launch but described it in detail only in a developer blog dated 24 June 2025, which laid out the dual-level scaling, the block size of 16 elements, and the use of an E4M3 FP8 scale per block.^[6] NVIDIA describes the result plainly: "NVFP4 is an innovative 4-bit floating point format introduced with the NVIDIA Blackwell GPU architecture."^[6] A research paper from the NVIDIA team, Pretraining Large Language Models with NVFP4 (Felix Abecassis, Paulius Micikevicius, Asit Mishra, and others; arXiv:2509.25149, 29 September 2025), then demonstrated end-to-end pretraining of a 12-billion-parameter model on 10 trillion tokens at NVFP4 precision, matching FP8 baselines on MMLU-Pro to within 0.04 percentage points.^[7]

How does FP4 work technically?

Why floating point, not integer

INT4 maps the 16 codes uniformly across a chosen range, which devotes most of its representational budget to a narrow region around zero and wastes resolution far from it. FP4 instead places representable values at uneven, log-like spacing, providing fine resolution near zero and coarse resolution near the maximum. For weight tensors in trained transformers, which approximately follow zero-centred Gaussian or Laplace distributions, this layout is closer to information-theoretically efficient and is also more robust to the outlier activations that commonly appear in attention layers.^[8]^[14]

The need for block scaling

The dynamic range of bare E2M1 (about 12.5 to 1) is far too narrow for raw weight or activation tensors. Every practical FP4 pipeline therefore pairs E2M1 elements with a block-shared scale: each contiguous group of k elements is divided by a per-block scale before quantisation and multiplied back after dequantisation. The micro-scaling family fixes k and a scale format and standardises both on hardware. This block-floating-point idea is decades old but only recently became economically viable as an in-silicon datatype.^[11]

What is the difference between MXFP4 and NVFP4?

MXFP4 and NVFP4 share the same E2M1 4-bit element but differ in block size, scale format, and the number of scaling levels. MXFP4 (OCP) groups 32 elements under a single power-of-two E8M0 scale, while NVFP4 (NVIDIA) groups only 16 elements under a fractional FP8 (E4M3) scale and adds a second FP32 per-tensor scale. The smaller block and fractional scale make NVFP4 more accurate per bit, at the cost of about 0.25 extra bits per value and NVIDIA-only hardware.^[4]^[6]

MXFP4

The OCP MXFP4 variant uses:

Element format: E2M1 (4 bits)
Block size: 32 elements
Scale format: E8M0 (8 bits, unsigned, biased exponent)

Total storage per 32-element block is 32 * 4 + 8 = 136 bits, equal to 4.25 bits per element on average.^[11] The E8M0 scale is a pure power-of-two: the dequantised value is element * 2^(scale - 127). Because the scale can take any integer exponent between -127 and 127, MXFP4 can represent enormous dynamic range, but the absence of fractional scaling means every block is rounded up to the nearest power-of-two granularity, which limits precision compared to fractional scale formats.^[11]

MXFP4 is natively supported by NVIDIA Blackwell tensor cores, the AMD Instinct MI400 series, AWS Trainium, and several recent neural processing units (NPUs).^[1]^[15] OpenAI's gpt-oss model release (2025) used MXFP4 for mixture-of-experts weight storage, allowing the 120-billion-parameter variant to fit within a single 80 GB GPU.^[15]

NVFP4

NVFP4 is NVIDIA's micro-scaled FP4 variant with two key differences from MXFP4:^[6]

Block size: 16 elements instead of 32. A smaller block tracks the tensor's dynamic range more locally, reducing the in-block variance that has to be absorbed by the 4-bit codes.
Block scale format: E4M3 FP8 instead of E8M0. E4M3 trades exponent range for mantissa, allowing fractional (non-power-of-two) per-block scaling, which lowers block-level rounding error.
Outer tensor scale: a single FP32 scalar applied at the whole-tensor level, used to bring the FP8 block scales themselves into their representable range. This is the "two-level" or "dual-level" scaling that NVIDIA documentation emphasises.

NVIDIA describes the design as a "two-level micro-block scaling strategy" that "applies a fine-grained E4M3 scaling factor to each 16-value micro-block, a compact subset of the larger tensor, while also leveraging a second-level FP32 scalar applied per tensor."^[6] Total NVFP4 storage per 16-element block is 16 * 4 + 8 = 72 bits, equal to 4.5 bits per element on average, plus a negligible FP32 tensor-wide overhead.^[6] The format gives roughly a 3.5x memory reduction versus FP16 and a 1.8x reduction versus FP8 while keeping accuracy close to FP8 on tasks NVIDIA has measured.^[6] By construction at least one value per block (the block-maximum) is stored near FP8 precision because the E4M3 scale is fitted to it, while the remaining values fall back to native FP4 precision.^[6]

Rounding and outlier handling

Because each FP4 code covers a wider value range than its FP8 counterpart, the rounding strategy matters more than at higher precision.

Round-to-nearest-even (RNE) is the standard deterministic option and is used for the forward pass in most published recipes.^[16]^[14]
Stochastic rounding (SR) rounds up with a probability proportional to the distance to the nearest representable value, giving an unbiased expectation. SR is essential for backward and update passes during FP4 training because round-to-nearest produces a biased gradient estimate that prevents convergence at such low precision.^[16]^[14]
Random Hadamard transforms (RHT), which mix tensor entries by a random orthogonal matrix before quantisation, are used by both the NVIDIA NVFP4 pretraining recipe and the "Training LLMs with MXFP4" paper to bound the variance introduced by stochastic rounding in the presence of block-level outliers.^[7]^[17]

How does Blackwell use FP4?

The Blackwell B200 advertises about 20 PFLOPS of dense FP4 throughput per GPU (about 40 PFLOPS with sparsity), exactly twice the FP8 throughput and four times the FP16 throughput.^[3]^[18] On the GB200 NVL72 system, which packs 72 Blackwell GPUs into a single liquid-cooled rack, NVIDIA reports about 720 PFLOPS of dense FP4 and up to 1.44 EFLOPS of FP4 inference with sparsity at the rack scale.^[3] Consumer Blackwell silicon (RTX 50 series) inherits the same FP4 tensor cores; FP4 doubles raw throughput and halves model footprint relative to FP8, with NVIDIA reporting a 2x speed-up for FLUX image generation on the RTX 5090 against FP16 on the RTX 4090.^[13]

Blackwell exposes FP4 through a fifth-generation Transformer Engine that, layer by layer, selects FP4, FP6, FP8, or a higher precision automatically, so that only the layers that tolerate 4-bit math actually run there.^[3]

How do FP4 variants compare?

The 4-bit and 8-bit landscape now includes several closely related but incompatible formats. The most important features are summarised below.

Format	Element type	Block size	Scale format	Avg bits/value	Hardware native
INT4	signed 4-bit integer	per-channel typical	FP16/FP32	~4.x	Hopper, Ada, Blackwell (W4A16)
NF4	4-bit non-uniform code	64 (typical)	FP32 + 2nd-level FP8	~4.1 (with double quant)	Software only (bitsandbytes)
MXFP4	E2M1 (4 bits)	32	E8M0 (8 bits, power of two)	4.25	Blackwell, MI400, Trainium
NVFP4	E2M1 (4 bits)	16	E4M3 FP8 + FP32 tensor scalar	4.5 + epsilon	Blackwell only
FP8 (E4M3)	E4M3 (8 bits)	per-tensor or block	FP32	8.x	Hopper, Ada, Blackwell
INT8	signed 8-bit integer	per-channel or per-token	FP16/FP32	~8.x	All recent GPUs

^[4]^[6]^[9]^[11]^[19]^[14]

Comparison with NF4

NF4 is a lookup-table quantisation format introduced by QLoRA: the 16 codes are chosen to be quantiles of a standard normal distribution rather than samples of a fixed floating-point grid. Because pretrained transformer weights are approximately normally distributed, NF4 achieves lower perplexity than FP4 and INT4 on most weight-only fine-tuning benchmarks.^[9] However, NF4 has no efficient hardware implementation: every NF4 element must be looked up to recover a 16-bit value before computation, which is acceptable for memory-bound finetuning but uncompetitive for the compute-bound forward pass at inference time. FP4 (and the MX/NVFP4 variants) inverts the trade-off: slightly worse for memory-bound workloads but far faster on hardware that supports it.^[9]^[14]

Comparison with INT4

INT4 weight-only quantisation (W4A16) is the dominant low-precision inference path on Hopper and Ada-Lovelace GPUs through algorithms such as GPTQ and AWQ. INT4 typically maintains 1 to 3 percentage points of accuracy versus FP16 on standard LLM benchmarks but cannot be used for activations without significant loss because of attention outliers.^[14] FP4 with block scaling can quantise both weights and activations (W4A4), unlocking the full doubling of tensor-core throughput; in 2026 NVIDIA-published evaluations on Llama 3 family models, NVFP4 W4A4 closes most of the accuracy gap relative to FP8.^[6]

Comparison with FP8 and INT8

FP8 became the workhorse low-precision format on the NVIDIA Hopper generation. FP8 is essentially lossless versus FP16 for almost all transformer workloads.^[14] FP4 pushes throughput two-times higher than FP8 but adds non-trivial accuracy risk: published NVFP4 results show typical degradation of well under a percentage point on instruction-tuned models, while older or single-level FP4 recipes can lose several percentage points without careful calibration.^[6]^[7] INT8 with proper per-channel and per-token scaling remains competitive with FP8 on many tasks and is widely deployed on edge devices.^[14]

What software supports FP4?

NVIDIA TransformerEngine

NVIDIA's TransformerEngine (open-source, Apache-2.0) is the reference implementation of Blackwell low-precision training and inference. From the 2.x release line it exposes a NVFP4BlockScaling quantisation recipe with options for stochastic rounding, random Hadamard transforms, and per-axis 2D quantisation (separate quantisation grids along the row and column axis of each weight matrix, so that the matmul partner sees a transposed view that is itself well quantised).^[16] A helper function is_nvfp4_available() lets user code detect at runtime whether the underlying device has Blackwell-class tensor cores.^[16] The same library also implements MXFP8 and MXFP4 recipes following the OCP specification.^[16]

TensorRT-LLM and TensorRT Model Optimizer

TensorRT-LLM gained NVFP4 support in version 0.17, including W4A4 quantised attention and KV-cache compression in FP4.^[20] The recommended workflow is post-training calibration with NVIDIA Model Optimizer to compute per-block scales and per-tensor outer scales, followed by engine compilation in TensorRT-LLM. On Blackwell SM 100/103 devices the framework supports the broadest range of formats: NVFP4, MXFP4, FP8 per-tensor, block-scaling, and rowwise variants, plus W4A8 and W4A16 weight-only paths for AWQ and GPTQ.^[20]

llm-compressor and vLLM

The llm-compressor library (maintained as part of the vLLM project) provides a QuantizationModifier with an "NVFP4" scheme that converts a HuggingFace transformer checkpoint into a compressed-tensors NVFP4 weight file in a few lines of code.^[19] After calibration the resulting checkpoint can be served by vLLM on Blackwell hardware; on devices below SM 100 vLLM falls back to weight-only NVFP4 (W4A16) because activation quantisation requires native FP4 tensor cores.^[19]

HuggingFace Hub model availability

By early 2026 a substantial catalogue of pre-quantised NVFP4 and MXFP4 models was available on the HuggingFace Hub, including official NVIDIA releases of Llama 3.1 8B, Llama 3.1 405B, Llama 3.3 70B, Llama 4 Scout, DeepSeek-R1, and DeepSeek V3.2 along with Red Hat AI ports of Qwen3 8B/14B/32B and Gemma checkpoints; all were produced with llm-compressor and target vLLM or TensorRT-LLM as the inference backend.^[19]^[20]

Other ecosystems

OpenAI's gpt-oss release used MXFP4 weight storage in its mixture-of-experts layers, fitting a 120-billion-parameter model on a single 80 GB GPU.^[15] AMD and Intel have publicly committed to MXFP4 in subsequent generations of accelerators (Instinct MI400 and Gaudi-class hardware respectively), giving the OCP MX family the broader multi-vendor footprint of the two FP4 standards.^[1]^[11]

What is FP4 used for?

Inference

The primary commercial driver of FP4 is large-language-model inference. By halving weight and KV-cache footprint relative to FP8, FP4 lets a given accelerator fit larger models, longer contexts, and more concurrent sessions; at the same time tensor-core throughput doubles. In practice NVIDIA reports about 2.3 times higher throughput for 4-bit LLMs on Blackwell against FP8 baselines while preserving accuracy.^[6] Image-generation models such as FLUX show a 2x speed-up at half the memory footprint when run in FP4 on the RTX 5090 compared to FP16 on the RTX 4090.^[13]

Pretraining

Although FP8 has rapidly become the default precision for frontier pretraining, NVIDIA's Pretraining Large Language Models with NVFP4 showed that an entire 10-trillion-token pretrain at 12 billion parameters can be performed predominantly in NVFP4 with downstream accuracy on par with FP8 (62.58% versus 62.62% on MMLU-Pro).^[7] The earlier academic FP4 All the Way result (Chmiel et al., May 2025) reached the same conclusion on a 7-billion-parameter run on Intel Gaudi 2 accelerators using a software-emulated NVFP4-shaped format.^[14] Training LLMs with MXFP4 (Tseng et al., February 2025) and the Quartet family of papers (Castro Tsizhanovska et al., 2025) extended the recipe to MXFP4 and showed that stochastic rounding plus random Hadamard transforms is the dominant technique for stable FP4 training.^[17]

Finetuning

QLoRA-style finetuning historically used NF4 rather than FP4 because the matrix multiplication runs in 16-bit anyway; on Blackwell hardware, NVFP4-quantised base weights paired with LoRA adapters compute the forward pass at native FP4 tensor-core speed, restoring the throughput advantage of low precision to the finetuning workflow.^[16]^[19]

What are the limitations of FP4?

Quantisation noise dominates at small block sizes

FP4 has at most 16 representable codes, so the per-element quantisation error is large in absolute terms. The MX and NVFP4 strategies hide this by absorbing range into a per-block scale, but each new layer of scaling shifts a tiny budget away from the codeword: NVFP4 already spends 0.5 bits per value on its FP8 scale alone, and adding a higher-precision per-tensor scalar increases overhead further. There is an information-theoretic floor below which adding more scale layers cannot recover lost precision.^[6]^[11]

Outlier activations

Transformer activations contain a small fraction of outlier values whose magnitudes can be hundreds of times larger than the typical entry. Without intervention, these outliers force the block scale to enlarge, pushing the bulk of the distribution into the codes near zero and destroying useful precision. Random Hadamard transforms, learned per-channel scaling (as in SmoothQuant), and rotation-based schemes have all been used to redistribute outlier mass before FP4 quantisation.^[7]^[17]

Gradient bias during training

Round-to-nearest in FP4 introduces a measurable bias into accumulated gradients because the codeword spacing is uneven and the rounding error is correlated with the underlying tensor. Stochastic rounding restores zero-mean error in expectation but increases per-step variance, and the Chmiel et al. analysis quantified a training-effectiveness threshold: when the gradient norm falls below roughly the square root of three times the quantisation noise, further FP4 updates contribute almost nothing to convergence.^[14] In practice this means the last few layers of LLMs, where signals are smallest, are usually kept in higher precision (MXFP8 or FP16) even in otherwise FP4 training runs.^[7]

Double quantisation and per-block scaling overhead

The NF4-derived idea of "double quantisation" (storing the per-block scale itself in a quantised form) saves about 0.37 bits per parameter in QLoRA configurations and was a key innovation in fitting a 65-billion-parameter model on a single 48 GB GPU.^[9] MXFP4 effectively double-quantises by using the cheap E8M0 power-of-two scale, while NVFP4 spends more bits on the inner scale (FP8) and recovers them with the outer FP32 tensor scalar. The choice between these strategies is fundamentally a question of how much variance lives inside each block versus across blocks of the same tensor, and the right answer depends on layer and model.^[6]^[11]

Hardware fragmentation

FP4 is currently supported natively on only a small fraction of deployed AI accelerators (NVIDIA Blackwell datacentre and consumer Blackwell, AMD Instinct MI400, AWS Trainium, and a handful of NPUs). NVFP4 in particular is NVIDIA-only. Older Hopper and Ada-Lovelace GPUs can store FP4 weights and dequantise on the fly into FP16 for tensor-core math, but that yields no compute speed-up. Software stacks therefore have to ship both true-FP4 and emulated-FP4 paths.^[16]^[19]^[20]

How does FP4 compare with other low-precision research?

The most cited contemporary 4-bit floating-point papers fall into three groups.

Training-focused: FP4 All the Way (Chmiel, Fishman, Banner, Soudry, 2025), Training LLMs with MXFP4 (Tseng, Yu, Park, 2025), Quartet: Native FP4 Training Can Be Optimal (Panferov et al., 2025), and NVIDIA's Pretraining Large Language Models with NVFP4 (2025).^[7]^[14]^[17]^[21]
Inference-focused: LLM-FP4 (Liu et al., 2023), Microscaling Data Formats for Deep Learning (Rouhani et al., 2023), and NVIDIA's NVFP4 developer-blog announcement (2025).^[5]^[6]^[10]
Format design: the OCP MX v1.0 specification (2023) and subsequent variants such as NVFP4's two-level scaling.^[4]^[6]^[11]

A clear consensus has emerged across these works: native 4-bit floating point is viable both for inference and for pretraining provided that (a) block sizes are small (16 to 32 elements), (b) at least one scale level uses a fractional (non-power-of-two) representation, (c) stochastic rounding and Hadamard or rotation transforms are applied during training, and (d) a handful of the most sensitive layers remain in higher precision.

ELI5: what is FP4 in plain terms?

A computer normally stores each number in a neural network using 16 or 32 bits, which is precise but bulky. FP4 squeezes each number into just 4 bits, so it can only pick from 16 possible values (like a paint set with 16 colours instead of millions). On its own that is far too coarse, so FP4 groups numbers into small blocks of 16 or 32 and gives each block its own "volume knob" (a shared scale) that stretches those 16 values to fit whatever range that block needs. The payoff: a model takes up a fraction of the memory and runs about twice as fast on chips like NVIDIA Blackwell, while staying almost as accurate as the bulkier formats.^[3]^[6]

References

Emergent Mind, "FP4 Precision: Low-Bit Efficiency", emergentmind.com, 2025. https://www.emergentmind.com/topics/fp4-precision. Accessed 2026-05-21. ↩
John D. Cook, "4-bit floating point FP4", johndcook.com, 2026-04-17. https://www.johndcook.com/blog/2026/04/17/fp4/. Accessed 2026-05-21. ↩
NVIDIA, "NVIDIA Blackwell Platform Arrives to Power a New Era of Computing", NVIDIA Newsroom, 2024-03-18. https://nvidianews.nvidia.com/news/nvidia-blackwell-platform-arrives-to-power-a-new-era-of-computing. Accessed 2026-05-21. ↩
Open Compute Project, "OCP Microscaling Formats (MX) Specification v1.0", opencompute.org, 2023-09. https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf. Accessed 2026-05-21. ↩
Bita Darvish Rouhani et al., "Microscaling Data Formats for Deep Learning", arXiv:2310.10537, 2023-10-16. https://arxiv.org/abs/2310.10537. Accessed 2026-05-21. ↩
NVIDIA, "Introducing NVFP4 for Efficient and Accurate Low-Precision Inference", NVIDIA Technical Blog, 2025-06-24. https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/. Accessed 2026-05-21. ↩
Felix Abecassis, Paulius Micikevicius, Asit Mishra et al., "Pretraining Large Language Models with NVFP4", arXiv:2509.25149, 2025-09-29. https://arxiv.org/abs/2509.25149. Accessed 2026-05-21. ↩
Shih-Yang Liu et al., "LLM-FP4: 4-Bit Floating-Point Quantized Transformers", GitHub repository, 2023. https://github.com/nbasyl/LLM-FP4. Accessed 2026-05-21. ↩
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer, "QLoRA: Efficient Finetuning of Quantized LLMs", arXiv:2305.14314, 2023-05-23. https://arxiv.org/abs/2305.14314. Accessed 2026-05-21. ↩
Shih-Yang Liu et al., "LLM-FP4: 4-Bit Floating-Point Quantized Transformers", arXiv:2310.16836, 2023-10-25. https://arxiv.org/abs/2310.16836. Accessed 2026-05-21. ↩
FPRox, "OCP MX Scaling Formats", fprox.substack.com, 2024. https://fprox.substack.com/p/ocp-mx-scaling-formats. Accessed 2026-05-21. ↩
Wikipedia, "Blackwell (microarchitecture)", en.wikipedia.org, 2024-03-18 announcement context. https://en.wikipedia.org/wiki/Blackwell_(microarchitecture). Accessed 2026-05-21. ↩
NVIDIA, "NVIDIA TensorRT Unlocks FP4 Image Generation for NVIDIA Blackwell GeForce RTX 50 Series GPUs", NVIDIA Technical Blog, 2025-01-06. https://developer.nvidia.com/blog/nvidia-tensorrt-unlocks-fp4-image-generation-for-nvidia-blackwell-geforce-rtx-50-series-gpus/. Accessed 2026-05-21. ↩
Brian Chmiel, Maxim Fishman, Ron Banner, Daniel Soudry, "FP4 All the Way: Fully Quantized Training of LLMs", arXiv:2505.19115, 2025-05-25. https://arxiv.org/abs/2505.19115. Accessed 2026-05-21. ↩
Abdullah Grewal, "MXFP4, FP4, and FP8: How GPT-OSS Runs 120B Parameters on an 80GB GPU with MoE Weight Quantization", Medium, 2025. https://buzzgrewal.medium.com/mxfp4-fp4-and-fp8-how-gpt-oss-runs-120b-parameters-on-an-80gb-gpu-with-moe-weight-quantization-db26b57fd787. Accessed 2026-05-21. ↩
NVIDIA, "NVFP4", Transformer Engine 2.16 documentation, 2026. https://nvidia.github.io/TransformerEngine/features/low_precision_training/nvfp4/nvfp4.html. Accessed 2026-05-21. ↩
Albert Tseng, Tao Yu, Youngsuk Park, "Training LLMs with MXFP4", arXiv:2502.20586, 2025-02-28. https://arxiv.org/abs/2502.20586. Accessed 2026-05-21. ↩
Wikipedia (navigational), "Blackwell (microarchitecture): B200 and GB200 specifications", en.wikipedia.org, 2024. https://en.wikipedia.org/wiki/Blackwell_(microarchitecture). Accessed 2026-05-21. ↩
vLLM Project, "fp4 Quantization with NVFP4", LLM Compressor Docs, 2025. https://docs.vllm.ai/projects/llm-compressor/en/latest/examples/quantization_w4a4_fp4/. Accessed 2026-05-21. ↩
NVIDIA, "Quantization", TensorRT-LLM documentation, 2026. https://nvidia.github.io/TensorRT-LLM/latest/features/quantization.html. Accessed 2026-05-21. ↩
Roberto Castro Tsizhanovska et al., "Quartet: Native FP4 Training Can Be Optimal for Large Language Models", arXiv:2505.14669, 2025-05-20. https://arxiv.org/abs/2505.14669. Accessed 2026-05-21. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributor · full history

Suggest edit

What links here

FP8 LLM.int8()Microscaling formats NVIDIA GB200 NVL72 NVIDIA RTX Spark

What is FP4?

What is the E2M1 format?

When was FP4 introduced and standardised?

Earlier 4-bit work

The OCP Microscaling specification

NVIDIA Blackwell and native FP4

NVFP4 release

How does FP4 work technically?

Why floating point, not integer

The need for block scaling

What is the difference between MXFP4 and NVFP4?

MXFP4

NVFP4

Rounding and outlier handling

How does Blackwell use FP4?

How do FP4 variants compare?

Comparison with NF4

Comparison with INT4

Comparison with FP8 and INT8

What software supports FP4?

NVIDIA TransformerEngine

TensorRT-LLM and TensorRT Model Optimizer

llm-compressor and vLLM

HuggingFace Hub model availability

Other ecosystems

What is FP4 used for?

Inference

Pretraining

Finetuning

What are the limitations of FP4?

Quantisation noise dominates at small block sizes

Outlier activations

Gradient bias during training

Double quantisation and per-block scaling overhead

Hardware fragmentation

How does FP4 compare with other low-precision research?

ELI5: what is FP4 in plain terms?

See also

References

Improve this article

Related Articles

Pruning

QLoRA

GRPO

KTO

RLVR

Test-Time Training (TTT)

What links here

Related Articles

Pruning

QLoRA

GRPO

KTO

RLVR

Test-Time Training (TTT)

What links here