NVFP4

12 min read

Updated Jul 23, 2026

NVFP4 (NVIDIA FP4) is a 4-bit floating-point number format introduced by Nvidia with the Blackwell GPU architecture. It stores each value in just 4 bits using an E2M1 layout (1 sign bit, 2 exponent bits, 1 mantissa bit) and recovers accuracy with a two-level micro-block scaling scheme: every group of 16 elements shares an 8-bit FP8 E4M3 scale factor, and the whole tensor carries one additional FP32 global scale. ^[1]^[2] NVFP4 targets both low-precision inference for large models and full 4-bit pretraining of billion-parameter language models, and it is NVIDIA's accuracy-focused counterpart to the open MXFP4 standard, which uses coarser 32-element blocks and a power-of-two scale. ^[1]^[3]

NVFP4 runs natively on Blackwell's fifth-generation FP4 Tensor Cores, where NVIDIA reports up to a 4x throughput gain over FP8 for matrix multiplications, a roughly 3.5x smaller memory footprint than FP16, and per-token energy efficiency gains of up to 25x on Blackwell and 50x on Blackwell Ultra against an H100 baseline. ^[1] According to NVIDIA, the format "addresses this concern with two architectural innovations that make it highly effective for AI inference: high-precision scale encoding" and "a two-level micro-block scaling strategy." ^[1]

Why does 4-bit floating point matter?

Modern AI models are limited as much by memory and data movement as by raw compute. A weight stored in 16-bit bfloat16 takes 2 bytes; the same weight in 4-bit takes half a byte. Cutting each value from 16 bits to 4 shrinks the memory footprint by about 4x, which lets a given GPU hold a larger model, or the same model with longer context and bigger batches. Smaller operands also move faster through caches and across interconnects, and the matrix engines can pack more multiply-accumulate operations into each clock. For large language model inference, where the cost of serving a token is dominated by reading weights and activations out of memory, 4-bit formats translate fairly directly into higher throughput and lower energy per token.

The catch is precision. A 4-bit float can represent only a handful of distinct magnitudes, so naive rounding to 4 bits wrecks model accuracy. The interesting engineering in NVFP4 is not the 4-bit element itself, which is a known idea, but the scaling machinery wrapped around it that recovers most of the lost dynamic range.

What is the E2M1 element format?

Every NVFP4 value uses the E2M1 encoding: 1 sign bit, 2 exponent bits, and 1 mantissa bit. That is the same per-element format used by the FP4 type in the open Microscaling standard, and it is a true floating-point layout rather than an integer one, so it keeps the exponent's ability to span several orders of magnitude in very few bits. ^[1]^[5]

Four bits give 16 codes. After accounting for the sign, NVFP4 can represent these magnitudes: 0, 0.5, 1, 1.5, 2, 3, 4, and 6, together with their negatives. The largest representable magnitude is 6, so the raw element range runs from about -6 to +6. ^[1] For comparison, single-precision FP32 can express on the order of four billion distinct positive values; an E2M1 element offers eight positive ones. That scarcity is exactly why scaling does the heavy lifting.

How does NVFP4's two-level micro-block scaling work?

NVFP4 pairs its 4-bit elements with two separate scale factors that together stretch the narrow element range to cover the actual numbers in a tensor. ^[1]^[2]

The first level is a micro-block scale. NVFP4 groups values into blocks of 16 consecutive elements, and each block carries its own scale factor encoded in 8-bit FP8 using the E4M3 layout (4 exponent bits, 3 mantissa bits). Because E4M3 is itself a floating-point type with a mantissa, the block scale can take non-power-of-two values such as 1.25 or 3.5, which lets the encoder fit each 16-element block tightly to its own local range. ^[1]^[2]

The second level is a single FP32 scale applied to the whole tensor. This per-tensor factor sets the overall magnitude and extends the representable range so that the FP8 block scales themselves do not overflow or underflow. The reconstructed value is the 4-bit element multiplied by its block's FP8 scale and then by the tensor's FP32 scale. ^[1]^[2]

Storing one FP8 scale per 16 values adds 8 bits across 16 elements, which works out to half a bit per value, so NVFP4 costs roughly 4.5 bits per element once the block scale is amortized; the single FP32 per-tensor scale is negligible spread over a large tensor. ^[1] The point of the design is that a small 16-element block sees a narrower spread of magnitudes than a large block would, and a floating-point block scale can match that spread more precisely than a coarse power-of-two scale. Both choices reduce rounding error where it matters.

How does NVFP4 differ from MXFP4?

NVFP4 is closely related to MXFP4, the 4-bit member of the Microscaling (MX) family. The MX formats were standardized through the Open Compute Project in 2023 by an industry group that included AMD, Arm, Intel, Meta, Microsoft, NVIDIA, and Qualcomm. ^[5]^[6] MXFP4 also uses E2M1 elements, so the per-value encoding is identical. The differences are entirely in the scaling.

MXFP4 uses larger blocks of 32 elements, and its shared block scale is an E8M0 value, an 8-bit exponent with no mantissa that can only represent powers of two. ^[5]^[6] NVFP4 halves the block to 16 elements, switches the block scale to FP8 E4M3 so it can take fractional, non-power-of-two values, and adds the extra per-tensor FP32 scale that MXFP4 does not have. ^[1]^[2] Smaller blocks adapt to local dynamic range more finely, and a mantissa-bearing scale lands closer to the ideal multiplier for each block, so both changes cut quantization error. The trade is a slightly higher metadata cost, since NVFP4 carries one scale per 16 values versus one per 32 for MXFP4, and an extra global scale.

Property	NVFP4	MXFP4	FP8 (E4M3)
Element format	E2M1 (4-bit)	E2M1 (4-bit)	E4M3 (8-bit)
Bits per value	4 (about 4.5 with scale)	4 (about 4.25 with scale)	8
Block size	16 elements	32 elements	n/a
Block scale format	FP8 E4M3 (fractional)	E8M0 (power of two)	n/a
Second-level scale	per-tensor FP32	none	per-tensor (typical)
Standardized by	NVIDIA	Open Compute Project (MX)	OCP / vendor
Native hardware	Blackwell Tensor Cores	Blackwell, others	Hopper, Blackwell, others

Why does NVFP4 improve accuracy over MXFP4?

The accuracy edge comes from the two scaling changes working together. A 16-element block spans a narrower range of magnitudes than a 32-element block, so a single shared scale fits the values inside it more tightly and leaves less rounding error. On top of that, the FP8 E4M3 block scale is a real floating-point number with a mantissa, so it can land on a fractional multiplier like 1.25 or 3.5 that sits close to each block's true range, whereas MXFP4's E8M0 scale can only snap to the nearest power of two. The per-tensor FP32 global scale then keeps those FP8 block scales from overflowing or underflowing across the full dynamic range of the tensor. ^[1]^[2]

NVIDIA reports that these choices keep 4-bit accuracy close to higher-precision baselines: the format holds within "less than 1% degradation on key language modeling tasks" for the models it tested. ^[1] In training, the finer scaling also pays off in data efficiency. NVIDIA's pretraining study found that matching NVFP4's loss with MXFP4 required on the order of 36% more training tokens, which the authors attribute to NVFP4's tighter block size and fractional scale. ^[7]

How is NVFP4 supported on Blackwell hardware?

NVFP4 is a hardware format, not just a storage convention. The fifth-generation Tensor Cores in Blackwell implement NVFP4 directly: they handle the grouping of elements into blocks, apply the dynamic scales, and run the 4-bit matrix multiplies natively. ^[1] Because the math engines understand the format, tensors can stay in their compact 4-bit form all the way through a matrix multiply without a separate dequantization step into a wider type, which is where a lot of the throughput and energy benefit comes from. ^[1] NVFP4 is supported across the Blackwell data center lineup, including the B200 and B300 accelerators and the GB200 and GB300 Grace Blackwell systems. ^[1]

The speedups are large relative to wider formats. NVIDIA reports that 4-bit matrix multiplies (GEMMs) run at roughly 4x the throughput of bfloat16 on GB200 and about 6x on GB300, which works out to roughly 2x and 3x faster than FP8 on the same hardware. ^[3]^[4] On the memory side, NVFP4 cuts the model footprint by about 3.5x versus FP16 and about 1.8x versus FP8, and NVIDIA cites per-token energy efficiency gains of up to 25x for Blackwell and 50x for Blackwell Ultra against an H100 baseline. ^[1]

How is NVFP4 used for inference?

The first and most direct use of NVFP4 is post-training quantization for inference. A model trained in a higher precision such as FP8 or BF16 is converted to NVFP4 weights and activations, then served on Blackwell hardware. The fine-grained two-level scaling is what makes this viable: it preserves enough numerical detail that accuracy stays close to the higher-precision original. ^[1]^[2]

NVIDIA's reported example is the DeepSeek-R1 0528 reasoning model. Quantized from FP8 to NVFP4, it held accuracy within about 1 percent of the FP8 baseline across most language and reasoning benchmarks, and on the AIME 2024 math set the NVFP4 version actually scored about 2 percent higher, well inside normal run-to-run variation for that benchmark. ^[1] Independent analysis of NVFP4 inference has reported throughput on the order of 2.3x higher than FP8 while holding accuracy, consistent with the format's 4-bit width and Blackwell's native support. ^[2]

Can NVFP4 be used to train models, not just run them?

The more ambitious use of NVFP4 is training models in 4-bit precision from scratch, not just running them. In a 2025 paper, NVIDIA researchers pretrained a 12-billion-parameter hybrid Mamba-Transformer model on 10 trillion tokens using NVFP4 for the dominant matrix multiplications, which they describe as "the longest publicly documented training run in 4-bit precision to date." ^[3]^[7]

Naively training in 4-bit is unstable, so the recipe layers several techniques on top of the format. As the paper puts it, the method "integrates Random Hadamard transforms (RHT) to bound block-level outliers, employs a two-dimensional quantization scheme for consistent representations across both the forward and backward passes, utilizes stochastic rounding for unbiased gradient estimation, and incorporates selective high-precision layers." ^[7] In plain terms: Random Hadamard transforms reshape the distribution of values feeding the gradient computation, spreading out large outliers so they fit the format better; two-dimensional block scaling keeps a weight matrix and its transpose consistently scaled in the forward and backward passes; stochastic rounding replaces round-to-nearest for gradients so the quantized gradients stay unbiased on average; and a small number of numerically sensitive layers, such as the final layers, are kept in higher precision like BF16 rather than 4-bit. ^[3]^[7]

With those pieces in place, the NVFP4 run tracked an FP8 baseline closely. Validation loss stayed within roughly 1 percent of FP8 through the stable phase of training, widening to a bit over 1.5 percent during the learning-rate decay near the end, and downstream benchmark accuracy across knowledge, reasoning, math, and coding tasks came out comparable to the FP8 model. ^[3]^[7] The same work found that MXFP4 was less data-efficient for this kind of training: matching NVFP4's loss with MXFP4 required on the order of 36 percent more training tokens, which the authors attribute to NVFP4's finer scaling. ^[7] NVIDIA frames the result this way: "NVFP4 4-bit pretraining redefines the boundaries of efficiency and scalability, setting a new standard for high-performance AI model development." ^[3]

How does NVFP4 relate to quantization?

NVFP4 is a concrete instance of quantization, the broad practice of representing a model's numbers in fewer bits than the precision it was developed in. What distinguishes NVFP4 from older 4-bit quantization schemes is the combination of a floating-point element with block-level and tensor-level scaling baked into the hardware path, rather than an integer code book applied only in software. Its closest relatives are the other Microscaling formats, and it slots into the same broader trend across deep learning toward ever-lower precision: FP32 to FP16 and BF16, then to 8-bit FP8, and now to 4-bit FP4 variants for the workloads that can tolerate them. NVIDIA's TensorRT stack and its Transformer Engine library expose NVFP4 so that the quantization and the scale bookkeeping can be applied without hand-writing the low-level kernels. ^[1]^[8]

What are the limitations of NVFP4?

NVFP4 is not a universal replacement for wider formats. Eight magnitudes per element is very little resolution, so the format depends entirely on its scaling and on careful calibration; tensors with heavy outliers or unusual distributions can still lose accuracy, which is why outlier-handling tricks like Hadamard transforms show up in the training recipe. ^[3]^[7] Some layers remain too sensitive to quantize and are kept in higher precision, so a 4-bit model is in practice a mixed-precision model. The block and tensor scales add metadata and a small amount of overhead beyond the bare 4 bits. The biggest practical limit is hardware: native NVFP4 acceleration is specific to Blackwell-generation Tensor Cores, so the throughput and energy advantages do not carry over to older GPUs such as Hopper, which support FP8 but not native FP4. ^[1] As with any low-precision format, the reported accuracy results hold for the models and benchmarks NVIDIA tested, and results can vary for other architectures, tasks, or training regimes.

References

^NVIDIA. "Introducing NVFP4 for Efficient and Accurate Low-Precision Inference." NVIDIA Technical Blog. developer.nvidia.com/...te-low-precision-inference
^Marie, Benjamin. "NVFP4: Same Accuracy with 2.3x Higher Throughput for 4-Bit LLMs." Data Science Collective, Medium. medium.com/...roughput-for-4-bit-llms-03518ecba108
^NVIDIA. "NVFP4 Trains with Precision of 16-Bit and Speed and Efficiency of 4-Bit." NVIDIA Technical Blog. developer.nvidia.com/...ed-and-efficiency-of-4-bit
^Tom's Hardware. "Nvidia details efficiency of the NVFP4 format for LLM training." tomshardware.com/...ers-benefits-over-fp8-and-bf16
^Open Compute Project. "OCP Microscaling Formats (MX) Specification v1.0." opencompute.org/...-formats-mx-v1-0-spec-final-pdf
^Rouhani, B. D., et al. "Microscaling Data Formats for Deep Learning." arXiv:2310.10537. arxiv.org/...2310.10537
^NVIDIA. "Pretraining Large Language Models with NVFP4." arXiv:2509.25149. arxiv.org/...2509.25149
^NVIDIA. "Using FP8 and FP4 with Transformer Engine." Transformer Engine documentation. docs.nvidia.com/...fp8_primer

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributors · v4 · 2,410 words · full history

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Suggest edit

What links here

DeepGEMM FP4 (4-bit floating point)GLM-5.2 Inkling (Thinking Machines Lab)MiniMax M2.7 NVIDIA Blackwell Ultra NVIDIA DGX B300 NVIDIA Digits NVIDIA Exemplar Cloud NVIDIA NemoClaw NVIDIA RTX Spark NVIDIA Rubin CPX NVIDIA Rubin Ultra NVIDIA Vera Rubin Nemotron Nemotron 3 Tensor Core

Why does 4-bit floating point matter?

What is the E2M1 element format?

How does NVFP4's two-level micro-block scaling work?

How does NVFP4 differ from MXFP4?

Why does NVFP4 improve accuracy over MXFP4?

How is NVFP4 supported on Blackwell hardware?

How is NVFP4 used for inference?

Can NVFP4 be used to train models, not just run them?

How does NVFP4 relate to quantization?

What are the limitations of NVFP4?

See also

References

Improve this article

Related Articles

CuDNN

Jetson Thor

NVIDIA Blackwell

NVIDIA DGX Spark

NVIDIA Picasso

Jensen Huang

What links here

Related Articles

CuDNN

Jetson Thor

NVIDIA Blackwell

NVIDIA DGX Spark

NVIDIA Picasso

Jensen Huang

What links here