NVFP4
Last reviewed
May 31, 2026
Sources
8 citations
Review status
Source-backed
Revision
v2 · 2,034 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
8 citations
Review status
Source-backed
Revision
v2 · 2,034 words
Add missing citations, update stale details, or suggest a clearer explanation.
NVFP4 is a 4-bit floating-point number format from Nvidia, introduced with the Blackwell GPU architecture. Each value is stored in just 4 bits using an E2M1 layout, and a two-level scaling scheme keeps those tiny numbers accurate enough for real workloads. NVFP4 targets two jobs that used to demand much wider arithmetic: low-precision inference for large models, and, more surprisingly, full pretraining of billion-parameter language models in 4-bit precision. The name is short for NVIDIA FP4, and the format is NVIDIA's answer to the open MXFP4 standard, with a finer scaling design meant to close most of the accuracy gap that 4-bit math normally opens up.
Modern AI models are limited as much by memory and data movement as by raw compute. A weight stored in 16-bit bfloat16 takes 2 bytes; the same weight in 4-bit takes half a byte. Cutting each value from 16 bits to 4 shrinks the memory footprint by about 4x, which lets a given GPU hold a larger model, or the same model with longer context and bigger batches. Smaller operands also move faster through caches and across interconnects, and the matrix engines can pack more multiply-accumulate operations into each clock. For large language model inference, where the cost of serving a token is dominated by reading weights and activations out of memory, 4-bit formats translate fairly directly into higher throughput and lower energy per token.
The catch is precision. A 4-bit float can represent only a handful of distinct magnitudes, so naive rounding to 4 bits wrecks model accuracy. The interesting engineering in NVFP4 is not the 4-bit element itself, which is a known idea, but the scaling machinery wrapped around it that recovers most of the lost dynamic range.
Every NVFP4 value uses the E2M1 encoding: 1 sign bit, 2 exponent bits, and 1 mantissa bit. That is the same per-element format used by the FP4 type in the open Microscaling standard, and it is a true floating-point layout rather than an integer one, so it keeps the exponent's ability to span several orders of magnitude in very few bits. [1][5]
Four bits give 16 codes. After accounting for the sign, NVFP4 can represent these magnitudes: 0, 0.5, 1, 1.5, 2, 3, 4, and 6, together with their negatives. The largest representable magnitude is 6, so the raw element range runs from about -6 to +6. [1] For comparison, single-precision FP32 can express on the order of four billion distinct positive values; an E2M1 element offers eight positive ones. That scarcity is exactly why scaling does the heavy lifting.
NVFP4 pairs its 4-bit elements with two separate scale factors that together stretch the narrow element range to cover the actual numbers in a tensor. [1][2]
The first level is a micro-block scale. NVFP4 groups values into blocks of 16 consecutive elements, and each block carries its own scale factor encoded in 8-bit FP8 using the E4M3 layout (4 exponent bits, 3 mantissa bits). Because E4M3 is itself a floating-point type with a mantissa, the block scale can take non-power-of-two values such as 1.25 or 3.5, which lets the encoder fit each 16-element block tightly to its own local range. [1][2]
The second level is a single FP32 scale applied to the whole tensor. This per-tensor factor sets the overall magnitude and extends the representable range so that the FP8 block scales themselves do not overflow or underflow. The reconstructed value is the 4-bit element multiplied by its block's FP8 scale and then by the tensor's FP32 scale. [1][2]
Storing one FP8 scale per 16 values adds 8 bits across 16 elements, which works out to half a bit per value, so NVFP4 costs roughly 4.5 bits per element once the block scale is amortized; the single FP32 per-tensor scale is negligible spread over a large tensor. [1] The point of the design is that a small 16-element block sees a narrower spread of magnitudes than a large block would, and a floating-point block scale can match that spread more precisely than a coarse power-of-two scale. Both choices reduce rounding error where it matters.
NVFP4 is closely related to MXFP4, the 4-bit member of the Microscaling (MX) family. The MX formats were standardized through the Open Compute Project in 2023 by an industry group that included AMD, Arm, Intel, Meta, Microsoft, NVIDIA, and Qualcomm. [5][6] MXFP4 also uses E2M1 elements, so the per-value encoding is identical. The differences are entirely in the scaling.
MXFP4 uses larger blocks of 32 elements, and its shared block scale is an E8M0 value, an 8-bit exponent with no mantissa that can only represent powers of two. [5][6] NVFP4 halves the block to 16 elements, switches the block scale to FP8 E4M3 so it can take fractional, non-power-of-two values, and adds the extra per-tensor FP32 scale that MXFP4 does not have. [1][2] Smaller blocks adapt to local dynamic range more finely, and a mantissa-bearing scale lands closer to the ideal multiplier for each block, so both changes cut quantization error. The trade is a slightly higher metadata cost, since NVFP4 carries one scale per 16 values versus one per 32 for MXFP4, and an extra global scale.
| Property | NVFP4 | MXFP4 | FP8 (E4M3) |
|---|---|---|---|
| Element format | E2M1 (4-bit) | E2M1 (4-bit) | E4M3 (8-bit) |
| Bits per value | 4 (about 4.5 with scale) | 4 (about 4.25 with scale) | 8 |
| Block size | 16 elements | 32 elements | n/a |
| Block scale format | FP8 E4M3 (fractional) | E8M0 (power of two) | n/a |
| Second-level scale | per-tensor FP32 | none | per-tensor (typical) |
| Standardized by | NVIDIA | Open Compute Project (MX) | OCP / vendor |
| Native hardware | Blackwell Tensor Cores | Blackwell, others | Hopper, Blackwell, others |
NVFP4 is a hardware format, not just a storage convention. The fifth-generation Tensor Cores in Blackwell implement NVFP4 directly: they handle the grouping of elements into blocks, apply the dynamic scales, and run the 4-bit matrix multiplies natively. [1] Because the math engines understand the format, tensors can stay in their compact 4-bit form all the way through a matrix multiply without a separate dequantization step into a wider type, which is where a lot of the throughput and energy benefit comes from. [1] NVFP4 is supported across the Blackwell data center lineup, including the B200 and B300 accelerators and the GB200 and GB300 Grace Blackwell systems. [1]
The speedups are large relative to wider formats. NVIDIA reports that 4-bit matrix multiplies (GEMMs) run at roughly 4x the throughput of bfloat16 on GB200 and about 6x on GB300, which works out to roughly 2x and 3x faster than FP8 on the same hardware. [3][4] On the memory side, NVFP4 cuts the model footprint by about 3.5x versus FP16 and about 1.8x versus FP8, and NVIDIA cites per-token energy efficiency gains of up to 25x for Blackwell and 50x for Blackwell Ultra against an H100 baseline. [1]
The first and most direct use of NVFP4 is post-training quantization for inference. A model trained in a higher precision such as FP8 or BF16 is converted to NVFP4 weights and activations, then served on Blackwell hardware. The fine-grained two-level scaling is what makes this viable: it preserves enough numerical detail that accuracy stays close to the higher-precision original. [1][2]
NVIDIA's reported example is the DeepSeek-R1 0528 reasoning model. Quantized from FP8 to NVFP4, it held accuracy within about 1 percent of the FP8 baseline across most language and reasoning benchmarks, and on the AIME 2024 math set the NVFP4 version actually scored about 2 percent higher, well inside normal run-to-run variation for that benchmark. [1] Independent analysis of NVFP4 inference has reported throughput on the order of 2.3x higher than FP8 while holding accuracy, consistent with the format's 4-bit width and Blackwell's native support. [2]
The more ambitious use of NVFP4 is training models in 4-bit precision from scratch, not just running them. In a 2025 paper, NVIDIA researchers pretrained a 12-billion-parameter hybrid Mamba-Transformer model on 10 trillion tokens using NVFP4 for the dominant matrix multiplications, which they describe as the first demonstration of stable, multi-trillion-token 4-bit pretraining at that scale. [3][7]
Naively training in 4-bit is unstable, so the recipe layers several techniques on top of the format. Random Hadamard transforms reshape the distribution of values feeding the gradient computation, spreading out large outliers so they fit the format better. Two-dimensional block scaling is applied to the weights so that a weight matrix and its transpose see consistent scaling in the forward and backward passes. Stochastic rounding replaces round-to-nearest for gradients, which keeps the quantized gradients unbiased on average. And a small number of numerically sensitive layers, such as the final layers, are kept in higher precision like BF16 rather than 4-bit. [3][7]
With those pieces in place, the NVFP4 run tracked an FP8 baseline closely. Validation loss stayed within roughly 1 percent of FP8 through the stable phase of training, widening to a bit over 1.5 percent during the learning-rate decay near the end, and downstream benchmark accuracy across knowledge, reasoning, math, and coding tasks came out comparable to the FP8 model. [3][7] The same work found that MXFP4 was less data-efficient for this kind of training: matching NVFP4's loss with MXFP4 required on the order of 36 percent more training tokens, which the authors attribute to NVFP4's finer scaling. [7]
NVFP4 is a concrete instance of quantization, the broad practice of representing a model's numbers in fewer bits than the precision it was developed in. What distinguishes NVFP4 from older 4-bit quantization schemes is the combination of a floating-point element with block-level and tensor-level scaling baked into the hardware path, rather than an integer code book applied only in software. Its closest relatives are the other Microscaling formats, and it slots into the same broader trend across deep learning toward ever-lower precision: FP32 to FP16 and BF16, then to 8-bit FP8, and now to 4-bit FP4 variants for the workloads that can tolerate them. NVIDIA's TensorRT stack and its Transformer Engine library expose NVFP4 so that the quantization and the scale bookkeeping can be applied without hand-writing the low-level kernels. [1][8]
NVFP4 is not a universal replacement for wider formats. Eight magnitudes per element is very little resolution, so the format depends entirely on its scaling and on careful calibration; tensors with heavy outliers or unusual distributions can still lose accuracy, which is why outlier-handling tricks like Hadamard transforms show up in the training recipe. [3][7] Some layers remain too sensitive to quantize and are kept in higher precision, so a 4-bit model is in practice a mixed-precision model. The block and tensor scales add metadata and a small amount of overhead beyond the bare 4 bits. The biggest practical limit is hardware: native NVFP4 acceleration is specific to Blackwell-generation Tensor Cores, so the throughput and energy advantages do not carry over to older GPUs such as Hopper, which support FP8 but not native FP4. [1] As with any low-precision format, the reported accuracy results hold for the models and benchmarks NVIDIA tested, and results can vary for other architectures, tasks, or training regimes.