FP8
Last reviewed
Sources
23 citations
Review status
Source-backed
Revision
v2 ยท 2,315 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
23 citations
Review status
Source-backed
Revision
v2 ยท 2,315 words
Add missing citations, update stale details, or suggest a clearer explanation.
FP8 is a family of 8-bit floating-point number formats used to accelerate deep learning training and inference, built around two encodings, E4M3 (4 exponent bits, 3 mantissa bits) and E5M2 (5 exponent bits, 2 mantissa bits), that NVIDIA, Arm, and Intel proposed jointly in September 2022 [1][2]. An FP8 value occupies a single byte, half the size of FP16 or bfloat16, which halves memory and bandwidth requirements and lets GPU tensor cores deliver roughly twice the arithmetic throughput of 16-bit math. FP8 became practical at scale with NVIDIA's Hopper generation and its Transformer Engine, and is now supported across accelerators from AMD, Intel, and others. It is used both for training, most prominently in DeepSeek-V3's FP8 mixed-precision framework, and as a mainstream quantization format for serving large language models. The Open Compute Project's Microscaling (MX) formats extend FP8 with block-level scaling and connect it to its 4-bit successor, FP4. The 2022 specification paper reported that FP8 training was "effectively matching the result quality achieved by 16-bit training sessions" across CNNs, RNNs, and Transformer models, including language models up to 175 billion parameters [1].
Eight-bit floating-point arithmetic predates the standardization push. IBM researchers demonstrated FP8 training in 2018 using a 1-5-2 (sign-exponent-mantissa) format with chunk-based higher-precision accumulation [3], and in 2019 proposed a hybrid scheme (HFP8) that paired a 4-bit-exponent format for the forward pass with a 5-bit-exponent format for gradients, prefiguring the later E4M3/E5M2 split. Tesla's Dojo whitepaper described configurable 8-bit floating-point formats, and Graphcore published its own 8-bit format study in 2022.
In September 2022, NVIDIA, Arm, and Intel published "FP8 Formats for Deep Learning", an interchange specification built around the E4M3 and E5M2 encodings. The paper reported that FP8 training effectively matched 16-bit result quality across CNNs, RNNs, and Transformer models, including GPT-3-style language models up to 175 billion parameters [1][2]. A competing proposal backed by Graphcore, AMD, and Qualcomm differed in how it handled special values; its "FNUZ" encodings later appeared in AMD hardware [4]. The Open Compute Project subsequently codified the NVIDIA/Arm/Intel encodings as the OFP8 specification (revision 1.0, 2023) [5], and the IEEE P3109 working group has been developing a formal standard for 8-bit and narrower machine-learning floats.
Both encodings use one sign bit and split the remaining seven bits between exponent and mantissa:
| Property | E4M3 | E5M2 |
|---|---|---|
| Bit layout (sign/exponent/mantissa) | 1/4/3 | 1/5/2 |
| Exponent bias | 7 | 15 |
| Largest finite value | 448 | 57,344 |
| Smallest normal value | 2^-6 (approx. 0.016) | 2^-14 (approx. 6.1e-5) |
| Smallest subnormal value | 2^-9 (approx. 0.002) | 2^-16 (approx. 1.5e-5) |
| Infinities | None | Positive and negative |
| NaN bit patterns | 2 | 6 |
| Typical role | Weights, activations | Gradients |
The tradeoff is range versus precision: E4M3 spans about 18 binades (powers of two) with a 3-bit mantissa, while E5M2 spans about 32 binades with only a 2-bit mantissa [1]. The 2022 paper states that "the recommended use of FP8 encodings is E4M3 for weight and activation tensors, and E5M2 for gradient tensors" [1]: weight and activation distributions are comparatively narrow, while gradients vary over a much wider range. Because even E5M2's range is tiny next to FP32, FP8 tensors are almost always paired with higher-precision scaling factors chosen so that a tensor's values fill the representable range; how those scales are computed, and at what granularity, is the central design question in both FP8 training and inference.
AMD's CDNA 3 accelerators implement variant "FNUZ" encodings (finite, NaN, unsigned zero): there are no infinities, a single NaN occupies the negative-zero bit pattern, and the exponent bias shifts by one, so E4M3FNUZ tops out at 240 instead of 448 [4]. The mismatch complicates moving FP8 checkpoints between vendors; AMD's CDNA 4 generation (the MI350 series, 2025) adopted the OCP-standard encodings.
NVIDIA's Hopper architecture, announced in March 2022, introduced FP8 tensor cores: the H100 delivers close to 4 petaFLOPS of FP8 compute (3,958 TFLOPS with structured sparsity, 1,979 TFLOPS dense), twice its FP16 tensor rate [6]. Hopper paired the format with the Transformer Engine, a combination of tensor core hardware and library software that chooses between FP8 and 16-bit precision per layer and manages scaling and recasting automatically [6][7]. In the open-source Transformer Engine library, per-tensor scales are derived either from a history of recent absolute-maximum (amax) values ("delayed scaling") or computed just in time ("current scaling"); delayed scaling is faster but can hurt convergence, and NVIDIA's later guidance favors current scaling and finer-grained per-block recipes [7][8]. The Ada Lovelace generation (L4, L40S) also includes FP8 tensor cores, and Blackwell (2024) adds a second-generation Transformer Engine with native block-scaled microscaling support [9].
AMD's Instinct MI300X (launched December 2023) reaches a peak theoretical 2,614.9 teraFLOPS of dense FP8 [10], using the FNUZ encodings described above [4]. Intel's Gaudi 2 and Gaudi 3 accelerators support FP8 as well; enabling it roughly doubled Gaudi 2's speed on the MLPerf GPT-3 training benchmark in November 2023, cutting time to train to 153.6 minutes on 384 chips [11]. Newer accelerators such as AWS Trainium2 and Google's seventh-generation TPU (Ironwood) likewise quote peak throughput in FP8.
FP8 training is a form of mixed-precision training: matrix multiplications take FP8 inputs while master weights, optimizer states, and numerically sensitive operations (normalization, attention softmax, embeddings, output projections) remain in BF16 or FP32, and accumulation inside the matrix units happens at higher precision. The standard recipe stores weights and activations as E4M3 and gradients as E5M2, with one scale factor per tensor [1][7].
Early large-scale adopters included Inflection AI, which trained Inflection-2 (announced November 2023) on 5,000 H100 GPUs in FP8 mixed precision for roughly 10^25 floating-point operations [12]. Microsoft's FP8-LM framework (October 2023) pushed FP8 beyond the matrix multiplies into gradient communication and optimizer state, reporting 39 percent lower real memory usage and 75 percent faster training than BF16 Megatron-LM on a GPT-175B configuration [13].
The most influential demonstration came with DeepSeek-V3 (December 2024), a 671-billion-parameter mixture-of-experts model trained on 14.8 trillion tokens in 2.788 million H800 GPU-hours. DeepSeek's framework runs all three linear-layer matrix multiplications (forward, activation gradient, and weight gradient) in FP8 and, unusually, uses E4M3 everywhere rather than the E4M3/E5M2 hybrid [14]. It compensates for the narrow range with fine-grained scaling: activations are scaled per 1x128 tile (one token by 128 channels) and weights per 128x128 block. Because the team found that FP8 matrix-multiply accumulation on H800 tensor cores retains only around 14 bits of precision, partial sums are promoted to FP32 registers on the CUDA cores at intervals of 128 elements [14]. The technical report states that the relative loss error versus a BF16 baseline stayed consistently below 0.25 percent, and describes the run as "the first validation of FP8 training's effectiveness on an extremely large-scale model" [14]. The released checkpoint is itself stored natively in FP8 with block-wise scales.
FP8 is also a widely used post-training quantization format. In the common "W8A8" scheme, weights and activations are cast to E4M3 (preferred over E5M2 for inference because of its extra mantissa bit), cutting weight memory roughly in half relative to 16-bit formats and engaging the hardware's FP8 matrix paths [15]. Weight scales are typically per-tensor or per-output-channel; activation scales are either dynamic, computed at runtime per tensor or per token, or static, calibrated offline by running sample data through the model and recording amax statistics [15]. Tools such as NVIDIA's TensorRT Model Optimizer, the vLLM-associated LLM Compressor, and AMD's Quark produce FP8 checkpoints, and engines including vLLM, TensorRT-LLM, and SGLang execute them; the KV cache can also be held in FP8 to shrink long-context memory [15]. For most large language models, FP8 W8A8 with per-channel weight and per-token activation scaling is reported to be near-lossless on standard benchmarks, with the remaining degradation concentrated in layers that exhibit severe activation outliers.
Native FP8 releases have become routine: Meta published an official FP8-quantized build of Llama 3.1 405B so the model fits on a single eight-GPU H100 node [16], and DeepSeek distributes V3 and R1 in FP8 form.
A single scale per tensor must compromise between outliers and typical values. Microscaling (MX) formats address this by attaching a shared scale to every small block of elements. The MX Alliance, formed by AMD, Arm, Intel, Meta, Microsoft, NVIDIA, and Qualcomm, published version 1.0 of the OCP Microscaling Formats specification in September 2023 as an open, license-free standard [17][18]. Each MX block holds 32 elements plus one shared 8-bit power-of-two scale (an exponent-only E8M0 value):
| Format | Element encoding(s) | Element bits | Block size | Effective bits per element |
|---|---|---|---|---|
| MXFP8 | FP8 E4M3 or E5M2 | 8 | 32 | 8.25 |
| MXFP6 | FP6 E3M2 or E2M3 | 6 | 32 | 6.25 |
| MXFP4 | FP4 E2M1 | 4 | 32 | 4.25 |
| MXINT8 | INT8 | 8 | 32 | 8.25 |
A companion paper demonstrated direct-cast inference and training with MX formats, including language-model training in MXFP8 and MXFP6 that closely tracked FP32 baselines [19]. NVIDIA's Blackwell tensor cores natively support MXFP8, MXFP6, and MXFP4 alongside NVIDIA's own NVFP4, which differs from MXFP4 by using 16-element blocks with an FP8 (E4M3) scale plus a per-tensor FP32 scale [9]; AMD's MI350 series supports the MX family as well. A 2025 NVIDIA study trained an 8-billion-parameter model on 15 trillion tokens in MXFP8 with accuracy matching BF16, recommending E4M3 elements for every tensor, gradients included [20]. On the deployment side, OpenAI's gpt-oss models (August 2025) ship with mixture-of-experts weights quantized to MXFP4, about 4.25 bits per parameter, which lets the 120-billion-parameter model run on a single 80 GB GPU [21].
Versus FP16 and BF16. FP8 halves storage and doubles peak tensor throughput relative to 16-bit formats. E5M2 shares FP16's exponent range, while BF16's 8-bit exponent gives it a far wider range than either FP8 encoding, which is why higher-precision formats are retained for master weights, sensitive operations, and loss computation even in FP8 training runs.
Versus INT8. Both occupy one byte. INT8's uniformly spaced values resolve a well-bounded distribution more finely, whereas FP8's exponent allocates precision logarithmically and tolerates the long-tailed, outlier-heavy activation distributions typical of Transformers. Qualcomm researchers argued in 2023 that FP8 multiply-accumulate units are 50 to 180 percent less area- and energy-efficient than INT8 and that INT8 remains preferable for many convolutional and edge workloads, while finding that E4M3-style formats suit Transformer layers with large outliers [22]. On server GPUs, where FP8 and INT8 peak rates are typically identical, FP8's robustness to outliers without elaborate calibration has made it the more common choice for LLM serving.
Versus FP4. FP4 (E2M1) continues the same trajectory: Blackwell is NVIDIA's first generation with FP4 tensor cores, doubling throughput again over FP8. Block-scaled FP4 (MXFP4 and NVFP4) is today primarily an inference and weight-storage format, with FP4 pretraining an active research area; an NVIDIA-led 2025 paper reported training a 12-billion-parameter model on 10 trillion tokens in NVFP4 [23]. FP8 in turn serves as a building block of FP4 recipes, providing the per-block scale datatype in NVFP4 and the higher-precision fallback for sensitive layers [9].