FP8

AI Hardware

12 min read

Updated Jun 23, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 23, 2026

Fact-checked

In review queue

Sources

23 citations

Revision

v2 · 2,315 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

FP8 is a family of 8-bit floating-point number formats used to accelerate deep learning training and inference, built around two encodings, E4M3 (4 exponent bits, 3 mantissa bits) and E5M2 (5 exponent bits, 2 mantissa bits), that NVIDIA, Arm, and Intel proposed jointly in September 2022 ^[1]^[2]. An FP8 value occupies a single byte, half the size of FP16 or bfloat16, which halves memory and bandwidth requirements and lets GPU tensor cores deliver roughly twice the arithmetic throughput of 16-bit math. FP8 became practical at scale with NVIDIA's Hopper generation and its Transformer Engine, and is now supported across accelerators from AMD, Intel, and others. It is used both for training, most prominently in DeepSeek-V3's FP8 mixed-precision framework, and as a mainstream quantization format for serving large language models. The Open Compute Project's Microscaling (MX) formats extend FP8 with block-level scaling and connect it to its 4-bit successor, FP4. The 2022 specification paper reported that FP8 training was "effectively matching the result quality achieved by 16-bit training sessions" across CNNs, RNNs, and Transformer models, including language models up to 175 billion parameters ^[1].

What problem does FP8 solve?

Eight-bit floating-point arithmetic predates the standardization push. IBM researchers demonstrated FP8 training in 2018 using a 1-5-2 (sign-exponent-mantissa) format with chunk-based higher-precision accumulation ^[3], and in 2019 proposed a hybrid scheme (HFP8) that paired a 4-bit-exponent format for the forward pass with a 5-bit-exponent format for gradients, prefiguring the later E4M3/E5M2 split. Tesla's Dojo whitepaper described configurable 8-bit floating-point formats, and Graphcore published its own 8-bit format study in 2022.

In September 2022, NVIDIA, Arm, and Intel published "FP8 Formats for Deep Learning", an interchange specification built around the E4M3 and E5M2 encodings. The paper reported that FP8 training effectively matched 16-bit result quality across CNNs, RNNs, and Transformer models, including GPT-3-style language models up to 175 billion parameters ^[1]^[2]. A competing proposal backed by Graphcore, AMD, and Qualcomm differed in how it handled special values; its "FNUZ" encodings later appeared in AMD hardware ^[4]. The Open Compute Project subsequently codified the NVIDIA/Arm/Intel encodings as the OFP8 specification (revision 1.0, 2023) ^[5], and the IEEE P3109 working group has been developing a formal standard for 8-bit and narrower machine-learning floats.

What are the E4M3 and E5M2 formats?

Both encodings use one sign bit and split the remaining seven bits between exponent and mantissa:

E4M3 (4 exponent bits, 3 mantissa bits, bias 7) favors precision. To stretch its range it departs from IEEE 754 conventions: it has no infinities and reserves only two bit patterns for NaN, which frees the top exponent values and extends the largest finite magnitude to 448 ^[1].
E5M2 (5 exponent bits, 2 mantissa bits, bias 15) favors range and follows IEEE 754 conventions for infinities and NaNs. It is effectively IEEE FP16 with the mantissa truncated to two bits, making conversion between the two trivial ^[1].

Property	E4M3	E5M2
Bit layout (sign/exponent/mantissa)	1/4/3	1/5/2
Exponent bias	7	15
Largest finite value	448	57,344
Smallest normal value	2^-6 (approx. 0.016)	2^-14 (approx. 6.1e-5)
Smallest subnormal value	2^-9 (approx. 0.002)	2^-16 (approx. 1.5e-5)
Infinities	None	Positive and negative
NaN bit patterns	2	6
Typical role	Weights, activations	Gradients

The tradeoff is range versus precision: E4M3 spans about 18 binades (powers of two) with a 3-bit mantissa, while E5M2 spans about 32 binades with only a 2-bit mantissa ^[1]. The 2022 paper states that "the recommended use of FP8 encodings is E4M3 for weight and activation tensors, and E5M2 for gradient tensors" ^[1]: weight and activation distributions are comparatively narrow, while gradients vary over a much wider range. Because even E5M2's range is tiny next to FP32, FP8 tensors are almost always paired with higher-precision scaling factors chosen so that a tensor's values fill the representable range; how those scales are computed, and at what granularity, is the central design question in both FP8 training and inference.

AMD's CDNA 3 accelerators implement variant "FNUZ" encodings (finite, NaN, unsigned zero): there are no infinities, a single NaN occupies the negative-zero bit pattern, and the exponent bias shifts by one, so E4M3FNUZ tops out at 240 instead of 448 ^[4]. The mismatch complicates moving FP8 checkpoints between vendors; AMD's CDNA 4 generation (the MI350 series, 2025) adopted the OCP-standard encodings.

Which hardware supports FP8?

NVIDIA's Hopper architecture, announced in March 2022, introduced FP8 tensor cores: the H100 delivers close to 4 petaFLOPS of FP8 compute (3,958 TFLOPS with structured sparsity, 1,979 TFLOPS dense), twice its FP16 tensor rate ^[6]. Hopper paired the format with the Transformer Engine, a combination of tensor core hardware and library software that chooses between FP8 and 16-bit precision per layer and manages scaling and recasting automatically ^[6]^[7]. In the open-source Transformer Engine library, per-tensor scales are derived either from a history of recent absolute-maximum (amax) values ("delayed scaling") or computed just in time ("current scaling"); delayed scaling is faster but can hurt convergence, and NVIDIA's later guidance favors current scaling and finer-grained per-block recipes ^[7]^[8]. The Ada Lovelace generation (L4, L40S) also includes FP8 tensor cores, and Blackwell (2024) adds a second-generation Transformer Engine with native block-scaled microscaling support ^[9].

AMD's Instinct MI300X (launched December 2023) reaches a peak theoretical 2,614.9 teraFLOPS of dense FP8 ^[10], using the FNUZ encodings described above ^[4]. Intel's Gaudi 2 and Gaudi 3 accelerators support FP8 as well; enabling it roughly doubled Gaudi 2's speed on the MLPerf GPT-3 training benchmark in November 2023, cutting time to train to 153.6 minutes on 384 chips ^[11]. Newer accelerators such as AWS Trainium2 and Google's seventh-generation TPU (Ironwood) likewise quote peak throughput in FP8.

How does FP8 training work?

FP8 training is a form of mixed-precision training: matrix multiplications take FP8 inputs while master weights, optimizer states, and numerically sensitive operations (normalization, attention softmax, embeddings, output projections) remain in BF16 or FP32, and accumulation inside the matrix units happens at higher precision. The standard recipe stores weights and activations as E4M3 and gradients as E5M2, with one scale factor per tensor ^[1]^[7].

Early large-scale adopters included Inflection AI, which trained Inflection-2 (announced November 2023) on 5,000 H100 GPUs in FP8 mixed precision for roughly 10^25 floating-point operations ^[12]. Microsoft's FP8-LM framework (October 2023) pushed FP8 beyond the matrix multiplies into gradient communication and optimizer state, reporting 39 percent lower real memory usage and 75 percent faster training than BF16 Megatron-LM on a GPT-175B configuration ^[13].

The most influential demonstration came with DeepSeek-V3 (December 2024), a 671-billion-parameter mixture-of-experts model trained on 14.8 trillion tokens in 2.788 million H800 GPU-hours. DeepSeek's framework runs all three linear-layer matrix multiplications (forward, activation gradient, and weight gradient) in FP8 and, unusually, uses E4M3 everywhere rather than the E4M3/E5M2 hybrid ^[14]. It compensates for the narrow range with fine-grained scaling: activations are scaled per 1x128 tile (one token by 128 channels) and weights per 128x128 block. Because the team found that FP8 matrix-multiply accumulation on H800 tensor cores retains only around 14 bits of precision, partial sums are promoted to FP32 registers on the CUDA cores at intervals of 128 elements ^[14]. The technical report states that the relative loss error versus a BF16 baseline stayed consistently below 0.25 percent, and describes the run as "the first validation of FP8 training's effectiveness on an extremely large-scale model" ^[14]. The released checkpoint is itself stored natively in FP8 with block-wise scales.

How is FP8 used for inference and quantization?

FP8 is also a widely used post-training quantization format. In the common "W8A8" scheme, weights and activations are cast to E4M3 (preferred over E5M2 for inference because of its extra mantissa bit), cutting weight memory roughly in half relative to 16-bit formats and engaging the hardware's FP8 matrix paths ^[15]. Weight scales are typically per-tensor or per-output-channel; activation scales are either dynamic, computed at runtime per tensor or per token, or static, calibrated offline by running sample data through the model and recording amax statistics ^[15]. Tools such as NVIDIA's TensorRT Model Optimizer, the vLLM-associated LLM Compressor, and AMD's Quark produce FP8 checkpoints, and engines including vLLM, TensorRT-LLM, and SGLang execute them; the KV cache can also be held in FP8 to shrink long-context memory ^[15]. For most large language models, FP8 W8A8 with per-channel weight and per-token activation scaling is reported to be near-lossless on standard benchmarks, with the remaining degradation concentrated in layers that exhibit severe activation outliers.

Native FP8 releases have become routine: Meta published an official FP8-quantized build of Llama 3.1 405B so the model fits on a single eight-GPU H100 node ^[16], and DeepSeek distributes V3 and R1 in FP8 form.

What are microscaling (MX) formats?

A single scale per tensor must compromise between outliers and typical values. Microscaling (MX) formats address this by attaching a shared scale to every small block of elements. The MX Alliance, formed by AMD, Arm, Intel, Meta, Microsoft, NVIDIA, and Qualcomm, published version 1.0 of the OCP Microscaling Formats specification in September 2023 as an open, license-free standard ^[17]^[18]. Each MX block holds 32 elements plus one shared 8-bit power-of-two scale (an exponent-only E8M0 value):

Format	Element encoding(s)	Element bits	Block size	Effective bits per element
MXFP8	FP8 E4M3 or E5M2	8	32	8.25
MXFP6	FP6 E3M2 or E2M3	6	32	6.25
MXFP4	FP4 E2M1	4	32	4.25
MXINT8	INT8	8	32	8.25

A companion paper demonstrated direct-cast inference and training with MX formats, including language-model training in MXFP8 and MXFP6 that closely tracked FP32 baselines ^[19]. NVIDIA's Blackwell tensor cores natively support MXFP8, MXFP6, and MXFP4 alongside NVIDIA's own NVFP4, which differs from MXFP4 by using 16-element blocks with an FP8 (E4M3) scale plus a per-tensor FP32 scale ^[9]; AMD's MI350 series supports the MX family as well. A 2025 NVIDIA study trained an 8-billion-parameter model on 15 trillion tokens in MXFP8 with accuracy matching BF16, recommending E4M3 elements for every tensor, gradients included ^[20]. On the deployment side, OpenAI's gpt-oss models (August 2025) ship with mixture-of-experts weights quantized to MXFP4, about 4.25 bits per parameter, which lets the 120-billion-parameter model run on a single 80 GB GPU ^[21].

How does FP8 compare to FP16, INT8, and FP4?

Versus FP16 and BF16. FP8 halves storage and doubles peak tensor throughput relative to 16-bit formats. E5M2 shares FP16's exponent range, while BF16's 8-bit exponent gives it a far wider range than either FP8 encoding, which is why higher-precision formats are retained for master weights, sensitive operations, and loss computation even in FP8 training runs.

Versus INT8. Both occupy one byte. INT8's uniformly spaced values resolve a well-bounded distribution more finely, whereas FP8's exponent allocates precision logarithmically and tolerates the long-tailed, outlier-heavy activation distributions typical of Transformers. Qualcomm researchers argued in 2023 that FP8 multiply-accumulate units are 50 to 180 percent less area- and energy-efficient than INT8 and that INT8 remains preferable for many convolutional and edge workloads, while finding that E4M3-style formats suit Transformer layers with large outliers ^[22]. On server GPUs, where FP8 and INT8 peak rates are typically identical, FP8's robustness to outliers without elaborate calibration has made it the more common choice for LLM serving.

Versus FP4. FP4 (E2M1) continues the same trajectory: Blackwell is NVIDIA's first generation with FP4 tensor cores, doubling throughput again over FP8. Block-scaled FP4 (MXFP4 and NVFP4) is today primarily an inference and weight-storage format, with FP4 pretraining an active research area; an NVIDIA-led 2025 paper reported training a 12-billion-parameter model on 10 trillion tokens in NVFP4 ^[23]. FP8 in turn serves as a building block of FP4 recipes, providing the per-block scale datatype in NVFP4 and the higher-precision fallback for sensitive layers ^[9].

References

Micikevicius, P. et al. "FP8 Formats for Deep Learning". arXiv:2209.05433, September 2022. https://arxiv.org/abs/2209.05433 ↩
"NVIDIA, Arm, and Intel Publish FP8 Specification for Standardization as an Interchange Format for AI". NVIDIA Technical Blog, September 2022. https://developer.nvidia.com/blog/nvidia-arm-and-intel-publish-fp8-specification-for-standardization-as-an-interchange-format-for-ai/ ↩
Wang, N. et al. "Training Deep Neural Networks with 8-bit Floating Point Numbers". arXiv:1812.08011 (NeurIPS 2018). https://arxiv.org/abs/1812.08011 ↩
"FP8 Numbers". AMD ROCm HIP documentation. https://rocm.docs.amd.com/projects/HIP/en/docs-6.3.0/reference/fp8_numbers.html ↩
"OCP 8-bit Floating Point Specification (OFP8), Revision 1.0". Open Compute Project, 2023. https://www.opencompute.org/documents/ocp-8-bit-floating-point-specification-ofp8-revision-1-0-2023-12-01-pdf-1 ↩
"NVIDIA Hopper Architecture In-Depth". NVIDIA Technical Blog, March 2022. https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/ ↩
"Using FP8 and FP4 with Transformer Engine". NVIDIA Transformer Engine documentation. https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/examples/fp8_primer.html ↩
"Per-Tensor and Per-Block Scaling Strategies for Effective FP8 Training". NVIDIA Technical Blog. https://developer.nvidia.com/blog/per-tensor-and-per-block-scaling-strategies-for-effective-fp8-training/ ↩
"Introducing NVFP4 for Efficient and Accurate Low-Precision Inference". NVIDIA Technical Blog, 2025. https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/ ↩
"AMD Instinct MI300X Accelerator Data Sheet". AMD, 2023. https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/data-sheets/amd-instinct-mi300x-data-sheet.pdf ↩
"Intel Gaudi AI Accelerator Gains 2x Performance Leap on GPT-3 with FP8 Software". Intel Newsroom, November 2023. https://newsroom.intel.com/artificial-intelligence/intel-gaudi-ai-accelerator-brings-greater-ai-choice ↩
"Inflection AI debuts new flagship Inflection-2 LLM trained on 5,000 H100 chips". SiliconANGLE, November 22, 2023. https://siliconangle.com/2023/11/22/inflection-ai-debuts-new-flagship-inflection-2-llm-trained-5000-h100-chips/ ↩
Peng, H. et al. "FP8-LM: Training FP8 Large Language Models". arXiv:2310.18313, October 2023. https://arxiv.org/abs/2310.18313 ↩
DeepSeek-AI. "DeepSeek-V3 Technical Report". arXiv:2412.19437, December 2024. https://arxiv.org/abs/2412.19437 ↩
"FP8 W8A8". vLLM documentation. https://docs.vllm.ai/en/latest/features/quantization/fp8/ ↩
"Llama-3.1-405B-Instruct-FP8". Meta on Hugging Face, July 2024. https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct-FP8 ↩
"OCP Microscaling Formats (MX) Specification, Version 1.0". Open Compute Project, September 2023. https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf ↩
"AMD, Arm, Intel, Meta, Microsoft, NVIDIA, and Qualcomm Standardize Next-Generation Narrow Precision Data Formats for AI". Open Compute Project blog, October 2023. https://www.opencompute.org/blog/amd-arm-intel-meta-microsoft-nvidia-and-qualcomm-standardize-next-generation-narrow-precision-data-formats-for-ai ↩
Rouhani, B. D. et al. "Microscaling Data Formats for Deep Learning". arXiv:2310.10537, October 2023. https://arxiv.org/abs/2310.10537 ↩
"Recipes for Pre-training LLMs with MXFP8". arXiv:2506.08027, 2025. https://arxiv.org/abs/2506.08027 ↩
"Introducing gpt-oss". OpenAI, August 5, 2025. https://openai.com/index/introducing-gpt-oss/ ↩
van Baalen, M. et al. "FP8 versus INT8 for efficient deep learning inference". arXiv:2303.17951, March 2023. https://arxiv.org/abs/2303.17951 ↩
"Pretraining Large Language Models with NVFP4". arXiv:2509.25149, September 2025. https://arxiv.org/abs/2509.25149 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Abbreviations DeepGEMM DeepSeek V3 DeepSeek V3.1 Etched Sohu FP4 (4-bit floating point)NVFP4 NVIDIA A100 NVIDIA GB200 NVL72 NVIDIA H100 NVIDIA H200 NVIDIA L40S Nemotron Nemotron-4 Nemotron-H Partitioning strategy bfloat16

What problem does FP8 solve?

What are the E4M3 and E5M2 formats?

Which hardware supports FP8?

How does FP8 training work?

How is FP8 used for inference and quantization?

What are microscaling (MX) formats?

How does FP8 compare to FP16, INT8, and FP4?

References

Improve this article

Related Articles

Cloud TPU

CuDNN

Jetson Thor

Nvidia

NVIDIA Blackwell

NVIDIA DGX Spark

What links here

Related Articles

Cloud TPU

CuDNN

Jetson Thor

Nvidia

NVIDIA Blackwell

NVIDIA DGX Spark

What links here