Mixed-Precision Training

Deep Learning Machine Learning

22 min read

Updated Jun 21, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 21, 2026

Fact-checked

In review queue

Sources

12 citations

Revision

v4 · 4,314 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Mixed-precision training is a technique for training deep learning models using lower-precision floating-point formats for most computations while maintaining a higher-precision copy of the model weights for numerical stability. By performing forward and backward passes in half-precision (16-bit) or even quarter-precision (8-bit) formats while keeping a master copy of weights in single-precision (FP32), mixed-precision training reduces memory consumption by nearly 2x and delivers significant speedups on hardware with specialized low-precision compute units, all without degrading model accuracy ^[1]. First formally described by Paulius Micikevicius and colleagues at NVIDIA in a 2017 paper, the technique has become standard practice for training virtually every modern large language model and deep neural network. In the original work, the authors reported that maintaining an FP32 master copy of the weights plus loss scaling was sufficient to "reduce the memory consumption of deep learning models by nearly 2x" while matching FP32 accuracy across convolutional, recurrent, and generative models ^[1].

What problem does mixed-precision training solve?

Traditional deep learning training uses 32-bit single-precision floating-point arithmetic (FP32) for all computations. FP32 provides a wide dynamic range (approximately 1.18 x 10^-38 to 3.4 x 10^38) and high precision (about 7 decimal digits), which ensures numerical stability during the many iterations of gradient-based optimization.

However, FP32 is expensive in terms of both memory and computation. Each parameter, gradient, and activation value occupies 4 bytes of memory. The arithmetic units that process FP32 operations are larger, consume more power, and achieve lower throughput than units designed for lower-precision formats. As models grew from millions to billions of parameters, the memory and compute costs of FP32 training became a significant bottleneck.

The key observation behind mixed-precision training is that not all computations in a neural network require the full precision and range of FP32. Many operations, particularly the large matrix multiplications in transformer layers, are tolerant of reduced precision. The gradients and activations that flow through the network carry information that can be adequately represented in fewer bits for the majority of the training process. Only a few critical operations, notably the accumulation of small gradient updates into model weights, genuinely require the precision of FP32 ^[1].

What floating-point formats are used in deep learning?

The following table summarizes the floating-point formats relevant to modern deep learning training.

Format	Total Bits	Sign Bits	Exponent Bits	Mantissa Bits	Dynamic Range	Precision (decimal digits)	First Hardware Support
FP32 (IEEE 754)	32	1	8	23	~1.18e-38 to ~3.4e38	~7.2	All modern CPUs/GPUs
TF32 (TensorFloat-32)	19	1	8	10	Same as FP32	~3.4	NVIDIA Ampere (A100, 2020)
BF16 (Brain Float 16)	16	1	8	7	Same as FP32	~2.4	Google TPUv2 (2017), NVIDIA Ampere (2020)
FP16 (IEEE 754 half)	16	1	5	10	~6.1e-5 to 65504	~3.4	NVIDIA Pascal (P100, 2016)
FP8 E4M3	8	1	4	3	~1.95e-3 to 448	~1.2	NVIDIA Hopper (H100, 2022)
FP8 E5M2	8	1	5	2	~1.53e-5 to 57344	~0.9	NVIDIA Hopper (H100, 2022)
INT8	8	0 or 1	N/A	N/A	-128 to 127 or 0 to 255	N/A	Various GPUs and accelerators

FP32

The baseline format for deep learning. FP32 provides ample range and precision for all training operations. Its primary drawback is the memory and compute cost at scale.

TF32 (TensorFloat-32)

Introduced by NVIDIA with the Ampere architecture (A100 GPU, 2020), TF32 is a 19-bit format that combines the 8-bit exponent of FP32 (preserving its full dynamic range) with the 10-bit mantissa of FP16 (providing adequate precision). TF32 is used internally by tensor cores for matrix multiplications: the GPU accepts FP32 inputs, performs the multiply in TF32 precision, and accumulates the result in FP32. This happens transparently and is enabled by default on Ampere and later GPUs, meaning existing FP32 training scripts get up to 10x speedups on tensor core operations without any code changes ^[2].

TF32 is not a storage format (data in memory remains in FP32); it is purely a computational mode within the tensor cores. NVIDIA has demonstrated that TF32 achieves accuracy equivalent to full FP32 training across a wide range of models ^[2].

BF16 (Brain Floating Point 16)

BF16 was originally developed by Google for use on TPUs and has since been adopted by NVIDIA (Ampere and later), AMD, and Intel. It uses 8 exponent bits (the same as FP32), giving it the same dynamic range as FP32, but only 7 mantissa bits, providing less precision. The key advantage of BF16 over FP16 is its wider dynamic range, which eliminates the need for loss scaling in most cases. Gradients that would underflow in FP16 can be represented in BF16 without special handling.

BF16 has become the default training format for most large language models. Models like GPT-4, LLaMA, and Gemini are typically trained in BF16 mixed precision. On NVIDIA Ampere and Hopper GPUs, BF16 tensor core operations achieve the same throughput as FP16 operations ^[3].

FP16 (IEEE 754 Half Precision)

FP16 was the first reduced-precision format widely adopted for deep learning training, supported on NVIDIA Volta (V100) and later GPUs. It uses 5 exponent bits and 10 mantissa bits, providing higher precision than BF16 but a much narrower dynamic range (maximum value of 65,504). This limited range means that gradients with magnitudes below approximately 6 x 10^-5 underflow to zero, which is a common occurrence during training of deep networks.

To address this, FP16 mixed-precision training requires loss scaling (described below). FP16 remains useful on hardware that does not support BF16, and it provides slightly higher precision in the mantissa compared to BF16, which can matter for certain operations ^[1].

FP8

FP8 comes in two variants defined by the OFP8 standard: E4M3 (4 exponent bits, 3 mantissa bits) and E5M2 (5 exponent bits, 2 mantissa bits). E4M3 is typically used for forward pass computations (where precision matters more), while E5M2 is used for gradients in the backward pass (where dynamic range is more important).

FP8 was first supported in hardware by NVIDIA's Hopper architecture (H100 GPU, 2022). On the H100, FP8 tensor core operations achieve approximately double the throughput of FP16/BF16 operations, making FP8 extremely attractive for training. However, the very limited precision of FP8 requires careful quantization strategies, including per-tensor or per-block scaling factors, to maintain training accuracy ^[4].

INT8

INT8 (8-bit integer) is primarily used for inference quantization rather than training. Models trained in higher precision are quantized to INT8 for deployment, reducing memory and compute requirements. Some research has explored INT8 training, but it remains less common than floating-point mixed precision for training due to the lack of dynamic range in integer formats.

How does mixed-precision training work?

The standard mixed-precision training procedure, as described by Micikevicius et al. (2017), involves three key techniques used together: FP32 master weights, FP16/BF16 forward and backward passes, and loss scaling ^[1].

FP32 Master Weights

A primary copy of all model weights is maintained in FP32. This "master copy" is the authoritative version of the weights and is where gradient updates are applied. At the start of each training step, the FP32 master weights are cast to the lower-precision format (FP16 or BF16) for use in the forward and backward passes.

The reason FP32 master weights are necessary is that weight updates are often very small relative to the weight values themselves. A typical learning rate of 1e-4 multiplied by a gradient of 1e-3 produces an update of 1e-7. In FP16, values smaller than about 6e-5 cannot be represented, meaning this update would be lost entirely. In BF16, the limited mantissa means that adding a very small value to a larger one may have no effect due to rounding. FP32's 23-bit mantissa can represent these small updates accurately ^[1].

Forward and Backward Passes in Lower Precision

The forward pass (computing predictions from inputs) and the backward pass (computing gradients via backpropagation) are performed using the lower-precision copies of the weights. All intermediate activations, attention scores, and layer outputs are computed and stored in FP16 or BF16. This is where the bulk of the memory savings come from: activations often dominate memory usage during training (especially for long sequences), and storing them in 16 bits instead of 32 bits cuts activation memory in half.

Matrix multiplications are performed on tensor cores, which are specialized hardware units that perform fused multiply-accumulate operations at high throughput in lower precision while accumulating results in FP32. This FP32 accumulation is critical for maintaining the numerical accuracy of large dot products and matrix multiplications ^[5].

Loss Scaling

Loss scaling is specifically needed for FP16 training (BF16's wider dynamic range generally makes it unnecessary). The technique works as follows:

The loss value computed in the forward pass is multiplied by a scaling factor S before starting backpropagation.
Because backpropagation applies the chain rule, this scaling factor propagates to all gradient values, shifting them into a range that FP16 can represent.
Before applying gradients to the FP32 master weights, the gradients are divided by S (unscaled) to restore their true magnitude.

Static loss scaling uses a fixed scaling factor chosen before training. Common values range from 8 to 32,768. The appropriate value depends on the model and must sometimes be tuned manually.

Dynamic loss scaling adjusts the scaling factor automatically during training. It starts with a large value and monitors for NaN or Inf values in the gradients. If overflow is detected, the scaling factor is halved, and the training step is skipped. If training proceeds without overflow for a specified number of steps, the scaling factor is increased. This approach requires no manual tuning and is the default in most frameworks ^[1].

PyTorch provides dynamic loss scaling through torch.amp.GradScaler, which integrates with the torch.autocast context manager for automatic mixed-precision training.

The Complete Training Loop

A typical mixed-precision training iteration proceeds as follows:

Cast FP32 master weights to FP16/BF16.
Forward pass in FP16/BF16 (with FP32 accumulation in matrix multiplications on tensor cores).
Compute loss in FP32.
Scale the loss by factor S (FP16 only).
Backward pass in FP16/BF16, computing scaled gradients.
Unscale gradients by dividing by S (FP16 only).
Check for overflow; if detected, skip the weight update and reduce S.
Apply gradients to FP32 master weights using the optimizer (e.g., Adam).

Who invented mixed-precision training?

The foundational paper on mixed-precision training, titled "Mixed Precision Training," was authored by Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu, all affiliated with NVIDIA at the time (with co-authors from Baidu Research). The paper was submitted to arXiv in October 2017 and published at ICLR 2018 ^[1].

The abstract states the core recipe directly: "we recommend maintaining a single-precision copy of the weights that accumulates the gradients after each optimizer step" and "scaling the loss appropriately to handle the loss of information with half-precision gradients" ^[1]. The paper demonstrated that mixed-precision training with FP16 arithmetic could match FP32 training accuracy across a wide range of tasks, including image classification (ResNet on ImageNet), object detection (Faster R-CNN), speech recognition (DeepSpeech 2), neural machine translation, language modeling, and generative adversarial networks (DCGAN). The authors showed that the three techniques of FP32 master weights, loss scaling, and FP32 accumulation in matrix multiplications were sufficient to train all tested models without accuracy loss, and noted the approach "works for large scale models with more than 100 million parameters trained on large datasets" ^[1].

This paper was influential because it provided both the practical recipe and the empirical evidence needed for the community to adopt mixed-precision training with confidence. Prior to this work, reduced-precision training was viewed as risky and required ad hoc adjustments for each model. Micikevicius et al. demonstrated a general, reliable approach.

What hardware supports mixed-precision training?

Mixed-precision training performance depends critically on hardware support for low-precision arithmetic, specifically through tensor cores (on NVIDIA GPUs) or equivalent units on other accelerators.

NVIDIA GPU Generations

GPU Generation	Architecture	Year	Key Precision Features	Tensor Core Generation
Pascal (P100)	GP100	2016	FP16 CUDA cores (no tensor cores)	N/A
Volta (V100)	GV100	2017	FP16 tensor cores with FP32 accumulation	1st generation
Turing (T4, RTX 20xx)	TU10x	2018	FP16 tensor cores, INT8/INT4 inference	2nd generation
Ampere (A100, RTX 30xx)	GA100	2020	TF32, BF16, FP16 tensor cores; sparse tensor cores	3rd generation
Hopper (H100, H200)	GH100	2022	FP8 (E4M3, E5M2), FP16, BF16, TF32 tensor cores; Transformer Engine	4th generation
Blackwell (B100, B200)	GB100/GB202	2024-2025	FP8, FP4 (NVFP4), MXFP8, BF16, FP16, TF32; 5th gen tensor cores	5th generation

Tensor Cores

Tensor cores are specialized matrix multiply-accumulate units on NVIDIA GPUs. Each tensor core performs a D = A x B + C operation on small matrix tiles (e.g., 4x4 or 16x16). The inputs A and B can be in a lower-precision format (FP16, BF16, FP8, etc.), while the accumulator C and output D are in FP32. This design is what makes mixed-precision training possible without accuracy loss: the individual multiplications happen in low precision for speed, but the accumulated result retains FP32 accuracy ^[5].

Tensor core throughput has increased dramatically across GPU generations. On the A100, FP16/BF16 tensor cores deliver 312 TFLOPS, compared to 19.5 TFLOPS for FP32 non-tensor-core math. On the H100, FP16/BF16 tensor cores deliver approximately 990 TFLOPS, and FP8 tensor cores deliver roughly 1,979 TFLOPS (with sparsity). The B200 further increases these numbers, with over 2,250 TFLOPS for FP8 and support for the new FP4 format ^[6].

NVIDIA Transformer Engine

The Transformer Engine, introduced with the Hopper architecture, is a software-hardware co-design that automatically manages FP8 quantization within transformer layers. It dynamically chooses per-tensor scaling factors for weights, activations, and gradients, enabling FP8 training with minimal accuracy loss. The Transformer Engine is integrated into frameworks like PyTorch, TensorFlow, and JAX through NVIDIA's open-source library ^[7].

Other Hardware

Google TPUs have supported BF16 since the TPUv2 (2017) and include high-throughput BF16 matrix units. BF16 was actually developed at Google specifically for TPU training.

AMD MI300X GPUs support FP16, BF16, and FP8 formats on their CDNA 3 architecture matrix cores, providing competitive mixed-precision performance.

Intel Gaudi accelerators and Intel Xeon CPUs (with AMX instructions) also support BF16 mixed-precision training.

What are the benefits of mixed-precision training?

Memory Reduction

Storing activations in FP16/BF16 instead of FP32 halves their memory footprint. Since activations typically dominate memory usage during training (especially for long-sequence transformer models), this translates to roughly 1.5-2x increase in the maximum batch size or sequence length that fits in GPU memory. The original Micikevicius et al. paper reported an overall reduction in memory consumption of nearly 2x ^[1]. For the model parameters themselves, the FP16 copy used during computation is half the size of the FP32 master copy, though both must be maintained.

Compute Speedup

Tensor core operations in FP16/BF16 are 2-8x faster than FP32 operations on modern NVIDIA GPUs, depending on the specific operation and GPU generation. In practice, end-to-end training speedups of 1.5-3x are typical when switching from pure FP32 to mixed precision, with the exact speedup depending on how much of the computation is spent in tensor-core-eligible operations versus other operations (normalization, activation functions, etc.) that remain in FP32 ^[5].

Reduced Communication

In distributed training, gradient communication between devices benefits from mixed precision. Transmitting gradients in FP16/BF16 halves the communication volume compared to FP32, which directly improves the throughput of data-parallel training where gradient AllReduce is the bottleneck.

Energy Efficiency

Lower-precision operations consume less energy per operation. This has both direct cost benefits (lower electricity bills for training runs) and environmental implications. As training runs for the largest models consume megawatt-hours of electricity, even modest reductions in per-operation energy are meaningful at scale.

FP8 Training

FP8 training represents the current frontier of mixed-precision techniques. With only 8 bits per value, FP8 offers half the memory footprint and roughly double the compute throughput of FP16/BF16 on supported hardware.

Challenges

The extremely limited precision of FP8 (3 mantissa bits for E4M3, 2 for E5M2) means that naive quantization of weights, activations, and gradients to FP8 typically degrades training accuracy. The primary challenges are:

Outlier values: Some activations or gradients have values much larger than others. A single large outlier can force the scaling factor to be very small, causing most other values to underflow.
Precision loss: With only 8 possible mantissa values (E4M3), fine differences between similar values are lost, which can slow convergence.
Dynamic range: E4M3 has a maximum value of 448, meaning any value larger than this overflows, while E5M2 has higher range but even less precision.

Per-Tensor and Per-Block Scaling

The standard approach to FP8 training uses per-tensor scaling: each tensor (weight matrix, activation tensor, or gradient tensor) has its own FP32 scaling factor that maps the tensor's values into the representable range of FP8. The scaling factor is typically computed from the maximum absolute value of the tensor in the previous training step (delayed scaling) or the current step (just-in-time scaling).

NVIDIA's Blackwell architecture introduced MXFP8, a block-scaled FP8 format where a separate scaling factor is assigned to each block of 32 consecutive values. This finer-grained scaling better handles outliers and improves accuracy compared to per-tensor scaling, because local outliers only affect the scaling of their block rather than the entire tensor ^[6].

FP8 Training Results

NVIDIA and its partners have demonstrated FP8 training on a range of models, including GPT-3 175B, with accuracy matching BF16 training. The Transformer Engine manages the complexity of FP8 scaling automatically, and frameworks like DeepSpeed and Megatron-LM have integrated FP8 support. On the H100, FP8 training achieves approximately 2x the throughput of BF16 training for large transformer models ^[4].

FlashAttention-3 includes FP8 attention kernels for the H100, using block quantization and incoherent processing to maintain accuracy. These kernels approach the theoretical peak FP8 throughput of the H100, achieving close to 1.2 PFLOPS for attention computation ^[8].

What frameworks support mixed-precision training?

Mixed-precision training is supported by all major deep learning frameworks.

Framework	Mixed-Precision API	Key Features
PyTorch	`torch.autocast` + `torch.amp.GradScaler`	Automatic mixed precision (AMP); selects precision per operation; dynamic loss scaling
TensorFlow / Keras	`tf.keras.mixed_precision`	Policy-based precision selection; automatic loss scaling
NVIDIA NeMo	Built-in mixed-precision support	Integrated with Transformer Engine for FP8; supports BF16 and FP16
DeepSpeed	ZeRO + mixed-precision integration	FP16 and BF16 ZeRO training; FP8 support via Transformer Engine
JAX / XLA	`jax.numpy.bfloat16` + custom policies	BF16 is the default for TPU training; FP8 support via Transformer Engine
Hugging Face Accelerate	`mixed_precision` config parameter	Wraps PyTorch AMP; supports FP16, BF16, and FP8

In PyTorch, enabling mixed-precision training requires only a few lines of code. The torch.autocast context manager automatically selects the appropriate precision for each operation (e.g., FP16 for linear layers, FP32 for layer normalization and softmax). The GradScaler handles dynamic loss scaling. This ease of use has been critical to the widespread adoption of mixed-precision training ^[9].

Common Pitfalls and Best Practices

Several operations are known to be sensitive to reduced precision and should generally remain in FP32:

Softmax: The exponentiation and normalization in softmax are numerically sensitive; performing them in FP16 can cause significant accuracy loss.
Layer normalization and batch normalization: These operations involve computing means and variances, which benefit from FP32 precision.
Loss computation: Cross-entropy loss and other loss functions should be computed in FP32 to avoid precision issues.
Small operations: Operations like residual additions, where the values being added may have very different magnitudes, benefit from FP32.

Modern automatic mixed-precision implementations (PyTorch AMP, Transformer Engine) handle these distinctions automatically, maintaining an internal list of operations that should run in full precision versus reduced precision.

Another best practice is to monitor for gradient overflow/underflow during training. While dynamic loss scaling handles most cases, unusually deep networks or unusual architectures may require adjustment of the initial scaling factor or the scaling update schedule.

Historical Context

Research on reduced-precision training predates the Micikevicius et al. paper. Gupta et al. (2015) demonstrated training with 16-bit fixed-point arithmetic using stochastic rounding. Courbariaux et al. (2015) explored binary and ternary weight networks. However, these earlier approaches either required significant algorithmic modifications or accepted accuracy degradation.

The Micikevicius et al. contribution was to show that a straightforward recipe (FP32 master weights + FP16 computation + loss scaling) could achieve full accuracy with no changes to model architecture or hyperparameters, making mixed-precision training a drop-in improvement rather than a research project. This practicality drove rapid adoption across the industry.

The subsequent introduction of BF16 by Google (which eliminated the need for loss scaling due to its wider dynamic range) and TF32 by NVIDIA (which provided speedups with zero code changes) further lowered the barrier to adoption. By 2022, mixed-precision training was the default rather than the exception.

FP4 Training

FP4 training (4-bit floating point) represents the emerging frontier beyond FP8. NVIDIA's Blackwell architecture supports native FP4 arithmetic through its 5th-generation tensor cores, delivering 2x to 3x higher math throughput than FP8 depending on the GPU variant (GB200 vs. GB300).

NVFP4 and production results

NVIDIA's NVFP4 format, developed for Blackwell, uses block-level scaling (one scaling factor per block of 16 values) to handle outlier values that would otherwise degrade accuracy in a global-scale FP4 scheme. A September 2025 paper, "Pretraining Large Language Models with NVFP4," demonstrated training a 12B-parameter hybrid Mamba-Transformer model on 10 trillion tokens with NVFP4 precision, matching FP8 reference model accuracy ^[10]. The NVFP4-trained model scored 62.58% on MMLU-Pro (5-shot) versus 62.62% for the FP8 baseline, a gap of just 0.04 percentage points, establishing the first public evidence of sustained 4-bit pretraining at multi-trillion-token scale ^[10]. NVIDIA's MLPerf Training v5.1 submission in 2025 set a new Llama 3.1 405B time-to-train record of just 10 minutes using more than 5,000 Blackwell GPUs and FP4 precision, the first MLPerf submission to use FP4 while meeting strict accuracy requirements ^[11].

Quartet: native FP4 training

A May 2025 paper, "Quartet: Native FP4 Training Can Be Optimal for Large Language Models," showed that a properly designed FP4 training recipe can match BF16 accuracy at scale, with experimental results scaling effectively to 13B-parameter LLMs trained on up to 100 billion tokens ^[12]. The paper established that native FP4 training (not just FP4 inference of a higher-precision checkpoint) is achievable with minimal accuracy loss when block-scaled quantization and appropriate weight update handling are used.

What is the current state of mixed-precision training (2025-2026)?

As of early 2026, mixed-precision training is universal in large-scale deep learning. Specific trends include:

BF16 is the dominant training format. Nearly all large language models (GPT-4, Claude, Gemini, LLaMA 3, Mistral, DeepSeek) are trained in BF16 mixed precision. FP16 with loss scaling remains in use on older hardware or in specific applications but is no longer the default choice.

FP8 training is moving from experimental to production. With the widespread deployment of H100 and H200 GPUs, and the arrival of Blackwell GPUs with enhanced FP8 support (including MXFP8 block scaling), FP8 training is becoming routine for organizations with access to recent hardware. The Transformer Engine and framework integrations have made FP8 training nearly as easy to deploy as BF16 training.

FP4 training is moving from research to early production. NVIDIA's Blackwell architecture with NVFP4 block scaling has demonstrated that 4-bit training can match BF16 accuracy at scale, delivering 2-3x throughput improvements over FP8. The first MLPerf results using FP4 arithmetic were submitted in 2025. Research on FP4 for the forward pass with higher-precision backward passes continues to refine the optimal mixed-precision recipe for the next generation of training clusters.

The combination of mixed-precision training with distributed training optimizations (ZeRO, gradient checkpointing, FlashAttention) has created a comprehensive toolkit for efficient large-scale training. Together, these techniques enable training of models with hundreds of billions of parameters on clusters of thousands of GPUs, a capability that was impractical only a few years ago.

References

Micikevicius, P., Narang, S., Alben, J., Diamos, G., Elsen, E., Garcia, D., Ginsburg, B., Houston, M., Kuchaiev, O., Venkatesh, G., & Wu, H. (2018). "Mixed Precision Training." Proceedings of ICLR 2018. https://arxiv.org/abs/1710.03740 ↩
NVIDIA. (2020). "Accelerating AI Training with NVIDIA TF32 Tensor Cores." https://developer.nvidia.com/blog/accelerating-ai-training-with-tf32-tensor-cores/ ↩
NVIDIA. (2020). "What is the TensorFloat-32 Precision Format?" https://blogs.nvidia.com/blog/tensorfloat-32-precision-format/ ↩
Noune, B., Jones, P., Sherwin, D., & Sherwin, T. (2022). "FP8 Formats for Deep Learning." https://arxiv.org/abs/2209.05433 ↩
NVIDIA. (2023). "Train With Mixed Precision." NVIDIA Deep Learning Performance Documentation. https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html ↩
NVIDIA. (2025). "Inside NVIDIA Blackwell Ultra: The Chip Powering the AI Factory Era." https://developer.nvidia.com/blog/inside-nvidia-blackwell-ultra-the-chip-powering-the-ai-factory-era/ ↩
NVIDIA. (2025). "TransformerEngine: A library for accelerating Transformer models on NVIDIA GPUs." https://github.com/NVIDIA/TransformerEngine ↩
Shah, J., Bikshandi, G., Zhang, Y., Thakkar, V., Ramani, P., & Dao, T. (2024). "FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision." https://arxiv.org/abs/2407.08608 ↩
PyTorch. (2023). "What Every User Should Know About Mixed Precision Training in PyTorch." https://pytorch.org/blog/what-every-user-should-know-about-mixed-precision-training-in-pytorch/ ↩
NVIDIA. (2025). "Pretraining Large Language Models with NVFP4." arXiv:2509.25149. https://arxiv.org/abs/2509.25149 ↩
NVIDIA. (2025). "NVIDIA Wins Every MLPerf Training v5.1 Benchmark." https://blogs.nvidia.com/blog/mlperf-training-benchmark-blackwell-ultra/ ↩
Sakr, C., et al. (2025). "Quartet: Native FP4 Training Can Be Optimal for Large Language Models." arXiv:2505.14669. https://arxiv.org/abs/2505.14669 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributors · full history

Suggest edit

Mixed-Precision Training

What problem does mixed-precision training solve?

What floating-point formats are used in deep learning?

FP32

TF32 (TensorFloat-32)

BF16 (Brain Floating Point 16)

FP16 (IEEE 754 Half Precision)

FP8

INT8

How does mixed-precision training work?

FP32 Master Weights

Forward and Backward Passes in Lower Precision

Loss Scaling

The Complete Training Loop

Who invented mixed-precision training?

What hardware supports mixed-precision training?

NVIDIA GPU Generations

Tensor Cores

NVIDIA Transformer Engine

Other Hardware

What are the benefits of mixed-precision training?

Memory Reduction

Compute Speedup

Reduced Communication

Energy Efficiency

FP8 Training

Challenges

Per-Tensor and Per-Block Scaling

FP8 Training Results

What frameworks support mixed-precision training?

Common Pitfalls and Best Practices

Historical Context

FP4 Training

NVFP4 and production results

Quartet: native FP4 training

What is the current state of mixed-precision training (2025-2026)?

References

Improve this article

What links here (24 of 27)

What links here (24 of 27)

What problem does mixed-precision training solve?

What floating-point formats are used in deep learning?

FP32

TF32 (TensorFloat-32)

BF16 (Brain Floating Point 16)

FP16 (IEEE 754 Half Precision)

FP8

INT8

How does mixed-precision training work?

FP32 Master Weights

Forward and Backward Passes in Lower Precision

Loss Scaling

The Complete Training Loop

Who invented mixed-precision training?

What hardware supports mixed-precision training?

NVIDIA GPU Generations

Tensor Cores

NVIDIA Transformer Engine

Other Hardware

What are the benefits of mixed-precision training?

Memory Reduction

Compute Speedup

Reduced Communication

Energy Efficiency

FP8 Training

Challenges

Per-Tensor and Per-Block Scaling

FP8 Training Results

What frameworks support mixed-precision training?

Common Pitfalls and Best Practices

Historical Context

FP4 Training

NVFP4 and production results

Quartet: native FP4 training

What is the current state of mixed-precision training (2025-2026)?

References

Improve this article

Related Articles

Diffusion model

Generalization

Mixture of Experts (MoE)

Modality

Sparsity

Activation Function

What links here (24 of 27)

Related Articles

Diffusion model

Generalization

Mixture of Experts (MoE)

Modality

Sparsity

Activation Function

What links here (24 of 27)