Mixed-precision training is a technique for training deep learning models using lower-precision floating-point formats for most computations while maintaining a higher-precision copy of the model weights for numerical stability. By performing forward and backward passes in half-precision (16-bit) or even quarter-precision (8-bit) formats while keeping a master copy of weights in single-precision (FP32), mixed-precision training achieves substantial reductions in memory usage and significant speedups on hardware with specialized low-precision compute units, all without degrading model accuracy. First formally described by Micikevicius et al. in 2017 at NVIDIA, the technique has become standard practice for training virtually every modern large language model and deep neural network.
Traditional deep learning training uses 32-bit single-precision floating-point arithmetic (FP32) for all computations. FP32 provides a wide dynamic range (approximately 1.18 x 10^-38 to 3.4 x 10^38) and high precision (about 7 decimal digits), which ensures numerical stability during the many iterations of gradient-based optimization.
However, FP32 is expensive in terms of both memory and computation. Each parameter, gradient, and activation value occupies 4 bytes of memory. The arithmetic units that process FP32 operations are larger, consume more power, and achieve lower throughput than units designed for lower-precision formats. As models grew from millions to billions of parameters, the memory and compute costs of FP32 training became a significant bottleneck.
The key observation behind mixed-precision training is that not all computations in a neural network require the full precision and range of FP32. Many operations, particularly the large matrix multiplications in transformer layers, are tolerant of reduced precision. The gradients and activations that flow through the network carry information that can be adequately represented in fewer bits for the majority of the training process. Only a few critical operations, notably the accumulation of small gradient updates into model weights, genuinely require the precision of FP32 [1].
The following table summarizes the floating-point formats relevant to modern deep learning training.
| Format | Total Bits | Sign Bits | Exponent Bits | Mantissa Bits | Dynamic Range | Precision (decimal digits) | First Hardware Support |
|---|---|---|---|---|---|---|---|
| FP32 (IEEE 754) | 32 | 1 | 8 | 23 | ~1.18e-38 to ~3.4e38 | ~7.2 | All modern CPUs/GPUs |
| TF32 (TensorFloat-32) | 19 | 1 | 8 | 10 | Same as FP32 | ~3.4 | NVIDIA Ampere (A100, 2020) |
| BF16 (Brain Float 16) | 16 | 1 | 8 | 7 | Same as FP32 | ~2.4 | Google TPUv2 (2017), NVIDIA Ampere (2020) |
| FP16 (IEEE 754 half) | 16 | 1 | 5 | 10 | ~6.1e-5 to 65504 | ~3.4 | NVIDIA Pascal (P100, 2016) |
| FP8 E4M3 | 8 | 1 | 4 | 3 | ~1.95e-3 to 448 | ~1.2 | NVIDIA Hopper (H100, 2022) |
| FP8 E5M2 | 8 | 1 | 5 | 2 | ~1.53e-5 to 57344 | ~0.9 | NVIDIA Hopper (H100, 2022) |
| INT8 | 8 | 0 or 1 | N/A | N/A | -128 to 127 or 0 to 255 | N/A | Various GPUs and accelerators |
The baseline format for deep learning. FP32 provides ample range and precision for all training operations. Its primary drawback is the memory and compute cost at scale.
Introduced by NVIDIA with the Ampere architecture (A100 GPU, 2020), TF32 is a 19-bit format that combines the 8-bit exponent of FP32 (preserving its full dynamic range) with the 10-bit mantissa of FP16 (providing adequate precision). TF32 is used internally by tensor cores for matrix multiplications: the GPU accepts FP32 inputs, performs the multiply in TF32 precision, and accumulates the result in FP32. This happens transparently and is enabled by default on Ampere and later GPUs, meaning existing FP32 training scripts get up to 10x speedups on tensor core operations without any code changes [2].
TF32 is not a storage format (data in memory remains in FP32); it is purely a computational mode within the tensor cores. NVIDIA has demonstrated that TF32 achieves accuracy equivalent to full FP32 training across a wide range of models [2].
BF16 was originally developed by Google for use on TPUs and has since been adopted by NVIDIA (Ampere and later), AMD, and Intel. It uses 8 exponent bits (the same as FP32), giving it the same dynamic range as FP32, but only 7 mantissa bits, providing less precision. The key advantage of BF16 over FP16 is its wider dynamic range, which eliminates the need for loss scaling in most cases. Gradients that would underflow in FP16 can be represented in BF16 without special handling.
BF16 has become the default training format for most large language models. Models like GPT-4, LLaMA, and Gemini are typically trained in BF16 mixed precision. On NVIDIA Ampere and Hopper GPUs, BF16 tensor core operations achieve the same throughput as FP16 operations [3].
FP16 was the first reduced-precision format widely adopted for deep learning training, supported on NVIDIA Volta (V100) and later GPUs. It uses 5 exponent bits and 10 mantissa bits, providing higher precision than BF16 but a much narrower dynamic range (maximum value of 65,504). This limited range means that gradients with magnitudes below approximately 6 x 10^-5 underflow to zero, which is a common occurrence during training of deep networks.
To address this, FP16 mixed-precision training requires loss scaling (described below). FP16 remains useful on hardware that does not support BF16, and it provides slightly higher precision in the mantissa compared to BF16, which can matter for certain operations [1].
FP8 comes in two variants defined by the OFP8 standard: E4M3 (4 exponent bits, 3 mantissa bits) and E5M2 (5 exponent bits, 2 mantissa bits). E4M3 is typically used for forward pass computations (where precision matters more), while E5M2 is used for gradients in the backward pass (where dynamic range is more important).
FP8 was first supported in hardware by NVIDIA's Hopper architecture (H100 GPU, 2022). On the H100, FP8 tensor core operations achieve approximately double the throughput of FP16/BF16 operations, making FP8 extremely attractive for training. However, the very limited precision of FP8 requires careful quantization strategies, including per-tensor or per-block scaling factors, to maintain training accuracy [4].
INT8 (8-bit integer) is primarily used for inference quantization rather than training. Models trained in higher precision are quantized to INT8 for deployment, reducing memory and compute requirements. Some research has explored INT8 training, but it remains less common than floating-point mixed precision for training due to the lack of dynamic range in integer formats.
The standard mixed-precision training procedure, as described by Micikevicius et al. (2017), involves three key techniques used together: FP32 master weights, FP16/BF16 forward and backward passes, and loss scaling [1].
A primary copy of all model weights is maintained in FP32. This "master copy" is the authoritative version of the weights and is where gradient updates are applied. At the start of each training step, the FP32 master weights are cast to the lower-precision format (FP16 or BF16) for use in the forward and backward passes.
The reason FP32 master weights are necessary is that weight updates are often very small relative to the weight values themselves. A typical learning rate of 1e-4 multiplied by a gradient of 1e-3 produces an update of 1e-7. In FP16, values smaller than about 6e-5 cannot be represented, meaning this update would be lost entirely. In BF16, the limited mantissa means that adding a very small value to a larger one may have no effect due to rounding. FP32's 23-bit mantissa can represent these small updates accurately [1].
The forward pass (computing predictions from inputs) and the backward pass (computing gradients via backpropagation) are performed using the lower-precision copies of the weights. All intermediate activations, attention scores, and layer outputs are computed and stored in FP16 or BF16. This is where the bulk of the memory savings come from: activations often dominate memory usage during training (especially for long sequences), and storing them in 16 bits instead of 32 bits cuts activation memory in half.
Matrix multiplications are performed on tensor cores, which are specialized hardware units that perform fused multiply-accumulate operations at high throughput in lower precision while accumulating results in FP32. This FP32 accumulation is critical for maintaining the numerical accuracy of large dot products and matrix multiplications [5].
Loss scaling is specifically needed for FP16 training (BF16's wider dynamic range generally makes it unnecessary). The technique works as follows:
Static loss scaling uses a fixed scaling factor chosen before training. Common values range from 8 to 32,768. The appropriate value depends on the model and must sometimes be tuned manually.
Dynamic loss scaling adjusts the scaling factor automatically during training. It starts with a large value and monitors for NaN or Inf values in the gradients. If overflow is detected, the scaling factor is halved, and the training step is skipped. If training proceeds without overflow for a specified number of steps, the scaling factor is increased. This approach requires no manual tuning and is the default in most frameworks [1].
PyTorch provides dynamic loss scaling through torch.amp.GradScaler, which integrates with the torch.autocast context manager for automatic mixed-precision training.
A typical mixed-precision training iteration proceeds as follows:
The foundational paper on mixed-precision training, titled "Mixed Precision Training," was authored by Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu, all affiliated with NVIDIA at the time. The paper was submitted to arXiv in October 2017 and published at ICLR 2018 [1].
The paper demonstrated that mixed-precision training with FP16 arithmetic could match FP32 training accuracy across a wide range of tasks, including image classification (ResNet on ImageNet), object detection (Faster R-CNN), speech recognition (DeepSpeech 2), neural machine translation, language modeling, and generative adversarial networks (DCGAN). The authors showed that the three techniques of FP32 master weights, loss scaling, and FP32 accumulation in matrix multiplications were sufficient to train all tested models without accuracy loss [1].
This paper was influential because it provided both the practical recipe and the empirical evidence needed for the community to adopt mixed-precision training with confidence. Prior to this work, reduced-precision training was viewed as risky and required ad hoc adjustments for each model. Micikevicius et al. demonstrated a general, reliable approach.
Mixed-precision training performance depends critically on hardware support for low-precision arithmetic, specifically through tensor cores (on NVIDIA GPUs) or equivalent units on other accelerators.
| GPU Generation | Architecture | Year | Key Precision Features | Tensor Core Generation |
|---|---|---|---|---|
| Pascal (P100) | GP100 | 2016 | FP16 CUDA cores (no tensor cores) | N/A |
| Volta (V100) | GV100 | 2017 | FP16 tensor cores with FP32 accumulation | 1st generation |
| Turing (T4, RTX 20xx) | TU10x | 2018 | FP16 tensor cores, INT8/INT4 inference | 2nd generation |
| Ampere (A100, RTX 30xx) | GA100 | 2020 | TF32, BF16, FP16 tensor cores; sparse tensor cores | 3rd generation |
| Hopper (H100, H200) | GH100 | 2022 | FP8 (E4M3, E5M2), FP16, BF16, TF32 tensor cores; Transformer Engine | 4th generation |
| Blackwell (B100, B200) | GB100/GB202 | 2024-2025 | FP8, FP4 (NVFP4), MXFP8, BF16, FP16, TF32; 5th gen tensor cores | 5th generation |
Tensor cores are specialized matrix multiply-accumulate units on NVIDIA GPUs. Each tensor core performs a D = A x B + C operation on small matrix tiles (e.g., 4x4 or 16x16). The inputs A and B can be in a lower-precision format (FP16, BF16, FP8, etc.), while the accumulator C and output D are in FP32. This design is what makes mixed-precision training possible without accuracy loss: the individual multiplications happen in low precision for speed, but the accumulated result retains FP32 accuracy [5].
Tensor core throughput has increased dramatically across GPU generations. On the A100, FP16/BF16 tensor cores deliver 312 TFLOPS, compared to 19.5 TFLOPS for FP32 non-tensor-core math. On the H100, FP16/BF16 tensor cores deliver approximately 990 TFLOPS, and FP8 tensor cores deliver roughly 1,979 TFLOPS (with sparsity). The B200 further increases these numbers, with over 2,250 TFLOPS for FP8 and support for the new FP4 format [6].
The Transformer Engine, introduced with the Hopper architecture, is a software-hardware co-design that automatically manages FP8 quantization within transformer layers. It dynamically chooses per-tensor scaling factors for weights, activations, and gradients, enabling FP8 training with minimal accuracy loss. The Transformer Engine is integrated into frameworks like PyTorch, TensorFlow, and JAX through NVIDIA's open-source library [7].
Google TPUs have supported BF16 since the TPUv2 (2017) and include high-throughput BF16 matrix units. BF16 was actually developed at Google specifically for TPU training.
AMD MI300X GPUs support FP16, BF16, and FP8 formats on their CDNA 3 architecture matrix cores, providing competitive mixed-precision performance.
Intel Gaudi accelerators and Intel Xeon CPUs (with AMX instructions) also support BF16 mixed-precision training.
Storing activations in FP16/BF16 instead of FP32 halves their memory footprint. Since activations typically dominate memory usage during training (especially for long-sequence transformer models), this translates to roughly 1.5-2x increase in the maximum batch size or sequence length that fits in GPU memory. For the model parameters themselves, the FP16 copy used during computation is half the size of the FP32 master copy, though both must be maintained.
Tensor core operations in FP16/BF16 are 2-8x faster than FP32 operations on modern NVIDIA GPUs, depending on the specific operation and GPU generation. In practice, end-to-end training speedups of 1.5-3x are typical when switching from pure FP32 to mixed precision, with the exact speedup depending on how much of the computation is spent in tensor-core-eligible operations versus other operations (normalization, activation functions, etc.) that remain in FP32 [5].
In distributed training, gradient communication between devices benefits from mixed precision. Transmitting gradients in FP16/BF16 halves the communication volume compared to FP32, which directly improves the throughput of data-parallel training where gradient AllReduce is the bottleneck.
Lower-precision operations consume less energy per operation. This has both direct cost benefits (lower electricity bills for training runs) and environmental implications. As training runs for the largest models consume megawatt-hours of electricity, even modest reductions in per-operation energy are meaningful at scale.
FP8 training represents the current frontier of mixed-precision techniques. With only 8 bits per value, FP8 offers half the memory footprint and roughly double the compute throughput of FP16/BF16 on supported hardware.
The extremely limited precision of FP8 (3 mantissa bits for E4M3, 2 for E5M2) means that naive quantization of weights, activations, and gradients to FP8 typically degrades training accuracy. The primary challenges are:
The standard approach to FP8 training uses per-tensor scaling: each tensor (weight matrix, activation tensor, or gradient tensor) has its own FP32 scaling factor that maps the tensor's values into the representable range of FP8. The scaling factor is typically computed from the maximum absolute value of the tensor in the previous training step (delayed scaling) or the current step (just-in-time scaling).
NVIDIA's Blackwell architecture introduced MXFP8, a block-scaled FP8 format where a separate scaling factor is assigned to each block of 32 consecutive values. This finer-grained scaling better handles outliers and improves accuracy compared to per-tensor scaling, because local outliers only affect the scaling of their block rather than the entire tensor [6].
NVIDIA and its partners have demonstrated FP8 training on a range of models, including GPT-3 175B, with accuracy matching BF16 training. The Transformer Engine manages the complexity of FP8 scaling automatically, and frameworks like DeepSpeed and Megatron-LM have integrated FP8 support. On the H100, FP8 training achieves approximately 2x the throughput of BF16 training for large transformer models [4].
FlashAttention-3 includes FP8 attention kernels for the H100, using block quantization and incoherent processing to maintain accuracy. These kernels approach the theoretical peak FP8 throughput of the H100, achieving close to 1.2 PFLOPS for attention computation [8].
Mixed-precision training is supported by all major deep learning frameworks.
| Framework | Mixed-Precision API | Key Features |
|---|---|---|
| PyTorch | torch.autocast + torch.amp.GradScaler | Automatic mixed precision (AMP); selects precision per operation; dynamic loss scaling |
| TensorFlow / Keras | tf.keras.mixed_precision | Policy-based precision selection; automatic loss scaling |
| NVIDIA NeMo | Built-in mixed-precision support | Integrated with Transformer Engine for FP8; supports BF16 and FP16 |
| DeepSpeed | ZeRO + mixed-precision integration | FP16 and BF16 ZeRO training; FP8 support via Transformer Engine |
| JAX / XLA | jax.numpy.bfloat16 + custom policies | BF16 is the default for TPU training; FP8 support via Transformer Engine |
| Hugging Face Accelerate | mixed_precision config parameter | Wraps PyTorch AMP; supports FP16, BF16, and FP8 |
In PyTorch, enabling mixed-precision training requires only a few lines of code. The torch.autocast context manager automatically selects the appropriate precision for each operation (e.g., FP16 for linear layers, FP32 for layer normalization and softmax). The GradScaler handles dynamic loss scaling. This ease of use has been critical to the widespread adoption of mixed-precision training.
Several operations are known to be sensitive to reduced precision and should generally remain in FP32:
Modern automatic mixed-precision implementations (PyTorch AMP, Transformer Engine) handle these distinctions automatically, maintaining an internal list of operations that should run in full precision versus reduced precision.
Another best practice is to monitor for gradient overflow/underflow during training. While dynamic loss scaling handles most cases, unusually deep networks or unusual architectures may require adjustment of the initial scaling factor or the scaling update schedule.
Research on reduced-precision training predates the Micikevicius et al. paper. Gupta et al. (2015) demonstrated training with 16-bit fixed-point arithmetic using stochastic rounding. Courbariaux et al. (2015) explored binary and ternary weight networks. However, these earlier approaches either required significant algorithmic modifications or accepted accuracy degradation.
The Micikevicius et al. contribution was to show that a straightforward recipe (FP32 master weights + FP16 computation + loss scaling) could achieve full accuracy with no changes to model architecture or hyperparameters, making mixed-precision training a drop-in improvement rather than a research project. This practicality drove rapid adoption across the industry.
The subsequent introduction of BF16 by Google (which eliminated the need for loss scaling due to its wider dynamic range) and TF32 by NVIDIA (which provided speedups with zero code changes) further lowered the barrier to adoption. By 2022, mixed-precision training was the default rather than the exception.
As of early 2026, mixed-precision training is universal in large-scale deep learning. Specific trends include:
BF16 is the dominant training format. Nearly all large language models (GPT-4, Claude, Gemini, LLaMA 3, Mistral, DeepSeek) are trained in BF16 mixed precision. FP16 with loss scaling remains in use on older hardware or in specific applications but is no longer the default choice.
FP8 training is moving from experimental to production. With the widespread deployment of H100 and H200 GPUs, and the arrival of Blackwell GPUs with enhanced FP8 support (including MXFP8 block scaling and FP4 for inference), FP8 training is becoming routine for organizations with access to recent hardware. The Transformer Engine and framework integrations have made FP8 training nearly as easy to deploy as BF16 training.
Research on even lower precision continues. FP4 training (4-bit floating point) is an active research area, though it remains experimental and typically requires more sophisticated quantization strategies than FP8. NVIDIA's Blackwell architecture supports NVFP4 for inference, and research groups are exploring FP4 for the forward pass during training while keeping the backward pass in higher precision.
The combination of mixed-precision training with distributed training optimizations (ZeRO, gradient checkpointing, FlashAttention) has created a comprehensive toolkit for efficient large-scale training. Together, these techniques enable training of models with hundreds of billions of parameters on clusters of thousands of GPUs, a capability that was impractical only a few years ago.