# bfloat16

> Source: https://aiwiki.ai/wiki/bfloat16
> Updated: 2026-06-21
> Categories: AI Hardware, AI Infrastructure
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**bfloat16** (short for **Brain Floating Point Format**, sometimes written **BF16**) is a 16-bit floating-point number format that uses one sign bit, eight exponent bits, and seven mantissa bits, giving it the same numerical range as 32-bit single precision (FP32) at half the memory cost.[1] It was developed by [Google Brain](/wiki/google_brain) and first deployed in [Tensor Processing Units (TPUs)](/wiki/tensor_processing_unit_tpu) to accelerate neural network workloads. By keeping FP32's full eight-bit exponent (a range of roughly 1.18e-38 to 3.4e+38) while truncating the mantissa from 23 bits to 7, bfloat16 sacrifices decimal precision (about 2 to 3 significant digits) but preserves the wide dynamic range that neural network training depends on.[1] For machine learning, where weights, activations, and gradients can span many orders of magnitude but rarely need more than two or three significant decimal digits, that trade is overwhelmingly favorable, and bfloat16 has become the default training and inference format for most large neural networks built since 2020.

The format is now natively supported across nearly every modern AI accelerator, including Google TPU v2 through v7, [NVIDIA](/wiki/nvidia) Ampere, Hopper, and Blackwell [GPUs](/wiki/gpu), AMD Instinct MI200/MI300 series, Intel Cooper Lake and Sapphire Rapids CPUs, ARM Neoverse cores, and Apple Silicon from the M2 generation onward.[3] It is also the standard "low precision" dtype in the major deep learning frameworks, including [PyTorch](/wiki/pytorch), TensorFlow, and [JAX](/wiki/jax), where it is exposed through automatic mixed precision APIs and is treated as a first-class numpy-compatible dtype on TPU backends.[15][17]

## What is the bfloat16 bit layout?

A bfloat16 value occupies 16 bits of memory and uses the same encoding scheme as IEEE 754 binary floating-point. One bit holds the sign, eight bits encode a biased exponent with bias 127, and seven bits encode the explicit fraction (with an implicit leading 1 for normal numbers, giving 8 bits of effective significand precision). The structural choice is to take a standard IEEE 754 single-precision number and simply truncate the lower 16 bits of the mantissa.[5] That property makes conversion between FP32 and bfloat16 trivial: a hardware unit can drop or extend the trailing 16 bits with a single shift, and rounding-to-nearest-even can be implemented with a tiny adder.

### Bit positions in a bfloat16 word

| Bit index (MSB to LSB) | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Field | S | E | E | E | E | E | E | E | E | M | M | M | M | M | M | M |
| Meaning | Sign | exp7 | exp6 | exp5 | exp4 | exp3 | exp2 | exp1 | exp0 | m6 | m5 | m4 | m3 | m2 | m1 | m0 |

The encoded value of a normal bfloat16 number is `(-1)^S * 1.M * 2^(E - 127)`, where `M` is the 7-bit fraction interpreted as a binary expansion after the implicit 1. Subnormal numbers, infinities, signaling NaNs, and quiet NaNs follow the same conventions as IEEE 754 single precision. The minimum positive normal value is `2^-126` (about 1.18e-38) and the maximum finite magnitude is approximately 3.39e38, identical to FP32.[22]

### How does bfloat16 differ from IEEE FP16?

FP16 (the IEEE 754 binary16 format, also called "half precision") uses one sign bit, five exponent bits, and ten mantissa bits. The two formats spend the same total of 16 bits very differently. FP16 buys precision (about 3.3 decimal digits) at the cost of a narrow dynamic range (roughly 6.1e-5 to 6.55e4). bfloat16 buys range (1.18e-38 to 3.39e38, equal to FP32) at the cost of precision (about 2.3 decimal digits, or roughly 8 bits of significand).[23] The mantissa difference is what matters most for machine learning: the cost of FP16 is constant overflow risk on activations and constant underflow risk on small gradients, both of which require explicit *loss scaling* to manage during training.[25] bfloat16 inherits FP32's range and avoids both problems.[1] The cost is that individual rounding errors are larger by a factor of 8 (2^3) than in FP16, which is acceptable for stochastic gradient updates that are noisy by nature.

## Who invented bfloat16 and when?

bfloat16 was conceived inside Google Brain during the design of the second-generation Tensor Processing Unit. The first TPU (TPU v1, 2015) was an inference-only chip that operated on 8-bit integers.[24] When Google's hardware team began designing TPU v2 in 2016 and 2017, the goal was to support training as well as inference, and that meant supporting floating-point arithmetic. Norman Jouppi and the TPU architecture team observed, based on years of internal experiments, that neural networks were far more sensitive to the *range* of representable values than to the *precision* of any individual value.[24] Underflows, overflows, and NaNs broke training runs; an extra bit or two of mantissa precision rarely made a measurable accuracy difference. They responded by stripping FP32 down to its top 16 bits, calling the result "Brain Floating Point."

The format first shipped in TPU v2 in 2017 and was made publicly available through Cloud TPU in 2018.[24] Shibo Wang and Pankaj Kanwar formally documented it in the 2019 Google Cloud whitepaper *BFloat16: The secret to high performance on Cloud TPUs*.[1] As that whitepaper put it, "The physical size of a hardware multiplier scales with the square of the mantissa width," so reducing the mantissa from FP32's 23 bits to 7 bits shrinks each multiplier by a factor of about 11.[1] In practice the bfloat16 multiplier is roughly half the area of an FP16 multiplier and one-eighth the area of an FP32 multiplier, which lets a TPU chip pack many more compute lanes into the same die budget.[1] The matrix multiplication unit (MXU) of a TPU v2 or v3 chip is a 128 by 128 systolic array of bfloat16 multipliers feeding FP32 accumulators, an arrangement that captures most of the speed of pure 16-bit arithmetic while preserving FP32 stability for the long sums of products that compose a matrix multiply.[1] Jouppi and colleagues later presented the full TPU v2/v3 design, including the bfloat16 arithmetic, the XLA compiler, and the interconnect, in a 2020 Communications of the ACM paper.[24]

## Numerical Trade-offs

Floating-point formats encode three things: a sign, a magnitude (exponent), and a fractional precision (mantissa). For neural networks, each plays a distinct role. The exponent governs whether the value is even representable; the mantissa governs how accurately a representable value can be expressed. Both matter for accuracy, but they do not matter equally.

Weights in a trained network are typically in the range of about 1e-3 to 1e+1 in absolute value. Activations, after batch normalization or layer normalization, are usually in a similar range. Gradients, however, can be much smaller, often near 1e-6 or 1e-7 deep inside large language models, and can occasionally spike to 1e+3 or larger during training instabilities. The total dynamic range required for stable training is therefore at least 10 orders of magnitude. FP32 supports about 76 orders of magnitude, FP16 supports about 9 orders, and bfloat16 supports about 76 orders.[22] FP16 is right at the edge of being usable for training; bfloat16 has comfortable headroom in both directions.

The trade-off shows up most clearly in three behaviors. First, a model trained in pure FP16 typically requires loss scaling, in which the loss value is multiplied by a constant before backpropagation and the resulting gradients are divided by the same constant before being applied.[25] This pushes small gradient magnitudes up into the FP16 normal range. bfloat16 does not need loss scaling at all, because gradients near 1e-7 are still well inside its representable range.[1] Second, FP16 can encode small differences in weight values more precisely, which can matter for inference of an already-trained model where the weight distribution is very compact. bfloat16 will round more aggressively in those situations and can introduce visible accuracy loss on tasks like high-precision regression. Third, bfloat16 is generally a drop-in replacement for FP32 in training: weights initialized in FP32 truncate cleanly to bfloat16, and a model can be moved between the two formats without numerical translation problems.[1][14] FP16 requires more careful handling.

## How does bfloat16 compare to other number formats?

Deep learning hardware has steadily added support for narrower number formats since 2017. Each format trades precision and range against memory, bandwidth, and arithmetic throughput. The table below summarizes the formats most commonly used in modern AI training and inference.

| Format | Total bits | Sign | Exponent | Mantissa | Approx. range | Approx. decimal precision | First major hardware |
|---|---|---|---|---|---|---|---|
| FP64 (double) | 64 | 1 | 11 | 52 | 2.2e-308 to 1.8e+308 | 15-17 digits | All general-purpose CPUs and GPUs |
| FP32 (single) | 32 | 1 | 8 | 23 | 1.18e-38 to 3.4e+38 | 7 digits | Universal |
| TF32 (TensorFloat-32) | 19 (stored in 32) | 1 | 8 | 10 | 1.18e-38 to 3.4e+38 | 4 digits | NVIDIA Ampere A100 (2020)[10] |
| FP16 (half) | 16 | 1 | 5 | 10 | 6.1e-5 to 6.55e+4 | 3-4 digits | NVIDIA Pascal P100 (2016) |
| bfloat16 (BF16) | 16 | 1 | 8 | 7 | 1.18e-38 to 3.4e+38 | 2-3 digits | Google TPU v2 (2017-2018)[1] |
| FP8 E4M3 | 8 | 1 | 4 | 3 | ~1.95e-3 to ~448 | ~1.5 digits | NVIDIA Hopper H100 (2022)[8] |
| FP8 E5M2 | 8 | 1 | 5 | 2 | ~1.5e-5 to ~5.7e+4 | ~1 digit | NVIDIA Hopper H100 (2022)[8] |
| MXFP6 E3M2 | 6 plus shared scale | 1 | 3 | 2 | block-scaled | ~1 digit | NVIDIA Blackwell, AMD MI355 (2024-2025)[11] |
| MXFP4 / NVFP4 E2M1 | 4 plus shared scale | 1 | 2 | 1 | block-scaled | <1 digit | NVIDIA Blackwell, AMD MI355 (2024-2025)[9] |

The most important comparisons are summarized below.

### bfloat16 versus FP32

FP32 is the historical baseline for both training and inference. It uses 32 bits per value, has 23 bits of mantissa, and represents about seven decimal digits with a range from roughly 1.18e-38 to 3.4e+38. bfloat16 keeps the same 8-bit exponent as FP32 and therefore the same range, but truncates the mantissa to seven bits. The result is half the memory footprint, roughly half the memory bandwidth requirement, and proportional improvements in cache efficiency.[2] For nearly every deep learning workload, the accuracy loss is invisible at the model output, so bfloat16 has displaced FP32 as the default training format.

### bfloat16 versus FP16

FP16 was the original "low precision" format used in deep learning, popularized by NVIDIA Pascal and Volta GPUs and the original mixed-precision training paper from Baidu and NVIDIA in 2017.[25] Both formats use 16 bits and have nominally the same memory and bandwidth costs, but they trade off range for precision in opposite directions. FP16 has 10 bits of mantissa (about 3.3 decimal digits) but only 5 bits of exponent (range about 6.1e-5 to 6.55e+4). bfloat16 has 7 bits of mantissa (about 2.3 decimal digits) but 8 bits of exponent (range about 1.18e-38 to 3.4e+38).[23] For training, the wider range of bfloat16 is usually decisive; FP16 retains an edge for inference of smaller models or tasks that need higher per-value precision.

### bfloat16 versus TF32

NVIDIA introduced TensorFloat-32 (TF32) with the Ampere A100 in 2020 as a compromise between FP32 and bfloat16.[10] TF32 uses 1 sign bit, 8 exponent bits (matching FP32 and bfloat16), and 10 mantissa bits (matching FP16). The total of 19 bits is stored inside a 32-bit register, so TF32 does not save memory; it saves only multiplier area inside the tensor cores.[4] TF32 is the default math mode for FP32 matmuls on Ampere and later NVIDIA GPUs, which is why upgrading from a Volta V100 to an A100 produces a measurable speedup on many "FP32" workloads without any code changes.[10] bfloat16 is more aggressive than TF32 in both directions: it reduces the mantissa further (7 bits instead of 10) and also saves memory (16 bits per value instead of 32), so it sits below TF32 in the precision-versus-throughput trade-off.

### bfloat16 versus FP8

NVIDIA's Hopper H100 (2022) introduced FP8 in two flavors.[7] E4M3 has 4 exponent bits and 3 mantissa bits; it can encode values from about 1.95e-3 to 448 and is intended for forward activations and weights.[8] E5M2 has 5 exponent bits and 2 mantissa bits; it covers about 1.5e-5 to 5.7e+4 and is intended for backward gradients, where range matters more than precision.[8] FP8 halves the memory footprint of bfloat16 and roughly doubles the compute throughput on supported hardware, but it requires careful per-tensor scaling to avoid range failures. NVIDIA's Transformer Engine library and similar tools manage that scaling automatically.[7] As of 2024 to 2026, large language model pretraining typically still relies on bfloat16 for stability, with FP8 used selectively for forward-pass activations and gradient transmission, while inference is increasingly run end-to-end in FP8.

### bfloat16 versus MX formats

In September 2023, the Microscaling Formats (MX) Alliance, a group including AMD, ARM, Intel, Meta, Microsoft, NVIDIA, and Qualcomm, published version 1.0 of the OCP Microscaling Formats specification through the Open Compute Project.[12] The standard defines four narrow-precision formats (MXFP8, MXFP6, MXFP4, and MXINT8) that share a single 8-bit (E8M0) scaling factor across a block of 32 values.[11] The block scale acts as a supplementary exponent that effectively widens the dynamic range of any single value, so a 4-bit MXFP4 value can stay accurate even when the underlying values are small or large.[13] NVIDIA's Blackwell B200, AMD's MI355, and other 2025-era accelerators ship native MXFP4 and MXFP6 hardware. NVIDIA's NVFP4 variant uses a finer block size of 16 elements with an FP8 (E4M3) scale and a per-tensor FP32 second-level scale.[9] These formats live two notches below bfloat16 in the precision hierarchy and are aimed primarily at inference of trained models, where weights have settled into a tight distribution that responds well to block-wise quantization.[9] bfloat16 typically remains the format of record for the master weights, with MX formats produced as a downstream conversion.

## Which hardware supports bfloat16?

bfloat16 hardware support spread quickly from Google's first deployment to nearly the entire industry between 2018 and 2022. The table below summarizes the major adoption milestones.

| Year | Hardware | Vendor | Notes |
|---|---|---|---|
| 2017-2018 | TPU v2 | Google | First production hardware to use bfloat16 in the matrix multiplication unit; FP32 accumulation.[1] |
| 2018 | TPU v3 | Google | Doubled the number of MXUs per chip; bfloat16 throughput rose to 123 TFLOPS per chip.[28] |
| 2020 | Cooper Lake (Xeon Scalable 3rd gen) | Intel | First x86 CPU with AVX-512 BF16 instructions; introduced VDPBF16PS and VCVTNE2PS2BF16.[20] |
| 2020 | A100 | NVIDIA | Third-generation Tensor Cores added native bfloat16; 312 TFLOPS dense, 624 TFLOPS with structured sparsity.[6] |
| 2020 | [TPU v4](/wiki/google_tpu_v4) | Google | bfloat16 throughput grew to about 275 TFLOPS per chip with much larger MXU arrays.[29] |
| 2020 | MI100 | AMD | First AMD CDNA accelerator with native bfloat16 matrix engines.[30] |
| 2021 | Sapphire Rapids (preview) | Intel | Adds AMX (Advanced Matrix Extensions) with bfloat16 tile multiplication. |
| 2021 | Neoverse V1, N2 | ARM | Added BFloat16 instructions to the ARMv8.6-A instruction set; began rolling out across server-class ARM cores.[21] |
| 2022 | H100 (Hopper) | NVIDIA | bfloat16 reaches about 989 TFLOPS dense / 1979 with sparsity in addition to FP8 support.[7] |
| 2022 | MI250 / MI250X | AMD | bfloat16 throughput of 383 TFLOPS per GPU; widely used in Frontier and other supercomputers. |
| 2023 | MI300X / MI300A | AMD | bfloat16 throughput of 1307 TFLOPS per GPU; first APU with combined CPU and GPU bfloat16 support.[19] |
| 2022 | Apple M2 | Apple | First Apple Silicon to support ARM bfloat16 instructions; the M2 implements ARMv8.6-A.[31] |
| 2024 | B200 (Blackwell) | NVIDIA | bfloat16 throughput about 2.25 PFLOPS per chip; native FP6/FP4 added alongside.[32] |
| 2025 | TPU v6e "Trillium" / TPU v7 "Ironwood" | Google | Continued scaling of bfloat16 throughput per chip; bfloat16 remains the default training datatype.[33] |

Among the most recent generations, Google's TPU v6e ([Trillium](/wiki/google_trillium)) lists a peak of 918 bfloat16 TFLOPS per chip,[33] while the seventh-generation TPU v7 ([Ironwood](/wiki/tpu_ironwood)), announced in April 2025, is specified at 4,614 TFLOPs per chip at FP8 precision with 192 GB of HBM3E and scales to pods of 9,216 chips.[34]

From this 2018-to-2025 arc, bfloat16 has gone from a TPU-only curiosity to the lowest common denominator that every serious AI accelerator now supports. Software written against bfloat16 today runs without modification on TPUs, NVIDIA GPUs, AMD GPUs, x86 CPUs, ARM CPUs, and Apple Silicon, which has made it the natural format for portable AI workloads.

## How is bfloat16 used in training?

The most common modern training recipe uses bfloat16 as the working precision for forward and backward passes, with FP32 used selectively for the parameters that benefit most from extra precision. This is called *mixed precision training*; see [mixed precision training](/wiki/mixed_precision_training).

The canonical pattern is:

1. Maintain a *master copy* of every learnable parameter in FP32. The optimizer (Adam, SGD, AdamW, etc.) updates these master weights, because the small additive updates near 1e-7 would not survive bfloat16 rounding.[25]
2. Cast the master weights to bfloat16 at the start of each forward pass. This is the version that gets multiplied against the activations.
3. Compute activations in bfloat16. Convolutions, matrix multiplies, and most pointwise operations run natively in 16 bits.
4. Compute gradients in bfloat16 during backpropagation.
5. Cast gradients back up to FP32 and apply them to the master weights.[14]

This pattern halves the memory required for activations (which dominate the memory budget during training) and roughly doubles the throughput of the heavy matmul kernels. Because bfloat16 has the same exponent range as FP32, the loss-scaling step that FP16 mixed precision requires is unnecessary.[1] That makes bfloat16 mixed precision substantially easier to deploy and debug than FP16 mixed precision, and it is the dominant training strategy for large language models in 2026. Meta, for example, reported that [Llama 3.1](/wiki/llama_3) 405B was pretrained in bfloat16 on up to 16,384 NVIDIA H100 GPUs at a model FLOPs utilization of 38 to 43 percent, with the released model later quantized to FP8 for production inference.[35]

### PyTorch

PyTorch exposes bfloat16 through `torch.bfloat16` as a first-class dtype and through the `torch.amp` (automatic mixed precision) package.[15] The standard recipe wraps the forward pass in `torch.autocast(device_type="cuda", dtype=torch.bfloat16)`.[16] Because gradients do not underflow, no `GradScaler` is needed, in contrast to the FP16 recipe.[15] PyTorch's documentation explicitly notes that the AMP framework is dtype-agnostic and that bfloat16 is preferred when the underlying hardware supports it.

### JAX

JAX treats `jax.numpy.bfloat16` as a native numpy dtype (using a small Python extension to make this work outside the standard numpy type system).[17][43] On TPU backends, the default matmul precision is bfloat16 with FP32 accumulation, which can be adjusted through the `precision` argument on `jnp.matmul`, `jnp.einsum`, and related operations. The DeepMind library JMP (JAX Mixed Precision) wraps the same idea with higher-level policy objects.[18]

### TensorFlow

TensorFlow exposes bfloat16 through `tf.bfloat16` and through the `tf.keras.mixed_precision` API. On Cloud TPU, switching on bfloat16 mixed precision is typically a one-line change to a model definition, and TPU-aware optimizers handle the FP32 master weights internally.[2]

## How is bfloat16 used in inference?

Inference uses precision a little differently from training. Because the weights and activations are no longer changing, and because numerical sensitivity tends to concentrate in only a few layers (for example, attention softmax outputs and layer norm statistics), inference can usually drop to a more aggressive precision than training. The decision tree most teams use is roughly:

| Step | Typical precision | Why |
|---|---|---|
| Master weight storage during training | FP32 | Optimizer updates need 7+ digits of precision. |
| Forward / backward pass during training | bfloat16 | Best balance of range, speed, and memory. |
| Gradient communication across nodes | bfloat16 or FP8 E5M2 | Compresses bandwidth without compromising convergence.[8] |
| Inference of a freshly trained checkpoint | bfloat16 | Drop-in conversion from training checkpoint. |
| Production inference for cost-sensitive deployment | FP8 (E4M3 weights, E4M3 or E5M2 activations) or INT8 | Halves memory and doubles throughput on H100 and later.[8] |
| Inference at the lowest cost / largest batch | MXFP4, NVFP4, or INT4 | Quarter the memory of bfloat16, requires per-block scaling.[9] |

For a model that has been pretrained in bfloat16, full bfloat16 inference is the safest deployment because it requires no conversion and produces bit-identical or near-identical outputs to the training environment. Quantization to lower precision (INT8, FP8, or MX formats) is a separate post-training step described in [quantization](/wiki/quantization). bfloat16 is the canonical "baseline" against which inference quantization is measured.

## Relationship to ML Scaling Laws and the Precision Race

bfloat16's adoption in 2018 to 2020 coincided with a steep growth in model sizes and training compute. The scaling laws published by OpenAI in 2020[26] and refined by DeepMind's Chinchilla work in 2022[27] made it clear that the practical limit on frontier model size was set by the available compute and memory budget, and that compute spent on a smaller numerical type produced roughly the same loss reduction as compute spent on a larger one. Reducing precision from FP32 to bfloat16 effectively doubled the affordable model size and training duration for a given hardware budget. This pattern continued with each new precision. Moving from bfloat16 to FP8 in 2022 to 2023 doubled affordable size again,[8] and the move from FP8 to FP4 in 2024 to 2026 nominally doubled it once more.[9]

At each step the trade-off looks the same. The narrower format requires more careful per-tensor scaling, more careful selection of which operations stay at higher precision, and more risk of training instabilities, but on hardware that supports the format natively, the throughput gains are immediate. bfloat16 sits in a particularly comfortable place in this hierarchy: it is narrow enough to give a meaningful speedup over FP32 but wide enough to be used as a *drop-in* replacement without per-tensor scaling or other software intervention. That property has kept bfloat16 in service even as FP8 and FP4 have rolled out, because it remains the default working precision for the components that need a reliable wide range, including optimizer state and gradient reduction in many large-scale recipes.

### 2024-2026 developments

Recent research has begun to quantify exactly where bfloat16 sits on the precision-versus-compute frontier. The November 2024 preprint *Scaling Laws for Precision* by Kumar and colleagues fit [scaling laws](/wiki/scaling_laws) that treat numerical precision as a third axis alongside parameters and data, concluding that "the de-facto practice of training models in BF16 may be suboptimal" because their fits imply that compute-optimal pretraining precision is around 7 to 8 bits.[37]

Production practice has shifted around bfloat16 rather than away from it. [DeepSeek-V3](/wiki/deepseek_v3) (December 2024), the first widely publicized frontier-scale model pretrained largely in [FP8](/wiki/fp8), ran its compute-intensive matrix multiplications in FP8 with fine-grained scaling but stored the AdamW optimizer's first and second moments in bfloat16, keeping master weights and accumulated gradients in FP32; this inverted the older convention that optimizer state must remain entirely in 32-bit formats.[36] On the deployment side, OpenAI's open-weight [gpt-oss](/wiki/gpt_oss) models (August 2025) quantize the mixture-of-experts weights, which account for over 90 percent of parameters, to MXFP4 at about 4.25 bits per parameter while keeping all other tensors in bfloat16.[40]

bfloat16's dominance in post-training came under direct challenge in October 2025, when researchers at Sea AI Lab and the National University of Singapore reported that bfloat16's rounding error is a root cause of the training-inference mismatch that destabilizes reinforcement learning fine-tuning of language models, and showed that switching to FP16, whose extra mantissa bits keep the training and inference policies numerically consistent, eliminates the mismatch with only a few lines of code.[38] The authors framed the issue bluntly: "the widely adopted BF16 introduces large rounding errors that break the consistency between the training and inference policies," while FP16, with three more mantissa bits, restores it.[38] The result was quickly picked up by fine-tuning frameworks, whose documentation now discusses the FP16-versus-bfloat16 choice explicitly for reinforcement learning workloads.[39]

## Software Ecosystem

Most AI frameworks treat bfloat16 as a first-class type today. PyTorch, TensorFlow, and JAX all support `bfloat16` tensors on every backend that has hardware support. [Hugging Face](/wiki/hugging_face) Transformers exposes a `torch_dtype=torch.bfloat16` argument on every model loader, and most model weights distributed on the Hugging Face Hub for the past three years are saved in bfloat16. The ONNX Runtime, OpenAI Triton, NVIDIA cuDNN and TensorRT, Intel's OneDNN, Apple's Metal Performance Shaders, and AMD's ROCm libraries all accept bfloat16 inputs and produce bfloat16 outputs without conversion. Numpy itself does not include bfloat16 as a built-in dtype, but the JAX and ml_dtypes packages provide compatible Python-level extensions that allow it to be used in scientific code outside deep learning.[43]

On the model file format side, [Safetensors](/wiki/safetensors), the de facto serialization standard in 2026 for neural network weights, supports bfloat16 directly without re-encoding. Model checkpoints are typically a third smaller than they would be in FP32 because the bfloat16 weight tensors dominate the file size.

The format has also been standardized outside the deep learning toolchain. C++23 added `std::bfloat16_t` as an optional fixed-width floating-point type through proposal P1467R9, giving bfloat16 a portable name in mainstream systems programming.[41] RISC-V International has ratified version 1.0 of its BF16 instruction set extensions, comprising Zfbfmin (scalar conversions between BF16 and FP32), Zvfbfmin (vector conversions), and Zvfbfwma (a vector widening multiply-accumulate that feeds bfloat16 products into FP32 accumulators).[42]

## What are the limitations of bfloat16?

bfloat16 is not appropriate for every workload. Any computation that requires more than about 2.5 decimal digits of precision per value will encounter problems, including:

- Large reductions, especially summations of many small values, where the rounding error per addition accumulates and can swamp the result. This is why bfloat16 multiplications are typically accumulated into FP32 registers in hardware; the multiplier is bfloat16 but the accumulator is FP32.[1]
- Iterative refinement loops (such as the inner solve in some scientific simulations) that depend on small residuals shrinking below a tight tolerance.
- Optimizer states for Adam-style optimizers, which need to track a very small running mean of squared gradients. These almost always live in FP32 even when the model weights are bfloat16.
- High-resolution rendering and most non-ML numerical workloads, where IEEE-compliant range and precision are both needed.

Some workloads also require bit-identical reproducibility across runs, which is harder to guarantee in bfloat16 than in FP32 because rounding errors interact with parallel reduction order. As the October 2025 FP16 research demonstrated, that same rounding behavior can quietly destabilize reinforcement learning fine-tuning, where training and inference must produce numerically consistent policies.[38] Frameworks generally provide higher-precision modes for such cases, at the cost of throughput.

## See Also

- [Tensor Processing Unit (TPU)](/wiki/tensor_processing_unit_tpu)
- [GPU](/wiki/gpu)
- [NVIDIA](/wiki/nvidia)
- [Mixed precision training](/wiki/mixed_precision_training)
- [Quantization](/wiki/quantization)
- [PyTorch](/wiki/pytorch)
- [JAX](/wiki/jax)

## References

1. Wang, S., and Kanwar, P. (2019). *BFloat16: The secret to high performance on Cloud TPUs.* Google Cloud Blog. https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus
2. Google Cloud Documentation. *Improve your model's performance with bfloat16.* https://cloud.google.com/tpu/docs/bfloat16
3. Wikipedia. *bfloat16 floating-point format.* https://en.wikipedia.org/wiki/Bfloat16_floating-point_format
4. Wikipedia. *TensorFloat-32.* https://en.wikipedia.org/wiki/TensorFloat-32
5. WikiChip. *Brain floating-point format (bfloat16).* https://en.wikichip.org/wiki/brain_floating-point_format
6. NVIDIA. (2020). *NVIDIA A100 Tensor Core GPU Architecture.* https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf
7. NVIDIA. (2022). *NVIDIA Hopper Architecture In-Depth.* NVIDIA Technical Blog. https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/
8. NVIDIA. *Floating-Point 8: An Introduction to Efficient, Lower-Precision AI Training.* https://developer.nvidia.com/blog/floating-point-8-an-introduction-to-efficient-lower-precision-ai-training/
9. NVIDIA. *Introducing NVFP4 for Efficient and Accurate Low-Precision Inference.* https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/
10. NVIDIA. *Accelerating AI Training with NVIDIA TF32 Tensor Cores.* https://developer.nvidia.com/blog/accelerating-ai-training-with-tf32-tensor-cores/
11. Open Compute Project. (2023). *OCP Microscaling Formats (MX) Specification v1.0.* https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf
12. Open Compute Project. *AMD, Arm, Intel, Meta, Microsoft, NVIDIA, and Qualcomm Standardize Next-Generation Narrow Precision Data Formats for AI.* https://www.opencompute.org/blog/amd-arm-intel-meta-microsoft-nvidia-and-qualcomm-standardize-next-generation-narrow-precision-data-formats-for-ai
13. Rouhani, B., et al. (2023). *Microscaling Data Formats for Deep Learning.* arXiv:2310.10537. https://arxiv.org/pdf/2310.10537
14. Kalamkar, D., et al. (2019). *A Study of BFLOAT16 for Deep Learning Training.* arXiv:1905.12322. https://arxiv.org/pdf/1905.12322
15. PyTorch Documentation. *Automatic Mixed Precision package - torch.amp.* https://docs.pytorch.org/docs/stable/amp.html
16. PyTorch Tutorials. *Automatic Mixed Precision recipe.* https://docs.pytorch.org/tutorials/recipes/recipes/amp_recipe.html
17. JAX Documentation. *Type promotion semantics.* https://docs.jax.dev/en/latest/type_promotion.html
18. DeepMind. *JMP: a Mixed Precision library for JAX.* https://github.com/google-deepmind/jmp
19. AMD. *AMD Instinct MI300 Series Accelerators.* https://www.amd.com/en/products/accelerators/instinct/mi300.html
20. WikiChip. *AVX-512 BFloat16 Instructions (BF16).* https://en.wikichip.org/wiki/x86/avx512_bf16
21. WikiChip Fuse. *Arm Updates Its Neoverse Roadmap: New BFloat16, SVE Support.* https://fuse.wikichip.org/news/4564/arm-updates-its-neoverse-roadmap-new-bfloat16-sve-support/
22. Higham, N. (2020). *What Is Bfloat16 Arithmetic?* https://nhigham.com/2020/06/02/what-is-bfloat16-arithmetic/
23. Cook, J. D. (2018). *bfloat16 (BF16) range and precision.* https://www.johndcook.com/blog/2018/11/15/bfloat16/
24. Jouppi, N. P., Yoon, D. H., Kurian, G., Li, S., Patil, N., Laudon, J., Young, C., and Patterson, D. (2020). *A Domain-Specific Supercomputer for Training Deep Neural Networks.* Communications of the ACM, 63(7), 67-78. https://dl.acm.org/doi/10.1145/3360307
25. Micikevicius, P., et al. (2017). *Mixed Precision Training.* arXiv:1710.03740. https://arxiv.org/abs/1710.03740
26. Kaplan, J., McCandlish, S., et al. (2020). *Scaling Laws for Neural Language Models.* arXiv:2001.08361. https://arxiv.org/abs/2001.08361
27. Hoffmann, J., Borgeaud, S., et al. (2022). *Training Compute-Optimal Large Language Models.* arXiv:2203.15556. https://arxiv.org/abs/2203.15556
28. Google Cloud Documentation. *TPU v3.* https://cloud.google.com/tpu/docs/v3
29. Jouppi, N. P., et al. (2023). *TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings.* arXiv:2304.01433. https://arxiv.org/abs/2304.01433
30. HPCwire. (November 16, 2020). *AMD Courts HPC with 11.5 Teraflops Instinct MI100 GPU.* https://www.hpcwire.com/2020/11/16/amd-courts-hpc-with-11-5-teraflops-instinct-gpu/
31. The Eclectic Light Company. (January 15, 2024). *Why the M2 is more advanced that it seemed.* https://eclecticlight.co/2024/01/15/why-the-m2-is-more-advanced-that-it-seemed/
32. NVIDIA. *NVIDIA HGX Platform.* https://www.nvidia.com/en-us/data-center/hgx/
33. Google Cloud Documentation. *TPU v6e.* https://cloud.google.com/tpu/docs/v6e
34. Google. (April 9, 2025). *Ironwood: The first Google TPU for the age of inference.* https://blog.google/products/google-cloud/ironwood-tpu-age-of-inference/
35. Meta Llama Team. (2024). *The Llama 3 Herd of Models.* arXiv:2407.21783. https://arxiv.org/abs/2407.21783
36. DeepSeek-AI. (2024). *DeepSeek-V3 Technical Report.* arXiv:2412.19437. https://arxiv.org/abs/2412.19437
37. Kumar, T., Ankner, Z., et al. (2024). *Scaling Laws for Precision.* arXiv:2411.04330. https://arxiv.org/abs/2411.04330
38. Qi, P., Liu, Z., Zhou, X., Pang, T., Du, C., Lee, W. S., and Lin, M. (2025). *Defeating the Training-Inference Mismatch via FP16.* arXiv:2510.26788. https://arxiv.org/abs/2510.26788
39. Unsloth Documentation. *FP16 vs BF16 for RL.* https://docs.unsloth.ai/get-started/reinforcement-learning-rl-guide/fp16-vs-bf16-for-rl
40. OpenAI. (2025). *gpt-oss-120b & gpt-oss-20b Model Card.* arXiv:2508.10925. https://arxiv.org/abs/2508.10925
41. ISO/IEC JTC1/SC22/WG21. (2022). *P1467R9: Extended floating-point types and standard names.* https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2022/p1467r9.html
42. RISC-V International. *BF16 Extensions for BFloat16-precision Floating-Point, Version 1.0.* RISC-V Ratified Specifications Library. https://docs.riscv.org/reference/isa/unpriv/bfloat16.html
43. ml_dtypes contributors. *ml_dtypes: NumPy dtype extensions for machine learning.* GitHub. https://github.com/jax-ml/ml_dtypes