bfloat16 (short for Brain Floating Point Format, sometimes written BF16) is a 16-bit floating-point number representation developed by Google Brain and first deployed in Tensor Processing Units (TPUs) for accelerating neural network workloads. The format is built from one sign bit, eight exponent bits, and seven mantissa bits. That layout sacrifices a great deal of decimal precision compared with the IEEE 754 half-precision format (FP16) and instead preserves the full numerical range of single precision (FP32). For machine learning, where weights, activations, and gradients can span many orders of magnitude but rarely need more than two or three significant decimal digits, that trade is overwhelmingly favorable, and bfloat16 has become the default training and inference format for most large neural networks built since 2020.
The format is now natively supported across nearly every modern AI accelerator, including Google TPU v2 through v5, NVIDIA Ampere, Hopper, and Blackwell GPUs, AMD Instinct MI200/MI300 series, Intel Cooper Lake and Sapphire Rapids CPUs, ARM Neoverse cores, and Apple Silicon from the M2 generation onward. It is also the standard "low precision" dtype in the major deep learning frameworks, including PyTorch, TensorFlow, and JAX, where it is exposed through automatic mixed precision APIs and is treated as a first-class numpy-compatible dtype on TPU backends.
A bfloat16 value occupies 16 bits of memory and uses the same encoding scheme as IEEE 754 binary floating-point. One bit holds the sign, eight bits encode a biased exponent with bias 127, and seven bits encode the explicit fraction (with an implicit leading 1 for normal numbers, giving 8 bits of effective significand precision). The structural choice is to take a standard IEEE 754 single-precision number and simply truncate the lower 16 bits of the mantissa. That property makes conversion between FP32 and bfloat16 trivial: a hardware unit can drop or extend the trailing 16 bits with a single shift, and rounding-to-nearest-even can be implemented with a tiny adder.
| Bit index (MSB to LSB) | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Field | S | E | E | E | E | E | E | E | E | M | M | M | M | M | M | M |
| Meaning | Sign | exp7 | exp6 | exp5 | exp4 | exp3 | exp2 | exp1 | exp0 | m6 | m5 | m4 | m3 | m2 | m1 | m0 |
The encoded value of a normal bfloat16 number is (-1)^S * 1.M * 2^(E - 127), where M is the 7-bit fraction interpreted as a binary expansion after the implicit 1. Subnormal numbers, infinities, signaling NaNs, and quiet NaNs follow the same conventions as IEEE 754 single precision. The minimum positive normal value is 2^-126 (about 1.18e-38) and the maximum finite magnitude is approximately 3.39e38, identical to FP32.
FP16 (the IEEE 754 binary16 format, also called "half precision") uses one sign bit, five exponent bits, and ten mantissa bits. The two formats spend the same total of 16 bits very differently. FP16 buys precision (about 3.3 decimal digits) at the cost of a narrow dynamic range (roughly 6.1e-5 to 6.55e4). bfloat16 buys range (1.18e-38 to 3.39e38, equal to FP32) at the cost of precision (about 2.3 decimal digits, or roughly 8 bits of significand). The mantissa difference is what matters most for machine learning: the cost of FP16 is constant overflow risk on activations and constant underflow risk on small gradients, both of which require explicit loss scaling to manage during training. bfloat16 inherits FP32's range and avoids both problems. The cost is that individual rounding errors are larger by a factor of 8 (2^3) than in FP16, which is acceptable for stochastic gradient updates that are noisy by nature.
bfloat16 was conceived inside Google Brain during the design of the second-generation Tensor Processing Unit. The first TPU (TPU v1, 2015) was an inference-only chip that operated on 8-bit integers. When Google's hardware team began designing TPU v2 in 2016 and 2017, the goal was to support training as well as inference, and that meant supporting floating-point arithmetic. Norman Jouppi and the TPU architecture team observed, based on years of internal experiments, that neural networks were far more sensitive to the range of representable values than to the precision of any individual value. Underflows, overflows, and NaNs broke training runs; an extra bit or two of mantissa precision rarely made a measurable accuracy difference. They responded by stripping FP32 down to its top 16 bits, calling the result "Brain Floating Point."
The format first shipped in TPU v2 in 2017 and was made publicly available through Cloud TPU in 2018. Shibo Wang and Pankaj Kanwar formally documented it in the 2019 Google Cloud whitepaper BFloat16: The secret to high performance on Cloud TPUs. The paper argued that the silicon area of a floating-point multiplier scales with the square of the mantissa width, so reducing the mantissa from FP32's 23 bits to 7 bits shrinks each multiplier by a factor of about 11. In practice the bfloat16 multiplier is roughly half the area of an FP16 multiplier and one-eighth the area of an FP32 multiplier, which lets a TPU chip pack many more compute lanes into the same die budget. The matrix multiplication unit (MXU) of a TPU v2 or v3 chip is a 128 by 128 systolic array of bfloat16 multipliers feeding FP32 accumulators, an arrangement that captures most of the speed of pure 16-bit arithmetic while preserving FP32 stability for the long sums of products that compose a matrix multiply.
Floating-point formats encode three things: a sign, a magnitude (exponent), and a fractional precision (mantissa). For neural networks, each plays a distinct role. The exponent governs whether the value is even representable; the mantissa governs how accurately a representable value can be expressed. Both matter for accuracy, but they do not matter equally.
Weights in a trained network are typically in the range of about 1e-3 to 1e+1 in absolute value. Activations, after batch normalization or layer normalization, are usually in a similar range. Gradients, however, can be much smaller, often near 1e-6 or 1e-7 deep inside large language models, and can occasionally spike to 1e+3 or larger during training instabilities. The total dynamic range required for stable training is therefore at least 10 orders of magnitude. FP32 supports about 76 orders of magnitude, FP16 supports about 9 orders, and bfloat16 supports about 76 orders. FP16 is right at the edge of being usable for training; bfloat16 has comfortable headroom in both directions.
The trade-off shows up most clearly in three behaviors. First, a model trained in pure FP16 typically requires loss scaling, in which the loss value is multiplied by a constant before backpropagation and the resulting gradients are divided by the same constant before being applied. This pushes small gradient magnitudes up into the FP16 normal range. bfloat16 does not need loss scaling at all, because gradients near 1e-7 are still well inside its representable range. Second, FP16 can encode small differences in weight values more precisely, which can matter for inference of an already-trained model where the weight distribution is very compact. bfloat16 will round more aggressively in those situations and can introduce visible accuracy loss on tasks like high-precision regression. Third, bfloat16 is generally a drop-in replacement for FP32 in training: weights initialized in FP32 truncate cleanly to bfloat16, and a model can be moved between the two formats without numerical translation problems. FP16 requires more careful handling.
Deep learning hardware has steadily added support for narrower number formats since 2017. Each format trades precision and range against memory, bandwidth, and arithmetic throughput. The table below summarizes the formats most commonly used in modern AI training and inference.
| Format | Total bits | Sign | Exponent | Mantissa | Approx. range | Approx. decimal precision | First major hardware |
|---|---|---|---|---|---|---|---|
| FP64 (double) | 64 | 1 | 11 | 52 | 2.2e-308 to 1.8e+308 | 15-17 digits | All general-purpose CPUs and GPUs |
| FP32 (single) | 32 | 1 | 8 | 23 | 1.18e-38 to 3.4e+38 | 7 digits | Universal |
| TF32 (TensorFloat-32) | 19 (stored in 32) | 1 | 8 | 10 | 1.18e-38 to 3.4e+38 | 4 digits | NVIDIA Ampere A100 (2020) |
| FP16 (half) | 16 | 1 | 5 | 10 | 6.1e-5 to 6.55e+4 | 3-4 digits | NVIDIA Pascal P100 (2016) |
| bfloat16 (BF16) | 16 | 1 | 8 | 7 | 1.18e-38 to 3.4e+38 | 2-3 digits | Google TPU v2 (2017-2018) |
| FP8 E4M3 | 8 | 1 | 4 | 3 | ~1.95e-3 to ~448 | ~1.5 digits | NVIDIA Hopper H100 (2022) |
| FP8 E5M2 | 8 | 1 | 5 | 2 | ~1.5e-5 to ~5.7e+4 | ~1 digit | NVIDIA Hopper H100 (2022) |
| MXFP6 E3M2 | 6 plus shared scale | 1 | 3 | 2 | block-scaled | ~1 digit | NVIDIA Blackwell, AMD MI355 (2024-2025) |
| MXFP4 / NVFP4 E2M1 | 4 plus shared scale | 1 | 2 | 1 | block-scaled | <1 digit | NVIDIA Blackwell, AMD MI355 (2024-2025) |
The most important comparisons are summarized below.
FP32 is the historical baseline for both training and inference. It uses 32 bits per value, has 23 bits of mantissa, and represents about seven decimal digits with a range from roughly 1.18e-38 to 3.4e+38. bfloat16 keeps the same 8-bit exponent as FP32 and therefore the same range, but truncates the mantissa to seven bits. The result is half the memory footprint, roughly half the memory bandwidth requirement, and proportional improvements in cache efficiency. For nearly every deep learning workload, the accuracy loss is invisible at the model output, so bfloat16 has displaced FP32 as the default training format.
FP16 was the original "low precision" format used in deep learning, popularized by NVIDIA Pascal and Volta GPUs and the original mixed-precision training paper from Baidu and NVIDIA in 2017. Both formats use 16 bits and have nominally the same memory and bandwidth costs, but they trade off range for precision in opposite directions. FP16 has 10 bits of mantissa (about 3.3 decimal digits) but only 5 bits of exponent (range about 6.1e-5 to 6.55e+4). bfloat16 has 7 bits of mantissa (about 2.3 decimal digits) but 8 bits of exponent (range about 1.18e-38 to 3.4e+38). For training, the wider range of bfloat16 is usually decisive; FP16 retains an edge for inference of smaller models or tasks that need higher per-value precision.
NVIDIA introduced TensorFloat-32 (TF32) with the Ampere A100 in 2020 as a compromise between FP32 and bfloat16. TF32 uses 1 sign bit, 8 exponent bits (matching FP32 and bfloat16), and 10 mantissa bits (matching FP16). The total of 19 bits is stored inside a 32-bit register, so TF32 does not save memory; it saves only multiplier area inside the tensor cores. TF32 is the default math mode for FP32 matmuls on Ampere and later NVIDIA GPUs, which is why upgrading from a Volta V100 to an A100 produces a measurable speedup on many "FP32" workloads without any code changes. bfloat16 is more aggressive than TF32 in both directions: it reduces the mantissa further (7 bits instead of 10) and also saves memory (16 bits per value instead of 32), so it sits below TF32 in the precision-versus-throughput trade-off.
NVIDIA's Hopper H100 (2022) introduced FP8 in two flavors. E4M3 has 4 exponent bits and 3 mantissa bits; it can encode values from about 1.95e-3 to 448 and is intended for forward activations and weights. E5M2 has 5 exponent bits and 2 mantissa bits; it covers about 1.5e-5 to 5.7e+4 and is intended for backward gradients, where range matters more than precision. FP8 halves the memory footprint of bfloat16 and roughly doubles the compute throughput on supported hardware, but it requires careful per-tensor scaling to avoid range failures. NVIDIA's Transformer Engine library and similar tools manage that scaling automatically. As of 2024 to 2026, large language model pretraining typically still relies on bfloat16 for stability, with FP8 used selectively for forward-pass activations and gradient transmission, while inference is increasingly run end-to-end in FP8.
In September 2023, the Microscaling Formats (MX) Alliance, a group including AMD, ARM, Intel, Meta, Microsoft, NVIDIA, and Qualcomm, published version 1.0 of the OCP Microscaling Formats specification through the Open Compute Project. The standard defines four narrow-precision formats (MXFP8, MXFP6, MXFP4, and MXINT8) that share a single 8-bit (E8M0) scaling factor across a block of 32 values. The block scale acts as a supplementary exponent that effectively widens the dynamic range of any single value, so a 4-bit MXFP4 value can stay accurate even when the underlying values are small or large. NVIDIA's Blackwell B200, AMD's MI355, and other 2025-era accelerators ship native MXFP4 and MXFP6 hardware. NVIDIA's NVFP4 variant uses a finer block size of 16 elements with an FP8 (E4M3) scale and a per-tensor FP32 second-level scale. These formats live two notches below bfloat16 in the precision hierarchy and are aimed primarily at inference of trained models, where weights have settled into a tight distribution that responds well to block-wise quantization. bfloat16 typically remains the format of record for the master weights, with MX formats produced as a downstream conversion.
bfloat16 hardware support spread quickly from Google's first deployment to nearly the entire industry between 2018 and 2022. The table below summarizes the major adoption milestones.
| Year | Hardware | Vendor | Notes |
|---|---|---|---|
| 2017-2018 | TPU v2 | First production hardware to use bfloat16 in the matrix multiplication unit; FP32 accumulation. | |
| 2018 | TPU v3 | Doubled the number of MXUs per chip; bfloat16 throughput doubled to 420 TFLOPS per chip. | |
| 2019 | Cooper Lake (Xeon Scalable 3rd gen) | Intel | First x86 CPU with AVX-512 BF16 instructions; introduced VDPBF16PS and VCVTNE2PS2BF16. |
| 2020 | A100 | NVIDIA | Third-generation Tensor Cores added native bfloat16; 312 TFLOPS dense, 624 TFLOPS with structured sparsity. |
| 2020 | TPU v4 | bfloat16 throughput grew to about 275 TFLOPS per chip with much larger MXU arrays. | |
| 2021 | MI100 | AMD | First AMD CDNA accelerator with native bfloat16 matrix engines. |
| 2021 | Sapphire Rapids (preview) | Intel | Adds AMX (Advanced Matrix Extensions) with bfloat16 tile multiplication. |
| 2021 | Neoverse V1, N2 | ARM | Added BFloat16 instructions to the ARMv8.6-A instruction set; began rolling out across server-class ARM cores. |
| 2022 | H100 (Hopper) | NVIDIA | bfloat16 reaches about 989 TFLOPS dense / 1979 with sparsity in addition to FP8 support. |
| 2022 | MI250 / MI250X | AMD | bfloat16 throughput of 383 TFLOPS per GPU; widely used in Frontier and other supercomputers. |
| 2023 | MI300X / MI300A | AMD | bfloat16 throughput of 1307 TFLOPS per GPU; first APU with combined CPU and GPU bfloat16 support. |
| 2023 | Apple M3 | Apple | First Apple Silicon to support ARM bfloat16 instructions; M2 already supported ARMv8.6-A. |
| 2024 | B200 (Blackwell) | NVIDIA | bfloat16 throughput about 2.25 PFLOPS per chip; native FP6/FP4 added alongside. |
| 2025 | TPU v6e "Trillium" / TPU v7 "Ironwood" | Continued scaling of bfloat16 throughput per chip; bfloat16 remains the default training datatype. |
From this 2018-to-2025 arc, bfloat16 has gone from a TPU-only curiosity to the lowest common denominator that every serious AI accelerator now supports. Software written against bfloat16 today runs without modification on TPUs, NVIDIA GPUs, AMD GPUs, x86 CPUs, ARM CPUs, and Apple Silicon, which has made it the natural format for portable AI workloads.
The most common modern training recipe uses bfloat16 as the working precision for forward and backward passes, with FP32 used selectively for the parameters that benefit most from extra precision. This is called mixed precision training; see mixed_precision_training.
The canonical pattern is:
This pattern halves the memory required for activations (which dominate the memory budget during training) and roughly doubles the throughput of the heavy matmul kernels. Because bfloat16 has the same exponent range as FP32, the loss-scaling step that FP16 mixed precision requires is unnecessary. That makes bfloat16 mixed precision substantially easier to deploy and debug than FP16 mixed precision, and it is the dominant training strategy for large language models in 2026.
PyTorch exposes bfloat16 through torch.bfloat16 as a first-class dtype and through the torch.amp (automatic mixed precision) package. The standard recipe wraps the forward pass in torch.autocast(device_type="cuda", dtype=torch.bfloat16). Because gradients do not underflow, no GradScaler is needed, in contrast to the FP16 recipe. PyTorch's documentation explicitly notes that the AMP framework is dtype-agnostic and that bfloat16 is preferred when the underlying hardware supports it.
JAX treats jax.numpy.bfloat16 as a native numpy dtype (using a small Python extension to make this work outside the standard numpy type system). On TPU backends, the default matmul precision is bfloat16 with FP32 accumulation, which can be adjusted through the precision argument on jnp.matmul, jnp.einsum, and related operations. The DeepMind library JMP (JAX Mixed Precision) wraps the same idea with higher-level policy objects.
TensorFlow exposes bfloat16 through tf.bfloat16 and through the tf.keras.mixed_precision API. On Cloud TPU, switching on bfloat16 mixed precision is typically a one-line change to a model definition, and TPU-aware optimizers handle the FP32 master weights internally.
Inference uses precision a little differently from training. Because the weights and activations are no longer changing, and because numerical sensitivity tends to concentrate in only a few layers (for example, attention softmax outputs and layer norm statistics), inference can usually drop to a more aggressive precision than training. The decision tree most teams use is roughly:
| Step | Typical precision | Why |
|---|---|---|
| Master weight storage during training | FP32 | Optimizer updates need 7+ digits of precision. |
| Forward / backward pass during training | bfloat16 | Best balance of range, speed, and memory. |
| Gradient communication across nodes | bfloat16 or FP8 E5M2 | Compresses bandwidth without compromising convergence. |
| Inference of a freshly trained checkpoint | bfloat16 | Drop-in conversion from training checkpoint. |
| Production inference for cost-sensitive deployment | FP8 (E4M3 weights, E4M3 or E5M2 activations) or INT8 | Halves memory and doubles throughput on H100 and later. |
| Inference at the lowest cost / largest batch | MXFP4, NVFP4, or INT4 | Quarter the memory of bfloat16, requires per-block scaling. |
For a model that has been pretrained in bfloat16, full bfloat16 inference is the safest deployment because it requires no conversion and produces bit-identical or near-identical outputs to the training environment. Quantization to lower precision (INT8, FP8, or MX formats) is a separate post-training step described in quantization. bfloat16 is the canonical "baseline" against which inference quantization is measured.
bfloat16's adoption in 2018 to 2020 coincided with a steep growth in model sizes and training compute. The scaling laws published by OpenAI in 2020 and refined by DeepMind's Chinchilla work in 2022 made it clear that the practical limit on frontier model size was set by the available compute and memory budget, and that compute spent on a smaller numerical type produced roughly the same loss reduction as compute spent on a larger one. Reducing precision from FP32 to bfloat16 effectively doubled the affordable model size and training duration for a given hardware budget. This pattern continued with each new precision. Moving from bfloat16 to FP8 in 2022 to 2023 doubled affordable size again, and the move from FP8 to FP4 in 2024 to 2026 nominally doubled it once more.
At each step the trade-off looks the same. The narrower format requires more careful per-tensor scaling, more careful selection of which operations stay at higher precision, and more risk of training instabilities, but on hardware that supports the format natively, the throughput gains are immediate. bfloat16 sits in a particularly comfortable place in this hierarchy: it is narrow enough to give a meaningful speedup over FP32 but wide enough to be used as a drop-in replacement without per-tensor scaling or other software intervention. That property has kept bfloat16 in service even as FP8 and FP4 have rolled out, because it remains the default working precision for the components that need a reliable wide range, including optimizer state and gradient reduction in many large-scale recipes.
Most AI frameworks treat bfloat16 as a first-class type today. PyTorch, TensorFlow, and JAX all support bfloat16 tensors on every backend that has hardware support. Hugging Face Transformers exposes a torch_dtype=torch.bfloat16 argument on every model loader, and most model weights distributed on the Hugging Face Hub for the past three years are saved in bfloat16. The ONNX Runtime, OpenAI Triton, NVIDIA cuDNN and TensorRT, Intel's OneDNN, Apple's Metal Performance Shaders, and AMD's ROCm libraries all accept bfloat16 inputs and produce bfloat16 outputs without conversion. Numpy itself does not include bfloat16 as a built-in dtype, but the JAX and ml_dtypes packages provide compatible Python-level extensions that allow it to be used in scientific code outside deep learning.
On the model file format side, Safetensors, the de facto serialization standard in 2026 for neural network weights, supports bfloat16 directly without re-encoding. Model checkpoints are typically a third smaller than they would be in FP32 because the bfloat16 weight tensors dominate the file size.
bfloat16 is not appropriate for every workload. Any computation that requires more than about 2.5 decimal digits of precision per value will encounter problems, including:
Some workloads also require bit-identical reproducibility across runs, which is harder to guarantee in bfloat16 than in FP32 because rounding errors interact with parallel reduction order. Frameworks generally provide higher-precision modes for such cases, at the cost of throughput.