bfloat16

bfloat16 (short for Brain Floating Point Format, sometimes written BF16) is a 16-bit floating-point number representation developed by Google Brain and first deployed in Tensor Processing Units (TPUs) for accelerating neural network workloads. The format is built from one sign bit, eight exponent bits, and seven mantissa bits. That layout sacrifices a great deal of decimal precision compared with the IEEE 754 half-precision format (FP16) and instead preserves the full numerical range of single precision (FP32). For machine learning, where weights, activations, and gradients can span many orders of magnitude but rarely need more than two or three significant decimal digits, that trade is overwhelmingly favorable, and bfloat16 has become the default training and inference format for most large neural networks built since 2020.

The format is now natively supported across nearly every modern AI accelerator, including Google TPU v2 through v5, NVIDIA Ampere, Hopper, and Blackwell GPUs, AMD Instinct MI200/MI300 series, Intel Cooper Lake and Sapphire Rapids CPUs, ARM Neoverse cores, and Apple Silicon from the M2 generation onward. It is also the standard "low precision" dtype in the major deep learning frameworks, including PyTorch, TensorFlow, and JAX, where it is exposed through automatic mixed precision APIs and is treated as a first-class numpy-compatible dtype on TPU backends.

Definition and Bit Layout

A bfloat16 value occupies 16 bits of memory and uses the same encoding scheme as IEEE 754 binary floating-point. One bit holds the sign, eight bits encode a biased exponent with bias 127, and seven bits encode the explicit fraction (with an implicit leading 1 for normal numbers, giving 8 bits of effective significand precision). The structural choice is to take a standard IEEE 754 single-precision number and simply truncate the lower 16 bits of the mantissa. That property makes conversion between FP32 and bfloat16 trivial: a hardware unit can drop or extend the trailing 16 bits with a single shift, and rounding-to-nearest-even can be implemented with a tiny adder.

Bit positions in a bfloat16 word

Bit index (MSB to LSB)	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
Field	S	E	E	E	E	E	E	E	E	M	M	M	M	M	M	M
Meaning	Sign	exp7	exp6	exp5	exp4	exp3	exp2	exp1	exp0	m6	m5	m4	m3	m2	m1	m0

The encoded value of a normal bfloat16 number is (-1)^S * 1.M * 2^(E - 127), where M is the 7-bit fraction interpreted as a binary expansion after the implicit 1. Subnormal numbers, infinities, signaling NaNs, and quiet NaNs follow the same conventions as IEEE 754 single precision. The minimum positive normal value is 2^-126 (about 1.18e-38) and the maximum finite magnitude is approximately 3.39e38, identical to FP32.

Comparison with IEEE FP16

FP16 (the IEEE 754 binary16 format, also called "half precision") uses one sign bit, five exponent bits, and ten mantissa bits. The two formats spend the same total of 16 bits very differently. FP16 buys precision (about 3.3 decimal digits) at the cost of a narrow dynamic range (roughly 6.1e-5 to 6.55e4). bfloat16 buys range (1.18e-38 to 3.39e38, equal to FP32) at the cost of precision (about 2.3 decimal digits, or roughly 8 bits of significand). The mantissa difference is what matters most for machine learning: the cost of FP16 is constant overflow risk on activations and constant underflow risk on small gradients, both of which require explicit loss scaling to manage during training. bfloat16 inherits FP32's range and avoids both problems. The cost is that individual rounding errors are larger by a factor of 8 (2^3) than in FP16, which is acceptable for stochastic gradient updates that are noisy by nature.

Origins at Google

bfloat16 was conceived inside Google Brain during the design of the second-generation Tensor Processing Unit. The first TPU (TPU v1, 2015) was an inference-only chip that operated on 8-bit integers. When Google's hardware team began designing TPU v2 in 2016 and 2017, the goal was to support training as well as inference, and that meant supporting floating-point arithmetic. Norman Jouppi and the TPU architecture team observed, based on years of internal experiments, that neural networks were far more sensitive to the range of representable values than to the precision of any individual value. Underflows, overflows, and NaNs broke training runs; an extra bit or two of mantissa precision rarely made a measurable accuracy difference. They responded by stripping FP32 down to its top 16 bits, calling the result "Brain Floating Point."

The format first shipped in TPU v2 in 2017 and was made publicly available through Cloud TPU in 2018. Shibo Wang and Pankaj Kanwar formally documented it in the 2019 Google Cloud whitepaper BFloat16: The secret to high performance on Cloud TPUs. The paper argued that the silicon area of a floating-point multiplier scales with the square of the mantissa width, so reducing the mantissa from FP32's 23 bits to 7 bits shrinks each multiplier by a factor of about 11. In practice the bfloat16 multiplier is roughly half the area of an FP16 multiplier and one-eighth the area of an FP32 multiplier, which lets a TPU chip pack many more compute lanes into the same die budget. The matrix multiplication unit (MXU) of a TPU v2 or v3 chip is a 128 by 128 systolic array of bfloat16 multipliers feeding FP32 accumulators, an arrangement that captures most of the speed of pure 16-bit arithmetic while preserving FP32 stability for the long sums of products that compose a matrix multiply.

Numerical Trade-offs

Floating-point formats encode three things: a sign, a magnitude (exponent), and a fractional precision (mantissa). For neural networks, each plays a distinct role. The exponent governs whether the value is even representable; the mantissa governs how accurately a representable value can be expressed. Both matter for accuracy, but they do not matter equally.

Weights in a trained network are typically in the range of about 1e-3 to 1e+1 in absolute value. Activations, after batch normalization or layer normalization, are usually in a similar range. Gradients, however, can be much smaller, often near 1e-6 or 1e-7 deep inside large language models, and can occasionally spike to 1e+3 or larger during training instabilities. The total dynamic range required for stable training is therefore at least 10 orders of magnitude. FP32 supports about 76 orders of magnitude, FP16 supports about 9 orders, and bfloat16 supports about 76 orders. FP16 is right at the edge of being usable for training; bfloat16 has comfortable headroom in both directions.

The trade-off shows up most clearly in three behaviors. First, a model trained in pure FP16 typically requires loss scaling, in which the loss value is multiplied by a constant before backpropagation and the resulting gradients are divided by the same constant before being applied. This pushes small gradient magnitudes up into the FP16 normal range. bfloat16 does not need loss scaling at all, because gradients near 1e-7 are still well inside its representable range. Second, FP16 can encode small differences in weight values more precisely, which can matter for inference of an already-trained model where the weight distribution is very compact. bfloat16 will round more aggressively in those situations and can introduce visible accuracy loss on tasks like high-precision regression. Third, bfloat16 is generally a drop-in replacement for FP32 in training: weights initialized in FP32 truncate cleanly to bfloat16, and a model can be moved between the two formats without numerical translation problems. FP16 requires more careful handling.

Comparison with Other Numerical Formats

Deep learning hardware has steadily added support for narrower number formats since 2017. Each format trades precision and range against memory, bandwidth, and arithmetic throughput. The table below summarizes the formats most commonly used in modern AI training and inference.

Format	Total bits	Sign	Exponent	Mantissa	Approx. range	Approx. decimal precision	First major hardware
FP64 (double)	64	1	11	52	2.2e-308 to 1.8e+308	15-17 digits	All general-purpose CPUs and GPUs
FP32 (single)	32	1	8	23	1.18e-38 to 3.4e+38	7 digits	Universal
TF32 (TensorFloat-32)	19 (stored in 32)	1	8	10	1.18e-38 to 3.4e+38	4 digits	NVIDIA Ampere A100 (2020)
FP16 (half)	16	1	5	10	6.1e-5 to 6.55e+4	3-4 digits	NVIDIA Pascal P100 (2016)
bfloat16 (BF16)	16	1	8	7	1.18e-38 to 3.4e+38	2-3 digits	Google TPU v2 (2017-2018)
FP8 E4M3	8	1	4	3	~1.95e-3 to ~448	~1.5 digits	NVIDIA Hopper H100 (2022)
FP8 E5M2	8	1	5	2	~1.5e-5 to ~5.7e+4	~1 digit	NVIDIA Hopper H100 (2022)
MXFP6 E3M2	6 plus shared scale	1	3	2	block-scaled	~1 digit	NVIDIA Blackwell, AMD MI355 (2024-2025)
MXFP4 / NVFP4 E2M1	4 plus shared scale	1	2	1	block-scaled	<1 digit	NVIDIA Blackwell, AMD MI355 (2024-2025)

The most important comparisons are summarized below.

bfloat16 versus FP32

FP32 is the historical baseline for both training and inference. It uses 32 bits per value, has 23 bits of mantissa, and represents about seven decimal digits with a range from roughly 1.18e-38 to 3.4e+38. bfloat16 keeps the same 8-bit exponent as FP32 and therefore the same range, but truncates the mantissa to seven bits. The result is half the memory footprint, roughly half the memory bandwidth requirement, and proportional improvements in cache efficiency. For nearly every deep learning workload, the accuracy loss is invisible at the model output, so bfloat16 has displaced FP32 as the default training format.

bfloat16 versus FP16

FP16 was the original "low precision" format used in deep learning, popularized by NVIDIA Pascal and Volta GPUs and the original mixed-precision training paper from Baidu and NVIDIA in 2017. Both formats use 16 bits and have nominally the same memory and bandwidth costs, but they trade off range for precision in opposite directions. FP16 has 10 bits of mantissa (about 3.3 decimal digits) but only 5 bits of exponent (range about 6.1e-5 to 6.55e+4). bfloat16 has 7 bits of mantissa (about 2.3 decimal digits) but 8 bits of exponent (range about 1.18e-38 to 3.4e+38). For training, the wider range of bfloat16 is usually decisive; FP16 retains an edge for inference of smaller models or tasks that need higher per-value precision.

bfloat16 versus TF32

NVIDIA introduced TensorFloat-32 (TF32) with the Ampere A100 in 2020 as a compromise between FP32 and bfloat16. TF32 uses 1 sign bit, 8 exponent bits (matching FP32 and bfloat16), and 10 mantissa bits (matching FP16). The total of 19 bits is stored inside a 32-bit register, so TF32 does not save memory; it saves only multiplier area inside the tensor cores. TF32 is the default math mode for FP32 matmuls on Ampere and later NVIDIA GPUs, which is why upgrading from a Volta V100 to an A100 produces a measurable speedup on many "FP32" workloads without any code changes. bfloat16 is more aggressive than TF32 in both directions: it reduces the mantissa further (7 bits instead of 10) and also saves memory (16 bits per value instead of 32), so it sits below TF32 in the precision-versus-throughput trade-off.

bfloat16 versus FP8

NVIDIA's Hopper H100 (2022) introduced FP8 in two flavors. E4M3 has 4 exponent bits and 3 mantissa bits; it can encode values from about 1.95e-3 to 448 and is intended for forward activations and weights. E5M2 has 5 exponent bits and 2 mantissa bits; it covers about 1.5e-5 to 5.7e+4 and is intended for backward gradients, where range matters more than precision. FP8 halves the memory footprint of bfloat16 and roughly doubles the compute throughput on supported hardware, but it requires careful per-tensor scaling to avoid range failures. NVIDIA's Transformer Engine library and similar tools manage that scaling automatically. As of 2024 to 2026, large language model pretraining typically still relies on bfloat16 for stability, with FP8 used selectively for forward-pass activations and gradient transmission, while inference is increasingly run end-to-end in FP8.

bfloat16 versus MX formats

In September 2023, the Microscaling Formats (MX) Alliance, a group including AMD, ARM, Intel, Meta, Microsoft, NVIDIA, and Qualcomm, published version 1.0 of the OCP Microscaling Formats specification through the Open Compute Project. The standard defines four narrow-precision formats (MXFP8, MXFP6, MXFP4, and MXINT8) that share a single 8-bit (E8M0) scaling factor across a block of 32 values. The block scale acts as a supplementary exponent that effectively widens the dynamic range of any single value, so a 4-bit MXFP4 value can stay accurate even when the underlying values are small or large. NVIDIA's Blackwell B200, AMD's MI355, and other 2025-era accelerators ship native MXFP4 and MXFP6 hardware. NVIDIA's NVFP4 variant uses a finer block size of 16 elements with an FP8 (E4M3) scale and a per-tensor FP32 second-level scale. These formats live two notches below bfloat16 in the precision hierarchy and are aimed primarily at inference of trained models, where weights have settled into a tight distribution that responds well to block-wise quantization. bfloat16 typically remains the format of record for the master weights, with MX formats produced as a downstream conversion.

Hardware Support Timeline

bfloat16 hardware support spread quickly from Google's first deployment to nearly the entire industry between 2018 and 2022. The table below summarizes the major adoption milestones.

Year	Hardware	Vendor	Notes
2017-2018	TPU v2	Google	First production hardware to use bfloat16 in the matrix multiplication unit; FP32 accumulation.
2018	TPU v3	Google	Doubled the number of MXUs per chip; bfloat16 throughput doubled to 420 TFLOPS per chip.
2019	Cooper Lake (Xeon Scalable 3rd gen)	Intel	First x86 CPU with AVX-512 BF16 instructions; introduced VDPBF16PS and VCVTNE2PS2BF16.
2020	A100	NVIDIA	Third-generation Tensor Cores added native bfloat16; 312 TFLOPS dense, 624 TFLOPS with structured sparsity.
2020	TPU v4	Google	bfloat16 throughput grew to about 275 TFLOPS per chip with much larger MXU arrays.
2021	MI100	AMD	First AMD CDNA accelerator with native bfloat16 matrix engines.
2021	Sapphire Rapids (preview)	Intel	Adds AMX (Advanced Matrix Extensions) with bfloat16 tile multiplication.
2021	Neoverse V1, N2	ARM	Added BFloat16 instructions to the ARMv8.6-A instruction set; began rolling out across server-class ARM cores.
2022	H100 (Hopper)	NVIDIA	bfloat16 reaches about 989 TFLOPS dense / 1979 with sparsity in addition to FP8 support.
2022	MI250 / MI250X	AMD	bfloat16 throughput of 383 TFLOPS per GPU; widely used in Frontier and other supercomputers.
2023	MI300X / MI300A	AMD	bfloat16 throughput of 1307 TFLOPS per GPU; first APU with combined CPU and GPU bfloat16 support.
2023	Apple M3	Apple	First Apple Silicon to support ARM bfloat16 instructions; M2 already supported ARMv8.6-A.
2024	B200 (Blackwell)	NVIDIA	bfloat16 throughput about 2.25 PFLOPS per chip; native FP6/FP4 added alongside.
2025	TPU v6e "Trillium" / TPU v7 "Ironwood"	Google	Continued scaling of bfloat16 throughput per chip; bfloat16 remains the default training datatype.

From this 2018-to-2025 arc, bfloat16 has gone from a TPU-only curiosity to the lowest common denominator that every serious AI accelerator now supports. Software written against bfloat16 today runs without modification on TPUs, NVIDIA GPUs, AMD GPUs, x86 CPUs, ARM CPUs, and Apple Silicon, which has made it the natural format for portable AI workloads.

Use in Training

The most common modern training recipe uses bfloat16 as the working precision for forward and backward passes, with FP32 used selectively for the parameters that benefit most from extra precision. This is called mixed precision training; see mixed_precision_training.

The canonical pattern is:

Maintain a master copy of every learnable parameter in FP32. The optimizer (Adam, SGD, AdamW, etc.) updates these master weights, because the small additive updates near 1e-7 would not survive bfloat16 rounding.
Cast the master weights to bfloat16 at the start of each forward pass. This is the version that gets multiplied against the activations.
Compute activations in bfloat16. Convolutions, matrix multiplies, and most pointwise operations run natively in 16 bits.
Compute gradients in bfloat16 during backpropagation.
Cast gradients back up to FP32 and apply them to the master weights.

This pattern halves the memory required for activations (which dominate the memory budget during training) and roughly doubles the throughput of the heavy matmul kernels. Because bfloat16 has the same exponent range as FP32, the loss-scaling step that FP16 mixed precision requires is unnecessary. That makes bfloat16 mixed precision substantially easier to deploy and debug than FP16 mixed precision, and it is the dominant training strategy for large language models in 2026.

PyTorch

PyTorch exposes bfloat16 through torch.bfloat16 as a first-class dtype and through the torch.amp (automatic mixed precision) package. The standard recipe wraps the forward pass in torch.autocast(device_type="cuda", dtype=torch.bfloat16). Because gradients do not underflow, no GradScaler is needed, in contrast to the FP16 recipe. PyTorch's documentation explicitly notes that the AMP framework is dtype-agnostic and that bfloat16 is preferred when the underlying hardware supports it.

JAX

JAX treats jax.numpy.bfloat16 as a native numpy dtype (using a small Python extension to make this work outside the standard numpy type system). On TPU backends, the default matmul precision is bfloat16 with FP32 accumulation, which can be adjusted through the precision argument on jnp.matmul, jnp.einsum, and related operations. The DeepMind library JMP (JAX Mixed Precision) wraps the same idea with higher-level policy objects.

TensorFlow

TensorFlow exposes bfloat16 through tf.bfloat16 and through the tf.keras.mixed_precision API. On Cloud TPU, switching on bfloat16 mixed precision is typically a one-line change to a model definition, and TPU-aware optimizers handle the FP32 master weights internally.

Use in Inference

Inference uses precision a little differently from training. Because the weights and activations are no longer changing, and because numerical sensitivity tends to concentrate in only a few layers (for example, attention softmax outputs and layer norm statistics), inference can usually drop to a more aggressive precision than training. The decision tree most teams use is roughly:

Step	Typical precision	Why
Master weight storage during training	FP32	Optimizer updates need 7+ digits of precision.
Forward / backward pass during training	bfloat16	Best balance of range, speed, and memory.
Gradient communication across nodes	bfloat16 or FP8 E5M2	Compresses bandwidth without compromising convergence.
Inference of a freshly trained checkpoint	bfloat16	Drop-in conversion from training checkpoint.
Production inference for cost-sensitive deployment	FP8 (E4M3 weights, E4M3 or E5M2 activations) or INT8	Halves memory and doubles throughput on H100 and later.
Inference at the lowest cost / largest batch	MXFP4, NVFP4, or INT4	Quarter the memory of bfloat16, requires per-block scaling.

For a model that has been pretrained in bfloat16, full bfloat16 inference is the safest deployment because it requires no conversion and produces bit-identical or near-identical outputs to the training environment. Quantization to lower precision (INT8, FP8, or MX formats) is a separate post-training step described in quantization. bfloat16 is the canonical "baseline" against which inference quantization is measured.

Relationship to ML Scaling Laws and the Precision Race

bfloat16's adoption in 2018 to 2020 coincided with a steep growth in model sizes and training compute. The scaling laws published by OpenAI in 2020 and refined by DeepMind's Chinchilla work in 2022 made it clear that the practical limit on frontier model size was set by the available compute and memory budget, and that compute spent on a smaller numerical type produced roughly the same loss reduction as compute spent on a larger one. Reducing precision from FP32 to bfloat16 effectively doubled the affordable model size and training duration for a given hardware budget. This pattern continued with each new precision. Moving from bfloat16 to FP8 in 2022 to 2023 doubled affordable size again, and the move from FP8 to FP4 in 2024 to 2026 nominally doubled it once more.

At each step the trade-off looks the same. The narrower format requires more careful per-tensor scaling, more careful selection of which operations stay at higher precision, and more risk of training instabilities, but on hardware that supports the format natively, the throughput gains are immediate. bfloat16 sits in a particularly comfortable place in this hierarchy: it is narrow enough to give a meaningful speedup over FP32 but wide enough to be used as a drop-in replacement without per-tensor scaling or other software intervention. That property has kept bfloat16 in service even as FP8 and FP4 have rolled out, because it remains the default working precision for the components that need a reliable wide range, including optimizer state and gradient reduction in many large-scale recipes.

Software Ecosystem

Most AI frameworks treat bfloat16 as a first-class type today. PyTorch, TensorFlow, and JAX all support bfloat16 tensors on every backend that has hardware support. Hugging Face Transformers exposes a torch_dtype=torch.bfloat16 argument on every model loader, and most model weights distributed on the Hugging Face Hub for the past three years are saved in bfloat16. The ONNX Runtime, OpenAI Triton, NVIDIA cuDNN and TensorRT, Intel's OneDNN, Apple's Metal Performance Shaders, and AMD's ROCm libraries all accept bfloat16 inputs and produce bfloat16 outputs without conversion. Numpy itself does not include bfloat16 as a built-in dtype, but the JAX and ml_dtypes packages provide compatible Python-level extensions that allow it to be used in scientific code outside deep learning.

On the model file format side, Safetensors, the de facto serialization standard in 2026 for neural network weights, supports bfloat16 directly without re-encoding. Model checkpoints are typically a third smaller than they would be in FP32 because the bfloat16 weight tensors dominate the file size.

Limitations

bfloat16 is not appropriate for every workload. Any computation that requires more than about 2.5 decimal digits of precision per value will encounter problems, including:

Large reductions, especially summations of many small values, where the rounding error per addition accumulates and can swamp the result. This is why bfloat16 multiplications are typically accumulated into FP32 registers in hardware; the multiplier is bfloat16 but the accumulator is FP32.
Iterative refinement loops (such as the inner solve in some scientific simulations) that depend on small residuals shrinking below a tight tolerance.
Optimizer states for Adam-style optimizers, which need to track a very small running mean of squared gradients. These almost always live in FP32 even when the model weights are bfloat16.
High-resolution rendering and most non-ML numerical workloads, where IEEE-compliant range and precision are both needed.

Some workloads also require bit-identical reproducibility across runs, which is harder to guarantee in bfloat16 than in FP32 because rounding errors interact with parallel reduction order. Frameworks generally provide higher-precision modes for such cases, at the cost of throughput.

References

Wang, S., and Kanwar, P. (2019). *BFloat16: The secret to high performance on Cloud TPUs.* Google Cloud Blog. https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus
Google Cloud Documentation. *Improve your model's performance with bfloat16.* https://cloud.google.com/tpu/docs/bfloat16
Wikipedia. *bfloat16 floating-point format.* https://en.wikipedia.org/wiki/Bfloat16_floating-point_format
Wikipedia. *TensorFloat-32.* https://en.wikipedia.org/wiki/TensorFloat-32
WikiChip. *Brain floating-point format (bfloat16).* https://en.wikichip.org/wiki/brain_floating-point_format
NVIDIA. (2020). *NVIDIA A100 Tensor Core GPU Architecture.* https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf
NVIDIA. (2022). *NVIDIA Hopper Architecture In-Depth.* NVIDIA Technical Blog. https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/
NVIDIA. *Floating-Point 8: An Introduction to Efficient, Lower-Precision AI Training.* https://developer.nvidia.com/blog/floating-point-8-an-introduction-to-efficient-lower-precision-ai-training/
NVIDIA. *Introducing NVFP4 for Efficient and Accurate Low-Precision Inference.* https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/
NVIDIA. *Accelerating AI Training with NVIDIA TF32 Tensor Cores.* https://developer.nvidia.com/blog/accelerating-ai-training-with-tf32-tensor-cores/
Open Compute Project. (2023). *OCP Microscaling Formats (MX) Specification v1.0.* https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf
Open Compute Project. *AMD, Arm, Intel, Meta, Microsoft, NVIDIA, and Qualcomm Standardize Next-Generation Narrow Precision Data Formats for AI.* https://www.opencompute.org/blog/amd-arm-intel-meta-microsoft-nvidia-and-qualcomm-standardize-next-generation-narrow-precision-data-formats-for-ai
Rouhani, B., et al. (2023). *Microscaling Data Formats for Deep Learning.* arXiv:2310.10537. https://arxiv.org/pdf/2310.10537
Kalamkar, D., et al. (2019). *A Study of BFLOAT16 for Deep Learning Training.* arXiv:1905.12322. https://arxiv.org/pdf/1905.12322
PyTorch Documentation. *Automatic Mixed Precision package - torch.amp.* https://docs.pytorch.org/docs/stable/amp.html
PyTorch Tutorials. *Automatic Mixed Precision recipe.* https://docs.pytorch.org/tutorials/recipes/recipes/amp_recipe.html
JAX Documentation. *Type promotion semantics.* https://docs.jax.dev/en/latest/type_promotion.html
DeepMind. *JMP: a Mixed Precision library for JAX.* https://github.com/google-deepmind/jmp
AMD. *AMD Instinct MI300 Series Accelerators.* https://www.amd.com/en/products/accelerators/instinct/mi300.html
WikiChip. *AVX-512 BFloat16 Instructions (BF16).* https://en.wichchip.org/wiki/x86/avx512_bf16
WikiChip Fuse. *Arm Updates Its Neoverse Roadmap: New BFloat16, SVE Support.* https://fuse.wikichip.org/news/4564/arm-updates-its-neoverse-roadmap-new-bfloat16-sve-support/
Higham, N. (2020). *What Is Bfloat16 Arithmetic?* https://nhigham.com/2020/06/02/what-is-bfloat16-arithmetic/
Cook, J. D. (2018). *bfloat16 (BF16) range and precision.* https://www.johndcook.com/blog/2018/11/15/bfloat16/

bfloat16

Definition and Bit Layout

Bit positions in a bfloat16 word

Comparison with IEEE FP16

Origins at Google

Numerical Trade-offs

Comparison with Other Numerical Formats

bfloat16 versus FP32

bfloat16 versus FP16

bfloat16 versus TF32

bfloat16 versus FP8

bfloat16 versus MX formats

Hardware Support Timeline

Use in Training

PyTorch

JAX

TensorFlow

Use in Inference

Relationship to ML Scaling Laws and the Precision Race

Software Ecosystem

Limitations

See Also

References

Improve this article

Related Articles

MCP server

Lenovo

Cloud computing

Edge computing

Data Center

Tenstorrent

bfloat16

Definition and Bit Layout

Bit positions in a bfloat16 word

Comparison with IEEE FP16

Origins at Google

Numerical Trade-offs

Comparison with Other Numerical Formats

bfloat16 versus FP32

bfloat16 versus FP16

bfloat16 versus TF32

bfloat16 versus FP8

bfloat16 versus MX formats

Hardware Support Timeline

Use in Training

PyTorch

JAX

TensorFlow

Use in Inference

Relationship to ML Scaling Laws and the Precision Race

Software Ecosystem

Limitations

See Also

References

Related Articles

MCP server

Lenovo

Cloud computing

Edge computing

Data Center

Tenstorrent