Microscaling formats
Last reviewed
May 31, 2026
Sources
9 citations
Review status
Source-backed
Revision
v2 · 2,065 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
9 citations
Review status
Source-backed
Revision
v2 · 2,065 words
Add missing citations, update stale details, or suggest a clearer explanation.
Microscaling (MX) formats are a family of low-precision number formats for machine learning in which a small block of values, normally 32 of them, shares one common scale factor while each value is stored in a narrow floating point or integer type. They are a block floating point design, standardized in 2023 as the Open Compute Project (OCP) Microscaling Formats specification, and they provide the on-disk and on-wire encodings behind much of the FP8 and FP4 math in recent accelerators such as NVIDIA Blackwell and the AMD Instinct MI350 series. The defined types are MXFP8, MXFP6, MXFP4, and MXINT8.[1][2]
The goal is plain. Storing weights, activations, and gradients in fewer bits cuts memory and the bandwidth needed to move them, and it lets matrix engines do more multiplies per clock. The hard part is keeping enough numerical range and precision that model quality does not fall apart. MX formats try to hit that balance by attaching a shared exponent to each short block of numbers, which restores dynamic range cheaply without giving every element its own full-width scale.[2]
Plain 8-bit floating point such as FP8, and especially 4-bit floating point such as FP4, have very little dynamic range on their own. An E4M3 value tops out around 448 and an E2M1 value tops out at 6, so the raw format cannot directly hold the spread of magnitudes that shows up inside a transformer. The usual fix is quantization with a scale factor that maps the real values into the representable range. The question is how coarse or fine that scale should be.
The simplest choice is one scale for an entire tensor, or one per row or column. That is cheap to store but it has to cover every value in a large array with a single number. When a handful of outlier activations are far larger than the rest, a per-tensor scale gets stretched to fit them, and all the small values lose precision because they sit near the bottom of the range. The other extreme, a separate scale for every element, preserves precision but costs as much metadata as the data itself, which defeats the point.
Block scaling sits in between. By giving each short run of consecutive values its own scale, the format adapts to local changes in magnitude. An outlier only spoils the precision of its own block of 32 rather than the whole tensor, and the metadata cost is one scale spread across 32 numbers. This idea is old. Block floating point goes back to early digital signal processing, where a shared exponent across a block of samples was a known trick. MX applies it to deep learning at a fixed, hardware-friendly block size.[3]
The basic unit in an MX format is a block of k elements plus one shared scale X. Each stored element P_i is a low-precision number, and the real value it represents is the product X times P_i. The specification fixes k at 32. So a single 8-bit scale is shared across 32 narrow elements, and decoding a value means reading the block scale and one element, then multiplying.[1][2]
The shared scale uses a type called E8M0. As the name suggests it has 8 exponent bits and no mantissa and no sign, which makes it exactly the exponent field of a standard 32-bit float. It can only represent powers of two, from 2 to the minus 127 up to 2 to the 127, plus a reserved NaN encoding. Restricting the scale to powers of two keeps the rescaling step simple in hardware, since multiplying by the block scale is an exponent add rather than a full multiply.[1][4]
In practice an encoder picks the block scale from the largest magnitude in the block. It finds the biggest element, chooses the power-of-two exponent that brings that element into the element type's range, and stores that exponent as the E8M0 scale. Every element in the block is then divided by the scale and rounded into its narrow type. This is the step that recovers dynamic range. The block can hold both large and small numbers because the shared exponent slides the whole group up or down, and the element bits only have to capture the relative differences inside the block.[2][4]
The v1.0 specification defines four concrete formats. They differ in the element type, while all four share the same E8M0 scale and the same block size of 32. MXFP8 and MXFP6 each come in two element variants that trade exponent range against mantissa precision.[1]
| Format | Element type | Element bits | Exponent bits | Mantissa bits | Max element value | Scale | Block size | Bits per block |
|---|---|---|---|---|---|---|---|---|
| MXFP8 | FP8 E4M3 | 8 | 4 | 3 | 448 | E8M0 (8-bit) | 32 | 264 |
| MXFP8 | FP8 E5M2 | 8 | 5 | 2 | 57344 | E8M0 (8-bit) | 32 | 264 |
| MXFP6 | FP6 E2M3 | 6 | 2 | 3 | 7.5 | E8M0 (8-bit) | 32 | 200 |
| MXFP6 | FP6 E3M2 | 6 | 3 | 2 | 28 | E8M0 (8-bit) | 32 | 200 |
| MXFP4 | FP4 E2M1 | 4 | 2 | 1 | 6 | E8M0 (8-bit) | 32 | 136 |
| MXINT8 | INT8 | 8 | 0 | 7 plus sign | n/a | E8M0 (8-bit) | 32 | 264 |
The last column counts the storage for one block, which is 32 element values plus the single 8-bit scale. MXFP4 needs 136 bits per 32 values, or 4.25 bits per value once the shared scale is amortized, against 4 bits for a raw FP4 value with no scaling. The shared exponent is nearly free per element, which is the whole appeal.[1][4]
The naming follows the ExMy convention used across low-precision floating point. E4M3 means 1 sign bit, 4 exponent bits, and 3 mantissa bits. E5M2 spends a bit on exponent rather than mantissa, so it reaches a much larger maximum value of 57344 at the cost of precision, which suits gradients and other quantities with wide range. E2M1 is the tiny 4-bit element with a single mantissa bit. MXINT8 is the integer member of the family, an 8-bit integer element with the same shared power-of-two scale, which behaves like a fine-grained integer quantization scheme.[1][4]
The formats were defined by a group that called itself the Microscaling Formats Alliance, made up of AMD, Arm, Intel, Meta, Microsoft, NVIDIA, and Qualcomm. The alliance published version 1.0 of the OCP Microscaling Formats specification on 26 September 2023 through the Open Compute Project Foundation, in an open and license-free form meant to encourage broad adoption.[5]
Releasing one shared standard, rather than a separate proprietary block format per vendor, matters for interoperability. A model quantized to MXFP4 has a defined bit layout, so it can move between frameworks and run on hardware from different makers that all agree on the same encoding. The OCP route is the same path used to standardize other data center building blocks, which made it a natural home for a numeric format that several competitors wanted to agree on.[5]
Alongside the specification, Microsoft released an open PyTorch emulation library called microxcaling that reproduces the MX arithmetic in software, so researchers could study the formats before silicon with native support shipped.[6]
The research case for MX was laid out in the paper "Microscaling Data Formats for Deep Learning" by Bita Darvish Rouhani and colleagues from the same set of companies. They evaluated the formats across more than two dozen benchmarks for both inference and training of deep learning models.[2]
For inference the simplest mode is direct cast, where a model trained in higher precision is converted to an MX format with no retraining and run as is. The paper reports that MXINT8 direct-cast inference stays very close to the 32-bit baseline. On BERT-Large it reached an F1 score of 93.41 against a 93.47 baseline, and on a generative model such as LLaMA-7B the MXINT8 accuracy on the Lambada task sat within a few thousandths of the full-precision number. Lower-bit formats need a little care, often a brief calibration or fine-tuning step, but they recover most of the quality.[2]
The headline training result was the first demonstration of training a generative large language model at sub-8-bit precision for weights, activations, and gradients, with minimal loss of quality and no change to the training recipe. A GPT-style model trained with 6-bit elements landed within a small fraction of the FP32 loss, and mixed setups using 4-bit weights with 6-bit activations stayed close as well. That made MX a credible path to lower-precision training rather than only inference, which connects it to the broader practice of mixed-precision training.[2]
MX moved from specification to silicon quickly. NVIDIA Blackwell GPUs include fifth-generation Tensor Cores that natively support the OCP microscaling formats, doing the per-block scaling in hardware so the matrix engine reads MXFP8, MXFP6, or MXFP4 blocks directly. AMD added native support for the MX formats in its CDNA 4 architecture, which powers the Instinct MI350 series of accelerators announced on 12 June 2025, with hardware paths for FP4 and FP6 aimed at large language model training and inference.[7][8]
NVIDIA also introduced a closely related but distinct format called NVFP4 for its Blackwell hardware. NVFP4 keeps the 4-bit E2M1 element but changes the scaling. It uses a smaller microblock of 16 elements instead of 32, and the per-block scale is an FP8 E4M3 value rather than a power-of-two E8M0, layered under a single FP32 scale for the whole tensor. The smaller block and the higher-precision scale reduce the damage from outliers and tend to give better accuracy than MXFP4 at the same 4-bit element width, at the cost of a little more metadata and a format that is NVIDIA-specific rather than the open OCP one. NVFP4 is best read as a variant on the same block-scaling idea that MX standardized.[9]
MX formats are a form of fine-grained quantization, and they slot into the same toolkit as precision reduction methods generally. Where classic post-training quantization often uses a per-tensor or per-channel scale and an integer element, MX fixes the block at 32, fixes the scale type at a power of two, and standardizes the element types. The practical effect is the same goal of fewer bits per value, but the design is tuned so that the dequantization step is cheap enough to bake into the matrix unit of a GPU. That hardware-native quality is what separates MX from a software-only quantization scheme.[2]
MX is not free of trade-offs. The shared scale only helps within a block, so a single block that mixes very large and very small values still loses precision in its smallest members, and at 4 bits there is simply not much mantissa to work with regardless of scaling. The power-of-two E8M0 scale is cheap but coarse, which is exactly the gap NVFP4 tries to close with a finer FP8 scale. The fixed block size of 32 is a compromise that suits some tensors better than others. And while inference at MXINT8 or MXFP8 is close to drop-in, getting good results at MXFP4 or MXFP6 usually still needs calibration, fine-tuning, or careful choice of which tensors stay in higher precision, so the formats lower the cost of low precision rather than removing the engineering work entirely.[2][9]