NVIDIA Blackwell Ultra
Last reviewed
Jun 3, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 · 1,677 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 · 1,677 words
Add missing citations, update stale details, or suggest a clearer explanation.
NVIDIA Blackwell Ultra is a mid-cycle refresh of NVIDIA's Blackwell GPU architecture, announced at the company's GTC conference in March 2025. It is built around the B300 data-center GPU and the GB300 (Grace Blackwell Ultra) superchip, and it is delivered chiefly through rack-scale and server systems such as the GB300 NVL72 and the HGX B300. Relative to the base Blackwell generation (the B200 GPU and GB200 systems), Blackwell Ultra carries 50 percent more high-bandwidth memory per GPU and roughly 50 percent more dense low-precision compute, changes that NVIDIA positioned squarely at what it called the "age of AI reasoning": large-scale, token-heavy inference for models that produce long chains of thought at test time. [1][2][3]
Blackwell Ultra is not a new microarchitecture. It reuses the same dual-die Blackwell silicon, the same fifth-generation NVLink interconnect, and the same NVFP4 numerical format as the base generation, while raising memory capacity, attention-layer throughput, and power budget. In NVIDIA's cadence it occupies the slot between the original 2024 Blackwell launch and the next-generation Vera Rubin platform expected in 2026, mirroring the earlier "Hopper to Hopper" style mid-life update of the prior Hopper generation. [1][4]
NVIDIA unveiled the Blackwell Ultra AI factory platform on March 18, 2025, during chief executive Jensen Huang's keynote at GTC in San Jose. The company framed the launch around inference for reasoning models, which generate far more output tokens per query than conventional chatbots and therefore stress both compute and memory during the generation (decode) phase. NVIDIA's headline message was that Blackwell Ultra would let operators build "AI factories" that turn electrical power and chips into tokens at higher throughput and lower latency than the base Blackwell platform. [1][2]
The platform announced at GTC comprised two principal system designs: the rack-scale GB300 NVL72 and the eight-GPU HGX B300 NVL16 board for conventional servers. NVIDIA said Blackwell Ultra products would be available from partners starting in the second half of 2025, and that the GB300 NVL72 would also be offered through NVIDIA DGX Cloud. [1]
The defining changes from base Blackwell are larger memory and higher dense low-precision math, both aimed at inference. NVIDIA's technical materials emphasize three deltas over the B200: a step up from 192 GB to 288 GB of HBM3e per GPU, an increase in dense NVFP4 tensor throughput from about 10 to 15 petaFLOPS, and roughly doubled acceleration of the attention layer's softmax (exponential) operations, which dominate the cost of long-context generation. The added memory lets a single GPU hold larger models and far longer key-value caches, which is precisely the bottleneck for reasoning workloads that emit thousands of tokens. [3][5]
Importantly, several core attributes are unchanged from the base generation. Blackwell Ultra uses the same two reticle-sized GPU dies fused by NVIDIA's NV-HBI die-to-die link (about 10 TB/s) into a single logical GPU totaling 208 billion transistors, the same fifth-generation NVLink at 1.8 TB/s of bidirectional bandwidth per GPU, and the same NVFP4 4-bit format introduced with Blackwell. The principal trade-off for the higher performance is power: the SXM B300 module is rated up to about 1,400 W, above the roughly 1,000 to 1,200 W of base Blackwell modules. [3][5][6]
| Specification (per GPU) | Blackwell (B200) | Blackwell Ultra (B300) |
|---|---|---|
| HBM memory | 192 GB HBM3e | 288 GB HBM3e |
| Memory bandwidth | 8 TB/s | 8 TB/s |
| Dense NVFP4 compute | ~10 PFLOPS | ~15 PFLOPS |
| Attention (softmax) throughput | baseline | ~2x baseline |
| NVLink (5th gen) | 1.8 TB/s | 1.8 TB/s |
| Transistors (dual die) | 208 billion | 208 billion |
| Peak board power | up to ~1,200 W | up to ~1,400 W |
Sources: NVIDIA technical blog and product materials; Tom's Hardware. [3][5][6]
The B300 is the Blackwell Ultra GPU. Like the base B200, it is a single package containing two reticle-limit dies built on a TSMC 4-nanometer-class (4NP) process and joined by the NV-HBI interface so software treats them as one device. The GPU exposes 160 streaming multiprocessors organized into eight processing clusters, with 640 fifth-generation Tensor Cores, and it supports NVIDIA's NVFP4, FP6, and FP8 formats through a second-generation Transformer Engine. [3][5]
The headline figures for the B300 are 288 GB of HBM3e (eight 12-high stacks) delivering about 8 TB/s of bandwidth, and 15 petaFLOPS of dense NVFP4 tensor compute (about 20 petaFLOPS with sparsity). The doubling of the special-function-unit throughput used for exponentials gives roughly 2x faster attention-layer compute than base Blackwell, a targeted change because attention's softmax is the hot path during long-sequence decoding. Each GPU connects to its host Grace CPU over NVLink-C2C at 900 GB/s and to peers over fifth-generation NVLink at 1.8 TB/s. [3][5]
The GB300 superchip pairs NVIDIA's Arm-based Grace CPU with Blackwell Ultra GPUs on a single coherent module, the Blackwell Ultra equivalent of the earlier GB200. In the configuration used inside the GB300 NVL72, each superchip board combines one Grace CPU with two B300 GPUs, joined to the CPU by the 900 GB/s NVLink-C2C coherent link so that CPU and GPU share a unified memory space. This tight CPU-GPU coupling, together with the large HBM3e pool, is what NVIDIA leans on for serving very large mixture-of-experts and reasoning models. [3][5]
The GB300 NVL72 is the flagship Blackwell Ultra system: a single liquid-cooled rack that links 72 B300 GPUs and 36 Grace CPUs through a fifth-generation NVLink switch fabric so the whole rack behaves as one large accelerator. NVIDIA states the rack delivers about 1.1 exaFLOPS of dense NVFP4 compute (1,440 petaFLOPS with sparsity) and 720 petaFLOPS of FP8 for training, with around 20 TB of HBM3e GPU memory, about 37 TB of total fast memory, and 130 TB/s of aggregate NVLink bandwidth. Networking uses NVIDIA ConnectX-8 SuperNICs providing 800 Gb/s per GPU. NVIDIA reports the GB300 NVL72 offers about 1.5x the dense FP4 performance of the prior GB200 NVL72. [3][7]
| Specification | GB200 NVL72 | GB300 NVL72 |
|---|---|---|
| Blackwell GPUs | 72 (B200) | 72 (B300) |
| Grace CPUs | 36 | 36 |
| NVFP4 compute (with sparsity) | 1,440 PFLOPS | 1,440 PFLOPS |
| FP4 dense compute | ~0.72 EFLOPS | ~1.1 EFLOPS |
| GPU (HBM3e) memory | ~13.4 TB | ~20 TB |
| NVLink bandwidth | 130 TB/s | 130 TB/s |
| Per-GPU SuperNIC | ConnectX-7/8 | ConnectX-8, 800 Gb/s |
Sources: NVIDIA GB300 NVL72 product page and technical blog; NVIDIA GB200 NVL72 materials. [3][7][8]
For customers using standard eight-GPU servers rather than full racks, NVIDIA offers the HGX B300 NVL16 baseboard, which carries Blackwell Ultra GPUs in the familiar HGX form factor. NVIDIA characterized the HGX B300 as delivering up to 11x faster large-language-model inference, 7x more compute, and 4x larger memory than the Hopper generation. The same silicon underpins the DGX B300 system, which packs eight Blackwell Ultra GPUs with 2.1 TB of total GPU memory, 144 petaFLOPS of FP4 inference (sparse), and 72 petaFLOPS of FP8 training, paired with Intel Xeon host CPUs and ConnectX-8 networking. Eight GB300 NVL72 racks combine into a Blackwell Ultra DGX SuperPOD of 576 GPUs, 288 Grace CPUs, about 300 TB of HBM3e, and roughly 11.5 exaFLOPS of FP4. [1][9][6]
NVIDIA guided that Blackwell Ultra systems would ship from partners in the second half of 2025, and the rollout broadly tracked that schedule. The company named a wide hardware ecosystem at launch, including server makers Cisco, Dell, HPE, Lenovo, and Supermicro alongside ASUS, Foxconn, GIGABYTE, Pegatron, QCT, Wistron, and Wiwynn, plus cloud providers AWS, Google Cloud, Microsoft Azure, Oracle Cloud Infrastructure, CoreWeave, Crusoe, Lambda, and Nebius. [1]
By the second half of 2025 the platform was appearing in benchmarks and live deployments. In the September 2025 round of MLPerf Inference, NVIDIA reported that the GB300 NVL72 set records across the tested workloads and delivered about a 45 percent increase in DeepSeek-R1 inference throughput over the GB200 NVL72. Cloud operators also began standing up very large GB300 clusters, with Microsoft Azure describing a production GB300 NVL72 deployment networking thousands of Blackwell Ultra GPUs into a single inference fabric. [10][7]
Blackwell Ultra marked NVIDIA's pivot to an annual product rhythm in which a mid-cycle "Ultra" refresh extends a microarchitecture before the next full generation arrives. Its specific bias toward memory capacity and attention throughput, rather than a wholesale compute redesign, reflected the rise of reasoning models and inference-time scaling, where serving cost is dominated by long output sequences and large key-value caches rather than by raw matrix-multiply peaks. By raising per-GPU HBM3e to 288 GB and dense FP4 to 15 petaFLOPS while keeping the rack-scale NVLink fabric of base Blackwell, NVIDIA aimed to lower the cost per token of these workloads and to bridge the gap to the Vera Rubin platform, the next-generation architecture slated to use HBM4 memory. [2][3][4]