NVIDIA GB300 NVL72

Overview

The NVIDIA GB300 NVL72 is a rack-scale AI computing system built around the Blackwell Ultra GPU architecture. NVIDIA announced the platform on March 18, 2025 at its GTC conference, positioning the system as the company's primary answer to the explosive compute demands of reasoning AI, agentic workloads, and test-time scaling inference. The GB300 NVL72 packs 72 NVIDIA Blackwell Ultra GPUs and 36 NVIDIA Grace CPUs into a single liquid-cooled rack, forming one of the largest unified compute domains ever deployed in commercial AI infrastructure.

At full utilization, the system delivers over 1.1 exaFLOPS of dense FP4 compute, a figure that marks the first time a single commercial rack has crossed the exascale threshold. Against the Hopper-based H100 systems that dominated data centers from 2022 through 2024, NVIDIA claims 50x higher AI factory output and 65x more AI compute. Shipments began with Dell delivering the first production unit to CoreWeave in July 2025, with broad availability across major cloud providers through the second half of 2025.

Background: from Blackwell to Blackwell Ultra

NVIDIA's Blackwell architecture (the B100/B200 generation) launched in 2024 as the successor to Hopper. The original Blackwell GPUs introduced a dual-die design, FP4 tensor cores, fifth-generation NVLink, and a tight integration with the Grace CPU through a high-speed chip-to-chip interconnect. The GB200 NVL72, built around the standard B200 GPU, became the company's flagship rack-scale product in 2024 and drew commitments from every major hyperscaler.

Blackwell Ultra, designated the B300, is a refined version of the same core architecture on the same TSMC 4NP process. NVIDIA did not redesign the chip from scratch. Instead, the company made targeted changes to the components most relevant to inference and reasoning workloads: the attention-layer compute units, the memory stack configuration, and the FP4 tensor core throughput. The result is a GPU that costs more per unit, draws 200W more at peak, but delivers roughly 50% more useful throughput for the transformer-based inference workloads that now dominate data center demand.

Jensen Huang, NVIDIA's CEO, framed the timing at GTC 2025: "AI has made a giant leap. Reasoning and agentic AI demand orders of magnitude more computing performance." The GB300 NVL72 was designed specifically to serve that demand, with the Dynamo inference framework and NVIDIA NIM microservices providing the software counterpart.

B300 chip architecture

Dual-die design and transistor count

Like its B200 predecessor, the Blackwell Ultra B300 is not a single monolithic die. Two reticle-limited GPU dies are connected through NVIDIA's NV-HBI (High-Bandwidth Interface), a custom die-to-die interconnect delivering 10 TB/s of internal bandwidth. The two dies operate as a single logical GPU, sharing a unified memory address space and appearing to software as one accelerator.

The combined die contains 208 billion transistors, identical to the standard Blackwell count, manufactured on TSMC's N4P process node. This is an optimized variant of TSMC's 5nm family, sometimes called 4NP, tuned for high-density logic and power efficiency in compute workloads. The 208 billion transistor count exceeds the 185 billion in AMD's competing Instinct MI355X by approximately 12%.

Each B300 GPU contains:

160 Streaming Multiprocessors (SMs) distributed across the two dies
20,480 CUDA cores (128 per SM)
640 fifth-generation Tensor Cores (4 per SM)
256 KB of Tensor Memory (TMEM) per SM

Tensor cores and precision support

The fifth-generation Tensor Cores in the B300 support FP8, FP6, and NVFP4 (four-bit floating point) precision. FP4 inference was the headline addition in the original Blackwell generation, but Blackwell Ultra doubled the attention-layer compute specifically to address a known bottleneck in transformer inference.

In a standard FP8 transformer forward pass, softmax computation in the attention layer consumes roughly the same number of cycles as the matrix multiplication (GEMM) operations. This creates a pipeline bottleneck that requires precise kernel scheduling to avoid performance loss. Blackwell Ultra adds 2x the MUFU (Multi-Function Unit) capacity dedicated to attention operations, providing a 2x speedup on attention compute and relaxing the kernel scheduling constraints that limited practical throughput on B200 systems.

Dense NVFP4 throughput per GPU is 15 petaFLOPS, compared to 10 petaFLOPS on the B200 (a 50% increase). FP8 throughput is 7.5 petaFLOPS dense.

Memory subsystem: HBM3E at 288 GB

The most significant change in the B300 relative to the B200 is memory capacity. The B300 uses 12-Hi HBM3E stacks instead of the 8-Hi stacks in the B200. This allows 288 GB of HBM3E per GPU, compared to 192 GB on the B200, a 50% increase per chip.

The memory interface consists of sixteen 512-bit controllers (8,192-bit total bus width) with a peak bandwidth of 8 TB/s per GPU. This is the same bandwidth figure as the B200 because HBM3E at 12-Hi stacks does not inherently increase per-pin speed; the benefit is purely in capacity. For inference, the 288 GB capacity means very large models (300B+ parameter models) can fit entirely within a single GPU's memory without offloading, eliminating costly inter-GPU communication for certain serving configurations.

Across the 72-GPU GB300 NVL72 rack, total HBM3E memory is approximately 20 TB.

Host connectivity

Each B300 GPU connects to its paired Grace CPU through NVLink-C2C, a coherent chip-to-chip interconnect running at 900 GB/s. The CPU and GPU share a unified memory address space over this link, allowing CPU-side LPDDR5X memory to be addressed directly from GPU kernels without explicit data transfers.

For host system connectivity in server configurations without a Grace CPU, the B300 provides a PCIe Gen 6 x16 interface delivering 256 GB/s bidirectional bandwidth, double that of PCIe Gen 5.

GB300 Grace Blackwell Ultra Superchip

NVIDIA packages the B300 GPU with the Grace CPU in a component called the GB300 Grace Blackwell Ultra Superchip. Each GB300 superchip contains two B300 GPUs and one Grace CPU, connected through NVLink-C2C. This three-chip package is the building block of the GB300 NVL72 rack.

The Grace CPU in each superchip is based on Arm's Neoverse V2 architecture, the same core used in the GB200. Each Grace CPU contains 72 Arm Neoverse V2 cores running at a base frequency of 3.1 GHz. In the GB300 NVL72 rack, 36 superchips provide 36 Grace CPUs with a combined 2,592 Neoverse V2 cores.

Grace CPU memory in the NVL72 system totals approximately 17 TB of LPDDR5X ECC memory with up to 14 TB/s of bandwidth. The CPU memory is available to GPU kernels through the coherent NVLink-C2C interface, giving each pair of B300 GPUs access to a combined pool of CPU DRAM and GPU HBM3E.

GB300 NVL72 rack system

Physical configuration

The GB300 NVL72 is a pre-integrated 48U rack containing:

36 GB300 Grace Blackwell Ultra Superchips (each with 2 B300 GPUs + 1 Grace CPU)
72 Blackwell Ultra GPUs total
36 NVIDIA Grace CPUs total
NVLink Switch trays housing NVSwitch chips
ConnectX-8 SuperNIC IO modules (one per superchip node)
Coolant Distribution Units (CDUs) integrated in-rack
Power distribution equipment

The rack is shipped as a complete, pre-validated unit by NVIDIA's manufacturing partners, including Dell, HPE, Lenovo, Supermicro, GIGABYTE, and others. Dell's PowerEdge XE9712 was the first system to ship, delivered to CoreWeave on July 3, 2025.

NVLink fabric and NVSwitch

All 72 GPUs in the rack are connected through NVIDIA's fifth-generation NVLink Switch chips into a single non-blocking NVLink fabric. Each B300 GPU has fifth-generation NVLink with 1.8 TB/s total bandwidth (900 GB/s unidirectional) across 18 links at 100 GB/s per link each direction.

The NVSwitch chips inside the rack aggregate these links into a fabric that delivers 130 TB/s of total GPU-to-GPU bandwidth within the NVL72 domain. Every GPU can communicate directly with every other GPU at full NVLink speed without congestion. This topology makes the 72-GPU rack function as a single logical compute unit for large model inference and training jobs.

The NVLink Switch also supports NVIDIA SHARP (Scalable Hierarchical Aggregation and Reduction Protocol) with FP8 precision, enabling in-network collective operations that reduce communication overhead for distributed training and inference.

Networking (scale-out)

For connecting multiple GB300 NVL72 racks or integrating with storage and front-end systems, each superchip node connects to a ConnectX-8 SuperNIC IO module. The ConnectX-8 provides 800 Gb/s of network bandwidth per GPU node, using either NVIDIA Quantum-X800 InfiniBand or NVIDIA Spectrum-X Ethernet fabric.

The ConnectX-8 SuperNIC integrates PCIe Gen 6 switching directly into the NIC, eliminating the need for a separate PCIe switch chip on the baseboard and reducing latency on the data path between GPU and network. Each ConnectX-8 module contains two ConnectX-8 devices, providing 800 Gb/s aggregate throughput.

A BlueField-3 Data Processing Unit (DPU) handles multi-tenant networking isolation, security, and storage offload functions within the rack.

Specifications

B300 GPU specifications

Specification	Value
Architecture	Blackwell Ultra
Process node	TSMC N4P
Transistors	208 billion
Die configuration	Dual-die (NV-HBI, 10 TB/s)
Streaming Multiprocessors	160
CUDA cores	20,480
Tensor Cores	640 (5th generation)
Precision support	FP4, FP6, FP8, FP16, BF16, TF32, FP64
HBM3E capacity	288 GB
HBM3E bandwidth	8 TB/s
HBM3E stacks	8 stacks of 12-Hi
Memory bus width	8,192-bit
FP4 dense compute	15 petaFLOPS
FP8 dense compute	7.5 petaFLOPS
FP16/BF16 dense compute	3.75 petaFLOPS
TDP	1,400 W
NVLink bandwidth	1.8 TB/s (bidirectional)
NVLink version	5th generation
PCIe interface	Gen 6 x16 (256 GB/s bidirectional)
NVLink-C2C bandwidth	900 GB/s

GB300 NVL72 rack specifications

Specification	Value
Total GPUs	72 (Blackwell Ultra B300)
Total Grace CPUs	36
CPU cores	2,592 (Arm Neoverse V2)
GPU memory (HBM3E)	~20 TB
GPU memory bandwidth	~576 TB/s
CPU memory (LPDDR5X)	~17 TB
CPU memory bandwidth	~14 TB/s
Total fast memory	~37 TB
NVLink fabric bandwidth	130 TB/s
Network bandwidth per GPU	800 Gb/s (ConnectX-8)
FP4 peak compute	1,080 PFLOPS dense (~1.1 exaFLOPS)
FP4 peak compute (with sparsity)	1,440 PFLOPS
FP8 peak compute	540 PFLOPS
FP16/BF16 peak compute	270 PFLOPS
Rack power draw	~120 kW (full load)
Cooling	100% liquid-cooled
Form factor	48U rack
Networking options	Quantum-X800 InfiniBand / Spectrum-X Ethernet

Performance vs predecessors

Metric	H100 SXM5	H200 SXM5	B200 SXM	B300 SXM
Architecture	Hopper	Hopper	Blackwell	Blackwell Ultra
Process node	TSMC 4N	TSMC 4N	TSMC N4P	TSMC N4P
Transistors	80B	80B	208B	208B
HBM capacity	80 GB	141 GB	192 GB	288 GB
HBM bandwidth	3.35 TB/s	4.8 TB/s	8 TB/s	8 TB/s
FP8 dense compute	1.98 PFLOPS	1.98 PFLOPS	4.5 PFLOPS	7.5 PFLOPS
FP4 dense compute	N/A	N/A	9 PFLOPS	15 PFLOPS
Attention speedup vs H100	1x	1x	5x (vs H100)	2x (vs B200)
TDP	700 W	700 W	1,000 W	1,400 W

The GB300 NVL72 as a full rack system versus a comparable Hopper cluster:

Workload metric	GB300 NVL72 vs H100 NVL8 equivalent
AI factory output (NVIDIA claim)	50x higher
AI compute (FP4 vs FP8 equivalent basis)	65x more
Throughput per megawatt	5x higher
Video generation	30x faster
Tokens per second per user	10x improvement

NVLink, NVSwitch, and networking details

NVLink 5 is the interconnect generation shipping with both Blackwell and Blackwell Ultra GPUs. Each GPU has 18 NVLink lanes, each running at 100 GB/s per direction, for a total of 1.8 TB/s bidirectional per GPU. Compared to NVLink 4 (used in Hopper), NVLink 5 doubles per-GPU bandwidth.

Within the NVL72 rack, the NVLink Switch chips aggregate all GPU links. The NVL72 domain runs at 130 TB/s total GPU-to-GPU bandwidth, non-blocking. NVLink Switch also includes SHARP support for in-network reductions, which reduces the amount of data that must traverse the fabric during collective operations like all-reduce, commonly used in distributed training.

For scale-out networking beyond a single rack, the ConnectX-8 SuperNIC provides 800 Gb/s per GPU node. This is double the networking bandwidth available on GB200 NVL72 systems, which used ConnectX-7 at 400 Gb/s. The ConnectX-8 supports both InfiniBand (Quantum-X800 at 400 Gb/s per port, NDR) and Ethernet (Spectrum-X at 400 Gb/s per port). Two such ports per NIC provide 800 Gb/s aggregate.

At scale, NVIDIA's reference architecture for multi-rack GB300 NVL72 deployments uses a two-tier spine-leaf topology with dedicated InfiniBand or Ethernet fabrics connecting racks. Microsoft Azure's first at-scale deployment connected more than 4,600 GB300 NVL72 racks through next-generation InfiniBand for OpenAI workloads.

Liquid cooling

The GB300 NVL72 requires direct liquid cooling. Air cooling is not sufficient for a rack drawing approximately 120 kW at full load, and NVIDIA designed the system from the ground up for liquid. The primary components (GPUs, CPUs, NVSwitch chips) are all liquid-cooled through direct contact cold plates. Peripheral components including OSFP transceiver modules, storage drives, and power distribution boards are air-cooled within the rack enclosure.

Each rack requires connection to a facility chilled water loop through a Coolant Distribution Unit (CDU). Target supply water temperatures are between 30°C and 40°C. A single GB300 NVL72 rack generates approximately 409,000 BTU/hr of heat at 120 kW load.

The cooling system bill of materials for one NVL72 rack, according to industry component pricing, totals approximately $49,860. This includes cooling hardware across all compute trays (roughly $40,680 worth) and NVSwitch trays (roughly $9,180 worth). This cooling hardware cost is separate from the rack system price itself.

Data centers deploying GB300 NVL72 racks at scale need to plan power distribution at the row level, not per-rack. A single row of 10 GB300 NVL72 racks draws over one megawatt. Most enterprise data centers built before 2022 require significant infrastructure upgrades before deploying this generation of hardware.

NVIDIA has cited the liquid cooling architecture as a significant efficiency improvement. The company claims GB200 and GB300 NVL72 liquid-cooled systems achieve over 300x greater water efficiency compared to traditional air-cooled data centers running H100s at equivalent output, primarily because liquid cooling allows much higher heat densities and requires less evaporative cooling in the overall facility.

Software stack

CUDA and cuDNN

The B300 GPU is fully CUDA-compatible. All existing CUDA code targeting Blackwell or Hopper runs without modification on Blackwell Ultra. NVIDIA introduced the sm_90a compute capability for original Blackwell and the same PTX instruction set supports Blackwell Ultra, with updated libraries (cuDNN, cuBLAS, NCCL) auto-tuned for the B300's doubled attention throughput and NVFP4 compute.

NVIDIA Dynamo

NVIDIA Dynamo is an open-source inference framework released alongside the GB300 NVL72. It was designed specifically for distributed, disaggregated inference serving on large GPU clusters. Dynamo splits the two phases of LLM inference, prefill (context processing) and decode (token generation), across separate pools of GPUs. This disaggregated serving approach allows each phase to be independently scaled and optimized.

On the GB300 NVL72, Dynamo with disaggregated serving delivers approximately 1.5x higher throughput per GPU versus traditional in-flight batching approaches. For Mixture-of-Experts models like DeepSeek-R1, the combination of GB300 NVL72 hardware and Dynamo delivers up to 50x higher throughput than Hopper-based systems with earlier software.

Dynamo supports major LLM serving frameworks as backends including NVIDIA TensorRT-LLM, vLLM, and SGLang. NIM (NVIDIA Inference Microservices) integrates Dynamo capabilities to provide a containerized, optimized deployment option.

TensorRT-LLM

NVIDIA TensorRT-LLM is the primary compiler and runtime for optimized LLM inference on NVIDIA hardware. For the B300, TensorRT-LLM includes optimized kernels for NVFP4 quantization, FP8 key-value cache compression, and the new attention MUFU units. Models quantized to NVFP4 via TensorRT-LLM's quantization toolkit run at the full 15 petaFLOPS rated throughput of the B300.

NCCL and collective communications

NVIDIA's NCCL (NCCL Collective Communications Library) handles multi-GPU and multi-node communication. On the GB300 NVL72, NCCL operations within the 72-GPU NVLink domain run over the 130 TB/s NVLink fabric. Cross-rack NCCL operations use the ConnectX-8 network at 800 Gb/s per node. NCCL is aware of the NVLink topology and routes intra-rack collectives through the NVSwitch fabric rather than the network interface.

Use cases

Reasoning AI and test-time scaling

The primary design goal of the GB300 NVL72 is reasoning AI inference. Models like OpenAI's o1/o3, DeepSeek-R1, and similar "chain-of-thought" reasoning systems generate many more tokens per query than conventional LLMs. While a standard chat completion might generate 200-500 output tokens, a reasoning model solving a complex problem may generate tens of thousands of tokens across internal reasoning steps. NVIDIA estimates that reasoning variants demand approximately 20x more tokens per query than standard models, and up to 150x more compute than traditional one-shot inference.

This token explosion makes memory capacity and memory bandwidth the primary bottlenecks, not raw FLOPS. The B300's 288 GB of HBM3E addresses both: more capacity means larger KV caches for longer contexts, and the doubled attention throughput means the attention computation itself does not bottleneck the decode phase.

Test-time scaling, the practice of spending more compute during inference to improve answer quality, can demand up to 100x more compute than traditional inference. The GB300 NVL72's exaFLOP-scale compute capacity is intended to make test-time scaling economically viable at production scale.

Large model training

The GB300 NVL72 is also used for training and fine-tuning large language models. The 130 TB/s NVLink fabric allows large model parallelism within a single rack. For trillion-parameter model training, multiple racks are connected via InfiniBand or Ethernet.

In MLPerf Training v5.1 benchmarks, Lambda's GB300 NVL72 cluster outperformed GB200 NVL72 systems by 27% on training throughput metrics. The increased per-GPU compute and memory capacity allows larger batch sizes and reduces the communication overhead relative to compute time.

Agentic AI workloads

Agentic AI systems use LLMs to plan and execute multi-step tasks, often calling external tools, running code, browsing web content, or invoking specialized models. These workloads require both high throughput (to handle many concurrent agent instances) and low latency (to minimize response time per step). The GB300 NVL72's combination of per-GPU memory capacity, FP4 throughput, and attention acceleration serves both requirements simultaneously.

Physical AI and video generation

NVIDIA also targets physical AI applications including robotics simulation and synthetic data generation. For video generation using diffusion models, NVIDIA claims 30x faster generation on GB300 NVL72 systems compared to Hopper. This enables synthetic dataset generation for autonomous vehicle training and robotic manipulation at scales that were previously impractical.

MLPerf benchmark results

The GB300 NVL72 made its official benchmark debut in MLPerf Inference v5.1 in September 2025. Key results:

DeepSeek-R1 671B (offline scenario):

GB300 NVL72 per GPU: 5,842 tokens/second
GB200 NVL72 per GPU: approximately 4,000 tokens/second
Improvement: 45% per GPU over GB200 NVL72
Improvement vs Hopper H200: approximately 5x

Llama 3.1 405B (server scenario):

GB300 NVL72 per GPU: 170 tokens/second
1.4x higher performance per GPU vs GB200 NVL72

Llama 3.1 405B (interactive scenario):

GB300 NVL72 per GPU: 138 tokens/second
Includes more than 2x output token rate vs server mode with 1.3x faster time-to-first-token

The benchmarks employed NVFP4 quantization for the majority of model weights (DeepSeek-R1), FP8 key-value cache compression, and disaggregated prefill serving. These techniques collectively contributed about 1.5x throughput improvement on top of the hardware gains.

In MLPerf Training v5.1, NVIDIA swept all seven benchmarks with Blackwell Ultra systems, with the GB200 NVL72 achieving a record 10-minute training time for Llama 3.1 405B.

Buyers and deployments

ODM and server manufacturers

System integrators announced GB300 NVL72 products at or shortly after GTC 2025, including:

Dell Technologies (PowerEdge XE9712)
Hewlett Packard Enterprise
Lenovo
Supermicro
GIGABYTE
ASUS
Inventec
Wistron
QCT
ASRock Rack
Pegatron
Wiwynn
Foxconn
Eviden

Cloud providers

All major hyperscalers committed to deploying GB300 NVL72 systems:

Amazon Web Services: AWS announced Amazon EC2 P6e-GB300 UltraServers, which became generally available in December 2025 at AWS re:Invent. The P6e-GB300 instances provide 1.5x more GPU memory and 1.5x more FP4 compute compared to the prior P6e-GB200 instances. AWS also deployed HGX B300-based EC2 P6-B300 instances for single-node workloads in November 2025.

Microsoft Azure: Azure deployed the first large-scale GB300 NVL72 cluster in production, with more than 4,600 interconnected GB300 NVL72 racks using NVIDIA InfiniBand, running OpenAI workloads. Azure's GB300 NVL72 instances are available as ND GB300 v6 VMs.

Google Cloud: Announced as an early GB300 NVL72 customer at GTC 2025.

Oracle Cloud Infrastructure: Committed to GB300 NVL72 deployment as part of the GTC 2025 announcements.

CoreWeave: The first cloud provider to bring GB300 NVL72 instances into production. CoreWeave made GB300 NVL72-powered instances generally available on August 19, 2025, initially in the US-WEST-01A availability zone. Dell delivered CoreWeave's first production unit on July 3, 2025.

Additional GPU cloud providers committed at GTC 2025 include Lambda, Nebius, Nscale, Crusoe, Yotta, and YTL.

Pricing and availability

NVIDIA has not published official list prices for the GB300 NVL72. Industry estimates based on ODM quotes and supply chain analysis suggest the full rack system costs approximately $6 million to $6.5 million in AI inference-optimized configurations, though pricing varies by vendor and configuration.

For comparison, the GB200 NVL72 was estimated at approximately $3 million per rack when it launched, suggesting the GB300 NVL72 carries a roughly 2x price premium for 1.5x performance improvement per GPU. The price per token of inference, accounting for the system cost amortized against throughput, is nonetheless favorable because the GB300 NVL72 delivers significantly more tokens per rack per unit of power.

Initial shipments began in July 2025 (Dell to CoreWeave). Volume production ramp was expected in Q3-Q4 2025. AWS P6e-GB300 instances became generally available in December 2025. The broader enterprise market and smaller cloud providers were expected to gain access through 2026.

The HGX B300 NVL16 (an 8-GPU server form factor without Grace CPUs) is available as a lower-cost entry point to Blackwell Ultra, suitable for workloads that do not require the full rack-scale NVLink domain.

Comparison with AMD Instinct MI355X

Specification	NVIDIA B300 SXM	AMD Instinct MI355X
Architecture	Blackwell Ultra	CDNA 3.5
Process node	TSMC N4P	TSMC N3P (compute), N6 (IO)
Transistors	208 billion	185 billion
HBM capacity	288 GB HBM3E	288 GB HBM3E
Memory bandwidth	8 TB/s	8 TB/s
FP4 dense compute	15 PFLOPS	20 PFLOPS
FP8 dense compute	7.5 PFLOPS	10 PFLOPS
FP16/BF16 dense compute	3.75 PFLOPS	5 PFLOPS
TDP	1,400 W	1,400 W
NVLink/xGMI	NVLink 5 (1.8 TB/s)	xGMI (no equivalent scale-up fabric)
Rack-scale domain	72 GPUs / 130 TB/s	No equivalent
Software ecosystem	CUDA, TensorRT, Dynamo	ROCm, MIOpen

The AMD MI355X has higher peak FP4 FLOPS on paper (20 vs 15 PFLOPS dense). However, real-world inference benchmarks tell a different story. SemiAnalysis's InferenceX v2 analysis found that NVIDIA's B300 and GB300 NVL72 dominate across most inference scenarios when advanced techniques like disaggregated prefill, wide expert parallelism, and FP4 precision are used together. The MI355X performs well when these optimizations are applied individually but underperforms when combined, due to gaps in the ROCm software stack's kernel and collective optimizations.

In MLPerf Inference v5.1, the MI355X matched or slightly beat the B200 (not B300) on several single-node LLM benchmarks. AMD stated at ISSCC 2026 that the MI355X matches GB200 performance despite lower compute unit count through per-CU throughput improvements.

The GB300 NVL72's primary competitive advantage is not per-GPU FLOPS but rather the 130 TB/s NVLink fabric connecting all 72 GPUs. No AMD product offers an equivalent scale-up domain. For very large model inference (400B+ parameters across a whole rack) and for agentic workloads requiring rapid KV cache sharing, the unified NVLink domain provides practical throughput advantages that peak FLOPS comparisons do not capture.

On cost-per-token for standard FP8 inference at high concurrency, AMD claims competitive cost economics for the MI355X versus the GB300 NVL72 at throughput-optimized operating points.

Successors: Vera Rubin roadmap

NVIDIA announced the Vera Rubin platform as the successor to Blackwell Ultra at GTC 2025, with additional details at CES 2026. The Vera Rubin NVL72 uses the R100 GPU (also called the Rubin GPU) and the Vera CPU, and follows the same rack-scale architecture as the GB300 NVL72.

Key announced improvements in Vera Rubin:

NVLink 6 replacing NVLink 5, doubling the rack fabric bandwidth to 260 TB/s
HBM4 memory replacing HBM3E, with approximately 22 TB/s per GPU bandwidth (nearly 3x the B300's 8 TB/s)
The Vera CPU replacing Grace, also built on Arm Neoverse
A new Rubin GPU die with approximately 336 billion transistors
Per-NVIDIA projections, roughly 3.3x the inference throughput of the GB300 NVL72

Jensen Huang confirmed at CES 2026 that the Vera Rubin NVL72 was in production, with delivery expected in the second half of 2026 to the same initial customers (AWS, Google Cloud, Microsoft Azure, OCI, CoreWeave, Lambda, Nebius, Nscale).

NVIDIA has also outlined a longer roadmap including an NVL144 rack (doubling the GPU count per rack) and a "Vera Rubin Ultra" generation, though detailed specifications for these future products had not been disclosed as of early 2026.

Power and infrastructure requirements

The GB300 NVL72 represents a significant step up in data center infrastructure requirements compared to prior GPU generations. Key infrastructure considerations:

Power density: Each rack draws approximately 120 kW at full load (each B300 GPU at 1,400 W, plus Grace CPUs and switching equipment). This is roughly double the power draw of a comparable H100 NVL8 cluster delivering equivalent model throughput.

Liquid cooling infrastructure: 100% of GPU and CPU heat must be removed by liquid. Data centers must provide a chilled water supply loop with sufficient flow rate and heat rejection capacity. Many enterprise data centers built before 2022 lack the piping infrastructure for direct liquid cooling and require capital investment before deploying GB300 NVL72 racks.

Power distribution: A row of 10 GB300 NVL72 racks exceeds 1.2 MW. Power feeds must be redundant and capacity-planned at the row level. Floor loading must accommodate rack weights that include heavy liquid cooling equipment.

Power smoothing: NVIDIA has incorporated energy storage and power management features in the GB300 NVL72 design to reduce peak demand transients when GPU workloads ramp up simultaneously, a persistent challenge with dense GPU installations.

Facility build time: Analysts at Eliovp estimated a four-month data center construction and infrastructure preparation period as the minimum lead time for a new facility capable of hosting GB300 NVL72 racks, longer than the supply lead time for the hardware itself.

References

NVIDIA Newsroom. "NVIDIA Blackwell Ultra AI Factory Platform Paves Way for Age of AI Reasoning." March 18, 2025. https://nvidianews.nvidia.com/news/nvidia-blackwell-ultra-ai-factory-platform-paves-way-for-age-of-ai-reasoning
NVIDIA Technical Blog. "Inside NVIDIA Blackwell Ultra: The Chip Powering the AI Factory Era." 2025. https://developer.nvidia.com/blog/inside-nvidia-blackwell-ultra-the-chip-powering-the-ai-factory-era/
NVIDIA. "GB300 NVL72 Product Page." https://www.nvidia.com/en-us/data-center/gb300-nvl72/
NVIDIA. "Blackwell Architecture Overview." https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/
TweakTown. "NVIDIA details Blackwell Ultra GB300: dual-die design, 208B transistors, up to 288GB HBM3E." 2025. https://www.tweaktown.com/news/107373/nvidia-details-blackwell-ultra-gb300-dual-die-design-208b-transistors-up-to-288gb-hbm3e/index.html
NVIDIA Technical Blog. "NVIDIA Blackwell Ultra Sets New Inference Records in MLPerf Debut." 2025. https://developer.nvidia.com/blog/nvidia-blackwell-ultra-sets-new-inference-records-in-mlperf-debut/
SemiAnalysis. "NVIDIA GTC 2025: Built For Reasoning, Vera Rubin, Kyber, CPO, Dynamo Inference, Jensen Math, Feynman." 2025. https://newsletter.semianalysis.com/p/nvidia-gtc-2025-built-for-reasoning-vera-rubin-kyber-cpo-dynamo-inference-jensen-math-feynman
Dell Technologies. "Dell Delivers Market's First NVIDIA GB300 NVL72 to CoreWeave." July 2025. https://www.dell.com/en-us/blog/dell-delivers-market-s-first-nvidia-gb300-nvl72-to-coreweave/
AWS. "Amazon EC2 P6e-GB300 UltraServers accelerated by NVIDIA GB300 NVL72 are now generally available." December 2025. https://aws.amazon.com/about-aws/whats-new/2025/12/amazon-ec2-p6e-gb300-ultraservers-nvidia-gb300-nvl72-generally-available/
Microsoft Azure Blog. "Microsoft Azure delivers the first large-scale cluster with NVIDIA GB300 NVL72 for OpenAI workloads." https://azure.microsoft.com/en-us/blog/microsoft-azure-delivers-the-first-large-scale-cluster-with-nvidia-gb300-nvl72-for-openai-workloads/
CoreWeave. "GB300 NVL72-Powered Cloud Instances Changelog." August 19, 2025. https://docs.coreweave.com/docs/changelog/release-notes/gb300-nvl72
Tom's Hardware. "Cooling system for a single Nvidia Blackwell Ultra NVL72 rack costs a staggering $50,000." 2025. https://www.tomshardware.com/pc-components/cooling/cooling-system-for-a-single-nvidia-blackwell-ultra-nvl72-rack-costs-a-staggering-usd50-000-set-to-increase-to-usd56-000-with-next-generation-nvl144-racks
ServeTheHome. "The NVIDIA HGX B300 NVL16 is Massively Different." 2025. https://www.servethehome.com/the-nvidia-hgx-b300-nvl16-is-massively-different/
NVIDIA Technical Blog. "NVIDIA ConnectX-8 SuperNICs Advance AI Platform Architecture with PCIe Gen6 Connectivity." 2025. https://developer.nvidia.com/blog/nvidia-connectx-8-supernics-advance-ai-platform-architecture-with-pcie-gen6-connectivity/
NVIDIA Developer. "Dynamo Inference Framework." https://developer.nvidia.com/dynamo
NVIDIA Technical Blog. "NVIDIA Vera Rubin POD: Seven Chips, Five Rack-Scale Systems, One AI Supercomputer." 2026. https://developer.nvidia.com/blog/nvidia-vera-rubin-pod-seven-chips-five-rack-scale-systems-one-ai-supercomputer/
Tom's Hardware. "Nvidia CEO confirms Vera Rubin NVL72 is now in production." January 2026. https://www.tomshardware.com/tech-industry/artificial-intelligence/nvidia-ceo-confirms-vera-rubin-nvl72-is-now-in-production-jensen-huang-uses-ces-keynote-to-announce-the-milestone
Sunbird DCIM. "How Much Power Does a NVIDIA GB300 NVL72 Need?" 2025. https://www.sunbirddcim.com/blog/how-much-power-does-nvidia-gb300-nvl72-need
Lambda AI. "MLPerf Training v5.1: Lambda's NVIDIA GB300 NVL72 outperforms NVIDIA GB200 NVL72 by 27%." 2025. https://lambda.ai/blog/lambda-mlperf-training-benchmarks-v5.1
NVIDIA Blog. "NVIDIA Blackwell Platform Boosts Water Efficiency by Over 300x." 2025. https://blogs.nvidia.com/blog/blackwell-platform-water-efficiency-liquid-cooling-data-centers-ai-factories/

Overview

Background: from Blackwell to Blackwell Ultra

B300 chip architecture

Dual-die design and transistor count

Tensor cores and precision support

Memory subsystem: HBM3E at 288 GB

Host connectivity

GB300 Grace Blackwell Ultra Superchip

GB300 NVL72 rack system

Physical configuration

NVLink fabric and NVSwitch

Networking (scale-out)

Specifications

B300 GPU specifications

GB300 NVL72 rack specifications

Performance vs predecessors

NVLink, NVSwitch, and networking details

Liquid cooling

Software stack

CUDA and cuDNN

NVIDIA Dynamo

TensorRT-LLM

NCCL and collective communications

Use cases

Reasoning AI and test-time scaling

Large model training

Agentic AI workloads

Physical AI and video generation

MLPerf benchmark results

Buyers and deployments

ODM and server manufacturers

Cloud providers

Pricing and availability

Comparison with AMD Instinct MI355X

Successors: Vera Rubin roadmap

Power and infrastructure requirements

See also

References

Improve this article

Related Articles

NVIDIA B200

NVIDIA DGX B300

AMD Instinct MI355X

AMD Instinct MI300X

AMD Instinct MI325X

NVIDIA H100

Overview

Background: from Blackwell to Blackwell Ultra

B300 chip architecture

Dual-die design and transistor count

Tensor cores and precision support

Memory subsystem: HBM3E at 288 GB

Host connectivity

GB300 Grace Blackwell Ultra Superchip

GB300 NVL72 rack system

Physical configuration

NVLink fabric and NVSwitch

Networking (scale-out)

Specifications

B300 GPU specifications

GB300 NVL72 rack specifications

Performance vs predecessors

NVLink, NVSwitch, and networking details

Liquid cooling

Software stack

CUDA and cuDNN

NVIDIA Dynamo

TensorRT-LLM

NCCL and collective communications

Use cases

Reasoning AI and test-time scaling

Large model training

Agentic AI workloads

Physical AI and video generation

MLPerf benchmark results

Buyers and deployments

ODM and server manufacturers

Cloud providers

Pricing and availability

Comparison with AMD Instinct MI355X