NVIDIA GB200 NVL72
Last reviewed
May 17, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,524 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 17, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,524 words
Add missing citations, update stale details, or suggest a clearer explanation.
The NVIDIA GB200 NVL72 is a rack-scale AI computing system that packages 72 Blackwell GPUs and 36 Grace CPUs into a single liquid-cooled rack, joined into one NVLink domain that the software stack treats as a single massive accelerator. NVIDIA announced the platform at GTC 2024 in March 2024 and positioned it as the company's flagship answer to the compute demand created by trillion-parameter large language models and reasoning workloads. The first production racks were delivered to hyperscale customers in late 2024 and the platform ramped to volume through 2025.
The central design move is the unified NVLink domain across all 72 GPUs. Earlier generations of NVIDIA HGX and DGX servers connected eight GPUs over NVSwitch at high bandwidth and then dropped to slower InfiniBand or Ethernet for everything beyond the box. The NVL72 keeps NVLink continuous across the whole rack at 1.8 TB/s per GPU, which lets a model with weights and activations far larger than any single GPU's HBM3e memory run as if it were on one device. NVIDIA cites a peak of 1.4 exaFLOPS of FP4 inference compute and 720 petaFLOPS of FP8 training compute per rack, with 13.5 TB of unified GPU memory and 130 TB/s of aggregate NVLink bandwidth.
The NVL72 became the most consequential single product in data center infrastructure during 2024 and 2025. Every major hyperscaler placed multi-billion-dollar orders. Microsoft, Meta, Oracle, AWS, Google, and CoreWeave all committed to large GB200 fleets, with the system serving as the substrate for OpenAI inference, Llama training, and the next wave of frontier model training. A typical configured rack ships for roughly USD 3.0 to 3.4 million.
| Field | Value |
|---|---|
| Type | Liquid-cooled rack-scale AI computing system |
| GPU architecture | Blackwell (B200) |
| GPUs per rack | 72 Blackwell GPUs |
| CPUs per rack | 36 Grace ARM Neoverse V2 CPUs |
| Superchip building block | GB200 Grace Blackwell Superchip (1 Grace + 2 Blackwell) |
| Compute trays | 18 (1U each), 2 Grace + 4 Blackwell per tray |
| NVLink switch trays | 9 (2 NVSwitch chips per tray) |
| GPU memory | 13.5 TB HBM3e unified |
| CPU memory | 17.3 TB LPDDR5X |
| NVLink bandwidth per GPU | 1.8 TB/s bidirectional (fifth-generation) |
| Total NVLink fabric bandwidth | 130 TB/s |
| FP4 inference (dense) | 1.4 exaFLOPS |
| FP8 training (dense) | 720 petaFLOPS |
| Power draw | Up to 120 kW |
| Weight | ~1.36 metric tons (~3,000 lb) |
| Cooling | 100% direct liquid cooling, no fans in rack |
| External networking | NVIDIA Quantum-X800 InfiniBand or Spectrum-X Ethernet, ConnectX-7 or ConnectX-8 |
| Announced | March 18, 2024 (GTC 2024) |
| First customer shipments | Late 2024 |
| Volume ramp | Q2 to Q3 2025 |
| List price per rack | ~USD 3.0 to 3.4 million |
| Successor | NVIDIA GB300 NVL72 |
NVIDIA introduced the Blackwell architecture at GTC 2024 in March 2024 as the successor to Hopper. Each Blackwell GPU is a multi-die package: two reticle-limited dies on TSMC's custom 4NP process joined by NV-HBI, a 10 TB/s die-to-die interconnect, and presented as one logical GPU. The combined package contains 208 billion transistors and ships with 192 GB of HBM3e memory at 8 TB/s. Blackwell adds native FP4 tensor support to the Hopper-era FP8 Transformer Engine, roughly doubling raw throughput for inference workloads that tolerate four-bit precision.
The GB200 Grace Blackwell Superchip is the building block. It pairs a single Grace CPU with two Blackwell B200 GPUs using NVIDIA's NVLink-C2C cache-coherent chip-to-chip interconnect at 900 GB/s. The Grace CPU contributes 72 Arm Neoverse V2 cores and 480 GB of LPDDR5X memory, serving as the host processor for orchestration and data preprocessing. Each superchip carries 384 GB of HBM3e total across its two GPUs.
NVIDIA's earlier rack-scale efforts built up to this point in increments. The GH200 NVL32, launched alongside Hopper, connected 32 Grace-Hopper superchips into a single NVLink domain through external switch boxes. The GB200 NVL72 collapses the whole stack into one 48U cabinet by integrating the switch trays into the same rack as the compute, pushing the entire domain onto a passive copper backplane. Jensen Huang framed it at GTC 2024 with the line "This is one GPU," gesturing at the rack on stage. From the software's perspective, the NVL72 presents a single unified address space and a flat NVLink topology that hides the per-package boundaries.
The rack itself is a 48U cabinet that weighs roughly 1.36 metric tons fully populated. Power draw at peak load reaches 120 kW, which is roughly five to seven times the density of a conventional GPU rack and well beyond what most data center floors were designed to support in 2024. The mechanical design is the result of close collaboration with the Open Compute Project, and NVIDIA contributed the reference rack design to OCP in October 2024.
A fully populated rack contains:
| Component | Count | Notes |
|---|---|---|
| Compute trays | 18 | 1U, 2 Grace CPUs + 4 Blackwell GPUs each |
| NVLink switch trays | 9 | 2 fifth-generation NVSwitch chips each |
| Power shelves | 4 to 8 | Top and bottom mounted |
| NVLink copper backplane | 1 | ~5,000 passive copper cables |
| Cooling distribution unit | 1 | Liquid manifold, top of rack |
The compute tray houses two GB200 superchips in 1U. NVIDIA solders the LPDDR5X directly to the motherboard and routes cooling through cold plates on the GPU and CPU packages. There are no fans in compute trays. Data signaling exits through blind-mate connectors on the rear. The switch tray holds two fifth-generation NVSwitch chips, each delivering 7.2 TB/s across 144 NVLink ports at 50 GB/s.
The NVLink copper backplane is the headline mechanical element. NVIDIA stuck with passive copper rather than active optics: roughly 5,000 copper twinax cables routed through a custom cartridge backplane behind all 27 trays, carrying 130 TB/s of bandwidth at zero retiming or optical conversion cost. Optical alternatives would burn an extra 20 kW for lasers and SerDes alone. The downside is assembly complexity: each cable is hand-routed by ODM technicians.
The NVL72 is 100% direct liquid cooled with no air path inside the cabinet. Coolant enters at the top, runs through manifolds to cold plates on every GPU, CPU, NVSwitch, and voltage regulator, and exits at the bottom. Typical supply temperatures are 32 to 45 degrees Celsius, warm enough that many deployments use chiller-less or evaporative-only outdoor cooling rather than mechanical chilling, dramatically improving facility-level Power Usage Effectiveness.
Liquid cooling is not optional. Each Blackwell GPU draws up to 1,200 W and each Grace CPU another 300 W; the combined heat flux through a 1U tray with 4 GPUs and 2 CPUs would not move with air. Facilities need rear-door heat exchangers or direct facility water connections, redundant pumps, leak detection, and chilled water at the proper supply temperature. Most data centers built before 2023 must be retrofitted, one of the gating factors on deployment pace.
The defining feature of the NVL72 is the unified NVLink domain spanning 72 GPUs. Large mixture-of-experts models can have hundreds of gigabytes of weights, and serving them at low latency requires either fitting the whole model on one accelerator (impossible above ~192 GB) or sharding across many accelerators with constant cross-device traffic for expert routing and KV cache lookups. In a conventional eight-GPU server, that cross-device traffic is fast within the box but slow across boxes. In an NVL72, all 72 GPUs sit on the same NVLink fabric at 1.8 TB/s per-GPU bandwidth, so a sharded model behaves much more like a single-device deployment.
The practical numbers:
| Metric | Value |
|---|---|
| NVLink generation | 5 |
| Per-GPU NVLink bandwidth | 1.8 TB/s bidirectional |
| Per-port bandwidth | 50 GB/s |
| Ports per GPU | 18 |
| NVSwitch chip bandwidth | 7.2 TB/s |
| NVSwitch ports per chip | 144 at 50 GB/s |
| NVSwitch chips per rack | 18 (across 9 trays) |
| Total NVLink fabric bandwidth | 130 TB/s |
| GPUs per NVLink domain | 72 |
| Aggregate GPU memory addressable from any GPU | 13.5 TB |
NVIDIA's description of "36 times faster than 400 Gbps Ethernet" captures the magnitude. The NVLink fabric is also far lower latency: about 1 microsecond hop-to-hop for adjacent GPUs versus ten to twenty microseconds for InfiniBand or Ethernet. For multi-rack scale-out, each compute tray has slots for ConnectX-7 or ConnectX-8 SuperNICs supporting either Quantum-X800 InfiniBand or Spectrum-X Ethernet at 800 Gbps.
NVIDIA quotes the following peak compute figures for a single GB200 NVL72 rack:
| Precision | Throughput (dense) |
|---|---|
| FP4 Tensor Core | 1,440 PFLOPS (1.44 exaFLOPS) |
| FP8 Tensor Core | 720 PFLOPS |
| FP16 / BF16 Tensor Core | 360 PFLOPS |
| TF32 Tensor Core | 180 PFLOPS |
| FP64 Tensor Core | 3.24 PFLOPS |
NVIDIA's measurements, replicated in third-party MLPerf submissions, place the NVL72 at roughly 30 times higher throughput than an equivalent count of H200 GPUs on Llama 3.1 405B inference in MLPerf Inference v5.0, with most of the gain coming from the unified NVLink domain rather than per-GPU compute. On H200 the model must shard across multiple eight-GPU nodes with cross-node InfiniBand for tensor parallelism; on NVL72 the same model fits in a single NVLink domain.
For training, CoreWeave, NVIDIA, and IBM submitted a 2,496-GPU cluster (about 35 NVL72 racks) to MLPerf Training v5.0 in mid-2025 and recorded a more than 2x speedup over Hopper at the same GPU count, with up to 3.2x speedup on the Llama 3.1 405B benchmark. For mixture-of-experts models like DeepSeek-V3, the NVL72 reaches roughly 200 tokens per second per GPU on long-context generation, an order of magnitude higher than H200, primarily because the expert routing layer can scatter and gather across all 72 GPUs at full NVLink bandwidth.
A single NVL72 rack is the unit of NVLink locality. Multi-rack deployments use either Quantum-X800 InfiniBand at 800 Gbps (default for training clusters, used by Microsoft, CoreWeave, and Oracle) or Spectrum-X Ethernet at 800 Gbps (used by Meta and several neoclouds with adaptive routing and BlueField-3 DPU congestion control). Per-rack network egress is typically 36 ConnectX-8 SuperNICs (one per superchip), totaling 28.8 Tbps. The egress bandwidth is intentionally a fraction of the in-rack NVLink fabric, since NVLink absorbs the heaviest tensor-parallel and KV cache traffic while InfiniBand or Ethernet handles the slower pipeline-parallel and data-parallel collectives. OCI Superclusters with NVIDIA Blackwell have been announced at the 100,000-GPU scale, which corresponds to roughly 1,400 NVL72 racks.
NVIDIA does not assemble the racks itself; it licenses the reference design to partners.
| Partner | Role | Notes |
|---|---|---|
| Foxconn (Hon Hai) | ODM | Largest manufacturer, ~1,000 racks shipped April 2025 |
| Quanta Computer | ODM | 300 to 400 racks per month at peak ramp |
| Wistron | ODM | 150-plus racks per month |
| Wiwynn, Inventec, Pegatron, Hyve | ODM | Hyperscaler-focused configurations |
| Supermicro | OEM | SRS-GB200-NVL72 SuperCluster |
| Dell Technologies | OEM | PowerEdge XE9712 and XE9785 |
| Hewlett Packard Enterprise | OEM | HPE GB200 NVL72 racks |
| Lenovo, GIGABYTE, ASUS, QCT | OEM | Branded enterprise racks |
| Inspur, H3C | OEM (China) | Region-specific variants |
ODMs build to a hyperscaler's custom spec and ship under the customer's brand. OEMs ship branded racks through enterprise channels at higher margins. The ODM channel accounts for the majority of NVL72 volume in 2024 and 2025.
NVIDIA does not publish list prices. Supply chain reporting and customer disclosures place the rack-scale server at roughly USD 3.0 to 3.4 million for a hyperscaler configuration. The rack alone (compute, switches, backplane, cooling) is about USD 3.1 million; a fully integrated rack including networking, storage, and installation is about USD 3.9 million. Per-GPU economics work out to roughly USD 40,000 to 45,000 per Blackwell B200 in the NVL72.
NVIDIA's pricing strategy through the Blackwell launch has been to charge for the system rather than the chip, capturing the value of the NVLink fabric, liquid cooling integration, and unified-rack engineering. The NVL72 priced in at roughly 3x the equivalent-GPU-count H100 server cost while delivering 5x to 30x the inference throughput depending on workload.
Demand has exceeded supply. By early 2026, total Blackwell hardware was sold out through Q3 2026 with an estimated backlog of approximately 3.6 million GPU-equivalents. The supply constraint is primarily TSMC CoWoS-L advanced packaging capacity, with HBM3e supply from Micron, SK hynix, and Samsung as a secondary constraint.
The NVL72 had a difficult ramp. Initial production was slated for September 2024, but a sequence of issues pushed the volume ramp to Q2 and Q3 of 2025:
By April 2025, industry-wide NVL72 shipments reached approximately 1,500 racks per month, with Hon Hai (Foxconn) alone shipping 1,000 racks. Total 2025 NVL72 shipments settled in the 30,000 to 40,000 rack range.
The NVL72 runs the standard NVIDIA software stack with Blackwell-specific support: CUDA 12.4+ with Blackwell compute capability sm_100, cuDNN 9.x with Transformer Engine FP8 and FP4 integration, TensorRT-LLM for production inference with NVL72-aware sharding, NVIDIA Triton Inference Server, NCCL with topology discovery, PyTorch 2.4+ with native FP8 and FP4 training paths, NVIDIA Dynamo (announced GTC 2025) for disaggregated inference orchestration, NVIDIA NIM microservices, and NVIDIA Mission Control for cluster management.
A 72-GPU all-reduce on the NVL72 runs in roughly the time that the same all-reduce across nine eight-GPU H200 servers would take just to negotiate the cross-node InfiniBand transfers. PyTorch's distributed primitives automatically detect the topology and route through NVLink wherever possible.
| Product | GPUs | CPUs | NVLink domain | FP4 peak | Year |
|---|---|---|---|---|---|
| DGX H100 | 8 H100 | 2 x86 | 8-GPU | n/a (FP8: 32 PFLOPS) | 2022 |
| HGX H200 | 8 H200 | host CPU | 8-GPU | n/a (FP8: 32 PFLOPS) | 2024 |
| GH200 NVL32 | 32 GH200 | 32 Grace | 32-GPU | n/a (FP8: 127 PFLOPS) | 2023 |
| GB200 NVL72 | 72 B200 | 36 Grace | 72-GPU | 1.44 EFLOPS | 2024 |
| GB300 NVL72 | 72 B300 | 36 Grace | 72-GPU | ~1.1 EFLOPS dense | 2025 |
| Vera Rubin NVL144 | 144 R100 | 36 Vera | 144-GPU | TBD | 2026 to 2027 |
The immediate successor is the GB300 NVL72, announced at GTC 2025, a refresh of the same rack form factor using the Blackwell Ultra B300 GPU. GB300 began customer shipments in mid-2025 and rapidly overtook the GB200 in newer cluster builds. NVIDIA's roadmap calls for the Vera Rubin NVL144 in 2026 to 2027, doubling per-rack GPU count to 144 with HBM4 memory and a refreshed NVLink generation. Pricing for Vera Rubin NVL144 racks has been reported as high as USD 8.8 million.
The NVL72 is most valuable where the unified NVLink domain enables a different deployment shape: