NVIDIA Blackwell B200
Last reviewed
May 17, 2026
Sources
15 citations
Review status
Source-backed
Revision
v1 ยท 3,475 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 17, 2026
Sources
15 citations
Review status
Source-backed
Revision
v1 ยท 3,475 words
Add missing citations, update stale details, or suggest a clearer explanation.
The NVIDIA B200 is a data center GPU accelerator built on the Blackwell microarchitecture, introduced by NVIDIA on March 18, 2024 at the company's GTC keynote in San Jose. It is the flagship Blackwell datacenter part, succeeding the H100 and H200 of the Hopper generation, and it is the chip that ships inside the HGX B200 baseboard, the DGX B200 server, and (paired with a Grace CPU) the GB200 Grace Blackwell Superchip and the GB200 NVL72 rack system. The B200 is built from two reticle-limited dies on TSMC's custom 4NP process, packs 208 billion transistors, carries 192 GB of HBM3e memory at 8 TB/s, and is rated at 1,000 W in its SXM configuration. At launch NVIDIA quoted peak throughput of 20 PFLOPS of FP4 tensor compute with sparsity, roughly five times the FP8 throughput of the H100.
The B200 became the highest-volume frontier AI training and inference chip of the 2024 to 2025 cycle. By late 2024 Morgan Stanley reported that 2025 production had already been sold out, and the buyer list included every major hyperscaler and frontier model developer: Amazon Web Services, Google, Meta, Microsoft, OpenAI, Oracle, Tesla, xAI, and CoreWeave. The chip was joined in late 2025 by the higher-power Blackwell Ultra B300, before the Vera Rubin generation began shipping in 2026.
| Field | Value |
|---|---|
| Type | Data center GPU accelerator |
| Microarchitecture | Blackwell |
| Die | GB100 (dual-die, reticle-limited) |
| Process | TSMC custom 4NP |
| Transistors | 208 billion (two dies, 104 billion each) |
| Die-to-die interconnect | NV-HBI, 10 TB/s |
| Memory | 192 GB HBM3e (eight 8-Hi stacks) |
| Memory bandwidth | 8 TB/s |
| Streaming multiprocessors | 148 enabled (74 per die) |
| CUDA cores | approximately 18,432 |
| Tensor cores | 592 (fifth generation) |
| FP4 tensor (with sparsity) | 20 PFLOPS |
| FP8 tensor (with sparsity) | 10 PFLOPS |
| Interconnect | NVLink 5 (1.8 TB/s bidirectional), PCIe Gen6 x16 |
| Form factors | B200 SXM (in HGX B200, DGX B200), GB200 Superchip |
| TDP | 1,000 W (SXM) |
| Announced | March 18, 2024 (GTC, San Jose) |
| First customer shipments | Q4 2024 (sample units), Q1 to Q2 2025 (volume) |
| Estimated unit price | $40,000 to $50,000 |
| Compute capability | 10.0 |
| Required CUDA | 12.8 or later |
NVIDIA CEO Jensen Huang introduced Blackwell and the B200 at GTC on March 18, 2024 in San Jose, the company's first in-person GTC since the pandemic. The keynote framed the B200 as the foundation of an "AI factory" architecture aimed at training and serving trillion-parameter large language models. NVIDIA disclosed the dual-die layout, the 208 billion transistor count, FP4 support, fifth-generation NVLink, and the GB200 Grace Blackwell Superchip variant at the same event.
Production did not run cleanly. In August 2024, reports from The Information and several supply-chain analysts surfaced that the B200 mask set required a respin to correct a defect in the NV-HBI die-to-die bridge that limited yield on the full dual-die assembly. NVIDIA chief financial officer Colette Kress acknowledged on the Q2 fiscal 2025 earnings call that the company had executed a mask change to improve production yield, pushing the ramp from Q3 to Q4 calendar 2024. A second wave of issues followed at the system level: the GB200 NVL72 rack drew significantly more power than initial specifications suggested (some reports cited 140 kW versus the 120 kW datasheet figure), and OEM partners reported overheating in densely populated racks, intermittent NVLink switch faults, and leaks in the direct-to-chip liquid-cooling loops. By December 2024 NVIDIA and its ODM partners (Foxconn, Wistron, Quanta, Inventec) had stabilized the production lines.
Volume B200 deliveries to cloud providers and hyperscalers began in the first quarter of 2025 and ramped sharply through the year. OpenAI publicly received one of the earliest engineering DGX B200 systems in October 2024, with Huang and OpenAI president Greg Brockman posing with the unit at OpenAI's San Francisco office. By the second half of 2025 the chip was the default frontier-training part for new builds at AWS, Google Cloud, Microsoft Azure, Oracle Cloud Infrastructure, and CoreWeave.
The B200 is the first commercial product to use TSMC's CoWoS-L (Chip-on-Wafer-on-Substrate with Local Silicon Interconnect) packaging at volume. CoWoS-L uses small silicon bridge dies (rather than a single monolithic interposer) to route signals between the GPU dies and the HBM3e stacks, enabling packages that exceed a single interposer reticle. The B200 substrate measures roughly 102 mm by 75 mm, well beyond the limits of the older CoWoS-S used for H100 and H200.
The two GPU dies sit at the center of the package and are linked by the NV-HBI (NVIDIA High Bandwidth Interface), a proprietary die-to-die interconnect derived from the NVLink 7 PHY layer. NV-HBI provides 10 TB/s of bidirectional bandwidth between the dies, low enough in latency that the pair presents to software as a single logical GPU with unified cache coherency. The figure is roughly an order of magnitude greater than typical chiplet links and is what allows the B200 to behave like a monolithic part to CUDA programmers.
Each die contains 104 billion transistors and exposes 74 streaming multiprocessors out of 80 physical SMs, with the difference reserved for yield. The two dies together deliver 148 active SMs, approximately 18,432 CUDA cores, and 592 fifth-generation tensor cores, with 228 KB of shared memory per SM. Eight HBM3e stacks ring the GPU dies, each an 8-Hi stack delivering 24 GB for a total of 192 GB per GPU. Peak memory bandwidth is 8 TB/s, a 2.4x improvement over H100 and 1.67x over H200. SK Hynix supplies the dominant share of HBM3e for the B200 ramp, with Micron qualified as a second source from mid-2024 and Samsung qualifying through 2025. TechInsights teardown analysis confirmed SK Hynix as the volume supplier on early production units.
The fifth-generation tensor core is the centerpiece of the B200's compute architecture. It adds native support for FP4 (4-bit floating point) and FP6 (6-bit floating point) in two flavors: NVIDIA-defined formats and the Open Compute Project's MX (microscaling) formats, MXFP4 and MXFP6. Microscaling attaches a small shared exponent to a block of low-precision values, recovering dynamic range that would otherwise be lost at four bits per element.
The peak tensor throughput figures NVIDIA quoted at launch (per B200 GPU, with structured sparsity) are:
| Precision | Throughput (with sparsity) | Throughput (dense) |
|---|---|---|
| FP4 | 20 PFLOPS | 10 PFLOPS |
| FP6 | 10 PFLOPS | 5 PFLOPS |
| FP8 | 10 PFLOPS | 5 PFLOPS |
| FP16 / BF16 | 5 PFLOPS | 2.5 PFLOPS |
| TF32 | 2.5 PFLOPS | 1.25 PFLOPS |
| FP64 (tensor) | 40 TFLOPS | not applicable |
FP4 is the headline number and the precision that gives the B200 its 5x inference advantage over the H100 in NVIDIA's marketing comparisons. FP4 inference is enabled by the second-generation Transformer Engine, a runtime layer in TensorRT-LLM and CUDA libraries that decides on a per-tensor or per-block basis which precision to use, automatically scaling values to fit the smaller representable range. The Transformer Engine first appeared on Hopper with FP8 support; the Blackwell version is the first to natively support 4-bit precision in hardware. For training, NVIDIA introduced NVFP4 as a recipe that uses FP4 for the forward pass and selected backward operations while retaining higher precision (FP8 or BF16) for gradients and master weights. NVIDIA's MLPerf Training v5.1 submissions in late 2025 used NVFP4 to claim approximately 1.9x training speedup over previously published FP8 Hopper results.
The following table compares the B200 against the most recent prior NVIDIA datacenter parts and the B100 lower-power Blackwell SKU. All figures are for the SXM form factor.
| Specification | H100 SXM | H200 SXM | B100 SXM | B200 SXM |
|---|---|---|---|---|
| Architecture | Hopper | Hopper | Blackwell | Blackwell |
| Die | GH100 (monolithic) | GH100 (monolithic) | GB100 dual-die | GB100 dual-die |
| Process | TSMC 4N | TSMC 4N | TSMC 4NP | TSMC 4NP |
| Transistors | 80 billion | 80 billion | 208 billion | 208 billion |
| Memory | 80 GB HBM3 | 141 GB HBM3e | 192 GB HBM3e | 192 GB HBM3e |
| Memory bandwidth | 3.35 TB/s | 4.8 TB/s | 8 TB/s | 8 TB/s |
| FP4 tensor (dense) | not supported | not supported | 7 PFLOPS | 10 PFLOPS |
| FP8 tensor (dense) | 2 PFLOPS | 2 PFLOPS | 3.5 PFLOPS | 5 PFLOPS |
| BF16 tensor (dense) | 1 PFLOPS | 1 PFLOPS | 1.75 PFLOPS | 2.5 PFLOPS |
| NVLink generation | 4 | 4 | 5 | 5 |
| NVLink bandwidth | 900 GB/s | 900 GB/s | 1.8 TB/s | 1.8 TB/s |
| TDP | 700 W | 700 W | 700 W | 1,000 W |
The B200 and B100 share the same silicon and memory configuration. The B100 is binned and clocked lower so that it stays inside the 700 W thermal envelope of the older HGX H100 baseboard, providing a drop-in upgrade path for operators that cannot rework their air-cooled facilities. The B200, by contrast, is the full-power SKU and assumes liquid cooling in most deployments.
The memory increase from 80 GB on the H100 to 192 GB on the B200 is one of the most visible practical differences. A 70-billion-parameter model in BF16 fits on a single B200 with substantial headroom for the KV cache, where the same model on an H100 requires aggressive quantization or memory swapping. A 405-billion-parameter model like Llama 3.1 405B fits in BF16 across just three B200 GPUs (versus eight H100s), or comfortably on a single 8-GPU HGX B200 board in FP8.
The HGX B200 is the reference baseboard NVIDIA sells to OEMs, populated with eight B200 GPUs interconnected by four NVLink 5 switch ASICs. OEMs that have shipped HGX B200 servers include Dell (PowerEdge XE9680), Hewlett Packard Enterprise (Cray XD685, ProLiant XD685), Lenovo (ThinkSystem SR685a V3), Supermicro (SYS-822GS-NB3), and ASUS (ESC NB8-E11). Each HGX B200 delivers 1,440 GB of aggregate HBM3e, 14.4 TB/s of aggregate NVLink bandwidth, and roughly 72 PFLOPS of FP8 dense tensor compute, with system power around 10 kW.
The DGX B200 is NVIDIA's own turnkey 8-GPU server, the direct successor to the DGX H100. It pairs the HGX B200 baseboard with two Intel Xeon Platinum 8570 processors, up to 4 TB of system DDR5 memory, eight ConnectX-7 400 Gb/s network adapters, two BlueField-3 DPUs, and dual hot-swappable power supplies. The chassis is roughly 14U tall and weighs approximately 130 kg. NVIDIA quotes 72 PFLOPS of FP8 training performance and 144 PFLOPS of FP4 inference performance per system, with starting OEM list prices around $515,000. NVIDIA positions the DGX B200 as delivering 3x the training throughput and 15x the inference throughput of the DGX H100. The system uses liquid-assisted air cooling rather than direct liquid cooling, simplifying deployment in air-cooled datacenters.
The GB200 Superchip pairs two B200 GPUs with one Grace CPU on a single board, connected over NVLink-C2C at 900 GB/s bidirectional. The Grace CPU is an Arm Neoverse V2 design with 72 cores and up to 480 GB of LPDDR5X memory, sharing a unified memory address space with the GPUs. Each GB200 provides 384 GB of HBM3e and 40 PFLOPS of FP4 tensor compute with sparsity. The GB200 is the building block of the rack-scale GB200 NVL72, which connects 18 compute trays (36 Grace CPUs and 72 B200 GPUs) inside a single liquid-cooled rack, with nine NVLink switch trays providing a flat 130 TB/s all-to-all NVLink domain across all 72 GPUs. NVIDIA describes the NVL72 as "one big GPU" because the NVLink-attached domain is large enough that a trillion-parameter model can be sharded entirely inside it without using a slower scale-out fabric.
B200 customer adoption skewed heavily toward frontier model developers and hyperscale clouds. The following deployments were publicly disclosed or reported during the 2024 to 2026 ramp.
| Customer | Deployment | Scale | Notes |
|---|---|---|---|
| OpenAI | Engineering DGX B200 | One system initially | Delivered October 2024; first non-NVIDIA recipient |
| xAI | Colossus expansion | 50,000 B200 GPUs reported | On top of 100,000 H100s in Memphis, Tennessee |
| Meta | Hyperion / Prometheus campuses | Millions of Blackwell and Rubin GPUs | Multi-year partnership announced February 2026 |
| Microsoft | Azure ND GB200 v6 | Tens of thousands of GPUs | GA mid-2025; powers OpenAI workloads |
| Amazon Web Services | EC2 P6, GB200 NVL72 capacity blocks | Tens of thousands of GPUs | First B200 capacity blocks GA Q2 2025 |
| Google Cloud | A4 and A4X virtual machines | Tens of thousands of GPUs | A4X uses GB200, A4 uses HGX B200 |
| Oracle Cloud Infrastructure | OCI Supercluster | Up to 131,072 B200 GPUs (cluster ceiling) | Early DGX B200 launch partner |
| CoreWeave | GB200 NVL72 racks | Tens of thousands of GPUs | Joint MLPerf v5.0 submission on 2,496 GPUs |
| Tesla | Cortex training cluster | Undisclosed | Mixed with Dojo fleet |
| Lambda | Cloud B200 instances | Low double-digit thousands | On-demand 8x B200 instances launched 2025 |
NVIDIA's own DGX SuperPOD reference designs were updated in 2024 to use the DGX B200 as a building block, with a SuperPOD of 32 DGX B200 systems (256 B200 GPUs) marketed as the standard unit for sub-hyperscaler deployments.
NVIDIA does not publish list pricing for datacenter GPUs. Public OEM quotes, channel data, and analyst estimates put the per-GPU street price of the B200 SXM module at roughly $40,000 to $50,000 through 2024 and 2025, broadly in line with H100 pricing at the equivalent point in its lifecycle. EpochAI estimated the manufacturing cost of a B200 at approximately $6,400 per unit, with HBM3e memory and CoWoS-L packaging accounting for roughly two-thirds of the bill of materials. DGX B200 systems carried list prices around $515,000, while complete GB200 NVL72 racks were quoted between $2 million and $3 million depending on memory and networking options. Cloud pricing for on-demand 8-GPU HGX B200 instances settled into a band of roughly $4 to $7 per GPU-hour during 2025, with spot pricing dropping as low as $2.25 per GPU-hour at neocloud providers.
The supply story dominated the chip's first year. By mid-2025, TSMC's CoWoS-L capacity was effectively reserved for NVIDIA through 2027, with NVIDIA holding more than 70% of available volume. Both HBM3e supply (allocated through long-term contracts with SK Hynix, Micron, and qualifying Samsung) and CoWoS-L packaging were the gating constraints on B200 shipments, rather than wafer output from TSMC's 4NP lines.
In the MLPerf Training v4.1 and v5.0 rounds, NVIDIA and its partners submitted B200 and GB200 NVL72 results. A 2,496-GPU GB200 NVL72 submission from CoreWeave, NVIDIA, and IBM completed Llama 3.1 405B pretraining in 27.33 minutes, approximately 2.1x faster than NVIDIA's 2,560-GPU H100 submission. On GPT-3 175B and Llama 2 70B fine-tuning, the B200 delivered roughly 2x and 2.2x the per-GPU throughput of the H100. NVFP4 training recipes added another 1.4x uplift between MLPerf v5.0 and v5.1, attributable to software rather than hardware changes.
MLPerf Inference v5.0 results for 8x B200 servers, compared with 8x H200, showed:
| Benchmark | 8x B200 | 8x H200 | Speedup |
|---|---|---|---|
| Llama 2 70B (server) | 98,443 tokens/s | approximately 32,800 tokens/s | 3.0x |
| Llama 2 70B (offline) | 98,858 tokens/s | approximately 35,300 tokens/s | 2.8x |
| Mixtral 8x7B (server) | 126,845 tokens/s | approximately 60,400 tokens/s | 2.1x |
| Mixtral 8x7B (offline) | 128,148 tokens/s | approximately 61,000 tokens/s | 2.1x |
| Stable Diffusion XL (server) | 28.44 samples/s | approximately 17.8 samples/s | 1.6x |
The gap widens further on rack-scale GB200 NVL72 deployments. On the Llama 3.1 405B inference benchmark, a single NVL72 rack returned up to 30x the throughput of an H200 8-GPU server, partly because the 405B model fits within the NVL72's NVLink domain and avoids cross-node communication. Per-GPU, NVL72 still delivered 3.4x the throughput of an H200 system on the same workload.
The B200 SXM module is rated at 1,000 W TDP, a 43% increase over the H100's 700 W. NVIDIA expects most B200 deployments to use liquid cooling: direct-to-chip cold plates in rack-scale GB200 systems, or rear-door heat exchangers and liquid-assisted air for HGX B200 chassis. The GB200 NVL72 is fully liquid-cooled and requires data centers with appropriate facility water loops. Despite the higher TDP, actual power draw under typical inference workloads runs well below the 1,000 W rating, with cloud operators reporting time-averaged figures around 600 W per GPU on production LLM traffic; full TDP is only reached during aggressive training or synthetic FP8 benchmarks. NVIDIA's claim of 25x cost and energy efficiency over the H100 for trillion-parameter model inference rests on the combination of FP4 throughput, reduced cross-GPU communication inside an NVLink domain, and below-TDP operating points.
The B200 has compute capability 10.0 and requires CUDA 12.8 or later for full hardware enablement. The earliest production deliveries in late 2024 used CUDA 12.4 with experimental FP4 support; full FP4 production support landed with CUDA 12.8 in early 2025. The primary software components are TensorRT-LLM (with explicit B200 kernels for FP4 and FP6 GEMMs and NVIDIA QAT for FP4 quantization), the second-generation Transformer Engine integrated with PyTorch and JAX, NVIDIA NIM inference microservices with pre-packaged FP4-tuned containers, and NCCL 2.21 or later for collective operations over the 1.8 TB/s NVLink 5 fabric and the 130 TB/s NVL72 domain.
The B200's launch was the largest single generational leap in NVIDIA's datacenter roadmap to date, and it cemented the company's commercial dominance through the 2024 to 2026 frontier-model build-out. Even with the production delays of late 2024, B200 demand massively outstripped supply throughout 2025: Morgan Stanley reported in November 2024 that the entire 2025 production allocation had been sold to a small number of large buyers.
NVIDIA continued the Blackwell line at GTC 2025 with the B300 ("Blackwell Ultra"), which retains the 208 billion transistor dual-die package but increases enabled SM count, memory capacity (288 GB HBM3e), and TDP (1,400 W) for roughly 1.5x performance in rack-scale form factors. The B200 continues to ship in parallel for customers that prefer its lower thermal envelope. The Blackwell line is scheduled to be succeeded by the Vera Rubin platform, which began initial shipments in 2026 and uses a 3 nm process, HBM4 memory, and a new Vera Arm CPU. Even with Rubin in the field, the B200 is expected to remain in active production through 2026 and 2027 because of secondary cloud demand, regional sovereign-AI deployments, and the long tail of customers upgrading from H100 and H200.