NVIDIA Blackwell B200

AI Hardware NVIDIA

20 min read

Updated Jun 28, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 28, 2026

Fact-checked

In review queue

Sources

15 citations

Revision

v2 · 3,940 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

The NVIDIA B200 is a data center GPU accelerator built on the Blackwell microarchitecture, introduced by NVIDIA on March 18, 2024 at the company's GTC keynote in San Jose. ^[1] It is the flagship Blackwell datacenter part, succeeding the H100 and H200 of the Hopper generation, and it is the chip that ships inside the HGX B200 baseboard, the DGX B200 server, and (paired with a Grace CPU) the GB200 Grace Blackwell Superchip and the GB200 NVL72 rack system. The B200 is built from two reticle-limited dies on TSMC's custom 4NP process, connected by a 10 TB/s NV-HBI link and presented to software as a single GPU, packs 208 billion transistors, carries 192 GB of HBM3e memory at 8 TB/s, and is rated at 1,000 W in its SXM configuration. ^[1]^[3] At launch NVIDIA quoted peak throughput of 20 PFLOPS of FP4 tensor compute with sparsity, roughly five times the FP8 throughput of the H100. ^[1]

NVIDIA positioned Blackwell and the B200 as the foundation of an "AI factory" capable of running "real-time generative AI on trillion-parameter large language models at up to 25x less cost and energy consumption than its predecessor." ^[1] Jensen Huang framed the launch by declaring, "Generative AI is the defining technology of our time. Blackwell is the engine to power this new industrial revolution." ^[1]

The B200 became the highest-volume frontier AI training and inference chip of the 2024 to 2025 cycle. By late 2024 Morgan Stanley reported that 2025 production had already been sold out, and the buyer list included every major hyperscaler and frontier model developer: Amazon Web Services, Google, Meta, Microsoft, OpenAI, Oracle, Tesla, xAI, and CoreWeave. The chip was joined in late 2025 by the higher-power Blackwell Ultra B300, before the Vera Rubin generation began shipping in 2026.

What is the NVIDIA B200?

The NVIDIA B200 is the top data center GPU of the Blackwell generation, designed to train and serve trillion-parameter large language models. ^[1] Unlike every prior flagship NVIDIA GPU, which used a single monolithic die, the B200 is a dual-die design: two reticle-limited dies are fused into one package and connected by the 10 TB/s NV-HBI interconnect, so the pair behaves as a single logical GPU to CUDA software. ^[1]^[3] The two dies together hold 208 billion transistors, the device carries 192 GB of HBM3e memory, and the second-generation Transformer Engine adds native FP4 and FP6 precision that NVIDIA credits for the chip's headline efficiency gains. ^[1] In short, the B200 is the building block of NVIDIA's Blackwell "AI factory" systems, from the 8-GPU HGX B200 baseboard up to the rack-scale GB200 NVL72.

Infobox

Field	Value
Type	Data center GPU accelerator
Microarchitecture	Blackwell
Die	GB100 (dual-die, reticle-limited)
Process	TSMC custom 4NP
Transistors	208 billion (two dies, 104 billion each)
Die-to-die interconnect	NV-HBI, 10 TB/s
Memory	192 GB HBM3e (eight 8-Hi stacks)
Memory bandwidth	8 TB/s
Streaming multiprocessors	148 enabled (74 per die)
CUDA cores	approximately 18,432
Tensor cores	592 (fifth generation)
FP4 tensor (with sparsity)	20 PFLOPS
FP8 tensor (with sparsity)	10 PFLOPS
Interconnect	NVLink 5 (1.8 TB/s bidirectional), PCIe Gen6 x16
Form factors	B200 SXM (in HGX B200, DGX B200), GB200 Superchip
TDP	1,000 W (SXM)
Announced	March 18, 2024 (GTC, San Jose)
First customer shipments	Q4 2024 (sample units), Q1 to Q2 2025 (volume)
Estimated unit price	$40,000 to $50,000
Compute capability	10.0
Required CUDA	12.8 or later

When was the B200 announced and when did it ship?

NVIDIA CEO Jensen Huang introduced Blackwell and the B200 at GTC on March 18, 2024 in San Jose, the company's first in-person GTC since the pandemic. ^[1] The keynote framed the B200 as the foundation of an "AI factory" architecture aimed at training and serving trillion-parameter large language models. NVIDIA disclosed the dual-die layout, the 208 billion transistor count, the 10 TB/s die-to-die link, FP4 support, fifth-generation NVLink, and the GB200 Grace Blackwell Superchip variant at the same event. ^[1] NVIDIA said Blackwell-based products would enter the market from partners worldwide in late 2024. ^[1]

Production did not run cleanly. In August 2024, reports from The Information and several supply-chain analysts surfaced that the B200 mask set required a respin to correct a defect in the NV-HBI die-to-die bridge that limited yield on the full dual-die assembly. ^[5] NVIDIA chief financial officer Colette Kress acknowledged on the Q2 fiscal 2025 earnings call that the company had executed a mask change to improve production yield, pushing the ramp from Q3 to Q4 calendar 2024. A second wave of issues followed at the system level: the GB200 NVL72 rack drew significantly more power than initial specifications suggested (some reports cited 140 kW versus the 120 kW datasheet figure), and OEM partners reported overheating in densely populated racks, intermittent NVLink switch faults, and leaks in the direct-to-chip liquid-cooling loops. ^[13] By December 2024 NVIDIA and its ODM partners (Foxconn, Wistron, Quanta, Inventec) had stabilized the production lines. ^[13]

Volume B200 deliveries to cloud providers and hyperscalers began in the first quarter of 2025 and ramped sharply through the year. OpenAI publicly received one of the earliest engineering DGX B200 systems in October 2024, with Huang and OpenAI president Greg Brockman posing with the unit at OpenAI's San Francisco office. ^[4] By the second half of 2025 the chip was the default frontier-training part for new builds at AWS, Google Cloud, Microsoft Azure, Oracle Cloud Infrastructure, and CoreWeave.

How is the B200 built? (silicon and packaging)

The B200 is the first commercial product to use TSMC's CoWoS-L (Chip-on-Wafer-on-Substrate with Local Silicon Interconnect) packaging at volume. CoWoS-L uses small silicon bridge dies (rather than a single monolithic interposer) to route signals between the GPU dies and the HBM3e stacks, enabling packages that exceed a single interposer reticle. The B200 substrate measures roughly 102 mm by 75 mm, well beyond the limits of the older CoWoS-S used for H100 and H200. ^[6]

The two GPU dies sit at the center of the package and are linked by the NV-HBI (NVIDIA High Bandwidth Interface), a proprietary die-to-die interconnect derived from the NVLink PHY layer. NV-HBI provides 10 TB/s of bidirectional bandwidth between the dies, low enough in latency that the pair presents to software as a single logical GPU with unified cache coherency. ^[1]^[3] The figure is roughly an order of magnitude greater than typical chiplet links and is what allows the B200 to behave like a monolithic part to CUDA programmers.

Each die contains 104 billion transistors and exposes 74 streaming multiprocessors out of 80 physical SMs, with the difference reserved for yield. ^[3] The two dies together deliver 148 active SMs, approximately 18,432 CUDA cores, and 592 fifth-generation tensor cores, with 228 KB of shared memory per SM. Eight HBM3e stacks ring the GPU dies, each an 8-Hi stack delivering 24 GB for a total of 192 GB per GPU. Peak memory bandwidth is 8 TB/s, a 2.4x improvement over H100 and 1.67x over H200. ^[3] SK Hynix supplies the dominant share of HBM3e for the B200 ramp, with Micron qualified as a second source from mid-2024 and Samsung qualifying through 2025. TechInsights teardown analysis confirmed SK Hynix as the volume supplier on early production units. ^[6]

What is the second-generation Transformer Engine, and what is FP4?

The fifth-generation tensor core is the centerpiece of the B200's compute architecture. It adds native support for FP4 (4-bit floating point) and FP6 (6-bit floating point) in two flavors: NVIDIA-defined formats and the Open Compute Project's MX (microscaling) formats, MXFP4 and MXFP6. Microscaling attaches a small shared exponent to a block of low-precision values, recovering dynamic range that would otherwise be lost at four bits per element. NVIDIA describes the second-generation Transformer Engine as enabling "double the compute and model sizes with new 4-bit floating point AI inference capabilities." ^[1]

The peak tensor throughput figures NVIDIA quoted at launch (per B200 GPU, with structured sparsity) are: ^[1]^[3]

Precision	Throughput (with sparsity)	Throughput (dense)
FP4	20 PFLOPS	10 PFLOPS
FP6	10 PFLOPS	5 PFLOPS
FP8	10 PFLOPS	5 PFLOPS
FP16 / BF16	5 PFLOPS	2.5 PFLOPS
TF32	2.5 PFLOPS	1.25 PFLOPS
FP64 (tensor)	40 TFLOPS	not applicable

FP4 is the headline number and the precision that gives the B200 its 5x inference advantage over the H100 in NVIDIA's marketing comparisons. ^[1] FP4 inference is enabled by the second-generation Transformer Engine, a runtime layer in TensorRT-LLM and CUDA libraries that decides on a per-tensor or per-block basis which precision to use, automatically scaling values to fit the smaller representable range. The Transformer Engine first appeared on Hopper with FP8 support; the Blackwell version is the first to natively support 4-bit precision in hardware. For training, NVIDIA introduced NVFP4 as a recipe that uses FP4 for the forward pass and selected backward operations while retaining higher precision (FP8 or BF16) for gradients and master weights. NVIDIA's MLPerf Training v5.1 submissions in late 2025 used NVFP4 to claim approximately 1.9x training speedup over previously published FP8 Hopper results.

How does the B200 compare to the H100, H200, and B100?

The following table compares the B200 against the most recent prior NVIDIA datacenter parts and the B100 lower-power Blackwell SKU. All figures are for the SXM form factor.

Specification	H100 SXM	H200 SXM	B100 SXM	B200 SXM
Architecture	Hopper	Hopper	Blackwell	Blackwell
Die	GH100 (monolithic)	GH100 (monolithic)	GB100 dual-die	GB100 dual-die
Process	TSMC 4N	TSMC 4N	TSMC 4NP	TSMC 4NP
Transistors	80 billion	80 billion	208 billion	208 billion
Memory	80 GB HBM3	141 GB HBM3e	192 GB HBM3e	192 GB HBM3e
Memory bandwidth	3.35 TB/s	4.8 TB/s	8 TB/s	8 TB/s
FP4 tensor (dense)	not supported	not supported	7 PFLOPS	10 PFLOPS
FP8 tensor (dense)	2 PFLOPS	2 PFLOPS	3.5 PFLOPS	5 PFLOPS
BF16 tensor (dense)	1 PFLOPS	1 PFLOPS	1.75 PFLOPS	2.5 PFLOPS
NVLink generation	4	4	5	5
NVLink bandwidth	900 GB/s	900 GB/s	1.8 TB/s	1.8 TB/s
TDP	700 W	700 W	700 W	1,000 W

The B200 and B100 share the same silicon and memory configuration. The B100 is binned and clocked lower so that it stays inside the 700 W thermal envelope of the older HGX H100 baseboard, providing a drop-in upgrade path for operators that cannot rework their air-cooled facilities. The B200, by contrast, is the full-power SKU and assumes liquid cooling in most deployments.

The memory increase from 80 GB on the H100 to 192 GB on the B200 is one of the most visible practical differences. A 70-billion-parameter model in BF16 fits on a single B200 with substantial headroom for the KV cache, where the same model on an H100 requires aggressive quantization or memory swapping. A 405-billion-parameter model like Llama 3.1 405B fits in BF16 across just three B200 GPUs (versus eight H100s), or comfortably on a single 8-GPU HGX B200 board in FP8.

What systems use the B200? (form factors)

HGX B200

The HGX B200 is the reference baseboard NVIDIA sells to OEMs, populated with eight B200 GPUs interconnected by four NVLink 5 switch ASICs. OEMs that have shipped HGX B200 servers include Dell (PowerEdge XE9680), Hewlett Packard Enterprise (Cray XD685, ProLiant XD685), Lenovo (ThinkSystem SR685a V3), Supermicro (SYS-822GS-NB3), and ASUS (ESC NB8-E11). Each HGX B200 delivers 1,440 GB of aggregate HBM3e, 14.4 TB/s of aggregate NVLink bandwidth, and roughly 72 PFLOPS of FP8 dense tensor compute, with system power around 10 kW.

DGX B200

The DGX B200 is NVIDIA's own turnkey 8-GPU server, the direct successor to the DGX H100. It pairs the HGX B200 baseboard with two Intel Xeon Platinum 8570 processors, up to 4 TB of system DDR5 memory, eight ConnectX-7 400 Gb/s network adapters, two BlueField-3 DPUs, and dual hot-swappable power supplies. The chassis is roughly 14U tall and weighs approximately 130 kg. NVIDIA quotes 72 PFLOPS of FP8 training performance and 144 PFLOPS of FP4 inference performance per system, with starting OEM list prices around $515,000. ^[2] NVIDIA positions the DGX B200 as delivering 3x the training throughput and 15x the inference throughput of the DGX H100. ^[2] The system uses liquid-assisted air cooling rather than direct liquid cooling, simplifying deployment in air-cooled datacenters.

GB200 Grace Blackwell Superchip and NVL72

The GB200 Superchip pairs two B200 GPUs with one Grace CPU on a single board, connected over NVLink-C2C at 900 GB/s bidirectional. ^[1] The Grace CPU is an Arm Neoverse V2 design with 72 cores and up to 480 GB of LPDDR5X memory, sharing a unified memory address space with the GPUs. Each GB200 provides 384 GB of HBM3e and 40 PFLOPS of FP4 tensor compute with sparsity. The GB200 is the building block of the rack-scale GB200 NVL72, which connects 18 compute trays (36 Grace CPUs and 72 B200 GPUs) inside a single liquid-cooled rack, with nine NVLink switch trays providing a flat 130 TB/s all-to-all NVLink domain across all 72 GPUs. ^[14] NVIDIA describes the NVL72 as "one big GPU" because the NVLink-attached domain is large enough that a trillion-parameter model can be sharded entirely inside it without using a slower scale-out fabric. ^[14]

Who uses the B200? (customers and deployments)

B200 customer adoption skewed heavily toward frontier model developers and hyperscale clouds. The following deployments were publicly disclosed or reported during the 2024 to 2026 ramp.

Customer	Deployment	Scale	Notes
OpenAI	Engineering DGX B200	One system initially	Delivered October 2024; first non-NVIDIA recipient
xAI	Colossus expansion	50,000 B200 GPUs reported	On top of 100,000 H100s in Memphis, Tennessee
Meta	Hyperion / Prometheus campuses	Millions of Blackwell and Rubin GPUs	Multi-year partnership announced February 2026
Microsoft	Azure ND GB200 v6	Tens of thousands of GPUs	GA mid-2025; powers OpenAI workloads
Amazon Web Services	EC2 P6, GB200 NVL72 capacity blocks	Tens of thousands of GPUs	First B200 capacity blocks GA Q2 2025
Google Cloud	A4 and A4X virtual machines	Tens of thousands of GPUs	A4X uses GB200, A4 uses HGX B200
Oracle Cloud Infrastructure	OCI Supercluster	Up to 131,072 B200 GPUs (cluster ceiling)	Early DGX B200 launch partner
CoreWeave	GB200 NVL72 racks	Tens of thousands of GPUs	Joint MLPerf v5.0 submission on 2,496 GPUs
Tesla	Cortex training cluster	Undisclosed	Mixed with Dojo fleet
Lambda	Cloud B200 instances	Low double-digit thousands	On-demand 8x B200 instances launched 2025

NVIDIA's own DGX SuperPOD reference designs were updated in 2024 to use the DGX B200 as a building block, with a SuperPOD of 32 DGX B200 systems (256 B200 GPUs) marketed as the standard unit for sub-hyperscaler deployments.

How much does a B200 cost?

NVIDIA does not publish list pricing for datacenter GPUs. Public OEM quotes, channel data, and analyst estimates put the per-GPU street price of the B200 SXM module at roughly $40,000 to $50,000 through 2024 and 2025, broadly in line with H100 pricing at the equivalent point in its lifecycle. ^[15] EpochAI estimated the manufacturing cost of a B200 at approximately $6,400 per unit, with HBM3e memory and CoWoS-L packaging accounting for roughly two-thirds of the bill of materials. ^[7] DGX B200 systems carried list prices around $515,000, while complete GB200 NVL72 racks were quoted between $2 million and $3 million depending on memory and networking options. Cloud pricing for on-demand 8-GPU HGX B200 instances settled into a band of roughly $4 to $7 per GPU-hour during 2025, with spot pricing dropping as low as $2.25 per GPU-hour at neocloud providers. ^[15]

The supply story dominated the chip's first year. By mid-2025, TSMC's CoWoS-L capacity was effectively reserved for NVIDIA through 2027, with NVIDIA holding more than 70% of available volume. ^[10]^[11] Both HBM3e supply (allocated through long-term contracts with SK Hynix, Micron, and qualifying Samsung) and CoWoS-L packaging were the gating constraints on B200 shipments, rather than wafer output from TSMC's 4NP lines.

How fast is the B200? (performance results)

Training

In the MLPerf Training v4.1 and v5.0 rounds, NVIDIA and its partners submitted B200 and GB200 NVL72 results. A 2,496-GPU GB200 NVL72 submission from CoreWeave, NVIDIA, and IBM completed Llama 3.1 405B pretraining in 27.33 minutes, approximately 2.1x faster than NVIDIA's 2,560-GPU H100 submission. ^[9] On GPT-3 175B and Llama 2 70B fine-tuning, the B200 delivered roughly 2x and 2.2x the per-GPU throughput of the H100. ^[9] NVFP4 training recipes added another 1.4x uplift between MLPerf v5.0 and v5.1, attributable to software rather than hardware changes.

Inference

MLPerf Inference v5.0 results for 8x B200 servers, compared with 8x H200, showed: ^[8]

Benchmark	8x B200	8x H200	Speedup
Llama 2 70B (server)	98,443 tokens/s	approximately 32,800 tokens/s	3.0x
Llama 2 70B (offline)	98,858 tokens/s	approximately 35,300 tokens/s	2.8x
Mixtral 8x7B (server)	126,845 tokens/s	approximately 60,400 tokens/s	2.1x
Mixtral 8x7B (offline)	128,148 tokens/s	approximately 61,000 tokens/s	2.1x
Stable Diffusion XL (server)	28.44 samples/s	approximately 17.8 samples/s	1.6x

The gap widens further on rack-scale GB200 NVL72 deployments. On the Llama 3.1 405B inference benchmark, a single NVL72 rack returned up to 30x the throughput of an H200 8-GPU server, partly because the 405B model fits within the NVL72's NVLink domain and avoids cross-node communication. ^[8] Per-GPU, NVL72 still delivered 3.4x the throughput of an H200 system on the same workload. ^[8]

How much power does the B200 use? (power, cooling, density)

The B200 SXM module is rated at 1,000 W TDP, a 43% increase over the H100's 700 W. NVIDIA expects most B200 deployments to use liquid cooling: direct-to-chip cold plates in rack-scale GB200 systems, or rear-door heat exchangers and liquid-assisted air for HGX B200 chassis. The GB200 NVL72 is fully liquid-cooled and requires data centers with appropriate facility water loops. Despite the higher TDP, actual power draw under typical inference workloads runs well below the 1,000 W rating, with cloud operators reporting time-averaged figures around 600 W per GPU on production LLM traffic; full TDP is only reached during aggressive training or synthetic FP8 benchmarks. NVIDIA's claim of 25x cost and energy efficiency over the H100 for trillion-parameter model inference rests on the combination of FP4 throughput, reduced cross-GPU communication inside an NVLink domain, and below-TDP operating points. ^[1]

What software does the B200 require?

The B200 has compute capability 10.0 and requires CUDA 12.8 or later for full hardware enablement. The earliest production deliveries in late 2024 used CUDA 12.4 with experimental FP4 support; full FP4 production support landed with CUDA 12.8 in early 2025. The primary software components are TensorRT-LLM (with explicit B200 kernels for FP4 and FP6 GEMMs and NVIDIA QAT for FP4 quantization), the second-generation Transformer Engine integrated with PyTorch and JAX, NVIDIA NIM inference microservices with pre-packaged FP4-tuned containers, and NCCL 2.21 or later for collective operations over the 1.8 TB/s NVLink 5 fabric and the 130 TB/s NVL72 domain.

Where does the B200 sit in NVIDIA's roadmap?

The B200's launch was the largest single generational leap in NVIDIA's datacenter roadmap to date, and it cemented the company's commercial dominance through the 2024 to 2026 frontier-model build-out. Even with the production delays of late 2024, B200 demand massively outstripped supply throughout 2025: Morgan Stanley reported in November 2024 that the entire 2025 production allocation had been sold to a small number of large buyers.

NVIDIA continued the Blackwell line at GTC 2025 with the B300 ("Blackwell Ultra"), which retains the 208 billion transistor dual-die package but increases enabled SM count, memory capacity (288 GB HBM3e), and TDP (1,400 W) for roughly 1.5x performance in rack-scale form factors. ^[12] The B200 continues to ship in parallel for customers that prefer its lower thermal envelope. The Blackwell line is scheduled to be succeeded by the Vera Rubin platform, which began initial shipments in 2026 and uses a 3 nm process, HBM4 memory, and a new Vera Arm CPU. Even with Rubin in the field, the B200 is expected to remain in active production through 2026 and 2027 because of secondary cloud demand, regional sovereign-AI deployments, and the long tail of customers upgrading from H100 and H200.

ELI5: what is the B200, simply?

Imagine the most powerful single chip a company could possibly print on a wafer, and then imagine that NVIDIA hit the physical size limit of what a chip can be. To go bigger, NVIDIA glued two of those maxed-out chips together with a connection so fast (10 trillion bytes per second) that software still sees them as one giant brain. That glued-together chip is the B200. It has 208 billion tiny switches (transistors), enough fast memory (192 GB) to hold a very large AI model on a single card, and a special trick called the Transformer Engine that lets it do math in a super-compact 4-bit format, which is how it runs AI chatbots so much faster and cheaper than the previous generation.

References

NVIDIA Blackwell Platform Arrives to Power a New Era of Computing. NVIDIA Newsroom. March 18, 2024. ↩
DGX B200: The Foundation for Your AI Factory. NVIDIA. ↩
NVIDIA Blackwell Architecture Technical Overview. NVIDIA. 2024. ↩
NVIDIA Delivers "One of The First" Blackwell DGX B200 AI System To OpenAI. Wccftech. October 2024. ↩
Nvidia Blackwell GPUs allegedly delayed due to design flaws. Tom's Hardware. August 2024. ↩
TechInsights' NVIDIA HGX B200 Analysis Reveals HBM3E Supplier and Advanced Packaging Innovation. TechInsights. ↩
NVIDIA's B200 costs around $6,400 to produce, with memory accounting for half. Epoch AI. ↩
NVIDIA Blackwell Delivers Massive Performance Leaps in MLPerf Inference v5.0. NVIDIA Developer Blog. ↩
NVIDIA Blackwell Enables 3x Faster Training and Nearly 2x Training Performance Per Dollar. NVIDIA Developer Blog. ↩
Inside the AI Bottleneck: CoWoS, HBM, and 2-3nm Capacity Constraints Through 2027. FusionWW. ↩
NVIDIA Alone Has TSMC's Advanced Packaging Lines Booked for Several Years Ahead. Wccftech. ↩
NVIDIA Blackwell Ultra AI Factory Platform Paves Way for Age of AI Reasoning. NVIDIA Newsroom. March 2025. ↩
Nvidia server makers solve Blackwell technical issues, ramp up shipments of GB200 racks. DataCenter Dynamics. ↩
GB200 NVL72. NVIDIA. ↩
How much does it cost to run NVIDIA B200 GPUs in 2025?. Modal Blog. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

SambaNova SN40L ThunderKittens Trillium (TPU v6e)Trusted Execution Environments for machine learning