NVIDIA B200

The NVIDIA B200 is a data center GPU based on the NVIDIA Blackwell microarchitecture, announced by Jensen Huang at GTC 2024 in March 2024.[^1] It is the flagship accelerator of the Blackwell generation and the direct successor to the NVIDIA H100 and NVIDIA H200. The B200 features a dual-die design with 208 billion transistors, 180 GB of HBM3E memory in its standard HGX form factor, and a peak FP4 throughput of 18 PFLOPS with structured sparsity per GPU.[^2][^3] It ships standalone as the SXM-form HGX B200, inside the GB200 Grace Blackwell Superchip (one Grace CPU plus two B200 GPUs), and reaches its maximum scale in the GB200 NVL72, a rack-scale system combining 72 B200 GPUs and 36 Grace CPUs into a unified 1.4-exaflops FP4 sparse computing platform.[^4]

Nvidia began shipping B200 GPUs and GB200 NVL72 racks to hyperscalers in late 2024, with wider cloud availability expanding through 2025.[^5][^6] The B200 introduced native FP4 hardware acceleration for the first time in any NVIDIA GPU, along with a second-generation Transformer Engine, fifth-generation NVLink, and fourth-generation NVSwitch technology. It was succeeded in early 2026 by the B300 (Blackwell Ultra), used in the GB300 NVL72, but as of mid-2026 the B200 remains the volume Blackwell product, sold out through mid-2026 with a hyperscaler backlog estimated at roughly 3.6 million units.[^7][^8]

Background and announcement

NVIDIA announced the Blackwell GPU architecture and the B200 on March 18, 2024, at its GTC developer conference in San Jose, California.[^1] Jensen Huang, wearing his trademark leather jacket, pulled a Blackwell chip out of his pocket on stage and held it next to a Hopper chip, which it visibly dwarfed.[^9] The announcement framed Blackwell as the company's answer to escalating demand from large language model training runs and inference at scale.

The B200 is named for David Harold Blackwell, a statistician and mathematician who was the first Black scholar inducted into the National Academy of Sciences.[^1] NVIDIA has followed the tradition of naming its data center GPU architectures after scientists and researchers.

Huang described the design philosophy with the phrase "we need bigger GPUs," acknowledging that the scale of modern AI workloads had outpaced what a single monolithic die could deliver.[^9] This observation drove the B200's defining architectural choice: a dual-die chiplet design that effectively doubles the transistor count compared to a single-die approach on the same process node.

At GTC 2024, NVIDIA simultaneously announced the GB200 Grace Blackwell Superchip, the GB200 NVL72 rack system, and a broader product lineup that included the B100 for less demanding deployments.[^5] The Blackwell platform was positioned as being capable of 4x faster training and up to 30x faster inference compared to NVIDIA Hopper-generation DGX H100 systems.[^5][^4]

Architecture overview

Blackwell builds on Hopper's foundation while introducing several architectural changes intended specifically for large-scale transformer model workloads. The architecture adds a second-generation Transformer Engine with support for FP4 precision, a new decompression engine, confidential computing enhancements, and a redesigned NVLink fabric.[^5]

Dual-die design

The most visible change from Hopper is the move from a monolithic die to a multi-chip module (MCM) package. Each B200 GPU contains two separate dies, each manufactured at the maximum reticle size on TSMC's 4NP process, bonded together in a single SXM-style package.[^2] The two dies are connected by a proprietary chip-to-chip NVLink bridge delivering 10 TB/s of bidirectional bandwidth, which is fast enough that software treats the two dies as a single coherent GPU with a unified address space.[^5]

The dual-die approach gives NVIDIA access to substantially more transistors (208 billion, compared to 80 billion on the H100) without requiring an unrealistically large monolithic die. It also improves yield economics: two smaller dies can be manufactured at higher combined yield than one die twice as large.

The B200's 208 billion transistors compare to 80 billion on the H100, 80 billion on the H200, and approximately 77 billion on NVIDIA's earlier A100.[^2]

Tensor Cores and precision formats

Blackwell introduces fifth-generation Tensor Cores with a notable addition: native hardware support for FP4 (NVFP4) precision.[^5] The H100 and H200 support FP8 and lower formats only through emulation or quantization into FP8 kernels; the B200 adds dedicated FP4 multiply-accumulate units in hardware. This is significant because FP4 allows twice as many operations per clock compared to FP8 while keeping memory footprints small enough to serve large batches.

The B200 also introduces support for MXFP8 (microscaling FP8) and MXFP6 in addition to the standard FP8 format that appeared in Hopper.[^5] MXFP8 uses per-block scaling factors that improve accuracy compared to tensor-wide FP8 quantization, making it easier to maintain model quality during inference.

Second-generation Transformer Engine

The Transformer Engine, first introduced in the H100, is upgraded in Blackwell to support FP4 inference in addition to FP8.[^5] The engine dynamically adjusts the precision of computations within attention and feed-forward layers, choosing the highest precision compatible with accuracy constraints. In Blackwell, the engine can now operate in FP4 for attention queries and keys while keeping values in FP8, for instance, allowing aggressive compression of KV caches during LLM inference.

NVIDIA reports that the Transformer Engine's FP4 mode can effectively double throughput on inference workloads relative to FP8, at the cost of slight accuracy degradation that is typically acceptable in production serving scenarios.[^5]

Decompression engine

Blackwell adds a dedicated hardware decompression unit capable of processing LZ4, Deflate, and Snappy compressed data at up to 800 GB/s.[^5] This is aimed primarily at database and analytics workloads where compressed data must be decompressed before computation. NVIDIA claims this delivers 18x faster database query processing compared to CPU-based decompression.[^5]

Confidential computing

Blackwell adds a second-generation confidential computing mode that allows GPU workloads to run in a hardware-isolated enclave, with data encrypted in memory and protected from the host system, hypervisor, and other tenants.[^5] This is aimed at regulated industries handling sensitive data in cloud environments.

B200 specifications

The table below shows specifications for the standalone HGX B200 (1,000 W SXM form factor). The same silicon is configured slightly differently when packaged inside the GB200 Superchip (see GB200 section).

Specification	Value
Architecture	Blackwell (NVIDIA)
Die configuration	Dual-die MCM (two chiplets)
Process node	TSMC 4NP
Transistors	208 billion
Streaming Multiprocessors	160 SMs
CUDA cores	20,480
FP4 Tensor Core (sparse)	18 PFLOPS[^3][^10]
FP4 Tensor Core (dense)	9 PFLOPS
FP8 / FP6 Tensor Core (sparse)	9 PFLOPS
FP8 / FP6 Tensor Core (dense)	4.5 PFLOPS
BF16 / FP16 Tensor Core	2.25 PFLOPS (dense), 4.5 PFLOPS (sparse)
TF32 Tensor Core	1.125 PFLOPS (dense)
FP64	40 TFLOPS
Memory capacity (HGX)	180 GB HBM3E[^3][^11]
Memory capacity (in GB200)	186 GB HBM3E[^12]
Memory bandwidth (HGX)	7.7 TB/s[^3]
Memory bandwidth (in GB200)	8 TB/s[^12]
Memory bus width	8192-bit
NVLink version	5th generation
NVLink bandwidth (per GPU)	1.8 TB/s bidirectional[^4]
PCIe interface	PCIe Gen 6
Form factor	SXM6
TDP (HGX SXM form factor)	1,000 W[^3]
TDP (in GB200 NVL72)	1,200 W[^12]
Cooling	Air-cooled at 1,000 W; direct liquid cooling for GB200

The 8 HBM3E stacks on the package each contain 24 GB raw, yielding 192 GB of physical DRAM, but NVIDIA exposes 180 GB usable on the HGX/DGX configuration and 186 GB inside the GB200 NVL72, with the remainder reserved for redundancy/ECC.[^11][^12]

Memory subsystem

The B200 uses HBM3E (High Bandwidth Memory, third generation enhanced) arranged in eight stacks across an 8192-bit wide bus.[^2] In the HGX B200 form factor each GPU exposes 180 GB at 7.7 TB/s; the more thermally aggressive GB200 NVL72 configuration exposes 186 GB at 8 TB/s.[^3][^12]

This represents a substantial improvement over the H100's memory subsystem. The H100 SXM5 variant carries 80 GB of HBM3 with 3.35 TB/s bandwidth; the H200 SXM variant increased this to 141 GB HBM3E at 4.8 TB/s. The B200 more than doubles the H100's capacity and more than doubles its bandwidth.

HBM stacks in the B200 use a 3D-stacked construction where DRAM dies are stacked vertically and connected to the logic die via through-silicon vias (TSVs). The 8192-bit bus is the widest memory interface deployed in any production GPU as of 2024, and it is what allows the 7.7-8 TB/s figure despite the relatively modest per-pin data rate of HBM3E.

The large memory capacity matters most for inference workloads that involve long context windows or serve many concurrent users. A 70-billion-parameter Llama model in BF16 requires roughly 140 GB just for weights, which fits within a single B200's 180 GB but would require tensor-parallel splitting across two H100s (each at 80 GB).[^11] The B200's larger pool reduces the need for model parallelism in many practical deployment scenarios, which in turn reduces communication overhead and latency.

Performance

FP4 performance

FP4 (4-bit floating point) is the headline new precision format in Blackwell. The B200 delivers 18 PFLOPS of sparse FP4 throughput (where sparsity refers to the structured 2:4 sparsity pattern that halves bandwidth when 50% of weights are zero) and 9 PFLOPS dense.[^3][^10] The H100 and H200 have no native FP4 hardware support.

In practice, FP4 is used primarily for inference rather than training. At this precision, model weights consume half as much memory and bandwidth as FP8, allowing larger batch sizes and lower per-token latency. NVIDIA's benchmarks for LLM serving show the B200 achieving roughly 15x higher token throughput per rack compared to an equivalent Hopper system when running at FP4 precision.[^4]

FP8 performance

FP8 was introduced in the H100 and remains the primary training precision in Blackwell. The B200 delivers approximately 4,500 TFLOPS (4.5 PFLOPS) of dense FP8 throughput, compared to roughly 1,979 TFLOPS on the H100. This represents a 2.3x improvement at the same precision level on a per-chip basis.

For training large transformer models, FP8 is now the dominant format at the frontier. The higher FP8 throughput translates fairly directly into faster training runs for a given model size and batch configuration.

BF16 and FP16

BF16 and FP16 are the default formats for many training and fine-tuning workloads due to their balance of range and precision. The B200 delivers approximately 2,250 TFLOPS dense BF16 throughput, compared to 989 TFLOPS on the H100 SXM5. This is again roughly a 2.3x improvement at the per-GPU level.

Comparison to H100 and H200

Metric	H100 SXM5	H200 SXM5	B200 (HGX)
FP8 Tensor (dense)	1,979 TFLOPS	1,979 TFLOPS	4,500 TFLOPS
FP4 Tensor (dense)	N/A	N/A	9,000 TFLOPS
FP4 Tensor (sparse)	N/A	N/A	18,000 TFLOPS
BF16 Tensor (dense)	989 TFLOPS	989 TFLOPS	2,250 TFLOPS
Memory capacity	80 GB HBM3	141 GB HBM3E	180 GB HBM3E
Memory bandwidth	3.35 TB/s	4.8 TB/s	7.7-8 TB/s
NVLink bandwidth	900 GB/s	900 GB/s	1.8 TB/s
TDP	700 W	700 W	1,000-1,200 W
Transistors	80 B	80 B	208 B

The performance gap is larger for inference than training. NVIDIA's own comparison numbers show 4x faster training and 15-30x faster inference depending on model size and batch configuration.[^4] The large inference advantage comes from the combination of FP4 precision (unavailable on Hopper), larger memory (which allows bigger batches and longer contexts), and higher bandwidth.

MLPerf v6.0 results (April 2026)

In MLPerf Inference v6.0, published in April 2026, GB200 NVL72 (B200) and GB300 NVL72 (B300) systems posted the highest tokens-per-second numbers on the GPT-OSS 120B benchmark in the datacenter closed division, with GB300 NVL72 leading and GB200 NVL72 close behind.[^13] On the same submission round, NVIDIA reported HGX B200 (8 GPU) results for benchmarks including Llama 3.1 405B and the new GPT-OSS 120B reasoning benchmark.[^14]

SemiAnalysis's InferenceX benchmark, also published in April 2026, found that a B200 SXM6 running TensorRT-LLM delivered approximately 60,000 tokens/second/GPU on gpt-oss-120B, roughly 4x the throughput of an H200 with TensorRT-LLM and 4.5x cheaper per million tokens than Hopper-powered systems running vLLM.[^15] The same benchmark recorded the B200's cost-per-million-tokens dropping from $0.11 at the GPT-OSS launch to $0.02 within two months, attributing the 5x improvement entirely to software (TensorRT-LLM kernel optimizations, scheduler tuning, and quantization changes).[^15]

DeepSeek-R1 optimization

NVIDIA published a sequence of TensorRT-LLM optimizations targeting DeepSeek-R1 on B200 between January 2025 and April 2026. The headline result is throughput on an 8-GPU B200 system progressing from 67 tokens/second/user in early 2025 to 368 tokens/second/user (a 5.5x speed-up) and a maximum throughput of 30,000+ tokens/second on the 671B-parameter model.[^16] NVIDIA reports that combined hardware and software improvements have increased DeepSeek-R1 throughput by about 36x since January 2025, translating into roughly a 32x improvement in cost per token.[^17]

GB200 Grace Blackwell Superchip

The GB200 is not a standalone GPU product but a multi-chip module that combines one NVIDIA Grace CPU with two B200 GPUs on a single package.[^4] The "G" prefix denotes Grace, NVIDIA's Arm-based server CPU.

The Grace CPU in the GB200 is based on the Arm Neoverse V2 core, with 72 cores per CPU and up to 480 GB of LPDDR5X memory.[^4] It connects to the two B200 GPUs via NVLink-C2C (Chip-to-Chip), a proprietary high-bandwidth interconnect that delivers 900 GB/s bidirectional bandwidth between the CPU and each GPU.[^4] This is far faster than any PCIe connection: PCIe Gen 5 peaks at roughly 128 GB/s, and even PCIe Gen 6 does not reach NVLink-C2C speeds.

The NVLink-C2C connection enables cache-coherent memory access between the Grace CPU and the B200 GPUs. This means GPU kernels can directly access CPU memory without explicit DMA transfers, and the CPU can read GPU memory without staging copies. In practice, this simplifies programming and reduces latency for workloads that alternate between CPU preprocessing and GPU inference.

Each GB200 Superchip (one Grace + two B200s) delivers 40 PFLOPS sparse FP4 AI performance.[^4] In this configuration the two B200 dies operate at the 1,200 W envelope and expose 186 GB HBM3E each at 8 TB/s.[^12] The GB200 is the building block of all rack-scale Blackwell deployments. It is not sold as a standalone consumer or OEM product; instead it ships exclusively inside NVL36, NVL72, and other configured rack systems.

GB200 NVL72 rack

The GB200 NVL72 is a rack-scale system that combines 36 GB200 Superchips (18 dual-node compute trays) into a single NVLink domain.[^4] The result is 72 B200 GPUs and 36 Grace CPUs in one rack, all connected via fifth-generation NVLink through a set of NVSwitch chips.

Physical configuration

The NVL72 rack houses:[^4]

18 compute trays, each containing 2 Grace CPUs and 4 B200 GPUs (2 GB200 Superchips per tray)
9 NVLink Switch trays, each containing 2 NVSwitch 4 chips
A dedicated liquid cooling infrastructure for the entire rack

The total rack power draw runs to approximately 120-132 kW for the full NVL72 configuration, requiring dedicated power delivery and direct-to-chip liquid cooling.[^18] This power density is substantially higher than Hopper-generation DGX H100 racks (which typically ran at 10-14 kW per 8-GPU node), and it requires purpose-built data center infrastructure or significant retrofit work.

NVLink fabric in NVL72

Every B200 in the NVL72 has 18 NVLink ports, each running at 100 GB/s. The 9 NVSwitch trays (18 NVSwitch chips total) provide full any-to-any connectivity: any GPU can communicate with any other GPU in the rack with a single hop through one NVSwitch. The aggregate NVLink bandwidth across all 72 GPUs in the rack is 130 TB/s.[^4]

The single-hop topology is an advantage over multi-hop alternatives. When training a model with tensor parallelism across all 72 GPUs, all-reduce operations require only one NVLink hop regardless of which two GPUs are communicating. This keeps latency predictable and avoids the bandwidth bottlenecks that occur at higher-hop network switches.

Beyond a single NVL72 rack, NVIDIA supports NVLink domains by connecting multiple racks via additional switching. At full rack-scale aggregation the NVL72 NVLink fabric within one rack reaches 1 PB/s of total bandwidth.[^4] The next-generation Vera Rubin Ultra NVL576 platform (expected late 2026/2027) extends single-domain NVLink reach to 576 GPUs.[^19]

GB200 NVL72 performance

Metric	GB200 NVL72	DGX H100 (8 GPU)	Ratio
FP4 (NVFP4, sparse)	1,440 PFLOPS	N/A	--
FP8 (dense)	720 PFLOPS	~16 PFLOPS	~45x
BF16 (dense)	360 PFLOPS	~8 PFLOPS	~45x
GPU memory	13.4 TB HBM3E	640 GB	~21x
NVLink bandwidth	130 TB/s	7.2 TB/s	~18x
LLM training (GPT-MoE-1.8T)	--	--	4x faster
LLM inference (1.8T params)	--	--	30x faster

The 30x inference advantage for trillion-parameter models reflects both the FP4 precision benefit (2x throughput vs FP8) and the memory capacity advantage, which allows the entire trillion-parameter model to fit in the rack's 13.4 TB aggregate GPU memory without offloading.[^4]

DGX B200 and HGX B200

Alongside the NVL72 rack product, NVIDIA and its server partners offer systems built around standard SXM-format B200 GPUs (without the Grace CPU), following the same DGX/HGX product line structure used in previous generations.

HGX B200

The HGX B200 is a baseboard module containing eight B200 GPUs interconnected via fifth-generation NVLink and NVSwitch.[^20] It is the GPU subsystem equivalent of the HGX H100, designed to be integrated into server chassis by OEM partners including Dell, HPE, Lenovo, and Supermicro. The HGX B200 module connects to host CPUs (typically AMD EPYC or Intel Xeon) via PCIe Gen 6. In this configuration each GPU operates at 1,000 W with 180 GB of exposed HBM3E.[^3][^11]

In an 8-GPU HGX B200 system, the GPUs are connected by two NVSwitch chips providing 1.8 TB/s bidirectional NVLink bandwidth per GPU in intra-node communication. External GPU-to-GPU communication between nodes uses InfiniBand or Spectrum-X Ethernet.

Cloud providers that want standard x86-based AI servers with Blackwell GPUs use HGX B200. CoreWeave made HGX B200 instances generally available on May 29, 2025.[^21] Lambda Labs, RunPod, and other AI cloud providers also offer HGX B200 instances.

DGX B200

The DGX B200 is NVIDIA's own fully integrated server product built around eight B200 GPUs.[^22] It includes the chassis, power supply, CPU (Intel Xeon), system memory, NVMe storage, and networking in a single 10U rackmount system. It is the turnkey option for organizations that want NVIDIA-validated hardware and don't want to source individual components from multiple vendors.

Key DGX B200 system specifications:[^22]

Component	Specification
GPUs	8x NVIDIA B200 (SXM6)
CPU	2x Intel Xeon Platinum
System memory	2 TB DDR5
GPU memory	1.44 TB (8 x 180 GB)
GPU-to-GPU	NVLink 5 + NVSwitch
Network	8x 400Gb InfiniBand or Ethernet
System form factor	10U
System TDP	~14.3 kW
FP8 training performance	~72 PFLOPS (dense)
FP4 inference performance	~144 PFLOPS (sparse)

NVIDIA quotes the DGX B200 as delivering 3x faster training and 15x faster inference versus the DGX H100, at the 8-GPU node level.[^22] The gap is smaller at this scale than at NVL72 scale partly because the HGX B200 lacks the Grace CPU and NVLink-C2C advantages of the Superchip design.

DGX B200 pricing was reported in the $280,000 to $320,000 range depending on configuration and vendor, compared to approximately $300,000 to $400,000 for a DGX H100.[^23]

B200 variants and product lineup

NVIDIA shipped three principal Blackwell-generation B-family products: B100, B200, and the GB200 Superchip pairing the B200 with Grace.

B100 (HGX B100)

The B100 is a thermally derated variant of the B200 designed as a drop-in replacement for HGX H100 baseboards.[^24] It runs at 700 W TDP per GPU (matching the H100), making it compatible with existing HGX H100 chassis, power, and air-cooling infrastructure without requiring a redesign. The trade-off is reduced compute throughput: NVIDIA's published figures show roughly 14 PFLOPS sparse FP4 and 7 PFLOPS sparse FP8 per B100 GPU, versus 18 PFLOPS / 9 PFLOPS on the B200.[^24] Memory capacity and bandwidth are identical to the B200 (180 GB HBM3E, 7.7 TB/s).[^25]

The HGX B100 was positioned for hyperscalers and enterprises that wanted Blackwell's architectural improvements (FP4, second-gen Transformer Engine, NVLink 5) without rebuilding their existing Hopper data center infrastructure.

B200 (HGX B200)

The B200 in the standard 8-GPU HGX baseboard runs at the full 1,000 W TDP per GPU and requires a redesigned chassis with higher power delivery and either denser air cooling or rear-door heat exchangers.[^11] This is the form factor sold by CoreWeave, Lambda Labs, AWS, GCP, and other clouds as a standalone B200 instance.

GB200 (with Grace CPU)

The GB200 Superchip pushes the B200 silicon into a 1,200 W per-GPU envelope thanks to direct-liquid cooling in the NVL72 rack, exposing 186 GB HBM3E at 8 TB/s.[^12] This is the highest-performance configuration of B200 silicon and is sold only as part of rack-scale GB200 NVL36 and NVL72 systems.

B300 (Blackwell Ultra)

The B300, marketed as "Blackwell Ultra," is NVIDIA's mid-cycle refresh within the Blackwell generation.[^26] It shipped beginning in January 2026. The B300 increases memory capacity to 288 GB HBM3E per GPU, raises sparse FP4 throughput to approximately 30 PFLOPS (versus 18 PFLOPS for the B200), and increases TDP to 1,400 W per GPU. It is paired with Grace in the GB300 NVL72 rack.

The B300's larger memory (288 GB vs 180/186 GB) is its primary advantage for inference: it can hold a full 70B-parameter model in BF16 with substantial room remaining for KV cache, whereas the B200 leaves tighter margins. The B300's 1,400 W TDP requires mandatory direct liquid cooling with less flexibility than the B200, which can operate in air-cooled configurations at the 1,000 W HGX TDP.

NVLink and NVSwitch

Fifth-generation NVLink

NVLink 5 doubles the per-link bandwidth compared to NVLink 4 (used in Hopper). Each NVLink 5 link runs at 100 GB/s bidirectional. The B200 GPU has 18 NVLink ports, giving a maximum per-GPU NVLink bandwidth of 1.8 TB/s, compared to 900 GB/s on the H100 (9 NVLink 4 ports at 100 GB/s each).[^4]

NVLink 5 also extends NVLink domains across racks using copper cable cartridges rather than optical modules. NVIDIA deliberately chose passive copper connections for within-rack connectivity in the NVL72 to reduce cost, latency, and power consumption versus active optical cables, though optical is required for longer distances between racks.

Fourth-generation NVSwitch

NVSwitch 4 is the switch chip that enables any-to-any GPU communication within the NVL72 rack. Each NVSwitch 4 chip has 72 NVLink ports, enough to connect all 72 GPUs in the rack to a single switch hop. The 18 NVSwitch chips in the NVL72 rack provide 130 TB/s of total switch bandwidth.[^4]

NVSwitch 4 supports NVIDIA's SHARP (Scalable Hierarchical Aggregation and Reduction Protocol) in-network computing feature, which allows all-reduce operations to be executed inside the switch fabric rather than requiring all data to be routed back to individual GPUs. This reduces the effective bandwidth needed for gradient synchronization during training.

Software support

CUDA and Compute Stack

Blackwell GPUs require CUDA 12.4 or later for full feature support, including native FP4 Tensor Core instructions and the new decompression engine APIs.[^5] The CUDA programming model is unchanged from Hopper; existing GPU kernels run without modification, though new kernels targeting Blackwell-specific features require recompilation against the sm_100 target architecture.

NVIDIA publishes Blackwell-optimized cuBLAS, cuDNN, and cuSPARSE libraries as part of the standard CUDA toolkit. These libraries expose FP4 and MXFP8 operations through standard API calls, allowing frameworks like PyTorch and JAX to use Blackwell's new precision formats without custom kernel development.

TensorRT-LLM and Dynamo

TensorRT-LLM is NVIDIA's inference framework for large language models, and it received extensive updates for Blackwell.[^16] The framework adds:

Native NVFP4 (FP4) quantization and inference kernels optimized for B200 Tensor Cores
MXFP8 (microscaling FP8) quantization support
Blackwell-specific attention kernels for FP4 KV caches
DeepSeek-R1 inference optimization targeting Blackwell's architecture
Multi-GPU tensor parallelism over NVLink 5

TensorRT-LLM exposes an AutoDeploy interface that can take a standard HuggingFace model checkpoint and automatically select quantization format, batch size, and parallelism strategy for best throughput on a given target (B200, GB200 NVL72, etc.).

NVIDIA Dynamo, introduced at GTC 2025, is a distributed serving framework that sits above TensorRT-LLM and orchestrates inference across multiple GB200 NVL72 racks for disaggregated serving (separating prefill from decode phases of LLM inference).

Transformer Engine library

NVIDIA's open-source Transformer Engine library provides high-level APIs for mixed-precision transformer training and inference. For Blackwell, it adds:

FP4 attention computation
Per-tensor and per-block scaling for MXFP8
Blackwell-compatible fused attention kernels

The library is integrated into PyTorch, JAX, and Megatron-LM, making Blackwell's precision features accessible through standard framework APIs.

NIM microservices

NVIDIA NIM is NVIDIA's AI inference microservice platform. It ships pre-built containers optimized for common frontier models (Llama, Mistral, DeepSeek, GPT-class models) running on Blackwell hardware. NIM containers handle model loading, quantization, batching, and serving without requiring users to write inference code. They are available through NVIDIA's cloud and partner channels.

Pricing and availability

Hardware pricing

The standalone B200 GPU module (SXM6 format) is priced at approximately $30,000 to $40,000 per unit at list price, with hyperscaler volume discounts available.[^23] The GB200 Superchip (one Grace + two B200s) carries an estimated price of $60,000 to $70,000. The DGX B200 8-GPU server is priced in the $280,000 to $320,000 range.[^23]

For comparison, the H100 SXM5 sold for approximately $25,000 to $35,000 per chip at launch, and the DGX H100 was priced around $300,000 to $400,000 depending on configuration. The B200's per-chip price is modestly higher than the H100's, while delivering substantially more compute, which gives it a better price-performance ratio for most AI workloads.

Cloud rental pricing

Cloud rental pricing for B200 instances varies significantly by provider and commitment level. As of April-May 2026 it ranges from roughly $2.06/hour for spot capacity to $14.24/hour for on-demand instances at major hyperscalers, with a typical mid-range of $3-6/hour:[^27]

Provider	On-demand price	Notes
Lambda Labs	$3.79/hour	Per GPU
RunPod	$4.99-5.99/hour	Per GPU
Modal	$6.25/hour	Per GPU, serverless
AWS	$14.24/hour	Per GPU, on-demand
GCP	$18.53/hour	Per GPU, on-demand
Baseten	$9.98/hour	Per GPU, serverless
Thundercompute / spot markets	$2.06-2.25/hour	Per GPU, spot[^27]

Lambda Labs offered the lowest annual reserved rate at approximately $2.99/hour per GPU on a 3-year commitment. AWS and GCP command premiums over direct AI cloud providers due to their broader service ecosystems and enterprise support.

Cloud prices for B200 dropped roughly 6% between early and mid-2025 as supply increased with production ramp, and have continued to ease through Q2 2026 as more clusters reached general availability.[^23][^27]

Availability and supply

NVIDIA shipped the first B200 GPUs and GB200 NVL72 racks to major hyperscalers in late 2024, following a delayed ramp (described in more detail in the Production timeline section below).[^28] CoreWeave was the first cloud provider to offer GB200 NVL72 instances, announcing availability in February 2025 with systems already deployed for early customers including IBM, Mistral AI, and Cohere.[^29] CoreWeave made HGX B200 instances generally available on May 29, 2025.[^21]

AWS, Google Cloud, Microsoft Azure, and Oracle Cloud Infrastructure all began offering Blackwell-based instances through 2025.

By Q2 2026, NVIDIA's Blackwell line had moved to full-scale volume production, but supply remained tight: industry reports placed the hyperscaler B200/GB200 backlog at approximately 3.6 million units, with enterprise lead times of 8-16 weeks (down from 12-24 weeks in late 2025).[^7][^8] TSMC also reached high-yield production of Blackwell silicon at its Fab 21 facility in Arizona, partially de-risking the supply chain.[^7]

Major deployments

CoreWeave

CoreWeave was among the earliest and largest deployers of GB200 NVL72 systems. The company disclosed that its GB200 fleet was scalable to 110,000 GPUs and that it had deployed "thousands" of Blackwell GPUs by early 2025.[^29] Early workloads on CoreWeave's Blackwell infrastructure included IBM's Granite model training and Mistral AI's inference serving. CoreWeave's platform supports scaling up to 110,000+ B200 GPUs in multi-rack NVLink domain configurations.

Microsoft Azure

Microsoft announced early commitments to deploy Blackwell GPUs at scale as part of its AI infrastructure investment program. Azure added Blackwell-powered instances through 2025, offering both HGX B200 (with standard x86 CPU hosts) and GB200 NVL72 rack configurations.[^28] In April 2026 Microsoft also became the first hyperscaler to power on a next-generation Vera Rubin NVL72 system for validation, signalling the start of its successor platform's rollout while B200/GB200 continued ramping in production data centers.[^30]

Oracle Cloud Infrastructure

Oracle was among the hyperscalers receiving GB200 NVL72 racks in the initial late-2024 shipment wave. Oracle's GPU-optimized cloud segments have been a major consumer of NVIDIA's top data center products.

Amazon Web Services

AWS integrated B200-based instances into its EC2 accelerated computing catalog and offered them through SageMaker for managed training and inference. AWS's on-demand pricing for B200 is higher than direct AI cloud providers but comes with the full AWS service ecosystem.

Alpha Compute and second-wave clouds

In April 2026 Alpha Compute disclosed its first large-scale 504-GPU B200 cluster, ALPHA-01, in final testing with handover to customers targeted for May 8, 2026, following a four-week delay caused by supply-chain constraints on rack-scale components.[^32] The company also disclosed a contractual right of first refusal on more than 1,000 additional B200 GPUs at the same Canadian facility (ALPHA-03, targeted August 2026) and a separate Swedish expansion to more than 1,000 B300 GPUs.[^32] Alpha Compute's launch is representative of a second wave of specialized GPU clouds bringing B200 capacity online behind the major hyperscalers in 2026.

MLPerf and supercomputing benchmarks

Nebius, CoreWeave, and other infrastructure partners submitted MLPerf Training v5.1 and Inference v6.0 results with HGX B200 (and HGX B300) systems through 2025-2026, including 8-GPU single-node and 16-/32-GPU multi-node configurations.[^33] These submissions established B200 as the per-GPU performance leader in MLPerf at submission time, prior to the B300 results.

Use cases

LLM training

The B200's primary design target for training is large transformer models, particularly mixture-of-experts (MoE) architectures with trillions of parameters. The combination of high FP8 throughput (4.5 PFLOPS dense per GPU), large memory (180-186 GB), and high NVLink bandwidth (1.8 TB/s per GPU) addresses the three main bottlenecks in distributed training: compute, memory capacity for model parallelism, and all-reduce communication for gradient synchronization.

For frontier model training runs that require thousands of GPUs, the NVL72 rack's unified 72-GPU NVLink domain simplifies the network topology: communication within a rack happens over NVLink (at 130 TB/s aggregate) rather than InfiniBand, with InfiniBand only needed for cross-rack gradients. This reduces the volume of traffic on the InfiniBand fabric and improves overall training throughput.

LLM inference

Inference is the use case where the B200 shows its largest performance advantage over Hopper. Several factors combine:

FP4 precision (exclusive to Blackwell in 2024) doubles throughput compared to FP8 for quantization-tolerant models
The 180-186 GB memory capacity fits larger models without model parallelism, reducing latency from cross-GPU communication
The high memory bandwidth (7.7-8 TB/s) reduces the time to load KV cache entries during autoregressive generation
The second-generation Transformer Engine supports FP4 KV cache compression, further reducing memory requirements for long contexts

NVIDIA's own benchmark for real-time inference on a 1.8-trillion-parameter MoE model shows 30x faster throughput on GB200 NVL72 compared to an equivalent DGX H100 system.[^4]

Reasoning model inference

With the growth of chain-of-thought and reasoning-optimized models (such as DeepSeek-R1, OpenAI o-series, and similar approaches), inference workloads have become more compute-intensive relative to memory-bandwidth-bound generation. Reasoning models perform longer internal token sequences before producing output, which shifts the operational intensity toward compute. The B200's Tensor Core improvements benefit reasoning inference more than they benefit standard generation, making it well-suited for serving these newer model classes.

NVIDIA's TensorRT-LLM includes a specific DeepSeek-R1 optimization for Blackwell that takes advantage of the B200's MoE-optimized dispatch kernels, achieving a 36x throughput improvement and ~32x cost-per-token reduction on R1 between January 2025 and April 2026.[^16][^17]

HPC and scientific computing

Beyond AI, the B200 retains strong double-precision floating-point performance (40 TFLOPS FP64) for traditional HPC workloads including molecular dynamics, climate simulation, and computational fluid dynamics. The large memory capacity is beneficial for HPC workloads that deal with large data sets that previously required multi-GPU memory partitioning.

Production timeline and issues

Initial delay (mid-2024)

Following the March 2024 GTC announcement, reports emerged in August 2024 that Blackwell production was delayed due to a design flaw in the B200 GPU.[^34] NVIDIA and TSMC identified the flaw, which affected manufacturing yield. Jensen Huang later confirmed that the issue was "functional" and "caused the yield to be low." NVIDIA worked with TSMC to re-spin layers of the B200 processor to correct the problem.[^35]

The Register and other outlets reported in August 2024 that Blackwell GPU shipments would be delayed into 2025, though NVIDIA disputed the severity of the delays.[^34]

Q4 2024 ramp

NVIDIA began ramping Blackwell production in Q4 2024. The company committed to shipping Blackwell GPUs "worth several billion dollars" in that quarter.[^35] Analyst estimates placed Blackwell production volume at 750,000 to 800,000 units by Q1 2025.

During this period, Hopper continued to ship in large volumes as a bridge product while Blackwell ramped. NVIDIA extended Hopper's production timeline specifically to fill the supply gap.

Liquid cooling and overheating issues

Beyond the chip yield problem, early GB200 NVL72 system integrations encountered hardware challenges related to liquid cooling. Suppliers and system integrators disclosed in late 2024 that some GB200 rack systems suffered from overheating and liquid cooling leaks during integration and testing.[^36] These issues delayed final system qualification and shipment to customers.

NVIDIA's Taiwanese manufacturing partners announced at Computex 2025 that GB200 rack shipments had resumed and commenced at the end of Q1 2025 after the cooling issues were resolved.[^36]

Ramp trajectory and volume production

By mid-2025, Blackwell production had stabilized and cloud providers were offering B200 and GB200 instances at commercial scale. By February 2026, NVIDIA had officially moved Blackwell (B200 and the liquid-cooled GB200 NVL72 rack) into full-scale volume production, ending the supply-constrained "scarcity era" of 2024-2025.[^37] Even so, demand outstripped supply through Q2 2026, with sold-out lead times into mid-2026 and the 3.6-million-unit hyperscaler backlog noted earlier.[^7][^8]

The delays were significant enough that NVIDIA's Hopper-generation products (H100 and H200) remained the primary datacenter GPUs through most of 2024 rather than being rapidly displaced after the March announcement, and Hopper inventory continued to clear through 2025-2026 even as Blackwell ramped.

Comparison to predecessors and successors

B200 vs. H100

The H100 was NVIDIA's dominant data center GPU from its 2022 launch through late 2024. It is built on a monolithic die with 80 billion transistors on TSMC's 4N process. The B200 outperforms the H100 in every relevant metric: 208B vs 80B transistors, 180GB vs 80GB memory, 7.7-8 TB/s vs 3.35 TB/s bandwidth, 4.5 PFLOPS vs 1.98 PFLOPS FP8 throughput, and adds FP4 support that the H100 entirely lacks.

For organizations still running H100 clusters in 2025-2026, the upgrade case is most compelling for inference workloads on large models. Training workloads show a significant but less dramatic improvement because large training runs are bottlenecked by network communication as much as by single-GPU throughput.

B200 vs. H200

The H200 was a mid-cycle memory upgrade to the H100, replacing the 80 GB HBM3 with 141 GB HBM3E at 4.8 TB/s while leaving the compute die unchanged. The B200 improves on the H200 substantially: 180 GB vs 141 GB memory, 7.7-8 TB/s vs 4.8 TB/s bandwidth, 4.5 PFLOPS vs 1.98 PFLOPS FP8, and native FP4 support vs none.

B200 vs. B100

The B100 is a 700 W drop-in variant of the same Blackwell silicon designed to fit existing HGX H100 infrastructure (matching the H100's 700 W TDP and chassis). It offers identical memory (180 GB HBM3E) but reduced throughput: approximately 14 PFLOPS sparse FP4 and 7 PFLOPS sparse FP8 per GPU, versus 18 PFLOPS / 9 PFLOPS on the B200.[^24]

B200 vs. B300 (Blackwell Ultra)

The B300, marketed as "Blackwell Ultra," is NVIDIA's follow-on to the B200 within the Blackwell architecture generation. It shipped beginning in January 2026.[^26] The B300 increases memory capacity to 288 GB HBM3E per GPU, raises sparse FP4 throughput to approximately 30 PFLOPS (compared to 18 PFLOPS for the B200), and increases TDP to 1,400 W per GPU.

The B300's larger memory (288 GB vs 180/186 GB) is its primary advantage for inference: it can hold a full 70B-parameter model in BF16 with substantial room remaining for KV cache, whereas the B200 leaves tighter margins. The B300's 1,400 W TDP requires mandatory direct liquid cooling with less flexibility than the B200, which can operate in air-cooled configurations at reduced TDP.

For organizations that ordered B200 systems in 2024 and 2025, the B300 represents a future upgrade path rather than an immediate displacement. The B200 remains competitive for the workloads it was designed for, and the two products coexist in cloud catalogs as of mid-2026.

B200 vs. Vera Rubin

NVIDIA Vera Rubin is the next architecture generation after Blackwell, replacing Grace+Blackwell with the Vera CPU and Rubin GPU. The first Vera Rubin NVL72 systems entered hyperscaler validation in April 2026 (Microsoft Azure was first to power one on), with general availability targeted for the second half of 2026 and into 2027.[^30] B200/GB200 is expected to remain the volume training and inference platform until Vera Rubin reaches comparable production scale.

References

Background and announcement

Architecture overview

Dual-die design

Tensor Cores and precision formats

Second-generation Transformer Engine

Decompression engine

Confidential computing

B200 specifications

Memory subsystem

Performance

FP4 performance

FP8 performance

BF16 and FP16

Comparison to H100 and H200

MLPerf v6.0 results (April 2026)

DeepSeek-R1 optimization

GB200 Grace Blackwell Superchip

GB200 NVL72 rack

Physical configuration

NVLink fabric in NVL72

GB200 NVL72 performance

DGX B200 and HGX B200

HGX B200

DGX B200

B200 variants and product lineup

B100 (HGX B100)

B200 (HGX B200)

GB200 (with Grace CPU)

B300 (Blackwell Ultra)

NVLink and NVSwitch

Fifth-generation NVLink

Fourth-generation NVSwitch

Software support

CUDA and Compute Stack

TensorRT-LLM and Dynamo

Transformer Engine library

NIM microservices

Pricing and availability

Hardware pricing

Cloud rental pricing

Availability and supply

Major deployments

CoreWeave

Microsoft Azure

Oracle Cloud Infrastructure

Meta

Amazon Web Services

Alpha Compute and second-wave clouds

MLPerf and supercomputing benchmarks

Use cases

LLM training

LLM inference

Reasoning model inference

HPC and scientific computing

Production timeline and issues

Initial delay (mid-2024)

Q4 2024 ramp

Liquid cooling and overheating issues

Ramp trajectory and volume production

Comparison to predecessors and successors

B200 vs. H100

B200 vs. H200

B200 vs. B100

B200 vs. B300 (Blackwell Ultra)

B200 vs. Vera Rubin

See also

References

Improve this article

Related Articles

NVIDIA GB300 NVL72

NVIDIA DGX B300

AMD Instinct MI355X

AMD Instinct MI300X

AMD Instinct MI325X

NVIDIA H100

Background and announcement

Architecture overview

Dual-die design

Tensor Cores and precision formats

Second-generation Transformer Engine