The NVIDIA B200 is a data center GPU based on the NVIDIA Blackwell microarchitecture, announced by NVIDIA CEO Jensen Huang at GTC 2024 in March 2024. It is the flagship accelerator of the Blackwell generation and the direct successor to the NVIDIA H100 and NVIDIA H200. The B200 features a dual-die design with 208 billion transistors, 192 GB of HBM3E memory, and a peak FP4 throughput of 20 petaFLOPS per GPU. It ships inside the GB200 Grace Blackwell Superchip (one Grace CPU plus two B200 GPUs), and reaches its maximum scale in the GB200 NVL72, a rack-scale system combining 72 B200 GPUs and 36 Grace CPUs into a unified 1.44 exaflops FP4 computing platform.
NVIDIA began shipping B200 GPUs and GB200 NVL72 racks to hyperscalers in late 2024, with wider cloud availability expanding through 2025. The B200 introduced native FP4 hardware acceleration for the first time in any NVIDIA GPU, along with second-generation Transformer Engine, fifth-generation NVLink, and fourth-generation NVSwitch technology. It was succeeded by the B300 (Blackwell Ultra) in early 2026.
NVIDIA announced the Blackwell GPU architecture and the B200 on March 18, 2024, at its GTC developer conference in San Jose, California. Jensen Huang, wearing his trademark leather jacket, pulled a Blackwell chip out of his pocket on stage and held it next to a Hopper chip, which it visibly dwarfed. The announcement framed Blackwell as the company's answer to escalating demand from large language model (LLM) training runs and inference at scale.
The B200 is named for David Harold Blackwell, a statistician and mathematician who was the first Black scholar inducted into the National Academy of Sciences. NVIDIA has followed the tradition of naming its data center GPU architectures after scientists and researchers.
Huang described the design philosophy with the phrase "we need bigger GPUs," acknowledging that the scale of modern AI workloads had outpaced what a single monolithic die could deliver. This observation drove the B200's defining architectural choice: a dual-die chiplet design that effectively doubles the transistor count compared to a single-die approach on the same process node.
At GTC 2024, NVIDIA simultaneously announced the GB200 Grace Blackwell Superchip, the GB200 NVL72 rack system, and a broader product lineup that included the B100 for less demanding deployments. The Blackwell platform was positioned as being capable of 4x faster training and up to 30x faster inference compared to Hopper-generation DGX H100 systems.
Blackwell builds on Hopper's foundation while introducing several architectural changes intended specifically for large-scale transformer model workloads. The architecture adds a second-generation Transformer Engine with support for FP4 precision, a new decompression engine, confidential computing enhancements, and a redesigned NVLink fabric.
The most visible change from Hopper is the move from a monolithic die to a multi-chip module (MCM) package. Each B200 GPU contains two separate dies, each manufactured at the maximum reticle size on TSMC's 4NP process, bonded together in a single SXM-style package. The two dies are connected by a proprietary chip-to-chip NVLink bridge delivering 10 TB/s of bidirectional bandwidth, which is fast enough that software treats the two dies as a single coherent GPU with a unified address space.
The dual-die approach gives NVIDIA access to substantially more transistors (208 billion, compared to 80 billion on the H100) without requiring an unrealistically large monolithic die. It also improves yield economics: two smaller dies can be manufactured at higher combined yield than one die twice as large.
The B200's 208 billion transistors compare to 80 billion on the H100, 80 billion on the H200, and approximately 77 billion on NVIDIA's earlier A100.
Blackwell introduces fifth-generation Tensor Cores with a notable addition: native hardware support for FP4 (NVFP4) precision. The H100 and H200 support FP8 and lower formats only through emulation or quantization into FP8 kernels; the B200 adds dedicated FP4 multiply-accumulate units in hardware. This is significant because FP4 allows twice as many operations per clock compared to FP8 while keeping memory footprints small enough to serve large batches.
The B200 also introduces support for MXFP8 (microscaling FP8) in addition to the standard FP8 format that appeared in Hopper. MXFP8 uses per-block scaling factors that improve accuracy compared to tensor-wide FP8 quantization, making it easier to maintain model quality during inference.
The Transformer Engine, first introduced in the H100, is upgraded in Blackwell to support FP4 inference in addition to FP8. The engine dynamically adjusts the precision of computations within attention and feed-forward layers, choosing the highest precision compatible with accuracy constraints. In Blackwell, the engine can now operate in FP4 for attention queries and keys while keeping values in FP8, for instance, allowing aggressive compression of KV caches during LLM inference.
NVIDIA reports that the Transformer Engine's FP4 mode can effectively double throughput on inference workloads relative to FP8, at the cost of slight accuracy degradation that is typically acceptable in production serving scenarios.
Blackwell adds a dedicated hardware decompression unit capable of processing LZ4, Deflate, and Snappy compressed data at up to 800 GB/s. This is aimed primarily at database and analytics workloads where compressed data must be decompressed before computation. NVIDIA claims this delivers 18x faster database query processing compared to CPU-based decompression.
Blackwell adds a second-generation confidential computing mode that allows GPU workloads to run in a hardware-isolated enclave, with data encrypted in memory and protected from the host system, hypervisor, and other tenants. This is aimed at regulated industries handling sensitive data in cloud environments.
| Specification | Value |
|---|---|
| Architecture | Blackwell (NVIDIA) |
| Die configuration | Dual-die MCM (two chiplets) |
| Process node | TSMC 4NP |
| Transistors | 208 billion |
| Streaming Multiprocessors | 160 SMs |
| CUDA cores | 20,480 |
| FP4 Tensor Core (sparse) | 20 PFLOPS |
| FP4 Tensor Core (dense) | 9 PFLOPS |
| FP8 Tensor Core (sparse) | 9 PFLOPS |
| FP8 Tensor Core (dense) | 4.5 PFLOPS |
| BF16/FP16 Tensor Core | 2.25 PFLOPS (dense) |
| TF32 Tensor Core | 1.125 PFLOPS (dense) |
| FP64 | 40 TFLOPS |
| Memory capacity | 192 GB HBM3E |
| Memory bandwidth | 8 TB/s |
| Memory bus width | 8192-bit |
| NVLink version | 5th generation |
| NVLink bandwidth (per GPU) | 1.8 TB/s bidirectional |
| PCIe interface | PCIe Gen 6 |
| Form factor | SXM6 |
| TDP (SXM form factor) | 1,000 W |
| Cooling | Direct liquid cooling required |
The B200 uses HBM3E (High Bandwidth Memory, third generation enhanced) arranged in eight stacks, each providing 24 GB capacity across an 8192-bit wide bus. The aggregate capacity is 192 GB per GPU with 8 TB/s of peak bandwidth.
This represents a substantial improvement over the H100's memory subsystem. The H100 SXM5 variant carries 80 GB of HBM3 with 3.35 TB/s bandwidth; the H200 SXM variant increased this to 141 GB HBM3E at 4.8 TB/s. The B200 nearly doubles the H200's capacity and improves bandwidth by roughly 67%.
HBM stacks in the B200 use a 3D-stacked construction where DRAM dies are stacked vertically and connected to the logic die via through-silicon vias (TSVs). The 8192-bit bus is the widest memory interface deployed in any production GPU as of 2024, and it is what allows the 8 TB/s figure despite the relatively modest per-pin data rate of HBM3E.
The large memory capacity matters most for inference workloads that involve long context windows or serve many concurrent users. A 70-billion-parameter model in BF16 requires roughly 140 GB just for weights, which fits within a single B200's 192 GB but would require tensor-parallel splitting across two H100s (each at 80 GB). The B200's larger pool reduces the need for model parallelism in many practical deployment scenarios, which in turn reduces communication overhead and latency.
FP4 (4-bit floating point) is the headline new precision format in Blackwell. The B200 delivers 20 PFLOPS of sparse FP4 throughput (where sparsity refers to the structured 2:4 sparsity pattern that halves bandwidth when 50% of weights are zero) and 9 PFLOPS dense. The H100 and H200 have no native FP4 hardware support.
In practice, FP4 is used primarily for inference rather than training. At this precision, model weights consume half as much memory and bandwidth as FP8, allowing larger batch sizes and lower per-token latency. NVIDIA's benchmarks for LLM serving show the B200 achieving roughly 15x higher token throughput per rack compared to an equivalent Hopper system when running at FP4 precision.
FP8 was introduced in the H100 and remains the primary training precision in Blackwell. The B200 delivers approximately 4,500 TFLOPS (4.5 PFLOPS) of dense FP8 throughput, compared to roughly 1,979 TFLOPS on the H100. This represents a 2.3x improvement at the same precision level on a per-chip basis.
For training large transformer models, FP8 is now the dominant format at the frontier. The higher FP8 throughput translates fairly directly into faster training runs for a given model size and batch configuration.
BF16 and FP16 are the default formats for many training and fine-tuning workloads due to their balance of range and precision. The B200 delivers approximately 2,250 TFLOPS dense BF16 throughput, compared to 989 TFLOPS on the H100 SXM5. This is again roughly a 2.3x improvement at the per-GPU level.
| Metric | H100 SXM5 | H200 SXM5 | B200 |
|---|---|---|---|
| FP8 Tensor (dense) | 1,979 TFLOPS | 1,979 TFLOPS | 4,500 TFLOPS |
| FP4 Tensor (dense) | N/A | N/A | 9,000 TFLOPS |
| BF16 Tensor (dense) | 989 TFLOPS | 989 TFLOPS | 2,250 TFLOPS |
| Memory capacity | 80 GB HBM3 | 141 GB HBM3E | 192 GB HBM3E |
| Memory bandwidth | 3.35 TB/s | 4.8 TB/s | 8 TB/s |
| NVLink bandwidth | 900 GB/s | 900 GB/s | 1.8 TB/s |
| TDP | 700 W | 700 W | 1,000 W |
| Transistors | 80 B | 80 B | 208 B |
The performance gap is larger for inference than training. NVIDIA's own comparison numbers show 4x faster training and 15-30x faster inference depending on model size and batch configuration. The large inference advantage comes from the combination of FP4 precision (unavailable on Hopper), larger memory (which allows bigger batches and longer contexts), and higher bandwidth.
The GB200 is not a standalone GPU product but a multi-chip module that combines one NVIDIA Grace CPU with two B200 GPUs on a single package. The "G" prefix denotes Grace, NVIDIA's Arm-based server CPU.
The Grace CPU in the GB200 is based on the Arm Neoverse V2 core, with 72 cores per CPU and 480 GB of LPDDR5X memory at 8 TB/s. It connects to the two B200 GPUs via NVLink-C2C (Chip-to-Chip), a proprietary high-bandwidth interconnect that delivers 900 GB/s bidirectional bandwidth between the CPU and each GPU. This is far faster than any PCIe connection: PCIe Gen 5 peaks at roughly 128 GB/s, and even PCIe Gen 6 does not reach NVLink-C2C speeds.
The NVLink-C2C connection enables cache-coherent memory access between the Grace CPU and the B200 GPUs. This means GPU kernels can directly access CPU memory without explicit DMA transfers, and the CPU can read GPU memory without staging copies. In practice, this simplifies programming and reduces latency for workloads that alternate between CPU preprocessing and GPU inference.
Each GB200 Superchip (one Grace + two B200s) delivers 40 PFLOPS FP4 AI performance and carries 372 GB of total memory (192 GB HBM3E per B200, plus 480 GB LPDDR5X for Grace, accessible coherently).
The GB200 is the building block of all rack-scale Blackwell deployments. It is not sold as a standalone consumer or OEM product; instead it ships exclusively inside NVL36, NVL72, and other configured rack systems.
The GB200 NVL72 is a rack-scale system that combines 36 GB200 Superchips (18 dual-node compute trays) into a single NVLink domain. The result is 72 B200 GPUs and 36 Grace CPUs in one rack, all connected via fifth-generation NVLink through a set of NVSwitch chips.
The NVL72 rack houses:
The total rack power draw runs to approximately 120-130 kW for the full NVL72 configuration, requiring dedicated power delivery and direct-to-chip liquid cooling. This power density is substantially higher than Hopper-generation DGX H100 racks (which typically ran at 10-14 kW per 8-GPU node), and it requires purpose-built data center infrastructure or significant retrofit work.
Every B200 in the NVL72 has 18 NVLink ports, each running at 100 GB/s. The 9 NVSwitch trays (18 NVSwitch chips total) provide full any-to-any connectivity: any GPU can communicate with any other GPU in the rack with a single hop through one NVSwitch. The aggregate NVLink bandwidth across all 72 GPUs in the rack is 130 TB/s.
The single-hop topology is an advantage over multi-hop alternatives. When training a model with tensor parallelism across all 72 GPUs, all-reduce operations require only one NVLink hop regardless of which two GPUs are communicating. This keeps latency predictable and avoids the bandwidth bottlenecks that occur at higher-hop network switches.
Beyond a single NVL72 rack, NVIDIA supports NVLink domains of up to 576 GPUs by connecting multiple racks. At that scale, the aggregate NVLink bandwidth reaches over 1 petabyte per second.
| Metric | GB200 NVL72 | DGX H100 (8 GPU) | Ratio |
|---|---|---|---|
| FP4 (NVFP4) | 1,440 PFLOPS | N/A | -- |
| FP8 | 720 PFLOPS | ~16 PFLOPS | ~45x |
| BF16 | 360 PFLOPS | ~8 PFLOPS | ~45x |
| GPU memory | 13.4 TB HBM3E | 640 GB | ~21x |
| NVLink bandwidth | 130 TB/s | 7.2 TB/s | ~18x |
| LLM training (GPT-MoE-1.8T) | -- | -- | 4x faster |
| LLM inference (1.8T params) | -- | -- | 30x faster |
The 30x inference advantage for trillion-parameter models reflects both the FP4 precision benefit (2x throughput vs FP8) and the memory capacity advantage, which allows the entire trillion-parameter model to fit in the rack's 13.4 TB aggregate GPU memory without offloading.
Alongside the NVL72 rack product, NVIDIA and its server partners offer systems built around standard SXM-format B200 GPUs (without the Grace CPU), following the same DGX/HGX product line structure used in previous generations.
The HGX B200 is a baseboard module containing eight B200 GPUs interconnected via fifth-generation NVLink and NVSwitch. It is the GPU subsystem equivalent of the HGX H100, designed to be integrated into server chassis by OEM partners including Dell, HPE, Lenovo, and Supermicro. The HGX B200 module connects to host CPUs (typically AMD EPYC or Intel Xeon) via PCIe Gen 6.
In an 8-GPU HGX B200 system, the GPUs are connected by two NVSwitch chips providing 900 GB/s bidirectional NVLink bandwidth per GPU in intra-node communication. External GPU-to-GPU communication between nodes uses InfiniBand or Ethernet.
Cloud providers that want standard x86-based AI servers with Blackwell GPUs use HGX B200. CoreWeave made HGX B200 instances generally available in May 2025. Lambda Labs, RunPod, and other AI cloud providers also offer HGX B200 instances.
The DGX B200 is NVIDIA's own fully integrated server product built around eight B200 GPUs. It includes the chassis, power supply, CPU (Intel Xeon), system memory, NVMe storage, and networking in a single 10U rackmount system. It is the turnkey option for organizations that want NVIDIA-validated hardware and don't want to source individual components from multiple vendors.
Key DGX B200 system specifications:
| Component | Specification |
|---|---|
| GPUs | 8x NVIDIA B200 (SXM6) |
| CPU | 2x Intel Xeon Platinum |
| System memory | 2 TB DDR5 |
| GPU memory | 1.44 TB (8 x 180 GB) |
| GPU-to-GPU | NVLink 5 + NVSwitch |
| Network | 8x 400Gb InfiniBand or Ethernet |
| System form factor | 10U |
| System TDP | ~14.3 kW |
| FP8 training performance | ~72 PFLOPS |
| FP4 inference performance | ~144 PFLOPS |
NVIDIA quotes the DGX B200 as delivering 3x faster training and 15x faster inference versus the DGX H100, at the 8-GPU node level. The gap is smaller at this scale than at NVL72 scale partly because the HGX B200 lacks the Grace CPU and NVLink-C2C advantages of the Superchip design.
DGX B200 pricing was reported in the $280,000 to $320,000 range depending on configuration and vendor, compared to approximately $300,000 to $400,000 for a DGX H100.
NVLink 5 doubles the per-link bandwidth compared to NVLink 4 (used in Hopper). Each NVLink 5 link runs at 100 GB/s bidirectional. The B200 GPU has 18 NVLink ports, giving a maximum per-GPU NVLink bandwidth of 1.8 TB/s, compared to 900 GB/s on the H100 (9 NVLink 4 ports at 100 GB/s each).
NVLink 5 also extends NVLink domains across racks using copper cable cartridges rather than optical modules. NVIDIA deliberately chose passive copper connections for within-rack connectivity in the NVL72 to reduce cost, latency, and power consumption versus active optical cables, though optical is required for longer distances between racks.
NVSwitch 4 is the switch chip that enables any-to-any GPU communication within the NVL72 rack. Each NVSwitch 4 chip has 72 NVLink ports, enough to connect all 72 GPUs in the rack to a single switch hop. The 18 NVSwitch chips in the NVL72 rack provide 130 TB/s of total switch bandwidth.
NVSwitch 4 supports NVIDIA's SHARP (Scalable Hierarchical Aggregation and Reduction Protocol) in-network computing feature, which allows all-reduce operations to be executed inside the switch fabric rather than requiring all data to be routed back to individual GPUs. This reduces the effective bandwidth needed for gradient synchronization during training.
Blackwell GPUs require CUDA 12.4 or later for full feature support, including native FP4 Tensor Core instructions and the new decompression engine APIs. The CUDA programming model is unchanged from Hopper; existing GPU kernels run without modification, though new kernels targeting Blackwell-specific features require recompilation against the sm_100 target architecture.
NVIDIA publishes Blackwell-optimized cuBLAS, cuDNN, and cuSPARSE libraries as part of the standard CUDA toolkit. These libraries expose FP4 and MXFP8 operations through standard API calls, allowing frameworks like PyTorch and JAX to use Blackwell's new precision formats without custom kernel development.
TensorRT-LLM is NVIDIA's inference framework for large language models, and it received extensive updates for Blackwell. The framework adds:
TensorRT-LLM exposes an AutoDeploy interface that can take a standard HuggingFace model checkpoint and automatically select quantization format, batch size, and parallelism strategy for best throughput on a given target (B200, GB200 NVL72, etc.).
NVIDIA's open-source Transformer Engine library provides high-level APIs for mixed-precision transformer training and inference. For Blackwell, it adds:
The library is integrated into PyTorch, JAX, and Megatron-LM, making Blackwell's precision features accessible through standard framework APIs.
NVIDIA's AI inference microservice platform (NIM) ships pre-built containers optimized for common frontier models (Llama, Mistral, GPT-4 class models) running on Blackwell hardware. NIM containers handle model loading, quantization, batching, and serving without requiring users to write inference code. They are available through NVIDIA's cloud and partner channels.
The standalone B200 GPU module (SXM6 format) is priced at approximately $30,000 to $40,000 per unit at list price, with hyperscaler volume discounts available. The GB200 Superchip (one Grace + two B200s) carries an estimated price of $60,000 to $70,000. The DGX B200 8-GPU server is priced in the $280,000 to $320,000 range.
For comparison, the H100 SXM5 sold for approximately $25,000 to $35,000 per chip at launch, and the DGX H100 was priced around $300,000 to $400,000 depending on configuration. The B200's per-chip price is modestly higher than the H100's, while delivering substantially more compute, which gives it a better price-performance ratio for most AI workloads.
As of mid-2025, cloud rental pricing for B200 instances varied significantly by provider and commitment level:
| Provider | On-demand price | Notes |
|---|---|---|
| Lambda Labs | $3.79/hour | Per GPU |
| RunPod | $4.99-5.99/hour | Per GPU |
| Modal | $6.25/hour | Per GPU, serverless |
| AWS | $14.24/hour | Per GPU, on-demand |
| GCP | $18.53/hour | Per GPU, on-demand |
| Baseten | $9.98/hour | Per GPU, serverless |
Lambda Labs offered the lowest annual reserved rate at approximately $2.99/hour per GPU on a 3-year commitment. AWS and GCP command premiums over direct AI cloud providers due to their broader service ecosystems and enterprise support.
Cloud prices for B200 dropped roughly 6% between early and mid-2025 as supply increased with production ramp.
NVIDIA shipped the first B200 GPUs and GB200 NVL72 racks to major hyperscalers in late 2024, following a delayed ramp (described in more detail in the Production timeline section below). CoreWeave was the first cloud provider to offer GB200 NVL72 instances, announcing availability in February 2025 with systems already deployed for early customers including IBM, Mistral AI, and Cohere. CoreWeave made HGX B200 instances generally available on May 29, 2025.
AWS, Google Cloud, Microsoft Azure, and Oracle Cloud Infrastructure all began offering Blackwell-based instances through 2025.
CoreWeave was among the earliest and largest deployers of GB200 NVL72 systems. The company disclosed that its GB200 fleet was scalable to 110,000 GPUs and that it had deployed "thousands" of Blackwell GPUs by early 2025. Early workloads on CoreWeave's Blackwell infrastructure included IBM's Granite model training and Mistral AI's inference serving. CoreWeave's platform supports scaling up to 110,000+ B200 GPUs in multi-rack NVLink domain configurations.
Microsoft announced early commitments to deploy Blackwell GPUs at scale as part of its AI infrastructure investment program. Azure added Blackwell-powered instances through 2025, offering both HGX B200 (with standard x86 CPU hosts) and GB200 NVL72 rack configurations.
Oracle was among the hyperscalers receiving GB200 NVL72 racks in the initial late-2024 shipment wave. Oracle's GPU-optimized cloud segments have been a major consumer of NVIDIA's top data center products.
Meta received early GB200 NVL72 systems for AI research and production inference workloads. Meta has been one of NVIDIA's largest customers for data center GPUs across multiple product generations.
AWS integrated B200-based instances into its EC2 accelerated computing catalog and offered them through SageMaker for managed training and inference. AWS's on-demand pricing for B200 is higher than direct AI cloud providers but comes with the full AWS service ecosystem.
The B200's primary design target for training is large transformer models, particularly mixture-of-experts (MoE) architectures with trillions of parameters. The combination of high FP8 throughput (4.5 PFLOPS dense per GPU), large memory (192 GB), and high NVLink bandwidth (1.8 TB/s per GPU) addresses the three main bottlenecks in distributed training: compute, memory capacity for model parallelism, and all-reduce communication for gradient synchronization.
For frontier model training runs that require thousands of GPUs, the NVL72 rack's unified 72-GPU NVLink domain simplifies the network topology: communication within a rack happens over NVLink (at 130 TB/s aggregate) rather than InfiniBand, with InfiniBand only needed for cross-rack gradients. This reduces the volume of traffic on the InfiniBand fabric and improves overall training throughput.
Inference is the use case where the B200 shows its largest performance advantage over Hopper. Several factors combine:
NVIDIA's own benchmark for real-time inference on a 1.8-trillion-parameter MoE model shows 30x faster throughput on GB200 NVL72 compared to an equivalent DGX H100 system.
With the growth of chain-of-thought and reasoning-optimized models (such as DeepSeek-R1, OpenAI o-series, and similar approaches), inference workloads have become more compute-intensive relative to memory-bandwidth-bound generation. Reasoning models perform longer internal token sequences before producing output, which shifts the operational intensity toward compute. The B200's Tensor Core improvements benefit reasoning inference more than they benefit standard generation, making it well-suited for serving these newer model classes.
NVIDIA's TensorRT-LLM includes a specific DeepSeek-R1 optimization for Blackwell that takes advantage of the B200's MoE-optimized dispatch kernels.
Beyond AI, the B200 retains strong double-precision floating-point performance (40 TFLOPS FP64) for traditional HPC workloads including molecular dynamics, climate simulation, and computational fluid dynamics. The large memory capacity is beneficial for HPC workloads that deal with large data sets that previously required multi-GPU memory partitioning.
Following the March 2024 GTC announcement, reports emerged in August 2024 that Blackwell production was delayed due to a design flaw in the B200 GPU. NVIDIA and TSMC identified the flaw, which affected manufacturing yield. Jensen Huang later confirmed that the issue was "functional" and "caused the yield to be low." NVIDIA worked with TSMC to re-spin layers of the B200 processor to correct the problem.
The Register and other outlets reported in August 2024 that Blackwell GPU shipments would be delayed into 2025, though NVIDIA disputed the severity of the delays.
NVIDIA began ramping Blackwell production in Q4 2024. The company committed to shipping Blackwell GPUs "worth several billion dollars" in that quarter. Analyst estimates placed Blackwell production volume at 750,000 to 800,000 units by Q1 2025.
During this period, Hopper continued to ship in large volumes as a bridge product while Blackwell ramped. NVIDIA extended Hopper's production timeline specifically to fill the supply gap.
Beyond the chip yield problem, early GB200 NVL72 system integrations encountered hardware challenges related to liquid cooling. Suppliers and system integrators disclosed in late 2024 that some GB200 rack systems suffered from overheating and liquid cooling leaks during integration and testing. These issues delayed final system qualification and shipment to customers.
NVIDIA's Taiwanese manufacturing partners announced at Computex 2025 that GB200 rack shipments had resumed and commenced at the end of Q1 2025 after the cooling issues were resolved.
By mid-2025, Blackwell production had stabilized and cloud providers were offering B200 and GB200 instances at commercial scale. The production ramp followed a familiar pattern for advanced semiconductor products: initial yields are low, qualifying production builds take time, and system integration issues extend the timeline beyond chip availability alone.
The delays were significant enough that NVIDIA's Hopper-generation products (H100 and H200) remained the primary datacenter GPUs through most of 2024 rather than being rapidly displaced after the March announcement.
The H100 was NVIDIA's dominant data center GPU from its 2022 launch through late 2024. It is built on a monolithic die with 80 billion transistors on TSMC's 4N process. The B200 outperforms the H100 in every relevant metric: 208B vs 80B transistors, 192GB vs 80GB memory, 8 TB/s vs 3.35 TB/s bandwidth, 4.5 PFLOPS vs 1.98 PFLOPS FP8 throughput, and adds FP4 support that the H100 entirely lacks.
For organizations still running H100 clusters in 2025, the upgrade case is most compelling for inference workloads on large models. Training workloads show a significant but less dramatic improvement because large training runs are bottlenecked by network communication as much as by single-GPU throughput.
The H200 was a mid-cycle memory upgrade to the H100, replacing the 80 GB HBM3 with 141 GB HBM3E at 4.8 TB/s while leaving the compute die unchanged. The B200 improves on the H200 substantially: 192 GB vs 141 GB memory, 8 TB/s vs 4.8 TB/s bandwidth, 4.5 PFLOPS vs 1.98 PFLOPS FP8, and native FP4 support vs none.
The B300, marketed as "Blackwell Ultra," is NVIDIA's follow-on to the B200 within the Blackwell architecture generation. It shipped beginning in January 2026. The B300 increases memory capacity to 288 GB HBM3E per GPU, raises FP4 throughput to approximately 14 PFLOPS dense (compared to 9 PFLOPS for the B200), and increases TDP to 1,400 W.
The B300's larger memory (288 GB vs 192 GB) is its primary advantage for inference: it can hold a full 70B-parameter model in BF16 with substantial room remaining for KV cache, whereas the B200 leaves tighter margins. The B300's 1,400 W TDP requires mandatory direct liquid cooling with less flexibility than the B200, which can operate in some air-cooled configurations at reduced TDP.
For organizations that ordered B200 systems in 2024 and 2025, the B300 represents a future upgrade path rather than an immediate displacement. The B200 remains competitive for the workloads it was designed for, and the two products coexist in cloud catalogs as of early 2026.