NVIDIA H200
Last reviewed
May 8, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 ยท 6,135 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 8, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 ยท 6,135 words
Add missing citations, update stale details, or suggest a clearer explanation.
The NVIDIA H200 is a Tensor Core GPU for AI and high-performance computing, built on the same Hopper microarchitecture and GH100 silicon as the NVIDIA H100. It is the first GPU to ship with HBM3e memory, packing 141 GB at 4.8 TB/s on the SXM and NVL boards. The compute pipeline is functionally identical to the H100. The whole point of the H200 is the memory: more capacity, more bandwidth, and a noticeably better fit for large language model inference where the KV cache and weight tensors are the bottleneck rather than the math units.
NVIDIA announced the H200 on November 13, 2023 at SC23 (the SC supercomputing conference) in Denver. First customer shipments started in Q2 2024 through the major server OEMs and cloud providers. The H200 sits between the H100 and the Blackwell B100 and B200 in NVIDIA's data-center roadmap, and a lot of organizations are still buying it through 2025 and into 2026 because Blackwell supply is constrained and the 141 GB memory footprint is enough to host a 70-billion-parameter model on a single GPU in 16-bit precision.
| Field | Value |
|---|---|
| Type | Data center GPU accelerator |
| Microarchitecture | Hopper |
| Die | GH100 |
| Process | TSMC custom 4N |
| Transistors | ~80 billion |
| Die size | 814 mm squared |
| Memory | 141 GB HBM3e |
| Memory bandwidth | 4.8 TB/s |
| Interconnect | NVLink 4 (900 GB/s); PCIe Gen5 x16 |
| Form factors | H200 SXM (SXM5), H200 NVL (PCIe), GH200 superchip variant |
| TDP | 700 W (SXM); 600 W (NVL); configurable up to 1,000 W |
| Announced | November 13, 2023 (SC23, Denver) |
| First customer shipments | Q2 2024 |
| Predecessor | NVIDIA H100 |
| Successor | NVIDIA B100, NVIDIA B200 |
The H200 is best understood as a memory refresh of the H100 rather than a new architecture. NVIDIA kept the GH100 die: same 814 mm squared fabricated on TSMC's custom 4N process, same 80 billion transistors, same 132 streaming multiprocessors, same fourth-generation Tensor Cores, same Transformer Engine for FP8 and bfloat16 dynamic precision switching. What changed is the memory subsystem. The H100 SXM uses five active 16 GB HBM3 stacks for 80 GB at 3.35 TB/s. The H200 uses six active 24 GB HBM3e stacks for 141 GB at 4.8 TB/s. The eighth physical stack site on the package is left disabled in both cases, mirroring the H100 design.
That memory upgrade matters for two workloads in particular:
For compute-bound workloads, the H200 and H100 perform identically.
| Event | Date |
|---|---|
| Announcement at SC23, Denver | November 13, 2023 |
| First HBM3e shipments from Micron | Q1 2024 |
| First customer system shipments | Q2 2024 |
| First broad cloud availability (AWS, Azure, GCP, OCI, CoreWeave, Lambda) | Throughout 2024 |
| H200 NVL PCIe variant launched | Late 2024 |
| Successor B200 (Blackwell) ramp | 2024 to 2025 |
| Continuing fleet expansion at neoclouds | 2025 to 2026 |
NVIDIA listed twelve initial server-OEM partners at SC23: ASRock Rack, ASUS, Dell Technologies, Eviden, GIGABYTE, Hewlett Packard Enterprise, Ingrasys, Lenovo, QCT, Supermicro, Wistron, and Wiwynn. The first cloud service providers to publicly commit to H200 instances were AWS (P5e), Google Cloud (A3 Ultra and Mega), Microsoft Azure (ND H200 v5), Oracle Cloud Infrastructure, CoreWeave, Lambda, and Vultr.
Micron Technology was the initial HBM3e supplier. Micron put 24 GB 8-high HBM3e stacks into volume production in early 2024 specifically for the H200. SK hynix qualified shortly after and Samsung followed; by mid-2024 NVIDIA was sourcing HBM3e from multiple vendors. The HBM3e supply ramp was the gating factor for H200 production through most of 2024, and the same constraint later limited Blackwell ramp through 2025.
The announcement at SC23 was deliberately scheduled for the supercomputing community rather than a typical NVIDIA enterprise event. Ian Buck, NVIDIA's vice president of hyperscale and HPC, framed the H200 in the announcement keynote as a way to keep the Hopper platform competitive for another product cycle while Blackwell finished its tape-out and packaging qualification. That positioning held up. By the time Blackwell parts started reaching customers in volume in the second half of 2024, the H200 had already become the default LLM-inference SKU in most cloud catalogs.
The H200 inherits everything compute-related from the H100. The list below is mostly identical to the H100 entry, which is intentional.
The Transformer Engine is the more interesting piece. It tracks per-tensor statistics during training and inference and switches between FP8 and BF16 on a per-layer basis. FP8 doubles raw throughput compared to BF16 and halves the memory footprint of activations and gradients, but it loses dynamic range, so the engine falls back to BF16 for layers that show numerical instability. The H200 ships with the same Transformer Engine version as the H100; what improves on FP8 workloads is the rate at which weights and KV cache can be streamed in from HBM3e.
Hopper also adds the Tensor Memory Accelerator (TMA) and thread block clusters, both inherited from the H100 and unchanged on the H200. The TMA is a dedicated copy engine for moving multi-dimensional tensors between global memory and shared memory without using CUDA or Tensor Core cycles, and thread block clusters expose distributed shared memory across adjacent SMs. Software written to take advantage of these features (FlashAttention 2 and 3, the kernels in TensorRT-LLM, CUTLASS 3.x kernels) runs without modification on the H200, simply benefitting from higher memory throughput.
A quick reference for what is shared between the two parts and what is different.
| Spec | H100 SXM | H200 SXM |
|---|---|---|
| Die | GH100 | GH100 |
| SMs (active) | 132 | 132 |
| FP32 CUDA cores | 16,896 | 16,896 |
| Tensor Cores (4th gen) | 528 | 528 |
| FP8 Tensor (dense) | 3,958 TFLOPS | 3,958 TFLOPS |
| BF16 Tensor (dense) | 1,979 TFLOPS | 1,979 TFLOPS |
| FP64 Tensor | 67 TFLOPS | 67 TFLOPS |
| L2 cache | 50 MB (often quoted) | 60 MB (NVIDIA H200 datasheet) |
| Memory type | HBM3 | HBM3e |
| Memory capacity | 80 GB | 141 GB |
| Memory bandwidth | 3.35 TB/s | 4.8 TB/s |
| Bus width | 5,120-bit | 6,144-bit |
| NVLink | NVLink 4 (900 GB/s) | NVLink 4 (900 GB/s) |
| PCIe | Gen5 x16 | Gen5 x16 |
| TDP (SXM) | 700 W | up to 700 W (configurable to 1,000 W) |
| Confidential Computing | Yes | Yes |
In practice, every CUDA kernel and every cuBLAS, cuDNN, and TensorRT routine that runs on H100 runs on H200. The compute capability remains sm_90, so binaries compile once and target both parts. The performance differences come entirely from the memory subsystem and from any per-tensor scaling chosen by the Transformer Engine when more memory is available for activation tracking buffers.
| Spec | H100 SXM | H200 SXM |
|---|---|---|
| Memory type | HBM3 | HBM3e |
| Capacity | 80 GB | 141 GB |
| Bandwidth | 3.35 TB/s | 4.8 TB/s |
| Stack count (active) | 5 x 16 GB | 6 x 24 GB |
| Bus width | 5,120-bit | 6,144-bit |
The physical package has eight HBM stack sites. On both H100 and H200, one site is left unpopulated for yield reasons. The H200's six active HBM3e stacks at 24 GB each give 144 GB raw, but a small slice is reserved for ECC and for vendor-specific repair, so the user-visible capacity is 141 GB.
Micron's 8-high HBM3e parts run at roughly 9.2 Gb/s per pin, which is what produces the 4.8 TB/s aggregate over the 6,144-bit bus. SK hynix and Samsung HBM3e parts hit similar speeds. Through most of 2024, NVIDIA's qualified bill of materials specified Micron-only HBM3e for production H200 modules; SK hynix qualifications opened up by Q3 2024 and Samsung followed. By the time the H200 NVL launched in late 2024, all three vendors were shipping into the H200 supply chain.
The practical impact on inference is significant. A 70 billion-parameter model in BF16 occupies roughly 140 GB for weights alone, plus KV cache that scales with batch size and context length. On H100, serving Llama 2 70B at production batch sizes requires tensor-parallel sharding across two or more GPUs, with NVLink and NVSwitch handling the all-reduce traffic. On H200, the same model fits comfortably on one GPU, eliminating the cross-GPU communication entirely and freeing the second GPU for additional inference replicas. NVIDIA has reported in TensorRT-LLM benchmarks that this single-GPU configuration roughly doubles tokens-per-second per GPU on long-context Llama 2 70B inference.
The H200 ships in three product variants. The most common in cloud deployments is the SXM5 module sold as part of an HGX baseboard.
| Form factor | TDP | Memory | Memory BW | Interconnect | Typical system |
|---|---|---|---|---|---|
| H200 SXM (SXM5) | up to 700 W | 141 GB HBM3e | 4.8 TB/s | NVLink 4 (900 GB/s), PCIe Gen5 x16 | HGX H200, DGX H200 |
| H200 NVL (PCIe) | up to 600 W | 141 GB HBM3e | 4.8 TB/s | NVLink 4 bridge (2- or 4-way, 900 GB/s), PCIe Gen5 x16 | Air-cooled enterprise servers |
| GH200 Grace Hopper Superchip (H200 variant) | up to 1,000 W (CPU+GPU) | 144 GB HBM3e + 480 GB LPDDR5X | 4.9 TB/s GPU + ~512 GB/s CPU | NVLink-C2C (900 GB/s) between Grace and Hopper | NVL2 nodes, GH200 NVL32 |
H200 SXM is the high-end SKU. It plugs into an HGX H200 baseboard with four or eight GPUs interconnected by NVSwitch. Eight-way HGX H200 systems are the basis for DGX H200 and for the per-node configuration of cloud instances at AWS, Azure, GCP, and OCI.
H200 NVL launched a few months later as a dual-slot PCIe card for enterprise servers that cannot accept SXM modules. The NVL part runs at lower TDP (600 W), trades a small amount of throughput for a more conventional thermal envelope, and supports 2-way or 4-way NVLink bridges that link adjacent cards at 900 GB/s. NVIDIA bundles a five-year subscription to NVIDIA AI Enterprise with each H200 NVL.
GH200 Grace Hopper Superchip is a different kind of product. It pairs a Grace CPU (72 Arm Neoverse V2 cores, 480 GB LPDDR5X) with a Hopper GPU using NVLink-C2C, NVIDIA's chip-to-chip cache-coherent interconnect at 900 GB/s. The GH200 originally shipped with H100-class GPU silicon and 96 GB HBM3; the refreshed GH200 launched in late 2023 swaps that for an H200-class GPU with 144 GB of HBM3e at roughly 4.9 TB/s. GH200 superchips are the building block for the GH200 NVL2 and NVL32 rackscale systems.
The HGX H200 baseboard is the OEM-targeted version of the eight-GPU server module. It carries eight H200 SXM modules, four NVSwitch chips for all-to-all NVLink connectivity at 900 GB/s per GPU, the NVLink Switch System interface, and the power and cooling infrastructure for sustained 700 W per GPU. The board ships to OEMs (Supermicro, Dell, HPE, Lenovo, GIGABYTE, QCT, ASUS, ASRock Rack, Wiwynn, Wistron, Foxconn, Inventec, and others) for integration into their own chassis. The bulk of hyperscale H200 deployments use HGX boards rather than full DGX systems.
The DGX H200 is NVIDIA's reference appliance built around the same eight-GPU HGX H200 baseboard. Each DGX H200 contains eight H200 SXM modules totaling 1.128 TB of HBM3e, two Intel Xeon Sapphire Rapids or Emerald Rapids CPUs, 2 TB of system DRAM, eight ConnectX-7 400 Gb/s NICs for InfiniBand or Ethernet scale-out, two BlueField-3 DPUs, and 30 TB of NVMe storage in the reference configuration. The chassis is 8U and dissipates roughly 10.2 kW under sustained load. NVIDIA quotes 32 PFLOPS of dense FP8 per DGX H200 (8 x 3.958 PFLOPS) and 1.1 TB of total memory.
The H200 NVL is intended for enterprise customers running standard 19-inch rack servers with PCIe slots. Each card occupies two PCIe slots, draws up to 600 W from a 12V-2x6 connector, and exposes 141 GB HBM3e at the same 4.8 TB/s as the SXM variant. The 600 W envelope is achieved by lowering boost clocks; the dense FP8 Tensor throughput drops from 3,958 TFLOPS on SXM to roughly 3,341 TFLOPS on NVL.
Up to four H200 NVL cards in a server can be linked with NVLink bridges into a single coherent 564 GB pool at 900 GB/s aggregate inter-GPU bandwidth. Beyond four cards, the only path is PCIe Gen5, so multi-GPU NVL deployments are typically configured as two-way or four-way NVL groups with PCIe between groups. NVIDIA bundles a five-year NVIDIA AI Enterprise subscription with each H200 NVL purchase, which is a notable difference from the SXM SKU.
The compute spec sheet for the H200 SXM is identical to the H100 SXM. The H200 NVL clocks slightly lower to fit the 600 W envelope, so its peak rates are roughly 84 percent of the SXM numbers.
| Precision | H200 SXM | H200 NVL |
|---|---|---|
| FP64 | 34 TFLOPS | 30 TFLOPS |
| FP64 Tensor Core | 67 TFLOPS | 60 TFLOPS |
| FP32 | 67 TFLOPS | 60 TFLOPS |
| TF32 Tensor Core (sparse) | 989 / 1,979 TFLOPS | 835 / 1,671 TFLOPS |
| BF16 Tensor Core (sparse) | 1,979 / 3,958 TFLOPS | 1,671 / 3,341 TFLOPS |
| FP16 Tensor Core (sparse) | 1,979 / 3,958 TFLOPS | 1,671 / 3,341 TFLOPS |
| FP8 Tensor Core (sparse) | 3,958 / 7,916 TFLOPS | 3,341 / 6,682 TFLOPS |
| INT8 Tensor Core (sparse) | 3,958 / 7,916 TOPS | 3,341 / 6,682 TOPS |
The "sparse" numbers assume NVIDIA's 2:4 structured sparsity is in use. Most production LLM inference runs do not exploit 2:4 sparsity, so the dense numbers are usually the relevant ones. The sparsity uplift is genuine but only applies if you have actually trained or pruned the model into the 2:4 pattern.
NVIDIA's published Llama 2 70B inference benchmark, run with TensorRT-LLM, shows H200 delivering roughly 1.6 to 1.9 times the throughput of H100. The exact ratio depends on batch size and sequence length. The reason is straightforward: at long context, the KV cache dominates memory traffic, and the H200's 1.43x bandwidth advantage plus its ability to keep the entire cache in HBM (rather than paging it through tensor-parallel comms) compounds.
The most-cited public benchmark is MLPerf Inference v4.0, where eight-GPU H200 systems beat eight-GPU H100 systems on Llama 2 70B by approximately 45 percent. By v4.1, NVIDIA reported up to 27 percent additional H200 gains from software optimizations alone, and an H200 system with eight GPUs configured at 1,000 W reached roughly 33,000 tokens per second on Llama 2 70B server-mode submissions. (The 1,000 W mode is a special configurable TDP that pushes the part beyond its 700 W reference setting; standard datacenter deployments run at 700 W.)
For HPC workloads, NVIDIA's bandwidth-bound benchmarks (HPCG, MILC, certain quantum chemistry codes) show roughly 1.4x speedups over H100. Compute-bound HPC codes (those that already saturate the FP64 or FP64 Tensor Core pipelines) see negligible improvement; the math units are the same.
For LLM training, the picture is more nuanced. If your training job already fits in H100 memory at the desired batch size, the H200 buys you very little because compute is unchanged. If you were using gradient checkpointing or aggressive pipeline parallelism to squeeze a model into H100 memory, the H200 lets you raise the per-GPU batch size and reduce communication, which in practice moves training throughput up by 10 to 30 percent on 70B-class models.
The MLPerf Inference benchmark suite is the standard public reference for accelerator performance. The H200 has appeared in v4.0 (March 2024), v4.1 (August 2024), and v5.0 submissions, mostly through NVIDIA's own entries and through partners like CoreWeave, Supermicro, Dell, and Lenovo.
Independent third-party benchmarks (ServeTheHome, The Next Platform, MLCommons-published partner submissions) confirm the broad pattern. The H200's advantage is largest on memory-bound LLM inference, smallest on training compute already fitting in H100 memory, and somewhere in between on HPC workloads depending on the kernel mix. On Mixtral 8x7B and Mixtral 8x22B, where MoE routing means activations are sparse but weights remain large, H200 systems also show 1.5x to 1.8x throughput improvements over H100.
NVIDIA does not publish list prices for SXM modules. Resellers list H200 SXM modules at roughly 31,000 to 40,000 USD per GPU through 2024 and 2025, and DGX H200 systems (eight GPUs plus chassis) have been reported in the 400,000 to 500,000 USD range. H200 NVL PCIe cards are listed at retail by some distributors at roughly 31,000 USD. These figures vary widely by reseller, region, contract term, and supply conditions; treat them as order-of-magnitude rather than catalog prices.
Cloud rental is the more common access path. As of late 2025 and early 2026, on-demand rates for a single H200 SXM hour at major hyperscalers and neoclouds are usually in the 3 to 7 USD per GPU per hour range, with reserved or committed capacity well under that. Eight-GPU H200 instances rent for roughly 25 to 50 USD per hour on-demand. Specific rates vary by provider, region, and contract term.
| Provider | Instance / SKU | Per-node config | NVLink topology | Networking |
|---|---|---|---|---|
| AWS | EC2 P5e and P5en | 8 x H200 SXM | NVSwitch all-to-all | 3,200 Gbps EFAv3 |
| Azure | ND H200 v5 | 8 x H200 SXM | NVSwitch all-to-all | 3,200 Gbps NDR InfiniBand |
| Google Cloud | A3 Ultra and A3 Mega | 8 x H200 SXM | NVSwitch all-to-all | TPU-style fabric or 800 Gbps RDMA |
| Oracle Cloud | BM.GPU.H200.8 (bare metal) | 8 x H200 SXM | NVSwitch all-to-all | 3,200 Gbps RoCEv2 |
| CoreWeave | HGX H200 nodes | 8 x H200 SXM | NVSwitch all-to-all | 3,200 Gbps NDR InfiniBand |
| Lambda | 1-Click Clusters and on-demand | 1 to 512 H200 SXM | NVSwitch + IB | NDR InfiniBand |
| Crusoe | Crusoe Cloud H200 | 8 x H200 SXM | NVSwitch all-to-all | 3,200 Gbps NDR InfiniBand |
| Vultr | H200 bare metal and VM | 1 to 8 x H200 SXM | NVSwitch | RoCE |
| RunPod | Secure Cloud H200 | 1 to 8 x H200 SXM | NVSwitch | 3,200 Gbps |
| Together AI | H200 endpoints | Shared and dedicated | NVSwitch | NDR |
| Nebius | H200 cluster shapes | 8 x H200 SXM | NVSwitch | 3,200 Gbps NDR |
Most providers list both single-GPU instances and full eight-GPU bare-metal nodes. The eight-GPU configurations are the most common deployment for production LLM inference and for fine-tuning runs that need an HGX-class NVLink fabric in a single node. Larger training jobs span multiple nodes connected by 400 Gbps NDR InfiniBand or 400/800 Gbps RoCE Ethernet.
Reserved and committed capacity at one-year and three-year terms is significantly cheaper than on-demand. CoreWeave, Lambda, and the hyperscalers all publish reserved-instance discounts in the 30 to 60 percent range depending on the commitment length and total volume.
The H200 has been adopted broadly across hyperscalers, neoclouds, AI labs, and HPC sites. Notable deployments include:
The table below summarizes how the H200 fits in NVIDIA's data-center GPU lineup alongside its predecessor, successors, and the most prominent competing accelerators. Specs are as published by each vendor for the flagship SXM-equivalent variant.
| Product | Architecture | Memory | Bandwidth | FP8 / FP4 Tensor (dense) | TDP | Year |
|---|---|---|---|---|---|---|
| NVIDIA A100 SXM4 80GB | Ampere | 80 GB HBM2e | 2.0 TB/s | n/a (FP16: 312 TFLOPS) | 400 W | 2020 |
| NVIDIA H100 SXM5 | Hopper | 80 GB HBM3 | 3.35 TB/s | 3,958 TFLOPS FP8 | 700 W | 2022 |
| H100 NVL (PCIe pair) | Hopper | 188 GB HBM3 (94 GB per GPU) | 7.8 TB/s (paired) | 3,341 TFLOPS FP8 (per pair) | 2 x 400 W | 2023 |
| H200 SXM | Hopper | 141 GB HBM3e | 4.8 TB/s | 3,958 TFLOPS FP8 | 700 W (up to 1,000 W configurable) | 2024 |
| H200 NVL | Hopper | 141 GB HBM3e | 4.8 TB/s | 3,341 TFLOPS FP8 | 600 W | 2024 |
| GH200 (H200) | Grace + Hopper | 144 GB HBM3e + 480 GB LPDDR5X | 4.9 TB/s GPU | 3,958 TFLOPS FP8 | up to 1,000 W | 2024 |
| NVIDIA B100 | Blackwell | 192 GB HBM3e | 8 TB/s | 3,500 TFLOPS FP8 / 7,000 TFLOPS FP4 | 700 W | 2024 |
| NVIDIA B200 | Blackwell | 192 GB HBM3e | 8 TB/s | 4,500 TFLOPS FP8 / 9,000 TFLOPS FP4 | 1,000 W | 2024 to 2025 |
| NVIDIA GB200 (per Blackwell GPU) | Grace + Blackwell | 192 GB HBM3e | 8 TB/s | 5,000 TFLOPS FP8 / 10,000 TFLOPS FP4 | up to 1,200 W | 2024 to 2025 |
| AMD MI300X | CDNA 3 | 192 GB HBM3 | 5.3 TB/s | 2,615 TFLOPS FP8 | 750 W | 2023 |
| AMD MI325X | CDNA 3 | 256 GB HBM3e | 6.0 TB/s | 2,615 TFLOPS FP8 | 1,000 W | 2024 |
| Google TPU v5p | TPU v5p | 95 GB HBM | 2.8 TB/s | n/a (BF16: 459 TFLOPS) | n/a (cloud only) | 2023 |
| Google TPU Trillium (v6e) | TPU v6e | 32 GB HBM | 1.6 TB/s | n/a (BF16: 918 TFLOPS) | n/a (cloud only) | 2024 |
A few points are worth highlighting from this table.
The H200's 141 GB sits between the H100's 80 GB and the B200's 192 GB. AMD's MI300X also reaches 192 GB but with HBM3 (not HBM3e), so its bandwidth at 5.3 TB/s is below B200 but above H200. The MI325X stretches the memory to 256 GB at 6.0 TB/s, which is the largest single-package memory pool of any of these accelerators in 2024. AMD has positioned the MI300X and MI325X explicitly as LLM-inference parts where the memory advantage matters most. NVIDIA's response across the H200 and Blackwell line has been a combination of HBM3e adoption, NVSwitch-scale memory pooling across eight GPUs, and software ecosystem investment in TensorRT-LLM and NIM microservices.
The H100 NVL deserves a footnote. NVIDIA briefly sold an H100 NVL part that pairs two PCIe cards over NVLink and exposes 188 GB of memory across the pair (94 GB per card, roughly 96 GB physical with a few GB held back). The H200 NVL is a different product. Single H200 NVL cards have 141 GB on their own.
Google's TPU offerings are not directly comparable on a per-chip basis because they are sold as cloud capacity with their own software stack (JAX, TensorFlow XLA, PyTorch via TPU plugin) and a different scale-out fabric. For workloads that Google has tuned heavily (Gemini training, internal recommendation systems, BERT-class inference), TPU v5p and Trillium are competitive with H200 and B200 on cost per training step. For the broader CUDA ecosystem of open-source models, optimized inference engines, and third-party tooling, H200 retains a substantial software advantage.
The H200 uses the same Hopper compute capability (sm_90) as the H100, so binaries compiled for H100 run unchanged. Software support comes from the standard NVIDIA stack:
NVIDIA AI Enterprise (a subscription product bundled free with H200 NVL for five years) packages enterprise-supported builds of these components plus the NIM microservice runtime.
TensorRT-LLM is the inference engine NVIDIA most aggressively tunes against new hardware generations. It is the path through which most of the H200's advertised inference gains are realized in production. Key features that are particularly useful on H200:
NIM (NVIDIA Inference Microservices) packages popular open-source models (Llama 3, Llama 3.1, Mixtral, Phi, Qwen, Mistral, and others) as Docker containers that can be deployed on any Hopper-class GPU. NIM containers automatically detect the underlying hardware (H100, H200, B100, B200) and select a tuned engine. On H200, NIM typically defaults to a single-GPU engine for 70B-class models because the model fits without sharding.
The H200 is well supported by open-source LLM serving frameworks even outside NVIDIA's first-party stack.
The H200 is most useful where memory rather than compute is the constraint:
It is a poor choice when the workload is compute-bound and already fits comfortably in H100 memory; you would be paying H200 prices for H100 throughput. It is also a poor choice when the workload would otherwise fit on an L40S, an H100, or even an A100, since those parts are cheaper per GPU-hour on most clouds.
NVIDIA announced Blackwell at GTC 2024 in March 2024, four months after the H200 announcement. The B100 and B200 GPUs ramped through 2024 with 192 GB of HBM3e and roughly double the FP8 Tensor throughput of H200, plus native FP4 support that the Hopper Tensor Cores lack. The GB200 NVL72 rack ties 36 Grace CPUs to 72 Blackwell GPUs over a single NVLink domain, presenting itself as a unified accelerator for trillion-parameter models. NVIDIA also previewed the Rubin generation at GTC 2024, slated for 2025 to 2026, with HBM4 memory.
In practice, the H200 has had a longer-than-expected useful life. Blackwell supply through 2024 and 2025 was constrained by a combination of HBM3e availability, advanced packaging (CoWoS-L) capacity at TSMC, and software readiness. A lot of customers who ordered Blackwell in early 2024 ended up taking H200 capacity in the meantime, and the H200 remains the standard "big memory" Hopper SKU on most clouds well into 2026. Even after Blackwell volume ramps, the H200 retains a price-performance niche for inference workloads that fit comfortably in 141 GB and that do not need FP4.
For a longer view of the architectural lineage from Ampere through Rubin, see the NVIDIA Hopper and NVIDIA Blackwell articles. The NVIDIA H100 page covers the predecessor in depth, and the NVIDIA B200 and NVIDIA GB200 pages describe the Blackwell-generation successors.