# NVIDIA H200

> Source: https://aiwiki.ai/wiki/nvidia_h200
> Updated: 2026-06-21
> Categories: AI Hardware, Data Centers, NVIDIA
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

The **NVIDIA H200** is a data center Tensor Core [GPU](/wiki/gpu) for AI and high-performance computing that was the first GPU to ship with [HBM3e](/wiki/hbm3e) memory, packing 141 GB at 4.8 TB/s on its SXM and NVL boards.[^1][^3] It is built on the same Hopper microarchitecture and GH100 silicon as the [NVIDIA H100](/wiki/nvidia_h100), so its compute pipeline is functionally identical; the whole point of the H200 is the memory.[^1][^2] More capacity and more bandwidth make it a noticeably better fit for [large language model](/wiki/large_language_model) inference, where the KV cache and weight tensors are the bottleneck rather than the math units, and NVIDIA states the H200 delivers "nearly double the inference performance of the H100" on a 70-billion-parameter model.[^1][^2][^4]

[NVIDIA](/wiki/nvidia) announced the H200 on November 13, 2023 at SC23 (the SC supercomputing conference) in Denver.[^2][^5] First customer shipments started in Q2 2024 through the major server OEMs and cloud providers.[^2][^6] The H200 sits between the H100 and the [Blackwell](/wiki/nvidia_blackwell) [B100](/wiki/nvidia_b100), [B200](/wiki/nvidia_b200), and [B300](/wiki/nvidia_b300) Blackwell Ultra parts in NVIDIA's data-center roadmap, and a lot of organizations are still buying it through 2025 and into 2026 because Blackwell supply remains constrained and the 141 GB memory footprint is enough to host a 70-billion-parameter model on a single GPU in 16-bit precision.[^4][^7][^8]

## Infobox

| Field | Value |
|---|---|
| Type | Data center GPU accelerator |
| Microarchitecture | [Hopper](/wiki/nvidia_hopper) |
| Die | GH100 |
| Process | TSMC custom 4N |
| Transistors | ~80 billion |
| Die size | 814 mm squared |
| Memory | 141 GB [HBM3e](/wiki/hbm3e) |
| Memory bandwidth | 4.8 TB/s |
| Interconnect | [NVLink](/wiki/nvlink) 4 (900 GB/s); [PCIe](/wiki/pcie) Gen5 x16 |
| Form factors | H200 SXM (SXM5), H200 NVL (PCIe), GH200 superchip variant |
| TDP | 700 W (SXM); 600 W (NVL); configurable up to 1,000 W |
| Announced | November 13, 2023 (SC23, Denver) |
| First customer shipments | Q2 2024 |
| Predecessor | [NVIDIA H100](/wiki/nvidia_h100) |
| Successor | [NVIDIA B100](/wiki/nvidia_b100), [NVIDIA B200](/wiki/nvidia_b200), [NVIDIA B300](/wiki/nvidia_b300) (Blackwell Ultra) |

## What is the NVIDIA H200?

The H200 is best understood as a memory refresh of the H100 rather than a new architecture.[^1][^9] NVIDIA kept the GH100 die: same 814 mm squared fabricated on TSMC's custom 4N process, same 80 billion transistors, same 132 streaming multiprocessors, same fourth-generation Tensor Cores, same Transformer Engine for [FP8](/wiki/fp8) and [bfloat16](/wiki/bf16) dynamic precision switching. What changed is the memory subsystem. The H100 SXM uses five active 16 GB HBM3 stacks for 80 GB at 3.35 TB/s. The H200 uses six active 24 GB HBM3e stacks for 141 GB at 4.8 TB/s, which is roughly 76 percent more capacity and about 43 percent more bandwidth than the H100 SXM.[^1][^10] The eighth physical stack site on the package is left disabled in both cases, mirroring the H100 design.

NVIDIA framed the upgrade in memory terms at launch. "To create intelligence with generative AI and HPC applications, vast amounts of data must be efficiently processed at high speed using large, fast GPU memory," said Ian Buck, NVIDIA's vice president of hyperscale and HPC. "With NVIDIA H200, the industry's leading end-to-end AI supercomputing platform just got faster to solve some of the world's most important challenges."[^2]

That memory upgrade matters for two workloads in particular:

- **Large LLM inference.** A [Llama 2](/wiki/llama_2) 70B model in [FP16](/wiki/fp16) needs roughly 140 GB just for weights. On an H100 you have to shard it across at least two GPUs and pay tensor-parallel communication overhead. On an H200, the weights, activations, and a meaningful chunk of KV cache fit on a single device. NVIDIA's own benchmarks claim that H200 nearly doubles Llama 2 70B [inference](/wiki/inference) throughput versus H100.[^1][^11]
- **Bandwidth-bound HPC.** Codes that are memory-bound rather than compute-bound (parts of CFD, molecular dynamics, certain seismic and weather kernels) see the bandwidth uplift translate directly into runtime improvements of roughly 1.4x.[^1]

For compute-bound workloads, the H200 and H100 perform identically.

## When was the H200 announced and released?

| Event | Date |
| --- | --- |
| Announcement at SC23, Denver | November 13, 2023 |
| First HBM3e shipments from Micron | Q1 2024 |
| First customer system shipments | Q2 2024 |
| First broad cloud availability (AWS, Azure, GCP, OCI, CoreWeave, Lambda) | Throughout 2024 |
| H200 NVL PCIe variant launched | Late 2024 |
| Successor B200 (Blackwell) ramp | 2024 to 2025 |
| Successor B300 (Blackwell Ultra) announced at GTC 2025; systems ship 2H 2025 | March 18, 2025[^12] |
| US Commerce Department approves H200 exports to China (case-by-case, 25% levy) | December 2025[^13] |
| Continuing fleet expansion at neoclouds | 2025 to 2026 |

NVIDIA listed twelve initial server-OEM partners at SC23: ASRock Rack, ASUS, Dell Technologies, Eviden, GIGABYTE, Hewlett Packard Enterprise, Ingrasys, Lenovo, QCT, Supermicro, Wistron, and Wiwynn.[^2][^5] The first cloud service providers to publicly commit to H200 instances were [AWS](/wiki/aws) (P5e), Google Cloud (A3 Ultra and Mega), Microsoft [Azure](/wiki/azure) (ND H200 v5), [Oracle](/wiki/oracle) Cloud Infrastructure, [CoreWeave](/wiki/coreweave), [Lambda](/wiki/lambda_labs), and Vultr.[^14][^15]

[Micron Technology](/wiki/micron_technology) was the initial HBM3e supplier. Micron put 24 GB 8-high HBM3e stacks into volume production in early 2024 specifically for the H200.[^16] SK hynix qualified shortly after and Samsung followed; by mid-2024 NVIDIA was sourcing HBM3e from multiple vendors. The HBM3e supply ramp was the gating factor for H200 production through most of 2024, and the same constraint later limited Blackwell ramp through 2025.[^7]

The announcement at SC23 was deliberately scheduled for the supercomputing community rather than a typical NVIDIA enterprise event. Ian Buck, NVIDIA's vice president of hyperscale and HPC, framed the H200 in the announcement keynote as a way to keep the Hopper platform competitive for another product cycle while Blackwell finished its tape-out and packaging qualification.[^5] That positioning held up. By the time Blackwell parts started reaching customers in volume in the second half of 2024, the H200 had already become the default LLM-inference SKU in most cloud catalogs.

## Architecture

The H200 inherits everything compute-related from the H100. The list below is mostly identical to the H100 entry, which is intentional.

- **Die:** GH100, 814 mm squared, ~80 billion transistors[^1]
- **Process:** [TSMC](/wiki/tsmc) custom 4N (a tuned 5 nm class node, marketed as 4 nm)
- **Streaming multiprocessors:** 132 active SMs (out of 144 physical) on the SXM and NVL parts
- **CUDA cores:** 16,896 FP32 lanes
- **Tensor Cores:** 528 fourth-generation units
- **Transformer Engine:** dynamic FP8 / BF16 precision switching for transformer layers
- **Cache hierarchy:** 60 MB L2 cache, 256 KB register file per SM, 228 KB combined shared memory plus L1 per SM
- **Interconnect:** [NVLink](/wiki/nvlink) 4 at 900 GB/s aggregate per GPU, [PCIe](/wiki/pcie) Gen5 x16 at ~128 GB/s bidirectional
- **MIG:** Multi-Instance GPU partitions a single H200 into up to seven isolated instances (18 GB each on SXM, 16.5 GB each on NVL)
- **Confidential Computing:** TEE with hardware-encrypted memory and attested boot, inherited from Hopper

The Transformer Engine is the more interesting piece. It tracks per-tensor statistics during training and inference and switches between FP8 and BF16 on a per-layer basis. FP8 doubles raw throughput compared to BF16 and halves the memory footprint of activations and gradients, but it loses dynamic range, so the engine falls back to BF16 for layers that show numerical instability. The H200 ships with the same Transformer Engine version as the H100; what improves on FP8 workloads is the rate at which weights and KV cache can be streamed in from HBM3e.[^17]

Hopper also adds the **Tensor Memory Accelerator** (TMA) and **thread block clusters**, both inherited from the H100 and unchanged on the H200.[^17] The TMA is a dedicated copy engine for moving multi-dimensional tensors between global memory and shared memory without using CUDA or Tensor Core cycles, and thread block clusters expose distributed shared memory across adjacent SMs. Software written to take advantage of these features (FlashAttention 2 and 3, the kernels in [TensorRT-LLM](/wiki/tensorrt_llm), CUTLASS 3.x kernels) runs without modification on the H200, simply benefitting from higher memory throughput.

## How does the H200 differ from the H100?

A quick reference for what is shared between the two parts and what is different.

| Spec | H100 SXM | H200 SXM |
| --- | --- | --- |
| Die | GH100 | GH100 |
| SMs (active) | 132 | 132 |
| FP32 CUDA cores | 16,896 | 16,896 |
| Tensor Cores (4th gen) | 528 | 528 |
| FP8 Tensor (dense) | 3,958 TFLOPS | 3,958 TFLOPS |
| BF16 Tensor (dense) | 1,979 TFLOPS | 1,979 TFLOPS |
| FP64 Tensor | 67 TFLOPS | 67 TFLOPS |
| L2 cache | 50 MB (often quoted) | 60 MB (NVIDIA H200 datasheet)[^3] |
| Memory type | HBM3 | HBM3e |
| Memory capacity | 80 GB | 141 GB |
| Memory bandwidth | 3.35 TB/s | 4.8 TB/s |
| Bus width | 5,120-bit | 6,144-bit |
| NVLink | NVLink 4 (900 GB/s) | NVLink 4 (900 GB/s) |
| PCIe | Gen5 x16 | Gen5 x16 |
| TDP (SXM) | 700 W | up to 700 W (configurable to 1,000 W) |
| Confidential Computing | Yes | Yes |

In short, the two parts are compute-identical and differ only in memory: the H200 carries 141 GB of HBM3e at 4.8 TB/s against the H100's 80 GB of HBM3 at 3.35 TB/s, about 76 percent more capacity and 43 percent more bandwidth.[^1][^3][^10] In practice, every CUDA kernel and every cuBLAS, cuDNN, and TensorRT routine that runs on H100 runs on H200. The compute capability remains sm_90, so binaries compile once and target both parts. The performance differences come entirely from the memory subsystem and from any per-tensor scaling chosen by the Transformer Engine when more memory is available for activation tracking buffers.

## Memory: the headline change

| Spec | H100 SXM | H200 SXM |
| --- | --- | --- |
| Memory type | [HBM3](/wiki/hbm3) | [HBM3e](/wiki/hbm3e) |
| Capacity | 80 GB | 141 GB |
| Bandwidth | 3.35 TB/s | 4.8 TB/s |
| Stack count (active) | 5 x 16 GB | 6 x 24 GB |
| Bus width | 5,120-bit | 6,144-bit |

The physical package has eight HBM stack sites. On both H100 and H200, one site is left unpopulated for yield reasons. The H200's six active HBM3e stacks at 24 GB each give 144 GB raw, but a small slice is reserved for ECC and for vendor-specific repair, so the user-visible capacity is 141 GB.[^1][^3]

Micron's 8-high HBM3e parts run at roughly 9.2 Gb/s per pin, which is what produces the 4.8 TB/s aggregate over the 6,144-bit bus.[^16] SK hynix and Samsung HBM3e parts hit similar speeds. Through most of 2024, NVIDIA's qualified bill of materials specified Micron-only HBM3e for production H200 modules; SK hynix qualifications opened up by Q3 2024 and Samsung followed. By the time the H200 NVL launched in late 2024, all three vendors were shipping into the H200 supply chain.

The practical impact on inference is significant. A 70 billion-parameter model in BF16 occupies roughly 140 GB for weights alone, plus KV cache that scales with batch size and context length. On H100, serving Llama 2 70B at production batch sizes requires tensor-parallel sharding across two or more GPUs, with NVLink and NVSwitch handling the all-reduce traffic. On H200, the same model fits comfortably on one GPU, eliminating the cross-GPU communication entirely and freeing the second GPU for additional inference replicas. NVIDIA has reported in TensorRT-LLM benchmarks that this single-GPU configuration roughly doubles tokens-per-second per GPU on long-context Llama 2 70B inference.[^11]

## What form factors does the H200 come in?

The H200 ships in three product variants. The most common in cloud deployments is the SXM5 module sold as part of an [HGX](/wiki/hgx) baseboard.

| Form factor | TDP | Memory | Memory BW | Interconnect | Typical system |
| --- | --- | --- | --- | --- | --- |
| H200 SXM (SXM5) | up to 700 W | 141 GB HBM3e | 4.8 TB/s | NVLink 4 (900 GB/s), PCIe Gen5 x16 | HGX H200, [DGX H200](/wiki/dgx_h200) |
| H200 NVL (PCIe) | up to 600 W | 141 GB HBM3e | 4.8 TB/s | NVLink 4 bridge (2- or 4-way, 900 GB/s), PCIe Gen5 x16 | Air-cooled enterprise servers |
| GH200 Grace Hopper Superchip (H200 variant) | up to 1,000 W (CPU+GPU) | 144 GB HBM3e + 480 GB LPDDR5X | 4.9 TB/s GPU + ~512 GB/s CPU | NVLink-C2C (900 GB/s) between Grace and Hopper | NVL2 nodes, GH200 NVL32 |

**H200 SXM** is the high-end SKU. It plugs into an HGX H200 baseboard with four or eight GPUs interconnected by NVSwitch. Eight-way HGX H200 systems are the basis for [DGX H200](/wiki/dgx_h200) and for the per-node configuration of cloud instances at AWS, Azure, GCP, and OCI.[^14][^15]

**H200 NVL** launched a few months later as a dual-slot PCIe card for enterprise servers that cannot accept SXM modules. The NVL part runs at lower TDP (600 W), trades a small amount of throughput for a more conventional thermal envelope, and supports 2-way or 4-way NVLink bridges that link adjacent cards at 900 GB/s.[^18][^19] NVIDIA bundles a five-year subscription to NVIDIA AI Enterprise with each H200 NVL.[^19]

**GH200 Grace Hopper Superchip** is a different kind of product. It pairs a Grace CPU (72 Arm Neoverse V2 cores, 480 GB LPDDR5X) with a Hopper GPU using NVLink-C2C, NVIDIA's chip-to-chip cache-coherent interconnect at 900 GB/s. The GH200 originally shipped with H100-class GPU silicon and 96 GB HBM3; the refreshed GH200 launched in late 2023 swaps that for an H200-class GPU with 144 GB of HBM3e at roughly 4.9 TB/s.[^20] GH200 superchips are the building block for the GH200 NVL2 and NVL32 rackscale systems.

### HGX H200 and DGX H200

The HGX H200 baseboard is the OEM-targeted version of the eight-GPU server module. It carries eight H200 SXM modules, four NVSwitch chips for all-to-all NVLink connectivity at 900 GB/s per GPU, the NVLink Switch System interface, and the power and cooling infrastructure for sustained 700 W per GPU.[^14] The board ships to OEMs (Supermicro, Dell, HPE, Lenovo, GIGABYTE, QCT, ASUS, ASRock Rack, Wiwynn, Wistron, Foxconn, Inventec, and others) for integration into their own chassis. The bulk of hyperscale H200 deployments use HGX boards rather than full DGX systems.

The [DGX H200](/wiki/dgx_h200) is NVIDIA's reference appliance built around the same eight-GPU HGX H200 baseboard. Each DGX H200 contains eight H200 SXM modules totaling 1.128 TB of HBM3e, two Intel Xeon Sapphire Rapids or Emerald Rapids CPUs, 2 TB of system DRAM, eight ConnectX-7 400 Gb/s NICs for InfiniBand or Ethernet scale-out, two BlueField-3 DPUs, and 30 TB of NVMe storage in the reference configuration. The chassis is 8U and dissipates roughly 10.2 kW under sustained load. NVIDIA quotes 32 PFLOPS of dense FP8 per DGX H200 (8 x 3.958 PFLOPS) and 1.1 TB of total memory.

### H200 NVL details

The H200 NVL is intended for enterprise customers running standard 19-inch rack servers with PCIe slots.[^18][^19] Each card occupies two PCIe slots, draws up to 600 W from a 12V-2x6 connector, and exposes 141 GB HBM3e at the same 4.8 TB/s as the SXM variant. The 600 W envelope is achieved by lowering boost clocks; the dense FP8 Tensor throughput drops from 3,958 TFLOPS on SXM to roughly 3,341 TFLOPS on NVL.[^18]

Up to four H200 NVL cards in a server can be linked with NVLink bridges into a single coherent 564 GB pool at 900 GB/s aggregate inter-GPU bandwidth. Beyond four cards, the only path is PCIe Gen5, so multi-GPU NVL deployments are typically configured as two-way or four-way NVL groups with PCIe between groups. NVIDIA states that with "up to four GPUs connected by NVIDIA NVLink and a 1.5x memory increase, large language model (LLM) inference can be accelerated up to 1.7x" over the H100 NVL, with HPC applications gaining up to 1.3x.[^19] NVIDIA bundles a five-year NVIDIA AI Enterprise subscription with each H200 NVL purchase, which is a notable difference from the SXM SKU.[^19]

## Compute throughput

The compute spec sheet for the H200 SXM is identical to the H100 SXM. The H200 NVL clocks slightly lower to fit the 600 W envelope, so its peak rates are roughly 84 percent of the SXM numbers.

| Precision | H200 SXM | H200 NVL |
| --- | --- | --- |
| FP64 | 34 TFLOPS | 30 TFLOPS |
| FP64 Tensor Core | 67 TFLOPS | 60 TFLOPS |
| FP32 | 67 TFLOPS | 60 TFLOPS |
| [TF32](/wiki/tf32) Tensor Core (sparse) | 989 / 1,979 TFLOPS | 835 / 1,671 TFLOPS |
| BF16 Tensor Core (sparse) | 1,979 / 3,958 TFLOPS | 1,671 / 3,341 TFLOPS |
| FP16 Tensor Core (sparse) | 1,979 / 3,958 TFLOPS | 1,671 / 3,341 TFLOPS |
| FP8 Tensor Core (sparse) | 3,958 / 7,916 TFLOPS | 3,341 / 6,682 TFLOPS |
| INT8 Tensor Core (sparse) | 3,958 / 7,916 TOPS | 3,341 / 6,682 TOPS |

The "sparse" numbers assume NVIDIA's 2:4 structured sparsity is in use.[^3] Most production LLM inference runs do not exploit 2:4 sparsity, so the dense numbers are usually the relevant ones. The sparsity uplift is genuine but only applies if you have actually trained or pruned the model into the 2:4 pattern.

## How much faster is the H200 than the H100?

NVIDIA's published Llama 2 70B inference benchmark, run with TensorRT-LLM, shows H200 delivering roughly 1.6 to 1.9 times the throughput of H100.[^1][^11] The exact ratio depends on batch size and sequence length. The reason is straightforward: at long context, the KV cache dominates memory traffic, and the H200's 1.43x bandwidth advantage plus its ability to keep the entire cache in HBM (rather than paging it through tensor-parallel comms) compounds. NVIDIA reports the H200 provides "nearly 1.8x more GPU memory and 1.4x higher GPU memory bandwidth compared to the H100," which is the source of the inference uplift.[^11]

The most-cited public benchmark is [MLPerf](/wiki/mlperf) Inference v4.0 (March 2024), where eight-GPU H200 systems beat eight-GPU H100 systems on Llama 2 70B by approximately 45 percent.[^21][^22] NVIDIA's own breakdown attributes 28 percent of that gain to the H200 running at its standard 700 W TDP, with the remainder coming from a custom 1,000 W configurable thermal design.[^11] By v4.1 (August 2024), NVIDIA reported up to 27 percent additional H200 gains from software optimizations alone, and an H200 system with eight GPUs configured at 1,000 W reached roughly 33,000 tokens per second on Llama 2 70B server-mode submissions.[^21] (The 1,000 W mode is a special configurable TDP that pushes the part beyond its 700 W reference setting; standard datacenter deployments run at 700 W.)

For HPC workloads, NVIDIA's bandwidth-bound benchmarks (HPCG, MILC, certain quantum chemistry codes) show roughly 1.4x speedups over H100.[^1] Compute-bound HPC codes (those that already saturate the FP64 or FP64 Tensor Core pipelines) see negligible improvement; the math units are the same.

For LLM training, the picture is more nuanced. If your training job already fits in H100 memory at the desired batch size, the H200 buys you very little because compute is unchanged. If you were using gradient checkpointing or aggressive pipeline parallelism to squeeze a model into H100 memory, the H200 lets you raise the per-GPU batch size and reduce communication, which in practice moves training throughput up by 10 to 30 percent on 70B-class models.

### MLPerf and benchmark detail

The MLPerf Inference benchmark suite is the standard public reference for accelerator performance. The H200 has appeared in v4.0 (March 2024), v4.1 (August 2024), v5.0 (early 2025), and v5.1 (September 2025) submissions, mostly through NVIDIA's own entries and through partners like CoreWeave, Supermicro, Dell, and Lenovo.[^21][^23]

- **MLPerf Inference v4.0 (March 2024):** first H200 submissions. Eight-GPU H200 systems posted around 31,712 tokens per second on Llama 2 70B in server mode, roughly 45 percent faster than the same configuration with H100.[^22]
- **MLPerf Inference v4.1 (August 2024):** NVIDIA submitted H200 results with TensorRT-LLM optimizations including improved KV cache management, better GEMM scheduling, and FP8 quantization. Server-mode tokens per second on Llama 2 70B reached the low 33,000 range with an eight-GPU H200 at the 1,000 W configurable TDP.[^21]
- **MLPerf Inference v5.0 (early 2025):** H200 submissions remained competitive even as Blackwell parts entered the suite. The H200 results stayed roughly stable; NVIDIA's optimization work over the year extracted a few additional percent.
- **MLPerf Inference v5.1 (September 2025):** Llama 3.1 70B was added as a new benchmark. Eight-GPU HGX H200 systems hit roughly 31,391 tokens per second offline with SGLang v0.4.9 and 26,319 tokens per second with vLLM v0.9.2 on pre-quantized weights, and 30,893 tokens per second with vLLM on dynamic quantization. NVIDIA reported an 11 percent improvement on Llama 2 70B over H100 v5.0 results for the H200 generation, and open-source engines (vLLM, SGLang) had closed to roughly 90 percent of NVIDIA's own TensorRT-LLM submissions over the preceding six months.[^23]

Independent third-party benchmarks (ServeTheHome, The Next Platform, MLCommons-published partner submissions) confirm the broad pattern. The H200's advantage is largest on memory-bound LLM inference, smallest on training compute already fitting in H100 memory, and somewhere in between on HPC workloads depending on the kernel mix. On Mixtral 8x7B and Mixtral 8x22B, where MoE routing means activations are sparse but weights remain large, H200 systems also show 1.5x to 1.8x throughput improvements over H100.[^11]

## How much does the H200 cost?

NVIDIA does not publish list prices for SXM modules. Resellers list H200 SXM modules at roughly 31,000 to 40,000 USD per GPU through 2024 and 2025, and DGX H200 systems (eight GPUs plus chassis) have been reported in the 400,000 to 500,000 USD range.[^24] H200 NVL PCIe cards are listed at retail by some distributors at roughly 31,000 USD. These figures vary widely by reseller, region, contract term, and supply conditions; treat them as order-of-magnitude rather than catalog prices.

Cloud rental is the more common access path. As of mid-2026, on-demand rates for a single H200 SXM hour at major hyperscalers and neoclouds typically range from roughly 3.72 USD per GPU per hour at marketplace and spot providers up to about 10.60 USD per GPU per hour at AWS and Azure on-demand, with reserved or committed capacity well below those figures.[^25][^26] Eight-GPU H200 instances rent for roughly 25 to 50 USD per hour on-demand at most providers. The H200 has consistently traded at a 15 to 20 percent premium over the H100 on a per-GPU-hour basis through 2025 and into 2026.[^26]

Relative pricing across the Hopper-Blackwell line in mid-2026 looks roughly like this:

| Part | Cloud on-demand (per GPU-hour, USD, mid-2026) | Notes |
| --- | --- | --- |
| H100 SXM | 1.03 to 12.29 (median around 2 to 4) | Spot pricing collapsed in 2025; hyperscaler list rates still elevated[^27][^26] |
| **H200 SXM** | **3.72 to 10.60 (median around 4 to 7)** | **Most common LLM inference SKU on hyperscalers[^25][^26]** |
| B200 SXM | 2.12 (spot) to 14.24 (on-demand); index around 4 to 6 | Volatile through 2026 as Blackwell supply ramps[^28][^29] |
| B300 / GB300 | not yet broadly priced on-demand | Limited 2H 2025 launches; mostly committed deals[^30] |

Reserved and committed capacity at one-year and three-year terms is significantly cheaper than on-demand. CoreWeave, Lambda, and the hyperscalers all publish reserved-instance discounts in the 30 to 60 percent range depending on the commitment length and total volume.

### Cloud availability

| Provider | Instance / SKU | Per-node config | NVLink topology | Networking |
| --- | --- | --- | --- | --- |
| [AWS](/wiki/aws) | EC2 P5e and P5en | 8 x H200 SXM | NVSwitch all-to-all | 3,200 Gbps EFAv3 |
| [Azure](/wiki/azure) | ND H200 v5 | 8 x H200 SXM | NVSwitch all-to-all | 3,200 Gbps NDR InfiniBand |
| Google Cloud | A3 Ultra and A3 Mega | 8 x H200 SXM | NVSwitch all-to-all | TPU-style fabric or 800 Gbps RDMA |
| Oracle Cloud | BM.GPU.H200.8 (bare metal) | 8 x H200 SXM | NVSwitch all-to-all | 3,200 Gbps RoCEv2 |
| [CoreWeave](/wiki/coreweave) | HGX H200 nodes | 8 x H200 SXM | NVSwitch all-to-all | 3,200 Gbps NDR InfiniBand |
| [Lambda](/wiki/lambda_labs) | 1-Click Clusters and on-demand | 1 to 512 H200 SXM | NVSwitch + IB | NDR InfiniBand |
| Crusoe | Crusoe Cloud H200 | 8 x H200 SXM | NVSwitch all-to-all | 3,200 Gbps NDR InfiniBand |
| Vultr | H200 bare metal and VM | 1 to 8 x H200 SXM | NVSwitch | RoCE |
| RunPod | Secure Cloud H200 | 1 to 8 x H200 SXM | NVSwitch | 3,200 Gbps |
| Together AI | H200 endpoints | Shared and dedicated | NVSwitch | NDR |
| Nebius | H200 cluster shapes | 8 x H200 SXM | NVSwitch | 3,200 Gbps NDR |

Most providers list both single-GPU instances and full eight-GPU bare-metal nodes. The eight-GPU configurations are the most common deployment for production LLM inference and for fine-tuning runs that need an HGX-class NVLink fabric in a single node. Larger training jobs span multiple nodes connected by 400 Gbps NDR InfiniBand or 400/800 Gbps RoCE Ethernet.

## Notable deployments

The H200 has been adopted broadly across hyperscalers, neoclouds, AI labs, and HPC sites. Notable deployments include:

- **AWS P5e and P5en instances**, eight H200 SXM per node, NVLink-connected, with Elastic Fabric Adapter v3 networking between nodes. Available in select regions (us-east-1, us-west-2, eu-west-1, ap-northeast-1) starting in 2024.[^14]
- **Microsoft Azure ND H200 v5**, eight H200 SXM per VM, primarily targeted at customers running large model fine-tuning and inference.[^14] Microsoft has used H200 capacity to support OpenAI's inference workloads alongside Blackwell as the latter has ramped.
- **Google Cloud A3 Ultra and A3 Mega**, eight H200 per VM, available in select regions. A3 Ultra is the higher-bandwidth networking variant.[^14]
- **Oracle Cloud Infrastructure** offers H200 bare-metal (BM.GPU.H200.8) and VM shapes, with RDMA cluster networking aimed at large-scale training.[^14]
- **CoreWeave**, **Lambda**, **RunPod**, **Crusoe**, **Vultr**, **Nebius**, **Together AI**, and other GPU-focused clouds operate H200 capacity at scale; CoreWeave has publicly disclosed H200 deployments in the tens of thousands of GPUs as of 2025.
- **Stargate Abilene** (Texas), the Crusoe-built data center campus that anchors the OpenAI / Oracle / SoftBank Stargate Project, is a mixed Hopper-Blackwell site. Construction on the first two buildings began in June 2024 and they were energized in September 2025; early workloads ran on a combination of H100 and H200 capacity before Oracle began delivering Blackwell GB200 racks in June 2025. The full eight-building campus targets roughly 1.2 GW and is scheduled for completion by mid-2026, eventually housing in the order of 450,000 Blackwell-class GPUs alongside its earlier Hopper fleet.[^31][^32]
- **xAI Colossus** (Memphis, Tennessee) used 100,000 H100s at first launch in 2024; subsequent expansions have included H200 capacity alongside the larger Blackwell buildouts that became the focus through 2025 and into 2026.
- **Meta** committed to a mixed H100 and H200 fleet exceeding several hundred thousand GPUs through 2024 and 2025. Mark Zuckerberg's 2024 statements referenced 350,000 H100-equivalent units; subsequent capacity has included substantial H200 procurement.
- **OpenAI** uses H200 capacity through Microsoft Azure for inference and fine-tuning, alongside increasing Blackwell allocations on Azure and on Stargate Abilene.[^31]
- **Jupiter exascale supercomputer** at Forschungszentrum Julich uses GH200 modules with H200-class HBM3e memory; commissioned through 2024 and 2025 as Europe's first official exascale system.
- **Various national labs and academic consortia** operate H200 nodes for HPC plus AI workloads, including Argonne, Oak Ridge, Los Alamos, NCSA, and EuroHPC sites.
- **Grok 3 training** at xAI used a mix of H100 and H200 capacity inside Colossus during late 2024 and early 2025; subsequent Grok versions transitioned more workload to Blackwell as it became available.

## What are the China export controls on the H200?

The H200 has been at the center of the US-China export-control debate over advanced AI accelerators. The original October 2023 update to the Bureau of Industry and Security (BIS) controls placed the full-spec H100 and H200 above the performance thresholds for general export to China; NVIDIA shipped a downgraded HGX H20 part for the Chinese market instead.[^13][^33]

In early December 2025, the Trump administration directed Commerce to allow case-by-case H200 exports to China subject to inter-agency review and a 25 percent export levy.[^13][^33] Initial approvals reportedly cleared roughly ten Chinese customers (including Alibaba, Tencent, and others) for an aggregate ceiling on the order of 40,000 to 80,000 H200 GPUs (5,000 to 10,000 HGX H200 boards) drawn from existing inventory, with the total capped well below half of US customer volume.[^33] By May 2026, however, Beijing had largely declined to approve H200 procurement on the Chinese side, with PRC authorities steering domestic buyers toward homegrown accelerators; NVIDIA had not booked a material volume of approved H200 sales into China three months after the White House decision.[^34][^35] The H200 remains roughly six times more capable than the H20 part originally cleared for the Chinese market under the 2023 rules.[^33]

## How does the H200 compare to other AI accelerators?

The table below summarizes how the H200 fits in NVIDIA's data-center GPU lineup alongside its predecessor, successors, and the most prominent competing accelerators. Specs are as published by each vendor for the flagship SXM-equivalent variant.

| Product | Architecture | Memory | Bandwidth | FP8 / FP4 Tensor (dense) | TDP | Year |
| --- | --- | --- | --- | --- | --- | --- |
| [NVIDIA A100](/wiki/nvidia_a100) SXM4 80GB | Ampere | 80 GB HBM2e | 2.0 TB/s | n/a (FP16: 312 TFLOPS) | 400 W | 2020 |
| [NVIDIA H100](/wiki/nvidia_h100) SXM5 | [Hopper](/wiki/nvidia_hopper) | 80 GB HBM3 | 3.35 TB/s | 3,958 TFLOPS FP8 | 700 W | 2022 |
| H100 NVL (PCIe pair) | Hopper | 188 GB HBM3 (94 GB per GPU) | 7.8 TB/s (paired) | 3,341 TFLOPS FP8 (per pair) | 2 x 400 W | 2023 |
| **H200 SXM** | **Hopper** | **141 GB HBM3e** | **4.8 TB/s** | **3,958 TFLOPS FP8** | **700 W (up to 1,000 W configurable)** | **2024** |
| H200 NVL | Hopper | 141 GB HBM3e | 4.8 TB/s | 3,341 TFLOPS FP8 | 600 W | 2024 |
| GH200 (H200) | Grace + Hopper | 144 GB HBM3e + 480 GB LPDDR5X | 4.9 TB/s GPU | 3,958 TFLOPS FP8 | up to 1,000 W | 2024 |
| [NVIDIA B100](/wiki/nvidia_b100) | [Blackwell](/wiki/nvidia_blackwell) | 192 GB HBM3e | 8 TB/s | 3,500 TFLOPS FP8 / 7,000 TFLOPS FP4 | 700 W | 2024 |
| [NVIDIA B200](/wiki/nvidia_b200) | Blackwell | 192 GB HBM3e | 8 TB/s | 4,500 TFLOPS FP8 / 9,000 TFLOPS FP4 | 1,000 W | 2024 to 2025 |
| [NVIDIA GB200](/wiki/nvidia_gb200) (per Blackwell GPU) | Grace + Blackwell | 192 GB HBM3e | 8 TB/s | 5,000 TFLOPS FP8 / 10,000 TFLOPS FP4 | up to 1,200 W | 2024 to 2025 |
| [NVIDIA B300](/wiki/nvidia_b300) (Blackwell Ultra) | Blackwell Ultra | 288 GB HBM3e | 8 TB/s | ~5,000 TFLOPS FP8 / 15,000 TFLOPS FP4 | up to 1,400 W | 2H 2025[^12][^30] |
| [AMD MI300X](/wiki/amd_instinct_mi300x) | CDNA 3 | 192 GB HBM3 | 5.3 TB/s | 2,615 TFLOPS FP8 | 750 W | 2023 |
| [AMD MI325X](/wiki/amd_instinct_mi325x) | CDNA 3 | 256 GB HBM3e | 6.0 TB/s | 2,615 TFLOPS FP8 | 1,000 W | 2024 |
| Google TPU v5p | TPU v5p | 95 GB HBM | 2.8 TB/s | n/a (BF16: 459 TFLOPS) | n/a (cloud only) | 2023 |
| Google [TPU](/wiki/tpu) Trillium (v6e) | TPU v6e | 32 GB HBM | 1.6 TB/s | n/a (BF16: 918 TFLOPS) | n/a (cloud only) | 2024 |

A few points are worth highlighting from this table.

The H200's 141 GB sits between the H100's 80 GB and the B200's 192 GB. AMD's MI300X also reaches 192 GB but with HBM3 (not HBM3e), so its bandwidth at 5.3 TB/s is below B200 but above H200. The MI325X stretches the memory to 256 GB at 6.0 TB/s, which is the largest single-package memory pool of any of these accelerators in 2024. AMD has positioned the MI300X and MI325X explicitly as LLM-inference parts where the memory advantage matters most. NVIDIA's response across the H200 and Blackwell line has been a combination of HBM3e adoption, NVSwitch-scale memory pooling across eight GPUs, and software ecosystem investment in TensorRT-LLM and NIM microservices.

The H100 NVL deserves a footnote. NVIDIA briefly sold an H100 NVL part that pairs two PCIe cards over NVLink and exposes 188 GB of memory across the pair (94 GB per card, roughly 96 GB physical with a few GB held back). The H200 NVL is a different product. Single H200 NVL cards have 141 GB on their own.

Google's TPU offerings are not directly comparable on a per-chip basis because they are sold as cloud capacity with their own software stack ([JAX](/wiki/jax), TensorFlow XLA, PyTorch via TPU plugin) and a different scale-out fabric. For workloads that Google has tuned heavily (Gemini training, internal recommendation systems, BERT-class inference), TPU v5p and Trillium are competitive with H200 and B200 on cost per training step. For the broader CUDA ecosystem of open-source models, optimized inference engines, and third-party tooling, H200 retains a substantial software advantage.

## Software stack

The H200 uses the same Hopper compute capability (sm_90) as the H100, so binaries compiled for H100 run unchanged. Software support comes from the standard NVIDIA stack:

- [CUDA](/wiki/cuda) 12.x toolkit with Hopper code generation
- [cuDNN](/wiki/cudnn) for primitive deep-learning kernels
- [TensorRT-LLM](/wiki/tensorrt_llm) and the NVIDIA [Triton Inference Server](/wiki/triton_inference_server) for production inference
- [NCCL](/wiki/nccl) for multi-GPU and multi-node collectives
- [PyTorch](/wiki/pytorch) 2.x with native Hopper support, including FP8 training via the Transformer Engine package
- Megatron-LM, NeMo, and DeepSpeed for large-model training
- vLLM, SGLang, and TGI for open-source LLM serving

NVIDIA AI Enterprise (a subscription product bundled free with H200 NVL for five years) packages enterprise-supported builds of these components plus the NIM microservice runtime.[^19]

### TensorRT-LLM and NIM

TensorRT-LLM is the inference engine NVIDIA most aggressively tunes against new hardware generations. It is the path through which most of the H200's advertised inference gains are realized in production.[^11] Key features that are particularly useful on H200:

- **In-flight batching** (also called continuous batching) avoids the head-of-line blocking that would otherwise tie up the larger memory budget on long sequences.
- **Paged KV cache** allows the cache to be allocated in fixed-size blocks rather than as a contiguous tensor per request, which is essential when the KV cache is the dominant memory consumer.
- **FP8 quantization** through the Transformer Engine reduces both compute and memory footprint, doubling effective throughput on attention-heavy phases.
- **Speculative decoding** uses a smaller draft model to propose tokens that the H200 verifies in parallel.

NIM (NVIDIA Inference Microservices) packages popular open-source models (Llama 3, Llama 3.1, Mixtral, Phi, Qwen, Mistral, and others) as Docker containers that can be deployed on any Hopper-class GPU. NIM containers automatically detect the underlying hardware (H100, H200, B100, B200) and select a tuned engine. On H200, NIM typically defaults to a single-GPU engine for 70B-class models because the model fits without sharding.

### Open-source frameworks

The H200 is well supported by open-source LLM serving frameworks even outside NVIDIA's first-party stack.[^23]

- **vLLM** added explicit H200 detection and KV cache sizing in versions 0.4 and later. The paged-attention kernel that vLLM is known for is particularly effective on H200 because the larger HBM3e budget allows higher KV cache block counts and therefore higher request concurrency.
- **SGLang** ships with Hopper-tuned attention kernels and adopts a similar paged KV cache design. Its compiled inference engine is competitive with TensorRT-LLM on many open-source models; SGLang's MLPerf Inference v5.1 submissions on eight-GPU H200 reached roughly 90 percent of NVIDIA's reference results on Llama 3.1 70B.[^23]
- **Hugging Face TGI** runs unmodified on H200; the Hugging Face team maintains H200 benchmark numbers in its serving documentation.
- **DeepSpeed Inference** and **FasterTransformer** legacy paths still work but are largely superseded by TensorRT-LLM, vLLM, and SGLang for production deployments.

## What is the H200 used for?

The H200 is most useful where memory rather than compute is the constraint:

- **Single-GPU 70B-class LLM inference.** [Llama 2](/wiki/llama_2) 70B, [Llama 3](/wiki/llama_3) 70B, Mixtral 8x22B, and similar models run comfortably on a single H200 in 16-bit precision, where on H100 they require at least two cards.
- **Long-context inference.** Larger KV cache budgets let you serve longer prompts or higher batch sizes at the same context length. Models with 128K context windows benefit disproportionately.
- **Embedding and retrieval workloads** that need to keep a large index resident in GPU memory, including ColBERT-style late-interaction retrievers and on-GPU vector search.
- **Mid-scale fine-tuning** of 7B to 70B models, where the extra memory removes the need for activation checkpointing or aggressive sharding.
- **Bandwidth-bound HPC simulations** (CFD, climate, drug discovery, seismic, computational chemistry, lattice QCD).
- **Generative media inference** for diffusion and video models with large batch sizes; particularly useful for video models where temporal context inflates activation memory.
- **Mixture-of-experts (MoE) serving** for models like Mixtral 8x22B, DeepSeek-V2, and Qwen2-MoE where total parameter counts exceed dense-model equivalents.
- **Agentic workloads** that maintain long conversation histories and tool-call state across many turns; the larger KV cache holds more of the working context without paging.

It is a poor choice when the workload is compute-bound and already fits comfortably in H100 memory; you would be paying H200 prices for H100 throughput. It is also a poor choice when the workload would otherwise fit on an L40S, an H100, or even an A100, since those parts are cheaper per GPU-hour on most clouds.

## Limitations

- The compute pipeline is identical to the H100, so workloads that are math-bound and already memory-fit see no benefit.
- 141 GB is less than B200's 192 GB, B300's 288 GB, and well below MI325X's 256 GB. For frontier-scale models, Blackwell parts are the better choice when you can get them, and AMD MI325X is competitive on memory capacity for inference-only workloads.[^12][^30]
- 700 W per GPU on the SXM is a serious thermal load. Eight-GPU HGX H200 systems can pull 10 kW or more under sustained load and need rack-level cooling capacity that older datacenters often lack.
- Through most of 2024, supply was constrained because Micron was the only qualified HBM3e source for the first months. SK hynix and Samsung qualifications eased the constraint later in the year.[^7][^16]
- The H200 NVL caps NVLink connectivity at four GPUs per bridge, which is fewer than the eight-way SXM HGX configuration.
- The Transformer Engine still requires careful loss-scaling and per-tensor statistics tracking; FP8 is not free, and bad scaling choices can hurt accuracy.
- The configurable 1,000 W TDP mode that NVIDIA used in some MLPerf submissions is not the standard datacenter deployment mode. Most production HGX H200 systems run at 700 W per GPU, so customers should not assume the 1,000 W benchmark numbers in their own deployments.[^21]
- The H200 does not support FP4 natively. Workloads relying on FP4 (notably reasoning-model inference where Blackwell's NVFP4 path roughly doubles throughput) need B200 or B300 silicon to benefit.[^12]
- The Hopper architecture predates the FP4 Tensor Core and the second-generation Transformer Engine that ship with Blackwell, so for reasoning workloads with long chain-of-thought decoding, B200 and B300 platforms outperform H200 by larger margins than the 1.5x to 1.8x typically reported for short-form Llama 70B inference.[^30]

## Successor and roadmap context

NVIDIA announced [Blackwell](/wiki/nvidia_blackwell) at GTC 2024 in March 2024, four months after the H200 announcement.[^36] The B100 and B200 GPUs ramped through 2024 with 192 GB of HBM3e and roughly double the FP8 Tensor throughput of H200, plus native FP4 support that the Hopper Tensor Cores lack. The [GB200](/wiki/nvidia_gb200) NVL72 rack ties 36 Grace CPUs to 72 Blackwell GPUs over a single NVLink domain, presenting itself as a unified accelerator for trillion-parameter models.

At GTC 2025 (March 18, 2025), NVIDIA announced [Blackwell Ultra](/wiki/nvidia_blackwell) (the B300 and the GB300 NVL72), which lifts on-package memory to 288 GB of HBM3e at 8 TB/s, raises FP4 throughput to roughly 15 PFLOPS dense per GPU (1.5x GB200), and pushes total board power into the 1,400 W range.[^12][^30] B300 systems began shipping in the second half of 2025 and ramped through early 2026 in parallel with continued B200 production. NVIDIA also previewed the Vera Rubin generation at GTC 2025 for 2026 to 2027, with HBM4 memory.[^12]

In practice, the H200 has had a longer-than-expected useful life. Blackwell supply through 2024 and 2025 was constrained by a combination of HBM3e availability, advanced packaging (CoWoS-L) capacity at TSMC, and software readiness.[^7][^8] A lot of customers who ordered Blackwell in early 2024 ended up taking H200 capacity in the meantime, and the H200 remains the standard "big memory" Hopper SKU on most clouds well into 2026. Even after Blackwell and Blackwell Ultra volume ramps, the H200 retains a price-performance niche for inference workloads that fit comfortably in 141 GB and that do not need FP4.

For a longer view of the architectural lineage from Ampere through Rubin, see the [NVIDIA Hopper](/wiki/nvidia_hopper) and [NVIDIA Blackwell](/wiki/nvidia_blackwell) articles. The [NVIDIA H100](/wiki/nvidia_h100) page covers the predecessor in depth, and the [NVIDIA B200](/wiki/nvidia_b200), [NVIDIA B300](/wiki/nvidia_b300), and [NVIDIA GB200](/wiki/nvidia_gb200) pages describe the Blackwell-generation successors.

## See also

- [NVIDIA Hopper](/wiki/nvidia_hopper), [NVIDIA Blackwell](/wiki/nvidia_blackwell), [NVIDIA H100](/wiki/nvidia_h100), [NVIDIA A100](/wiki/nvidia_a100)
- [NVIDIA B100](/wiki/nvidia_b100), [NVIDIA B200](/wiki/nvidia_b200), [NVIDIA B300](/wiki/nvidia_b300), [NVIDIA GB200](/wiki/nvidia_gb200)
- [DGX H200](/wiki/dgx_h200), [HGX](/wiki/hgx)
- [HBM3](/wiki/hbm3), [HBM3e](/wiki/hbm3e), [NVLink](/wiki/nvlink), [PCIe](/wiki/pcie), [TSMC](/wiki/tsmc)
- [CUDA](/wiki/cuda), [cuDNN](/wiki/cudnn), [TensorRT-LLM](/wiki/tensorrt_llm), [Triton Inference Server](/wiki/triton_inference_server), [NCCL](/wiki/nccl), [PyTorch](/wiki/pytorch)
- [AMD Instinct MI300X](/wiki/amd_instinct_mi300x), [AMD Instinct MI325X](/wiki/amd_instinct_mi325x), [TPU](/wiki/tpu)
- [MLPerf](/wiki/mlperf), [Llama 2](/wiki/llama_2), [Llama 3](/wiki/llama_3), [FP8](/wiki/fp8), [BF16](/wiki/bf16), [TF32](/wiki/tf32), [FP16](/wiki/fp16)
- [CoreWeave](/wiki/coreweave), [Lambda](/wiki/lambda_labs), [AWS](/wiki/aws), [Azure](/wiki/azure)

## References

[^1]: NVIDIA. "NVIDIA H200 Tensor Core GPU" product page. https://www.nvidia.com/en-us/data-center/h200/

[^2]: NVIDIA Newsroom. "NVIDIA Supercharges Hopper, the World's Leading AI Computing Platform." November 13, 2023. https://nvidianews.nvidia.com/news/nvidia-supercharges-hopper-the-worlds-leading-ai-computing-platform

[^3]: NVIDIA. "NVIDIA H200 Tensor Core GPU Datasheet." https://resources.nvidia.com/en-us-data-center-overview-mc/en-us-data-center-overview/hpc-datasheet-sc23-h200

[^4]: Hyperbolic AI. "H200 vs H100 for LLM Inference: Single-GPU Serving for 70B Models." https://www.hyperbolic.ai/blog/h200-vs-h100

[^5]: Inside HPC. "NVIDIA Announces HGX H200 Systems and Cloud Instances." November 13, 2023. https://insidehpc.com/2023/11/nvidia-announces-hgx-h200-systems-and-cloud-instances/

[^6]: Tom's Hardware. "Nvidia Announces H200 GPU: 141GB of HBM3e and 4.8 TB/s Bandwidth." https://www.tomshardware.com/news/nvidia-h200-gpu-announced

[^7]: FinancialContent. "NVIDIA's Second Wind: H200 Supply Surge and Blackwell Backlog Fuel 2026 Momentum." December 31, 2025. https://markets.financialcontent.com/wral/article/marketminute-2025-12-31-nvidias-second-wind-h200-supply-surge-and-blackwell-backlog-fuel-2026-momentum

[^8]: FinancialContent. "Nvidia's Blackwell Dynasty: B200 and GB200 Sold Out Through Mid-2026 as Backlog Hits 3.6 Million Units." December 29, 2025. https://markets.financialcontent.com/wral/article/tokenring-2025-12-29-nvidias-blackwell-dynasty-b200-and-gb200-sold-out-through-mid-2026-as-backlog-hits-36-million-units

[^9]: NVIDIA Developer Blog. "NVIDIA Hopper Architecture In-Depth." https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/

[^10]: Glenn K. Lockwood. "NVIDIA H200" technical analysis. https://www.glennklockwood.com/garden/processors/H200

[^11]: NVIDIA Developer Blog. "NVIDIA H200 Tensor Core GPUs and NVIDIA TensorRT-LLM Set MLPerf LLM Inference Records." https://developer.nvidia.com/blog/nvidia-h200-tensor-core-gpus-and-nvidia-tensorrt-llm-set-mlperf-llm-inference-records/

[^12]: NVIDIA Newsroom. "NVIDIA Blackwell Ultra AI Factory Platform Paves Way for Age of AI Reasoning." GTC 2025, March 18, 2025. https://nvidianews.nvidia.com/news/nvidia-blackwell-ultra-ai-factory-platform-paves-way-for-age-of-ai-reasoning

[^13]: South China Morning Post. "US government approves Nvidia H200 chip exports to China." December 2025. https://www.scmp.com/news/china/diplomacy/article/3339797/us-government-approves-nvidia-h200-chip-exports-china

[^14]: AWS. "Amazon EC2 P5e Instances." https://aws.amazon.com/ec2/instance-types/p5/

[^15]: Microsoft Azure. "ND H200 v5-series virtual machines." https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/gpu-accelerated/ndh200v5-series

[^16]: Tom's Hardware. "Micron puts stackable 24GB HBM3E chips into volume production for Nvidia's next-gen H200 AI GPU." https://www.tomshardware.com/pc-components/ram/micron-puts-stackable-24gb-hbm3e-chips-into-volume-production-for-nvidias-next-gen-h200-ai-gpu

[^17]: NVIDIA Developer Blog. "NVIDIA Hopper Architecture In-Depth." https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/

[^18]: NVIDIA Developer Blog. "Deploying NVIDIA H200 NVL at Scale with New Enterprise Reference Architecture." https://developer.nvidia.com/blog/deploying-nvidia-h200-nvl-at-scale-with-new-enterprise-reference-architecture/

[^19]: PNY. "NVIDIA H200 NVL Datasheet." https://www.pny.com/file%20library/company/support/linecards/data-center-gpus/h200-nvl-datasheet.pdf

[^20]: NVIDIA Developer Blog. "NVIDIA GH200 Grace Hopper Superchip Delivers Outstanding Performance in MLPerf Inference v4.1." https://developer.nvidia.com/blog/nvidia-gh200-grace-hopper-superchip-delivers-outstanding-performance-in-mlperf-inference-v4-1/

[^21]: NVIDIA Developer Blog. "NVIDIA Blackwell Platform Sets New LLM Inference Records in MLPerf Inference v4.1." https://developer.nvidia.com/blog/nvidia-blackwell-platform-sets-new-llm-inference-records-in-mlperf-inference-v4-1/

[^22]: MLCommons. "Llama 2 70B: An MLPerf Inference Benchmark for Large Language Models." March 2024. https://mlcommons.org/2024/03/mlperf-llama2-70b/

[^23]: HPCwire. "MLPerf Inference v5.1 Results Land With New Benchmarks and Record Participation." September 10, 2025. https://www.hpcwire.com/2025/09/10/mlperf-inference-v5-1-results-land-with-new-benchmarks-and-record-participation/

[^24]: IntuitionLabs. "NVIDIA AI GPU Prices: H100 ($27K-$40K) & H200 ($315K/8-GPU) Cost Guide." https://intuitionlabs.ai/articles/nvidia-ai-gpu-pricing-guide

[^25]: JarvisLabs. "NVIDIA H200 Price Guide 2026: GPU Cost, Rental & Cloud Availability." https://jarvislabs.ai/blog/h200-price

[^26]: Spheron. "GPU Cloud Pricing 2026: H100 from $1.03/hr, B200 from $2.12/hr (15+ providers)." https://www.spheron.network/blog/gpu-cloud-pricing-comparison-2026/

[^27]: Silicon Data. "H100 Rental Price Over Time (2023 to 2025): A Complete Market Analysis." https://www.silicondata.com/blog/h100-rental-price-over-time

[^28]: Silicon Data. "B200 Index Price March 2026 Update: What the Charts, Price Bands Are Telling Us." https://www.silicondata.com/blog/b200-rental-price-march-2026-update

[^29]: GetDeploying. "B200 Cloud Pricing: Compare 22+ Providers (2026)." https://getdeploying.com/gpus/nvidia-b200

[^30]: NVIDIA Developer Blog. "Inside NVIDIA Blackwell Ultra: The Chip Powering the AI Factory Era." https://developer.nvidia.com/blog/inside-nvidia-blackwell-ultra-the-chip-powering-the-ai-factory-era/

[^31]: Data Center Dynamics. "OpenAI and Oracle to deploy 450,000 GB200 GPUs at Stargate data center in Abilene, Texas." https://www.datacenterdynamics.com/en/news/openai-and-oracle-to-deploy-450000-gb200-gpus-at-stargate-abilene-data-center/

[^32]: Crusoe. "Crusoe announces flagship Abilene data center is live." https://www.crusoe.ai/resources/newsroom/crusoe-announces-flagship-abilene-data-center-is-live

[^33]: Tom's Hardware. "Nvidia reportedly wins H200 exports to China; US Department of Commerce set to ease restrictions for full Hopper AI GPU." https://www.tomshardware.com/pc-components/gpus/nvidia-reportedly-wins-h200-exports-to-china-us-department-of-commerce-set-to-ease-restrictions-for-full-hopper-ai-gpu

[^34]: Tom's Hardware. "Trump says China is blocking Nvidia H200 purchases despite US approval." https://www.tomshardware.com/tech-industry/trump-says-china-is-blocking-h200-purchases

[^35]: Tom's Hardware. "Nvidia still hasn't sold a single H200 to China nearly three months after getting the green light from the White House." 2026. https://www.tomshardware.com/tech-industry/artificial-intelligence/nvidia-still-hasnt-sold-a-single-h200-to-china-nearly-three-months-after-getting-the-green-light-from-the-white-house-u-s-commerce-official-says-department-hasnt-approved-any-sales-during-a-house-hearing

[^36]: NVIDIA Newsroom. "NVIDIA Blackwell Platform Arrives to Power a New Era of Computing." March 18, 2024. https://nvidianews.nvidia.com/news/nvidia-blackwell-platform-arrives-to-power-a-new-era-of-computing

