NVIDIA H800
Last reviewed
Jun 3, 2026
Sources
12 citations
Review status
Source-backed
Revision
v1 · 2,188 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
12 citations
Review status
Source-backed
Revision
v1 · 2,188 words
Add missing citations, update stale details, or suggest a clearer explanation.
The NVIDIA H800 is a data center graphics processing unit (GPU) that Nvidia designed specifically for the Chinese market as a regulatory-compliant variant of its flagship H100 accelerator. Built on the same Hopper architecture and the same GH100 silicon as the H100, the H800 was deliberately handicapped on certain capabilities, principally chip-to-chip interconnect bandwidth and double-precision (FP64) throughput, so that it would fall below the performance thresholds set by the United States export controls announced on October 7, 2022. The chip launched on March 21, 2023, and remained on the market for roughly seven months before a tightening of the same export regime on October 17, 2023 cut off its sale to China. The H800 is best known outside the semiconductor industry as the GPU that the Chinese startup DeepSeek used to train its DeepSeek-V3 model, a result that demonstrated how a frontier-class large language model could be produced under hardware constraints and at a fraction of the compute cost of comparable Western efforts. After the H800 was banned, Nvidia replaced it for the Chinese market with the further cut-down H20.
On October 7, 2022, the United States Department of Commerce, acting through its Bureau of Industry and Security (BIS), published sweeping new rules restricting the export to China of advanced computing chips, the equipment used to make them, and related supercomputing items. The rules were intended to slow China's ability to build the high-performance computing clusters used for artificial intelligence and military applications. Rather than naming individual products, the regulations defined controlled hardware by measurable technical parameters. Two parameters mattered most for AI accelerators: total compute performance and the chip-to-chip interconnect bandwidth, the rate at which one accelerator can exchange data with another. Nvidia's two leading data center GPUs at the time, the Ampere-generation A100 and the newly announced Hopper-generation H100, both exceeded the new limits and so could no longer be shipped to Chinese customers without a license that was unlikely to be granted.
Faced with the loss of a large and lucrative market, Nvidia engineered cut-down versions of both chips that were tuned to sit just under the regulatory ceiling. The Ampere-based A800 arrived first, in late 2022, followed by the Hopper-based H800 in March 2023. The interconnect threshold in the original 2022 rule was effectively set around the level of the A100, whose NVLink bandwidth was 600 GB/s. Nvidia therefore reduced the H800's NVLink bandwidth from the H100's 900 GB/s down to 400 GB/s, comfortably below that line, while keeping the low-precision tensor performance that matters most for training and running neural networks. The result was a chip that complied with the letter of the controls yet still delivered the bulk of the H100's value for AI workloads.
The H800 is not a different chip from the H100 in any architectural sense. Both are built from the same GH100 graphics processor on TSMC's custom 4N (5 nm class) process, carry the same fourth-generation Tensor Cores, and support the same low-precision data types, including the FP8 Transformer Engine that accelerates large language model training. The differences are confined to a small set of parameters that the export rules measured.
The most consequential change was to the NVLink interconnect, which Nvidia roughly halved from 900 GB/s to 400 GB/s. This does not affect the speed of a single GPU, but it slows the rate at which many GPUs in a server or cluster can pool their work, a factor that becomes important at the scale of frontier model training. The second change was to double-precision (FP64) floating point math: the H800's FP64 throughput was reduced to roughly 0.8 to 1 TFLOPS, a tiny fraction of the H100's figure (34 TFLOPS on the SXM module). This effectively removed the H800 from contention for traditional high-performance computing and scientific simulation, which depend on FP64, while leaving AI training and inference, which use FP16, BF16, and FP8, largely intact. The H800 came in both SXM5 and PCIe forms. The PCIe variant shipped with 80 GB of HBM2e memory at 2 TB/s, matching the H100 PCIe card, while the SXM5 variant carried 80 GB of HBM3 at roughly 3.35 TB/s, matching the H100 SXM module.
The table below compares the H800 with the H100 it was derived from. Figures are given for the SXM module unless noted, with PCIe figures shown where they differ materially.
| Specification | NVIDIA H100 (SXM5) | NVIDIA H800 (SXM5) |
|---|---|---|
| Architecture | Hopper (GH100) | Hopper (GH100) |
| Process | TSMC 4N (5 nm class) | TSMC 4N (5 nm class) |
| Launch | 2022 | March 21, 2023 |
| Target market | Global | China |
| GPU memory | 80 GB HBM3 | 80 GB HBM3 |
| Memory bandwidth | 3.35 TB/s | ~3.35 TB/s |
| NVLink (GPU-to-GPU) bandwidth | 900 GB/s | 400 GB/s |
| Peak FP64 | 34 TFLOPS | ~0.8 to 1 TFLOPS |
| Peak FP16 Tensor Core | 1,979 TFLOPS | ~1,979 TFLOPS |
| Peak FP8 Tensor Core | 3,958 TFLOPS | ~3,958 TFLOPS |
| Max power (TDP) | up to 700 W | up to 700 W |
For the PCIe variants, both the H100 PCIe and the H800 PCIe were specified at 80 GB HBM2e, 2 TB/s memory bandwidth, 1,513 TFLOPS FP16 Tensor Core, 3,026 TFLOPS FP8 Tensor Core, and 350 W. The H800 PCIe again differed only in its reduced NVLink (400 GB/s) and its sharply lower FP64 (about 0.8 TFLOPS versus the H100 PCIe's 26 TFLOPS). In short, for the matrix-multiply-heavy, low-precision arithmetic that dominates deep learning, the H800 was essentially as fast as the H100; the penalties fell on inter-GPU communication and on double precision.
The H800's most influential application emerged in December 2024, when the Hangzhou-based AI company DeepSeek released its DeepSeek-V3 model along with a detailed technical report. According to that report, DeepSeek-V3, a Mixture-of-Experts language model with 671 billion total parameters (37 billion activated per token), was trained on a cluster of 2,048 NVIDIA H800 GPUs. Each node in the cluster contained 8 GPUs connected internally by NVLink and NVSwitch, while nodes were linked to one another over InfiniBand. The choice of hardware was a direct consequence of export controls: H800s were the most capable Nvidia GPUs that DeepSeek could legally obtain in China at the time the cluster was assembled.
DeepSeek reported that pre-training consumed 2.664 million H800 GPU hours, with a further 119,000 hours for context-length extension and 5,000 hours for post-training, for a total of about 2.788 million GPU hours. At an assumed rental price of 2 US dollars per GPU hour, the company put the headline training cost at roughly 5.576 million US dollars. DeepSeek was careful to note that this figure covered only the official training run and excluded prior research, ablations, and experiments, so it should not be read as the total cost of developing the model. Even with that caveat, the number was striking: it implied that a competitive frontier model could be trained for a tiny fraction of what comparable Western models were believed to cost.
A central reason DeepSeek could work efficiently on H800s was that its engineers built their training stack around the chip's specific bottleneck. Because the H800's reduced NVLink bandwidth made heavy inter-GPU communication relatively expensive, DeepSeek used a custom framework (HAI-LLM) and a parallelization strategy combining 16-way pipeline parallelism, 64-way expert parallelism, and ZeRO-1 data parallelism, and it overlapped computation with communication so that the slower interconnect was rarely left idle. The episode became a widely cited example of algorithmic and systems efficiency compensating for constrained hardware, and it intensified the policy debate over whether export controls were achieving their intended effect. When details of DeepSeek-V3 and the subsequent DeepSeek-R1 reasoning model spread in January 2025, they contributed to a sharp sell-off in Nvidia and other AI-related stocks, as investors reassessed assumptions about how much expensive hardware frontier AI actually requires.
The H800's commercial life was short. On October 17, 2023, BIS issued an interim final rule that revised and reinforced the October 2022 controls. The most important change for Nvidia was that the new rule removed interconnect bandwidth as a control parameter entirely, the very metric around which the H800 and A800 had been engineered. In its place, BIS adopted criteria based on total processing performance and a new measure of performance density (performance per unit of die area), together with a test of whether a chip was designed or marketed for use in a data center. Stripped of the interconnect loophole, the H800 once again exceeded the thresholds and became subject to licensing requirements that, in practice, blocked its export to China.
Nvidia confirmed the impact in a filing with the US Securities and Exchange Commission on October 17, 2023, stating that the new licensing requirements applied to a list of its products that included the A100, A800, H100, H800, L40, L40S, and the consumer-grade RTX 4090. The A800 and H800, the two chips Nvidia had created specifically to keep selling into China under the previous rules, were thus retired from that market less than a year after they appeared. The episode illustrated a recurring dynamic in which Nvidia designed a compliant chip, the government observed the design and revised the controls, and the chip was then caught by the updated rules.
With the H800 and A800 barred, Nvidia developed a third generation of China-specific accelerators that complied with the October 2023 criteria. The most important of these was the H20, a further cut-down Hopper part that Nvidia introduced for the Chinese market in early 2024 (announced in late 2023 alongside the Ada Lovelace-based L20 and L2). The H20 traded raw compute for memory: it carried 96 GB of HBM3 with about 4.0 TB/s of memory bandwidth, but its dense tensor performance was far below the H800's, at roughly 148 TFLOPS of FP16 and 296 TFLOPS of FP8. Notably, because the 2023 rules no longer controlled interconnect bandwidth, Nvidia restored the H20's NVLink to the full 900 GB/s. The design reflected the new control regime: by slashing peak compute and performance density while preserving high memory capacity and bandwidth, the H20 was tuned for inference and for the new criteria rather than for the interconnect ceiling that had defined the H800.
The H20 itself later became a subject of further US policy action and on-again, off-again restrictions during 2025, and Nvidia's chief executive Jensen Huang indicated that any future China part would not be based on Hopper, since the architecture could not be modified further to satisfy tightening rules. The H800 thus sits as the middle entry in a three-step sequence of China-market Hopper derivatives: the H100 for the global market, the H800 as the 2023 export-compliant version, and the H20 as the 2024 replacement after the H800 was banned.
The H800 is significant on two levels. As a piece of hardware, it is a clear case study in how export controls based on measurable technical parameters can be answered by deliberate product engineering: Nvidia kept the AI-relevant performance of the H100 while trimming exactly the metrics the regulations measured. As a subject of AI history, it is inseparable from DeepSeek-V3, the model whose efficient training on 2,048 H800s became one of the most discussed results of the period and a recurring reference point in arguments about the effectiveness of chip export controls, the true cost of training frontier models, and the pace at which Chinese AI development could continue under restrictions. The rapid succession from H100 to H800 to H20 also documents the cat-and-mouse pattern that has come to characterize US-China technology policy, in which each compliant chip prompts a revision of the rules that created the need for it.