NVIDIA GH200 Grace Hopper Superchip
Last reviewed
Sources
18 citations
Review status
Source-backed
Revision
v2 · 2,280 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
18 citations
Review status
Source-backed
Revision
v2 · 2,280 words
Add missing citations, update stale details, or suggest a clearer explanation.
The NVIDIA GH200 Grace Hopper Superchip is a combined central processing unit and graphics processing unit module from NVIDIA that joins a 72-core Grace CPU with a Hopper-generation GPU on a single package, linked by a high-bandwidth coherent interconnect called NVLink-C2C. NVIDIA revealed the Grace Hopper design at its GTC conference in March 2022, announced full production at Computex in May 2023, and the chip reached broad system availability through 2024 [1][2][10]. The GH200 was built for large-scale artificial intelligence and high-performance computing (HPC) workloads where the GPU benefits from fast, coherent access to a large pool of CPU-attached memory. It is the predecessor of the Grace Blackwell GB200 generation.
The central idea of a "superchip" is to stop treating the CPU and GPU as separate devices connected over a comparatively slow PCI Express link, and instead wire them together tightly enough that the GPU can read and write the CPU's memory almost as if it were its own. That helps the largest models, whose parameters, activations, and key-value caches often exceed the GPU's onboard high-bandwidth memory. NVIDIA describes the GH200 as the first true heterogeneous accelerated platform for HPC and AI, meaning a single building block engineered so the CPU and GPU share one memory space rather than copying data back and forth [1][3].
The GH200 has two halves joined by NVLink-C2C. The Grace half is an NVIDIA-designed server CPU built from 72 Arm Neoverse V2 cores on the Armv9 instruction set, paired with up to 480 gigabytes (GB) of LPDDR5X memory with error-correction code (ECC) [1][7]. Each core carries 64 kilobytes (KB) of L1 instruction cache plus 64 KB of L1 data cache and 1 megabyte (MB) of L2 cache, backed by 114 MB of L3 cache shared across the chip. The cores run at a 3.1 gigahertz (GHz) base frequency, dropping to a 3.0 GHz all-core SIMD frequency under heavy vector load [1]. The LPDDR5X subsystem delivers up to 512 GB/s of memory bandwidth. NVIDIA positions Grace as the fastest Arm data-center CPU of its generation and claims roughly twice the performance per watt of conventional x86-64 platforms, with the LPDDR5X memory subsystem providing up to 53 percent more bandwidth than an eight-channel DDR5 design at one-eighth the power per gigabyte per second [1]. The CPU also exposes up to four PCIe x16 Gen5 links for networking and storage.
The Hopper half is an H100 Tensor Core GPU, NVIDIA's ninth-generation data-center GPU built on the Hopper architecture [1]. It carries fourth-generation Tensor Cores, a Transformer Engine that accelerates the matrix math behind transformer models, and Multi-Instance GPU (MIG) support for partitioning the GPU into isolated slices. The Hopper GPU in the GH200 delivers 34 teraFLOPS of FP64 and 67 teraFLOPS of FP64 via the Tensor Cores, 67 teraFLOPS of FP32, and, on the Tensor Cores, 989 teraFLOPS of TF32, 1,979 teraFLOPS of BFLOAT16 and FP16, and 3,958 teraFLOPS of FP8 (those last figures quoted with sparsity; dense throughput is half) [1]. INT8 reaches 3,958 TOPS with sparsity. These are the same compute capabilities as the standalone H100, since the GH200 uses an H100-class Hopper die; what changes is everything around it.
The whole module is a single superchip board with a programmable thermal design power (TDP) spanning 450 watts to 1,000 watts across the CPU, GPU, and memory combined [1]. That envelope lets system builders tune the chip for dense, power-constrained racks or for maximum performance, and the module supports both air cooling and liquid cooling.
| Component | Specification |
|---|---|
| CPU | NVIDIA Grace, 72 Arm Neoverse V2 (Armv9) cores |
| CPU caches | 64 KB L1 i + 64 KB L1 d per core, 1 MB L2 per core, 114 MB L3 |
| CPU clocks | 3.1 GHz base, 3.0 GHz all-core SIMD |
| CPU memory | Up to 480 GB LPDDR5X with ECC, up to 512 GB/s |
| GPU | Hopper H100 Tensor Core GPU |
| GPU memory | 96 GB HBM3 (up to 4 TB/s) or 144 GB HBM3e (up to 4.9 TB/s) |
| GPU compute | 34 TFLOPS FP64; 989 TFLOPS TF32, 3,958 TFLOPS FP8 (with sparsity) |
| CPU-to-GPU link | NVLink-C2C, 900 GB/s bidirectional (7x PCIe Gen5) |
| PCIe | Up to 4x PCIe x16 Gen5 |
| Fast-access memory | Up to 624 GB per superchip |
| TDP | Programmable 450 W to 1,000 W (CPU + GPU + memory) |
| Coherency | Unified, coherent address space across CPU and GPU memory |
| Successor | GB200 Grace Blackwell Superchip |
NVLink-C2C is the heart of the design. It is a memory-coherent, low-latency chip-to-chip interconnect that provides 900 GB/s of total bidirectional bandwidth, which NVIDIA puts at seven times the bandwidth of the PCIe Gen5 lanes used in conventional accelerated systems [1][8]. Built on fourth-generation NVLink, it lets the GPU reach peer memory with direct loads, stores, and atomic operations rather than the page-migration model that PCIe-attached accelerators rely on.
Coherence is the part that matters most for software. Because the CPU and GPU share one address space and NVLink-C2C keeps their caches consistent, CPU threads and GPU threads can concurrently and transparently read and write both CPU-resident and GPU-resident memory [1][3]. Developers transfer only the data they actually need instead of copying entire pages to and from the GPU, and they get lightweight synchronization through native atomics that work from either side. The practical effect is that an application can oversubscribe the GPU's HBM and spill into the much larger LPDDR5X pool at high bandwidth, so a working set that would not fit in GPU memory alone still runs without slow manual staging. NVIDIA pitches this as a way to let scientists and engineers focus on algorithms rather than explicit memory management.
The GH200 shipped in two GPU-memory variants. The original configuration pairs the Hopper GPU with 96 GB of HBM3 at up to 4 terabytes per second (TB/s) of bandwidth. NVIDIA then announced a second variant at SIGGRAPH on August 8, 2023, upgrading the GPU to 144 GB of faster HBM3e and raising GPU memory bandwidth to up to 4.9 TB/s [4][5]. NVIDIA described that part as the world's first HBM3e processor, with HBM3e running about 50 percent faster than the HBM3 it replaces [5]. Systems built on the HBM3e version were expected from leading manufacturers in the second quarter of calendar 2024, while the HBM3 version was already in full production [5][6].
The Grace side stays the same across both: up to 480 GB of LPDDR5X. Combined with the GPU's HBM, a single GH200 presents up to 624 GB of fast-access memory to an application through the coherent address space [1]. NVIDIA frames the advantage in terms of headroom: with up to 480 GB of LPDDR5X, the GPU has direct access to roughly seven times more fast memory than its HBM3 alone, or nearly eight times more than its HBM3e alone, depending on the configuration [1]. That is far beyond what a standalone H100 with 80 GB of HBM can offer, and it is the reason the GH200 is attractive for memory-bound workloads.
NVIDIA also offers a dual configuration. The dual-GH200 (sold in some reference designs as GH200 NVL2) fully connects two superchips over NVLink to present 144 Arm Neoverse cores, 288 GB of HBM3e, eight petaFLOPS of AI performance, and 1.2 TB of fast memory across the pair, with up to 10 TB/s of combined memory bandwidth [1][5].
Individual GH200 superchips can be combined into much larger machines using the NVIDIA NVLink Switch System. The GH200 NVL32 is a rack-scale platform that connects 32 superchips into a single NVLink domain so they behave as one large accelerator with 19.5 TB of unified memory [9][11]. It is built from 16 dual-GH200 server nodes, uses nine NVSwitch chips based on third-generation NVSwitch technology, fits the NVIDIA MGX modular chassis design, and is liquid cooled for density [11]. NVIDIA reported large gains for this configuration against H100 systems, including roughly 1.7 times faster GPT-3 training, about 2 times faster large-language-model inference on a GPT-530B class model, up to 7.9 times faster training for recommender systems with 10 TB embedding tables, and up to 5.8 times faster GraphSAGE graph-neural-network training [11].
The NVLink Switch System scales further than a single rack. A full fabric can connect up to 256 NVLink-connected GPUs, with all GPU threads able to access up to 144 TB of memory at high bandwidth [1]. NVIDIA packaged that 256-superchip topology as the DGX GH200 AI supercomputer, announced at Computex in May 2023 and aimed at memory-intensive AI such as large language models, recommender systems, and graph neural networks [2][10]. On the cloud side, AWS announced at re:Invent in November 2023 that it would be the first cloud provider to offer NVLink-connected GH200 superchips, delivered through NVIDIA DGX Cloud and Amazon EC2 instances [9][11].
The GH200 anchored several large public supercomputers. The most prominent is JUPITER at the Jülich Supercomputing Centre in Germany, Europe's first exascale system. Its Booster module uses roughly 24,000 GH200 superchips across about 6,000 compute nodes, with four GH200s per node, built on Eviden's BullSequana XH3000 platform with direct liquid cooling [12][13]. JUPITER exceeds one exaFLOP/s of FP64 performance on the High Performance Linpack benchmark, and in the May 2024 GREEN500 list its early partition ranked first for energy efficiency at more than 60 gigaFLOPS per watt [12][14].
In Switzerland, the Swiss National Supercomputing Centre (CSCS) built Alps on the HPE Cray EX254n platform, with 10,752 GH200 superchips arranged in quad-GH200 nodes; it was inaugurated in September 2024 and reached about 270 petaFLOPS in its initial configuration [15][16]. These machines illustrate the pattern NVIDIA designed the GH200 for: tight CPU-GPU coupling at the node level, scaled out across thousands of nodes with high-speed networking such as Slingshot or InfiniBand. System makers including Supermicro and GIGABYTE shipped GH200-based servers as part of NVIDIA's MGX family, broadening availability beyond the largest labs [6].
The GH200 marked NVIDIA's shift from selling GPUs that plug into third-party servers toward selling tightly integrated CPU-plus-GPU building blocks for AI infrastructure. The coherent memory model is especially useful for recommendation systems, graph analytics, and large language models with very large embeddings or key-value caches, where the working set does not fit in GPU memory alone and the alternative is slow paging over PCI Express. Because the GH200 runs the standard 64-bit Arm software ecosystem, the same containers, binaries, and operating systems that run on other Arm servers run on Grace Hopper without modification, alongside the full NVIDIA HPC, AI, and Omniverse software stacks [1][3].
The approach carried directly into the Grace Blackwell GB200, announced at GTC in March 2024. The GB200 replaces the single Hopper GPU with two Blackwell-architecture GPUs per Grace CPU, again linked by a 900 GB/s NVLink-C2C interconnect, and scales to the rack-level GB200 NVL72, which ties 36 Grace Blackwell Superchips (72 Blackwell GPUs and 36 Grace CPUs) into one NVLink domain over fifth-generation NVLink [17][18]. The lineage continues toward NVIDIA's later Vera Rubin platform. The GH200 also helped popularize the broader industry move toward Arm-based server CPUs co-designed with accelerators, a pattern visible in later host CPUs such as Google Axion and in the wider competition over rack-scale AI systems, including NVIDIA's own NVLink Fusion effort to open the interconnect to third-party silicon.
A limitation of the GH200 is that its coherent-memory advantage depends on software taking advantage of the unified address space, and the LPDDR5X pool, while large, is much slower than HBM, so performance depends on how well a workload's hottest data stays in HBM. The Hopper compute itself matches a standard H100, so the chip's appeal rests on memory capacity, coherence, and CPU-GPU bandwidth rather than raw GPU throughput.