GPU cluster
Last reviewed
Jun 9, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v3 · 4,087 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 9, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v3 · 4,087 words
Add missing citations, update stale details, or suggest a clearer explanation.
A GPU cluster is a tightly coupled collection of servers, each populated with multiple graphics processing units, interconnected by high-bandwidth low-latency networking and managed as a single computational pool for distributed training of large neural networks or high-throughput model inference. Modern AI clusters at frontier laboratories scale to tens or hundreds of thousands of nvidia H100 or H200 accelerators, organized into rail-optimized fabrics built on NVLink, InfiniBand, or RDMA-capable Ethernet. The shift from CPU-dominant high-performance computing to GPU-dominant AI computing accelerated after the publication of "Attention Is All You Need" in 2017, but the most consequential expansion began in 2022 with the release of the H100 and the emergence of generative AI as a commercial product category. By 2024 a single 100,000 GPU H100 training cluster drew roughly 150 megawatts of critical IT power[1], and by 2026 cluster planning has shifted to multi-gigawatt sites colocated with dedicated power generation. This article describes the hardware, networking, software, power, and reliability characteristics of contemporary AI GPU clusters, with reference to the largest production deployments at Meta, xAI, OpenAI, Microsoft, Anthropic, and Google.
GPU clustering in its modern form began with NVIDIA's DGX-1, announced on 6 April 2016, which packaged eight Tesla P100 Pascal GPUs interconnected through an NVLink mesh into a single 3U chassis selling for approximately $129,000[2]. NVIDIA CEO Jensen Huang personally delivered the first production DGX-1 to OpenAI in August 2016, and the system became a defining piece of equipment for the first generation of deep learning research labs[3]. A Volta-based DGX-1 with eight V100 GPUs followed in 2017, and the DGX-2 in 2018 introduced NVSwitch to connect sixteen V100 GPUs at full bandwidth within a single node.
The transition from single-node DGX systems to multi-node clusters was driven by the growth of language model parameter counts. The original Transformer in 2017 had 213 million parameters; GPT-3 in 2020 had 175 billion; GPT-4 in 2023 reportedly required tens of thousands of A100 GPUs. By the Hopper generation, cluster scale rather than per-chip performance had become the limiting factor for frontier model training. Microsoft disclosed in May 2020 that it had built a supercomputer for OpenAI with more than 285,000 CPU cores, 10,000 GPUs, and 400 Gbps of network connectivity to each GPU server[4]; in November 2023, the same partnership debuted "Eagle," a 14,400 H100 system running on Azure ND H100 v5 nodes that ranked #3 on the Top500 list with 561.2 petaflops of Linpack performance[5][6].
By 2024 the largest training clusters announced publicly used 16,384 or 24,576 H100 GPUs per training run, and Meta CEO Mark Zuckerberg disclosed in January 2024 that the company planned to operate compute equivalent to nearly 600,000 H100 GPUs by the end of that year[7]. The trajectory has continued upward: xAI's Colossus reached 100,000 H100 in July 2024, doubled to 200,000 in early 2025, and surpassed 555,000 mixed H100, H200, and GB200 GPUs at a 2 gigawatt site by late 2025[8].
A modern AI training cluster decomposes into several physical and logical units. The smallest unit is the node, typically an 8-GPU server such as an NVIDIA HGX H100 baseboard or Meta's Grand Teton platform. Eight nodes form a scalable unit (SU) of 64 GPUs in NVIDIA's DGX SuperPOD reference architecture, and 32 nodes form a rack in newer GB200 NVL72 configurations where a single liquid-cooled rack contains 72 Blackwell GPUs and 36 Grace CPUs[9]. Hundreds of racks combine into a pod or island sharing a single high-bandwidth scale-out fabric, and multiple pods form a full cluster spanning one or more buildings on a single campus.
This hierarchy reflects bandwidth tapering. Inside a single node, eight GPUs communicate over fourth-generation NVLink at 900 GB/s of aggregate bidirectional bandwidth per GPU on the H100[10]. Across nodes inside the same scale-out island, a 400 Gbps InfiniBand NDR or 400 Gbps Spectrum-X Ethernet link per GPU provides 50 GB/s of unidirectional bandwidth per rail. Across islands, the bandwidth typically tapers by a factor of two to four to limit the optics budget. Across buildings, dedicated dark fiber may carry tens of terabits per second between two compute halls in the same campus, with synchronous training jobs partitioned across the slowest segment.
Rail-optimized topology is the dominant scale-out design for H100 and H200 clusters. Each of the eight GPUs in a server connects to a dedicated network interface card and from there to a dedicated "rail" of leaf switches, so that all GPU 0 ports across the cluster share one leaf-switch plane, all GPU 1 ports share another, and so on[11]. This allows two distant GPUs that need to communicate to do so through their respective servers' intra-node NVSwitch to reach the matching rail NIC, then traverse only one switch hop to the destination rail switch, rather than crossing multiple switches. The arrangement reduces hop counts for the all-reduce patterns common in synchronous data-parallel training while keeping the per-GPU radix manageable. A NVIDIA DGX SuperPOD with H100 uses Quantum-2 NDR InfiniBand switches at 400 Gb/s per port arranged as a three-tier fat tree, capable of scaling beyond 2,000 nodes (16,000 GPUs) while maintaining full bisection bandwidth at the intra-island level[12].
The Clos or fat-tree network, first described by Charles Clos in 1953 and adapted for HPC by groups including Mellanox and Cisco in the 2000s, provides full or partial bisection bandwidth using commodity switches. Larger AI clusters often use a three-tier fat tree: leaf, spine, and super-spine. Dragonfly and dragonfly+ topologies, used in HPE Cray's Slingshot interconnect on the Frontier supercomputer at Oak Ridge National Laboratory, organize switches into local groups with global links to reduce optics cost at extreme scale[13]. For dedicated AI training, however, fat-tree variants remain dominant because most training collectives are bandwidth-bound rather than latency-bound.
NVLink is NVIDIA's proprietary point-to-point GPU interconnect. The fourth-generation NVLink in the H100 provides 18 links per GPU at 50 GB/s bidirectional per link, for 900 GB/s of aggregate bandwidth per GPU[10]. The NVSwitch chip routes traffic between GPUs in an all-to-all fashion within a node. In the GB200 NVL72 rack, fifth-generation NVLink doubles per-link bandwidth to 100 GB/s, giving 1.8 TB/s per GPU, and the rack-scale NVSwitch fabric connects all 72 Blackwell GPUs into a single NVLink domain[9].
InfiniBand is the dominant scale-out interconnect for AI training, marketed primarily by NVIDIA following its 2019 acquisition of Mellanox. The Quantum-2 NDR generation, introduced in 2021, delivers 400 Gb/s per port with 64-port switches[12]. The Quantum-X800 XDR generation, announced at GTC 2024 alongside Blackwell, doubles per-port bandwidth to 800 Gb/s and supports denser switch radices appropriate for clusters beyond 100,000 GPUs. InfiniBand uses RDMA natively, allowing GPU memory to be read or written remotely without CPU involvement, and supports congestion control and adaptive routing tuned for collective operations.
RDMA over Converged Ethernet provides RDMA semantics on commodity Ethernet hardware. Meta chose RoCE for one of its two 24,576 H100 training clusters disclosed in March 2024, built with Arista 7800 switches and OCP-contributed Wedge400 and Minipack2 platforms, while building the other cluster on InfiniBand to compare the two approaches at full production scale[14]. Meta trained Llama 3 on the RoCE cluster without encountering network bottlenecks[14]. xAI's Colossus also uses Ethernet rather than InfiniBand: NVIDIA confirmed in October 2024 that the 100,000 H100 cluster runs on the NVIDIA Spectrum-X Ethernet networking platform, including SN5600 switches and BlueField-3 SuperNICs[15]. Spectrum-X SN5600 switches offer 128 ports of 400 Gb/s, twice the radix of Quantum-2 NDR InfiniBand, which can allow a fully connected 100,000 GPU cluster to be built in three tiers rather than four[1].
NVIDIA's Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) offloads collective operations such as all-reduce from CPUs and GPUs onto the InfiniBand network switches themselves. The switches perform aggregation directly, reducing the volume of data traversing the fabric and roughly doubling the effective all-reduce bandwidth versus a software-only implementation[16]. SHARPv2, introduced with HDR 200 Gb/s Quantum switches, added support for large-message reductions used in deep learning. SHARPv3 added multi-tenant support so multiple AI workloads can share the same in-network reduction infrastructure. The protocol is integrated with NVIDIA's NCCL collective communication library, where it can roughly halve the time spent on cross-node all-reduce for synchronous gradient updates.
The NVIDIA Collective Communication Library (NCCL, pronounced "nickel") implements all-reduce, all-gather, reduce-scatter, broadcast, and point-to-point send and receive primitives optimized for NVIDIA GPUs over NVLink, NVSwitch, InfiniBand, and RoCE[17]. NCCL has been open source since 2015 and has become the de facto standard for distributed deep learning on NVIDIA hardware. The library auto-detects topology and selects ring, tree, or specialized algorithms (CollNet, NVLS) per operation; CollNet integrates with SHARP to offload reductions to network switches. Huawei's HCCL on Ascend NPUs and AMD's RCCL on Instinct GPUs are functional analogues. NCCL 2.27, released in 2025, added Symmetric Memory APIs and runtime connection establishment for fault recovery during large training runs[18].
Cluster orchestration determines how multi-thousand-GPU jobs are placed onto physical hardware. Slurm, originally developed at Lawrence Livermore National Laboratory, remains the most common scheduler for dedicated AI training clusters because of its mature support for gang scheduling, NUMA-aware placement, and integration with MPI. Kubernetes, with the NVIDIA GPU Operator and Volcano or Kueue extensions, is more common for inference clusters and mixed inference-and-training environments such as cloud providers. Meta uses an internal scheduler called MAST that incorporates network topology awareness, placing communicating training shards adjacent in the rail fabric to minimize cross-spine traffic[14]. Many frontier labs run Ray as a layer above Slurm or Kubernetes for reinforcement learning workloads that mix training and inference inside the same job.
The dominant training frameworks are PyTorch (used by Meta, OpenAI, xAI, and most academic labs) and JAX (used by Google and Anthropic). PyTorch integrates with NCCL through the torch.distributed package and offers FSDP (Fully Sharded Data Parallelism) and the older DDP (Distributed Data Parallel) as built-in primitives. Megatron-LM from NVIDIA, DeepSpeed from Microsoft, and Hugging Face Accelerate provide higher-level tensor parallelism, pipeline parallelism, and ZeRO sharding implementations.
In March 2024 Meta published "Building Meta's GenAI Infrastructure," disclosing two production clusters of 24,576 H100 GPUs each, configured for Llama 3 and subsequent generative AI training[14]. The two clusters share identical compute and storage hardware (Grand Teton open GPU servers contributed to the Open Compute Project, paired with a Tectonic-backed distributed storage layer and a Hammerspace parallel NFS interface) but differ in the scale-out fabric: one cluster uses RoCE with Arista 7800 switches, the other Quantum-2 InfiniBand. Both reach 400 Gbps per endpoint. Meta reported that out-of-the-box NCCL performance fell in the 10 to 90 percent range of theoretical peak and that achieving consistent 90 percent or better utilization required topology-aware job scheduling and NCCL tuning specific to the rail layout[14]. By the end of 2024, Meta aimed to operate 350,000 H100 GPUs with total compute equivalent to roughly 600,000 H100[14].
xAI's Colossus, located in a converted Electrolux factory in South Memphis, Tennessee, became operational on 22 July 2024 with 100,000 H100 GPUs after a construction sprint that xAI reported took 122 days from contract signing to first training job, against typical timelines of 18 to 24 months for greenfield data center construction[8]. The cluster was doubled to 200,000 GPUs (a mix of H100 and H200) within an additional three months and was used to train Grok 3[15]. NVIDIA confirmed in October 2024 that Colossus runs on Spectrum-X Ethernet rather than InfiniBand[15]. On 30 December 2025 xAI disclosed that it had purchased a third building near the Colossus 2 site, bringing total power capacity at the campus to roughly 2 gigawatts and deployed GPU count to about 555,000, with on-site natural gas generation supplementing grid power and a target of more than one million GPUs[8]. DDN supplied the storage for the system[19].
Microsoft disclosed in May 2020 that it had built a custom cluster for OpenAI with 10,000 GPUs and 400 Gbps networking to each GPU server, used to train GPT-3[4]. The November 2023 Top500 list placed "Eagle," a Microsoft Azure system with 14,400 H100 GPUs, Intel Xeon Sapphire Rapids CPUs, and Quantum-2 NDR InfiniBand, at #3 globally with 561.2 petaflops of Linpack performance[5][6]. Microsoft subsequently disclosed that it deployed the equivalent of five Eagle-class systems per month through 2024 to meet OpenAI training demand[20].
The Stargate Project was announced on 21 January 2025 at the White House by President Donald Trump alongside Sam Altman of OpenAI, Larry Ellison of Oracle, and Masayoshi Son of SoftBank[21]. The joint venture, with equity from SoftBank, OpenAI, Oracle, and MGX (each holding significant stakes, with SoftBank and OpenAI reported at 40 percent each), pledged to invest up to $500 billion in U.S. AI infrastructure over four years, with $100 billion to be deployed immediately[21]. Arm, Microsoft, NVIDIA, Oracle, and OpenAI are the initial technology partners[21]. The flagship site is the Abilene data center in Taylor County, Texas, where construction with Oracle Cloud Infrastructure was already under way before the announcement. In September 2025 OpenAI added five additional sites (Shackelford County, Texas; Doña Ana County, New Mexico; Lordstown, Ohio; Milam County, Texas; and an undisclosed Midwestern location), bringing planned capacity to nearly 7 gigawatts[22]. The first phase of the Abilene site came online in September 2025[22].
Anthropic does not operate its own data centers and instead trains and serves Claude on partner infrastructure. The company has multi-billion-dollar commitments with Amazon Web Services (using Trainium, Trainium2, and Trainium 3 silicon in addition to NVIDIA GPUs) and with Google Cloud (using TPU Ironwood, Trillium, and earlier TPU generations). The Amazon partnership announced an investment of up to $4 billion in 2023, expanded to $8 billion in November 2024, and was accompanied by the construction of dedicated Trainium2 capacity at the AWS Rainier project in Indiana.
Google's TPU platform is the most prominent non-NVIDIA AI training architecture at scale. The TPU v5p pod, the production training generation through most of 2024, contained up to 8,960 chips interconnected by a 3D torus inter-chip interconnect at 4,800 Gbps per chip[23]. Trillium (TPU v6e) launched in late 2024 with 256-chip pod slices targeted at inference and small training jobs. TPU Ironwood (v7), announced on 9 April 2025, scales to 9,216 liquid-cooled chips per pod, with 192 GB of HBM3e and 9,600 Gbps of ICI bandwidth per chip, delivering a claimed 42.5 exaflops per pod[24]. Ironwood was Google's first TPU designed primarily for inference rather than training.
Power has become the binding constraint on cluster scale. A 100,000 H100 cluster draws approximately 150 megawatts of critical IT power, including the 700 watt per-GPU TDP plus an additional roughly 575 watts per GPU for CPUs, NICs, memory, storage, and conversion losses, for about 1,275 watts per GPU at the wall[1]. At 2024 industrial electricity rates this consumes roughly 1.59 terawatt hours per year, an annual electricity cost on the order of $124 million[1]. Multi-gigawatt sites announced through 2025 (xAI Colossus 2, Stargate Abilene, the planned Stargate sites) shift this further: a 2 gigawatt site consumes roughly 17.5 terawatt hours per year, comparable to a small U.S. state.
To secure power at this scale, hyperscalers have signed bilateral power purchase agreements with nuclear operators. In September 2024 Microsoft and Constellation Energy signed a 20-year power purchase agreement to restart Unit 1 of the Three Mile Island Nuclear Generating Station in Pennsylvania (renamed the Crane Clean Energy Center), supplying 835 megawatts of carbon-free electricity to Microsoft's data centers beginning in 2028; Constellation will invest roughly $1.6 billion to refurbish the pressurized water reactor[25]. In March 2024 Amazon Web Services purchased the Cumulus Data campus from Talen Energy for $650 million and subsequently signed a 17-year, approximately $18 billion power purchase agreement for up to 1,920 megawatts from the 2.5 gigawatt Susquehanna nuclear plant, although the Federal Energy Regulatory Commission in November 2024 rejected the initial expanded interconnection service agreement, limiting near-term co-located load to 300 megawatts[26]. Behind-the-meter natural gas generation has become the fastback option for sites that cannot wait for nuclear refurbishment; xAI operates an on-site natural gas plant adjacent to the Colossus campus to supplement Memphis grid power[8].
The H100 at 700 watts can be cooled by direct expansion air cooling, although dense H100 racks at 50 to 80 kW per rack already push the limits of air-cooled hot-aisle containment. The 1,200 watt Blackwell B200 in the GB200 NVL72 configuration cannot be air-cooled at production densities: each GB200 superchip dissipates up to 1,200 W, and a 72-GPU rack generates more than 120 kW of heat, requiring direct-to-chip liquid cooling with cold-plate inlet temperatures below 45 degrees Celsius and minimum coolant flow rates of 2 to 3 liters per minute per module[27]. The DGX B300 and GB300 NVL72 systems extend this to roughly 1,400 watts per GPU and require liquid cooling at the chassis or rack level. Hyperscalers are retrofitting facilities with rear-door heat exchangers, in-rack coolant distribution units, and chilled-water loops to accommodate the Blackwell and Vera Rubin generations.
H100-class training clusters experience component failures at rates that would be considered alarming in classical HPC. Meta's "The Llama 3 Herd of Models" paper, published in July 2024, reported that during 54 days of pre-training on a single 16,384-GPU H100 cluster, the system experienced 466 job interruptions: 47 from planned maintenance and 419 from unexpected hardware or software faults, averaging one unexpected interruption every three hours[28]. Of the unexpected interruptions, 58.7 percent were GPU related, with 148 caused by GPU failures (including NVLink failures) and 72 by HBM3 memory faults; network switch and cable issues accounted for 35 (8.4 percent)[28]. The SemiAnalysis estimate of mean time to first job failure for a brand new 100,000 H100 cluster without fault recovery is approximately 26.28 minutes, dominated by optical transceiver failures across the tens of thousands of cables in such a build[1].
Synchronous training of a frontier model is fragile: a single GPU failure can stall the entire job, because all replicas must process the same step before any can proceed. The dominant mitigation is checkpointing, where the model state (weights, optimizer state, RNG seeds) is periodically serialized to fast storage. Checkpoint sizes for 100B-parameter or larger models reach hundreds of gigabytes to several terabytes, so checkpoint cadence is bounded by storage bandwidth. Asynchronous checkpointing, in-memory checkpointing across redundant GPU groups, and elastic training (where the job continues with a slightly smaller world size while the failed node is replaced) are active research areas. Meta reported that approximately 90 percent of unexpected Llama 3 interruptions were handled by automated recovery without operator intervention[28].
Operational practice at scale also includes pre-allocated "hot spare" nodes that can be swapped in within minutes, automated cable and transceiver health monitoring, and detection of silent data corruption (SDC) where a GPU returns numerically incorrect results without raising an error. SDC is rare per-GPU but non-negligible when integrated across hundreds of thousands of GPU-hours per training run, and frontier labs run periodic deterministic validation checks to detect it.
Training a multi-trillion-token dataset requires sustained ingest bandwidth on the order of terabytes per second across the cluster. Modern AI clusters use a combination of object storage (S3 or equivalent) for cold storage, parallel file systems (DDN EXAScaler, IBM Spectrum Scale, VAST Data) for active datasets, and FUSE-based caching layers that present a POSIX-like interface backed by sharded object stores. Meta's clusters use a custom FUSE API atop a flash-optimized Tectonic backend combined with Hammerspace for parallel NFS[14]. xAI Colossus uses DDN as primary storage[19]. Checkpoint storage is typically separated from training data storage and provisioned with much higher write bandwidth.
Classical HPC clusters (climate modeling, computational fluid dynamics, lattice QCD) historically optimized for tightly coupled MPI workloads with point-to-point latency in the low microseconds and high node-count weak scaling. Frontier (Oak Ridge), the first U.S. exascale system commissioned in 2022, used 37,888 AMD Instinct MI250X GPUs on 9,408 HPE Cray EX nodes interconnected by Slingshot-11, delivering 1.102 Linpack exaflops in a 21.1 megawatt power envelope[13]. AI clusters share much of this hardware lineage (Slingshot, fat-tree variants, NVLink, MPI as a fallback) but optimize for different patterns: large bandwidth-bound collectives, dense linear algebra at low precision (FP8, FP4), and tolerance for asynchrony at the level of pipeline parallelism. The economic model also differs: HPC procurements traditionally pay for peak FLOP/s achieved on Linpack, whereas AI clusters optimize for cost per useful training token at production batch size and pipeline depth.