GPU cluster

AI Hardware AI Infrastructure Data Centers

24 min read

Updated Jul 11, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 11, 2026

Fact-checked

In review queue

Sources

32 citations

Revision

v4 · 4,861 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

A GPU cluster is a tightly coupled collection of servers, each populated with multiple graphics processing units, interconnected by high-bandwidth low-latency networking and managed as a single computational pool for distributed training of large neural networks or high-throughput model inference. Modern AI clusters at frontier laboratories scale to tens or hundreds of thousands of nvidia H100 or H200 accelerators, organized into rail-optimized fabrics built on NVLink, InfiniBand, or RDMA-capable Ethernet. The shift from CPU-dominant high-performance computing to GPU-dominant AI computing accelerated after the publication of "Attention Is All You Need" in 2017, but the most consequential expansion began in 2022 with the release of the H100 and the emergence of generative AI as a commercial product category. By 2024 a single 100,000 GPU H100 training cluster drew roughly 150 megawatts of critical IT power^[1], and by 2026 cluster planning has shifted to multi-gigawatt sites colocated with dedicated power generation. As of 2026 the largest single site, xAI's Colossus in Memphis, operates roughly 555,000 GPUs at a campus drawing about 2 gigawatts, and OpenAI's Stargate build-out has committed to more than 9 gigawatts of planned United States capacity^[8]^[31]. This article describes the hardware, networking, software, power, and reliability characteristics of contemporary AI GPU clusters, with reference to the largest production deployments at Meta, xAI, OpenAI, Microsoft, Anthropic, and Google.

How did GPU clusters originate?

GPU clustering in its modern form began with NVIDIA's DGX-1, announced on 6 April 2016, which packaged eight Tesla P100 Pascal GPUs interconnected through an NVLink mesh into a single 3U chassis selling for approximately $129,000^[2]. NVIDIA CEO Jensen Huang personally delivered the first production DGX-1 to OpenAI in August 2016, and the system became a defining piece of equipment for the first generation of deep learning research labs^[3]. A Volta-based DGX-1 with eight V100 GPUs followed in 2017, and the DGX-2 in 2018 introduced NVSwitch to connect sixteen V100 GPUs at full bandwidth within a single node.

The transition from single-node DGX systems to multi-node clusters was driven by the growth of language model parameter counts. The original Transformer in 2017 had 213 million parameters; GPT-3 in 2020 had 175 billion; GPT-4 in 2023 reportedly required tens of thousands of A100 GPUs. By the Hopper generation, cluster scale rather than per-chip performance had become the limiting factor for frontier model training. Microsoft disclosed in May 2020 that it had built a supercomputer for OpenAI with more than 285,000 CPU cores, 10,000 GPUs, and 400 Gbps of network connectivity to each GPU server^[4]; in November 2023, the same partnership debuted "Eagle," a 14,400 H100 system running on Azure ND H100 v5 nodes that ranked #3 on the Top500 list with 561.2 petaflops of Linpack performance^[5]^[6].

By 2024 the largest training clusters announced publicly used 16,384 or 24,576 H100 GPUs per training run, and Meta CEO Mark Zuckerberg disclosed in January 2024 that the company planned to operate compute equivalent to nearly 600,000 H100 GPUs by the end of that year^[7]. The trajectory has continued upward: xAI's Colossus reached 100,000 H100 in July 2024, doubled to 200,000 in early 2025, and surpassed 555,000 mixed H100, H200, and GB200 GPUs at a 2 gigawatt site by late 2025^[8].

What is the physical layout of a GPU cluster?

A modern AI training cluster decomposes into several physical and logical units. The smallest unit is the node, typically an 8-GPU server such as an NVIDIA HGX H100 baseboard or Meta's Grand Teton platform. Eight nodes form a scalable unit (SU) of 64 GPUs in NVIDIA's DGX SuperPOD reference architecture, and a GB200 NVL72 configuration packs 18 liquid-cooled compute trays holding 72 Blackwell GPUs and 36 Grace CPUs into a single rack^[9]^[29]. Hundreds of racks combine into a pod or island sharing a single high-bandwidth scale-out fabric, and multiple pods form a full cluster spanning one or more buildings on a single campus.

This hierarchy reflects bandwidth tapering. Inside a single node, eight GPUs communicate over fourth-generation NVLink at 900 GB/s of aggregate bidirectional bandwidth per GPU on the H100^[10]. Across nodes inside the same scale-out island, a 400 Gbps InfiniBand NDR or 400 Gbps Spectrum-X Ethernet link per GPU provides 50 GB/s of unidirectional bandwidth per rail. Across islands, the bandwidth typically tapers by a factor of two to four to limit the optics budget. Across buildings, dedicated dark fiber may carry tens of terabits per second between two compute halls in the same campus, with synchronous training jobs partitioned across the slowest segment.

Rail-optimized fabrics

Rail-optimized topology is the dominant scale-out design for H100 and H200 clusters. Each of the eight GPUs in a server connects to a dedicated network interface card and from there to a dedicated "rail" of leaf switches, so that all GPU 0 ports across the cluster share one leaf-switch plane, all GPU 1 ports share another, and so on^[11]. This allows two distant GPUs that need to communicate to do so through their respective servers' intra-node NVSwitch to reach the matching rail NIC, then traverse only one switch hop to the destination rail switch, rather than crossing multiple switches. The arrangement reduces hop counts for the all-reduce patterns common in synchronous data-parallel training while keeping the per-GPU radix manageable. A NVIDIA DGX SuperPOD with H100 uses Quantum-2 NDR InfiniBand switches at 400 Gb/s per port arranged as a three-tier fat tree, capable of scaling beyond 2,000 nodes (16,000 GPUs) while maintaining full bisection bandwidth at the intra-island level^[12].

Fat-tree and dragonfly variants

The Clos or fat-tree network, first described by Charles Clos in 1953 and adapted for HPC by groups including Mellanox and Cisco in the 2000s, provides full or partial bisection bandwidth using commodity switches. Larger AI clusters often use a three-tier fat tree: leaf, spine, and super-spine. Dragonfly and dragonfly+ topologies, used in HPE Cray's Slingshot interconnect on the Frontier supercomputer at Oak Ridge National Laboratory, organize switches into local groups with global links to reduce optics cost at extreme scale^[13]. For dedicated AI training, however, fat-tree variants remain dominant because most training collectives are bandwidth-bound rather than latency-bound.

How are the GPUs interconnected?

NVLink and NVSwitch

NVLink is NVIDIA's proprietary point-to-point GPU interconnect. The fourth-generation NVLink in the H100 provides 18 links per GPU at 50 GB/s bidirectional per link, for 900 GB/s of aggregate bandwidth per GPU^[10]. The NVSwitch chip routes traffic between GPUs in an all-to-all fashion within a node. In the GB200 NVL72 rack, fifth-generation NVLink doubles per-link bandwidth to 100 GB/s, giving 1.8 TB/s per GPU, and the rack-scale NVSwitch fabric connects all 72 Blackwell GPUs into a single NVLink domain^[9].

InfiniBand

InfiniBand is the dominant scale-out interconnect for AI training, marketed primarily by NVIDIA following its 2019 acquisition of Mellanox. The Quantum-2 NDR generation, introduced in 2021, delivers 400 Gb/s per port with 64-port switches^[12]. The Quantum-X800 XDR generation, announced at GTC 2024 alongside Blackwell, doubles per-port bandwidth to 800 Gb/s and supports denser switch radices appropriate for clusters beyond 100,000 GPUs. InfiniBand uses RDMA natively, allowing GPU memory to be read or written remotely without CPU involvement, and supports congestion control and adaptive routing tuned for collective operations.

RDMA over converged Ethernet (RoCE)

RDMA over Converged Ethernet provides RDMA semantics on commodity Ethernet hardware. Meta chose RoCE for one of its two 24,576 H100 training clusters disclosed in March 2024, built with Arista 7800 switches and OCP-contributed Wedge400 and Minipack2 platforms, while building the other cluster on InfiniBand to compare the two approaches at full production scale^[14]. Meta trained Llama 3 on the RoCE cluster without encountering network bottlenecks^[14]. Meta's engineering team summarized the outcome this way: "Through careful co-design of the network, software, and model architectures, we have successfully used both RoCE and InfiniBand clusters for large, GenAI workloads without any network bottlenecks."^[14] xAI's Colossus also uses Ethernet rather than InfiniBand: NVIDIA confirmed in October 2024 that the 100,000 H100 cluster runs on the NVIDIA Spectrum-X Ethernet networking platform, including SN5600 switches and BlueField-3 SuperNICs^[15]. On Colossus, NVIDIA reported that Spectrum-X delivered 95 percent effective data throughput with zero application latency degradation^[15]. Spectrum-X SN5600 switches offer 128 ports of 400 Gb/s, twice the radix of Quantum-2 NDR InfiniBand, which can allow a fully connected 100,000 GPU cluster to be built in three tiers rather than four^[1].

SHARP in-network reduction

NVIDIA's Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) offloads collective operations such as all-reduce from CPUs and GPUs onto the InfiniBand network switches themselves. The switches perform aggregation directly, reducing the volume of data traversing the fabric and roughly doubling the effective all-reduce bandwidth versus a software-only implementation^[16]. SHARPv2, introduced with HDR 200 Gb/s Quantum switches, added support for large-message reductions used in deep learning. SHARPv3 added multi-tenant support so multiple AI workloads can share the same in-network reduction infrastructure. The protocol is integrated with NVIDIA's NCCL collective communication library, where it can roughly halve the time spent on cross-node all-reduce for synchronous gradient updates.

What software runs an AI GPU cluster?

NCCL and HCCL

The NVIDIA Collective Communication Library (NCCL, pronounced "nickel") implements all-reduce, all-gather, reduce-scatter, broadcast, and point-to-point send and receive primitives optimized for NVIDIA GPUs over NVLink, NVSwitch, InfiniBand, and RoCE^[17]. NCCL has been open source since 2015 and has become the de facto standard for distributed deep learning on NVIDIA hardware. The library auto-detects topology and selects ring, tree, or specialized algorithms (CollNet, NVLS) per operation; CollNet integrates with SHARP to offload reductions to network switches. Huawei's HCCL on Ascend NPUs and AMD's RCCL on Instinct GPUs are functional analogues. NCCL 2.27, released in 2025, added Symmetric Memory APIs and runtime connection establishment for fault recovery during large training runs^[18].

Orchestration: Slurm, Kubernetes, and custom schedulers

Cluster orchestration determines how multi-thousand-GPU jobs are placed onto physical hardware. Slurm, originally developed at Lawrence Livermore National Laboratory, remains the most common scheduler for dedicated AI training clusters because of its mature support for gang scheduling, NUMA-aware placement, and integration with MPI. Kubernetes, with the NVIDIA GPU Operator and Volcano or Kueue extensions, is more common for inference clusters and mixed inference-and-training environments such as cloud providers. Meta uses an internal scheduler called MAST that incorporates network topology awareness, placing communicating training shards adjacent in the rail fabric to minimize cross-spine traffic^[14]. Many frontier labs run Ray as a layer above Slurm or Kubernetes for reinforcement learning workloads that mix training and inference inside the same job.

Frameworks: PyTorch, JAX, and parallelism libraries

The dominant training frameworks are PyTorch (used by Meta, OpenAI, xAI, and most academic labs) and JAX (used by Google and Anthropic). PyTorch integrates with NCCL through the torch.distributed package and offers FSDP (Fully Sharded Data Parallelism) and the older DDP (Distributed Data Parallel) as built-in primitives. Megatron-LM from NVIDIA, DeepSpeed from Microsoft, and Hugging Face Accelerate provide higher-level tensor parallelism, pipeline parallelism, and ZeRO sharding implementations.

What are the largest GPU clusters in 2026?

The table below summarizes the largest publicly documented AI GPU clusters, followed by a detailed profile of each.

Cluster	GPUs	Interconnect	Scale and status
Meta GenAI (two clusters)	24,576 H100 each	RoCE (Arista) and Quantum-2 InfiniBand	400 Gbps per GPU; trained Llama 3 (2024) ^[14]
Microsoft Azure "Eagle"	14,400 H100	Quantum-2 NDR InfiniBand	561.2 PFLOP/s Linpack, #3 on the Nov 2023 Top500 ^[5]^[6]
xAI Colossus (Memphis)	~555,000 (H100/H200/GB200)	Spectrum-X Ethernet	~2 GW campus; roadmap beyond 1M GPUs (late 2025) ^[8]^[15]
Stargate Abilene (OpenAI/Oracle)	~200,000 Blackwell (4 of 8 buildings)	NVLink5 (GB200 NVL72)	~0.3 GW live, ~1.2 GW target by Q4 2026 ^[22]^[31]
Google Ironwood (TPU v7) pod	9,216 TPUs	3D torus ICI, 9,600 Gbps per chip	42.5 EFLOP/s per pod (2025) ^[24]

Meta's 24k H100 clusters

In March 2024 Meta published "Building Meta's GenAI Infrastructure," disclosing two production clusters of 24,576 H100 GPUs each, configured for Llama 3 and subsequent generative AI training^[14]. The two clusters share identical compute and storage hardware (Grand Teton open GPU servers contributed to the Open Compute Project, paired with a Tectonic-backed distributed storage layer and a Hammerspace parallel NFS interface) but differ in the scale-out fabric: one cluster uses RoCE with Arista 7800 switches, the other Quantum-2 InfiniBand. Both reach 400 Gbps per endpoint. Meta reported that out-of-the-box NCCL performance fell in the 10 to 90 percent range of theoretical peak and that achieving consistent 90 percent or better utilization required topology-aware job scheduling and NCCL tuning specific to the rail layout^[14]. By the end of 2024, Meta aimed to operate 350,000 H100 GPUs with total compute equivalent to roughly 600,000 H100^[14].

xAI Colossus

xAI's Colossus, located in a converted Electrolux factory in South Memphis, Tennessee, became operational on 22 July 2024 with 100,000 H100 GPUs after a construction sprint that xAI reported took 122 days from contract signing to first training job, against typical timelines of 18 to 24 months for greenfield data center construction^[8]. The cluster was doubled to 200,000 GPUs (a mix of H100 and H200) within an additional three months and was used to train Grok 3^[15]. NVIDIA confirmed in October 2024 that Colossus runs on Spectrum-X Ethernet rather than InfiniBand^[15]. On 30 December 2025 xAI disclosed that it had purchased a third building near the Colossus 2 site, bringing total power capacity at the campus to roughly 2 gigawatts and deployed GPU count to about 555,000, with on-site natural gas generation supplementing grid power and a target of more than one million GPUs^[8]. DDN supplied the storage for the system^[19].

Microsoft Azure Eagle and the OpenAI partnership

Microsoft disclosed in May 2020 that it had built a custom cluster for OpenAI with 10,000 GPUs and 400 Gbps networking to each GPU server, used to train GPT-3^[4]. The November 2023 Top500 list placed "Eagle," a Microsoft Azure system with 14,400 H100 GPUs, Intel Xeon Sapphire Rapids CPUs, and Quantum-2 NDR InfiniBand, at #3 globally with 561.2 petaflops of Linpack performance^[5]^[6]. Microsoft subsequently disclosed that it deployed the equivalent of five Eagle-class systems per month through 2024 to meet OpenAI training demand^[20].

Stargate

The Stargate Project was announced on 21 January 2025 at the White House by President Donald Trump alongside Sam Altman of OpenAI, Larry Ellison of Oracle, and Masayoshi Son of SoftBank^[21]. The joint venture, with equity from SoftBank, OpenAI, Oracle, and MGX (each holding significant stakes, with SoftBank and OpenAI reported at 40 percent each), pledged to invest up to $500 billion in U.S. AI infrastructure over four years, with $100 billion to be deployed immediately^[21]. Arm, Microsoft, NVIDIA, Oracle, and OpenAI are the initial technology partners^[21]. The flagship site is the Abilene data center in Taylor County, Texas, where construction with Oracle Cloud Infrastructure was already under way before the announcement. In September 2025 OpenAI added five additional sites (Shackelford County, Texas; Doña Ana County, New Mexico; Lordstown, Ohio; Milam County, Texas; and an undisclosed Midwestern location), bringing planned capacity to nearly 7 gigawatts^[22]. The first phase of the Abilene site came online in September 2025^[22].

As of mid-2026, Epoch AI estimated that the Abilene site was operating at roughly 0.3 gigawatts, about 250,000 H100-equivalents, with four of eight buildings live and each completed building housing around 50,000 Blackwell GPUs; the site targets roughly 1.2 gigawatts, on the order of one million H100-equivalents, by the fourth quarter of 2026^[31]. OpenAI reversed an earlier plan to expand Abilene to 2.1 gigawatts, redirecting that capacity to other sites, while the overall Stargate program targets more than 9 gigawatts of planned capacity across the United States by 2029^[31].

Anthropic's compute partnerships

Anthropic does not operate its own data centers and instead trains and serves Claude on partner infrastructure. The company has multi-billion-dollar commitments with Amazon Web Services (using Trainium, Trainium2, and Trainium 3 silicon in addition to NVIDIA GPUs) and with Google Cloud (using TPU Ironwood, Trillium, and earlier TPU generations). The Amazon partnership announced an investment of up to $4 billion in 2023, expanded to $8 billion in November 2024, and was accompanied by the construction of dedicated Trainium2 capacity at the AWS Rainier project in Indiana.

Google's TPU pods

Google's TPU platform is the most prominent non-NVIDIA AI training architecture at scale. The TPU v5p pod, the production training generation through most of 2024, contained up to 8,960 chips interconnected by a 3D torus inter-chip interconnect at 4,800 Gbps per chip^[23]. Trillium (TPU v6e) launched in late 2024 with 256-chip pod slices targeted at inference and small training jobs. TPU Ironwood (v7), announced on 9 April 2025, scales to 9,216 liquid-cooled chips per pod, with 192 GB of HBM3e and 9,600 Gbps of ICI bandwidth per chip, delivering a claimed 42.5 exaflops per pod^[24]. Ironwood was Google's first TPU designed primarily for inference rather than training.

How much power and cooling does a GPU cluster need?

Power scale and grid impact

Power has become the binding constraint on cluster scale. A 100,000 H100 cluster draws approximately 150 megawatts of critical IT power, including the 700 watt per-GPU TDP plus an additional roughly 575 watts per GPU for CPUs, NICs, memory, storage, and conversion losses, for about 1,275 watts per GPU at the wall^[1]. At 2024 industrial electricity rates this consumes roughly 1.59 terawatt hours per year, an annual electricity cost on the order of $124 million^[1]. Multi-gigawatt sites announced through 2025 (xAI Colossus 2, Stargate Abilene, the planned Stargate sites) shift this further: a 2 gigawatt site consumes roughly 17.5 terawatt hours per year, comparable to a small U.S. state.

Nuclear and dedicated generation deals

To secure power at this scale, hyperscalers have signed bilateral power purchase agreements with nuclear operators. In September 2024 Microsoft and Constellation Energy signed a 20-year power purchase agreement to restart Unit 1 of the Three Mile Island Nuclear Generating Station in Pennsylvania (renamed the Crane Clean Energy Center), supplying 835 megawatts of carbon-free electricity to Microsoft's data centers beginning in 2028; Constellation will invest roughly $1.6 billion to refurbish the pressurized water reactor^[25]. In March 2024 Amazon Web Services purchased the Cumulus Data campus from Talen Energy for $650 million and subsequently signed a 17-year, approximately $18 billion power purchase agreement for up to 1,920 megawatts from the 2.5 gigawatt Susquehanna nuclear plant, although the Federal Energy Regulatory Commission in November 2024 rejected the initial expanded interconnection service agreement, limiting near-term co-located load to 300 megawatts^[26]. Behind-the-meter natural gas generation has become the fastest option for sites that cannot wait for nuclear refurbishment; xAI operates an on-site natural gas plant adjacent to the Colossus campus to supplement Memphis grid power^[8].

Air versus liquid cooling

The H100 at 700 watts can be cooled by direct expansion air cooling, although dense H100 racks at 50 to 80 kW per rack already push the limits of air-cooled hot-aisle containment. The 1,200 watt Blackwell B200 in the GB200 NVL72 configuration cannot be air-cooled at production densities. Each B200 GPU draws up to 1,200 W, and a GB200 superchip, which pairs two B200 GPUs with one Grace CPU, draws up to 2,700 W^[27]^[29]. A fully populated GB200 NVL72 rack holds 72 B200 GPUs and 36 Grace CPUs across 18 compute trays and draws about 120 kW, far beyond what air cooling can remove, so the design requires direct-to-chip liquid cooling: in one reference configuration coolant enters the rack at roughly 25 degrees Celsius at about two liters per second and leaves around 20 degrees warmer^[29]. The DGX B300 and GB300 NVL72 systems built on the B300 Blackwell Ultra GPU push per-GPU power to roughly 1,400 watts, carry 288 GB of HBM3e per GPU, and draw about 120 kW per rack, all of which require liquid cooling at the chassis or rack level^[32]. Hyperscalers are retrofitting facilities with rear-door heat exchangers, in-rack coolant distribution units, and chilled-water loops to accommodate the Blackwell and Vera Rubin generations.

The table below summarizes per-accelerator power and cooling across recent NVIDIA generations.

Accelerator	Peak GPU power	Rack-scale system	Approx. rack power	Cooling
H100 (Hopper, 2022)	700 W	HGX/DGX H100 (8 GPU)	50-80 kW	Air or liquid ^[27]
B200 (Blackwell, 2024)	up to 1,200 W	GB200 NVL72 (72 GPU)	~120 kW	Direct-to-chip liquid ^[27]^[29]
GB200 superchip (Grace + 2x B200)	up to 2,700 W	36 per GB200 NVL72 rack	~120 kW	Direct-to-chip liquid ^[29]
B300 (Blackwell Ultra, 2025)	up to 1,400 W	GB300 NVL72 (72 GPU)	~120 kW	Liquid required ^[32]
Rubin (2026)	not yet public	Vera Rubin NVL144	not yet public	Liquid ^[30]

What comes after Blackwell?

The per-GPU power climb continues into NVIDIA's next platform, Vera Rubin. NVIDIA announced Rubin CPX, a GPU it built for what it calls massive-context inference, on 9 September 2025, and positioned the rack-scale Vera Rubin NVL144 CPX system at 8 exaflops of AI compute, 100 TB of fast memory, and 1.7 petabytes per second of aggregate memory bandwidth in a single rack, with availability targeted for the end of 2026^[30]. "Rubin CPX is the first CUDA GPU purpose-built for massive-context AI, where models reason across millions of tokens of knowledge at once," said NVIDIA CEO Jensen Huang^[30]. Rack-scale Rubin systems are liquid-cooled from the outset, extending the direct-to-chip and warm-water cooling practices established for Blackwell into the next generation.

How often do large GPU clusters fail?

Failure rates at scale

H100-class training clusters experience component failures at rates that would be considered alarming in classical HPC. Meta's "The Llama 3 Herd of Models" paper, published in July 2024, reported that during 54 days of pre-training on a single 16,384-GPU H100 cluster, the system experienced 466 job interruptions: 47 from planned maintenance and 419 from unexpected hardware or software faults, averaging one unexpected interruption every three hours^[28]. Of the unexpected interruptions, 58.7 percent were GPU related, with 148 caused by GPU failures (including NVLink failures) and 72 by HBM3 memory faults; network switch and cable issues accounted for 35 (8.4 percent)^[28]. The SemiAnalysis estimate of mean time to first job failure for a brand new 100,000 H100 cluster without fault recovery is approximately 26.28 minutes, dominated by optical transceiver failures across the tens of thousands of cables in such a build^[1].

Checkpointing and recovery

Synchronous training of a frontier model is fragile: a single GPU failure can stall the entire job, because all replicas must process the same step before any can proceed. The dominant mitigation is checkpointing, where the model state (weights, optimizer state, RNG seeds) is periodically serialized to fast storage. Checkpoint sizes for 100B-parameter or larger models reach hundreds of gigabytes to several terabytes, so checkpoint cadence is bounded by storage bandwidth. Asynchronous checkpointing, in-memory checkpointing across redundant GPU groups, and elastic training (where the job continues with a slightly smaller world size while the failed node is replaced) are active research areas. Meta reported that approximately 90 percent of unexpected Llama 3 interruptions were handled by automated recovery without operator intervention^[28].

Hot spares and silent data corruption

Operational practice at scale also includes pre-allocated "hot spare" nodes that can be swapped in within minutes, automated cable and transceiver health monitoring, and detection of silent data corruption (SDC) where a GPU returns numerically incorrect results without raising an error. SDC is rare per-GPU but non-negligible when integrated across hundreds of thousands of GPU-hours per training run, and frontier labs run periodic deterministic validation checks to detect it.

How do GPU clusters store and feed training data?

Training a multi-trillion-token dataset requires sustained ingest bandwidth on the order of terabytes per second across the cluster. Modern AI clusters use a combination of object storage (S3 or equivalent) for cold storage, parallel file systems (DDN EXAScaler, IBM Spectrum Scale, VAST Data) for active datasets, and FUSE-based caching layers that present a POSIX-like interface backed by sharded object stores. Meta's clusters use a custom FUSE API atop a flash-optimized Tectonic backend combined with Hammerspace for parallel NFS^[14]. xAI Colossus uses DDN as primary storage^[19]. Checkpoint storage is typically separated from training data storage and provisioned with much higher write bandwidth.

How does an AI cluster differ from traditional HPC?

Classical HPC clusters (climate modeling, computational fluid dynamics, lattice QCD) historically optimized for tightly coupled MPI workloads with point-to-point latency in the low microseconds and high node-count weak scaling. Frontier (Oak Ridge), the first U.S. exascale system commissioned in 2022, used 37,888 AMD Instinct MI250X GPUs on 9,408 HPE Cray EX nodes interconnected by Slingshot-11, delivering 1.102 Linpack exaflops in a 21.1 megawatt power envelope^[13]. AI clusters share much of this hardware lineage (Slingshot, fat-tree variants, NVLink, MPI as a fallback) but optimize for different patterns: large bandwidth-bound collectives, dense linear algebra at low precision (FP8, FP4), and tolerance for asynchrony at the level of pipeline parallelism. The economic model also differs: HPC procurements traditionally pay for peak FLOP/s achieved on Linpack, whereas AI clusters optimize for cost per useful training token at production batch size and pipeline depth.

References

Dylan Patel et al., "100,000 H100 Clusters: Power, Network Topology, Ethernet vs InfiniBand, Reliability, Failures, Checkpointing", SemiAnalysis, 2024-06-17. https://newsletter.semianalysis.com/p/100000-h100-clusters-power-network. Accessed 2026-05-25. ↩
NVIDIA, "NVIDIA Launches World's First Deep Learning Supercomputer", NVIDIA Newsroom, 2016-04-05. https://nvidianews.nvidia.com/news/nvidia-launches-world-s-first-deep-learning-supercomputer. Accessed 2026-05-25. ↩
Mark Tyson, "Elon Musk reminisces about the time Jensen Huang donated a DGX-1 to OpenAI", Tom's Hardware, 2025-07-23. https://www.tomshardware.com/tech-industry/artificial-intelligence/elon-musk-reminisces-about-the-time-jensen-huang-donated-a-dgx-1-to-openai-shares-photo-gallery. Accessed 2026-05-25. ↩
Jennifer Langston, "Microsoft announces new supercomputer, lays out vision for future AI work", Microsoft Source, 2020-05-19. https://news.microsoft.com/source/features/ai/openai-azure-supercomputer/. Accessed 2026-05-25. ↩
TOP500, "Eagle - Microsoft NDv5, Xeon Platinum 8480C 48C 2GHz, NVIDIA H100, NVIDIA Infiniband NDR", TOP500, 2023-11-13. https://top500.org/system/180236/. Accessed 2026-05-25. ↩
Patrick Kennedy, "Microsoft Azure Eagle is a Paradigm Shifting Cloud Supercomputer", ServeTheHome, 2023-11-13. https://www.servethehome.com/microsoft-azure-eagle-is-a-paradigm-shifting-cloud-supercomputer-nvidia-intel/. Accessed 2026-05-25. ↩
Sebastian Moss, "Meta to operate '600,000 H100 GPU equivalents of compute' by year-end", Data Center Dynamics, 2024-01-18. https://www.datacenterdynamics.com/en/news/meta-to-operate-600000-gpus-by-year-end/. Accessed 2026-07-12. ↩
Introl Editorial, "xAI Colossus Hits 2 GW: 555,000 GPUs, $18B, Largest AI Site", Introl Blog, 2026-01-04. https://introl.com/blog/xai-colossus-2-gigawatt-expansion-555k-gpus-january-2026. Accessed 2026-05-25. ↩
Supermicro, "Supermicro NVIDIA GB200 NVL72 Datasheet", Supermicro, 2024-03-18. https://www.supermicro.com/datasheet/datasheet_SuperCluster_GB200_NVL72.pdf. Accessed 2026-05-25. ↩
NVIDIA, "NVLink & NVLink Switch: Fastest HPC Data Center Platform", NVIDIA, 2024-03-18. https://www.nvidia.com/en-us/data-center/nvlink/. Accessed 2026-05-25. ↩
Introl Editorial, "GPU Cluster Network Topology Design", Introl Blog, 2025-02-20. https://introl.com/blog/gpu-cluster-network-topology-fat-tree-dragonfly-rail-optimized-2025. Accessed 2026-05-25. ↩
NVIDIA, "NVIDIA DGX SuperPOD: Next Generation Scalable Infrastructure for AI Leadership Reference Architecture Featuring NVIDIA DGX H100", NVIDIA Documentation, 2023-04-10. https://docs.nvidia.com/dgx-superpod/reference-architecture-scalable-infrastructure-h100/latest/dgx-superpod-architecture.html. Accessed 2026-05-25. ↩
Oak Ridge Leadership Computing Facility, "Frontier", ORNL, 2022-05-30. https://www.olcf.ornl.gov/frontier/. Accessed 2026-05-25. ↩
Kevin Lee, Adi Gangidi, Mathew Oldham, "Building Meta's GenAI Infrastructure", Engineering at Meta, 2024-03-12. https://engineering.fb.com/2024/03/12/data-center-engineering/building-metas-genai-infrastructure/. Accessed 2026-05-25. ↩
NVIDIA, "NVIDIA Ethernet Networking Accelerates World's Largest AI Supercomputer, Built by xAI", NVIDIA Newsroom, 2024-10-28. https://nvidianews.nvidia.com/news/spectrum-x-ethernet-networking-xai-colossus. Accessed 2026-05-25. ↩
NVIDIA, "Advancing Performance with NVIDIA SHARP In-Network Computing", NVIDIA Technical Blog, 2024-02-15. https://developer.nvidia.com/blog/advancing-performance-with-nvidia-sharp-in-network-computing/. Accessed 2026-05-25. ↩
NVIDIA, "NVIDIA Collective Communications Library (NCCL)", NVIDIA Developer, 2024-09-12. https://developer.nvidia.com/nccl. Accessed 2026-05-25. ↩
NVIDIA, "Enabling Fast Inference and Resilient Training with NCCL 2.27", NVIDIA Technical Blog, 2025-06-25. https://developer.nvidia.com/blog/enabling-fast-inference-and-resilient-training-with-nccl-2-27/. Accessed 2026-05-25. ↩
Chris Mellor, "DDN supplying storage for xAI's Colossus supercomputer", Blocks & Files, 2024-11-19. https://www.blocksandfiles.com/ai-ml/2024/11/19/ddn-supplying-storage-for-xais-colossus-supercomputer/1604044. Accessed 2026-05-25. ↩
Sebastian Moss, "Microsoft is deploying the equivalent of five 561 petaflops supercomputers every month", Data Center Dynamics, 2024-05-21. https://www.datacenterdynamics.com/en/news/microsoft-deploying-equivalent-of-five-561-petaflops-supercomputers-every-month/. Accessed 2026-05-25. ↩
OpenAI, "Announcing The Stargate Project", OpenAI, 2025-01-21. https://openai.com/index/announcing-the-stargate-project/. Accessed 2026-05-25. ↩
Jordan Novet, "OpenAI's first data center in $500 billion Stargate project is open in Texas", CNBC, 2025-09-23. https://www.cnbc.com/2025/09/23/openai-first-data-center-in-500-billion-stargate-project-up-in-texas.html. Accessed 2026-05-25. ↩
Google Cloud, "TPU v5p", Google Cloud Documentation, 2024-04-09. https://cloud.google.com/tpu/docs/v5p. Accessed 2026-05-25. ↩
Amin Vahdat and Mark Lohmeyer, "Ironwood: The first Google TPU for the age of inference", Google Blog, 2025-04-09. https://blog.google/innovation-and-ai/infrastructure-and-cloud/google-cloud/ironwood-tpu-age-of-inference/. Accessed 2026-05-25. ↩
Spencer Kimball, "Constellation Energy to restart Three Mile Island nuclear plant, sell the power to Microsoft for AI", CNBC, 2024-09-20. https://www.cnbc.com/2024/09/20/constellation-energy-to-restart-three-mile-island-and-sell-the-power-to-microsoft.html. Accessed 2026-05-25. ↩
Sonal Patel, "Talen, Amazon Launch $18B Nuclear PPA: A Grid-Connected IPP Model for the Data Center Era", POWER Magazine, 2024-12-09. https://www.powermag.com/talen-amazon-launch-18b-nuclear-ppa-a-grid-connected-ipp-model-for-the-data-center-era/. Accessed 2026-05-25. ↩
Anton Shilov, "NVIDIA's full-spec Blackwell B200 AI GPU uses 1200W of power", TweakTown, 2024-03-19. https://www.tweaktown.com/news/97059/nvidias-full-spec-blackwell-b200-ai-gpu-uses-1200w-of-power-up-from-700w-on-hopper-h100/index.html. Accessed 2026-05-25. ↩
Aaron Grattafiori et al. (Meta), "The Llama 3 Herd of Models", arXiv:2407.21783, 2024-07-31. https://arxiv.org/abs/2407.21783. Accessed 2026-05-25. ↩
Tobias Mann, "A closer look at Nvidia's 120kW DGX GB200 NVL72 rack system", The Register, 2024-03-21. https://www.theregister.com/2024/03/21/nvidia_dgx_gb200_nvk72/. Accessed 2026-07-12. ↩
NVIDIA, "NVIDIA Unveils Rubin CPX: A New Class of GPU Designed for Massive-Context Inference", NVIDIA Newsroom, 2025-09-09. https://nvidianews.nvidia.com/news/nvidia-unveils-rubin-cpx-a-new-class-of-gpu-designed-for-massive-context-inference. Accessed 2026-07-12. ↩
Epoch AI, "OpenAI Stargate: where the US sites stand", Epoch AI, 2026. https://epoch.ai/publications/openai-stargate-where-the-us-sites-stand. Accessed 2026-07-12. ↩
Introl Editorial, "NVIDIA GB300 NVL72: Blackwell Ultra Deployment", Introl Blog, 2025. https://introl.com/blog/why-nvidia-gb300-nvl72-blackwell-ultra-matters. Accessed 2026-07-12. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributors · full history

Suggest edit

What links here

Oracle Corporation Stargate Initiative xAI Colossus

How did GPU clusters originate?

What is the physical layout of a GPU cluster?

Rail-optimized fabrics

Fat-tree and dragonfly variants

How are the GPUs interconnected?

NVLink and NVSwitch

InfiniBand

RDMA over converged Ethernet (RoCE)

SHARP in-network reduction

What software runs an AI GPU cluster?

NCCL and HCCL

Orchestration: Slurm, Kubernetes, and custom schedulers

Frameworks: PyTorch, JAX, and parallelism libraries

What are the largest GPU clusters in 2026?

Meta's 24k H100 clusters

xAI Colossus

Microsoft Azure Eagle and the OpenAI partnership

Stargate

Anthropic's compute partnerships

Google's TPU pods

How much power and cooling does a GPU cluster need?

Power scale and grid impact

Nuclear and dedicated generation deals

Air versus liquid cooling

What comes after Blackwell?

How often do large GPU clusters fail?

Failure rates at scale

Checkpointing and recovery

Hot spares and silent data corruption

How do GPU clusters store and feed training data?

How does an AI cluster differ from traditional HPC?

See also

References

Improve this article

Related Articles

NVIDIA B200

AWS Trainium 2

TPU Ironwood

NVIDIA GB300 NVL72

AMD Instinct MI355X

Cerebras WSE-3

What links here

Related Articles

NVIDIA B200

AWS Trainium 2

TPU Ironwood

NVIDIA GB300 NVL72

AMD Instinct MI355X

Cerebras WSE-3

What links here