Distributed training is the practice of training a single machine learning model using many compute devices in parallel, splitting the data, the model, or both across GPUs, TPUs, or other accelerators that coordinate over a network. As deep learning models have grown from millions to hundreds of billions and even trillions of parameters, single-device training has become infeasible in terms of both memory and wall-clock time. Modern frontier LLM runs use clusters of thousands to tens of thousands of accelerators training continuously for weeks or months. The field has evolved from simple data-parallel setups into multi-dimensional parallelism strategies that combine data, tensor, pipeline, expert, and context parallelism in a single training job. Distributed training is the systems counterpart of partitioning strategy: partitioning describes how a model is split, while distributed training is the broader practice of executing that split across real hardware.
The push to distributed training comes from a few converging pressures. Models have grown faster than per-device memory. GPT-2 (2019) had 1.5 billion parameters and could be trained on a small cluster. GPT-3 (2020) scaled to 175 billion parameters and required thousands of GPUs. By the mid-2020s, models such as GPT-4, Gemini Ultra, Llama 3.1 405B, DeepSeek-V3, and Grok 4 were trained on clusters of 16,000 or more accelerators [1][6][12].
A single NVIDIA H100 has 80 GB of high-bandwidth memory. A 70-billion-parameter model in FP32 requires roughly 280 GB just for weights, and Adam optimizer state pushes the resident footprint to about 1.1 TB before activations are even allocated. The model cannot live on one device, no matter how careful the implementation. Even when the model fits, training time matters: a 405-billion-parameter run on 16,000 H100s for 54 days at 400 TFLOPs per GPU represents about 3.8e25 floating-point operations [6]. Compressing that into a single device would take centuries. Distributed training is not a convenience, it is the only way to perform compute at the scale that frontier models require.
Large-batch training is another motivator. Many optimization strategies favor very large effective batches because they smooth gradient noise and increase throughput. A per-device batch of 4 sequences across 4,096 devices yields a global batch of 16,384, which would not fit on any single accelerator. Data parallelism makes that batch size routine.
Distributed training pays for these benefits in complexity. Inter-device communication adds latency, bookkeeping for fault tolerance is nontrivial, and a single misconfigured shard can leave thousands of expensive accelerators idle. The research and engineering effort spent on distributed training reflects how much these costs matter at scale.
Distributed training strategies are usually classified by what gets split. The same model can be partitioned along several axes at once. The table below summarizes the main families. Each is discussed in more detail in the sections that follow, and the partitioning strategy article goes deeper on the underlying decomposition.
| Strategy | What is split | Communication pattern | Typical scope | Originating work |
|---|---|---|---|---|
| Data parallelism | Input batch across replicas | All-reduce on gradients | All scales, baseline approach | DistBelief (Dean 2012); Horovod (Sergeev 2018) |
| Tensor parallelism | Weights inside a layer | All-reduce per layer | Within a node, NVLink fabric | Megatron-LM (Shoeybi 2019) [3] |
| Pipeline parallelism | Layers across stages | Point-to-point send and receive | Across nodes, micro-batches | GPipe (Huang 2018) [4]; PipeDream (Harlap 2018) [16] |
| Sequence parallelism | Sequence dimension within a layer | Reduce-scatter, all-gather | Combined with tensor parallelism | Korthikanti 2022 [13] |
| Context parallelism (Ring Attention) | Sequence dimension across attention | Ring of key-value blocks | Long-context training | Liu 2023 [14] |
| Expert parallelism | MoE experts across devices | All-to-all token routing | MoE models | GShard (Lepikhin 2020); Switch Transformer (Fedus 2022) [5] |
| ZeRO (sharded data parallelism) | Optimizer state, gradients, weights | All-gather, reduce-scatter | Memory-bound runs | ZeRO (Rajbhandari 2020) [7] |
| Fully Sharded Data Parallel (FSDP) | Weights and optimizer state | All-gather per layer | PyTorch ecosystem | Zhao 2023 [2] |
Data parallelism is the simplest and most widely used form of distributed training. Each device holds a complete copy of the model. The training data batch is split across devices, with each device processing a different subset of examples. After computing gradients locally, devices synchronize them, typically by averaging through an all-reduce, and each replica updates its weights identically. Determinism falls out naturally: every replica sees the same averaged gradient, so every replica computes the same new weights.
The canonical PyTorch implementation is DistributedDataParallel (DDP), which uses NVIDIA's NCCL library to perform gradient all-reduce. DDP overlaps gradient communication with backward computation: as soon as gradients for a layer are produced, they are sent to other ranks while gradients for earlier layers are still being calculated. This overlap is what makes DDP scale efficiently up to several thousand GPUs for models that fit per-device.
Data parallelism scales well when the full model fits in one device's memory, which today means roughly 1 to 7 billion parameters in mixed precision on an H100. Beyond that, weights and optimizer state spill, and DDP must give way to sharded variants like FSDP or ZeRO-3.
The effective batch size in data parallelism equals the per-device batch multiplied by the number of replicas. Very large batches can hurt convergence. The linear scaling rule and warmup schedule from Goyal and colleagues (2017) are the standard recipe for large-batch SGD: when batch grows by a factor of k, scale the learning rate by k and warm up gradually over the first epoch [1]. Subsequent optimizers such as LARS and LAMB extended these ideas with per-layer learning rate adaptation, which let teams push effective batches into the millions of tokens for Llama 3 and similar models.
Tensor parallelism, also called intra-layer model parallelism, splits the weight matrices inside a single layer across devices. The Megatron-LM paper from NVIDIA showed that the linear layers in a transformer can be partitioned with very modest communication: split the first feed-forward matrix column-wise so each device holds a slice of the output features, then split the second matrix row-wise so the partial sums can be combined with one all-reduce at the end of the block [3]. Self-attention is split along the head dimension, which is naturally independent: each device computes a subset of heads and the results are concatenated.
The original Megatron paper trained an 8.3-billion-parameter transformer on 512 V100 GPUs at 76% scaling efficiency relative to a single-GPU baseline, a result that put intra-layer parallelism on the map [3]. The technique requires very low-latency communication because each transformer block performs at least one all-reduce across the tensor-parallel group. In practice, tensor parallelism is confined to a single server node, where eight or sixteen GPUs share an NVLink and NVSwitch fabric. Stretching it across the slower InfiniBand fabric between nodes makes the all-reduce dominate runtime.
Pipeline parallelism divides the model along its depth. Consecutive groups of layers, called stages, are placed on different devices. During the forward pass, each stage processes its layers and forwards activations to the next stage. The backward pass flows in the opposite direction.
The naive version suffers from a pipeline bubble: a stage sits idle while waiting for its predecessor or successor. GPipe, introduced by Huang and colleagues at Google in 2018, mitigates the bubble by splitting each batch into micro-batches and pumping them through the pipeline in sequence [4]. The bubble shrinks as the number of micro-batches grows, and gradient updates remain synchronous because each micro-batch contributes to the same accumulated gradient. GPipe was used to train a six-billion-parameter, 128-layer transformer for multilingual translation, an early demonstration of just how much pipeline parallelism could unlock.
PipeDream, from Harlap and colleagues at Microsoft in 2018, took a different angle. It introduced the 1F1B (one-forward, one-backward) schedule, in which each worker alternates strictly between a forward pass on one micro-batch and a backward pass on another [16]. 1F1B keeps the in-flight working set bounded and removes most of the bubble in steady state. The original PipeDream did this asynchronously and used weight stashing to avoid stale-gradient problems, which complicates correctness. Modern implementations in DeepSpeed and Megatron-LM use synchronous 1F1B with carefully placed flushes at the end of each global batch, retaining most of the throughput benefit without the asynchrony.
More recent work pushes the bubble closer to zero. Interleaved 1F1B in Megatron-LM splits each stage into multiple chunks; zero-bubble pipelines (Qi 2024) split the backward pass into a weight-gradient and input-gradient phase that can be reordered to fully eliminate the bubble for sufficiently many micro-batches.
Pipeline parallelism is well-suited to the slower inter-node InfiniBand fabric because adjacent stages exchange only activations, which are far smaller than gradients. It also tolerates network jitter because stages can absorb small delays through micro-batch buffering.
Sequence parallelism, introduced by Korthikanti and colleagues at NVIDIA in 2022, is a refinement of tensor parallelism that handles the parts of the transformer that tensor parallelism leaves unsharded, namely layer normalization, dropout, and the residual stream [13]. The idea is to split these operations along the sequence dimension so that activation memory shrinks proportionally to the tensor-parallel group size. Combined with selective activation recomputation, sequence parallelism cut activation memory by 5x and reduced recomputation overhead by more than 90% on a 530-billion-parameter run on 2,240 A100 GPUs, raising model FLOPs utilization from 42.1% to 54.2% [13].
Context parallelism is a more aggressive variant for very long contexts. Ring Attention, from Liu and colleagues in 2023, distributes the attention computation itself across devices arranged in a ring [14]. Each rank holds a slice of the queries, keys, and values for a long sequence; the keys and values rotate around the ring, with each device combining incoming blocks against its local queries. Communication of key-value blocks is overlapped with the blockwise attention computation, so context length scales linearly with device count without changing the asymptotic memory footprint per device. Ring Attention is what makes million-token context windows tractable. Striped Attention (Brandon 2023) extended the idea to handle the load imbalance caused by causal masking. Meta used context parallelism as one of four parallel dimensions when training Llama 3.1 with 128k-token contexts [6].
Mixture-of-experts (MoE) models contain many expert sub-networks inside specific layers. A gating mechanism routes each token to a small subset of experts, often two out of dozens or hundreds. Expert parallelism places different experts on different devices. The compute pattern is unique: tokens must be sent across the network to whichever device hosts the expert they were routed to, processed there, and then sent back. This is an all-to-all communication, which is sensitive to bandwidth and to load balance because hot experts can pile up.
GShard from Google (Lepikhin 2020) and the Switch Transformer (Fedus 2022) established expert parallelism as the standard for trillion-parameter sparse models [5]. DeepSeek-V3, with 671 billion total parameters and 37 billion active per token, used 64-way expert parallelism spanning eight nodes during training [12].
The Zero Redundancy Optimizer (ZeRO), from Rajbhandari and colleagues at Microsoft, identifies a wasteful aspect of vanilla data parallelism: every replica stores the same optimizer state, the same gradients, and the same parameters, even though only its slice of the batch is being processed at any moment [7]. ZeRO partitions these tensors across the data-parallel group, materializing them only when needed.
| ZeRO stage | What is partitioned | Memory reduction vs DDP | Communication overhead |
|---|---|---|---|
| Stage 0 | Nothing (vanilla data parallelism) | 1x | Baseline all-reduce |
| Stage 1 | Optimizer state | ~4x | Minimal increase |
| Stage 2 | Optimizer state and gradients | ~8x | Moderate increase |
| Stage 3 | Optimizer state, gradients, and weights | Linear with N devices, up to ~64x | Significant: extra all-gather per layer |
| ZeRO-Infinity | Stage 3 plus offload to CPU and NVMe | Limited only by host storage | Highest |
ZeRO Stage 1 partitions optimizer state, which for Adam in mixed precision is 12 bytes per parameter. Stage 2 also partitions gradients. Stage 3 also partitions weights, which means the device must all-gather the layer's parameters just before the forward pass, and again before the backward pass, then drop them. ZeRO-Infinity layers in CPU and NVMe offload, allowing trillion-parameter models on modest GPU counts at the cost of much higher communication and slower step time. Microsoft's original demonstration used ZeRO Stage 3 to train a one-trillion-parameter model on 1,024 NVIDIA GPUs [7].
PyTorch's Fully Sharded Data Parallel (FSDP) is conceptually similar to ZeRO Stage 3 but is implemented as a first-class member of the PyTorch ecosystem. The Zhao paper describes FSDP as an industry-grade solution for large-model training and reports near-linear TFLOPS scaling [2]. FSDP wraps a model in shards, all-gathers each shard just in time for forward and backward, and reduce-scatters gradients so each rank ends up holding only its own shard.
FSDP2, which became the default in PyTorch 2.x, redesigned the original implementation around the per-parameter DTensor abstraction, fixed several composability issues with torch.compile, and added cleaner support for mixed precision and CPU offload. Empirical studies show FSDP can reduce GPU memory usage by 60% or more compared to DDP, although it can slow per-step time by 2x to 6x depending on cluster bandwidth and model shape [2]. For the largest models, the trade is usually worth it because the model would not fit at all under plain DDP.
Real training runs at frontier scale combine several of these strategies. The convention is to count the number of orthogonal axes:
The number of devices assigned to each dimension must multiply to the total cluster size, and the assignment is far from automatic. Auto-parallelism systems such as Alpa, FlexFlow, and Galvatron search this space, but most production runs are still tuned by hand using profiled microbenchmarks. The partitioning strategy article covers the optimization considerations in detail.
Distributed training is built on a small set of collective operations, all of them inherited from the high-performance computing tradition.
| Primitive | What it does | Use in training |
|---|---|---|
| All-reduce | Combines tensors from every rank and returns the result to every rank | Gradient averaging in data parallelism |
| All-gather | Collects each rank's slice into a full tensor on every rank | Materializing weights in ZeRO-3 and FSDP |
| Reduce-scatter | Reduces and partitions the result so each rank gets a slice | Sharded gradient updates in FSDP |
| Broadcast | Sends one rank's tensor to every rank | Initialization, parameter syncs |
| All-to-all | Each rank sends a different slice to every other rank | Expert parallelism token routing |
| Send and recv | Point-to-point transfer | Activation hand-off in pipeline parallelism |
NVIDIA's NCCL is the dominant implementation on GPU clusters, with Gloo and MPI as alternatives. NCCL automatically detects the underlying topology, including PCIe, NVLink, NVSwitch, InfiniBand, and RoCE links, and selects an algorithm such as ring, tree, or PAT (Parallel Aggregated Trees) for each operation [11]. Recent NCCL releases added cross-data-center awareness, in-network reduction via NVIDIA SHARP, and improved fault diagnosis for very large jobs.
The ring all-reduce algorithm, popularized for deep learning by Baidu's research team in 2017, is the workhorse for gradient synchronization. Devices form a logical ring; in the scatter-reduce phase each device sends a chunk to its neighbor, which adds it to its own; in the all-gather phase, each device circulates its completed chunk back around. The total bandwidth cost is 2(N-1)/N times the data size, which approaches 2x for large N, and crucially it is independent of N. This bandwidth invariance is what made Horovod's ring-based approach popular before NCCL incorporated similar algorithms natively [10].
Distributed training depends as much on the network as on the accelerators. Network topology and link bandwidth determine which parallelism strategies are even viable. The table below summarizes the main interconnect tiers in modern AI clusters.
| Interconnect | Bandwidth | Latency | Scope | Used by |
|---|---|---|---|---|
| NVLink 3.0 (A100) | 600 GB/s bidirectional | sub-microsecond | Intra-node GPU-to-GPU | NVIDIA DGX A100 |
| NVLink 4.0 (H100) | 900 GB/s bidirectional | sub-microsecond | Intra-node GPU-to-GPU | NVIDIA DGX H100 |
| NVLink 5.0 (B200) | 1,800 GB/s bidirectional | sub-microsecond | Intra-node GPU-to-GPU | NVIDIA DGX B200 |
| NVSwitch | Full bisection bandwidth | very low | Intra-node all-to-all | DGX H100, DGX B200, GB200 NVL72 |
| InfiniBand HDR | 200 Gb/s per port | ~1 microsecond | Inter-node | Earlier HPC and AI clusters |
| InfiniBand NDR | 400 Gb/s per port | ~1 microsecond | Inter-node | Llama 3 cluster, modern AI fabrics |
| RoCE (RDMA over Converged Ethernet) | 100 to 400 Gb/s | 2 to 5 microseconds | Inter-node | Hyperscaler-built clusters |
| TPU ICI (v4) | 4,800 Gbps per chip | very low | Intra-pod 3D torus | TPU v4 pods, 4,096 chips |
| TPU ICI (v5p) | 4,800 Gbps per chip | very low | 16x20x28 3D torus | TPU v5p pods, 8,960 chips |
NVLink and NVSwitch provide the bandwidth needed for tensor parallelism inside a node. The H100 SXM module pushes 900 GB/s of bidirectional GPU-to-GPU NVLink, and the GB200 NVL72 platform extends this fabric to 72 GPUs in a single rack. InfiniBand or RoCE provides the slower but still high-bandwidth inter-node fabric used for data parallelism, pipeline parallelism, and expert parallelism. Meta's Llama 3 cluster used 400 Gb/s InfiniBand NDR between nodes and reported that the network topology was a key factor in achieving 400 TFLOPs per GPU [6].
Google's TPU pods take a different architectural approach. TPU v4 chips are arranged in a 3D torus that scales to 4,096 chips per pod; TPU v5p extends the torus to 8,960 chips in a 16x20x28 mesh, with optical circuit switches reconfiguring the topology at job-launch time [15]. The torus provides each chip with high-bandwidth direct connections to six neighbors, which is well-suited to the SPMD parallelism model that JAX uses. PaLM 540B was trained across two TPU v4 pods (6,144 chips total) connected by datacenter network using a combination of model and data parallelism, achieving 46.2% model FLOPs utilization, the largest TPU configuration used for training at the time [17].
Network topology at scale is usually a fat tree (oversubscribed or full bisection) for InfiniBand-based AI clusters, with dragonfly and dragonfly+ topologies seen in some hyperscaler designs to reduce switch count at very large port counts. Cluster operators care obsessively about congestion control, link error rates, and the placement of training jobs relative to topology, because a single slow link can drag the whole job's throughput down to that link's bandwidth.
The table below covers a representative set of large training runs, with hardware and parallelism details where they have been published. Compute estimates are based on disclosed FLOPs counts where available.
| Model | Year | Parameters | Hardware | Cluster size | Parallelism | Notes |
|---|---|---|---|---|---|---|
| GPT-3 | 2020 | 175B dense | NVIDIA V100 | ~10,000 GPUs (Microsoft Azure) | Tensor + pipeline + data | OpenAI used Microsoft's purpose-built supercomputer |
| Megatron-Turing NLG | 2021 | 530B dense | NVIDIA A100 | 2,240 GPUs | Tensor + pipeline + data | Microsoft and NVIDIA partnership |
| PaLM | 2022 | 540B dense | TPU v4 | 6,144 chips across 2 pods | Pathways system, model + data | 46.2% MFU [17] |
| GPT-4 | 2023 | not disclosed | NVIDIA A100 | rumored ~25,000 GPUs over months | undisclosed | OpenAI did not publish training details |
| Llama 2 70B | 2023 | 70B dense | NVIDIA A100 | 2,000 GPUs | FSDP + tensor parallel | Released openly by Meta |
| Llama 3 405B | 2024 | 405B dense | NVIDIA H100 | 16,000 GPUs | 4D: TP=8, PP=16, CP, FSDP | 3.8e25 FLOPs, ~54 days [6] |
| DeepSeek-V3 | 2024 | 671B (37B active) MoE | NVIDIA H800 | 2,048 GPUs | PP=16, EP=64, ZeRO-1 DP | 2.788M GPU-hours, ~$5.6M [12] |
| Grok 4 | 2024 | not disclosed | NVIDIA H100/H200 | xAI Memphis Colossus, ~100,000 GPUs initially scaled to 200,000 | undisclosed | Single-site cluster, AWS/Dell/Supermicro builds |
| Gemini Ultra | 2023 | not disclosed | TPU v4/v5 | many TPU pods | JAX/Pathways | Google did not disclose chip count |
| Claude Opus / Sonnet | 2024 to 2026 | not disclosed | TPU and AWS Trainium | undisclosed | undisclosed | Anthropic split capacity across providers |
DeepSeek-V3 is a particularly informative case because the team published unusually detailed numbers. The full pre-training cost was 2,664,000 H800 GPU-hours, or roughly 3.7 days per trillion training tokens on the 2,048-GPU cluster, with total costs estimated at $5.576M assuming a $2 per GPU-hour rental price [12]. The choice to skip tensor parallelism entirely, in favor of expert parallelism, pipeline parallelism, and ZeRO-1 sharded data parallelism, was driven by H800's reduced inter-node bandwidth compared to H100. The publication touched off an industry-wide reassessment of how much compute is actually required for frontier-quality models.
Several frameworks dominate distributed training as of 2026. The table below summarizes the major ones; longer descriptions follow.
| Framework | Developer | Key features | Parallelism types | Language stack |
|---|---|---|---|---|
| PyTorch DDP and FSDP | Meta | Native PyTorch integration, FSDP2 with DTensor, tight torch.compile support | Data, fully sharded data parallelism | Python, PyTorch |
| TorchTitan | PyTorch team | Reference implementation of FSDP2 + tensor and pipeline parallelism for LLMs | Data, tensor, pipeline, sequence | Python, PyTorch |
| DeepSpeed | Microsoft | ZeRO 0-3 and ZeRO++; pipeline parallelism; offload to CPU and NVMe; inference engine | Data, pipeline, tensor, expert | Python, PyTorch |
| Megatron-LM | NVIDIA | Efficient tensor parallelism, interleaved 1F1B pipelines, sequence parallelism, expert parallelism | Tensor, pipeline, data, expert, context | Python, PyTorch |
| Megatron-DeepSpeed | Microsoft + NVIDIA | Fusion of Megatron tensor and pipeline parallelism with DeepSpeed ZeRO | All of the above | Python, PyTorch |
JAX and Flax with pjit and shard_map | Compiler-based SPMD via XLA; tightly coupled to TPU pods | Data, tensor, pipeline | Python, JAX/XLA | |
TensorFlow tf.distribute and DTensor | Strategy-based API and SPMD partitioning | Data, tensor | Python, TensorFlow | |
| Hugging Face Accelerate | Hugging Face | Wrapper around DDP, FSDP, DeepSpeed for portability | Data, fully sharded | Python, PyTorch |
| Colossal-AI | HPC-AI Tech | Auto-parallelism search, heterogeneous memory management | Data, tensor, pipeline, sequence, expert | Python, PyTorch |
| Horovod | LF AI (originally Uber) | Ring-AllReduce wrapper; framework-agnostic | Data | Python, multi-framework |
| Ray Train | Anyscale | Orchestration and elastic training; backend-agnostic | Wraps DDP, FSDP, DeepSpeed | Python, Ray |
| OpenDiLoCo | Prime Intellect | Open implementation of DiLoCo for low-bandwidth distributed training | Decentralized data parallel | Python, PyTorch |
PyTorch DistributedDataParallel is the default starting point for almost any distributed PyTorch training script. It is implemented as a thin wrapper around the model that hooks gradient computation, batches up the gradients into communication buckets, and triggers NCCL all-reduce operations as buckets become ready. The bucketing and overlap make it scale efficiently to many thousands of GPUs.
FSDP and its successor FSDP2 are the sharded variants. They split parameters, gradients, and optimizer state across the data-parallel group, all-gathering parameters just before each layer's forward and backward, and reduce-scattering gradients afterward. FSDP2 introduced a per-parameter DTensor representation that simplifies composition with tensor parallelism and torch.compile. The TorchTitan project provides a reference implementation that combines FSDP2 with tensor and pipeline parallelism for end-to-end LLM pretraining.
DeepSpeed, from Microsoft Research, is one of the most influential libraries in the field. Its core contribution is ZeRO, which is described in detail above. DeepSpeed also bundles pipeline parallelism with 1F1B scheduling, a deep set of mixed-precision recipes including FP8 on Hopper hardware, gradient checkpointing helpers, an inference engine, and integration with Megatron-LM via the Megatron-DeepSpeed combined repository. The ZeRO++ extensions added quantized communication and hierarchical partitioning to bring down communication volume in cross-node training.
Megatron-LM, from NVIDIA, focuses on efficient tensor and pipeline parallelism for transformer architectures. The original 2019 paper [3] established intra-layer parallelism as a viable technique; subsequent work added pipeline parallelism with interleaved schedules, sequence parallelism, expert parallelism for MoE, and very tight integration with NCCL, NVLink, NVSwitch, and InfiniBand. Megatron-LM is the foundation of NVIDIA's NeMo framework and is the de facto standard for training models on NVIDIA hardware at scale.
JAX expresses parallelism through the SPMD (single program, multiple data) model, with pjit and shard_map letting the developer annotate how arrays should be sharded across a logical mesh of devices. The XLA compiler then generates communication and computation kernels for the entire program. JAX is tightly coupled to Google's TPU stack and is the framework behind Gemini, PaLM, and many of Google's internal models, although it also runs on GPU clusters. The Pathways system, introduced for PaLM training, extends JAX with a global view of the cluster that lets a single program span multiple TPU pods [17].
A newer cluster of frameworks targets training that does not require a homogeneous, low-latency datacenter. Hivemind allows volunteer-style training over the open internet. DiLoCo, from Douillard and colleagues at Google DeepMind in 2023, performs many local SGD steps on each worker before a periodic synchronization through an outer Nesterov optimizer; it matches fully synchronous training on the C4 dataset with up to 500x less communication [18]. OpenDiLoCo, an open-source implementation by Prime Intellect, has been used for actual cross-continent runs. DeMo (Decoupled Momentum Optimization) from Peng and colleagues in 2024 takes a complementary approach, decomposing momentum into fast and slow components via a discrete cosine transform and synchronizing only the fast components, which cuts inter-accelerator communication by an order of magnitude or more [19].
Distributed training rarely uses parallelism alone. A handful of orthogonal techniques shave memory or trade compute for memory, and most large runs combine several at once.
Mixed precision training stores activations and many gradients in BF16, FP16, or FP8 instead of FP32. This roughly halves the memory used for those tensors and doubles arithmetic throughput on tensor cores. BF16 has become the default for transformer training because its FP32-equivalent dynamic range avoids most of the loss-scaling fragility that plagued FP16. FP8 training, supported on Hopper and Blackwell GPUs, pushes throughput further, though it requires careful per-tensor scaling. The mixed precision training article covers the numerics in depth.
Gradient checkpointing, also called activation recomputation, drops most intermediate activations during the forward pass and recomputes them during the backward pass. This trades roughly 33% extra compute for an order-of-magnitude reduction in activation memory. Selective recomputation, introduced in the Korthikanti paper alongside sequence parallelism, recomputes only the cheap-to-redo operations and stores the expensive ones, dropping the overhead from 33% to under 5% on a 530-billion-parameter run [13].
Gradient accumulation simulates a large batch using small micro-batches. The per-device batch is divided into micro-batches; gradients accumulate across them before the optimizer step. This makes very large effective batches feasible without needing more devices, at the cost of slower wall-clock per global step.
Optimizer state offload to CPU or NVMe, as in DeepSpeed's ZeRO-Infinity, lets training reach trillions of parameters on modest GPU counts by keeping cold tensors out of VRAM. The cost is heavier I/O and slower step time, which is acceptable when the alternative is a model that does not fit at all.
There is a long-standing debate between synchronous and asynchronous distributed training. Synchronous training waits for every worker to finish its step before the next one starts, producing deterministic gradients and predictable convergence. Asynchronous training lets workers update a shared parameter store as they go, which avoids stragglers but introduces stale gradients.
The Hogwild! algorithm by Niu and colleagues in 2011 was an early demonstration that lock-free asynchronous SGD on shared memory can converge for sparse problems, with nearly linear speedup over locked variants [20]. Google's DistBelief framework used asynchronous parameter servers for early convolutional and recurrent network training. The original PipeDream design used asynchronous pipeline parallelism, with the weight stashing trick to control staleness [16].
In practice, frontier LLM training uses synchronous parallelism almost exclusively. Stale gradients perturb the optimizer in ways that matter at very large batch sizes and very deep models, and reproducibility is too important when a single run can cost tens of millions of dollars. Asynchronous methods survive in narrower contexts: federated learning across mobile devices, decentralized cross-datacenter runs (DiLoCo, DeMo), and certain reinforcement learning training loops where actor and learner workloads naturally decouple.
Three bottlenecks dominate distributed training at scale: communication, memory bandwidth, and load imbalance.
Communication. Gradient all-reduce in data parallelism, weight all-gather in FSDP, and all-to-all routing in expert parallelism each consume bandwidth proportional to model size or activation size. Research at MegaScale (Jiang 2024) reported that some configurations spent more than 90% of step time on communication, with computation almost entirely hidden behind it [21]. Mitigations include communication-computation overlap, gradient compression (PowerSGD, 1-bit Adam, ZeRO++ quantization), hierarchical all-reduce that exploits faster intra-node fabrics, and topology-aware placement.
Memory bandwidth. The per-device tensor cores can run faster than the HBM that feeds them. Kernel fusion (FlashAttention, Triton kernels), in-place operations, and careful operator scheduling reduce the bandwidth pressure.
Load imbalance. Pipeline parallelism produces bubbles. Expert parallelism produces hot experts that overload some devices. Even data parallelism can suffer if input lengths vary across the batch. Bin-packing, capacity factors, padding to a common length, and load-balancing losses for MoE all chip away at this overhead.
Fault tolerance. In a cluster of 16,000 GPUs, hardware failures are statistical certainties rather than exceptional events. Meta's MegaScale paper documented the operational reality of running large training jobs: lost GPUs, overheating racks, network partitions, ECC errors, and kernel hangs all occur on a daily basis at this scale [21]. The standard mitigation is periodic checkpointing of the full training state including weights, optimizer state, gradient accumulator, and dataloader position. Recovering from a failure means restarting the surviving workers from the latest checkpoint.
At 405 billion parameters, a checkpoint can run to several terabytes, and naive synchronous checkpointing is impractical because it freezes training. Modern systems use asynchronous checkpointing that stages tensors to host memory and writes to storage in the background, incremental checkpointing that captures only the changed shards, and in-memory replicas distributed across racks for fast recovery without going to disk. Meta also reported elastic scheduling, where lost ranks are replaced from a hot spare pool without restarting the entire job. Schedulers such as Slurm, Kubernetes operators (NVIDIA's NIM Operator, Volcano, KubeFlow training operator), and Ray's autoscaler manage these workflows.
A frontier training run is one of the largest single capital expenditures in modern computing. Meta did not disclose Llama 3.1 405B's electricity bill, but the publicly stated 39.3 million H100-80GB GPU-hours, at an industry rental price of around $2 to $4 per H100-hour, implies a compute cost in the $80 million to $160 million range, before counting datacenter, network, and engineering overhead [6][22]. DeepSeek-V3's published $5.6 million pre-training cost is striking by comparison and reflects both clever engineering and a much smaller cluster, though the full R&D investment that produced V3's recipe is much higher than that single run [12].
Networking is a dominant share of cluster cost. InfiniBand NDR switches, NICs, optical transceivers, and cabling can rival the cost of the GPUs themselves at 16,000-GPU scale. Energy is another major line item: a 16,000-GPU H100 cluster running flat out draws around 12 megawatts continuously, before accounting for cooling and CPU overhead.
Diminishing returns at scale are an active research concern. Strong scaling, the ability to make an existing job finish faster by adding more devices, runs into Amdahl's law as the serial fraction (typically activation passes through the deepest pipeline stages and synchronization barriers) becomes the bottleneck. Weak scaling, the ability to handle a larger model or larger batch on more devices, is more forgiving but eventually hits the critical batch size beyond which more data parallelism does not improve loss per token. Practitioners now think carefully about where their cluster sits on this curve before committing to a particular shape.
While this article focuses on training, the same parallelism vocabulary applies to inference at scale, with different priorities. Tensor parallelism and pipeline parallelism are commonly used in inference engines such as vLLM, TGI, and TensorRT-LLM to fit large models like Llama 3 70B and DeepSeek-V3 across multiple GPUs. Expert parallelism is essential for serving large MoE models. Sequence parallelism and Ring Attention enable serving million-token contexts. The communication patterns are typically lighter than in training because there are no gradients, but latency requirements are tighter and batch shapes are more varied. The inference article covers serving systems in detail.
A few threads in 2024 to 2026 are reshaping distributed training.
Decentralized and low-bandwidth training. DiLoCo, OpenDiLoCo, and DeMo demonstrate that frontier-quality models can be trained across geographically distributed clusters connected by ordinary internet links rather than dedicated InfiniBand fabrics [18][19]. Prime Intellect's INTELLECT-1 (10B) and INTELLECT-2 runs in 2024 to 2025 trained competitive models on heterogeneous, geographically distributed compute, hinting at a future where massive single-site clusters are not the only option. Decoupled DiLoCo from Google DeepMind in 2025 added resilience and scaling improvements that pushed the technique to many-tens-of-billions-parameter models.
Zero-bubble pipelines. Qi 2024 and follow-up work eliminated the pipeline bubble entirely by splitting backward into weight-gradient and input-gradient phases that can be reordered within a 1F1B schedule. This brings pipeline parallelism efficiency to within a few percent of pure tensor parallelism for matched configurations.
FP8 training as default. Hopper and Blackwell GPUs support FP8 matrix multiplication. DeepSeek-V3 and several frontier runs in 2024 to 2025 trained primarily in FP8 with carefully designed scaling, doubling effective compute throughput compared to BF16 [12].
Larger and longer-lived clusters. xAI's Memphis Colossus, built in 2024 and expanded through 2025, scaled past 200,000 H100 and H200 GPUs in a single facility, the largest known single-site AI training cluster as of 2026. Microsoft, Google, Meta, and Amazon each operate clusters of comparable order of magnitude in aggregate. The engineering investment in network design, cooling, and power delivery for these facilities is unprecedented for any prior class of computing.
Mixture of training data and curriculum. As clusters have grown, the bottleneck for many teams has shifted from compute to data quality and curriculum. The systems engineering of moving petabytes of training tokens through a pipeline of filters, dedupers, and tokenizers has become a distributed-training subspecialty in its own right.