Distributed training

Distributed training is the practice of training a single machine learning model using many compute devices in parallel, splitting the data, the model, or both across GPUs, TPUs, or other accelerators that coordinate over a network. As deep learning models have grown from millions to hundreds of billions and even trillions of parameters, single-device training has become infeasible in terms of both memory and wall-clock time. Modern frontier LLM runs use clusters of thousands to tens of thousands of accelerators training continuously for weeks or months. The field has evolved from simple data-parallel setups into multi-dimensional parallelism strategies that combine data, tensor, pipeline, expert, and context parallelism in a single training job. Distributed training is the systems counterpart of partitioning strategy: partitioning describes how a model is split, while distributed training is the broader practice of executing that split across real hardware.

Motivation

The push to distributed training comes from a few converging pressures. Models have grown faster than per-device memory. GPT-2 (2019) had 1.5 billion parameters and could be trained on a small cluster. GPT-3 (2020) scaled to 175 billion parameters and required thousands of GPUs. By the mid-2020s, models such as GPT-4, Gemini Ultra, Llama 3.1 405B, DeepSeek-V3, and Grok 4 were trained on clusters of 16,000 or more accelerators ^[1]^[6]^[12].

A single NVIDIA H100 has 80 GB of high-bandwidth memory. A 70-billion-parameter model in FP32 requires roughly 280 GB just for weights, and Adam optimizer state pushes the resident footprint to about 1.1 TB before activations are even allocated. The model cannot live on one device, no matter how careful the implementation. Even when the model fits, training time matters: a 405-billion-parameter run on 16,000 H100s for 54 days at 400 TFLOPs per GPU represents about 3.8e25 floating-point operations ^[6]. Compressing that into a single device would take centuries. Distributed training is not a convenience, it is the only way to perform compute at the scale that frontier models require.

Large-batch training is another motivator. Many optimization strategies favor very large effective batches because they smooth gradient noise and increase throughput. A per-device batch of 4 sequences across 4,096 devices yields a global batch of 16,384, which would not fit on any single accelerator. Data parallelism makes that batch size routine.

Distributed training pays for these benefits in complexity. Inter-device communication adds latency, bookkeeping for fault tolerance is nontrivial, and a single misconfigured shard can leave thousands of expensive accelerators idle. The research and engineering effort spent on distributed training reflects how much these costs matter at scale.

Parallelism strategies

Distributed training strategies are usually classified by what gets split. The same model can be partitioned along several axes at once. The table below summarizes the main families. Each is discussed in more detail in the sections that follow, and the partitioning strategy article goes deeper on the underlying decomposition.

Strategy	What is split	Communication pattern	Typical scope	Originating work
Data parallelism	Input batch across replicas	All-reduce on gradients	All scales, baseline approach	DistBelief (Dean 2012); Horovod (Sergeev 2018)
Tensor parallelism	Weights inside a layer	All-reduce per layer	Within a node, NVLink fabric	Megatron-LM (Shoeybi 2019) ^[3]
Pipeline parallelism	Layers across stages	Point-to-point send and receive	Across nodes, micro-batches	GPipe (Huang 2018) ^[4]; PipeDream (Harlap 2018) ^[16]
Sequence parallelism	Sequence dimension within a layer	Reduce-scatter, all-gather	Combined with tensor parallelism	Korthikanti 2022 ^[13]
Context parallelism (Ring Attention)	Sequence dimension across attention	Ring of key-value blocks	Long-context training	Liu 2023 ^[14]
Expert parallelism	MoE experts across devices	All-to-all token routing	MoE models	GShard (Lepikhin 2020); Switch Transformer (Fedus 2022) ^[5]
ZeRO (sharded data parallelism)	Optimizer state, gradients, weights	All-gather, reduce-scatter	Memory-bound runs	ZeRO (Rajbhandari 2020) ^[7]
Fully Sharded Data Parallel (FSDP)	Weights and optimizer state	All-gather per layer	PyTorch ecosystem	Zhao 2023 ^[2]

Data parallelism

Data parallelism is the simplest and most widely used form of distributed training. Each device holds a complete copy of the model. The training data batch is split across devices, with each device processing a different subset of examples. After computing gradients locally, devices synchronize them, typically by averaging through an all-reduce, and each replica updates its weights identically. Determinism falls out naturally: every replica sees the same averaged gradient, so every replica computes the same new weights.

The canonical PyTorch implementation is DistributedDataParallel (DDP), which uses NVIDIA's NCCL library to perform gradient all-reduce. DDP overlaps gradient communication with backward computation: as soon as gradients for a layer are produced, they are sent to other ranks while gradients for earlier layers are still being calculated. This overlap is what makes DDP scale efficiently up to several thousand GPUs for models that fit per-device.

Data parallelism scales well when the full model fits in one device's memory, which today means roughly 1 to 7 billion parameters in mixed precision on an H100. Beyond that, weights and optimizer state spill, and DDP must give way to sharded variants like FSDP or ZeRO-3.

The effective batch size in data parallelism equals the per-device batch multiplied by the number of replicas. Very large batches can hurt convergence. The linear scaling rule and warmup schedule from Goyal and colleagues (2017) are the standard recipe for large-batch SGD: when batch grows by a factor of k, scale the learning rate by k and warm up gradually over the first epoch ^[1]. Subsequent optimizers such as LARS and LAMB extended these ideas with per-layer learning rate adaptation, which let teams push effective batches into the millions of tokens for Llama 3 and similar models.

Tensor parallelism

Tensor parallelism, also called intra-layer model parallelism, splits the weight matrices inside a single layer across devices. The Megatron-LM paper from NVIDIA showed that the linear layers in a transformer can be partitioned with very modest communication: split the first feed-forward matrix column-wise so each device holds a slice of the output features, then split the second matrix row-wise so the partial sums can be combined with one all-reduce at the end of the block ^[3]. Self-attention is split along the head dimension, which is naturally independent: each device computes a subset of heads and the results are concatenated.

The original Megatron paper trained an 8.3-billion-parameter transformer on 512 V100 GPUs at 76% scaling efficiency relative to a single-GPU baseline, a result that put intra-layer parallelism on the map ^[3]. The technique requires very low-latency communication because each transformer block performs at least one all-reduce across the tensor-parallel group. In practice, tensor parallelism is confined to a single server node, where eight or sixteen GPUs share an NVLink and NVSwitch fabric. Stretching it across the slower InfiniBand fabric between nodes makes the all-reduce dominate runtime.

Pipeline parallelism

Pipeline parallelism divides the model along its depth. Consecutive groups of layers, called stages, are placed on different devices. During the forward pass, each stage processes its layers and forwards activations to the next stage. The backward pass flows in the opposite direction.

The naive version suffers from a pipeline bubble: a stage sits idle while waiting for its predecessor or successor. GPipe, introduced by Huang and colleagues at Google in 2018, mitigates the bubble by splitting each batch into micro-batches and pumping them through the pipeline in sequence ^[4]. The bubble shrinks as the number of micro-batches grows, and gradient updates remain synchronous because each micro-batch contributes to the same accumulated gradient. GPipe was used to train a six-billion-parameter, 128-layer transformer for multilingual translation, an early demonstration of just how much pipeline parallelism could unlock.

PipeDream, from Harlap and colleagues at Microsoft in 2018, took a different angle. It introduced the 1F1B (one-forward, one-backward) schedule, in which each worker alternates strictly between a forward pass on one micro-batch and a backward pass on another ^[16]. 1F1B keeps the in-flight working set bounded and removes most of the bubble in steady state. The original PipeDream did this asynchronously and used weight stashing to avoid stale-gradient problems, which complicates correctness. Modern implementations in DeepSpeed and Megatron-LM use synchronous 1F1B with carefully placed flushes at the end of each global batch, retaining most of the throughput benefit without the asynchrony.

More recent work pushes the bubble closer to zero. Interleaved 1F1B in Megatron-LM splits each stage into multiple chunks; zero-bubble pipelines (Qi 2024) split the backward pass into a weight-gradient and input-gradient phase that can be reordered to fully eliminate the bubble for sufficiently many micro-batches.

Pipeline parallelism is well-suited to the slower inter-node InfiniBand fabric because adjacent stages exchange only activations, which are far smaller than gradients. It also tolerates network jitter because stages can absorb small delays through micro-batch buffering.

Sequence and context parallelism

Sequence parallelism, introduced by Korthikanti and colleagues at NVIDIA in 2022, is a refinement of tensor parallelism that handles the parts of the transformer that tensor parallelism leaves unsharded, namely layer normalization, dropout, and the residual stream ^[13]. The idea is to split these operations along the sequence dimension so that activation memory shrinks proportionally to the tensor-parallel group size. Combined with selective activation recomputation, sequence parallelism cut activation memory by 5x and reduced recomputation overhead by more than 90% on a 530-billion-parameter run on 2,240 A100 GPUs, raising model FLOPs utilization from 42.1% to 54.2% ^[13].

Context parallelism is a more aggressive variant for very long contexts. Ring Attention, from Liu and colleagues in 2023, distributes the attention computation itself across devices arranged in a ring ^[14]. Each rank holds a slice of the queries, keys, and values for a long sequence; the keys and values rotate around the ring, with each device combining incoming blocks against its local queries. Communication of key-value blocks is overlapped with the blockwise attention computation, so context length scales linearly with device count without changing the asymptotic memory footprint per device. Ring Attention is what makes million-token context windows tractable. Striped Attention (Brandon 2023) extended the idea to handle the load imbalance caused by causal masking. Meta used context parallelism as one of four parallel dimensions when training Llama 3.1 with 128k-token contexts ^[6].

Expert parallelism

Mixture-of-experts (MoE) models contain many expert sub-networks inside specific layers. A gating mechanism routes each token to a small subset of experts, often two out of dozens or hundreds. Expert parallelism places different experts on different devices. The compute pattern is unique: tokens must be sent across the network to whichever device hosts the expert they were routed to, processed there, and then sent back. This is an all-to-all communication, which is sensitive to bandwidth and to load balance because hot experts can pile up.

GShard from Google (Lepikhin 2020) and the Switch Transformer (Fedus 2022) established expert parallelism as the standard for trillion-parameter sparse models ^[5]. DeepSeek-V3, with 671 billion total parameters and 37 billion active per token, used 64-way expert parallelism spanning eight nodes during training ^[12].

ZeRO

The Zero Redundancy Optimizer (ZeRO), from Rajbhandari and colleagues at Microsoft, identifies a wasteful aspect of vanilla data parallelism: every replica stores the same optimizer state, the same gradients, and the same parameters, even though only its slice of the batch is being processed at any moment ^[7]. ZeRO partitions these tensors across the data-parallel group, materializing them only when needed.

ZeRO stage	What is partitioned	Memory reduction vs DDP	Communication overhead
Stage 0	Nothing (vanilla data parallelism)	1x	Baseline all-reduce
Stage 1	Optimizer state	~4x	Minimal increase
Stage 2	Optimizer state and gradients	~8x	Moderate increase
Stage 3	Optimizer state, gradients, and weights	Linear with N devices, up to ~64x	Significant: extra all-gather per layer
ZeRO-Infinity	Stage 3 plus offload to CPU and NVMe	Limited only by host storage	Highest

ZeRO Stage 1 partitions optimizer state, which for Adam in mixed precision is 12 bytes per parameter. Stage 2 also partitions gradients. Stage 3 also partitions weights, which means the device must all-gather the layer's parameters just before the forward pass, and again before the backward pass, then drop them. ZeRO-Infinity layers in CPU and NVMe offload, allowing trillion-parameter models on modest GPU counts at the cost of much higher communication and slower step time. Microsoft's original demonstration used ZeRO Stage 3 to train a one-trillion-parameter model on 1,024 NVIDIA GPUs ^[7].

Fully Sharded Data Parallel

PyTorch's Fully Sharded Data Parallel (FSDP) is conceptually similar to ZeRO Stage 3 but is implemented as a first-class member of the PyTorch ecosystem. The Zhao paper describes FSDP as an industry-grade solution for large-model training and reports near-linear TFLOPS scaling ^[2]. FSDP wraps a model in shards, all-gathers each shard just in time for forward and backward, and reduce-scatters gradients so each rank ends up holding only its own shard.

FSDP2, which became the default in PyTorch 2.x, redesigned the original implementation around the per-parameter DTensor abstraction, fixed several composability issues with torch.compile, and added cleaner support for mixed precision and CPU offload. Empirical studies show FSDP can reduce GPU memory usage by 60% or more compared to DDP, although it can slow per-step time by 2x to 6x depending on cluster bandwidth and model shape ^[2]. For the largest models, the trade is usually worth it because the model would not fit at all under plain DDP.

Multi-dimensional parallelism

Real training runs at frontier scale combine several of these strategies. The convention is to count the number of orthogonal axes:

3D parallelism combines data, tensor, and pipeline parallelism. This was the standard configuration for runs from 2020 through 2023 and is still the workhorse for 70-billion to 200-billion-parameter dense models. Tensor parallelism stays inside a node, pipeline parallelism crosses nodes, and data parallelism replicates over the rest.
4D parallelism adds context parallelism for long-sequence training. Llama 3.1 405B used a 4D mesh: 8-way tensor parallelism inside each node, 16-way pipeline parallelism across nodes, context parallelism for 128k contexts, and FSDP for the remaining axis ^[6].
5D parallelism adds expert parallelism, applicable to MoE models. DeepSeek-V3 used data, pipeline, expert, and ZeRO-1 sharding without tensor parallelism, an unusual configuration optimized for the H800 cluster's reduced inter-node bandwidth ^[12].

The number of devices assigned to each dimension must multiply to the total cluster size, and the assignment is far from automatic. Auto-parallelism systems such as Alpa, FlexFlow, and Galvatron search this space, but most production runs are still tuned by hand using profiled microbenchmarks. The partitioning strategy article covers the optimization considerations in detail.

Communication primitives

Distributed training is built on a small set of collective operations, all of them inherited from the high-performance computing tradition.

Primitive	What it does	Use in training
All-reduce	Combines tensors from every rank and returns the result to every rank	Gradient averaging in data parallelism
All-gather	Collects each rank's slice into a full tensor on every rank	Materializing weights in ZeRO-3 and FSDP
Reduce-scatter	Reduces and partitions the result so each rank gets a slice	Sharded gradient updates in FSDP
Broadcast	Sends one rank's tensor to every rank	Initialization, parameter syncs
All-to-all	Each rank sends a different slice to every other rank	Expert parallelism token routing
Send and recv	Point-to-point transfer	Activation hand-off in pipeline parallelism

NVIDIA's NCCL is the dominant implementation on GPU clusters, with Gloo and MPI as alternatives. NCCL automatically detects the underlying topology, including PCIe, NVLink, NVSwitch, InfiniBand, and RoCE links, and selects an algorithm such as ring, tree, or PAT (Parallel Aggregated Trees) for each operation ^[11]. Recent NCCL releases added cross-data-center awareness, in-network reduction via NVIDIA SHARP, and improved fault diagnosis for very large jobs.

The ring all-reduce algorithm, popularized for deep learning by Baidu's research team in 2017, is the workhorse for gradient synchronization. Devices form a logical ring; in the scatter-reduce phase each device sends a chunk to its neighbor, which adds it to its own; in the all-gather phase, each device circulates its completed chunk back around. The total bandwidth cost is 2(N-1)/N times the data size, which approaches 2x for large N, and crucially it is independent of N. This bandwidth invariance is what made Horovod's ring-based approach popular before NCCL incorporated similar algorithms natively ^[10].

Hardware infrastructure

Distributed training depends as much on the network as on the accelerators. Network topology and link bandwidth determine which parallelism strategies are even viable. The table below summarizes the main interconnect tiers in modern AI clusters.

Interconnect	Bandwidth	Latency	Scope	Used by
NVLink 3.0 (A100)	600 GB/s bidirectional	sub-microsecond	Intra-node GPU-to-GPU	NVIDIA DGX A100
NVLink 4.0 (H100)	900 GB/s bidirectional	sub-microsecond	Intra-node GPU-to-GPU	NVIDIA DGX H100
NVLink 5.0 (B200)	1,800 GB/s bidirectional	sub-microsecond	Intra-node GPU-to-GPU	NVIDIA DGX B200
NVSwitch	Full bisection bandwidth	very low	Intra-node all-to-all	DGX H100, DGX B200, GB200 NVL72
InfiniBand HDR	200 Gb/s per port	~1 microsecond	Inter-node	Earlier HPC and AI clusters
InfiniBand NDR	400 Gb/s per port	~1 microsecond	Inter-node	Llama 3 cluster, modern AI fabrics
RoCE (RDMA over Converged Ethernet)	100 to 400 Gb/s	2 to 5 microseconds	Inter-node	Hyperscaler-built clusters
TPU ICI (v4)	4,800 Gbps per chip	very low	Intra-pod 3D torus	TPU v4 pods, 4,096 chips
TPU ICI (v5p)	4,800 Gbps per chip	very low	16x20x28 3D torus	TPU v5p pods, 8,960 chips

NVLink and NVSwitch provide the bandwidth needed for tensor parallelism inside a node. The H100 SXM module pushes 900 GB/s of bidirectional GPU-to-GPU NVLink, and the GB200 NVL72 platform extends this fabric to 72 GPUs in a single rack. InfiniBand or RoCE provides the slower but still high-bandwidth inter-node fabric used for data parallelism, pipeline parallelism, and expert parallelism. Meta's Llama 3 cluster used 400 Gb/s InfiniBand NDR between nodes and reported that the network topology was a key factor in achieving 400 TFLOPs per GPU ^[6].

Google's TPU pods take a different architectural approach. TPU v4 chips are arranged in a 3D torus that scales to 4,096 chips per pod; TPU v5p extends the torus to 8,960 chips in a 16x20x28 mesh, with optical circuit switches reconfiguring the topology at job-launch time ^[15]. The torus provides each chip with high-bandwidth direct connections to six neighbors, which is well-suited to the SPMD parallelism model that JAX uses. PaLM 540B was trained across two TPU v4 pods (6,144 chips total) connected by datacenter network using a combination of model and data parallelism, achieving 46.2% model FLOPs utilization, the largest TPU configuration used for training at the time ^[17].

Network topology at scale is usually a fat tree (oversubscribed or full bisection) for InfiniBand-based AI clusters, with dragonfly and dragonfly+ topologies seen in some hyperscaler designs to reduce switch count at very large port counts. Cluster operators care obsessively about congestion control, link error rates, and the placement of training jobs relative to topology, because a single slow link can drag the whole job's throughput down to that link's bandwidth.

Notable training runs

The table below covers a representative set of large training runs, with hardware and parallelism details where they have been published. Compute estimates are based on disclosed FLOPs counts where available.

Model	Year	Parameters	Hardware	Cluster size	Parallelism	Notes
GPT-3	2020	175B dense	NVIDIA V100	~10,000 GPUs (Microsoft Azure)	Tensor + pipeline + data	OpenAI used Microsoft's purpose-built supercomputer
Megatron-Turing NLG	2021	530B dense	NVIDIA A100	2,240 GPUs	Tensor + pipeline + data	Microsoft and NVIDIA partnership
PaLM	2022	540B dense	TPU v4	6,144 chips across 2 pods	Pathways system, model + data	46.2% MFU ^[17]
GPT-4	2023	not disclosed	NVIDIA A100	rumored ~25,000 GPUs over months	undisclosed	OpenAI did not publish training details
Llama 2 70B	2023	70B dense	NVIDIA A100	2,000 GPUs	FSDP + tensor parallel	Released openly by Meta
Llama 3 405B	2024	405B dense	NVIDIA H100	16,000 GPUs	4D: TP=8, PP=16, CP, FSDP	3.8e25 FLOPs, ~54 days ^[6]
DeepSeek-V3	2024	671B (37B active) MoE	NVIDIA H800	2,048 GPUs	PP=16, EP=64, ZeRO-1 DP	2.788M GPU-hours, ~$5.6M ^[12]
Grok 4	2024	not disclosed	NVIDIA H100/H200	xAI Memphis Colossus, ~100,000 GPUs initially scaled to 200,000	undisclosed	Single-site cluster, AWS/Dell/Supermicro builds
Gemini Ultra	2023	not disclosed	TPU v4/v5	many TPU pods	JAX/Pathways	Google did not disclose chip count
Claude Opus / Sonnet	2024 to 2026	not disclosed	TPU and AWS Trainium	undisclosed	undisclosed	Anthropic split capacity across providers

DeepSeek-V3 is a particularly informative case because the team published unusually detailed numbers. The full pre-training cost was 2,664,000 H800 GPU-hours, or roughly 3.7 days per trillion training tokens on the 2,048-GPU cluster, with total costs estimated at $5.576M assuming a $2 per GPU-hour rental price ^[12]. The choice to skip tensor parallelism entirely, in favor of expert parallelism, pipeline parallelism, and ZeRO-1 sharded data parallelism, was driven by H800's reduced inter-node bandwidth compared to H100. The publication touched off an industry-wide reassessment of how much compute is actually required for frontier-quality models.

Frameworks

Several frameworks dominate distributed training as of 2026. The table below summarizes the major ones; longer descriptions follow.

Framework	Developer	Key features	Parallelism types	Language stack
PyTorch DDP and FSDP	Meta	Native PyTorch integration, FSDP2 with DTensor, tight `torch.compile` support	Data, fully sharded data parallelism	Python, PyTorch
TorchTitan	PyTorch team	Reference implementation of FSDP2 + tensor and pipeline parallelism for LLMs	Data, tensor, pipeline, sequence	Python, PyTorch
DeepSpeed	Microsoft	ZeRO 0-3 and ZeRO++; pipeline parallelism; offload to CPU and NVMe; inference engine	Data, pipeline, tensor, expert	Python, PyTorch
Megatron-LM	NVIDIA	Efficient tensor parallelism, interleaved 1F1B pipelines, sequence parallelism, expert parallelism	Tensor, pipeline, data, expert, context	Python, PyTorch
Megatron-DeepSpeed	Microsoft + NVIDIA	Fusion of Megatron tensor and pipeline parallelism with DeepSpeed ZeRO	All of the above	Python, PyTorch
JAX and Flax with `pjit` and `shard_map`	Google	Compiler-based SPMD via XLA; tightly coupled to TPU pods	Data, tensor, pipeline	Python, JAX/XLA
TensorFlow `tf.distribute` and DTensor	Google	Strategy-based API and SPMD partitioning	Data, tensor	Python, TensorFlow
Hugging Face Accelerate	Hugging Face	Wrapper around DDP, FSDP, DeepSpeed for portability	Data, fully sharded	Python, PyTorch
Colossal-AI	HPC-AI Tech	Auto-parallelism search, heterogeneous memory management	Data, tensor, pipeline, sequence, expert	Python, PyTorch
Horovod	LF AI (originally Uber)	Ring-AllReduce wrapper; framework-agnostic	Data	Python, multi-framework
Ray Train	Anyscale	Orchestration and elastic training; backend-agnostic	Wraps DDP, FSDP, DeepSpeed	Python, Ray
OpenDiLoCo	Prime Intellect	Open implementation of DiLoCo for low-bandwidth distributed training	Decentralized data parallel	Python, PyTorch

PyTorch DDP and FSDP

PyTorch DistributedDataParallel is the default starting point for almost any distributed PyTorch training script. It is implemented as a thin wrapper around the model that hooks gradient computation, batches up the gradients into communication buckets, and triggers NCCL all-reduce operations as buckets become ready. The bucketing and overlap make it scale efficiently to many thousands of GPUs.

FSDP and its successor FSDP2 are the sharded variants. They split parameters, gradients, and optimizer state across the data-parallel group, all-gathering parameters just before each layer's forward and backward, and reduce-scattering gradients afterward. FSDP2 introduced a per-parameter DTensor representation that simplifies composition with tensor parallelism and torch.compile. The TorchTitan project provides a reference implementation that combines FSDP2 with tensor and pipeline parallelism for end-to-end LLM pretraining.

DeepSpeed

DeepSpeed, from Microsoft Research, is one of the most influential libraries in the field. Its core contribution is ZeRO, which is described in detail above. DeepSpeed also bundles pipeline parallelism with 1F1B scheduling, a deep set of mixed-precision recipes including FP8 on Hopper hardware, gradient checkpointing helpers, an inference engine, and integration with Megatron-LM via the Megatron-DeepSpeed combined repository. The ZeRO++ extensions added quantized communication and hierarchical partitioning to bring down communication volume in cross-node training.

Megatron-LM

Megatron-LM, from NVIDIA, focuses on efficient tensor and pipeline parallelism for transformer architectures. The original 2019 paper ^[3] established intra-layer parallelism as a viable technique; subsequent work added pipeline parallelism with interleaved schedules, sequence parallelism, expert parallelism for MoE, and very tight integration with NCCL, NVLink, NVSwitch, and InfiniBand. Megatron-LM is the foundation of NVIDIA's NeMo framework and is the de facto standard for training models on NVIDIA hardware at scale.

JAX and TPU stack

JAX expresses parallelism through the SPMD (single program, multiple data) model, with pjit and shard_map letting the developer annotate how arrays should be sharded across a logical mesh of devices. The XLA compiler then generates communication and computation kernels for the entire program. JAX is tightly coupled to Google's TPU stack and is the framework behind Gemini, PaLM, and many of Google's internal models, although it also runs on GPU clusters. The Pathways system, introduced for PaLM training, extends JAX with a global view of the cluster that lets a single program span multiple TPU pods ^[17].

Decentralized training stacks

A newer cluster of frameworks targets training that does not require a homogeneous, low-latency datacenter. Hivemind allows volunteer-style training over the open internet. DiLoCo, from Douillard and colleagues at Google DeepMind in 2023, performs many local SGD steps on each worker before a periodic synchronization through an outer Nesterov optimizer; it matches fully synchronous training on the C4 dataset with up to 500x less communication ^[18]. OpenDiLoCo, an open-source implementation by Prime Intellect, has been used for actual cross-continent runs. DeMo (Decoupled Momentum Optimization) from Peng and colleagues in 2024 takes a complementary approach, decomposing momentum into fast and slow components via a discrete cosine transform and synchronizing only the fast components, which cuts inter-accelerator communication by an order of magnitude or more ^[19].

Memory optimization techniques

Distributed training rarely uses parallelism alone. A handful of orthogonal techniques shave memory or trade compute for memory, and most large runs combine several at once.

Mixed precision training stores activations and many gradients in BF16, FP16, or FP8 instead of FP32. This roughly halves the memory used for those tensors and doubles arithmetic throughput on tensor cores. BF16 has become the default for transformer training because its FP32-equivalent dynamic range avoids most of the loss-scaling fragility that plagued FP16. FP8 training, supported on Hopper and Blackwell GPUs, pushes throughput further, though it requires careful per-tensor scaling. The mixed precision training article covers the numerics in depth.

Gradient checkpointing, also called activation recomputation, drops most intermediate activations during the forward pass and recomputes them during the backward pass. This trades roughly 33% extra compute for an order-of-magnitude reduction in activation memory. Selective recomputation, introduced in the Korthikanti paper alongside sequence parallelism, recomputes only the cheap-to-redo operations and stores the expensive ones, dropping the overhead from 33% to under 5% on a 530-billion-parameter run ^[13].

Gradient accumulation simulates a large batch using small micro-batches. The per-device batch is divided into micro-batches; gradients accumulate across them before the optimizer step. This makes very large effective batches feasible without needing more devices, at the cost of slower wall-clock per global step.

Optimizer state offload to CPU or NVMe, as in DeepSpeed's ZeRO-Infinity, lets training reach trillions of parameters on modest GPU counts by keeping cold tensors out of VRAM. The cost is heavier I/O and slower step time, which is acceptable when the alternative is a model that does not fit at all.

Synchronous and asynchronous training

There is a long-standing debate between synchronous and asynchronous distributed training. Synchronous training waits for every worker to finish its step before the next one starts, producing deterministic gradients and predictable convergence. Asynchronous training lets workers update a shared parameter store as they go, which avoids stragglers but introduces stale gradients.

The Hogwild! algorithm by Niu and colleagues in 2011 was an early demonstration that lock-free asynchronous SGD on shared memory can converge for sparse problems, with nearly linear speedup over locked variants ^[20]. Google's DistBelief framework used asynchronous parameter servers for early convolutional and recurrent network training. The original PipeDream design used asynchronous pipeline parallelism, with the weight stashing trick to control staleness ^[16].

In practice, frontier LLM training uses synchronous parallelism almost exclusively. Stale gradients perturb the optimizer in ways that matter at very large batch sizes and very deep models, and reproducibility is too important when a single run can cost tens of millions of dollars. Asynchronous methods survive in narrower contexts: federated learning across mobile devices, decentralized cross-datacenter runs (DiLoCo, DeMo), and certain reinforcement learning training loops where actor and learner workloads naturally decouple.

Bottlenecks and failure handling

Three bottlenecks dominate distributed training at scale: communication, memory bandwidth, and load imbalance.

Communication. Gradient all-reduce in data parallelism, weight all-gather in FSDP, and all-to-all routing in expert parallelism each consume bandwidth proportional to model size or activation size. Research at MegaScale (Jiang 2024) reported that some configurations spent more than 90% of step time on communication, with computation almost entirely hidden behind it ^[21]. Mitigations include communication-computation overlap, gradient compression (PowerSGD, 1-bit Adam, ZeRO++ quantization), hierarchical all-reduce that exploits faster intra-node fabrics, and topology-aware placement.

Memory bandwidth. The per-device tensor cores can run faster than the HBM that feeds them. Kernel fusion (FlashAttention, Triton kernels), in-place operations, and careful operator scheduling reduce the bandwidth pressure.

Load imbalance. Pipeline parallelism produces bubbles. Expert parallelism produces hot experts that overload some devices. Even data parallelism can suffer if input lengths vary across the batch. Bin-packing, capacity factors, padding to a common length, and load-balancing losses for MoE all chip away at this overhead.

Fault tolerance. In a cluster of 16,000 GPUs, hardware failures are statistical certainties rather than exceptional events. Meta's MegaScale paper documented the operational reality of running large training jobs: lost GPUs, overheating racks, network partitions, ECC errors, and kernel hangs all occur on a daily basis at this scale ^[21]. The standard mitigation is periodic checkpointing of the full training state including weights, optimizer state, gradient accumulator, and dataloader position. Recovering from a failure means restarting the surviving workers from the latest checkpoint.

At 405 billion parameters, a checkpoint can run to several terabytes, and naive synchronous checkpointing is impractical because it freezes training. Modern systems use asynchronous checkpointing that stages tensors to host memory and writes to storage in the background, incremental checkpointing that captures only the changed shards, and in-memory replicas distributed across racks for fast recovery without going to disk. Meta also reported elastic scheduling, where lost ranks are replaced from a hot spare pool without restarting the entire job. Schedulers such as Slurm, Kubernetes operators (NVIDIA's NIM Operator, Volcano, KubeFlow training operator), and Ray's autoscaler manage these workflows.

Cost and efficiency

A frontier training run is one of the largest single capital expenditures in modern computing. Meta did not disclose Llama 3.1 405B's electricity bill, but the publicly stated 39.3 million H100-80GB GPU-hours, at an industry rental price of around $2 to $4 per H100-hour, implies a compute cost in the $80 million to $160 million range, before counting datacenter, network, and engineering overhead ^[6]^[22]. DeepSeek-V3's published $5.6 million pre-training cost is striking by comparison and reflects both clever engineering and a much smaller cluster, though the full R&D investment that produced V3's recipe is much higher than that single run ^[12].

Networking is a dominant share of cluster cost. InfiniBand NDR switches, NICs, optical transceivers, and cabling can rival the cost of the GPUs themselves at 16,000-GPU scale. Energy is another major line item: a 16,000-GPU H100 cluster running flat out draws around 12 megawatts continuously, before accounting for cooling and CPU overhead.

Diminishing returns at scale are an active research concern. Strong scaling, the ability to make an existing job finish faster by adding more devices, runs into Amdahl's law as the serial fraction (typically activation passes through the deepest pipeline stages and synchronization barriers) becomes the bottleneck. Weak scaling, the ability to handle a larger model or larger batch on more devices, is more forgiving but eventually hits the critical batch size beyond which more data parallelism does not improve loss per token. Practitioners now think carefully about where their cluster sits on this curve before committing to a particular shape.

Distributed inference

While this article focuses on training, the same parallelism vocabulary applies to inference at scale, with different priorities. Tensor parallelism and pipeline parallelism are commonly used in inference engines such as vLLM, TGI, and TensorRT-LLM to fit large models like Llama 3 70B and DeepSeek-V3 across multiple GPUs. Expert parallelism is essential for serving large MoE models. Sequence parallelism and Ring Attention enable serving million-token contexts. The communication patterns are typically lighter than in training because there are no gradients, but latency requirements are tighter and batch shapes are more varied. The inference article covers serving systems in detail.

Recent developments

A few threads in 2024 to 2026 are reshaping distributed training.

Decentralized and low-bandwidth training. DiLoCo, OpenDiLoCo, and DeMo demonstrate that frontier-quality models can be trained across geographically distributed clusters connected by ordinary internet links rather than dedicated InfiniBand fabrics ^[18]^[19]. Prime Intellect's INTELLECT-1 (10B) and INTELLECT-2 runs in 2024 to 2025 trained competitive models on heterogeneous, geographically distributed compute, hinting at a future where massive single-site clusters are not the only option. Decoupled DiLoCo from Google DeepMind in 2025 added resilience and scaling improvements that pushed the technique to many-tens-of-billions-parameter models.

Zero-bubble pipelines. Qi 2024 and follow-up work eliminated the pipeline bubble entirely by splitting backward into weight-gradient and input-gradient phases that can be reordered within a 1F1B schedule. This brings pipeline parallelism efficiency to within a few percent of pure tensor parallelism for matched configurations.

FP8 training as default. Hopper and Blackwell GPUs support FP8 matrix multiplication. DeepSeek-V3 and several frontier runs in 2024 to 2025 trained primarily in FP8 with carefully designed scaling, doubling effective compute throughput compared to BF16 ^[12].

Larger and longer-lived clusters. xAI's Memphis Colossus, built in 2024 and expanded through 2025, scaled past 200,000 H100 and H200 GPUs in a single facility, the largest known single-site AI training cluster as of 2026. Microsoft, Google, Meta, and Amazon each operate clusters of comparable order of magnitude in aggregate. The engineering investment in network design, cooling, and power delivery for these facilities is unprecedented for any prior class of computing.

Mixture of training data and curriculum. As clusters have grown, the bottleneck for many teams has shifted from compute to data quality and curriculum. The systems engineering of moving petabytes of training tokens through a pipeline of filters, dedupers, and tokenizers has become a distributed-training subspecialty in its own right.

References

Goyal, P., Dollar, P., Girshick, R., Noordhuis, P., et al. (2017). "Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour." https://arxiv.org/abs/1706.02677
Zhao, Y., Gu, A., Varma, R., Luo, L., et al. (2023). "PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel." Proceedings of the VLDB Endowment, 16(12): 3848-3860. https://arxiv.org/abs/2304.11277
Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., & Catanzaro, B. (2019). "Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism." https://arxiv.org/abs/1909.08053
Huang, Y., Cheng, Y., Bapna, A., Firat, O., et al. (2018). "GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism." Advances in Neural Information Processing Systems. https://arxiv.org/abs/1811.06965
Fedus, W., Zoph, B., & Shazeer, N. (2022). "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity." Journal of Machine Learning Research. https://arxiv.org/abs/2101.03961
Grattafiori, A., et al. (2024). "The Llama 3 Herd of Models." Meta AI. https://arxiv.org/abs/2407.21783
Rajbhandari, S., Rasley, J., Ruwase, O., & He, Y. (2020). "ZeRO: Memory Optimizations Toward Training Trillion Parameter Models." Proceedings of SC20. https://arxiv.org/abs/1910.02054
Anyscale. (2023). "Distributed Deep Learning with Ray Train." https://www.anyscale.com/blog/distributed-deep-learning-with-ray-train-is-now-in-beta
Sergeev, A. & Del Balso, M. (2018). "Horovod: fast and easy distributed deep learning in TensorFlow." https://arxiv.org/abs/1802.05799
Gibiansky, A. (2017). "Bringing HPC Techniques to Deep Learning." Baidu Research. https://andrew.gibiansky.com/blog/machine-learning/baidu-allreduce/
NVIDIA. (2025). "Enabling Fast Inference and Resilient Training with NCCL 2.27." https://developer.nvidia.com/blog/enabling-fast-inference-and-resilient-training-with-nccl-2-27
DeepSeek-AI. (2024). "DeepSeek-V3 Technical Report." https://arxiv.org/abs/2412.19437
Korthikanti, V., Casper, J., Lym, S., McAfee, L., et al. (2022). "Reducing Activation Recomputation in Large Transformer Models." Proceedings of MLSys 2023. https://arxiv.org/abs/2205.05198
Liu, H., Zaharia, M., & Abbeel, P. (2023). "Ring Attention with Blockwise Transformers for Near-Infinite Context." Proceedings of ICLR 2024. https://arxiv.org/abs/2310.01889
Google Cloud. (2024). "TPU v5p documentation." https://docs.cloud.google.com/tpu/docs/v5p
Harlap, A., Narayanan, D., Phanishayee, A., Seshadri, V., et al. (2018, 2019). "PipeDream: Generalized Pipeline Parallelism for DNN Training." Proceedings of SOSP 2019. https://arxiv.org/abs/1806.03377
Chowdhery, A., et al. (2022). "PaLM: Scaling Language Modeling with Pathways." https://arxiv.org/abs/2204.02311
Douillard, A., Feng, Q., Rusu, A. A., Chhaparia, R., et al. (2023). "DiLoCo: Distributed Low-Communication Training of Language Models." https://arxiv.org/abs/2311.08105
Peng, B., Quesnelle, J., & Kingma, D. P. (2024). "DeMo: Decoupled Momentum Optimization." https://arxiv.org/abs/2411.19870
Niu, F., Recht, B., Re, C., & Wright, S. J. (2011). "HOGWILD!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent." Advances in Neural Information Processing Systems. https://arxiv.org/abs/1106.5730
Jiang, Z., Lin, H., Zhong, Y., Huang, Q., et al. (2024). "MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs." Proceedings of NSDI 2024. https://www.usenix.org/system/files/nsdi24-jiang-ziheng.pdf
Epoch AI. (2024). "Llama 3.1 405B GPU-hour and FLOPs estimates." https://x.com/EpochAIResearch/status/1815778832265232483

Motivation

Parallelism strategies

Data parallelism

Tensor parallelism

Pipeline parallelism

Sequence and context parallelism

Expert parallelism

ZeRO

Fully Sharded Data Parallel

Multi-dimensional parallelism

Communication primitives

Hardware infrastructure

Notable training runs

Frameworks

PyTorch DDP and FSDP

DeepSpeed

Megatron-LM

JAX and TPU stack

Decentralized training stacks

Memory optimization techniques

Synchronous and asynchronous training

Bottlenecks and failure handling

Cost and efficiency

Distributed inference

Recent developments

See also

References

Improve this article

Related Articles

Partitioning strategy

Online inference

Operation (op)

TensorFlow Serving

MLOps

Parameter Server (PS)

Motivation

Parallelism strategies

Data parallelism

Tensor parallelism

Pipeline parallelism

Sequence and context parallelism

Expert parallelism

ZeRO

Fully Sharded Data Parallel

Multi-dimensional parallelism

Communication primitives

Hardware infrastructure

Notable training runs

Frameworks

PyTorch DDP and FSDP

DeepSpeed

Megatron-LM

JAX and TPU stack

Decentralized training stacks

Memory optimization techniques

Synchronous and asynchronous training

Bottlenecks and failure handling

Cost and efficiency

Distributed inference

Recent developments

See also

References

Related Articles

Partitioning strategy

Online inference

Operation (op)

TensorFlow Serving

MLOps

Parameter Server (PS)