Distributed training refers to the practice of training machine learning models across multiple GPUs, machines, or accelerators working in coordination. As deep learning models have grown from millions to hundreds of billions (and even trillions) of parameters, single-device training has become infeasible in terms of both memory and time. Distributed training techniques partition the computation, data, or model across a cluster of devices, enabling the training of models that would otherwise be impossible to fit on a single accelerator. The field has evolved from simple data-parallel setups into sophisticated multi-dimensional parallelism strategies that combine data, tensor, pipeline, and expert parallelism in a single training run.
The growth of model sizes in deep learning has been exponential. GPT-2 (2019) had 1.5 billion parameters. GPT-3 (2020) scaled to 175 billion. By 2023-2024, models like GPT-4, Gemini, and LLaMA 3 reached parameter counts that required thousands of GPUs training in parallel for weeks or months. A single NVIDIA H100 GPU has 80 GB of high-bandwidth memory, but a 70-billion-parameter model in FP32 alone requires approximately 280 GB just for the weights, not counting optimizer states, gradients, or activations.
Distributed training solves this problem by spreading the workload across many devices. The fundamental challenge is doing so efficiently: communication between devices introduces overhead, and poor partitioning can leave expensive hardware idle. The field of distributed training is therefore as much about systems engineering and communication optimization as it is about machine learning algorithms.
Data parallelism is the simplest and most widely used form of distributed training. Each device (GPU or accelerator) holds a complete copy of the model. The training data batch is split across devices, with each device processing a different subset of examples. After computing gradients locally, the devices synchronize their gradients (typically by averaging), and each device updates its local model copy identically.
The standard implementation in PyTorch is DistributedDataParallel (DDP), which uses the NCCL communication library to perform gradient AllReduce operations. DDP overlaps gradient communication with backward computation: as soon as gradients for a layer are computed, they can be communicated while gradients for earlier layers are still being calculated.
Data parallelism scales well when the model fits in a single device's memory, which is the case for models up to roughly 1-2 billion parameters on modern GPUs. Beyond that, memory becomes the bottleneck, and techniques like gradient accumulation (processing micro-batches sequentially) or more advanced sharding strategies become necessary.
The effective batch size in data parallelism equals the per-device batch size multiplied by the number of devices. This has implications for training dynamics: very large batch sizes can degrade model convergence. Techniques like learning rate warmup and linear scaling rules (Goyal et al., 2017) help maintain training stability at scale [1].
Fully Sharded Data Parallelism extends data parallelism by sharding not just the data but also the model parameters, gradients, and optimizer states across devices. Each device stores only a fraction of the full model state. When a layer's parameters are needed for a forward or backward computation, they are gathered (all-gathered) from other devices on demand, used for computation, and then released.
FSDP is conceptually similar to Microsoft's ZeRO Stage 3 (see below). PyTorch introduced FSDP as a native feature, and it has become the default choice for training models that exceed single-GPU memory capacity within the PyTorch ecosystem. FSDP2, released with PyTorch 2.x, improved the composability and performance of the original implementation. Studies show that FSDP can reduce GPU memory usage by over 60% compared to standard DDP, though it increases communication volume and can slow training by a factor of 2-6x depending on the model size and cluster configuration [2].
Tensor parallelism (often called model parallelism or intra-layer parallelism) splits individual layers across multiple devices. For example, a large matrix multiplication in a transformer layer can be partitioned so that each device computes a portion of the output. The partial results are then combined through communication operations.
Megatron-LM, developed by NVIDIA, pioneered efficient tensor parallelism for transformers. The key insight is that the self-attention and feed-forward layers in a transformer can be split along specific dimensions with minimal communication. For a feed-forward layer with weight matrix W, tensor parallelism splits W column-wise across devices for the first linear layer and row-wise for the second, requiring only a single AllReduce operation per layer. For self-attention, different attention heads are distributed across devices, which is a natural partition since heads are independent [3].
Tensor parallelism is most effective within a single node (server) where devices are connected by high-bandwidth interconnects like NVLink. Cross-node tensor parallelism is generally avoided because the frequent, fine-grained communication it requires does not tolerate the higher latency and lower bandwidth of inter-node networks.
Pipeline parallelism divides the model by layers, assigning consecutive groups of layers (called stages) to different devices. During the forward pass, each device processes its layers and passes the intermediate activations to the next device. During the backward pass, gradients flow in the reverse direction.
Naive pipeline parallelism suffers from severe "pipeline bubbles," where devices sit idle waiting for activations or gradients from other stages. The GPipe approach (Huang et al., 2019) mitigates this by splitting each batch into multiple micro-batches that flow through the pipeline in sequence, keeping more stages active simultaneously [4].
DeepSpeed introduced further improvements with its PipeDream-style 1F1B (one-forward-one-backward) schedule, which interleaves forward and backward micro-batches to reduce the size of the pipeline bubble. More recent work on interleaved pipeline schedules and zero-bubble pipeline parallelism has reduced the bubble overhead to near zero for sufficiently large numbers of micro-batches.
Pipeline parallelism is typically used across nodes, since it requires only point-to-point communication of activation tensors between adjacent stages, which is less bandwidth-intensive than the all-to-all communication patterns of tensor parallelism.
Mixture-of-experts (MoE) models contain multiple "expert" sub-networks within certain layers, with a gating mechanism that routes each token to a subset of experts. Expert parallelism distributes different experts across different devices. Unlike other forms of model parallelism, expert parallelism applies only to the expert layers; the rest of the model can use data parallelism or tensor parallelism as usual.
Expert parallelism introduces all-to-all communication patterns: tokens must be routed to the device hosting the appropriate expert, processed, and then routed back. This creates distinct communication challenges compared to AllReduce-based approaches. Models like Mixtral, DeepSeek-V3, and Google's Switch Transformer use expert parallelism to scale to very large effective parameter counts while keeping the per-token computation manageable [5].
Context parallelism (also called sequence parallelism in some implementations) distributes the sequence dimension across devices. This is particularly relevant for training with very long context lengths, where the attention computation's memory requirements (quadratic in sequence length before FlashAttention) can exceed single-device memory even when the model parameters fit. Meta's Llama 3 training used context parallelism as a fourth dimension alongside data, tensor, and pipeline parallelism [6].
The table below summarizes the major distributed training frameworks as of early 2026.
| Framework | Developer | Key Features | Parallelism Types Supported | Language/Integration |
|---|---|---|---|---|
| PyTorch DDP / FSDP | Meta | Native PyTorch integration; FSDP2 for memory-efficient training; automatic gradient synchronization | Data parallelism, fully sharded data parallelism | Python / PyTorch |
| DeepSpeed | Microsoft | ZeRO optimizer (stages 0-3); pipeline parallelism; mixed precision; inference optimization | Data, pipeline, tensor, expert parallelism | Python / PyTorch |
| Megatron-LM | NVIDIA | Efficient tensor parallelism for transformers; pipeline parallelism; sequence parallelism; integrated with NeMo | Tensor, pipeline, data, expert, context parallelism | Python / PyTorch |
| Ray Train | Anyscale | Framework-agnostic distributed training orchestration; elastic training; integration with Ray ecosystem | Data parallelism (wraps other frameworks) | Python / multi-framework |
| Horovod | Originally Uber | Ring-AllReduce based gradient synchronization; framework-agnostic (PyTorch, TensorFlow, MXNet) | Data parallelism | Python / multi-framework |
| Colossal-AI | HPC-AI Tech | Comprehensive parallelism support; auto-parallelism; memory optimization | Data, tensor, pipeline, expert, sequence parallelism | Python / PyTorch |
| JAX/XLA (pjit) | Compiler-based parallelism via SPMD; TPU-optimized; automatic sharding | Data, tensor, pipeline parallelism | Python / JAX |
DeepSpeed, developed by Microsoft Research, is one of the most influential distributed training libraries. Its core contribution is the Zero Redundancy Optimizer (ZeRO), which progressively eliminates memory redundancy across data-parallel processes.
| ZeRO Stage | What Is Sharded | Memory Savings | Communication Overhead |
|---|---|---|---|
| Stage 0 | Nothing (vanilla data parallelism) | None | Baseline |
| Stage 1 | Optimizer states | ~4x reduction | Minimal increase |
| Stage 2 | Optimizer states + gradients | ~8x reduction | Moderate increase |
| Stage 3 | Optimizer states + gradients + parameters | Up to ~64x reduction | Significant increase |
| ZeRO-Infinity | Offloads to CPU/NVMe storage | Limited only by CPU/disk capacity | Highest overhead |
ZeRO Stage 1 partitions optimizer states (which for Adam include first and second moment estimates, requiring 12 bytes per parameter in mixed precision) across data-parallel ranks. Stage 2 additionally partitions gradients. Stage 3 partitions the parameters themselves, requiring all-gather operations before each forward and backward computation. ZeRO-Infinity extends Stage 3 with the ability to offload parameters and optimizer states to CPU RAM or even NVMe SSDs, enabling training of models with trillions of parameters on limited GPU clusters [7].
DeepSpeed also provides pipeline parallelism (with 1F1B scheduling), mixed-precision training support, gradient checkpointing, and inference optimization features. The Megatron-DeepSpeed integration combines Megatron-LM's tensor parallelism with DeepSpeed's ZeRO and pipeline parallelism for maximum flexibility.
Megatron-LM, developed by NVIDIA, focuses on efficient tensor parallelism for transformer models. The original Megatron paper (Shoeybi et al., 2019) demonstrated that careful partitioning of attention heads and feed-forward layers could scale transformer training to billions of parameters with near-linear throughput scaling across GPUs.
Megatron-LM has evolved to support pipeline parallelism (with interleaved schedules), sequence parallelism (distributing layer normalization and dropout across the sequence dimension), and expert parallelism for MoE models. It is tightly integrated with NVIDIA's NeMo framework for training large language models and is optimized for NVIDIA hardware, including NVLink and InfiniBand interconnects [3].
Ray Train, part of the Ray ecosystem developed by Anyscale, takes a different approach from DeepSpeed and Megatron-LM. Rather than implementing low-level parallelism strategies, Ray Train provides distributed training orchestration: it manages the lifecycle of training workers, handles fault tolerance, and integrates with various backend frameworks including PyTorch DDP, DeepSpeed, Horovod, TensorFlow, XGBoost, and others.
Ray Train's strengths include elastic training (scaling the number of workers up or down during training), integration with Ray Tune for hyperparameter optimization, and a unified API across different training backends. It is particularly popular in production environments where training jobs need to coexist with other workloads on a shared cluster [8].
Horovod, originally developed at Uber in 2017, was one of the first frameworks to make distributed data-parallel training accessible. Its core design is based on ring-AllReduce for gradient synchronization, inspired by the MPI (Message Passing Interface) communication standard. Horovod requires minimal code changes to parallelize existing training scripts: users wrap their optimizer with Horovod's distributed optimizer, and the library handles gradient synchronization transparently.
Horovod supports PyTorch, TensorFlow, Keras, and Apache MXNet. While it remains in use, it has been largely superseded by framework-native solutions (PyTorch DDP/FSDP) and more comprehensive libraries (DeepSpeed) for large-scale training. Horovod can also run on top of Ray, combining Horovod's efficient AllReduce with Ray's cluster management [9].
AllReduce is the fundamental collective communication operation in data-parallel training. It takes input tensors from all participating devices, reduces them (typically by summing), and distributes the result to all devices. After an AllReduce on gradients, every device has the same averaged gradient and can perform identical weight updates.
The ring-AllReduce algorithm arranges devices in a logical ring topology. The operation proceeds in two phases. In the scatter-reduce phase, each device sends a chunk of its data to its neighbor, which adds the received chunk to its own. After N-1 steps (where N is the number of devices), each device holds the fully reduced result for one chunk. In the all-gather phase, each device sends its completed chunk around the ring so that all devices end up with the full result.
The bandwidth cost of ring-AllReduce is 2(N-1)/N times the data size, which approaches 2x the data size for large N. Crucially, this cost is independent of the number of devices, making ring-AllReduce highly scalable. The algorithm was popularized for deep learning by Baidu's research team in 2017 and became the basis for Horovod [10].
For clusters with non-uniform network topologies (e.g., fast NVLink within nodes, slower InfiniBand between nodes), hierarchical AllReduce performs a local reduction within each node first, then a global reduction across nodes, and finally broadcasts the result back. This approach takes advantage of the faster intra-node bandwidth and reduces the amount of data that must traverse the slower inter-node network.
NVIDIA's NCCL (NVIDIA Collective Communications Library) automatically selects the best algorithm for the given topology, including ring, tree, and hierarchical variants. NCCL 2.27 and later versions include optimizations for multi-rail InfiniBand, NVSwitch, and NVIDIA's SHARP (Scalable Hierarchical Aggregation and Reduction Protocol) in-network computing technology [11].
| Interconnect | Bandwidth | Latency | Scope | Used By |
|---|---|---|---|---|
| NVLink 3.0 (A100) | 600 GB/s bidirectional | Sub-microsecond | Intra-node (GPU-to-GPU) | NVIDIA DGX A100 |
| NVLink 4.0 (H100) | 900 GB/s bidirectional | Sub-microsecond | Intra-node (GPU-to-GPU) | NVIDIA DGX H100 |
| NVLink 5.0 (B200) | 1,800 GB/s bidirectional | Sub-microsecond | Intra-node (GPU-to-GPU) | NVIDIA DGX B200 |
| NVSwitch | Full bisection bandwidth | Very low | Intra-node (all-to-all GPU) | DGX systems |
| InfiniBand HDR | 200 Gb/s per port | ~1 microsecond | Inter-node | HPC clusters |
| InfiniBand NDR | 400 Gb/s per port | ~1 microsecond | Inter-node | Modern AI clusters |
| RoCE (RDMA over Converged Ethernet) | 100-400 Gb/s | ~2-5 microseconds | Inter-node | Cloud-based clusters |
NVLink provides the high-bandwidth, low-latency connection needed for tensor parallelism within a node. InfiniBand (or high-speed Ethernet with RoCE) provides the inter-node fabric needed for data parallelism and pipeline parallelism across nodes. The choice of interconnect has a direct impact on training throughput: Meta reported that its Llama 3 training cluster used 400 Gb/s InfiniBand NDR between nodes, and that network performance was a key factor in overall training efficiency [6].
For the largest models, a single form of parallelism is insufficient. Practitioners combine multiple parallelism types, creating multi-dimensional parallel configurations.
3D parallelism combines data parallelism, tensor parallelism, and pipeline parallelism. This was the standard configuration for training models with tens to hundreds of billions of parameters from 2020 to 2023. Tensor parallelism is applied within each node (across GPUs connected by NVLink), pipeline parallelism is applied across nodes, and data parallelism is applied across groups of nodes.
4D parallelism adds a fourth dimension, typically context (sequence) parallelism. Meta's Llama 3.1 405B model was trained using 4D parallelism: tensor parallelism across 8 GPUs within a node, pipeline parallelism across 4 nodes (16 pipeline stages), context parallelism for long-sequence training, and FSDP across the remaining dimension [6].
5D parallelism adds expert parallelism as a fifth dimension, applicable to MoE models. DeepSeek-V3 and similar large MoE models use configurations that include data, tensor, pipeline, context, and expert parallelism simultaneously [5].
The configuration of these parallelism dimensions is a complex optimization problem. The number of devices assigned to each dimension must multiply to the total number of devices in the cluster, and the assignment affects memory usage, communication volume, and compute efficiency. Automated parallelism search tools (such as those in Alpa and Galvatron) attempt to find optimal configurations, but manual tuning by experienced systems engineers remains common for the largest training runs.
Communication between devices is the primary bottleneck in distributed training. Gradient synchronization in data parallelism requires transmitting data proportional to the model size. Tensor parallelism requires multiple communication rounds per layer. Pipeline parallelism introduces latency from sequential stage execution.
At large scales, communication can consume a substantial fraction of total training time. Research has estimated that in some configurations, over 90% of training time is spent on communication rather than computation. Techniques to mitigate this include communication-computation overlap (starting communication for one layer while computing another), gradient compression, and using higher-bandwidth interconnects [12].
In a cluster of thousands of GPUs, hardware failures are not exceptional events but statistical certainties. A single GPU failure, network partition, or software crash can halt training for the entire cluster. The standard mitigation is periodic checkpointing: saving the full training state (model parameters, optimizer states, data loader position) to persistent storage at regular intervals. When a failure occurs, training resumes from the last checkpoint.
Checkpointing at scale is itself a challenge. For a model with hundreds of billions of parameters and corresponding optimizer states, a checkpoint can be hundreds of gigabytes to terabytes in size. Writing this to storage every few minutes creates significant I/O pressure. Asynchronous checkpointing (writing checkpoints in the background while training continues), incremental checkpointing (saving only what changed), and in-memory checkpointing (keeping backup copies distributed across nodes) are all active areas of improvement.
Meta's MegaScale paper (Jiang et al., 2024) described the infrastructure used to train a 175-billion-parameter model across 12,288 GPUs and documented a comprehensive fault-tolerance system with automatic failure detection, diagnosis, and recovery [12].
Data parallelism increases the effective batch size proportionally to the number of devices. Very large batch sizes can impair model convergence, producing models that generalize poorly. This problem was first documented in detail by Goyal et al. (2017), who proposed the linear scaling rule: when the batch size is multiplied by k, the learning rate should also be multiplied by k, combined with a warmup period at the start of training [1].
Subsequent work, including LARS (Layer-wise Adaptive Rate Scaling) and LAMB (Layer-wise Adaptive Moments optimizer for Batch training), introduced per-layer learning rate adaptation to further improve large-batch training stability. Despite these advances, there remain diminishing returns to scaling batch size: at some point, adding more data-parallel workers does not speed up training because the batch size is already at or beyond the critical batch size for the problem.
Training a large model requires memory for parameters, gradients, optimizer states, and activations. For a model with P parameters trained with the Adam optimizer in FP32, the total memory requirement is approximately 16P bytes (4 bytes each for parameters, gradients, first moment, and second moment). A 70-billion-parameter model thus needs over 1 TB of memory for these components alone.
Gradient checkpointing (also called activation checkpointing) trades computation for memory by not storing intermediate activations during the forward pass and instead recomputing them during the backward pass. This can reduce activation memory by a factor proportional to the square root of the number of layers.
Mixed-precision training reduces memory by storing activations and some model state in FP16 or BF16, cutting memory usage roughly in half for those components.
As of early 2026, distributed training is an essential capability for any organization training large models. The landscape is characterized by several trends.
First, the convergence of frameworks continues. PyTorch FSDP2 has matured significantly, and many teams that previously used DeepSpeed or custom solutions have migrated to native PyTorch distributed primitives. At the same time, DeepSpeed continues to innovate, particularly with its ZeRO++ optimizations (which add quantized communication and hierarchical partitioning) and its support for MoE training.
Second, NVIDIA's hardware and software stack dominates. The combination of H100/B200 GPUs, NVLink/NVSwitch for intra-node communication, InfiniBand NDR for inter-node communication, NCCL for collective operations, and Megatron-LM/NeMo for parallelism implementation represents the most widely deployed stack for large-scale training. Google's TPU-based training with JAX/XLA is the primary alternative, used for Gemini and other Google models.
Third, automation of parallelism configuration is improving. Tools that automatically determine the optimal combination of data, tensor, pipeline, and expert parallelism for a given model and cluster are reducing the need for manual systems engineering, though they have not yet replaced it for the largest training runs.
Fourth, training at unprecedented scale continues to push the boundaries. The largest known training runs as of early 2026 use clusters of 16,000 to over 100,000 GPUs, requiring solutions for fault tolerance, load balancing, and network optimization that go well beyond what was needed even two years earlier.
Fifth, the energy and cost implications of distributed training at scale have become a significant concern, driving interest in more efficient parallelism strategies, better hardware utilization, and training techniques that reduce the total number of FLOPs needed to reach a given model quality.