See also: Machine learning terms, Data parallelism, GPU computing, Deep learning
Model parallelism is an approach in machine learning that addresses the computational challenges posed by the increasing size and complexity of modern neural network models. It involves the concurrent execution of different parts of a single model across multiple processing units, often in parallel to other parts of the model. As large language models have grown from millions to hundreds of billions (and even trillions) of parameters, model parallelism has become essential for training and serving these systems. A single GPU with 80 GB of high-bandwidth memory cannot hold the full parameters, gradients, and optimizer states for a 175-billion-parameter model, which can require over 2 TB of memory in mixed-precision training with Adam. Model parallelism solves this by distributing different parts of the model across many accelerators so that no single device needs to hold everything at once.
As machine learning models grow larger and more complex, they require increasing amounts of memory and computation resources. This growth in demand is driven by the pursuit of improved performance, as larger models tend to have higher representational capacity and can better capture complex patterns in the data. However, this increased size and complexity can lead to several challenges, including:
Memory limitations: The memory capacity of a single processing unit, such as a Graphics Processing Unit (GPU) or Tensor Processing Unit (TPU), may not be sufficient to store the entire model and its associated data. For a model with P parameters trained using mixed-precision Adam, the total memory footprint is approximately 16P bytes (2 bytes for fp16 parameters, 2 bytes for fp16 gradients, 4 bytes for fp32 master weights, 4 bytes for first moment estimates, and 4 bytes for second moment estimates). A 70-billion-parameter model therefore needs roughly 1.12 TB just for model states, far exceeding the capacity of any single GPU.
Computational bottlenecks: The sheer number of parameters and operations in large models can cause significant delays in training and inference, which may be unacceptable for certain applications or time-sensitive tasks.
Scaling laws: Research by Kaplan et al. (2020) and Hoffmann et al. (2022) has demonstrated that model performance improves predictably with scale, providing strong incentives to train ever-larger models. These scaling laws drive the need for parallelism strategies that can efficiently distribute computation across thousands of accelerators.
Model parallelism offers a solution to these challenges by dividing the model's computation across multiple processing units, thereby allowing for the efficient execution of larger and more complex models.
The roots of model parallelism in deep learning trace back to the earliest days of GPU-accelerated neural network training. Before GPUs became standard, training deep networks on CPUs was prohibitively slow. In the mid-2000s, researchers began adapting GPUs for general-purpose computing, reporting 4x speedups over CPU implementations by 2006 and up to 70x speedups by 2009.
The landmark moment came in 2012 with AlexNet, developed by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton at the University of Toronto. AlexNet contained 60 million parameters and was one of the first deep networks to employ model parallelism in practice. Because the NVIDIA GTX 580 GPUs used for training had only 3 GB of memory each, Krizhevsky split the network across two GPUs, with each GPU processing one half of the feature maps. The two halves communicated only at specific layers. This design was driven by hardware necessity rather than theoretical considerations, but it demonstrated that partitioning a model across devices was both feasible and effective. AlexNet won the 2012 ImageNet Large Scale Visual Recognition Challenge by a wide margin, catalyzing the deep learning revolution.
Through 2013 to 2017, data parallelism dominated multi-GPU training as GPU memory grew and models remained small enough to fit on a single device. Frameworks like DistBelief (Dean et al., 2012) at Google and later TensorFlow and PyTorch made data-parallel training accessible. However, as Transformer architectures emerged in 2017 and models began scaling into the billions of parameters, the limitations of single-device memory became acute once again.
The period from 2019 onward saw a rapid evolution of model parallelism techniques. Megatron-LM (Shoeybi et al., 2019) introduced systematic tensor parallelism for Transformers. GPipe (Huang et al., 2019) and PipeDream (Narayanan et al., 2019) formalized pipeline parallelism. The ZeRO optimizer (Rajbhandari et al., 2020) bridged the gap between data parallelism and model parallelism by sharding model states. By 2024, training the largest frontier models routinely involved combining four or five forms of parallelism across tens of thousands of GPUs.
There are several distinct types of model parallelism, each addressing different axes of partitioning and suited to different model architectures and hardware configurations.
Tensor parallelism (sometimes called intra-layer parallelism or intra-operator parallelism) splits individual layers of a neural network across multiple devices. Rather than placing entire layers on different accelerators, tensor parallelism partitions the weight matrices within a single layer so that each device computes a portion of the layer's output. The partial results are then combined through communication operations such as all-reduce or all-gather.
The foundational work on tensor parallelism for Transformers was introduced by Shoeybi et al. (2019) in the Megatron-LM paper. The key insight is that the matrix multiplications inside Transformer layers can be split along specific dimensions with minimal communication overhead. For a multi-head attention layer, the query, key, and value projections can be split column-wise across devices, with each device computing a subset of the attention heads. For the feed-forward network (FFN), the first linear layer is split column-wise and the second is split row-wise, requiring only a single all-reduce operation per layer during the forward pass and another during the backward pass.
Megatron-LM demonstrated training of models up to 8.3 billion parameters across 512 NVIDIA V100 GPUs, achieving 15.1 PetaFLOPs of sustained performance with 76% scaling efficiency compared to a strong single-GPU baseline of 39 TeraFLOPs.
Advantages of tensor parallelism:
Limitations:
Pipeline parallelism (also called inter-layer parallelism) partitions a model vertically by assigning consecutive groups of layers (called stages) to different devices. Input data flows through the pipeline from the first stage to the last during the forward pass, and gradients flow in reverse during the backward pass. The two foundational systems for pipeline parallelism are GPipe and PipeDream.
GPipe (Huang et al., 2019), developed at Google Brain, introduced micro-batch pipelining. GPipe divides each mini-batch into M smaller micro-batches, which are then fed through the pipeline in sequence. During the forward pass, each micro-batch moves through all stages, and activations at stage boundaries are stored. During the backward pass, gradients are computed for each micro-batch and accumulated across all M micro-batches before a single synchronous weight update is applied. GPipe uses synchronous gradient accumulation, meaning it produces the same training dynamics as standard synchronous data-parallel training. Google demonstrated GPipe by training a 557-million-parameter AmoebaNet-B model, achieving 84.3% top-1 accuracy on ImageNet. GPipe also uses activation re-materialization (recomputation) to reduce memory: only boundary activations between stages are stored, and intermediate activations within each stage are recomputed during the backward pass.
The main limitation of GPipe is the "pipeline bubble," the idle time that occurs at the start and end of each mini-batch when some pipeline stages have no micro-batch to process. The fraction of idle time is approximately (S - 1) / M, where S is the number of pipeline stages and M is the number of micro-batches. Increasing M relative to S reduces the bubble but increases memory requirements for storing activations.
PipeDream (Narayanan et al., 2019), developed jointly at Microsoft Research and Carnegie Mellon University, took a different approach. PipeDream uses an asynchronous one-forward-one-backward (1F1B) schedule in which each worker alternates between processing a forward pass and a backward pass from different micro-batches. This significantly reduces pipeline idle time compared to GPipe. However, the asynchronous nature means that forward and backward passes for the same micro-batch may use different versions of the model weights, introducing weight staleness. PipeDream addresses this through a technique called weight stashing, which stores multiple versions of model parameters so that each backward pass uses the same weights that were used in the corresponding forward pass. PipeDream demonstrated up to 5.3x speedup over standard data-parallel training. A later variant, PipeDream-2BW (also known as PipeDream-Flush), reduces the memory overhead of weight stashing by maintaining only two weight versions and periodically flushing the pipeline.
Interleaved 1F1B, introduced in Megatron-LM v2 (Narayanan et al., 2021), assigns multiple non-consecutive stages to each device to further reduce the pipeline bubble. By interleaving stages, each device processes micro-batches from different positions in the pipeline, keeping the hardware busier during the ramp-up and ramp-down phases.
Zero-bubble pipeline parallelism (Qi et al., 2024), accepted at ICLR 2024, achieves near-zero pipeline idle time under synchronous training semantics. The key insight is to split the backward computation into two separate passes: one that computes the gradient with respect to the input (used for the upstream stage) and one that computes the gradient with respect to the parameters (used for weight updates). By scheduling these two backward passes independently, the algorithm can fill in the idle slots that would otherwise form bubbles. The authors also developed an automatic scheduling algorithm that finds optimal pipeline schedules for a given model configuration and memory constraint. Experiments showed up to 23% throughput improvement over standard 1F1B under similar memory limits, increasing to 31% when memory constraints were relaxed.
DualPipe, introduced by DeepSeek for training DeepSeek-V3 and DeepSeek-R1 (DeepSeek-AI, 2024), is a bidirectional pipeline parallelism algorithm that achieves full overlap of forward and backward computation with communication. DualPipe divides each micro-batch chunk into four components: attention, all-to-all dispatch, MLP, and all-to-all combine. By scheduling forward and backward passes to flow in opposite directions through the pipeline simultaneously, DualPipe overlaps the communication-heavy operations of one chunk with the computation-heavy operations of another. This is particularly effective for MoE models where the computation-to-communication ratio approaches 1:1 due to heavy cross-node expert routing traffic.
Sequence parallelism addresses a different bottleneck: the memory consumed by activations that grow with sequence length. In standard tensor parallelism, operations like LayerNorm, dropout, and residual connections are replicated on every device because they operate on the full hidden dimension. Sequence parallelism, introduced as an extension to Megatron-LM by Korthikanti et al. (2023), partitions these operations along the sequence dimension instead, so each device handles a portion of the sequence for these non-tensor-parallel regions. This reduces activation memory by a factor equal to the tensor parallelism degree without additional communication overhead, since the required all-gather and reduce-scatter operations replace the existing all-reduce operations used in tensor parallelism.
For long-context training, more aggressive forms of sequence parallelism have been developed:
Expert parallelism is a specialized form of model parallelism designed for Mixture-of-Experts (MoE) architectures. In an MoE model, each Transformer layer contains a routing mechanism (a gating network) and multiple expert sub-networks (typically feed-forward networks). For any given input token, only a subset of experts (the top-k, often top-1 or top-2) is activated, making the model sparse.
In expert parallelism, the expert sub-networks are distributed across devices so that each device holds a subset of the experts. A routing step determines which tokens go to which experts, and all-to-all communication moves tokens to the devices that hold their selected experts. After expert computation, another all-to-all operation returns the results to the originating devices.
This approach enables training of models with very large total parameter counts while keeping the per-token computation manageable. The Switch Transformer (Fedus et al., 2022) reported approximately 4x pre-training speedup over a dense model of comparable quality by using expert parallelism. DeepSeek-V2 and Mixtral also rely on expert parallelism for efficient training and inference.
The primary challenge of expert parallelism is communication overhead. The all-to-all operations required for token routing can consume over 40% of total runtime in large-scale training. Techniques like hybrid expert parallelism (combining expert parallelism with data and tensor parallelism) and optimized communication kernels have been developed to mitigate this. DeepSpeed-TED combines data, tensor, and expert parallelism to train MoE models with 4 to 8x larger base models than previous approaches.
Training the largest models typically requires combining multiple parallelism strategies simultaneously. 3D parallelism refers to the combination of data parallelism, tensor parallelism, and pipeline parallelism. Each strategy is mapped to a different level of the hardware hierarchy based on communication bandwidth considerations:
| Parallelism type | Typical scope | Communication pattern | Bandwidth requirement |
|---|---|---|---|
| Tensor parallelism | Within a single node (e.g., 8 GPUs connected via NVLink) | All-reduce per layer | Very high (requires NVLink or NVSwitch) |
| Pipeline parallelism | Across nodes within a rack or pod | Point-to-point activation transfers at stage boundaries | Moderate (InfiniBand is sufficient) |
| Data parallelism | Across all remaining devices | All-reduce of gradients once per training step | Low to moderate (can overlap with computation) |
For example, in a cluster of 64 GPUs arranged as 8 nodes of 8 GPUs each, one might use tensor parallelism degree 8 (within each node), pipeline parallelism degree 4 (across 4 groups of nodes), and data parallelism degree 2 (the remaining factor). This yields 8 x 4 x 2 = 64 total devices.
The Megatron-LM v2 paper (Narayanan et al., 2021) presented detailed analysis of 3D parallelism for training models up to one trillion parameters, demonstrating 502 petaFLOPS on 3072 A100 GPUs with 52% of peak GPU throughput.
Meta's training of Llama 3 405B on 16,384 NVIDIA H100 GPUs used 4D parallelism, adding context parallelism as a fourth dimension. The specific configuration was:
| Dimension | Degree | Description |
|---|---|---|
| Tensor Parallelism (TP) | 8 | Splits weight matrices within each node |
| Pipeline Parallelism (PP) | 16 | Distributes layers across 16 pipeline stages |
| Context Parallelism (CP) | 16 (for 128K context) | Partitions long sequences across devices |
| Data Parallelism (FSDP) | Remaining GPUs | Shards parameters, gradients, and optimizer states |
This configuration achieved approximately 400 TeraFLOPs per GPU with 8K sequence length and 380 TeraFLOPs per GPU with 128K sequence length.
For MoE models, expert parallelism adds a fifth dimension. Training setups for models like Mixtral or DeepSeek-V3 combine data, tensor, pipeline, context, and expert parallelism, often referred to as 5D parallelism. DeepSeek-V3, a 671-billion-parameter MoE model with 256 experts, used a specific combination of 16-way pipeline parallelism with DualPipe, 64-way expert parallelism spanning 8 nodes, and ZeRO-1 data parallelism. The DualPipe schedule was critical for overlapping the heavy all-to-all communication required by expert routing with forward and backward computation, achieving effective hardware utilization despite the approximately 1:1 computation-to-communication ratio.
ZeRO (Rajbhandari et al., 2020) is a family of memory optimization techniques developed by Microsoft as part of the DeepSpeed library. ZeRO eliminates memory redundancy in data-parallel training by partitioning model states across data-parallel processes instead of replicating them. It is implemented as three progressive stages:
| Stage | What is partitioned | Memory reduction | Communication volume |
|---|---|---|---|
| ZeRO Stage 1 (ZeRO-1) | Optimizer states only (e.g., Adam's first and second moment estimates, fp32 master weights) | Up to 4x | Same as standard data parallelism |
| ZeRO Stage 2 (ZeRO-2) | Optimizer states + gradients | Up to 8x | Same as standard data parallelism |
| ZeRO Stage 3 (ZeRO-3) | Optimizer states + gradients + model parameters | Linear reduction with number of GPUs (Nx for N GPUs) | 1.5x of standard data parallelism |
In ZeRO Stage 1, only the optimizer states are partitioned. Each GPU stores 1/N of the optimizer states (where N is the number of data-parallel GPUs) and updates only its assigned partition of model parameters. After the update, an all-gather operation broadcasts the updated parameters to all GPUs.
ZeRO Stage 2 additionally partitions the gradients. Each GPU retains only the gradients corresponding to its portion of the optimizer states and discards the rest after the reduce-scatter operation.
ZeRO Stage 3 goes further by also partitioning the model parameters themselves. Each GPU stores only 1/N of the parameters, and all-gather operations are inserted before each forward and backward computation to temporarily reconstruct the full parameters needed for that layer. This enables training models whose total memory footprint far exceeds any single GPU's capacity, at the cost of increased communication.
ZeRO-Offload and ZeRO-Infinity extend these concepts further by offloading partitioned states to CPU memory or NVMe SSDs, enabling training of models with trillions of parameters on limited GPU hardware.
Fully Sharded Data Parallel (FSDP) is PyTorch's native implementation of ZeRO Stage 3 concepts. Originally inspired by FairScale's implementation, FSDP was integrated into PyTorch core and has undergone significant evolution:
FSDP wraps model layers or submodules in FSDP units. Before each forward pass, an all-gather operation collects the full parameters for that unit from all devices. After the computation, the full parameters are discarded and only the local shard is retained. During the backward pass, the same all-gather and discard pattern is repeated. Gradients are reduced via reduce-scatter so that each device accumulates only its shard of the gradients.
FSDP sharding strategies:
| Strategy | Behavior | Equivalent ZeRO stage |
|---|---|---|
| FULL_SHARD | Shards parameters, gradients, and optimizer states | ZeRO Stage 3 |
| SHARD_GRAD_OP | Shards gradients and optimizer states; keeps full parameters after forward | ZeRO Stage 2 |
| NO_SHARD | Standard data parallelism with no sharding | Baseline DDP |
| Aspect | PyTorch FSDP | DeepSpeed ZeRO |
|---|---|---|
| Framework integration | Native PyTorch | Requires DeepSpeed library |
| Configuration | Python API, minimal config | JSON config file + launcher |
| Composability with TP/PP | Strong (especially FSDP2) | Supported but requires more setup |
| CPU/NVMe offloading | Supported | Supported (ZeRO-Offload, ZeRO-Infinity) |
| Best performance range | 100M to 10B parameters | 10B+ parameters |
| Ecosystem | PyTorch-native tooling | Rich feature set (MoE, inference, compression) |
Benchmarks from 2024 and 2025 show nuanced performance differences. FSDP with FULL_SHARD has been observed running up to 5x faster per iteration than DeepSpeed ZeRO-3 on mid-sized models. However, DeepSpeed tends to show advantages at 10B+ parameter scales, where its memory-saving optimizations and offloading capabilities become more significant.
Data parallelism and model parallelism address different scaling bottlenecks and are often used together rather than as alternatives.
| Feature | Data parallelism | Model parallelism |
|---|---|---|
| What is replicated/split | Data is split; model is replicated on each device | Model is split across devices; data may be replicated or also split |
| Memory per device | Full model + optimizer states on each GPU | Only a fraction of model states per GPU |
| Communication pattern | All-reduce of gradients after each step | Varies: all-reduce (tensor), point-to-point (pipeline), all-to-all (expert) |
| Communication frequency | Once per training step | Multiple times per layer (tensor) or per micro-batch (pipeline) |
| Scaling limit | Model must fit on a single device | Can scale beyond single-device memory |
| Implementation complexity | Relatively simple (e.g., PyTorch DDP) | More complex; requires careful partitioning |
| Hardware requirement | Any multi-GPU setup | High-bandwidth interconnects preferred (especially for tensor parallelism) |
| Batch size impact | Effective batch size scales with number of devices | Batch size is independent of parallelism degree |
| Gradient synchronization | Synchronous (standard) or asynchronous | Depends on specific strategy |
| Primary use case | Models that fit on one GPU; scaling throughput | Models too large for one GPU; memory-constrained scenarios |
In practice, most large-scale training systems combine both. Data parallelism increases throughput by processing more data in parallel, while model parallelism enables training models that would not fit on a single device.
Efficient communication is critical for model parallelism performance. The choice of collective operation and hardware interconnect directly impacts training throughput.
Distributed training relies on a set of collective communication primitives, most commonly provided by NVIDIA's NCCL (NVIDIA Collective Communications Library) for GPU clusters:
| Operation | Description | Used by |
|---|---|---|
| All-Reduce | Every device sends its local data, and all devices receive the sum (or other reduction) of all data. | Data parallelism (gradient synchronization), tensor parallelism |
| All-Gather | Every device sends its local shard, and all devices receive the concatenation of all shards. | FSDP/ZeRO-3 (parameter reconstruction before forward/backward) |
| Reduce-Scatter | Each device sends data, and each receives a reduced (summed) portion. Equivalent to a reduce followed by a scatter. | FSDP/ZeRO (gradient sharding), tensor parallelism |
| All-to-All | Each device sends a different piece of data to every other device. | Expert parallelism (token routing), sequence parallelism (DeepSpeed-Ulysses) |
| Broadcast | One device sends data to all other devices. | Parameter initialization, inference |
| Point-to-Point (Send/Recv) | Direct transfer between two specific devices. | Pipeline parallelism (activation/gradient transfer between stages) |
NCCL automatically detects hardware topology (PCIe, NVLink, NVSwitch, InfiniBand, RoCE) and selects optimal algorithms and parameters. NCCL 2.27 (2025) added features for fast inference and resilient training, including fault tolerance for large-scale jobs.
The bandwidth and latency of interconnects determine how effectively parallelism strategies can scale:
| Interconnect | Scope | Bandwidth | Notes |
|---|---|---|---|
| PCIe Gen 5 | Intra-node (GPU to CPU, GPU to GPU) | 64 GB/s per direction | Baseline; insufficient for tensor parallelism at scale |
| NVLink 4.0 (Hopper) | Intra-node (GPU to GPU) | 900 GB/s bidirectional (18 links at 25 GB/s each) | Used in H100 systems; standard for tensor parallelism |
| NVLink 5.0 (Blackwell) | Intra-node and intra-rack | 1.8 TB/s bidirectional (18 links at 50 GB/s each) | Used in B200 systems; supports up to 576 GPUs in a single NVLink domain |
| NVSwitch | Multi-GPU fabric within a node or rack | Up to 130 TB/s across a 72-GPU NVLink domain (NVL72) | Enables all-to-all NVLink connectivity |
| InfiniBand NDR | Inter-node | 400 Gb/s (50 GB/s) per port | Standard for cross-node pipeline and data parallelism |
| InfiniBand XDR | Inter-node | 800 Gb/s (100 GB/s) per port | Next generation; used in newest clusters |
| RoCE (RDMA over Converged Ethernet) | Inter-node | Up to 400 Gb/s | Lower cost alternative to InfiniBand |
The general principle is to map the most communication-intensive parallelism dimension to the highest-bandwidth interconnect. Tensor parallelism, which requires an all-reduce per layer, is placed within NVLink-connected devices. Pipeline parallelism, which only transfers activations at stage boundaries, can tolerate the lower bandwidth of InfiniBand. Data parallelism overlaps gradient synchronization with backward computation, further reducing its sensitivity to interconnect bandwidth.
NVIDIA's SHARP (Scalable Hierarchical Aggregation and Reduction Protocol) technology offloads collective operations to InfiniBand network switches, reducing the data traversing the network by performing reductions in-network rather than at the endpoints.
Model parallelism is often combined with memory optimization techniques to maximize the effective model size that can be trained:
Activation checkpointing trades computation for memory by discarding intermediate activations during the forward pass and recomputing them during the backward pass. Instead of storing all activations (O(n) memory for n layers), selective checkpointing stores only the activations at certain layer boundaries and recomputes the rest, reducing activation memory from O(n) to O(sqrt(n)). The cost is approximately one additional forward pass worth of computation (roughly 33% overhead). GPipe's re-materialization and PyTorch's torch.utils.checkpoint both implement this technique.
Mixed-precision training uses lower-precision numerical formats (typically fp16 or bf16) for forward and backward computation while maintaining fp32 master weights for parameter updates. This halves the memory required for activations and enables the use of Tensor Cores on NVIDIA GPUs, which provide significantly higher throughput for reduced-precision matrix operations. The H100 GPU provides 989 TFLOPS for bf16 Tensor Core operations compared to 67 TFLOPS for fp32.
Float8 (FP8) training, supported by Hopper and Blackwell architectures, pushes precision even further down, reducing memory bandwidth requirements and increasing throughput. TorchTitan supports FP8 training with dynamic, delayed, or static per-tensor scaling.
Gradient accumulation allows training with effectively large batch sizes without proportionally increasing memory. Instead of processing one large batch, the system processes multiple smaller micro-batches sequentially, accumulating gradients before performing a single parameter update. This is orthogonal to model parallelism and is commonly used alongside it.
DeepSpeed's ZeRO-Offload and ZeRO-Infinity extend sharding to heterogeneous memory by offloading optimizer states and parameters to CPU DRAM or NVMe SSDs. ZeRO-Infinity can train models with over 30 trillion parameters on a cluster of 32 NVIDIA DGX-2 nodes by leveraging the aggregate NVMe bandwidth. While this introduces data transfer latency, careful prefetching and overlapping with computation can minimize the impact on throughput.
| Technique | Memory saved | Compute overhead | Applicable to |
|---|---|---|---|
| Activation checkpointing | Reduces activation memory from O(n) to O(sqrt(n)) | ~33% additional forward computation | All model types |
| Mixed-precision (fp16/bf16) | ~50% reduction in activation and weight memory | Minimal (often faster due to Tensor Cores) | All model types |
| FP8 training | ~75% reduction vs. fp32 | Requires careful scaling; supported on Hopper+ | Transformer models |
| ZeRO Stage 1 | Up to 4x reduction in optimizer memory | Minimal | Data-parallel training |
| ZeRO Stage 2 | Up to 8x reduction in optimizer + gradient memory | Minimal | Data-parallel training |
| ZeRO Stage 3 / FSDP | Linear reduction with GPU count | Additional all-gather per layer | Data-parallel training |
| CPU/NVMe offloading | Near-unlimited (limited by CPU/SSD capacity) | Data transfer latency | Any ZeRO stage |
| Gradient accumulation | Decouples batch size from memory | None (just more steps) | All training setups |
Several major frameworks provide implementations of model parallelism strategies:
DeepSpeed, developed by Microsoft Research, is a comprehensive distributed training library built on PyTorch. Its key features include:
DeepSpeed is configured through a JSON file and integrates with the Hugging Face Transformers library through the Accelerate package.
Megatron-LM, developed by NVIDIA, is the reference implementation for tensor parallelism in Transformer models. It provides:
Megatron-LM has been used to train models up to one trillion parameters. The Megatron-LM v2 paper (Narayanan et al., 2021) demonstrated efficient 3D parallelism on clusters of thousands of A100 GPUs.
PyTorch provides native support for distributed training through several built-in modules:
TorchTitan is a PyTorch-native platform for production-ready LLM pre-training that unifies these components. Released by Meta in 2024 and accepted at ICLR 2025, TorchTitan provides stackable implementations of FSDP2, tensor parallelism, and pipeline parallelism, each introduced as orthogonal layers that can be composed freely. TorchTitan demonstrated accelerations of 65% on Llama 3.1 8B at 128-GPU scale (1D parallelism), 12.6% on Llama 3.1 70B at 256-GPU scale (2D), and 30% on Llama 3.1 405B at 512-GPU scale (3D) over optimized baselines on NVIDIA H100 GPUs.
Alpa (Zheng et al., 2022), published at OSDI 2022, takes a compiler-based approach to automating parallelism. Rather than requiring users to manually specify parallelism strategies, Alpa formulates parallelism planning as an optimization problem with two hierarchical levels:
Alpa generates parallelization plans that match or outperform hand-tuned model-parallel training systems, even on models those systems were specifically designed for, while also generalizing to novel architectures without manually designed plans.
Colossal-AI provides a unified interface for various parallelism paradigms including data, tensor, pipeline, and sequence parallelism. It focuses on making large-scale distributed training accessible with simplified APIs and automatic parallelism features.
| Framework | Developer | Key parallelism features | Language/Backend |
|---|---|---|---|
| DeepSpeed | Microsoft Research | ZeRO (1/2/3), pipeline, MoE, offloading | PyTorch |
| Megatron-LM | NVIDIA | Tensor, pipeline, sequence parallelism | PyTorch |
| FSDP | Meta / PyTorch | Sharded data parallelism (ZeRO-3 equivalent) | PyTorch (native) |
| TorchTitan | Meta / PyTorch | Composable FSDP2 + TP + PP + CP | PyTorch (native) |
| Alpa | UC Berkeley / Google | Automated inter- and intra-op parallelism | JAX |
| Colossal-AI | HPC-AI Tech | Unified parallelism with simplified APIs | PyTorch |
| Megatron-DeepSpeed | NVIDIA + Microsoft | Combined Megatron tensor/pipeline + ZeRO | PyTorch |
While most literature on model parallelism focuses on training, parallelism is equally important for serving large models at inference time. A model that was trained across many GPUs often still exceeds the memory of a single device when deployed for inference, and latency requirements for real-time serving impose additional constraints.
Tensor parallelism is the most commonly used strategy for inference because it reduces per-token latency. By splitting weight matrices across GPUs, each GPU computes a portion of each layer simultaneously. For autoregressive language models, this means each token generation step completes faster, directly reducing time-to-first-token (TTFT) and inter-token latency. Serving frameworks like vLLM, TensorRT-LLM, and SGLang all support tensor parallelism as a primary scaling mechanism. For example, serving a 70-billion-parameter model typically requires tensor parallelism across 4 or 8 GPUs.
Pipeline parallelism can also be used for inference, though it introduces latency for single requests because the input must traverse all stages sequentially. Its primary benefit for inference is enabling models that exceed the combined memory of a single node to be served across multiple nodes. In practice, pipeline parallelism is used for inference when the model is too large for a single node but latency requirements are not extreme.
MoE models benefit from expert parallelism during inference, distributing experts across GPUs so that the active subset for each token can be computed without requiring any single device to hold all experts. This is critical for models like Mixtral (46.7B total, 12.9B active) and DeepSeek-V3 (671B total, 37B active), where total parameter count far exceeds active parameters.
Modern inference frameworks integrate parallelism with other optimizations:
| Framework | Developer | Parallelism support | Key features |
|---|---|---|---|
| vLLM | UC Berkeley / community | TP, PP, EP, DP | PagedAttention, continuous batching, speculative decoding |
| TensorRT-LLM | NVIDIA | TP, PP, EP | Optimized CUDA kernels, in-flight batching, FP8 support |
| SGLang | UC Berkeley | TP, DP, EP | RadixAttention, constrained decoding |
| text-generation-inference | Hugging Face | TP | Flash decoding, watermarking |
vLLM, one of the most widely adopted open-source inference engines, supports setting tensor_parallel_size and pipeline_parallel_size as simple configuration parameters. For single-node multi-GPU deployments, tensor parallelism is generally preferred for its latency benefits. For multi-node deployments, a combination of tensor parallelism within each node and pipeline parallelism across nodes is the standard approach.
The parallelism strategies used by leading AI laboratories illustrate how these techniques are combined in practice.
Llama 3 405B was trained on up to 16,384 H100 GPUs (each with 80 GB HBM3) using Meta's Grand Teton AI server platform. The training used 4D parallelism:
The training consumed approximately 3.8 x 10^25 FLOPs over several months and achieved 400 TeraFLOPs per GPU (approximately 57% of peak bf16 throughput on H100). The training used 15 trillion tokens of data. Debugging and optimizing the 4D parallelism configuration across 16K GPUs was one of the most challenging aspects, as performance issues could propagate across the entire distributed system.
For Llama 4, released in 2025, Meta scaled reinforcement learning training for a two-trillion-parameter MoE model. The training infrastructure was revamped to support flexible allocation of different models to separate GPUs, balancing resources across multiple models based on computational speed. This resulted in approximately 10x improvement in training efficiency over the Llama 3 generation. Llama 4 Maverick uses 128 routed experts with a shared expert, where each token is sent to the shared expert and one of the 128 routed experts.
While OpenAI has not published full details of GPT-4's training infrastructure, it is known that GPT-4 was trained on a Microsoft Azure supercomputer with tens of thousands of NVIDIA A100 GPUs connected via InfiniBand. Based on available information and the scale of the model, GPT-4's training likely used a combination of tensor parallelism (within nodes), pipeline parallelism (across node groups), and data parallelism (across the remaining devices), consistent with 3D parallelism practices established by Megatron-LM and DeepSpeed at the time of training.
Google DeepMind trained the Gemini family of models on TPU v4 and TPU v5 pods. TPU architectures support model parallelism through the XLA compiler and the GSPMD (General and Scalable Parallelization for ML Computation Graphs) system, which automatically partitions computation graphs across TPU meshes. Google's Pathways system enables training across multiple TPU pods connected via data center networks.
DeepSeek trained DeepSeek-V3, a 671-billion-parameter MoE model, using a combination of expert parallelism, tensor parallelism, pipeline parallelism, and data parallelism. The architecture uses 256 experts with top-8 routing and a multi-head latent attention mechanism designed to reduce per-token communication costs. DeepSeek-V3 introduced DualPipe for pipeline parallelism, which overlaps forward and backward computation with the heavy all-to-all communication required by cross-node expert routing. The training was completed on 2,048 NVIDIA H800 GPUs, reportedly costing only $5.6 million in compute, a fraction of what comparable models required. This cost efficiency was attributed to the careful co-design of the model architecture and parallelism strategy.
The following table summarizes the key characteristics of each parallelism strategy to help practitioners choose the right combination for their workload:
| Strategy | Axis of partitioning | Communication overhead | Memory savings | Pipeline bubble | Best suited for | Key systems |
|---|---|---|---|---|---|---|
| Data parallelism | Training data | All-reduce of gradients (once per step) | None (model replicated) | None | Models that fit on one GPU | PyTorch DDP |
| ZeRO / FSDP | Model states (optimizer, gradients, params) | All-gather + reduce-scatter per layer | Up to Nx (N = GPU count) | None | Large models with many data-parallel workers | DeepSpeed, PyTorch FSDP |
| Tensor parallelism | Weight matrices within a layer | All-reduce per layer | Splits weights and activations | None | Layers with large matrices; high-bandwidth intra-node links | Megatron-LM |
| Pipeline parallelism | Consecutive groups of layers | Point-to-point at stage boundaries | Splits model by layers | Yes (bubble fraction ~ (S-1)/M) | Very deep models; cross-node training | GPipe, PipeDream, Megatron-LM |
| Sequence parallelism | Sequence dimension (for non-TP operations) | Replaces all-reduce with all-gather + reduce-scatter | Reduces activation memory | None | Long-sequence training with tensor parallelism | Megatron-LM |
| Context parallelism | Sequence dimension (for attention) | Ring or all-to-all for KV blocks | Scales context length linearly | None | Very long context (64K+ tokens) | Ring Attention, DeepSpeed-Ulysses, TorchTitan |
| Expert parallelism | Expert sub-networks | All-to-all for token routing | Distributes experts across devices | None | MoE models with many experts | DeepSpeed-MoE, Megatron-LM |
Model parallelism offers several benefits, including:
Scalability: Model parallelism allows for the efficient execution of larger models that would otherwise be infeasible due to memory or computation constraints. It is the only way to train models whose parameters, gradients, and optimizer states exceed the memory of a single device.
Faster training and inference: By distributing the model's computation across multiple processing units, model parallelism can reduce total training time. Frontier models that would take years on a single GPU can be trained in weeks or months on large clusters.
Flexibility: Different parallelism strategies can be composed (3D, 4D, 5D parallelism) to match diverse hardware topologies and model architectures.
However, model parallelism also has some drawbacks, such as:
Communication overhead: Model parallelism often requires frequent communication between processing units to exchange intermediate results, gradients, and updates. This communication can introduce latency and consume valuable bandwidth, especially for tensor parallelism over slow interconnects.
Implementation complexity: Implementing model parallelism can be more complex than data parallelism and may require careful consideration of hardware topology, load balancing between pipeline stages, and numerical correctness across partitions.
Pipeline bubbles: Pipeline parallelism introduces idle time (bubbles) at the beginning and end of each mini-batch, reducing hardware utilization unless large numbers of micro-batches are used or advanced scheduling algorithms (such as zero-bubble or DualPipe) are employed.
Debugging difficulty: Bugs in distributed training can be difficult to reproduce and diagnose, as they may manifest only at specific parallelism configurations or scales. Meta's Llama 3 paper highlighted that debugging 4D parallelism across 16K GPUs posed significant practical challenges.
Imagine you and your friends are building a really, really long train out of toy blocks. The train is so big that no one person has enough blocks to build the whole thing. So you split up the work: one friend builds the engine, another builds the middle cars, and another builds the caboose. Each person works on their piece at the same time, and then you connect them all together. That is like pipeline parallelism.
Now imagine one single train car is so complicated that one person cannot build it alone. So two friends work on the same car at the same time, each building half of it side by side. That is like tensor parallelism.
And if you have lots of identical trains to build, you could have different groups each building a complete train. That is like data parallelism.
When you combine all three ideas together, with lots of friends building lots of trains where each train is split up and each complicated part is shared, that is 3D parallelism. It is how the biggest AI models in the world are trained.