# Model Parallelism

> Source: https://aiwiki.ai/wiki/model_parallelism
> Updated: 2026-07-11
> Categories: AI Infrastructure, Deep Learning, Machine Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

*See also: [Machine learning terms](/wiki/machine_learning_terms), [Data parallelism](/wiki/data_parallelism), [GPU computing](/wiki/gpu_computing), [Deep learning](/wiki/deep_learning)*

## What is model parallelism?

Model parallelism is a distributed training and inference technique that splits a single [neural network](/wiki/neural_network) across multiple processing units so that no individual accelerator has to hold the entire model. Different parts of one model run concurrently on different GPUs or TPUs, with communication operations stitching their partial results back together. It is the primary reason today's largest [large language models](/wiki/large_language_model), which range from hundreds of billions to trillions of parameters, can be trained at all: a single GPU with 80 GB of high-bandwidth memory cannot hold the full parameters, gradients, and optimizer states for a 175-billion-parameter model, which can require over 2 TB of memory in mixed-precision training with [Adam](/wiki/adam_optimizer).[1] Model parallelism solves this by distributing the model across many accelerators so that no single device needs to hold everything at once.

The foundational systems paper for tensor model parallelism, Megatron-LM, frames the approach plainly: the authors "implement a simple, efficient intra-layer model parallel approach that enables training transformer models with billions of parameters," illustrated by "converging transformer based models up to 8.3 billion parameters using 512 GPUs."[1] Model parallelism is usually combined with [data parallelism](/wiki/data_parallelism), rather than used as an alternative to it, and the two together underpin every frontier-scale training run.

### Why is model parallelism needed?

As machine learning models grow larger and more complex, they require increasing amounts of memory and computation resources. This growth in demand is driven by the pursuit of improved performance, as larger models tend to have higher representational capacity and can better capture complex patterns in the data. However, this increased size and complexity can lead to several challenges, including:

- **Memory limitations:** The memory capacity of a single processing unit, such as a Graphics Processing Unit (GPU) or [Tensor Processing Unit (TPU)](/wiki/tensor_processing_unit_tpu), may not be sufficient to store the entire model and its associated data. For a model with P parameters trained using mixed-precision Adam, the total memory footprint is approximately $$16P$$ bytes (2 bytes for fp16 parameters, 2 bytes for fp16 gradients, 4 bytes for fp32 master weights, 4 bytes for first moment estimates, and 4 bytes for second moment estimates). A 70-billion-parameter model therefore needs roughly 1.12 TB just for model states, far exceeding the capacity of any single GPU.

- **Computational bottlenecks:** The sheer number of parameters and operations in large models can cause significant delays in training and inference, which may be unacceptable for certain applications or time-sensitive tasks.

- **Scaling laws:** Research by Kaplan et al. (2020) and Hoffmann et al. (2022) has demonstrated that model performance improves predictably with scale, providing strong incentives to train ever-larger models.[13] These [scaling laws](/wiki/scaling_laws) drive the need for parallelism strategies that can efficiently distribute computation across thousands of accelerators.

Model parallelism offers a solution to these challenges by dividing the model's computation across multiple processing units, thereby allowing for the efficient execution of larger and more complex models.

### When did model parallelism emerge?

The roots of model parallelism in [deep learning](/wiki/deep_model) trace back to the earliest days of GPU-accelerated neural network training. Before GPUs became standard, training deep networks on CPUs was prohibitively slow. In the mid-2000s, researchers began adapting GPUs for general-purpose computing, reporting 4x speedups over CPU implementations by 2006 and up to 70x speedups by 2009.

The landmark moment came in 2012 with [AlexNet](/wiki/alexnet), developed by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton at the University of Toronto. AlexNet contained 60 million parameters and was one of the first deep networks to employ model parallelism in practice. Because the NVIDIA GTX 580 GPUs used for training had only 3 GB of memory each, Krizhevsky split the network across two GPUs, with each GPU processing one half of the feature maps. The two halves communicated only at specific layers.[17] This design was driven by hardware necessity rather than theoretical considerations, but it demonstrated that partitioning a model across devices was both feasible and effective. AlexNet won the 2012 ImageNet Large Scale Visual Recognition Challenge by a wide margin, catalyzing the deep learning revolution.

Through 2013 to 2017, data parallelism dominated multi-GPU training as GPU memory grew and models remained small enough to fit on a single device. Frameworks like DistBelief (Dean et al., 2012) at Google and later [TensorFlow](/wiki/tensorflow) and [PyTorch](/wiki/pytorch) made data-parallel training accessible. However, as [Transformer](/wiki/transformers) architectures emerged in 2017 and models began scaling into the billions of parameters, the limitations of single-device memory became acute once again.

The period from 2019 onward saw a rapid evolution of model parallelism techniques. Megatron-LM (Shoeybi et al., 2019) introduced systematic [tensor](/wiki/tensor) parallelism for Transformers.[1] GPipe (Huang et al., 2019) and PipeDream (Narayanan et al., 2019) formalized pipeline parallelism.[2][3] The ZeRO optimizer (Rajbhandari et al., 2020) bridged the gap between data parallelism and model parallelism by sharding model states.[4] By 2024, training the largest frontier models routinely involved combining four or five forms of parallelism across tens of thousands of GPUs.

## What are the types of model parallelism?

There are several distinct types of model parallelism, each addressing different axes of partitioning and suited to different model architectures and hardware configurations.

### Tensor parallelism

Tensor parallelism (sometimes called intra-layer parallelism or intra-operator parallelism) splits individual layers of a neural network across multiple devices. Rather than placing entire layers on different accelerators, tensor parallelism partitions the weight matrices within a single layer so that each device computes a portion of the layer's output. The partial results are then combined through communication operations such as all-reduce or all-gather.

The foundational work on tensor parallelism for [Transformers](/wiki/transformers) was introduced by Shoeybi et al. (2019) in the Megatron-LM paper.[1] The key insight is that the matrix multiplications inside Transformer layers can be split along specific dimensions with minimal communication overhead. For a [multi-head attention](/wiki/attention) layer, the query, key, and value projections can be split column-wise across devices, with each device computing a subset of the attention heads. For the feed-forward network (FFN), the first linear layer is split column-wise and the second is split row-wise, requiring only a single all-reduce operation per layer during the forward pass and another during the backward pass.[1]

Megatron-LM demonstrated training of models up to 8.3 billion parameters across 512 NVIDIA V100 GPUs, achieving 15.1 PetaFLOPs of sustained performance with 76% scaling efficiency compared to a strong single-GPU baseline of 39 TeraFLOPs (about 30% of peak FLOPs).[1]

**Advantages of tensor parallelism:**

- Low latency per layer, because the split computation can proceed mostly in parallel with only a small synchronization step
- Well-suited for high-bandwidth intra-node interconnects like [NVLink](/wiki/nvlink)
- Does not introduce pipeline bubbles

**Limitations:**

- Requires high-bandwidth interconnects between participating devices; performance degrades sharply over slower links
- Typically limited to the number of GPUs within a single node (e.g., 8 GPUs connected via NVLink)
- For attention layers, the degree of tensor parallelism is bounded by the number of attention heads

### Pipeline parallelism

Pipeline parallelism (also called inter-layer parallelism) partitions a model vertically by assigning consecutive groups of layers (called stages) to different devices. Input data flows through the pipeline from the first stage to the last during the forward pass, and gradients flow in reverse during the backward pass. The two foundational systems for pipeline parallelism are GPipe and PipeDream.

**GPipe** (Huang et al., 2019), developed at Google Brain, introduced micro-batch pipelining.[2] In the authors' words, GPipe "first split a mini-batch of training examples into smaller micro-batches, then pipeline the execution of each set of micro-batches over cells."[2] During the forward pass, each micro-batch moves through all stages, and activations at stage boundaries are stored. During the backward pass, gradients are computed for each micro-batch and accumulated across all M micro-batches before a single synchronous weight update is applied. GPipe uses synchronous gradient accumulation, meaning it produces the same training dynamics as standard synchronous data-parallel training. Google demonstrated GPipe by training a 557-million-parameter AmoebaNet model, achieving a top-1 accuracy of 84.4% on [ImageNet](/wiki/imagenet)-2012.[2] GPipe also uses activation re-materialization (recomputation) to reduce memory: only boundary activations between stages are stored, and intermediate activations within each stage are recomputed during the backward pass.

The main limitation of GPipe is the "pipeline bubble," the idle time that occurs at the start and end of each mini-batch when some pipeline stages have no micro-batch to process. The fraction of idle time is approximately $$(S - 1) / M$$, where S is the number of pipeline stages and M is the number of micro-batches.[2] Increasing M relative to S reduces the bubble but increases memory requirements for storing activations.

**PipeDream** (Narayanan et al., 2019), developed jointly at Microsoft Research and Carnegie Mellon University, took a different approach.[3] PipeDream uses an asynchronous one-forward-one-backward (1F1B) schedule in which each worker alternates between processing a forward pass and a backward pass from different micro-batches. This significantly reduces pipeline idle time compared to GPipe. However, the asynchronous nature means that forward and backward passes for the same micro-batch may use different versions of the model weights, introducing weight staleness. PipeDream addresses this through a technique called weight stashing, which stores multiple versions of model parameters so that each backward pass uses the same weights that were used in the corresponding forward pass. PipeDream demonstrated up to 5.3x speedup over standard data-parallel training.[3] A later variant, PipeDream-2BW (also known as PipeDream-Flush), reduces the memory overhead of weight stashing by maintaining only two weight versions and periodically flushing the pipeline.

**Interleaved 1F1B**, introduced in Megatron-LM v2 (Narayanan et al., 2021), assigns multiple non-consecutive stages to each device to further reduce the pipeline bubble.[5] By interleaving stages, each device processes micro-batches from different positions in the pipeline, keeping the hardware busier during the ramp-up and ramp-down phases.

**Zero-bubble pipeline parallelism** (Qi et al., 2024), accepted at ICLR 2024, achieves near-zero pipeline idle time under synchronous training semantics.[15] The key insight is to split the backward computation into two separate passes: one that computes the gradient with respect to the input (used for the upstream stage) and one that computes the gradient with respect to the parameters (used for weight updates). By scheduling these two backward passes independently, the algorithm can fill in the idle slots that would otherwise form bubbles. The authors also developed an automatic scheduling algorithm that finds optimal pipeline schedules for a given model configuration and memory constraint. Experiments showed up to 23% throughput improvement over standard 1F1B under similar memory limits, increasing to 31% when memory constraints were relaxed.[15]

**DualPipe**, introduced by DeepSeek for training DeepSeek-V3 and DeepSeek-R1 (DeepSeek-AI, 2024), is a bidirectional pipeline parallelism algorithm that achieves full overlap of forward and backward computation with communication.[16] DualPipe divides each micro-batch chunk into four components: attention, all-to-all dispatch, MLP, and all-to-all combine. By scheduling forward and backward passes to flow in opposite directions through the pipeline simultaneously, DualPipe overlaps the communication-heavy operations of one chunk with the computation-heavy operations of another.[16] This is particularly effective for MoE models where the computation-to-communication ratio approaches 1:1 due to heavy cross-node expert routing traffic.

### Sequence parallelism

Sequence parallelism addresses a different bottleneck: the memory consumed by activations that grow with sequence length. In standard tensor parallelism, operations like [LayerNorm](/wiki/layer_normalization), dropout, and residual connections are replicated on every device because they operate on the full hidden dimension. Sequence parallelism, introduced as an extension to Megatron-LM by Korthikanti et al. (2023), partitions these operations along the sequence dimension instead, so each device handles a portion of the sequence for these non-tensor-parallel regions.[6] This reduces activation memory by a factor equal to the tensor parallelism degree without additional communication overhead, since the required all-gather and reduce-scatter operations replace the existing all-reduce operations used in tensor parallelism.[6]

For long-context training, more aggressive forms of sequence parallelism have been developed:

- **DeepSpeed-Ulysses** partitions input tensors along the sequence dimension and uses all-to-all collective communication for distributed attention computation. It has demonstrated support for sequences up to 1 million tokens on 64 A100 GPUs.[12]
- **Ring Attention** (Liu et al., 2023) distributes the query, key, and value blocks across devices arranged in a ring topology. Each device computes attention on its local block while asynchronously sending key-value blocks to the next device in the ring, overlapping communication with computation. Ring Attention enables near-linear scaling of context length with the number of devices.[11]
- **Context Parallelism**, as implemented in Meta's Llama 3 training and in PyTorch's TorchTitan, partitions long sequences across devices for the attention computation. When training [Llama 3](/wiki/llama) 405B with 128K context length, context parallelism with degree 16 was used so that each GPU processed 8K tokens.[9][10]

### Expert parallelism

Expert parallelism is a specialized form of model parallelism designed for [Mixture-of-Experts (MoE)](/wiki/mixture_of_experts) architectures. In an MoE model, each Transformer layer contains a routing mechanism (a gating network) and multiple expert sub-networks (typically feed-forward networks). For any given input token, only a subset of experts (the top-k, often top-1 or top-2) is activated, making the model sparse.

In expert parallelism, the expert sub-networks are distributed across devices so that each device holds a subset of the experts. A routing step determines which tokens go to which experts, and all-to-all communication moves tokens to the devices that hold their selected experts. After expert computation, another all-to-all operation returns the results to the originating devices.

This approach enables training of models with very large total parameter counts while keeping the per-token computation manageable. The [Switch Transformer](/wiki/switch_transformer) (Fedus et al., 2022) reported approximately 4x pre-training speedup over a dense model of comparable quality by using expert parallelism.[8] [DeepSeek-V2](/wiki/deepseek) and [Mixtral](/wiki/mixtral) also rely on expert parallelism for efficient training and inference.

The primary challenge of expert parallelism is communication overhead. The all-to-all operations required for token routing can consume over 40% of total runtime in large-scale training. Techniques like hybrid expert parallelism (combining expert parallelism with data and tensor parallelism) and optimized communication kernels have been developed to mitigate this. DeepSpeed-TED combines data, tensor, and expert parallelism to train MoE models with 4 to 8x larger base models than previous approaches.

## What is 3D parallelism?

### 3D parallelism

Training the largest models typically requires combining multiple parallelism strategies simultaneously. 3D parallelism refers to the combination of [data parallelism](/wiki/data_parallelism), tensor parallelism, and pipeline parallelism. Each strategy is mapped to a different level of the hardware hierarchy based on communication bandwidth considerations:

| Parallelism type | Typical scope | Communication pattern | Bandwidth requirement |
|---|---|---|---|
| Tensor parallelism | Within a single node (e.g., 8 GPUs connected via NVLink) | All-reduce per layer | Very high (requires NVLink or NVSwitch) |
| Pipeline parallelism | Across nodes within a rack or pod | Point-to-point activation transfers at stage boundaries | Moderate (InfiniBand is sufficient) |
| [Data parallelism](/wiki/data_parallelism) | Across all remaining devices | All-reduce of gradients once per training step | Low to moderate (can overlap with computation) |

For example, in a cluster of 64 GPUs arranged as 8 nodes of 8 GPUs each, one might use tensor parallelism degree 8 (within each node), pipeline parallelism degree 4 (across 4 groups of nodes), and data parallelism degree 2 (the remaining factor). This yields $$8 \times 4 \times 2 = 64$$ total devices.

The Megatron-LM v2 paper (Narayanan et al., 2021) presented detailed analysis of 3D parallelism for training models up to one trillion parameters, demonstrating 502 petaFLOPS on 3072 A100 GPUs with 52% of peak GPU throughput.[5]

### 4D parallelism

Meta's training of Llama 3 405B on 16,384 NVIDIA [H100](/wiki/gpu_computing) GPUs used 4D parallelism, adding context parallelism as a fourth dimension.[9] The specific configuration was:

| Dimension | Degree | Description |
|---|---|---|
| Tensor Parallelism (TP) | 8 | Splits weight matrices within each node |
| Pipeline Parallelism (PP) | 16 | Distributes layers across 16 pipeline stages |
| Context Parallelism (CP) | 16 (for 128K context) | Partitions long sequences across devices |
| Data Parallelism (FSDP) | Remaining GPUs | Shards parameters, gradients, and optimizer states |

This configuration achieved approximately 400 TeraFLOPs per GPU with 8K sequence length and 380 TeraFLOPs per GPU with 128K sequence length.[9]

### 5D parallelism

For MoE models, expert parallelism adds a fifth dimension. Training setups for models like Mixtral or [DeepSeek-V3](/wiki/deepseek) combine data, tensor, pipeline, context, and expert parallelism, often referred to as 5D parallelism. DeepSeek-V3, a 671-billion-parameter MoE model with 256 experts, used a specific combination of 16-way pipeline parallelism with DualPipe, 64-way expert parallelism spanning 8 nodes, and ZeRO-1 data parallelism.[16] The DualPipe schedule was critical for overlapping the heavy all-to-all communication required by expert routing with forward and backward computation; the DeepSeek-V3 report notes that with this design "both all-to-all and PP communication can be fully hidden," achieving effective hardware utilization despite the approximately 1:1 computation-to-communication ratio.[16]

## How do FSDP and ZeRO work?

### ZeRO (Zero Redundancy Optimizer)

[ZeRO](/wiki/zero_redundancy_optimizer) (Rajbhandari et al., 2020) is a family of memory optimization techniques developed by Microsoft as part of the [DeepSpeed](/wiki/deepspeed) library.[4] ZeRO eliminates memory redundancy in data-parallel training by partitioning model states across data-parallel processes instead of replicating them. It is implemented as three progressive stages:[4]

| Stage | What is partitioned | Memory reduction | Communication volume |
|---|---|---|---|
| ZeRO Stage 1 (ZeRO-1) | Optimizer states only (e.g., Adam's first and second moment estimates, fp32 master weights) | Up to 4x | Same as standard data parallelism |
| ZeRO Stage 2 (ZeRO-2) | Optimizer states + gradients | Up to 8x | Same as standard data parallelism |
| ZeRO Stage 3 (ZeRO-3) | Optimizer states + gradients + model parameters | Linear reduction with number of GPUs (Nx for N GPUs) | 1.5x of standard data parallelism |

In ZeRO Stage 1, only the optimizer states are partitioned. Each GPU stores $$1/N$$ of the optimizer states (where N is the number of data-parallel GPUs) and updates only its assigned partition of model parameters. After the update, an all-gather operation broadcasts the updated parameters to all GPUs.

ZeRO Stage 2 additionally partitions the gradients. Each GPU retains only the gradients corresponding to its portion of the optimizer states and discards the rest after the reduce-scatter operation.

ZeRO Stage 3 goes further by also partitioning the model parameters themselves. Each GPU stores only $$1/N$$ of the parameters, and all-gather operations are inserted before each forward and backward computation to temporarily reconstruct the full parameters needed for that layer. This enables training models whose total memory footprint far exceeds any single GPU's capacity, at the cost of increased communication.

ZeRO-Offload and ZeRO-Infinity extend these concepts further by offloading partitioned states to CPU memory or NVMe SSDs, enabling training of models with trillions of parameters on limited GPU hardware.[14]

### FSDP (Fully Sharded Data Parallel)

[Fully Sharded Data Parallel (FSDP)](/wiki/fsdp) is PyTorch's native implementation of ZeRO Stage 3 concepts. Originally inspired by FairScale's implementation, FSDP was integrated into PyTorch core and has undergone significant evolution:

- **FSDP1** (PyTorch 1.11+): The original implementation that shards parameters, gradients, and optimizer states across data-parallel workers, with flat-parameter-based sharding.
- **FSDP2** (PyTorch 2.x): A redesigned version with per-parameter sharding, improved composability with other parallelism strategies (tensor parallelism, pipeline parallelism), and better memory management. FSDP2 is the version used in TorchTitan.[10]

FSDP wraps model layers or submodules in FSDP units. Before each forward pass, an all-gather operation collects the full parameters for that unit from all devices. After the computation, the full parameters are discarded and only the local shard is retained. During the backward pass, the same all-gather and discard pattern is repeated. Gradients are reduced via reduce-scatter so that each device accumulates only its shard of the gradients.

**FSDP sharding strategies:**

| Strategy | Behavior | Equivalent ZeRO stage |
|---|---|---|
| FULL_SHARD | Shards parameters, gradients, and optimizer states | ZeRO Stage 3 |
| SHARD_GRAD_OP | Shards gradients and optimizer states; keeps full parameters after forward | ZeRO Stage 2 |
| NO_SHARD | Standard data parallelism with no sharding | Baseline DDP |

### How does FSDP compare to DeepSpeed?

| Aspect | PyTorch FSDP | DeepSpeed ZeRO |
|---|---|---|
| Framework integration | Native PyTorch | Requires DeepSpeed library |
| Configuration | Python API, minimal config | JSON config file + launcher |
| Composability with TP/PP | Strong (especially FSDP2) | Supported but requires more setup |
| CPU/NVMe offloading | Supported | Supported (ZeRO-Offload, ZeRO-Infinity) |
| Best performance range | 100M to 10B parameters | 10B+ parameters |
| Ecosystem | PyTorch-native tooling | Rich feature set (MoE, inference, compression) |

Benchmarks from 2024 and 2025 show nuanced performance differences. FSDP with FULL_SHARD has been observed running up to 5x faster per iteration than DeepSpeed ZeRO-3 on mid-sized models. However, DeepSpeed tends to show advantages at 10B+ parameter scales, where its memory-saving optimizations and offloading capabilities become more significant.

## How does model parallelism differ from data parallelism?

Data parallelism and model parallelism address different scaling bottlenecks and are often used together rather than as alternatives.

| Feature | Data parallelism | Model parallelism |
|---|---|---|
| What is replicated/split | Data is split; model is replicated on each device | Model is split across devices; data may be replicated or also split |
| Memory per device | Full model + optimizer states on each GPU | Only a fraction of model states per GPU |
| Communication pattern | All-reduce of gradients after each step | Varies: all-reduce (tensor), point-to-point (pipeline), all-to-all (expert) |
| Communication frequency | Once per training step | Multiple times per layer (tensor) or per micro-batch (pipeline) |
| Scaling limit | Model must fit on a single device | Can scale beyond single-device memory |
| Implementation complexity | Relatively simple (e.g., PyTorch DDP) | More complex; requires careful partitioning |
| Hardware requirement | Any multi-GPU setup | High-bandwidth interconnects preferred (especially for tensor parallelism) |
| Batch size impact | Effective batch size scales with number of devices | Batch size is independent of parallelism degree |
| [Gradient](/wiki/gradient_descent) synchronization | Synchronous (standard) or asynchronous | Depends on specific strategy |
| Primary use case | Models that fit on one GPU; scaling throughput | Models too large for one GPU; memory-constrained scenarios |

In practice, most large-scale training systems combine both. Data parallelism increases throughput by processing more data in parallel, while model parallelism enables training models that would not fit on a single device.

## Communication primitives and interconnects

Efficient communication is critical for model parallelism performance. The choice of collective operation and hardware interconnect directly impacts training throughput.

### Collective communication operations

Distributed training relies on a set of collective communication primitives, most commonly provided by NVIDIA's NCCL (NVIDIA Collective Communications Library) for GPU clusters:

| Operation | Description | Used by |
|---|---|---|
| **All-Reduce** | Every device sends its local data, and all devices receive the sum (or other reduction) of all data. | Data parallelism (gradient synchronization), tensor parallelism |
| **All-Gather** | Every device sends its local shard, and all devices receive the concatenation of all shards. | FSDP/ZeRO-3 (parameter reconstruction before forward/backward) |
| **Reduce-Scatter** | Each device sends data, and each receives a reduced (summed) portion. Equivalent to a reduce followed by a scatter. | FSDP/ZeRO (gradient sharding), tensor parallelism |
| **All-to-All** | Each device sends a different piece of data to every other device. | Expert parallelism (token routing), sequence parallelism (DeepSpeed-Ulysses) |
| **Broadcast** | One device sends data to all other devices. | Parameter initialization, inference |
| **Point-to-Point (Send/Recv)** | Direct transfer between two specific devices. | Pipeline parallelism (activation/gradient transfer between stages) |

NCCL automatically detects hardware topology (PCIe, NVLink, NVSwitch, InfiniBand, RoCE) and selects optimal algorithms and parameters. NCCL 2.27 (2025) added features for fast inference and resilient training, including fault tolerance for large-scale jobs.

### Hardware interconnects

The bandwidth and latency of interconnects determine how effectively parallelism strategies can scale:

| Interconnect | Scope | Bandwidth | Notes |
|---|---|---|---|
| **PCIe Gen 5** | Intra-node (GPU to CPU, GPU to GPU) | 64 GB/s per direction | Baseline; insufficient for tensor parallelism at scale |
| **NVLink 4.0** (Hopper) | Intra-node (GPU to GPU) | 900 GB/s bidirectional (18 links at 25 GB/s each) | Used in [H100](/wiki/gpu_computing) systems; standard for tensor parallelism |
| **NVLink 5.0** (Blackwell) | Intra-node and intra-rack | 1.8 TB/s bidirectional (18 links at 50 GB/s each) | Used in B200 systems; supports up to 576 GPUs in a single NVLink domain |
| **NVSwitch** | Multi-GPU fabric within a node or rack | Up to 130 TB/s across a 72-GPU NVLink domain (NVL72) | Enables all-to-all NVLink connectivity |
| **InfiniBand NDR** | Inter-node | 400 Gb/s (50 GB/s) per port | Standard for cross-node pipeline and data parallelism |
| **InfiniBand XDR** | Inter-node | 800 Gb/s (100 GB/s) per port | Next generation; used in newest clusters |
| **RoCE (RDMA over Converged Ethernet)** | Inter-node | Up to 400 Gb/s | Lower cost alternative to InfiniBand |

The general principle is to map the most communication-intensive parallelism dimension to the highest-bandwidth interconnect. Tensor parallelism, which requires an all-reduce per layer, is placed within NVLink-connected devices. Pipeline parallelism, which only transfers activations at stage boundaries, can tolerate the lower bandwidth of InfiniBand. Data parallelism overlaps gradient synchronization with backward computation, further reducing its sensitivity to interconnect bandwidth.

NVIDIA's SHARP (Scalable Hierarchical Aggregation and Reduction Protocol) technology offloads collective operations to InfiniBand network switches, reducing the data traversing the network by performing reductions in-network rather than at the endpoints.

## Memory optimization techniques

Model parallelism is often combined with memory optimization techniques to maximize the effective model size that can be trained:

### Activation checkpointing (gradient checkpointing)

[Activation checkpointing](/wiki/gradient_checkpointing) trades computation for memory by discarding intermediate activations during the forward pass and recomputing them during the backward pass. Instead of storing all activations ($$O(n)$$ memory for n layers), selective checkpointing stores only the activations at certain layer boundaries and recomputes the rest, reducing activation memory from $$O(n)$$ to $$O(\sqrt{n})$$. The cost is approximately one additional forward pass worth of computation (roughly 33% overhead). GPipe's re-materialization and PyTorch's `torch.utils.checkpoint` both implement this technique.

### Mixed-precision training

[Mixed-precision training](/wiki/mixed_precision_training) uses lower-precision numerical formats (typically [fp16](/wiki/floating_point) or [bf16](/wiki/bfloat16)) for forward and backward computation while maintaining fp32 master weights for parameter updates. This halves the memory required for activations and enables the use of Tensor Cores on NVIDIA GPUs, which provide significantly higher throughput for reduced-precision matrix operations. The H100 GPU provides 989 TFLOPS for bf16 Tensor Core operations compared to 67 TFLOPS for fp32.

Float8 (FP8) training, supported by Hopper and Blackwell architectures, pushes precision even further down, reducing memory bandwidth requirements and increasing throughput. TorchTitan supports FP8 training with dynamic, delayed, or static per-tensor scaling.

### Gradient accumulation

[Gradient accumulation](/wiki/gradient_accumulation) allows training with effectively large batch sizes without proportionally increasing memory. Instead of processing one large batch, the system processes multiple smaller micro-batches sequentially, accumulating gradients before performing a single parameter update. This is orthogonal to model parallelism and is commonly used alongside it.

### CPU and NVMe offloading

DeepSpeed's ZeRO-Offload and ZeRO-Infinity extend sharding to heterogeneous memory by offloading optimizer states and parameters to CPU DRAM or NVMe SSDs. ZeRO-Infinity can train models with over 30 trillion parameters on a cluster of 32 NVIDIA DGX-2 nodes by leveraging the aggregate NVMe bandwidth.[14] While this introduces data transfer latency, careful prefetching and overlapping with computation can minimize the impact on throughput.

### Summary of memory optimization techniques

| Technique | Memory saved | Compute overhead | Applicable to |
|---|---|---|---|
| Activation checkpointing | Reduces activation memory from $$O(n)$$ to $$O(\sqrt{n})$$ | ~33% additional forward computation | All model types |
| Mixed-precision (fp16/bf16) | ~50% reduction in activation and weight memory | Minimal (often faster due to Tensor Cores) | All model types |
| FP8 training | ~75% reduction vs. fp32 | Requires careful scaling; supported on Hopper+ | Transformer models |
| ZeRO Stage 1 | Up to 4x reduction in optimizer memory | Minimal | Data-parallel training |
| ZeRO Stage 2 | Up to 8x reduction in optimizer + gradient memory | Minimal | Data-parallel training |
| ZeRO Stage 3 / FSDP | Linear reduction with GPU count | Additional all-gather per layer | Data-parallel training |
| CPU/NVMe offloading | Near-unlimited (limited by CPU/SSD capacity) | Data transfer latency | Any ZeRO stage |
| Gradient accumulation | Decouples batch size from memory | None (just more steps) | All training setups |

## What frameworks and tools support model parallelism?

Several major frameworks provide implementations of model parallelism strategies:

### DeepSpeed

[DeepSpeed](/wiki/deepspeed), developed by Microsoft Research, is a comprehensive distributed training library built on PyTorch. Its key features include:

- ZeRO Stages 1, 2, and 3 for memory-efficient data parallelism
- ZeRO-Offload and ZeRO-Infinity for CPU/NVMe offloading
- Pipeline parallelism with the PipeDream-Flush schedule
- MoE support with expert parallelism
- DeepSpeed-Ulysses for sequence parallelism
- DeepSpeed-TED for hybrid tensor-expert-data parallelism
- [Inference](/wiki/inference) optimizations including model compression and quantization

DeepSpeed is configured through a JSON file and integrates with the [Hugging Face](/wiki/hugging_face) Transformers library through the Accelerate package.

### Megatron-LM

Megatron-LM, developed by [NVIDIA](/wiki/nvidia), is the reference implementation for tensor parallelism in Transformer models. It provides:

- Tensor parallelism for attention and feed-forward layers
- Interleaved pipeline parallelism with reduced bubble overhead
- Sequence parallelism for LayerNorm and dropout regions
- Integration with DeepSpeed (Megatron-DeepSpeed) for combined tensor + pipeline + ZeRO parallelism

Megatron-LM has been used to train models up to one trillion parameters. The Megatron-LM v2 paper (Narayanan et al., 2021) demonstrated efficient 3D parallelism on clusters of thousands of A100 GPUs.[5]

### PyTorch FSDP and TorchTitan

PyTorch provides native support for distributed training through several built-in modules:

- **DistributedDataParallel (DDP):** Standard data parallelism with gradient all-reduce
- **FSDP (Fully Sharded Data Parallel):** ZeRO-3-style parameter sharding
- **DeviceMesh and DTensor:** Abstractions for multi-dimensional parallelism

**TorchTitan** is a PyTorch-native platform for production-ready LLM pre-training that unifies these components. Released by Meta in 2024 and accepted at ICLR 2025, TorchTitan provides stackable implementations of FSDP2, tensor parallelism, and pipeline parallelism, each introduced as orthogonal layers that can be composed freely.[10] TorchTitan demonstrated accelerations of 65% on Llama 3.1 8B at 128-GPU scale (1D parallelism), 12.6% on Llama 3.1 70B at 256-GPU scale (2D), and 30% on Llama 3.1 405B at 512-GPU scale (3D) over optimized baselines on NVIDIA H100 GPUs.[10]

### Alpa

Alpa (Zheng et al., 2022), published at OSDI 2022, takes a compiler-based approach to automating parallelism.[7] Rather than requiring users to manually specify parallelism strategies, Alpa formulates parallelism planning as an optimization problem with two hierarchical levels:

1. **Intra-operator parallelism** (within a single operator): Solves for the best way to partition tensors across devices using an Integer Linear Programming (ILP) formulation
2. **Inter-operator parallelism** (across operators): Uses dynamic programming to assign groups of operators (pipeline stages) to device submeshes

Alpa generates parallelization plans that match or outperform hand-tuned model-parallel training systems, even on models those systems were specifically designed for, while also generalizing to novel architectures without manually designed plans.[7]

### Colossal-AI

[Colossal-AI](/wiki/colossal_ai) provides a unified interface for various parallelism paradigms including data, tensor, pipeline, and sequence parallelism. It focuses on making large-scale distributed training accessible with simplified APIs and automatic parallelism features.

### Summary of frameworks

| Framework | Developer | Key parallelism features | Language/Backend |
|---|---|---|---|
| [DeepSpeed](/wiki/deepspeed) | Microsoft Research | ZeRO (1/2/3), pipeline, MoE, offloading | PyTorch |
| Megatron-LM | [NVIDIA](/wiki/nvidia) | Tensor, pipeline, sequence parallelism | PyTorch |
| [FSDP](/wiki/fsdp) | Meta / PyTorch | Sharded data parallelism (ZeRO-3 equivalent) | PyTorch (native) |
| TorchTitan | Meta / PyTorch | Composable FSDP2 + TP + PP + CP | PyTorch (native) |
| Alpa | UC Berkeley / Google | Automated inter- and intra-op parallelism | [JAX](/wiki/jax) |
| [Colossal-AI](/wiki/colossal_ai) | HPC-AI Tech | Unified parallelism with simplified APIs | PyTorch |
| Megatron-DeepSpeed | NVIDIA + Microsoft | Combined Megatron tensor/pipeline + ZeRO | PyTorch |

## How is model parallelism used for inference?

While most literature on model parallelism focuses on training, parallelism is equally important for serving large models at inference time. A model that was trained across many GPUs often still exceeds the memory of a single device when deployed for inference, and latency requirements for real-time serving impose additional constraints.

### Tensor parallelism for inference

Tensor parallelism is the most commonly used strategy for inference because it reduces per-token latency. By splitting weight matrices across GPUs, each GPU computes a portion of each layer simultaneously. For autoregressive [language models](/wiki/language_model), this means each token generation step completes faster, directly reducing time-to-first-token (TTFT) and inter-token latency. Serving frameworks like vLLM, TensorRT-LLM, and SGLang all support tensor parallelism as a primary scaling mechanism. For example, serving a 70-billion-parameter model typically requires tensor parallelism across 4 or 8 GPUs.

### Pipeline parallelism for inference

Pipeline parallelism can also be used for inference, though it introduces latency for single requests because the input must traverse all stages sequentially. Its primary benefit for inference is enabling models that exceed the combined memory of a single node to be served across multiple nodes. In practice, pipeline parallelism is used for inference when the model is too large for a single node but latency requirements are not extreme.

### Expert parallelism for inference

MoE models benefit from expert parallelism during inference, distributing experts across GPUs so that the active subset for each token can be computed without requiring any single device to hold all experts. This is critical for models like [Mixtral](/wiki/mixtral) (46.7B total, 12.9B active) and DeepSeek-V3 (671B total, 37B active), where total parameter count far exceeds active parameters.

### Inference serving frameworks

Modern inference frameworks integrate parallelism with other optimizations:

| Framework | Developer | Parallelism support | Key features |
|---|---|---|---|
| vLLM | UC Berkeley / community | TP, PP, EP, DP | PagedAttention, continuous batching, speculative decoding |
| TensorRT-LLM | NVIDIA | TP, PP, EP | Optimized CUDA kernels, in-flight batching, FP8 support |
| SGLang | UC Berkeley | TP, DP, EP | RadixAttention, constrained decoding |
| text-generation-inference | Hugging Face | TP | Flash decoding, watermarking |

vLLM, one of the most widely adopted open-source inference engines, supports setting `tensor_parallel_size` and `pipeline_parallel_size` as simple configuration parameters. For single-node multi-GPU deployments, tensor parallelism is generally preferred for its latency benefits. For multi-node deployments, a combination of tensor parallelism within each node and pipeline parallelism across nodes is the standard approach.

## How do frontier labs train large models?

The parallelism strategies used by leading AI laboratories illustrate how these techniques are combined in practice.

### Meta: Llama 3 405B and Llama 4

Llama 3 405B was trained on up to 16,384 H100 GPUs (each with 80 GB HBM3) using Meta's Grand Teton AI server platform. The training used 4D parallelism:

- Tensor parallelism (TP=8) within each 8-GPU node via NVLink
- Pipeline parallelism (PP=16) across nodes
- Fully Sharded Data Parallelism for parameter/gradient/optimizer sharding
- Context parallelism (CP=16) for 128K context-length training

The training consumed approximately $$3.8 \times 10^{25}$$ FLOPs over several months and achieved 400 TeraFLOPs per GPU (approximately 57% of peak bf16 throughput on H100). The training used 15 trillion tokens of data. Debugging and optimizing the 4D parallelism configuration across 16K GPUs was one of the most challenging aspects, as performance issues could propagate across the entire distributed system.[9]

For [Llama 4](/wiki/llama), released in 2025, Meta scaled reinforcement learning training for a two-trillion-parameter MoE model. The training infrastructure was revamped to support flexible allocation of different models to separate GPUs, balancing resources across multiple models based on computational speed. This resulted in approximately 10x improvement in training efficiency over the Llama 3 generation. Llama 4 Maverick uses 128 routed experts with a shared expert, where each token is sent to the shared expert and one of the 128 routed experts.

### OpenAI: GPT-4

While [OpenAI](/wiki/openai) has not published full details of [GPT-4](/wiki/gpt-4)'s training infrastructure, it is known that GPT-4 was trained on a Microsoft Azure supercomputer with tens of thousands of NVIDIA A100 GPUs connected via InfiniBand. Based on available information and the scale of the model, GPT-4's training likely used a combination of tensor parallelism (within nodes), pipeline parallelism (across node groups), and data parallelism (across the remaining devices), consistent with 3D parallelism practices established by Megatron-LM and DeepSpeed at the time of training.

### Google DeepMind: Gemini

[Google DeepMind](/wiki/google_deepmind) trained the [Gemini](/wiki/gemini) family of models on [TPU v4](/wiki/tensor_processing_unit_tpu) and TPU v5 pods. TPU architectures support model parallelism through the XLA compiler and the GSPMD (General and Scalable Parallelization for ML Computation Graphs) system, which automatically partitions computation graphs across TPU meshes. Google's Pathways system enables training across multiple TPU pods connected via data center networks.

### DeepSeek: DeepSeek-V3

[DeepSeek](/wiki/deepseek) trained DeepSeek-V3, a 671-billion-parameter MoE model, using a combination of expert parallelism, tensor parallelism, pipeline parallelism, and data parallelism. The architecture uses 256 experts with top-8 routing and a multi-head latent attention mechanism designed to reduce per-token communication costs. DeepSeek-V3 introduced DualPipe for pipeline parallelism, which overlaps forward and backward computation with the heavy all-to-all communication required by cross-node expert routing. The training was completed on 2,048 NVIDIA H800 GPUs, reportedly costing only $5.6 million in compute, a fraction of what comparable models required.[16] This cost efficiency was attributed to the careful co-design of the model architecture and parallelism strategy.

## Parallelism strategy comparison

The following table summarizes the key characteristics of each parallelism strategy to help practitioners choose the right combination for their workload:

| Strategy | Axis of partitioning | Communication overhead | Memory savings | Pipeline bubble | Best suited for | Key systems |
|---|---|---|---|---|---|---|
| [Data parallelism](/wiki/data_parallelism) | Training data | All-reduce of gradients (once per step) | None (model replicated) | None | Models that fit on one GPU | PyTorch DDP |
| ZeRO / [FSDP](/wiki/fsdp) | Model states (optimizer, gradients, params) | All-gather + reduce-scatter per layer | Up to Nx (N = GPU count) | None | Large models with many data-parallel workers | [DeepSpeed](/wiki/deepspeed), PyTorch FSDP |
| Tensor parallelism | Weight matrices within a layer | All-reduce per layer | Splits weights and activations | None | Layers with large matrices; high-bandwidth intra-node links | Megatron-LM |
| Pipeline parallelism | Consecutive groups of layers | Point-to-point at stage boundaries | Splits model by layers | Yes (bubble fraction $$\approx (S-1)/M$$) | Very deep models; cross-node training | GPipe, PipeDream, Megatron-LM |
| Sequence parallelism | Sequence dimension (for non-TP operations) | Replaces all-reduce with all-gather + reduce-scatter | Reduces activation memory | None | Long-sequence training with tensor parallelism | Megatron-LM |
| Context parallelism | Sequence dimension (for attention) | Ring or all-to-all for KV blocks | Scales context length linearly | None | Very long context (64K+ tokens) | Ring Attention, DeepSpeed-Ulysses, TorchTitan |
| Expert parallelism | Expert sub-networks | All-to-all for token routing | Distributes experts across devices | None | [MoE](/wiki/mixture_of_experts) models with many experts | DeepSpeed-MoE, Megatron-LM |

## Advantages and disadvantages

Model parallelism offers several benefits, including:

- **Scalability**: Model parallelism allows for the efficient execution of larger models that would otherwise be infeasible due to memory or computation constraints. It is the only way to train models whose parameters, gradients, and optimizer states exceed the memory of a single device.

- **Faster training and inference**: By distributing the model's computation across multiple processing units, model parallelism can reduce total training time. Frontier models that would take years on a single GPU can be trained in weeks or months on large clusters.

- **Flexibility**: Different parallelism strategies can be composed (3D, 4D, 5D parallelism) to match diverse hardware topologies and model architectures.

However, model parallelism also has some drawbacks, such as:

- **Communication overhead**: Model parallelism often requires frequent communication between processing units to exchange intermediate results, gradients, and updates. This communication can introduce latency and consume valuable bandwidth, especially for tensor parallelism over slow interconnects.

- **Implementation complexity**: Implementing model parallelism can be more complex than data parallelism and may require careful consideration of hardware topology, load balancing between pipeline stages, and numerical correctness across partitions.

- **Pipeline bubbles**: Pipeline parallelism introduces idle time (bubbles) at the beginning and end of each mini-batch, reducing hardware utilization unless large numbers of micro-batches are used or advanced scheduling algorithms (such as zero-bubble or DualPipe) are employed.

- **Debugging difficulty**: Bugs in distributed training can be difficult to reproduce and diagnose, as they may manifest only at specific parallelism configurations or scales. Meta's Llama 3 paper highlighted that debugging 4D parallelism across 16K GPUs posed significant practical challenges.[9]

## Explain like I'm 5 (ELI5)

Imagine you and your friends are building a really, really long train out of toy blocks. The train is so big that no one person has enough blocks to build the whole thing. So you split up the work: one friend builds the engine, another builds the middle cars, and another builds the caboose. Each person works on their piece at the same time, and then you connect them all together. That is like pipeline parallelism.

Now imagine one single train car is so complicated that one person cannot build it alone. So two friends work on the same car at the same time, each building half of it side by side. That is like tensor parallelism.

And if you have lots of identical trains to build, you could have different groups each building a complete train. That is like data parallelism.

When you combine all three ideas together, with lots of friends building lots of trains where each train is split up and each complicated part is shared, that is 3D parallelism. It is how the biggest AI models in the world are trained.

## References

1. Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., & Catanzaro, B. (2019). "Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism." *arXiv preprint arXiv:1909.08053*.

2. Huang, Y., Cheng, Y., Bapna, A., Firat, O., Chen, M. X., Chen, D., Lee, H., Ngiam, J., Le, Q. V., Wu, Y., & Chen, Z. (2019). "GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism." *Advances in Neural Information Processing Systems ([NeurIPS](/wiki/neurips)) 2019*. arXiv:1811.06965.

3. Narayanan, D., Harlap, A., Phanishayee, A., Seshadri, V., Devanur, N. R., Ganger, G. R., Gibbons, P. B., & Zaharia, M. (2019). "PipeDream: Generalized Pipeline Parallelism for DNN Training." *Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP 2019)*.

4. Rajbhandari, S., Rasley, J., Ruwase, O., & He, Y. (2020). "ZeRO: Memory Optimizations Toward Training Trillion Parameter Models." *Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '20)*.

5. Narayanan, D., Shoeybi, M., Casper, J., LeGresley, P., Patwary, M., Korthikanti, V., Vainbrand, D., Kasber, P., Zaharia, M., & Catanzaro, B. (2021). "Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM." *Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '21)*.

6. Korthikanti, V., Casper, J., Lym, S., McAfee, L., Andersch, M., Shoeybi, M., & Catanzaro, B. (2023). "Reducing Activation Recomputation in Large Transformer Models." *Proceedings of Machine Learning and Systems (MLSys 2023)*.

7. Zheng, L., Li, Z., Zhang, H., Zhuang, Y., Chen, Z., Huang, Y., Wang, J., Xu, Y., Zhuo, D., Xing, E. P., Gonzalez, J. E., & Stoica, I. (2022). "Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning." *Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI '22)*.

8. Fedus, W., Zoph, B., & Shazeer, N. (2022). "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity." *Journal of Machine Learning Research, 23*(120), 1-39.

9. Dubey, A., et al. (2024). "The Llama 3 Herd of Models." *arXiv preprint arXiv:2407.21783*.

10. Liang, W., et al. (2024). "TorchTitan: One-stop PyTorch Native Solution for Production Ready LLM [Pre-training](/wiki/pre-training)." *arXiv preprint arXiv:2410.06511*. Accepted at ICLR 2025.

11. Liu, H., Zaharia, M., & Abbeel, P. (2023). "Ring Attention with Blockwise Transformers for Near-Infinite Context." *arXiv preprint arXiv:2310.01889*.

12. Jacobs, S. A., et al. (2023). "DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models." *arXiv preprint arXiv:2309.14509*.

13. Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., & Amodei, D. (2020). "[Scaling Laws for Neural Language Models](/wiki/scaling_laws_paper)." *arXiv preprint arXiv:2001.08361*.

14. Rajbhandari, S., Ruwase, O., Rasley, J., Smith, S., & He, Y. (2021). "ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning." *Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '21)*.

15. Qi, P., Wan, X., Huang, G., & Lin, M. (2024). "Zero Bubble Pipeline Parallelism." *Proceedings of the International Conference on Learning Representations (ICLR 2024)*. arXiv:2401.10241.

16. DeepSeek-AI. (2024). "DeepSeek-V3 Technical Report." *arXiv preprint arXiv:2412.19437*.

17. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). "ImageNet Classification with Deep Convolutional Neural Networks." *Advances in Neural Information Processing Systems 25 (NeurIPS 2012)*.