Model Parallelism

Model parallelism in machine learning

Model parallelism is an approach in machine learning that addresses the computational challenges posed by the increasing size and complexity of modern neural network models. It involves the concurrent execution of different parts of a single model across multiple processing units, often in parallel to other parts of the model. As large language models have grown from millions to hundreds of billions (and even trillions) of parameters, model parallelism has become essential for training and serving these systems. A single GPU with 80 GB of high-bandwidth memory cannot hold the full parameters, gradients, and optimizer states for a 175-billion-parameter model, which can require over 2 TB of memory in mixed-precision training with Adam. Model parallelism solves this by distributing different parts of the model across many accelerators so that no single device needs to hold everything at once.

Motivation

As machine learning models grow larger and more complex, they require increasing amounts of memory and computation resources. This growth in demand is driven by the pursuit of improved performance, as larger models tend to have higher representational capacity and can better capture complex patterns in the data. However, this increased size and complexity can lead to several challenges, including:

Memory limitations: The memory capacity of a single processing unit, such as a Graphics Processing Unit (GPU) or Tensor Processing Unit (TPU), may not be sufficient to store the entire model and its associated data. For a model with P parameters trained using mixed-precision Adam, the total memory footprint is approximately 16P bytes (2 bytes for fp16 parameters, 2 bytes for fp16 gradients, 4 bytes for fp32 master weights, 4 bytes for first moment estimates, and 4 bytes for second moment estimates). A 70-billion-parameter model therefore needs roughly 1.12 TB just for model states, far exceeding the capacity of any single GPU.
Computational bottlenecks: The sheer number of parameters and operations in large models can cause significant delays in training and inference, which may be unacceptable for certain applications or time-sensitive tasks.
Scaling laws: Research by Kaplan et al. (2020) and Hoffmann et al. (2022) has demonstrated that model performance improves predictably with scale, providing strong incentives to train ever-larger models. These scaling laws drive the need for parallelism strategies that can efficiently distribute computation across thousands of accelerators.

Model parallelism offers a solution to these challenges by dividing the model's computation across multiple processing units, thereby allowing for the efficient execution of larger and more complex models.

Historical context

The roots of model parallelism in deep learning trace back to the earliest days of GPU-accelerated neural network training. Before GPUs became standard, training deep networks on CPUs was prohibitively slow. In the mid-2000s, researchers began adapting GPUs for general-purpose computing, reporting 4x speedups over CPU implementations by 2006 and up to 70x speedups by 2009.

The landmark moment came in 2012 with AlexNet, developed by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton at the University of Toronto. AlexNet contained 60 million parameters and was one of the first deep networks to employ model parallelism in practice. Because the NVIDIA GTX 580 GPUs used for training had only 3 GB of memory each, Krizhevsky split the network across two GPUs, with each GPU processing one half of the feature maps. The two halves communicated only at specific layers. This design was driven by hardware necessity rather than theoretical considerations, but it demonstrated that partitioning a model across devices was both feasible and effective. AlexNet won the 2012 ImageNet Large Scale Visual Recognition Challenge by a wide margin, catalyzing the deep learning revolution.

Through 2013 to 2017, data parallelism dominated multi-GPU training as GPU memory grew and models remained small enough to fit on a single device. Frameworks like DistBelief (Dean et al., 2012) at Google and later TensorFlow and PyTorch made data-parallel training accessible. However, as Transformer architectures emerged in 2017 and models began scaling into the billions of parameters, the limitations of single-device memory became acute once again.

The period from 2019 onward saw a rapid evolution of model parallelism techniques. Megatron-LM (Shoeybi et al., 2019) introduced systematic tensor parallelism for Transformers. GPipe (Huang et al., 2019) and PipeDream (Narayanan et al., 2019) formalized pipeline parallelism. The ZeRO optimizer (Rajbhandari et al., 2020) bridged the gap between data parallelism and model parallelism by sharding model states. By 2024, training the largest frontier models routinely involved combining four or five forms of parallelism across tens of thousands of GPUs.

Types of model parallelism

There are several distinct types of model parallelism, each addressing different axes of partitioning and suited to different model architectures and hardware configurations.

Tensor parallelism

Tensor parallelism (sometimes called intra-layer parallelism or intra-operator parallelism) splits individual layers of a neural network across multiple devices. Rather than placing entire layers on different accelerators, tensor parallelism partitions the weight matrices within a single layer so that each device computes a portion of the layer's output. The partial results are then combined through communication operations such as all-reduce or all-gather.

The foundational work on tensor parallelism for Transformers was introduced by Shoeybi et al. (2019) in the Megatron-LM paper. The key insight is that the matrix multiplications inside Transformer layers can be split along specific dimensions with minimal communication overhead. For a multi-head attention layer, the query, key, and value projections can be split column-wise across devices, with each device computing a subset of the attention heads. For the feed-forward network (FFN), the first linear layer is split column-wise and the second is split row-wise, requiring only a single all-reduce operation per layer during the forward pass and another during the backward pass.

Megatron-LM demonstrated training of models up to 8.3 billion parameters across 512 NVIDIA V100 GPUs, achieving 15.1 PetaFLOPs of sustained performance with 76% scaling efficiency compared to a strong single-GPU baseline of 39 TeraFLOPs.

Advantages of tensor parallelism:

Low latency per layer, because the split computation can proceed mostly in parallel with only a small synchronization step
Well-suited for high-bandwidth intra-node interconnects like NVLink
Does not introduce pipeline bubbles

Limitations:

Requires high-bandwidth interconnects between participating devices; performance degrades sharply over slower links
Typically limited to the number of GPUs within a single node (e.g., 8 GPUs connected via NVLink)
For attention layers, the degree of tensor parallelism is bounded by the number of attention heads

Pipeline parallelism

Pipeline parallelism (also called inter-layer parallelism) partitions a model vertically by assigning consecutive groups of layers (called stages) to different devices. Input data flows through the pipeline from the first stage to the last during the forward pass, and gradients flow in reverse during the backward pass. The two foundational systems for pipeline parallelism are GPipe and PipeDream.

GPipe (Huang et al., 2019), developed at Google Brain, introduced micro-batch pipelining. GPipe divides each mini-batch into M smaller micro-batches, which are then fed through the pipeline in sequence. During the forward pass, each micro-batch moves through all stages, and activations at stage boundaries are stored. During the backward pass, gradients are computed for each micro-batch and accumulated across all M micro-batches before a single synchronous weight update is applied. GPipe uses synchronous gradient accumulation, meaning it produces the same training dynamics as standard synchronous data-parallel training. Google demonstrated GPipe by training a 557-million-parameter AmoebaNet-B model, achieving 84.3% top-1 accuracy on ImageNet. GPipe also uses activation re-materialization (recomputation) to reduce memory: only boundary activations between stages are stored, and intermediate activations within each stage are recomputed during the backward pass.

The main limitation of GPipe is the "pipeline bubble," the idle time that occurs at the start and end of each mini-batch when some pipeline stages have no micro-batch to process. The fraction of idle time is approximately (S - 1) / M, where S is the number of pipeline stages and M is the number of micro-batches. Increasing M relative to S reduces the bubble but increases memory requirements for storing activations.

PipeDream (Narayanan et al., 2019), developed jointly at Microsoft Research and Carnegie Mellon University, took a different approach. PipeDream uses an asynchronous one-forward-one-backward (1F1B) schedule in which each worker alternates between processing a forward pass and a backward pass from different micro-batches. This significantly reduces pipeline idle time compared to GPipe. However, the asynchronous nature means that forward and backward passes for the same micro-batch may use different versions of the model weights, introducing weight staleness. PipeDream addresses this through a technique called weight stashing, which stores multiple versions of model parameters so that each backward pass uses the same weights that were used in the corresponding forward pass. PipeDream demonstrated up to 5.3x speedup over standard data-parallel training. A later variant, PipeDream-2BW (also known as PipeDream-Flush), reduces the memory overhead of weight stashing by maintaining only two weight versions and periodically flushing the pipeline.

Interleaved 1F1B, introduced in Megatron-LM v2 (Narayanan et al., 2021), assigns multiple non-consecutive stages to each device to further reduce the pipeline bubble. By interleaving stages, each device processes micro-batches from different positions in the pipeline, keeping the hardware busier during the ramp-up and ramp-down phases.

Zero-bubble pipeline parallelism (Qi et al., 2024), accepted at ICLR 2024, achieves near-zero pipeline idle time under synchronous training semantics. The key insight is to split the backward computation into two separate passes: one that computes the gradient with respect to the input (used for the upstream stage) and one that computes the gradient with respect to the parameters (used for weight updates). By scheduling these two backward passes independently, the algorithm can fill in the idle slots that would otherwise form bubbles. The authors also developed an automatic scheduling algorithm that finds optimal pipeline schedules for a given model configuration and memory constraint. Experiments showed up to 23% throughput improvement over standard 1F1B under similar memory limits, increasing to 31% when memory constraints were relaxed.

DualPipe, introduced by DeepSeek for training DeepSeek-V3 and DeepSeek-R1 (DeepSeek-AI, 2024), is a bidirectional pipeline parallelism algorithm that achieves full overlap of forward and backward computation with communication. DualPipe divides each micro-batch chunk into four components: attention, all-to-all dispatch, MLP, and all-to-all combine. By scheduling forward and backward passes to flow in opposite directions through the pipeline simultaneously, DualPipe overlaps the communication-heavy operations of one chunk with the computation-heavy operations of another. This is particularly effective for MoE models where the computation-to-communication ratio approaches 1:1 due to heavy cross-node expert routing traffic.

Sequence parallelism

Sequence parallelism addresses a different bottleneck: the memory consumed by activations that grow with sequence length. In standard tensor parallelism, operations like LayerNorm, dropout, and residual connections are replicated on every device because they operate on the full hidden dimension. Sequence parallelism, introduced as an extension to Megatron-LM by Korthikanti et al. (2023), partitions these operations along the sequence dimension instead, so each device handles a portion of the sequence for these non-tensor-parallel regions. This reduces activation memory by a factor equal to the tensor parallelism degree without additional communication overhead, since the required all-gather and reduce-scatter operations replace the existing all-reduce operations used in tensor parallelism.

For long-context training, more aggressive forms of sequence parallelism have been developed:

DeepSpeed-Ulysses partitions input tensors along the sequence dimension and uses all-to-all collective communication for distributed attention computation. It has demonstrated support for sequences up to 1 million tokens on 64 A100 GPUs.
Ring Attention (Liu et al., 2023) distributes the query, key, and value blocks across devices arranged in a ring topology. Each device computes attention on its local block while asynchronously sending key-value blocks to the next device in the ring, overlapping communication with computation. Ring Attention enables near-linear scaling of context length with the number of devices.
Context Parallelism, as implemented in Meta's Llama 3 training and in PyTorch's TorchTitan, partitions long sequences across devices for the attention computation. When training Llama 3 405B with 128K context length, context parallelism with degree 16 was used so that each GPU processed 8K tokens.

Expert parallelism

Expert parallelism is a specialized form of model parallelism designed for Mixture-of-Experts (MoE) architectures. In an MoE model, each Transformer layer contains a routing mechanism (a gating network) and multiple expert sub-networks (typically feed-forward networks). For any given input token, only a subset of experts (the top-k, often top-1 or top-2) is activated, making the model sparse.

In expert parallelism, the expert sub-networks are distributed across devices so that each device holds a subset of the experts. A routing step determines which tokens go to which experts, and all-to-all communication moves tokens to the devices that hold their selected experts. After expert computation, another all-to-all operation returns the results to the originating devices.

This approach enables training of models with very large total parameter counts while keeping the per-token computation manageable. The Switch Transformer (Fedus et al., 2022) reported approximately 4x pre-training speedup over a dense model of comparable quality by using expert parallelism. DeepSeek-V2 and Mixtral also rely on expert parallelism for efficient training and inference.

The primary challenge of expert parallelism is communication overhead. The all-to-all operations required for token routing can consume over 40% of total runtime in large-scale training. Techniques like hybrid expert parallelism (combining expert parallelism with data and tensor parallelism) and optimized communication kernels have been developed to mitigate this. DeepSpeed-TED combines data, tensor, and expert parallelism to train MoE models with 4 to 8x larger base models than previous approaches.

3D parallelism and beyond

3D parallelism

Training the largest models typically requires combining multiple parallelism strategies simultaneously. 3D parallelism refers to the combination of data parallelism, tensor parallelism, and pipeline parallelism. Each strategy is mapped to a different level of the hardware hierarchy based on communication bandwidth considerations:

Parallelism type	Typical scope	Communication pattern	Bandwidth requirement
Tensor parallelism	Within a single node (e.g., 8 GPUs connected via NVLink)	All-reduce per layer	Very high (requires NVLink or NVSwitch)
Pipeline parallelism	Across nodes within a rack or pod	Point-to-point activation transfers at stage boundaries	Moderate (InfiniBand is sufficient)
Data parallelism	Across all remaining devices	All-reduce of gradients once per training step	Low to moderate (can overlap with computation)

For example, in a cluster of 64 GPUs arranged as 8 nodes of 8 GPUs each, one might use tensor parallelism degree 8 (within each node), pipeline parallelism degree 4 (across 4 groups of nodes), and data parallelism degree 2 (the remaining factor). This yields 8 x 4 x 2 = 64 total devices.

The Megatron-LM v2 paper (Narayanan et al., 2021) presented detailed analysis of 3D parallelism for training models up to one trillion parameters, demonstrating 502 petaFLOPS on 3072 A100 GPUs with 52% of peak GPU throughput.

4D parallelism

Meta's training of Llama 3 405B on 16,384 NVIDIA H100 GPUs used 4D parallelism, adding context parallelism as a fourth dimension. The specific configuration was:

Dimension	Degree	Description
Tensor Parallelism (TP)	8	Splits weight matrices within each node
Pipeline Parallelism (PP)	16	Distributes layers across 16 pipeline stages
Context Parallelism (CP)	16 (for 128K context)	Partitions long sequences across devices
Data Parallelism (FSDP)	Remaining GPUs	Shards parameters, gradients, and optimizer states

This configuration achieved approximately 400 TeraFLOPs per GPU with 8K sequence length and 380 TeraFLOPs per GPU with 128K sequence length.

5D parallelism

For MoE models, expert parallelism adds a fifth dimension. Training setups for models like Mixtral or DeepSeek-V3 combine data, tensor, pipeline, context, and expert parallelism, often referred to as 5D parallelism. DeepSeek-V3, a 671-billion-parameter MoE model with 256 experts, used a specific combination of 16-way pipeline parallelism with DualPipe, 64-way expert parallelism spanning 8 nodes, and ZeRO-1 data parallelism. The DualPipe schedule was critical for overlapping the heavy all-to-all communication required by expert routing with forward and backward computation, achieving effective hardware utilization despite the approximately 1:1 computation-to-communication ratio.

FSDP and ZeRO

ZeRO (Zero Redundancy Optimizer)

ZeRO (Rajbhandari et al., 2020) is a family of memory optimization techniques developed by Microsoft as part of the DeepSpeed library. ZeRO eliminates memory redundancy in data-parallel training by partitioning model states across data-parallel processes instead of replicating them. It is implemented as three progressive stages:

Stage	What is partitioned	Memory reduction	Communication volume
ZeRO Stage 1 (ZeRO-1)	Optimizer states only (e.g., Adam's first and second moment estimates, fp32 master weights)	Up to 4x	Same as standard data parallelism
ZeRO Stage 2 (ZeRO-2)	Optimizer states + gradients	Up to 8x	Same as standard data parallelism
ZeRO Stage 3 (ZeRO-3)	Optimizer states + gradients + model parameters	Linear reduction with number of GPUs (Nx for N GPUs)	1.5x of standard data parallelism

In ZeRO Stage 1, only the optimizer states are partitioned. Each GPU stores 1/N of the optimizer states (where N is the number of data-parallel GPUs) and updates only its assigned partition of model parameters. After the update, an all-gather operation broadcasts the updated parameters to all GPUs.

ZeRO Stage 2 additionally partitions the gradients. Each GPU retains only the gradients corresponding to its portion of the optimizer states and discards the rest after the reduce-scatter operation.

ZeRO Stage 3 goes further by also partitioning the model parameters themselves. Each GPU stores only 1/N of the parameters, and all-gather operations are inserted before each forward and backward computation to temporarily reconstruct the full parameters needed for that layer. This enables training models whose total memory footprint far exceeds any single GPU's capacity, at the cost of increased communication.

ZeRO-Offload and ZeRO-Infinity extend these concepts further by offloading partitioned states to CPU memory or NVMe SSDs, enabling training of models with trillions of parameters on limited GPU hardware.

FSDP (Fully Sharded Data Parallel)

Fully Sharded Data Parallel (FSDP) is PyTorch's native implementation of ZeRO Stage 3 concepts. Originally inspired by FairScale's implementation, FSDP was integrated into PyTorch core and has undergone significant evolution:

FSDP1 (PyTorch 1.11+): The original implementation that shards parameters, gradients, and optimizer states across data-parallel workers, with flat-parameter-based sharding.
FSDP2 (PyTorch 2.x): A redesigned version with per-parameter sharding, improved composability with other parallelism strategies (tensor parallelism, pipeline parallelism), and better memory management. FSDP2 is the version used in TorchTitan.

FSDP wraps model layers or submodules in FSDP units. Before each forward pass, an all-gather operation collects the full parameters for that unit from all devices. After the computation, the full parameters are discarded and only the local shard is retained. During the backward pass, the same all-gather and discard pattern is repeated. Gradients are reduced via reduce-scatter so that each device accumulates only its shard of the gradients.

FSDP sharding strategies:

Strategy	Behavior	Equivalent ZeRO stage
FULL_SHARD	Shards parameters, gradients, and optimizer states	ZeRO Stage 3
SHARD_GRAD_OP	Shards gradients and optimizer states; keeps full parameters after forward	ZeRO Stage 2
NO_SHARD	Standard data parallelism with no sharding	Baseline DDP

FSDP vs. DeepSpeed: practical considerations

Aspect	PyTorch FSDP	DeepSpeed ZeRO
Framework integration	Native PyTorch	Requires DeepSpeed library
Configuration	Python API, minimal config	JSON config file + launcher
Composability with TP/PP	Strong (especially FSDP2)	Supported but requires more setup
CPU/NVMe offloading	Supported	Supported (ZeRO-Offload, ZeRO-Infinity)
Best performance range	100M to 10B parameters	10B+ parameters
Ecosystem	PyTorch-native tooling	Rich feature set (MoE, inference, compression)

Benchmarks from 2024 and 2025 show nuanced performance differences. FSDP with FULL_SHARD has been observed running up to 5x faster per iteration than DeepSpeed ZeRO-3 on mid-sized models. However, DeepSpeed tends to show advantages at 10B+ parameter scales, where its memory-saving optimizations and offloading capabilities become more significant.

Comparison with data parallelism

Data parallelism and model parallelism address different scaling bottlenecks and are often used together rather than as alternatives.

Feature	Data parallelism	Model parallelism
What is replicated/split	Data is split; model is replicated on each device	Model is split across devices; data may be replicated or also split
Memory per device	Full model + optimizer states on each GPU	Only a fraction of model states per GPU
Communication pattern	All-reduce of gradients after each step	Varies: all-reduce (tensor), point-to-point (pipeline), all-to-all (expert)
Communication frequency	Once per training step	Multiple times per layer (tensor) or per micro-batch (pipeline)
Scaling limit	Model must fit on a single device	Can scale beyond single-device memory
Implementation complexity	Relatively simple (e.g., PyTorch DDP)	More complex; requires careful partitioning
Hardware requirement	Any multi-GPU setup	High-bandwidth interconnects preferred (especially for tensor parallelism)
Batch size impact	Effective batch size scales with number of devices	Batch size is independent of parallelism degree
Gradient synchronization	Synchronous (standard) or asynchronous	Depends on specific strategy
Primary use case	Models that fit on one GPU; scaling throughput	Models too large for one GPU; memory-constrained scenarios

In practice, most large-scale training systems combine both. Data parallelism increases throughput by processing more data in parallel, while model parallelism enables training models that would not fit on a single device.

Communication primitives and interconnects

Efficient communication is critical for model parallelism performance. The choice of collective operation and hardware interconnect directly impacts training throughput.

Collective communication operations

Distributed training relies on a set of collective communication primitives, most commonly provided by NVIDIA's NCCL (NVIDIA Collective Communications Library) for GPU clusters:

Operation	Description	Used by
All-Reduce	Every device sends its local data, and all devices receive the sum (or other reduction) of all data.	Data parallelism (gradient synchronization), tensor parallelism
All-Gather	Every device sends its local shard, and all devices receive the concatenation of all shards.	FSDP/ZeRO-3 (parameter reconstruction before forward/backward)
Reduce-Scatter	Each device sends data, and each receives a reduced (summed) portion. Equivalent to a reduce followed by a scatter.	FSDP/ZeRO (gradient sharding), tensor parallelism
All-to-All	Each device sends a different piece of data to every other device.	Expert parallelism (token routing), sequence parallelism (DeepSpeed-Ulysses)
Broadcast	One device sends data to all other devices.	Parameter initialization, inference
Point-to-Point (Send/Recv)	Direct transfer between two specific devices.	Pipeline parallelism (activation/gradient transfer between stages)

NCCL automatically detects hardware topology (PCIe, NVLink, NVSwitch, InfiniBand, RoCE) and selects optimal algorithms and parameters. NCCL 2.27 (2025) added features for fast inference and resilient training, including fault tolerance for large-scale jobs.

Hardware interconnects

The bandwidth and latency of interconnects determine how effectively parallelism strategies can scale:

Interconnect	Scope	Bandwidth	Notes
PCIe Gen 5	Intra-node (GPU to CPU, GPU to GPU)	64 GB/s per direction	Baseline; insufficient for tensor parallelism at scale
NVLink 4.0 (Hopper)	Intra-node (GPU to GPU)	900 GB/s bidirectional (18 links at 25 GB/s each)	Used in H100 systems; standard for tensor parallelism
NVLink 5.0 (Blackwell)	Intra-node and intra-rack	1.8 TB/s bidirectional (18 links at 50 GB/s each)	Used in B200 systems; supports up to 576 GPUs in a single NVLink domain
NVSwitch	Multi-GPU fabric within a node or rack	Up to 130 TB/s across a 72-GPU NVLink domain (NVL72)	Enables all-to-all NVLink connectivity
InfiniBand NDR	Inter-node	400 Gb/s (50 GB/s) per port	Standard for cross-node pipeline and data parallelism
InfiniBand XDR	Inter-node	800 Gb/s (100 GB/s) per port	Next generation; used in newest clusters
RoCE (RDMA over Converged Ethernet)	Inter-node	Up to 400 Gb/s	Lower cost alternative to InfiniBand

The general principle is to map the most communication-intensive parallelism dimension to the highest-bandwidth interconnect. Tensor parallelism, which requires an all-reduce per layer, is placed within NVLink-connected devices. Pipeline parallelism, which only transfers activations at stage boundaries, can tolerate the lower bandwidth of InfiniBand. Data parallelism overlaps gradient synchronization with backward computation, further reducing its sensitivity to interconnect bandwidth.

NVIDIA's SHARP (Scalable Hierarchical Aggregation and Reduction Protocol) technology offloads collective operations to InfiniBand network switches, reducing the data traversing the network by performing reductions in-network rather than at the endpoints.

Memory optimization techniques

Model parallelism is often combined with memory optimization techniques to maximize the effective model size that can be trained:

Activation checkpointing (gradient checkpointing)

Activation checkpointing trades computation for memory by discarding intermediate activations during the forward pass and recomputing them during the backward pass. Instead of storing all activations (O(n) memory for n layers), selective checkpointing stores only the activations at certain layer boundaries and recomputes the rest, reducing activation memory from O(n) to O(sqrt(n)). The cost is approximately one additional forward pass worth of computation (roughly 33% overhead). GPipe's re-materialization and PyTorch's torch.utils.checkpoint both implement this technique.

Mixed-precision training

Mixed-precision training uses lower-precision numerical formats (typically fp16 or bf16) for forward and backward computation while maintaining fp32 master weights for parameter updates. This halves the memory required for activations and enables the use of Tensor Cores on NVIDIA GPUs, which provide significantly higher throughput for reduced-precision matrix operations. The H100 GPU provides 989 TFLOPS for bf16 Tensor Core operations compared to 67 TFLOPS for fp32.

Float8 (FP8) training, supported by Hopper and Blackwell architectures, pushes precision even further down, reducing memory bandwidth requirements and increasing throughput. TorchTitan supports FP8 training with dynamic, delayed, or static per-tensor scaling.

Gradient accumulation

Gradient accumulation allows training with effectively large batch sizes without proportionally increasing memory. Instead of processing one large batch, the system processes multiple smaller micro-batches sequentially, accumulating gradients before performing a single parameter update. This is orthogonal to model parallelism and is commonly used alongside it.

CPU and NVMe offloading

DeepSpeed's ZeRO-Offload and ZeRO-Infinity extend sharding to heterogeneous memory by offloading optimizer states and parameters to CPU DRAM or NVMe SSDs. ZeRO-Infinity can train models with over 30 trillion parameters on a cluster of 32 NVIDIA DGX-2 nodes by leveraging the aggregate NVMe bandwidth. While this introduces data transfer latency, careful prefetching and overlapping with computation can minimize the impact on throughput.

Summary of memory optimization techniques

Technique	Memory saved	Compute overhead	Applicable to
Activation checkpointing	Reduces activation memory from O(n) to O(sqrt(n))	~33% additional forward computation	All model types
Mixed-precision (fp16/bf16)	~50% reduction in activation and weight memory	Minimal (often faster due to Tensor Cores)	All model types
FP8 training	~75% reduction vs. fp32	Requires careful scaling; supported on Hopper+	Transformer models
ZeRO Stage 1	Up to 4x reduction in optimizer memory	Minimal	Data-parallel training
ZeRO Stage 2	Up to 8x reduction in optimizer + gradient memory	Minimal	Data-parallel training
ZeRO Stage 3 / FSDP	Linear reduction with GPU count	Additional all-gather per layer	Data-parallel training
CPU/NVMe offloading	Near-unlimited (limited by CPU/SSD capacity)	Data transfer latency	Any ZeRO stage
Gradient accumulation	Decouples batch size from memory	None (just more steps)	All training setups

Frameworks and tools

Several major frameworks provide implementations of model parallelism strategies:

DeepSpeed

DeepSpeed, developed by Microsoft Research, is a comprehensive distributed training library built on PyTorch. Its key features include:

ZeRO Stages 1, 2, and 3 for memory-efficient data parallelism
ZeRO-Offload and ZeRO-Infinity for CPU/NVMe offloading
Pipeline parallelism with the PipeDream-Flush schedule
MoE support with expert parallelism
DeepSpeed-Ulysses for sequence parallelism
DeepSpeed-TED for hybrid tensor-expert-data parallelism
Inference optimizations including model compression and quantization

DeepSpeed is configured through a JSON file and integrates with the Hugging Face Transformers library through the Accelerate package.

Megatron-LM

Megatron-LM, developed by NVIDIA, is the reference implementation for tensor parallelism in Transformer models. It provides:

Tensor parallelism for attention and feed-forward layers
Interleaved pipeline parallelism with reduced bubble overhead
Sequence parallelism for LayerNorm and dropout regions
Integration with DeepSpeed (Megatron-DeepSpeed) for combined tensor + pipeline + ZeRO parallelism

Megatron-LM has been used to train models up to one trillion parameters. The Megatron-LM v2 paper (Narayanan et al., 2021) demonstrated efficient 3D parallelism on clusters of thousands of A100 GPUs.

PyTorch FSDP and TorchTitan

PyTorch provides native support for distributed training through several built-in modules:

DistributedDataParallel (DDP): Standard data parallelism with gradient all-reduce
FSDP (Fully Sharded Data Parallel): ZeRO-3-style parameter sharding
DeviceMesh and DTensor: Abstractions for multi-dimensional parallelism

TorchTitan is a PyTorch-native platform for production-ready LLM pre-training that unifies these components. Released by Meta in 2024 and accepted at ICLR 2025, TorchTitan provides stackable implementations of FSDP2, tensor parallelism, and pipeline parallelism, each introduced as orthogonal layers that can be composed freely. TorchTitan demonstrated accelerations of 65% on Llama 3.1 8B at 128-GPU scale (1D parallelism), 12.6% on Llama 3.1 70B at 256-GPU scale (2D), and 30% on Llama 3.1 405B at 512-GPU scale (3D) over optimized baselines on NVIDIA H100 GPUs.

Alpa

Alpa (Zheng et al., 2022), published at OSDI 2022, takes a compiler-based approach to automating parallelism. Rather than requiring users to manually specify parallelism strategies, Alpa formulates parallelism planning as an optimization problem with two hierarchical levels:

Intra-operator parallelism (within a single operator): Solves for the best way to partition tensors across devices using an Integer Linear Programming (ILP) formulation
Inter-operator parallelism (across operators): Uses dynamic programming to assign groups of operators (pipeline stages) to device submeshes

Alpa generates parallelization plans that match or outperform hand-tuned model-parallel training systems, even on models those systems were specifically designed for, while also generalizing to novel architectures without manually designed plans.

Colossal-AI

Colossal-AI provides a unified interface for various parallelism paradigms including data, tensor, pipeline, and sequence parallelism. It focuses on making large-scale distributed training accessible with simplified APIs and automatic parallelism features.

Summary of frameworks

Framework	Developer	Key parallelism features	Language/Backend
DeepSpeed	Microsoft Research	ZeRO (1/2/3), pipeline, MoE, offloading	PyTorch
Megatron-LM	NVIDIA	Tensor, pipeline, sequence parallelism	PyTorch
FSDP	Meta / PyTorch	Sharded data parallelism (ZeRO-3 equivalent)	PyTorch (native)
TorchTitan	Meta / PyTorch	Composable FSDP2 + TP + PP + CP	PyTorch (native)
Alpa	UC Berkeley / Google	Automated inter- and intra-op parallelism	JAX
Colossal-AI	HPC-AI Tech	Unified parallelism with simplified APIs	PyTorch
Megatron-DeepSpeed	NVIDIA + Microsoft	Combined Megatron tensor/pipeline + ZeRO	PyTorch

Inference parallelism

While most literature on model parallelism focuses on training, parallelism is equally important for serving large models at inference time. A model that was trained across many GPUs often still exceeds the memory of a single device when deployed for inference, and latency requirements for real-time serving impose additional constraints.

Tensor parallelism for inference

Tensor parallelism is the most commonly used strategy for inference because it reduces per-token latency. By splitting weight matrices across GPUs, each GPU computes a portion of each layer simultaneously. For autoregressive language models, this means each token generation step completes faster, directly reducing time-to-first-token (TTFT) and inter-token latency. Serving frameworks like vLLM, TensorRT-LLM, and SGLang all support tensor parallelism as a primary scaling mechanism. For example, serving a 70-billion-parameter model typically requires tensor parallelism across 4 or 8 GPUs.

Pipeline parallelism for inference

Pipeline parallelism can also be used for inference, though it introduces latency for single requests because the input must traverse all stages sequentially. Its primary benefit for inference is enabling models that exceed the combined memory of a single node to be served across multiple nodes. In practice, pipeline parallelism is used for inference when the model is too large for a single node but latency requirements are not extreme.

Expert parallelism for inference

MoE models benefit from expert parallelism during inference, distributing experts across GPUs so that the active subset for each token can be computed without requiring any single device to hold all experts. This is critical for models like Mixtral (46.7B total, 12.9B active) and DeepSeek-V3 (671B total, 37B active), where total parameter count far exceeds active parameters.

Inference serving frameworks

Modern inference frameworks integrate parallelism with other optimizations:

Framework	Developer	Parallelism support	Key features
vLLM	UC Berkeley / community	TP, PP, EP, DP	PagedAttention, continuous batching, speculative decoding
TensorRT-LLM	NVIDIA	TP, PP, EP	Optimized CUDA kernels, in-flight batching, FP8 support
SGLang	UC Berkeley	TP, DP, EP	RadixAttention, constrained decoding
text-generation-inference	Hugging Face	TP	Flash decoding, watermarking

vLLM, one of the most widely adopted open-source inference engines, supports setting tensor_parallel_size and pipeline_parallel_size as simple configuration parameters. For single-node multi-GPU deployments, tensor parallelism is generally preferred for its latency benefits. For multi-node deployments, a combination of tensor parallelism within each node and pipeline parallelism across nodes is the standard approach.

How frontier labs train large models

The parallelism strategies used by leading AI laboratories illustrate how these techniques are combined in practice.

Meta: Llama 3 405B and Llama 4

Llama 3 405B was trained on up to 16,384 H100 GPUs (each with 80 GB HBM3) using Meta's Grand Teton AI server platform. The training used 4D parallelism:

Tensor parallelism (TP=8) within each 8-GPU node via NVLink
Pipeline parallelism (PP=16) across nodes
Fully Sharded Data Parallelism for parameter/gradient/optimizer sharding
Context parallelism (CP=16) for 128K context-length training

The training consumed approximately 3.8 x 10^25 FLOPs over several months and achieved 400 TeraFLOPs per GPU (approximately 57% of peak bf16 throughput on H100). The training used 15 trillion tokens of data. Debugging and optimizing the 4D parallelism configuration across 16K GPUs was one of the most challenging aspects, as performance issues could propagate across the entire distributed system.

For Llama 4, released in 2025, Meta scaled reinforcement learning training for a two-trillion-parameter MoE model. The training infrastructure was revamped to support flexible allocation of different models to separate GPUs, balancing resources across multiple models based on computational speed. This resulted in approximately 10x improvement in training efficiency over the Llama 3 generation. Llama 4 Maverick uses 128 routed experts with a shared expert, where each token is sent to the shared expert and one of the 128 routed experts.

OpenAI: GPT-4

While OpenAI has not published full details of GPT-4's training infrastructure, it is known that GPT-4 was trained on a Microsoft Azure supercomputer with tens of thousands of NVIDIA A100 GPUs connected via InfiniBand. Based on available information and the scale of the model, GPT-4's training likely used a combination of tensor parallelism (within nodes), pipeline parallelism (across node groups), and data parallelism (across the remaining devices), consistent with 3D parallelism practices established by Megatron-LM and DeepSpeed at the time of training.

Google DeepMind: Gemini

Google DeepMind trained the Gemini family of models on TPU v4 and TPU v5 pods. TPU architectures support model parallelism through the XLA compiler and the GSPMD (General and Scalable Parallelization for ML Computation Graphs) system, which automatically partitions computation graphs across TPU meshes. Google's Pathways system enables training across multiple TPU pods connected via data center networks.

DeepSeek: DeepSeek-V3

DeepSeek trained DeepSeek-V3, a 671-billion-parameter MoE model, using a combination of expert parallelism, tensor parallelism, pipeline parallelism, and data parallelism. The architecture uses 256 experts with top-8 routing and a multi-head latent attention mechanism designed to reduce per-token communication costs. DeepSeek-V3 introduced DualPipe for pipeline parallelism, which overlaps forward and backward computation with the heavy all-to-all communication required by cross-node expert routing. The training was completed on 2,048 NVIDIA H800 GPUs, reportedly costing only $5.6 million in compute, a fraction of what comparable models required. This cost efficiency was attributed to the careful co-design of the model architecture and parallelism strategy.

Parallelism strategy comparison

The following table summarizes the key characteristics of each parallelism strategy to help practitioners choose the right combination for their workload:

Strategy	Axis of partitioning	Communication overhead	Memory savings	Pipeline bubble	Best suited for	Key systems
Data parallelism	Training data	All-reduce of gradients (once per step)	None (model replicated)	None	Models that fit on one GPU	PyTorch DDP
ZeRO / FSDP	Model states (optimizer, gradients, params)	All-gather + reduce-scatter per layer	Up to Nx (N = GPU count)	None	Large models with many data-parallel workers	DeepSpeed, PyTorch FSDP
Tensor parallelism	Weight matrices within a layer	All-reduce per layer	Splits weights and activations	None	Layers with large matrices; high-bandwidth intra-node links	Megatron-LM
Pipeline parallelism	Consecutive groups of layers	Point-to-point at stage boundaries	Splits model by layers	Yes (bubble fraction ~ (S-1)/M)	Very deep models; cross-node training	GPipe, PipeDream, Megatron-LM
Sequence parallelism	Sequence dimension (for non-TP operations)	Replaces all-reduce with all-gather + reduce-scatter	Reduces activation memory	None	Long-sequence training with tensor parallelism	Megatron-LM
Context parallelism	Sequence dimension (for attention)	Ring or all-to-all for KV blocks	Scales context length linearly	None	Very long context (64K+ tokens)	Ring Attention, DeepSpeed-Ulysses, TorchTitan
Expert parallelism	Expert sub-networks	All-to-all for token routing	Distributes experts across devices	None	MoE models with many experts	DeepSpeed-MoE, Megatron-LM

Advantages and disadvantages

Model parallelism offers several benefits, including:

Scalability: Model parallelism allows for the efficient execution of larger models that would otherwise be infeasible due to memory or computation constraints. It is the only way to train models whose parameters, gradients, and optimizer states exceed the memory of a single device.
Faster training and inference: By distributing the model's computation across multiple processing units, model parallelism can reduce total training time. Frontier models that would take years on a single GPU can be trained in weeks or months on large clusters.
Flexibility: Different parallelism strategies can be composed (3D, 4D, 5D parallelism) to match diverse hardware topologies and model architectures.

However, model parallelism also has some drawbacks, such as:

Communication overhead: Model parallelism often requires frequent communication between processing units to exchange intermediate results, gradients, and updates. This communication can introduce latency and consume valuable bandwidth, especially for tensor parallelism over slow interconnects.
Implementation complexity: Implementing model parallelism can be more complex than data parallelism and may require careful consideration of hardware topology, load balancing between pipeline stages, and numerical correctness across partitions.
Pipeline bubbles: Pipeline parallelism introduces idle time (bubbles) at the beginning and end of each mini-batch, reducing hardware utilization unless large numbers of micro-batches are used or advanced scheduling algorithms (such as zero-bubble or DualPipe) are employed.
Debugging difficulty: Bugs in distributed training can be difficult to reproduce and diagnose, as they may manifest only at specific parallelism configurations or scales. Meta's Llama 3 paper highlighted that debugging 4D parallelism across 16K GPUs posed significant practical challenges.

Explain like I'm 5 (ELI5)

Imagine you and your friends are building a really, really long train out of toy blocks. The train is so big that no one person has enough blocks to build the whole thing. So you split up the work: one friend builds the engine, another builds the middle cars, and another builds the caboose. Each person works on their piece at the same time, and then you connect them all together. That is like pipeline parallelism.

Now imagine one single train car is so complicated that one person cannot build it alone. So two friends work on the same car at the same time, each building half of it side by side. That is like tensor parallelism.

And if you have lots of identical trains to build, you could have different groups each building a complete train. That is like data parallelism.

When you combine all three ideas together, with lots of friends building lots of trains where each train is split up and each complicated part is shared, that is 3D parallelism. It is how the biggest AI models in the world are trained.

References

Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., & Catanzaro, B. (2019). "Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism." *arXiv preprint arXiv:1909.08053*.
Huang, Y., Cheng, Y., Bapna, A., Firat, O., Chen, M. X., Chen, D., Lee, H., Ngiam, J., Le, Q. V., Wu, Y., & Chen, Z. (2019). "GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism." *Advances in Neural Information Processing Systems (NeurIPS) 2019*.
Narayanan, D., Harlap, A., Phanishayee, A., Seshadri, V., Devanur, N. R., Ganger, G. R., Gibbons, P. B., & Zaharia, M. (2019). "PipeDream: Generalized Pipeline Parallelism for DNN Training." *Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP 2019)*.
Rajbhandari, S., Rasley, J., Ruwase, O., & He, Y. (2020). "ZeRO: Memory Optimizations Toward Training Trillion Parameter Models." *Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '20)*.
Narayanan, D., Shoeybi, M., Casper, J., LeGresley, P., Patwary, M., Korthikanti, V., Vainbrand, D., Kasber, P., Zaharia, M., & Catanzaro, B. (2021). "Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM." *Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '21)*.
Korthikanti, V., Casper, J., Lym, S., McAfee, L., Andersch, M., Shoeybi, M., & Catanzaro, B. (2023). "Reducing Activation Recomputation in Large Transformer Models." *Proceedings of Machine Learning and Systems (MLSys 2023)*.
Zheng, L., Li, Z., Zhang, H., Zhuang, Y., Chen, Z., Huang, Y., Wang, J., Xu, Y., Zhuo, D., Xing, E. P., Gonzalez, J. E., & Stoica, I. (2022). "Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning." *Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI '22)*.
Fedus, W., Zoph, B., & Shazeer, N. (2022). "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity." *Journal of Machine Learning Research, 23*(120), 1-39.
Dubey, A., et al. (2024). "The Llama 3 Herd of Models." *arXiv preprint arXiv:2407.21783*.
Liang, W., et al. (2024). "TorchTitan: One-stop PyTorch Native Solution for Production Ready LLM Pre-training." *arXiv preprint arXiv:2410.06511*. Accepted at ICLR 2025.
Liu, H., Zaharia, M., & Abbeel, P. (2023). "Ring Attention with Blockwise Transformers for Near-Infinite Context." *arXiv preprint arXiv:2310.01889*.
Jacobs, S. A., et al. (2023). "DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models." *arXiv preprint arXiv:2309.14509*.
Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., & Amodei, D. (2020). "Scaling Laws for Neural Language Models." *arXiv preprint arXiv:2001.08361*.
Rajbhandari, S., Ruwase, O., Rasley, J., Smith, S., & He, Y. (2021). "ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning." *Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '21)*.
Qi, P., Wan, X., Huang, G., & Lin, M. (2024). "Zero Bubble Pipeline Parallelism." *Proceedings of the International Conference on Learning Representations (ICLR 2024)*. arXiv:2401.10241.
DeepSeek-AI. (2024). "DeepSeek-V3 Technical Report." *arXiv preprint arXiv:2412.19437*.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). "ImageNet Classification with Deep Convolutional Neural Networks." *Advances in Neural Information Processing Systems 25 (NeurIPS 2012)*.