Expert Parallelism
Last reviewed
Sources
No citations yet
Review status
Needs citations
Revision
v3 ยท 4,606 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
No citations yet
Review status
Needs citations
Revision
v3 ยท 4,606 words
Add missing citations, update stale details, or suggest a clearer explanation.
Expert Parallelism (EP) is a model-parallelism strategy specific to Mixture of Experts (MoE) neural networks in which the individual expert sub-networks (typically feed-forward blocks) are sharded across different accelerator devices rather than being replicated. Each device hosts a disjoint subset of the experts, and a learned router assigns every input token to one or a few experts; tokens are then exchanged between devices via an all-to-all collective so that each expert receives the tokens routed to it, after which a second all-to-all returns the expert outputs to the originating devices.[1][2][3] The technique was introduced (under the name "expert parallelism" or "experts in parallel") in Google's GShard system in 2020 and has become the dominant scaling strategy for sparse MoE language models, including Switch Transformer, Mixtral 8x7B, DeepSeek V3, Llama 4, and Qwen3.[1][2][4][5][24][25] EP composes with tensor parallelism (TP), pipeline parallelism (PP), and data parallelism (DP) to form what practitioners call 3D or 4D parallelism for trillion-parameter MoE training.[6][7][8]
Expert parallelism is the practice of placing different experts of an MoE layer on different devices so that the model's total parameter count can grow far beyond the memory of a single accelerator while the per-token compute stays roughly constant. Whereas data parallelism replicates the entire model and splits the batch, and tensor parallelism splits individual weight matrices, expert parallelism splits along the expert dimension: each device owns whole experts, and the router decides at runtime which device each token must be sent to.[1][2][6] Because only a small subset of experts is activated per token (for example, 2 of 8 in Mixtral 8x7B, or 8 of 256 in DeepSeek V3), most of the model's parameters sit idle for any given token, which is what makes the sparse activation and the cross-device routing worthwhile.[4][18] The defining cost of the strategy is the pair of all-to-all collectives that route tokens to their experts and bring the results back, which is why most of the systems work on EP (DeepSpeed-MoE, Tutel, MegaBlocks, Megatron-Core, DeepEP) exists to make that all-to-all faster.[3][5][12][13][14]
Conditional computation through gated mixtures has a long history in machine learning, but the modern incarnation that motivates expert parallelism originates in the 2017 "Outrageously Large Neural Networks" paper by Noam Shazeer and collaborators at Google Brain, which proposed a Sparsely-Gated MoE layer placed between stacked LSTM layers and demonstrated training of models with up to 137 billion parameters.[9] That work already required distributing experts across devices to fit in memory, foreshadowing the systems abstraction later formalised as expert parallelism.
The term and the formal mechanism of expert parallelism were introduced in GShard by Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen, published on arXiv on 30 June 2020 (arXiv:2006.16668) and presented at ICLR 2021.[1] The authors describe GShard as "a module composed of a set of lightweight annotation APIs and an extension to the XLA compiler" that "provides an elegant way to express a wide range of parallel computation patterns with minimal changes to the existing model code."[1] They used GShard to "scale up multilingual neural machine translation Transformer model with Sparsely-Gated Mixture-of-Experts beyond 600 billion parameters," reporting that "such a giant model can efficiently be trained on 2048 TPU v3 accelerators in 4 days to achieve far superior quality for translation from 100 languages to English."[1] GShard partitions all tokens in a batch into local groups, assigns each expert a fractional capacity of C = 2N/(G*E) for top-2 gating with N tokens, G groups, and E experts, and uses an auxiliary loss based on the mean square of the fraction of tokens dispatched to each expert as a differentiable surrogate for load imbalance.[1][10]
In January 2021, William Fedus, Barret Zoph, and Noam Shazeer of Google released the Switch Transformer (arXiv:2101.03961), which simplified the GShard approach by routing each token to only the top-1 expert and demonstrated stable training up to 1.6 trillion parameters.[2] The paper states that the authors "simplify the MoE routing algorithm" and "measure up to 7x increases in pre-training speed with the same computational resources" relative to dense T5 baselines.[2] Switch Transformer formalised the expert capacity formula capacity = (tokens_per_batch / num_experts) * capacity_factor with capacity factors of approximately 1.0 to 1.25 as a practical sweet spot, and combined data, model, and expert parallelism for its largest configurations.[2][11] The model used a load-balancing auxiliary loss proportional to the dot product of the fraction of tokens routed to each expert and the mean router probability per expert.[2] Switch Transformer is widely credited with making EP a mainstream training pattern.[2][11]
In January 2022, Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, and Yuxiong He of Microsoft published DeepSpeed-MoE (arXiv:2201.05596), which appeared at ICML 2022.[3][12] DeepSpeed-MoE introduced an end-to-end training and inference stack including the Pyramid-Residual MoE (PR-MoE) architecture, the Mixture-of-Students (MoS) distillation technique, and a highly optimised inference engine.[3][12] Crucially for EP, DeepSpeed-MoE introduced a hierarchical all-to-all collective that reduces communication latency from O(p) to O(G + p/G) by splitting the global exchange into intra-node and inter-node phases with a data-layout transformation between them, where G is the intra-node group size and p is the total number of ranks.[12] The system also pioneered hybrid tensor-expert-data parallelism, scaling training from a 107B-parameter model up to 2 trillion parameters on 256 A100 GPUs and delivering up to 7.3x lower inference latency than prior MoE inference solutions.[3][12]
Two MLSys 2023 papers extended EP system design substantially. Tutel: Adaptive Mixture-of-Experts at Scale (arXiv:2206.03382), led by Changho Hwang and co-authors from Microsoft Research, introduced adaptive parallelism switching at runtime, flexible all-to-all, a two-dimensional hierarchical (2DH) all-to-all that mirrors the hierarchy of intra-node NVLink and inter-node InfiniBand fabrics, and fused encode/decode kernels.[13] Tutel reported 4.96x speedup on a single MoE layer on 16 A100 GPUs and 5.75x on 2048 A100 GPUs versus prior state of the art, and 1.55x to 2.11x end-to-end on SwinV2-MoE.[13] MegaBlocks (arXiv:2211.15841) by Trevor Gale, Deepak Narayanan, Cliff Young, and Matei Zaharia reformulated MoE computation as block-sparse matrix operations and contributed custom GPU kernels that eliminate token dropping entirely; the system reported up to 40% end-to-end speedups over Tutel and 2.4x over dense models trained with Megatron-LM, establishing dropless MoE training as practical.[14]
DeepSeek V3, released by DeepSeek in late December 2024, scaled EP to 671 billion total parameters with 37 billion active per token using 256 routed experts and 1 shared expert per MoE layer, with each token selecting 8 routed experts.[4][15] DeepSeek-V3 "pioneers an auxiliary-loss-free strategy for load balancing," adding a per-expert bias term to the routing affinity scores and adjusting it during training rather than relying on an auxiliary loss.[4] Training used 16-way pipeline parallelism, 64-way expert parallelism across 8 nodes, and ZeRO-1 data parallelism, while inference scaled to a 320-GPU large EP configuration during decoding.[4][15] On 25 February 2025, during the second day of DeepSeek's "Open-Source Week," the company released DeepEP (deepseek-ai/DeepEP on GitHub) under the MIT license, introducing it as "the first open-source EP communication library for MoE model training and inference."[5][16][17] DeepEP provides high-throughput normal kernels (achieving roughly 153 GB/s intranode over NVLink and 43 to 47 GB/s internode over InfiniBand on H800 systems) and dedicated low-latency RDMA-only kernels for inference decoding that reportedly achieve dispatch latencies as low as 163 microseconds.[5][16] DeepEP supports FP8 dispatch and BF16 combine, and was tightly co-designed with DeepSeek-V3's group-limited gating algorithm.[5][16]
In a sparsely-gated MoE Transformer block, the FFN sub-layer is replaced by E parallel expert FFNs and a small router (typically a single linear projection followed by softmax). For each token x, the router emits gating logits over experts; under top-k gating, the k experts with the highest scores are selected and the token is sent to those experts.[9][1][2] Under expert parallelism, experts are partitioned across EP ranks (devices). Computing a single MoE layer therefore proceeds through four phases:[6][8][12][18]
The backward pass mirrors this with two more all-to-alls. Because both transfer sizes and destinations are determined dynamically per step by the router, standard fixed-shape collectives (designed for known-ahead-of-time transfer sizes) are suboptimal, motivating the specialised kernels in DeepSpeed-MoE, Tutel, and DeepEP.[5][12][13]
Concretely, consider an MoE layer with 8 experts spread one-per-GPU across an EP=8 group, top-2 routing, a per-rank micro-batch of 1024 tokens with hidden dimension 4096 in BF16. The all-to-all dispatch moves about 1024 * 2 * 4096 * 2 bytes = 16 MiB of token activations out of each rank, plus a small amount of routing metadata (token indices and gating weights). After expert FFN compute (which, for a SwiGLU expert, is roughly three large matmuls of shape tokens x 4096 x intermediate), the combine all-to-all returns a similar volume. At a typical 8 GPU H100 node with 900 GB/s NVLink each direction, the dispatch alone is bandwidth-limited rather than latency-limited, which justifies overlap-focused optimisations like Tutel's pipelining and DualPipe's bidirectional schedule.[13][15][23]
Because the number of tokens routed to each expert is data-dependent and frequently uneven, MoE systems pre-allocate a fixed per-expert buffer called expert capacity:[1][2][11]
capacity = ceil((tokens_per_batch / num_experts) * capacity_factor)
Tokens beyond the capacity of their chosen expert are typically dropped (or, in dropless systems, handled via block-sparse computation).[2][14] A larger capacity factor reduces dropping but inflates compute, memory, and communication volume; Switch Transformer found a sweet spot near 1.0 to 1.25.[2][11] To keep dropping low, MoE models add a load-balancing auxiliary loss that penalises imbalance between the fraction of tokens routed to each expert and the mean router probability assigned to that expert.[1][2][10] GShard and Switch Transformer both rely on auxiliary losses; later work, including DeepSeek V3, has explored auxiliary-loss-free strategies that instead bias router scores with a learnable correction term to keep load balanced without an explicit regulariser.[4][15]
The dispatch/combine pair is the defining collective communication pattern of expert parallelism. In a vanilla implementation, each of the EP ranks sends a different number of tokens to each of the other ranks, which is the textbook unbalanced all-to-all primitive supported by NCCL and other collective libraries.[5][12] The total wire volume per layer is on the order of 2 * batch_tokens * hidden_size * k for top-k routing across the dispatch and combine pair, before any reduction. Because the destination of each token is decided by the gate at runtime, the per-rank send counts are not known until the routing step has completed, which is incompatible with the static buffer shapes that high-performance collective libraries traditionally exploit.[5][12] Three families of optimisation appear repeatedly:
A fourth, complementary technique is adaptive routing at the network fabric level. Both DeepEP and the SGLang large-EP deployment guide recommend enabling adaptive routing on InfiniBand fabrics to spread the dispatch traffic across multiple paths, reducing tail-latency spikes when many ranks send to the same expert simultaneously.[5][16]
Expert parallelism does not replace data, tensor, or pipeline parallelism; it composes with them. The four axes shard different things and use different collectives. A common pattern, sometimes called 3D or 4D parallelism for MoE, is:[6][7][8]
| Dimension | What it shards | Where it applies in an MoE Transformer | Key collective |
|---|---|---|---|
| Data Parallelism (DP) | Mini-batch across replicas | All layers | All-reduce on gradients |
| Tensor Parallelism (TP) | Weight matrices along an axis | Attention and dense FFN; sometimes per-expert | All-reduce or all-gather |
| Pipeline Parallelism (PP) | Layers across stages | Whole model | Point-to-point activations |
| Expert Parallelism (EP) | Experts across ranks | MoE FFN only | All-to-all dispatch and combine |
The key conceptual difference is that DP, TP, and PP all partition a fixed, statically-known computation graph, whereas EP partitions along a dimension whose data routing is decided at runtime by the gate. That is why EP relies on the dynamic, unbalanced all-to-all rather than the static all-reduce or point-to-point exchanges of the other axes.[5][6] Megatron-Core's MoE module recommends keeping the product EP * TP within a single node (typically 8 GPUs) so that the high-volume expert all-to-all stays on NVLink, reserving pipeline parallelism for cross-node scaling.[7] The 2025 "MoE Parallel Folding" paper from NVIDIA further decouples attention parallelism from MoE parallelism, allowing ETP * EP * EDP * PP over the experts to be configured independently from TP * CP * DP * PP over attention.[8]
DeepSpeed-MoE is the MoE training and inference subsystem of Microsoft's DeepSpeed library, accompanying the 2022 paper.[3][12] It provides hybrid tensor-expert-data parallelism, the hierarchical all-to-all, FP16 and FP32 expert kernels, and the PR-MoE architecture in which the bottom layers of the network use fewer experts than the top layers, reducing parameter count without quality loss.[3][12]
Tutel is a Microsoft Research open-source MoE library (also integrated into DeepSpeed) that ships flexible all-to-all primitives, 2DH all-to-all, fused token permutation/unpermutation, and runtime-adaptive parallelism switching, enabling expert and tensor parallelism degrees to change between iterations to match dynamic routing patterns.[13] Tutel powered SwinV2-MoE training and remains a reference EP implementation cited by subsequent systems.[13]
MegaBlocks (stanford-futuredata/megablocks and the later Databricks fork) reframes MoE FFN computation as a block-sparse matrix multiply over a single packed buffer, eliminating padding and token dropping. Its custom CUDA kernels deliver up to 40% throughput gains over Tutel and roughly 2.4x over Megatron-LM dense baselines on equivalent compute budgets.[14] MegaBlocks underpins the open-source training stacks used for many community MoE models.[14]
NVIDIA's Megatron-LM (Megatron-Core) provides a production EP implementation supporting EP combined with DP, TP, PP, and sequence parallelism, with token dispatcher options including a standard AllToAll dispatcher backed by NCCL, a FlexDispatcher that wraps DeepEP for cross-node EP, and a FlexDispatcher variant tuned for the H100 follow-on GB200 NVL72 multi-node NVLink fabric.[7] Megatron-Core also implements GroupedGEMM for batched expert matmuls, FP8 dispatch, router fusion, and overlap of the all-to-all with computation by merging forward and backward passes.[7][8]
DeepEP is the EP communication library open-sourced by DeepSeek on 25 February 2025 during Open-Source Week, described in its repository as "a high-performance communication library" providing "high-throughput and low-latency all-to-all GPU kernels (MoE dispatch and combine) with low-precision support including FP8."[5][16][17] It offers two kernel families: normal kernels optimised for high throughput during prefill and training, and low-latency kernels built on RDMA-only paths that bypass NCCL for decoding latency.[5][16] DeepEP supports InfiniBand and RoCE, FP8 dispatch with BF16 combine, communication-computation overlap via CUDA hooks that avoid consuming streaming multiprocessor resources, and adaptive routing for congested fabrics.[5][16] DeepEP requires Hopper-class GPUs and integrates into PyTorch.[5][16]
DeepSeek's DualPipe, open-sourced on day 4 of Open-Source Week as deepseek-ai/DualPipe, is a bidirectional pipeline parallelism algorithm specifically engineered to hide the EP all-to-alls inside compute.[15][23] DualPipe divides each micro-batch chunk into four sub-chunks (attention, all-to-all dispatch, MLP, all-to-all combine) and orchestrates forward and backward passes to occur in overlapping bidirectional streams so that while one set of micro-batches dispatches tokens, another performs attention or expert MLP computation, masking most of the communication latency.[15][23] DualPipe is therefore complementary to DeepEP: DeepEP optimises the all-to-all itself, while DualPipe overlaps the all-to-all with other computation.[15][23] The combination is reported to reduce pipeline bubbles substantially in DeepSeek V3 and R1 training.[15][23]
| System | Year | Total / active params | Experts per MoE layer | Routing | Notable EP feature |
|---|---|---|---|---|---|
| GShard MoE Transformer | 2020 | 600B / sparse | up to 2048 | top-2 | Introduced EP, capacity factor, aux loss[1] |
| Switch Transformer (Switch-C) | 2021 | 1.6T / sparse | 2048 | top-1 | EP + DP at trillion scale[2] |
| DeepSpeed-MoE | 2022 | up to 2T | configurable | top-1 | Hierarchical all-to-all, PR-MoE, hybrid TP-EP-DP[3][12] |
| Tutel | 2022 to 2023 | SwinV2-MoE | configurable | top-k | 2DH all-to-all, adaptive parallelism[13] |
| MegaBlocks | 2022 to 2023 | configurable | configurable | top-k | Block-sparse dropless MoE[14] |
| Mixtral 8x7B | 2023 to 2024 | 47B / 13B | 8 | top-2 | Apache 2.0 open MoE, mainstream EP[18] |
| Mixtral 8x22B | 2024 | 141B / 39B | 8 | top-2 | Scaled Mixtral architecture[19] |
| DeepSeek V3 | 2024 to 2025 | 671B / 37B | 256 + 1 shared | top-8 | 64-way EP + DualPipe; basis for DeepEP[4][15] |
| Llama 4 Maverick | 2025 | 400B / 17B | 128 + 1 shared | routed + shared | Native multimodal MoE, alternating dense/MoE layers[24] |
| Llama 4 Scout | 2025 | 109B / 17B | 16 | top-k | 10M-token context, runs on a single H100 host[24] |
| Qwen3-235B-A22B | 2025 | 235B / 22B | 128 | top-8 | No shared expert; large-EP inference[25] |
| Megatron-Core MoE | ongoing | configurable | configurable | top-k | Production EP with TP/PP/DP/CP integration[7][8] |
Mixtral 8x7B (mistralai/Mixtral-8x7B), released by Mistral AI in December 2023 and described in arXiv:2401.04088, was a watershed moment for open MoE models: it consists of 8 experts per layer with top-2 routing, totalling about 47 billion parameters with roughly 13 billion active per token, and runs straightforwardly under expert-parallel training and inference stacks such as vLLM, SGLang, and Megatron-Core.[18] Mixtral 8x22B followed in April 2024 with the same architectural template at larger scale.[19] These releases prompted Amazon SageMaker, NVIDIA, and other cloud providers to publish reference architectures showing how to pre-train Mixtral 8x7B with expert parallelism on managed clusters.[20]
The 2025 frontier of open-weight MoE entrenched EP further. Meta's Llama 4 family, released on 5 April 2025, ships two MoE models: Llama 4 Maverick with about 400 billion total parameters and 17 billion active per token across 128 routed experts plus a shared expert in alternating dense and MoE layers, and Llama 4 Scout with about 109 billion total parameters, 17 billion active, 16 experts, and a 10 million token context window.[24] Alibaba's Qwen3 flagship, Qwen3-235B-A22B, released in May 2025, uses 235 billion total parameters with 22 billion active per token, 128 experts and top-8 routing, and (unlike DeepSeek-V3 and earlier Qwen2.5-MoE) drops the shared expert.[25] All of these models are served in production with large-EP inference stacks (SGLang, vLLM, TensorRT-LLM) that shard experts across many GPUs.[5][25]
The phrase expert parallelism denotes the model-parallelism strategy itself: experts sharded across devices with token routing via all-to-all. The phrase DeepEP denotes the specific open-source communication library released by DeepSeek in February 2025 that implements an optimised version of that strategy.[5][16][17] In the months following the DeepEP release, the abbreviation "EP" began appearing as shorthand for the DeepEP library specifically in cloud deployment guides such as Microsoft Azure's HPC blog and the SGLang large-EP deployment series, while the broader strategy retains the name "expert parallelism."[5] Megatron-Core integrates DeepEP via its FlexDispatcher abstraction, allowing users to choose between a stock NCCL-based AllToAll dispatcher and the DeepEP backend depending on cluster fabric and scale.[7][8]
The primary motivation for EP is enabling sparse activation of very large parameter counts at fixed compute budgets. Concrete benefits documented in the cited literature include: