Expert Parallelism
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,012 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,012 words
Add missing citations, update stale details, or suggest a clearer explanation.
Expert Parallelism (EP) is a model-parallelism strategy specific to Mixture of Experts (MoE) neural networks in which the individual expert sub-networks (typically feed-forward blocks) are sharded across different accelerator devices rather than being replicated. Each device hosts a disjoint subset of the experts, and a learned router assigns every input token to one or a few experts; tokens are then exchanged between devices via an all-to-all collective so that each expert receives the tokens routed to it, after which a second all-to-all returns the expert outputs to the originating devices.[1][2][3] The technique was introduced (under the name "expert parallelism" or "experts in parallel") in Google's GShard system in 2020 and has become the dominant scaling strategy for sparse MoE language models, including Switch Transformer, mixtral 8x7B, and DeepSeek V3.[1][2][4][5] EP composes with tensor parallelism (TP), pipeline parallelism (PP), and data parallelism (DP) to form what practitioners call 3D or 4D parallelism for trillion-parameter MoE training.[6][7][8]
Conditional computation through gated mixtures has a long history in machine learning, but the modern incarnation that motivates expert parallelism originates in the 2017 "Outrageously Large Neural Networks" paper by Noam Shazeer and collaborators at Google Brain, which proposed a Sparsely-Gated MoE layer placed between stacked LSTM layers and demonstrated training of models with up to 137 billion parameters.[9] That work already required distributing experts across devices to fit in memory, foreshadowing the systems abstraction later formalised as expert parallelism.
The term and the formal mechanism of expert parallelism were introduced in GShard by Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen, published on arXiv on 30 June 2020 (arXiv:2006.16668) and presented at ICLR 2021.[1] GShard provides lightweight annotation APIs and an extension to the XLA compiler that allow programmers to express SPMD parallel computations, including sharded MoE layers, with minimal code changes.[1] The authors used GShard to scale a multilingual neural machine translation Transformer with a Sparsely-Gated MoE beyond 600 billion parameters, trained on 2048 TPU v3 accelerators in roughly four days.[1] GShard partitions all tokens in a batch into local groups, assigns each expert a fractional capacity of C = 2N/(G*E) for top-2 gating with N tokens, G groups, and E experts, and uses an auxiliary loss based on the mean square of the fraction of tokens dispatched to each expert as a differentiable surrogate for load imbalance.[1][10]
In January 2021, William Fedus, Barret Zoph, and Noam Shazeer of Google released the Switch Transformer (arXiv:2101.03961), which simplified the GShard approach by routing each token to only the top-1 expert and demonstrated stable training up to 1.6 trillion parameters, claiming up to 7x pre-training speedups over T5 dense baselines at comparable per-token FLOPs.[2] Switch Transformer formalised the expert capacity formula capacity = (tokens_per_batch / num_experts) * capacity_factor with capacity factors of approximately 1.0 to 1.25 as a practical sweet spot, and combined data, model, and expert parallelism for its largest configurations.[2][11] The model used a load-balancing auxiliary loss proportional to the dot product of the fraction of tokens routed to each expert and the mean router probability per expert.[2] Switch Transformer is widely credited with making EP a mainstream training pattern.[2][11]
In January 2022, Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, and Yuxiong He of Microsoft published DeepSpeed-MoE (arXiv:2201.05596), which appeared at ICML 2022.[3][12] DeepSpeed-MoE introduced an end-to-end training and inference stack including the Pyramid-Residual MoE (PR-MoE) architecture, the Mixture-of-Students (MoS) distillation technique, and a highly optimised inference engine.[3][12] Crucially for EP, DeepSpeed-MoE introduced a hierarchical all-to-all collective that reduces communication latency from O(p) to O(G + p/G) by splitting the global exchange into intra-node and inter-node phases with a data-layout transformation between them, where G is the intra-node group size and p is the total number of ranks.[12] The system also pioneered hybrid tensor-expert-data parallelism, scaling training from a 107B-parameter model up to 2 trillion parameters on 256 A100 GPUs and delivering up to 7.3x lower inference latency than prior MoE inference solutions.[3][12]
Two MLSys 2023 papers extended EP system design substantially. Tutel: Adaptive Mixture-of-Experts at Scale (arXiv:2206.03382), led by Changho Hwang and co-authors from Microsoft Research, introduced adaptive parallelism switching at runtime, flexible all-to-all, a two-dimensional hierarchical (2DH) all-to-all that mirrors the hierarchy of intra-node NVLink and inter-node InfiniBand fabrics, and fused encode/decode kernels.[13] Tutel reported 4.96x speedup on a single MoE layer on 16 A100 GPUs and 5.75x on 2048 A100 GPUs versus prior state of the art, and 1.55x to 2.11x end-to-end on SwinV2-MoE.[13] MegaBlocks (arXiv:2211.15841) by Trevor Gale, Deepak Narayanan, Cliff Young, and Matei Zaharia reformulated MoE computation as block-sparse matrix operations and contributed custom GPU kernels that eliminate token dropping entirely; the system reported up to 40% end-to-end speedups over Tutel and 2.4x over dense models trained with Megatron-LM, establishing dropless MoE training as practical.[14]
DeepSeek V3, released by DeepSeek in late December 2024, scaled EP to 671 billion total parameters with 37 billion active per token using 256 routed experts and 1 shared expert per MoE layer, with each token selecting 8 routed experts.[4][15] Training used 16-way pipeline parallelism, 64-way expert parallelism across 8 nodes, and ZeRO-1 data parallelism, while inference scaled to a 320-GPU large EP configuration during decoding.[4][15] On 25 February 2025, during the second day of DeepSeek's "Open-Source Week," the company released DeepEP (deepseek-ai/DeepEP on GitHub) under the MIT license, described as "the first open-source EP communication library for MoE model training and inference."[5][16][17] DeepEP provides high-throughput normal kernels (achieving roughly 153 GB/s intranode over NVLink and 43 to 47 GB/s internode over InfiniBand on H800 systems) and dedicated low-latency RDMA-only kernels for inference decoding that reportedly achieve dispatch latencies as low as 163 microseconds.[5][16] DeepEP supports FP8 dispatch and BF16 combine, and was tightly co-designed with DeepSeek-V3's group-limited gating algorithm.[5][16]
In a sparsely-gated MoE Transformer block, the FFN sub-layer is replaced by E parallel expert FFNs and a small router (typically a single linear projection followed by softmax). For each token x, the router emits gating logits over experts; under top-k gating, the k experts with the highest scores are selected and the token is sent to those experts.[9][1][2] Under expert parallelism, experts are partitioned across EP ranks (devices). Computing a single MoE layer therefore proceeds through four phases:[6][8][12][18]
The backward pass mirrors this with two more all-to-alls. Because both transfer sizes and destinations are determined dynamically per step by the router, standard fixed-shape collectives (designed for known-ahead-of-time transfer sizes) are suboptimal, motivating the specialised kernels in DeepSpeed-MoE, Tutel, and DeepEP.[5][12][13]
Concretely, consider an MoE layer with 8 experts spread one-per-GPU across an EP=8 group, top-2 routing, a per-rank micro-batch of 1024 tokens with hidden dimension 4096 in BF16. The all-to-all dispatch moves about 1024 * 2 * 4096 * 2 bytes = 16 MiB of token activations out of each rank, plus a small amount of routing metadata (token indices and gating weights). After expert FFN compute (which, for a SwiGLU expert, is roughly three large matmuls of shape tokens x 4096 x intermediate), the combine all-to-all returns a similar volume. At a typical 8 GPU H100 node with 900 GB/s NVLink each direction, the dispatch alone is bandwidth-limited rather than latency-limited, which justifies overlap-focused optimisations like Tutel's pipelining and DualPipe's bidirectional schedule.[13][15][23]
Because the number of tokens routed to each expert is data-dependent and frequently uneven, MoE systems pre-allocate a fixed per-expert buffer called expert capacity:[1][2][11]
capacity = ceil((tokens_per_batch / num_experts) * capacity_factor)
Tokens beyond the capacity of their chosen expert are typically dropped (or, in dropless systems, handled via block-sparse computation).[2][14] A larger capacity factor reduces dropping but inflates compute, memory, and communication volume; Switch Transformer found a sweet spot near 1.0 to 1.25.[2][11] To keep dropping low, MoE models add a load-balancing auxiliary loss that penalises imbalance between the fraction of tokens routed to each expert and the mean router probability assigned to that expert.[1][2][10] GShard and Switch Transformer both rely on auxiliary losses; later work, including DeepSeek V3, has explored auxiliary-loss-free strategies that instead bias router scores with a learnable correction term to keep load balanced without an explicit regulariser.[4][15]
The dispatch/combine pair is the defining collective communication pattern of expert parallelism. In a vanilla implementation, each of the EP ranks sends a different number of tokens to each of the other ranks, which is the textbook unbalanced all-to-all primitive supported by NCCL and other collective libraries.[5][12] The total wire volume per layer is on the order of 2 * batch_tokens * hidden_size * k for top-k routing across the dispatch and combine pair, before any reduction. Because the destination of each token is decided by the gate at runtime, the per-rank send counts are not known until the routing step has completed, which is incompatible with the static buffer shapes that high-performance collective libraries traditionally exploit.[5][12] Three families of optimisation appear repeatedly:
A fourth, complementary technique is adaptive routing at the network fabric level. Both DeepEP and the SGLang large-EP deployment guide recommend enabling adaptive routing on InfiniBand fabrics to spread the dispatch traffic across multiple paths, reducing tail-latency spikes when many ranks send to the same expert simultaneously.[5][16]
Expert parallelism does not replace data, tensor, or pipeline parallelism; it composes with them. A common pattern, sometimes called 3D or 4D parallelism for MoE, is:[6][7][8]
| Dimension | What it shards | Where it applies in an MoE Transformer | Key collective |
|---|---|---|---|
| Data Parallelism (DP) | Mini-batch across replicas | All layers | All-reduce on gradients |
| Tensor Parallelism (TP) | Weight matrices along an axis | Attention and dense FFN; sometimes per-expert | All-reduce or all-gather |
| Pipeline Parallelism (PP) | Layers across stages | Whole model | Point-to-point activations |
| Expert Parallelism (EP) | Experts across ranks | MoE FFN only | All-to-all dispatch and combine |
Megatron-Core's MoE module recommends keeping the product EP * TP within a single node (typically 8 GPUs) so that the high-volume expert all-to-all stays on NVLink, reserving pipeline parallelism for cross-node scaling.[7] The 2025 "MoE Parallel Folding" paper from NVIDIA further decouples attention parallelism from MoE parallelism, allowing ETP * EP * EDP * PP over the experts to be configured independently from TP * CP * DP * PP over attention.[8]
DeepSpeed-MoE is the MoE training and inference subsystem of Microsoft's DeepSpeed library, accompanying the 2022 paper.[3][12] It provides hybrid tensor-expert-data parallelism, the hierarchical all-to-all, FP16 and FP32 expert kernels, and the PR-MoE architecture in which the bottom layers of the network use fewer experts than the top layers, reducing parameter count without quality loss.[3][12]
Tutel is a Microsoft Research open-source MoE library (also integrated into DeepSpeed) that ships flexible all-to-all primitives, 2DH all-to-all, fused token permutation/unpermutation, and runtime-adaptive parallelism switching, enabling expert and tensor parallelism degrees to change between iterations to match dynamic routing patterns.[13] Tutel powered SwinV2-MoE training and remains a reference EP implementation cited by subsequent systems.[13]
MegaBlocks (stanford-futuredata/megablocks and the later Databricks fork) reframes MoE FFN computation as a block-sparse matrix multiply over a single packed buffer, eliminating padding and token dropping. Its custom CUDA kernels deliver up to 40% throughput gains over Tutel and roughly 2.4x over Megatron-LM dense baselines on equivalent compute budgets.[14] MegaBlocks underpins the open-source training stacks used for many community MoE models.[14]
NVIDIA's Megatron-LM (Megatron-Core) provides a production EP implementation supporting EP combined with DP, TP, PP, and sequence parallelism, with token dispatcher options including a standard AllToAll dispatcher backed by NCCL, a FlexDispatcher that wraps DeepEP for cross-node EP, and a FlexDispatcher variant tuned for the H100 follow-on GB200 NVL72 multi-node NVLink fabric.[7] Megatron-Core also implements GroupedGEMM for batched expert matmuls, FP8 dispatch, router fusion, and overlap of the all-to-all with computation by merging forward and backward passes.[7][8]
DeepEP is the EP communication library open-sourced by DeepSeek on 25 February 2025 during Open-Source Week, distinguished by being purpose-built around DeepSeek V3's group-limited gating and FP8 dispatch path.[5][16][17] It offers two kernel families: normal kernels optimised for high throughput during prefill and training, and low-latency kernels built on RDMA-only paths that bypass NCCL for decoding latency.[5][16] DeepEP supports InfiniBand and RoCE, FP8 dispatch with BF16 combine, communication-computation overlap via CUDA hooks that avoid consuming streaming multiprocessor resources, and adaptive routing for congested fabrics.[5][16] DeepEP requires Hopper-class GPUs and integrates into PyTorch.[5][16]
DeepSeek's DualPipe, open-sourced on day 4 of Open-Source Week as deepseek-ai/DualPipe, is a bidirectional pipeline parallelism algorithm specifically engineered to hide the EP all-to-alls inside compute.[15][23] DualPipe divides each micro-batch chunk into four sub-chunks (attention, all-to-all dispatch, MLP, all-to-all combine) and orchestrates forward and backward passes to occur in overlapping bidirectional streams so that while one set of micro-batches dispatches tokens, another performs attention or expert MLP computation, masking most of the communication latency.[15][23] DualPipe is therefore complementary to DeepEP: DeepEP optimises the all-to-all itself, while DualPipe overlaps the all-to-all with other computation.[15][23] The combination is reported to reduce pipeline bubbles substantially in DeepSeek V3 and R1 training.[15][23]
| System | Year | Total / active params | Experts per MoE layer | Routing | Notable EP feature |
|---|---|---|---|---|---|
| GShard MoE Transformer | 2020 | 600B / sparse | up to 2048 | top-2 | Introduced EP, capacity factor, aux loss[1] |
| Switch Transformer (Switch-C) | 2021 | 1.6T / sparse | 2048 | top-1 | EP + DP at trillion scale[2] |
| DeepSpeed-MoE | 2022 | up to 2T | configurable | top-1 | Hierarchical all-to-all, PR-MoE, hybrid TP-EP-DP[3][12] |
| Tutel | 2022 to 2023 | SwinV2-MoE | configurable | top-k | 2DH all-to-all, adaptive parallelism[13] |
| MegaBlocks | 2022 to 2023 | configurable | configurable | top-k | Block-sparse dropless MoE[14] |
| mixtral 8x7B | 2023 to 2024 | 47B / 13B | 8 | top-2 | Apache 2.0 open MoE, mainstream EP[18] |
| Mixtral 8x22B | 2024 | 141B / 39B | 8 | top-2 | Scaled Mixtral architecture[19] |
| DeepSeek V3 | 2024 to 2025 | 671B / 37B | 256 + 1 shared | top-8 | 64-way EP + DualPipe; basis for DeepEP[4][15] |
| Megatron-Core MoE | ongoing | configurable | configurable | top-k | Production EP with TP/PP/DP/CP integration[7][8] |
Mixtral 8x7B (mistralai/Mixtral-8x7B), released by Mistral AI in December 2023 and described in arXiv:2401.04088, was a watershed moment for open MoE models: it consists of 8 experts per layer with top-2 routing, totalling about 47 billion parameters with roughly 13 billion active per token, and runs straightforwardly under expert-parallel training and inference stacks such as vLLM, SGLang, and Megatron-Core.[18] Mixtral 8x22B followed in April 2024 with the same architectural template at larger scale.[19] These releases prompted Amazon SageMaker, NVIDIA, and other cloud providers to publish reference architectures showing how to pre-train Mixtral 8x7B with expert parallelism on managed clusters.[20]
The phrase expert parallelism denotes the model-parallelism strategy itself: experts sharded across devices with token routing via all-to-all. The phrase DeepEP denotes the specific open-source communication library released by DeepSeek in February 2025 that implements an optimised version of that strategy.[5][16][17] In the months following the DeepEP release, the abbreviation "EP" began appearing as shorthand for the DeepEP library specifically in cloud deployment guides such as Microsoft Azure's HPC blog and the SGLang large-EP deployment series, while the broader strategy retains the name "expert parallelism."[5] Megatron-Core integrates DeepEP via its FlexDispatcher abstraction, allowing users to choose between a stock NCCL-based AllToAll dispatcher and the DeepEP backend depending on cluster fabric and scale.[7][8]
The primary motivation for EP is enabling sparse activation of very large parameter counts at fixed compute budgets. Concrete benefits documented in the cited literature include: