Expert Parallelism

AI Infrastructure Mixture of Experts Training & Optimization

23 min read

Updated Jun 27, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 27, 2026

Fact-checked

In review queue

Sources

25 citations

Revision

v3 · 4,606 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Expert Parallelism (EP) is a model-parallelism strategy specific to Mixture of Experts (MoE) neural networks in which the individual expert sub-networks (typically feed-forward blocks) are sharded across different accelerator devices rather than being replicated. Each device hosts a disjoint subset of the experts, and a learned router assigns every input token to one or a few experts; tokens are then exchanged between devices via an all-to-all collective so that each expert receives the tokens routed to it, after which a second all-to-all returns the expert outputs to the originating devices.^[1]^[2]^[3] The technique was introduced (under the name "expert parallelism" or "experts in parallel") in Google's GShard system in 2020 and has become the dominant scaling strategy for sparse MoE language models, including Switch Transformer, Mixtral 8x7B, DeepSeek V3, Llama 4, and Qwen3.^[1]^[2]^[4]^[5]^[24]^[25] EP composes with tensor parallelism (TP), pipeline parallelism (PP), and data parallelism (DP) to form what practitioners call 3D or 4D parallelism for trillion-parameter MoE training.^[6]^[7]^[8]

What is expert parallelism?

Expert parallelism is the practice of placing different experts of an MoE layer on different devices so that the model's total parameter count can grow far beyond the memory of a single accelerator while the per-token compute stays roughly constant. Whereas data parallelism replicates the entire model and splits the batch, and tensor parallelism splits individual weight matrices, expert parallelism splits along the expert dimension: each device owns whole experts, and the router decides at runtime which device each token must be sent to.^[1]^[2]^[6] Because only a small subset of experts is activated per token (for example, 2 of 8 in Mixtral 8x7B, or 8 of 256 in DeepSeek V3), most of the model's parameters sit idle for any given token, which is what makes the sparse activation and the cross-device routing worthwhile.^[4]^[18] The defining cost of the strategy is the pair of all-to-all collectives that route tokens to their experts and bring the results back, which is why most of the systems work on EP (DeepSpeed-MoE, Tutel, MegaBlocks, Megatron-Core, DeepEP) exists to make that all-to-all faster.^[3]^[5]^[12]^[13]^[14]

History

Sparsely-gated MoE precursors

Conditional computation through gated mixtures has a long history in machine learning, but the modern incarnation that motivates expert parallelism originates in the 2017 "Outrageously Large Neural Networks" paper by Noam Shazeer and collaborators at Google Brain, which proposed a Sparsely-Gated MoE layer placed between stacked LSTM layers and demonstrated training of models with up to 137 billion parameters.^[9] That work already required distributing experts across devices to fit in memory, foreshadowing the systems abstraction later formalised as expert parallelism.

GShard introduces expert parallelism (2020)

The term and the formal mechanism of expert parallelism were introduced in GShard by Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen, published on arXiv on 30 June 2020 (arXiv:2006.16668) and presented at ICLR 2021.^[1] The authors describe GShard as "a module composed of a set of lightweight annotation APIs and an extension to the XLA compiler" that "provides an elegant way to express a wide range of parallel computation patterns with minimal changes to the existing model code."^[1] They used GShard to "scale up multilingual neural machine translation Transformer model with Sparsely-Gated Mixture-of-Experts beyond 600 billion parameters," reporting that "such a giant model can efficiently be trained on 2048 TPU v3 accelerators in 4 days to achieve far superior quality for translation from 100 languages to English."^[1] GShard partitions all tokens in a batch into local groups, assigns each expert a fractional capacity of C = 2N/(G*E) for top-2 gating with N tokens, G groups, and E experts, and uses an auxiliary loss based on the mean square of the fraction of tokens dispatched to each expert as a differentiable surrogate for load imbalance.^[1]^[10]

Switch Transformer scales EP to a trillion parameters (2021)

In January 2021, William Fedus, Barret Zoph, and Noam Shazeer of Google released the Switch Transformer (arXiv:2101.03961), which simplified the GShard approach by routing each token to only the top-1 expert and demonstrated stable training up to 1.6 trillion parameters.^[2] The paper states that the authors "simplify the MoE routing algorithm" and "measure up to 7x increases in pre-training speed with the same computational resources" relative to dense T5 baselines.^[2] Switch Transformer formalised the expert capacity formula capacity = (tokens_per_batch / num_experts) * capacity_factor with capacity factors of approximately 1.0 to 1.25 as a practical sweet spot, and combined data, model, and expert parallelism for its largest configurations.^[2]^[11] The model used a load-balancing auxiliary loss proportional to the dot product of the fraction of tokens routed to each expert and the mean router probability per expert.^[2] Switch Transformer is widely credited with making EP a mainstream training pattern.^[2]^[11]

DeepSpeed-MoE optimises all-to-all (2022)

In January 2022, Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, and Yuxiong He of Microsoft published DeepSpeed-MoE (arXiv:2201.05596), which appeared at ICML 2022.^[3]^[12] DeepSpeed-MoE introduced an end-to-end training and inference stack including the Pyramid-Residual MoE (PR-MoE) architecture, the Mixture-of-Students (MoS) distillation technique, and a highly optimised inference engine.^[3]^[12] Crucially for EP, DeepSpeed-MoE introduced a hierarchical all-to-all collective that reduces communication latency from O(p) to O(G + p/G) by splitting the global exchange into intra-node and inter-node phases with a data-layout transformation between them, where G is the intra-node group size and p is the total number of ranks.^[12] The system also pioneered hybrid tensor-expert-data parallelism, scaling training from a 107B-parameter model up to 2 trillion parameters on 256 A100 GPUs and delivering up to 7.3x lower inference latency than prior MoE inference solutions.^[3]^[12]

Tutel and MegaBlocks (MLSys 2023)

Two MLSys 2023 papers extended EP system design substantially. Tutel: Adaptive Mixture-of-Experts at Scale (arXiv:2206.03382), led by Changho Hwang and co-authors from Microsoft Research, introduced adaptive parallelism switching at runtime, flexible all-to-all, a two-dimensional hierarchical (2DH) all-to-all that mirrors the hierarchy of intra-node NVLink and inter-node InfiniBand fabrics, and fused encode/decode kernels.^[13] Tutel reported 4.96x speedup on a single MoE layer on 16 A100 GPUs and 5.75x on 2048 A100 GPUs versus prior state of the art, and 1.55x to 2.11x end-to-end on SwinV2-MoE.^[13] MegaBlocks (arXiv:2211.15841) by Trevor Gale, Deepak Narayanan, Cliff Young, and Matei Zaharia reformulated MoE computation as block-sparse matrix operations and contributed custom GPU kernels that eliminate token dropping entirely; the system reported up to 40% end-to-end speedups over Tutel and 2.4x over dense models trained with Megatron-LM, establishing dropless MoE training as practical.^[14]

DeepSeek-V3 and DeepEP (2024 to 2025)

DeepSeek V3, released by DeepSeek in late December 2024, scaled EP to 671 billion total parameters with 37 billion active per token using 256 routed experts and 1 shared expert per MoE layer, with each token selecting 8 routed experts.^[4]^[15] DeepSeek-V3 "pioneers an auxiliary-loss-free strategy for load balancing," adding a per-expert bias term to the routing affinity scores and adjusting it during training rather than relying on an auxiliary loss.^[4] Training used 16-way pipeline parallelism, 64-way expert parallelism across 8 nodes, and ZeRO-1 data parallelism, while inference scaled to a 320-GPU large EP configuration during decoding.^[4]^[15] On 25 February 2025, during the second day of DeepSeek's "Open-Source Week," the company released DeepEP (deepseek-ai/DeepEP on GitHub) under the MIT license, introducing it as "the first open-source EP communication library for MoE model training and inference."^[5]^[16]^[17] DeepEP provides high-throughput normal kernels (achieving roughly 153 GB/s intranode over NVLink and 43 to 47 GB/s internode over InfiniBand on H800 systems) and dedicated low-latency RDMA-only kernels for inference decoding that reportedly achieve dispatch latencies as low as 163 microseconds.^[5]^[16] DeepEP supports FP8 dispatch and BF16 combine, and was tightly co-designed with DeepSeek-V3's group-limited gating algorithm.^[5]^[16]

How does expert parallelism work?

Routing and the two all-to-alls

In a sparsely-gated MoE Transformer block, the FFN sub-layer is replaced by E parallel expert FFNs and a small router (typically a single linear projection followed by softmax). For each token x, the router emits gating logits over experts; under top-k gating, the k experts with the highest scores are selected and the token is sent to those experts.^[9]^[1]^[2] Under expert parallelism, experts are partitioned across EP ranks (devices). Computing a single MoE layer therefore proceeds through four phases:^[6]^[8]^[12]^[18]

Routing: each rank computes router logits for its local tokens and decides which expert each token must visit.
All-to-all dispatch: tokens are exchanged across the EP group so that every token arrives at the rank that owns its target expert(s).
Expert FFN forward: each rank runs its local experts over the tokens it received, typically with a Transformer-style gated linear unit or SwiGLU FFN.
All-to-all combine: expert outputs are exchanged back to the originating ranks and combined via the router's gating weights to produce the layer output.

The backward pass mirrors this with two more all-to-alls. Because both transfer sizes and destinations are determined dynamically per step by the router, standard fixed-shape collectives (designed for known-ahead-of-time transfer sizes) are suboptimal, motivating the specialised kernels in DeepSpeed-MoE, Tutel, and DeepEP.^[5]^[12]^[13]

Concretely, consider an MoE layer with 8 experts spread one-per-GPU across an EP=8 group, top-2 routing, a per-rank micro-batch of 1024 tokens with hidden dimension 4096 in BF16. The all-to-all dispatch moves about 1024 * 2 * 4096 * 2 bytes = 16 MiB of token activations out of each rank, plus a small amount of routing metadata (token indices and gating weights). After expert FFN compute (which, for a SwiGLU expert, is roughly three large matmuls of shape tokens x 4096 x intermediate), the combine all-to-all returns a similar volume. At a typical 8 GPU H100 node with 900 GB/s NVLink each direction, the dispatch alone is bandwidth-limited rather than latency-limited, which justifies overlap-focused optimisations like Tutel's pipelining and DualPipe's bidirectional schedule.^[13]^[15]^[23]

Capacity factor and load balancing

Because the number of tokens routed to each expert is data-dependent and frequently uneven, MoE systems pre-allocate a fixed per-expert buffer called expert capacity:^[1]^[2]^[11]

capacity = ceil((tokens_per_batch / num_experts) * capacity_factor)

Tokens beyond the capacity of their chosen expert are typically dropped (or, in dropless systems, handled via block-sparse computation).^[2]^[14] A larger capacity factor reduces dropping but inflates compute, memory, and communication volume; Switch Transformer found a sweet spot near 1.0 to 1.25.^[2]^[11] To keep dropping low, MoE models add a load-balancing auxiliary loss that penalises imbalance between the fraction of tokens routed to each expert and the mean router probability assigned to that expert.^[1]^[2]^[10] GShard and Switch Transformer both rely on auxiliary losses; later work, including DeepSeek V3, has explored auxiliary-loss-free strategies that instead bias router scores with a learnable correction term to keep load balanced without an explicit regulariser.^[4]^[15]

All-to-all dispatch/combine pattern

The dispatch/combine pair is the defining collective communication pattern of expert parallelism. In a vanilla implementation, each of the EP ranks sends a different number of tokens to each of the other ranks, which is the textbook unbalanced all-to-all primitive supported by NCCL and other collective libraries.^[5]^[12] The total wire volume per layer is on the order of 2 * batch_tokens * hidden_size * k for top-k routing across the dispatch and combine pair, before any reduction. Because the destination of each token is decided by the gate at runtime, the per-rank send counts are not known until the routing step has completed, which is incompatible with the static buffer shapes that high-performance collective libraries traditionally exploit.^[5]^[12] Three families of optimisation appear repeatedly:

Hierarchical or 2DH all-to-all: nest an intra-node exchange (NVLink) inside an inter-node exchange (InfiniBand or RoCE), reducing the volume on the slower fabric by a factor proportional to the per-node GPU count.^[12]^[13]
Communication and computation overlap: split the batch into micro-chunks and pipeline dispatch, expert compute, and combine to hide network latency, used by Tutel, Megatron-Core, and DeepSeek's DualPipe.^[13]^[7]^[15]
Low-precision dispatch: cast token activations to FP8 for the dispatch step, halving wire volume relative to BF16 while combining in BF16 to preserve accuracy. DeepEP makes this the default.^[5]^[16]

A fourth, complementary technique is adaptive routing at the network fabric level. Both DeepEP and the SGLang large-EP deployment guide recommend enabling adaptive routing on InfiniBand fabrics to spread the dispatch traffic across multiple paths, reducing tail-latency spikes when many ranks send to the same expert simultaneously.^[5]^[16]

How does expert parallelism differ from data, tensor, and pipeline parallelism?

Expert parallelism does not replace data, tensor, or pipeline parallelism; it composes with them. The four axes shard different things and use different collectives. A common pattern, sometimes called 3D or 4D parallelism for MoE, is:^[6]^[7]^[8]

Dimension	What it shards	Where it applies in an MoE Transformer	Key collective
Data Parallelism (DP)	Mini-batch across replicas	All layers	All-reduce on gradients
Tensor Parallelism (TP)	Weight matrices along an axis	Attention and dense FFN; sometimes per-expert	All-reduce or all-gather
Pipeline Parallelism (PP)	Layers across stages	Whole model	Point-to-point activations
Expert Parallelism (EP)	Experts across ranks	MoE FFN only	All-to-all dispatch and combine

The key conceptual difference is that DP, TP, and PP all partition a fixed, statically-known computation graph, whereas EP partitions along a dimension whose data routing is decided at runtime by the gate. That is why EP relies on the dynamic, unbalanced all-to-all rather than the static all-reduce or point-to-point exchanges of the other axes.^[5]^[6] Megatron-Core's MoE module recommends keeping the product EP * TP within a single node (typically 8 GPUs) so that the high-volume expert all-to-all stays on NVLink, reserving pipeline parallelism for cross-node scaling.^[7] The 2025 "MoE Parallel Folding" paper from NVIDIA further decouples attention parallelism from MoE parallelism, allowing ETP * EP * EDP * PP over the experts to be configured independently from TP * CP * DP * PP over attention.^[8]

Implementations

DeepSpeed-MoE

DeepSpeed-MoE is the MoE training and inference subsystem of Microsoft's DeepSpeed library, accompanying the 2022 paper.^[3]^[12] It provides hybrid tensor-expert-data parallelism, the hierarchical all-to-all, FP16 and FP32 expert kernels, and the PR-MoE architecture in which the bottom layers of the network use fewer experts than the top layers, reducing parameter count without quality loss.^[3]^[12]

Tutel

Tutel is a Microsoft Research open-source MoE library (also integrated into DeepSpeed) that ships flexible all-to-all primitives, 2DH all-to-all, fused token permutation/unpermutation, and runtime-adaptive parallelism switching, enabling expert and tensor parallelism degrees to change between iterations to match dynamic routing patterns.^[13] Tutel powered SwinV2-MoE training and remains a reference EP implementation cited by subsequent systems.^[13]

MegaBlocks

MegaBlocks (stanford-futuredata/megablocks and the later Databricks fork) reframes MoE FFN computation as a block-sparse matrix multiply over a single packed buffer, eliminating padding and token dropping. Its custom CUDA kernels deliver up to 40% throughput gains over Tutel and roughly 2.4x over Megatron-LM dense baselines on equivalent compute budgets.^[14] MegaBlocks underpins the open-source training stacks used for many community MoE models.^[14]

Megatron-Core MoE

NVIDIA's Megatron-LM (Megatron-Core) provides a production EP implementation supporting EP combined with DP, TP, PP, and sequence parallelism, with token dispatcher options including a standard AllToAll dispatcher backed by NCCL, a FlexDispatcher that wraps DeepEP for cross-node EP, and a FlexDispatcher variant tuned for the H100 follow-on GB200 NVL72 multi-node NVLink fabric.^[7] Megatron-Core also implements GroupedGEMM for batched expert matmuls, FP8 dispatch, router fusion, and overlap of the all-to-all with computation by merging forward and backward passes.^[7]^[8]

DeepEP

DeepEP is the EP communication library open-sourced by DeepSeek on 25 February 2025 during Open-Source Week, described in its repository as "a high-performance communication library" providing "high-throughput and low-latency all-to-all GPU kernels (MoE dispatch and combine) with low-precision support including FP8."^[5]^[16]^[17] It offers two kernel families: normal kernels optimised for high throughput during prefill and training, and low-latency kernels built on RDMA-only paths that bypass NCCL for decoding latency.^[5]^[16] DeepEP supports InfiniBand and RoCE, FP8 dispatch with BF16 combine, communication-computation overlap via CUDA hooks that avoid consuming streaming multiprocessor resources, and adaptive routing for congested fabrics.^[5]^[16] DeepEP requires Hopper-class GPUs and integrates into PyTorch.^[5]^[16]

DualPipe and computation/communication overlap

DeepSeek's DualPipe, open-sourced on day 4 of Open-Source Week as deepseek-ai/DualPipe, is a bidirectional pipeline parallelism algorithm specifically engineered to hide the EP all-to-alls inside compute.^[15]^[23] DualPipe divides each micro-batch chunk into four sub-chunks (attention, all-to-all dispatch, MLP, all-to-all combine) and orchestrates forward and backward passes to occur in overlapping bidirectional streams so that while one set of micro-batches dispatches tokens, another performs attention or expert MLP computation, masking most of the communication latency.^[15]^[23] DualPipe is therefore complementary to DeepEP: DeepEP optimises the all-to-all itself, while DualPipe overlaps the all-to-all with other computation.^[15]^[23] The combination is reported to reduce pipeline bubbles substantially in DeepSeek V3 and R1 training.^[15]^[23]

Which models use expert parallelism?

System	Year	Total / active params	Experts per MoE layer	Routing	Notable EP feature
GShard MoE Transformer	2020	600B / sparse	up to 2048	top-2	Introduced EP, capacity factor, aux loss^[1]
Switch Transformer (Switch-C)	2021	1.6T / sparse	2048	top-1	EP + DP at trillion scale^[2]
DeepSpeed-MoE	2022	up to 2T	configurable	top-1	Hierarchical all-to-all, PR-MoE, hybrid TP-EP-DP^[3]^[12]
Tutel	2022 to 2023	SwinV2-MoE	configurable	top-k	2DH all-to-all, adaptive parallelism^[13]
MegaBlocks	2022 to 2023	configurable	configurable	top-k	Block-sparse dropless MoE^[14]
Mixtral 8x7B	2023 to 2024	47B / 13B	8	top-2	Apache 2.0 open MoE, mainstream EP^[18]
Mixtral 8x22B	2024	141B / 39B	8	top-2	Scaled Mixtral architecture^[19]
DeepSeek V3	2024 to 2025	671B / 37B	256 + 1 shared	top-8	64-way EP + DualPipe; basis for DeepEP^[4]^[15]
Llama 4 Maverick	2025	400B / 17B	128 + 1 shared	routed + shared	Native multimodal MoE, alternating dense/MoE layers^[24]
Llama 4 Scout	2025	109B / 17B	16	top-k	10M-token context, runs on a single H100 host^[24]
Qwen3-235B-A22B	2025	235B / 22B	128	top-8	No shared expert; large-EP inference^[25]
Megatron-Core MoE	ongoing	configurable	configurable	top-k	Production EP with TP/PP/DP/CP integration^[7]^[8]

Mixtral 8x7B (mistralai/Mixtral-8x7B), released by Mistral AI in December 2023 and described in arXiv:2401.04088, was a watershed moment for open MoE models: it consists of 8 experts per layer with top-2 routing, totalling about 47 billion parameters with roughly 13 billion active per token, and runs straightforwardly under expert-parallel training and inference stacks such as vLLM, SGLang, and Megatron-Core.^[18] Mixtral 8x22B followed in April 2024 with the same architectural template at larger scale.^[19] These releases prompted Amazon SageMaker, NVIDIA, and other cloud providers to publish reference architectures showing how to pre-train Mixtral 8x7B with expert parallelism on managed clusters.^[20]

The 2025 frontier of open-weight MoE entrenched EP further. Meta's Llama 4 family, released on 5 April 2025, ships two MoE models: Llama 4 Maverick with about 400 billion total parameters and 17 billion active per token across 128 routed experts plus a shared expert in alternating dense and MoE layers, and Llama 4 Scout with about 109 billion total parameters, 17 billion active, 16 experts, and a 10 million token context window.^[24] Alibaba's Qwen3 flagship, Qwen3-235B-A22B, released in May 2025, uses 235 billion total parameters with 22 billion active per token, 128 experts and top-8 routing, and (unlike DeepSeek-V3 and earlier Qwen2.5-MoE) drops the shared expert.^[25] All of these models are served in production with large-EP inference stacks (SGLang, vLLM, TensorRT-LLM) that shard experts across many GPUs.^[5]^[25]

Terminology: Expert Parallelism vs DeepEP

The phrase expert parallelism denotes the model-parallelism strategy itself: experts sharded across devices with token routing via all-to-all. The phrase DeepEP denotes the specific open-source communication library released by DeepSeek in February 2025 that implements an optimised version of that strategy.^[5]^[16]^[17] In the months following the DeepEP release, the abbreviation "EP" began appearing as shorthand for the DeepEP library specifically in cloud deployment guides such as Microsoft Azure's HPC blog and the SGLang large-EP deployment series, while the broader strategy retains the name "expert parallelism."^[5] Megatron-Core integrates DeepEP via its FlexDispatcher abstraction, allowing users to choose between a stock NCCL-based AllToAll dispatcher and the DeepEP backend depending on cluster fabric and scale.^[7]^[8]

Applications

The primary motivation for EP is enabling sparse activation of very large parameter counts at fixed compute budgets. Concrete benefits documented in the cited literature include:

Pre-training efficiency: Switch Transformer reported up to 7x speedups versus T5 dense baselines at matched FLOPs, and DeepSpeed-MoE reported 5x training-cost reductions over quality-equivalent dense autoregressive models.^[2]^[3]
Inference efficiency: DeepSpeed-MoE reported 4.5x faster, 9x cheaper inference versus quality-matched dense models, and DeepEP enables sub-200-microsecond decode-step dispatch on H800 systems.^[3]^[5]
Scaling beyond a single host: EP is what allows MoE total parameter counts to grow from tens of billions on a single node into the hundreds of billions or trillions distributed across many nodes, while keeping per-token compute roughly constant.^[1]^[2]^[4]
Multilingual and domain specialisation: GShard's original use case was massively multilingual translation, and subsequent work has shown that EP-trained MoE layers learn approximate per-language and per-domain expert specialisation.^[1]
Open-weight MoE ecosystem: Mixtral 8x7B was the first MoE model with broadly comparable quality to dense 70B-class models to ship under a permissive license, and EP made it economically feasible for hobbyists and small labs to fine-tune the model on a handful of GPUs, accelerating the open-weights MoE ecosystem (Qwen3 MoE variants, Llama 4, Grok-1, DBRX, and others all rely on EP).^[18]^[19]^[24]^[25]

What are the challenges and limitations of expert parallelism?

All-to-all is the bottleneck. The dispatch and combine collectives are typically the dominant cost of an MoE layer at scale; DeepSpeed-MoE, Tutel, DeepEP, and Megatron-Core all exist primarily to address this bottleneck.^[12]^[13]^[7]^[5] Standard collective libraries are tuned for symmetric, fixed-shape exchanges, whereas EP all-to-alls are dynamic and unbalanced.^[5]
Token dropping versus padding. Static capacity factors force a tradeoff between dropping tokens (hurting quality) and padding (wasting compute and bandwidth); MegaBlocks' block-sparse formulation resolves this for training but is not yet universally adopted.^[14]
Load balancing is fragile. Auxiliary losses can interfere with the primary objective if poorly tuned, and routing collapse (most tokens going to a few experts) was a persistent issue in early Switch-style models. Strategies range from carefully tuned alpha hyperparameters to expert-choice routing and DeepSeek V3's auxiliary-loss-free bias-update rule.^[2]^[4]^[15]
EP couples mostly to a single sub-layer. Because EP only applies to the MoE FFN, the rest of the model (attention, dense layers, embeddings) must still be parallelised with TP, PP, and DP, complicating the parallel mapping. "MoE Parallel Folding" exists explicitly to decouple these dimensions.^[8]
Fabric dependence. EP performance is highly sensitive to network topology; H800 export constraints and inter-node bandwidth shaped many of DeepEP's design choices, and small-cluster deployments may not see the benefits demonstrated on multi-thousand-GPU clusters.^[5]^[15]
Inference complexity. Serving an EP-sharded MoE model in production requires distributed runtimes that can route tokens at request granularity. Serving stacks including SGLang, vLLM, and TensorRT-LLM have invested significant engineering in EP inference paths; deploying large-EP inference (for example, 96 H100s for DeepSeek V3 R1 served by LMSYS) still requires careful tuning of dispatch buffer sizes, expert replication strategies, and prefill/decode disaggregation.^[5]^[16]
Auxiliary-loss tuning vs auxiliary-loss-free. DeepSeek V3 introduced an auxiliary-loss-free balancing strategy that adds a learnable per-expert bias to router logits and updates the bias to reduce overload, eliminating the auxiliary loss's gradient interference. Whether this approach generalises is still an open question in the literature.^[4]^[15]

Tensor parallelism shards weight matrices within a layer. EP can be viewed as a "data-dependent" tensor parallelism over the expert dimension, where the routing of activations to weight shards is decided at runtime by the gate.^[6]^[8]
Pipeline parallelism shards across layers; DeepSeek V3's DualPipe is a bidirectional PP algorithm specifically engineered to overlap PP bubbles with EP all-to-alls.^[15]
Data parallelism shards the mini-batch across replicas; EP and DP form orthogonal axes in MoE training.^[6]^[7]
Model parallelism is the umbrella term that subsumes TP, PP, and EP.^[21]
Expert-choice routing is an alternative to token-choice top-k routing in which each expert picks the top tokens it wants, yielding perfect load balancing without an auxiliary loss but altering the gradient path; it can be implemented on the same EP substrate as top-k routing.^[22]

References

Lepikhin, D. et al., "GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding", arXiv, 2020-06-30. https://arxiv.org/abs/2006.16668. Accessed 2026-05-21. ↩
Fedus, W., Zoph, B., and Shazeer, N., "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity", arXiv, 2021-01-11. https://arxiv.org/abs/2101.03961. Accessed 2026-05-21. ↩
Rajbhandari, S. et al., "DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale", arXiv, 2022-01-14. https://arxiv.org/abs/2201.05596. Accessed 2026-05-21. ↩
DeepSeek-AI, "DeepSeek-V3 Technical Report", arXiv, 2024-12-27. https://arxiv.org/abs/2412.19437. Accessed 2026-05-21. ↩
DeepSeek-AI, "DeepEP: an efficient expert-parallel communication library", GitHub repository, 2025-02-25. https://github.com/deepseek-ai/DeepEP. Accessed 2026-05-21. ↩
Singh, S. et al., "A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training", arXiv, 2023-03-11. https://arxiv.org/abs/2303.06318. Accessed 2026-05-21. ↩
NVIDIA, "Mixture of Experts package", Megatron-Core developer documentation. https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/features/moe.html. Accessed 2026-05-21. ↩
Liu, X. et al., "MoE Parallel Folding: Heterogeneous Parallelism Mappings for Efficient Large-Scale MoE Model Training with Megatron Core", arXiv, 2025-04-21. https://arxiv.org/abs/2504.14960. Accessed 2026-05-21. ↩
Shazeer, N. et al., "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer", arXiv, 2017-01-23. https://arxiv.org/abs/1701.06538. Accessed 2026-05-21. ↩
Zhang, Y., "A Review on the Evolvement of Load Balancing Strategy in MoE LLMs: Pitfalls and Lessons", Hugging Face Blog, 2025-01-15. https://huggingface.co/blog/NormalUhr/moe-balance. Accessed 2026-05-21. ↩
Sanseviero, O. et al., "Mixture of Experts Explained", Hugging Face Blog, 2023-12-11. https://huggingface.co/blog/moe. Accessed 2026-05-21. ↩
Microsoft Research Blog, "DeepSpeed: Advancing MoE inference and training to power next-generation AI scale", Microsoft, 2022-01-19. https://www.microsoft.com/en-us/research/blog/deepspeed-advancing-moe-inference-and-training-to-power-next-generation-ai-scale/. Accessed 2026-05-21. ↩
Hwang, C. et al., "Tutel: Adaptive Mixture-of-Experts at Scale", arXiv, 2022-06-07; published at MLSys 2023. https://arxiv.org/abs/2206.03382. Accessed 2026-05-21. ↩
Gale, T., Narayanan, D., Young, C., and Zaharia, M., "MegaBlocks: Efficient Sparse Training with Mixture-of-Experts", arXiv, 2022-11-29; published at MLSys 2023. https://arxiv.org/abs/2211.15841. Accessed 2026-05-21. ↩
DeepSeek-AI, "Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures", arXiv, 2025-05-14. https://arxiv.org/abs/2505.09343. Accessed 2026-05-21. ↩
MarkTechPost, "DeepSeek AI Releases DeepEP: An Open-Source EP Communication Library for MoE Model Training and Inference", 2025-02-24. https://www.marktechpost.com/2025/02/24/deepseek-ai-releases-deepep-an-open-source-ep-communication-library-for-moe-model-training-and-inference/. Accessed 2026-05-21. ↩
Analytics Vidhya, "DeepEP Released on Day 2 of Open Source Week at DeepSeek", 2025-02-25. https://www.analyticsvidhya.com/blog/2025/02/deepseek-deepep/. Accessed 2026-05-21. ↩
Jiang, A. Q. et al., "Mixtral of Experts", arXiv, 2024-01-08. https://arxiv.org/abs/2401.04088. Accessed 2026-05-21. ↩
Mistral AI, "Cheaper, Better, Faster, Stronger: Continuing to push the frontier of AI and making it accessible to all (Mixtral 8x22B)", Mistral AI blog, 2024-04-17. https://mistral.ai/news/mixtral-8x22b/. Accessed 2026-05-21. ↩
AWS Machine Learning Blog, "Accelerate Mixtral 8x7B pre-training with expert parallelism on Amazon SageMaker", Amazon Web Services, 2024-05-23. https://aws.amazon.com/blogs/machine-learning/accelerate-mixtral-8x7b-pre-training-with-expert-parallelism-on-amazon-sagemaker/. Accessed 2026-05-21. ↩
NVIDIA, "Parallelisms", NeMo Framework User Guide. https://docs.nvidia.com/nemo-framework/user-guide/24.07/nemotoolkit/features/parallelisms.html. Accessed 2026-05-21. ↩
Zhou, Y. et al., "Mixture-of-Experts with Expert Choice Routing", arXiv, 2022-02-18. https://arxiv.org/abs/2202.09368. Accessed 2026-05-21. ↩
DeepSeek-AI, "DualPipe: A bidirectional pipeline parallelism algorithm for computation-communication overlap in DeepSeek V3/R1 training", GitHub repository, 2025-02-27. https://github.com/deepseek-ai/DualPipe. Accessed 2026-05-21. ↩
Meta AI, "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation", Meta, 2025-04-05. https://ai.meta.com/blog/llama-4-multimodal-intelligence/. Accessed 2026-06-27. ↩
Qwen Team, "Qwen3 Technical Report", arXiv, 2025-05-14. https://arxiv.org/abs/2505.09388. Accessed 2026-06-27. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributor · full history

Suggest edit

What links here

Abbreviations DeepEP DeepGEMM Distributed training DualPipe NVIDIA Dynamo Partitioning strategy

What is expert parallelism?

History

Sparsely-gated MoE precursors

GShard introduces expert parallelism (2020)

Switch Transformer scales EP to a trillion parameters (2021)

DeepSpeed-MoE optimises all-to-all (2022)

Tutel and MegaBlocks (MLSys 2023)

DeepSeek-V3 and DeepEP (2024 to 2025)

How does expert parallelism work?

Routing and the two all-to-alls

Capacity factor and load balancing

All-to-all dispatch/combine pattern

How does expert parallelism differ from data, tensor, and pipeline parallelism?

Implementations

DeepSpeed-MoE

Tutel

MegaBlocks

Megatron-Core MoE

DeepEP

DualPipe and computation/communication overlap

Which models use expert parallelism?

Terminology: Expert Parallelism vs DeepEP

Applications

What are the challenges and limitations of expert parallelism?

Related concepts

See also

References

Improve this article

Related Articles

DeepEP

DeepSpeed

Tensor Parallelism

Pipeline Parallelism

Fully Sharded Data Parallel (FSDP)

NCCL (NVIDIA Collective Communications Library)

What links here

Related Articles

DeepEP

DeepSpeed

Tensor Parallelism

Pipeline Parallelism

Fully Sharded Data Parallel (FSDP)

NCCL (NVIDIA Collective Communications Library)

What links here