Expert Choice routing

Deep Learning Neural Networks

10 min read

Updated Jun 8, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 8, 2026

Fact-checked

In review queue

Sources

11 citations

Revision

v1 · 1,900 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Overview

Expert Choice routing (often abbreviated EC) is a routing method for mixture of experts (MoE) layers in neural networks, introduced by researchers at Google in 2022. In a sparsely activated MoE Transformer, a small router decides which of several parallel expert sub-networks processes each token. Conventional schemes are token choice: every token picks its own top-k experts. Expert Choice inverts the direction of selection. Every expert instead picks its own top-k tokens. Because each expert is assigned exactly the same fixed number of tokens, the layer is perfectly load balanced by construction, with no auxiliary balancing loss and no risk of an expert overflowing its buffer and dropping tokens. A consequence of the inversion is that a token may be chosen by zero, one, or several experts, so tokens receive different amounts of computation at a fixed average budget. ^[1]^[2]

The method was presented in "Mixture-of-Experts with Expert Choice Routing" by Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew M. Dai, Zhifeng Chen, Quoc V. Le, and James Laudon, posted to arXiv on February 18, 2022 and published at NeurIPS 2022. ^[1]^[3] On an 8-billion-activated-parameter model with 64 experts, the authors reported more than 2x faster training convergence than the GShard top-2 baseline, together with consistent gains on GLUE and SuperGLUE fine-tuning. ^[1]^[2]

Background: token-choice routing and its problems

A sparse MoE layer replaces a single feed-forward block with many parallel expert blocks plus a router. Activating only a few experts per token lets a model hold far more parameters than it spends compute on for any single token. The dominant routing strategy before Expert Choice was token choice, used by GShard (Lepikhin et al., 2020, which sends each token to its top 2 experts) and the Switch Transformer (Fedus et al., 2021, which sends each token to its single top expert). In both, every token computes a distribution over the experts and is dispatched to its highest-scoring one or two. ^[4]^[5]

Token choice has three well-known failure modes:

Load imbalance. Nothing forces tokens to spread evenly across experts. Popular experts become overloaded while others sit idle and stay under-trained, a self-reinforcing collapse in which a few experts dominate. ^[1]^[4]
Auxiliary losses. To counteract imbalance, token-choice models add a load-balancing loss that penalizes uneven assignment. This extra term must be tuned and trades off against the main objective. ^[4]^[5]
Dropped tokens. Accelerators need fixed tensor shapes, so each expert is given a capacity, a maximum number of tokens it can hold. Tokens routed to an already-full expert are dropped and skip the layer through the residual connection. To keep the drop rate low, implementations over-provision capacity by roughly 2x to 8x, which wastes compute on padded, empty buffer slots. ^[1]

How Expert Choice routing works

The routing computation

Given the token representations X for a batch with n tokens in total (batch size times sequence length) and e experts, Expert Choice first computes a token-to-expert affinity matrix:

S = softmax(X W_g),

where W_g is a learned gating projection and S has shape n by e. The entry S[i, j] is the affinity score between token i and expert j. Token choice would now take a top-k along the expert dimension, that is, down each row of S. Expert Choice instead takes the top-k along the token dimension, down each column: for every expert j it keeps the k tokens with the highest scores for that expert. This produces an index matrix I, recording which tokens each expert selected, and a gating-weight matrix G holding the corresponding scores. A one-hot tensor P = onehot(I) gathers each expert's chosen tokens; the experts run in parallel, and P and G then scatter and weight the expert outputs back to their original token positions. ^[1]

The per-expert quota k is set directly from a capacity factor c:

k = n c / e.

Because there are on average n / e tokens per expert, c is exactly the average number of experts each token is routed to. The authors used c = 2 so that the average compute matches GShard top-2, and they also ablated c = 1 and c = 0.5. ^[1] The crucial property is that every expert receives exactly k tokens, so all experts are always full: utilization is 100% with no padding or overflow. That structural guarantee gives the perfect load balance and removes any need for a load-balancing loss. ^[1]^[2]

Heterogeneous compute and an optional cap

Filling every expert to k tokens says nothing about how many experts any single token lands in, and the resulting distribution is deliberately uneven. In a 100-million-parameter, 64-expert model the authors measured that most tokens were routed to one or two experts, about 23% to three or four experts, and only about 3% to more than four. ^[1] Important tokens can attract many experts while filler tokens attract few, giving variable computation per token at a fixed budget, a form of adaptive computation that emerges from the routing scheme itself.

For deployments that need a firm ceiling on per-token cost, the paper adds an optional cap on the maximum number of experts any token may use. It is formulated as an entropy-regularized linear program and solved with Dykstra's algorithm. The capped variants, called EC-CAP2 and EC-CAP3, limit each token to at most two or three experts respectively, with only a small quality cost relative to uncapped Expert Choice. ^[1]

Benefits

Perfect load balance with no auxiliary loss. Balance is structural rather than encouraged by a penalty, so the load-balancing loss and its tuning disappear. ^[1]^[2]
No capacity overflow. Every expert is filled to exactly k tokens, so there is no need to over-provision capacity, which removes the padding waste of token-choice systems. ^[1]
Faster convergence. The 8B/64E model reached GShard top-2 quality in less than half the training steps, a more than 2x speedup, and ran about 20% faster per step than the comparable GLaM model. ^[1]^[2]
Stronger downstream quality. At the 8B/64E scale, Expert Choice improved average GLUE and SuperGLUE accuracy by about 2 points over Switch and GShard, and by 3.4 points over a dense T5 baseline. ^[1]

Limitations: causality

The central limitation is that Expert Choice, as originally formulated, is not causal. Selecting an expert's top-k tokens ranks tokens against one another, so a token's routing depends on the other tokens in the batch, including ones that appear later in the same sequence. The authors state plainly that the method "might not immediately apply to auto-regressive text generation as our current implementation takes in the past and future tokens to perform the top-k selection," and they suggest restricting the top-k to tokens drawn from different sequences as a partial workaround. ^[1]

This is more than a technicality. During autoregressive training, information about future tokens can leak backward into earlier positions through the routing decision, letting the model effectively cheat on the next-token prediction objective. DeepSeek researchers later quantified the effect: the token assignment of an MoE layer with sparse ratio R (the fraction of experts active per token) can leak more than K log2((1 - R) / R) bits per token, where the factor K grows with the number of MoE layers and the experts used per token. For a 9-layer model with 16 experts and 2 experts per token on average, that bound works out to about 50 bits per token, more than enough for each token to all but determine the identity of its successor. They describe such leakage as "fatal" because it destroys generalization and makes evaluation unreliable. ^[6] For this reason, plain Expert Choice is best suited to encoders and other non-autoregressive settings, while decoder-only language models have continued to rely on token choice paired with an auxiliary or auxiliary-free balancer. ^[1]^[6]

Relationship to other routing methods

Expert Choice sits between sparse token choice and fully soft routing.

Method	Who selects	Load balance	Tokens dropped on overflow?	Causal for autoregressive decoding?
Token choice (GShard top-2, Switch top-1)	each token picks its top-k experts	needs an auxiliary load-balancing loss; can collapse	yes, when an expert exceeds its capacity	yes, the decision is per token
Expert Choice	each expert picks its top-k tokens	perfect by construction; no auxiliary loss	no, experts are exactly filled	no, selection ranks across tokens
Soft MoE	each expert processes weighted mixtures (slots) of all tokens	perfect by construction; fully differentiable	no, assignment is soft	no, slots mix all positions

Soft MoE (Puigcerver et al., 2023) pushes the same intuition further. Instead of a hard top-k, each expert processes a small fixed number of slots, where every slot is a learned weighted average of all tokens. Like Expert Choice it is balanced by construction and drops no tokens, and its slots likewise mix information across positions, so it too targets vision and other non-autoregressive tasks. ^[7]

Expert Choice has seen real use. The Brainformer architecture (Zhou et al., ICML 2023), from an overlapping Google team, adopts Expert Choice gating with a capacity factor of one, giving a very sparse network in which each token sees one expert on average. ^[8] Follow-up work restores causality so the load-balancing benefits can reach decoder language models. Lory (Zhong et al., 2024) introduces causal segment routing for a fully differentiable autoregressive MoE, and other work shifts the top-k selection from the token level up to the sequence or segment level so that no routing decision depends on future tokens within a sequence. ^[9]^[10] Conversely, because diffusion language models read context bidirectionally rather than autoregressively, the future-token leakage problem does not arise, and 2026 work applies Expert Choice to them directly for its deterministic load balancing, adding timestep-dependent expert capacity. ^[11]

References

Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew M. Dai, Zhifeng Chen, Quoc V. Le, James Laudon. "Mixture-of-Experts with Expert Choice Routing." arXiv:2202.09368, February 2022 (NeurIPS 2022). https://arxiv.org/abs/2202.09368 ↩
"Mixture-of-Experts with Expert Choice Routing." Google Research blog, 2022. https://research.google/blog/mixture-of-experts-with-expert-choice-routing/ ↩
"Mixture-of-Experts with Expert Choice Routing." NeurIPS 2022 proceedings. https://proceedings.neurips.cc/paper_files/paper/2022/hash/2f00ecd787b432c1d36f3de9800728eb-Abstract-Conference.html ↩
Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, Zhifeng Chen. "GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding." arXiv:2006.16668, 2020 (ICLR 2021). https://arxiv.org/abs/2006.16668 ↩
William Fedus, Barret Zoph, Noam Shazeer. "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity." arXiv:2101.03961, 2021 (JMLR 2022). https://arxiv.org/abs/2101.03961 ↩
Lean Wang, Huazuo Gao, Chenggang Zhao, Xu Sun, Damai Dai. "Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts." arXiv:2408.15664, 2024. https://arxiv.org/abs/2408.15664 ↩
Joan Puigcerver, Carlos Riquelme, Basil Mustafa, Neil Houlsby. "From Sparse to Soft Mixtures of Experts." arXiv:2308.00951, 2023 (ICLR 2024). https://arxiv.org/abs/2308.00951 ↩
Yanqi Zhou, Nan Du, Yanping Huang, Daiyi Peng, Chang Lan, Da Huang, Siamak Shakeri, David So, Andrew Dai, Yifeng Lu, Zhifeng Chen, Quoc Le, Claire Cui, James Laudon, Jeff Dean. "Brainformers: Trading Simplicity for Efficiency." arXiv:2306.00008, 2023 (ICML 2023). https://arxiv.org/abs/2306.00008 ↩
Zexuan Zhong, Mengzhou Xia, Danqi Chen, Mike Lewis. "Lory: Fully Differentiable Mixture-of-Experts for Autoregressive Language Model Pre-training." arXiv:2405.03133, May 2024. https://arxiv.org/abs/2405.03133 ↩
"Route Experts by Sequence, Not by Token." arXiv:2511.06494, 2025. https://arxiv.org/abs/2511.06494 ↩
Shuibai Zhang, Caspian Zhuang, Chihan Cui, et al. "Expert-Choice Routing Enables Adaptive Computation in Diffusion Language Models." arXiv:2604.01622, April 2026. https://arxiv.org/abs/2604.01622 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

MoE load-balancing loss Soft MoE

Overview

Background: token-choice routing and its problems

How Expert Choice routing works

The routing computation

Heterogeneous compute and an optional cap

Benefits

Limitations: causality

Relationship to other routing methods

References

Improve this article

Related Articles

LSTM

Mixture of Experts (MoE)

Translational invariance

Activation Function

Attention

Backpropagation

What links here

Related Articles

LSTM

Mixture of Experts (MoE)

Translational invariance

Activation Function

Attention

Backpropagation

What links here