Expert Choice routing
Last reviewed
Jun 8, 2026
Sources
11 citations
Review status
Source-backed
Revision
v1 · 1,900 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
11 citations
Review status
Source-backed
Revision
v1 · 1,900 words
Add missing citations, update stale details, or suggest a clearer explanation.
Expert Choice routing (often abbreviated EC) is a routing method for mixture of experts (MoE) layers in neural networks, introduced by researchers at Google in 2022. In a sparsely activated MoE Transformer, a small router decides which of several parallel expert sub-networks processes each token. Conventional schemes are token choice: every token picks its own top-k experts. Expert Choice inverts the direction of selection. Every expert instead picks its own top-k tokens. Because each expert is assigned exactly the same fixed number of tokens, the layer is perfectly load balanced by construction, with no auxiliary balancing loss and no risk of an expert overflowing its buffer and dropping tokens. A consequence of the inversion is that a token may be chosen by zero, one, or several experts, so tokens receive different amounts of computation at a fixed average budget. [1][2]
The method was presented in "Mixture-of-Experts with Expert Choice Routing" by Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew M. Dai, Zhifeng Chen, Quoc V. Le, and James Laudon, posted to arXiv on February 18, 2022 and published at NeurIPS 2022. [1][3] On an 8-billion-activated-parameter model with 64 experts, the authors reported more than 2x faster training convergence than the GShard top-2 baseline, together with consistent gains on GLUE and SuperGLUE fine-tuning. [1][2]
A sparse MoE layer replaces a single feed-forward block with many parallel expert blocks plus a router. Activating only a few experts per token lets a model hold far more parameters than it spends compute on for any single token. The dominant routing strategy before Expert Choice was token choice, used by GShard (Lepikhin et al., 2020, which sends each token to its top 2 experts) and the Switch Transformer (Fedus et al., 2021, which sends each token to its single top expert). In both, every token computes a distribution over the experts and is dispatched to its highest-scoring one or two. [4][5]
Token choice has three well-known failure modes:
Given the token representations X for a batch with n tokens in total (batch size times sequence length) and e experts, Expert Choice first computes a token-to-expert affinity matrix:
S = softmax(X W_g),
where W_g is a learned gating projection and S has shape n by e. The entry S[i, j] is the affinity score between token i and expert j. Token choice would now take a top-k along the expert dimension, that is, down each row of S. Expert Choice instead takes the top-k along the token dimension, down each column: for every expert j it keeps the k tokens with the highest scores for that expert. This produces an index matrix I, recording which tokens each expert selected, and a gating-weight matrix G holding the corresponding scores. A one-hot tensor P = onehot(I) gathers each expert's chosen tokens; the experts run in parallel, and P and G then scatter and weight the expert outputs back to their original token positions. [1]
The per-expert quota k is set directly from a capacity factor c:
k = n c / e.
Because there are on average n / e tokens per expert, c is exactly the average number of experts each token is routed to. The authors used c = 2 so that the average compute matches GShard top-2, and they also ablated c = 1 and c = 0.5. [1] The crucial property is that every expert receives exactly k tokens, so all experts are always full: utilization is 100% with no padding or overflow. That structural guarantee gives the perfect load balance and removes any need for a load-balancing loss. [1][2]
Filling every expert to k tokens says nothing about how many experts any single token lands in, and the resulting distribution is deliberately uneven. In a 100-million-parameter, 64-expert model the authors measured that most tokens were routed to one or two experts, about 23% to three or four experts, and only about 3% to more than four. [1] Important tokens can attract many experts while filler tokens attract few, giving variable computation per token at a fixed budget, a form of adaptive computation that emerges from the routing scheme itself.
For deployments that need a firm ceiling on per-token cost, the paper adds an optional cap on the maximum number of experts any token may use. It is formulated as an entropy-regularized linear program and solved with Dykstra's algorithm. The capped variants, called EC-CAP2 and EC-CAP3, limit each token to at most two or three experts respectively, with only a small quality cost relative to uncapped Expert Choice. [1]
The central limitation is that Expert Choice, as originally formulated, is not causal. Selecting an expert's top-k tokens ranks tokens against one another, so a token's routing depends on the other tokens in the batch, including ones that appear later in the same sequence. The authors state plainly that the method "might not immediately apply to auto-regressive text generation as our current implementation takes in the past and future tokens to perform the top-k selection," and they suggest restricting the top-k to tokens drawn from different sequences as a partial workaround. [1]
This is more than a technicality. During autoregressive training, information about future tokens can leak backward into earlier positions through the routing decision, letting the model effectively cheat on the next-token prediction objective. DeepSeek researchers later quantified the effect: the token assignment of an MoE layer with sparse ratio R (the fraction of experts active per token) can leak more than K log2((1 - R) / R) bits per token, where the factor K grows with the number of MoE layers and the experts used per token. For a 9-layer model with 16 experts and 2 experts per token on average, that bound works out to about 50 bits per token, more than enough for each token to all but determine the identity of its successor. They describe such leakage as "fatal" because it destroys generalization and makes evaluation unreliable. [6] For this reason, plain Expert Choice is best suited to encoders and other non-autoregressive settings, while decoder-only language models have continued to rely on token choice paired with an auxiliary or auxiliary-free balancer. [1][6]
Expert Choice sits between sparse token choice and fully soft routing.
| Method | Who selects | Load balance | Tokens dropped on overflow? | Causal for autoregressive decoding? |
|---|---|---|---|---|
| Token choice (GShard top-2, Switch top-1) | each token picks its top-k experts | needs an auxiliary load-balancing loss; can collapse | yes, when an expert exceeds its capacity | yes, the decision is per token |
| Expert Choice | each expert picks its top-k tokens | perfect by construction; no auxiliary loss | no, experts are exactly filled | no, selection ranks across tokens |
| Soft MoE | each expert processes weighted mixtures (slots) of all tokens | perfect by construction; fully differentiable | no, assignment is soft | no, slots mix all positions |
Soft MoE (Puigcerver et al., 2023) pushes the same intuition further. Instead of a hard top-k, each expert processes a small fixed number of slots, where every slot is a learned weighted average of all tokens. Like Expert Choice it is balanced by construction and drops no tokens, and its slots likewise mix information across positions, so it too targets vision and other non-autoregressive tasks. [7]
Expert Choice has seen real use. The Brainformer architecture (Zhou et al., ICML 2023), from an overlapping Google team, adopts Expert Choice gating with a capacity factor of one, giving a very sparse network in which each token sees one expert on average. [8] Follow-up work restores causality so the load-balancing benefits can reach decoder language models. Lory (Zhong et al., 2024) introduces causal segment routing for a fully differentiable autoregressive MoE, and other work shifts the top-k selection from the token level up to the sequence or segment level so that no routing decision depends on future tokens within a sequence. [9][10] Conversely, because diffusion language models read context bidirectionally rather than autoregressively, the future-token leakage problem does not arise, and 2026 work applies Expert Choice to them directly for its deterministic load balancing, adding timestep-dependent expert capacity. [11]