# Expert Choice routing

> Source: https://aiwiki.ai/wiki/expert_choice_routing
> Updated: 2026-06-08
> Categories: Deep Learning, Neural Networks
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

## Overview

Expert Choice routing (often abbreviated EC) is a routing method for [mixture of experts](/wiki/mixture_of_experts) (MoE) layers in neural networks, introduced by researchers at [Google](/wiki/google) in 2022. In a sparsely activated MoE [Transformer](/wiki/transformer), a small router decides which of several parallel expert sub-networks processes each token. Conventional schemes are token choice: every token picks its own top-k experts. Expert Choice inverts the direction of selection. Every expert instead picks its own top-k tokens. Because each expert is assigned exactly the same fixed number of tokens, the layer is perfectly load balanced by construction, with no auxiliary balancing loss and no risk of an expert overflowing its buffer and dropping tokens. A consequence of the inversion is that a token may be chosen by zero, one, or several experts, so tokens receive different amounts of computation at a fixed average budget. [1][2]

The method was presented in "Mixture-of-Experts with Expert Choice Routing" by Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew M. Dai, Zhifeng Chen, Quoc V. Le, and James Laudon, posted to arXiv on February 18, 2022 and published at NeurIPS 2022. [1][3] On an 8-billion-activated-parameter model with 64 experts, the authors reported more than 2x faster training convergence than the [GShard](/wiki/gshard) top-2 baseline, together with consistent gains on GLUE and SuperGLUE fine-tuning. [1][2]

## Background: token-choice routing and its problems

A sparse MoE layer replaces a single feed-forward block with many parallel expert blocks plus a router. Activating only a few experts per token lets a model hold far more parameters than it spends compute on for any single token. The dominant routing strategy before Expert Choice was token choice, used by [GShard](/wiki/gshard) (Lepikhin et al., 2020, which sends each token to its top 2 experts) and the [Switch Transformer](/wiki/switch_transformer) (Fedus et al., 2021, which sends each token to its single top expert). In both, every token computes a distribution over the experts and is dispatched to its highest-scoring one or two. [4][5]

Token choice has three well-known failure modes:

- Load imbalance. Nothing forces tokens to spread evenly across experts. Popular experts become overloaded while others sit idle and stay under-trained, a self-reinforcing collapse in which a few experts dominate. [1][4]
- Auxiliary losses. To counteract imbalance, token-choice models add a [load-balancing loss](/wiki/load_balancing_loss) that penalizes uneven assignment. This extra term must be tuned and trades off against the main objective. [4][5]
- Dropped tokens. Accelerators need fixed tensor shapes, so each expert is given a capacity, a maximum number of tokens it can hold. Tokens routed to an already-full expert are dropped and skip the layer through the residual connection. To keep the drop rate low, implementations over-provision capacity by roughly 2x to 8x, which wastes compute on padded, empty buffer slots. [1]

## How Expert Choice routing works

### The routing computation

Given the token representations X for a batch with n tokens in total (batch size times sequence length) and e experts, Expert Choice first computes a token-to-expert affinity matrix:

S = softmax(X W_g),

where W_g is a learned gating projection and S has shape n by e. The entry S[i, j] is the affinity score between token i and expert j. Token choice would now take a top-k along the expert dimension, that is, down each row of S. Expert Choice instead takes the top-k along the token dimension, down each column: for every expert j it keeps the k tokens with the highest scores for that expert. This produces an index matrix I, recording which tokens each expert selected, and a gating-weight matrix G holding the corresponding scores. A one-hot tensor P = onehot(I) gathers each expert's chosen tokens; the experts run in parallel, and P and G then scatter and weight the expert outputs back to their original token positions. [1]

The per-expert quota k is set directly from a capacity factor c:

k = n c / e.

Because there are on average n / e tokens per expert, c is exactly the average number of experts each token is routed to. The authors used c = 2 so that the average compute matches GShard top-2, and they also ablated c = 1 and c = 0.5. [1] The crucial property is that every expert receives exactly k tokens, so all experts are always full: utilization is 100% with no padding or overflow. That structural guarantee gives the perfect load balance and removes any need for a load-balancing loss. [1][2]

### Heterogeneous compute and an optional cap

Filling every expert to k tokens says nothing about how many experts any single token lands in, and the resulting distribution is deliberately uneven. In a 100-million-parameter, 64-expert model the authors measured that most tokens were routed to one or two experts, about 23% to three or four experts, and only about 3% to more than four. [1] Important tokens can attract many experts while filler tokens attract few, giving variable computation per token at a fixed budget, a form of adaptive computation that emerges from the routing scheme itself.

For deployments that need a firm ceiling on per-token cost, the paper adds an optional cap on the maximum number of experts any token may use. It is formulated as an entropy-regularized linear program and solved with Dykstra's algorithm. The capped variants, called EC-CAP2 and EC-CAP3, limit each token to at most two or three experts respectively, with only a small quality cost relative to uncapped Expert Choice. [1]

## Benefits

- Perfect load balance with no auxiliary loss. Balance is structural rather than encouraged by a penalty, so the load-balancing loss and its tuning disappear. [1][2]
- No capacity overflow. Every expert is filled to exactly k tokens, so there is no need to over-provision capacity, which removes the padding waste of token-choice systems. [1]
- Faster convergence. The 8B/64E model reached GShard top-2 quality in less than half the training steps, a more than 2x speedup, and ran about 20% faster per step than the comparable [GLaM](/wiki/glam) model. [1][2]
- Stronger downstream quality. At the 8B/64E scale, Expert Choice improved average GLUE and SuperGLUE accuracy by about 2 points over Switch and GShard, and by 3.4 points over a dense T5 baseline. [1]

## Limitations: causality

The central limitation is that Expert Choice, as originally formulated, is not causal. Selecting an expert's top-k tokens ranks tokens against one another, so a token's routing depends on the other tokens in the batch, including ones that appear later in the same sequence. The authors state plainly that the method "might not immediately apply to auto-regressive text generation as our current implementation takes in the past and future tokens to perform the top-k selection," and they suggest restricting the top-k to tokens drawn from different sequences as a partial workaround. [1]

This is more than a technicality. During autoregressive training, information about future tokens can leak backward into earlier positions through the routing decision, letting the model effectively cheat on the next-token prediction objective. DeepSeek researchers later quantified the effect: the token assignment of an MoE layer with sparse ratio R (the fraction of experts active per token) can leak more than K log2((1 - R) / R) bits per token, where the factor K grows with the number of MoE layers and the experts used per token. For a 9-layer model with 16 experts and 2 experts per token on average, that bound works out to about 50 bits per token, more than enough for each token to all but determine the identity of its successor. They describe such leakage as "fatal" because it destroys generalization and makes evaluation unreliable. [6] For this reason, plain Expert Choice is best suited to encoders and other non-autoregressive settings, while decoder-only language models have continued to rely on token choice paired with an auxiliary or auxiliary-free balancer. [1][6]

## Relationship to other routing methods

Expert Choice sits between sparse token choice and fully soft routing.

| Method | Who selects | Load balance | Tokens dropped on overflow? | Causal for autoregressive decoding? |
|---|---|---|---|---|
| Token choice (GShard top-2, Switch top-1) | each token picks its top-k experts | needs an auxiliary load-balancing loss; can collapse | yes, when an expert exceeds its capacity | yes, the decision is per token |
| Expert Choice | each expert picks its top-k tokens | perfect by construction; no auxiliary loss | no, experts are exactly filled | no, selection ranks across tokens |
| Soft MoE | each expert processes weighted mixtures (slots) of all tokens | perfect by construction; fully differentiable | no, assignment is soft | no, slots mix all positions |

[Soft MoE](/wiki/soft_moe) (Puigcerver et al., 2023) pushes the same intuition further. Instead of a hard top-k, each expert processes a small fixed number of slots, where every slot is a learned weighted average of all tokens. Like Expert Choice it is balanced by construction and drops no tokens, and its slots likewise mix information across positions, so it too targets vision and other non-autoregressive tasks. [7]

Expert Choice has seen real use. The [Brainformer](/wiki/brainformer) architecture (Zhou et al., ICML 2023), from an overlapping Google team, adopts Expert Choice gating with a capacity factor of one, giving a very sparse network in which each token sees one expert on average. [8] Follow-up work restores causality so the load-balancing benefits can reach decoder language models. Lory (Zhong et al., 2024) introduces causal segment routing for a fully differentiable autoregressive MoE, and other work shifts the top-k selection from the token level up to the sequence or segment level so that no routing decision depends on future tokens within a sequence. [9][10] Conversely, because diffusion language models read context bidirectionally rather than autoregressively, the future-token leakage problem does not arise, and 2026 work applies Expert Choice to them directly for its deterministic load balancing, adding timestep-dependent expert capacity. [11]

## References

1. Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew M. Dai, Zhifeng Chen, Quoc V. Le, James Laudon. "Mixture-of-Experts with Expert Choice Routing." arXiv:2202.09368, February 2022 (NeurIPS 2022). https://arxiv.org/abs/2202.09368
2. "Mixture-of-Experts with Expert Choice Routing." Google Research blog, 2022. https://research.google/blog/mixture-of-experts-with-expert-choice-routing/
3. "Mixture-of-Experts with Expert Choice Routing." NeurIPS 2022 proceedings. https://proceedings.neurips.cc/paper_files/paper/2022/hash/2f00ecd787b432c1d36f3de9800728eb-Abstract-Conference.html
4. Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, Zhifeng Chen. "GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding." arXiv:2006.16668, 2020 (ICLR 2021). https://arxiv.org/abs/2006.16668
5. William Fedus, Barret Zoph, Noam Shazeer. "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity." arXiv:2101.03961, 2021 (JMLR 2022). https://arxiv.org/abs/2101.03961
6. Lean Wang, Huazuo Gao, Chenggang Zhao, Xu Sun, Damai Dai. "Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts." arXiv:2408.15664, 2024. https://arxiv.org/abs/2408.15664
7. Joan Puigcerver, Carlos Riquelme, Basil Mustafa, Neil Houlsby. "From Sparse to Soft Mixtures of Experts." arXiv:2308.00951, 2023 (ICLR 2024). https://arxiv.org/abs/2308.00951
8. Yanqi Zhou, Nan Du, Yanping Huang, Daiyi Peng, Chang Lan, Da Huang, Siamak Shakeri, David So, Andrew Dai, Yifeng Lu, Zhifeng Chen, Quoc Le, Claire Cui, James Laudon, Jeff Dean. "Brainformers: Trading Simplicity for Efficiency." arXiv:2306.00008, 2023 (ICML 2023). https://arxiv.org/abs/2306.00008
9. Zexuan Zhong, Mengzhou Xia, Danqi Chen, Mike Lewis. "Lory: Fully Differentiable Mixture-of-Experts for Autoregressive Language Model Pre-training." arXiv:2405.03133, May 2024. https://arxiv.org/abs/2405.03133
10. "Route Experts by Sequence, Not by Token." arXiv:2511.06494, 2025. https://arxiv.org/abs/2511.06494
11. Shuibai Zhang, Caspian Zhuang, Chihan Cui, et al. "Expert-Choice Routing Enables Adaptive Computation in Diffusion Language Models." arXiv:2604.01622, April 2026. https://arxiv.org/abs/2604.01622

