MoE load-balancing loss

AI Agents AI Infrastructure

10 min read

Updated Jul 23, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 23, 2026

Fact-checked

In review queue

Sources

7 citations

Revision

v3 · 1,967 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Overview

The MoE load balancing loss is an auxiliary training objective used in sparse mixture of experts (MoE) neural networks to keep work spread evenly across the expert sub-networks. In a sparse MoE layer, a small learned router sends each token to only a few of many parallel experts, so that the model can hold a large number of parameters while spending compute on only a fraction of them per token. Left to itself, the router tends to converge on a small set of favored experts, a degenerate state often called expert collapse or router collapse. The load balancing loss adds a penalty that is smallest when tokens are routed uniformly across experts, nudging the router toward balanced assignment while the main objective continues to train. ^[1]^[2]^[3]

The technique is well established. A pair of balancing losses was introduced with the sparsely-gated MoE layer of Shazeer et al. in 2017, simplified into the now-standard single auxiliary term by GShard (2020) and the Switch Transformer (2021), and stabilized further by a separate router z-loss in ST-MoE (2022). A later line of work removes the auxiliary gradient entirely: DeepSeek's auxiliary-loss-free load balancing (2024) keeps experts balanced by adjusting a per-expert routing bias instead of by adding a loss term, and was used to train DeepSeek-V3. ^[1]^[2]^[3]^[4]^[5]

The expert-collapse problem

A sparse MoE layer replaces a single feed-forward block with N expert blocks plus a router. For each token the router produces affinity scores over the experts (typically via a learned projection followed by a softmax), selects the top-k experts (k is often 1 or 2), and combines their outputs weighted by the gate values. Because experts run in parallel on accelerators, the layer can scale parameter count almost independently of per-token compute. ^[1]^[3]

Nothing in this setup forces tokens to spread evenly. Routing creates a self-reinforcing feedback loop: an expert that receives slightly more tokens early in training gets more gradient signal, improves faster, and is therefore selected even more often, while neglected experts stay under-trained and are chosen less and less. The end state is expert collapse, in which a handful of experts absorb most of the traffic. This is harmful for two reasons. First, it wastes capacity, because idle experts contribute little despite occupying memory and parameters. Second, it unbalances compute: real implementations give every expert a fixed buffer (its capacity) so that tensor shapes stay static, and an overloaded expert overflows its buffer, forcing its surplus tokens to be dropped. ^[1]^[2]^[3]

Token dropping is governed by the expert capacity, defined in the Switch Transformer as

\text{expert capacity} = \frac{\text{tokens per batch}}{\text{number of experts}} \cdot \text{capacity factor}

A capacity factor of 1.0 provisions exactly the balanced share; values above 1.0 add slack to tolerate some imbalance at the cost of padding empty buffer slots. Any token routed to an expert that is already full is dropped, meaning it skips the layer and passes through the residual connection unchanged. Both the wasted-capacity and dropped-token failure modes get worse as imbalance grows, which is why balancing is treated as a first-class part of MoE training. ^[2]^[3]

The auxiliary load-balancing loss

Shazeer et al. (2017), in "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer," addressed collapse with two soft penalties, each based on the squared coefficient of variation $\mathrm{CV}^2$ (variance divided by squared mean) of a per-expert quantity. The importance loss penalizes unequal use of the gate values,

L_{\text{importance}} = w_{\text{importance}} \cdot \mathrm{CV}(\mathrm{Importance}(X))^2

where Importance(X) is, for each expert, the sum of its gate values over the batch X. Balancing importance alone is not enough, because an expert could accumulate high importance from a few tokens with large gates while still receiving an uneven number of tokens, so the authors add a load loss, $L_{\text{load}} = w_{\text{load}} \cdot \mathrm{CV}(\mathrm{Load}(X))^2$ , built on a smooth differentiable estimate of how many tokens each expert receives. Their router is a noisy top-k gate, $G(x) = \mathrm{Softmax}(\mathrm{KeepTopK}(H(x), k))$ with $H(x)_i = (x W_g)_i + \mathrm{StandardNormal}() \cdot \mathrm{Softplus}((x W_{\text{noise}})_i)$ ; the tunable Gaussian noise both encourages exploration and makes the load estimate differentiable. ^[1]

GShard (Lepikhin et al., 2020) and the Switch Transformer (Fedus et al., 2021) condensed this into a single term that has become the default. For N experts the auxiliary loss is

\text{loss} = \alpha N \sum_i f_i P_i

where $f_i$ is the fraction of tokens dispatched to expert i (a hard count) and $P_i$ is the average router probability assigned to expert i over the batch (a soft quantity). The product $f_i P_i$ is minimized, subject to the probabilities summing to one, when both vectors are uniform: each $f_i = P_i = 1/N$ gives a sum of $1/N$ , so the loss reaches its minimum value of $\alpha$ under perfectly balanced routing. The design is deliberate. The token-count fraction $f_i$ is the thing we want to equalize but is non-differentiable (it comes from an argmax over experts), whereas the probability $P_i$ is differentiable; multiplying them lets the gradient flow through $P_i$ and push probability mass toward under-used experts, which is why the Switch Transformer paper titles this section "A Differentiable Load Balancing Loss." The authors set the coefficient $\alpha$ to $10^{-2}$ , described as large enough to enforce balance yet small enough not to overwhelm the primary cross-entropy objective. ^[2]^[3]

The total training objective is the language-modeling loss plus alpha times this balancing term. This exposes the central tradeoff of the approach: the auxiliary loss and the main objective can conflict. Pushing the router toward uniform assignment costs some freedom to route each token to its genuinely best expert, and the balancing gradient is, from the language model's point of view, interference. The coefficient alpha must be tuned to trade these off, and the capacity factor interacts with it, since a larger capacity factor tolerates more residual imbalance and reduces dropping but wastes compute on padding. ^[2]^[3]^[4]

Variants and the z-loss

Beyond the basic balancing term, MoE training commonly adds a second, separate auxiliary loss aimed at numerical stability rather than balance. ST-MoE (Zoph et al., 2022) traced training instabilities in large sparse models to unbounded growth of the router logits, which can cause round-off blow-ups in the large matrix multiplications around the gate. Their router z-loss penalizes large logits directly,

L_z = \frac{1}{B} \sum_{\text{batch}} \left( \log \sum_{\text{experts}} \exp(\text{logit}) \right)^2

where B is the number of tokens. This is the squared log-sum-exp of the router logits, a smooth penalty that keeps the values entering the softmax small. ST-MoE reports that adding the z-loss with a small weight (about 10^-3) removes the instabilities without hurting quality, and it is now used alongside the balancing loss in many MoE recipes. ^[4]

The table below summarizes the main auxiliary terms.

Loss	Source	Purpose	Quantity penalized
Importance loss	Shazeer et al. 2017 ^[1]	balance	$\mathrm{CV}^2$ of summed gate values per expert
Load loss	Shazeer et al. 2017 ^[1]	balance	$\mathrm{CV}^2$ of smooth per-expert token count
Balancing loss ( $f_i P_i$ )	GShard 2020 ^[3], Switch 2021 ^[2]	balance	product of token fraction and mean router probability
Router z-loss	ST-MoE 2022 ^[4]	stability	squared log-sum-exp of router logits

Auxiliary-loss-free load balancing (DeepSeek)

The interference between the balancing loss and the main objective motivated an alternative that keeps experts balanced without adding any gradient term. In "Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts" (arXiv, posted August 2024), Lean Wang, Huazuo Gao, Chenggang Zhao, Xu Sun, and Damai Dai of DeepSeek and Peking University introduced a method they call Loss-Free Balancing. The starting observation is that the conventional balancing loss injects gradients that conflict with language modeling and thereby cost some quality. ^[5]

Loss-Free Balancing instead maintains a per-expert bias $b_i$ and adds it to the routing scores only for the top-k selection. A token is routed to the experts with the highest values of $(\text{score}_i + b_i)$ , but the gate weight used to combine that expert's output is computed from the original, unbiased score; the bias steers which experts are chosen without distorting the magnitudes that weight their outputs. After each step the biases are nudged by a simple feedback rule, $b_i = b_i + u \cdot \mathrm{sign}(e_i)$ , where $e_i$ is the load violation (the gap between the mean load and expert i's observed load) and u is a small update rate. The bias of an under-loaded expert is increased so it becomes more attractive on the next step, and the bias of an over-loaded expert is decreased; the update has no gradient and does not flow into the model weights. The authors track balance with a maximum-violation metric, $\mathrm{MaxVio} = \frac{\max_i \mathrm{Load}_i - \text{mean Load}}{\text{mean Load}}$ , and report that Loss-Free Balancing achieves both better load balance and better validation loss than auxiliary-loss training on models up to 3 billion parameters. ^[5]

This strategy was adopted in DeepSeek-V3 (December 2024). Its technical report sets the bias update speed to 0.001 for the first 14.3 trillion tokens of pretraining and to 0.0 for the final 500 billion. DeepSeek-V3 also keeps a complementary sequence-wise balance loss with a very small weight (0.0001), used only to prevent extreme imbalance within any single sequence rather than as the primary balancing mechanism, and combines the scheme with node-limited routing (each token reaches at most four nodes) to bound communication cost. The report credits the auxiliary-loss-free approach with better performance than balancing through pure auxiliary losses. ^[6]

Relationship to other routing methods

The balancing loss is specific to token-choice routing, in which each token picks its own experts and balance is therefore not guaranteed. Several alternative routing schemes balance load structurally and so need little or no balancing loss.

Expert choice routing (Zhou et al., 2022) inverts the direction of selection: rather than each token choosing top-k experts, each expert chooses the top-k tokens it scores highest. Every expert then receives exactly the same fixed number of tokens, so the layer is balanced by construction with no auxiliary loss and no expert-level overflow; the cost is that a token may be selected by zero, one, or several experts, so per-token compute varies. ^[7]

Soft MoE and other fully differentiable variants go further, replacing the discrete top-k assignment with a continuous, softmax-weighted mixture so that every expert is fed a balanced set of weighted slot inputs, again removing the need for a balancing loss but giving up exact sparsity. Within token-choice routing, DeepSeek's auxiliary-loss-free bias method is best seen as a third option that retains discrete top-k routing and its hard sparsity but replaces the balancing loss with a non-gradient feedback controller. In practice, production sparse-MoE language models now span all three families. ^[3]^[5]^[7]

References

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, Jeff Dean. "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer." arXiv:1701.06538, 2017. https://arxiv.org/abs/1701.06538 ↩
William Fedus, Barret Zoph, Noam Shazeer. "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity." arXiv:2101.03961, 2021. https://arxiv.org/abs/2101.03961 ↩
Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, Zhifeng Chen. "GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding." arXiv:2006.16668, 2020. https://arxiv.org/abs/2006.16668 ↩
Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, William Fedus. "ST-MoE: Designing Stable and Transferable Sparse Expert Models." arXiv:2202.08906, 2022. https://arxiv.org/abs/2202.08906 ↩
Lean Wang, Huazuo Gao, Chenggang Zhao, Xu Sun, Damai Dai. "Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts." arXiv:2408.15664, 2024. https://arxiv.org/abs/2408.15664 ↩
DeepSeek-AI. "DeepSeek-V3 Technical Report." arXiv:2412.19437, 2024. https://arxiv.org/abs/2412.19437 ↩
Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew M. Dai, Zhifeng Chen, Quoc V. Le, James Laudon. "Mixture-of-Experts with Expert Choice Routing." arXiv:2202.09368, 2022. https://arxiv.org/abs/2202.09368 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributors · full history

Suggest edit

What links here

Expert Choice routing

Overview

The expert-collapse problem

The auxiliary load-balancing loss

Variants and the z-loss

Auxiliary-loss-free load balancing (DeepSeek)

Relationship to other routing methods

References

Improve this article

Related Articles

Kiro (AI IDE)

MCP server

Agent Payments Protocol (AP2)

OpenAI AgentKit

Parallel Web Systems

NLWeb