MoE load-balancing loss
Last reviewed
Jun 8, 2026
Sources
7 citations
Review status
Source-backed
Revision
v1 · 1,977 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
7 citations
Review status
Source-backed
Revision
v1 · 1,977 words
Add missing citations, update stale details, or suggest a clearer explanation.
The MoE load balancing loss is an auxiliary training objective used in sparse mixture of experts (MoE) neural networks to keep work spread evenly across the expert sub-networks. In a sparse MoE layer, a small learned router sends each token to only a few of many parallel experts, so that the model can hold a large number of parameters while spending compute on only a fraction of them per token. Left to itself, the router tends to converge on a small set of favored experts, a degenerate state often called expert collapse or router collapse. The load balancing loss adds a penalty that is smallest when tokens are routed uniformly across experts, nudging the router toward balanced assignment while the main objective continues to train. [1][2][3]
The technique is well established. A pair of balancing losses was introduced with the sparsely-gated MoE layer of Shazeer et al. in 2017, simplified into the now-standard single auxiliary term by GShard (2020) and the Switch Transformer (2021), and stabilized further by a separate router z-loss in ST-MoE (2022). A later line of work removes the auxiliary gradient entirely: DeepSeek's auxiliary-loss-free load balancing (2024) keeps experts balanced by adjusting a per-expert routing bias instead of by adding a loss term, and was used to train DeepSeek-V3. [1][2][3][4][5]
A sparse MoE layer replaces a single feed-forward block with N expert blocks plus a router. For each token the router produces affinity scores over the experts (typically via a learned projection followed by a softmax), selects the top-k experts (k is often 1 or 2), and combines their outputs weighted by the gate values. Because experts run in parallel on accelerators, the layer can scale parameter count almost independently of per-token compute. [1][3]
Nothing in this setup forces tokens to spread evenly. Routing creates a self-reinforcing feedback loop: an expert that receives slightly more tokens early in training gets more gradient signal, improves faster, and is therefore selected even more often, while neglected experts stay under-trained and are chosen less and less. The end state is expert collapse, in which a handful of experts absorb most of the traffic. This is harmful for two reasons. First, it wastes capacity, because idle experts contribute little despite occupying memory and parameters. Second, it unbalances compute: real implementations give every expert a fixed buffer (its capacity) so that tensor shapes stay static, and an overloaded expert overflows its buffer, forcing its surplus tokens to be dropped. [1][2][3]
Token dropping is governed by the expert capacity, defined in the Switch Transformer as
expert capacity = (tokens per batch / number of experts) * capacity factor.
A capacity factor of 1.0 provisions exactly the balanced share; values above 1.0 add slack to tolerate some imbalance at the cost of padding empty buffer slots. Any token routed to an expert that is already full is dropped, meaning it skips the layer and passes through the residual connection unchanged. Both the wasted-capacity and dropped-token failure modes get worse as imbalance grows, which is why balancing is treated as a first-class part of MoE training. [2][3]
Shazeer et al. (2017), in "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer," addressed collapse with two soft penalties, each based on the squared coefficient of variation CV^2 (variance divided by squared mean) of a per-expert quantity. The importance loss penalizes unequal use of the gate values,
L_importance = w_importance * CV(Importance(X))^2,
where Importance(X) is, for each expert, the sum of its gate values over the batch X. Balancing importance alone is not enough, because an expert could accumulate high importance from a few tokens with large gates while still receiving an uneven number of tokens, so the authors add a load loss, L_load = w_load * CV(Load(X))^2, built on a smooth differentiable estimate of how many tokens each expert receives. Their router is a noisy top-k gate, G(x) = Softmax(KeepTopK(H(x), k)) with H(x)_i = (x W_g)_i + StandardNormal() * Softplus((x W_noise)_i); the tunable Gaussian noise both encourages exploration and makes the load estimate differentiable. [1]
GShard (Lepikhin et al., 2020) and the Switch Transformer (Fedus et al., 2021) condensed this into a single term that has become the default. For N experts the auxiliary loss is
loss = alpha * N * sum over i of (f_i * P_i),
where f_i is the fraction of tokens dispatched to expert i (a hard count) and P_i is the average router probability assigned to expert i over the batch (a soft quantity). The product f_i * P_i is minimized, subject to the probabilities summing to one, when both vectors are uniform: each f_i = P_i = 1/N gives a sum of 1/N, so the loss reaches its minimum value of alpha under perfectly balanced routing. The design is deliberate. The token-count fraction f_i is the thing we want to equalize but is non-differentiable (it comes from an argmax over experts), whereas the probability P_i is differentiable; multiplying them lets the gradient flow through P_i and push probability mass toward under-used experts, which is why the Switch Transformer paper titles this section "A Differentiable Load Balancing Loss." The authors set the coefficient alpha to 10^-2, described as large enough to enforce balance yet small enough not to overwhelm the primary cross-entropy objective. [2][3]
The total training objective is the language-modeling loss plus alpha times this balancing term. This exposes the central tradeoff of the approach: the auxiliary loss and the main objective can conflict. Pushing the router toward uniform assignment costs some freedom to route each token to its genuinely best expert, and the balancing gradient is, from the language model's point of view, interference. The coefficient alpha must be tuned to trade these off, and the capacity factor interacts with it, since a larger capacity factor tolerates more residual imbalance and reduces dropping but wastes compute on padding. [2][3][4]
Beyond the basic balancing term, MoE training commonly adds a second, separate auxiliary loss aimed at numerical stability rather than balance. ST-MoE (Zoph et al., 2022) traced training instabilities in large sparse models to unbounded growth of the router logits, which can cause round-off blow-ups in the large matrix multiplications around the gate. Their router z-loss penalizes large logits directly,
L_z = (1/B) * sum over the batch of (log sum over experts of exp(logit))^2,
where B is the number of tokens. This is the squared log-sum-exp of the router logits, a smooth penalty that keeps the values entering the softmax small. ST-MoE reports that adding the z-loss with a small weight (about 10^-3) removes the instabilities without hurting quality, and it is now used alongside the balancing loss in many MoE recipes. [4]
The table below summarizes the main auxiliary terms.
| Loss | Source | Purpose | Quantity penalized |
|---|---|---|---|
| Importance loss | Shazeer et al. 2017 [1] | balance | CV^2 of summed gate values per expert |
| Load loss | Shazeer et al. 2017 [1] | balance | CV^2 of smooth per-expert token count |
| Balancing loss (f_i * P_i) | GShard 2020 [3], Switch 2021 [2] | balance | product of token fraction and mean router probability |
| Router z-loss | ST-MoE 2022 [4] | stability | squared log-sum-exp of router logits |
The interference between the balancing loss and the main objective motivated an alternative that keeps experts balanced without adding any gradient term. In "Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts" (arXiv, posted August 2024), Lean Wang, Huazuo Gao, Chenggang Zhao, Xu Sun, and Damai Dai of DeepSeek and Peking University introduced a method they call Loss-Free Balancing. The starting observation is that the conventional balancing loss injects gradients that conflict with language modeling and thereby cost some quality. [5]
Loss-Free Balancing instead maintains a per-expert bias b_i and adds it to the routing scores only for the top-k selection. A token is routed to the experts with the highest values of (score_i + b_i), but the gate weight used to combine that expert's output is computed from the original, unbiased score; the bias steers which experts are chosen without distorting the magnitudes that weight their outputs. After each step the biases are nudged by a simple feedback rule, b_i = b_i + u * sign(e_i), where e_i is the load violation (the gap between the mean load and expert i's observed load) and u is a small update rate. The bias of an under-loaded expert is increased so it becomes more attractive on the next step, and the bias of an over-loaded expert is decreased; the update has no gradient and does not flow into the model weights. The authors track balance with a maximum-violation metric, MaxVio = (max_i Load_i - mean Load) / mean Load, and report that Loss-Free Balancing achieves both better load balance and better validation loss than auxiliary-loss training on models up to 3 billion parameters. [5]
This strategy was adopted in DeepSeek-V3 (December 2024). Its technical report sets the bias update speed to 0.001 for the first 14.3 trillion tokens of pretraining and to 0.0 for the final 500 billion. DeepSeek-V3 also keeps a complementary sequence-wise balance loss with a very small weight (0.0001), used only to prevent extreme imbalance within any single sequence rather than as the primary balancing mechanism, and combines the scheme with node-limited routing (each token reaches at most four nodes) to bound communication cost. The report credits the auxiliary-loss-free approach with better performance than balancing through pure auxiliary losses. [6]
The balancing loss is specific to token-choice routing, in which each token picks its own experts and balance is therefore not guaranteed. Several alternative routing schemes balance load structurally and so need little or no balancing loss.
Expert choice routing (Zhou et al., 2022) inverts the direction of selection: rather than each token choosing top-k experts, each expert chooses the top-k tokens it scores highest. Every expert then receives exactly the same fixed number of tokens, so the layer is balanced by construction with no auxiliary loss and no expert-level overflow; the cost is that a token may be selected by zero, one, or several experts, so per-token compute varies. [7]
Soft MoE and other fully differentiable variants go further, replacing the discrete top-k assignment with a continuous, softmax-weighted mixture so that every expert is fed a balanced set of weighted slot inputs, again removing the need for a balancing loss but giving up exact sparsity. Within token-choice routing, DeepSeek's auxiliary-loss-free bias method is best seen as a third option that retains discrete top-k routing and its hard sparsity but replaces the balancing loss with a non-gradient feedback controller. In practice, production sparse-MoE language models now span all three families. [3][5][7]