Soft MoE

Deep Learning Neural Networks

11 min read

Updated Jul 23, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 23, 2026

Fact-checked

In review queue

Sources

10 citations

Revision

v2 · 2,227 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Soft MoE (Soft Mixture of Experts) is a fully differentiable variant of the sparse mixture of experts (MoE) layer. Instead of routing each token to a small set of discrete experts, Soft MoE gives every expert a fixed number of input "slots," and each slot is filled with a learned, softmax weighted average of all of the input tokens. Because the assignment of tokens to experts is a continuous, weighted mixture rather than a hard, discrete choice, the whole layer is differentiable end to end and avoids the token dropping, load imbalance, and training instability that complicate ordinary sparse MoE. The method was introduced in the paper "From Sparse to Soft Mixtures of Experts" by Joan Puigcerver, Carlos Riquelme, Basil Mustafa, and Neil Houlsby of Google DeepMind, first posted to arXiv on August 2, 2023, and presented at the International Conference on Learning Representations (ICLR) in 2024. ^[1]^[2]

Overview

A sparse MoE replaces the single feed-forward sublayer of a Transformer block with many parallel feed-forward "experts" and a small router network that sends each token to only one or a few of them. This lets the model hold a very large number of parameters while activating only a small fraction of them per token, so capacity grows much faster than compute. The weakness of this design is the router: it makes a discrete, top-k assignment of tokens to experts, an operation that is not differentiable and that tends to send uneven numbers of tokens to different experts. ^[1]^[3]

Soft MoE keeps the central promise of sparse MoE, which is large parameter count at modest per-token cost, but removes the discrete router. Every expert is instead assigned a fixed set of slots, and softmax-based dispatch and combine steps move information between tokens and slots, so there is no argmax, no hard capacity cutoff, and no need for the auxiliary balancing losses that sparse routers require. The authors demonstrated the method primarily on image classification with Vision Transformer (ViT) backbones, where Soft MoE forms a strictly better accuracy-versus-cost frontier than dense ViTs and than the two leading sparse routers. ^[1]

Background: problems with sparse mixture of experts

The original sparse mixture of experts layer for deep networks routes each token to its top-scoring experts and multiplies each expert's output by the router's gate value. ^[3] Two dominant routing schemes followed. In token choice routing, used by Switch Transformer and GShard, each token selects its top-k experts. ^[4] In expert choice routing, each expert instead selects its top-c tokens. ^[5] Both schemes share a set of recurring difficulties:

Non-differentiable assignment. The top-k selection is a discrete operation. Gradients do not flow through the choice of which token goes to which expert, only through the gate weight applied afterward, which makes the router harder to train.
Load imbalance and token dropping. With token choice, nothing forces experts to receive equal numbers of tokens. To keep tensor shapes fixed, each expert is given a finite capacity, and any tokens routed beyond that capacity are dropped, meaning they skip the layer entirely. Expert choice guarantees a balanced load per expert, but it can leave some tokens selected by no expert, which is a different form of dropping.
Auxiliary losses and tuning. Token choice routers add load-balancing or z-loss terms whose coefficients must be tuned, adding training complexity and a source of instability.
Difficulty scaling the expert count and fine-tuning. As the number of experts grows, imbalance and instability tend to worsen, and sparse MoEs are often sensitive during fine-tuning.
Per-sequence non-determinism. Because tokens are usually routed in batched groups, the experts a given token reaches can depend on which other sequences happen to share its batch, so the model's output for one sequence is not fully determined by that sequence alone. ^[1]

Soft MoE was designed to keep the capacity benefits of these layers while eliminating the discrete assignment that causes most of the trouble.

How Soft MoE works

A Soft MoE layer has n experts, and each expert is given p input slots, for a total of n times p slots. Each expert is an ordinary feed-forward network (a multilayer perceptron). The layer transforms a sequence of m input tokens through three steps: a dispatch step that fills the slots, the experts themselves, and a combine step that reconstructs the output tokens. ^[1]

Dispatch (the soft assignment)

Arrange the m tokens of a single input as the rows of a matrix X (m rows, each of dimension d). A learned matrix of slot parameters Phi produces a logit matrix Lambda = X times Phi, in which entry Lambda for token i and slot j scores how strongly token i should contribute to slot j. The dispatch weights D are obtained by applying a softmax over the token index for each slot, so the weights feeding any one slot form a probability distribution over all m input tokens. Each slot's input is the corresponding convex combination of the token vectors, written compactly as the slot-input matrix equal to D transpose times X. Every slot therefore sees a different weighted average of the whole sequence, and no token is ever discarded. This weighted averaging is the "soft" assignment that replaces the hard router. ^[1]

Experts

Each expert applies its feed-forward transformation to its own p slots, producing one output vector per slot. Because the slot count per expert is fixed in advance, every expert always does exactly the same amount of work, regardless of the input. This is what removes load imbalance: there is no capacity to overflow and nothing to drop. ^[1]

Combine

The slot outputs are mapped back to m output tokens by a combine step. The combine weights C are obtained from the same logits Lambda by applying a softmax over the slot index for each token, so each output token reads a probability distribution over all n times p slots. The final output sequence is C times the matrix of slot outputs. Each output token is thus a weighted blend of the results from all slots, and gradients flow through both the dispatch and the combine softmaxes, making the entire layer differentiable. ^[1]

The authors found that the strongest configuration uses a single slot per expert, so the number of slots equals the number of experts. They also typically set the total number of slots close to the input sequence length, which keeps the cost of the layer close to that of a single dense feed-forward layer applied to all tokens; the only overhead beyond a dense layer is the relatively cheap dispatch and combine projections and softmaxes. A useful consequence is that adding more experts (more slots) keeps throughput nearly constant, in contrast to sparse routers whose speed degrades as the expert count grows. Soft MoE is also per-sequence deterministic: every token fractionally activates all of the layer's parameters, and every output token depends on all slots. ^[1]

Property	Dense Transformer	Token choice MoE	Expert choice MoE	Soft MoE
Assignment unit	none (all tokens, one MLP)	token picks top-k experts	expert picks top-c tokens	slot reads weighted mix of all tokens
Differentiable assignment	not applicable	no (discrete top-k)	no (discrete top-c)	yes (softmax dispatch and combine)
Token dropping	none	possible (capacity overflow)	possible (unselected tokens)	none
Load-balancing loss	not needed	usually required	not needed	not needed
Suited to causal decoding	yes	yes	no (peeks across tokens)	no (mixes all tokens)

Results

The paper evaluated Soft MoE mainly on large-scale image classification, pretraining on the JFT-4B dataset and measuring upstream precision, ImageNet 10-shot transfer, and full ImageNet fine-tuning, with dense ViTs and the two sparse routers (referred to as Tokens Choice and Experts Choice) as baselines. Across model scales, Soft MoE established a better quality-versus-cost trade-off than all of them. ^[1]

The headline scaling result is that Soft MoE Huge/14, configured with 128 experts across 16 MoE layers, holds more than 40 times as many parameters as a dense ViT-Huge/14 while increasing inference time by only about 2 percent, and it delivers substantially higher quality. ^[1] At smaller scale, Soft MoE Base/16, with roughly 3.7 billion parameters (about 5.5 times the parameter count of ViT-Huge/14), ran approximately 5.7 times faster at inference than ViT-Huge/14 while matching or exceeding its accuracy. The authors also reported that Soft MoE Large/16 surpassed ViT-Huge/14 on upstream, few-shot, and fine-tuning metrics while using close to half the training time. These results position Soft MoE as a Pareto improvement over both dense Transformers and existing sparse MoEs at fixed compute. ^[1]

Limitations: causality

Soft MoE's defining mechanism is also its main constraint. Because each slot is a weighted average over the entire input sequence, and each output token is a weighted average over all slots, every output position depends on every input position, including positions that come later in the sequence. This violates the causal masking that an autoregressive decoder requires, where the representation at a position may depend only on earlier positions. As a result, Soft MoE in its basic form is well suited to non-causal settings such as vision and Transformer encoders, but it is difficult to use directly in the decoder of an autoregressive large language model. The layer also operates on a whole set of tokens at once, which fits batched encoder processing better than incremental, one-token-at-a-time generation. ^[1]

Several follow-up methods tackle this limitation by merging in a space other than the token sequence. SMEAR (Muqeeth, Liu, and Raffel, 2023) keeps a single effective expert by softly averaging the experts' parameters according to the router weights, avoiding discrete routing without mixing tokens, though it was demonstrated mainly on classification fine-tuning. ^[7] Lory (Zhong et al., 2024) extends this parameter-merging idea to autoregressive language-model pretraining using a causal segment routing scheme that preserves the left-to-right dependency, although it was reported to trail standard top-k MoE in quality. ^[8] ReMoE (Wang et al., 2024) takes a different route, replacing top-k routing with ReLU routing so that the router becomes continuous and fully differentiable while remaining usable in decoders. ^[9] These efforts illustrate that the broad goal of a differentiable MoE remains active, with Soft MoE as a foundational reference point. ^[1]

Relationship to other MoE routing

Soft MoE is best understood as the soft, continuous endpoint of a spectrum of MoE routing strategies. Token choice routing, the approach of Switch Transformer and GShard, lets each token pick its experts; it is simple but prone to imbalance and token dropping. ^[4] Expert choice routing inverts the decision so that each expert picks its tokens, which fixes per-expert load but can leave some tokens unselected and requires looking across many tokens at once, a property that, like Soft MoE, makes it awkward for causal decoding. ^[5] Both are discrete. ^[1]

Soft MoE replaces the hard pick on either side with weighted combinations: instead of choosing which tokens an expert processes, it constructs slot inputs as soft mixtures of all tokens, and instead of choosing which expert a token's output comes from, it blends all slot outputs. In this sense it generalizes expert choice, where each "slot" would correspond to one selected token, into a fully soft assignment. It sits alongside parameter-merging methods such as SMEAR in the broader family of fully differentiable MoE layers, with the key distinction that Soft MoE merges in the input-token space while SMEAR and Lory merge in the expert-parameter space. The same Google research group had earlier built the sparse Vision MoE (V-MoE), and Soft MoE can be read as their answer to the routing problems that V-MoE and similar sparse models exposed. ^[1]^[6]

References

Puigcerver, Joan; Riquelme, Carlos; Mustafa, Basil; Houlsby, Neil. "From Sparse to Soft Mixtures of Experts." arXiv:2308.00951, August 2, 2023 (last revised May 27, 2024). https://arxiv.org/abs/2308.00951 ↩
Puigcerver, Joan; Riquelme, Carlos; Mustafa, Basil; Houlsby, Neil. "From Sparse to Soft Mixtures of Experts." Proceedings of the International Conference on Learning Representations (ICLR), 2024. https://proceedings.iclr.cc/paper_files/paper/2024/hash/79fea214543ba263952ac3f4e5452b14-Abstract-Conference.html ↩
Shazeer, Noam; Mirhoseini, Azalia; Maziarz, Krzysztof; Davis, Andy; Le, Quoc; Hinton, Geoffrey; Dean, Jeff. "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer." arXiv:1701.06538, 2017. https://arxiv.org/abs/1701.06538 ↩
Fedus, William; Zoph, Barret; Shazeer, Noam. "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity." arXiv:2101.03961, 2021. https://arxiv.org/abs/2101.03961 ↩
Zhou, Yanqi; Lei, Tao; Liu, Hanxiao; Du, Nan; Huang, Yanping; Zhao, Vincent; Dai, Andrew; Chen, Zhifeng; Le, Quoc; Laudon, James. "Mixture-of-Experts with Expert Choice Routing." arXiv:2202.09368, 2022. https://arxiv.org/abs/2202.09368 ↩
Riquelme, Carlos; Puigcerver, Joan; Mustafa, Basil; Neumann, Maxim; Jenatton, Rodolphe; Susano Pinto, Andre; Keysers, Daniel; Houlsby, Neil. "Scaling Vision with Sparse Mixture of Experts." arXiv:2106.05974, 2021. https://arxiv.org/abs/2106.05974 ↩
Muqeeth, Mohammed; Liu, Haokun; Raffel, Colin. "Soft Merging of Experts with Adaptive Routing." arXiv:2306.03745, 2023. https://arxiv.org/abs/2306.03745 ↩
Zhong, Zexuan; Xia, Mengzhou; Chen, Danqi; Lewis, Mike. "Lory: Fully Differentiable Mixture-of-Experts for Autoregressive Language Model Pre-training." arXiv:2405.03133, 2024. https://arxiv.org/abs/2405.03133 ↩
Wang, Ziteng; Chen, Jianfei; Zhu, Jun. "ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing." arXiv:2412.14711, 2024 (ICLR 2025). https://arxiv.org/abs/2412.14711 ↩
Dosovitskiy, Alexey; Beyer, Lucas; Kolesnikov, Alexander; Weissenborn, Dirk; Zhai, Xiaohua; et al. "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale." arXiv:2010.11929, 2020. https://arxiv.org/abs/2010.11929

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Expert Choice routing