Soft MoE
Last reviewed
Jun 8, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 · 2,227 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 · 2,227 words
Add missing citations, update stale details, or suggest a clearer explanation.
Soft MoE (Soft Mixture of Experts) is a fully differentiable variant of the sparse mixture of experts (MoE) layer. Instead of routing each token to a small set of discrete experts, Soft MoE gives every expert a fixed number of input "slots," and each slot is filled with a learned, softmax weighted average of all of the input tokens. Because the assignment of tokens to experts is a continuous, weighted mixture rather than a hard, discrete choice, the whole layer is differentiable end to end and avoids the token dropping, load imbalance, and training instability that complicate ordinary sparse MoE. The method was introduced in the paper "From Sparse to Soft Mixtures of Experts" by Joan Puigcerver, Carlos Riquelme, Basil Mustafa, and Neil Houlsby of Google DeepMind, first posted to arXiv on August 2, 2023, and presented at the International Conference on Learning Representations (ICLR) in 2024. [1][2]
A sparse MoE replaces the single feed-forward sublayer of a Transformer block with many parallel feed-forward "experts" and a small router network that sends each token to only one or a few of them. This lets the model hold a very large number of parameters while activating only a small fraction of them per token, so capacity grows much faster than compute. The weakness of this design is the router: it makes a discrete, top-k assignment of tokens to experts, an operation that is not differentiable and that tends to send uneven numbers of tokens to different experts. [1][3]
Soft MoE keeps the central promise of sparse MoE, which is large parameter count at modest per-token cost, but removes the discrete router. Every expert is instead assigned a fixed set of slots, and softmax-based dispatch and combine steps move information between tokens and slots, so there is no argmax, no hard capacity cutoff, and no need for the auxiliary balancing losses that sparse routers require. The authors demonstrated the method primarily on image classification with Vision Transformer (ViT) backbones, where Soft MoE forms a strictly better accuracy-versus-cost frontier than dense ViTs and than the two leading sparse routers. [1]
The original sparse mixture of experts layer for deep networks routes each token to its top-scoring experts and multiplies each expert's output by the router's gate value. [3] Two dominant routing schemes followed. In token choice routing, used by Switch Transformer and GShard, each token selects its top-k experts. [4] In expert choice routing, each expert instead selects its top-c tokens. [5] Both schemes share a set of recurring difficulties:
Soft MoE was designed to keep the capacity benefits of these layers while eliminating the discrete assignment that causes most of the trouble.
A Soft MoE layer has n experts, and each expert is given p input slots, for a total of n times p slots. Each expert is an ordinary feed-forward network (a multilayer perceptron). The layer transforms a sequence of m input tokens through three steps: a dispatch step that fills the slots, the experts themselves, and a combine step that reconstructs the output tokens. [1]
Arrange the m tokens of a single input as the rows of a matrix X (m rows, each of dimension d). A learned matrix of slot parameters Phi produces a logit matrix Lambda = X times Phi, in which entry Lambda for token i and slot j scores how strongly token i should contribute to slot j. The dispatch weights D are obtained by applying a softmax over the token index for each slot, so the weights feeding any one slot form a probability distribution over all m input tokens. Each slot's input is the corresponding convex combination of the token vectors, written compactly as the slot-input matrix equal to D transpose times X. Every slot therefore sees a different weighted average of the whole sequence, and no token is ever discarded. This weighted averaging is the "soft" assignment that replaces the hard router. [1]
Each expert applies its feed-forward transformation to its own p slots, producing one output vector per slot. Because the slot count per expert is fixed in advance, every expert always does exactly the same amount of work, regardless of the input. This is what removes load imbalance: there is no capacity to overflow and nothing to drop. [1]
The slot outputs are mapped back to m output tokens by a combine step. The combine weights C are obtained from the same logits Lambda by applying a softmax over the slot index for each token, so each output token reads a probability distribution over all n times p slots. The final output sequence is C times the matrix of slot outputs. Each output token is thus a weighted blend of the results from all slots, and gradients flow through both the dispatch and the combine softmaxes, making the entire layer differentiable. [1]
The authors found that the strongest configuration uses a single slot per expert, so the number of slots equals the number of experts. They also typically set the total number of slots close to the input sequence length, which keeps the cost of the layer close to that of a single dense feed-forward layer applied to all tokens; the only overhead beyond a dense layer is the relatively cheap dispatch and combine projections and softmaxes. A useful consequence is that adding more experts (more slots) keeps throughput nearly constant, in contrast to sparse routers whose speed degrades as the expert count grows. Soft MoE is also per-sequence deterministic: every token fractionally activates all of the layer's parameters, and every output token depends on all slots. [1]
| Property | Dense Transformer | Token choice MoE | Expert choice MoE | Soft MoE |
|---|---|---|---|---|
| Assignment unit | none (all tokens, one MLP) | token picks top-k experts | expert picks top-c tokens | slot reads weighted mix of all tokens |
| Differentiable assignment | not applicable | no (discrete top-k) | no (discrete top-c) | yes (softmax dispatch and combine) |
| Token dropping | none | possible (capacity overflow) | possible (unselected tokens) | none |
| Load-balancing loss | not needed | usually required | not needed | not needed |
| Suited to causal decoding | yes | yes | no (peeks across tokens) | no (mixes all tokens) |
The paper evaluated Soft MoE mainly on large-scale image classification, pretraining on the JFT-4B dataset and measuring upstream precision, ImageNet 10-shot transfer, and full ImageNet fine-tuning, with dense ViTs and the two sparse routers (referred to as Tokens Choice and Experts Choice) as baselines. Across model scales, Soft MoE established a better quality-versus-cost trade-off than all of them. [1]
The headline scaling result is that Soft MoE Huge/14, configured with 128 experts across 16 MoE layers, holds more than 40 times as many parameters as a dense ViT-Huge/14 while increasing inference time by only about 2 percent, and it delivers substantially higher quality. [1] At smaller scale, Soft MoE Base/16, with roughly 3.7 billion parameters (about 5.5 times the parameter count of ViT-Huge/14), ran approximately 5.7 times faster at inference than ViT-Huge/14 while matching or exceeding its accuracy. The authors also reported that Soft MoE Large/16 surpassed ViT-Huge/14 on upstream, few-shot, and fine-tuning metrics while using close to half the training time. These results position Soft MoE as a Pareto improvement over both dense Transformers and existing sparse MoEs at fixed compute. [1]
Soft MoE's defining mechanism is also its main constraint. Because each slot is a weighted average over the entire input sequence, and each output token is a weighted average over all slots, every output position depends on every input position, including positions that come later in the sequence. This violates the causal masking that an autoregressive decoder requires, where the representation at a position may depend only on earlier positions. As a result, Soft MoE in its basic form is well suited to non-causal settings such as vision and Transformer encoders, but it is difficult to use directly in the decoder of an autoregressive large language model. The layer also operates on a whole set of tokens at once, which fits batched encoder processing better than incremental, one-token-at-a-time generation. [1]
Several follow-up methods tackle this limitation by merging in a space other than the token sequence. SMEAR (Muqeeth, Liu, and Raffel, 2023) keeps a single effective expert by softly averaging the experts' parameters according to the router weights, avoiding discrete routing without mixing tokens, though it was demonstrated mainly on classification fine-tuning. [7] Lory (Zhong et al., 2024) extends this parameter-merging idea to autoregressive language-model pretraining using a causal segment routing scheme that preserves the left-to-right dependency, although it was reported to trail standard top-k MoE in quality. [8] ReMoE (Wang et al., 2024) takes a different route, replacing top-k routing with ReLU routing so that the router becomes continuous and fully differentiable while remaining usable in decoders. [9] These efforts illustrate that the broad goal of a differentiable MoE remains active, with Soft MoE as a foundational reference point. [1]
Soft MoE is best understood as the soft, continuous endpoint of a spectrum of MoE routing strategies. Token choice routing, the approach of Switch Transformer and GShard, lets each token pick its experts; it is simple but prone to imbalance and token dropping. [4] Expert choice routing inverts the decision so that each expert picks its tokens, which fixes per-expert load but can leave some tokens unselected and requires looking across many tokens at once, a property that, like Soft MoE, makes it awkward for causal decoding. [5] Both are discrete. [1]
Soft MoE replaces the hard pick on either side with weighted combinations: instead of choosing which tokens an expert processes, it constructs slot inputs as soft mixtures of all tokens, and instead of choosing which expert a token's output comes from, it blends all slot outputs. In this sense it generalizes expert choice, where each "slot" would correspond to one selected token, into a fully soft assignment. It sits alongside parameter-merging methods such as SMEAR in the broader family of fully differentiable MoE layers, with the key distinction that Soft MoE merges in the input-token space while SMEAR and Lory merge in the expert-parameter space. The same Google research group had earlier built the sparse Vision MoE (V-MoE), and Soft MoE can be read as their answer to the routing problems that V-MoE and similar sparse models exposed. [1][6]