Mixture of Experts (MoE)
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v8 · 12,376 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v8 · 12,376 words
Add missing citations, update stale details, or suggest a clearer explanation.
A Mixture of Experts (MoE) is a machine learning architecture that divides a problem into subtasks, each handled by a specialized sub-network called an "expert." A learned gating network (also called a router) determines which expert or experts should process each input. In modern deep learning, MoE most commonly appears as a sparse variant inside transformer models, where only a subset of experts is activated for any given input token. This allows models to scale to very large parameter counts while keeping per-token computation manageable.
MoE architectures have become central to the design of many state-of-the-art large language models, including Mixtral, DBRX, Grok-1, DeepSeek-V3, DeepSeek V3.1, DeepSeek V3.2, Llama 4, Qwen 3, Kimi K2, GLM-4.5, GLM-4.6, MiniMax-Text-01, Hunyuan-Large, gpt-oss, Gemini 1.5, and (reportedly) GPT-4.1234567 They offer a practical path to scaling model capacity without a proportional increase in training or inference cost. By 2026, the leading frontier models in nearly every category were sparse mixtures, marking one of the largest architectural shifts since the original transformer paper.
Imagine you have a really hard homework assignment that covers math, reading, science, and art. Instead of asking one friend who is okay at everything, you ask four different friends, each one the best at one subject. A "traffic director" looks at each question and sends it to whichever friend knows the answer best. That traffic director is the gating network, and each friend is an expert. The smart part is that you only bother one or two friends per question, so you get great answers without making everyone work on everything.
Now imagine the homework book is huge and there are 256 friends instead of four. You still only ask two of them per question, so the answers come fast. But you still need a giant table for all 256 friends to sit at, which is why these models need a lot of memory even though they are quick to run.
The MoE concept was introduced by Robert A. Jacobs, Michael Jordan, Steven J. Nowlan, and Geoffrey Hinton in their 1991 paper "Adaptive Mixtures of Local Experts," published in Neural Computation (volume 3, issue 1, pages 79 to 87). Jacobs and Jordan were affiliated with MIT's Department of Brain and Cognitive Sciences; Nowlan and Hinton were at the University of Toronto's Department of Computer Science. The paper proposed a supervised learning procedure for systems composed of many separate sub-networks, each learning to handle a subset of the training cases. The authors framed the approach two ways: as a modular version of a multilayer supervised network, and as an associative version of competitive learning.
The original system consisted of several specialist networks (experts) and a gating network that learned to assign inputs to the appropriate expert. The authors demonstrated the approach on a vowel discrimination task, training up to eight experts to recognize phonemes from six Japanese speakers. In the final trained model, only three of the eight experts were meaningfully active, showing that the system naturally learned to specialize and effectively pruned unused capacity. The 1991 formulation was a dense MoE: every expert ran on every input, and the gating network produced a soft weighting over their outputs.
Michael Jordan and Robert Jacobs extended the framework in 1994 with "Hierarchical Mixtures of Experts and the EM Algorithm," published in Neural Computation (volume 6, issue 2, pages 181 to 214). This version arranged experts in a tree structure with multiple levels of gating, allowing for hierarchical decomposition of the input space. The paper also introduced the Expectation-Maximization (EM) algorithm as an alternative to gradient descent for training MoE models, framing learning as a maximum likelihood estimation problem with hidden mixture component variables.
For roughly two decades after the original paper, MoE remained mostly an academic concept. Interest revived around 2013 when Yoshua Bengio and collaborators began exploring conditional computation, the idea that different parts of a neural network could be activated dynamically depending on the input. Bengio, Léonard, and Courville published "Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation" in 2013, providing tools for learning discrete routing decisions through gradient estimators.
That same year, David Eigen, Marc'Aurelio Ranzato, and Ilya Sutskever published "Learning Factored Representations in a Deep Mixture of Experts" (arXiv:1312.4314), which stacked multiple MoE layers and demonstrated on a jittered MNIST dataset that the network learned to factor different aspects of the data (location and class) at different layers. Davis and Arel, also in 2013, contributed parallel work on conditional computation. Bengio, Bacon, Pineau, and Precup followed in 2015 with "Conditional Computation in Neural Networks for Faster Models" (arXiv:1511.06297), formalizing the goal of decoupling parameter count from inference cost.
These papers laid conceptual groundwork for the integration of MoE into modern architectures but did not produce production-scale systems.
The turning point came with Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean at Google in their 2017 paper "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer" (ICLR 2017, arXiv:1701.06538). They introduced a MoE layer with up to thousands of feed-forward experts and a trainable gating network that selected a sparse combination of experts per input. The approach was applied between stacked LSTM layers, producing a model with 137 billion parameters that achieved state-of-the-art results on language modeling and machine translation benchmarks at a fraction of the computational cost of dense alternatives. Crucially, the paper introduced noisy top-k gating, an auxiliary load-balancing loss, and a system-level treatment of how to actually train sparse experts at scale on multiple devices. This paper established the template for modern sparse MoE.
From 2020 onward, MoE was integrated into transformer architectures at increasing scale. The decade-long progression is summarized below.
| Year | Model or paper | Organization | Total params | Active params | Experts | Top-k | Key contribution |
|---|---|---|---|---|---|---|---|
| 2020 | GShard | 600B+ | n/a | 2,048 | 2 | First MoE Transformer beyond 600B; multilingual MT; 2,048 TPU v3 | |
| 2021 | Switch Transformer | 1.6T | ~26B | 2,048 | 1 | Top-1 routing; bfloat16 training of trillion-parameter sparse models | |
| 2021 | V-MoE | 15B (vision) | n/a | up to 32 per layer | 2 | First sparse MoE vision transformer; 90.35% ImageNet | |
| 2022 | GLaM | 1.2T | ~97B | 64 per layer | 2 | One-third of GPT-3 training energy; 29-task NLP gains | |
| 2022 | ST-MoE | 269B | 32B | 32-128 | 2 | Router z-loss; first sparse model SOTA on transfer tasks | |
| 2022 | Expert Choice | 8B active | 8B | 64 | variable | Reversed routing; experts pick tokens; perfect load balance | |
| 2022 | DeepSpeed-MoE | Microsoft | n/a | n/a | n/a | n/a | 4.5x faster, 9x cheaper inference vs. quality-equivalent dense |
| 2022 | MegaBlocks | Stanford / Databricks | n/a | n/a | n/a | n/a | Block-sparse kernels; "dropless" MoE |
| 2023 | Mixtral 8x7B | Mistral AI | 46.7B | 12.9B | 8 | 2 | First widely-used open-weights MoE; matched Llama 2 70B |
| 2024 | DBRX | Databricks | 132B | 36B | 16 | 4 | Fine-grained MoE; 65x more expert combinations than 8-choose-2 |
| 2024 | Grok-1 | xAI | 314B | ~78B | 8 | 2 | Largest open-weights model at release; Apache 2.0 |
| 2024 | Mixtral 8x22B | Mistral AI | 141B | 39B | 8 | 2 | 64K context; native multilingual; Apache 2.0 |
| 2024 | Jamba | AI21 Labs | 52B | 12B | 16 | 2 | Hybrid Transformer-Mamba-MoE; 256K context |
| 2024 | DeepSeekMoE | DeepSeek | 16B | 2.8B | 64 (fine) + 2 shared | 6 | Fine-grained segmentation plus shared experts |
| 2024 | DeepSeek-V2 | DeepSeek | 236B | 21B | 160 + 2 shared | 6 | MLA + DeepSeekMoE; 128K context; 5.76x throughput vs. V1 |
| 2024 | Gemini 1.5 Pro | Google DeepMind | undisclosed | undisclosed | undisclosed | undisclosed | First production multimodal MoE; 1M+ token context |
| 2024 | Snowflake Arctic | Snowflake | 480B | 17B | 128 + 1 dense | 2 | Hybrid dense + residual MoE; enterprise focus |
| 2024 | Qwen1.5-MoE | Alibaba / Qwen | 14.3B | 2.7B | 60 + 4 shared | 4 | Upcycled from dense; 75% of training cost |
| 2024 | Yuan 2.0-M32 | Inspur / IEIT-Yuan | 40B | 3.7B | 32 | 2 | Attention Router replacing classical gate; +3.8% accuracy8 |
| 2024 | Skywork-MoE | Skywork AI | 146B | 22B | 16 | 2 | Gating logit normalization; adaptive aux loss; upcycled from Skywork-13B9 |
| 2024 | Phi-3.5-MoE | Microsoft | ~42B (16x3.8B) | 6.6B | 16 | 2 | GRIN (gradient-informed) MoE training; 128K context10 |
| 2024 | Aria | Rhymes AI | 25.3B | 3.5-3.9B | undisclosed | undisclosed | First open multimodal-native MoE; image, video, code, text11 |
| 2024 | Hunyuan-Large | Tencent | 389B | 52B | 16 + 1 shared | 1 | 256K context; expert-specific LR; KV cache compression12 |
| 2025 | DeepSeek-V3 | DeepSeek | 671B | 37B | 256 + 1 shared | 8 | Auxiliary-loss-free balancing; FP8 training; 2.788M H800 hours |
| 2025 | MiniMax-Text-01 | MiniMax | 456B | 45.9B | 32 | 2 | Lightning + softmax hybrid attention; 1M training context, 4M inference13 |
| 2025 | Llama 4 Scout | Meta | 109B | 17B | 16 | 1 | Native multimodality; 10M token context |
| 2025 | Llama 4 Maverick | Meta | 400B | 17B | 128 | 1 | 128 experts; alternating dense and MoE layers |
| 2025 | Llama 4 Behemoth | Meta | ~2T | 288B | 16 | 1 | Frontier teacher model (training as of 2025) |
| 2025 | Qwen3-30B-A3B | Alibaba / Qwen 3 | 30.5B | 3.3B | 128 | 8 | Compact MoE; no shared experts; 32K native context14 |
| 2025 | Qwen3-235B-A22B | Alibaba / Qwen 3 | 235B | 22B | 128 | 8 | Global-batch load balancing; no shared experts |
| 2025 | Qwen3-Next 80B-A3B | Alibaba / Qwen 3 | 80B | 3B | 512 | 10 | Hybrid Gated DeltaNet + attention; ultra-sparse MoE15 |
| 2025 | Ling-Plus | Ant Group | 290B | 28.8B | undisclosed | undisclosed | Trained on domestic Chinese GPUs; 64K context16 |
| 2025 | GLM-4.5 | Zhipu AI | 355B | 32B | undisclosed | undisclosed | Hybrid thinking/non-thinking modes; agentic17 |
| 2025 | GLM-4.5-Air | Zhipu AI | 106B | 12B | undisclosed | undisclosed | Lightweight sibling of GLM-4.517 |
| 2025 | GLM-4.6 | Zhipu AI | 355B | 32B | undisclosed | undisclosed | ~15% fewer tokens to complete tasks vs. GLM-4.518 |
| 2025 | gpt-oss-120b | OpenAI | 117B | ~5.1B | 128 | 4 | First OpenAI open-weights since GPT-2; MXFP4; 36 layers19 |
| 2025 | gpt-oss-20b | OpenAI | 21B | ~3.6B | 32 | 4 | Runs in 16 GB VRAM via 4-bit MXFP419 |
| 2025 | DeepSeek V3.1 | DeepSeek | 671B | 37B | 256 + 1 shared | 8 | Hybrid thinking model (same architecture as V3)20 |
| 2025 | DeepSeek V3.2-Exp | DeepSeek | 671B | 37B | 256 + 1 shared | 8 | DeepSeek Sparse Attention (DSA); near-linear O(kL) long context21 |
| 2025 | Kimi K2 | Moonshot AI | 1T | 32B | 384 | 8 | Trained with Muon optimizer; agentic focus; 128K context |
| 2025 | Kimi K2 Thinking | Moonshot AI | 1T | 32B | 384 | 8 | INT4-native; 200-300 sequential tool calls; 256K context22 |
| 2025 | Mistral Large 3 | Mistral AI | 675B | 41B | undisclosed | undisclosed | Mistral's first frontier-class MoE |
A standard MoE layer has two main parts.
Expert networks. A set of N independent sub-networks, typically feed-forward networks (FFNs) with a SwiGLU or GeLU non-linearity. Each expert has the same architecture but learns different parameters, allowing it to specialize on different types of inputs. Each expert in a transformer FFN typically has the form Expert(x) = W_2 * activation(W_1 * x), where W_1 projects up to a wider hidden dimension and W_2 projects back.
Gating network (router). A small network that takes the input and produces a probability distribution over the experts. Formally, for an input x, the gating network computes:
G(x) = Softmax(x * W_g)
where W_g is a learned weight matrix of shape (hidden_dim, N). The output of the MoE layer is the weighted sum of expert outputs:
y = sum_i G(x)_i * E_i(x)
where E_i(x) is the output of expert i. In sparse MoE, most components of G(x) are zero by construction.
In transformer-based models, MoE layers typically replace the feed-forward network (FFN) that follows each multi-head attention layer. Since the FFN accounts for a large share of a transformer's parameters (roughly 90% in models like PaLM-540B, and a similar fraction in Llama-style architectures), replacing even a subset of FFN layers with MoE layers can dramatically increase total parameter count without proportionally increasing computation.
Common placement strategies include:
The first and last few layers are often kept dense even in MoE models, on the theory that early layers process generic features and final layers form predictions where stable pathways are useful.
Two strategies exist for producing an MoE model: training from scratch with sparse routing from step zero, or upcycling an existing dense checkpoint into an MoE by replicating its FFN weights into multiple experts and continuing training. Upcycling, popularized by Qwen1.5-MoE and several Mixtral variants in the community, can reach competitive quality at roughly 25 to 50% of the from-scratch training compute, though it tends to produce experts that initially behave very similarly until specialization develops over many tokens of continued training.
The gating mechanism is the most studied component of MoE design, and it is where most of the qualitative differences between MoE systems live. Several approaches have been developed.
The simplest form computes a softmax over a linear projection of the input:
G(x) = Softmax(x * W_g)
This is a dense gating approach where all experts receive some weight. It works for small numbers of experts and is mathematically equivalent to the original 1991 formulation, but does not scale efficiently to hundreds or thousands of experts because every expert has to run.
Introduced by Shazeer et al. (2017), this is the foundation for most modern MoE routers. The process has three steps.
H(x)_i = (x * W_g)_i + StandardNormal() * Softplus((x * W_noise)_i)The noise helps prevent the router from always selecting the same experts and encourages different experts to be tried during training. After training stabilizes, many production systems disable noise at inference for determinism.
The Switch Transformer (Fedus, Zoph, and Shazeer, 2022) simplified routing by setting k = 1, sending each token to a single expert. The authors showed that this preserves model quality while offering three advantages.
Llama 4 returned to top-1 routing in 2025 with both Scout and Maverick, citing the same efficiency arguments. In top-1 routing the gating weight for the chosen expert is sometimes still applied as a multiplicative scalar on the expert output, which keeps the gating network differentiable.
Mixtral, DBRX (k = 4), Snowflake Arctic (k = 2), and DeepSeek-V3 (k = 8 over routed experts plus a shared expert) use top-k for k > 1. Higher k means each token sees more experts and is generally easier to balance, but communication and compute costs grow roughly linearly with k.
Zhou et al. (2022) at Google proposed reversing the routing direction. Instead of tokens selecting their top-k experts, each expert selects its top-k tokens from the batch (NeurIPS 2022, arXiv:2202.09368). This guarantees perfect load balancing by construction, since every expert processes exactly the same number of tokens. The approach achieved over 2x training speedup compared to top-1 and top-2 gating in an 8-billion-active-parameter model with 64 experts.
A trade-off of expert choice routing is that some tokens may be processed by many experts (receiving more computation) while others may be processed by none, requiring careful handling through residual connections. Because the assignment is computed across the whole batch, expert choice is best suited to training and high-throughput batch inference; for streaming, single-token-at-a-time decoding it is harder to apply.
Several alternative routing methods have been explored.
| Strategy | Description | Advantage |
|---|---|---|
| Hash routing | Deterministic assignment based on token hash | No learned parameters; zero routing overhead |
| Random routing | Tokens assigned to random experts | Baseline comparison; surprisingly competitive in some settings |
| Linear assignment | Global optimization of token-expert matching | Optimal assignment but computationally expensive |
| Reinforcement learning | Router trained with RL signals | Can optimize for downstream objectives |
| BASE layers | Balanced assignment via linear programming | Guaranteed balance with top-1 selection |
| Soft MoE | Each input is a weighted combination of all expert slots | Differentiable; useful in vision (Soft MoE, Puigcerver et al., 2023) |
| Threshold routing | Tokens routed only when a confidence threshold is met | Variable compute per token; saves FLOPs on easy tokens |
| Auxiliary-loss-free | Bias terms updated in place to balance load | No interference gradients; used in DeepSeek-V3 |
The distinction between sparse and dense MoE is fundamental to understanding modern implementations.
In a dense MoE, every expert processes every input, and their outputs are combined using the full gating weights. This is mathematically equivalent to the original 1991 formulation. Dense MoE does not save computation, since all experts run on every input, but it can still benefit from specialization through the gating weights. Soft MoE is a recent variant where every input slot interacts with every expert through learned mixing weights, used primarily in vision.
In a sparse MoE, only a small subset of experts (typically 1, 2, 4, or 8 out of 8 to 384+) is activated per input token. This is the dominant form in modern LLMs because it decouples model capacity (total parameters) from computational cost (active parameters per token). A model with 671 billion total parameters such as DeepSeek-V3 might activate only 37 billion per token; Kimi K2 activates 32 billion out of 1 trillion.
Key trade-offs between the two approaches:
| Property | Dense MoE | Sparse MoE |
|---|---|---|
| Computation per token | Proportional to total parameters | Proportional to active parameters only |
| Memory requirement | Same as computation | Must load all parameters despite sparse activation |
| Expert specialization | Soft (weighted combination) | Hard (only selected experts participate) |
| Load balancing | Not an issue | Requires explicit balancing mechanisms |
| Backward pass | Smooth gradients | Non-differentiable top-k requires straight-through estimators or surrogate losses |
| Scaling potential | Limited by compute | Can scale to trillions of parameters |
| Suitability for vision | Common (Soft MoE) | Common (V-MoE) |
| Suitability for LLMs | Rare in production | Dominant in 2024 to 2026 |
Load balancing is one of the most significant practical challenges in training sparse MoE models. Without intervention, routers tend to converge toward sending most tokens to a few "popular" experts while ignoring others, a failure mode called routing collapse or expert collapse.
Routing collapse creates a self-reinforcing cycle: popular experts receive more training signal, which makes them better, which causes the router to favor them even more. Meanwhile, ignored experts receive little to no gradient updates and remain undertrained. This defeats the purpose of having multiple experts. Empirically, models that suffer routing collapse converge to behave like dense models with a fraction of their advertised capacity.
The most common solution is an auxiliary (or load-balancing) loss added to the training objective. The Switch Transformer formulation uses:
L_aux = alpha * N * sum_i(f_i * P_i)
where f_i is the fraction of tokens dispatched to expert i, P_i is the fraction of the router's probability allocated to expert i, and alpha is a hyperparameter controlling the strength of the balancing signal. This loss is minimized when all experts receive equal token allocations.
The hyperparameter alpha requires careful tuning. If set too high, the auxiliary loss dominates the training signal and forces artificial uniformity, degrading model quality. If set too low, it fails to prevent collapse. In practice, values between 0.001 and 0.01 are typical for production training.
Introduced in the ST-MoE paper (Zoph et al., 2022, arXiv:2202.08906), the router z-loss penalizes large logits entering the gating network:
L_z = (1/B) * sum_b (log sum_i exp(x_b * W_g)_i)^2
Large logits create sharp probability distributions that are numerically unstable (especially in lower-precision training such as bfloat16 and FP8) and tend to cause routing collapse. By keeping logits small, the z-loss stabilizes training without hurting model quality. The ST-MoE paper identified router logit growth as the primary cause of training instabilities in large-scale MoE models, and z-loss has since been adopted in essentially every production MoE training framework.
DeepSeek-V2 and V3 introduced an alternative approach that eliminates the auxiliary loss entirely (DeepSeek-AI, "Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts," arXiv:2408.15664). Instead, a bias term b_i is added to each expert's gating logit before the top-k selection:
score_i = (x * W_g)_i + b_i
This bias is adjusted dynamically during training: when an expert is underutilized, its bias is increased, making it more likely to be selected; when overutilized, the bias is decreased. Critically, the bias is not part of the gating weight that gets multiplied into the expert output; it only affects the discrete top-k selection. This approach avoids the interference gradients that auxiliary losses introduce and has been credited with raising the upper bound of MoE model quality. DeepSeek-V3 reports keeping a balanced load throughout its full pre-training without dropping any tokens.
Qwen 3 (Alibaba, 2025) introduced global-batch load balancing, which computes the load-balancing signal over the entire global batch rather than each micro-batch. This produces a smoother target and, the Qwen team reports, encourages stronger expert specialization. Combined with the absence of shared experts in Qwen3, this approach was credited with the model's strong scaling behavior up to 235 billion total parameters.
Expert capacity sets a hard limit on how many tokens a single expert can process in a given batch. The capacity is typically computed as:
Expert Capacity = (tokens_per_batch / number_of_experts) * capacity_factor
The capacity factor is a hyperparameter, usually set between 1.0 and 2.0. A factor of 1.0 means each expert can handle exactly its "fair share" of tokens, with no buffer for imbalance. Switch Transformers found that a capacity factor of 1.0 to 1.25 worked well in practice. Higher factors waste compute on padding; lower factors increase the number of dropped tokens.
When an expert reaches capacity, additional tokens routed to it are dropped. These dropped tokens skip the expert computation and instead pass through a residual connection unchanged. Research has shown that up to about 11% of tokens can be dropped this way without significant degradation in model quality, but more aggressive dropping causes noticeable harm.
The MegaBlocks library (Gale et al., 2022, arXiv:2211.15841) introduced dropless MoE, which avoids token dropping entirely by reformulating MoE computation as block-sparse matrix multiplication. Custom GPU kernels handle variable numbers of tokens per expert, eliminating both wasted compute on padding and quality loss from dropped tokens. DBRX, Mixtral, and most subsequent open MoE models adopt the dropless approach.
GShard, by Dmitry Lepikhin, HyoukJoong Lee, Noam Shazeer, and colleagues at Google (ICLR 2021, arXiv:2006.16668), was the first system to scale MoE transformers beyond 600 billion parameters. It focused on multilingual neural machine translation, training a model on 2,048 TPU v3 accelerators in four days at a total cost of 22 TPU v3 core-years. By comparison, training 100 separate bilingual baselines would have cost 235.5 TPU v3 core-years and produced lower quality (36.9 vs. 44.3 average BLEU). GShard used top-2 expert routing and introduced position-based random routing for the second expert to improve load balancing. The paper also contributed a set of sharding annotation APIs and XLA compiler extensions for distributing MoE models across devices, becoming a foundational systems contribution.
William Fedus, Barret Zoph, and Noam Shazeer at Google proposed the Switch Transformer (JMLR 23, 2022, arXiv:2101.03961), which simplified MoE routing by using top-1 expert selection instead of top-2. The largest Switch Transformer had 1.6 trillion parameters distributed across 2,048 experts. Despite this extreme sparsity, it achieved up to 7x speedup in pre-training over dense T5 models using the same computational budget. The paper also validated, for the first time, that large sparse MoE models could be trained in lower-precision bfloat16 format. The authors used selective precision (router in float32, experts in bfloat16), a technique still standard in 2026.
Carlos Riquelme, Joan Puigcerver, Basil Mustafa, and others at Google Brain published "Scaling Vision with Sparse Mixture of Experts" (NeurIPS 2021, arXiv:2106.05974). V-MoE replaced a subset of dense feedforward layers in Vision Transformers (ViT) with sparse MoE layers, with each image patch routed to a subset of experts. A 15-billion-parameter V-MoE with 24 MoE layers (out of 48 blocks) reached 90.35% top-1 ImageNet accuracy after fine-tuning. The paper also introduced batch prioritized routing, which prioritized subsets of inputs across the entire batch to enable adaptive per-image compute.
Google's Generalist Language Model (GLaM) by Du, Huang, Dai, et al. (ICML 2022, arXiv:2112.06905) scaled to 1.2 trillion total parameters with 64 experts per MoE layer, activating about 97 billion parameters per token (roughly 8% of total). GLaM used 1/3 the energy of GPT-3 for training (456 MWh vs. 1,287 MWh) and half the inference FLOPs, while achieving better zero-shot and one-shot performance across 29 NLP benchmarks. GLaM placed MoE layers on every other transformer block rather than every block.
ST-MoE by Zoph, Bello, Kumar, Du, Huang, Dean, Shazeer, and Fedus (arXiv:2202.08906) addressed training instability and fine-tuning quality issues that had limited sparse models on transfer learning. The 269-billion-parameter ST-MoE-32B model (matching the FLOPs of a 32-billion-parameter dense encoder-decoder) was the first sparse model to achieve state-of-the-art performance on a diverse set of transfer tasks including reasoning, summarization, closed-book QA, and adversarial benchmarks. The router z-loss introduced in this paper became a near-universal component of subsequent MoE training pipelines.
Two systems papers in 2022 made large-scale MoE training and inference practical. DeepSpeed-MoE (Rajbhandari et al., ICML 2022, arXiv:2201.05596) at Microsoft provided an end-to-end training and inference solution with novel architecture designs and compression techniques that reduced MoE model size by up to 3.7x and offered 4.5x faster, 9x cheaper inference compared to quality-equivalent dense models. Tutel (also Microsoft) optimized the all-to-all communication primitive specifically for MoE routing, with adaptive pipelining and a 2-dimensional hierarchical (2DH) all-to-all algorithm, accelerating Meta's 1.1 trillion–parameter MoE model by more than 40% on 64 NDm A100 v4 nodes.
Mistral AI released Mixtral 8x7B in December 2023 and Mixtral 8x22B in April 2024, both open-source under the Apache 2.0 license. The technical report ("Mixtral of Experts," arXiv:2401.04088) was published in January 2024.
Mixtral 8x7B shares the same backbone as Mistral 7B but replaces each FFN layer with 8 expert FFNs. A router selects 2 experts per token per layer, applying softmax only over the top-2 chosen experts (rather than over all 8 before top-k). The model has 46.7 billion total parameters with 12.9 billion active per token. It outperformed or matched Llama 2 70B and GPT-3.5 across evaluated benchmarks despite using significantly fewer active parameters, and it was faster than any dense 70B model.
Mixtral 8x22B scaled this design up to 141 billion total parameters with 39 billion active, extended the context window to 65,536 tokens, and added native support for function calling. It strongly outperformed Llama 2 70B on French, German, Spanish, and Italian benchmarks (HellaSwag, Arc Challenge, MMLU).
Databricks released DBRX in March 2024 with a "fine-grained" MoE approach. Instead of the conventional 8-expert, choose-2 design, DBRX uses 16 experts and activates 4 per token, giving 65 times more possible expert combinations compared to 8-choose-2, which the authors found improved model quality. DBRX has 132 billion total parameters with 36 billion active, was pre-trained on 12 trillion tokens with a 32K context length, and uses rotary position encodings, gated linear units, and grouped query attention. It employs dropless MoE routing via the MegaBlocks library and was trained on 3,072 NVIDIA H100 GPUs connected via 3.2 Tbps InfiniBand.
xAI open-sourced Grok-1 on March 17, 2024, under the Apache 2.0 license. Pre-training had concluded in October 2023. Grok-1 has 314 billion total parameters with 8 experts and top-2 selection, activating roughly 25% of weights per token. The architecture uses 64 layers, 48 attention heads for queries and 8 for keys and values, an embedding size of 6,144, and supports 8-bit quantization. One notable difference from Mixtral is in the routing: Grok-1 applies top-2 selection after a softmax over all 8 experts, whereas Mixtral applies softmax only over the top-2 selected experts. At release, Grok-1 was the largest open-weights model.
Snowflake released Arctic on April 24, 2024 (Apache 2.0). Arctic combines a 10-billion-parameter dense transformer with a residual 128-by-3.66-billion MoE MLP, totaling 480 billion parameters with 17 billion active, chosen via top-2 gating. The 128-expert design produces a fine-grained MoE optimized for enterprise tasks (SQL, code generation). Snowflake reported up to 4x fewer memory reads than Code-Llama 70B and 2.5x fewer than Mixtral 8x22B, leading to faster inference.
DeepSeek-V2 ("A Strong, Economical, and Efficient Mixture-of-Experts Language Model," arXiv:2405.04434, May 2024) has 236 billion total parameters with 21 billion activated per token and a 128K context length. It introduced two architectural innovations that became influential: Multi-head Latent Attention (MLA), which compresses the KV cache into a low-rank latent vector and reduces KV cache size by 93.3%, and the production-scale DeepSeekMoE design with 2 shared experts and 160 routed experts (6 activated per token), each with a hidden dimension of 1,536. Compared to DeepSeek 67B, V2 achieved better quality with 42.5% lower training cost and 5.76x higher inference throughput.
The DeepSeekMoE paper (Dai et al., arXiv:2401.06066, January 2024) formalized two principal strategies that have shaped MoE design ever since: fine-grained expert segmentation (the hidden dimension of each expert is reduced while the number of experts is multiplied, enabling more flexible combinations) and shared expert isolation (a small set of experts is always active for every token, capturing common knowledge and reducing redundancy in routed experts). DeepSeekMoE 2B matched GShard 2.9B in quality with 1.5x fewer expert parameters and FLOPs.
DeepSeek-V3 (DeepSeek-AI, "DeepSeek-V3 Technical Report," arXiv:2412.19437) has 671 billion total parameters with 37 billion active per token. It uses 256 routed experts plus 1 shared expert, with the top 8 routed experts activated per token. Key contributions include:
DeepSeek-V3 reports zero token drops throughout training and inference, made possible by the combination of fine-grained experts, shared experts, and bias-based balancing.
Google DeepMind announced Gemini 1.5 Pro in February 2024 as a sparse mixture-of-experts transformer with multimodal inputs and a 1-million-token context window (extended in research previews to 10 million). The exact expert and active parameter counts have not been disclosed, but Jeff Dean publicly traced its lineage to "a long line of Google research efforts on sparse models" starting with Shazeer et al. 2017. Gemini 1.5 was the first widely available production frontier model confirmed to use MoE.
Meta released the Llama 4 herd on April 5, 2025, marking the first Llama generation to use mixture-of-experts. The herd consists of three models.
All Llama 4 models use top-1 routing, native multimodality with early fusion of text and image, and were pre-trained on more than 30 trillion tokens.
Alibaba's Qwen team has released several MoE generations.
Moonshot AI released Kimi K2 in mid-2025 as a 1-trillion-parameter MoE model with 32 billion active parameters. It uses 384 experts with 8 active per token and a 128K context window. Kimi K2 was pre-trained on 15.5 trillion tokens using the Muon optimizer at unprecedented scale, with the team reporting zero training instability after a custom set of optimizer modifications. The model is positioned around agentic intelligence, including extended reasoning and tool use.
Mistral AI's Mistral Large 3 (released 2025) was the company's first frontier-class MoE, with 41 billion active parameters out of 675 billion total. The shift from the dense Mistral Large 2 (123B dense) signaled that even labs that had stuck with dense designs were converging on sparse architectures for frontier work.
While OpenAI has not officially confirmed the architecture of GPT-4, multiple sources have reported that it uses an MoE design. A widely cited 2023 analysis by Dylan Patel and Gerald Wong at SemiAnalysis described GPT-4 as approximately 1.76 trillion total parameters across 16 experts of approximately 111 billion MLP parameters each, with 2 experts routed per forward pass. An earlier informal claim by George Hotz described 8 experts of 220 billion parameters each. These reports were partly corroborated by Soumith Chintala, co-creator of PyTorch, but remain unconfirmed by OpenAI.
AI21 Labs' Jamba is a hybrid architecture that combines transformer layers, Mamba (structured state space model) layers, and MoE layers (arXiv:2403.19887). It has 52 billion total parameters with 12 billion active, and offers a 256K context window. Roughly one in every eight layers uses a transformer attention mechanism; the rest use Mamba, with MoE layers interleaved. This hybrid approach reduces the memory footprint compared to a pure transformer of similar capacity.
Yuan 2.0-M32 (Inspur / IEIT-Yuan, arXiv:2405.17976) introduced an Attention Router that replaces the conventional linear gate with an attention-based selection mechanism. The router treats each expert as a key/value pair and uses the input token as the query; this captures correlations among experts during the routing decision. The 40B total / 3.7B active model with 32 experts and top-2 routing was trained from scratch on 2 trillion tokens at only 9.25% of the compute of a dense model of similar parameter count, and the authors report a 3.8% accuracy gain attributable to the Attention Router alone.8
Skywork-MoE (Skywork AI, arXiv:2406.06563) is a 146-billion-parameter, 16-expert MoE with 22 billion active parameters, upcycled from the dense Skywork-13B checkpoint. The paper introduced two training techniques. Gating logit normalization normalizes router logits before softmax, sharpening expert assignments and improving diversification. Adaptive auxiliary loss coefficients adjust the load-balancing coefficient per layer based on observed token drop rates, rather than holding it constant across the network. The authors also presented empirical guidance on the upcycling vs. from-scratch trade-off, finding that upcycling pays off when the dense checkpoint is already strong and the additional MoE training budget is small.9
Microsoft released Phi-3.5-MoE in August 2024 alongside the Phi-3.5-mini and Phi-3.5-Vision models. It is a 16 x 3.8B mixture-of-experts decoder-only transformer with 6.6 billion parameters active per token (top-2 routing). It supports a 128K context length and was trained with a new method called GRIN (GRadient-INformed) MoE, which uses gradient information to inform routing decisions and expert specialization. Microsoft reported that Phi-3.5-MoE matches or exceeds Llama 3.1 8B, Mixtral 8x7B, and Gemini-1.5-Flash on language, reasoning, math, and code benchmarks at significantly lower active parameter count.10
Rhymes AI released Aria in October 2024 as the first open-source, multimodal-native MoE. Aria has 25.3 billion total parameters with 3.5 billion active per text token and 3.9 billion active per visual token, supports a 64K multimodal context window, and was pre-trained from scratch through a four-stage pipeline (language pre-training, multimodal pre-training, long-context pre-training, and instruction tuning). Aria outperformed Pixtral 12B and Llama 3.2 11B-Vision on a range of multimodal benchmarks while fitting in a single A100 80GB GPU in bfloat16 precision.11
Tencent's Hunyuan-Large (arXiv:2411.02265) is a 389-billion-parameter MoE with 52 billion active parameters. It uses 16 specialized routed experts plus 1 shared expert, with top-1 routing over the routed experts. Hunyuan-Large supports a 256K context window and was pre-trained on 7 trillion tokens, including 1.5 trillion tokens of synthesized data. Notable contributions include expert-specific learning rates (different layers and experts use different LR schedules), KV cache compression for long-context efficiency, and a mixed routing strategy. At release, Hunyuan-Large was the largest open-source Transformer-based MoE.12
MiniMax's MiniMax-Text-01 (arXiv:2501.08313) combines Lightning Attention (a linear-attention variant) with traditional softmax attention and MoE feed-forwards. Within every 8 transformer blocks, 7 use Lightning Attention and 1 uses softmax attention. Each transformer layer has 32 MoE experts with top-2 routing, giving 456 billion total parameters and 45.9 billion active per token. The training context is 1 million tokens, extendable to 4 million tokens during inference, making it the first commercial-scale model to scale linear attention to the multi-million-token regime.13
Beyond the flagship Qwen3-235B-A22B, the Qwen team released a compact MoE called Qwen3-30B-A3B: 30.5 billion total parameters with about 3.3 billion active per token, 128 experts and top-8 routing, no shared experts, 48 transformer layers, 32 query heads and 4 key/value heads (grouped-query attention), and a 32,768-token native context (extensible with YaRN).14
Qwen3-Next 80B-A3B (preview, September 2025) is a hybrid model that combines Gated DeltaNet linear-attention layers with Gated Attention softmax layers and an ultra-sparse MoE: 512 experts of which only 10 are active per token (~3.7% of total weights), giving 80 billion total parameters and approximately 3 billion active. Alibaba reports roughly 10x faster inference than Qwen3-32B for long contexts and approximately 90% lower training cost relative to Qwen3-32B at comparable downstream quality.15
Ant Group's Ling-Plus (arXiv:2503.05139) is a 290-billion-parameter MoE with 28.8 billion active parameters and a 64K context window. Its accompanying paper, "Every FLOP Counts: Scaling a 300B MoE LING LLM without Premium GPUs," documents techniques for training large MoE models on lower-end (non-NVIDIA H-series) accelerators, including domestic Chinese GPUs. Ant reported that training one trillion tokens on high-end hardware cost approximately 6.35 million RMB compared to roughly 5.08 million RMB on their optimized lower-spec pipeline, while reaching parity on downstream benchmarks.16
Zhipu AI released GLM-4.5 in July 2025 (355 billion total, 32 billion active) and GLM-4.5-Air (106 billion total, 12 billion active) under the MIT license. Both are MoE LLMs that integrate reasoning, coding, and agentic capabilities and expose a hybrid "thinking" vs. "non-thinking" mode through chat-template selection. GLM-4.6 (September 2025) keeps the 355B/32B configuration but improves token efficiency, completing comparable tasks with approximately 15% fewer tokens than GLM-4.5.1718
OpenAI released gpt-oss-120b and gpt-oss-20b on August 5, 2025, under the Apache 2.0 license, OpenAI's first open-weight models since GPT-2 in 2019. The 120B model has 117 billion total parameters across 36 transformer layers, with 128 experts per layer of which 4 are active per token, yielding approximately 5.1 billion active parameters. The 20B model has 21 billion total parameters with 32 experts and top-4 routing, yielding around 3.6 billion active. Both use a native MXFP4 (4-bit microscaling FP) quantization for the expert weights, enabling gpt-oss-120b to run on a single 80 GB GPU and gpt-oss-20b to run on edge devices with 16 GB of memory. OpenAI reports that gpt-oss-120b matches or exceeds o4-mini on competition coding (Codeforces), problem solving (MMLU and HLE), and tool-use (TauBench).19
DeepSeek released DeepSeek V3.1 in August 2025 as a hybrid model that shares the same MoE architecture as DeepSeek-V3 (671B total, 37B active, 256 routed experts + 1 shared, top-8 routing) but adds a hybrid thinking mode controlled by the chat template: the same weights can either emit chain-of-thought reasoning (like DeepSeek-R1) or direct answers (like DeepSeek-V3). The base checkpoint was extended to a 128K context window via a two-phase procedure (630B tokens to 32K, then a further 209B tokens to 128K).20
DeepSeek V3.2-Exp (September 2025, arXiv:2512.02556) keeps the V3 MoE architecture and introduces DeepSeek Sparse Attention (DSA). DSA has two parts: a lightning indexer that estimates which past tokens matter for each query, and a token selector that retains only the top-k of them. This reduces long-context attention complexity from O(L^2) to approximately O(kL) and preserves quality at very long context lengths.21
Moonshot AI's Kimi K2 Thinking (released November 2025) extends Kimi K2 with a long chain-of-thought reasoning capability and native INT4 quantization. The model preserves the K2 backbone (1T total, 32B active, 384 experts, top-8 routing) but is trained with quantization-aware-training so that 4-bit weights are the default inference format. Moonshot reports that the model can execute 200 to 300 sequential tool calls without human intervention and exposes a 256K-token context window.22
MoE models are more prone to training instability than dense models, particularly at large scale. Sources of instability include:
Practical stabilization techniques include using full precision (float32) for the router even when experts run in bfloat16 or FP8, adding router z-loss, carefully tuning the auxiliary loss coefficient (or moving to bias-based balancing), gradient clipping, and warming up the auxiliary loss over the first few thousand steps.
Sparse MoE models are more susceptible to overfitting during fine-tuning than dense models of comparable active parameter count. This happens because MoE models have far more total parameters, but each parameter sees fewer training examples (since each expert only processes a fraction of tokens). Strategies to mitigate this include:
In expert parallelism, every MoE layer requires two all-to-all communications: one to dispatch tokens to the GPUs holding their assigned experts, and one to combine the results back. Research has shown that all-to-all communication can consume more than 40% of total runtime in large-scale MoE training, and up to 59.2% of forward-pass latency in the MoE layers on an 8-GPU server running DeepSeek-V2-Lite. For inference, all-to-all can contribute 10 to 30% of end-to-end latency, especially for decode messages where each token's hidden state must hop between GPUs. Optimizing this communication is a major focus of systems research; representative techniques include 2DH all-to-all, fused communication-computation kernels, and sub-chunk pipelining.
Research has revealed that experts in encoder models tend to develop token-level specialization. Certain experts may specialize in punctuation, proper nouns, or specific syntactic patterns. In decoder models, specialization is less interpretable; some experts appear to handle particular topical domains, others activate on rare tokens, and many appear functionally redundant in early training. Specialization typically sharpens over training, especially after the auxiliary loss is reduced.
Expert specialization collapse occurs when experts become functionally redundant, all learning similar representations instead of specializing. This negates the benefit of having multiple experts and is distinct from routing collapse (where experts are ignored entirely). Fine-grained segmentation, shared experts, and stronger regularization on the router are the most commonly cited remedies.
A key challenge for MoE inference is that, despite only activating a subset of experts per token, all expert parameters must be loaded into memory for fast access. This means MoE models have the same memory footprint as a dense model of equal total parameter count, even though they use far fewer FLOPs per token. For example, Mixtral 8x7B requires loading all 46.7 billion parameters into VRAM even though only 12.9 billion are active per token; DeepSeek-V3 requires loading 671 billion parameters even though only 37 billion are active.
Production deployments of large MoE models routinely require 8 or more GPUs with 80 GB each simply to load the model before serving any traffic. Llama 4 Maverick at 400 billion total parameters requires roughly 800 GB in 16-bit precision; DeepSeek-V3 at 671 billion fits in roughly 720 GB after FP8 packing.
Expert parallelism (EP) is a distribution strategy designed specifically for MoE models. Different experts are placed on different GPUs, and tokens are routed to the GPU holding their assigned expert via all-to-all communication. Non-MoE layers (such as attention) are handled via standard data or tensor parallelism.
This can be combined with other parallelism strategies:
| Parallelism type | What is distributed | Applicability |
|---|---|---|
| Data parallelism | Different batches across devices | All model types |
| Tensor parallelism | Individual layer weights split across devices | Large layers |
| Pipeline parallelism | Different layers on different devices | Deep models |
| Expert parallelism | Different experts on different devices | MoE models specifically |
| Context parallelism | Different parts of long sequences across devices | Long-context models |
NVIDIA's work on wide expert parallelism with GB200 NVL72 systems showed up to 1.8x higher per-GPU throughput compared to smaller expert-parallel configurations, by leveraging fewer experts per GPU and higher arithmetic intensity inside the high-bandwidth NVLink domain (130 TB/s coherent NVLink). Engineering teams at Meta have published case studies on combining tensor, context, and expert parallelism for serving large MoE models efficiently.
Quantization is particularly effective for MoE models because the memory savings are amplified by the large total parameter count. QMoE (Frantar and Alistarh, MLSys 2024, arXiv:2310.16795) demonstrated compression of a 1.6-trillion-parameter Switch Transformer from 3.2 TB to less than 160 GB at less than 1 bit per parameter, with only minor accuracy loss, in less than a day on a single GPU. With QMoE, the 1.6-trillion-parameter Switch Transformer could run on a single server with 4x NVIDIA A6000 GPUs at less than 5% runtime overhead relative to ideal uncompressed inference. FP8 weight quantization (used natively by DeepSeek-V3) and 4-bit AWQ or GPTQ quantization (used by community Mixtral builds) are also widely deployed.
For deployment on devices with limited GPU memory, expert offloading stores inactive expert weights in CPU memory and loads them to the GPU on demand. Pre-gated MoE takes this further by predicting which experts will be needed ahead of time and prefetching their weights, enabling single-GPU deployment of large MoE models at the cost of additional latency from CPU-GPU transfer. Open-source tools such as llama.cpp implement aggressive expert offloading to enable Mixtral 8x7B and DBRX inference on consumer GPUs with as little as 24 GB of VRAM.
MoE models can be distilled into smaller dense models that retain 30 to 40% of the MoE's quality advantage over a comparably sized dense baseline. Research has also shown that sentence-level or task-level routing can be used to extract specialized sub-networks from a trained MoE for targeted deployment. The Llama 4 Behemoth model is reported to be used primarily as a teacher for distilling Scout and Maverick.
The following table summarizes the practical trade-offs between MoE and dense model architectures.
| Dimension | MoE models | Dense models |
|---|---|---|
| Pre-training speed | Faster (4 to 7x for equivalent quality) | Slower |
| Total parameters | Very large (100B to 2T+) | Moderate (7B to 540B typically) |
| Active parameters per token | Small fraction of total | All parameters |
| Inference FLOPs per token | Lower for given quality level | Higher |
| VRAM requirement | High (must load all experts) | Proportional to parameter count |
| Training stability | Requires careful tuning (auxiliary loss, z-loss) | Generally more stable |
| Fine-tuning | Prone to overfitting; benefits from instruction tuning | More straightforward |
| Knowledge-intensive tasks | Generally stronger | Depends on size |
| Reasoning tasks | Mixed results historically; recent MoEs (DeepSeek-V3, Kimi K2) close the gap | Often stronger at similar active parameter count |
| Deployment complexity | Higher (expert parallelism, large memory) | Lower |
| Energy efficiency | Better (less compute for similar quality) | Worse |
| Edge / on-device | Difficult (memory) | Better suited |
The general MoE output for an input x is:
y = sum_{i=1}^{N} g(x)_i * E_i(x)
where N is the number of experts, E_i is the i-th expert network, and g(x) is the gating function.
For sparse top-k routing, the gating function becomes:
g(x) = Softmax(TopK(H(x), k))
where:
H(x)_i = (x * W_g)_i + epsilon_i * Softplus((x * W_noise)_i)
and epsilon_i is sampled from a standard normal distribution. The TopK function retains only the k largest values and sets the rest to negative infinity before applying softmax. In Mixtral-style routing, the softmax is applied only over the top-k retained values; in Grok-1-style routing, it is applied over all N values before retaining the top-k.
The load-balancing auxiliary loss for N experts across a batch of T tokens is:
L_balance = alpha * N * sum_{i=1}^{N} f_i * P_i
where f_i = (number of tokens assigned to expert i) / T and P_i = (sum of router probabilities for expert i) / T.
The router z-loss for batch size B is:
L_z = (1 / B) * sum_{b=1}^{B} (log sum_{i=1}^{N} exp(H(x_b)_i))^2
The total training loss is the weighted sum:
L_total = L_task + alpha * L_balance + beta * L_z
with typical settings alpha = 0.001 to 0.01 and beta = 0.001.
For DeepSeek-V3-style auxiliary-loss-free balancing, the gating logits are augmented with a per-expert bias before top-k selection:
score_i = (x * W_g)_i + b_i
The bias b_i is updated outside the gradient computation: at each step, b_i is decreased for over-utilized experts and increased for under-utilized ones, by a small fixed step size.
Although the modern MoE wave is rooted in Transformer FFN replacement, several lines of work apply sparse expert routing to non-Transformer or hybrid backbones.
MoE-Mamba (Pióro et al., arXiv:2401.04081) interleaves Mamba state-space-model (SSM) blocks with MoE feed-forward layers. The architecture inherits Mamba's linear-time sequence processing and adds MoE capacity. The authors report that MoE-Mamba reaches Mamba-equivalent perplexity in 2.35x fewer training steps while preserving Mamba's inference throughput advantage over Transformers.23
BlackMamba (Anthony et al., arXiv:2402.01771) further integrates SSMs and MoE by replacing both the attention layers (with Mamba) and the FFN layers (with MoE experts) of a Transformer block. Zyphra trained and open-sourced 340M/1.5B and 630M/2.8B BlackMamba models on 300B tokens. The architecture's combination of linear-complexity sequence mixing and sparse FFN computation produces favorable inference and training FLOP characteristics versus comparable Mamba and Transformer baselines.24
Jamba (AI21 Labs, 2024) and later Jamba 1.5 interleave Transformer attention, Mamba, and MoE blocks. MiniMax-Text-01 (2025) combines Lightning Attention, softmax attention, and MoE. These hybrid stacks are evidence that MoE is an architectural primitive that composes with backbones other than the standard Transformer.
Soft MoE (Puigcerver, Riquelme, Mustafa, Houlsby, "From Sparse to Soft Mixtures of Experts," arXiv:2308.00951) introduces slots: each expert holds a small fixed number of input slots, and every slot is filled with a learned weighted combination of all input tokens. Likewise, each output token is a learned weighted combination of all expert slot outputs. There is no top-k selection and no token dropping; the assignment is fully differentiable. In vision transformer workloads, Soft MoE Huge/14 with 128 experts in 16 MoE layers achieved over 40x more parameters than ViT-Huge/14 at only 2% additional inference time, outperforming both dense ViT and conventional Token-Choice and Expert-Choice MoEs. Soft MoE is the dominant continuous MoE recipe for vision but is harder to use in autoregressive decoding, since the slot mixture requires a full batch.25
Mixture of Tokens (MoT) (Antoniak, Krutul, Jaszczur, et al., arXiv:2310.15961) presents a continuous, fully differentiable alternative to sparse MoE that is compatible with autoregressive decoding. Each expert receives a unique mixture of tokens drawn from the same example (or across grouped examples), with mixing weights produced by a small controller. Because every expert receives the same number of (mixed) tokens, load imbalance is sidestepped by construction. The authors report up to a 3x training-speed improvement over dense Transformers and report parity with state-of-the-art sparse MoE on language pre-training quality.26
Mixture of Recursions (MoR) (Bae et al., arXiv:2507.10524) reuses a single shared stack of layers across multiple recursion steps and uses lightweight per-token routers to decide how many recursion steps each token takes. This combines parameter sharing with adaptive depth: easy tokens exit after fewer passes, hard tokens recurse deeper. The framework supports both expert-choice routing (top-k tokens continue at each step) and token-choice routing (each token gets a fixed depth at the outset) and includes recursive KV-caching strategies. Across model scales from 135M to 1.7B parameters, MoR achieves higher throughput and lower validation perplexity at matched training FLOPs.27
MoE upcycling, formalized by Komatsuzaki, Puigcerver, Mustafa, et al. (arXiv:2212.05055), converts a pre-trained dense model into an MoE by duplicating the FFN weights of each layer to seed multiple experts and initializing a fresh router. Continued training then specializes the experts. The paper showed that for T5 Base, Large, and XL, upcycling produces models that outperform their dense counterparts on SuperGLUE with about 50% additional compute over the dense pre-training run. Qwen1.5-MoE and Skywork-MoE both used upcycling at production scale. Later work ("Scaling Laws for Upcycling Mixture-of-Experts Language Models," arXiv:2502.03009) extended this with explicit scaling laws and found that the upcycled advantage persists up to roughly 120% of the sunk dense pre-training compute before from-scratch MoE training becomes competitive again.2829
Krajewski et al. ("Scaling Laws for Fine-Grained Mixture of Experts," ICML 2024, arXiv:2402.07871) introduced granularity as an explicit hyperparameter: the ratio of an expert's hidden size to a baseline dense FFN's hidden size. A model with granularity 8 has 8x more, 8x narrower experts than the baseline. They derived a joint scaling law over total parameters, training tokens, and granularity, and showed that for any fixed compute budget there is an optimal granularity well above 1, justifying the fine-grained DeepSeekMoE and Qwen3 designs. They also argued that earlier work (Clark et al., "Unified scaling laws for routed language models," ICML 2022) had under-estimated MoE's advantage because it held expert size and training duration fixed.30
Wang et al. ("Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts," arXiv:2408.15664) introduced the bias-based load-balancing strategy used by DeepSeek-V3. A bias term b_i is added to each routed expert's gating logit only for top-k selection (not for the gating weight applied to the expert output). The biases are updated each step by a small fixed amount: increased for under-utilized experts, decreased for over-utilized ones. This avoids the interference gradients that the standard auxiliary loss introduces and has become a standard component of frontier MoE designs (DeepSeek-V3, DeepSeek V3.1, DeepSeek V3.2, Kimi K2).31
Several lines of work have attempted to extend scaling laws from dense transformer models to MoE. Three influential results.
In practice, frontier labs converged on a recipe of approximately 256 routed experts with top-8 selection plus 1 shared expert, fine-grained (small) experts, and bias-based balancing for compute budgets in the 10^24 to 10^25 FLOP range.
By 2025, both vLLM and SGLang ship dedicated MoE execution paths. vLLM exposes an --enable-expert-parallel flag that switches MoE layers from tensor-parallel to expert-parallel execution while attention layers run with data-parallel KV-cache partitioning, a layout co-designed with DeepSeek-V3 and Llama 4 Maverick. For DeepSeek-V3-class models, an 8-way data-parallel attention + 8-way expert-parallel MoE configuration is standard, giving each GPU 1/8 of the KV cache and 1/8 of the routed experts.32
NVIDIA's "wide expert parallelism" work on GB200 NVL72 systems reports up to 1.8x higher per-GPU throughput than narrower EP configurations by placing fewer experts on each GPU and exploiting the NVL72's 130 TB/s coherent NVLink fabric. Operating in this regime turns the all-to-all traffic into intra-NVL72 traffic, dramatically reducing latency.33
For VRAM-constrained deployments, expert offloading stores inactive experts in CPU memory and pages them to GPU on demand. Pre-gated MoE predicts the experts needed at the next layer one step ahead of time and prefetches them during the current layer's computation, hiding most of the CPU-GPU transfer latency. The llama.cpp and SGLang stacks expose offloading knobs that let Mixtral 8x7B, DBRX, and even DeepSeek-V3-class models run on consumer GPUs with 24 GB of VRAM (with throughput penalties).
DeepSeek-V3 was the first model to validate FP8 mixed-precision training of an MoE at the 671B / 14.8T-token scale. gpt-oss went further by using MXFP4 (4-bit microscaling FP) as the default inference format for expert weights, packing the 117B-parameter gpt-oss-120b model into a single 80 GB GPU. As of 2026, MXFP4 or per-channel 4-bit weight quantization with a higher-precision (BF16 or FP8) router is becoming standard for open MoE releases.19
While MoE is most widely associated with large language models, the architecture has been applied to other domains.
Several libraries and frameworks support MoE training and inference.
| Library | Organization | Features |
|---|---|---|
| MegaBlocks | Databricks (originally Stanford) | Block-sparse GPU kernels; dropless MoE; backbone of DBRX |
| DeepSpeed-MoE | Microsoft | Hybrid parallel training (data + tensor + expert); residual MoE; 4.5x faster inference vs. dense equivalents |
| Tutel | Microsoft | Optimized all-to-all; FP8/NVFP4/MXFP4 support; targets DeepSeek, Kimi K2, Qwen3 |
| FairScale and Fairseq | Meta | Sequence modeling framework with MoE support; used in NLLB-200 |
| Hugging Face Transformers | Hugging Face | Native MoE support since v4.36.0 (Mixtral); now covers DBRX, Mixtral, Qwen MoE, DeepSeek, Llama 4 |
| Megatron-LM | NVIDIA | Production-scale MoE with expert parallelism and tensor parallelism |
| vLLM and SGLang | UC Berkeley / community | High-throughput inference with MoE-specific optimizations |
| MergeKit | Charles Goddard / community | "FrankenMoE" upcycling from existing dense checkpoints |
| OpenMoE | Community | Community-built Llama-based MoE models |
Across the leading MoE LLMs of 2024 to 2026, several recurring design choices have stabilized.
| Choice | Most common in 2024 to 2026 | Notable exceptions |
|---|---|---|
| Router type | Top-k softmax over routed experts | Expert choice (research); top-1 (Switch, Llama 4) |
| Number of experts | 16 to 256 routed; 1 shared | DBRX: 16; Llama 4 Maverick: 128; Kimi K2: 384 |
| Active experts per token | 2, 4, or 8 | Llama 4 (1) |
| Shared experts | Common in DeepSeek-style designs | Qwen3 dropped them |
| Load balancing | Aux-loss-free (DeepSeek), aux loss + z-loss (others) | Global-batch (Qwen3) |
| Dropless | Standard | Earlier Switch and Mixtral allowed drops |
| Precision | bfloat16 or FP8 weights, float32 router | |
| Capacity factor | 1.0 to 1.25 (when capacity is enforced) | Dropless models avoid the issue |
The architectural convergence is striking. By 2026, "fine-grained MoE with one or two shared experts (or none, in the Qwen3 style) and bias-based load balancing" had become the de facto recipe for sparse frontier models, with DeepSeek V3 / V3.1 / V3.2, Qwen 3, Kimi K2, GLM-4.5, GLM-4.6, and Hunyuan-Large all variants on this template.53471217
Several common misconceptions about MoE are worth addressing.
"MoE models are smaller than dense models." False. MoE models have far more total parameters than the dense models they compete with; they only have fewer active parameters per token. A MoE that activates 37 billion parameters per token from a 671-billion-parameter pool requires the full 671 billion to be loaded for fast inference.
"MoE models are 8 separate models." False. Each "expert" is a single FFN layer, not a complete model. Routing is decided independently at each layer, so a single token typically passes through different experts at different layers. With 32 layers and 8 experts per layer, each token traces one of 8^32 possible expert combinations.
"Each expert specializes in a topic (math, code, etc.)." Mostly false. Empirical analyses of Mixtral, DBRX, and DeepSeek routes find that experts often specialize on token classes (punctuation, proper nouns, function words) rather than topics. Topical specialization sometimes emerges but is not the design goal.
"MoE saves memory at inference." Largely false. MoE saves compute and energy, not VRAM or RAM, since all expert weights must be loaded. The exception is expert offloading, which saves VRAM at the cost of CPU-GPU transfer latency.
"MoE replaces ensembling." False. Ensembling combines independently trained models; MoE jointly trains a single model with sparse activation. The ensembling analogy in the 1991 paper has limited bearing on modern sparse implementations.
Jacobs, R. A., Jordan, M. I., Nowlan, S. J., & Hinton, G. E. (1991). "Adaptive Mixtures of Local Experts." Neural Computation, 3(1), 79 to 87. https://www.cs.toronto.edu/~hinton/absps/jjnh91.pdf ↩
Jiang, A. Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., Bamford, C., et al. (2024). "Mixtral of Experts." https://arxiv.org/abs/2401.04088 ↩
Dai, D., Deng, C., Zhao, C., Xu, R. X., Gao, H., Chen, D., et al. (2024). "DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models." ACL 2024. https://arxiv.org/abs/2401.06066 ↩
DeepSeek-AI. (2024). "DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model." https://arxiv.org/abs/2405.04434 ↩
DeepSeek-AI. (2024). "DeepSeek-V3 Technical Report." https://arxiv.org/abs/2412.19437 ↩ ↩2
Meta AI. (2025). "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." Meta AI blog, April 5, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/ ↩
Moonshot AI. (2025). "Kimi K2: Open Agentic Intelligence." https://moonshotai.github.io/Kimi-K2/ ↩ ↩2
Wu, S., et al. (2024). "Yuan 2.0-M32: Mixture of Experts with Attention Router." https://arxiv.org/abs/2405.17976 ↩ ↩2
Wei, T., Zhu, B., et al. / Skywork AI. (2024). "Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models." https://arxiv.org/abs/2406.06563 ↩ ↩2
Microsoft. (2024). "Phi-3.5-MoE-instruct Model Card." Hugging Face. https://huggingface.co/microsoft/Phi-3.5-MoE-instruct ↩ ↩2
Li, D., et al. / Rhymes AI. (2024). "Aria: An Open Multimodal Native Mixture-of-Experts Model." https://arxiv.org/abs/2410.05993 ↩ ↩2
Tencent. (2024). "Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters." https://arxiv.org/abs/2411.02265 ↩ ↩2 ↩3
MiniMax. (2025). "MiniMax-01: Scaling Foundation Models with Lightning Attention." https://arxiv.org/abs/2501.08313 ↩ ↩2
Qwen Team. (2025). "Qwen3-30B-A3B Model Card." Hugging Face. https://huggingface.co/Qwen/Qwen3-30B-A3B ↩ ↩2
NVIDIA Technical Blog. (2025). "New Open Source Qwen3-Next Models Preview Hybrid MoE Architecture." https://developer.nvidia.com/blog/new-open-source-qwen3-next-models-preview-hybrid-moe-architecture-delivering-improved-accuracy-and-accelerated-parallel-processing-across-nvidia-platform/ ↩ ↩2
InclusionAI / Ant Group. (2025). "Every FLOP Counts: Scaling a 300B Mixture-of-Experts LING LLM without Premium GPUs." https://arxiv.org/abs/2503.05139 ↩ ↩2
Zhipu AI. (2025). "Introducing GLM-4.5." https://huggingface.co/zai-org/GLM-4.5 ↩ ↩2 ↩3 ↩4
OpenAI. (2025). "Introducing gpt-oss." https://openai.com/index/introducing-gpt-oss/ and Model Card https://cdn.openai.com/pdf/419b6906-9da6-406c-a19d-1bb078ac7637/oai_gpt-oss_model_card.pdf ↩ ↩2 ↩3 ↩4
DeepSeek-AI. (2025). "DeepSeek-V3.1 Release." https://api-docs.deepseek.com/news/news250821 and Hugging Face https://huggingface.co/deepseek-ai/DeepSeek-V3.1 ↩ ↩2
DeepSeek-AI. (2025). "DeepSeek-V3.2-Exp: Boosting Long-Context Efficiency with DeepSeek Sparse Attention." https://arxiv.org/abs/2512.02556 and https://api-docs.deepseek.com/news/news250929 ↩ ↩2
Moonshot AI. (2025). "Kimi K2 Thinking Release." https://huggingface.co/moonshotai/Kimi-K2-Thinking ↩ ↩2
Pióro, M., Ciebiera, K., Król, K., Ludziejewski, J., & Jaszczur, S. (2024). "MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts." https://arxiv.org/abs/2401.04081 ↩
Anthony, Q., Tokpanov, Y., Glorioso, P., & Millidge, B. (2024). "BlackMamba: Mixture of Experts for State-Space Models." https://arxiv.org/abs/2402.01771 ↩
Puigcerver, J., Riquelme, C., Mustafa, B., & Houlsby, N. (2023). "From Sparse to Soft Mixtures of Experts." https://arxiv.org/abs/2308.00951 ↩
Antoniak, S., Krutul, M., Pióro, M., et al. (2024). "Mixture of Tokens: Continuous MoE through Cross-Example Aggregation." NeurIPS 2024. https://arxiv.org/abs/2310.15961 ↩
Bae, S., et al. (2025). "Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation." NeurIPS 2025. https://arxiv.org/abs/2507.10524 ↩
Komatsuzaki, A., Puigcerver, J., Lee-Thorp, J., Ruiz, C. R., Mustafa, B., Ainslie, J., Tay, Y., Dehghani, M., & Houlsby, N. (2022). "Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints." https://arxiv.org/abs/2212.05055 ↩ ↩2
Nakamura, T., et al. (2025). "Scaling Laws for Upcycling Mixture-of-Experts Language Models." https://arxiv.org/abs/2502.03009 ↩ ↩2
Krajewski, J., Ludziejewski, J., et al. (2024). "Scaling Laws for Fine-Grained Mixture of Experts." ICML 2024. https://arxiv.org/abs/2402.07871 ↩ ↩2
Wang, L., et al. / DeepSeek-AI. (2024). "Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts." https://arxiv.org/abs/2408.15664 ↩
vLLM Project. (2025). "Expert Parallel Deployment." https://docs.vllm.ai/en/latest/serving/expert_parallel_deployment/ and "Scaling DeepSeek-style MoEs with vLLM and llm-d using Wide EP." Red Hat Developer, September 8, 2025. https://developers.redhat.com/articles/2025/09/08/scaling-deepseek-style-moes-vllm-and-llm-d-using-wide-ep ↩
NVIDIA. (2025). "Scaling Large MoE Models with Wide Expert Parallelism on NVL72 Rack Scale Systems." NVIDIA Technical Blog. https://developer.nvidia.com/blog/scaling-large-moe-models-with-wide-expert-parallelism-on-nvl72-rack-scale-systems/ ↩
Qwen Team. (2025). "Qwen3 Technical Report." https://arxiv.org/abs/2505.09388 ↩