Mixture of Experts (MoE)
Last reviewed
May 9, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v7 · 8,966 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 9, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v7 · 8,966 words
Add missing citations, update stale details, or suggest a clearer explanation.
A Mixture of Experts (MoE) is a machine learning architecture that divides a problem into subtasks, each handled by a specialized sub-network called an "expert." A learned gating network (also called a router) determines which expert or experts should process each input. In modern deep learning, MoE most commonly appears as a sparse variant inside transformer models, where only a subset of experts is activated for any given input token. This allows models to scale to very large parameter counts while keeping per-token computation manageable.
MoE architectures have become central to the design of many state-of-the-art large language models, including Mixtral, DBRX, Grok-1, DeepSeek-V3, Llama 4, Qwen 3, Kimi K2, Gemini 1.5, and (reportedly) GPT-4. They offer a practical path to scaling model capacity without a proportional increase in training or inference cost. By 2025, the leading frontier models in nearly every category were sparse mixtures, marking one of the largest architectural shifts since the original transformer paper.
Imagine you have a really hard homework assignment that covers math, reading, science, and art. Instead of asking one friend who is okay at everything, you ask four different friends, each one the best at one subject. A "traffic director" looks at each question and sends it to whichever friend knows the answer best. That traffic director is the gating network, and each friend is an expert. The smart part is that you only bother one or two friends per question, so you get great answers without making everyone work on everything.
Now imagine the homework book is huge and there are 256 friends instead of four. You still only ask two of them per question, so the answers come fast. But you still need a giant table for all 256 friends to sit at, which is why these models need a lot of memory even though they are quick to run.
The MoE concept was introduced by Robert A. Jacobs, Michael Jordan, Steven J. Nowlan, and Geoffrey Hinton in their 1991 paper "Adaptive Mixtures of Local Experts," published in Neural Computation (volume 3, issue 1, pages 79 to 87). Jacobs and Jordan were affiliated with MIT's Department of Brain and Cognitive Sciences; Nowlan and Hinton were at the University of Toronto's Department of Computer Science. The paper proposed a supervised learning procedure for systems composed of many separate sub-networks, each learning to handle a subset of the training cases. The authors framed the approach two ways: as a modular version of a multilayer supervised network, and as an associative version of competitive learning.
The original system consisted of several specialist networks (experts) and a gating network that learned to assign inputs to the appropriate expert. The authors demonstrated the approach on a vowel discrimination task, training up to eight experts to recognize phonemes from six Japanese speakers. In the final trained model, only three of the eight experts were meaningfully active, showing that the system naturally learned to specialize and effectively pruned unused capacity. The 1991 formulation was a dense MoE: every expert ran on every input, and the gating network produced a soft weighting over their outputs.
Michael Jordan and Robert Jacobs extended the framework in 1994 with "Hierarchical Mixtures of Experts and the EM Algorithm," published in Neural Computation (volume 6, issue 2, pages 181 to 214). This version arranged experts in a tree structure with multiple levels of gating, allowing for hierarchical decomposition of the input space. The paper also introduced the Expectation-Maximization (EM) algorithm as an alternative to gradient descent for training MoE models, framing learning as a maximum likelihood estimation problem with hidden mixture component variables.
For roughly two decades after the original paper, MoE remained mostly an academic concept. Interest revived around 2013 when Yoshua Bengio and collaborators began exploring conditional computation, the idea that different parts of a neural network could be activated dynamically depending on the input. Bengio, Léonard, and Courville published "Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation" in 2013, providing tools for learning discrete routing decisions through gradient estimators.
That same year, David Eigen, Marc'Aurelio Ranzato, and Ilya Sutskever published "Learning Factored Representations in a Deep Mixture of Experts" (arXiv:1312.4314), which stacked multiple MoE layers and demonstrated on a jittered MNIST dataset that the network learned to factor different aspects of the data (location and class) at different layers. Davis and Arel, also in 2013, contributed parallel work on conditional computation. Bengio, Bacon, Pineau, and Precup followed in 2015 with "Conditional Computation in Neural Networks for Faster Models" (arXiv:1511.06297), formalizing the goal of decoupling parameter count from inference cost.
These papers laid conceptual groundwork for the integration of MoE into modern architectures but did not produce production-scale systems.
The turning point came with Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean at Google in their 2017 paper "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer" (ICLR 2017, arXiv:1701.06538). They introduced a MoE layer with up to thousands of feed-forward experts and a trainable gating network that selected a sparse combination of experts per input. The approach was applied between stacked LSTM layers, producing a model with 137 billion parameters that achieved state-of-the-art results on language modeling and machine translation benchmarks at a fraction of the computational cost of dense alternatives. Crucially, the paper introduced noisy top-k gating, an auxiliary load-balancing loss, and a system-level treatment of how to actually train sparse experts at scale on multiple devices. This paper established the template for modern sparse MoE.
From 2020 onward, MoE was integrated into transformer architectures at increasing scale. The decade-long progression is summarized below.
| Year | Model or paper | Organization | Total params | Active params | Experts | Top-k | Key contribution |
|---|---|---|---|---|---|---|---|
| 2020 | GShard | 600B+ | n/a | 2,048 | 2 | First MoE Transformer beyond 600B; multilingual MT; 2,048 TPU v3 | |
| 2021 | Switch Transformer | 1.6T | ~26B | 2,048 | 1 | Top-1 routing; bfloat16 training of trillion-parameter sparse models | |
| 2021 | V-MoE | 15B (vision) | n/a | up to 32 per layer | 2 | First sparse MoE vision transformer; 90.35% ImageNet | |
| 2022 | GLaM | 1.2T | ~97B | 64 per layer | 2 | One-third of GPT-3 training energy; 29-task NLP gains | |
| 2022 | ST-MoE | 269B | 32B | 32-128 | 2 | Router z-loss; first sparse model SOTA on transfer tasks | |
| 2022 | Expert Choice | 8B active | 8B | 64 | variable | Reversed routing; experts pick tokens; perfect load balance | |
| 2022 | DeepSpeed-MoE | Microsoft | n/a | n/a | n/a | n/a | 4.5x faster, 9x cheaper inference vs. quality-equivalent dense |
| 2022 | MegaBlocks | Stanford / Databricks | n/a | n/a | n/a | n/a | Block-sparse kernels; "dropless" MoE |
| 2023 | Mixtral 8x7B | Mistral AI | 46.7B | 12.9B | 8 | 2 | First widely-used open-weights MoE; matched Llama 2 70B |
| 2024 | DBRX | Databricks | 132B | 36B | 16 | 4 | Fine-grained MoE; 65x more expert combinations than 8-choose-2 |
| 2024 | Grok-1 | xAI | 314B | ~78B | 8 | 2 | Largest open-weights model at release; Apache 2.0 |
| 2024 | Mixtral 8x22B | Mistral AI | 141B | 39B | 8 | 2 | 64K context; native multilingual; Apache 2.0 |
| 2024 | Jamba | AI21 Labs | 52B | 12B | 16 | 2 | Hybrid Transformer-Mamba-MoE; 256K context |
| 2024 | DeepSeekMoE | DeepSeek | 16B | 2.8B | 64 (fine) + 2 shared | 6 | Fine-grained segmentation plus shared experts |
| 2024 | DeepSeek-V2 | DeepSeek | 236B | 21B | 160 + 2 shared | 6 | MLA + DeepSeekMoE; 128K context; 5.76x throughput vs. V1 |
| 2024 | Gemini 1.5 Pro | Google DeepMind | undisclosed | undisclosed | undisclosed | undisclosed | First production multimodal MoE; 1M+ token context |
| 2024 | Snowflake Arctic | Snowflake | 480B | 17B | 128 + 1 dense | 2 | Hybrid dense + residual MoE; enterprise focus |
| 2024 | Qwen1.5-MoE | Alibaba / Qwen | 14.3B | 2.7B | 60 + 4 shared | 4 | Upcycled from dense; 75% of training cost |
| 2025 | DeepSeek-V3 | DeepSeek | 671B | 37B | 256 + 1 shared | 8 | Auxiliary-loss-free balancing; FP8 training; 2.788M H800 hours |
| 2025 | Llama 4 Scout | Meta | 109B | 17B | 16 | 1 | Native multimodality; 10M token context |
| 2025 | Llama 4 Maverick | Meta | 400B | 17B | 128 | 1 | 128 experts; alternating dense and MoE layers |
| 2025 | Llama 4 Behemoth | Meta | ~2T | 288B | 16 | 1 | Frontier teacher model (training as of 2025) |
| 2025 | Qwen3-235B-A22B | Alibaba / Qwen | 235B | 22B | 128 | 8 | Global-batch load balancing; no shared experts |
| 2025 | Kimi K2 | Moonshot AI | 1T | 32B | 384 | 8 | Trained with Muon optimizer; agentic focus; 128K context |
| 2025 | Mistral Large 3 | Mistral AI | 675B | 41B | undisclosed | undisclosed | Mistral's first frontier-class MoE |
A standard MoE layer has two main parts.
Expert networks. A set of N independent sub-networks, typically feed-forward networks (FFNs) with a SwiGLU or GeLU non-linearity. Each expert has the same architecture but learns different parameters, allowing it to specialize on different types of inputs. Each expert in a transformer FFN typically has the form Expert(x) = W_2 * activation(W_1 * x), where W_1 projects up to a wider hidden dimension and W_2 projects back.
Gating network (router). A small network that takes the input and produces a probability distribution over the experts. Formally, for an input x, the gating network computes:
G(x) = Softmax(x * W_g)
where W_g is a learned weight matrix of shape (hidden_dim, N). The output of the MoE layer is the weighted sum of expert outputs:
y = sum_i G(x)_i * E_i(x)
where E_i(x) is the output of expert i. In sparse MoE, most components of G(x) are zero by construction.
In transformer-based models, MoE layers typically replace the feed-forward network (FFN) that follows each multi-head attention layer. Since the FFN accounts for a large share of a transformer's parameters (roughly 90% in models like PaLM-540B, and a similar fraction in Llama-style architectures), replacing even a subset of FFN layers with MoE layers can dramatically increase total parameter count without proportionally increasing computation.
Common placement strategies include:
The first and last few layers are often kept dense even in MoE models, on the theory that early layers process generic features and final layers form predictions where stable pathways are useful.
Two strategies exist for producing an MoE model: training from scratch with sparse routing from step zero, or upcycling an existing dense checkpoint into an MoE by replicating its FFN weights into multiple experts and continuing training. Upcycling, popularized by Qwen1.5-MoE and several Mixtral variants in the community, can reach competitive quality at roughly 25 to 50% of the from-scratch training compute, though it tends to produce experts that initially behave very similarly until specialization develops over many tokens of continued training.
The gating mechanism is the most studied component of MoE design, and it is where most of the qualitative differences between MoE systems live. Several approaches have been developed.
The simplest form computes a softmax over a linear projection of the input:
G(x) = Softmax(x * W_g)
This is a dense gating approach where all experts receive some weight. It works for small numbers of experts and is mathematically equivalent to the original 1991 formulation, but does not scale efficiently to hundreds or thousands of experts because every expert has to run.
Introduced by Shazeer et al. (2017), this is the foundation for most modern MoE routers. The process has three steps.
H(x)_i = (x * W_g)_i + StandardNormal() * Softplus((x * W_noise)_i)The noise helps prevent the router from always selecting the same experts and encourages different experts to be tried during training. After training stabilizes, many production systems disable noise at inference for determinism.
The Switch Transformer (Fedus, Zoph, and Shazeer, 2022) simplified routing by setting k = 1, sending each token to a single expert. The authors showed that this preserves model quality while offering three advantages.
Llama 4 returned to top-1 routing in 2025 with both Scout and Maverick, citing the same efficiency arguments. In top-1 routing the gating weight for the chosen expert is sometimes still applied as a multiplicative scalar on the expert output, which keeps the gating network differentiable.
Mixtral, DBRX (k = 4), Snowflake Arctic (k = 2), and DeepSeek-V3 (k = 8 over routed experts plus a shared expert) use top-k for k > 1. Higher k means each token sees more experts and is generally easier to balance, but communication and compute costs grow roughly linearly with k.
Zhou et al. (2022) at Google proposed reversing the routing direction. Instead of tokens selecting their top-k experts, each expert selects its top-k tokens from the batch (NeurIPS 2022, arXiv:2202.09368). This guarantees perfect load balancing by construction, since every expert processes exactly the same number of tokens. The approach achieved over 2x training speedup compared to top-1 and top-2 gating in an 8-billion-active-parameter model with 64 experts.
A trade-off of expert choice routing is that some tokens may be processed by many experts (receiving more computation) while others may be processed by none, requiring careful handling through residual connections. Because the assignment is computed across the whole batch, expert choice is best suited to training and high-throughput batch inference; for streaming, single-token-at-a-time decoding it is harder to apply.
Several alternative routing methods have been explored.
| Strategy | Description | Advantage |
|---|---|---|
| Hash routing | Deterministic assignment based on token hash | No learned parameters; zero routing overhead |
| Random routing | Tokens assigned to random experts | Baseline comparison; surprisingly competitive in some settings |
| Linear assignment | Global optimization of token-expert matching | Optimal assignment but computationally expensive |
| Reinforcement learning | Router trained with RL signals | Can optimize for downstream objectives |
| BASE layers | Balanced assignment via linear programming | Guaranteed balance with top-1 selection |
| Soft MoE | Each input is a weighted combination of all expert slots | Differentiable; useful in vision (Soft MoE, Puigcerver et al., 2023) |
| Threshold routing | Tokens routed only when a confidence threshold is met | Variable compute per token; saves FLOPs on easy tokens |
| Auxiliary-loss-free | Bias terms updated in place to balance load | No interference gradients; used in DeepSeek-V3 |
The distinction between sparse and dense MoE is fundamental to understanding modern implementations.
In a dense MoE, every expert processes every input, and their outputs are combined using the full gating weights. This is mathematically equivalent to the original 1991 formulation. Dense MoE does not save computation, since all experts run on every input, but it can still benefit from specialization through the gating weights. Soft MoE is a recent variant where every input slot interacts with every expert through learned mixing weights, used primarily in vision.
In a sparse MoE, only a small subset of experts (typically 1, 2, 4, or 8 out of 8 to 384+) is activated per input token. This is the dominant form in modern LLMs because it decouples model capacity (total parameters) from computational cost (active parameters per token). A model with 671 billion total parameters such as DeepSeek-V3 might activate only 37 billion per token; Kimi K2 activates 32 billion out of 1 trillion.
Key trade-offs between the two approaches:
| Property | Dense MoE | Sparse MoE |
|---|---|---|
| Computation per token | Proportional to total parameters | Proportional to active parameters only |
| Memory requirement | Same as computation | Must load all parameters despite sparse activation |
| Expert specialization | Soft (weighted combination) | Hard (only selected experts participate) |
| Load balancing | Not an issue | Requires explicit balancing mechanisms |
| Backward pass | Smooth gradients | Non-differentiable top-k requires straight-through estimators or surrogate losses |
| Scaling potential | Limited by compute | Can scale to trillions of parameters |
| Suitability for vision | Common (Soft MoE) | Common (V-MoE) |
| Suitability for LLMs | Rare in production | Dominant in 2024 to 2026 |
Load balancing is one of the most significant practical challenges in training sparse MoE models. Without intervention, routers tend to converge toward sending most tokens to a few "popular" experts while ignoring others, a failure mode called routing collapse or expert collapse.
Routing collapse creates a self-reinforcing cycle: popular experts receive more training signal, which makes them better, which causes the router to favor them even more. Meanwhile, ignored experts receive little to no gradient updates and remain undertrained. This defeats the purpose of having multiple experts. Empirically, models that suffer routing collapse converge to behave like dense models with a fraction of their advertised capacity.
The most common solution is an auxiliary (or load-balancing) loss added to the training objective. The Switch Transformer formulation uses:
L_aux = alpha * N * sum_i(f_i * P_i)
where f_i is the fraction of tokens dispatched to expert i, P_i is the fraction of the router's probability allocated to expert i, and alpha is a hyperparameter controlling the strength of the balancing signal. This loss is minimized when all experts receive equal token allocations.
The hyperparameter alpha requires careful tuning. If set too high, the auxiliary loss dominates the training signal and forces artificial uniformity, degrading model quality. If set too low, it fails to prevent collapse. In practice, values between 0.001 and 0.01 are typical for production training.
Introduced in the ST-MoE paper (Zoph et al., 2022, arXiv:2202.08906), the router z-loss penalizes large logits entering the gating network:
L_z = (1/B) * sum_b (log sum_i exp(x_b * W_g)_i)^2
Large logits create sharp probability distributions that are numerically unstable (especially in lower-precision training such as bfloat16 and FP8) and tend to cause routing collapse. By keeping logits small, the z-loss stabilizes training without hurting model quality. The ST-MoE paper identified router logit growth as the primary cause of training instabilities in large-scale MoE models, and z-loss has since been adopted in essentially every production MoE training framework.
DeepSeek-V2 and V3 introduced an alternative approach that eliminates the auxiliary loss entirely (DeepSeek-AI, "Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts," arXiv:2408.15664). Instead, a bias term b_i is added to each expert's gating logit before the top-k selection:
score_i = (x * W_g)_i + b_i
This bias is adjusted dynamically during training: when an expert is underutilized, its bias is increased, making it more likely to be selected; when overutilized, the bias is decreased. Critically, the bias is not part of the gating weight that gets multiplied into the expert output; it only affects the discrete top-k selection. This approach avoids the interference gradients that auxiliary losses introduce and has been credited with raising the upper bound of MoE model quality. DeepSeek-V3 reports keeping a balanced load throughout its full pre-training without dropping any tokens.
Qwen 3 (Alibaba, 2025) introduced global-batch load balancing, which computes the load-balancing signal over the entire global batch rather than each micro-batch. This produces a smoother target and, the Qwen team reports, encourages stronger expert specialization. Combined with the absence of shared experts in Qwen3, this approach was credited with the model's strong scaling behavior up to 235 billion total parameters.
Expert capacity sets a hard limit on how many tokens a single expert can process in a given batch. The capacity is typically computed as:
Expert Capacity = (tokens_per_batch / number_of_experts) * capacity_factor
The capacity factor is a hyperparameter, usually set between 1.0 and 2.0. A factor of 1.0 means each expert can handle exactly its "fair share" of tokens, with no buffer for imbalance. Switch Transformers found that a capacity factor of 1.0 to 1.25 worked well in practice. Higher factors waste compute on padding; lower factors increase the number of dropped tokens.
When an expert reaches capacity, additional tokens routed to it are dropped. These dropped tokens skip the expert computation and instead pass through a residual connection unchanged. Research has shown that up to about 11% of tokens can be dropped this way without significant degradation in model quality, but more aggressive dropping causes noticeable harm.
The MegaBlocks library (Gale et al., 2022, arXiv:2211.15841) introduced dropless MoE, which avoids token dropping entirely by reformulating MoE computation as block-sparse matrix multiplication. Custom GPU kernels handle variable numbers of tokens per expert, eliminating both wasted compute on padding and quality loss from dropped tokens. DBRX, Mixtral, and most subsequent open MoE models adopt the dropless approach.
GShard, by Dmitry Lepikhin, HyoukJoong Lee, Noam Shazeer, and colleagues at Google (ICLR 2021, arXiv:2006.16668), was the first system to scale MoE transformers beyond 600 billion parameters. It focused on multilingual neural machine translation, training a model on 2,048 TPU v3 accelerators in four days at a total cost of 22 TPU v3 core-years. By comparison, training 100 separate bilingual baselines would have cost 235.5 TPU v3 core-years and produced lower quality (36.9 vs. 44.3 average BLEU). GShard used top-2 expert routing and introduced position-based random routing for the second expert to improve load balancing. The paper also contributed a set of sharding annotation APIs and XLA compiler extensions for distributing MoE models across devices, becoming a foundational systems contribution.
William Fedus, Barret Zoph, and Noam Shazeer at Google proposed the Switch Transformer (JMLR 23, 2022, arXiv:2101.03961), which simplified MoE routing by using top-1 expert selection instead of top-2. The largest Switch Transformer had 1.6 trillion parameters distributed across 2,048 experts. Despite this extreme sparsity, it achieved up to 7x speedup in pre-training over dense T5 models using the same computational budget. The paper also validated, for the first time, that large sparse MoE models could be trained in lower-precision bfloat16 format. The authors used selective precision (router in float32, experts in bfloat16), a technique still standard in 2026.
Carlos Riquelme, Joan Puigcerver, Basil Mustafa, and others at Google Brain published "Scaling Vision with Sparse Mixture of Experts" (NeurIPS 2021, arXiv:2106.05974). V-MoE replaced a subset of dense feedforward layers in Vision Transformers (ViT) with sparse MoE layers, with each image patch routed to a subset of experts. A 15-billion-parameter V-MoE with 24 MoE layers (out of 48 blocks) reached 90.35% top-1 ImageNet accuracy after fine-tuning. The paper also introduced batch prioritized routing, which prioritized subsets of inputs across the entire batch to enable adaptive per-image compute.
Google's Generalist Language Model (GLaM) by Du, Huang, Dai, et al. (ICML 2022, arXiv:2112.06905) scaled to 1.2 trillion total parameters with 64 experts per MoE layer, activating about 97 billion parameters per token (roughly 8% of total). GLaM used 1/3 the energy of GPT-3 for training (456 MWh vs. 1,287 MWh) and half the inference FLOPs, while achieving better zero-shot and one-shot performance across 29 NLP benchmarks. GLaM placed MoE layers on every other transformer block rather than every block.
ST-MoE by Zoph, Bello, Kumar, Du, Huang, Dean, Shazeer, and Fedus (arXiv:2202.08906) addressed training instability and fine-tuning quality issues that had limited sparse models on transfer learning. The 269-billion-parameter ST-MoE-32B model (matching the FLOPs of a 32-billion-parameter dense encoder-decoder) was the first sparse model to achieve state-of-the-art performance on a diverse set of transfer tasks including reasoning, summarization, closed-book QA, and adversarial benchmarks. The router z-loss introduced in this paper became a near-universal component of subsequent MoE training pipelines.
Two systems papers in 2022 made large-scale MoE training and inference practical. DeepSpeed-MoE (Rajbhandari et al., ICML 2022, arXiv:2201.05596) at Microsoft provided an end-to-end training and inference solution with novel architecture designs and compression techniques that reduced MoE model size by up to 3.7x and offered 4.5x faster, 9x cheaper inference compared to quality-equivalent dense models. Tutel (also Microsoft) optimized the all-to-all communication primitive specifically for MoE routing, with adaptive pipelining and a 2-dimensional hierarchical (2DH) all-to-all algorithm, accelerating Meta's 1.1 trillion–parameter MoE model by more than 40% on 64 NDm A100 v4 nodes.
Mistral AI released Mixtral 8x7B in December 2023 and Mixtral 8x22B in April 2024, both open-source under the Apache 2.0 license. The technical report ("Mixtral of Experts," arXiv:2401.04088) was published in January 2024.
Mixtral 8x7B shares the same backbone as Mistral 7B but replaces each FFN layer with 8 expert FFNs. A router selects 2 experts per token per layer, applying softmax only over the top-2 chosen experts (rather than over all 8 before top-k). The model has 46.7 billion total parameters with 12.9 billion active per token. It outperformed or matched Llama 2 70B and GPT-3.5 across evaluated benchmarks despite using significantly fewer active parameters, and it was faster than any dense 70B model.
Mixtral 8x22B scaled this design up to 141 billion total parameters with 39 billion active, extended the context window to 65,536 tokens, and added native support for function calling. It strongly outperformed Llama 2 70B on French, German, Spanish, and Italian benchmarks (HellaSwag, Arc Challenge, MMLU).
Databricks released DBRX in March 2024 with a "fine-grained" MoE approach. Instead of the conventional 8-expert, choose-2 design, DBRX uses 16 experts and activates 4 per token, giving 65 times more possible expert combinations compared to 8-choose-2, which the authors found improved model quality. DBRX has 132 billion total parameters with 36 billion active, was pre-trained on 12 trillion tokens with a 32K context length, and uses rotary position encodings, gated linear units, and grouped query attention. It employs dropless MoE routing via the MegaBlocks library and was trained on 3,072 NVIDIA H100 GPUs connected via 3.2 Tbps InfiniBand.
xAI open-sourced Grok-1 on March 17, 2024, under the Apache 2.0 license. Pre-training had concluded in October 2023. Grok-1 has 314 billion total parameters with 8 experts and top-2 selection, activating roughly 25% of weights per token. The architecture uses 64 layers, 48 attention heads for queries and 8 for keys and values, an embedding size of 6,144, and supports 8-bit quantization. One notable difference from Mixtral is in the routing: Grok-1 applies top-2 selection after a softmax over all 8 experts, whereas Mixtral applies softmax only over the top-2 selected experts. At release, Grok-1 was the largest open-weights model.
Snowflake released Arctic on April 24, 2024 (Apache 2.0). Arctic combines a 10-billion-parameter dense transformer with a residual 128-by-3.66-billion MoE MLP, totaling 480 billion parameters with 17 billion active, chosen via top-2 gating. The 128-expert design produces a fine-grained MoE optimized for enterprise tasks (SQL, code generation). Snowflake reported up to 4x fewer memory reads than Code-Llama 70B and 2.5x fewer than Mixtral 8x22B, leading to faster inference.
DeepSeek-V2 ("A Strong, Economical, and Efficient Mixture-of-Experts Language Model," arXiv:2405.04434, May 2024) has 236 billion total parameters with 21 billion activated per token and a 128K context length. It introduced two architectural innovations that became influential: Multi-head Latent Attention (MLA), which compresses the KV cache into a low-rank latent vector and reduces KV cache size by 93.3%, and the production-scale DeepSeekMoE design with 2 shared experts and 160 routed experts (6 activated per token), each with a hidden dimension of 1,536. Compared to DeepSeek 67B, V2 achieved better quality with 42.5% lower training cost and 5.76x higher inference throughput.
The DeepSeekMoE paper (Dai et al., arXiv:2401.06066, January 2024) formalized two principal strategies that have shaped MoE design ever since: fine-grained expert segmentation (the hidden dimension of each expert is reduced while the number of experts is multiplied, enabling more flexible combinations) and shared expert isolation (a small set of experts is always active for every token, capturing common knowledge and reducing redundancy in routed experts). DeepSeekMoE 2B matched GShard 2.9B in quality with 1.5x fewer expert parameters and FLOPs.
DeepSeek-V3 (DeepSeek-AI, "DeepSeek-V3 Technical Report," arXiv:2412.19437) has 671 billion total parameters with 37 billion active per token. It uses 256 routed experts plus 1 shared expert, with the top 8 routed experts activated per token. Key contributions include:
DeepSeek-V3 reports zero token drops throughout training and inference, made possible by the combination of fine-grained experts, shared experts, and bias-based balancing.
Google DeepMind announced Gemini 1.5 Pro in February 2024 as a sparse mixture-of-experts transformer with multimodal inputs and a 1-million-token context window (extended in research previews to 10 million). The exact expert and active parameter counts have not been disclosed, but Jeff Dean publicly traced its lineage to "a long line of Google research efforts on sparse models" starting with Shazeer et al. 2017. Gemini 1.5 was the first widely available production frontier model confirmed to use MoE.
Meta released the Llama 4 herd on April 5, 2025, marking the first Llama generation to use mixture-of-experts. The herd consists of three models.
All Llama 4 models use top-1 routing, native multimodality with early fusion of text and image, and were pre-trained on more than 30 trillion tokens.
Alibaba's Qwen team has released several MoE generations.
Moonshot AI released Kimi K2 in mid-2025 as a 1-trillion-parameter MoE model with 32 billion active parameters. It uses 384 experts with 8 active per token and a 128K context window. Kimi K2 was pre-trained on 15.5 trillion tokens using the Muon optimizer at unprecedented scale, with the team reporting zero training instability after a custom set of optimizer modifications. The model is positioned around agentic intelligence, including extended reasoning and tool use.
Mistral AI's Mistral Large 3 (released 2025) was the company's first frontier-class MoE, with 41 billion active parameters out of 675 billion total. The shift from the dense Mistral Large 2 (123B dense) signaled that even labs that had stuck with dense designs were converging on sparse architectures for frontier work.
While OpenAI has not officially confirmed the architecture of GPT-4, multiple sources have reported that it uses an MoE design. A widely cited 2023 analysis by Dylan Patel and Gerald Wong at SemiAnalysis described GPT-4 as approximately 1.76 trillion total parameters across 16 experts of approximately 111 billion MLP parameters each, with 2 experts routed per forward pass. An earlier informal claim by George Hotz described 8 experts of 220 billion parameters each. These reports were partly corroborated by Soumith Chintala, co-creator of PyTorch, but remain unconfirmed by OpenAI.
AI21 Labs' Jamba is a hybrid architecture that combines transformer layers, Mamba (structured state space model) layers, and MoE layers (arXiv:2403.19887). It has 52 billion total parameters with 12 billion active, and offers a 256K context window. Roughly one in every eight layers uses a transformer attention mechanism; the rest use Mamba, with MoE layers interleaved. This hybrid approach reduces the memory footprint compared to a pure transformer of similar capacity.
MoE models are more prone to training instability than dense models, particularly at large scale. Sources of instability include:
Practical stabilization techniques include using full precision (float32) for the router even when experts run in bfloat16 or FP8, adding router z-loss, carefully tuning the auxiliary loss coefficient (or moving to bias-based balancing), gradient clipping, and warming up the auxiliary loss over the first few thousand steps.
Sparse MoE models are more susceptible to overfitting during fine-tuning than dense models of comparable active parameter count. This happens because MoE models have far more total parameters, but each parameter sees fewer training examples (since each expert only processes a fraction of tokens). Strategies to mitigate this include:
In expert parallelism, every MoE layer requires two all-to-all communications: one to dispatch tokens to the GPUs holding their assigned experts, and one to combine the results back. Research has shown that all-to-all communication can consume more than 40% of total runtime in large-scale MoE training, and up to 59.2% of forward-pass latency in the MoE layers on an 8-GPU server running DeepSeek-V2-Lite. For inference, all-to-all can contribute 10 to 30% of end-to-end latency, especially for decode messages where each token's hidden state must hop between GPUs. Optimizing this communication is a major focus of systems research; representative techniques include 2DH all-to-all, fused communication-computation kernels, and sub-chunk pipelining.
Research has revealed that experts in encoder models tend to develop token-level specialization. Certain experts may specialize in punctuation, proper nouns, or specific syntactic patterns. In decoder models, specialization is less interpretable; some experts appear to handle particular topical domains, others activate on rare tokens, and many appear functionally redundant in early training. Specialization typically sharpens over training, especially after the auxiliary loss is reduced.
Expert specialization collapse occurs when experts become functionally redundant, all learning similar representations instead of specializing. This negates the benefit of having multiple experts and is distinct from routing collapse (where experts are ignored entirely). Fine-grained segmentation, shared experts, and stronger regularization on the router are the most commonly cited remedies.
A key challenge for MoE inference is that, despite only activating a subset of experts per token, all expert parameters must be loaded into memory for fast access. This means MoE models have the same memory footprint as a dense model of equal total parameter count, even though they use far fewer FLOPs per token. For example, Mixtral 8x7B requires loading all 46.7 billion parameters into VRAM even though only 12.9 billion are active per token; DeepSeek-V3 requires loading 671 billion parameters even though only 37 billion are active.
Production deployments of large MoE models routinely require 8 or more GPUs with 80 GB each simply to load the model before serving any traffic. Llama 4 Maverick at 400 billion total parameters requires roughly 800 GB in 16-bit precision; DeepSeek-V3 at 671 billion fits in roughly 720 GB after FP8 packing.
Expert parallelism (EP) is a distribution strategy designed specifically for MoE models. Different experts are placed on different GPUs, and tokens are routed to the GPU holding their assigned expert via all-to-all communication. Non-MoE layers (such as attention) are handled via standard data or tensor parallelism.
This can be combined with other parallelism strategies:
| Parallelism type | What is distributed | Applicability |
|---|---|---|
| Data parallelism | Different batches across devices | All model types |
| Tensor parallelism | Individual layer weights split across devices | Large layers |
| Pipeline parallelism | Different layers on different devices | Deep models |
| Expert parallelism | Different experts on different devices | MoE models specifically |
| Context parallelism | Different parts of long sequences across devices | Long-context models |
NVIDIA's work on wide expert parallelism with GB200 NVL72 systems showed up to 1.8x higher per-GPU throughput compared to smaller expert-parallel configurations, by leveraging fewer experts per GPU and higher arithmetic intensity inside the high-bandwidth NVLink domain (130 TB/s coherent NVLink). Engineering teams at Meta have published case studies on combining tensor, context, and expert parallelism for serving large MoE models efficiently.
Quantization is particularly effective for MoE models because the memory savings are amplified by the large total parameter count. QMoE (Frantar and Alistarh, MLSys 2024, arXiv:2310.16795) demonstrated compression of a 1.6-trillion-parameter Switch Transformer from 3.2 TB to less than 160 GB at less than 1 bit per parameter, with only minor accuracy loss, in less than a day on a single GPU. With QMoE, the 1.6-trillion-parameter Switch Transformer could run on a single server with 4x NVIDIA A6000 GPUs at less than 5% runtime overhead relative to ideal uncompressed inference. FP8 weight quantization (used natively by DeepSeek-V3) and 4-bit AWQ or GPTQ quantization (used by community Mixtral builds) are also widely deployed.
For deployment on devices with limited GPU memory, expert offloading stores inactive expert weights in CPU memory and loads them to the GPU on demand. Pre-gated MoE takes this further by predicting which experts will be needed ahead of time and prefetching their weights, enabling single-GPU deployment of large MoE models at the cost of additional latency from CPU-GPU transfer. Open-source tools such as llama.cpp implement aggressive expert offloading to enable Mixtral 8x7B and DBRX inference on consumer GPUs with as little as 24 GB of VRAM.
MoE models can be distilled into smaller dense models that retain 30 to 40% of the MoE's quality advantage over a comparably sized dense baseline. Research has also shown that sentence-level or task-level routing can be used to extract specialized sub-networks from a trained MoE for targeted deployment. The Llama 4 Behemoth model is reported to be used primarily as a teacher for distilling Scout and Maverick.
The following table summarizes the practical trade-offs between MoE and dense model architectures.
| Dimension | MoE models | Dense models |
|---|---|---|
| Pre-training speed | Faster (4 to 7x for equivalent quality) | Slower |
| Total parameters | Very large (100B to 2T+) | Moderate (7B to 540B typically) |
| Active parameters per token | Small fraction of total | All parameters |
| Inference FLOPs per token | Lower for given quality level | Higher |
| VRAM requirement | High (must load all experts) | Proportional to parameter count |
| Training stability | Requires careful tuning (auxiliary loss, z-loss) | Generally more stable |
| Fine-tuning | Prone to overfitting; benefits from instruction tuning | More straightforward |
| Knowledge-intensive tasks | Generally stronger | Depends on size |
| Reasoning tasks | Mixed results historically; recent MoEs (DeepSeek-V3, Kimi K2) close the gap | Often stronger at similar active parameter count |
| Deployment complexity | Higher (expert parallelism, large memory) | Lower |
| Energy efficiency | Better (less compute for similar quality) | Worse |
| Edge / on-device | Difficult (memory) | Better suited |
The general MoE output for an input x is:
y = sum_{i=1}^{N} g(x)_i * E_i(x)
where N is the number of experts, E_i is the i-th expert network, and g(x) is the gating function.
For sparse top-k routing, the gating function becomes:
g(x) = Softmax(TopK(H(x), k))
where:
H(x)_i = (x * W_g)_i + epsilon_i * Softplus((x * W_noise)_i)
and epsilon_i is sampled from a standard normal distribution. The TopK function retains only the k largest values and sets the rest to negative infinity before applying softmax. In Mixtral-style routing, the softmax is applied only over the top-k retained values; in Grok-1-style routing, it is applied over all N values before retaining the top-k.
The load-balancing auxiliary loss for N experts across a batch of T tokens is:
L_balance = alpha * N * sum_{i=1}^{N} f_i * P_i
where f_i = (number of tokens assigned to expert i) / T and P_i = (sum of router probabilities for expert i) / T.
The router z-loss for batch size B is:
L_z = (1 / B) * sum_{b=1}^{B} (log sum_{i=1}^{N} exp(H(x_b)_i))^2
The total training loss is the weighted sum:
L_total = L_task + alpha * L_balance + beta * L_z
with typical settings alpha = 0.001 to 0.01 and beta = 0.001.
For DeepSeek-V3-style auxiliary-loss-free balancing, the gating logits are augmented with a per-expert bias before top-k selection:
score_i = (x * W_g)_i + b_i
The bias b_i is updated outside the gradient computation: at each step, b_i is decreased for over-utilized experts and increased for under-utilized ones, by a small fixed step size.
While MoE is most widely associated with large language models, the architecture has been applied to other domains.
Several libraries and frameworks support MoE training and inference.
| Library | Organization | Features |
|---|---|---|
| MegaBlocks | Databricks (originally Stanford) | Block-sparse GPU kernels; dropless MoE; backbone of DBRX |
| DeepSpeed-MoE | Microsoft | Hybrid parallel training (data + tensor + expert); residual MoE; 4.5x faster inference vs. dense equivalents |
| Tutel | Microsoft | Optimized all-to-all; FP8/NVFP4/MXFP4 support; targets DeepSeek, Kimi K2, Qwen3 |
| FairScale and Fairseq | Meta | Sequence modeling framework with MoE support; used in NLLB-200 |
| Hugging Face Transformers | Hugging Face | Native MoE support since v4.36.0 (Mixtral); now covers DBRX, Mixtral, Qwen MoE, DeepSeek, Llama 4 |
| Megatron-LM | NVIDIA | Production-scale MoE with expert parallelism and tensor parallelism |
| vLLM and SGLang | UC Berkeley / community | High-throughput inference with MoE-specific optimizations |
| MergeKit | Charles Goddard / community | "FrankenMoE" upcycling from existing dense checkpoints |
| OpenMoE | Community | Community-built Llama-based MoE models |
Across the leading MoE LLMs of 2024 to 2026, several recurring design choices have stabilized.
| Choice | Most common in 2024 to 2026 | Notable exceptions |
|---|---|---|
| Router type | Top-k softmax over routed experts | Expert choice (research); top-1 (Switch, Llama 4) |
| Number of experts | 16 to 256 routed; 1 shared | DBRX: 16; Llama 4 Maverick: 128; Kimi K2: 384 |
| Active experts per token | 2, 4, or 8 | Llama 4 (1) |
| Shared experts | Common in DeepSeek-style designs | Qwen3 dropped them |
| Load balancing | Aux-loss-free (DeepSeek), aux loss + z-loss (others) | Global-batch (Qwen3) |
| Dropless | Standard | Earlier Switch and Mixtral allowed drops |
| Precision | bfloat16 or FP8 weights, float32 router | |
| Capacity factor | 1.0 to 1.25 (when capacity is enforced) | Dropless models avoid the issue |
The architectural convergence is striking. By 2026, "fine-grained MoE with shared experts and bias-based load balancing" had become the de facto recipe for sparse frontier models, with DeepSeek, Qwen, and Kimi K2 all variants on this template.
Several common misconceptions about MoE are worth addressing.
"MoE models are smaller than dense models." False. MoE models have far more total parameters than the dense models they compete with; they only have fewer active parameters per token. A MoE that activates 37 billion parameters per token from a 671-billion-parameter pool requires the full 671 billion to be loaded for fast inference.
"MoE models are 8 separate models." False. Each "expert" is a single FFN layer, not a complete model. Routing is decided independently at each layer, so a single token typically passes through different experts at different layers. With 32 layers and 8 experts per layer, each token traces one of 8^32 possible expert combinations.
"Each expert specializes in a topic (math, code, etc.)." Mostly false. Empirical analyses of Mixtral, DBRX, and DeepSeek routes find that experts often specialize on token classes (punctuation, proper nouns, function words) rather than topics. Topical specialization sometimes emerges but is not the design goal.
"MoE saves memory at inference." Largely false. MoE saves compute and energy, not VRAM or RAM, since all expert weights must be loaded. The exception is expert offloading, which saves VRAM at the cost of CPU-GPU transfer latency.
"MoE replaces ensembling." False. Ensembling combines independently trained models; MoE jointly trains a single model with sparse activation. The ensembling analogy in the 1991 paper has limited bearing on modern sparse implementations.