A Mixture of Experts (MoE) is a machine learning architecture that divides a problem into subtasks, each handled by a specialized sub-network called an "expert." A learned gating network (also called a router) determines which expert or experts should process each input. In modern deep learning, MoE most commonly appears as a sparse variant inside transformer models, where only a subset of experts is activated for any given input token. This allows models to scale to very large parameter counts while keeping per-token computation manageable.
MoE architectures have become central to the design of many state-of-the-art large language models, including Mixtral, DBRX, Grok-1, DeepSeek-V3, and (reportedly) GPT-4. They offer a practical path to scaling model capacity without a proportional increase in training or inference cost.
Imagine you have a really hard homework assignment that covers math, reading, science, and art. Instead of asking one friend who is okay at everything, you ask four different friends, each one the best at one subject. A "traffic director" looks at each question and sends it to whichever friend knows the answer best. That traffic director is the gating network, and each friend is an expert. The smart part is that you only bother one or two friends per question, so you get great answers without making everyone work on everything.
The MoE concept was introduced by Robert Jacobs, Michael Jordan, Steven Nowlan, and Geoffrey Hinton in their 1991 paper "Adaptive Mixtures of Local Experts," published in Neural Computation. The original system consisted of several specialist networks (experts) and a gating network that learned to assign inputs to the appropriate expert. They demonstrated the approach on a vowel discrimination task, training up to eight experts to recognize phonemes from six Japanese speakers. In the final trained model, only three of the eight experts were meaningfully active, showing that the system naturally learned to specialize.
Michael Jordan and Robert Jacobs extended the framework in 1994 with "Hierarchical Mixtures of Experts and the EM Algorithm." This version arranged experts in a tree structure with multiple levels of gating. The paper also introduced the Expectation-Maximization (EM) algorithm as an alternative to gradient descent for training MoE models, framing learning as a maximum likelihood estimation problem.
For roughly two decades after the original paper, MoE remained mostly an academic concept. Interest revived around 2013 when researchers began exploring conditional computation, the idea that different parts of a neural network could be activated dynamically depending on the input. Bengio and collaborators published work on learning factored representations in deep networks, laying conceptual groundwork for the integration of MoE into modern architectures.
The turning point came with Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean at Google in their 2017 paper "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer." They introduced a MoE layer with up to thousands of feed-forward experts and a trainable gating network that selected a sparse combination of experts per input. The approach was applied between stacked LSTM layers, producing a model with 137 billion parameters that achieved state-of-the-art results on language modeling and machine translation benchmarks at a fraction of the computational cost of dense alternatives. This paper established the template for modern sparse MoE.
From 2020 onward, MoE was integrated into transformer architectures at increasing scale:
| Year | Model/paper | Organization | Key contribution |
|---|---|---|---|
| 2020 | GShard | 600B+ parameter MoE transformer for multilingual translation; top-2 expert routing; trained on 2,048 TPU v3 accelerators | |
| 2021 | Switch Transformer | Simplified routing to top-1 expert selection; scaled to 1.6 trillion parameters with 2,048 experts | |
| 2022 | GLaM | 1.2 trillion parameters across 64 experts; used 1/3 the energy of GPT-3 training | |
| 2022 | ST-MoE | Introduced router z-loss for training stability | |
| 2022 | Expert Choice | Reversed routing: experts select tokens instead of tokens selecting experts | |
| 2023 | Mixtral 8x7B | Mistral AI | High-quality open-source MoE; 46.7B total parameters, 12.9B active |
| 2024 | DBRX | Databricks | Fine-grained MoE with 16 experts, 4 active; 132B total parameters |
| 2024 | Grok-1 | xAI | 314B parameter open-source MoE; 8 experts, 2 active |
| 2024 | Mixtral 8x22B | Mistral AI | 141B total parameters, 39B active; 65K context window |
| 2024 | Jamba | AI21 Labs | Hybrid Transformer-Mamba-MoE; 52B total, 12B active |
| 2024 | DeepSeekMoE | DeepSeek | Fine-grained expert segmentation with shared experts |
| 2025 | DeepSeek-V3 | DeepSeek | 671B total, 37B active; auxiliary-loss-free load balancing; FP8 training |
A standard MoE layer has two main parts:
Expert networks: A set of N independent sub-networks, typically feed-forward networks (FFNs). Each expert has the same architecture but learns different parameters, allowing it to specialize on different types of inputs.
Gating network (router): A small network that takes the input and produces a probability distribution over the experts. Formally, for an input x, the gating network computes:
G(x) = Softmax(x * W_g)
where W_g is a learned weight matrix. The output of the MoE layer is the weighted sum of expert outputs:
y = sum_i G(x)_i * E_i(x)
where E_i(x) is the output of expert i.
In transformer-based models, MoE layers typically replace the feed-forward network (FFN) that follows each multi-head attention layer. Since the FFN accounts for a large share of a transformer's parameters (roughly 90% in models like PaLM-540B), replacing even a subset of FFN layers with MoE layers can dramatically increase total parameter count without proportionally increasing computation.
Common placement strategies include:
The gating mechanism is the most studied component of MoE design. Several approaches have been developed.
The simplest form computes a softmax over a linear projection of the input:
G(x) = Softmax(x * W_g)
This is a dense gating approach where all experts receive some weight. It works for small numbers of experts but does not scale efficiently to hundreds or thousands of experts.
Introduced by Shazeer et al. (2017), this is the foundation for most modern MoE routers. The process has three steps:
Add noise: Tunable Gaussian noise is added to the gating logits to encourage exploration.
H(x)_i = (x * W_g)_i + StandardNormal() * Softplus((x * W_noise)_i)
Keep top-k: Only the top-k values are retained; all others are set to negative infinity.
Apply softmax: The softmax is computed over the remaining values, producing a sparse distribution.
The noise helps prevent the router from always selecting the same experts and encourages different experts to be tried during training.
The Switch Transformer (Fedus et al., 2022) simplified routing by setting k = 1, sending each token to a single expert. The authors showed that this preserves model quality while offering three advantages:
Zhou et al. (2022) at Google proposed reversing the routing direction. Instead of tokens selecting their top-k experts, each expert selects its top-k tokens from the batch. This guarantees perfect load balancing by construction, since every expert processes exactly the same number of tokens. The approach achieved over 2x training speedup compared to top-1 and top-2 gating in an 8-billion-active-parameter model with 64 experts.
A trade-off of expert choice routing is that some tokens may be processed by many experts (receiving more computation) while others may be processed by none, requiring careful handling through residual connections.
Several alternative routing methods have been explored:
| Strategy | Description | Advantage |
|---|---|---|
| Hash routing | Deterministic assignment based on token hash | No learned parameters; zero routing overhead |
| Random routing | Tokens assigned to random experts | Baseline comparison; surprisingly competitive in some settings |
| Linear assignment | Global optimization of token-expert matching | Optimal assignment but computationally expensive |
| Reinforcement learning | Router trained with RL signals | Can optimize for downstream objectives |
| BASE layers | Balanced assignment via linear programming | Guaranteed balance with top-1 selection |
The distinction between sparse and dense MoE is fundamental to understanding modern implementations.
In a dense MoE, every expert processes every input, and their outputs are combined using the full gating weights. This is mathematically equivalent to the original 1991 formulation. Dense MoE does not save computation, since all experts run on every input, but it can still benefit from specialization through the gating weights.
In a sparse MoE, only a small subset of experts (typically 1 or 2 out of 8 to 64+) is activated per input token. This is the dominant form in modern LLMs because it decouples model capacity (total parameters) from computational cost (active parameters per token). A model with 600 billion total parameters might activate only 10-40 billion per token.
Key trade-offs between the two approaches:
| Property | Dense MoE | Sparse MoE |
|---|---|---|
| Computation per token | Proportional to total parameters | Proportional to active parameters only |
| Memory requirement | Same as computation | Must load all parameters despite sparse activation |
| Expert specialization | Soft (weighted combination) | Hard (only selected experts participate) |
| Load balancing | Not an issue | Requires explicit balancing mechanisms |
| Scaling potential | Limited by compute | Can scale to trillions of parameters |
Load balancing is one of the most significant practical challenges in training sparse MoE models. Without intervention, routers tend to converge toward sending most tokens to a few "popular" experts while ignoring others, a failure mode called routing collapse or expert collapse.
Routing collapse creates a self-reinforcing cycle: popular experts receive more training signal, which makes them better, which causes the router to favor them even more. Meanwhile, ignored experts receive little to no gradient updates and remain undertrained. This defeats the purpose of having multiple experts.
The most common solution is an auxiliary (or load-balancing) loss added to the training objective. The Switch Transformer formulation uses:
L_aux = alpha * N * sum_i(f_i * P_i)
where f_i is the fraction of tokens dispatched to expert i, P_i is the fraction of the router's probability allocated to expert i, and alpha is a hyperparameter controlling the strength of the balancing signal. This loss is minimized when all experts receive equal token allocations.
The hyperparameter alpha requires careful tuning. If set too high, the auxiliary loss dominates the training signal and forces artificial uniformity, degrading model quality. If set too low, it fails to prevent collapse.
Introduced in the ST-MoE paper (Zoph et al., 2022), the router z-loss penalizes large logits entering the gating network. Large logits create sharp probability distributions that are numerically unstable (especially in lower-precision training) and tend to cause routing collapse. By keeping logits small, the z-loss stabilizes training without hurting model quality.
DeepSeek-V3 introduced an alternative approach that eliminates the auxiliary loss entirely. Instead, a bias term b_i is added to each expert's gating value. This bias is adjusted dynamically during training: when an expert is underutilized, its bias increases, making it more likely to be selected; when overutilized, the bias decreases. This approach avoids the interference gradients that auxiliary losses introduce.
Expert capacity sets a hard limit on how many tokens a single expert can process in a given batch. The capacity is typically computed as:
Expert Capacity = (tokens_per_batch / number_of_experts) * capacity_factor
The capacity factor is a hyperparameter, usually set between 1.0 and 2.0. A factor of 1.0 means each expert can handle exactly its "fair share" of tokens, with no buffer for imbalance. Switch Transformers found that a capacity factor of 1.0 to 1.25 worked well in practice.
When an expert reaches capacity, additional tokens routed to it are "dropped." These dropped tokens skip the expert computation and instead pass through a residual connection unchanged. Research has shown that up to about 11% of tokens can be dropped this way without significant degradation in model quality.
The MegaBlocks library introduced "dropless" MoE, which avoids token dropping entirely by using block-sparse GPU kernels that can handle variable numbers of tokens per expert. DBRX adopted this approach.
GShard, by Dmitry Lepikhin, Noam Shazeer, and colleagues at Google, was the first system to scale MoE transformers beyond 600 billion parameters. It focused on multilingual neural machine translation, training a model on 2,048 TPU v3 accelerators in four days. GShard used top-2 expert routing and introduced position-based random routing for the second expert to improve load balancing. The paper also contributed a set of sharding annotation APIs and XLA compiler extensions for distributing MoE models across devices.
William Fedus, Barret Zoph, and Noam Shazeer at Google proposed the Switch Transformer, which simplified MoE routing by using top-1 expert selection instead of top-2. The largest Switch Transformer had 1.6 trillion parameters distributed across 2,048 experts. Despite this extreme sparsity, it achieved up to 7x speedup in pre-training over dense T5 models using the same computational budget. The paper also validated, for the first time, that large sparse MoE models could be trained in lower-precision bfloat16 format.
Google's Generalist Language Model (GLaM) scaled to 1.2 trillion total parameters with 64 experts per MoE layer, activating only 97 billion parameters (about 8%) per token. GLaM used 1/3 the energy of GPT-3 for training (456 MWh vs. 1,287 MWh) and half the inference FLOPs, while achieving better zero-shot and one-shot performance across 29 NLP benchmarks.
Mistral AI released Mixtral 8x7B in December 2023 and Mixtral 8x22B in April 2024, both open-source under the Apache 2.0 license.
Mixtral 8x7B shares the same architecture as Mistral 7B but replaces each FFN layer with 8 expert FFNs. A router selects 2 experts per token per layer. The model has 46.7 billion total parameters with 12.9 billion active per token. It outperformed or matched Llama 2 70B and GPT-3.5 across evaluated benchmarks despite using significantly fewer active parameters.
Mixtral 8x22B scaled this design up to 141 billion total parameters with 39 billion active, and extended the context window to 65,536 tokens.
Databricks released DBRX in March 2024 with a "fine-grained" MoE approach. Instead of the conventional 8-expert, choose-2 design, DBRX uses 16 experts and activates 4 per token. This gives 65 times more possible expert combinations compared to 8-choose-2, which the authors found improved model quality. DBRX has 132 billion total parameters with 36 billion active, uses rotary position encodings, gated linear units, and grouped query attention, and employs dropless MoE routing via the MegaBlocks library.
xAI open-sourced Grok-1 under the Apache 2.0 license. It has 314 billion total parameters with 8 experts and top-2 selection, activating roughly 25% of weights per token. The architecture uses 64 layers, 48 attention heads for queries and 8 for keys/values, and supports 8-bit quantization. One notable difference from Mixtral is in the routing: Grok-1 applies top-2 selection after softmax over all 8 experts, whereas Mixtral applies softmax only over the top-2 selected experts.
DeepSeek-V3 has 671 billion total parameters with 37 billion active per token. It uses the DeepSeekMoE architecture, which introduces two strategies: (1) experts are segmented into finer-grained sub-experts (the hidden dimension of each expert is reduced while the number of experts is multiplied), enabling more flexible combinations; and (2) a subset of experts is designated as "shared experts" that are always activated for every token, capturing common knowledge and reducing redundancy in the routed experts. DeepSeek-V3 also pioneered auxiliary-loss-free load balancing and was the first model to validate FP8 mixed-precision training at this scale. It required only 2.788 million H800 GPU hours for full training.
AI21 Labs' Jamba is a hybrid architecture that combines transformer layers, Mamba (structured state space model) layers, and MoE layers. It has 52 billion total parameters with 12 billion active, and offers a 256K context window. Roughly one in every eight layers uses a transformer attention mechanism; the rest use Mamba. This hybrid approach reduces the memory footprint compared to a pure transformer of similar capacity.
While OpenAI has not officially confirmed the architecture of GPT-4, multiple sources have reported that it uses a MoE design. One widely cited leak described it as an 8-expert model with approximately 220 billion parameters per expert, totaling around 1.76 trillion parameters. Another report described 16 experts with approximately 111 billion MLP parameters each, with 2 experts routed per forward pass. These reports were informally corroborated by Soumith Chintala, co-creator of PyTorch, but remain unconfirmed.
MoE models are more prone to training instability than dense models, particularly at large scale. Sources of instability include:
Practical stabilization techniques include using full precision (float32) for the router even when experts run in bfloat16, adding router z-loss, and carefully tuning the auxiliary loss coefficient.
Sparse MoE models are more susceptible to overfitting during fine-tuning than dense models of comparable active parameter count. This happens because MoE models have far more total parameters, but each parameter sees fewer training examples (since each expert only processes a fraction of tokens). Strategies to mitigate this include:
Research has revealed that experts in encoder models tend to develop token-level specialization. For example, certain experts may specialize in punctuation, proper nouns, or specific syntactic patterns. In decoder models, specialization patterns are less pronounced and harder to interpret.
Expert specialization collapse occurs when experts become functionally redundant, all learning similar representations instead of specializing. This negates the benefit of having multiple experts and is distinct from routing collapse (where experts are ignored entirely).
A key challenge for MoE inference is that, despite only activating a subset of experts per token, all expert parameters must be loaded into memory. This means MoE models have the same memory footprint as a dense model of equal total parameter count, even though they use far fewer FLOPs per token. For example, Mixtral 8x7B requires loading all 46.7B parameters into VRAM even though only 12.9B are active per token.
Production deployments of large MoE models routinely require 8 or more GPUs with 80 GB each simply to load the model before serving any traffic.
Expert parallelism is a distribution strategy designed specifically for MoE models. Different experts are placed on different GPUs, and tokens are routed to the GPU holding their assigned expert via all-to-all communication. Non-MoE layers (such as attention) are handled via standard data or tensor parallelism.
This can be combined with other parallelism strategies:
| Parallelism type | What is distributed | Applicability |
|---|---|---|
| Data parallelism | Different batches across devices | All model types |
| Tensor parallelism | Individual layer weights split across devices | Large layers |
| Pipeline parallelism | Different layers on different devices | Deep models |
| Expert parallelism | Different experts on different devices | MoE models specifically |
NVIDIA's work on wide expert parallelism with GB200 NVL72 systems showed up to 1.8x higher per-GPU throughput compared to smaller expert-parallel configurations, by leveraging fewer experts per GPU and higher arithmetic intensity.
Quantization is particularly effective for MoE models because the memory savings are amplified by the large total parameter count. QMoE demonstrated compression of a 1.6-trillion-parameter Switch Transformer from 3.2 TB to 160 GB at less than 1 bit per parameter, making deployment on commodity hardware feasible.
For deployment on devices with limited GPU memory, expert offloading stores inactive expert weights in CPU memory and loads them to the GPU on demand. Pre-gated MoE takes this further by predicting which experts will be needed ahead of time and prefetching their weights, enabling single-GPU deployment of large MoE models at the cost of additional latency from CPU-GPU transfer.
MoE models can be distilled into smaller dense models that retain 30-40% of the MoE's quality advantage over a comparably sized dense baseline. Research has also shown that sentence-level or task-level routing can be used to extract specialized sub-networks from a trained MoE for targeted deployment.
The following table summarizes the practical trade-offs between MoE and dense model architectures:
| Dimension | MoE models | Dense models |
|---|---|---|
| Pre-training speed | Faster (4-7x for equivalent quality) | Slower |
| Total parameters | Very large (100B-1T+) | Moderate (7B-540B typically) |
| Active parameters per token | Small fraction of total | All parameters |
| Inference FLOPs per token | Lower for given quality level | Higher |
| VRAM requirement | High (must load all experts) | Proportional to parameter count |
| Training stability | Requires careful tuning (auxiliary loss, z-loss) | Generally more stable |
| Fine-tuning | Prone to overfitting; benefits from instruction tuning | More straightforward |
| Knowledge-intensive tasks | Generally stronger | Depends on size |
| Reasoning tasks | Mixed results; sometimes weaker | Often stronger at similar active parameter count |
| Deployment complexity | Higher (expert parallelism, large memory) | Lower |
| Energy efficiency | Better (less compute for similar quality) | Worse |
The general MoE output for an input x is:
y = sum_{i=1}^{N} g(x)_i * E_i(x)
where N is the number of experts, E_i is the i-th expert network, and g(x) is the gating function.
For sparse top-k routing, the gating function becomes:
g(x) = Softmax(TopK(H(x), k))
where:
H(x)_i = (x * W_g)_i + epsilon_i * Softplus((x * W_noise)_i)
and epsilon_i is sampled from a standard normal distribution. The TopK function retains only the k largest values and sets the rest to negative infinity before applying softmax.
The load-balancing auxiliary loss for N experts across a batch of T tokens is:
L_balance = alpha * N * sum_{i=1}^{N} f_i * P_i
where f_i = (number of tokens assigned to expert i) / T and P_i = (sum of router probabilities for expert i) / T.
While MoE is most widely associated with large language models, the architecture has been applied to other domains:
Several libraries and frameworks support MoE training and inference:
| Library | Organization | Features |
|---|---|---|
| MegaBlocks | Databricks | Block-sparse GPU kernels; dropless MoE |
| DeepSpeed-MoE | Microsoft | Hybrid parallel training (data + tensor + expert parallelism) |
| Fairseq | Meta | Sequence modeling framework with MoE support |
| Hugging Face Transformers | Hugging Face | Native MoE support since v4.36.0 (Mixtral, DBRX, etc.) |
| Tutel | Microsoft | Optimized all-to-all communication for MoE |
| OpenMoE | Community | Community-built Llama-based MoE models |