Mixture of experts (MoE) is a machine learning architecture in which only a subset of specialized sub-networks, called "experts," are activated for each input. Rather than routing every token through every parameter in a model, a learned gating network (also called a router) selects a small number of experts to process each input. This conditional computation paradigm allows models to maintain an enormous total parameter count while keeping per-token inference cost comparable to a much smaller dense model. MoE has become one of the most important architectural patterns in modern large language models, powering systems such as Mixtral, DeepSeek V2 and V3, and likely GPT-4.
The mixture of experts concept was introduced in the 1991 paper "Adaptive Mixtures of Local Experts" by Robert A. Jacobs, Michael I. Jordan, Steven J. Nowlan, and Geoffrey Hinton [1]. The core idea was straightforward: instead of training a single monolithic network to handle all patterns in a dataset, train multiple smaller networks that each specialize in a different region of the input space. A gating network would learn to assign inputs to the appropriate expert.
This early formulation drew on the principle of divide-and-conquer. Each expert network could focus on a subset of the data where it performed best, while the gating network learned a soft partition of the input space. The result was a system that could model complex, multimodal distributions more effectively than a single network of equivalent size.
In 1994, Michael Jordan and Robert Jacobs extended the framework with hierarchical mixtures of experts (HME) [2]. This approach organized experts into a tree structure, with gating networks at each level of the hierarchy. The hierarchical design allowed for more fine-grained specialization: top-level gates would make coarse routing decisions, while lower-level gates would refine the selection. HME models found early success in tasks like speech recognition and regression problems.
After these initial contributions, MoE research remained relatively quiet for over a decade. The architecture saw limited adoption, partly because the computational infrastructure of the era could not fully exploit its advantages. Dense neural networks, especially convolutional neural networks and later recurrent neural networks, dominated the field through the 2000s and early 2010s.
The modern resurgence of MoE began with the landmark 2017 paper "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer" by Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean [3]. Published at ICLR 2017, this work introduced a sparsely-gated MoE layer that could be inserted between stacked LSTM layers. The key innovation was scaling MoE to thousands of experts (up to 131,072 in some experiments) and demonstrating over 1,000x improvements in model capacity with only minor losses in computational efficiency.
Shazeer et al. applied their approach to language modeling and machine translation, building models with up to 137 billion parameters. This paper laid the conceptual foundation for every large-scale MoE system that followed. It also identified many of the training challenges, such as load balancing and expert collapse, that remain active research topics today.
A mixture of experts model consists of two core components:
Expert networks: A set of N sub-networks (experts), typically feed-forward networks (FFNs), each with identical architecture but independent parameters. In modern transformer-based MoE models, the MoE layer replaces the standard dense FFN block within each transformer layer.
Gating network (router): A small neural network, usually a single linear layer followed by a softmax function, that takes the input token representation and produces a probability distribution over the N experts. The router then selects the top-k experts with the highest probabilities.
For each input token, the router computes a score for every expert. The mathematical formulation is:
G(x) = softmax(W_g * x)
where x is the token representation and W_g is the learnable weight matrix of the gating network. The router then selects the top-k experts based on these scores. Only the selected experts perform computation on that token, and their outputs are combined as a weighted sum using the gating scores as weights:
y = sum_{i in top-k} G(x)_i * E_i(x)
where E_i(x) is the output of expert i applied to token x.
In practice, k is typically small (often 1 or 2), meaning that even if a model has hundreds of experts, only a tiny fraction of total parameters are active for any single token. This is the fundamental source of MoE's computational efficiency.
The term "sparse" in sparse MoE refers to the fact that only a few experts are activated per input. This stands in contrast to a "dense" MoE (or soft MoE), where all experts contribute to every input with different weights. Nearly all modern large-scale MoE implementations use sparse routing because it provides the computational savings that make MoE attractive.
GShard, developed by Lepikhin et al. at Google in 2020, was one of the first successful demonstrations of MoE at massive scale within the transformer architecture [4]. GShard scaled a transformer model to 600 billion parameters using 2,048 experts across its MoE layers. The model was applied to multilingual machine translation and demonstrated that it could achieve superior translation quality compared to much smaller dense models while using roughly the same training compute.
GShard used top-2 routing (each token is processed by two experts) and introduced several practical innovations for distributed training, including automatic sharding of experts across devices. It also introduced an auxiliary load-balancing loss to prevent the router from sending all tokens to a small number of popular experts.
The Switch Transformer, introduced by Fedus, Zoph, and Shazeer at Google in 2021 (published in JMLR 2022), simplified the MoE routing mechanism by using top-1 routing: each token is sent to exactly one expert [5]. This "switch" routing reduced communication overhead and made the system simpler to implement while achieving up to 7x pre-training speedups over equivalent dense T5 models with the same computational budget.
The Switch Transformer scaled to over one trillion parameters and demonstrated that sparse models could be effectively distilled back into smaller dense models, retaining much of the quality gains. It also formalized the concept of a capacity factor, which determines how many tokens each expert can handle per batch.
ST-MoE (Stable and Transferable Mixture-of-Experts), also from Google, focused on improving both training stability and the quality of transfer learning in MoE models [6]. It introduced a router z-loss that penalizes large logits in the gating network, helping to prevent the instability that MoE models are prone to during training. ST-MoE recommended specific hyperparameter settings for auxiliary loss coefficients that became widely referenced in later work.
The following table summarizes notable models built on the MoE architecture:
| Model | Organization | Year | Total Parameters | Active Parameters | Experts per Layer | Top-k | Notes |
|---|---|---|---|---|---|---|---|
| Mixtral 8x7B | Mistral AI | 2023 | 46.7B | 12.9B | 8 | 2 | First widely adopted open-source MoE LLM [7] |
| Mixtral 8x22B | Mistral AI | 2024 | 141B | 39B | 8 | 2 | Scaled-up successor to 8x7B [8] |
| GPT-4 (rumored) | OpenAI | 2023 | ~1.76T (unconfirmed) | ~220B (unconfirmed) | 16 (unconfirmed) | 2 (unconfirmed) | Architecture never officially confirmed by OpenAI [9] |
| DeepSeek-V2 | DeepSeek AI | 2024 | 236B | 21B | 2 shared + 160 routed | 6 routed | Fine-grained expert segmentation with Multi-head Latent Attention [10] |
| DeepSeek-V3 | DeepSeek AI | 2024 | 671B | 37B | 1 shared + 256 routed | 8 routed | Auxiliary-loss-free load balancing; multi-token prediction [11] |
| Grok-1 | xAI | 2024 | 314B | ~78B | 8 | 2 | Open-weight release under Apache 2.0 [12] |
| DBRX | Databricks | 2024 | 132B | 36B | 16 | 4 | Fine-grained MoE with 65x more expert combinations [13] |
| Qwen1.5-MoE-A2.7B | Alibaba | 2024 | ~14.3B | 2.7B | 60 routed + 4 shared | 4 routed | Matches 7B dense model performance with one-third the active parameters [14] |
| Mistral Large 3 | Mistral AI | 2025 | 675B | 41B | MoE (granular sparse) | Undisclosed | Apache 2.0; 256K context window; multimodal with 2.5B vision encoder [15] |
Training and serving MoE models requires distributing experts across multiple devices. Expert parallelism is a strategy where different experts are placed on different GPUs or accelerators. When a token is routed to a particular expert, the token's representation must be sent to the device hosting that expert, and the result must be returned. This creates an all-to-all communication pattern that can become a bottleneck if not managed carefully.
In practice, expert parallelism is combined with other parallelism strategies. Data parallelism replicates the full model across devices and splits the batch; tensor parallelism splits individual layers across devices; pipeline parallelism assigns different layers to different devices. A large MoE training run might use all four forms simultaneously. For example, DeepSeek-V3 was trained on 2,048 NVIDIA H800 GPUs using a combination of expert, data, and pipeline parallelism.
One of the most persistent challenges in MoE is ensuring that tokens are distributed roughly evenly across experts. Without intervention, the router tends to converge on sending most tokens to a small number of "popular" experts. This creates two problems: the popular experts become overloaded (and must drop tokens if they exceed their capacity), while the underutilized experts receive too few training examples and fail to specialize.
The standard approach to load balancing is an auxiliary loss added to the training objective. This loss penalizes imbalanced routing distributions. Specifically, it encourages the fraction of tokens routed to each expert to be roughly uniform. The auxiliary loss takes the form:
L_aux = alpha * N * sum_{i=1}^{N} f_i * P_i
where f_i is the fraction of tokens assigned to expert i, P_i is the average router probability for expert i, N is the number of experts, and alpha is a balancing coefficient. Setting alpha correctly is tricky: too large and it interferes with the primary training objective; too small and load balancing fails. ST-MoE recommended alpha = 0.01 as a reasonable default.
The capacity factor (CF) determines the maximum number of tokens each expert can process in a given batch. It is defined as:
Capacity = CF * (total_tokens / num_experts)
A capacity factor of 1.0 means each expert can handle exactly its fair share of tokens. In practice, values slightly above 1.0 (such as 1.25) are used to allow for some imbalance without dropping tokens. Tokens that arrive at an already-full expert are simply dropped, meaning their representations are passed through unchanged (or zeroed out, depending on implementation). This token dropping introduces a trade-off: a higher capacity factor wastes compute on padding, while a lower one risks losing information.
The traditional auxiliary loss approach has a fundamental tension: it introduces gradient interference that can degrade model quality. DeepSeek-V3 pioneered an auxiliary-loss-free strategy for load balancing [11]. Instead of adding a loss term, this approach dynamically adjusts a per-expert bias term in the gating function during training. Experts that are receiving too many tokens have their bias decreased, while underutilized experts have their bias increased. This achieves balanced routing without injecting any extraneous gradients into the main training loss. The approach proved especially effective at large scale, where the interference from auxiliary losses becomes more problematic.
Traditional MoE models use a relatively small number of large experts (typically 8 to 16). DeepSeek introduced a different philosophy: use a much larger number of smaller experts [10]. This approach, called fine-grained expert segmentation, works by taking what would be a single expert in a conventional MoE and splitting it into m smaller pieces. Correspondingly, the number of activated experts is also multiplied by m, keeping total computation roughly constant.
For example, in DeepSeek-V3, each MoE layer has 256 routed experts plus 1 shared expert, but only 8 routed experts are activated per token. Each individual expert is much smaller than it would be in a model like Mixtral (which uses 8 large experts and activates 2). The advantage of this approach is combinatorial: with 256 experts choosing 8, there are vastly more possible expert combinations than with 8 experts choosing 2. This means the model can learn more nuanced routing strategies and achieve finer-grained specialization.
DBRX adopted a similar philosophy with 16 experts and top-4 routing, giving 65 times more expert combinations than a standard 8-expert, top-2 setup.
Another innovation from DeepSeek is the concept of shared experts: one or more expert networks that are always active for every token, regardless of the router's decisions. Shared experts learn to capture common, broadly useful patterns (such as basic syntax or frequent word associations), while routed experts specialize in more niche knowledge. This division of labor reduces redundancy among routed experts and improves overall model quality. DeepSeek-V2 uses 2 shared experts alongside 160 routed experts, while DeepSeek-V3 uses 1 shared expert alongside 256 routed experts.
The most significant advantage of MoE is the ability to scale model capacity without proportionally scaling compute. A model with 671 billion total parameters but only 37 billion active parameters (like DeepSeek-V3) stores far more knowledge than a 37-billion-parameter dense model, yet its forward pass costs roughly the same. This decoupling of capacity and compute has been the primary driver of MoE adoption in frontier AI systems.
MoE models can achieve the same quality as dense models with significantly less training compute. The Switch Transformer demonstrated up to 7x pre-training speedups over equivalent dense models. DeepSeek-V2 required only 172,800 GPU hours per trillion tokens of training, a 42.5% reduction compared to a dense model of comparable quality [10]. DeepSeek-V3's full training run required only 2.788 million H800 GPU hours, remarkably efficient for a model of its capability level [11].
Because only a fraction of parameters are active per token, MoE models can generate tokens faster than dense models of the same total size. A model like Mixtral 8x7B, despite having 46.7 billion total parameters, runs at speeds comparable to a 13-billion-parameter dense model because only 12.9 billion parameters are activated per token. This makes MoE particularly appealing for deployment scenarios where latency matters.
Different experts can learn to handle different types of inputs, languages, or domains. Research has shown that in multilingual MoE models, certain experts tend to specialize in particular languages, while others handle cross-lingual patterns. This emergent specialization is one of the reasons MoE models often outperform dense models of equivalent compute.
As discussed above, achieving even token distribution across experts is a persistent problem. If the auxiliary loss coefficient is too large, it degrades model quality. If it is too small, the router collapses into always choosing the same few experts. Finding the right balance requires careful tuning, and the optimal settings can change as training progresses. Auxiliary-loss-free methods like those used in DeepSeek-V3 represent a promising direction, but the problem is not fully solved.
Although only a subset of experts are active per token, all expert parameters must be stored in memory (or at least be quickly accessible). A 671-billion-parameter MoE model requires enough memory to hold all 671 billion parameters, even though only 37 billion are used at any moment. This creates significant memory pressure, especially during training when optimizer states and gradients must also be maintained. Expert offloading (moving inactive experts to CPU memory or disk) can help at inference time, but it introduces latency.
MoE models are more prone to training instability than dense models. The router's discrete selection process creates discontinuities in the loss landscape, and the interaction between the auxiliary loss and the primary training loss can cause oscillations. Large spikes in the loss during training have been widely reported. ST-MoE's router z-loss helps mitigate this by preventing the gating logits from growing too large, but instability remains a concern, particularly at very large scale [6].
Expert collapse occurs when some experts stop receiving tokens and effectively cease learning. Once an expert falls behind during training (perhaps due to random initialization), the router learns to avoid it, creating a self-reinforcing cycle. The collapsed expert wastes parameters that could otherwise contribute to model quality. Auxiliary losses help prevent collapse, but do not eliminate it entirely. Some approaches, such as random token assignment with a small probability, have been explored to keep all experts active.
The all-to-all communication pattern required by expert parallelism can become a training bottleneck. When experts are distributed across many devices, every token must be sent to the device hosting its assigned expert and the result must be returned. This communication overhead grows with the number of experts and the number of devices. Careful co-design of the routing strategy and the communication topology is essential for efficient distributed training.
Fine-tuning MoE models presents unique challenges. The routing patterns learned during pre-training may not transfer well to downstream tasks, and the load-balancing dynamics can shift when the data distribution changes. Some practitioners have found that fine-tuning only a subset of experts, or freezing the router and fine-tuning expert weights, produces better results than naive full fine-tuning.
| Aspect | Dense Model | MoE Model |
|---|---|---|
| Active parameters per token | All parameters | Subset (top-k experts) |
| Total parameter count | Same as active | Much larger than active |
| Inference speed | Proportional to total params | Proportional to active params |
| Memory footprint | Proportional to total params | Proportional to total params (all experts stored) |
| Training compute per quality level | Higher | Lower (more parameter-efficient) |
| Training stability | Generally stable | Prone to instability and expert collapse |
| Implementation complexity | Straightforward | Requires routing, load balancing, expert parallelism |
| Fine-tuning | Well-understood | More complex; routing dynamics may shift |
As of early 2026, MoE has become the dominant architecture for frontier language models. The trend is clear: nearly every major lab has adopted or is exploring MoE.
Mistral AI's trajectory illustrates the pattern. The company began with dense models (Mistral 7B) before moving to MoE with Mixtral 8x7B in late 2023, then Mixtral 8x22B in 2024, and finally Mistral Large 3 at 675 billion parameters in December 2025 [15]. Each generation has increased the total parameter count while keeping active parameters manageable.
DeepSeek has pushed the fine-grained MoE approach further than any other organization. DeepSeek-V3, released in late 2024, demonstrated that a 671-billion-parameter MoE model could be trained for under $6 million in compute costs while matching or exceeding the performance of models that cost orders of magnitude more to train [11]. The auxiliary-loss-free load balancing and multi-token prediction training objective introduced in V3 have influenced subsequent work across the field.
The leaked details about GPT-4 suggest that OpenAI also adopted MoE for its flagship model, though this has never been officially confirmed [9]. If accurate, it would mean that MoE underpins the most widely used commercial AI system in the world.
Several technical trends are shaping the future of MoE: