Mixture of Experts (MoE)

A Mixture of Experts (MoE) is a machine learning architecture that divides a problem into subtasks, each handled by a specialized sub-network called an "expert." A learned gating network (also called a router) determines which expert or experts should process each input. In modern deep learning, MoE most commonly appears as a sparse variant inside transformer models, where only a subset of experts is activated for any given input token. This allows models to scale to very large parameter counts while keeping per-token computation manageable.

MoE architectures have become central to the design of many state-of-the-art large language models, including Mixtral, DBRX, Grok-1, DeepSeek-V3, Llama 4, Qwen 3, Kimi K2, Gemini 1.5, and (reportedly) GPT-4. They offer a practical path to scaling model capacity without a proportional increase in training or inference cost. By 2025, the leading frontier models in nearly every category were sparse mixtures, marking one of the largest architectural shifts since the original transformer paper.

ELI5 (Explain like I'm 5)

Imagine you have a really hard homework assignment that covers math, reading, science, and art. Instead of asking one friend who is okay at everything, you ask four different friends, each one the best at one subject. A "traffic director" looks at each question and sends it to whichever friend knows the answer best. That traffic director is the gating network, and each friend is an expert. The smart part is that you only bother one or two friends per question, so you get great answers without making everyone work on everything.

Now imagine the homework book is huge and there are 256 friends instead of four. You still only ask two of them per question, so the answers come fast. But you still need a giant table for all 256 friends to sit at, which is why these models need a lot of memory even though they are quick to run.

history

origins (1991)

The MoE concept was introduced by Robert A. Jacobs, Michael Jordan, Steven J. Nowlan, and Geoffrey Hinton in their 1991 paper "Adaptive Mixtures of Local Experts," published in Neural Computation (volume 3, issue 1, pages 79 to 87). Jacobs and Jordan were affiliated with MIT's Department of Brain and Cognitive Sciences; Nowlan and Hinton were at the University of Toronto's Department of Computer Science. The paper proposed a supervised learning procedure for systems composed of many separate sub-networks, each learning to handle a subset of the training cases. The authors framed the approach two ways: as a modular version of a multilayer supervised network, and as an associative version of competitive learning.

The original system consisted of several specialist networks (experts) and a gating network that learned to assign inputs to the appropriate expert. The authors demonstrated the approach on a vowel discrimination task, training up to eight experts to recognize phonemes from six Japanese speakers. In the final trained model, only three of the eight experts were meaningfully active, showing that the system naturally learned to specialize and effectively pruned unused capacity. The 1991 formulation was a dense MoE: every expert ran on every input, and the gating network produced a soft weighting over their outputs.

hierarchical MoE (1994)

Michael Jordan and Robert Jacobs extended the framework in 1994 with "Hierarchical Mixtures of Experts and the EM Algorithm," published in Neural Computation (volume 6, issue 2, pages 181 to 214). This version arranged experts in a tree structure with multiple levels of gating, allowing for hierarchical decomposition of the input space. The paper also introduced the Expectation-Maximization (EM) algorithm as an alternative to gradient descent for training MoE models, framing learning as a maximum likelihood estimation problem with hidden mixture component variables.

conditional computation era (2013 to 2016)

For roughly two decades after the original paper, MoE remained mostly an academic concept. Interest revived around 2013 when Yoshua Bengio and collaborators began exploring conditional computation, the idea that different parts of a neural network could be activated dynamically depending on the input. Bengio, Léonard, and Courville published "Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation" in 2013, providing tools for learning discrete routing decisions through gradient estimators.

That same year, David Eigen, Marc'Aurelio Ranzato, and Ilya Sutskever published "Learning Factored Representations in a Deep Mixture of Experts" (arXiv:1312.4314), which stacked multiple MoE layers and demonstrated on a jittered MNIST dataset that the network learned to factor different aspects of the data (location and class) at different layers. Davis and Arel, also in 2013, contributed parallel work on conditional computation. Bengio, Bacon, Pineau, and Precup followed in 2015 with "Conditional Computation in Neural Networks for Faster Models" (arXiv:1511.06297), formalizing the goal of decoupling parameter count from inference cost.

These papers laid conceptual groundwork for the integration of MoE into modern architectures but did not produce production-scale systems.

sparsely-gated MoE (2017)

The turning point came with Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean at Google in their 2017 paper "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer" (ICLR 2017, arXiv:1701.06538). They introduced a MoE layer with up to thousands of feed-forward experts and a trainable gating network that selected a sparse combination of experts per input. The approach was applied between stacked LSTM layers, producing a model with 137 billion parameters that achieved state-of-the-art results on language modeling and machine translation benchmarks at a fraction of the computational cost of dense alternatives. Crucially, the paper introduced noisy top-k gating, an auxiliary load-balancing loss, and a system-level treatment of how to actually train sparse experts at scale on multiple devices. This paper established the template for modern sparse MoE.

scaling with transformers (2020 to present)

From 2020 onward, MoE was integrated into transformer architectures at increasing scale. The decade-long progression is summarized below.

Year	Model or paper	Organization	Total params	Active params	Experts	Top-k	Key contribution
2020	GShard	Google	600B+	n/a	2,048	2	First MoE Transformer beyond 600B; multilingual MT; 2,048 TPU v3
2021	Switch Transformer	Google	1.6T	~26B	2,048	1	Top-1 routing; bfloat16 training of trillion-parameter sparse models
2021	V-MoE	Google	15B (vision)	n/a	up to 32 per layer	2	First sparse MoE vision transformer; 90.35% ImageNet
2022	GLaM	Google	1.2T	~97B	64 per layer	2	One-third of GPT-3 training energy; 29-task NLP gains
2022	ST-MoE	Google	269B	32B	32-128	2	Router z-loss; first sparse model SOTA on transfer tasks
2022	Expert Choice	Google	8B active	8B	64	variable	Reversed routing; experts pick tokens; perfect load balance
2022	DeepSpeed-MoE	Microsoft	n/a	n/a	n/a	n/a	4.5x faster, 9x cheaper inference vs. quality-equivalent dense
2022	MegaBlocks	Stanford / Databricks	n/a	n/a	n/a	n/a	Block-sparse kernels; "dropless" MoE
2023	Mixtral 8x7B	Mistral AI	46.7B	12.9B	8	2	First widely-used open-weights MoE; matched Llama 2 70B
2024	DBRX	Databricks	132B	36B	16	4	Fine-grained MoE; 65x more expert combinations than 8-choose-2
2024	Grok-1	xAI	314B	~78B	8	2	Largest open-weights model at release; Apache 2.0
2024	Mixtral 8x22B	Mistral AI	141B	39B	8	2	64K context; native multilingual; Apache 2.0
2024	Jamba	AI21 Labs	52B	12B	16	2	Hybrid Transformer-Mamba-MoE; 256K context
2024	DeepSeekMoE	DeepSeek	16B	2.8B	64 (fine) + 2 shared	6	Fine-grained segmentation plus shared experts
2024	DeepSeek-V2	DeepSeek	236B	21B	160 + 2 shared	6	MLA + DeepSeekMoE; 128K context; 5.76x throughput vs. V1
2024	Gemini 1.5 Pro	Google DeepMind	undisclosed	undisclosed	undisclosed	undisclosed	First production multimodal MoE; 1M+ token context
2024	Snowflake Arctic	Snowflake	480B	17B	128 + 1 dense	2	Hybrid dense + residual MoE; enterprise focus
2024	Qwen1.5-MoE	Alibaba / Qwen	14.3B	2.7B	60 + 4 shared	4	Upcycled from dense; 75% of training cost
2025	DeepSeek-V3	DeepSeek	671B	37B	256 + 1 shared	8	Auxiliary-loss-free balancing; FP8 training; 2.788M H800 hours
2025	Llama 4 Scout	Meta	109B	17B	16	1	Native multimodality; 10M token context
2025	Llama 4 Maverick	Meta	400B	17B	128	1	128 experts; alternating dense and MoE layers
2025	Llama 4 Behemoth	Meta	~2T	288B	16	1	Frontier teacher model (training as of 2025)
2025	Qwen3-235B-A22B	Alibaba / Qwen	235B	22B	128	8	Global-batch load balancing; no shared experts
2025	Kimi K2	Moonshot AI	1T	32B	384	8	Trained with Muon optimizer; agentic focus; 128K context
2025	Mistral Large 3	Mistral AI	675B	41B	undisclosed	undisclosed	Mistral's first frontier-class MoE

architecture

core components

A standard MoE layer has two main parts.

Expert networks. A set of N independent sub-networks, typically feed-forward networks (FFNs) with a SwiGLU or GeLU non-linearity. Each expert has the same architecture but learns different parameters, allowing it to specialize on different types of inputs. Each expert in a transformer FFN typically has the form Expert(x) = W_2 * activation(W_1 * x), where W_1 projects up to a wider hidden dimension and W_2 projects back.

Gating network (router). A small network that takes the input and produces a probability distribution over the experts. Formally, for an input x, the gating network computes:

G(x) = Softmax(x * W_g)

where W_g is a learned weight matrix of shape (hidden_dim, N). The output of the MoE layer is the weighted sum of expert outputs:

y = sum_i G(x)_i * E_i(x)

where E_i(x) is the output of expert i. In sparse MoE, most components of G(x) are zero by construction.

placement in transformers

In transformer-based models, MoE layers typically replace the feed-forward network (FFN) that follows each multi-head attention layer. Since the FFN accounts for a large share of a transformer's parameters (roughly 90% in models like PaLM-540B, and a similar fraction in Llama-style architectures), replacing even a subset of FFN layers with MoE layers can dramatically increase total parameter count without proportionally increasing computation.

Common placement strategies include:

Every layer: Each transformer block uses an MoE FFN. This is the default in Mixtral, DBRX, and DeepSeek-V2 and V3.
Every other layer: Alternating between dense FFN and MoE FFN layers. Used in GLaM, GShard, and Llama 4 Maverick. Halves the routing overhead and number of all-to-all communications.
Every fourth layer: Less frequent MoE placement, used in some research configurations and in V-MoE for vision.
Hybrid stacks: Jamba intermixes Mamba, attention, and MoE blocks. Llama 4 alternates dense layers with MoE layers in Maverick (every other) but uses MoE on every layer in Scout.

The first and last few layers are often kept dense even in MoE models, on the theory that early layers process generic features and final layers form predictions where stable pathways are useful.

upcycling vs. training from scratch

Two strategies exist for producing an MoE model: training from scratch with sparse routing from step zero, or upcycling an existing dense checkpoint into an MoE by replicating its FFN weights into multiple experts and continuing training. Upcycling, popularized by Qwen1.5-MoE and several Mixtral variants in the community, can reach competitive quality at roughly 25 to 50% of the from-scratch training compute, though it tends to produce experts that initially behave very similarly until specialization develops over many tokens of continued training.

gating mechanisms

The gating mechanism is the most studied component of MoE design, and it is where most of the qualitative differences between MoE systems live. Several approaches have been developed.

softmax gating

The simplest form computes a softmax over a linear projection of the input:

G(x) = Softmax(x * W_g)

This is a dense gating approach where all experts receive some weight. It works for small numbers of experts and is mathematically equivalent to the original 1991 formulation, but does not scale efficiently to hundreds or thousands of experts because every expert has to run.

noisy top-k gating

Introduced by Shazeer et al. (2017), this is the foundation for most modern MoE routers. The process has three steps.

Add noise. Tunable Gaussian noise is added to the gating logits to encourage exploration. H(x)_i = (x * W_g)_i + StandardNormal() * Softplus((x * W_noise)_i)
Keep top-k. Only the top-k values are retained; all others are set to negative infinity.
Apply softmax. The softmax is computed over the remaining values, producing a sparse distribution.

The noise helps prevent the router from always selecting the same experts and encourages different experts to be tried during training. After training stabilizes, many production systems disable noise at inference for determinism.

switch routing (top-1)

The Switch Transformer (Fedus, Zoph, and Shazeer, 2022) simplified routing by setting k = 1, sending each token to a single expert. The authors showed that this preserves model quality while offering three advantages.

Router computation is reduced because only one expert needs to be evaluated per token.
Expert capacity requirements are halved because each token goes to only one expert.
Communication costs between devices decrease, since each token's hidden state crosses the network only once.

Llama 4 returned to top-1 routing in 2025 with both Scout and Maverick, citing the same efficiency arguments. In top-1 routing the gating weight for the chosen expert is sometimes still applied as a multiplicative scalar on the expert output, which keeps the gating network differentiable.

top-k routing (k = 2 or higher)

Mixtral, DBRX (k = 4), Snowflake Arctic (k = 2), and DeepSeek-V3 (k = 8 over routed experts plus a shared expert) use top-k for k > 1. Higher k means each token sees more experts and is generally easier to balance, but communication and compute costs grow roughly linearly with k.

expert choice routing

Zhou et al. (2022) at Google proposed reversing the routing direction. Instead of tokens selecting their top-k experts, each expert selects its top-k tokens from the batch (NeurIPS 2022, arXiv:2202.09368). This guarantees perfect load balancing by construction, since every expert processes exactly the same number of tokens. The approach achieved over 2x training speedup compared to top-1 and top-2 gating in an 8-billion-active-parameter model with 64 experts.

A trade-off of expert choice routing is that some tokens may be processed by many experts (receiving more computation) while others may be processed by none, requiring careful handling through residual connections. Because the assignment is computed across the whole batch, expert choice is best suited to training and high-throughput batch inference; for streaming, single-token-at-a-time decoding it is harder to apply.

other routing strategies

Several alternative routing methods have been explored.

Strategy	Description	Advantage
Hash routing	Deterministic assignment based on token hash	No learned parameters; zero routing overhead
Random routing	Tokens assigned to random experts	Baseline comparison; surprisingly competitive in some settings
Linear assignment	Global optimization of token-expert matching	Optimal assignment but computationally expensive
Reinforcement learning	Router trained with RL signals	Can optimize for downstream objectives
BASE layers	Balanced assignment via linear programming	Guaranteed balance with top-1 selection
Soft MoE	Each input is a weighted combination of all expert slots	Differentiable; useful in vision (Soft MoE, Puigcerver et al., 2023)
Threshold routing	Tokens routed only when a confidence threshold is met	Variable compute per token; saves FLOPs on easy tokens
Auxiliary-loss-free	Bias terms updated in place to balance load	No interference gradients; used in DeepSeek-V3

sparse vs. dense MoE

The distinction between sparse and dense MoE is fundamental to understanding modern implementations.

dense MoE

In a dense MoE, every expert processes every input, and their outputs are combined using the full gating weights. This is mathematically equivalent to the original 1991 formulation. Dense MoE does not save computation, since all experts run on every input, but it can still benefit from specialization through the gating weights. Soft MoE is a recent variant where every input slot interacts with every expert through learned mixing weights, used primarily in vision.

sparse MoE

In a sparse MoE, only a small subset of experts (typically 1, 2, 4, or 8 out of 8 to 384+) is activated per input token. This is the dominant form in modern LLMs because it decouples model capacity (total parameters) from computational cost (active parameters per token). A model with 671 billion total parameters such as DeepSeek-V3 might activate only 37 billion per token; Kimi K2 activates 32 billion out of 1 trillion.

Key trade-offs between the two approaches:

Property	Dense MoE	Sparse MoE
Computation per token	Proportional to total parameters	Proportional to active parameters only
Memory requirement	Same as computation	Must load all parameters despite sparse activation
Expert specialization	Soft (weighted combination)	Hard (only selected experts participate)
Load balancing	Not an issue	Requires explicit balancing mechanisms
Backward pass	Smooth gradients	Non-differentiable top-k requires straight-through estimators or surrogate losses
Scaling potential	Limited by compute	Can scale to trillions of parameters
Suitability for vision	Common (Soft MoE)	Common (V-MoE)
Suitability for LLMs	Rare in production	Dominant in 2024 to 2026

load balancing

Load balancing is one of the most significant practical challenges in training sparse MoE models. Without intervention, routers tend to converge toward sending most tokens to a few "popular" experts while ignoring others, a failure mode called routing collapse or expert collapse.

the problem

Routing collapse creates a self-reinforcing cycle: popular experts receive more training signal, which makes them better, which causes the router to favor them even more. Meanwhile, ignored experts receive little to no gradient updates and remain undertrained. This defeats the purpose of having multiple experts. Empirically, models that suffer routing collapse converge to behave like dense models with a fraction of their advertised capacity.

auxiliary loss

The most common solution is an auxiliary (or load-balancing) loss added to the training objective. The Switch Transformer formulation uses:

L_aux = alpha * N * sum_i(f_i * P_i)

where f_i is the fraction of tokens dispatched to expert i, P_i is the fraction of the router's probability allocated to expert i, and alpha is a hyperparameter controlling the strength of the balancing signal. This loss is minimized when all experts receive equal token allocations.

The hyperparameter alpha requires careful tuning. If set too high, the auxiliary loss dominates the training signal and forces artificial uniformity, degrading model quality. If set too low, it fails to prevent collapse. In practice, values between 0.001 and 0.01 are typical for production training.

router z-loss

Introduced in the ST-MoE paper (Zoph et al., 2022, arXiv:2202.08906), the router z-loss penalizes large logits entering the gating network:

L_z = (1/B) * sum_b (log sum_i exp(x_b * W_g)_i)^2

Large logits create sharp probability distributions that are numerically unstable (especially in lower-precision training such as bfloat16 and FP8) and tend to cause routing collapse. By keeping logits small, the z-loss stabilizes training without hurting model quality. The ST-MoE paper identified router logit growth as the primary cause of training instabilities in large-scale MoE models, and z-loss has since been adopted in essentially every production MoE training framework.

auxiliary-loss-free balancing

DeepSeek-V2 and V3 introduced an alternative approach that eliminates the auxiliary loss entirely (DeepSeek-AI, "Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts," arXiv:2408.15664). Instead, a bias term b_i is added to each expert's gating logit before the top-k selection:

score_i = (x * W_g)_i + b_i

This bias is adjusted dynamically during training: when an expert is underutilized, its bias is increased, making it more likely to be selected; when overutilized, the bias is decreased. Critically, the bias is not part of the gating weight that gets multiplied into the expert output; it only affects the discrete top-k selection. This approach avoids the interference gradients that auxiliary losses introduce and has been credited with raising the upper bound of MoE model quality. DeepSeek-V3 reports keeping a balanced load throughout its full pre-training without dropping any tokens.

global-batch load balancing

Qwen 3 (Alibaba, 2025) introduced global-batch load balancing, which computes the load-balancing signal over the entire global batch rather than each micro-batch. This produces a smoother target and, the Qwen team reports, encourages stronger expert specialization. Combined with the absence of shared experts in Qwen3, this approach was credited with the model's strong scaling behavior up to 235 billion total parameters.

expert capacity

Expert capacity sets a hard limit on how many tokens a single expert can process in a given batch. The capacity is typically computed as:

Expert Capacity = (tokens_per_batch / number_of_experts) * capacity_factor

The capacity factor is a hyperparameter, usually set between 1.0 and 2.0. A factor of 1.0 means each expert can handle exactly its "fair share" of tokens, with no buffer for imbalance. Switch Transformers found that a capacity factor of 1.0 to 1.25 worked well in practice. Higher factors waste compute on padding; lower factors increase the number of dropped tokens.

When an expert reaches capacity, additional tokens routed to it are dropped. These dropped tokens skip the expert computation and instead pass through a residual connection unchanged. Research has shown that up to about 11% of tokens can be dropped this way without significant degradation in model quality, but more aggressive dropping causes noticeable harm.

The MegaBlocks library (Gale et al., 2022, arXiv:2211.15841) introduced dropless MoE, which avoids token dropping entirely by reformulating MoE computation as block-sparse matrix multiplication. Custom GPU kernels handle variable numbers of tokens per expert, eliminating both wasted compute on padding and quality loss from dropped tokens. DBRX, Mixtral, and most subsequent open MoE models adopt the dropless approach.

notable MoE models

GShard (2020)

GShard, by Dmitry Lepikhin, HyoukJoong Lee, Noam Shazeer, and colleagues at Google (ICLR 2021, arXiv:2006.16668), was the first system to scale MoE transformers beyond 600 billion parameters. It focused on multilingual neural machine translation, training a model on 2,048 TPU v3 accelerators in four days at a total cost of 22 TPU v3 core-years. By comparison, training 100 separate bilingual baselines would have cost 235.5 TPU v3 core-years and produced lower quality (36.9 vs. 44.3 average BLEU). GShard used top-2 expert routing and introduced position-based random routing for the second expert to improve load balancing. The paper also contributed a set of sharding annotation APIs and XLA compiler extensions for distributing MoE models across devices, becoming a foundational systems contribution.

Switch Transformer (2021 to 2022)

William Fedus, Barret Zoph, and Noam Shazeer at Google proposed the Switch Transformer (JMLR 23, 2022, arXiv:2101.03961), which simplified MoE routing by using top-1 expert selection instead of top-2. The largest Switch Transformer had 1.6 trillion parameters distributed across 2,048 experts. Despite this extreme sparsity, it achieved up to 7x speedup in pre-training over dense T5 models using the same computational budget. The paper also validated, for the first time, that large sparse MoE models could be trained in lower-precision bfloat16 format. The authors used selective precision (router in float32, experts in bfloat16), a technique still standard in 2026.

V-MoE (2021)

Carlos Riquelme, Joan Puigcerver, Basil Mustafa, and others at Google Brain published "Scaling Vision with Sparse Mixture of Experts" (NeurIPS 2021, arXiv:2106.05974). V-MoE replaced a subset of dense feedforward layers in Vision Transformers (ViT) with sparse MoE layers, with each image patch routed to a subset of experts. A 15-billion-parameter V-MoE with 24 MoE layers (out of 48 blocks) reached 90.35% top-1 ImageNet accuracy after fine-tuning. The paper also introduced batch prioritized routing, which prioritized subsets of inputs across the entire batch to enable adaptive per-image compute.

GLaM (2022)

Google's Generalist Language Model (GLaM) by Du, Huang, Dai, et al. (ICML 2022, arXiv:2112.06905) scaled to 1.2 trillion total parameters with 64 experts per MoE layer, activating about 97 billion parameters per token (roughly 8% of total). GLaM used 1/3 the energy of GPT-3 for training (456 MWh vs. 1,287 MWh) and half the inference FLOPs, while achieving better zero-shot and one-shot performance across 29 NLP benchmarks. GLaM placed MoE layers on every other transformer block rather than every block.

ST-MoE (2022)

ST-MoE by Zoph, Bello, Kumar, Du, Huang, Dean, Shazeer, and Fedus (arXiv:2202.08906) addressed training instability and fine-tuning quality issues that had limited sparse models on transfer learning. The 269-billion-parameter ST-MoE-32B model (matching the FLOPs of a 32-billion-parameter dense encoder-decoder) was the first sparse model to achieve state-of-the-art performance on a diverse set of transfer tasks including reasoning, summarization, closed-book QA, and adversarial benchmarks. The router z-loss introduced in this paper became a near-universal component of subsequent MoE training pipelines.

DeepSpeed-MoE and Tutel (2022)

Two systems papers in 2022 made large-scale MoE training and inference practical. DeepSpeed-MoE (Rajbhandari et al., ICML 2022, arXiv:2201.05596) at Microsoft provided an end-to-end training and inference solution with novel architecture designs and compression techniques that reduced MoE model size by up to 3.7x and offered 4.5x faster, 9x cheaper inference compared to quality-equivalent dense models. Tutel (also Microsoft) optimized the all-to-all communication primitive specifically for MoE routing, with adaptive pipelining and a 2-dimensional hierarchical (2DH) all-to-all algorithm, accelerating Meta's 1.1 trillion–parameter MoE model by more than 40% on 64 NDm A100 v4 nodes.

Mixtral 8x7B and 8x22B (2023 to 2024)

Mistral AI released Mixtral 8x7B in December 2023 and Mixtral 8x22B in April 2024, both open-source under the Apache 2.0 license. The technical report ("Mixtral of Experts," arXiv:2401.04088) was published in January 2024.

Mixtral 8x7B shares the same backbone as Mistral 7B but replaces each FFN layer with 8 expert FFNs. A router selects 2 experts per token per layer, applying softmax only over the top-2 chosen experts (rather than over all 8 before top-k). The model has 46.7 billion total parameters with 12.9 billion active per token. It outperformed or matched Llama 2 70B and GPT-3.5 across evaluated benchmarks despite using significantly fewer active parameters, and it was faster than any dense 70B model.

Mixtral 8x22B scaled this design up to 141 billion total parameters with 39 billion active, extended the context window to 65,536 tokens, and added native support for function calling. It strongly outperformed Llama 2 70B on French, German, Spanish, and Italian benchmarks (HellaSwag, Arc Challenge, MMLU).

DBRX (2024)

Databricks released DBRX in March 2024 with a "fine-grained" MoE approach. Instead of the conventional 8-expert, choose-2 design, DBRX uses 16 experts and activates 4 per token, giving 65 times more possible expert combinations compared to 8-choose-2, which the authors found improved model quality. DBRX has 132 billion total parameters with 36 billion active, was pre-trained on 12 trillion tokens with a 32K context length, and uses rotary position encodings, gated linear units, and grouped query attention. It employs dropless MoE routing via the MegaBlocks library and was trained on 3,072 NVIDIA H100 GPUs connected via 3.2 Tbps InfiniBand.

Grok-1 (2024)

xAI open-sourced Grok-1 on March 17, 2024, under the Apache 2.0 license. Pre-training had concluded in October 2023. Grok-1 has 314 billion total parameters with 8 experts and top-2 selection, activating roughly 25% of weights per token. The architecture uses 64 layers, 48 attention heads for queries and 8 for keys and values, an embedding size of 6,144, and supports 8-bit quantization. One notable difference from Mixtral is in the routing: Grok-1 applies top-2 selection after a softmax over all 8 experts, whereas Mixtral applies softmax only over the top-2 selected experts. At release, Grok-1 was the largest open-weights model.

Snowflake Arctic (2024)

Snowflake released Arctic on April 24, 2024 (Apache 2.0). Arctic combines a 10-billion-parameter dense transformer with a residual 128-by-3.66-billion MoE MLP, totaling 480 billion parameters with 17 billion active, chosen via top-2 gating. The 128-expert design produces a fine-grained MoE optimized for enterprise tasks (SQL, code generation). Snowflake reported up to 4x fewer memory reads than Code-Llama 70B and 2.5x fewer than Mixtral 8x22B, leading to faster inference.

DeepSeek-V2 (2024)

DeepSeek-V2 ("A Strong, Economical, and Efficient Mixture-of-Experts Language Model," arXiv:2405.04434, May 2024) has 236 billion total parameters with 21 billion activated per token and a 128K context length. It introduced two architectural innovations that became influential: Multi-head Latent Attention (MLA), which compresses the KV cache into a low-rank latent vector and reduces KV cache size by 93.3%, and the production-scale DeepSeekMoE design with 2 shared experts and 160 routed experts (6 activated per token), each with a hidden dimension of 1,536. Compared to DeepSeek 67B, V2 achieved better quality with 42.5% lower training cost and 5.76x higher inference throughput.

DeepSeekMoE paper (2024)

The DeepSeekMoE paper (Dai et al., arXiv:2401.06066, January 2024) formalized two principal strategies that have shaped MoE design ever since: fine-grained expert segmentation (the hidden dimension of each expert is reduced while the number of experts is multiplied, enabling more flexible combinations) and shared expert isolation (a small set of experts is always active for every token, capturing common knowledge and reducing redundancy in routed experts). DeepSeekMoE 2B matched GShard 2.9B in quality with 1.5x fewer expert parameters and FLOPs.

DeepSeek-V3 (2025)

DeepSeek-V3 (DeepSeek-AI, "DeepSeek-V3 Technical Report," arXiv:2412.19437) has 671 billion total parameters with 37 billion active per token. It uses 256 routed experts plus 1 shared expert, with the top 8 routed experts activated per token. Key contributions include:

Auxiliary-loss-free load balancing via per-expert bias terms updated based on usage history.
Multi-Token Prediction (MTP) training objective that extends prediction to multiple future tokens, used during pre-training and discarded for inference.
FP8 mixed-precision training at trillion-token scale, the first model to validate FP8 training at this size.
A reported full pre-training cost of 2.788 million H800 GPU hours.

DeepSeek-V3 reports zero token drops throughout training and inference, made possible by the combination of fine-grained experts, shared experts, and bias-based balancing.

Gemini 1.5 (2024)

Google DeepMind announced Gemini 1.5 Pro in February 2024 as a sparse mixture-of-experts transformer with multimodal inputs and a 1-million-token context window (extended in research previews to 10 million). The exact expert and active parameter counts have not been disclosed, but Jeff Dean publicly traced its lineage to "a long line of Google research efforts on sparse models" starting with Shazeer et al. 2017. Gemini 1.5 was the first widely available production frontier model confirmed to use MoE.

Llama 4 family (2025)

Meta released the Llama 4 herd on April 5, 2025, marking the first Llama generation to use mixture-of-experts. The herd consists of three models.

Llama 4 Scout has 17 billion active parameters across 16 experts and 109 billion total parameters, with MoE on every layer. Scout was fine-tuned to support a 10-million-token context window.
Llama 4 Maverick has 17 billion active parameters across 128 experts and 400 billion total parameters, with MoE and dense layers alternating (so experts are applied in half of the layers). Maverick supports a 1-million-token context window.
Llama 4 Behemoth is a 288-billion-active, 16-expert model with approximately 2 trillion total parameters, in training as of 2025 to serve as a teacher for distillation.

All Llama 4 models use top-1 routing, native multimodality with early fusion of text and image, and were pre-trained on more than 30 trillion tokens.

Qwen MoE family (2024 to 2025)

Alibaba's Qwen team has released several MoE generations.

Qwen1.5-MoE-A2.7B (2024) was upcycled from a dense Qwen-1.8B model, using 60 experts plus 4 shared experts and activating 4 routed plus 4 shared per token. It matched 7B-class dense models while activating only 2.7 billion parameters per token, at 75% of dense training cost.
Qwen3 (2025, arXiv:2505.09388) introduced both dense and MoE models, including the flagship Qwen3-235B-A22B with 235 billion total parameters and 22 billion activated. Qwen3 dropped shared experts (used in Qwen2.5-MoE), used 128 routed experts with top-8 routing, and introduced global-batch load balancing.
Qwen3-Next (preview, late 2025) is a hybrid 80B-A3B model that routes among 512 experts.

Kimi K2 (2025)

Moonshot AI released Kimi K2 in mid-2025 as a 1-trillion-parameter MoE model with 32 billion active parameters. It uses 384 experts with 8 active per token and a 128K context window. Kimi K2 was pre-trained on 15.5 trillion tokens using the Muon optimizer at unprecedented scale, with the team reporting zero training instability after a custom set of optimizer modifications. The model is positioned around agentic intelligence, including extended reasoning and tool use.

Mistral Large 3 (2025)

Mistral AI's Mistral Large 3 (released 2025) was the company's first frontier-class MoE, with 41 billion active parameters out of 675 billion total. The shift from the dense Mistral Large 2 (123B dense) signaled that even labs that had stuck with dense designs were converging on sparse architectures for frontier work.

GPT-4 (rumored)

While OpenAI has not officially confirmed the architecture of GPT-4, multiple sources have reported that it uses an MoE design. A widely cited 2023 analysis by Dylan Patel and Gerald Wong at SemiAnalysis described GPT-4 as approximately 1.76 trillion total parameters across 16 experts of approximately 111 billion MLP parameters each, with 2 experts routed per forward pass. An earlier informal claim by George Hotz described 8 experts of 220 billion parameters each. These reports were partly corroborated by Soumith Chintala, co-creator of PyTorch, but remain unconfirmed by OpenAI.

Jamba (2024)

AI21 Labs' Jamba is a hybrid architecture that combines transformer layers, Mamba (structured state space model) layers, and MoE layers (arXiv:2403.19887). It has 52 billion total parameters with 12 billion active, and offers a 256K context window. Roughly one in every eight layers uses a transformer attention mechanism; the rest use Mamba, with MoE layers interleaved. This hybrid approach reduces the memory footprint compared to a pure transformer of similar capacity.

training challenges

instability

MoE models are more prone to training instability than dense models, particularly at large scale. Sources of instability include:

Large router logits. Sharp probability distributions can cause numerical overflow, especially in lower-precision formats like bfloat16 and FP8. The router z-loss addresses this.
Expert imbalance feedback loops. Uneven routing causes uneven gradient updates, reinforcing the imbalance.
Dropped tokens. When too many tokens exceed expert capacity and are dropped, the effective batch size shrinks unpredictably.
Discrete routing decisions. The argmax in top-k is non-differentiable; small perturbations can flip the routing assignment of a token, causing loss curve discontinuities.

Practical stabilization techniques include using full precision (float32) for the router even when experts run in bfloat16 or FP8, adding router z-loss, carefully tuning the auxiliary loss coefficient (or moving to bias-based balancing), gradient clipping, and warming up the auxiliary loss over the first few thousand steps.

overfitting during fine-tuning

Sparse MoE models are more susceptible to overfitting during fine-tuning than dense models of comparable active parameter count. This happens because MoE models have far more total parameters, but each parameter sees fewer training examples (since each expert only processes a fraction of tokens). Strategies to mitigate this include:

Using higher dropout rates within expert layers.
Using smaller batch sizes with higher learning rates.
Freezing non-MoE weights (only about 20% of total parameters in a typical sparse model) while keeping MoE layers trainable, or vice versa.
Applying instruction tuning, which empirically benefits MoE models more than dense models.
Using parameter-efficient fine-tuning (LoRA, QLoRA, or expert-only LoRA) to limit the effective number of trainable parameters.

communication overhead

In expert parallelism, every MoE layer requires two all-to-all communications: one to dispatch tokens to the GPUs holding their assigned experts, and one to combine the results back. Research has shown that all-to-all communication can consume more than 40% of total runtime in large-scale MoE training, and up to 59.2% of forward-pass latency in the MoE layers on an 8-GPU server running DeepSeek-V2-Lite. For inference, all-to-all can contribute 10 to 30% of end-to-end latency, especially for decode messages where each token's hidden state must hop between GPUs. Optimizing this communication is a major focus of systems research; representative techniques include 2DH all-to-all, fused communication-computation kernels, and sub-chunk pipelining.

expert specialization patterns

Research has revealed that experts in encoder models tend to develop token-level specialization. Certain experts may specialize in punctuation, proper nouns, or specific syntactic patterns. In decoder models, specialization is less interpretable; some experts appear to handle particular topical domains, others activate on rare tokens, and many appear functionally redundant in early training. Specialization typically sharpens over training, especially after the auxiliary loss is reduced.

Expert specialization collapse occurs when experts become functionally redundant, all learning similar representations instead of specializing. This negates the benefit of having multiple experts and is distinct from routing collapse (where experts are ignored entirely). Fine-grained segmentation, shared experts, and stronger regularization on the router are the most commonly cited remedies.

inference optimization

memory requirements

A key challenge for MoE inference is that, despite only activating a subset of experts per token, all expert parameters must be loaded into memory for fast access. This means MoE models have the same memory footprint as a dense model of equal total parameter count, even though they use far fewer FLOPs per token. For example, Mixtral 8x7B requires loading all 46.7 billion parameters into VRAM even though only 12.9 billion are active per token; DeepSeek-V3 requires loading 671 billion parameters even though only 37 billion are active.

Production deployments of large MoE models routinely require 8 or more GPUs with 80 GB each simply to load the model before serving any traffic. Llama 4 Maverick at 400 billion total parameters requires roughly 800 GB in 16-bit precision; DeepSeek-V3 at 671 billion fits in roughly 720 GB after FP8 packing.

expert parallelism

Expert parallelism (EP) is a distribution strategy designed specifically for MoE models. Different experts are placed on different GPUs, and tokens are routed to the GPU holding their assigned expert via all-to-all communication. Non-MoE layers (such as attention) are handled via standard data or tensor parallelism.

This can be combined with other parallelism strategies:

Parallelism type	What is distributed	Applicability
Data parallelism	Different batches across devices	All model types
Tensor parallelism	Individual layer weights split across devices	Large layers
Pipeline parallelism	Different layers on different devices	Deep models
Expert parallelism	Different experts on different devices	MoE models specifically
Context parallelism	Different parts of long sequences across devices	Long-context models

NVIDIA's work on wide expert parallelism with GB200 NVL72 systems showed up to 1.8x higher per-GPU throughput compared to smaller expert-parallel configurations, by leveraging fewer experts per GPU and higher arithmetic intensity inside the high-bandwidth NVLink domain (130 TB/s coherent NVLink). Engineering teams at Meta have published case studies on combining tensor, context, and expert parallelism for serving large MoE models efficiently.

quantization

Quantization is particularly effective for MoE models because the memory savings are amplified by the large total parameter count. QMoE (Frantar and Alistarh, MLSys 2024, arXiv:2310.16795) demonstrated compression of a 1.6-trillion-parameter Switch Transformer from 3.2 TB to less than 160 GB at less than 1 bit per parameter, with only minor accuracy loss, in less than a day on a single GPU. With QMoE, the 1.6-trillion-parameter Switch Transformer could run on a single server with 4x NVIDIA A6000 GPUs at less than 5% runtime overhead relative to ideal uncompressed inference. FP8 weight quantization (used natively by DeepSeek-V3) and 4-bit AWQ or GPTQ quantization (used by community Mixtral builds) are also widely deployed.

expert offloading

For deployment on devices with limited GPU memory, expert offloading stores inactive expert weights in CPU memory and loads them to the GPU on demand. Pre-gated MoE takes this further by predicting which experts will be needed ahead of time and prefetching their weights, enabling single-GPU deployment of large MoE models at the cost of additional latency from CPU-GPU transfer. Open-source tools such as llama.cpp implement aggressive expert offloading to enable Mixtral 8x7B and DBRX inference on consumer GPUs with as little as 24 GB of VRAM.

distillation

MoE models can be distilled into smaller dense models that retain 30 to 40% of the MoE's quality advantage over a comparably sized dense baseline. Research has also shown that sentence-level or task-level routing can be used to extract specialized sub-networks from a trained MoE for targeted deployment. The Llama 4 Behemoth model is reported to be used primarily as a teacher for distilling Scout and Maverick.

comparison with dense models

The following table summarizes the practical trade-offs between MoE and dense model architectures.

Dimension	MoE models	Dense models
Pre-training speed	Faster (4 to 7x for equivalent quality)	Slower
Total parameters	Very large (100B to 2T+)	Moderate (7B to 540B typically)
Active parameters per token	Small fraction of total	All parameters
Inference FLOPs per token	Lower for given quality level	Higher
VRAM requirement	High (must load all experts)	Proportional to parameter count
Training stability	Requires careful tuning (auxiliary loss, z-loss)	Generally more stable
Fine-tuning	Prone to overfitting; benefits from instruction tuning	More straightforward
Knowledge-intensive tasks	Generally stronger	Depends on size
Reasoning tasks	Mixed results historically; recent MoEs (DeepSeek-V3, Kimi K2) close the gap	Often stronger at similar active parameter count
Deployment complexity	Higher (expert parallelism, large memory)	Lower
Energy efficiency	Better (less compute for similar quality)	Worse
Edge / on-device	Difficult (memory)	Better suited

mathematical formulation

The general MoE output for an input x is:

y = sum_{i=1}^{N} g(x)_i * E_i(x)

where N is the number of experts, E_i is the i-th expert network, and g(x) is the gating function.

For sparse top-k routing, the gating function becomes:

g(x) = Softmax(TopK(H(x), k))

where:

H(x)_i = (x * W_g)_i + epsilon_i * Softplus((x * W_noise)_i)

and epsilon_i is sampled from a standard normal distribution. The TopK function retains only the k largest values and sets the rest to negative infinity before applying softmax. In Mixtral-style routing, the softmax is applied only over the top-k retained values; in Grok-1-style routing, it is applied over all N values before retaining the top-k.

The load-balancing auxiliary loss for N experts across a batch of T tokens is:

L_balance = alpha * N * sum_{i=1}^{N} f_i * P_i

where f_i = (number of tokens assigned to expert i) / T and P_i = (sum of router probabilities for expert i) / T.

The router z-loss for batch size B is:

L_z = (1 / B) * sum_{b=1}^{B} (log sum_{i=1}^{N} exp(H(x_b)_i))^2

The total training loss is the weighted sum:

L_total = L_task + alpha * L_balance + beta * L_z

with typical settings alpha = 0.001 to 0.01 and beta = 0.001.

For DeepSeek-V3-style auxiliary-loss-free balancing, the gating logits are augmented with a per-expert bias before top-k selection:

score_i = (x * W_g)_i + b_i

The bias b_i is updated outside the gradient computation: at each step, b_i is decreased for over-utilized experts and increased for under-utilized ones, by a small fixed step size.

applications beyond language models

While MoE is most widely associated with large language models, the architecture has been applied to other domains.

Computer vision. V-MoE applies sparse expert routing to Vision Transformers for image classification. Soft MoE (Puigcerver et al., 2023) and Mobile V-MoE (2023) extend the approach. Recent sparse vision models include MoE-LLaVA for multimodal understanding.
Multimodal models. MoE has been used in models that combine text and image understanding, including Gemini 1.5, Llama 4, and MoE-LLaVA.
Machine translation. GShard and NLLB-200 (Meta) used MoE for massively multilingual translation across 200 languages.
Recommender systems. MoE architectures have been applied to recommendation tasks where different user segments benefit from specialized expert networks; YouTube and Pinterest have published descriptions of MoE in production recommenders.
Speech recognition. The original 1991 work focused on phoneme recognition, and MoE continues to be used in modern speech recognition systems including multilingual Whisper variants from the community.
Diffusion models. SegMoE (Segmind, 2024) and recent work on Diffusion Mixture of Experts apply expert routing to diffusion language models and image generators.
Reinforcement learning. MoE has been applied to multi-task RL where different experts handle different task families.

open-source implementations

Several libraries and frameworks support MoE training and inference.

Library	Organization	Features
MegaBlocks	Databricks (originally Stanford)	Block-sparse GPU kernels; dropless MoE; backbone of DBRX
DeepSpeed-MoE	Microsoft	Hybrid parallel training (data + tensor + expert); residual MoE; 4.5x faster inference vs. dense equivalents
Tutel	Microsoft	Optimized all-to-all; FP8/NVFP4/MXFP4 support; targets DeepSeek, Kimi K2, Qwen3
FairScale and Fairseq	Meta	Sequence modeling framework with MoE support; used in NLLB-200
Hugging Face Transformers	Hugging Face	Native MoE support since v4.36.0 (Mixtral); now covers DBRX, Mixtral, Qwen MoE, DeepSeek, Llama 4
Megatron-LM	NVIDIA	Production-scale MoE with expert parallelism and tensor parallelism
vLLM and SGLang	UC Berkeley / community	High-throughput inference with MoE-specific optimizations
MergeKit	Charles Goddard / community	"FrankenMoE" upcycling from existing dense checkpoints
OpenMoE	Community	Community-built Llama-based MoE models

design choices in modern MoE LLMs

Across the leading MoE LLMs of 2024 to 2026, several recurring design choices have stabilized.

Choice	Most common in 2024 to 2026	Notable exceptions
Router type	Top-k softmax over routed experts	Expert choice (research); top-1 (Switch, Llama 4)
Number of experts	16 to 256 routed; 1 shared	DBRX: 16; Llama 4 Maverick: 128; Kimi K2: 384
Active experts per token	2, 4, or 8	Llama 4 (1)
Shared experts	Common in DeepSeek-style designs	Qwen3 dropped them
Load balancing	Aux-loss-free (DeepSeek), aux loss + z-loss (others)	Global-batch (Qwen3)
Dropless	Standard	Earlier Switch and Mixtral allowed drops
Precision	bfloat16 or FP8 weights, float32 router
Capacity factor	1.0 to 1.25 (when capacity is enforced)	Dropless models avoid the issue

The architectural convergence is striking. By 2026, "fine-grained MoE with shared experts and bias-based load balancing" had become the de facto recipe for sparse frontier models, with DeepSeek, Qwen, and Kimi K2 all variants on this template.

misconceptions and clarifications

Several common misconceptions about MoE are worth addressing.

"MoE models are smaller than dense models." False. MoE models have far more total parameters than the dense models they compete with; they only have fewer active parameters per token. A MoE that activates 37 billion parameters per token from a 671-billion-parameter pool requires the full 671 billion to be loaded for fast inference.

"MoE models are 8 separate models." False. Each "expert" is a single FFN layer, not a complete model. Routing is decided independently at each layer, so a single token typically passes through different experts at different layers. With 32 layers and 8 experts per layer, each token traces one of 8^32 possible expert combinations.

"Each expert specializes in a topic (math, code, etc.)." Mostly false. Empirical analyses of Mixtral, DBRX, and DeepSeek routes find that experts often specialize on token classes (punctuation, proper nouns, function words) rather than topics. Topical specialization sometimes emerges but is not the design goal.

"MoE saves memory at inference." Largely false. MoE saves compute and energy, not VRAM or RAM, since all expert weights must be loaded. The exception is expert offloading, which saves VRAM at the cost of CPU-GPU transfer latency.

"MoE replaces ensembling." False. Ensembling combines independently trained models; MoE jointly trains a single model with sparse activation. The ensembling analogy in the 1991 paper has limited bearing on modern sparse implementations.

references

Jacobs, R. A., Jordan, M. I., Nowlan, S. J., & Hinton, G. E. (1991). "Adaptive Mixtures of Local Experts." Neural Computation, 3(1), 79 to 87.
Jordan, M. I., & Jacobs, R. A. (1994). "Hierarchical Mixtures of Experts and the EM Algorithm." Neural Computation, 6(2), 181 to 214.
Bengio, Y., Léonard, N., & Courville, A. (2013). "Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation." arXiv:1308.3432.
Eigen, D., Ranzato, M., & Sutskever, I. (2014). "Learning Factored Representations in a Deep Mixture of Experts." arXiv:1312.4314.
Bengio, E., Bacon, P. L., Pineau, J., & Precup, D. (2015). "Conditional Computation in Neural Networks for Faster Models." arXiv:1511.06297.
Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., & Dean, J. (2017). "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer." ICLR 2017. arXiv:1701.06538.
Lepikhin, D., Lee, H., Xu, Y., Chen, D., Firat, O., Huang, Y., Krikun, M., Shazeer, N., & Chen, Z. (2021). "GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding." ICLR 2021. arXiv:2006.16668.
Fedus, W., Zoph, B., & Shazeer, N. (2022). "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity." Journal of Machine Learning Research, 23(120), 1 to 40. arXiv:2101.03961.
Riquelme, C., Puigcerver, J., Mustafa, B., Neumann, M., Jenatton, R., Susano Pinto, A., Keysers, D., & Houlsby, N. (2021). "Scaling Vision with Sparse Mixture of Experts." NeurIPS 2021. arXiv:2106.05974.
Du, N., Huang, Y., Dai, A. M., Tong, S., Lepikhin, D., Xu, Y., et al. (2022). "GLaM: Efficient Scaling of Language Models with Mixture-of-Experts." ICML 2022. arXiv:2112.06905.
Zhou, Y., Lei, T., Liu, H., Du, N., Huang, Y., Zhao, V., Dai, A., Chen, Z., Le, Q., & Laudon, J. (2022). "Mixture-of-Experts with Expert Choice Routing." NeurIPS 2022. arXiv:2202.09368.
Zoph, B., Bello, I., Kumar, S., Du, N., Huang, Y., Dean, J., Shazeer, N., & Fedus, W. (2022). "ST-MoE: Designing Stable and Transferable Sparse Expert Models." arXiv:2202.08906.
Rajbhandari, S., Li, C., Yao, Z., Zhang, M., Aminabadi, R. Y., Awan, A. A., Rasley, J., & He, Y. (2022). "DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale." ICML 2022. arXiv:2201.05596.
Hwang, C. C., et al. (2023). "Tutel: Adaptive Mixture-of-Experts at Scale." MLSys 2023.
Gale, T., Narayanan, D., Young, C., & Zaharia, M. (2022). "MegaBlocks: Efficient Sparse Training with Mixture-of-Experts." arXiv:2211.15841.
Frantar, E., & Alistarh, D. (2023). "QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models." MLSys 2024. arXiv:2310.16795.
Jiang, A. Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., Bamford, C., et al. (2024). "Mixtral of Experts." arXiv:2401.04088.
Dai, D., Deng, C., Zhao, C., Xu, R. X., Gao, H., Chen, D., et al. (2024). "DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models." ACL 2024. arXiv:2401.06066.
DeepSeek-AI. (2024). "DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model." arXiv:2405.04434.
DeepSeek-AI. (2024). "Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts." arXiv:2408.15664.
DeepSeek-AI. (2024). "DeepSeek-V3 Technical Report." arXiv:2412.19437.
Databricks. (2024). "Introducing DBRX: A New State-of-the-Art Open LLM." Databricks Blog, March 27, 2024.
xAI. (2024). "Open Release of Grok-1." x.ai/news/grok-os and GitHub repository xai-org/grok-1.
Lieber, O., Lenz, B., et al. (2024). "Jamba: A Hybrid Transformer-Mamba Language Model." arXiv:2403.19887.
Snowflake AI Research. (2024). "Snowflake Arctic: The Best LLM for Enterprise AI." Snowflake Engineering Blog, April 24, 2024.
Qwen Team. (2024). "Qwen1.5-MoE: Matching 7B Model Performance with 1/3 Activated Parameters." Qwen blog.
Qwen Team. (2025). "Qwen3 Technical Report." arXiv:2505.09388.
Meta AI. (2025). "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." Meta AI blog, April 5, 2025.
Moonshot AI. (2025). "Kimi K2: Open Agentic Intelligence." moonshotai.github.io/Kimi-K2.
Google. (2024). "Introducing Gemini 1.5, Google's next-generation AI model." Google Blog, February 15, 2024.
Google Research. (2024). "Gemini 1.5: Unlocking multimodal understanding across millions of tokens." arXiv:2403.05530.
Patel, D., & Wong, G. (2023). "GPT-4 Architecture, Infrastructure, Training Dataset, Costs, Vision, MoE." SemiAnalysis newsletter.
Sanseviero, O., Tunstall, L., Schmid, P., et al. (2023). "Mixture of Experts Explained." Hugging Face Blog. huggingface.co/blog/moe.
NVIDIA. (2025). "Scaling Large MoE Models with Wide Expert Parallelism on NVL72 Rack Scale Systems." NVIDIA Technical Blog.

ELI5 (Explain like I'm 5)

history

origins (1991)

hierarchical MoE (1994)

conditional computation era (2013 to 2016)

sparsely-gated MoE (2017)

scaling with transformers (2020 to present)

architecture

core components

placement in transformers

upcycling vs. training from scratch

gating mechanisms

softmax gating

noisy top-k gating

switch routing (top-1)

top-k routing (k = 2 or higher)

expert choice routing

other routing strategies

sparse vs. dense MoE

dense MoE

sparse MoE

load balancing

the problem

auxiliary loss

router z-loss

auxiliary-loss-free balancing

global-batch load balancing

expert capacity

notable MoE models

GShard (2020)

Switch Transformer (2021 to 2022)

V-MoE (2021)

GLaM (2022)

ST-MoE (2022)

DeepSpeed-MoE and Tutel (2022)

Mixtral 8x7B and 8x22B (2023 to 2024)

DBRX (2024)

Grok-1 (2024)

Snowflake Arctic (2024)

DeepSeek-V2 (2024)

DeepSeekMoE paper (2024)

DeepSeek-V3 (2025)

Gemini 1.5 (2024)

Llama 4 family (2025)

Qwen MoE family (2024 to 2025)

Kimi K2 (2025)

Mistral Large 3 (2025)

GPT-4 (rumored)

Jamba (2024)

training challenges

instability

overfitting during fine-tuning

communication overhead

expert specialization patterns

inference optimization

memory requirements

expert parallelism

quantization

expert offloading

distillation

comparison with dense models

mathematical formulation

applications beyond language models

open-source implementations

design choices in modern MoE LLMs

misconceptions and clarifications

see also

references

Improve this article

Related Articles

GELU (Gaussian Error Linear Unit)

Multi-head Latent Attention

Sparse autoencoder

ARC-AGI 2

LeNet

Activation Function

ELI5 (Explain like I'm 5)

history

origins (1991)

hierarchical MoE (1994)