Mixture of Experts (MoE)

A Mixture of Experts (MoE) is a machine learning architecture that divides a problem into subtasks, each handled by a specialized sub-network called an "expert." A learned gating network (also called a router) determines which expert or experts should process each input. In modern deep learning, MoE most commonly appears as a sparse variant inside transformer models, where only a subset of experts is activated for any given input token. This allows models to scale to very large parameter counts while keeping per-token computation manageable.

MoE architectures have become central to the design of many state-of-the-art large language models, including Mixtral, DBRX, Grok-1, DeepSeek-V3, DeepSeek V3.1, DeepSeek V3.2, Llama 4, Qwen 3, Kimi K2, GLM-4.5, GLM-4.6, MiniMax-Text-01, Hunyuan-Large, gpt-oss, Gemini 1.5, and (reportedly) GPT-4.¹²³⁴⁵⁶⁷ They offer a practical path to scaling model capacity without a proportional increase in training or inference cost. By 2026, the leading frontier models in nearly every category were sparse mixtures, marking one of the largest architectural shifts since the original transformer paper.

ELI5 (Explain like I'm 5)

Imagine you have a really hard homework assignment that covers math, reading, science, and art. Instead of asking one friend who is okay at everything, you ask four different friends, each one the best at one subject. A "traffic director" looks at each question and sends it to whichever friend knows the answer best. That traffic director is the gating network, and each friend is an expert. The smart part is that you only bother one or two friends per question, so you get great answers without making everyone work on everything.

Now imagine the homework book is huge and there are 256 friends instead of four. You still only ask two of them per question, so the answers come fast. But you still need a giant table for all 256 friends to sit at, which is why these models need a lot of memory even though they are quick to run.

history

origins (1991)

The MoE concept was introduced by Robert A. Jacobs, Michael Jordan, Steven J. Nowlan, and Geoffrey Hinton in their 1991 paper "Adaptive Mixtures of Local Experts," published in Neural Computation (volume 3, issue 1, pages 79 to 87). Jacobs and Jordan were affiliated with MIT's Department of Brain and Cognitive Sciences; Nowlan and Hinton were at the University of Toronto's Department of Computer Science. The paper proposed a supervised learning procedure for systems composed of many separate sub-networks, each learning to handle a subset of the training cases. The authors framed the approach two ways: as a modular version of a multilayer supervised network, and as an associative version of competitive learning.

The original system consisted of several specialist networks (experts) and a gating network that learned to assign inputs to the appropriate expert. The authors demonstrated the approach on a vowel discrimination task, training up to eight experts to recognize phonemes from six Japanese speakers. In the final trained model, only three of the eight experts were meaningfully active, showing that the system naturally learned to specialize and effectively pruned unused capacity. The 1991 formulation was a dense MoE: every expert ran on every input, and the gating network produced a soft weighting over their outputs.

hierarchical MoE (1994)

Michael Jordan and Robert Jacobs extended the framework in 1994 with "Hierarchical Mixtures of Experts and the EM Algorithm," published in Neural Computation (volume 6, issue 2, pages 181 to 214). This version arranged experts in a tree structure with multiple levels of gating, allowing for hierarchical decomposition of the input space. The paper also introduced the Expectation-Maximization (EM) algorithm as an alternative to gradient descent for training MoE models, framing learning as a maximum likelihood estimation problem with hidden mixture component variables.

conditional computation era (2013 to 2016)

For roughly two decades after the original paper, MoE remained mostly an academic concept. Interest revived around 2013 when Yoshua Bengio and collaborators began exploring conditional computation, the idea that different parts of a neural network could be activated dynamically depending on the input. Bengio, Léonard, and Courville published "Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation" in 2013, providing tools for learning discrete routing decisions through gradient estimators.

That same year, David Eigen, Marc'Aurelio Ranzato, and Ilya Sutskever published "Learning Factored Representations in a Deep Mixture of Experts" (arXiv:1312.4314), which stacked multiple MoE layers and demonstrated on a jittered MNIST dataset that the network learned to factor different aspects of the data (location and class) at different layers. Davis and Arel, also in 2013, contributed parallel work on conditional computation. Bengio, Bacon, Pineau, and Precup followed in 2015 with "Conditional Computation in Neural Networks for Faster Models" (arXiv:1511.06297), formalizing the goal of decoupling parameter count from inference cost.

These papers laid conceptual groundwork for the integration of MoE into modern architectures but did not produce production-scale systems.

sparsely-gated MoE (2017)

The turning point came with Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean at Google in their 2017 paper "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer" (ICLR 2017, arXiv:1701.06538). They introduced a MoE layer with up to thousands of feed-forward experts and a trainable gating network that selected a sparse combination of experts per input. The approach was applied between stacked LSTM layers, producing a model with 137 billion parameters that achieved state-of-the-art results on language modeling and machine translation benchmarks at a fraction of the computational cost of dense alternatives. Crucially, the paper introduced noisy top-k gating, an auxiliary load-balancing loss, and a system-level treatment of how to actually train sparse experts at scale on multiple devices. This paper established the template for modern sparse MoE.

scaling with transformers (2020 to present)

From 2020 onward, MoE was integrated into transformer architectures at increasing scale. The decade-long progression is summarized below.

Year	Model or paper	Organization	Total params	Active params	Experts	Top-k	Key contribution
2020	GShard	Google	600B+	n/a	2,048	2	First MoE Transformer beyond 600B; multilingual MT; 2,048 TPU v3
2021	Switch Transformer	Google	1.6T	~26B	2,048	1	Top-1 routing; bfloat16 training of trillion-parameter sparse models
2021	V-MoE	Google	15B (vision)	n/a	up to 32 per layer	2	First sparse MoE vision transformer; 90.35% ImageNet
2022	GLaM	Google	1.2T	~97B	64 per layer	2	One-third of GPT-3 training energy; 29-task NLP gains
2022	ST-MoE	Google	269B	32B	32-128	2	Router z-loss; first sparse model SOTA on transfer tasks
2022	Expert Choice	Google	8B active	8B	64	variable	Reversed routing; experts pick tokens; perfect load balance
2022	DeepSpeed-MoE	Microsoft	n/a	n/a	n/a	n/a	4.5x faster, 9x cheaper inference vs. quality-equivalent dense
2022	MegaBlocks	Stanford / Databricks	n/a	n/a	n/a	n/a	Block-sparse kernels; "dropless" MoE
2023	Mixtral 8x7B	Mistral AI	46.7B	12.9B	8	2	First widely-used open-weights MoE; matched Llama 2 70B
2024	DBRX	Databricks	132B	36B	16	4	Fine-grained MoE; 65x more expert combinations than 8-choose-2
2024	Grok-1	xAI	314B	~78B	8	2	Largest open-weights model at release; Apache 2.0
2024	Mixtral 8x22B	Mistral AI	141B	39B	8	2	64K context; native multilingual; Apache 2.0
2024	Jamba	AI21 Labs	52B	12B	16	2	Hybrid Transformer-Mamba-MoE; 256K context
2024	DeepSeekMoE	DeepSeek	16B	2.8B	64 (fine) + 2 shared	6	Fine-grained segmentation plus shared experts
2024	DeepSeek-V2	DeepSeek	236B	21B	160 + 2 shared	6	MLA + DeepSeekMoE; 128K context; 5.76x throughput vs. V1
2024	Gemini 1.5 Pro	Google DeepMind	undisclosed	undisclosed	undisclosed	undisclosed	First production multimodal MoE; 1M+ token context
2024	Snowflake Arctic	Snowflake	480B	17B	128 + 1 dense	2	Hybrid dense + residual MoE; enterprise focus
2024	Qwen1.5-MoE	Alibaba / Qwen	14.3B	2.7B	60 + 4 shared	4	Upcycled from dense; 75% of training cost
2024	Yuan 2.0-M32	Inspur / IEIT-Yuan	40B	3.7B	32	2	Attention Router replacing classical gate; +3.8% accuracy⁸
2024	Skywork-MoE	Skywork AI	146B	22B	16	2	Gating logit normalization; adaptive aux loss; upcycled from Skywork-13B⁹
2024	Phi-3.5-MoE	Microsoft	~42B (16x3.8B)	6.6B	16	2	GRIN (gradient-informed) MoE training; 128K context¹⁰
2024	Aria	Rhymes AI	25.3B	3.5-3.9B	undisclosed	undisclosed	First open multimodal-native MoE; image, video, code, text¹¹
2024	Hunyuan-Large	Tencent	389B	52B	16 + 1 shared	1	256K context; expert-specific LR; KV cache compression¹²
2025	DeepSeek-V3	DeepSeek	671B	37B	256 + 1 shared	8	Auxiliary-loss-free balancing; FP8 training; 2.788M H800 hours
2025	MiniMax-Text-01	MiniMax	456B	45.9B	32	2	Lightning + softmax hybrid attention; 1M training context, 4M inference¹³
2025	Llama 4 Scout	Meta	109B	17B	16	1	Native multimodality; 10M token context
2025	Llama 4 Maverick	Meta	400B	17B	128	1	128 experts; alternating dense and MoE layers
2025	Llama 4 Behemoth	Meta	~2T	288B	16	1	Frontier teacher model (training as of 2025)
2025	Qwen3-30B-A3B	Alibaba / Qwen 3	30.5B	3.3B	128	8	Compact MoE; no shared experts; 32K native context¹⁴
2025	Qwen3-235B-A22B	Alibaba / Qwen 3	235B	22B	128	8	Global-batch load balancing; no shared experts
2025	Qwen3-Next 80B-A3B	Alibaba / Qwen 3	80B	3B	512	10	Hybrid Gated DeltaNet + attention; ultra-sparse MoE¹⁵
2025	Ling-Plus	Ant Group	290B	28.8B	undisclosed	undisclosed	Trained on domestic Chinese GPUs; 64K context¹⁶
2025	GLM-4.5	Zhipu AI	355B	32B	undisclosed	undisclosed	Hybrid thinking/non-thinking modes; agentic¹⁷
2025	GLM-4.5-Air	Zhipu AI	106B	12B	undisclosed	undisclosed	Lightweight sibling of GLM-4.5¹⁷
2025	GLM-4.6	Zhipu AI	355B	32B	undisclosed	undisclosed	~15% fewer tokens to complete tasks vs. GLM-4.5¹⁸
2025	gpt-oss-120b	OpenAI	117B	~5.1B	128	4	First OpenAI open-weights since GPT-2; MXFP4; 36 layers¹⁹
2025	gpt-oss-20b	OpenAI	21B	~3.6B	32	4	Runs in 16 GB VRAM via 4-bit MXFP4¹⁹
2025	DeepSeek V3.1	DeepSeek	671B	37B	256 + 1 shared	8	Hybrid thinking model (same architecture as V3)²⁰
2025	DeepSeek V3.2-Exp	DeepSeek	671B	37B	256 + 1 shared	8	DeepSeek Sparse Attention (DSA); near-linear O(kL) long context²¹
2025	Kimi K2	Moonshot AI	1T	32B	384	8	Trained with Muon optimizer; agentic focus; 128K context
2025	Kimi K2 Thinking	Moonshot AI	1T	32B	384	8	INT4-native; 200-300 sequential tool calls; 256K context²²
2025	Mistral Large 3	Mistral AI	675B	41B	undisclosed	undisclosed	Mistral's first frontier-class MoE

architecture

core components

A standard MoE layer has two main parts.

Expert networks. A set of N independent sub-networks, typically feed-forward networks (FFNs) with a SwiGLU or GeLU non-linearity. Each expert has the same architecture but learns different parameters, allowing it to specialize on different types of inputs. Each expert in a transformer FFN typically has the form Expert(x) = W_2 * activation(W_1 * x), where W_1 projects up to a wider hidden dimension and W_2 projects back.

Gating network (router). A small network that takes the input and produces a probability distribution over the experts. Formally, for an input x, the gating network computes:

G(x) = Softmax(x * W_g)

where W_g is a learned weight matrix of shape (hidden_dim, N). The output of the MoE layer is the weighted sum of expert outputs:

y = sum_i G(x)_i * E_i(x)

where E_i(x) is the output of expert i. In sparse MoE, most components of G(x) are zero by construction.

placement in transformers

In transformer-based models, MoE layers typically replace the feed-forward network (FFN) that follows each multi-head attention layer. Since the FFN accounts for a large share of a transformer's parameters (roughly 90% in models like PaLM-540B, and a similar fraction in Llama-style architectures), replacing even a subset of FFN layers with MoE layers can dramatically increase total parameter count without proportionally increasing computation.

Common placement strategies include:

Every layer: Each transformer block uses an MoE FFN. This is the default in Mixtral, DBRX, and DeepSeek-V2 and V3.
Every other layer: Alternating between dense FFN and MoE FFN layers. Used in GLaM, GShard, and Llama 4 Maverick. Halves the routing overhead and number of all-to-all communications.
Every fourth layer: Less frequent MoE placement, used in some research configurations and in V-MoE for vision.
Hybrid stacks: Jamba intermixes Mamba, attention, and MoE blocks. Llama 4 alternates dense layers with MoE layers in Maverick (every other) but uses MoE on every layer in Scout.

The first and last few layers are often kept dense even in MoE models, on the theory that early layers process generic features and final layers form predictions where stable pathways are useful.

upcycling vs. training from scratch

Two strategies exist for producing an MoE model: training from scratch with sparse routing from step zero, or upcycling an existing dense checkpoint into an MoE by replicating its FFN weights into multiple experts and continuing training. Upcycling, popularized by Qwen1.5-MoE and several Mixtral variants in the community, can reach competitive quality at roughly 25 to 50% of the from-scratch training compute, though it tends to produce experts that initially behave very similarly until specialization develops over many tokens of continued training.

gating mechanisms

The gating mechanism is the most studied component of MoE design, and it is where most of the qualitative differences between MoE systems live. Several approaches have been developed.

softmax gating

The simplest form computes a softmax over a linear projection of the input:

G(x) = Softmax(x * W_g)

This is a dense gating approach where all experts receive some weight. It works for small numbers of experts and is mathematically equivalent to the original 1991 formulation, but does not scale efficiently to hundreds or thousands of experts because every expert has to run.

noisy top-k gating

Introduced by Shazeer et al. (2017), this is the foundation for most modern MoE routers. The process has three steps.

Add noise. Tunable Gaussian noise is added to the gating logits to encourage exploration. H(x)_i = (x * W_g)_i + StandardNormal() * Softplus((x * W_noise)_i)
Keep top-k. Only the top-k values are retained; all others are set to negative infinity.
Apply softmax. The softmax is computed over the remaining values, producing a sparse distribution.

The noise helps prevent the router from always selecting the same experts and encourages different experts to be tried during training. After training stabilizes, many production systems disable noise at inference for determinism.

switch routing (top-1)

The Switch Transformer (Fedus, Zoph, and Shazeer, 2022) simplified routing by setting k = 1, sending each token to a single expert. The authors showed that this preserves model quality while offering three advantages.

Router computation is reduced because only one expert needs to be evaluated per token.
Expert capacity requirements are halved because each token goes to only one expert.
Communication costs between devices decrease, since each token's hidden state crosses the network only once.

Llama 4 returned to top-1 routing in 2025 with both Scout and Maverick, citing the same efficiency arguments. In top-1 routing the gating weight for the chosen expert is sometimes still applied as a multiplicative scalar on the expert output, which keeps the gating network differentiable.

top-k routing (k = 2 or higher)

Mixtral, DBRX (k = 4), Snowflake Arctic (k = 2), and DeepSeek-V3 (k = 8 over routed experts plus a shared expert) use top-k for k > 1. Higher k means each token sees more experts and is generally easier to balance, but communication and compute costs grow roughly linearly with k.

expert choice routing

Zhou et al. (2022) at Google proposed reversing the routing direction. Instead of tokens selecting their top-k experts, each expert selects its top-k tokens from the batch (NeurIPS 2022, arXiv:2202.09368). This guarantees perfect load balancing by construction, since every expert processes exactly the same number of tokens. The approach achieved over 2x training speedup compared to top-1 and top-2 gating in an 8-billion-active-parameter model with 64 experts.

A trade-off of expert choice routing is that some tokens may be processed by many experts (receiving more computation) while others may be processed by none, requiring careful handling through residual connections. Because the assignment is computed across the whole batch, expert choice is best suited to training and high-throughput batch inference; for streaming, single-token-at-a-time decoding it is harder to apply.

other routing strategies

Several alternative routing methods have been explored.

Strategy	Description	Advantage
Hash routing	Deterministic assignment based on token hash	No learned parameters; zero routing overhead
Random routing	Tokens assigned to random experts	Baseline comparison; surprisingly competitive in some settings
Linear assignment	Global optimization of token-expert matching	Optimal assignment but computationally expensive
Reinforcement learning	Router trained with RL signals	Can optimize for downstream objectives
BASE layers	Balanced assignment via linear programming	Guaranteed balance with top-1 selection
Soft MoE	Each input is a weighted combination of all expert slots	Differentiable; useful in vision (Soft MoE, Puigcerver et al., 2023)
Threshold routing	Tokens routed only when a confidence threshold is met	Variable compute per token; saves FLOPs on easy tokens
Auxiliary-loss-free	Bias terms updated in place to balance load	No interference gradients; used in DeepSeek-V3

sparse vs. dense MoE

The distinction between sparse and dense MoE is fundamental to understanding modern implementations.

dense MoE

In a dense MoE, every expert processes every input, and their outputs are combined using the full gating weights. This is mathematically equivalent to the original 1991 formulation. Dense MoE does not save computation, since all experts run on every input, but it can still benefit from specialization through the gating weights. Soft MoE is a recent variant where every input slot interacts with every expert through learned mixing weights, used primarily in vision.

sparse MoE

In a sparse MoE, only a small subset of experts (typically 1, 2, 4, or 8 out of 8 to 384+) is activated per input token. This is the dominant form in modern LLMs because it decouples model capacity (total parameters) from computational cost (active parameters per token). A model with 671 billion total parameters such as DeepSeek-V3 might activate only 37 billion per token; Kimi K2 activates 32 billion out of 1 trillion.

Key trade-offs between the two approaches:

Property	Dense MoE	Sparse MoE
Computation per token	Proportional to total parameters	Proportional to active parameters only
Memory requirement	Same as computation	Must load all parameters despite sparse activation
Expert specialization	Soft (weighted combination)	Hard (only selected experts participate)
Load balancing	Not an issue	Requires explicit balancing mechanisms
Backward pass	Smooth gradients	Non-differentiable top-k requires straight-through estimators or surrogate losses
Scaling potential	Limited by compute	Can scale to trillions of parameters
Suitability for vision	Common (Soft MoE)	Common (V-MoE)
Suitability for LLMs	Rare in production	Dominant in 2024 to 2026

load balancing

Load balancing is one of the most significant practical challenges in training sparse MoE models. Without intervention, routers tend to converge toward sending most tokens to a few "popular" experts while ignoring others, a failure mode called routing collapse or expert collapse.

the problem

Routing collapse creates a self-reinforcing cycle: popular experts receive more training signal, which makes them better, which causes the router to favor them even more. Meanwhile, ignored experts receive little to no gradient updates and remain undertrained. This defeats the purpose of having multiple experts. Empirically, models that suffer routing collapse converge to behave like dense models with a fraction of their advertised capacity.

auxiliary loss

The most common solution is an auxiliary (or load-balancing) loss added to the training objective. The Switch Transformer formulation uses:

L_aux = alpha * N * sum_i(f_i * P_i)

where f_i is the fraction of tokens dispatched to expert i, P_i is the fraction of the router's probability allocated to expert i, and alpha is a hyperparameter controlling the strength of the balancing signal. This loss is minimized when all experts receive equal token allocations.

The hyperparameter alpha requires careful tuning. If set too high, the auxiliary loss dominates the training signal and forces artificial uniformity, degrading model quality. If set too low, it fails to prevent collapse. In practice, values between 0.001 and 0.01 are typical for production training.

router z-loss

Introduced in the ST-MoE paper (Zoph et al., 2022, arXiv:2202.08906), the router z-loss penalizes large logits entering the gating network:

L_z = (1/B) * sum_b (log sum_i exp(x_b * W_g)_i)^2

Large logits create sharp probability distributions that are numerically unstable (especially in lower-precision training such as bfloat16 and FP8) and tend to cause routing collapse. By keeping logits small, the z-loss stabilizes training without hurting model quality. The ST-MoE paper identified router logit growth as the primary cause of training instabilities in large-scale MoE models, and z-loss has since been adopted in essentially every production MoE training framework.

auxiliary-loss-free balancing

DeepSeek-V2 and V3 introduced an alternative approach that eliminates the auxiliary loss entirely (DeepSeek-AI, "Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts," arXiv:2408.15664). Instead, a bias term b_i is added to each expert's gating logit before the top-k selection:

score_i = (x * W_g)_i + b_i

This bias is adjusted dynamically during training: when an expert is underutilized, its bias is increased, making it more likely to be selected; when overutilized, the bias is decreased. Critically, the bias is not part of the gating weight that gets multiplied into the expert output; it only affects the discrete top-k selection. This approach avoids the interference gradients that auxiliary losses introduce and has been credited with raising the upper bound of MoE model quality. DeepSeek-V3 reports keeping a balanced load throughout its full pre-training without dropping any tokens.

global-batch load balancing

Qwen 3 (Alibaba, 2025) introduced global-batch load balancing, which computes the load-balancing signal over the entire global batch rather than each micro-batch. This produces a smoother target and, the Qwen team reports, encourages stronger expert specialization. Combined with the absence of shared experts in Qwen3, this approach was credited with the model's strong scaling behavior up to 235 billion total parameters.

expert capacity

Expert capacity sets a hard limit on how many tokens a single expert can process in a given batch. The capacity is typically computed as:

Expert Capacity = (tokens_per_batch / number_of_experts) * capacity_factor

The capacity factor is a hyperparameter, usually set between 1.0 and 2.0. A factor of 1.0 means each expert can handle exactly its "fair share" of tokens, with no buffer for imbalance. Switch Transformers found that a capacity factor of 1.0 to 1.25 worked well in practice. Higher factors waste compute on padding; lower factors increase the number of dropped tokens.

When an expert reaches capacity, additional tokens routed to it are dropped. These dropped tokens skip the expert computation and instead pass through a residual connection unchanged. Research has shown that up to about 11% of tokens can be dropped this way without significant degradation in model quality, but more aggressive dropping causes noticeable harm.

The MegaBlocks library (Gale et al., 2022, arXiv:2211.15841) introduced dropless MoE, which avoids token dropping entirely by reformulating MoE computation as block-sparse matrix multiplication. Custom GPU kernels handle variable numbers of tokens per expert, eliminating both wasted compute on padding and quality loss from dropped tokens. DBRX, Mixtral, and most subsequent open MoE models adopt the dropless approach.

notable MoE models

GShard (2020)

GShard, by Dmitry Lepikhin, HyoukJoong Lee, Noam Shazeer, and colleagues at Google (ICLR 2021, arXiv:2006.16668), was the first system to scale MoE transformers beyond 600 billion parameters. It focused on multilingual neural machine translation, training a model on 2,048 TPU v3 accelerators in four days at a total cost of 22 TPU v3 core-years. By comparison, training 100 separate bilingual baselines would have cost 235.5 TPU v3 core-years and produced lower quality (36.9 vs. 44.3 average BLEU). GShard used top-2 expert routing and introduced position-based random routing for the second expert to improve load balancing. The paper also contributed a set of sharding annotation APIs and XLA compiler extensions for distributing MoE models across devices, becoming a foundational systems contribution.

Switch Transformer (2021 to 2022)

William Fedus, Barret Zoph, and Noam Shazeer at Google proposed the Switch Transformer (JMLR 23, 2022, arXiv:2101.03961), which simplified MoE routing by using top-1 expert selection instead of top-2. The largest Switch Transformer had 1.6 trillion parameters distributed across 2,048 experts. Despite this extreme sparsity, it achieved up to 7x speedup in pre-training over dense T5 models using the same computational budget. The paper also validated, for the first time, that large sparse MoE models could be trained in lower-precision bfloat16 format. The authors used selective precision (router in float32, experts in bfloat16), a technique still standard in 2026.

V-MoE (2021)

Carlos Riquelme, Joan Puigcerver, Basil Mustafa, and others at Google Brain published "Scaling Vision with Sparse Mixture of Experts" (NeurIPS 2021, arXiv:2106.05974). V-MoE replaced a subset of dense feedforward layers in Vision Transformers (ViT) with sparse MoE layers, with each image patch routed to a subset of experts. A 15-billion-parameter V-MoE with 24 MoE layers (out of 48 blocks) reached 90.35% top-1 ImageNet accuracy after fine-tuning. The paper also introduced batch prioritized routing, which prioritized subsets of inputs across the entire batch to enable adaptive per-image compute.

GLaM (2022)

Google's Generalist Language Model (GLaM) by Du, Huang, Dai, et al. (ICML 2022, arXiv:2112.06905) scaled to 1.2 trillion total parameters with 64 experts per MoE layer, activating about 97 billion parameters per token (roughly 8% of total). GLaM used 1/3 the energy of GPT-3 for training (456 MWh vs. 1,287 MWh) and half the inference FLOPs, while achieving better zero-shot and one-shot performance across 29 NLP benchmarks. GLaM placed MoE layers on every other transformer block rather than every block.

ST-MoE (2022)

ST-MoE by Zoph, Bello, Kumar, Du, Huang, Dean, Shazeer, and Fedus (arXiv:2202.08906) addressed training instability and fine-tuning quality issues that had limited sparse models on transfer learning. The 269-billion-parameter ST-MoE-32B model (matching the FLOPs of a 32-billion-parameter dense encoder-decoder) was the first sparse model to achieve state-of-the-art performance on a diverse set of transfer tasks including reasoning, summarization, closed-book QA, and adversarial benchmarks. The router z-loss introduced in this paper became a near-universal component of subsequent MoE training pipelines.

DeepSpeed-MoE and Tutel (2022)

Two systems papers in 2022 made large-scale MoE training and inference practical. DeepSpeed-MoE (Rajbhandari et al., ICML 2022, arXiv:2201.05596) at Microsoft provided an end-to-end training and inference solution with novel architecture designs and compression techniques that reduced MoE model size by up to 3.7x and offered 4.5x faster, 9x cheaper inference compared to quality-equivalent dense models. Tutel (also Microsoft) optimized the all-to-all communication primitive specifically for MoE routing, with adaptive pipelining and a 2-dimensional hierarchical (2DH) all-to-all algorithm, accelerating Meta's 1.1 trillion–parameter MoE model by more than 40% on 64 NDm A100 v4 nodes.

Mixtral 8x7B and 8x22B (2023 to 2024)

Mistral AI released Mixtral 8x7B in December 2023 and Mixtral 8x22B in April 2024, both open-source under the Apache 2.0 license. The technical report ("Mixtral of Experts," arXiv:2401.04088) was published in January 2024.

Mixtral 8x7B shares the same backbone as Mistral 7B but replaces each FFN layer with 8 expert FFNs. A router selects 2 experts per token per layer, applying softmax only over the top-2 chosen experts (rather than over all 8 before top-k). The model has 46.7 billion total parameters with 12.9 billion active per token. It outperformed or matched Llama 2 70B and GPT-3.5 across evaluated benchmarks despite using significantly fewer active parameters, and it was faster than any dense 70B model.

Mixtral 8x22B scaled this design up to 141 billion total parameters with 39 billion active, extended the context window to 65,536 tokens, and added native support for function calling. It strongly outperformed Llama 2 70B on French, German, Spanish, and Italian benchmarks (HellaSwag, Arc Challenge, MMLU).

DBRX (2024)

Databricks released DBRX in March 2024 with a "fine-grained" MoE approach. Instead of the conventional 8-expert, choose-2 design, DBRX uses 16 experts and activates 4 per token, giving 65 times more possible expert combinations compared to 8-choose-2, which the authors found improved model quality. DBRX has 132 billion total parameters with 36 billion active, was pre-trained on 12 trillion tokens with a 32K context length, and uses rotary position encodings, gated linear units, and grouped query attention. It employs dropless MoE routing via the MegaBlocks library and was trained on 3,072 NVIDIA H100 GPUs connected via 3.2 Tbps InfiniBand.

Grok-1 (2024)

xAI open-sourced Grok-1 on March 17, 2024, under the Apache 2.0 license. Pre-training had concluded in October 2023. Grok-1 has 314 billion total parameters with 8 experts and top-2 selection, activating roughly 25% of weights per token. The architecture uses 64 layers, 48 attention heads for queries and 8 for keys and values, an embedding size of 6,144, and supports 8-bit quantization. One notable difference from Mixtral is in the routing: Grok-1 applies top-2 selection after a softmax over all 8 experts, whereas Mixtral applies softmax only over the top-2 selected experts. At release, Grok-1 was the largest open-weights model.

Snowflake Arctic (2024)

Snowflake released Arctic on April 24, 2024 (Apache 2.0). Arctic combines a 10-billion-parameter dense transformer with a residual 128-by-3.66-billion MoE MLP, totaling 480 billion parameters with 17 billion active, chosen via top-2 gating. The 128-expert design produces a fine-grained MoE optimized for enterprise tasks (SQL, code generation). Snowflake reported up to 4x fewer memory reads than Code-Llama 70B and 2.5x fewer than Mixtral 8x22B, leading to faster inference.

DeepSeek-V2 (2024)

DeepSeek-V2 ("A Strong, Economical, and Efficient Mixture-of-Experts Language Model," arXiv:2405.04434, May 2024) has 236 billion total parameters with 21 billion activated per token and a 128K context length. It introduced two architectural innovations that became influential: Multi-head Latent Attention (MLA), which compresses the KV cache into a low-rank latent vector and reduces KV cache size by 93.3%, and the production-scale DeepSeekMoE design with 2 shared experts and 160 routed experts (6 activated per token), each with a hidden dimension of 1,536. Compared to DeepSeek 67B, V2 achieved better quality with 42.5% lower training cost and 5.76x higher inference throughput.

DeepSeekMoE paper (2024)

The DeepSeekMoE paper (Dai et al., arXiv:2401.06066, January 2024) formalized two principal strategies that have shaped MoE design ever since: fine-grained expert segmentation (the hidden dimension of each expert is reduced while the number of experts is multiplied, enabling more flexible combinations) and shared expert isolation (a small set of experts is always active for every token, capturing common knowledge and reducing redundancy in routed experts). DeepSeekMoE 2B matched GShard 2.9B in quality with 1.5x fewer expert parameters and FLOPs.

DeepSeek-V3 (2025)

DeepSeek-V3 (DeepSeek-AI, "DeepSeek-V3 Technical Report," arXiv:2412.19437) has 671 billion total parameters with 37 billion active per token. It uses 256 routed experts plus 1 shared expert, with the top 8 routed experts activated per token. Key contributions include:

Auxiliary-loss-free load balancing via per-expert bias terms updated based on usage history.
Multi-Token Prediction (MTP) training objective that extends prediction to multiple future tokens, used during pre-training and discarded for inference.
FP8 mixed-precision training at trillion-token scale, the first model to validate FP8 training at this size.
A reported full pre-training cost of 2.788 million H800 GPU hours.

DeepSeek-V3 reports zero token drops throughout training and inference, made possible by the combination of fine-grained experts, shared experts, and bias-based balancing.

Gemini 1.5 (2024)

Google DeepMind announced Gemini 1.5 Pro in February 2024 as a sparse mixture-of-experts transformer with multimodal inputs and a 1-million-token context window (extended in research previews to 10 million). The exact expert and active parameter counts have not been disclosed, but Jeff Dean publicly traced its lineage to "a long line of Google research efforts on sparse models" starting with Shazeer et al. 2017. Gemini 1.5 was the first widely available production frontier model confirmed to use MoE.

Llama 4 family (2025)

Meta released the Llama 4 herd on April 5, 2025, marking the first Llama generation to use mixture-of-experts. The herd consists of three models.

Llama 4 Scout has 17 billion active parameters across 16 experts and 109 billion total parameters, with MoE on every layer. Scout was fine-tuned to support a 10-million-token context window.
Llama 4 Maverick has 17 billion active parameters across 128 experts and 400 billion total parameters, with MoE and dense layers alternating (so experts are applied in half of the layers). Maverick supports a 1-million-token context window.
Llama 4 Behemoth is a 288-billion-active, 16-expert model with approximately 2 trillion total parameters, in training as of 2025 to serve as a teacher for distillation.

All Llama 4 models use top-1 routing, native multimodality with early fusion of text and image, and were pre-trained on more than 30 trillion tokens.

Qwen MoE family (2024 to 2025)

Alibaba's Qwen team has released several MoE generations.

Qwen1.5-MoE-A2.7B (2024) was upcycled from a dense Qwen-1.8B model, using 60 experts plus 4 shared experts and activating 4 routed plus 4 shared per token. It matched 7B-class dense models while activating only 2.7 billion parameters per token, at 75% of dense training cost.
Qwen3 (2025, arXiv:2505.09388) introduced both dense and MoE models, including the flagship Qwen3-235B-A22B with 235 billion total parameters and 22 billion activated. Qwen3 dropped shared experts (used in Qwen2.5-MoE), used 128 routed experts with top-8 routing, and introduced global-batch load balancing.
Qwen3-Next (preview, late 2025) is a hybrid 80B-A3B model that routes among 512 experts.

Kimi K2 (2025)

Moonshot AI released Kimi K2 in mid-2025 as a 1-trillion-parameter MoE model with 32 billion active parameters. It uses 384 experts with 8 active per token and a 128K context window. Kimi K2 was pre-trained on 15.5 trillion tokens using the Muon optimizer at unprecedented scale, with the team reporting zero training instability after a custom set of optimizer modifications. The model is positioned around agentic intelligence, including extended reasoning and tool use.

Mistral Large 3 (2025)

Mistral AI's Mistral Large 3 (released 2025) was the company's first frontier-class MoE, with 41 billion active parameters out of 675 billion total. The shift from the dense Mistral Large 2 (123B dense) signaled that even labs that had stuck with dense designs were converging on sparse architectures for frontier work.

GPT-4 (rumored)

While OpenAI has not officially confirmed the architecture of GPT-4, multiple sources have reported that it uses an MoE design. A widely cited 2023 analysis by Dylan Patel and Gerald Wong at SemiAnalysis described GPT-4 as approximately 1.76 trillion total parameters across 16 experts of approximately 111 billion MLP parameters each, with 2 experts routed per forward pass. An earlier informal claim by George Hotz described 8 experts of 220 billion parameters each. These reports were partly corroborated by Soumith Chintala, co-creator of PyTorch, but remain unconfirmed by OpenAI.

Jamba (2024)

AI21 Labs' Jamba is a hybrid architecture that combines transformer layers, Mamba (structured state space model) layers, and MoE layers (arXiv:2403.19887). It has 52 billion total parameters with 12 billion active, and offers a 256K context window. Roughly one in every eight layers uses a transformer attention mechanism; the rest use Mamba, with MoE layers interleaved. This hybrid approach reduces the memory footprint compared to a pure transformer of similar capacity.

Yuan 2.0-M32 (May 2024)

Yuan 2.0-M32 (Inspur / IEIT-Yuan, arXiv:2405.17976) introduced an Attention Router that replaces the conventional linear gate with an attention-based selection mechanism. The router treats each expert as a key/value pair and uses the input token as the query; this captures correlations among experts during the routing decision. The 40B total / 3.7B active model with 32 experts and top-2 routing was trained from scratch on 2 trillion tokens at only 9.25% of the compute of a dense model of similar parameter count, and the authors report a 3.8% accuracy gain attributable to the Attention Router alone.⁸

Skywork-MoE (June 2024)

Skywork-MoE (Skywork AI, arXiv:2406.06563) is a 146-billion-parameter, 16-expert MoE with 22 billion active parameters, upcycled from the dense Skywork-13B checkpoint. The paper introduced two training techniques. Gating logit normalization normalizes router logits before softmax, sharpening expert assignments and improving diversification. Adaptive auxiliary loss coefficients adjust the load-balancing coefficient per layer based on observed token drop rates, rather than holding it constant across the network. The authors also presented empirical guidance on the upcycling vs. from-scratch trade-off, finding that upcycling pays off when the dense checkpoint is already strong and the additional MoE training budget is small.⁹

Phi-3.5-MoE (August 2024)

Microsoft released Phi-3.5-MoE in August 2024 alongside the Phi-3.5-mini and Phi-3.5-Vision models. It is a 16 x 3.8B mixture-of-experts decoder-only transformer with 6.6 billion parameters active per token (top-2 routing). It supports a 128K context length and was trained with a new method called GRIN (GRadient-INformed) MoE, which uses gradient information to inform routing decisions and expert specialization. Microsoft reported that Phi-3.5-MoE matches or exceeds Llama 3.1 8B, Mixtral 8x7B, and Gemini-1.5-Flash on language, reasoning, math, and code benchmarks at significantly lower active parameter count.¹⁰

Aria (October 2024)

Rhymes AI released Aria in October 2024 as the first open-source, multimodal-native MoE. Aria has 25.3 billion total parameters with 3.5 billion active per text token and 3.9 billion active per visual token, supports a 64K multimodal context window, and was pre-trained from scratch through a four-stage pipeline (language pre-training, multimodal pre-training, long-context pre-training, and instruction tuning). Aria outperformed Pixtral 12B and Llama 3.2 11B-Vision on a range of multimodal benchmarks while fitting in a single A100 80GB GPU in bfloat16 precision.¹¹

Hunyuan-Large (November 2024)

Tencent's Hunyuan-Large (arXiv:2411.02265) is a 389-billion-parameter MoE with 52 billion active parameters. It uses 16 specialized routed experts plus 1 shared expert, with top-1 routing over the routed experts. Hunyuan-Large supports a 256K context window and was pre-trained on 7 trillion tokens, including 1.5 trillion tokens of synthesized data. Notable contributions include expert-specific learning rates (different layers and experts use different LR schedules), KV cache compression for long-context efficiency, and a mixed routing strategy. At release, Hunyuan-Large was the largest open-source Transformer-based MoE.¹²

MiniMax-Text-01 (January 2025)

MiniMax's MiniMax-Text-01 (arXiv:2501.08313) combines Lightning Attention (a linear-attention variant) with traditional softmax attention and MoE feed-forwards. Within every 8 transformer blocks, 7 use Lightning Attention and 1 uses softmax attention. Each transformer layer has 32 MoE experts with top-2 routing, giving 456 billion total parameters and 45.9 billion active per token. The training context is 1 million tokens, extendable to 4 million tokens during inference, making it the first commercial-scale model to scale linear attention to the multi-million-token regime.¹³

Qwen3-30B-A3B and Qwen3-Next (2025)

Beyond the flagship Qwen3-235B-A22B, the Qwen team released a compact MoE called Qwen3-30B-A3B: 30.5 billion total parameters with about 3.3 billion active per token, 128 experts and top-8 routing, no shared experts, 48 transformer layers, 32 query heads and 4 key/value heads (grouped-query attention), and a 32,768-token native context (extensible with YaRN).¹⁴

Qwen3-Next 80B-A3B (preview, September 2025) is a hybrid model that combines Gated DeltaNet linear-attention layers with Gated Attention softmax layers and an ultra-sparse MoE: 512 experts of which only 10 are active per token (~3.7% of total weights), giving 80 billion total parameters and approximately 3 billion active. Alibaba reports roughly 10x faster inference than Qwen3-32B for long contexts and approximately 90% lower training cost relative to Qwen3-32B at comparable downstream quality.¹⁵

Ling-Plus (March 2025)

Ant Group's Ling-Plus (arXiv:2503.05139) is a 290-billion-parameter MoE with 28.8 billion active parameters and a 64K context window. Its accompanying paper, "Every FLOP Counts: Scaling a 300B MoE LING LLM without Premium GPUs," documents techniques for training large MoE models on lower-end (non-NVIDIA H-series) accelerators, including domestic Chinese GPUs. Ant reported that training one trillion tokens on high-end hardware cost approximately 6.35 million RMB compared to roughly 5.08 million RMB on their optimized lower-spec pipeline, while reaching parity on downstream benchmarks.¹⁶

GLM-4.5 and GLM-4.6 (2025)

Zhipu AI released GLM-4.5 in July 2025 (355 billion total, 32 billion active) and GLM-4.5-Air (106 billion total, 12 billion active) under the MIT license. Both are MoE LLMs that integrate reasoning, coding, and agentic capabilities and expose a hybrid "thinking" vs. "non-thinking" mode through chat-template selection. GLM-4.6 (September 2025) keeps the 355B/32B configuration but improves token efficiency, completing comparable tasks with approximately 15% fewer tokens than GLM-4.5.¹⁷¹⁸

gpt-oss (August 2025)

OpenAI released gpt-oss-120b and gpt-oss-20b on August 5, 2025, under the Apache 2.0 license, OpenAI's first open-weight models since GPT-2 in 2019. The 120B model has 117 billion total parameters across 36 transformer layers, with 128 experts per layer of which 4 are active per token, yielding approximately 5.1 billion active parameters. The 20B model has 21 billion total parameters with 32 experts and top-4 routing, yielding around 3.6 billion active. Both use a native MXFP4 (4-bit microscaling FP) quantization for the expert weights, enabling gpt-oss-120b to run on a single 80 GB GPU and gpt-oss-20b to run on edge devices with 16 GB of memory. OpenAI reports that gpt-oss-120b matches or exceeds o4-mini on competition coding (Codeforces), problem solving (MMLU and HLE), and tool-use (TauBench).¹⁹

DeepSeek V3.1 and V3.2 (2025)

DeepSeek released DeepSeek V3.1 in August 2025 as a hybrid model that shares the same MoE architecture as DeepSeek-V3 (671B total, 37B active, 256 routed experts + 1 shared, top-8 routing) but adds a hybrid thinking mode controlled by the chat template: the same weights can either emit chain-of-thought reasoning (like DeepSeek-R1) or direct answers (like DeepSeek-V3). The base checkpoint was extended to a 128K context window via a two-phase procedure (630B tokens to 32K, then a further 209B tokens to 128K).²⁰

DeepSeek V3.2-Exp (September 2025, arXiv:2512.02556) keeps the V3 MoE architecture and introduces DeepSeek Sparse Attention (DSA). DSA has two parts: a lightning indexer that estimates which past tokens matter for each query, and a token selector that retains only the top-k of them. This reduces long-context attention complexity from O(L^2) to approximately O(kL) and preserves quality at very long context lengths.²¹

Kimi K2 Thinking (November 2025)

Moonshot AI's Kimi K2 Thinking (released November 2025) extends Kimi K2 with a long chain-of-thought reasoning capability and native INT4 quantization. The model preserves the K2 backbone (1T total, 32B active, 384 experts, top-8 routing) but is trained with quantization-aware-training so that 4-bit weights are the default inference format. Moonshot reports that the model can execute 200 to 300 sequential tool calls without human intervention and exposes a 256K-token context window.²²

training challenges

instability

MoE models are more prone to training instability than dense models, particularly at large scale. Sources of instability include:

Large router logits. Sharp probability distributions can cause numerical overflow, especially in lower-precision formats like bfloat16 and FP8. The router z-loss addresses this.
Expert imbalance feedback loops. Uneven routing causes uneven gradient updates, reinforcing the imbalance.
Dropped tokens. When too many tokens exceed expert capacity and are dropped, the effective batch size shrinks unpredictably.
Discrete routing decisions. The argmax in top-k is non-differentiable; small perturbations can flip the routing assignment of a token, causing loss curve discontinuities.

Practical stabilization techniques include using full precision (float32) for the router even when experts run in bfloat16 or FP8, adding router z-loss, carefully tuning the auxiliary loss coefficient (or moving to bias-based balancing), gradient clipping, and warming up the auxiliary loss over the first few thousand steps.

overfitting during fine-tuning

Sparse MoE models are more susceptible to overfitting during fine-tuning than dense models of comparable active parameter count. This happens because MoE models have far more total parameters, but each parameter sees fewer training examples (since each expert only processes a fraction of tokens). Strategies to mitigate this include:

Using higher dropout rates within expert layers.
Using smaller batch sizes with higher learning rates.
Freezing non-MoE weights (only about 20% of total parameters in a typical sparse model) while keeping MoE layers trainable, or vice versa.
Applying instruction tuning, which empirically benefits MoE models more than dense models.
Using parameter-efficient fine-tuning (LoRA, QLoRA, or expert-only LoRA) to limit the effective number of trainable parameters.

communication overhead

In expert parallelism, every MoE layer requires two all-to-all communications: one to dispatch tokens to the GPUs holding their assigned experts, and one to combine the results back. Research has shown that all-to-all communication can consume more than 40% of total runtime in large-scale MoE training, and up to 59.2% of forward-pass latency in the MoE layers on an 8-GPU server running DeepSeek-V2-Lite. For inference, all-to-all can contribute 10 to 30% of end-to-end latency, especially for decode messages where each token's hidden state must hop between GPUs. Optimizing this communication is a major focus of systems research; representative techniques include 2DH all-to-all, fused communication-computation kernels, and sub-chunk pipelining.

expert specialization patterns

Research has revealed that experts in encoder models tend to develop token-level specialization. Certain experts may specialize in punctuation, proper nouns, or specific syntactic patterns. In decoder models, specialization is less interpretable; some experts appear to handle particular topical domains, others activate on rare tokens, and many appear functionally redundant in early training. Specialization typically sharpens over training, especially after the auxiliary loss is reduced.

Expert specialization collapse occurs when experts become functionally redundant, all learning similar representations instead of specializing. This negates the benefit of having multiple experts and is distinct from routing collapse (where experts are ignored entirely). Fine-grained segmentation, shared experts, and stronger regularization on the router are the most commonly cited remedies.

inference optimization

memory requirements

A key challenge for MoE inference is that, despite only activating a subset of experts per token, all expert parameters must be loaded into memory for fast access. This means MoE models have the same memory footprint as a dense model of equal total parameter count, even though they use far fewer FLOPs per token. For example, Mixtral 8x7B requires loading all 46.7 billion parameters into VRAM even though only 12.9 billion are active per token; DeepSeek-V3 requires loading 671 billion parameters even though only 37 billion are active.

Production deployments of large MoE models routinely require 8 or more GPUs with 80 GB each simply to load the model before serving any traffic. Llama 4 Maverick at 400 billion total parameters requires roughly 800 GB in 16-bit precision; DeepSeek-V3 at 671 billion fits in roughly 720 GB after FP8 packing.

expert parallelism

Expert parallelism (EP) is a distribution strategy designed specifically for MoE models. Different experts are placed on different GPUs, and tokens are routed to the GPU holding their assigned expert via all-to-all communication. Non-MoE layers (such as attention) are handled via standard data or tensor parallelism.

This can be combined with other parallelism strategies:

Parallelism type	What is distributed	Applicability
Data parallelism	Different batches across devices	All model types
Tensor parallelism	Individual layer weights split across devices	Large layers
Pipeline parallelism	Different layers on different devices	Deep models
Expert parallelism	Different experts on different devices	MoE models specifically
Context parallelism	Different parts of long sequences across devices	Long-context models

NVIDIA's work on wide expert parallelism with GB200 NVL72 systems showed up to 1.8x higher per-GPU throughput compared to smaller expert-parallel configurations, by leveraging fewer experts per GPU and higher arithmetic intensity inside the high-bandwidth NVLink domain (130 TB/s coherent NVLink). Engineering teams at Meta have published case studies on combining tensor, context, and expert parallelism for serving large MoE models efficiently.

quantization

Quantization is particularly effective for MoE models because the memory savings are amplified by the large total parameter count. QMoE (Frantar and Alistarh, MLSys 2024, arXiv:2310.16795) demonstrated compression of a 1.6-trillion-parameter Switch Transformer from 3.2 TB to less than 160 GB at less than 1 bit per parameter, with only minor accuracy loss, in less than a day on a single GPU. With QMoE, the 1.6-trillion-parameter Switch Transformer could run on a single server with 4x NVIDIA A6000 GPUs at less than 5% runtime overhead relative to ideal uncompressed inference. FP8 weight quantization (used natively by DeepSeek-V3) and 4-bit AWQ or GPTQ quantization (used by community Mixtral builds) are also widely deployed.

expert offloading

For deployment on devices with limited GPU memory, expert offloading stores inactive expert weights in CPU memory and loads them to the GPU on demand. Pre-gated MoE takes this further by predicting which experts will be needed ahead of time and prefetching their weights, enabling single-GPU deployment of large MoE models at the cost of additional latency from CPU-GPU transfer. Open-source tools such as llama.cpp implement aggressive expert offloading to enable Mixtral 8x7B and DBRX inference on consumer GPUs with as little as 24 GB of VRAM.

distillation

MoE models can be distilled into smaller dense models that retain 30 to 40% of the MoE's quality advantage over a comparably sized dense baseline. Research has also shown that sentence-level or task-level routing can be used to extract specialized sub-networks from a trained MoE for targeted deployment. The Llama 4 Behemoth model is reported to be used primarily as a teacher for distilling Scout and Maverick.

comparison with dense models

The following table summarizes the practical trade-offs between MoE and dense model architectures.

Dimension	MoE models	Dense models
Pre-training speed	Faster (4 to 7x for equivalent quality)	Slower
Total parameters	Very large (100B to 2T+)	Moderate (7B to 540B typically)
Active parameters per token	Small fraction of total	All parameters
Inference FLOPs per token	Lower for given quality level	Higher
VRAM requirement	High (must load all experts)	Proportional to parameter count
Training stability	Requires careful tuning (auxiliary loss, z-loss)	Generally more stable
Fine-tuning	Prone to overfitting; benefits from instruction tuning	More straightforward
Knowledge-intensive tasks	Generally stronger	Depends on size
Reasoning tasks	Mixed results historically; recent MoEs (DeepSeek-V3, Kimi K2) close the gap	Often stronger at similar active parameter count
Deployment complexity	Higher (expert parallelism, large memory)	Lower
Energy efficiency	Better (less compute for similar quality)	Worse
Edge / on-device	Difficult (memory)	Better suited

mathematical formulation

The general MoE output for an input x is:

y = sum_{i=1}^{N} g(x)_i * E_i(x)

where N is the number of experts, E_i is the i-th expert network, and g(x) is the gating function.

For sparse top-k routing, the gating function becomes:

g(x) = Softmax(TopK(H(x), k))

where:

H(x)_i = (x * W_g)_i + epsilon_i * Softplus((x * W_noise)_i)

and epsilon_i is sampled from a standard normal distribution. The TopK function retains only the k largest values and sets the rest to negative infinity before applying softmax. In Mixtral-style routing, the softmax is applied only over the top-k retained values; in Grok-1-style routing, it is applied over all N values before retaining the top-k.

The load-balancing auxiliary loss for N experts across a batch of T tokens is:

L_balance = alpha * N * sum_{i=1}^{N} f_i * P_i

where f_i = (number of tokens assigned to expert i) / T and P_i = (sum of router probabilities for expert i) / T.

The router z-loss for batch size B is:

L_z = (1 / B) * sum_{b=1}^{B} (log sum_{i=1}^{N} exp(H(x_b)_i))^2

The total training loss is the weighted sum:

L_total = L_task + alpha * L_balance + beta * L_z

with typical settings alpha = 0.001 to 0.01 and beta = 0.001.

For DeepSeek-V3-style auxiliary-loss-free balancing, the gating logits are augmented with a per-expert bias before top-k selection:

score_i = (x * W_g)_i + b_i

The bias b_i is updated outside the gradient computation: at each step, b_i is decreased for over-utilized experts and increased for under-utilized ones, by a small fixed step size.

MoE beyond Transformers

Although the modern MoE wave is rooted in Transformer FFN replacement, several lines of work apply sparse expert routing to non-Transformer or hybrid backbones.

MoE-Mamba (January 2024)

MoE-Mamba (Pióro et al., arXiv:2401.04081) interleaves Mamba state-space-model (SSM) blocks with MoE feed-forward layers. The architecture inherits Mamba's linear-time sequence processing and adds MoE capacity. The authors report that MoE-Mamba reaches Mamba-equivalent perplexity in 2.35x fewer training steps while preserving Mamba's inference throughput advantage over Transformers.²³

BlackMamba (February 2024)

BlackMamba (Anthony et al., arXiv:2402.01771) further integrates SSMs and MoE by replacing both the attention layers (with Mamba) and the FFN layers (with MoE experts) of a Transformer block. Zyphra trained and open-sourced 340M/1.5B and 630M/2.8B BlackMamba models on 300B tokens. The architecture's combination of linear-complexity sequence mixing and sparse FFN computation produces favorable inference and training FLOP characteristics versus comparable Mamba and Transformer baselines.²⁴

Jamba and hybrid stacks

Jamba (AI21 Labs, 2024) and later Jamba 1.5 interleave Transformer attention, Mamba, and MoE blocks. MiniMax-Text-01 (2025) combines Lightning Attention, softmax attention, and MoE. These hybrid stacks are evidence that MoE is an architectural primitive that composes with backbones other than the standard Transformer.

Recent routing innovations

Soft MoE (2023)

Soft MoE (Puigcerver, Riquelme, Mustafa, Houlsby, "From Sparse to Soft Mixtures of Experts," arXiv:2308.00951) introduces slots: each expert holds a small fixed number of input slots, and every slot is filled with a learned weighted combination of all input tokens. Likewise, each output token is a learned weighted combination of all expert slot outputs. There is no top-k selection and no token dropping; the assignment is fully differentiable. In vision transformer workloads, Soft MoE Huge/14 with 128 experts in 16 MoE layers achieved over 40x more parameters than ViT-Huge/14 at only 2% additional inference time, outperforming both dense ViT and conventional Token-Choice and Expert-Choice MoEs. Soft MoE is the dominant continuous MoE recipe for vision but is harder to use in autoregressive decoding, since the slot mixture requires a full batch.²⁵

Mixture of Tokens (NeurIPS 2024)

Mixture of Tokens (MoT) (Antoniak, Krutul, Jaszczur, et al., arXiv:2310.15961) presents a continuous, fully differentiable alternative to sparse MoE that is compatible with autoregressive decoding. Each expert receives a unique mixture of tokens drawn from the same example (or across grouped examples), with mixing weights produced by a small controller. Because every expert receives the same number of (mixed) tokens, load imbalance is sidestepped by construction. The authors report up to a 3x training-speed improvement over dense Transformers and report parity with state-of-the-art sparse MoE on language pre-training quality.²⁶

Mixture of Recursions (NeurIPS 2025)

Mixture of Recursions (MoR) (Bae et al., arXiv:2507.10524) reuses a single shared stack of layers across multiple recursion steps and uses lightweight per-token routers to decide how many recursion steps each token takes. This combines parameter sharing with adaptive depth: easy tokens exit after fewer passes, hard tokens recurse deeper. The framework supports both expert-choice routing (top-k tokens continue at each step) and token-choice routing (each token gets a fixed depth at the outset) and includes recursive KV-caching strategies. Across model scales from 135M to 1.7B parameters, MoR achieves higher throughput and lower validation perplexity at matched training FLOPs.²⁷

MoE upcycling (Komatsuzaki et al., 2022)

MoE upcycling, formalized by Komatsuzaki, Puigcerver, Mustafa, et al. (arXiv:2212.05055), converts a pre-trained dense model into an MoE by duplicating the FFN weights of each layer to seed multiple experts and initializing a fresh router. Continued training then specializes the experts. The paper showed that for T5 Base, Large, and XL, upcycling produces models that outperform their dense counterparts on SuperGLUE with about 50% additional compute over the dense pre-training run. Qwen1.5-MoE and Skywork-MoE both used upcycling at production scale. Later work ("Scaling Laws for Upcycling Mixture-of-Experts Language Models," arXiv:2502.03009) extended this with explicit scaling laws and found that the upcycled advantage persists up to roughly 120% of the sunk dense pre-training compute before from-scratch MoE training becomes competitive again.²⁸²⁹

Fine-grained MoE scaling laws (Krajewski et al., 2024)

Krajewski et al. ("Scaling Laws for Fine-Grained Mixture of Experts," ICML 2024, arXiv:2402.07871) introduced granularity as an explicit hyperparameter: the ratio of an expert's hidden size to a baseline dense FFN's hidden size. A model with granularity 8 has 8x more, 8x narrower experts than the baseline. They derived a joint scaling law over total parameters, training tokens, and granularity, and showed that for any fixed compute budget there is an optimal granularity well above 1, justifying the fine-grained DeepSeekMoE and Qwen3 designs. They also argued that earlier work (Clark et al., "Unified scaling laws for routed language models," ICML 2022) had under-estimated MoE's advantage because it held expert size and training duration fixed.³⁰

Auxiliary-loss-free balancing (DeepSeek, 2024)

Wang et al. ("Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts," arXiv:2408.15664) introduced the bias-based load-balancing strategy used by DeepSeek-V3. A bias term b_i is added to each routed expert's gating logit only for top-k selection (not for the gating weight applied to the expert output). The biases are updated each step by a small fixed amount: increased for under-utilized experts, decreased for over-utilized ones. This avoids the interference gradients that the standard auxiliary loss introduces and has become a standard component of frontier MoE designs (DeepSeek-V3, DeepSeek V3.1, DeepSeek V3.2, Kimi K2).³¹

Scaling laws for MoE

Several lines of work have attempted to extend scaling laws from dense transformer models to MoE. Three influential results.

Clark et al. (2022, ICML), "Unified Scaling Laws for Routed Language Models." First systematic MoE scaling law. Fixed-data, fixed-expert-size analysis. Suggested a crossover beyond which dense Transformers might overtake routed models. Later work showed this conclusion was an artifact of the constraints.
Krajewski et al. (2024, ICML), "Scaling Laws for Fine-Grained Mixture of Experts." Adds granularity as a tunable axis and shows that the dense-vs-MoE efficiency gap widens, not narrows, when granularity and compute budget are jointly optimized.³⁰
Komatsuzaki et al. (2022) and follow-ups. Quantify the regimes where upcycling beats from-scratch MoE training, and identify a crossover near roughly 1.2x of the dense sunk compute.²⁸²⁹

In practice, frontier labs converged on a recipe of approximately 256 routed experts with top-8 selection plus 1 shared expert, fine-grained (small) experts, and bias-based balancing for compute budgets in the 10^24 to 10^25 FLOP range.

Serving and inference systems

Expert parallelism in modern serving stacks

By 2025, both vLLM and SGLang ship dedicated MoE execution paths. vLLM exposes an --enable-expert-parallel flag that switches MoE layers from tensor-parallel to expert-parallel execution while attention layers run with data-parallel KV-cache partitioning, a layout co-designed with DeepSeek-V3 and Llama 4 Maverick. For DeepSeek-V3-class models, an 8-way data-parallel attention + 8-way expert-parallel MoE configuration is standard, giving each GPU 1/8 of the KV cache and 1/8 of the routed experts.³²

Wide expert parallelism (NVIDIA, 2025)

NVIDIA's "wide expert parallelism" work on GB200 NVL72 systems reports up to 1.8x higher per-GPU throughput than narrower EP configurations by placing fewer experts on each GPU and exploiting the NVL72's 130 TB/s coherent NVLink fabric. Operating in this regime turns the all-to-all traffic into intra-NVL72 traffic, dramatically reducing latency.³³

Expert offloading and pre-gating

For VRAM-constrained deployments, expert offloading stores inactive experts in CPU memory and pages them to GPU on demand. Pre-gated MoE predicts the experts needed at the next layer one step ahead of time and prefetches them during the current layer's computation, hiding most of the CPU-GPU transfer latency. The llama.cpp and SGLang stacks expose offloading knobs that let Mixtral 8x7B, DBRX, and even DeepSeek-V3-class models run on consumer GPUs with 24 GB of VRAM (with throughput penalties).

MXFP4 and FP8 for MoE

DeepSeek-V3 was the first model to validate FP8 mixed-precision training of an MoE at the 671B / 14.8T-token scale. gpt-oss went further by using MXFP4 (4-bit microscaling FP) as the default inference format for expert weights, packing the 117B-parameter gpt-oss-120b model into a single 80 GB GPU. As of 2026, MXFP4 or per-channel 4-bit weight quantization with a higher-precision (BF16 or FP8) router is becoming standard for open MoE releases.¹⁹

applications beyond language models

While MoE is most widely associated with large language models, the architecture has been applied to other domains.

Computer vision. V-MoE applies sparse expert routing to Vision Transformers for image classification. Soft MoE (Puigcerver et al., 2023) and Mobile V-MoE (2023) extend the approach. Recent sparse vision models include MoE-LLaVA for multimodal understanding.
Multimodal models. MoE has been used in models that combine text and image understanding, including Gemini 1.5, Llama 4, and MoE-LLaVA.
Machine translation. GShard and NLLB-200 (Meta) used MoE for massively multilingual translation across 200 languages.
Recommender systems. MoE architectures have been applied to recommendation tasks where different user segments benefit from specialized expert networks; YouTube and Pinterest have published descriptions of MoE in production recommenders.
Speech recognition. The original 1991 work focused on phoneme recognition, and MoE continues to be used in modern speech recognition systems including multilingual Whisper variants from the community.
Diffusion models. SegMoE (Segmind, 2024) and recent work on Diffusion Mixture of Experts apply expert routing to diffusion language models and image generators.
Reinforcement learning. MoE has been applied to multi-task RL where different experts handle different task families.

open-source implementations

Several libraries and frameworks support MoE training and inference.

Library	Organization	Features
MegaBlocks	Databricks (originally Stanford)	Block-sparse GPU kernels; dropless MoE; backbone of DBRX
DeepSpeed-MoE	Microsoft	Hybrid parallel training (data + tensor + expert); residual MoE; 4.5x faster inference vs. dense equivalents
Tutel	Microsoft	Optimized all-to-all; FP8/NVFP4/MXFP4 support; targets DeepSeek, Kimi K2, Qwen3
FairScale and Fairseq	Meta	Sequence modeling framework with MoE support; used in NLLB-200
Hugging Face Transformers	Hugging Face	Native MoE support since v4.36.0 (Mixtral); now covers DBRX, Mixtral, Qwen MoE, DeepSeek, Llama 4
Megatron-LM	NVIDIA	Production-scale MoE with expert parallelism and tensor parallelism
vLLM and SGLang	UC Berkeley / community	High-throughput inference with MoE-specific optimizations
MergeKit	Charles Goddard / community	"FrankenMoE" upcycling from existing dense checkpoints
OpenMoE	Community	Community-built Llama-based MoE models

design choices in modern MoE LLMs

Across the leading MoE LLMs of 2024 to 2026, several recurring design choices have stabilized.

Choice	Most common in 2024 to 2026	Notable exceptions
Router type	Top-k softmax over routed experts	Expert choice (research); top-1 (Switch, Llama 4)
Number of experts	16 to 256 routed; 1 shared	DBRX: 16; Llama 4 Maverick: 128; Kimi K2: 384
Active experts per token	2, 4, or 8	Llama 4 (1)
Shared experts	Common in DeepSeek-style designs	Qwen3 dropped them
Load balancing	Aux-loss-free (DeepSeek), aux loss + z-loss (others)	Global-batch (Qwen3)
Dropless	Standard	Earlier Switch and Mixtral allowed drops
Precision	bfloat16 or FP8 weights, float32 router
Capacity factor	1.0 to 1.25 (when capacity is enforced)	Dropless models avoid the issue

The architectural convergence is striking. By 2026, "fine-grained MoE with one or two shared experts (or none, in the Qwen3 style) and bias-based load balancing" had become the de facto recipe for sparse frontier models, with DeepSeek V3 / V3.1 / V3.2, Qwen 3, Kimi K2, GLM-4.5, GLM-4.6, and Hunyuan-Large all variants on this template.⁵³⁴⁷¹²¹⁷

misconceptions and clarifications

Several common misconceptions about MoE are worth addressing.

"MoE models are smaller than dense models." False. MoE models have far more total parameters than the dense models they compete with; they only have fewer active parameters per token. A MoE that activates 37 billion parameters per token from a 671-billion-parameter pool requires the full 671 billion to be loaded for fast inference.

"MoE models are 8 separate models." False. Each "expert" is a single FFN layer, not a complete model. Routing is decided independently at each layer, so a single token typically passes through different experts at different layers. With 32 layers and 8 experts per layer, each token traces one of 8^32 possible expert combinations.

"Each expert specializes in a topic (math, code, etc.)." Mostly false. Empirical analyses of Mixtral, DBRX, and DeepSeek routes find that experts often specialize on token classes (punctuation, proper nouns, function words) rather than topics. Topical specialization sometimes emerges but is not the design goal.

"MoE saves memory at inference." Largely false. MoE saves compute and energy, not VRAM or RAM, since all expert weights must be loaded. The exception is expert offloading, which saves VRAM at the cost of CPU-GPU transfer latency.

"MoE replaces ensembling." False. Ensembling combines independently trained models; MoE jointly trains a single model with sparse activation. The ensembling analogy in the 1991 paper has limited bearing on modern sparse implementations.

references

Jacobs, R. A., Jordan, M. I., Nowlan, S. J., & Hinton, G. E. (1991). "Adaptive Mixtures of Local Experts." Neural Computation, 3(1), 79 to 87. https://www.cs.toronto.edu/~hinton/absps/jjnh91.pdf ↩
Jiang, A. Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., Bamford, C., et al. (2024). "Mixtral of Experts." https://arxiv.org/abs/2401.04088 ↩
Dai, D., Deng, C., Zhao, C., Xu, R. X., Gao, H., Chen, D., et al. (2024). "DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models." ACL 2024. https://arxiv.org/abs/2401.06066 ↩
DeepSeek-AI. (2024). "DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model." https://arxiv.org/abs/2405.04434 ↩
DeepSeek-AI. (2024). "DeepSeek-V3 Technical Report." https://arxiv.org/abs/2412.19437 ↩ ↩²
Meta AI. (2025). "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." Meta AI blog, April 5, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/ ↩
Moonshot AI. (2025). "Kimi K2: Open Agentic Intelligence." https://moonshotai.github.io/Kimi-K2/ ↩ ↩²
Wu, S., et al. (2024). "Yuan 2.0-M32: Mixture of Experts with Attention Router." https://arxiv.org/abs/2405.17976 ↩ ↩²
Wei, T., Zhu, B., et al. / Skywork AI. (2024). "Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models." https://arxiv.org/abs/2406.06563 ↩ ↩²
Microsoft. (2024). "Phi-3.5-MoE-instruct Model Card." Hugging Face. https://huggingface.co/microsoft/Phi-3.5-MoE-instruct ↩ ↩²
Li, D., et al. / Rhymes AI. (2024). "Aria: An Open Multimodal Native Mixture-of-Experts Model." https://arxiv.org/abs/2410.05993 ↩ ↩²
Tencent. (2024). "Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters." https://arxiv.org/abs/2411.02265 ↩ ↩² ↩³
MiniMax. (2025). "MiniMax-01: Scaling Foundation Models with Lightning Attention." https://arxiv.org/abs/2501.08313 ↩ ↩²
Qwen Team. (2025). "Qwen3-30B-A3B Model Card." Hugging Face. https://huggingface.co/Qwen/Qwen3-30B-A3B ↩ ↩²
NVIDIA Technical Blog. (2025). "New Open Source Qwen3-Next Models Preview Hybrid MoE Architecture." https://developer.nvidia.com/blog/new-open-source-qwen3-next-models-preview-hybrid-moe-architecture-delivering-improved-accuracy-and-accelerated-parallel-processing-across-nvidia-platform/ ↩ ↩²
InclusionAI / Ant Group. (2025). "Every FLOP Counts: Scaling a 300B Mixture-of-Experts LING LLM without Premium GPUs." https://arxiv.org/abs/2503.05139 ↩ ↩²
Zhipu AI. (2025). "Introducing GLM-4.5." https://huggingface.co/zai-org/GLM-4.5 ↩ ↩² ↩³ ↩⁴
Zhipu AI. (2025). "GLM-4.6 Release Notes." Zhipu AI. ↩ ↩²
OpenAI. (2025). "Introducing gpt-oss." https://openai.com/index/introducing-gpt-oss/ and Model Card https://cdn.openai.com/pdf/419b6906-9da6-406c-a19d-1bb078ac7637/oai_gpt-oss_model_card.pdf ↩ ↩² ↩³ ↩⁴
DeepSeek-AI. (2025). "DeepSeek-V3.1 Release." https://api-docs.deepseek.com/news/news250821 and Hugging Face https://huggingface.co/deepseek-ai/DeepSeek-V3.1 ↩ ↩²
DeepSeek-AI. (2025). "DeepSeek-V3.2-Exp: Boosting Long-Context Efficiency with DeepSeek Sparse Attention." https://arxiv.org/abs/2512.02556 and https://api-docs.deepseek.com/news/news250929 ↩ ↩²
Moonshot AI. (2025). "Kimi K2 Thinking Release." https://huggingface.co/moonshotai/Kimi-K2-Thinking ↩ ↩²
Pióro, M., Ciebiera, K., Król, K., Ludziejewski, J., & Jaszczur, S. (2024). "MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts." https://arxiv.org/abs/2401.04081 ↩
Anthony, Q., Tokpanov, Y., Glorioso, P., & Millidge, B. (2024). "BlackMamba: Mixture of Experts for State-Space Models." https://arxiv.org/abs/2402.01771 ↩
Puigcerver, J., Riquelme, C., Mustafa, B., & Houlsby, N. (2023). "From Sparse to Soft Mixtures of Experts." https://arxiv.org/abs/2308.00951 ↩
Antoniak, S., Krutul, M., Pióro, M., et al. (2024). "Mixture of Tokens: Continuous MoE through Cross-Example Aggregation." NeurIPS 2024. https://arxiv.org/abs/2310.15961 ↩
Bae, S., et al. (2025). "Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation." NeurIPS 2025. https://arxiv.org/abs/2507.10524 ↩
Komatsuzaki, A., Puigcerver, J., Lee-Thorp, J., Ruiz, C. R., Mustafa, B., Ainslie, J., Tay, Y., Dehghani, M., & Houlsby, N. (2022). "Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints." https://arxiv.org/abs/2212.05055 ↩ ↩²
Nakamura, T., et al. (2025). "Scaling Laws for Upcycling Mixture-of-Experts Language Models." https://arxiv.org/abs/2502.03009 ↩ ↩²
Krajewski, J., Ludziejewski, J., et al. (2024). "Scaling Laws for Fine-Grained Mixture of Experts." ICML 2024. https://arxiv.org/abs/2402.07871 ↩ ↩²
Wang, L., et al. / DeepSeek-AI. (2024). "Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts." https://arxiv.org/abs/2408.15664 ↩
vLLM Project. (2025). "Expert Parallel Deployment." https://docs.vllm.ai/en/latest/serving/expert_parallel_deployment/ and "Scaling DeepSeek-style MoEs with vLLM and llm-d using Wide EP." Red Hat Developer, September 8, 2025. https://developers.redhat.com/articles/2025/09/08/scaling-deepseek-style-moes-vllm-and-llm-d-using-wide-ep ↩
NVIDIA. (2025). "Scaling Large MoE Models with Wide Expert Parallelism on NVL72 Rack Scale Systems." NVIDIA Technical Blog. https://developer.nvidia.com/blog/scaling-large-moe-models-with-wide-expert-parallelism-on-nvl72-rack-scale-systems/ ↩
Qwen Team. (2025). "Qwen3 Technical Report." https://arxiv.org/abs/2505.09388 ↩