Switch Transformer

Google Large Language Models Mixture of Experts Transformer Models

20 min read

Updated Jul 11, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 11, 2026

Fact-checked

In review queue

Sources

12 citations

Revision

v5 · 4,001 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

The Switch Transformer is a sparsely activated Mixture of Experts (MoE) Transformer architecture introduced by William Fedus, Barret Zoph, and Noam Shazeer at Google in January 2021.^[1] It scales model capacity by routing each input token to exactly one expert (a single feed-forward network) selected from a pool of N experts, using a learned gating function called the switch.^[1] Compared with prior MoE work that combined the outputs of several experts per token, Switch Transformer simplified the routing decision to top-1 and showed that the resulting sparse models could be trained reliably at the trillion-parameter scale.^[1] The authors describe their contribution plainly: "We simplify the MoE routing algorithm and design intuitive improved models with reduced communication and computational costs."^[1]

The paper, titled Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity, first appeared on arXiv on January 11, 2021 and was published in the Journal of Machine Learning Research in 2022 (volume 23, paper 21-0998).^[1] It demonstrated up to 7x faster pre-training than the equivalent dense T5 baseline at the same compute budget, and trained the first publicly disclosed 1.6 trillion parameter neural network, Switch-C, with 2,048 experts.^[1] Switch Transformer is widely considered the work that turned MoE from a research curiosity into a practical recipe for production language models, and it directly influenced Mixtral, DeepSeek V2 and V3, GLaM, ST-MoE, Grok-1, and other large MoE systems.

What problem did the Switch Transformer solve?

The core idea behind a Mixture of Experts is conditional computation: only a fraction of a neural network's parameters need to be activated for each input. The classical formulation, due to Jacobs, Jordan, Nowlan, and Hinton in the early 1990s, used a small number of experts and a soft gating function. The idea was rediscovered for deep learning by Shazeer et al. in the 2017 ICLR paper Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer.^[2] Shazeer's MoE layer was inserted between LSTM stacks for language modelling and machine translation; it scaled to 137 billion parameters using top-k routing (typically k = 2 or k = 4) with a noisy top-k gating function and an auxiliary load balancing loss.^[2]

By 2020, two trends made MoE attractive again. First, dense Transformers like GPT-3 (175 billion parameters) and T5-XXL (11 billion) were hitting the limits of synchronous data-parallel training. Second, Google's GShard project (Lepikhin et al., 2020) extended sparse MoE to a 600 billion parameter multilingual translation Transformer, again with top-2 routing, and added the XLA-based sharding annotations needed to actually run such models on TPU pods.^[3] GShard ran on 2,048 TPU v3 chips for four days and replaced every other Transformer feed-forward layer with an MoE layer.^[3]

The Switch Transformer paper diagnosed why MoE had not been widely adopted despite these results: "its widespread adoption has been hindered by complexity, communication costs, and training instability."^[1] Its contribution was both a simplification and a scaling study. Top-2 routing made the routing layer roughly twice as expensive in compute and communication as the routing decision itself; the second expert was rarely much better than the first; and the implementation was complicated by the need to combine and renormalise expert outputs. Fedus, Zoph, and Shazeer asked whether top-1 routing could match the quality of top-2 with simpler engineering. The answer was yes, and it unlocked the trillion-parameter regime.^[1]

Who wrote the Switch Transformer paper and when was it published?

The paper was written by William Fedus, Barret Zoph, and Noam Shazeer, all members of Google Brain at the time.^[1] Shazeer was already the dominant figure in MoE work, having co-authored both Outrageously Large Neural Networks (2017) and the Attention Is All You Need (2017) Transformer paper.^[2] Zoph had previously led work on neural architecture search; Fedus was a Google Brain resident.

The arXiv preprint went live on January 11, 2021.^[1] After several revisions, the final version appeared in the Journal of Machine Learning Research in 2022 (JMLR 23, paper 21-0998, pages 1 to 40; cited as 23(120):1-40).^[1] As of 2026 the paper has more than 4,000 citations and is one of the most heavily cited machine learning papers of the early 2020s.

How does the Switch Layer work?

A standard Transformer block consists of a multi-head self-attention sublayer followed by a position-wise feed-forward network (FFN). In Switch Transformer, the FFN sublayer is replaced by a Switch Layer; the attention sublayer, layer normalisation, and residual connections are unchanged.^[1] A Switch Layer is built from N independent expert FFNs (each with its own weights) plus a small router.^[1]

For an input token with hidden state x, the router computes logits $h(x) = W_r \, x$ using a learned matrix $W_r$ of shape $(d_{\text{model}}, N)$ . A softmax over the N experts produces routing probabilities $p_i(x) = \exp(h_i(x)) / \sum_j \exp(h_j(x))$ . The Switch Layer then picks the index $i^*$ with the highest probability (top-1) and computes the output as $y = p_{i^*}(x) \cdot E_{i^*}(x)$ , where $E_{i^*}$ is the chosen expert's FFN.^[1] Multiplying by the gate value $p_{i^*}(x)$ lets the gradient flow back to the router. Crucially, the entire input batch is routed token by token, so different tokens in the same sentence can visit different experts in the same Switch Layer.

In practice, Switch Layers replace the FFN in roughly every other Transformer block in both the encoder and the decoder, following the GShard convention.^[3] The Hugging Face implementation makes this configurable through num_sparse_encoder_layers and num_sparse_decoder_layers, which default to three sparse layers in a twelve-layer base model.^[11]

How does top-1 routing differ from prior MoE?

The shift from top-2 to top-1 routing is the single defining choice of Switch Transformer.^[1] The motivation is mostly engineering. With top-2 routing, each token sends activations to two experts, doubling the all-to-all communication volume in distributed training and roughly doubling the compute per Switch Layer. Top-1 routing halves both costs while letting the model spend the saved budget on more experts or larger experts.^[1]

The risk was that one expert per token might not be enough information to learn good representations. The paper showed empirically that top-1 routing performs as well or better than top-2 across a wide range of model sizes once a few stability tricks are added, and that the simpler routing makes it easier to push N (the number of experts) into the thousands.^[1] Switch Transformer's largest model, Switch-C, uses 2,048 experts per Switch Layer.^[1]

Training stability tricks

Sparse models are notoriously hard to train. Routing decisions are discrete, gradients flow through a softmax that is bottlenecked by the chosen expert, and the load balancing loss can interact badly with the main training objective. Fedus, Zoph, and Shazeer introduced a small bag of tricks that have since become standard MoE practice, and reported that these techniques "help wrangle the instabilities" and let large sparse models train, for the first time, in lower-precision (bfloat16) formats.^[1]

Selective precision. Mixed-precision training in bfloat16 caused divergences in early experiments. The fix was to keep the router computation (the small $W_r$ matrix multiply, the softmax, and the loss bookkeeping) in float32 while letting the experts and dispatch tensors stay in bfloat16.^[1] Because the router is a tiny fraction of total compute, the float32 cost is negligible, but the numerical headroom prevents the routing softmax from collapsing.

Smaller initialisation. The default Transformer initialisation scale ( $s = 1.0$ truncated normal) was too large for sparse models. The paper recommends shrinking the initialisation factor by an order of magnitude, to $s = 0.1$ , which sharply reduces variance in early training steps.^[1]

Expert dropout. During fine-tuning the paper applies a much higher dropout rate inside expert FFN layers (about 0.4) than in the rest of the network (about 0.1).^[1] This asymmetric regularisation prevents over-fitting to the small downstream datasets that exposed huge expert FFNs to too few examples.

Capacity factor and expert capacity. Because hardware needs static tensor shapes, each expert is given a fixed capacity per batch. Expert capacity is computed as (tokens_per_batch / num_experts) * capacity_factor. A capacity factor of 1.0 reserves exactly the average load per expert; values around 1.0 to 1.25 worked best in the paper.^[1] Tokens that overflow an expert's capacity are dropped: their representation passes through the residual stream unchanged. The paper reports drop rates below 1 percent at typical capacity factors.^[1]

Auxiliary load balancing loss. Without an explicit incentive, the router quickly collapses to a few favoured experts. Switch Transformer adds an auxiliary loss $L_{\text{aux}} = \alpha \cdot N \cdot \sum_e (f_e \cdot P_e)$ , where $f_e$ is the fraction of tokens routed to expert e, $P_e$ is the fraction of total router probability mass assigned to expert e, and $\alpha$ is a small coefficient ( $10^{-2}$ in the paper).^[1] The minimum of $f \cdot P$ under uniform load is $1/N$ , so the loss equals $\alpha$ when the load is perfectly balanced and grows whenever the router concentrates mass.

Capacity factor and token dropping

The capacity factor is one of the most important MoE knobs and was popularised by Switch Transformer. Setting it too low causes excessive token dropping; setting it too high wastes memory and communication bandwidth. The paper sweeps the capacity factor and finds that 1.0 works well at training time and that 2.0 is a reasonable choice during evaluation when memory is less constrained.^[1] The token-dropping mechanism, in which over-capacity tokens are simply skipped at the Switch Layer and pass through the residual connection, has been criticised but is the simplest way to reconcile dynamic routing with static shapes.

Auxiliary loss formula in detail

Given a batch of T tokens routed across N experts, define for expert e:

$f_e = \frac{1}{T} \cdot (\text{number of tokens for which } e \text{ was the top-1 expert})$
$P_e = \frac{1}{T} \sum_t p_e(x_t)$

Then $L_{\text{aux}} = \alpha \cdot N \cdot \sum_{e=1}^{N} f_e \cdot P_e$ .^[1] Both $f_e$ and $P_e$ are positive and sum to 1 over experts, so the dot product is minimised (subject to those constraints) when both vectors are uniform, in which case $f_e = P_e = 1/N$ and the sum equals $N \cdot (1/N)^2 = 1/N$ , giving $L_{\text{aux}} = \alpha$ . The product form is differentiable through $P_e$ (since $P_e$ is a soft-max output) but not through $f_e$ (which is a hard top-1 count); only the soft term contributes gradients to the router. ST-MoE later refined this with a router z-loss that penalises large pre-softmax logits and further improves stability.^[6]

Which models did the paper train?

The paper reports four primary model variants, all built on the T5 encoder-decoder architecture and pre-trained on the Colossal Clean Crawled Corpus (C4).^[1] The dense T5 baselines are matched on FLOPs per token to the corresponding Switch model, so a fair compute comparison is possible.^[1]

Model	Total parameters	Experts per Switch Layer	Matched dense baseline	Pre-training speed-up vs baseline
Switch-Base	7.4 B	up to 128	T5-Base (223 M)	up to 7.5x
Switch-Large	26.3 B	up to 128	T5-Large (739 M)	about 4.4x
Switch-XXL	395 B	64	T5-XXL (11 B)	4x to 9x on speed-quality trade-off
Switch-C	1.571 T	2,048	T5-XXL (11 B)	reaches T5-XXL quality 4x faster

Switch-Base reaches the language modelling perplexity of a fully converged T5-Base in roughly one-seventh of the wall-clock time on the same TPU pod.^[1] Switch-XXL provides a more compute-intensive comparison: at the same compute budget it beats T5-XXL by a wide margin, and at the same target perplexity it finishes 4 to 9 times faster depending on how the comparison is drawn.^[1] Switch-C, with 1.571 trillion parameters and 2,048 experts, was the first publicly disclosed model to cross the trillion-parameter threshold, outperforming the 11 billion parameter T5-XXL in pre-training perplexity while finishing in about one-quarter of the time.^[1] Notably, Switch-C exhibited "no training instability at all," in contrast with the smaller but deeper Switch-XXL, which required the stability tricks above to train at all.^[1]

The Hugging Face release later expanded the public collection to include google/switch-base-8, google/switch-base-16, google/switch-base-32, google/switch-base-64, google/switch-base-128, google/switch-base-256, google/switch-large-128, and google/switch-c-2048.^[11] These checkpoints have made Switch Transformer the most accessible large MoE family for research.

Pre-training and fine-tuning results

On the C4 pre-training task, Switch models reach lower negative log perplexity per FLOP than their dense T5 counterparts at every model scale tested.^[1] On downstream fine-tuning, the picture is more mixed. SuperGLUE scores improved by about 4.4 points for Switch-Base over T5-Base and about 2.0 points for Switch-Large over T5-Large.^[1] On Winogrande, Switch-Base reached 73.3 versus T5-Base at 66.6, and Switch-Large reached 83.0 versus T5-Large at 79.1.^[1] On TriviaQA closed-book question answering, Switch-Large scored 36.9 versus T5-Large at 29.5.^[1] On the smaller ARC reasoning datasets, however, dense models occasionally outperformed sparse variants, foreshadowing the well-known difficulty of fine-tuning very wide MoE models on small downstream tasks.^[1]

The most striking transfer result was multilingual. The paper trained Switch versions of mT5 on the mC4 corpus covering 101 languages and showed improvements on every single language compared with mT5-Base, with a mean speed-up of about 5x and 91 percent of languages achieving a 4x or greater speed-up to a target quality.^[1]

Distillation

Sparse models have a serious deployment problem: even though only a few parameters are active per token, all expert weights must be loaded into memory. To address this, the paper studied distilling sparse Switch teachers into smaller dense students.^[1] Distilling Switch-Base (7.4 B parameters) into a 223 M parameter student preserved roughly 30 percent of the quality gains over training the same dense student from scratch, while shrinking the model by about 95 percent.^[1] A larger 14.7 B sparse teacher distilled into the same 223 M student preserved about 28 percent of the gains at 99 percent compression.^[1] SuperGLUE fine-tuning showed similar 30 percent quality preservation.^[1] These distillation experiments were one of the first practical demonstrations that sparse pre-training could improve dense models, and they directly inspired the dense distillation pipelines used by later MoE systems.

Training infrastructure

Switch Transformer was trained on Google TPU v3 pods using Mesh TensorFlow, the precursor to the JAX-based pjit programming model.^[1] Mesh TensorFlow expresses parallelism by naming tensor dimensions and laying them out across a logical processor mesh. The Switch Transformer code mixes three forms of parallelism:

Data parallelism for the dense parts of the model (attention, embeddings, layer norms).
Expert parallelism for the Switch Layers, where each expert lives on a different TPU core.
Model (tensor) parallelism for the largest configurations, sharding the dense weights across cores.

During a forward pass, an all-to-all communication shuffles tokens to the TPU cores hosting their assigned experts, runs the experts locally, and then all-to-alls the results back.^[1] The all-to-all is the dominant communication cost and the main reason MoE models are bandwidth-hungry. The original Mesh TensorFlow implementation eventually moved into the open-source t5x and flaxformer codebases under JAX.^[11]

Strengths

Decouples capacity from per-token compute. A Switch model with 1.6 trillion parameters has the inference FLOPs of a much smaller dense model. This is the central appeal of MoE.
Faster pre-training to a target quality. All Switch variants reach the perplexity of their dense T5 baseline in a fraction of the wall-clock time at the same compute.
Simpler than top-2 routing. Removing the second expert eliminated half the routing communication and a lot of code.
Trillion-parameter feasibility. Switch-C was a proof point that sparse models could scale to that range without exotic infrastructure.
Distillability. Sparse teachers transfer non-trivial gains to dense students, providing a deployment escape hatch.

Weaknesses and limitations

Memory footprint. All experts must be resident in device memory or fast storage, even when inactive. Switch-C requires hundreds of TPU cores' worth of HBM just to hold weights.
Communication overhead. The all-to-all token shuffle dominates training and inference cost on slow interconnects, which is one reason MoE models prefer high-bandwidth fabrics like NVLink and TPU optical interconnects.
Routing instability. Without the stability tricks (selective precision, small init, z-loss in successors), routers can collapse or diverge.
Fine-tuning on small tasks. Wide sparse models are over-parameterised for many downstream datasets, and they can underperform dense baselines on small benchmarks.
Inference complexity. Dynamic routing complicates batching, latency profiling, and quantisation.
Token dropping. The capacity-factor mechanism explicitly drops over-capacity tokens, which is benign at training time but awkward to reason about.

How did the Switch Transformer influence later MoE models?

Switch Transformer set the modern MoE template. Almost every subsequent large sparse model uses the same structural pattern (replace some FFN layers with N experts plus a router), the same auxiliary load balancing loss, capacity factor, and selective precision, and either top-1 or top-2 routing.^[10] Major successors include:

Model	Year	Total params	Active params	Experts per layer	Routing	Notes
Sparsely-Gated MoE (Shazeer)	2017	up to 137 B	small	1024 to 65536	top-2/top-4	First deep-learning MoE.
GShard (Lepikhin)	2020	600 B	small	up to 2048	top-2	XLA sharding for translation.
Switch Transformer (Fedus)	2021	up to 1.571 T	constant	up to 2048	top-1	Trillion-parameter milestone.
GLaM (Du)	2021/2022	1.2 T	97 B	64	top-2	First MoE LLM at GPT-3 scale.
ST-MoE (Zoph)	2022	269 B	32 B equiv	64	top-2	Router z-loss, transferable MoE.
Mixtral 8x7B (Mistral)	Dec 2023	about 47 B	about 13 B	8	top-2	First widely deployed open MoE LLM.
Mixtral 8x22B (Mistral)	Apr 2024	about 141 B	about 39 B	8	top-2	Larger Mixtral.
Grok-1 (xAI)	Mar 2024	314 B	about 86 B	8	top-2	Open-weights under Apache 2.0.
DeepSeek-V2	May 2024	236 B	21 B	160 routed + 2 shared	top-6	Shared experts and MLA attention.
DeepSeek-V3	Dec 2024	671 B	37 B	256 routed + 1 shared	top-8	Auxiliary-loss-free balancing.
Qwen2-MoE / Qwen3-MoE	2024 to 2025	various	various	64+	top-2/top-8	Open MoE LLMs from Alibaba.
Llama 4 family	2025	various	various	up to 128	top-1/top-2	Meta's first MoE LLM line.

Google's GLaM (Du et al., 2022) brought the MoE recipe to a 1.2 trillion parameter decoder-only LLM with 64 experts and 97 B active parameters per token, matching GPT-3 quality at one-third the training energy.^[5] ST-MoE (Zoph et al., 2022) added the router z-loss and made sparse models reliably transferable, achieving state-of-the-art SuperGLUE with only 32 B active parameters.^[6] Mixture of Depths (Raposo et al., 2024) generalised the routing idea from "which expert" to "which depth" and is sometimes combined with Switch-style routing.

In the open-weights world, Mixtral 8x7B (Mistral AI, December 2023) brought a Switch-style MoE LLM into wide use, with 8 experts per layer and top-2 routing.^[7] Mixtral 8x22B followed in April 2024. xAI's Grok-1 (March 2024) released a 314 B MoE model under Apache 2.0.^[9] The DeepSeek series pushed the architecture further with shared and routed experts, with DeepSeek-V2 at 236 B / 21 B active and DeepSeek-V3 at 671 B / 37 B active introducing an auxiliary-loss-free load balancing scheme that side-steps the $L_{\text{aux}}$ term Switch Transformer originated.^[8]

Proprietary frontier models also use the recipe. Google's Gemini 1.5 and Gemini 2.0 Flash families are widely reported to be sparse MoE Transformers, and OpenAI's GPT-4 has been described in third-party leaks as a 1.8 trillion parameter MoE with sixteen 110 B experts, although the company has not confirmed details. The Switch Transformer paper is cited as the architectural reference in essentially every major MoE follow-up.^[10]

Implementations

The original Switch Transformer code lives in the google-research/google-research/tree/master/switch_transformer repository on GitHub and uses Mesh TensorFlow.^[12] Subsequent ports include:

Implementation	Framework	Notes
t5x / flaxformer	JAX	Google's official JAX rewrite of T5 and Switch.
Hugging Face transformers	PyTorch	`SwitchTransformersForConditionalGeneration` plus `google/switch-base-*`, `switch-large-128`, and `switch-c-2048` checkpoints.
Megatron-LM	PyTorch	NVIDIA's reference for large-scale MoE pre-training.
Megablocks	PyTorch / CUDA	Block-sparse GPU kernels that reformulate MoE as block-sparse matmul, with no token dropping.
DeepSpeed-MoE	PyTorch	Microsoft's MoE extension; combines ZeRO with expert parallelism.
Tutel	PyTorch / CUDA	Microsoft research library focused on adaptive parallelism for MoE.
ScatterMoE / Scattered MoE	PyTorch	Newer scatter-gather kernels for efficient routing on GPUs.

The Hugging Face configuration class SwitchTransformersConfig exposes the main MoE knobs: num_experts, expert_capacity, router_dtype (defaulting to float32 per the selective-precision recipe), router_jitter_noise, router_aux_loss_coef, and router_z_loss_coef (added later from ST-MoE).^[11]

Modern MoE practices traceable to Switch Transformer

Most engineering choices that production MoE systems rely on were either introduced by Switch Transformer or established as the default by it:^[10]

Top-K routing (with K = 1 in Switch and K = 2, 6, or 8 in successors).
An auxiliary load balancing loss with the $f \cdot P$ formulation, or its newer auxiliary-loss-free replacements.
A capacity factor with token dropping for static tensor shapes.
Selective precision (router in float32, experts in lower precision).
Reduced initialisation scale and expert-specific dropout for stability.
Distillation of sparse teachers into dense students for deployment.
Expert parallelism plus all-to-all token shuffles as the standard distributed training pattern.
Sparse activation reasoning: "this 1.6 T model has the inference cost of a 10 B dense model" became a standard MoE talking point because of Switch-C.

Significance

Switch Transformer is one of the small set of papers that defined the second half of the language-model scaling era. It moved MoE from a 2017 research idea into the production toolkit of essentially every frontier lab by 2024. The trillion-parameter Switch-C was a public proof that sparse scaling could keep going beyond what dense models could afford.^[1] The simplification of the routing decision to top-1 made the implementation tractable enough for open-source ports, which in turn enabled Mixtral and the wave of open MoE LLMs that followed. With more than 4,000 citations and a direct architectural lineage running through GLaM, ST-MoE, Mixtral, Grok-1, and the DeepSeek series, Switch Transformer is now the canonical reference for sparse Transformer design.

References

Fedus, W., Zoph, B., and Shazeer, N. (2022). Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. *Journal of Machine Learning Research*, 23(120): 1 to 40. JMLR PDF. ↩
Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., and Dean, J. (2017). Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. ICLR 2017. ↩
Lepikhin, D., Lee, H., Xu, Y., Chen, D., Firat, O., Huang, Y., Krikun, M., Shazeer, N., and Chen, Z. (2021). GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding. ICLR 2021. ↩
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. (2020). Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (T5). *JMLR* 21(140).
Du, N., Huang, Y., Dai, A. M., et al. (2022). GLaM: Efficient Scaling of Language Models with Mixture-of-Experts. ICML 2022. ↩
Zoph, B., Bello, I., Kumar, S., Du, N., Huang, Y., Dean, J., Shazeer, N., and Fedus, W. (2022). ST-MoE: Designing Stable and Transferable Sparse Expert Models. ↩
Jiang, A. Q., Sablayrolles, A., Roux, A., et al. (2024). Mixtral of Experts. Mistral AI. ↩
DeepSeek-AI (2024). DeepSeek-V3 Technical Report. ↩
xAI (2024). Open Release of Grok-1. ↩
Cai, W., et al. (2024). A Survey on Mixture of Experts in Large Language Models. ↩
Hugging Face. Switch Transformers documentation and Switch Transformers release collection. ↩
google-research/google-research GitHub: `switch_transformer/`. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

4 revisions by 1 contributors · full history

Suggest edit

What links here

Adafactor Attention Is All You Need BIG-Bench Barret Zoph Expert Choice routing Expert Parallelism Mixtral Mixture of Experts (MoE)MoE load-balancing loss Model Parallelism Noam Shazeer Partitioning strategy Periodic Labs Soft MoE Sparse upcycling Text2Text Generation Models Transformers William "Liam" Fedus Wu Dao

What problem did the Switch Transformer solve?

Who wrote the Switch Transformer paper and when was it published?

How does the Switch Layer work?

How does top-1 routing differ from prior MoE?

Training stability tricks

Capacity factor and token dropping

Auxiliary loss formula in detail

Which models did the paper train?

Pre-training and fine-tuning results

Distillation

Training infrastructure

Strengths

Weaknesses and limitations

How did the Switch Transformer influence later MoE models?

Implementations

Modern MoE practices traceable to Switch Transformer

Significance

References

Improve this article

Related Articles

Infini-Attention

Jamba

Mixtral

DeepSeek V4

Kimi K2

DeepSeek V3

What links here

Related Articles

Infini-Attention

Jamba

Mixtral

DeepSeek V4

Kimi K2

DeepSeek V3

What links here