Switch Transformer
Last reviewed
May 1, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,878 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 1, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,878 words
Add missing citations, update stale details, or suggest a clearer explanation.
The Switch Transformer is a sparsely activated [[mixture_of_experts|Mixture of Experts (MoE)]] Transformer architecture introduced by William Fedus, Barret Zoph, and Noam Shazeer at Google in January 2021. It scales model capacity by routing each input token to exactly one expert (a single feed-forward network) selected from a pool of N experts, using a learned gating function called the switch. Compared with prior MoE work that combined the outputs of several experts per token, Switch Transformer simplified the routing decision to top-1 and showed that the resulting sparse models could be trained reliably at the trillion-parameter scale.
The paper, titled Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity, first appeared on arXiv on January 11, 2021 and was published in the Journal of Machine Learning Research in 2022 (volume 23). It demonstrated up to seven times faster pre-training than the equivalent dense [[t5|T5]] baseline at the same compute budget, and trained the first publicly disclosed 1.6 trillion parameter neural network, Switch-C, with 2,048 experts. Switch Transformer is widely considered the work that turned [[moe|MoE]] from a research curiosity into a practical recipe for production language models, and it directly influenced [[mixtral|Mixtral]], [[deepseek|DeepSeek]] V2 and V3, GLaM, ST-MoE, Grok-1, and other large MoE systems.
The core idea behind a [[mixture_of_experts|Mixture of Experts]] is conditional computation: only a fraction of a neural network's parameters need to be activated for each input. The classical formulation, due to Jacobs, Jordan, Nowlan, and Hinton in the early 1990s, used a small number of experts and a soft gating function. The idea was rediscovered for deep learning by Shazeer et al. in the 2017 ICLR paper Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. Shazeer's MoE layer was inserted between LSTM stacks for language modelling and machine translation; it scaled to 137 billion parameters using top-k routing (typically k = 2 or k = 4) with a noisy top-k gating function and an auxiliary load balancing loss.
By 2020, two trends made MoE attractive again. First, dense Transformers like [[gpt_3|GPT-3]] (175 billion parameters) and [[t5|T5]]-XXL (11 billion) were hitting the limits of synchronous data-parallel training. Second, Google's GShard project (Lepikhin et al., 2020) extended sparse MoE to a 600 billion parameter multilingual translation Transformer, again with top-2 routing, and added the XLA-based sharding annotations needed to actually run such models on TPU pods. GShard ran on 2,048 TPU v3 chips for four days and replaced every other Transformer feed-forward layer with an MoE layer.
Switch Transformer's contribution was both a simplification and a scaling study. Top-2 routing made the routing layer roughly twice as expensive in compute and communication as the routing decision itself; the second expert was rarely much better than the first; and the implementation was complicated by the need to combine and renormalise expert outputs. Fedus, Zoph, and Shazeer asked whether top-1 routing could match the quality of top-2 with simpler engineering. The answer was yes, and it unlocked the trillion-parameter regime.
The paper was written by William Fedus, Barret Zoph, and Noam Shazeer, all members of Google Brain at the time. Shazeer was already the dominant figure in MoE work, having co-authored both Outrageously Large Neural Networks (2017) and the Attention Is All You Need (2017) Transformer paper. Zoph had previously led work on neural architecture search; Fedus was a Google Brain resident.
The arXiv preprint went live on January 11, 2021. After several revisions, the final version appeared in the Journal of Machine Learning Research in 2022 (JMLR 23: 5232 to 5270, paper 21-0998). As of 2026 the paper has more than 3,500 citations and is one of the most heavily cited machine learning papers of the early 2020s.
A standard Transformer block consists of a multi-head self-attention sublayer followed by a position-wise feed-forward network (FFN). In Switch Transformer, the FFN sublayer is replaced by a Switch Layer; the attention sublayer, layer normalisation, and residual connections are unchanged. A Switch Layer is built from N independent expert FFNs (each with its own weights) plus a small router.
For an input token with hidden state x, the router computes logits h(x) = W_r * x using a learned matrix W_r of shape (d_model, N). A softmax over the N experts produces routing probabilities p_i(x) = exp(h_i(x)) / sum_j exp(h_j(x)). The Switch Layer then picks the index i* with the highest probability (top-1) and computes the output as y = p_{i*}(x) * E_{i*}(x), where E_{i*} is the chosen expert's FFN. Multiplying by the gate value p_{i*}(x) lets the gradient flow back to the router. Crucially, the entire input batch is routed token by token, so different tokens in the same sentence can visit different experts in the same Switch Layer.
In practice, Switch Layers replace the FFN in roughly every other Transformer block in both the encoder and the decoder, following the GShard convention. The Hugging Face implementation makes this configurable through num_sparse_encoder_layers and num_sparse_decoder_layers, which default to three sparse layers in a twelve-layer base model.
The shift from top-2 to top-1 routing is the single defining choice of Switch Transformer. The motivation is mostly engineering. With top-2 routing, each token sends activations to two experts, doubling the all-to-all communication volume in distributed training and roughly doubling the compute per Switch Layer. Top-1 routing halves both costs while letting the model spend the saved budget on more experts or larger experts.
The risk was that one expert per token might not be enough information to learn good representations. The paper showed empirically that top-1 routing performs as well or better than top-2 across a wide range of model sizes once a few stability tricks are added, and that the simpler routing makes it easier to push N (the number of experts) into the thousands. Switch Transformer's largest model, Switch-C, uses 2,048 experts per Switch Layer.
Sparse models are notoriously hard to train. Routing decisions are discrete, gradients flow through a softmax that is bottlenecked by the chosen expert, and the load balancing loss can interact badly with the main training objective. Fedus, Zoph, and Shazeer introduced a small bag of tricks that have since become standard MoE practice.
Selective precision. Mixed-precision training in bfloat16 caused divergences in early experiments. The fix was to keep the router computation (the small W_r matrix multiply, the softmax, and the loss bookkeeping) in float32 while letting the experts and dispatch tensors stay in bfloat16. Because the router is a tiny fraction of total compute, the float32 cost is negligible, but the numerical headroom prevents the routing softmax from collapsing.
Smaller initialisation. The default Transformer initialisation scale (s = 1.0 truncated normal) was too large for sparse models. The paper recommends shrinking the initialisation factor by an order of magnitude, to s = 0.1, which sharply reduces variance in early training steps.
Expert dropout. During fine-tuning the paper applies a much higher dropout rate inside expert FFN layers (about 0.4) than in the rest of the network (about 0.1). This asymmetric regularisation prevents over-fitting to the small downstream datasets that exposed huge expert FFNs to too few examples.
Capacity factor and expert capacity. Because hardware needs static tensor shapes, each expert is given a fixed capacity per batch. Expert capacity is computed as (tokens_per_batch / num_experts) * capacity_factor. A capacity factor of 1.0 reserves exactly the average load per expert; values around 1.0 to 1.25 worked best in the paper. Tokens that overflow an expert's capacity are dropped: their representation passes through the residual stream unchanged. The paper reports drop rates below 1 percent at typical capacity factors.
Auxiliary load balancing loss. Without an explicit incentive, the router quickly collapses to a few favoured experts. Switch Transformer adds an auxiliary loss L_aux = alpha * N * sum_e (f_e * P_e), where f_e is the fraction of tokens routed to expert e, P_e is the fraction of total router probability mass assigned to expert e, and alpha is a small coefficient (10^-2 in the paper). The minimum of f * P under uniform load is 1/N, so the loss equals alpha when the load is perfectly balanced and grows whenever the router concentrates mass.
The capacity factor is one of the most important MoE knobs and was popularised by Switch Transformer. Setting it too low causes excessive token dropping; setting it too high wastes memory and communication bandwidth. The paper sweeps the capacity factor and finds that 1.0 works well at training time and that 2.0 is a reasonable choice during evaluation when memory is less constrained. The token-dropping mechanism, in which over-capacity tokens are simply skipped at the Switch Layer and pass through the residual connection, has been criticised but is the simplest way to reconcile dynamic routing with static shapes.
Given a batch of T tokens routed across N experts, define for expert e:
Then L_aux = alpha * N * sum_{e=1}^{N} f_e * P_e. Both f_e and P_e are positive and sum to 1 over experts, so the dot product is minimised (subject to those constraints) when both vectors are uniform, in which case f_e = P_e = 1/N and the sum equals N * (1/N)^2 = 1/N, giving L_aux = alpha. The product form is differentiable through P_e (since P_e is a soft-max output) but not through f_e (which is a hard top-1 count); only the soft term contributes gradients to the router. ST-MoE later refined this with a router z-loss that penalises large pre-softmax logits and further improves stability.
The paper reports four primary model variants, all built on the [[t5|T5]] encoder-decoder architecture and pre-trained on the Colossal Clean Crawled Corpus (C4). The dense T5 baselines are matched on FLOPs per token to the corresponding Switch model, so a fair compute comparison is possible.
| Model | Total parameters | Experts per Switch Layer | Matched dense baseline | Pre-training speed-up vs baseline |
|---|---|---|---|---|
| Switch-Base | 7.4 B | up to 128 | T5-Base (223 M) | up to 7.5x |
| Switch-Large | 26.3 B | up to 128 | T5-Large (739 M) | about 4.4x |
| Switch-XXL | 395 B | 64 | T5-XXL (11 B) | 4x to 9x on speed-quality trade-off |
| Switch-C | 1.571 T | 2,048 | T5-XXL (11 B) | reaches T5-XXL quality 4x faster |
Switch-Base reaches the language modelling perplexity of a fully converged T5-Base in roughly one-seventh of the wall-clock time on the same TPU pod. Switch-XXL provides a more compute-intensive comparison: at the same compute budget it beats T5-XXL by a wide margin, and at the same target perplexity it finishes 4 to 9 times faster depending on how the comparison is drawn. Switch-C, with 1.571 trillion parameters and 2,048 experts, was the first publicly disclosed model to cross the trillion-parameter threshold. Notably, Switch-C exhibited no training instability, in contrast with the smaller but deeper Switch-XXL, which required the stability tricks above to train at all.
The Hugging Face release later expanded the public collection to include google/switch-base-8, google/switch-base-16, google/switch-base-32, google/switch-base-64, google/switch-base-128, google/switch-base-256, google/switch-large-128, and google/switch-c-2048. These checkpoints have made Switch Transformer the most accessible large MoE family for research.
On the C4 pre-training task, Switch models reach lower negative log perplexity per FLOP than their dense T5 counterparts at every model scale tested. On downstream fine-tuning, the picture is more mixed. SuperGLUE scores improved by about 4.4 points for Switch-Base over T5-Base and about 2.0 points for Switch-Large over T5-Large. On Winogrande, Switch-Base reached 73.3 versus T5-Base at 66.6, and Switch-Large reached 83.0 versus T5-Large at 79.1. On TriviaQA closed-book question answering, Switch-Large scored 36.9 versus T5-Large at 29.5. On the smaller ARC reasoning datasets, however, dense models occasionally outperformed sparse variants, foreshadowing the well-known difficulty of fine-tuning very wide MoE models on small downstream tasks.
The most striking transfer result was multilingual. The paper trained Switch versions of mT5 on the mC4 corpus covering 101 languages and showed improvements on every single language compared with mT5-Base, with a mean speed-up of about 5x and 91 percent of languages achieving a 4x or greater speed-up to a target quality.
Sparse models have a serious deployment problem: even though only a few parameters are active per token, all expert weights must be loaded into memory. To address this, the paper studied distilling sparse Switch teachers into smaller dense students. Distilling Switch-Base (7.4 B parameters) into a 223 M parameter student preserved roughly 30 percent of the quality gains over training the same dense student from scratch, while shrinking the model by about 95 percent. A larger 14.7 B sparse teacher distilled into the same 223 M student preserved about 28 percent of the gains at 99 percent compression. SuperGLUE fine-tuning showed similar 30 percent quality preservation. These distillation experiments were one of the first practical demonstrations that sparse pre-training could improve dense models, and they directly inspired the dense distillation pipelines used by later MoE systems.
Switch Transformer was trained on Google TPU v3 pods using Mesh TensorFlow, the precursor to the JAX-based pjit programming model. Mesh TensorFlow expresses parallelism by naming tensor dimensions and laying them out across a logical processor mesh. The Switch Transformer code mixes three forms of parallelism:
During a forward pass, an all-to-all communication shuffles tokens to the TPU cores hosting their assigned experts, runs the experts locally, and then all-to-alls the results back. The all-to-all is the dominant communication cost and the main reason MoE models are bandwidth-hungry. The original Mesh TensorFlow implementation eventually moved into the open-source t5x and flaxformer codebases under JAX.
Switch Transformer set the modern MoE template. Almost every subsequent large sparse model uses the same structural pattern (replace some FFN layers with N experts plus a router), the same auxiliary load balancing loss, capacity factor, and selective precision, and either top-1 or top-2 routing. Major successors include:
| Model | Year | Total params | Active params | Experts per layer | Routing | Notes |
|---|---|---|---|---|---|---|
| Sparsely-Gated MoE (Shazeer) | 2017 | up to 137 B | small | 1024 to 65536 | top-2/top-4 | First deep-learning MoE. |
| GShard (Lepikhin) | 2020 | 600 B | small | up to 2048 | top-2 | XLA sharding for translation. |
| Switch Transformer (Fedus) | 2021 | up to 1.571 T | constant | up to 2048 | top-1 | Trillion-parameter milestone. |
| GLaM (Du) | 2021/2022 | 1.2 T | 97 B | 64 | top-2 | First MoE LLM at GPT-3 scale. |
| ST-MoE (Zoph) | 2022 | 269 B | 32 B equiv | 64 | top-2 | Router z-loss, transferable MoE. |
| [[mixtral | Mixtral 8x7B]] (Mistral) | Dec 2023 | about 47 B | about 13 B | 8 | top-2 |
| Mixtral 8x22B (Mistral) | Apr 2024 | about 141 B | about 39 B | 8 | top-2 | Larger Mixtral. |
| Grok-1 (xAI) | Mar 2024 | 314 B | about 86 B | 8 | top-2 | Open-weights under Apache 2.0. |
| DeepSeek-V2 | May 2024 | 236 B | 21 B | 160 routed + 2 shared | top-6 | Shared experts and MLA attention. |
| [[deepseek_3_0 | DeepSeek-V3]] | Dec 2024 | 671 B | 37 B | 256 routed + 1 shared | top-8 |
| Qwen2-MoE / Qwen3-MoE | 2024 to 2025 | various | various | 64+ | top-2/top-8 | Open MoE LLMs from Alibaba. |
| Llama 4 family | 2025 | various | various | up to 128 | top-1/top-2 | Meta's first MoE LLM line. |
Google's GLaM (Du et al., 2022) brought the MoE recipe to a 1.2 trillion parameter decoder-only LLM with 64 experts and 97 B active parameters per token, matching GPT-3 quality at one-third the training energy. ST-MoE (Zoph et al., 2022) added the router z-loss and made sparse models reliably transferable, achieving state-of-the-art SuperGLUE with only 32 B active parameters. [[mixture_of_depths|Mixture of Depths]] (Raposo et al., 2024) generalised the routing idea from "which expert" to "which depth" and is sometimes combined with Switch-style routing.
In the open-weights world, Mixtral 8x7B (Mistral AI, December 2023) brought a Switch-style MoE LLM into wide use, with 8 experts per layer and top-2 routing. Mixtral 8x22B followed in April 2024. xAI's Grok-1 (March 2024) released a 314 B MoE model under Apache 2.0. The DeepSeek series pushed the architecture further with shared and routed experts, with DeepSeek-V2 at 236 B / 21 B active and DeepSeek-V3 at 671 B / 37 B active introducing an auxiliary-loss-free load balancing scheme that side-steps the L_aux term Switch Transformer originated.
Proprietary frontier models also use the recipe. Google's Gemini 1.5 and Gemini 2.0 Flash families are widely reported to be sparse MoE Transformers, and OpenAI's GPT-4 has been described in third-party leaks as a 1.8 trillion parameter MoE with sixteen 110 B experts, although the company has not confirmed details. The Switch Transformer paper is cited as the architectural reference in essentially every major MoE follow-up.
The original Switch Transformer code lives in the google-research/google-research/tree/master/switch_transformer repository on GitHub and uses Mesh TensorFlow. Subsequent ports include:
| Implementation | Framework | Notes |
|---|---|---|
| t5x / flaxformer | JAX | Google's official JAX rewrite of T5 and Switch. |
| Hugging Face transformers | PyTorch | SwitchTransformersForConditionalGeneration plus google/switch-base-*, switch-large-128, and switch-c-2048 checkpoints. |
| Megatron-LM | PyTorch | NVIDIA's reference for large-scale MoE pre-training. |
| Megablocks | PyTorch / CUDA | Block-sparse GPU kernels that reformulate MoE as block-sparse matmul, with no token dropping. |
| DeepSpeed-MoE | PyTorch | Microsoft's MoE extension; combines ZeRO with expert parallelism. |
| Tutel | PyTorch / CUDA | Microsoft research library focused on adaptive parallelism for MoE. |
| ScatterMoE / Scattered MoE | PyTorch | Newer scatter-gather kernels for efficient routing on GPUs. |
The Hugging Face configuration class SwitchTransformersConfig exposes the main MoE knobs: num_experts, expert_capacity, router_dtype (defaulting to float32 per the selective-precision recipe), router_jitter_noise, router_aux_loss_coef, and router_z_loss_coef (added later from ST-MoE).
Most engineering choices that production MoE systems rely on were either introduced by Switch Transformer or established as the default by it:
Switch Transformer is one of the small set of papers that defined the second half of the language-model scaling era. It moved [[mixture_of_experts|MoE]] from a 2017 research idea into the production toolkit of essentially every frontier lab by 2024. The trillion-parameter Switch-C was a public proof that sparse [[scaling|scaling]] could keep going beyond what dense models could afford. The simplification of the routing decision to top-1 made the implementation tractable enough for open-source ports, which in turn enabled Mixtral and the wave of open MoE LLMs that followed. With more than 3,500 citations and a direct architectural lineage running through GLaM, ST-MoE, Mixtral, Grok-1, and the DeepSeek series, Switch Transformer is now the canonical reference for sparse Transformer design.