Sparse upcycling
Last reviewed
Jun 8, 2026
Sources
7 citations
Review status
Source-backed
Revision
v1 · 1,718 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
7 citations
Review status
Source-backed
Revision
v1 · 1,718 words
Add missing citations, update stale details, or suggest a clearer explanation.
Sparse upcycling is a technique for building a sparsely activated mixture of experts (MoE) model by initializing it from an already trained dense Transformer checkpoint, rather than training the MoE from random initialization. The method reuses the compute already spent pretraining the dense model: each new expert begins as an exact copy of the dense network's feed-forward (MLP) layer, every other weight is carried over unchanged, and only the new routing components are added fresh. Continued training then lets the duplicated experts gradually diverge and specialize while the router learns to dispatch tokens among them. [1]
The approach was introduced in "Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints," first posted to arXiv in December 2022 and published at the International Conference on Learning Representations (ICLR) in 2023. [1][2] The starting description of "Google, 2022" is broadly correct: the work was carried out at Google, with lead author Aran Komatsuzaki, then a PhD student at the Georgia Institute of Technology and a researcher at EleutherAI, completing it during an internship at Google Research, together with co-authors from Google Brain and Google Research (Joan Puigcerver, James Lee-Thorp, Carlos Riquelme Ruiz, Basil Mustafa, Joshua Ainslie, Yi Tay, Mostafa Dehghani, and Neil Houlsby). [1] The central appeal is economic: it converts a strong dense checkpoint into a higher-capacity sparse model for a small fraction of the cost of training a comparable MoE from scratch, while exploiting the large and growing stock of openly available dense models.
Mixture-of-experts layers increase a model's parameter count without a proportional increase in the compute spent per token. An MoE layer holds many parallel expert sub-networks but routes each token to only one or a few of them, so the number of floating-point operations per token stays roughly fixed even as total parameters grow. Architectures such as the Switch Transformer and GShard used this property to scale to very large parameter counts at near-constant per-token cost, often yielding better quality than a dense model trained with the same compute. [1]
The drawback is that training such large MoE models from random initialization is itself expensive, and the field had already invested enormous compute in dense checkpoints such as T5, GPT-style decoders, and Vision Transformers. Discarding those dense weights to start an MoE over from scratch wastes that sunk cost. Sparse upcycling targets exactly this gap: it asks whether a dense checkpoint can be cheaply "upgraded" into a sparse MoE that surpasses the original dense model and approaches the quality of a from-scratch MoE, using only a modest amount of additional training. [1]
Sparse upcycling starts from a fully trained dense Transformer and applies a structural transformation followed by continued training. [1] The key steps are:
A defining challenge the paper highlights is that "a central challenge in model upcycling is overcoming the initial performance decrease entailed by changing a trained network's structure." [1] Immediately after the structural change, and especially with a randomly initialized router that may drop or mis-route tokens, the upcycled model can temporarily perform worse than the dense checkpoint it came from. With continued training it recovers and then overtakes the dense baseline. The authors also note that design choices interact with modality: weight normalization on the router helped the vision models but hurt the language models, which they attribute to differences between the encoder and decoder routing setups. [1]
The paper validates the method on both language and vision Transformers. On the language side it upcycles T5 Base, Large, and XL and evaluates on SuperGLUE; on the vision side it upcycles Vision Transformer (ViT) Base and Large and evaluates on ImageNet. [1] In both domains the upcycled MoE models significantly outperformed their dense counterparts while adding extra compute equal to only about 50 percent of the cost of the original dense pretraining run. [1]
Sparse upcycling also compared favorably against training an equivalent MoE from random initialization. The paper reports that a language MoE trained from scratch needed roughly 120 percent of the original dense checkpoint's compute budget merely to catch up to the upcycled model, meaning upcycling reached a given quality far sooner. [1] The authors characterize the method as most effective when the additional training budget is modest, on the order of a 10 percent to 60 percent increase over the dense pretraining cost, which is the regime where reusing the dense weights gives the largest head start.
Sparse upcycling became a standard tool for producing open and industrial MoE models cheaply by recycling existing dense checkpoints. Several widely used models adopted it, typically following the same recipe of cloning the dense MLP into experts and adding a fresh router:
| Model | Developer | Year | Dense source | Sparse configuration |
|---|---|---|---|---|
| MiniCPM-MoE | OpenBMB / Tsinghua | 2024 | MiniCPM 2B | 13.6B total parameters, two of eight experts active per token [3] |
| Qwen2-57B-A14B | Alibaba | 2024 | Qwen2-7B | 57B total parameters, 14B active per token [4] |
| Skywork-MoE | Kunlun / Skywork | 2024 | Skywork-13B | 146B total parameters, 16 experts, two active (about 22B active) [5] |
| Nemotron upcycling study | NVIDIA | 2024 | Nemotron dense models | systematic study of router and granularity variants [6] |
MiniCPM-MoE explicitly states that it is initialized via sparse upcycling, replacing each MLP layer with an MoE layer of cloned experts and randomly initializing the router. [3] Qwen2-57B-A14B was upcycled from the dense Qwen2-7B and is reported to match the quality of a roughly 30B dense model while activating only 14B parameters per token. [4] Skywork-MoE was upcycled from Skywork-13B and used the project to study upcycling versus from-scratch training directly. [5] NVIDIA's "Upcycling Large Language Models into Mixture of Experts" extended the original recipe with variants such as softmax-then-top-k routing and virtual expert groups, applied to Nemotron models. [6] Mistral's Mixtral has been speculated to rely on a similar dense-to-MoE construction, but its developers did not confirm this, so it is not a documented case.
The broader significance is that sparse upcycling reframed MoE training as an incremental upgrade of existing dense models rather than an all-or-nothing investment, lowering the barrier to building sparse models and influencing a line of follow-up work on warm-starting, expert initialization, and breaking expert symmetry. [6]
Sparse upcycling is most advantageous at relatively small additional budgets, and its edge erodes as training continues. The original paper notes that for very large compute budgets, an MoE trained from scratch will eventually catch and surpass the upcycled model, because the upcycled experts begin highly correlated and take time to diversify. [1] It also observes that using a very large number of experts produces a larger initial quality drop relative to the dense baseline. [1]
Later work sharpened these caveats. In the OLMoE project, the authors compared upcycling OLMo-1B against training a comparable decoder-only MoE from scratch and found the crossover much earlier than in the original study: the from-scratch model caught up after roughly 600B tokens, around 25 percent of the dense model's compute rather than the 120 percent reported by Komatsuzaki et al., which they attributed to using token-choice routing and a decoder-only model started from a heavily trained dense baseline. [7] Because they planned to train for about 250 percent of the dense budget, the OLMoE team chose not to upcycle. They also flagged a subtler cost: upcycling locks the new model into the dense checkpoint's original architecture and hyperparameters, preventing adoption of later improvements such as QK-normalization or better initialization. [7]
Other practical limitations are inherited from MoE models in general. Because all experts start as copies of one MLP, they are initially redundant and prone to expert collapse or load imbalance, where some experts are underused, so auxiliary load-balancing losses and careful routing are usually required. And while sparse upcycling keeps the per-token compute close to the dense model, the full set of experts must still be stored and served, raising memory and deployment costs relative to the original dense checkpoint. [1][7]