Sparse upcycling

Machine Learning Reinforcement Learning

9 min read

Updated Jun 8, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 8, 2026

Fact-checked

In review queue

Sources

7 citations

Revision

v1 · 1,718 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Overview

Sparse upcycling is a technique for building a sparsely activated mixture of experts (MoE) model by initializing it from an already trained dense Transformer checkpoint, rather than training the MoE from random initialization. The method reuses the compute already spent pretraining the dense model: each new expert begins as an exact copy of the dense network's feed-forward (MLP) layer, every other weight is carried over unchanged, and only the new routing components are added fresh. Continued training then lets the duplicated experts gradually diverge and specialize while the router learns to dispatch tokens among them. ^[1]

The approach was introduced in "Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints," first posted to arXiv in December 2022 and published at the International Conference on Learning Representations (ICLR) in 2023. ^[1]^[2] The starting description of "Google, 2022" is broadly correct: the work was carried out at Google, with lead author Aran Komatsuzaki, then a PhD student at the Georgia Institute of Technology and a researcher at EleutherAI, completing it during an internship at Google Research, together with co-authors from Google Brain and Google Research (Joan Puigcerver, James Lee-Thorp, Carlos Riquelme Ruiz, Basil Mustafa, Joshua Ainslie, Yi Tay, Mostafa Dehghani, and Neil Houlsby). ^[1] The central appeal is economic: it converts a strong dense checkpoint into a higher-capacity sparse model for a small fraction of the cost of training a comparable MoE from scratch, while exploiting the large and growing stock of openly available dense models.

Motivation

Mixture-of-experts layers increase a model's parameter count without a proportional increase in the compute spent per token. An MoE layer holds many parallel expert sub-networks but routes each token to only one or a few of them, so the number of floating-point operations per token stays roughly fixed even as total parameters grow. Architectures such as the Switch Transformer and GShard used this property to scale to very large parameter counts at near-constant per-token cost, often yielding better quality than a dense model trained with the same compute. ^[1]

The drawback is that training such large MoE models from random initialization is itself expensive, and the field had already invested enormous compute in dense checkpoints such as T5, GPT-style decoders, and Vision Transformers. Discarding those dense weights to start an MoE over from scratch wastes that sunk cost. Sparse upcycling targets exactly this gap: it asks whether a dense checkpoint can be cheaply "upgraded" into a sparse MoE that surpasses the original dense model and approaches the quality of a from-scratch MoE, using only a modest amount of additional training. ^[1]

How sparse upcycling works

Sparse upcycling starts from a fully trained dense Transformer and applies a structural transformation followed by continued training. ^[1] The key steps are:

Select layers to convert. A subset of the dense MLP layers is replaced with MoE layers. By default the authors convert half of the MLP layers (for example, every other block), leaving the remaining MLP layers dense.
Clone the MLP into experts. Inside each new MoE layer, every expert is initialized as an identical copy of that layer's original dense MLP. The authors use 32 experts as their default configuration, which they found a reasonable compromise for their compute budgets.
Copy everything else. The remaining MLP layers, all attention layers, the layer-normalization parameters, and the embedding and output projections are copied directly from the dense checkpoint, so the upcycled model inherits almost all of the original network's learned representations.
Add and initialize a router. A new routing (gating) network is placed in front of each MoE layer. Its weights are the only large component initialized from scratch, drawn from a zero-mean normal distribution with standard deviation 0.02. For the vision models and the T5 encoder the authors use Expert Choice routing with a capacity factor of 2; for the T5 decoder they use top-k token-choice routing with k of 2. ^[1]
Continue training. The upcycled model is trained further on the original objective. Because all experts start identical and the router starts random, the experts only become meaningfully different through this continued training, which breaks the initial symmetry and drives specialization.

A defining challenge the paper highlights is that "a central challenge in model upcycling is overcoming the initial performance decrease entailed by changing a trained network's structure." ^[1] Immediately after the structural change, and especially with a randomly initialized router that may drop or mis-route tokens, the upcycled model can temporarily perform worse than the dense checkpoint it came from. With continued training it recovers and then overtakes the dense baseline. The authors also note that design choices interact with modality: weight normalization on the router helped the vision models but hurt the language models, which they attribute to differences between the encoder and decoder routing setups. ^[1]

Results

The paper validates the method on both language and vision Transformers. On the language side it upcycles T5 Base, Large, and XL and evaluates on SuperGLUE; on the vision side it upcycles Vision Transformer (ViT) Base and Large and evaluates on ImageNet. ^[1] In both domains the upcycled MoE models significantly outperformed their dense counterparts while adding extra compute equal to only about 50 percent of the cost of the original dense pretraining run. ^[1]

Sparse upcycling also compared favorably against training an equivalent MoE from random initialization. The paper reports that a language MoE trained from scratch needed roughly 120 percent of the original dense checkpoint's compute budget merely to catch up to the upcycled model, meaning upcycling reached a given quality far sooner. ^[1] The authors characterize the method as most effective when the additional training budget is modest, on the order of a 10 percent to 60 percent increase over the dense pretraining cost, which is the regime where reusing the dense weights gives the largest head start.

Adoption and significance

Sparse upcycling became a standard tool for producing open and industrial MoE models cheaply by recycling existing dense checkpoints. Several widely used models adopted it, typically following the same recipe of cloning the dense MLP into experts and adding a fresh router:

Model	Developer	Year	Dense source	Sparse configuration
MiniCPM-MoE	OpenBMB / Tsinghua	2024	MiniCPM 2B	13.6B total parameters, two of eight experts active per token ^[3]
Qwen2-57B-A14B	Alibaba	2024	Qwen2-7B	57B total parameters, 14B active per token ^[4]
Skywork-MoE	Kunlun / Skywork	2024	Skywork-13B	146B total parameters, 16 experts, two active (about 22B active) ^[5]
Nemotron upcycling study	NVIDIA	2024	Nemotron dense models	systematic study of router and granularity variants ^[6]

MiniCPM-MoE explicitly states that it is initialized via sparse upcycling, replacing each MLP layer with an MoE layer of cloned experts and randomly initializing the router. ^[3] Qwen2-57B-A14B was upcycled from the dense Qwen2-7B and is reported to match the quality of a roughly 30B dense model while activating only 14B parameters per token. ^[4] Skywork-MoE was upcycled from Skywork-13B and used the project to study upcycling versus from-scratch training directly. ^[5] NVIDIA's "Upcycling Large Language Models into Mixture of Experts" extended the original recipe with variants such as softmax-then-top-k routing and virtual expert groups, applied to Nemotron models. ^[6] Mistral's Mixtral has been speculated to rely on a similar dense-to-MoE construction, but its developers did not confirm this, so it is not a documented case.

The broader significance is that sparse upcycling reframed MoE training as an incremental upgrade of existing dense models rather than an all-or-nothing investment, lowering the barrier to building sparse models and influencing a line of follow-up work on warm-starting, expert initialization, and breaking expert symmetry. ^[6]

Limitations

Sparse upcycling is most advantageous at relatively small additional budgets, and its edge erodes as training continues. The original paper notes that for very large compute budgets, an MoE trained from scratch will eventually catch and surpass the upcycled model, because the upcycled experts begin highly correlated and take time to diversify. ^[1] It also observes that using a very large number of experts produces a larger initial quality drop relative to the dense baseline. ^[1]

Later work sharpened these caveats. In the OLMoE project, the authors compared upcycling OLMo-1B against training a comparable decoder-only MoE from scratch and found the crossover much earlier than in the original study: the from-scratch model caught up after roughly 600B tokens, around 25 percent of the dense model's compute rather than the 120 percent reported by Komatsuzaki et al., which they attributed to using token-choice routing and a decoder-only model started from a heavily trained dense baseline. ^[7] Because they planned to train for about 250 percent of the dense budget, the OLMoE team chose not to upcycle. They also flagged a subtler cost: upcycling locks the new model into the dense checkpoint's original architecture and hyperparameters, preventing adoption of later improvements such as QK-normalization or better initialization. ^[7]

Other practical limitations are inherited from MoE models in general. Because all experts start as copies of one MLP, they are initially redundant and prone to expert collapse or load imbalance, where some experts are underused, so auxiliary load-balancing losses and careful routing are usually required. And while sparse upcycling keeps the per-token compute close to the dense model, the full set of experts must still be stored and served, raising memory and deployment costs relative to the original dense checkpoint. ^[1]^[7]

References

Komatsuzaki, A., Puigcerver, J., Lee-Thorp, J., Riquelme Ruiz, C., Mustafa, B., Ainslie, J., Tay, Y., Dehghani, M., Houlsby, N. "Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints." ICLR 2023. arXiv:2212.05055. https://arxiv.org/abs/2212.05055 ↩
OpenReview, ICLR 2023 conference page for "Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints." https://openreview.net/forum?id=T5nUQDrM4u ↩
Hu, S., et al. "MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies." 2024. arXiv:2404.06395. https://arxiv.org/abs/2404.06395 ↩
Qwen Team, Alibaba. "Qwen2 Technical Report." 2024. arXiv:2407.10671. https://arxiv.org/abs/2407.10671 ↩
Wei, T., et al. "Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models." 2024. arXiv:2406.06563. https://arxiv.org/abs/2406.06563 ↩
He, E., et al. (NVIDIA). "Upcycling Large Language Models into Mixture of Experts." 2024. arXiv:2410.07524. https://arxiv.org/abs/2410.07524 ↩
Muennighoff, N., et al. "OLMoE: Open Mixture-of-Experts Language Models." 2024. arXiv:2409.02060. https://arxiv.org/abs/2409.02060 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

Depth up-scaling (DUS)

Overview

Motivation

How sparse upcycling works

Results

Adoption and significance

Limitations

References

Improve this article

Related Articles

State (Reinforcement Learning)

State-Action Value Function

Action (Reinforcement Learning)

Bellman Equation

Critic

Deep Q-Network (DQN)