Depth up-scaling (DUS)

Machine Learning Reinforcement Learning

10 min read

Updated Jun 9, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 9, 2026

Fact-checked

In review queue

Sources

7 citations

Revision

v2 · 2,047 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Overview

Depth up-scaling (DUS) is a model-scaling method that builds a deeper large language model by duplicating and stacking the layers of an existing pretrained Transformer checkpoint and then continuing to pretrain the enlarged network so that it recovers and surpasses the performance of the original. The technique was introduced by the South Korean company Upstage in the paper "SOLAR 10.7B: Scaling Large Language Models with Simple yet Effective Depth Up-Scaling," first posted to arXiv on December 23, 2023 and later accepted to the Industry Track of NAACL 2024 ^[1]. The accompanying SOLAR 10.7B model was released as an openly licensed checkpoint in December 2023.

The defining property of DUS is that it produces an ordinary dense Transformer. Unlike scaling approaches based on a mixture of experts, it adds no gating networks, routers, or dynamic expert selection, so the up-scaled model slots directly into existing training and inference frameworks with no special handling ^[1]. DUS is best understood as a way to convert a strong, fully trained 7-billion-parameter checkpoint into a roughly 11-billion-parameter model at a fraction of the cost of pretraining a model of that size from scratch.

Motivation

Empirical scaling laws indicate that, holding data and compute roughly in balance, larger models tend to perform better. Training a fresh model of the desired size from random initialization is expensive, however, and it discards the substantial compute already invested in strong open checkpoints. By late 2023 the most capable openly released base model near the 7-billion-parameter scale was Mistral 7B, and practitioners wanted a way to grow such a checkpoint into a larger, stronger model without starting over ^[1].

The leading alternative for cheaply enlarging a dense checkpoint was sparse upcycling, which converts a dense model into a mixture-of-experts model by cloning the feed-forward blocks into multiple experts and adding a learned router ^[5]. That approach increases parameter count and capacity while keeping per-token compute modest, but it changes the architecture: the resulting model needs routing logic, load-balancing during training, and expert-parallel serving infrastructure, and it can be awkward to deploy or to fine-tune with standard tooling ^[1]^[5]. Upstage designed DUS to avoid this complexity entirely. The goals were to (1) reuse a top performing existing checkpoint rather than discard it, (2) grow the model along depth, which prior work had found to be an effective axis of scaling, and (3) keep the output a plain dense Transformer so that no part of the standard pretraining, fine-tuning, or inference stack needs modification ^[1].

How depth up-scaling works

DUS proceeds in two stages: a one-time structural operation called depthwise scaling, followed by continued pretraining of the enlarged model ^[1].

Depthwise scaling

Depthwise scaling takes a base model with n Transformer layers and produces a model with s layers using a single integer hyperparameter m, the number of layers trimmed from each copy. The procedure is:

Start from a base model with n layers. For SOLAR, this is a 32-layer Llama 2-style architecture initialized with the pretrained weights of Mistral 7B, which itself has 32 layers ^[1].
Make two identical copies of the base model.
Remove the final m layers from the first copy and the initial m layers from the second copy.
Concatenate the two trimmed copies, stacking the second on top of the first.

The resulting network has

s = 2 * (n - m)

layers. With n = 32 and m = 8 this yields s = 48 layers, and at the hidden dimensions inherited from Mistral 7B the model totals roughly 10.7 billion parameters ^[1].

Symbol	Meaning	SOLAR 10.7B value
n	Layers in the base model	32
m	Layers trimmed from each copy	8
s = 2(n - m)	Layers in the up-scaled model	48
Parameters	Total weights after scaling	~10.7 billion

The reason for trimming layers, rather than simply stacking two full 32-layer copies into a 64-layer model, is to soften the discontinuity at the join. Where the two copies meet, the last retained layer of the first copy feeds the first retained layer of the second. If no layers were removed, the output of the original final layer (layer 32) would feed a layer expecting representations from the very start of the network (layer 1), the two most dissimilar points in the depth. By discarding the 2m middle layers, the seam instead connects layer 24 to layer 9 in SOLAR's case, which are closer in the network's processing pipeline. The authors describe this as reducing "the discrepancy at the seam," which makes it easier for the subsequent training to recover performance quickly ^[1]. The choice of m = 8, and thus s = 48, was driven largely by available hardware and the desire to land in the 7-billion to 13-billion-parameter band rather than by a proof of optimality ^[1].

Continued pretraining

Immediately after depthwise scaling the model performs worse than the base model, because the stacked weights no longer compose into a coherent forward pass at the seam. The second stage heals this by continuing to pretrain the full 48-layer model on a language-modeling objective. Upstage reports that performance drops sharply right after scaling and then recovers rapidly during continued pretraining, eventually exceeding the base model. The paper attributes the speed of this recovery to the seam-smoothing effect of trimming the middle layers, which reduces the heterogeneity the optimizer must reconcile ^[1]. The published work specifies the instruction-tuning and alignment data in detail but does not fully document the corpus used for this continued-pretraining stage, which is one of the reproducibility gaps noted by later readers ^[1].

SOLAR 10.7B

SOLAR 10.7B is the reference model produced by DUS and the vehicle through which the method was introduced. The base model (SOLAR-10.7B-v1.0) was released under the permissive Apache 2.0 license in December 2023 ^[1]^[7]. Upstage then ran two further stages to produce the aligned variant SOLAR-10.7B-Instruct-v1.0:

Instruction tuning (supervised fine-tuning) on a mixture that included Alpaca-GPT4, OpenOrca, and a synthesized math-instruction dataset ^[1].
Alignment tuning with direct preference optimization (DPO), using cleaned UltraFeedback data and a synthesized math-alignment dataset in which a rephrased answer served as the preferred response and the original as the rejected one ^[1].

On the six-task average (often called H6) of the Hugging Face Open LLM Leaderboard, the aligned model was competitive with or better than substantially larger models. The figures below are reported by Upstage ^[1].

Model	Approx. params	H6 average	ARC	HellaSwag	MMLU	TruthfulQA	Winogrande	GSM8K
SOLAR 10.7B-Instruct	10.7B (dense)	74.20	71.08	88.16	66.21	71.43	83.58	64.75
Mixtral 8x7B-Instruct-v0.1	46.7B (MoE)	72.62	-	-	-	-	-	-
Qwen 14B	14B (dense)	65.86	-	-	-	-	-	-
Mistral 7B-Instruct-v0.2	7.2B (dense)	65.71	-	-	-	-	-	-
SOLAR 10.7B (base)	10.7B (dense)	66.04	61.95	84.60	65.48	45.04	83.66	55.50

The headline result is that SOLAR 10.7B-Instruct, a dense 10.7-billion-parameter model, scored higher on the H6 average than the much larger Mixtral 8x7B-Instruct mixture-of-experts model ^[1]. Because the output is a conventional dense checkpoint, SOLAR also became a popular base for the open-weights community to fine-tune and merge, and DUS was subsequently adopted and adapted by other groups building mid-sized models from 7-billion-parameter checkpoints.

Relationship to other model-growth methods

DUS belongs to a broader family of model-growth or model-reuse techniques that initialize a larger network from a smaller trained one rather than from scratch. It is distinguished within that family chiefly by being deliberately non-function-preserving and by being applied once to a target size.

Net2Net (Chen, Goodfellow, and Shlens, 2015) introduced function-preserving transformations that grow a network while exactly reproducing its outputs; its Net2DeeperNet operator inserts new layers initialized to compute the identity so the deeper model starts as an exact functional copy of the shallower one ^[2].
Progressive or gradual stacking (Gong et al., 2019) trains a shallow Transformer, then repeatedly doubles its depth by copying existing layers, accelerating pretraining by spending early compute on a smaller model ^[3].
bert2BERT (Chen et al., 2022) adapted the function-preserving idea to Transformers, transferring the weights of a smaller pretrained model into a larger one as a warm start to cut pretraining cost ^[4].
Sparse upcycling (Komatsuzaki et al., 2022) grows a dense checkpoint along a different axis, converting it into a mixture-of-experts model, which is the approach DUS was explicitly designed to avoid ^[1]^[5].

The SOLAR paper frames DUS as inspired by the depthwise component of compound model scaling from EfficientNet (Tan and Le, 2019) and positions it against mixture-of-experts scaling ^[1]^[6]. Two differences set DUS apart from the function-preserving methods above. First, DUS does not preserve the base model's function: the stacking step intentionally creates a discontinuity at the seam and relies on continued pretraining to repair it, trading an exact warm start for a much simpler construction with no identity-initialized layers or weight-mapping rules. Second, DUS grows the model a single time to a fixed target depth using a strong, fully pretrained checkpoint, rather than growing progressively during training from a small or randomly initialized model. The shared thread across all of these methods is the reuse of existing trained weights to reach a larger model more cheaply than full pretraining from scratch.

Limitations

Several caveats temper the method. The hyperparameters m and s were selected to fit available hardware and a target parameter band, and the authors explicitly state they did not establish that these values are optimal, leaving the best trim depth an open question ^[1]. The continued-pretraining corpus is not fully documented, which limits exact reproducibility of the base model ^[1]. DUS is also not free: healing the seam requires a nontrivial amount of continued pretraining, so the saving is relative to training a same-size model from scratch rather than an avoidance of large-scale training altogether. Some of SOLAR 10.7B's measured gains stem from its instruction tuning and DPO alignment data rather than from depthwise scaling alone, so leaderboard comparisons should not be read as isolating the contribution of DUS. The method as described scales only depth, leaving width and other dimensions fixed, and the published recipe demonstrates a single doubling-and-trimming step; how well DUS composes across repeated applications or much larger scale factors was not characterized in the original work ^[1]. Finally, because the seam is repaired statistically by training rather than constructed to preserve function, there is no guarantee that the trimmed middle layers were the least useful ones to remove, only the empirical evidence that the resulting model trains back to strong performance.

References

Kim, D., Park, C., Kim, S., et al. (Upstage AI). "SOLAR 10.7B: Scaling Large Language Models with Simple yet Effective Depth Up-Scaling." NAACL 2024 (Industry Track). arXiv:2312.15166. https://arxiv.org/abs/2312.15166 ↩
Chen, T., Goodfellow, I., and Shlens, J. "Net2Net: Accelerating Learning via Knowledge Transfer." International Conference on Learning Representations (ICLR 2016). arXiv:1511.05641. https://arxiv.org/abs/1511.05641 ↩
Gong, L., He, D., Li, Z., Qin, T., Wang, L., and Liu, T. "Efficient Training of BERT by Progressively Stacking." International Conference on Machine Learning (ICML 2019). https://proceedings.mlr.press/v97/gong19a.html ↩
Chen, C., Yin, Y., Shang, L., Jiang, X., Qin, Y., Wang, F., Wang, Z., Chen, X., Liu, Z., and Liu, Q. "bert2BERT: Towards Reusable Pretrained Language Models." Annual Meeting of the Association for Computational Linguistics (ACL 2022). arXiv:2110.07143. https://arxiv.org/abs/2110.07143 ↩
Komatsuzaki, A., Puigcerver, J., Lee-Thorp, J., Ruiz, C. R., Mustafa, B., Ainslie, J., Tay, Y., Dehghani, M., and Houlsby, N. "Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints." International Conference on Learning Representations (ICLR 2023). arXiv:2212.05055. https://arxiv.org/abs/2212.05055 ↩
Tan, M., and Le, Q. "EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks." International Conference on Machine Learning (ICML 2019). arXiv:1905.11946. https://arxiv.org/abs/1905.11946 ↩
Upstage. "SOLAR-10.7B-v1.0" model card. Hugging Face. https://huggingface.co/upstage/SOLAR-10.7B-v1.0 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Reinforcement learning