Depth up-scaling (DUS)
Last reviewed
Jun 8, 2026
Sources
7 citations
Review status
Source-backed
Revision
v1 · 2,047 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
7 citations
Review status
Source-backed
Revision
v1 · 2,047 words
Add missing citations, update stale details, or suggest a clearer explanation.
Depth up-scaling (DUS) is a model-scaling method that builds a deeper large language model by duplicating and stacking the layers of an existing pretrained Transformer checkpoint and then continuing to pretrain the enlarged network so that it recovers and surpasses the performance of the original. The technique was introduced by the South Korean company Upstage in the paper "SOLAR 10.7B: Scaling Large Language Models with Simple yet Effective Depth Up-Scaling," first posted to arXiv on December 23, 2023 and later accepted to the Industry Track of NAACL 2024 [1]. The accompanying SOLAR 10.7B model was released as an openly licensed checkpoint in December 2023.
The defining property of DUS is that it produces an ordinary dense Transformer. Unlike scaling approaches based on a mixture of experts, it adds no gating networks, routers, or dynamic expert selection, so the up-scaled model slots directly into existing training and inference frameworks with no special handling [1]. DUS is best understood as a way to convert a strong, fully trained 7-billion-parameter checkpoint into a roughly 11-billion-parameter model at a fraction of the cost of pretraining a model of that size from scratch.
Empirical scaling laws indicate that, holding data and compute roughly in balance, larger models tend to perform better. Training a fresh model of the desired size from random initialization is expensive, however, and it discards the substantial compute already invested in strong open checkpoints. By late 2023 the most capable openly released base model near the 7-billion-parameter scale was Mistral 7B, and practitioners wanted a way to grow such a checkpoint into a larger, stronger model without starting over [1].
The leading alternative for cheaply enlarging a dense checkpoint was sparse upcycling, which converts a dense model into a mixture-of-experts model by cloning the feed-forward blocks into multiple experts and adding a learned router [5]. That approach increases parameter count and capacity while keeping per-token compute modest, but it changes the architecture: the resulting model needs routing logic, load-balancing during training, and expert-parallel serving infrastructure, and it can be awkward to deploy or to fine-tune with standard tooling [1][5]. Upstage designed DUS to avoid this complexity entirely. The goals were to (1) reuse a top performing existing checkpoint rather than discard it, (2) grow the model along depth, which prior work had found to be an effective axis of scaling, and (3) keep the output a plain dense Transformer so that no part of the standard pretraining, fine-tuning, or inference stack needs modification [1].
DUS proceeds in two stages: a one-time structural operation called depthwise scaling, followed by continued pretraining of the enlarged model [1].
Depthwise scaling takes a base model with n Transformer layers and produces a model with s layers using a single integer hyperparameter m, the number of layers trimmed from each copy. The procedure is:
The resulting network has
s = 2 * (n - m)
layers. With n = 32 and m = 8 this yields s = 48 layers, and at the hidden dimensions inherited from Mistral 7B the model totals roughly 10.7 billion parameters [1].
| Symbol | Meaning | SOLAR 10.7B value |
|---|---|---|
| n | Layers in the base model | 32 |
| m | Layers trimmed from each copy | 8 |
| s = 2(n - m) | Layers in the up-scaled model | 48 |
| Parameters | Total weights after scaling | ~10.7 billion |
The reason for trimming layers, rather than simply stacking two full 32-layer copies into a 64-layer model, is to soften the discontinuity at the join. Where the two copies meet, the last retained layer of the first copy feeds the first retained layer of the second. If no layers were removed, the output of the original final layer (layer 32) would feed a layer expecting representations from the very start of the network (layer 1), the two most dissimilar points in the depth. By discarding the 2m middle layers, the seam instead connects layer 24 to layer 9 in SOLAR's case, which are closer in the network's processing pipeline. The authors describe this as reducing "the discrepancy at the seam," which makes it easier for the subsequent training to recover performance quickly [1]. The choice of m = 8, and thus s = 48, was driven largely by available hardware and the desire to land in the 7-billion to 13-billion-parameter band rather than by a proof of optimality [1].
Immediately after depthwise scaling the model performs worse than the base model, because the stacked weights no longer compose into a coherent forward pass at the seam. The second stage heals this by continuing to pretrain the full 48-layer model on a language-modeling objective. Upstage reports that performance drops sharply right after scaling and then recovers rapidly during continued pretraining, eventually exceeding the base model. The paper attributes the speed of this recovery to the seam-smoothing effect of trimming the middle layers, which reduces the heterogeneity the optimizer must reconcile [1]. The published work specifies the instruction-tuning and alignment data in detail but does not fully document the corpus used for this continued-pretraining stage, which is one of the reproducibility gaps noted by later readers [1].
SOLAR 10.7B is the reference model produced by DUS and the vehicle through which the method was introduced. The base model (SOLAR-10.7B-v1.0) was released under the permissive Apache 2.0 license in December 2023 [1][7]. Upstage then ran two further stages to produce the aligned variant SOLAR-10.7B-Instruct-v1.0:
On the six-task average (often called H6) of the Hugging Face Open LLM Leaderboard, the aligned model was competitive with or better than substantially larger models. The figures below are reported by Upstage [1].
| Model | Approx. params | H6 average | ARC | HellaSwag | MMLU | TruthfulQA | Winogrande | GSM8K |
|---|---|---|---|---|---|---|---|---|
| SOLAR 10.7B-Instruct | 10.7B (dense) | 74.20 | 71.08 | 88.16 | 66.21 | 71.43 | 83.58 | 64.75 |
| Mixtral 8x7B-Instruct-v0.1 | 46.7B (MoE) | 72.62 | - | - | - | - | - | - |
| Qwen 14B | 14B (dense) | 65.86 | - | - | - | - | - | - |
| Mistral 7B-Instruct-v0.2 | 7.2B (dense) | 65.71 | - | - | - | - | - | - |
| SOLAR 10.7B (base) | 10.7B (dense) | 66.04 | 61.95 | 84.60 | 65.48 | 45.04 | 83.66 | 55.50 |
The headline result is that SOLAR 10.7B-Instruct, a dense 10.7-billion-parameter model, scored higher on the H6 average than the much larger Mixtral 8x7B-Instruct mixture-of-experts model [1]. Because the output is a conventional dense checkpoint, SOLAR also became a popular base for the open-weights community to fine-tune and merge, and DUS was subsequently adopted and adapted by other groups building mid-sized models from 7-billion-parameter checkpoints.
DUS belongs to a broader family of model-growth or model-reuse techniques that initialize a larger network from a smaller trained one rather than from scratch. It is distinguished within that family chiefly by being deliberately non-function-preserving and by being applied once to a target size.
The SOLAR paper frames DUS as inspired by the depthwise component of compound model scaling from EfficientNet (Tan and Le, 2019) and positions it against mixture-of-experts scaling [1][6]. Two differences set DUS apart from the function-preserving methods above. First, DUS does not preserve the base model's function: the stacking step intentionally creates a discontinuity at the seam and relies on continued pretraining to repair it, trading an exact warm start for a much simpler construction with no identity-initialized layers or weight-mapping rules. Second, DUS grows the model a single time to a fixed target depth using a strong, fully pretrained checkpoint, rather than growing progressively during training from a small or randomly initialized model. The shared thread across all of these methods is the reuse of existing trained weights to reach a larger model more cheaply than full pretraining from scratch.
Several caveats temper the method. The hyperparameters m and s were selected to fit available hardware and a target parameter band, and the authors explicitly state they did not establish that these values are optimal, leaving the best trim depth an open question [1]. The continued-pretraining corpus is not fully documented, which limits exact reproducibility of the base model [1]. DUS is also not free: healing the seam requires a nontrivial amount of continued pretraining, so the saving is relative to training a same-size model from scratch rather than an avoidance of large-scale training altogether. Some of SOLAR 10.7B's measured gains stem from its instruction tuning and DPO alignment data rather than from depthwise scaling alone, so leaderboard comparisons should not be read as isolating the contribution of DUS. The method as described scales only depth, leaving width and other dimensions fixed, and the published recipe demonstrates a single doubling-and-trimming step; how well DUS composes across repeated applications or much larger scale factors was not characterized in the original work [1]. Finally, because the seam is repaired statistically by training rather than constructed to preserve function, there is no guarantee that the trimmed middle layers were the least useful ones to remove, only the empirical evidence that the resulting model trains back to strong performance.