DeepNorm / DeepNet

Deep Learning Neural Networks

9 min read

Updated Jul 11, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 11, 2026

Fact-checked

In review queue

Sources

8 citations

Revision

v2 · 1,806 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Overview

DeepNorm is a normalization and weight initialization scheme for Transformer networks that makes the training of very deep models stable. It was introduced by researchers at Microsoft Research in the paper "DeepNet: Scaling Transformers to 1,000 Layers," first posted to arXiv on March 1, 2022 by Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Dongdong Zhang, and Furu Wei, and later published in IEEE Transactions on Pattern Analysis and Machine Intelligence in 2024 ^[1]^[2]. The method makes two coordinated changes to the standard residual block: a constant $\alpha$ that up-scales the residual (skip) connection before layer normalization, and a constant $\beta$ that down-scales the initial values of selected weights. Using DeepNorm, the authors trained DeepNet, a 1,000-layer Transformer (2,500 attention and feed-forward sublayers), roughly one order of magnitude deeper than any stably trained Transformer that preceded it ^[1].

The central design goal is summarized by the paper's claim that DeepNorm combines "the best of two worlds, i.e., good performance of Post-LayerNorm and stable training of Pre-LayerNorm" ^[1]. Both $\alpha$ and $\beta$ are fixed constants that depend only on the depth of the network and on whether the model is an encoder, a decoder, or an encoder-decoder. DeepNorm adds no learnable parameters and requires no profiling pass over the data, and models trained with it remain stable even without the learning-rate warmup that vanilla deep Transformers usually need ^[1].

Background: Post-LN versus Pre-LN

A Transformer is a stack of residual blocks, each wrapping a sublayer $G$ (either self-attention or a feed-forward network) with a residual connection and layer normalization. Where the normalization is placed relative to the residual addition defines two well-known variants, and each makes a different tradeoff between final quality and training stability.

Post-LayerNorm (Post-LN) is the placement used in the original Transformer of Vaswani et al. (2017): the sublayer output is added to the input and the sum is then normalized, $x_{l+1} = \mathrm{LN}(x_l + G_l(x_l))$ ^[3]. Post-LN tends to reach the best final accuracy, but it becomes unstable as depth grows. The gradients passing back through the post-residual LayerNorm can grow or shrink sharply, so deep Post-LN models typically need a careful learning-rate warmup and often diverge entirely past a few dozen layers. This is a practical instance of the vanishing gradient problem and its exploding counterpart.

Pre-LayerNorm (Pre-LN), popularized by GPT-2 and analyzed by Xiong et al. (2020), instead normalizes the input to each sublayer and leaves the residual path untouched, $x_{l+1} = x_l + G_l(\mathrm{LN}(x_l))$ ^[4]. Because there is always a clean, unnormalized identity path from input to output, gradients are far better behaved, Pre-LN trains stably to great depth, and it usually needs no warmup. The cost is quality: at equal depth Pre-LN tends to underperform Post-LN, in part because the residual stream grows layer by layer while each sublayer's relative contribution shrinks, so the deepest layers add little. DeepNorm was designed to keep Post-LN's accuracy while borrowing Pre-LN's stability.

Variant	Update rule	Stability at depth	Final quality
Post-LN	$x_{l+1} = \mathrm{LN}(x_l + G_l(x_l))$	Poor, needs warmup	High
Pre-LN	$x_{l+1} = x_l + G_l(\mathrm{LN}(x_l))$	Good	Lower
DeepNorm	$x_{l+1} = \mathrm{LN}(\alpha x_l + G_l(x_l))$	Good	High

How DeepNorm works

DeepNorm keeps the Post-LN placement, normalization applied after the residual addition, but scales the residual term by a constant $\alpha$ that is greater than or equal to 1:

x_{l+1} = \mathrm{LN}(\alpha x_l + G_l(x_l, \theta_l))

Here $G_l$ is the l-th sublayer with parameters $\theta_l$ , and $\alpha$ multiplies the residual input $x_l$ rather than the sublayer output ^[1]. Up-scaling the skip connection makes the identity path dominate at the start of training, which limits how much any single sublayer can perturb the running representation and therefore how large the early updates can be.

The second ingredient acts only at initialization. After the usual Xavier (Glorot) weight initialization, DeepNorm multiplies a specific subset of weights by $\beta$ , a constant smaller than 1. The down-scaled weights are the two linear projections of each feed-forward block, the value projection, and the output projection of each attention sublayer; the query and key projections, the embeddings, and the LayerNorm parameters are left unchanged ^[1]^[5]. Shrinking exactly these weights reduces the magnitude of the sublayer branch at initialization, complementing the up-scaled residual.

Both constants depend only on the architecture and on the layer counts, with $N$ the number of encoder layers and $M$ the number of decoder layers. The values derived in the paper are:

Architecture	$\alpha$	$\beta$
Encoder-only (e.g., BERT)	$(2N)^{1/4}$	$(8N)^{-1/4}$
Decoder-only (e.g., GPT)	$(2M)^{1/4}$	$(8M)^{-1/4}$
Encoder-decoder, encoder side	$0.81 (N^4 M)^{1/16}$	$0.87 (N^4 M)^{-1/16}$
Encoder-decoder, decoder side	$(3M)^{1/4}$	$(12M)^{-1/4}$

These same closed-form constants appear in Microsoft's reference implementation in the torchscale library ^[5]. Because they are fixed numbers computed once from the depth, applying DeepNorm to an existing Post-LN model is a small change: scale the residual by $\alpha$ and rescale a few weight matrices by $\beta$ . No additional parameters are introduced and nothing has to be learned or tuned per dataset.

Theoretical motivation

The paper grounds these particular values in an analysis of the model update, defined as the change $\lVert \Delta F \rVert$ in the network's output after a single optimization step at the very start of training. The authors argue that the instability of deep Post-LN comes from this quantity growing with depth: stacking more layers makes the first few updates increasingly large, until one of them is big enough to push the model into a region from which it cannot recover ^[1]. They derive a bound of the form

\lVert \Delta F \rVert \le \sum_i \frac{\sqrt{v_i^2 + w_i^2}}{\alpha} \lVert \theta_i^{\text{updated}} - \theta_i \rVert

which shows how the residual scaling $\alpha$ in the denominator and the down-scaled weights controlled by $\beta$ act together to suppress the update ^[1]. The constants in the table above are chosen precisely so that the expected magnitude of the model update is bounded by a value that does not depend on the number of layers. In other words, DeepNorm aims to make a deep network's early-training behavior look, to first order, like that of a shallow one, which is what allows training to proceed without warmup and to remain stable as depth is pushed into the hundreds or thousands. The analysis covers encoder-only, decoder-only, and encoder-decoder models, which is why the recommended alpha and beta differ across the three cases.

Results: DeepNet

To demonstrate the method, the authors built DeepNet by replacing every Post-LN in a Transformer with DeepNorm and trained models of {12, 20, 100, 200, 1000} layers. Whereas vanilla Post-LN baselines diverged once the depth grew past roughly 50 layers on each side, DeepNet trained stably all the way to 1,000 layers, that is 2,500 attention and feed-forward sublayers, without a warmup schedule ^[1]. On the bilingual WMT-17 English-to-German benchmark, a 100-layer-encoder, 100-layer-decoder DeepNet reached 28.9 BLEU, while comparable Post-LN models failed to converge at that depth ^[1].

The headline result came on massively multilingual neural machine translation. A deep and narrow DeepNet of about 200 layers with only 3.2 billion parameters outperformed M2M-100, a strong 48-layer baseline (24-layer encoder, 24-layer decoder, 4,096 hidden size) with up to 12 billion parameters, by an average of 5 BLEU points ^[1]^[6]. The evaluation spanned 7,482 translation directions across 87 languages, and the gap shows that under DeepNorm, investing parameters in depth rather than width paid off for this task. Microsoft released the training code in the open-source torchscale library, where DeepNorm can be enabled for encoder, decoder, and encoder-decoder configurations ^[5].

Relationship to other normalization methods

DeepNorm sits in a line of work on initializing and normalizing Transformers so that depth does not destabilize training. Its most direct comparison is to the two baselines it set out to beat, Post-LN and Pre-LN, described above. Against Pre-LN it keeps the post-residual normalization that yields higher accuracy; against Post-LN it adds the $\alpha$ and $\beta$ constants that remove the depth-dependent blow-up in the early updates.

Several earlier methods attacked the same instability through initialization alone. Admin (Liu et al., 2020) controls the dependency on residual branches with per-layer scaling factors, but it estimates them from a profiling pass over data before training; DeepNorm instead uses fixed closed-form constants and needs no profiling ^[7]. ReZero initializes each residual branch with a learnable scalar set to zero, and T-Fixup removes both warmup and LayerNorm through a tailored initialization; the deep-machine-translation literature, including DLCL (Wang et al., 2019), had likewise found that careful initialization and connection design were needed to train deep encoders. DeepNorm differs from these in pairing an explicit residual up-scaling with a matching weight down-scaling, both derived from a single bound on the model update.

The same Microsoft group later generalized the idea in Foundation Transformers (the Magneto architecture, Wang et al., 2022), which introduced a Sub-LayerNorm and a DeepNorm-style initialization intended to work as a single recipe across language, vision, speech, and multimodal models ^[8]. More broadly, most current large language model decoders have settled on Pre-LN combined with RMSNorm, so DeepNorm has been most influential in the Post-LN and encoder-decoder lineage and as a theoretical statement about why depth makes Transformers unstable and how a pair of constants can fix it.

References

Wang, H., Ma, S., Dong, L., Huang, S., Zhang, D., and Wei, F. "DeepNet: Scaling Transformers to 1,000 Layers." arXiv:2203.00555, March 1, 2022. https://arxiv.org/abs/2203.00555 ↩
Wang, H., Ma, S., Dong, L., Huang, S., Zhang, D., and Wei, F. "DeepNet: Scaling Transformers to 1,000 Layers." IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. DOI: 10.1109/TPAMI.2024.3386927. https://doi.org/10.1109/TPAMI.2024.3386927 ↩
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. "Attention Is All You Need." NeurIPS 2017. arXiv:1706.03762. https://arxiv.org/abs/1706.03762 ↩
Xiong, R., Yang, Y., He, D., Zheng, K., Zheng, S., Xing, C., Zhang, H., Lan, Y., Wang, L., and Liu, T. "On Layer Normalization in the Transformer Architecture." ICML 2020. arXiv:2002.04745. https://arxiv.org/abs/2002.04745 ↩
Microsoft. "torchscale: Foundation Architecture for (M)LLMs." GitHub repository. https://github.com/microsoft/torchscale ↩
Fan, A., Bhosale, S., Schwenk, H., Ma, Z., El-Kishky, A., Goyal, S., et al. "Beyond English-Centric Multilingual Machine Translation" (M2M-100). Journal of Machine Learning Research, 2021. arXiv:2010.11125. https://arxiv.org/abs/2010.11125 ↩
Liu, L., Liu, X., Gao, J., Chen, W., and Han, J. "Understanding the Difficulty of Training Transformers" (Admin). EMNLP 2020. arXiv:2004.08249. https://arxiv.org/abs/2004.08249 ↩
Wang, H., Ma, S., Huang, S., Dong, L., Wang, W., Peng, Z., Wu, Y., Bajaj, P., Singhal, S., Benhaim, A., Patra, B., Liu, Z., Chaudhary, V., Song, X., and Wei, F. "Foundation Transformers" (Magneto). arXiv:2210.06423, 2022. https://arxiv.org/abs/2210.06423 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Neural Network

Overview

Background: Post-LN versus Pre-LN

How DeepNorm works

Theoretical motivation

Results: DeepNet

Relationship to other normalization methods

References

Improve this article

Related Articles

LSTM

Mixture of Experts (MoE)

Translational invariance

Activation Function

Attention

Backpropagation