# RMSNorm

> Source: https://aiwiki.ai/wiki/rmsnorm
> Updated: 2026-06-21
> Categories: Artificial Intelligence, Model Architecture, Transformer Models
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**RMSNorm** (Root Mean Square Layer Normalization) is a feature normalization technique introduced by Biao Zhang and Rico Sennrich in 2019 that scales each activation vector by its root mean square only, dropping the mean-subtraction and bias parameter that [Layer Normalization](/wiki/layer_normalization) uses.[1] It is the default normalization layer in nearly every modern large language model, including [LLaMA](/wiki/llama), [Mistral 7B](/wiki/mistral_7b), Gemma, Qwen, and DeepSeek.[5][6] The 2019 paper hypothesized that "re-centering invariance in LayerNorm is dispensable" and showed that keeping only the re-scaling step "achieves comparable performance against LayerNorm but reduces the running time by 7%~64% on different models."[1]

In other words, RMSNorm keeps the re-scaling invariance that stabilizes training while giving up the re-centering invariance, and the original paper showed this trade-off costs essentially nothing on accuracy while running between 7% and 64% faster per training step depending on model and hardware.[1] The work was published at NeurIPS 2019 in Vancouver and posted to arXiv in October 2019 (arXiv:1910.07467).[1]

For several years RMSNorm was a niche curiosity. Then it became the default normalization layer in nearly every large open-weight [Transformer](/wiki/transformer) language model. [LLaMA](/wiki/llama), LLaMA 2, and LLaMA 3 use it, as do [Mistral 7B](/wiki/mistral_7b), [Mixtral](/wiki/mixtral), Gemma, DeepSeek, Qwen, and the [PaLM](/wiki/palm) family.[5][6][7] The contemporaneous [T5](/wiki/t5) model from Google also adopted an RMSNorm-equivalent simplified LayerNorm in October 2019, the same month Zhang and Sennrich's paper appeared on arXiv.[4] By the time inference engines like vLLM, llama.cpp, and TensorRT-LLM matured, RMSNorm was treated as a fixed assumption of the architecture, with hand-written fused CUDA kernels squeezing the last few microseconds out of it.

## Quick facts

| Field | Value |
|-------|-------|
| Introduced | October 2019 (arXiv preprint) |
| Conference | NeurIPS 2019 (Vancouver) |
| Authors | Biao Zhang, Rico Sennrich |
| Affiliation | [University of Edinburgh](/wiki/university_of_edinburgh) |
| arXiv ID | 1910.07467 |
| Original tasks | WMT14 English-German MT, image-text retrieval, enwik8 LM, reading comprehension |
| Reported speedup vs LayerNorm | 7% to 64% per step |
| Built into PyTorch | `torch.nn.RMSNorm`, version 2.4 (July 2024) |

## What is RMSNorm?

The Transformer architecture uses normalization layers to keep activations from drifting in scale as signals flow through deep stacks of attention and feedforward blocks. Without normalization, training tends to diverge once depth exceeds a few dozen layers, because gradients either explode or vanish.

Three main normalizers were in widespread use before RMSNorm:

* **Batch Normalization** (Ioffe and Szegedy 2015), the original. It computes mean and variance across the batch dimension for each feature channel, which works well for vision [convolutional neural networks](/wiki/convolutional_neural_network) but fails for variable-length sequences and small batch sizes.[3] See [Batch normalization](/wiki/batch_normalization).
* **Layer Normalization** (Ba, Kiros, Hinton 2016), introduced in part to fix BatchNorm's shortcomings for recurrent networks and later adopted by the original Transformer. LayerNorm computes statistics across the feature dimension within each individual sample, so it does not depend on batch size or sequence position.[2]
* **Instance Normalization** and **Group Normalization**, used mainly in image generation and small-batch vision models.

LayerNorm became the standard in Transformers from 2017 onward.[2] Its update rule is

$$\text{LayerNorm}(\mathbf{a}) = \frac{\mathbf{a} - \mu}{\sigma} \odot \boldsymbol{\gamma} + \boldsymbol{\beta},$$

where $\mu = \frac{1}{n}\sum_i a_i$, $\sigma = \sqrt{\frac{1}{n}\sum_i (a_i - \mu)^2}$, and $\boldsymbol{\gamma}, \boldsymbol{\beta} \in \mathbb{R}^n$ are learned per-feature gain and bias parameters. The two reduction passes (one for the mean, one for the variance) are the expensive part. On modern accelerators a normalization layer is bandwidth-bound, and halving the reduction work translates almost linearly into wall-clock savings.

Zhang and Sennrich's 2019 paper started from a single hypothesis: maybe the re-centering step in LayerNorm (subtracting the mean) is not actually doing useful work, and only the re-scaling step matters for stable training.[1] The paper states the bet directly: "we hypothesize that the re-centering invariance in LayerNorm is dispensable and propose root mean square layer normalization, or RMSNorm."[1] If true, you can drop the mean computation, drop the bias parameter, and get a faster norm with essentially the same behavior.

## Definition and formula

Given a vector $\mathbf{a} \in \mathbb{R}^n$ representing a single position's hidden state in a Transformer, RMSNorm is defined as

$$\bar{a}_i = \frac{a_i}{\text{RMS}(\mathbf{a})} \cdot g_i, \qquad \text{RMS}(\mathbf{a}) = \sqrt{\frac{1}{n}\sum_{j=1}^{n} a_j^2}.$$

In practice a small $\epsilon$ is added inside the square root for numerical stability,

$$\text{RMS}(\mathbf{a}) = \sqrt{\frac{1}{n}\sum_{j=1}^{n} a_j^2 + \epsilon},$$

so the denominator never collapses to zero in fp16 or bf16. The vector $\mathbf{g} \in \mathbb{R}^n$ is a learned per-feature gain, initialized to all ones; some early implementations made it optional, but modern implementations always include it.[1]

## How does RMSNorm differ from LayerNorm?

The key differences from LayerNorm are summarized below.

| Aspect | LayerNorm | RMSNorm |
|--------|-----------|---------|
| Mean subtraction | Yes ($\mu$) | No |
| Denominator | Standard deviation $\sigma$ | Root mean square $\text{RMS}(\mathbf{a})$ |
| Learned parameters | Gain $\boldsymbol{\gamma}$ + bias $\boldsymbol{\beta}$ ($2n$ values) | Gain $\mathbf{g}$ only ($n$ values) |
| Reduction passes over features | Two | One |
| Re-centering invariance | Yes | No |
| Re-scaling invariance | Yes | Yes |

### Worked example

Take $\mathbf{a} = [3.0, -1.0, 4.0, -2.0]$, with $n = 4$ and $\epsilon$ ignored.

* $\sum_j a_j^2 = 9 + 1 + 16 + 4 = 30$
* $\text{RMS}(\mathbf{a}) = \sqrt{30/4} = \sqrt{7.5} \approx 2.7386$
* $\mathbf{a} / \text{RMS}(\mathbf{a}) \approx [1.0954, -0.3651, 1.4606, -0.7303]$

With $\mathbf{g}$ initialized to all ones, that is the output. LayerNorm on the same input would first subtract the mean ($\mu = 1.0$) to get $[2, -2, 3, -3]$ and then divide by the standard deviation $\sqrt{6.5} \approx 2.5495$, giving a different output. The two layers behave the same only when the input already has zero mean.

## Theoretical properties

The RMSNorm paper analyzes the invariances each normalizer preserves.[1] Invariance here means: if you transform the input in a certain way, does the output of the layer change?

* **Re-scaling invariance** (input scaling). If $\mathbf{a}$ is multiplied by a positive scalar $c$, both the numerator and the denominator scale by $c$, so the output is unchanged. Both LayerNorm and RMSNorm have this property. It is the property that stabilizes activations across layers and lets gradients flow.
* **Re-centering invariance** (input shifting). If a constant $b$ is added to every element of $\mathbf{a}$, LayerNorm cancels it out by subtracting $\mu$. RMSNorm does not subtract anything, so a shift in the input changes the RMS denominator and therefore changes the output. RMSNorm is **not** re-centering invariant.

Zhang and Sennrich's empirical claim is that re-centering invariance does not matter for training stability or final accuracy in deep Transformers.[1] Six years of large-model training have mostly confirmed this, though a 2024 follow-up paper, "Re-Introducing LayerNorm: Geometric Meaning, Irreversibility and a Comparative Study with RMSNorm" (arXiv:2409.12951), argued that the geometric distinction between the two layers is non-trivial in some regimes, particularly when activations have strong outlier features.[10]

Another often-overlooked property is **implicit learning-rate adaptation**. The paper notes that RMSNorm gives the model "re-scaling invariance property and implicit learning rate adaptation ability."[1] Both LayerNorm and RMSNorm rescale the gradient flowing back into the input by a factor inversely proportional to the input's norm, which acts like a per-vector learning rate that automatically dampens unusually large activations.[1] This is part of why pre-norm Transformers train so much more stably than unnormalized ones.

RMSNorm has fewer parameters than LayerNorm. For a hidden size of $d$, LayerNorm has $2d$ parameters; RMSNorm has $d$. At the scale of a 70-billion-parameter model, dropping the bias saves a few hundred thousand parameters in total, which is irrelevant for parameter count but does shrink the optimizer state.

## How much faster is RMSNorm?

The speedup comes from doing less work. LayerNorm requires two reduction passes over the feature dimension: one to compute $\mu$, then a second to compute $\sigma^2$ from $(a_i - \mu)^2$. RMSNorm needs only one reduction pass, summing $a_i^2$.[1] On a memory-bound operation, which a normalization layer almost always is on modern accelerators, halving the reduction work translates directly to wall-clock savings.

The original paper reported per-step time reductions across a range of models and hardware:[1]

| Model | Task | Reported speedup |
|-------|------|------------------|
| Bidirectional GRU | Reading comprehension (CNN/Daily Mail) | up to 64% |
| Transformer (small/big) | WMT14 English-German MT | 7-15% |
| Word-level RNN | enwik8 language modeling | around 25% |
| ConvS2S | Machine translation | around 10% |

The larger speedups tended to appear on smaller models and recurrent architectures where the norm is a bigger fraction of total work. In modern fused implementations, the absolute gap between RMSNorm and LayerNorm is smaller because the norm is a tiny slice of a Transformer's compute. It still matters at scale: a saved microsecond per layer per token, multiplied across 80 layers and trillions of training tokens, is a real cost.

A secondary practical reason RMSNorm wins on hardware is numerical: computing variance in low precision via $E[a^2] - E[a]^2$ can lose significant digits when the two terms are close, while $\frac{1}{n}\sum a_i^2$ avoids that subtraction entirely.

## Which models use RMSNorm?

The T5 paper from Google (Raffel et al., October 2019) was contemporaneous with Zhang and Sennrich's work, and its public Mesh-TensorFlow code used a simplified LayerNorm that, in the relevant terms, is equivalent to RMSNorm.[4] The T5 paper itself did not advertise the change, but the implementation set the pattern that later Google models followed.

| Model | Year | Normalization | Notes |
|-------|------|---------------|-------|
| [T5](/wiki/t5) | 2019 | RMSNorm-equivalent | Simplified LayerNorm without bias, in Mesh-TensorFlow code |
| [PaLM](/wiki/palm) | 2022 | RMSNorm | Google, descended from T5 conventions |
| PaLM 2 | 2023 | RMSNorm | Google |
| [LLaMA](/wiki/llama) 1 | 2023 | RMSNorm (pre-norm) | Touvron et al. |
| LLaMA 2 | 2023 | RMSNorm (pre-norm) | Same architecture as LLaMA 1 |
| LLaMA 3 | 2024 | RMSNorm (pre-norm) | Same scheme, larger model |
| [Mistral 7B](/wiki/mistral_7b) | 2023 | RMSNorm (pre-norm) | Mistral AI |
| [Mixtral](/wiki/mixtral) 8x7B | 2023 | RMSNorm (pre-norm) | Mixture of experts on Mistral base |
| Gemma 1, 2, 3 | 2024-2025 | RMSNorm (pre-norm + post-norm) | Google DeepMind |
| DeepSeek V2, V3 | 2024-2025 | RMSNorm | Pre-norm in attention and FFN blocks |
| Qwen 2, 2.5, 3 | 2024-2025 | RMSNorm | Alibaba |
| Phi-3 | 2024 | RMSNorm | Microsoft, Phi-3-medium and later |
| Mamba | 2023 | RMSNorm | Inside the Mamba block, post-SSM |
| Mamba-2 | 2024 | RMSNorm + post-gate RMSNorm | Stability addition |
| RWKV v5/v6 | 2024 | RMSNorm variants | Time-mix and channel-mix blocks |

The placement is almost always **pre-norm**, meaning the norm is applied before each attention or feedforward sub-block, with a residual added afterwards.[5] LLaMA explicitly normalizes the input of each Transformer sub-layer rather than the output, an idea its authors credit to GPT-3.[5] Pre-norm Transformers train more stably at depth than post-norm Transformers, and combining pre-norm with RMSNorm has become the default recipe for open-weight LLMs.

Not every modern model uses RMSNorm. The original GPT-2 used LayerNorm, and several follow-ups in the EleutherAI lineage ([GPT-J](/wiki/gpt_j) 6B, GPT-NeoX-20B, Pythia) stayed with LayerNorm rather than switching. GPT-3 also used LayerNorm. The specific choice in GPT-4 and later proprietary OpenAI models has not been published.

## Why do modern LLMs use RMSNorm?

Why modern LLMs prefer RMSNorm comes down to a combination of training stability and speed. Pre-norm RMSNorm trains as well as pre-norm LayerNorm in head-to-head comparisons that researchers have actually reported, and it costs less per step. With training compute being the binding constraint on frontier models, even single-digit percentage savings are worth taking by default.

Meta's decision to put RMSNorm in LLaMA was especially influential. Because LLaMA's architecture was openly published and widely cloned, its pre-norm RMSNorm recipe propagated into Mistral, Gemma, Qwen, DeepSeek, and most other open-weight families, making RMSNorm the de facto standard for the open LLM ecosystem.[5][6]

## Comparison with related normalization methods

| Method | Statistics computed | Re-centering invariant | Re-scaling invariant | Typical use |
|--------|--------------------|-----------------------|---------------------|-------------|
| [Batch normalization](/wiki/batch_normalization) | Mean and variance over batch dimension | Yes (per channel) | Yes | CNNs, vision |
| Layer normalization | Mean and variance over feature dimension | Yes | Yes | Original Transformers, RNNs |
| Group Normalization | Mean and variance per group of channels | Yes (per group) | Yes | CNNs at small batch sizes |
| Instance Normalization | Mean and variance per channel per sample | Yes (per channel) | Yes | Style transfer, GANs |
| RMSNorm | RMS over feature dimension | No | Yes | Modern LLMs |
| ScaleNorm | Single scalar L2 norm | No | Yes | Niche, some translation models |
| DeepNorm | LayerNorm + residual scaling factor | Yes | Yes | Very deep Transformers (1000+ layers) |

RMSNorm is the only widely used method in this list that gives up re-centering invariance, and it is also the cheapest. The success of LLMs trained with it is the main empirical evidence that, at least for autoregressive language modeling, that property is not load-bearing.

## Variants

### Partial RMSNorm (pRMSNorm)

The original paper also introduced a partial variant that estimates the RMS from the first $p\%$ of the feature dimensions instead of all of them:[1]

$$\overline{\text{RMS}}(\mathbf{a}) = \sqrt{\frac{1}{k}\sum_{j=1}^{k} a_j^2}, \qquad k = \lceil n \cdot p / 100 \rceil.$$

This trades a small amount of accuracy for additional compute savings. The paper showed pRMSNorm with $p = 6.25\%$ gives near-identical results to full RMSNorm on machine translation.[1] In practice pRMSNorm did not catch on, partly because fused kernels make full RMSNorm fast enough that the extra approximation is not worth the complexity.

### Pre-norm vs post-norm placement

A second axis of variation is where the norm sits inside a residual block. The original Transformer paper put LayerNorm after the residual addition (post-norm). Pre-norm, where the norm is applied to the input of each sub-block before attention or FFN, became dominant after roughly 2019 because it trains more stably at depth.

RMSNorm inherits this distinction. The default in essentially every modern LLM is pre-norm RMSNorm. Some Gemma generations add an extra post-norm RMSNorm on top of pre-norm; that acts more like an output scaling than a true post-norm.[9] DeepNorm (Microsoft 2022) is a separate approach that uses LayerNorm with a scaling factor on the residual connection to enable training of 1000-layer Transformers.[15]

### Other related variants

ScaleNorm (Nguyen and Salazar 2019) replaces the per-feature scale with a single scalar that divides by the L2 norm of the activation vector. ScaleNorm is even cheaper than RMSNorm but gives up the per-feature gain.[14] It was used in some smaller machine-translation models but never became mainstream.

Nvidia's nGPT (2024) takes the idea further by normalizing every weight matrix and embedding to lie on the unit hypersphere, removing normalization layers entirely.

## Implementation

A minimal [PyTorch](/wiki/pytorch) reference implementation looks like this:

```python
import torch
import torch.nn as nn

class RMSNorm(nn.Module):
    def __init__(self, dim: int, eps: float = 1e-6):
        super().__init__()
        self.eps = eps
        self.weight = nn.Parameter(torch.ones(dim))

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # x: (..., dim)
        rms = x.pow(2).mean(dim=-1, keepdim=True).add(self.eps).sqrt()
        return x / rms * self.weight
```

A few details that matter in practice:

* The norm is computed across the last dimension only, which is the feature axis. Batch and sequence dimensions are independent of one another.
* In mixed precision, the standard pattern is to upcast $x$ to fp32 for the RMS computation, then cast back to bf16 or fp16 before multiplying by the weight. This avoids underflow when $x_i^2$ is small and overflow when it is large.
* `self.weight` is initialized to ones, not zeros. The norm is the identity at init.

### Is RMSNorm built into PyTorch?

Yes. PyTorch added a built-in `torch.nn.RMSNorm` module and a functional `torch.nn.functional.rms_norm` in version 2.4 (released July 2024, PR #121364).[11][12] The API mirrors `torch.nn.LayerNorm`:

```python
import torch.nn as nn
norm = nn.RMSNorm(normalized_shape=4096, eps=1e-6)
```

The `normalized_shape` argument can be a single integer or a tuple, matching LayerNorm's behavior. If `eps` is not specified, PyTorch uses the machine epsilon of the computation type.[11]

### Hugging Face Transformers

The [Hugging Face](/wiki/hugging_face) Transformers library ships an `LlamaRMSNorm` class in `transformers.models.llama.modeling_llama` that follows the LLaMA reference implementation, with explicit fp32 upcast for the RMS computation. Most other RMSNorm-using models in the library either reuse `LlamaRMSNorm` directly or define near-identical classes (`MistralRMSNorm`, `GemmaRMSNorm`, `Qwen2RMSNorm`, and so on) that differ only in dtype handling and bias conventions.

### Fused kernels

For inference and training at scale, almost everyone uses fused kernels rather than the eager-mode reference. Apex's `FusedRMSNorm`, the [Triton](/wiki/triton)-authored kernels in vLLM, the hand-written CUDA kernels in TensorRT-LLM, and the GGML implementation in llama.cpp all do the same thing: load $x$ once, compute the sum of squares, the inverse RMS, and the output multiply in a single pass with no intermediate materialization. On Apple silicon, MLX provides `mx.fast.rms_norm` for the same reason. The unfused PyTorch reference can be 2 to 3 times slower than fused kernels in inference workloads where memory bandwidth is the bottleneck.

## Pitfalls and caveats

For most users, RMSNorm is a drop-in replacement for LayerNorm and there are no surprises. A few practical issues do come up at scale.

* **Numerical fragility in fp16.** Squaring activations in fp16 can overflow when the hidden state has large magnitudes, particularly in long-context generation or shortly after initialization. The standard fix is to compute the RMS in fp32 even when the rest of the model runs in fp16 or bf16. Most production frameworks do this automatically, but custom kernels that skip the upcast have caused divergence in real training runs.
* **Outlier features and quantization.** When quantizing an LLM to int8 or int4, the activation distribution after RMSNorm matters. The lack of mean subtraction means some channels can carry persistent offsets, which complicates per-tensor quantization schemes. Methods like SmoothQuant and AWQ exist in part to deal with the resulting dynamic-range issues.
* **No clean theoretical justification.** The original paper made an empirical case, and follow-up work has not produced a clean theoretical explanation for why dropping mean centering is fine in practice. The 2024 "Re-Introducing LayerNorm" paper argued that the geometric distinction between RMSNorm and LayerNorm is real and shows up in some regimes, leaving open questions for very deep models or non-language modalities.[10]
* **Post-norm RMSNorm is unstable for deep models.** Like LayerNorm, RMSNorm placed after the residual add does not train well past about 12 layers without other modifications. Pre-norm placement, as in LLaMA, is the safe default.[5] Some Gemma generations add a second RMSNorm on the output of each block on top of pre-norm; that is closer to an output scaling than a true post-norm.[9]
* **Gradient through the RMS denominator.** The backward pass through $1/\text{RMS}(\mathbf{a})$ involves a division by a square root, which is numerically delicate when the RMS is close to zero. The $\epsilon$ inside the square root prevents catastrophic blow-up, but choosing $\epsilon$ too small in mixed precision can still cause silent NaNs. Common values are $10^{-5}$ for fp16 and $10^{-6}$ for bf16/fp32.

In day-to-day work the entire interface is `nn.RMSNorm(hidden_size)`, and you do not need to think about any of this.

## References

1. Biao Zhang and Rico Sennrich. "Root Mean Square Layer Normalization." Advances in Neural Information Processing Systems 32 (NeurIPS 2019). arXiv:1910.07467, October 2019. [https://arxiv.org/abs/1910.07467](https://arxiv.org/abs/1910.07467)
2. Jimmy Lei Ba, Jamie Ryan Kiros, Geoffrey E. Hinton. "Layer Normalization." arXiv:1607.06450, 2016. [https://arxiv.org/abs/1607.06450](https://arxiv.org/abs/1607.06450)
3. Sergey Ioffe and Christian Szegedy. "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift." ICML 2015. arXiv:1502.03167. [https://arxiv.org/abs/1502.03167](https://arxiv.org/abs/1502.03167)
4. Colin Raffel, Noam Shazeer, Adam Roberts, et al. "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer" (T5). arXiv:1910.10683, October 2019. [https://arxiv.org/abs/1910.10683](https://arxiv.org/abs/1910.10683)
5. Hugo Touvron, Thibaut Lavril, Gautier Izacard, et al. "LLaMA: Open and Efficient Foundation Language Models." arXiv:2302.13971, February 2023. [https://arxiv.org/abs/2302.13971](https://arxiv.org/abs/2302.13971)
6. Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, et al. "Mistral 7B." arXiv:2310.06825, October 2023. [https://arxiv.org/abs/2310.06825](https://arxiv.org/abs/2310.06825)
7. Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, et al. "PaLM: Scaling Language Modeling with Pathways." arXiv:2204.02311, April 2022. [https://arxiv.org/abs/2204.02311](https://arxiv.org/abs/2204.02311)
8. Albert Gu, Tri Dao. "Mamba: Linear-Time Sequence Modeling with Selective State Spaces." arXiv:2312.00752, December 2023. [https://arxiv.org/abs/2312.00752](https://arxiv.org/abs/2312.00752)
9. Gemma Team, Google DeepMind. "Gemma: Open Models Based on Gemini Research and Technology." arXiv:2403.08295, March 2024. [https://arxiv.org/abs/2403.08295](https://arxiv.org/abs/2403.08295)
10. Akshat Shrivastava et al. "Re-Introducing LayerNorm: Geometric Meaning, Irreversibility and a Comparative Study with RMSNorm." arXiv:2409.12951, 2024. [https://arxiv.org/abs/2409.12951](https://arxiv.org/abs/2409.12951)
11. PyTorch documentation, `torch.nn.RMSNorm`. [https://docs.pytorch.org/docs/stable/generated/torch.nn.RMSNorm.html](https://docs.pytorch.org/docs/stable/generated/torch.nn.RMSNorm.html)
12. PyTorch GitHub PR #121364, "Add RMSNorm module," merged for the 2.4 release in March 2024. [https://github.com/pytorch/pytorch/pull/121364](https://github.com/pytorch/pytorch/pull/121364)
13. Reference RMSNorm implementation by Zhang and Sennrich. [https://github.com/bzhangGo/rmsnorm](https://github.com/bzhangGo/rmsnorm)
14. Toan Q. Nguyen, Julian Salazar. "Transformers without Tears: Improving the Normalization of Self-Attention" (ScaleNorm). arXiv:1910.05895, 2019. [https://arxiv.org/abs/1910.05895](https://arxiv.org/abs/1910.05895)
15. Hongyu Wang et al. "DeepNet: Scaling Transformers to 1,000 Layers" (DeepNorm). arXiv:2203.00555, 2022. [https://arxiv.org/abs/2203.00555](https://arxiv.org/abs/2203.00555)

