RMSNorm

RMSNorm (Root Mean Square Layer Normalization) is a feature normalization technique introduced by Biao Zhang and Rico Sennrich in 2019 as a simplified, faster alternative to Layer Normalization. Instead of subtracting the mean and dividing by the standard deviation, it scales each activation vector by its root mean square only, dropping both the mean computation and the bias parameter that LayerNorm relies on. The method preserves the re-scaling invariance that stabilizes training while giving up the re-centering invariance, and the original paper showed this trade-off costs essentially nothing on accuracy while running between 7% and 64% faster per training step depending on model and hardware.

For several years RMSNorm was a niche curiosity. Then it became the default normalization layer in nearly every large open-weight Transformer language model. LLaMA, LLaMA 2, and LLaMA 3 use it, as do Mistral 7B, Mixtral, Gemma, DeepSeek, Qwen, and the PaLM family. The contemporaneous T5 model from Google also adopted an RMSNorm-equivalent simplified LayerNorm in October 2019, the same month Zhang and Sennrich's paper appeared on arXiv. By the time inference engines like vLLM, llama.cpp, and TensorRT-LLM matured, RMSNorm was treated as a fixed assumption of the architecture, with hand-written fused CUDA kernels squeezing the last few microseconds out of it.

Quick facts

Field	Value
Introduced	October 2019 (arXiv preprint)
Conference	NeurIPS 2019
Authors	Biao Zhang, Rico Sennrich
Affiliation	University of Edinburgh
arXiv ID	1910.07467
Original tasks	WMT14 English-German MT, image-text retrieval, enwik8 LM, reading comprehension
Reported speedup vs LayerNorm	7% to 64% per step
Built into PyTorch	`torch.nn.RMSNorm`, version 2.4 (July 2024)

Background

The Transformer architecture uses normalization layers to keep activations from drifting in scale as signals flow through deep stacks of attention and feedforward blocks. Without normalization, training tends to diverge once depth exceeds a few dozen layers, because gradients either explode or vanish.

Three main normalizers were in widespread use before RMSNorm:

Batch Normalization (Ioffe and Szegedy 2015), the original. It computes mean and variance across the batch dimension for each feature channel, which works well for vision convolutional neural networks but fails for variable-length sequences and small batch sizes. See Batch normalization.
Layer Normalization (Ba, Kiros, Hinton 2016), introduced in part to fix BatchNorm's shortcomings for recurrent networks and later adopted by the original Transformer. LayerNorm computes statistics across the feature dimension within each individual sample, so it does not depend on batch size or sequence position.
Instance Normalization and Group Normalization, used mainly in image generation and small-batch vision models.

LayerNorm became the standard in Transformers from 2017 onward. Its update rule is

$$\text{LayerNorm}(\mathbf{a}) = \frac{\mathbf{a} - \mu}{\sigma} \odot \boldsymbol{\gamma} + \boldsymbol{\beta},$$

where $\mu = \frac{1}{n}\sum_i a_i$, $\sigma = \sqrt{\frac{1}{n}\sum_i (a_i - \mu)^2}$, and $\boldsymbol{\gamma}, \boldsymbol{\beta} \in \mathbb{R}^n$ are learned per-feature gain and bias parameters. The two reduction passes (one for the mean, one for the variance) are the expensive part. On modern accelerators a normalization layer is bandwidth-bound, and halving the reduction work translates almost linearly into wall-clock savings.

Zhang and Sennrich's 2019 paper started from a single hypothesis: maybe the re-centering step in LayerNorm (subtracting the mean) is not actually doing useful work, and only the re-scaling step matters for stable training. If true, you can drop the mean computation, drop the bias parameter, and get a faster norm with essentially the same behavior.

Definition and formula

Given a vector $\mathbf{a} \in \mathbb{R}^n$ representing a single position's hidden state in a Transformer, RMSNorm is defined as

$$\bar{a}i = \frac{a_i}{\text{RMS}(\mathbf{a})} \cdot g_i, \qquad \text{RMS}(\mathbf{a}) = \sqrt{\frac{1}{n}\sum{j=1}^{n} a_j^2}.$$

In practice a small $\epsilon$ is added inside the square root for numerical stability,

$$\text{RMS}(\mathbf{a}) = \sqrt{\frac{1}{n}\sum_{j=1}^{n} a_j^2 + \epsilon},$$

so the denominator never collapses to zero in fp16 or bf16. The vector $\mathbf{g} \in \mathbb{R}^n$ is a learned per-feature gain, initialized to all ones; some early implementations made it optional, but modern implementations always include it.

The key differences from LayerNorm are summarized below.

Aspect	LayerNorm	RMSNorm
Mean subtraction	Yes ($\mu$)	No
Denominator	Standard deviation $\sigma$	Root mean square $\text{RMS}(\mathbf{a})$
Learned parameters	Gain $\boldsymbol{\gamma}$ + bias $\boldsymbol{\beta}$ ($2n$ values)	Gain $\mathbf{g}$ only ($n$ values)
Reduction passes over features	Two	One
Re-centering invariance	Yes	No
Re-scaling invariance	Yes	Yes

Worked example

Take $\mathbf{a} = [3.0, -1.0, 4.0, -2.0]$, with $n = 4$ and $\epsilon$ ignored.

$\sum_j a_j^2 = 9 + 1 + 16 + 4 = 30$
$\text{RMS}(\mathbf{a}) = \sqrt{30/4} = \sqrt{7.5} \approx 2.7386$
$\mathbf{a} / \text{RMS}(\mathbf{a}) \approx [1.0954, -0.3651, 1.4606, -0.7303]$

With $\mathbf{g}$ initialized to all ones, that is the output. LayerNorm on the same input would first subtract the mean ($\mu = 1.0$) to get $[2, -2, 3, -3]$ and then divide by the standard deviation $\sqrt{6.5} \approx 2.5495$, giving a different output. The two layers behave the same only when the input already has zero mean.

Theoretical properties

The RMSNorm paper analyzes the invariances each normalizer preserves. Invariance here means: if you transform the input in a certain way, does the output of the layer change?

Re-scaling invariance (input scaling). If $\mathbf{a}$ is multiplied by a positive scalar $c$, both the numerator and the denominator scale by $c$, so the output is unchanged. Both LayerNorm and RMSNorm have this property. It is the property that stabilizes activations across layers and lets gradients flow.
Re-centering invariance (input shifting). If a constant $b$ is added to every element of $\mathbf{a}$, LayerNorm cancels it out by subtracting $\mu$. RMSNorm does not subtract anything, so a shift in the input changes the RMS denominator and therefore changes the output. RMSNorm is not re-centering invariant.

Zhang and Sennrich's empirical claim is that re-centering invariance does not matter for training stability or final accuracy in deep Transformers. Six years of large-model training have mostly confirmed this, though a 2024 follow-up paper, "Re-Introducing LayerNorm: Geometric Meaning, Irreversibility and a Comparative Study with RMSNorm" (arXiv:2409.12951), argued that the geometric distinction between the two layers is non-trivial in some regimes, particularly when activations have strong outlier features.

Another often-overlooked property is implicit learning-rate adaptation. Both LayerNorm and RMSNorm rescale the gradient flowing back into the input by a factor inversely proportional to the input's norm, which acts like a per-vector learning rate that automatically dampens unusually large activations. This is part of why pre-norm Transformers train so much more stably than unnormalized ones.

RMSNorm has fewer parameters than LayerNorm. For a hidden size of $d$, LayerNorm has $2d$ parameters; RMSNorm has $d$. At the scale of a 70-billion-parameter model, dropping the bias saves a few hundred thousand parameters in total, which is irrelevant for parameter count but does shrink the optimizer state.

Computational cost

The speedup comes from doing less work. LayerNorm requires two reduction passes over the feature dimension: one to compute $\mu$, then a second to compute $\sigma^2$ from $(a_i - \mu)^2$. RMSNorm needs only one reduction pass, summing $a_i^2$. On a memory-bound operation, which a normalization layer almost always is on modern accelerators, halving the reduction work translates directly to wall-clock savings.

The original paper reported per-step time reductions across a range of models and hardware:

Model	Task	Reported speedup
Bidirectional GRU	Reading comprehension (CNN/Daily Mail)	up to 64%
Transformer (small/big)	WMT14 English-German MT	7-15%
Word-level RNN	enwik8 language modeling	around 25%
ConvS2S	Machine translation	around 10%

The larger speedups tended to appear on smaller models and recurrent architectures where the norm is a bigger fraction of total work. In modern fused implementations, the absolute gap between RMSNorm and LayerNorm is smaller because the norm is a tiny slice of a Transformer's compute. It still matters at scale: a saved microsecond per layer per token, multiplied across 80 layers and trillions of training tokens, is a real cost.

A secondary practical reason RMSNorm wins on hardware is numerical: computing variance in low precision via $E[a^2] - E[a]^2$ can lose significant digits when the two terms are close, while $\frac{1}{n}\sum a_i^2$ avoids that subtraction entirely.

Adoption in modern language models

The T5 paper from Google (Raffel et al., October 2019) was contemporaneous with Zhang and Sennrich's work, and its public Mesh-TensorFlow code used a simplified LayerNorm that, in the relevant terms, is equivalent to RMSNorm. The T5 paper itself did not advertise the change, but the implementation set the pattern that later Google models followed.

Model	Year	Normalization	Notes
T5	2019	RMSNorm-equivalent	Simplified LayerNorm without bias, in Mesh-TensorFlow code
PaLM	2022	RMSNorm	Google, descended from T5 conventions
PaLM 2	2023	RMSNorm	Google
LLaMA 1	2023	RMSNorm (pre-norm)	Touvron et al.
LLaMA 2	2023	RMSNorm (pre-norm)	Same architecture as LLaMA 1
LLaMA 3	2024	RMSNorm (pre-norm)	Same scheme, larger model
Mistral 7B	2023	RMSNorm (pre-norm)	Mistral AI
Mixtral 8x7B	2023	RMSNorm (pre-norm)	Mixture of experts on Mistral base
Gemma 1, 2, 3	2024-2025	RMSNorm (pre-norm + post-norm)	Google DeepMind
DeepSeek V2, V3	2024-2025	RMSNorm	Pre-norm in attention and FFN blocks
Qwen 2, 2.5, 3	2024-2025	RMSNorm	Alibaba
Phi-3	2024	RMSNorm	Microsoft, Phi-3-medium and later
Mamba	2023	RMSNorm	Inside the Mamba block, post-SSM
Mamba-2	2024	RMSNorm + post-gate RMSNorm	Stability addition
RWKV v5/v6	2024	RMSNorm variants	Time-mix and channel-mix blocks

The placement is almost always pre-norm, meaning the norm is applied before each attention or feedforward sub-block, with a residual added afterwards. Pre-norm Transformers train more stably at depth than post-norm Transformers, and combining pre-norm with RMSNorm has become the default recipe for open-weight LLMs.

Not every modern model uses RMSNorm. The original GPT-2 used LayerNorm, and several follow-ups in the EleutherAI lineage (GPT-J 6B, GPT-NeoX-20B, Pythia) stayed with LayerNorm rather than switching. GPT-3 also used LayerNorm. The specific choice in GPT-4 and later proprietary OpenAI models has not been published.

Why modern LLMs prefer RMSNorm comes down to a combination of training stability and speed. Pre-norm RMSNorm trains as well as pre-norm LayerNorm in head-to-head comparisons that researchers have actually reported, and it costs less per step. With training compute being the binding constraint on frontier models, even single-digit percentage savings are worth taking by default.

Method	Statistics computed	Re-centering invariant	Re-scaling invariant	Typical use
Batch normalization	Mean and variance over batch dimension	Yes (per channel)	Yes	CNNs, vision
Layer normalization	Mean and variance over feature dimension	Yes	Yes	Original Transformers, RNNs
Group Normalization	Mean and variance per group of channels	Yes (per group)	Yes	CNNs at small batch sizes
Instance Normalization	Mean and variance per channel per sample	Yes (per channel)	Yes	Style transfer, GANs
RMSNorm	RMS over feature dimension	No	Yes	Modern LLMs
ScaleNorm	Single scalar L2 norm	No	Yes	Niche, some translation models
DeepNorm	LayerNorm + residual scaling factor	Yes	Yes	Very deep Transformers (1000+ layers)

RMSNorm is the only widely used method in this list that gives up re-centering invariance, and it is also the cheapest. The success of LLMs trained with it is the main empirical evidence that, at least for autoregressive language modeling, that property is not load-bearing.

Variants

Partial RMSNorm (pRMSNorm)

The original paper also introduced a partial variant that estimates the RMS from the first $p%$ of the feature dimensions instead of all of them:

$$\overline{\text{RMS}}(\mathbf{a}) = \sqrt{\frac{1}{k}\sum_{j=1}^{k} a_j^2}, \qquad k = \lceil n \cdot p / 100 \rceil.$$

This trades a small amount of accuracy for additional compute savings. The paper showed pRMSNorm with $p = 6.25%$ gives near-identical results to full RMSNorm on machine translation. In practice pRMSNorm did not catch on, partly because fused kernels make full RMSNorm fast enough that the extra approximation is not worth the complexity.

Pre-norm vs post-norm placement

A second axis of variation is where the norm sits inside a residual block. The original Transformer paper put LayerNorm after the residual addition (post-norm). Pre-norm, where the norm is applied to the input of each sub-block before attention or FFN, became dominant after roughly 2019 because it trains more stably at depth.

RMSNorm inherits this distinction. The default in essentially every modern LLM is pre-norm RMSNorm. Some Gemma generations add an extra post-norm RMSNorm on top of pre-norm; that acts more like an output scaling than a true post-norm. DeepNorm (Microsoft 2022) is a separate approach that uses LayerNorm with a scaling factor on the residual connection to enable training of 1000-layer Transformers.

ScaleNorm (Nguyen and Salazar 2019) replaces the per-feature scale with a single scalar that divides by the L2 norm of the activation vector. ScaleNorm is even cheaper than RMSNorm but gives up the per-feature gain. It was used in some smaller machine-translation models but never became mainstream.

Nvidia's nGPT (2024) takes the idea further by normalizing every weight matrix and embedding to lie on the unit hypersphere, removing normalization layers entirely.

Implementation

A minimal PyTorch reference implementation looks like this:

import torch
import torch.nn as nn

class RMSNorm(nn.Module):
    def __init__(self, dim: int, eps: float = 1e-6):
        super().__init__()
        self.eps = eps
        self.weight = nn.Parameter(torch.ones(dim))

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # x: (..., dim)
        rms = x.pow(2).mean(dim=-1, keepdim=True).add(self.eps).sqrt()
        return x / rms * self.weight

A few details that matter in practice:

The norm is computed across the last dimension only, which is the feature axis. Batch and sequence dimensions are independent of one another.
In mixed precision, the standard pattern is to upcast $x$ to fp32 for the RMS computation, then cast back to bf16 or fp16 before multiplying by the weight. This avoids underflow when $x_i^2$ is small and overflow when it is large.
self.weight is initialized to ones, not zeros. The norm is the identity at init.

Built-in PyTorch support

PyTorch added a built-in torch.nn.RMSNorm module and a functional torch.nn.functional.rms_norm in version 2.4 (released July 2024, PR #121364). The API mirrors torch.nn.LayerNorm:

import torch.nn as nn
norm = nn.RMSNorm(normalized_shape=4096, eps=1e-6)

The normalized_shape argument can be a single integer or a tuple, matching LayerNorm's behavior. If eps is not specified, PyTorch uses the machine epsilon of the computation type.

Hugging Face Transformers

The Hugging Face Transformers library ships an LlamaRMSNorm class in transformers.models.llama.modeling_llama that follows the LLaMA reference implementation, with explicit fp32 upcast for the RMS computation. Most other RMSNorm-using models in the library either reuse LlamaRMSNorm directly or define near-identical classes (MistralRMSNorm, GemmaRMSNorm, Qwen2RMSNorm, and so on) that differ only in dtype handling and bias conventions.

Fused kernels

For inference and training at scale, almost everyone uses fused kernels rather than the eager-mode reference. Apex's FusedRMSNorm, the Triton-authored kernels in vLLM, the hand-written CUDA kernels in TensorRT-LLM, and the GGML implementation in llama.cpp all do the same thing: load $x$ once, compute the sum of squares, the inverse RMS, and the output multiply in a single pass with no intermediate materialization. On Apple silicon, MLX provides mx.fast.rms_norm for the same reason. The unfused PyTorch reference can be 2 to 3 times slower than fused kernels in inference workloads where memory bandwidth is the bottleneck.

Pitfalls and caveats

For most users, RMSNorm is a drop-in replacement for LayerNorm and there are no surprises. A few practical issues do come up at scale.

Numerical fragility in fp16. Squaring activations in fp16 can overflow when the hidden state has large magnitudes, particularly in long-context generation or shortly after initialization. The standard fix is to compute the RMS in fp32 even when the rest of the model runs in fp16 or bf16. Most production frameworks do this automatically, but custom kernels that skip the upcast have caused divergence in real training runs.
Outlier features and quantization. When quantizing an LLM to int8 or int4, the activation distribution after RMSNorm matters. The lack of mean subtraction means some channels can carry persistent offsets, which complicates per-tensor quantization schemes. Methods like SmoothQuant and AWQ exist in part to deal with the resulting dynamic-range issues.
No clean theoretical justification. The original paper made an empirical case, and follow-up work has not produced a clean theoretical explanation for why dropping mean centering is fine in practice. The 2024 "Re-Introducing LayerNorm" paper argued that the geometric distinction between RMSNorm and LayerNorm is real and shows up in some regimes, leaving open questions for very deep models or non-language modalities.
Post-norm RMSNorm is unstable for deep models. Like LayerNorm, RMSNorm placed after the residual add does not train well past about 12 layers without other modifications. Pre-norm placement, as in LLaMA, is the safe default. Some Gemma generations add a second RMSNorm on the output of each block on top of pre-norm; that is closer to an output scaling than a true post-norm.
Gradient through the RMS denominator. The backward pass through $1/\text{RMS}(\mathbf{a})$ involves a division by a square root, which is numerically delicate when the RMS is close to zero. The $\epsilon$ inside the square root prevents catastrophic blow-up, but choosing $\epsilon$ too small in mixed precision can still cause silent NaNs. Common values are $10^{-5}$ for fp16 and $10^{-6}$ for bf16/fp32.

In day-to-day work the entire interface is nn.RMSNorm(hidden_size), and you do not need to think about any of this.

References

Biao Zhang and Rico Sennrich. "Root Mean Square Layer Normalization." Advances in Neural Information Processing Systems 32 (NeurIPS 2019). arXiv:1910.07467, October 2019. https://arxiv.org/abs/1910.07467
Jimmy Lei Ba, Jamie Ryan Kiros, Geoffrey E. Hinton. "Layer Normalization." arXiv:1607.06450, 2016. https://arxiv.org/abs/1607.06450
Sergey Ioffe and Christian Szegedy. "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift." ICML 2015. arXiv:1502.03167. https://arxiv.org/abs/1502.03167
Colin Raffel, Noam Shazeer, Adam Roberts, et al. "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer" (T5). arXiv:1910.10683, October 2019. https://arxiv.org/abs/1910.10683
Hugo Touvron, Thibaut Lavril, Gautier Izacard, et al. "LLaMA: Open and Efficient Foundation Language Models." arXiv:2302.13971, February 2023. https://arxiv.org/abs/2302.13971
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, et al. "Mistral 7B." arXiv:2310.06825, October 2023. https://arxiv.org/abs/2310.06825
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, et al. "PaLM: Scaling Language Modeling with Pathways." arXiv:2204.02311, April 2022. https://arxiv.org/abs/2204.02311
Albert Gu, Tri Dao. "Mamba: Linear-Time Sequence Modeling with Selective State Spaces." arXiv:2312.00752, December 2023. https://arxiv.org/abs/2312.00752
Gemma Team, Google DeepMind. "Gemma: Open Models Based on Gemini Research and Technology." arXiv:2403.08295, March 2024. https://arxiv.org/abs/2403.08295
Akshat Shrivastava et al. "Re-Introducing LayerNorm: Geometric Meaning, Irreversibility and a Comparative Study with RMSNorm." arXiv:2409.12951, 2024. https://arxiv.org/abs/2409.12951
PyTorch documentation, `torch.nn.RMSNorm`. https://docs.pytorch.org/docs/stable/generated/torch.nn.RMSNorm.html
PyTorch GitHub PR #121364, "Add RMSNorm module," merged for the 2.4 release in March 2024. https://github.com/pytorch/pytorch/pull/121364
Reference RMSNorm implementation by Zhang and Sennrich. https://github.com/bzhangGo/rmsnorm
Toan Q. Nguyen, Julian Salazar. "Transformers without Tears: Improving the Normalization of Self-Attention" (ScaleNorm). arXiv:1910.05895, 2019. https://arxiv.org/abs/1910.05895
Hongyu Wang et al. "DeepNet: Scaling Transformers to 1,000 Layers" (DeepNorm). arXiv:2203.00555, 2022. https://arxiv.org/abs/2203.00555

Quick facts

Background

Definition and formula

Worked example

Theoretical properties

Computational cost

Adoption in modern language models

Comparison with related normalization methods

Variants

Partial RMSNorm (pRMSNorm)

Pre-norm vs post-norm placement

Other related variants

Implementation

Built-in PyTorch support

Hugging Face Transformers

Fused kernels

Pitfalls and caveats

References

Improve this article

Related Articles

Feature Pyramid Network (FPN)

Multi-Head Self-Attention

Rotary Position Embedding

Grouped-Query Attention

KV Cache

Self-attention

Quick facts

Background

Definition and formula

Worked example

Theoretical properties

Computational cost

Adoption in modern language models

Comparison with related normalization methods

Variants

Partial RMSNorm (pRMSNorm)

Pre-norm vs post-norm placement

Other related variants

Implementation

Built-in PyTorch support

Hugging Face Transformers

Fused kernels

Pitfalls and caveats

References

Related Articles

Feature Pyramid Network (FPN)

Multi-Head Self-Attention

Rotary Position Embedding

Grouped-Query Attention

KV Cache

Self-attention