RMSNorm
Last reviewed
May 2, 2026
Sources
15 citations
Review status
Source-backed
Revision
v2 ยท 3,289 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 2, 2026
Sources
15 citations
Review status
Source-backed
Revision
v2 ยท 3,289 words
Add missing citations, update stale details, or suggest a clearer explanation.
RMSNorm (Root Mean Square Layer Normalization) is a feature normalization technique introduced by Biao Zhang and Rico Sennrich in 2019 as a simplified, faster alternative to Layer Normalization. Instead of subtracting the mean and dividing by the standard deviation, it scales each activation vector by its root mean square only, dropping both the mean computation and the bias parameter that LayerNorm relies on. The method preserves the re-scaling invariance that stabilizes training while giving up the re-centering invariance, and the original paper showed this trade-off costs essentially nothing on accuracy while running between 7% and 64% faster per training step depending on model and hardware.
For several years RMSNorm was a niche curiosity. Then it became the default normalization layer in nearly every large open-weight Transformer language model. LLaMA, LLaMA 2, and LLaMA 3 use it, as do Mistral 7B, Mixtral, Gemma, DeepSeek, Qwen, and the PaLM family. The contemporaneous T5 model from Google also adopted an RMSNorm-equivalent simplified LayerNorm in October 2019, the same month Zhang and Sennrich's paper appeared on arXiv. By the time inference engines like vLLM, llama.cpp, and TensorRT-LLM matured, RMSNorm was treated as a fixed assumption of the architecture, with hand-written fused CUDA kernels squeezing the last few microseconds out of it.
| Field | Value |
|---|---|
| Introduced | October 2019 (arXiv preprint) |
| Conference | NeurIPS 2019 |
| Authors | Biao Zhang, Rico Sennrich |
| Affiliation | University of Edinburgh |
| arXiv ID | 1910.07467 |
| Original tasks | WMT14 English-German MT, image-text retrieval, enwik8 LM, reading comprehension |
| Reported speedup vs LayerNorm | 7% to 64% per step |
| Built into PyTorch | torch.nn.RMSNorm, version 2.4 (July 2024) |
The Transformer architecture uses normalization layers to keep activations from drifting in scale as signals flow through deep stacks of attention and feedforward blocks. Without normalization, training tends to diverge once depth exceeds a few dozen layers, because gradients either explode or vanish.
Three main normalizers were in widespread use before RMSNorm:
LayerNorm became the standard in Transformers from 2017 onward. Its update rule is
$$\text{LayerNorm}(\mathbf{a}) = \frac{\mathbf{a} - \mu}{\sigma} \odot \boldsymbol{\gamma} + \boldsymbol{\beta},$$
where $\mu = \frac{1}{n}\sum_i a_i$, $\sigma = \sqrt{\frac{1}{n}\sum_i (a_i - \mu)^2}$, and $\boldsymbol{\gamma}, \boldsymbol{\beta} \in \mathbb{R}^n$ are learned per-feature gain and bias parameters. The two reduction passes (one for the mean, one for the variance) are the expensive part. On modern accelerators a normalization layer is bandwidth-bound, and halving the reduction work translates almost linearly into wall-clock savings.
Zhang and Sennrich's 2019 paper started from a single hypothesis: maybe the re-centering step in LayerNorm (subtracting the mean) is not actually doing useful work, and only the re-scaling step matters for stable training. If true, you can drop the mean computation, drop the bias parameter, and get a faster norm with essentially the same behavior.
Given a vector $\mathbf{a} \in \mathbb{R}^n$ representing a single position's hidden state in a Transformer, RMSNorm is defined as
$$\bar{a}i = \frac{a_i}{\text{RMS}(\mathbf{a})} \cdot g_i, \qquad \text{RMS}(\mathbf{a}) = \sqrt{\frac{1}{n}\sum{j=1}^{n} a_j^2}.$$
In practice a small $\epsilon$ is added inside the square root for numerical stability,
$$\text{RMS}(\mathbf{a}) = \sqrt{\frac{1}{n}\sum_{j=1}^{n} a_j^2 + \epsilon},$$
so the denominator never collapses to zero in fp16 or bf16. The vector $\mathbf{g} \in \mathbb{R}^n$ is a learned per-feature gain, initialized to all ones; some early implementations made it optional, but modern implementations always include it.
The key differences from LayerNorm are summarized below.
| Aspect | LayerNorm | RMSNorm |
|---|---|---|
| Mean subtraction | Yes ($\mu$) | No |
| Denominator | Standard deviation $\sigma$ | Root mean square $\text{RMS}(\mathbf{a})$ |
| Learned parameters | Gain $\boldsymbol{\gamma}$ + bias $\boldsymbol{\beta}$ ($2n$ values) | Gain $\mathbf{g}$ only ($n$ values) |
| Reduction passes over features | Two | One |
| Re-centering invariance | Yes | No |
| Re-scaling invariance | Yes | Yes |
Take $\mathbf{a} = [3.0, -1.0, 4.0, -2.0]$, with $n = 4$ and $\epsilon$ ignored.
With $\mathbf{g}$ initialized to all ones, that is the output. LayerNorm on the same input would first subtract the mean ($\mu = 1.0$) to get $[2, -2, 3, -3]$ and then divide by the standard deviation $\sqrt{6.5} \approx 2.5495$, giving a different output. The two layers behave the same only when the input already has zero mean.
The RMSNorm paper analyzes the invariances each normalizer preserves. Invariance here means: if you transform the input in a certain way, does the output of the layer change?
Zhang and Sennrich's empirical claim is that re-centering invariance does not matter for training stability or final accuracy in deep Transformers. Six years of large-model training have mostly confirmed this, though a 2024 follow-up paper, "Re-Introducing LayerNorm: Geometric Meaning, Irreversibility and a Comparative Study with RMSNorm" (arXiv:2409.12951), argued that the geometric distinction between the two layers is non-trivial in some regimes, particularly when activations have strong outlier features.
Another often-overlooked property is implicit learning-rate adaptation. Both LayerNorm and RMSNorm rescale the gradient flowing back into the input by a factor inversely proportional to the input's norm, which acts like a per-vector learning rate that automatically dampens unusually large activations. This is part of why pre-norm Transformers train so much more stably than unnormalized ones.
RMSNorm has fewer parameters than LayerNorm. For a hidden size of $d$, LayerNorm has $2d$ parameters; RMSNorm has $d$. At the scale of a 70-billion-parameter model, dropping the bias saves a few hundred thousand parameters in total, which is irrelevant for parameter count but does shrink the optimizer state.
The speedup comes from doing less work. LayerNorm requires two reduction passes over the feature dimension: one to compute $\mu$, then a second to compute $\sigma^2$ from $(a_i - \mu)^2$. RMSNorm needs only one reduction pass, summing $a_i^2$. On a memory-bound operation, which a normalization layer almost always is on modern accelerators, halving the reduction work translates directly to wall-clock savings.
The original paper reported per-step time reductions across a range of models and hardware:
| Model | Task | Reported speedup |
|---|---|---|
| Bidirectional GRU | Reading comprehension (CNN/Daily Mail) | up to 64% |
| Transformer (small/big) | WMT14 English-German MT | 7-15% |
| Word-level RNN | enwik8 language modeling | around 25% |
| ConvS2S | Machine translation | around 10% |
The larger speedups tended to appear on smaller models and recurrent architectures where the norm is a bigger fraction of total work. In modern fused implementations, the absolute gap between RMSNorm and LayerNorm is smaller because the norm is a tiny slice of a Transformer's compute. It still matters at scale: a saved microsecond per layer per token, multiplied across 80 layers and trillions of training tokens, is a real cost.
A secondary practical reason RMSNorm wins on hardware is numerical: computing variance in low precision via $E[a^2] - E[a]^2$ can lose significant digits when the two terms are close, while $\frac{1}{n}\sum a_i^2$ avoids that subtraction entirely.
The T5 paper from Google (Raffel et al., October 2019) was contemporaneous with Zhang and Sennrich's work, and its public Mesh-TensorFlow code used a simplified LayerNorm that, in the relevant terms, is equivalent to RMSNorm. The T5 paper itself did not advertise the change, but the implementation set the pattern that later Google models followed.
| Model | Year | Normalization | Notes |
|---|---|---|---|
| T5 | 2019 | RMSNorm-equivalent | Simplified LayerNorm without bias, in Mesh-TensorFlow code |
| PaLM | 2022 | RMSNorm | Google, descended from T5 conventions |
| PaLM 2 | 2023 | RMSNorm | |
| LLaMA 1 | 2023 | RMSNorm (pre-norm) | Touvron et al. |
| LLaMA 2 | 2023 | RMSNorm (pre-norm) | Same architecture as LLaMA 1 |
| LLaMA 3 | 2024 | RMSNorm (pre-norm) | Same scheme, larger model |
| Mistral 7B | 2023 | RMSNorm (pre-norm) | Mistral AI |
| Mixtral 8x7B | 2023 | RMSNorm (pre-norm) | Mixture of experts on Mistral base |
| Gemma 1, 2, 3 | 2024-2025 | RMSNorm (pre-norm + post-norm) | Google DeepMind |
| DeepSeek V2, V3 | 2024-2025 | RMSNorm | Pre-norm in attention and FFN blocks |
| Qwen 2, 2.5, 3 | 2024-2025 | RMSNorm | Alibaba |
| Phi-3 | 2024 | RMSNorm | Microsoft, Phi-3-medium and later |
| Mamba | 2023 | RMSNorm | Inside the Mamba block, post-SSM |
| Mamba-2 | 2024 | RMSNorm + post-gate RMSNorm | Stability addition |
| RWKV v5/v6 | 2024 | RMSNorm variants | Time-mix and channel-mix blocks |
The placement is almost always pre-norm, meaning the norm is applied before each attention or feedforward sub-block, with a residual added afterwards. Pre-norm Transformers train more stably at depth than post-norm Transformers, and combining pre-norm with RMSNorm has become the default recipe for open-weight LLMs.
Not every modern model uses RMSNorm. The original GPT-2 used LayerNorm, and several follow-ups in the EleutherAI lineage (GPT-J 6B, GPT-NeoX-20B, Pythia) stayed with LayerNorm rather than switching. GPT-3 also used LayerNorm. The specific choice in GPT-4 and later proprietary OpenAI models has not been published.
Why modern LLMs prefer RMSNorm comes down to a combination of training stability and speed. Pre-norm RMSNorm trains as well as pre-norm LayerNorm in head-to-head comparisons that researchers have actually reported, and it costs less per step. With training compute being the binding constraint on frontier models, even single-digit percentage savings are worth taking by default.
| Method | Statistics computed | Re-centering invariant | Re-scaling invariant | Typical use |
|---|---|---|---|---|
| Batch normalization | Mean and variance over batch dimension | Yes (per channel) | Yes | CNNs, vision |
| Layer normalization | Mean and variance over feature dimension | Yes | Yes | Original Transformers, RNNs |
| Group Normalization | Mean and variance per group of channels | Yes (per group) | Yes | CNNs at small batch sizes |
| Instance Normalization | Mean and variance per channel per sample | Yes (per channel) | Yes | Style transfer, GANs |
| RMSNorm | RMS over feature dimension | No | Yes | Modern LLMs |
| ScaleNorm | Single scalar L2 norm | No | Yes | Niche, some translation models |
| DeepNorm | LayerNorm + residual scaling factor | Yes | Yes | Very deep Transformers (1000+ layers) |
RMSNorm is the only widely used method in this list that gives up re-centering invariance, and it is also the cheapest. The success of LLMs trained with it is the main empirical evidence that, at least for autoregressive language modeling, that property is not load-bearing.
The original paper also introduced a partial variant that estimates the RMS from the first $p%$ of the feature dimensions instead of all of them:
$$\overline{\text{RMS}}(\mathbf{a}) = \sqrt{\frac{1}{k}\sum_{j=1}^{k} a_j^2}, \qquad k = \lceil n \cdot p / 100 \rceil.$$
This trades a small amount of accuracy for additional compute savings. The paper showed pRMSNorm with $p = 6.25%$ gives near-identical results to full RMSNorm on machine translation. In practice pRMSNorm did not catch on, partly because fused kernels make full RMSNorm fast enough that the extra approximation is not worth the complexity.
A second axis of variation is where the norm sits inside a residual block. The original Transformer paper put LayerNorm after the residual addition (post-norm). Pre-norm, where the norm is applied to the input of each sub-block before attention or FFN, became dominant after roughly 2019 because it trains more stably at depth.
RMSNorm inherits this distinction. The default in essentially every modern LLM is pre-norm RMSNorm. Some Gemma generations add an extra post-norm RMSNorm on top of pre-norm; that acts more like an output scaling than a true post-norm. DeepNorm (Microsoft 2022) is a separate approach that uses LayerNorm with a scaling factor on the residual connection to enable training of 1000-layer Transformers.
ScaleNorm (Nguyen and Salazar 2019) replaces the per-feature scale with a single scalar that divides by the L2 norm of the activation vector. ScaleNorm is even cheaper than RMSNorm but gives up the per-feature gain. It was used in some smaller machine-translation models but never became mainstream.
Nvidia's nGPT (2024) takes the idea further by normalizing every weight matrix and embedding to lie on the unit hypersphere, removing normalization layers entirely.
A minimal PyTorch reference implementation looks like this:
import torch
import torch.nn as nn
class RMSNorm(nn.Module):
def __init__(self, dim: int, eps: float = 1e-6):
super().__init__()
self.eps = eps
self.weight = nn.Parameter(torch.ones(dim))
def forward(self, x: torch.Tensor) -> torch.Tensor:
# x: (..., dim)
rms = x.pow(2).mean(dim=-1, keepdim=True).add(self.eps).sqrt()
return x / rms * self.weight
A few details that matter in practice:
self.weight is initialized to ones, not zeros. The norm is the identity at init.PyTorch added a built-in torch.nn.RMSNorm module and a functional torch.nn.functional.rms_norm in version 2.4 (released July 2024, PR #121364). The API mirrors torch.nn.LayerNorm:
import torch.nn as nn
norm = nn.RMSNorm(normalized_shape=4096, eps=1e-6)
The normalized_shape argument can be a single integer or a tuple, matching LayerNorm's behavior. If eps is not specified, PyTorch uses the machine epsilon of the computation type.
The Hugging Face Transformers library ships an LlamaRMSNorm class in transformers.models.llama.modeling_llama that follows the LLaMA reference implementation, with explicit fp32 upcast for the RMS computation. Most other RMSNorm-using models in the library either reuse LlamaRMSNorm directly or define near-identical classes (MistralRMSNorm, GemmaRMSNorm, Qwen2RMSNorm, and so on) that differ only in dtype handling and bias conventions.
For inference and training at scale, almost everyone uses fused kernels rather than the eager-mode reference. Apex's FusedRMSNorm, the Triton-authored kernels in vLLM, the hand-written CUDA kernels in TensorRT-LLM, and the GGML implementation in llama.cpp all do the same thing: load $x$ once, compute the sum of squares, the inverse RMS, and the output multiply in a single pass with no intermediate materialization. On Apple silicon, MLX provides mx.fast.rms_norm for the same reason. The unfused PyTorch reference can be 2 to 3 times slower than fused kernels in inference workloads where memory bandwidth is the bottleneck.
For most users, RMSNorm is a drop-in replacement for LayerNorm and there are no surprises. A few practical issues do come up at scale.
In day-to-day work the entire interface is nn.RMSNorm(hidden_size), and you do not need to think about any of this.