Layer normalization

Layer normalization is a technique for normalizing the activations of a neural network that operates across the feature dimension of each individual sample, rather than across a batch of samples. Proposed by Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton in 2016, layer normalization was originally developed to address limitations of batch normalization in recurrent neural networks [1]. It has since become the standard normalization method in transformer architectures and large language models, where its independence from batch statistics and compatibility with variable-length sequences make it far more practical than batch-based alternatives. Modern variants such as RMSNorm have refined the original formulation by stripping out operations that proved unnecessary at scale, and the question of where to place normalization within a transformer block (pre-norm versus post-norm) has become one of the most consequential architectural decisions in modern deep learning.

background and motivation

Normalization techniques are essential for training deep neural networks. Without normalization, the distribution of activations at each layer shifts during training as the parameters of preceding layers change, a phenomenon known as internal covariate shift. This instability slows training and can prevent convergence altogether in very deep networks. Activations can also drift toward extreme magnitudes, triggering the vanishing gradient problem or exploding gradients depending on the direction of the drift.

Batch normalization, introduced by Sergey Ioffe and Christian Szegedy in 2015, addressed this problem by normalizing activations across the batch dimension. For each feature, it computes the mean and variance over all samples in the current mini-batch and uses these statistics to standardize the activations [2]. Batch normalization proved highly effective for convolutional neural networks and quickly became a standard component of architectures like ResNet, GoogLeNet, and DenseNet.

However, batch normalization has several practical limitations:

It depends on mini-batch statistics, meaning its behavior changes between training (using batch statistics) and inference (using running averages). This train/test discrepancy can introduce subtle bugs and demands careful handling.
It performs poorly with small batch sizes, because the batch statistics become noisy estimates of the true population statistics. With a batch size of 2, ResNet-50 on ImageNet shows around a 10% accuracy drop compared to typical batch sizes [7].
It is difficult to apply to recurrent neural networks, because the statistics would need to be computed separately for each time step, and sequences in a batch may have different lengths.
It creates a dependency between samples in a batch, which complicates distributed training and certain architectures such as autoregressive generators.

These limitations motivated Ba, Kiros, and Hinton to develop layer normalization as an alternative that normalizes within each sample independently. Their original 2016 paper demonstrated that layer normalization stabilized training of long short-term memory networks on tasks where batch normalization had been impractical, including handwriting sequence generation, image-sentence ranking, and question answering on the bAbI dataset [1].

how layer normalization works

Given an input vector x of dimension H (representing the activations of a single sample at a single layer), layer normalization computes:

mu = (1/H) * sum(x_i) for i = 1 to H

sigma^2 = (1/H) * sum((x_i - mu)^2) for i = 1 to H

y_i = gamma_i * (x_i - mu) / sqrt(sigma^2 + epsilon) + beta_i

where:

mu is the mean of the activations across the feature dimension
sigma^2 is the (biased, population) variance across the feature dimension
gamma and beta are learned scale and shift parameters, one per feature
epsilon is a small constant for numerical stability (typically 1e-5 or 1e-6)

The critical difference from batch normalization is the axis of normalization. Layer normalization computes statistics across features (the hidden dimension) for each sample independently. Batch normalization computes statistics across samples (the batch dimension) for each feature independently. The two operations look almost identical on paper but differ on a single axis, and that single difference rewires nearly everything about how the operation behaves in practice.

tensor shape examples

For a transformer with hidden size d_model = 768 processing a batch of B sequences each of length n, the activation tensor has shape (B, n, 768). Layer normalization computes a separate mean and variance for each of the B times n token positions, normalizing across the 768-dimensional feature vector at each position. Batch normalization would instead compute a mean and variance for each of the 768 features, averaging across all B times n token positions in the batch.

In PyTorch, calling nn.LayerNorm(768) on this tensor produces 2 statistics (mu and sigma squared) per token, for a total of 2 times B times n statistics. The learned gamma and beta parameters are shared across positions and samples, with shape (768,). The number of learned parameters is therefore tiny relative to the rest of the model, typically a few thousand parameters per layer normalization module.

properties of the operation

Layer normalization is invariant to per-sample shifts and scalings of the input. If we replace x with a constant c plus x, the mean shifts by c and the centered values are unchanged, so the output is the same up to the learned gamma and beta. If we replace x with a positive scalar a times x, the variance scales by a squared and the standard deviation by a, so the normalized values are again unchanged. This re-centering and re-scaling invariance is what gives layer normalization its regularizing effect: the network does not need to learn how to compensate for the absolute magnitude of incoming activations, only their relative pattern across the feature axis.

comparison with batch normalization

The following table summarizes the key differences between layer normalization and batch normalization.

property	layer normalization	batch normalization
normalization axis	feature dimension (per sample)	batch dimension (per feature)
depends on batch size	no	yes
behavior at inference	identical to training	uses running averages
works with variable-length sequences	yes	problematic
works with batch size 1	yes	no (undefined statistics)
learned parameters	gamma, beta per feature	gamma, beta per feature
original domain	RNNs	CNNs
used in transformers	yes (standard)	rarely
train/test discrepancy	none	yes
helps regularization	weakly	yes (noise from batch stats)
sequential dependency along batch	none	yes

In convolutional networks for image classification, batch normalization typically gives a small but measurable accuracy edge over layer normalization, because the noise injected by batch statistics acts as a regularization signal. In sequence models and transformers, the practical disadvantages of batch normalization (variable sequence lengths, autoregressive inference, distributed training overhead) outweigh that small advantage and layer normalization wins decisively.

why transformers use layer normalization

Virtually all transformer architectures use layer normalization rather than batch normalization. Several properties of layer normalization align well with the requirements of transformer training.

variable sequence lengths

In natural language processing, sequences within a batch typically have different lengths and are padded to a common length. Batch normalization would compute statistics that mix meaningful token positions with padding tokens, distorting the normalization. Layer normalization sidesteps this entirely by normalizing each token position independently. Padding tokens still receive a normalization, but they no longer pollute the statistics of real tokens.

no batch dependency

Layer normalization produces identical outputs for a given input regardless of what other samples are in the batch. This property is critical for autoregressive generation, where the model processes one token at a time (effectively batch size 1 for the new token). It also simplifies distributed training, since there is no need to synchronize batch statistics across devices, and it removes a class of training-versus-inference bugs caused by stale running averages.

sequence modeling compatibility

Transformers process sequences where each position has a d_model-dimensional representation. Layer normalization treats each position as an independent sample to normalize, which fits the position-wise structure of transformer computation. The same module can be applied to a tensor of shape (B, n, d_model), a tensor of shape (B, 1, d_model) during decoding, or a tensor of shape (B, d_model) for a classification head, without any reshaping or changes in behavior.

training stability

Empirically, layer normalization contributes to stable training of deep transformer models, particularly in conjunction with residual connections. The combination of residual connections and layer normalization allows gradients to flow effectively through networks with dozens or hundreds of layers. Models such as GPT-3, with 96 transformer blocks, would be very difficult to train without normalization keeping the residual stream bounded in magnitude.

pre-norm vs post-norm

The original transformer architecture by Vaswani et al. (2017) placed layer normalization after each residual sublayer, a configuration now called Post-Norm or Post-LN. The computation for each sublayer follows this pattern [3]:

Post-Norm: y = LayerNorm(x + Sublayer(x))

In this arrangement, the output of the sublayer (attention or feed-forward network) is added to the residual, and then layer normalization is applied to the sum.

An alternative placement, called Pre-Norm or Pre-LN, applies layer normalization before the sublayer:

Pre-Norm: y = x + Sublayer(LayerNorm(x))

Here, the input is first normalized, then passed through the sublayer, and the sublayer output is added to the original (unnormalized) input via the residual connection. The residual stream therefore carries unnormalized values from layer to layer, with each block reading a normalized snapshot of that stream and writing back an additive update.

training stability differences

Xiong et al. (2020) provided a theoretical and empirical analysis showing that Pre-Norm transformers have significantly better-behaved gradients at initialization compared to Post-Norm transformers [4]. In the Post-Norm configuration, the expected gradients of parameters near the output layer can be very large at initialization, making training unstable without a careful learning rate warmup schedule. Pre-Norm transformers do not suffer from this issue and can often be trained without any warmup.

The paper demonstrated that the Post-Norm transformer's need for learning rate warmup is not just a practical trick but a mathematical necessity given the gradient magnitudes at initialization. Pre-Norm placement resolves this by ensuring that the input to each sublayer is well-conditioned [4]. In experiments, Pre-Norm transformers without warmup matched the final quality of Post-Norm transformers with carefully tuned warmup, while training significantly faster and being far more forgiving of hyperparameter choices.

the cost of pre-norm

Pre-norm is not free of drawbacks. Because each block writes additively into an unnormalized residual stream, the magnitudes of activations in deep pre-norm networks tend to grow with depth. Liu et al. (2020) showed that this growth can hurt the effective expressive depth of pre-norm transformers, since later layers see inputs dominated by the cumulative residual rather than by the latest representation [12]. Some researchers report that post-norm models, when they can be trained successfully, achieve marginally better final loss than pre-norm models of the same size. The dominant choice in modern large-scale training is still pre-norm because the optimization stability gains far outweigh that small quality difference.

sandwich and peri-LN variants

Several hybrid placements have been explored. Sandwich-LN applies layer normalization both before and after the sublayer, capturing some of the regularization of post-norm while keeping the optimization friendliness of pre-norm. Peri-LN, proposed in 2025 by Park et al., is a related variant that has shown improvements on certain tasks [4a]. These hybrids see occasional use, but the field has largely converged on plain pre-norm with RMSNorm for the largest models.

adoption

GPT-2 (Radford et al., 2019) was one of the first prominent models to adopt Pre-Norm, and this placement has since become the default for nearly all large language models. GPT-3, LLaMA 1, 2, and 3, Mistral, Mixtral, PaLM, Gemma, Qwen, and DeepSeek all use Pre-Norm. The original transformer and the early BERT family used Post-Norm, but successors like RoBERTa and most modern encoder models switched to Pre-Norm or hybrid placements.

The following table compares the two configurations.

property	pre-norm	post-norm
layer norm placement	before sublayer	after sublayer + residual
training stability	high	requires warmup
learning rate warmup	often unnecessary	critical
final model quality (small models)	slightly lower	slightly higher
gradient behavior at init	well-behaved	can be very large near output
residual magnitude growth	tends to grow with depth	bounded by normalization
used in	GPT-2/3, LLaMA, Mistral, PaLM, Gemma	original transformer, BERT

RMSNorm

Root Mean Square Layer Normalization, called RMSNorm, was proposed by Biao Zhang and Rico Sennrich in 2019. It is a simplified variant of layer normalization that has become the dominant normalization method in modern large language models [5].

motivation

Standard layer normalization performs two operations: re-centering (subtracting the mean) and re-scaling (dividing by the standard deviation). Zhang and Sennrich hypothesized that the re-centering step is not essential and that the re-scaling alone provides the key regularization benefit. Removing mean subtraction reduces computational overhead while maintaining comparable model quality. The intuition is that re-scaling invariance is what tames the optimization landscape, and that re-centering invariance, while elegant, mostly costs cycles without paying for itself in model quality [5].

formulation

Given an input vector x of dimension H, RMSNorm computes:

RMS(x) = sqrt((1/H) * sum(x_i^2) for i = 1 to H)

y_i = gamma_i * x_i / RMS(x)

Note that RMSNorm does not subtract the mean and does not include a learned bias term beta. It only has the learned scale parameter gamma. The computation is simpler than standard layer normalization in four ways:

There is no mean computation.
There is no mean subtraction step.
The variance computation is replaced by a simpler sum of squares.
There is no bias parameter.

A partial variant called pRMSNorm estimates the RMS from only p% of the inputs, trading a small amount of accuracy for additional speed. This variant is rarely used in practice because the full RMSNorm is already cheap.

efficiency

Zhang and Sennrich reported that RMSNorm reduces the running time of the normalization step by 7% to 64% compared to standard layer normalization across different model architectures, while achieving comparable performance on machine translation, text summarization, and image classification tasks [5]. The exact speedup depends on the hardware and the relative cost of normalization compared to attention and feed-forward computation. On modern GPUs, where memory bandwidth often dominates, the benefit of skipping the mean computation is most visible during inference and during the backward pass.

adoption in modern LLMs

RMSNorm has been adopted by most leading open-weight large language models. The following table tracks normalization choices across major architectures.

model	normalization	year	reference
original transformer	post-LN, standard LayerNorm	2017	Vaswani et al. [3]
GPT-2	pre-LN, standard LayerNorm	2019	Radford et al.
GPT-3	pre-LN, standard LayerNorm	2020	Brown et al.
BERT	post-LN, standard LayerNorm	2018	Devlin et al.
T5	pre-LN, RMSNorm (no bias)	2019	Raffel et al.
LLaMA 1 / 2 / 3	pre-RMSNorm	2023, 2024	Touvron et al. [6]
Mistral / Mixtral	pre-RMSNorm	2023	Jiang et al.
PaLM / PaLM 2	pre-RMSNorm	2022	Chowdhery et al.
Qwen / Qwen 2 / Qwen 3	pre-RMSNorm	2023, 2024	Bai et al.
Gemma / Gemma 2	pre-RMSNorm	2024	Google
DeepSeek-V2 / V3	pre-RMSNorm	2024	DeepSeek-AI
Falcon	pre-LN, standard LayerNorm	2023	TII
Phi-2 / Phi-3	pre-LN, standard LayerNorm	2023, 2024	Microsoft

The convergence toward RMSNorm in pre-norm position reflects an empirical finding: this combination provides the best tradeoff between training stability, computational efficiency, and model quality for large-scale language model training.

DeepNorm and very deep transformer training

Wang et al. (2022) introduced DeepNorm, a normalization scheme designed for transformers with hundreds or thousands of layers [10]. The DeepNet paper successfully trained a transformer with 1,000 layers (2,500 attention and feed-forward sublayers), an order of magnitude deeper than previous deep transformers.

DeepNorm modifies post-norm by scaling the residual branch with a depth-dependent constant alpha and the attention/FFN parameters at initialization with a constant beta:

DeepNorm: y = LayerNorm(alpha * x + Sublayer(x))

The constants alpha and beta are derived theoretically based on the model depth and architecture (encoder-only, decoder-only, or encoder-decoder). For an encoder-decoder transformer with N encoder layers and M decoder layers, alpha is roughly proportional to the fourth root of the layer count.

The core insight is that the model update at each step should remain bounded as depth grows. Plain post-norm fails this criterion, leading to gradient explosions. Plain pre-norm satisfies it but suffers from the residual magnitude growth problem mentioned earlier. DeepNorm threads the needle by combining the better gradient behavior of post-norm with a careful initialization that bounds updates.

On a multilingual machine translation benchmark covering 7,482 translation directions, the 200-layer 3.2-billion-parameter DeepNet model significantly outperformed shallower baselines, demonstrating that depth scaling can pay off when training stability is solved [10].

DeepNorm has not displaced pre-norm with RMSNorm in mainstream language models, partly because most large models are wide and moderately deep (around 32 to 96 layers) rather than extremely deep, and partly because the alpha and beta constants tie the architecture to a specific depth. For depth experiments and certain specialized use cases, however, DeepNorm remains an important technique.

numerical considerations

Layer normalization is one of the most numerically sensitive operations in a transformer because it computes per-sample statistics that can have a wide dynamic range.

mixed precision

When training in mixed precision using float16 or bfloat16, the variance computation can lose significant precision. The squared deviations can underflow to zero when the activations are small, or overflow when the activations are large. Most production LLM training stacks compute layer normalization statistics in float32 even when the rest of the network runs in lower precision. PyTorch, JAX, and TensorFlow all upcast the normalization step automatically in their built-in modules, but custom kernels need to handle this explicitly.

bfloat16 is more forgiving than float16 because of its larger exponent range, but it has fewer mantissa bits. The result is that bfloat16 can represent the magnitudes that occur in practice but loses precision in the actual normalization division, sometimes producing visible artifacts in long training runs. The standard practice in the LLaMA, Gemma, and DeepSeek code bases is to keep the gamma parameter in bfloat16 but compute the RMS in float32.

epsilon

The epsilon parameter in the denominator prevents division by zero when all activations are identical (giving zero variance). Different implementations use different default values. PyTorch's nn.LayerNorm defaults to 1e-5, while many LLM implementations use 1e-6 or 1e-8 for additional precision. T5 famously uses RMSNorm with eps = 1e-6 and no bias, a choice copied by many later models.

The epsilon also affects gradient stability. Too large a value introduces a noticeable bias into the normalization at small magnitudes; too small a value can produce very large gradients when activations happen to be small. 1e-5 is a reasonable default; 1e-6 is appropriate when training in bfloat16.

fused kernels

In practice, the separate operations of layer normalization (mean computation, variance computation, normalization, scale, shift) are fused into a single GPU kernel to avoid multiple passes over memory. NVIDIA's Apex library provides apex.normalization.FusedLayerNorm and FusedRMSNorm implementations that are several times faster than naive PyTorch implementations. The Triton language has been used to write efficient custom normalization kernels, and FlashAttention's authors released companion FlashLayerNorm kernels that combine normalization with the surrounding linear projections [11].

For RMSNorm, the simpler computation makes fused kernels even more attractive. Many modern training frameworks ship with a hand-tuned fused RMSNorm that achieves close to the memory-bandwidth limit on the underlying GPU.

implementation

Layer normalization and its variants are available in every major deep learning framework.

framework	layer norm API	RMSNorm API
PyTorch	`torch.nn.LayerNorm(normalized_shape)`	`torch.nn.RMSNorm(normalized_shape)` (PyTorch 2.4+)
TensorFlow / Keras	`tf.keras.layers.LayerNormalization(axis=-1)`	community modules; not built in as of TF 2.16
JAX / Flax	`flax.linen.LayerNorm()`	`flax.linen.RMSNorm()`
JAX / Haiku	`hk.LayerNorm(axis=-1)`	community modules
MXNet	`mxnet.gluon.nn.LayerNorm`	community modules
MATLAB	`layerNormalizationLayer`	not built in

Prior to the addition of nn.RMSNorm in PyTorch 2.4, RMSNorm was typically implemented in user code as a roughly four-line module. The official PyTorch implementation matches the original Zhang and Sennrich formulation and is accelerated by the same fused kernels as nn.LayerNorm.

a minimal pytorch implementation

A standard layer normalization module looks like this in PyTorch:

class LayerNorm(nn.Module):
    def __init__(self, dim, eps=1e-5):
        super().__init__()
        self.gamma = nn.Parameter(torch.ones(dim))
        self.beta = nn.Parameter(torch.zeros(dim))
        self.eps = eps

    def forward(self, x):
        mu = x.mean(dim=-1, keepdim=True)
        var = x.var(dim=-1, unbiased=False, keepdim=True)
        x_hat = (x - mu) / torch.sqrt(var + self.eps)
        return self.gamma * x_hat + self.beta

And RMSNorm:

class RMSNorm(nn.Module):
    def __init__(self, dim, eps=1e-6):
        super().__init__()
        self.gamma = nn.Parameter(torch.ones(dim))
        self.eps = eps

    def forward(self, x):
        rms = torch.sqrt(x.pow(2).mean(dim=-1, keepdim=True) + self.eps)
        return self.gamma * x / rms

Production code uses the framework-provided modules so that fused kernels and gradient checkpointing work transparently.

other normalization variants

Several other normalization techniques have been explored as alternatives or supplements to layer normalization. Each draws its statistics from a different slice of the activation tensor.

batch normalization

Batch normalization, the predecessor of layer normalization, computes statistics across the batch dimension for each feature. It remains the dominant choice for image classification CNNs trained with reasonably large batches (32 or more).

instance normalization

Instance normalization, proposed by Ulyanov, Vedaldi, and Lempitsky in 2016, normalizes across the spatial dimensions (height and width) of each feature map for each sample individually [8]. It was developed for fast neural style transfer, where it dramatically improved the quality of generated images by removing the influence of the original image's contrast. Instance normalization is rarely used in transformers but remains popular in image generation networks for style transfer and image-to-image translation.

group normalization

Group normalization, proposed by Yuxin Wu and Kaiming He in 2018, divides channels into groups and normalizes within each group [7]. It can be viewed as a generalization that interpolates between layer normalization (one group containing all channels) and instance normalization (each channel in its own group). On ResNet-50 with batch size 2 on ImageNet, group normalization produces 10.6% lower error than batch normalization, making it the standard choice for vision tasks like object detection and instance segmentation that must use small batches because of memory constraints.

switchable normalization

Switchable normalization, proposed by Luo et al. in 2018, learns a weighted combination of batch, layer, and instance normalization at each layer. It rarely outperforms a well-chosen single normalization in practice, but it has been useful for understanding which normalization works best for which task by examining the learned weights.

scale normalization and weight normalization

Scale normalization (Nguyen and Salazar, 2019) and weight normalization (Salimans and Kingma, 2016) are not strictly normalization layers in the same sense, but they normalize parameters rather than activations. Weight normalization decomposes each weight vector into a unit direction and a scalar magnitude. These techniques have niche use but did not displace layer normalization in transformers.

comparison of normalization methods

For a 4D activation tensor of shape (N, C, H, W), the following table summarizes the axes over which each method computes statistics, where N is the batch dimension, C the channel dimension, and H times W the spatial dimensions.

method	normalizes across	batch dependent	learned parameters	primary domain	reference
batch normalization	N, H, W (per channel)	yes	gamma, beta	CNNs	Ioffe and Szegedy 2015 [2]
layer normalization	C, H, W (per sample)	no	gamma, beta	transformers, RNNs	Ba, Kiros, Hinton 2016 [1]
instance normalization	H, W (per sample, per channel)	no	gamma, beta	style transfer	Ulyanov et al. 2016 [8]
group normalization	C/G group, H, W	no	gamma, beta	vision (small batch)	Wu and He 2018 [7]
RMSNorm	C, H, W (per sample, no mean)	no	gamma only	modern LLMs	Zhang and Sennrich 2019 [5]
switchable norm	weighted mix of BN, LN, IN	partial	gamma, beta + weights	exploratory	Luo et al. 2018 [9]
weight normalization	per weight vector	no	scale per neuron	various	Salimans and Kingma 2016

For a 3D transformer activation tensor of shape (B, n, d_model), batch normalization would normalize across (B, n), layer normalization and RMSNorm across (d_model,), and instance/group normalization are not commonly applied in this setting.

when to use which normalization

The choice of normalization depends on the architecture, the data, and the training regime. The table below summarizes the practical defaults that have emerged over the past decade.

use case	recommended normalization	reason
CNN image classification, batch >= 32	batch normalization	best accuracy, mature tooling
CNN with small batch (< 8)	group normalization	accuracy independent of batch
object detection, segmentation	group normalization	small effective batch per device
style transfer, image-to-image	instance normalization	removes per-image contrast
RNN, LSTM, GRU	layer normalization	per-time-step stability
transformer encoder/decoder (small)	layer normalization, pre-norm	mature default
transformer LLM (large, modern)	RMSNorm, pre-norm	speed and quality at scale
extremely deep transformer (>200 layers)	DeepNorm	bounded updates at depth
autoregressive generation	layer normalization or RMSNorm	batch independence essential

The default for a new transformer-based language model is unambiguously pre-norm with RMSNorm. The default for a new convolutional vision model is still batch normalization unless memory limits force a small batch, in which case group normalization is the standard fallback.

relationship to training dynamics

Layer normalization interacts with several other components of the transformer architecture to enable stable training of very deep models.

Residual connections provide a direct path for gradients to flow backward through the network. Layer normalization ensures that the activations at each layer remain in a well-conditioned range, preventing the accumulation of scale changes that could otherwise cause gradients to explode or vanish. Together, these two mechanisms allow transformers to be trained with dozens of layers (12 for BERT-base, 96 for GPT-3) without the severe optimization difficulties that plague unnormalized deep networks.

Weight initialization strategies must account for the presence of layer normalization. The common practice of using Xavier or Kaiming initialization for linear layers in transformers is predicated on the assumption that layer normalization will keep activations roughly standardized. Some architectures, such as GPT-2, include an additional scaling factor at initialization that divides residual layer weights by sqrt(2 * num_layers), further ensuring that the residual stream does not grow in magnitude through the network.

The Adam optimizer and its variants such as AdamW interact favorably with layer normalization because the per-parameter adaptive learning rates can adjust to the normalized scale of activations. This combination of layer normalization and adaptive optimization has proven robust across a wide range of model sizes and training configurations. With plain SGD, layer normalization still works but typically requires more careful tuning of the learning rate.

There is also a subtle interaction between layer normalization and gradient clipping. Because RMSNorm bounds the per-sample activation magnitude but not the per-feature one, models using RMSNorm sometimes show occasional spikes in particular features that gradient clipping needs to absorb. Modern training stacks set the gradient clipping norm to around 1.0 by default, a value that handles these spikes while not interfering with normal gradient updates.

limitations

Layer normalization is not a free operation. Several costs and shortcomings are worth noting.

It adds learned parameters: gamma and beta (or just gamma for RMSNorm). The total parameter count is small relative to the rest of the model, but in resource-constrained settings every parameter counts. RMSNorm's removal of beta saves d_model parameters per layer, which adds up across hundreds of layers.

It computes statistics per sample, which introduces a sequential dependency along the feature axis. The mean and variance must be computed before the normalized values can be produced. On highly parallel hardware, this serial reduction is usually fast, but on extremely wide layers (d_model in the tens of thousands) it can become noticeable.

It does not regularize as strongly as batch normalization in CNNs. Batch normalization injects noise from batch statistics during training, which acts as an implicit form of regularization. Layer normalization is deterministic given a single sample, so it does not provide this regularization signal. CNN architectures using layer normalization typically need additional regularization techniques such as dropout or stronger data augmentation to match batch-normalized performance.

RMSNorm is slightly faster than full layer normalization but loses the subtractive normalization. In rare cases, this matters. Models with strong feature-mean drift can benefit from full layer normalization; in mainstream language modeling this drift is small and RMSNorm is fine.

Finally, normalization can interact badly with quantization. Per-sample statistics computed in float32 and then applied to int8 activations require careful handling to preserve accuracy. Most LLM quantization pipelines either keep normalization in higher precision or fold the gamma parameter into the surrounding linear layers to avoid the issue.

historical significance

The Ba, Kiros, and Hinton (2016) paper on layer normalization has become one of the most cited papers in deep learning, with tens of thousands of citations. While it was originally motivated by the challenges of normalizing recurrent neural networks, its greatest impact has been in the transformer era. Every major transformer model, from the original transformer to GPT-4 and beyond, uses some form of layer normalization.

The progression from standard layer normalization to RMSNorm reflects a broader trend in deep learning research. As models scale to hundreds of billions of parameters, even modest computational savings in frequently executed operations accumulate to meaningful reductions in training cost and time. RMSNorm's removal of the mean centering step, while seemingly minor, translates to tangible efficiency gains at the scale of modern LLM training.

The pre-norm versus post-norm question, settled in practice in favor of pre-norm by the GPT-2 architecture and the Xiong et al. analysis, is one of the cleanest examples of a small architectural choice having outsized practical consequences. Models that fail to adopt pre-norm at scale are notoriously hard to train and often fail silently, producing worse final loss without any obvious signal that the normalization placement is to blame. The convergence of nearly all modern LLMs on pre-norm with RMSNorm represents a hard-won consensus from the past five years of large-scale training experience.

references

Ba, J.L., Kiros, J.R., and Hinton, G.E. (2016). "Layer Normalization." arXiv preprint. https://arxiv.org/abs/1607.06450
Ioffe, S. and Szegedy, C. (2015). "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift." Proceedings of the 32nd International Conference on Machine Learning. https://arxiv.org/abs/1502.03167
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., and Polosukhin, I. (2017). "Attention Is All You Need." Advances in Neural Information Processing Systems 30. https://arxiv.org/abs/1706.03762
Xiong, R., Yang, Y., He, D., Zheng, K., Zheng, S., Xing, C., Zhang, H., Lan, Y., Wang, L., and Liu, T. (2020). "On Layer Normalization in the Transformer Architecture." Proceedings of the 37th International Conference on Machine Learning. https://arxiv.org/abs/2002.04745 4a. Park, et al. (2025). "Peri-LN: Revisiting Layer Normalization in the Transformer Architecture." arXiv preprint. https://arxiv.org/abs/2502.02732
Zhang, B. and Sennrich, R. (2019). "Root Mean Square Layer Normalization." Advances in Neural Information Processing Systems 32. https://arxiv.org/abs/1910.07467
Touvron, H., Lavril, T., Izacard, G., et al. (2023). "LLaMA: Open and Efficient Foundation Language Models." arXiv preprint. https://arxiv.org/abs/2302.13971
Wu, Y. and He, K. (2018). "Group Normalization." Proceedings of the European Conference on Computer Vision (ECCV). https://arxiv.org/abs/1803.08494
Ulyanov, D., Vedaldi, A., and Lempitsky, V. (2016). "Instance Normalization: The Missing Ingredient for Fast Stylization." arXiv preprint. https://arxiv.org/abs/1607.08022
Luo, P., Ren, J., Peng, Z., Zhang, R., and Li, J. (2018). "Differentiable Learning-to-Normalize via Switchable Normalization." arXiv preprint. https://arxiv.org/abs/1806.10779
Wang, H., Ma, S., Dong, L., Huang, S., Zhang, D., and Wei, F. (2022). "DeepNet: Scaling Transformers to 1,000 Layers." arXiv preprint. https://arxiv.org/abs/2203.00555
Tillet, P., Kung, H.T., and Cox, D. (2019). "Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations." Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages. https://www.eecs.harvard.edu/~htk/publication/2019-mapl-tillet-kung-cox.pdf
Liu, L., Liu, X., Gao, J., Chen, W., and Han, J. (2020). "Understanding the Difficulty of Training Transformers." Proceedings of EMNLP 2020. https://arxiv.org/abs/2004.08249
Salimans, T. and Kingma, D.P. (2016). "Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks." Advances in Neural Information Processing Systems 29. https://arxiv.org/abs/1602.07868
PyTorch documentation. "torch.nn.RMSNorm." https://docs.pytorch.org/docs/stable/generated/torch.nn.modules.normalization.RMSNorm.html

background and motivation

how layer normalization works

tensor shape examples

properties of the operation

comparison with batch normalization

why transformers use layer normalization

variable sequence lengths

no batch dependency

sequence modeling compatibility

training stability

pre-norm vs post-norm

training stability differences

the cost of pre-norm

sandwich and peri-LN variants

adoption

RMSNorm

motivation

formulation

efficiency

adoption in modern LLMs

DeepNorm and very deep transformer training

numerical considerations

mixed precision

epsilon

fused kernels

implementation

a minimal pytorch implementation

other normalization variants

batch normalization

instance normalization

group normalization

switchable normalization

scale normalization and weight normalization

comparison of normalization methods

when to use which normalization

relationship to training dynamics

limitations

historical significance

references

Improve this article

Related Articles

Sparse autoencoder

GELU (Gaussian Error Linear Unit)

LeNet

Context window

Multi-head Latent Attention

OCR Models

background and motivation

how layer normalization works

tensor shape examples

properties of the operation

comparison with batch normalization

why transformers use layer normalization

variable sequence lengths

no batch dependency

sequence modeling compatibility

training stability

pre-norm vs post-norm

training stability differences

the cost of pre-norm

sandwich and peri-LN variants

adoption

RMSNorm

motivation

formulation

efficiency

adoption in modern LLMs

DeepNorm and very deep transformer training

numerical considerations

mixed precision

epsilon

fused kernels

implementation

a minimal pytorch implementation

other normalization variants

batch normalization

instance normalization

group normalization

switchable normalization

scale normalization and weight normalization