Layer normalization is a technique for normalizing the activations of a neural network that operates across the feature dimension of each individual sample, rather than across a batch of samples. Proposed by Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton in 2016, layer normalization was originally developed to address limitations of batch normalization in recurrent neural networks [1]. It has since become the standard normalization method in transformer architectures and large language models, where its independence from batch statistics and compatibility with variable-length sequences make it far more practical than batch-based alternatives. Modern variants such as RMSNorm have refined the original formulation by stripping out operations that proved unnecessary at scale, and the question of where to place normalization within a transformer block (pre-norm versus post-norm) has become one of the most consequential architectural decisions in modern deep learning.
Normalization techniques are essential for training deep neural networks. Without normalization, the distribution of activations at each layer shifts during training as the parameters of preceding layers change, a phenomenon known as internal covariate shift. This instability slows training and can prevent convergence altogether in very deep networks. Activations can also drift toward extreme magnitudes, triggering the vanishing gradient problem or exploding gradients depending on the direction of the drift.
Batch normalization, introduced by Sergey Ioffe and Christian Szegedy in 2015, addressed this problem by normalizing activations across the batch dimension. For each feature, it computes the mean and variance over all samples in the current mini-batch and uses these statistics to standardize the activations [2]. Batch normalization proved highly effective for convolutional neural networks and quickly became a standard component of architectures like ResNet, GoogLeNet, and DenseNet.
However, batch normalization has several practical limitations:
These limitations motivated Ba, Kiros, and Hinton to develop layer normalization as an alternative that normalizes within each sample independently. Their original 2016 paper demonstrated that layer normalization stabilized training of long short-term memory networks on tasks where batch normalization had been impractical, including handwriting sequence generation, image-sentence ranking, and question answering on the bAbI dataset [1].
Given an input vector x of dimension H (representing the activations of a single sample at a single layer), layer normalization computes:
mu = (1/H) * sum(x_i) for i = 1 to H
sigma^2 = (1/H) * sum((x_i - mu)^2) for i = 1 to H
y_i = gamma_i * (x_i - mu) / sqrt(sigma^2 + epsilon) + beta_i
where:
The critical difference from batch normalization is the axis of normalization. Layer normalization computes statistics across features (the hidden dimension) for each sample independently. Batch normalization computes statistics across samples (the batch dimension) for each feature independently. The two operations look almost identical on paper but differ on a single axis, and that single difference rewires nearly everything about how the operation behaves in practice.
For a transformer with hidden size d_model = 768 processing a batch of B sequences each of length n, the activation tensor has shape (B, n, 768). Layer normalization computes a separate mean and variance for each of the B times n token positions, normalizing across the 768-dimensional feature vector at each position. Batch normalization would instead compute a mean and variance for each of the 768 features, averaging across all B times n token positions in the batch.
In PyTorch, calling nn.LayerNorm(768) on this tensor produces 2 statistics (mu and sigma squared) per token, for a total of 2 times B times n statistics. The learned gamma and beta parameters are shared across positions and samples, with shape (768,). The number of learned parameters is therefore tiny relative to the rest of the model, typically a few thousand parameters per layer normalization module.
Layer normalization is invariant to per-sample shifts and scalings of the input. If we replace x with a constant c plus x, the mean shifts by c and the centered values are unchanged, so the output is the same up to the learned gamma and beta. If we replace x with a positive scalar a times x, the variance scales by a squared and the standard deviation by a, so the normalized values are again unchanged. This re-centering and re-scaling invariance is what gives layer normalization its regularizing effect: the network does not need to learn how to compensate for the absolute magnitude of incoming activations, only their relative pattern across the feature axis.
The following table summarizes the key differences between layer normalization and batch normalization.
| property | layer normalization | batch normalization |
|---|---|---|
| normalization axis | feature dimension (per sample) | batch dimension (per feature) |
| depends on batch size | no | yes |
| behavior at inference | identical to training | uses running averages |
| works with variable-length sequences | yes | problematic |
| works with batch size 1 | yes | no (undefined statistics) |
| learned parameters | gamma, beta per feature | gamma, beta per feature |
| original domain | RNNs | CNNs |
| used in transformers | yes (standard) | rarely |
| train/test discrepancy | none | yes |
| helps regularization | weakly | yes (noise from batch stats) |
| sequential dependency along batch | none | yes |
In convolutional networks for image classification, batch normalization typically gives a small but measurable accuracy edge over layer normalization, because the noise injected by batch statistics acts as a regularization signal. In sequence models and transformers, the practical disadvantages of batch normalization (variable sequence lengths, autoregressive inference, distributed training overhead) outweigh that small advantage and layer normalization wins decisively.
Virtually all transformer architectures use layer normalization rather than batch normalization. Several properties of layer normalization align well with the requirements of transformer training.
In natural language processing, sequences within a batch typically have different lengths and are padded to a common length. Batch normalization would compute statistics that mix meaningful token positions with padding tokens, distorting the normalization. Layer normalization sidesteps this entirely by normalizing each token position independently. Padding tokens still receive a normalization, but they no longer pollute the statistics of real tokens.
Layer normalization produces identical outputs for a given input regardless of what other samples are in the batch. This property is critical for autoregressive generation, where the model processes one token at a time (effectively batch size 1 for the new token). It also simplifies distributed training, since there is no need to synchronize batch statistics across devices, and it removes a class of training-versus-inference bugs caused by stale running averages.
Transformers process sequences where each position has a d_model-dimensional representation. Layer normalization treats each position as an independent sample to normalize, which fits the position-wise structure of transformer computation. The same module can be applied to a tensor of shape (B, n, d_model), a tensor of shape (B, 1, d_model) during decoding, or a tensor of shape (B, d_model) for a classification head, without any reshaping or changes in behavior.
Empirically, layer normalization contributes to stable training of deep transformer models, particularly in conjunction with residual connections. The combination of residual connections and layer normalization allows gradients to flow effectively through networks with dozens or hundreds of layers. Models such as GPT-3, with 96 transformer blocks, would be very difficult to train without normalization keeping the residual stream bounded in magnitude.
The original transformer architecture by Vaswani et al. (2017) placed layer normalization after each residual sublayer, a configuration now called Post-Norm or Post-LN. The computation for each sublayer follows this pattern [3]:
Post-Norm: y = LayerNorm(x + Sublayer(x))
In this arrangement, the output of the sublayer (attention or feed-forward network) is added to the residual, and then layer normalization is applied to the sum.
An alternative placement, called Pre-Norm or Pre-LN, applies layer normalization before the sublayer:
Pre-Norm: y = x + Sublayer(LayerNorm(x))
Here, the input is first normalized, then passed through the sublayer, and the sublayer output is added to the original (unnormalized) input via the residual connection. The residual stream therefore carries unnormalized values from layer to layer, with each block reading a normalized snapshot of that stream and writing back an additive update.
Xiong et al. (2020) provided a theoretical and empirical analysis showing that Pre-Norm transformers have significantly better-behaved gradients at initialization compared to Post-Norm transformers [4]. In the Post-Norm configuration, the expected gradients of parameters near the output layer can be very large at initialization, making training unstable without a careful learning rate warmup schedule. Pre-Norm transformers do not suffer from this issue and can often be trained without any warmup.
The paper demonstrated that the Post-Norm transformer's need for learning rate warmup is not just a practical trick but a mathematical necessity given the gradient magnitudes at initialization. Pre-Norm placement resolves this by ensuring that the input to each sublayer is well-conditioned [4]. In experiments, Pre-Norm transformers without warmup matched the final quality of Post-Norm transformers with carefully tuned warmup, while training significantly faster and being far more forgiving of hyperparameter choices.
Pre-norm is not free of drawbacks. Because each block writes additively into an unnormalized residual stream, the magnitudes of activations in deep pre-norm networks tend to grow with depth. Liu et al. (2020) showed that this growth can hurt the effective expressive depth of pre-norm transformers, since later layers see inputs dominated by the cumulative residual rather than by the latest representation [12]. Some researchers report that post-norm models, when they can be trained successfully, achieve marginally better final loss than pre-norm models of the same size. The dominant choice in modern large-scale training is still pre-norm because the optimization stability gains far outweigh that small quality difference.
Several hybrid placements have been explored. Sandwich-LN applies layer normalization both before and after the sublayer, capturing some of the regularization of post-norm while keeping the optimization friendliness of pre-norm. Peri-LN, proposed in 2025 by Park et al., is a related variant that has shown improvements on certain tasks [4a]. These hybrids see occasional use, but the field has largely converged on plain pre-norm with RMSNorm for the largest models.
GPT-2 (Radford et al., 2019) was one of the first prominent models to adopt Pre-Norm, and this placement has since become the default for nearly all large language models. GPT-3, LLaMA 1, 2, and 3, Mistral, Mixtral, PaLM, Gemma, Qwen, and DeepSeek all use Pre-Norm. The original transformer and the early BERT family used Post-Norm, but successors like RoBERTa and most modern encoder models switched to Pre-Norm or hybrid placements.
The following table compares the two configurations.
| property | pre-norm | post-norm |
|---|---|---|
| layer norm placement | before sublayer | after sublayer + residual |
| training stability | high | requires warmup |
| learning rate warmup | often unnecessary | critical |
| final model quality (small models) | slightly lower | slightly higher |
| gradient behavior at init | well-behaved | can be very large near output |
| residual magnitude growth | tends to grow with depth | bounded by normalization |
| used in | GPT-2/3, LLaMA, Mistral, PaLM, Gemma | original transformer, BERT |
Root Mean Square Layer Normalization, called RMSNorm, was proposed by Biao Zhang and Rico Sennrich in 2019. It is a simplified variant of layer normalization that has become the dominant normalization method in modern large language models [5].
Standard layer normalization performs two operations: re-centering (subtracting the mean) and re-scaling (dividing by the standard deviation). Zhang and Sennrich hypothesized that the re-centering step is not essential and that the re-scaling alone provides the key regularization benefit. Removing mean subtraction reduces computational overhead while maintaining comparable model quality. The intuition is that re-scaling invariance is what tames the optimization landscape, and that re-centering invariance, while elegant, mostly costs cycles without paying for itself in model quality [5].
Given an input vector x of dimension H, RMSNorm computes:
RMS(x) = sqrt((1/H) * sum(x_i^2) for i = 1 to H)
y_i = gamma_i * x_i / RMS(x)
Note that RMSNorm does not subtract the mean and does not include a learned bias term beta. It only has the learned scale parameter gamma. The computation is simpler than standard layer normalization in four ways:
A partial variant called pRMSNorm estimates the RMS from only p% of the inputs, trading a small amount of accuracy for additional speed. This variant is rarely used in practice because the full RMSNorm is already cheap.
Zhang and Sennrich reported that RMSNorm reduces the running time of the normalization step by 7% to 64% compared to standard layer normalization across different model architectures, while achieving comparable performance on machine translation, text summarization, and image classification tasks [5]. The exact speedup depends on the hardware and the relative cost of normalization compared to attention and feed-forward computation. On modern GPUs, where memory bandwidth often dominates, the benefit of skipping the mean computation is most visible during inference and during the backward pass.
RMSNorm has been adopted by most leading open-weight large language models. The following table tracks normalization choices across major architectures.
| model | normalization | year | reference |
|---|---|---|---|
| original transformer | post-LN, standard LayerNorm | 2017 | Vaswani et al. [3] |
| GPT-2 | pre-LN, standard LayerNorm | 2019 | Radford et al. |
| GPT-3 | pre-LN, standard LayerNorm | 2020 | Brown et al. |
| BERT | post-LN, standard LayerNorm | 2018 | Devlin et al. |
| T5 | pre-LN, RMSNorm (no bias) | 2019 | Raffel et al. |
| LLaMA 1 / 2 / 3 | pre-RMSNorm | 2023, 2024 | Touvron et al. [6] |
| Mistral / Mixtral | pre-RMSNorm | 2023 | Jiang et al. |
| PaLM / PaLM 2 | pre-RMSNorm | 2022 | Chowdhery et al. |
| Qwen / Qwen 2 / Qwen 3 | pre-RMSNorm | 2023, 2024 | Bai et al. |
| Gemma / Gemma 2 | pre-RMSNorm | 2024 | |
| DeepSeek-V2 / V3 | pre-RMSNorm | 2024 | DeepSeek-AI |
| Falcon | pre-LN, standard LayerNorm | 2023 | TII |
| Phi-2 / Phi-3 | pre-LN, standard LayerNorm | 2023, 2024 | Microsoft |
The convergence toward RMSNorm in pre-norm position reflects an empirical finding: this combination provides the best tradeoff between training stability, computational efficiency, and model quality for large-scale language model training.
Wang et al. (2022) introduced DeepNorm, a normalization scheme designed for transformers with hundreds or thousands of layers [10]. The DeepNet paper successfully trained a transformer with 1,000 layers (2,500 attention and feed-forward sublayers), an order of magnitude deeper than previous deep transformers.
DeepNorm modifies post-norm by scaling the residual branch with a depth-dependent constant alpha and the attention/FFN parameters at initialization with a constant beta:
DeepNorm: y = LayerNorm(alpha * x + Sublayer(x))
The constants alpha and beta are derived theoretically based on the model depth and architecture (encoder-only, decoder-only, or encoder-decoder). For an encoder-decoder transformer with N encoder layers and M decoder layers, alpha is roughly proportional to the fourth root of the layer count.
The core insight is that the model update at each step should remain bounded as depth grows. Plain post-norm fails this criterion, leading to gradient explosions. Plain pre-norm satisfies it but suffers from the residual magnitude growth problem mentioned earlier. DeepNorm threads the needle by combining the better gradient behavior of post-norm with a careful initialization that bounds updates.
On a multilingual machine translation benchmark covering 7,482 translation directions, the 200-layer 3.2-billion-parameter DeepNet model significantly outperformed shallower baselines, demonstrating that depth scaling can pay off when training stability is solved [10].
DeepNorm has not displaced pre-norm with RMSNorm in mainstream language models, partly because most large models are wide and moderately deep (around 32 to 96 layers) rather than extremely deep, and partly because the alpha and beta constants tie the architecture to a specific depth. For depth experiments and certain specialized use cases, however, DeepNorm remains an important technique.
Layer normalization is one of the most numerically sensitive operations in a transformer because it computes per-sample statistics that can have a wide dynamic range.
When training in mixed precision using float16 or bfloat16, the variance computation can lose significant precision. The squared deviations can underflow to zero when the activations are small, or overflow when the activations are large. Most production LLM training stacks compute layer normalization statistics in float32 even when the rest of the network runs in lower precision. PyTorch, JAX, and TensorFlow all upcast the normalization step automatically in their built-in modules, but custom kernels need to handle this explicitly.
bfloat16 is more forgiving than float16 because of its larger exponent range, but it has fewer mantissa bits. The result is that bfloat16 can represent the magnitudes that occur in practice but loses precision in the actual normalization division, sometimes producing visible artifacts in long training runs. The standard practice in the LLaMA, Gemma, and DeepSeek code bases is to keep the gamma parameter in bfloat16 but compute the RMS in float32.
The epsilon parameter in the denominator prevents division by zero when all activations are identical (giving zero variance). Different implementations use different default values. PyTorch's nn.LayerNorm defaults to 1e-5, while many LLM implementations use 1e-6 or 1e-8 for additional precision. T5 famously uses RMSNorm with eps = 1e-6 and no bias, a choice copied by many later models.
The epsilon also affects gradient stability. Too large a value introduces a noticeable bias into the normalization at small magnitudes; too small a value can produce very large gradients when activations happen to be small. 1e-5 is a reasonable default; 1e-6 is appropriate when training in bfloat16.
In practice, the separate operations of layer normalization (mean computation, variance computation, normalization, scale, shift) are fused into a single GPU kernel to avoid multiple passes over memory. NVIDIA's Apex library provides apex.normalization.FusedLayerNorm and FusedRMSNorm implementations that are several times faster than naive PyTorch implementations. The Triton language has been used to write efficient custom normalization kernels, and FlashAttention's authors released companion FlashLayerNorm kernels that combine normalization with the surrounding linear projections [11].
For RMSNorm, the simpler computation makes fused kernels even more attractive. Many modern training frameworks ship with a hand-tuned fused RMSNorm that achieves close to the memory-bandwidth limit on the underlying GPU.
Layer normalization and its variants are available in every major deep learning framework.
| framework | layer norm API | RMSNorm API |
|---|---|---|
| PyTorch | torch.nn.LayerNorm(normalized_shape) | torch.nn.RMSNorm(normalized_shape) (PyTorch 2.4+) |
| TensorFlow / Keras | tf.keras.layers.LayerNormalization(axis=-1) | community modules; not built in as of TF 2.16 |
| JAX / Flax | flax.linen.LayerNorm() | flax.linen.RMSNorm() |
| JAX / Haiku | hk.LayerNorm(axis=-1) | community modules |
| MXNet | mxnet.gluon.nn.LayerNorm | community modules |
| MATLAB | layerNormalizationLayer | not built in |
Prior to the addition of nn.RMSNorm in PyTorch 2.4, RMSNorm was typically implemented in user code as a roughly four-line module. The official PyTorch implementation matches the original Zhang and Sennrich formulation and is accelerated by the same fused kernels as nn.LayerNorm.
A standard layer normalization module looks like this in PyTorch:
class LayerNorm(nn.Module):
def __init__(self, dim, eps=1e-5):
super().__init__()
self.gamma = nn.Parameter(torch.ones(dim))
self.beta = nn.Parameter(torch.zeros(dim))
self.eps = eps
def forward(self, x):
mu = x.mean(dim=-1, keepdim=True)
var = x.var(dim=-1, unbiased=False, keepdim=True)
x_hat = (x - mu) / torch.sqrt(var + self.eps)
return self.gamma * x_hat + self.beta
And RMSNorm:
class RMSNorm(nn.Module):
def __init__(self, dim, eps=1e-6):
super().__init__()
self.gamma = nn.Parameter(torch.ones(dim))
self.eps = eps
def forward(self, x):
rms = torch.sqrt(x.pow(2).mean(dim=-1, keepdim=True) + self.eps)
return self.gamma * x / rms
Production code uses the framework-provided modules so that fused kernels and gradient checkpointing work transparently.
Several other normalization techniques have been explored as alternatives or supplements to layer normalization. Each draws its statistics from a different slice of the activation tensor.
Batch normalization, the predecessor of layer normalization, computes statistics across the batch dimension for each feature. It remains the dominant choice for image classification CNNs trained with reasonably large batches (32 or more).
Instance normalization, proposed by Ulyanov, Vedaldi, and Lempitsky in 2016, normalizes across the spatial dimensions (height and width) of each feature map for each sample individually [8]. It was developed for fast neural style transfer, where it dramatically improved the quality of generated images by removing the influence of the original image's contrast. Instance normalization is rarely used in transformers but remains popular in image generation networks for style transfer and image-to-image translation.
Group normalization, proposed by Yuxin Wu and Kaiming He in 2018, divides channels into groups and normalizes within each group [7]. It can be viewed as a generalization that interpolates between layer normalization (one group containing all channels) and instance normalization (each channel in its own group). On ResNet-50 with batch size 2 on ImageNet, group normalization produces 10.6% lower error than batch normalization, making it the standard choice for vision tasks like object detection and instance segmentation that must use small batches because of memory constraints.
Switchable normalization, proposed by Luo et al. in 2018, learns a weighted combination of batch, layer, and instance normalization at each layer. It rarely outperforms a well-chosen single normalization in practice, but it has been useful for understanding which normalization works best for which task by examining the learned weights.
Scale normalization (Nguyen and Salazar, 2019) and weight normalization (Salimans and Kingma, 2016) are not strictly normalization layers in the same sense, but they normalize parameters rather than activations. Weight normalization decomposes each weight vector into a unit direction and a scalar magnitude. These techniques have niche use but did not displace layer normalization in transformers.
For a 4D activation tensor of shape (N, C, H, W), the following table summarizes the axes over which each method computes statistics, where N is the batch dimension, C the channel dimension, and H times W the spatial dimensions.
| method | normalizes across | batch dependent | learned parameters | primary domain | reference |
|---|---|---|---|---|---|
| batch normalization | N, H, W (per channel) | yes | gamma, beta | CNNs | Ioffe and Szegedy 2015 [2] |
| layer normalization | C, H, W (per sample) | no | gamma, beta | transformers, RNNs | Ba, Kiros, Hinton 2016 [1] |
| instance normalization | H, W (per sample, per channel) | no | gamma, beta | style transfer | Ulyanov et al. 2016 [8] |
| group normalization | C/G group, H, W | no | gamma, beta | vision (small batch) | Wu and He 2018 [7] |
| RMSNorm | C, H, W (per sample, no mean) | no | gamma only | modern LLMs | Zhang and Sennrich 2019 [5] |
| switchable norm | weighted mix of BN, LN, IN | partial | gamma, beta + weights | exploratory | Luo et al. 2018 [9] |
| weight normalization | per weight vector | no | scale per neuron | various | Salimans and Kingma 2016 |
For a 3D transformer activation tensor of shape (B, n, d_model), batch normalization would normalize across (B, n), layer normalization and RMSNorm across (d_model,), and instance/group normalization are not commonly applied in this setting.
The choice of normalization depends on the architecture, the data, and the training regime. The table below summarizes the practical defaults that have emerged over the past decade.
| use case | recommended normalization | reason |
|---|---|---|
| CNN image classification, batch >= 32 | batch normalization | best accuracy, mature tooling |
| CNN with small batch (< 8) | group normalization | accuracy independent of batch |
| object detection, segmentation | group normalization | small effective batch per device |
| style transfer, image-to-image | instance normalization | removes per-image contrast |
| RNN, LSTM, GRU | layer normalization | per-time-step stability |
| transformer encoder/decoder (small) | layer normalization, pre-norm | mature default |
| transformer LLM (large, modern) | RMSNorm, pre-norm | speed and quality at scale |
| extremely deep transformer (>200 layers) | DeepNorm | bounded updates at depth |
| autoregressive generation | layer normalization or RMSNorm | batch independence essential |
The default for a new transformer-based language model is unambiguously pre-norm with RMSNorm. The default for a new convolutional vision model is still batch normalization unless memory limits force a small batch, in which case group normalization is the standard fallback.
Layer normalization interacts with several other components of the transformer architecture to enable stable training of very deep models.
Residual connections provide a direct path for gradients to flow backward through the network. Layer normalization ensures that the activations at each layer remain in a well-conditioned range, preventing the accumulation of scale changes that could otherwise cause gradients to explode or vanish. Together, these two mechanisms allow transformers to be trained with dozens of layers (12 for BERT-base, 96 for GPT-3) without the severe optimization difficulties that plague unnormalized deep networks.
Weight initialization strategies must account for the presence of layer normalization. The common practice of using Xavier or Kaiming initialization for linear layers in transformers is predicated on the assumption that layer normalization will keep activations roughly standardized. Some architectures, such as GPT-2, include an additional scaling factor at initialization that divides residual layer weights by sqrt(2 * num_layers), further ensuring that the residual stream does not grow in magnitude through the network.
The Adam optimizer and its variants such as AdamW interact favorably with layer normalization because the per-parameter adaptive learning rates can adjust to the normalized scale of activations. This combination of layer normalization and adaptive optimization has proven robust across a wide range of model sizes and training configurations. With plain SGD, layer normalization still works but typically requires more careful tuning of the learning rate.
There is also a subtle interaction between layer normalization and gradient clipping. Because RMSNorm bounds the per-sample activation magnitude but not the per-feature one, models using RMSNorm sometimes show occasional spikes in particular features that gradient clipping needs to absorb. Modern training stacks set the gradient clipping norm to around 1.0 by default, a value that handles these spikes while not interfering with normal gradient updates.
Layer normalization is not a free operation. Several costs and shortcomings are worth noting.
It adds learned parameters: gamma and beta (or just gamma for RMSNorm). The total parameter count is small relative to the rest of the model, but in resource-constrained settings every parameter counts. RMSNorm's removal of beta saves d_model parameters per layer, which adds up across hundreds of layers.
It computes statistics per sample, which introduces a sequential dependency along the feature axis. The mean and variance must be computed before the normalized values can be produced. On highly parallel hardware, this serial reduction is usually fast, but on extremely wide layers (d_model in the tens of thousands) it can become noticeable.
It does not regularize as strongly as batch normalization in CNNs. Batch normalization injects noise from batch statistics during training, which acts as an implicit form of regularization. Layer normalization is deterministic given a single sample, so it does not provide this regularization signal. CNN architectures using layer normalization typically need additional regularization techniques such as dropout or stronger data augmentation to match batch-normalized performance.
RMSNorm is slightly faster than full layer normalization but loses the subtractive normalization. In rare cases, this matters. Models with strong feature-mean drift can benefit from full layer normalization; in mainstream language modeling this drift is small and RMSNorm is fine.
Finally, normalization can interact badly with quantization. Per-sample statistics computed in float32 and then applied to int8 activations require careful handling to preserve accuracy. Most LLM quantization pipelines either keep normalization in higher precision or fold the gamma parameter into the surrounding linear layers to avoid the issue.
The Ba, Kiros, and Hinton (2016) paper on layer normalization has become one of the most cited papers in deep learning, with tens of thousands of citations. While it was originally motivated by the challenges of normalizing recurrent neural networks, its greatest impact has been in the transformer era. Every major transformer model, from the original transformer to GPT-4 and beyond, uses some form of layer normalization.
The progression from standard layer normalization to RMSNorm reflects a broader trend in deep learning research. As models scale to hundreds of billions of parameters, even modest computational savings in frequently executed operations accumulate to meaningful reductions in training cost and time. RMSNorm's removal of the mean centering step, while seemingly minor, translates to tangible efficiency gains at the scale of modern LLM training.
The pre-norm versus post-norm question, settled in practice in favor of pre-norm by the GPT-2 architecture and the Xiong et al. analysis, is one of the cleanest examples of a small architectural choice having outsized practical consequences. Models that fail to adopt pre-norm at scale are notoriously hard to train and often fail silently, producing worse final loss without any obvious signal that the normalization placement is to blame. The convergence of nearly all modern LLMs on pre-norm with RMSNorm represents a hard-won consensus from the past five years of large-scale training experience.