# Normalization

> Source: https://aiwiki.ai/wiki/normalization
> Updated: 2026-06-21
> Categories: Data & Datasets, Machine Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

*See also: [Machine learning terms](/wiki/machine_learning_terms)*

**Normalization** is the process of scaling numerical data to a standard range or distribution so that features and activations are comparable and downstream computation behaves predictably. In [machine learning](/wiki/machine_learning) and data science the term covers a family of techniques applied during [data preprocessing](/wiki/data_preprocessing) (rescaling each tabular feature), inside [neural network](/wiki/neural_network) architectures (normalization layers such as [batch normalization](/wiki/batch_normalization) and [layer normalization](/wiki/layer_normalization)), and in database design (organizing tables to reduce redundancy). By removing arbitrary scale, normalization helps algorithms converge faster, produce more stable gradients, and treat all input dimensions equitably. In one landmark result, adding batch normalization let a state-of-the-art image classifier reach the same accuracy with 14 times fewer training steps [1].

Normalization is especially important for algorithms that are sensitive to the magnitude of input features, including [gradient descent](/wiki/gradient_descent)-based optimizers, distance-based methods such as [k-nearest neighbors](/wiki/k_nearest_neighbors) and [k-means](/wiki/k-means) clustering, and most neural network architectures. Without normalization, features with larger numeric ranges can dominate the learning process, leading to slow convergence or poor generalization.

The word "normalization" carries different meanings in different parts of machine learning. In tabular preprocessing it usually means rescaling each column to a fixed range or distribution. Inside a deep network it means inserting a layer that standardizes activations on the fly. In information retrieval it means dividing each row vector by its length. The mathematics differ, but the underlying goal is the same: remove arbitrary scale so that downstream computation behaves predictably.

## What is data normalization (feature scaling)?

Data normalization, sometimes called [feature scaling](/wiki/feature_scaling), rescales individual features so they occupy a comparable numeric range before being fed into a model. The choice of method depends on the data distribution, the presence of outliers, and the downstream algorithm.

### Min-max normalization

Min-max normalization linearly maps each feature to a fixed interval, most commonly [0, 1]. The formula is:

```
x' = (x - x_min) / (x_max - x_min)
```

To scale to an arbitrary range [a, b]:

```
x' = a + (x - x_min)(b - a) / (x_max - x_min)
```

Min-max normalization preserves the original distribution shape and is straightforward to implement. However, it is highly sensitive to outliers: a single extreme value can compress all other values into a narrow band. In [scikit-learn](/wiki/scikit_learn), this method is available as `MinMaxScaler` [11].

A practical concern with min-max scaling is what happens when a value at inference time falls outside the range observed during training. The transformed value will lie outside [0, 1], which is acceptable for most models but breaks the assumption that all inputs are bounded. When the input data has known semantic limits (for example, normalized image pixel intensities in [0, 255]), those limits should be used directly rather than the empirical minimum and maximum [24].

### Z-score normalization (standardization)

Z-score normalization, also called standardization, transforms each [feature](/wiki/feature) so that it has a mean of zero and a standard deviation of one:

```
x' = (x - mu) / sigma
```

where mu is the sample mean and sigma is the sample standard deviation. The resulting values are unbounded and can be negative. [Z-Score Normalization](/wiki/z-score_normalization) handles outliers more gracefully than min-max normalization because outliers affect the mean and variance but do not compress the rest of the distribution into a tiny range. It is the default choice for many linear models, support vector machines, and principal component analysis. In scikit-learn, this method is provided by `StandardScaler` [11].

A closely related question is whether to use the population variance (dividing by n) or the sample variance (dividing by n-1) when estimating sigma. For normalization purposes the difference is negligible at typical dataset sizes. The scikit-learn implementation uses the population variance for consistency with the `numpy.std` default [11].

### Max absolute scaling

Max absolute scaling divides each feature value by the maximum absolute value of that feature:

```
x' = x / |x_max|
```

This maps the data to the range [-1, 1] without shifting the center, which means it does not destroy sparsity. It is useful for sparse datasets where preserving zero entries matters, including text feature matrices produced by `CountVectorizer` or `TfidfVectorizer`. In scikit-learn, this method is available as `MaxAbsScaler` [11].

### Robust scaling

Robust scaling uses the median and interquartile range (IQR) instead of the mean and standard deviation:

```
x' = (x - Q2) / (Q3 - Q1)
```

where Q1, Q2, and Q3 are the 25th, 50th, and 75th percentiles, respectively. Because the median and IQR are not influenced by extreme values, robust scaling is the preferred choice when the data contains significant outliers that cannot be removed. In scikit-learn, the corresponding class is `RobustScaler`. The user can configure which percentiles to use through the `quantile_range` parameter, with the default (25.0, 75.0) corresponding to the interquartile range [11].

### L1 and L2 normalization (unit norm)

L1 and L2 normalization scale each sample (row) so that it has unit norm:

- **L1 normalization** divides each element by the sum of absolute values, so the L1 norm of the resulting vector equals 1.
- **L2 normalization** divides each element by the Euclidean length, so the L2 norm of the resulting vector equals 1.

These methods are common in text classification and information retrieval, where document vectors are normalized to unit length before computing cosine similarity. After L2 normalization, the dot product between any two vectors equals their cosine similarity, which simplifies and speeds up similarity search. In scikit-learn, the `Normalizer` class handles both L1 and L2 normalization [11].

### Log transformations

When a feature spans many orders of magnitude or is heavily right-skewed, applying a logarithmic transformation can make it more amenable to learning. Common variants include the natural log (`log(x)`), `log(1 + x)` (which avoids issues at zero), and `log10(x)`. Log transformations are widely used for features such as income, price, population counts, and gene expression levels. The transformation must be applied only when all values are positive; for data containing zeros, `log1p` is the standard choice.

### Power transformations: Box-Cox and Yeo-Johnson

Power transformations seek to make a feature more closely resemble a Gaussian distribution by applying a parametric power function. The Box-Cox transformation, introduced by Box and Cox in 1964, is defined for strictly positive data as [13]:

```
x' = (x^lambda - 1) / lambda     if lambda != 0
x' = log(x)                       if lambda == 0
```

The parameter lambda is selected to maximize the log-likelihood of the transformed data under a normality assumption. The Yeo-Johnson transformation, introduced by Yeo and Johnson in 2000, extends Box-Cox to handle zero and negative values and is therefore more general [14]:

```
For x >= 0:
  x' = ((x + 1)^lambda - 1) / lambda          if lambda != 0
  x' = log(x + 1)                              if lambda == 0
For x < 0:
  x' = -((-x + 1)^(2 - lambda) - 1) / (2 - lambda)   if lambda != 2
  x' = -log(-x + 1)                                   if lambda == 2
```

In scikit-learn, both transformations are provided by `PowerTransformer` with the `method` argument set to `'box-cox'` or `'yeo-johnson'`. The class fits lambda separately for each feature [11].

### Quantile transformation

A quantile transformation maps each feature to a target distribution by ranking the values and applying the inverse cumulative distribution function of the target. Two target distributions are common: uniform on [0, 1] and standard normal. Because the transformation is rank-based, it is robust to outliers and produces an output with a fixed shape regardless of the input distribution. The downside is that it is nonlinear and non-monotonic across new samples, so values not seen during fitting are interpolated. In scikit-learn, this is provided by `QuantileTransformer` [11]. Quantile transformation is useful when the goal is to make features comparable in shape, for example when feeding heterogeneous tabular data into a [deep learning](/wiki/deep_learning) model.

### Comparison of feature scaling methods

| Method | Formula | Output range | Sensitive to outliers | Preserves sparsity | scikit-learn class |
|---|---|---|---|---|---|
| Min-max | (x - x_min) / (x_max - x_min) | [0, 1] | Yes | No | `MinMaxScaler` |
| Z-score | (x - mu) / sigma | Unbounded | Moderate | No | `StandardScaler` |
| Max absolute | x / abs(x_max) | [-1, 1] | Yes | Yes | `MaxAbsScaler` |
| Robust | (x - median) / IQR | Unbounded | No | No | `RobustScaler` |
| L1 norm | x / sum(abs(x)) | [0, 1] per sample | No | Yes | `Normalizer(norm='l1')` |
| L2 norm | x / sqrt(sum(x^2)) | [-1, 1] per sample | No | Yes | `Normalizer(norm='l2')` |
| Box-Cox | (x^lambda - 1) / lambda | Approximately Gaussian | Reduced after fit | No | `PowerTransformer(method='box-cox')` |
| Yeo-Johnson | Piecewise power | Approximately Gaussian | Reduced after fit | No | `PowerTransformer(method='yeo-johnson')` |
| Quantile (uniform) | Rank / N | [0, 1] | No | No | `QuantileTransformer(output_distribution='uniform')` |
| Quantile (normal) | Inverse normal CDF of rank | Approximately N(0, 1) | No | No | `QuantileTransformer(output_distribution='normal')` |
| Log | log(1 + x) | Compressed | Reduced | No | `FunctionTransformer(np.log1p)` |

## When should you normalize data?

Normalization is most beneficial in the following scenarios:

- **Distance-based algorithms.** Methods such as [k-nearest neighbors](/wiki/k_nearest_neighbors), [k-means](/wiki/k-means), and support vector machines compute distances between data points. If features are on different scales, features with larger magnitudes will dominate the distance calculation.
- **Gradient descent optimization.** When features have very different scales, the loss surface becomes elongated, causing [gradient descent](/wiki/gradient_descent) to oscillate and converge slowly. Normalization produces a more spherical loss landscape, allowing the optimizer to take more direct steps toward the minimum [24].
- **Neural networks.** Most [neural network](/wiki/neural_network) architectures benefit from normalized inputs because activation functions (such as sigmoid and tanh) are most sensitive in a limited numeric range. Feeding unnormalized data can push activations into saturated regions, causing vanishing gradients [22].
- **Regularized models.** Algorithms with L1 or L2 [regularization](/wiki/regularization) penalize feature weights. If features are on different scales, the penalty applies unevenly, distorting the model.
- **Principal component analysis and related methods.** PCA, factor analysis, and linear discriminant analysis all rely on covariance or correlation. Without standardization, features with high variance dominate the principal components even when their large variance reflects unit choice rather than information content.

## When should you not normalize data?

Normalization is unnecessary or even counterproductive in some situations:

- **Tree-based models.** [Decision trees](/wiki/decision_tree), [random forests](/wiki/random_forest), and [gradient boosting](/wiki/gradient_boosting) algorithms split on individual feature thresholds. Because they compare values within a single feature at a time, the relative scale across features does not affect their behavior. Tree-based models are largely invariant to monotonic transformations of any single feature.
- **Count-based features.** In some natural language processing tasks, raw counts (for example, word frequencies) carry meaningful absolute magnitude that normalization would obscure.
- **Pre-normalized data.** If features are already on similar scales (for example, pixel values in [0, 255] for all channels), additional normalization may not improve performance.
- **Categorical features encoded as integers.** Integer-encoded categorical variables should be one-hot encoded or treated as embeddings, not standardized. Z-scoring an arbitrary integer code introduces a spurious ordering.

## What is the difference between normalization and standardization?

The terms "normalization" and "standardization" are often used interchangeably in practice, but they refer to different operations in a strict sense.

| | Normalization (min-max) | Standardization (z-score) |
|---|---|---|
| Goal | Scale values to a fixed range (e.g., [0, 1]) | Center values at zero with unit variance |
| Formula | (x - x_min) / (x_max - x_min) | (x - mu) / sigma |
| Output range | Bounded ([0, 1] or [-1, 1]) | Unbounded |
| Outlier behavior | Compressed into the range | Shifts mean and widens spread |
| Best for | Data with known bounds; neural network inputs | Data that is roughly Gaussian; linear models, SVMs, PCA |

In everyday conversation, many practitioners use "normalization" as an umbrella term for any kind of [feature scaling](/wiki/feature_scaling). When precision matters, it is helpful to specify the exact method (for example, "min-max normalization" or "z-score standardization").

## How do you avoid data leakage when normalizing?

A critical rule when applying normalization in a machine learning pipeline is to **fit the scaler on the [training set](/wiki/training_set) only**. The scaler's parameters (minimum, maximum, mean, standard deviation, median, or IQR) must be computed exclusively from training data. The test set and validation set should then be transformed using those same parameters.

If the scaler is fit on the entire dataset before splitting, information from the test set leaks into the training process, producing overly optimistic performance estimates that do not reflect real-world generalization. In scikit-learn, the recommended approach is to place the scaler inside a `Pipeline` together with the model, which automatically applies `fit` only to the training fold during [cross-validation](/wiki/cross-validation).

```python
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', LogisticRegression())
])
pipe.fit(X_train, y_train)
score = pipe.score(X_test, y_test)
```

The leakage rule extends beyond simple train/test splits. In time series forecasting, the scaler should be fit only on data observed up to the cutoff time of each evaluation point. In stratified k-fold cross-validation, the scaler should be fit on the training folds within each iteration. The general principle is that any statistic the scaler depends on must come exclusively from data the model is allowed to see at training time.

## What are neural network normalization layers?

Beyond preprocessing the input data, modern [deep learning](/wiki/deep_learning) architectures apply normalization inside the network itself. These normalization layers stabilize activations during training, reduce sensitivity to weight initialization, and often allow higher learning rates. They all follow a common pattern: compute statistics over some subset of the activations, normalize using those statistics, and then apply a learned affine transformation (scale and shift).

The general form of an internal normalization layer is:

```
y = gamma * (x - mu) / sqrt(sigma^2 + epsilon) + beta
```

where `mu` and `sigma^2` are the mean and variance computed over a chosen set of dimensions, `epsilon` is a small constant (typically 1e-5) added for numerical stability, and `gamma` and `beta` are learnable scale and shift parameters. The choice of dimensions over which mu and sigma^2 are computed is what distinguishes the various normalization layers from each other.

### Batch normalization

[Batch normalization](/wiki/batch_normalization) (BatchNorm), introduced by Ioffe and Szegedy in 2015 (arxiv 1502.03167), normalizes activations across the mini-batch dimension. For each channel in a layer, it computes the mean and variance over all spatial positions and all samples in the current mini-batch, then normalizes accordingly. BatchNorm was originally motivated by the problem of "internal covariate shift," where the distribution of layer inputs changes as earlier layers update their weights [1].

BatchNorm enables the use of much higher [learning rates](/wiki/learning_rate) and reduces the dependence on careful weight initialization. The original paper reported that "Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin" [1]. However, BatchNorm has notable limitations: its behavior depends on the mini-batch size, it performs poorly with very small batches, and its statistics differ between training (using mini-batch statistics) and inference (using running averages), which can cause inconsistencies. A 2018 study by Santurkar and colleagues argued that the actual benefit of BatchNorm comes from smoothing the loss landscape rather than from reducing covariate shift [12].

In PyTorch, BatchNorm is available as `torch.nn.BatchNorm1d`, `torch.nn.BatchNorm2d`, and `torch.nn.BatchNorm3d`, which differ only in the spatial rank of the input. During inference, the running mean and running variance computed during training are used in place of mini-batch statistics; this behavior is enabled by setting the module to evaluation mode with `model.eval()` [23].

### Layer normalization

[Layer normalization](/wiki/layer_normalization) (LayerNorm), proposed by Ba, Kiros, and [Geoffrey Hinton](/wiki/geoffrey_hinton) in 2016 (arxiv 1607.06450), computes normalization statistics across all neurons within a single layer for a single training example. Unlike BatchNorm, it does not depend on the batch dimension, making it straightforward to apply to recurrent neural networks and [transformer](/wiki/transformer) architectures where sequence lengths may vary [2].

LayerNorm has become the standard normalization technique in transformer-based language models, including [GPT](/wiki/gpt) and [BERT](/wiki/bert). Its independence from batch size makes it well suited for settings where batch sizes are small or variable. Because the same statistics are used at training and inference time, there is no train/test mismatch and no need for running averages. In PyTorch, LayerNorm is provided by `torch.nn.LayerNorm`, which takes a `normalized_shape` argument specifying the trailing dimensions to normalize over [23].

### Instance normalization

Instance normalization (InstanceNorm), introduced by Ulyanov, Vedaldi, and Lempitsky in 2016, computes the mean and variance for each individual channel of each individual sample. This means each feature map in each image is normalized independently [3].

InstanceNorm gained prominence in neural style transfer, where it effectively strips instance-specific contrast information from images, allowing the network to focus on high-level style features. It is commonly used in generative adversarial networks ([GANs](/wiki/generative_adversarial_network)) and image generation tasks. A close relative is adaptive instance normalization (AdaIN), introduced by Huang and Belongie in 2017, which uses the mean and variance from a style image as the affine parameters of the normalization, producing an effective real-time style transfer technique.

### Group normalization

[Group normalization](/wiki/group_normalization) (GroupNorm), proposed by Wu and He in 2018 (arxiv 1803.08494), divides channels into groups and computes normalization statistics within each group for each sample [4]. It can be seen as a middle ground between LayerNorm (one group containing all channels) and InstanceNorm (each channel in its own group).

GroupNorm is particularly valuable in computer vision tasks where memory constraints force the use of small batch sizes (for example, in object detection and video classification). The authors reported that "on ResNet-50 trained in ImageNet, GN has 10.6% lower error than its BN counterpart when using a batch size of 2" [4]. The number of groups is a hyperparameter; the original paper recommends 32 groups as a sensible default [4].

### Weight normalization

Weight normalization, introduced by Salimans and Kingma in 2016 (arxiv 1602.07868), takes a different approach. Instead of normalizing activations, it reparameterizes the weights of each layer as a product of a unit vector and a learnable scalar magnitude [5]:

```
w = g * (v / ||v||)
```

where `v` is a vector of free parameters, `||v||` is its Euclidean norm, and `g` is a learned scalar. This decouples the magnitude of the weight vector from its direction, which improves the conditioning of the optimization problem and accelerates training. Weight normalization can be combined with mean-only batch normalization for even better performance [5]. Compared to BatchNorm, it is independent of batch size and adds essentially no computational overhead. It is available in PyTorch as `torch.nn.utils.weight_norm`.

### Root mean square layer normalization (RMSNorm)

[RMSNorm](/wiki/rmsnorm), introduced by Zhang and Sennrich in 2019 (arxiv 1910.07467), simplifies LayerNorm by removing the mean-centering step [6]. Instead of subtracting the mean and dividing by the standard deviation, RMSNorm divides activations only by their root mean square:

```
x' = x / RMS(x),  where RMS(x) = sqrt((1/n) * sum(x_i^2))
```

The authors hypothesized that the re-centering invariance of LayerNorm is dispensable, writing in the abstract that "RMSNorm regularizes the summed inputs to a neuron in one layer according to root mean square (RMS), giving the model re-scaling invariance property and implicit learning rate adaptation ability" and that it "achieves comparable performance against LayerNorm but reduces the running time by 7% to 64% on different models" [6]. RMSNorm has been adopted in several prominent large language models, including [LLaMA](/wiki/llama) and its successors, Mistral, Qwen, and DeepSeek [16]. Native support for RMSNorm was added to PyTorch as `torch.nn.RMSNorm` in version 2.4 [23].

### Filter response normalization (FRN)

Filter response normalization (FRN), proposed by Singh and Krishnan in 2019, eliminates batch dependence by normalizing each channel of each sample using only the mean of the squared activations across spatial dimensions. It pairs with a thresholded linear unit (TLU) activation function, which clamps negative values to a learnable threshold rather than zero. FRN was shown to outperform BatchNorm in image classification with small batch sizes and to match it at large batch sizes [8].

### Power normalization

Power normalization, introduced by Shen and colleagues in 2020, was developed for transformer models in computer vision and NLP. It uses a moving average of the variance, rather than the per-batch variance, and applies the variance only without subtracting the mean. The authors reported improved performance over LayerNorm on machine translation and language modeling benchmarks, although the technique has not been widely adopted in production systems [9].

### Switchable normalization

Switchable normalization, proposed by Luo and colleagues in 2018, dynamically chooses among BatchNorm, LayerNorm, and InstanceNorm by learning a softmax weight over the three [10]. Different layers in the network can therefore use different normalizers, with the network selecting the most appropriate one for each. The approach yielded modest improvements on image classification but added implementation complexity.

### Local response normalization

Local response normalization (LRN), used in the AlexNet architecture (Krizhevsky and colleagues, 2012), normalizes each activation by a local sum of squares across nearby channels. While LRN was an important component of early deep convolutional networks, later work showed it is largely unnecessary when other techniques (such as BatchNorm) are used, and modern architectures rarely include it.

### Comparison of normalization layers

| Normalization layer | Normalizes over | Batch dependent | Typical use case | Original paper |
|---|---|---|---|---|
| [Batch normalization](/wiki/batch_normalization) | Batch + spatial | Yes | CNNs with large batches | Ioffe and Szegedy, 2015 (arxiv 1502.03167) |
| [Layer normalization](/wiki/layer_normalization) | All neurons in a layer | No | Transformers, RNNs | Ba, Kiros, and Hinton, 2016 (arxiv 1607.06450) |
| Instance normalization | Single channel, single sample | No | Style transfer, GANs | Ulyanov, Vedaldi, and Lempitsky, 2016 (arxiv 1607.08022) |
| [Group normalization](/wiki/group_normalization) | Groups of channels, single sample | No | CNNs with small batches | Wu and He, 2018 (arxiv 1803.08494) |
| Weight normalization | Weight vector reparameterization | No | Generative models, RL | Salimans and Kingma, 2016 (arxiv 1602.07868) |
| [RMSNorm](/wiki/rmsnorm) | All neurons (RMS only) | No | Large language models | Zhang and Sennrich, 2019 (arxiv 1910.07467) |
| Filter response normalization | Per-sample, per-channel spatial | No | Small-batch CNNs | Singh and Krishnan, 2019 (arxiv 1911.09737) |
| Power normalization | Channel-wise with running variance | Partial | Vision and NLP transformers | Shen et al., 2020 (arxiv 2003.07845) |
| Switchable normalization | Mixture of BN, LN, IN | Partial | General CNNs | Luo et al., 2018 (arxiv 1806.10779) |
| Local response normalization | Adjacent channels | No | Legacy CNNs (AlexNet) | Krizhevsky et al., 2012 |

### Where each layer is applied in a typical block

| Architecture family | Conventional placement | Common normalization choice |
|---|---|---|
| Vanilla CNN | After each conv, before activation | Batch normalization |
| ResNet | Inside residual blocks, before activation | Batch normalization |
| Transformer (post-LN) | After attention and feed-forward sublayer outputs | Layer normalization |
| Transformer (pre-LN) | Before attention and feed-forward sublayers | Layer normalization or RMSNorm |
| Vision Transformer (ViT) | Pre-LN | Layer normalization |
| Modern LLM (LLaMA, Mistral, Qwen, DeepSeek) | Pre-LN | RMSNorm |
| Style transfer | After conv layers | Instance normalization or AdaIN |
| Object detection backbone with small batches | Inside the backbone | Group normalization or synchronous BN |

## What is the difference between pre-LN and post-LN in transformers?

The original [Transformer](/wiki/transformer) architecture introduced by Vaswani and colleagues in 2017 placed LayerNorm after each sublayer's residual connection (post-norm or post-LN) [15]. This means each sublayer computes:

```
x' = LayerNorm(x + Sublayer(x))
```

While post-LN works well for moderately deep transformers (such as the original 6-layer encoder-decoder), training very deep post-LN transformers requires careful learning rate warmup and is prone to instability. Xiong and colleagues showed in 2020 (arxiv 2002.04745) that placing LayerNorm before each sublayer (pre-norm or pre-LN) yields a much better-conditioned optimization landscape and removes the need for warmup [7]. The pre-LN formulation is:

```
x' = x + Sublayer(LayerNorm(x))
```

The key practical difference is that gradients flow more smoothly through the residual path in pre-LN, because the residual connection is no longer crossed by a normalization layer. This makes very deep transformers easier to train. Most modern large language models, including [GPT-3](/wiki/gpt-3), [LLaMA](/wiki/llama), Mistral, and DeepSeek, use the pre-LN formulation. A small number of recent architectures combine both approaches (so-called sandwich norm or hybrid norm) by placing LayerNorm both before and after each sublayer.

## Which normalization do modern LLMs use?

The choice of normalization layer in modern large language models has converged on a small set of preferences:

| Model family | Normalization | Placement |
|---|---|---|
| BERT (2018) | LayerNorm | Post-LN |
| GPT-2 (2019) | LayerNorm | Pre-LN |
| GPT-3 (2020) | LayerNorm | Pre-LN |
| T5 (2019) | RMSNorm | Pre-LN |
| LLaMA / LLaMA 2 / LLaMA 3 | RMSNorm | Pre-LN |
| Mistral 7B | RMSNorm | Pre-LN |
| Qwen / Qwen 2 / Qwen 2.5 | RMSNorm | Pre-LN |
| DeepSeek / DeepSeek V2 / V3 | RMSNorm | Pre-LN |
| Gemma | RMSNorm | Pre-LN |
| PaLM (2022) | RMSNorm | Pre-LN |
| Claude (Anthropic) | LayerNorm or RMSNorm (architecture details not fully public) | Pre-LN |

The move from LayerNorm to RMSNorm in flagship LLMs is driven primarily by efficiency. RMSNorm removes the mean-centering step, eliminating a sum reduction and a subtraction at every token in every layer. On large vocabularies, large hidden dimensions, and very deep stacks, this saving compounds. The LLaMA paper noted that the change had no measurable impact on quality [16]. As a result, most newer transformer-based models default to RMSNorm.

## Are normalization layers necessary?

A recurring research question is whether explicit normalization layers are even necessary. Several approaches have proposed normalization-free architectures:

- **Fixup initialization** (Zhang and colleagues, 2019) carefully scales the initial weights of residual networks so that the network is well-conditioned at initialization without any normalization layer. With Fixup, very deep ResNets train successfully without BatchNorm [17].
- **NFNets** (Brock and colleagues, 2021) are normalizer-free networks that use weight standardization and adaptive gradient clipping to remain stable. They achieved state-of-the-art image classification accuracy without BatchNorm [18].
- **DeepNorm** (Wang and colleagues, 2022) is a modified residual scaling combined with post-LN that allows transformers with up to 1,000 layers to train stably [19].
- **Dynamic Tanh (DyT)**, proposed by Zhu, Chen, He, LeCun, and Liu and presented at CVPR 2025 (arxiv 2503.10622), replaces normalization layers with the element-wise operation DyT(x) = tanh(alpha * x), where alpha is a learnable scalar [25]. The idea is inspired by the observation that LayerNorm in transformers often produces tanh-like, S-shaped input-output mappings. The authors reported that "Transformers with DyT can match or exceed the performance of their normalized counterparts, mostly without hyperparameter tuning" across vision (ViT, ConvNeXt), self-supervised learning (MAE, DINO), diffusion (DiT), large language models (LLaMA), speech, and DNA sequence modeling [25]. The technique is still under active investigation and has not yet been adopted in any production frontier LLM, but it represents a continuing effort to find simpler alternatives to explicit normalization.

Despite these proposals, normalization layers remain the dominant approach in production deep learning systems because they are simple, robust across hyperparameter choices, and well-supported in standard frameworks.

## How does normalization affect training speed and convergence?

Normalization, whether applied to the input data or through internal network layers, consistently accelerates training and improves convergence. The main mechanisms are:

1. **Smoother loss landscape.** Normalized inputs and activations produce a loss function with more uniform curvature. The optimizer can use a larger [learning rate](/wiki/learning_rate) without overshooting, and gradient updates point more directly toward the optimum.
2. **Reduced internal covariate shift.** Normalization layers keep the distribution of layer inputs relatively stable throughout training, so each layer does not need to continually adapt to shifting input distributions.
3. **Better gradient flow.** By keeping activations in a moderate numeric range, normalization reduces the likelihood of vanishing or exploding gradients, especially in deep networks.
4. **Implicit regularization.** Some normalization methods (notably BatchNorm) introduce noise through mini-batch statistics, which can act as a mild regularizer similar to [dropout](/wiki/dropout).
5. **Reduced sensitivity to initialization.** Networks with normalization layers train successfully across a wider range of weight initialization schemes. This is particularly valuable when designing new architectures, where choosing a specific initialization can be a significant hyperparameter.

Empirical studies have shown that networks trained with BatchNorm can reach the same accuracy in a fraction of the training steps required without it [1]. Similarly, choosing the right input normalization scheme can reduce the number of [epochs](/wiki/epoch) needed for convergence by an order of magnitude for algorithms like stochastic gradient descent [24].

## Practical implementation guide

A pragmatic workflow for deciding how to normalize tabular data and where to place internal normalization layers in a network looks like this.

### For tabular preprocessing

1. **Start with `StandardScaler`** if features are roughly Gaussian and the model is linear, an SVM, or a small neural network.
2. **Use `RobustScaler`** if the data contains outliers that you cannot remove. The interquartile range will not be distorted by extreme values.
3. **Use `MinMaxScaler` to [0, 1]** if the model expects bounded inputs (for example, when feeding into a sigmoid output of a different model).
4. **Use `MaxAbsScaler` for sparse data** to preserve the location of zero entries.
5. **Apply a `PowerTransformer`** before standardization if the feature is heavily skewed and you want to make it more Gaussian.
6. **Use `QuantileTransformer`** as a last resort for mixed-distribution heterogeneous tabular data feeding into neural models.
7. **Skip normalization** for tree-based models such as XGBoost, LightGBM, and random forests.

### For internal normalization layers

1. **Choose BatchNorm** for image classification, segmentation, and detection with batch sizes >= 32.
2. **Choose GroupNorm** for tasks with very small batch sizes (<= 4), particularly in detection backbones and 3D vision.
3. **Choose LayerNorm** for sequence models, recurrent networks, and standard transformers.
4. **Choose RMSNorm** when training a large language model from scratch and efficiency at scale matters.
5. **Choose InstanceNorm** for neural style transfer and image-to-image translation.
6. **Use pre-LN** for transformers deeper than ~12 layers.

## What is database normalization?

In database design, "normalization" refers to the process of organizing a relational database to reduce redundancy and improve data integrity. This is an entirely different concept from normalization in machine learning. Database normalization follows a series of "normal forms" (1NF, 2NF, 3NF, BCNF, and higher) that define rules about how data should be divided across tables and how relationships should be structured. While both uses of the term involve imposing structure and consistency, they operate in completely different domains.

## Common mistakes

- **Fitting the scaler on the entire dataset before splitting.** This is the single most common mistake in applied machine learning. Always fit on the training set only.
- **Forgetting to save the fitted scaler.** When deploying a model, the fitted scaler must be saved alongside the model weights so the same parameters are applied to incoming production data.
- **Normalizing target variables without inverting at prediction time.** If you scale the regression target during training, remember to invert the transformation when reporting predictions in the original units.
- **Mixing scalers across cross-validation folds.** Use a `Pipeline` to ensure the scaler is refit within each fold; otherwise the cross-validated score will be optimistically biased.
- **Standardizing one-hot encoded features.** Z-scoring binary indicators changes their semantics without improving learning. Leave categorical encodings alone.
- **Applying BatchNorm with batch size 1.** BatchNorm divides by the mini-batch variance; with one sample the variance is zero and the operation is undefined. Use LayerNorm or GroupNorm in this case.
- **Forgetting `model.eval()` at inference.** BatchNorm and dropout behave differently during evaluation; failing to switch modes will use the current mini-batch statistics instead of the running averages, producing inconsistent predictions.

## Explain like I'm 5 (ELI5)

Imagine you and your friends are comparing how far you can throw a ball, how fast you can run, and how many pushups you can do. The throwing distances are in meters, the running speed is in seconds, and pushups are just a count. If you try to add all three numbers together to pick a winner, the throwing distance (maybe 30 meters) would matter way more than the pushups (maybe 10). That is not fair.

Normalization is like converting each score to a number between 0 and 10 so every activity counts equally. The fastest runner gets a 10, the slowest gets a 0, and everyone else falls in between. Now when you add up the scores, no single activity can unfairly dominate.

In machine learning, computers face the same problem. Different measurements can have wildly different ranges, and normalization makes them all comparable so the computer can learn from every measurement equally.

Deep neural networks face a related problem inside their own layers. As the network learns, the numbers flowing between layers can grow very large or very small, which makes the math unstable. Internal normalization layers act like a thermostat: at every layer, they squeeze the numbers back to a sensible range so the next layer always receives well-behaved input.

## References

1. Ioffe, S. and Szegedy, C. (2015). "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift." *Proceedings of the 32nd International Conference on Machine Learning (ICML)*, 37:448-456. https://arxiv.org/abs/1502.03167
2. Ba, J.L., Kiros, J.R., and Hinton, G.E. (2016). "Layer Normalization." *arXiv preprint arXiv:1607.06450*. https://arxiv.org/abs/1607.06450
3. Ulyanov, D., Vedaldi, A., and Lempitsky, V. (2016). "Instance Normalization: The Missing Ingredient for Fast Stylization." *arXiv preprint arXiv:1607.08022*. https://arxiv.org/abs/1607.08022
4. Wu, Y. and He, K. (2018). "Group Normalization." *Proceedings of the European Conference on Computer Vision (ECCV)*, pp. 3-19. https://arxiv.org/abs/1803.08494
5. Salimans, T. and Kingma, D.P. (2016). "Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks." *Advances in Neural Information Processing Systems (NeurIPS)*, 29. https://arxiv.org/abs/1602.07868
6. Zhang, B. and Sennrich, R. (2019). "Root Mean Square Layer Normalization." *Advances in Neural Information Processing Systems (NeurIPS)*, 32. https://arxiv.org/abs/1910.07467
7. Xiong, R., Yang, Y., He, D., Zheng, K., Zheng, S., Xing, C., Zhang, H., Lan, Y., Wang, L., and Liu, T. (2020). "On Layer Normalization in the Transformer Architecture." *Proceedings of the 37th International Conference on Machine Learning (ICML)*. https://arxiv.org/abs/2002.04745
8. Singh, S. and Krishnan, S. (2019). "Filter Response Normalization Layer: Eliminating Batch Dependence in the Training of Deep Neural Networks." *arXiv preprint arXiv:1911.09737*. https://arxiv.org/abs/1911.09737
9. Shen, S., Yao, Z., Gholami, A., Mahoney, M.W., and Keutzer, K. (2020). "PowerNorm: Rethinking Batch Normalization in Transformers." *Proceedings of the 37th International Conference on Machine Learning (ICML)*. https://arxiv.org/abs/2003.07845
10. Luo, P., Ren, J., Peng, Z., Zhang, R., and Li, J. (2018). "Differentiable Learning-to-Normalize via Switchable Normalization." *International Conference on Learning Representations (ICLR) 2019*. https://arxiv.org/abs/1806.10779
11. Pedregosa, F. et al. (2011). "Scikit-learn: Machine Learning in Python." *Journal of Machine Learning Research*, 12:2825-2830. https://scikit-learn.org/stable/modules/preprocessing.html
12. Santurkar, S., Tsipras, D., Ilyas, A., and Madry, A. (2018). "How Does Batch Normalization Help Optimization?" *Advances in Neural Information Processing Systems (NeurIPS)*, 31. https://arxiv.org/abs/1805.11604
13. Box, G.E.P. and Cox, D.R. (1964). "An Analysis of Transformations." *Journal of the Royal Statistical Society. Series B*, 26(2):211-252.
14. Yeo, I. and Johnson, R.A. (2000). "A New Family of Power Transformations to Improve Normality or Symmetry." *Biometrika*, 87(4):954-959.
15. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., and Polosukhin, I. (2017). "Attention Is All You Need." *Advances in Neural Information Processing Systems (NeurIPS)*, 30. https://arxiv.org/abs/1706.03762
16. Touvron, H. et al. (2023). "LLaMA: Open and Efficient Foundation Language Models." *arXiv preprint arXiv:2302.13971*. https://arxiv.org/abs/2302.13971
17. Zhang, H., Dauphin, Y.N., and Ma, T. (2019). "Fixup Initialization: Residual Learning Without Normalization." *International Conference on Learning Representations (ICLR)*. https://arxiv.org/abs/1901.09321
18. Brock, A., De, S., Smith, S.L., and Simonyan, K. (2021). "High-Performance Large-Scale Image Recognition Without Normalization." *Proceedings of the 38th International Conference on Machine Learning (ICML)*. https://arxiv.org/abs/2102.06171
19. Wang, H., Ma, S., Dong, L., Huang, S., Zhang, D., and Wei, F. (2022). "DeepNet: Scaling Transformers to 1,000 Layers." *arXiv preprint arXiv:2203.00555*. https://arxiv.org/abs/2203.00555
20. Goodfellow, I., Bengio, Y., and Courville, A. (2016). *Deep Learning*. MIT Press. Chapter 8: Optimization for Training Deep Models. https://www.deeplearningbook.org/
21. Bishop, C.M. (2006). *Pattern Recognition and Machine Learning*. Springer. Section 5.1: Feed-forward Network Functions.
22. Stanford CS231n. "Convolutional Neural Networks for Visual Recognition: Neural Networks Part 2." Course notes. https://cs231n.github.io/neural-networks-2/
23. PyTorch documentation. "torch.nn.LayerNorm," "torch.nn.BatchNorm2d," "torch.nn.RMSNorm." https://pytorch.org/docs/stable/nn.html
24. Google Developers. "Numerical Data: Normalization." *Machine Learning Crash Course*. https://developers.google.com/machine-learning/crash-course/numerical-data/normalization
25. Zhu, J., Chen, X., He, K., LeCun, Y., and Liu, Z. (2025). "Transformers without Normalization." *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*. https://arxiv.org/abs/2503.10622