Normalization
Last reviewed
May 9, 2026
Sources
24 citations
Review status
Source-backed
Revision
v4 ยท 5,890 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 9, 2026
Sources
24 citations
Review status
Source-backed
Revision
v4 ยท 5,890 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Machine learning terms
In machine learning and data science, normalization refers to the process of scaling data to a standard range or distribution. The term is used broadly to describe a family of techniques applied during data preprocessing, within neural network architectures, and in database design. By transforming features to a common scale, normalization helps algorithms converge faster, produce more stable gradients, and treat all input dimensions equitably.
Normalization is especially important for algorithms that are sensitive to the magnitude of input features, including gradient descent-based optimizers, distance-based methods such as k-nearest neighbors and k-means clustering, and most neural network architectures. Without normalization, features with larger numeric ranges can dominate the learning process, leading to slow convergence or poor generalization.
The word "normalization" carries different meanings in different parts of machine learning. In tabular preprocessing it usually means rescaling each column to a fixed range or distribution. Inside a deep network it means inserting a layer that standardizes activations on the fly. In information retrieval it means dividing each row vector by its length. The mathematics differ, but the underlying goal is the same: remove arbitrary scale so that downstream computation behaves predictably.
Data normalization, sometimes called feature scaling, rescales individual features so they occupy a comparable numeric range before being fed into a model. The choice of method depends on the data distribution, the presence of outliers, and the downstream algorithm.
Min-max normalization linearly maps each feature to a fixed interval, most commonly [0, 1]. The formula is:
x' = (x - x_min) / (x_max - x_min)
To scale to an arbitrary range [a, b]:
x' = a + (x - x_min)(b - a) / (x_max - x_min)
Min-max normalization preserves the original distribution shape and is straightforward to implement. However, it is highly sensitive to outliers: a single extreme value can compress all other values into a narrow band. In scikit-learn, this method is available as MinMaxScaler.
A practical concern with min-max scaling is what happens when a value at inference time falls outside the range observed during training. The transformed value will lie outside [0, 1], which is acceptable for most models but breaks the assumption that all inputs are bounded. When the input data has known semantic limits (for example, normalized image pixel intensities in [0, 255]), those limits should be used directly rather than the empirical minimum and maximum.
Z-score normalization, also called standardization, transforms each feature so that it has a mean of zero and a standard deviation of one:
x' = (x - mu) / sigma
where mu is the sample mean and sigma is the sample standard deviation. The resulting values are unbounded and can be negative. Z-Score Normalization handles outliers more gracefully than min-max normalization because outliers affect the mean and variance but do not compress the rest of the distribution into a tiny range. It is the default choice for many linear models, support vector machines, and principal component analysis. In scikit-learn, this method is provided by StandardScaler.
A closely related question is whether to use the population variance (dividing by n) or the sample variance (dividing by n-1) when estimating sigma. For normalization purposes the difference is negligible at typical dataset sizes. The scikit-learn implementation uses the population variance for consistency with the numpy.std default.
Max absolute scaling divides each feature value by the maximum absolute value of that feature:
x' = x / |x_max|
This maps the data to the range [-1, 1] without shifting the center, which means it does not destroy sparsity. It is useful for sparse datasets where preserving zero entries matters, including text feature matrices produced by CountVectorizer or TfidfVectorizer. In scikit-learn, this method is available as MaxAbsScaler.
Robust scaling uses the median and interquartile range (IQR) instead of the mean and standard deviation:
x' = (x - Q2) / (Q3 - Q1)
where Q1, Q2, and Q3 are the 25th, 50th, and 75th percentiles, respectively. Because the median and IQR are not influenced by extreme values, robust scaling is the preferred choice when the data contains significant outliers that cannot be removed. In scikit-learn, the corresponding class is RobustScaler. The user can configure which percentiles to use through the quantile_range parameter, with the default (25.0, 75.0) corresponding to the interquartile range.
L1 and L2 normalization scale each sample (row) so that it has unit norm:
These methods are common in text classification and information retrieval, where document vectors are normalized to unit length before computing cosine similarity. After L2 normalization, the dot product between any two vectors equals their cosine similarity, which simplifies and speeds up similarity search. In scikit-learn, the Normalizer class handles both L1 and L2 normalization.
When a feature spans many orders of magnitude or is heavily right-skewed, applying a logarithmic transformation can make it more amenable to learning. Common variants include the natural log (log(x)), log(1 + x) (which avoids issues at zero), and log10(x). Log transformations are widely used for features such as income, price, population counts, and gene expression levels. The transformation must be applied only when all values are positive; for data containing zeros, log1p is the standard choice.
Power transformations seek to make a feature more closely resemble a Gaussian distribution by applying a parametric power function. The Box-Cox transformation, introduced by Box and Cox in 1964, is defined for strictly positive data as:
x' = (x^lambda - 1) / lambda if lambda != 0
x' = log(x) if lambda == 0
The parameter lambda is selected to maximize the log-likelihood of the transformed data under a normality assumption. The Yeo-Johnson transformation, introduced by Yeo and Johnson in 2000, extends Box-Cox to handle zero and negative values and is therefore more general:
For x >= 0:
x' = ((x + 1)^lambda - 1) / lambda if lambda != 0
x' = log(x + 1) if lambda == 0
For x < 0:
x' = -((-x + 1)^(2 - lambda) - 1) / (2 - lambda) if lambda != 2
x' = -log(-x + 1) if lambda == 2
In scikit-learn, both transformations are provided by PowerTransformer with the method argument set to 'box-cox' or 'yeo-johnson'. The class fits lambda separately for each feature.
A quantile transformation maps each feature to a target distribution by ranking the values and applying the inverse cumulative distribution function of the target. Two target distributions are common: uniform on [0, 1] and standard normal. Because the transformation is rank-based, it is robust to outliers and produces an output with a fixed shape regardless of the input distribution. The downside is that it is nonlinear and non-monotonic across new samples, so values not seen during fitting are interpolated. In scikit-learn, this is provided by QuantileTransformer. Quantile transformation is useful when the goal is to make features comparable in shape, for example when feeding heterogeneous tabular data into a deep learning model.
| Method | Formula | Output range | Sensitive to outliers | Preserves sparsity | scikit-learn class |
|---|---|---|---|---|---|
| Min-max | (x - x_min) / (x_max - x_min) | [0, 1] | Yes | No | MinMaxScaler |
| Z-score | (x - mu) / sigma | Unbounded | Moderate | No | StandardScaler |
| Max absolute | x / abs(x_max) | [-1, 1] | Yes | Yes | MaxAbsScaler |
| Robust | (x - median) / IQR | Unbounded | No | No | RobustScaler |
| L1 norm | x / sum(abs(x)) | [0, 1] per sample | No | Yes | Normalizer(norm='l1') |
| L2 norm | x / sqrt(sum(x^2)) | [-1, 1] per sample | No | Yes | Normalizer(norm='l2') |
| Box-Cox | (x^lambda - 1) / lambda | Approximately Gaussian | Reduced after fit | No | PowerTransformer(method='box-cox') |
| Yeo-Johnson | Piecewise power | Approximately Gaussian | Reduced after fit | No | PowerTransformer(method='yeo-johnson') |
| Quantile (uniform) | Rank / N | [0, 1] | No | No | QuantileTransformer(output_distribution='uniform') |
| Quantile (normal) | Inverse normal CDF of rank | Approximately N(0, 1) | No | No | QuantileTransformer(output_distribution='normal') |
| Log | log(1 + x) | Compressed | Reduced | No | FunctionTransformer(np.log1p) |
Normalization is most beneficial in the following scenarios:
Normalization is unnecessary or even counterproductive in some situations:
The terms "normalization" and "standardization" are often used interchangeably in practice, but they refer to different operations in a strict sense.
| Normalization (min-max) | Standardization (z-score) | |
|---|---|---|
| Goal | Scale values to a fixed range (e.g., [0, 1]) | Center values at zero with unit variance |
| Formula | (x - x_min) / (x_max - x_min) | (x - mu) / sigma |
| Output range | Bounded ([0, 1] or [-1, 1]) | Unbounded |
| Outlier behavior | Compressed into the range | Shifts mean and widens spread |
| Best for | Data with known bounds; neural network inputs | Data that is roughly Gaussian; linear models, SVMs, PCA |
In everyday conversation, many practitioners use "normalization" as an umbrella term for any kind of feature scaling. When precision matters, it is helpful to specify the exact method (for example, "min-max normalization" or "z-score standardization").
A critical rule when applying normalization in a machine learning pipeline is to fit the scaler on the training set only. The scaler's parameters (minimum, maximum, mean, standard deviation, median, or IQR) must be computed exclusively from training data. The test set and validation set should then be transformed using those same parameters.
If the scaler is fit on the entire dataset before splitting, information from the test set leaks into the training process, producing overly optimistic performance estimates that do not reflect real-world generalization. In scikit-learn, the recommended approach is to place the scaler inside a Pipeline together with the model, which automatically applies fit only to the training fold during cross-validation.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
pipe = Pipeline([
('scaler', StandardScaler()),
('clf', LogisticRegression())
])
pipe.fit(X_train, y_train)
score = pipe.score(X_test, y_test)
The leakage rule extends beyond simple train/test splits. In time series forecasting, the scaler should be fit only on data observed up to the cutoff time of each evaluation point. In stratified k-fold cross-validation, the scaler should be fit on the training folds within each iteration. The general principle is that any statistic the scaler depends on must come exclusively from data the model is allowed to see at training time.
Beyond preprocessing the input data, modern deep learning architectures apply normalization inside the network itself. These normalization layers stabilize activations during training, reduce sensitivity to weight initialization, and often allow higher learning rates. They all follow a common pattern: compute statistics over some subset of the activations, normalize using those statistics, and then apply a learned affine transformation (scale and shift).
The general form of an internal normalization layer is:
y = gamma * (x - mu) / sqrt(sigma^2 + epsilon) + beta
where mu and sigma^2 are the mean and variance computed over a chosen set of dimensions, epsilon is a small constant (typically 1e-5) added for numerical stability, and gamma and beta are learnable scale and shift parameters. The choice of dimensions over which mu and sigma^2 are computed is what distinguishes the various normalization layers from each other.
Batch normalization (BatchNorm), introduced by Ioffe and Szegedy in 2015 (arxiv 1502.03167), normalizes activations across the mini-batch dimension. For each channel in a layer, it computes the mean and variance over all spatial positions and all samples in the current mini-batch, then normalizes accordingly. BatchNorm was originally motivated by the problem of "internal covariate shift," where the distribution of layer inputs changes as earlier layers update their weights.
BatchNorm enables the use of much higher learning rates and reduces the dependence on careful weight initialization. In the original paper, it achieved the same accuracy as a baseline model with 14 times fewer training steps. However, BatchNorm has notable limitations: its behavior depends on the mini-batch size, it performs poorly with very small batches, and its statistics differ between training (using mini-batch statistics) and inference (using running averages), which can cause inconsistencies. A 2018 study by Santurkar and colleagues argued that the actual benefit of BatchNorm comes from smoothing the loss landscape rather than from reducing covariate shift.
In PyTorch, BatchNorm is available as torch.nn.BatchNorm1d, torch.nn.BatchNorm2d, and torch.nn.BatchNorm3d, which differ only in the spatial rank of the input. During inference, the running mean and running variance computed during training are used in place of mini-batch statistics; this behavior is enabled by setting the module to evaluation mode with model.eval().
Layer normalization (LayerNorm), proposed by Ba, Kiros, and Geoffrey Hinton in 2016 (arxiv 1607.06450), computes normalization statistics across all neurons within a single layer for a single training example. Unlike BatchNorm, it does not depend on the batch dimension, making it straightforward to apply to recurrent neural networks and transformer architectures where sequence lengths may vary.
LayerNorm has become the standard normalization technique in transformer-based language models, including GPT and BERT. Its independence from batch size makes it well suited for settings where batch sizes are small or variable. Because the same statistics are used at training and inference time, there is no train/test mismatch and no need for running averages. In PyTorch, LayerNorm is provided by torch.nn.LayerNorm, which takes a normalized_shape argument specifying the trailing dimensions to normalize over.
Instance normalization (InstanceNorm), introduced by Ulyanov, Vedaldi, and Lempitsky in 2016, computes the mean and variance for each individual channel of each individual sample. This means each feature map in each image is normalized independently.
InstanceNorm gained prominence in neural style transfer, where it effectively strips instance-specific contrast information from images, allowing the network to focus on high-level style features. It is commonly used in generative adversarial networks (GANs) and image generation tasks. A close relative is adaptive instance normalization (AdaIN), introduced by Huang and Belongie in 2017, which uses the mean and variance from a style image as the affine parameters of the normalization, producing an effective real-time style transfer technique.
Group normalization (GroupNorm), proposed by Wu and He in 2018 (arxiv 1803.08494), divides channels into groups and computes normalization statistics within each group for each sample. It can be seen as a middle ground between LayerNorm (one group containing all channels) and InstanceNorm (each channel in its own group).
GroupNorm is particularly valuable in computer vision tasks where memory constraints force the use of small batch sizes (for example, in object detection and video classification). On ResNet-50 trained on ImageNet with a batch size of 2, GroupNorm achieved 10.6% lower error than the equivalent BatchNorm model. The number of groups is a hyperparameter; the original paper recommends 32 groups as a sensible default.
Weight normalization, introduced by Salimans and Kingma in 2016 (arxiv 1602.07868), takes a different approach. Instead of normalizing activations, it reparameterizes the weights of each layer as a product of a unit vector and a learnable scalar magnitude:
w = g * (v / ||v||)
where v is a vector of free parameters, ||v|| is its Euclidean norm, and g is a learned scalar. This decouples the magnitude of the weight vector from its direction, which improves the conditioning of the optimization problem and accelerates training. Weight normalization can be combined with mean-only batch normalization for even better performance. Compared to BatchNorm, it is independent of batch size and adds essentially no computational overhead. It is available in PyTorch as torch.nn.utils.weight_norm.
RMSNorm, introduced by Zhang and Sennrich in 2019 (arxiv 1910.07467), simplifies LayerNorm by removing the mean-centering step. Instead of subtracting the mean and dividing by the standard deviation, RMSNorm divides activations only by their root mean square:
x' = x / RMS(x), where RMS(x) = sqrt((1/n) * sum(x_i^2))
The authors hypothesized that the re-centering component of LayerNorm is not essential, and experiments confirmed that RMSNorm achieves comparable performance while reducing computational overhead by 7% to 64% depending on the model. RMSNorm has been adopted in several prominent large language models, including LLaMA and its successors, Mistral, Qwen, and DeepSeek. Native support for RMSNorm was added to PyTorch as torch.nn.RMSNorm in version 2.4.
Filter response normalization (FRN), proposed by Singh and Krishnan in 2019, eliminates batch dependence by normalizing each channel of each sample using only the mean of the squared activations across spatial dimensions. It pairs with a thresholded linear unit (TLU) activation function, which clamps negative values to a learnable threshold rather than zero. FRN was shown to outperform BatchNorm in image classification with small batch sizes and to match it at large batch sizes.
Power normalization, introduced by Shen and colleagues in 2020, was developed for transformer models in computer vision and NLP. It uses a moving average of the variance, rather than the per-batch variance, and applies the variance only without subtracting the mean. The authors reported improved performance over LayerNorm on machine translation and language modeling benchmarks, although the technique has not been widely adopted in production systems.
Switchable normalization, proposed by Luo and colleagues in 2018, dynamically chooses among BatchNorm, LayerNorm, and InstanceNorm by learning a softmax weight over the three. Different layers in the network can therefore use different normalizers, with the network selecting the most appropriate one for each. The approach yielded modest improvements on image classification but added implementation complexity.
Local response normalization (LRN), used in the AlexNet architecture (Krizhevsky and colleagues, 2012), normalizes each activation by a local sum of squares across nearby channels. While LRN was an important component of early deep convolutional networks, later work showed it is largely unnecessary when other techniques (such as BatchNorm) are used, and modern architectures rarely include it.
| Normalization layer | Normalizes over | Batch dependent | Typical use case | Original paper |
|---|---|---|---|---|
| Batch normalization | Batch + spatial | Yes | CNNs with large batches | Ioffe and Szegedy, 2015 (arxiv 1502.03167) |
| Layer normalization | All neurons in a layer | No | Transformers, RNNs | Ba, Kiros, and Hinton, 2016 (arxiv 1607.06450) |
| Instance normalization | Single channel, single sample | No | Style transfer, GANs | Ulyanov, Vedaldi, and Lempitsky, 2016 (arxiv 1607.08022) |
| Group normalization | Groups of channels, single sample | No | CNNs with small batches | Wu and He, 2018 (arxiv 1803.08494) |
| Weight normalization | Weight vector reparameterization | No | Generative models, RL | Salimans and Kingma, 2016 (arxiv 1602.07868) |
| RMSNorm | All neurons (RMS only) | No | Large language models | Zhang and Sennrich, 2019 (arxiv 1910.07467) |
| Filter response normalization | Per-sample, per-channel spatial | No | Small-batch CNNs | Singh and Krishnan, 2019 (arxiv 1911.09737) |
| Power normalization | Channel-wise with running variance | Partial | Vision and NLP transformers | Shen et al., 2020 (arxiv 2003.07845) |
| Switchable normalization | Mixture of BN, LN, IN | Partial | General CNNs | Luo et al., 2018 (arxiv 1806.10779) |
| Local response normalization | Adjacent channels | No | Legacy CNNs (AlexNet) | Krizhevsky et al., 2012 |
| Architecture family | Conventional placement | Common normalization choice |
|---|---|---|
| Vanilla CNN | After each conv, before activation | Batch normalization |
| ResNet | Inside residual blocks, before activation | Batch normalization |
| Transformer (post-LN) | After attention and feed-forward sublayer outputs | Layer normalization |
| Transformer (pre-LN) | Before attention and feed-forward sublayers | Layer normalization or RMSNorm |
| Vision Transformer (ViT) | Pre-LN | Layer normalization |
| Modern LLM (LLaMA, Mistral, Qwen, DeepSeek) | Pre-LN | RMSNorm |
| Style transfer | After conv layers | Instance normalization or AdaIN |
| Object detection backbone with small batches | Inside the backbone | Group normalization or synchronous BN |
The original Transformer architecture introduced by Vaswani and colleagues in 2017 placed LayerNorm after each sublayer's residual connection (post-norm or post-LN). This means each sublayer computes:
x' = LayerNorm(x + Sublayer(x))
While post-LN works well for moderately deep transformers (such as the original 6-layer encoder-decoder), training very deep post-LN transformers requires careful learning rate warmup and is prone to instability. Xiong and colleagues showed in 2020 (arxiv 2002.04745) that placing LayerNorm before each sublayer (pre-norm or pre-LN) yields a much better-conditioned optimization landscape and removes the need for warmup. The pre-LN formulation is:
x' = x + Sublayer(LayerNorm(x))
The key practical difference is that gradients flow more smoothly through the residual path in pre-LN, because the residual connection is no longer crossed by a normalization layer. This makes very deep transformers easier to train. Most modern large language models, including GPT-3, LLaMA, Mistral, and DeepSeek, use the pre-LN formulation. A small number of recent architectures combine both approaches (so-called sandwich norm or hybrid norm) by placing LayerNorm both before and after each sublayer.
The choice of normalization layer in modern large language models has converged on a small set of preferences:
| Model family | Normalization | Placement |
|---|---|---|
| BERT (2018) | LayerNorm | Post-LN |
| GPT-2 (2019) | LayerNorm | Pre-LN |
| GPT-3 (2020) | LayerNorm | Pre-LN |
| T5 (2019) | RMSNorm | Pre-LN |
| LLaMA / LLaMA 2 / LLaMA 3 | RMSNorm | Pre-LN |
| Mistral 7B | RMSNorm | Pre-LN |
| Qwen / Qwen 2 / Qwen 2.5 | RMSNorm | Pre-LN |
| DeepSeek / DeepSeek V2 / V3 | RMSNorm | Pre-LN |
| Gemma | RMSNorm | Pre-LN |
| PaLM (2022) | RMSNorm | Pre-LN |
| Claude (Anthropic) | LayerNorm or RMSNorm (architecture details not fully public) | Pre-LN |
The move from LayerNorm to RMSNorm in flagship LLMs is driven primarily by efficiency. RMSNorm removes the mean-centering step, eliminating a sum reduction and a subtraction at every token in every layer. On large vocabularies, large hidden dimensions, and very deep stacks, this saving compounds. The LLaMA paper noted that the change had no measurable impact on quality. As a result, most newer transformer-based models default to RMSNorm.
A recurring research question is whether explicit normalization layers are even necessary. Several approaches have proposed normalization-free architectures:
Despite these proposals, normalization layers remain the dominant approach in production deep learning systems because they are simple, robust across hyperparameter choices, and well-supported in standard frameworks.
Normalization, whether applied to the input data or through internal network layers, consistently accelerates training and improves convergence. The main mechanisms are:
Empirical studies have shown that networks trained with BatchNorm can reach the same accuracy in a fraction of the training steps required without it. Similarly, choosing the right input normalization scheme can reduce the number of epochs needed for convergence by an order of magnitude for algorithms like stochastic gradient descent.
A pragmatic workflow for deciding how to normalize tabular data and where to place internal normalization layers in a network looks like this.
StandardScaler if features are roughly Gaussian and the model is linear, an SVM, or a small neural network.RobustScaler if the data contains outliers that you cannot remove. The interquartile range will not be distorted by extreme values.MinMaxScaler to [0, 1] if the model expects bounded inputs (for example, when feeding into a sigmoid output of a different model).MaxAbsScaler for sparse data to preserve the location of zero entries.PowerTransformer before standardization if the feature is heavily skewed and you want to make it more Gaussian.QuantileTransformer as a last resort for mixed-distribution heterogeneous tabular data feeding into neural models.In database design, "normalization" refers to the process of organizing a relational database to reduce redundancy and improve data integrity. This is an entirely different concept from normalization in machine learning. Database normalization follows a series of "normal forms" (1NF, 2NF, 3NF, BCNF, and higher) that define rules about how data should be divided across tables and how relationships should be structured. While both uses of the term involve imposing structure and consistency, they operate in completely different domains.
Pipeline to ensure the scaler is refit within each fold; otherwise the cross-validated score will be optimistically biased.model.eval() at inference. BatchNorm and dropout behave differently during evaluation; failing to switch modes will use the current mini-batch statistics instead of the running averages, producing inconsistent predictions.Imagine you and your friends are comparing how far you can throw a ball, how fast you can run, and how many pushups you can do. The throwing distances are in meters, the running speed is in seconds, and pushups are just a count. If you try to add all three numbers together to pick a winner, the throwing distance (maybe 30 meters) would matter way more than the pushups (maybe 10). That is not fair.
Normalization is like converting each score to a number between 0 and 10 so every activity counts equally. The fastest runner gets a 10, the slowest gets a 0, and everyone else falls in between. Now when you add up the scores, no single activity can unfairly dominate.
In machine learning, computers face the same problem. Different measurements can have wildly different ranges, and normalization makes them all comparable so the computer can learn from every measurement equally.
Deep neural networks face a related problem inside their own layers. As the network learns, the numbers flowing between layers can grow very large or very small, which makes the math unstable. Internal normalization layers act like a thermostat: at every layer, they squeeze the numbers back to a sensible range so the next layer always receives well-behaved input.