See also: Machine learning terms
In machine learning and data science, normalization refers to the process of scaling data to a standard range or distribution. The term is used broadly to describe a family of techniques applied during data preprocessing, within neural network architectures, and in database design. By transforming features to a common scale, normalization helps algorithms converge faster, produce more stable gradients, and treat all input dimensions equitably.
Normalization is especially important for algorithms that are sensitive to the magnitude of input features, including gradient descent-based optimizers, distance-based methods such as k-nearest neighbors and k-means clustering, and most neural network architectures. Without normalization, features with larger numeric ranges can dominate the learning process, leading to slow convergence or poor generalization.
Data normalization, sometimes called feature scaling, rescales individual features so they occupy a comparable numeric range before being fed into a model. The choice of method depends on the data distribution, the presence of outliers, and the downstream algorithm.
Min-max normalization linearly maps each feature to a fixed interval, most commonly [0, 1]. The formula is:
x' = (x - x_min) / (x_max - x_min)
To scale to an arbitrary range [a, b]:
x' = a + (x - x_min)(b - a) / (x_max - x_min)
Min-max normalization preserves the original distribution shape and is straightforward to implement. However, it is highly sensitive to outliers: a single extreme value can compress all other values into a narrow band. In scikit-learn, this method is available as MinMaxScaler.
Z-score normalization, also called standardization, transforms each feature so that it has a mean of zero and a standard deviation of one:
x' = (x - mu) / sigma
where mu is the sample mean and sigma is the sample standard deviation. The resulting values are unbounded and can be negative. Z-score normalization handles outliers more gracefully than min-max normalization because outliers affect the mean and variance but do not compress the rest of the distribution into a tiny range. It is the default choice for many linear models, support vector machines, and principal component analysis. In scikit-learn, this method is provided by StandardScaler.
Max absolute scaling divides each feature value by the maximum absolute value of that feature:
x' = x / |x_max|
This maps the data to the range [-1, 1] without shifting the center, which means it does not destroy sparsity. It is useful for sparse datasets where preserving zero entries matters. In scikit-learn, this method is available as MaxAbsScaler.
Robust scaling uses the median and interquartile range (IQR) instead of the mean and standard deviation:
x' = (x - Q2) / (Q3 - Q1)
where Q1, Q2, and Q3 are the 25th, 50th, and 75th percentiles, respectively. Because the median and IQR are not influenced by extreme values, robust scaling is the preferred choice when the data contains significant outliers that cannot be removed. In scikit-learn, the corresponding class is RobustScaler.
L1 and L2 normalization scale each sample (row) so that it has unit norm:
These methods are common in text classification and information retrieval, where document vectors are normalized to unit length before computing cosine similarity. In scikit-learn, the Normalizer class handles both L1 and L2 normalization.
| Method | Formula | Output range | Sensitive to outliers | Preserves sparsity | scikit-learn class |
|---|---|---|---|---|---|
| Min-max | (x - x_min) / (x_max - x_min) | [0, 1] | Yes | No | MinMaxScaler |
| Z-score | (x - mu) / sigma | Unbounded | Moderate | No | StandardScaler |
| Max absolute | x / abs(x_max) | [-1, 1] | Yes | Yes | MaxAbsScaler |
| Robust | (x - median) / IQR | Unbounded | No | No | RobustScaler |
| L1 norm | x / sum(abs(x)) | [0, 1] per sample | No | Yes | Normalizer(norm='l1') |
| L2 norm | x / sqrt(sum(x^2)) | [-1, 1] per sample | No | Yes | Normalizer(norm='l2') |
Normalization is most beneficial in the following scenarios:
Normalization is unnecessary or even counterproductive in some situations:
The terms "normalization" and "standardization" are often used interchangeably in practice, but they refer to different operations in a strict sense.
| Normalization (min-max) | Standardization (z-score) | |
|---|---|---|
| Goal | Scale values to a fixed range (e.g., [0, 1]) | Center values at zero with unit variance |
| Formula | (x - x_min) / (x_max - x_min) | (x - mu) / sigma |
| Output range | Bounded ([0, 1] or [-1, 1]) | Unbounded |
| Outlier behavior | Compressed into the range | Shifts mean and widens spread |
| Best for | Data with known bounds; neural network inputs | Data that is roughly Gaussian; linear models, SVMs, PCA |
In everyday conversation, many practitioners use "normalization" as an umbrella term for any kind of feature scaling. When precision matters, it is helpful to specify the exact method (for example, "min-max normalization" or "z-score standardization").
A critical rule when applying normalization in a machine learning pipeline is to fit the scaler on the training set only. The scaler's parameters (minimum, maximum, mean, standard deviation, median, or IQR) must be computed exclusively from training data. The test set and validation set should then be transformed using those same parameters.
If the scaler is fit on the entire dataset before splitting, information from the test set leaks into the training process, producing overly optimistic performance estimates that do not reflect real-world generalization. In scikit-learn, the recommended approach is to place the scaler inside a Pipeline together with the model, which automatically applies fit only to the training fold during cross-validation.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
pipe = Pipeline([
('scaler', StandardScaler()),
('clf', LogisticRegression())
])
pipe.fit(X_train, y_train)
score = pipe.score(X_test, y_test)
Beyond preprocessing the input data, modern deep learning architectures apply normalization inside the network itself. These normalization layers stabilize activations during training, reduce sensitivity to weight initialization, and often allow higher learning rates. They all follow a common pattern: compute statistics over some subset of the activations, normalize using those statistics, and then apply a learned affine transformation (scale and shift).
Batch normalization (BatchNorm), introduced by Ioffe and Szegedy in 2015, normalizes activations across the mini-batch dimension. For each channel in a layer, it computes the mean and variance over all spatial positions and all samples in the current mini-batch, then normalizes accordingly. BatchNorm was originally motivated by the problem of "internal covariate shift," where the distribution of layer inputs changes as earlier layers update their weights.
BatchNorm enables the use of much higher learning rates and reduces the dependence on careful weight initialization. In the original paper, it achieved the same accuracy as a baseline model with 14 times fewer training steps. However, BatchNorm has notable limitations: its behavior depends on the mini-batch size, it performs poorly with very small batches, and its statistics differ between training (using mini-batch statistics) and inference (using running averages), which can cause inconsistencies.
Layer normalization (LayerNorm), proposed by Ba, Kiros, and Hinton in 2016, computes normalization statistics across all neurons within a single layer for a single training example. Unlike BatchNorm, it does not depend on the batch dimension, making it straightforward to apply to recurrent neural networks and transformer architectures where sequence lengths may vary.
LayerNorm has become the standard normalization technique in transformer-based language models, including GPT and BERT. Its independence from batch size makes it well suited for settings where batch sizes are small or variable.
Instance normalization (InstanceNorm), introduced by Ulyanov, Vedaldi, and Lempitsky in 2016, computes the mean and variance for each individual channel of each individual sample. This means each feature map in each image is normalized independently.
InstanceNorm gained prominence in neural style transfer, where it effectively strips instance-specific contrast information from images, allowing the network to focus on high-level style features. It is commonly used in generative adversarial networks (GANs) and image generation tasks.
Group normalization (GroupNorm), proposed by Wu and He in 2018, divides channels into groups and computes normalization statistics within each group for each sample. It can be seen as a middle ground between LayerNorm (one group containing all channels) and InstanceNorm (each channel in its own group).
GroupNorm is particularly valuable in computer vision tasks where memory constraints force the use of small batch sizes (for example, in object detection and video classification). On ResNet-50 trained on ImageNet with a batch size of 2, GroupNorm achieved 10.6% lower error than the equivalent BatchNorm model.
Root mean square layer normalization (RMSNorm), introduced by Zhang and Sennrich in 2019, simplifies LayerNorm by removing the mean-centering step. Instead of subtracting the mean and dividing by the standard deviation, RMSNorm divides activations only by their root mean square:
x' = x / RMS(x), where RMS(x) = sqrt((1/n) * sum(x_i^2))
The authors hypothesized that the re-centering component of LayerNorm is not essential, and experiments confirmed that RMSNorm achieves comparable performance while reducing computational overhead by 7% to 64% depending on the model. RMSNorm has been adopted in several prominent large language models, including LLaMA and its successors.
| Normalization layer | Normalizes over | Batch dependent | Typical use case | Introduced |
|---|---|---|---|---|
| Batch normalization | Batch + spatial | Yes | CNNs with large batches | Ioffe and Szegedy, 2015 |
| Layer normalization | All neurons in a layer | No | Transformers, RNNs | Ba, Kiros, and Hinton, 2016 |
| Instance normalization | Single channel, single sample | No | Style transfer, GANs | Ulyanov, Vedaldi, and Lempitsky, 2016 |
| Group normalization | Groups of channels, single sample | No | CNNs with small batches | Wu and He, 2018 |
| RMSNorm | All neurons (RMS only) | No | Large language models | Zhang and Sennrich, 2019 |
Normalization, whether applied to the input data or through internal network layers, consistently accelerates training and improves convergence. The main mechanisms are:
Empirical studies have shown that networks trained with BatchNorm can reach the same accuracy in a fraction of the training steps required without it. Similarly, choosing the right input normalization scheme can reduce the number of epochs needed for convergence by an order of magnitude for algorithms like stochastic gradient descent.
In database design, "normalization" refers to the process of organizing a relational database to reduce redundancy and improve data integrity. This is an entirely different concept from normalization in machine learning. Database normalization follows a series of "normal forms" (1NF, 2NF, 3NF, BCNF, and higher) that define rules about how data should be divided across tables and how relationships should be structured. While both uses of the term involve imposing structure and consistency, they operate in completely different domains.
Imagine you and your friends are comparing how far you can throw a ball, how fast you can run, and how many pushups you can do. The throwing distances are in meters, the running speed is in seconds, and pushups are just a count. If you try to add all three numbers together to pick a winner, the throwing distance (maybe 30 meters) would matter way more than the pushups (maybe 10). That is not fair.
Normalization is like converting each score to a number between 0 and 10 so every activity counts equally. The fastest runner gets a 10, the slowest gets a 0, and everyone else falls in between. Now when you add up the scores, no single activity can unfairly dominate.
In machine learning, computers face the same problem. Different measurements can have wildly different ranges, and normalization makes them all comparable so the computer can learn from every measurement equally.