Normalization

See also: Machine learning terms

In machine learning and data science, normalization refers to the process of scaling data to a standard range or distribution. The term is used broadly to describe a family of techniques applied during data preprocessing, within neural network architectures, and in database design. By transforming features to a common scale, normalization helps algorithms converge faster, produce more stable gradients, and treat all input dimensions equitably.

Normalization is especially important for algorithms that are sensitive to the magnitude of input features, including gradient descent-based optimizers, distance-based methods such as k-nearest neighbors and k-means clustering, and most neural network architectures. Without normalization, features with larger numeric ranges can dominate the learning process, leading to slow convergence or poor generalization.

The word "normalization" carries different meanings in different parts of machine learning. In tabular preprocessing it usually means rescaling each column to a fixed range or distribution. Inside a deep network it means inserting a layer that standardizes activations on the fly. In information retrieval it means dividing each row vector by its length. The mathematics differ, but the underlying goal is the same: remove arbitrary scale so that downstream computation behaves predictably.

Data normalization (feature scaling)

Data normalization, sometimes called feature scaling, rescales individual features so they occupy a comparable numeric range before being fed into a model. The choice of method depends on the data distribution, the presence of outliers, and the downstream algorithm.

Min-max normalization

Min-max normalization linearly maps each feature to a fixed interval, most commonly [0, 1]. The formula is:

x' = (x - x_min) / (x_max - x_min)

To scale to an arbitrary range [a, b]:

x' = a + (x - x_min)(b - a) / (x_max - x_min)

Min-max normalization preserves the original distribution shape and is straightforward to implement. However, it is highly sensitive to outliers: a single extreme value can compress all other values into a narrow band. In scikit-learn, this method is available as MinMaxScaler.

A practical concern with min-max scaling is what happens when a value at inference time falls outside the range observed during training. The transformed value will lie outside [0, 1], which is acceptable for most models but breaks the assumption that all inputs are bounded. When the input data has known semantic limits (for example, normalized image pixel intensities in [0, 255]), those limits should be used directly rather than the empirical minimum and maximum.

Z-score normalization (standardization)

Z-score normalization, also called standardization, transforms each feature so that it has a mean of zero and a standard deviation of one:

x' = (x - mu) / sigma

where mu is the sample mean and sigma is the sample standard deviation. The resulting values are unbounded and can be negative. Z-Score Normalization handles outliers more gracefully than min-max normalization because outliers affect the mean and variance but do not compress the rest of the distribution into a tiny range. It is the default choice for many linear models, support vector machines, and principal component analysis. In scikit-learn, this method is provided by StandardScaler.

A closely related question is whether to use the population variance (dividing by n) or the sample variance (dividing by n-1) when estimating sigma. For normalization purposes the difference is negligible at typical dataset sizes. The scikit-learn implementation uses the population variance for consistency with the numpy.std default.

Max absolute scaling

Max absolute scaling divides each feature value by the maximum absolute value of that feature:

x' = x / |x_max|

This maps the data to the range [-1, 1] without shifting the center, which means it does not destroy sparsity. It is useful for sparse datasets where preserving zero entries matters, including text feature matrices produced by CountVectorizer or TfidfVectorizer. In scikit-learn, this method is available as MaxAbsScaler.

Robust scaling

Robust scaling uses the median and interquartile range (IQR) instead of the mean and standard deviation:

x' = (x - Q2) / (Q3 - Q1)

where Q1, Q2, and Q3 are the 25th, 50th, and 75th percentiles, respectively. Because the median and IQR are not influenced by extreme values, robust scaling is the preferred choice when the data contains significant outliers that cannot be removed. In scikit-learn, the corresponding class is RobustScaler. The user can configure which percentiles to use through the quantile_range parameter, with the default (25.0, 75.0) corresponding to the interquartile range.

L1 and L2 normalization (unit norm)

L1 and L2 normalization scale each sample (row) so that it has unit norm:

L1 normalization divides each element by the sum of absolute values, so the L1 norm of the resulting vector equals 1.
L2 normalization divides each element by the Euclidean length, so the L2 norm of the resulting vector equals 1.

These methods are common in text classification and information retrieval, where document vectors are normalized to unit length before computing cosine similarity. After L2 normalization, the dot product between any two vectors equals their cosine similarity, which simplifies and speeds up similarity search. In scikit-learn, the Normalizer class handles both L1 and L2 normalization.

Log transformations

When a feature spans many orders of magnitude or is heavily right-skewed, applying a logarithmic transformation can make it more amenable to learning. Common variants include the natural log (log(x)), log(1 + x) (which avoids issues at zero), and log10(x). Log transformations are widely used for features such as income, price, population counts, and gene expression levels. The transformation must be applied only when all values are positive; for data containing zeros, log1p is the standard choice.

Power transformations: Box-Cox and Yeo-Johnson

Power transformations seek to make a feature more closely resemble a Gaussian distribution by applying a parametric power function. The Box-Cox transformation, introduced by Box and Cox in 1964, is defined for strictly positive data as:

x' = (x^lambda - 1) / lambda     if lambda != 0
x' = log(x)                       if lambda == 0

The parameter lambda is selected to maximize the log-likelihood of the transformed data under a normality assumption. The Yeo-Johnson transformation, introduced by Yeo and Johnson in 2000, extends Box-Cox to handle zero and negative values and is therefore more general:

For x >= 0:
  x' = ((x + 1)^lambda - 1) / lambda          if lambda != 0
  x' = log(x + 1)                              if lambda == 0
For x < 0:
  x' = -((-x + 1)^(2 - lambda) - 1) / (2 - lambda)   if lambda != 2
  x' = -log(-x + 1)                                   if lambda == 2

In scikit-learn, both transformations are provided by PowerTransformer with the method argument set to 'box-cox' or 'yeo-johnson'. The class fits lambda separately for each feature.

Quantile transformation

A quantile transformation maps each feature to a target distribution by ranking the values and applying the inverse cumulative distribution function of the target. Two target distributions are common: uniform on [0, 1] and standard normal. Because the transformation is rank-based, it is robust to outliers and produces an output with a fixed shape regardless of the input distribution. The downside is that it is nonlinear and non-monotonic across new samples, so values not seen during fitting are interpolated. In scikit-learn, this is provided by QuantileTransformer. Quantile transformation is useful when the goal is to make features comparable in shape, for example when feeding heterogeneous tabular data into a deep learning model.

Comparison of feature scaling methods

Method	Formula	Output range	Sensitive to outliers	Preserves sparsity	scikit-learn class
Min-max	(x - x_min) / (x_max - x_min)	[0, 1]	Yes	No	`MinMaxScaler`
Z-score	(x - mu) / sigma	Unbounded	Moderate	No	`StandardScaler`
Max absolute	x / abs(x_max)	[-1, 1]	Yes	Yes	`MaxAbsScaler`
Robust	(x - median) / IQR	Unbounded	No	No	`RobustScaler`
L1 norm	x / sum(abs(x))	[0, 1] per sample	No	Yes	`Normalizer(norm='l1')`
L2 norm	x / sqrt(sum(x^2))	[-1, 1] per sample	No	Yes	`Normalizer(norm='l2')`
Box-Cox	(x^lambda - 1) / lambda	Approximately Gaussian	Reduced after fit	No	`PowerTransformer(method='box-cox')`
Yeo-Johnson	Piecewise power	Approximately Gaussian	Reduced after fit	No	`PowerTransformer(method='yeo-johnson')`
Quantile (uniform)	Rank / N	[0, 1]	No	No	`QuantileTransformer(output_distribution='uniform')`
Quantile (normal)	Inverse normal CDF of rank	Approximately N(0, 1)	No	No	`QuantileTransformer(output_distribution='normal')`
Log	log(1 + x)	Compressed	Reduced	No	`FunctionTransformer(np.log1p)`

When to normalize

Normalization is most beneficial in the following scenarios:

Distance-based algorithms. Methods such as k-nearest neighbors, k-means, and support vector machines compute distances between data points. If features are on different scales, features with larger magnitudes will dominate the distance calculation.
Gradient descent optimization. When features have very different scales, the loss surface becomes elongated, causing gradient descent to oscillate and converge slowly. Normalization produces a more spherical loss landscape, allowing the optimizer to take more direct steps toward the minimum.
Neural networks. Most neural network architectures benefit from normalized inputs because activation functions (such as sigmoid and tanh) are most sensitive in a limited numeric range. Feeding unnormalized data can push activations into saturated regions, causing vanishing gradients.
Regularized models. Algorithms with L1 or L2 regularization penalize feature weights. If features are on different scales, the penalty applies unevenly, distorting the model.
Principal component analysis and related methods. PCA, factor analysis, and linear discriminant analysis all rely on covariance or correlation. Without standardization, features with high variance dominate the principal components even when their large variance reflects unit choice rather than information content.

When not to normalize

Normalization is unnecessary or even counterproductive in some situations:

Tree-based models. Decision trees, random forests, and gradient boosting algorithms split on individual feature thresholds. Because they compare values within a single feature at a time, the relative scale across features does not affect their behavior. Tree-based models are largely invariant to monotonic transformations of any single feature.
Count-based features. In some natural language processing tasks, raw counts (for example, word frequencies) carry meaningful absolute magnitude that normalization would obscure.
Pre-normalized data. If features are already on similar scales (for example, pixel values in [0, 255] for all channels), additional normalization may not improve performance.
Categorical features encoded as integers. Integer-encoded categorical variables should be one-hot encoded or treated as embeddings, not standardized. Z-scoring an arbitrary integer code introduces a spurious ordering.

Normalization vs. standardization

The terms "normalization" and "standardization" are often used interchangeably in practice, but they refer to different operations in a strict sense.

	Normalization (min-max)	Standardization (z-score)
Goal	Scale values to a fixed range (e.g., [0, 1])	Center values at zero with unit variance
Formula	(x - x_min) / (x_max - x_min)	(x - mu) / sigma
Output range	Bounded ([0, 1] or [-1, 1])	Unbounded
Outlier behavior	Compressed into the range	Shifts mean and widens spread
Best for	Data with known bounds; neural network inputs	Data that is roughly Gaussian; linear models, SVMs, PCA

In everyday conversation, many practitioners use "normalization" as an umbrella term for any kind of feature scaling. When precision matters, it is helpful to specify the exact method (for example, "min-max normalization" or "z-score standardization").

Avoiding data leakage

A critical rule when applying normalization in a machine learning pipeline is to fit the scaler on the training set only. The scaler's parameters (minimum, maximum, mean, standard deviation, median, or IQR) must be computed exclusively from training data. The test set and validation set should then be transformed using those same parameters.

If the scaler is fit on the entire dataset before splitting, information from the test set leaks into the training process, producing overly optimistic performance estimates that do not reflect real-world generalization. In scikit-learn, the recommended approach is to place the scaler inside a Pipeline together with the model, which automatically applies fit only to the training fold during cross-validation.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', LogisticRegression())
])
pipe.fit(X_train, y_train)
score = pipe.score(X_test, y_test)

The leakage rule extends beyond simple train/test splits. In time series forecasting, the scaler should be fit only on data observed up to the cutoff time of each evaluation point. In stratified k-fold cross-validation, the scaler should be fit on the training folds within each iteration. The general principle is that any statistic the scaler depends on must come exclusively from data the model is allowed to see at training time.

Neural network normalization layers

Beyond preprocessing the input data, modern deep learning architectures apply normalization inside the network itself. These normalization layers stabilize activations during training, reduce sensitivity to weight initialization, and often allow higher learning rates. They all follow a common pattern: compute statistics over some subset of the activations, normalize using those statistics, and then apply a learned affine transformation (scale and shift).

The general form of an internal normalization layer is:

y = gamma * (x - mu) / sqrt(sigma^2 + epsilon) + beta

where mu and sigma^2 are the mean and variance computed over a chosen set of dimensions, epsilon is a small constant (typically 1e-5) added for numerical stability, and gamma and beta are learnable scale and shift parameters. The choice of dimensions over which mu and sigma^2 are computed is what distinguishes the various normalization layers from each other.

Batch normalization

Batch normalization (BatchNorm), introduced by Ioffe and Szegedy in 2015 (arxiv 1502.03167), normalizes activations across the mini-batch dimension. For each channel in a layer, it computes the mean and variance over all spatial positions and all samples in the current mini-batch, then normalizes accordingly. BatchNorm was originally motivated by the problem of "internal covariate shift," where the distribution of layer inputs changes as earlier layers update their weights.

BatchNorm enables the use of much higher learning rates and reduces the dependence on careful weight initialization. In the original paper, it achieved the same accuracy as a baseline model with 14 times fewer training steps. However, BatchNorm has notable limitations: its behavior depends on the mini-batch size, it performs poorly with very small batches, and its statistics differ between training (using mini-batch statistics) and inference (using running averages), which can cause inconsistencies. A 2018 study by Santurkar and colleagues argued that the actual benefit of BatchNorm comes from smoothing the loss landscape rather than from reducing covariate shift.

In PyTorch, BatchNorm is available as torch.nn.BatchNorm1d, torch.nn.BatchNorm2d, and torch.nn.BatchNorm3d, which differ only in the spatial rank of the input. During inference, the running mean and running variance computed during training are used in place of mini-batch statistics; this behavior is enabled by setting the module to evaluation mode with model.eval().

Layer normalization

Layer normalization (LayerNorm), proposed by Ba, Kiros, and Geoffrey Hinton in 2016 (arxiv 1607.06450), computes normalization statistics across all neurons within a single layer for a single training example. Unlike BatchNorm, it does not depend on the batch dimension, making it straightforward to apply to recurrent neural networks and transformer architectures where sequence lengths may vary.

LayerNorm has become the standard normalization technique in transformer-based language models, including GPT and BERT. Its independence from batch size makes it well suited for settings where batch sizes are small or variable. Because the same statistics are used at training and inference time, there is no train/test mismatch and no need for running averages. In PyTorch, LayerNorm is provided by torch.nn.LayerNorm, which takes a normalized_shape argument specifying the trailing dimensions to normalize over.

Instance normalization

Instance normalization (InstanceNorm), introduced by Ulyanov, Vedaldi, and Lempitsky in 2016, computes the mean and variance for each individual channel of each individual sample. This means each feature map in each image is normalized independently.

InstanceNorm gained prominence in neural style transfer, where it effectively strips instance-specific contrast information from images, allowing the network to focus on high-level style features. It is commonly used in generative adversarial networks (GANs) and image generation tasks. A close relative is adaptive instance normalization (AdaIN), introduced by Huang and Belongie in 2017, which uses the mean and variance from a style image as the affine parameters of the normalization, producing an effective real-time style transfer technique.

Group normalization

Group normalization (GroupNorm), proposed by Wu and He in 2018 (arxiv 1803.08494), divides channels into groups and computes normalization statistics within each group for each sample. It can be seen as a middle ground between LayerNorm (one group containing all channels) and InstanceNorm (each channel in its own group).

GroupNorm is particularly valuable in computer vision tasks where memory constraints force the use of small batch sizes (for example, in object detection and video classification). On ResNet-50 trained on ImageNet with a batch size of 2, GroupNorm achieved 10.6% lower error than the equivalent BatchNorm model. The number of groups is a hyperparameter; the original paper recommends 32 groups as a sensible default.

Weight normalization

Weight normalization, introduced by Salimans and Kingma in 2016 (arxiv 1602.07868), takes a different approach. Instead of normalizing activations, it reparameterizes the weights of each layer as a product of a unit vector and a learnable scalar magnitude:

w = g * (v / ||v||)

where v is a vector of free parameters, ||v|| is its Euclidean norm, and g is a learned scalar. This decouples the magnitude of the weight vector from its direction, which improves the conditioning of the optimization problem and accelerates training. Weight normalization can be combined with mean-only batch normalization for even better performance. Compared to BatchNorm, it is independent of batch size and adds essentially no computational overhead. It is available in PyTorch as torch.nn.utils.weight_norm.

Root mean square layer normalization (RMSNorm)

RMSNorm, introduced by Zhang and Sennrich in 2019 (arxiv 1910.07467), simplifies LayerNorm by removing the mean-centering step. Instead of subtracting the mean and dividing by the standard deviation, RMSNorm divides activations only by their root mean square:

x' = x / RMS(x),  where RMS(x) = sqrt((1/n) * sum(x_i^2))

The authors hypothesized that the re-centering component of LayerNorm is not essential, and experiments confirmed that RMSNorm achieves comparable performance while reducing computational overhead by 7% to 64% depending on the model. RMSNorm has been adopted in several prominent large language models, including LLaMA and its successors, Mistral, Qwen, and DeepSeek. Native support for RMSNorm was added to PyTorch as torch.nn.RMSNorm in version 2.4.

Filter response normalization (FRN)

Filter response normalization (FRN), proposed by Singh and Krishnan in 2019, eliminates batch dependence by normalizing each channel of each sample using only the mean of the squared activations across spatial dimensions. It pairs with a thresholded linear unit (TLU) activation function, which clamps negative values to a learnable threshold rather than zero. FRN was shown to outperform BatchNorm in image classification with small batch sizes and to match it at large batch sizes.

Power normalization

Power normalization, introduced by Shen and colleagues in 2020, was developed for transformer models in computer vision and NLP. It uses a moving average of the variance, rather than the per-batch variance, and applies the variance only without subtracting the mean. The authors reported improved performance over LayerNorm on machine translation and language modeling benchmarks, although the technique has not been widely adopted in production systems.

Switchable normalization

Switchable normalization, proposed by Luo and colleagues in 2018, dynamically chooses among BatchNorm, LayerNorm, and InstanceNorm by learning a softmax weight over the three. Different layers in the network can therefore use different normalizers, with the network selecting the most appropriate one for each. The approach yielded modest improvements on image classification but added implementation complexity.

Local response normalization

Local response normalization (LRN), used in the AlexNet architecture (Krizhevsky and colleagues, 2012), normalizes each activation by a local sum of squares across nearby channels. While LRN was an important component of early deep convolutional networks, later work showed it is largely unnecessary when other techniques (such as BatchNorm) are used, and modern architectures rarely include it.

Comparison of normalization layers

Normalization layer	Normalizes over	Batch dependent	Typical use case	Original paper
Batch normalization	Batch + spatial	Yes	CNNs with large batches	Ioffe and Szegedy, 2015 (arxiv 1502.03167)
Layer normalization	All neurons in a layer	No	Transformers, RNNs	Ba, Kiros, and Hinton, 2016 (arxiv 1607.06450)
Instance normalization	Single channel, single sample	No	Style transfer, GANs	Ulyanov, Vedaldi, and Lempitsky, 2016 (arxiv 1607.08022)
Group normalization	Groups of channels, single sample	No	CNNs with small batches	Wu and He, 2018 (arxiv 1803.08494)
Weight normalization	Weight vector reparameterization	No	Generative models, RL	Salimans and Kingma, 2016 (arxiv 1602.07868)
RMSNorm	All neurons (RMS only)	No	Large language models	Zhang and Sennrich, 2019 (arxiv 1910.07467)
Filter response normalization	Per-sample, per-channel spatial	No	Small-batch CNNs	Singh and Krishnan, 2019 (arxiv 1911.09737)
Power normalization	Channel-wise with running variance	Partial	Vision and NLP transformers	Shen et al., 2020 (arxiv 2003.07845)
Switchable normalization	Mixture of BN, LN, IN	Partial	General CNNs	Luo et al., 2018 (arxiv 1806.10779)
Local response normalization	Adjacent channels	No	Legacy CNNs (AlexNet)	Krizhevsky et al., 2012

Where each layer is applied in a typical block

Architecture family	Conventional placement	Common normalization choice
Vanilla CNN	After each conv, before activation	Batch normalization
ResNet	Inside residual blocks, before activation	Batch normalization
Transformer (post-LN)	After attention and feed-forward sublayer outputs	Layer normalization
Transformer (pre-LN)	Before attention and feed-forward sublayers	Layer normalization or RMSNorm
Vision Transformer (ViT)	Pre-LN	Layer normalization
Modern LLM (LLaMA, Mistral, Qwen, DeepSeek)	Pre-LN	RMSNorm
Style transfer	After conv layers	Instance normalization or AdaIN
Object detection backbone with small batches	Inside the backbone	Group normalization or synchronous BN

Pre-LN vs Post-LN in transformers

The original Transformer architecture introduced by Vaswani and colleagues in 2017 placed LayerNorm after each sublayer's residual connection (post-norm or post-LN). This means each sublayer computes:

x' = LayerNorm(x + Sublayer(x))

While post-LN works well for moderately deep transformers (such as the original 6-layer encoder-decoder), training very deep post-LN transformers requires careful learning rate warmup and is prone to instability. Xiong and colleagues showed in 2020 (arxiv 2002.04745) that placing LayerNorm before each sublayer (pre-norm or pre-LN) yields a much better-conditioned optimization landscape and removes the need for warmup. The pre-LN formulation is:

x' = x + Sublayer(LayerNorm(x))

The key practical difference is that gradients flow more smoothly through the residual path in pre-LN, because the residual connection is no longer crossed by a normalization layer. This makes very deep transformers easier to train. Most modern large language models, including GPT-3, LLaMA, Mistral, and DeepSeek, use the pre-LN formulation. A small number of recent architectures combine both approaches (so-called sandwich norm or hybrid norm) by placing LayerNorm both before and after each sublayer.

Modern LLM normalization choices

The choice of normalization layer in modern large language models has converged on a small set of preferences:

Model family	Normalization	Placement
BERT (2018)	LayerNorm	Post-LN
GPT-2 (2019)	LayerNorm	Pre-LN
GPT-3 (2020)	LayerNorm	Pre-LN
T5 (2019)	RMSNorm	Pre-LN
LLaMA / LLaMA 2 / LLaMA 3	RMSNorm	Pre-LN
Mistral 7B	RMSNorm	Pre-LN
Qwen / Qwen 2 / Qwen 2.5	RMSNorm	Pre-LN
DeepSeek / DeepSeek V2 / V3	RMSNorm	Pre-LN
Gemma	RMSNorm	Pre-LN
PaLM (2022)	RMSNorm	Pre-LN
Claude (Anthropic)	LayerNorm or RMSNorm (architecture details not fully public)	Pre-LN

The move from LayerNorm to RMSNorm in flagship LLMs is driven primarily by efficiency. RMSNorm removes the mean-centering step, eliminating a sum reduction and a subtraction at every token in every layer. On large vocabularies, large hidden dimensions, and very deep stacks, this saving compounds. The LLaMA paper noted that the change had no measurable impact on quality. As a result, most newer transformer-based models default to RMSNorm.

Normalization-free alternatives

A recurring research question is whether explicit normalization layers are even necessary. Several approaches have proposed normalization-free architectures:

Fixup initialization (Zhang and colleagues, 2019) carefully scales the initial weights of residual networks so that the network is well-conditioned at initialization without any normalization layer. With Fixup, very deep ResNets train successfully without BatchNorm.
NFNets (Brock and colleagues, 2021) are normalizer-free networks that use weight standardization and adaptive gradient clipping to remain stable. They achieved state-of-the-art image classification accuracy without BatchNorm.
DeepNorm (Wang and colleagues, 2022) is a modified residual scaling combined with post-LN that allows transformers with up to 1,000 layers to train stably.
Dynamic Tanh (DyT), proposed in early 2025, replaces normalization layers with an element-wise tanh activation scaled by a learned parameter. The authors reported that swapping LayerNorm for DyT in transformers preserved accuracy while modestly reducing compute. The technique is still under active investigation and has not yet been adopted in any production frontier LLM, but it represents a continuing effort to find simpler alternatives to explicit normalization.

Despite these proposals, normalization layers remain the dominant approach in production deep learning systems because they are simple, robust across hyperparameter choices, and well-supported in standard frameworks.

Effect on training speed and convergence

Normalization, whether applied to the input data or through internal network layers, consistently accelerates training and improves convergence. The main mechanisms are:

Smoother loss landscape. Normalized inputs and activations produce a loss function with more uniform curvature. The optimizer can use a larger learning rate without overshooting, and gradient updates point more directly toward the optimum.
Reduced internal covariate shift. Normalization layers keep the distribution of layer inputs relatively stable throughout training, so each layer does not need to continually adapt to shifting input distributions.
Better gradient flow. By keeping activations in a moderate numeric range, normalization reduces the likelihood of vanishing or exploding gradients, especially in deep networks.
Implicit regularization. Some normalization methods (notably BatchNorm) introduce noise through mini-batch statistics, which can act as a mild regularizer similar to dropout.
Reduced sensitivity to initialization. Networks with normalization layers train successfully across a wider range of weight initialization schemes. This is particularly valuable when designing new architectures, where choosing a specific initialization can be a significant hyperparameter.

Empirical studies have shown that networks trained with BatchNorm can reach the same accuracy in a fraction of the training steps required without it. Similarly, choosing the right input normalization scheme can reduce the number of epochs needed for convergence by an order of magnitude for algorithms like stochastic gradient descent.

Practical implementation guide

A pragmatic workflow for deciding how to normalize tabular data and where to place internal normalization layers in a network looks like this.

For tabular preprocessing

Start with StandardScaler if features are roughly Gaussian and the model is linear, an SVM, or a small neural network.
Use RobustScaler if the data contains outliers that you cannot remove. The interquartile range will not be distorted by extreme values.
Use MinMaxScaler to [0, 1] if the model expects bounded inputs (for example, when feeding into a sigmoid output of a different model).
Use MaxAbsScaler for sparse data to preserve the location of zero entries.
Apply a PowerTransformer before standardization if the feature is heavily skewed and you want to make it more Gaussian.
Use QuantileTransformer as a last resort for mixed-distribution heterogeneous tabular data feeding into neural models.
Skip normalization for tree-based models such as XGBoost, LightGBM, and random forests.

For internal normalization layers

Choose BatchNorm for image classification, segmentation, and detection with batch sizes >= 32.
Choose GroupNorm for tasks with very small batch sizes (<= 4), particularly in detection backbones and 3D vision.
Choose LayerNorm for sequence models, recurrent networks, and standard transformers.
Choose RMSNorm when training a large language model from scratch and efficiency at scale matters.
Choose InstanceNorm for neural style transfer and image-to-image translation.
Use pre-LN for transformers deeper than ~12 layers.

Database normalization

In database design, "normalization" refers to the process of organizing a relational database to reduce redundancy and improve data integrity. This is an entirely different concept from normalization in machine learning. Database normalization follows a series of "normal forms" (1NF, 2NF, 3NF, BCNF, and higher) that define rules about how data should be divided across tables and how relationships should be structured. While both uses of the term involve imposing structure and consistency, they operate in completely different domains.

Common mistakes

Fitting the scaler on the entire dataset before splitting. This is the single most common mistake in applied machine learning. Always fit on the training set only.
Forgetting to save the fitted scaler. When deploying a model, the fitted scaler must be saved alongside the model weights so the same parameters are applied to incoming production data.
Normalizing target variables without inverting at prediction time. If you scale the regression target during training, remember to invert the transformation when reporting predictions in the original units.
Mixing scalers across cross-validation folds. Use a Pipeline to ensure the scaler is refit within each fold; otherwise the cross-validated score will be optimistically biased.
Standardizing one-hot encoded features. Z-scoring binary indicators changes their semantics without improving learning. Leave categorical encodings alone.
Applying BatchNorm with batch size 1. BatchNorm divides by the mini-batch variance; with one sample the variance is zero and the operation is undefined. Use LayerNorm or GroupNorm in this case.
Forgetting model.eval() at inference. BatchNorm and dropout behave differently during evaluation; failing to switch modes will use the current mini-batch statistics instead of the running averages, producing inconsistent predictions.

Explain like I'm 5 (ELI5)

Imagine you and your friends are comparing how far you can throw a ball, how fast you can run, and how many pushups you can do. The throwing distances are in meters, the running speed is in seconds, and pushups are just a count. If you try to add all three numbers together to pick a winner, the throwing distance (maybe 30 meters) would matter way more than the pushups (maybe 10). That is not fair.

Normalization is like converting each score to a number between 0 and 10 so every activity counts equally. The fastest runner gets a 10, the slowest gets a 0, and everyone else falls in between. Now when you add up the scores, no single activity can unfairly dominate.

In machine learning, computers face the same problem. Different measurements can have wildly different ranges, and normalization makes them all comparable so the computer can learn from every measurement equally.

Deep neural networks face a related problem inside their own layers. As the network learns, the numbers flowing between layers can grow very large or very small, which makes the math unstable. Internal normalization layers act like a thermostat: at every layer, they squeeze the numbers back to a sensible range so the next layer always receives well-behaved input.

References

Ioffe, S. and Szegedy, C. (2015). "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift." *Proceedings of the 32nd International Conference on Machine Learning (ICML)*, 37:448-456. https://arxiv.org/abs/1502.03167
Ba, J.L., Kiros, J.R., and Hinton, G.E. (2016). "Layer Normalization." *arXiv preprint arXiv:1607.06450*. https://arxiv.org/abs/1607.06450
Ulyanov, D., Vedaldi, A., and Lempitsky, V. (2016). "Instance Normalization: The Missing Ingredient for Fast Stylization." *arXiv preprint arXiv:1607.08022*. https://arxiv.org/abs/1607.08022
Wu, Y. and He, K. (2018). "Group Normalization." *Proceedings of the European Conference on Computer Vision (ECCV)*, pp. 3-19. https://arxiv.org/abs/1803.08494
Salimans, T. and Kingma, D.P. (2016). "Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks." *Advances in Neural Information Processing Systems (NeurIPS)*, 29. https://arxiv.org/abs/1602.07868
Zhang, B. and Sennrich, R. (2019). "Root Mean Square Layer Normalization." *Advances in Neural Information Processing Systems (NeurIPS)*, 32. https://arxiv.org/abs/1910.07467
Xiong, R., Yang, Y., He, D., Zheng, K., Zheng, S., Xing, C., Zhang, H., Lan, Y., Wang, L., and Liu, T. (2020). "On Layer Normalization in the Transformer Architecture." *Proceedings of the 37th International Conference on Machine Learning (ICML)*. https://arxiv.org/abs/2002.04745
Singh, S. and Krishnan, S. (2019). "Filter Response Normalization Layer: Eliminating Batch Dependence in the Training of Deep Neural Networks." *arXiv preprint arXiv:1911.09737*. https://arxiv.org/abs/1911.09737
Shen, S., Yao, Z., Gholami, A., Mahoney, M.W., and Keutzer, K. (2020). "PowerNorm: Rethinking Batch Normalization in Transformers." *Proceedings of the 37th International Conference on Machine Learning (ICML)*. https://arxiv.org/abs/2003.07845
Luo, P., Ren, J., Peng, Z., Zhang, R., and Li, J. (2018). "Differentiable Learning-to-Normalize via Switchable Normalization." *International Conference on Learning Representations (ICLR) 2019*. https://arxiv.org/abs/1806.10779
Pedregosa, F. et al. (2011). "Scikit-learn: Machine Learning in Python." *Journal of Machine Learning Research*, 12:2825-2830. https://scikit-learn.org/stable/modules/preprocessing.html
Santurkar, S., Tsipras, D., Ilyas, A., and Madry, A. (2018). "How Does Batch Normalization Help Optimization?" *Advances in Neural Information Processing Systems (NeurIPS)*, 31. https://arxiv.org/abs/1805.11604
Box, G.E.P. and Cox, D.R. (1964). "An Analysis of Transformations." *Journal of the Royal Statistical Society. Series B*, 26(2):211-252.
Yeo, I. and Johnson, R.A. (2000). "A New Family of Power Transformations to Improve Normality or Symmetry." *Biometrika*, 87(4):954-959.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., and Polosukhin, I. (2017). "Attention Is All You Need." *Advances in Neural Information Processing Systems (NeurIPS)*, 30. https://arxiv.org/abs/1706.03762
Touvron, H. et al. (2023). "LLaMA: Open and Efficient Foundation Language Models." *arXiv preprint arXiv:2302.13971*. https://arxiv.org/abs/2302.13971
Zhang, H., Dauphin, Y.N., and Ma, T. (2019). "Fixup Initialization: Residual Learning Without Normalization." *International Conference on Learning Representations (ICLR)*. https://arxiv.org/abs/1901.09321
Brock, A., De, S., Smith, S.L., and Simonyan, K. (2021). "High-Performance Large-Scale Image Recognition Without Normalization." *Proceedings of the 38th International Conference on Machine Learning (ICML)*. https://arxiv.org/abs/2102.06171
Wang, H., Ma, S., Dong, L., Huang, S., Zhang, D., and Wei, F. (2022). "DeepNet: Scaling Transformers to 1,000 Layers." *arXiv preprint arXiv:2203.00555*. https://arxiv.org/abs/2203.00555
Goodfellow, I., Bengio, Y., and Courville, A. (2016). *Deep Learning*. MIT Press. Chapter 8: Optimization for Training Deep Models. https://www.deeplearningbook.org/
Bishop, C.M. (2006). *Pattern Recognition and Machine Learning*. Springer. Section 5.1: Feed-forward Network Functions.
Stanford CS231n. "Convolutional Neural Networks for Visual Recognition: Neural Networks Part 2." Course notes. https://cs231n.github.io/neural-networks-2/
PyTorch documentation. "torch.nn.LayerNorm," "torch.nn.BatchNorm2d," "torch.nn.RMSNorm." https://pytorch.org/docs/stable/nn.html
Google Developers. "Numerical Data: Normalization." *Machine Learning Crash Course*. https://developers.google.com/machine-learning/crash-course/numerical-data/normalization

Data normalization (feature scaling)

Min-max normalization

Z-score normalization (standardization)

Max absolute scaling

Robust scaling

L1 and L2 normalization (unit norm)

Log transformations

Power transformations: Box-Cox and Yeo-Johnson

Quantile transformation

Comparison of feature scaling methods

When to normalize

When not to normalize

Normalization vs. standardization

Avoiding data leakage

Neural network normalization layers

Batch normalization

Layer normalization

Instance normalization

Group normalization

Weight normalization

Root mean square layer normalization (RMSNorm)

Filter response normalization (FRN)

Power normalization

Switchable normalization

Local response normalization

Comparison of normalization layers

Where each layer is applied in a typical block

Pre-LN vs Post-LN in transformers

Modern LLM normalization choices

Normalization-free alternatives

Effect on training speed and convergence

Practical implementation guide

For tabular preprocessing

For internal normalization layers

Database normalization

Common mistakes

Explain like I'm 5 (ELI5)

References

Improve this article

Related Articles

ARC-AGI 2

Dimension Reduction

Bucketing

Class-Imbalanced Dataset

Data Augmentation

Data Set or Dataset

Data normalization (feature scaling)

Min-max normalization

Z-score normalization (standardization)

Max absolute scaling

Robust scaling

L1 and L2 normalization (unit norm)

Log transformations

Power transformations: Box-Cox and Yeo-Johnson

Quantile transformation

Comparison of feature scaling methods

When to normalize

When not to normalize

Normalization vs. standardization

Avoiding data leakage

Neural network normalization layers

Batch normalization

Layer normalization

Instance normalization

Group normalization

Weight normalization

Root mean square layer normalization (RMSNorm)

Filter response normalization (FRN)

Power normalization

Switchable normalization

Local response normalization

Comparison of normalization layers

Where each layer is applied in a typical block

Pre-LN vs Post-LN in transformers

Modern LLM normalization choices

Normalization-free alternatives

Effect on training speed and convergence

Practical implementation guide

For tabular preprocessing

For internal normalization layers