See also: Machine learning terms
Batch normalization (often abbreviated BatchNorm or BN) is a technique for improving the speed, stability, and performance of deep neural network training. Introduced by Sergey Ioffe and Christian Szegedy in their 2015 paper "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift," the method normalizes the inputs to each layer by re-centering and re-scaling them using statistics computed from the current mini-batch, then applies learnable scale and shift parameters that let the network recover any transformation it needs.
Batch normalization became one of the most widely adopted techniques in deep learning after its introduction. It was a key component of the Inception v2 architecture that achieved state-of-the-art results on the ImageNet benchmark, matching the previous best accuracy with 14 times fewer training steps. Nearly every major convolutional neural network architecture developed after 2015, including ResNet, DenseNet, Inception v2/v3, and EfficientNet, uses batch normalization.
Batch normalization operates on the activations (or pre-activations) of a layer during training by standardizing them to have zero mean and unit variance across the mini-batch. It then applies a learned linear transformation to restore representational capacity. The algorithm operates on each feature independently, normalizing across the examples in a mini-batch, and consists of four steps applied to each mini-batch of size m.
Given a mini-batch of values B = {x_1, x_2, ..., x_m} for a particular activation:
| Step | Operation | Formula |
|---|---|---|
| 1. Compute mini-batch mean | Average over the batch | $\mu_B = \frac{1}{m} \sum_{i=1}^{m} x_i$ |
| 2. Compute mini-batch variance | Variance over the batch | $\sigma_B^2 = \frac{1}{m} \sum_{i=1}^{m} (x_i - \mu_B)^2$ |
| 3. Normalize | Subtract mean, divide by std | $\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}$ |
| 4. Scale and shift | Apply learnable parameters | $y_i = \gamma \hat{x}_i + \beta$ |
The small constant epsilon (typically 1e-5) is added for numerical stability to prevent division by zero. The parameters gamma (scale) and beta (shift) are learnable parameters that are updated during training via backpropagation and gradient descent, just like the layer's weights.
These parameters are critical: without them, the normalization step would constrain the network to only represent activations with zero mean and unit variance, limiting the model's expressive power. For example, if the activation function is a sigmoid, forcing zero mean and unit variance would confine the inputs to the near-linear region of the sigmoid, eliminating the non-linearity.
By learning gamma and beta, the network can recover any linear transformation of the normalized values, including the identity transformation if that turns out to be optimal. In principle, the network can undo the normalization entirely (by setting gamma = sigma_B and beta = mu_B). In practice, the network learns whatever scale and shift works best for each layer, which is usually something between full normalization and no normalization at all.
In CNNs, batch normalization is applied per channel (per feature map) rather than per individual activation. For a convolutional layer producing feature maps of shape (N, C, H, W), where N is the batch size, C is the number of channels, and H and W are spatial dimensions, the mean and variance are computed across the batch dimension and both spatial dimensions for each channel independently. This means each of the C channels has its own scalar mean, scalar variance, and its own learned gamma and beta parameters. The rationale is that all spatial locations within a single feature map should share the same normalization statistics because they are produced by the same convolutional filter.
The original paper by Ioffe and Szegedy proposed applying batch normalization before the activation function, following the pattern: Convolution, BatchNorm, ReLU. The reasoning was that BN aims to normalize the inputs to the nonlinearity, keeping them in a range where gradients flow effectively. For ReLU activations, centering the distribution around zero before applying ReLU helps ensure that a meaningful proportion of activations are non-zero, which can reduce the "dying ReLU" problem.
However, this placement is not universally agreed upon. Some practitioners and researchers have experimented with placing batch normalization after the activation function, and for ReLU-based networks the empirical differences are often minimal. For bounded activation functions like tanh, placing BN after the activation has been reported to yield better performance on certain benchmarks. In practice, the pre-activation convention (BN before the nonlinearity) remains the most common default in modern frameworks and architectures.
Batch normalization behaves differently during training and inference, and understanding this distinction is essential for correct implementation. The distinction is also a frequent source of bugs in practice.
During training, the mean and variance are computed from the current mini-batch. This introduces a form of stochastic noise into the forward pass, because the normalization of each sample depends on the other samples in the same mini-batch. Each mini-batch produces slightly different statistics, which is what creates the regularization effect (see below). This noise acts as a mild regularizer, which can be beneficial but also means that predictions for a single input are not deterministic during training. The gradients flow through the mean and variance computation, so they are part of the computation graph.
Simultaneously, the BN layer maintains exponential moving averages of the batch mean and variance, called the running mean and running variance. After each mini-batch, these running statistics are updated:
| Parameter | Update rule |
|---|---|
| Running mean | running_mean = (1 - momentum) * running_mean + momentum * batch_mean |
| Running variance | running_var = (1 - momentum) * running_var + momentum * batch_var |
The momentum parameter (typically 0.1 or 0.9, depending on the framework convention) controls how quickly the running statistics adapt. In PyTorch, the default momentum value is 0.1.
At inference time, it is impractical (and undesirable) to compute statistics from a mini-batch, since predictions should be deterministic and may involve a single sample. Instead, the BN layer uses the accumulated running mean and running variance from training. The normalization at inference becomes a fixed linear transformation:
y = gamma * (x - running_mean) / sqrt(running_var + epsilon) + beta
This fixed transformation can be fused with the preceding linear or convolutional layer's weights and biases, eliminating the BN layer entirely at inference and reducing the computational overhead to zero.
| Problem | Cause | Solution |
|---|---|---|
| Training and inference results differ significantly | Model not switched to eval mode (batch stats vs. running stats) | Call model.eval() before inference in PyTorch |
| Poor accuracy with small batch sizes | Noisy batch statistics from too few samples | Use Group Normalization or increase batch size |
| Running statistics are inaccurate after fine-tuning | Running stats from pre-training do not match new data distribution | Reset or freeze BN statistics when fine-tuning on a new domain |
| NaN or unstable loss values | Extremely small batches or degenerate batches | Check batch composition; consider batch-size-independent normalization |
Ioffe and Szegedy motivated batch normalization as a solution to internal covariate shift, defined as the change in the distribution of layer inputs caused by updates to the parameters of preceding layers during training. They argued that this shifting distribution forced each layer to continuously adapt to new input statistics, slowing down training and requiring conservative learning rate choices and careful parameter initialization. By normalizing layer inputs, BN was claimed to stabilize these distributions and allow the network to train more effectively.
In 2018, Santurkar, Tsipras, Ilyas, and Madry published a highly influential paper titled "How Does Batch Normalization Help Optimization?" at NeurIPS that challenged the internal covariate shift narrative. Through careful experiments, they demonstrated:
Specifically, BN improves the Lipschitzness of both the loss function and its gradients (also known as beta-smoothness). A smoother loss landscape means that gradients are more predictive of the actual direction of improvement, which has several practical consequences:
| Explanation | Proposed by | Year | Status |
|---|---|---|---|
| Reduces internal covariate shift | Ioffe and Szegedy | 2015 | Disputed |
| Smooths the loss landscape | Santurkar et al. | 2018 | Widely accepted |
| Acts as a regularizer (noise from batch statistics) | Ioffe and Szegedy (secondary claim) | 2015 | Generally accepted |
This loss landscape smoothing explanation is now widely accepted as the more accurate account of why batch normalization helps training, though the exact mechanisms by which batch normalization achieves this smoothing are still being studied.
Because the normalization statistics are computed over mini-batches, each sample's normalized value depends on the other samples in the batch. This introduces noise into the training process in a manner somewhat analogous to dropout. This stochastic regularization effect can reduce overfitting, and Ioffe and Szegedy noted in their original paper that batch normalization sometimes eliminates the need for dropout entirely. The regularization effect is stronger with smaller batch sizes (since the batch statistics are noisier) and weaker with larger batch sizes.
Batch normalization makes networks significantly more robust to the choice of weight initialization. Without BN, poor initialization can cause activations to explode or vanish as they propagate through many layers, leading to very slow training or complete failure. With BN, the normalization step re-centers and re-scales activations at every layer, preventing such cascading effects. While proper initialization (such as He initialization for ReLU networks) is still beneficial and can speed up convergence, BN provides a safety net that prevents catastrophic initialization failures.
Similarly, BN allows practitioners to use much higher learning rates than would otherwise be stable. The smoother loss landscape allows the use of learning rates that would cause divergence without normalization. Ioffe and Szegedy reported that their batch-normalized Inception network could be trained with learning rates an order of magnitude higher than the non-normalized version while still converging.
Combining batch normalization and dropout in the same network requires care, as the two techniques can conflict. Li et al. (2019) identified a "variance shift" problem that arises when dropout is placed before a BN layer. During training, dropout randomly zeros out activations with probability (1 - p), which changes the variance of the layer's outputs. BN then computes running statistics based on these dropout-modified activations. At inference time, however, dropout is disabled and all activations are present (scaled by p), producing a different variance than what BN's running statistics expect. This mismatch between training-time and inference-time statistics can degrade performance.
Practical recommendations for combining the two techniques include:
Batch normalization significantly reduces, but does not eliminate, the importance of weight initialization. Before BN became widespread, techniques like Xavier initialization and He initialization were critical for training deep networks, as they aimed to maintain stable activation variances across layers. With BN, the normalization step automatically corrects activation scales at every layer, providing resilience against poor initialization. In practice, He initialization combined with BN and skip connections is the standard recipe for training very deep networks (50+ layers). Even with BN, very poor initialization choices can slow convergence, so using established initialization methods remains a best practice.
Batch normalization's reliance on mini-batch statistics creates a fundamental problem when batch sizes are small. The sample mean and variance become noisy estimates of the true population statistics, with estimation errors scaling as O(1/sqrt(m)), where m is the batch size. With a batch size of 32, these estimates are reasonably stable. But as the batch size shrinks:
| Batch size | Relative estimation error | Typical impact on performance |
|---|---|---|
| 32+ | Low | Batch normalization works well |
| 8-16 | Moderate | Slight degradation; still usable |
| 2-4 | High | Noticeable performance drop |
| 1 | Undefined (zero variance for single sample) | Batch normalization breaks entirely |
With a batch size of 1, the mini-batch mean is just the single example's value, and the variance is zero. The normalization step divides by zero (or by epsilon, which produces an extremely scaled output), making the computation meaningless. Wu and He (2018) demonstrated that ResNet-50 with BN lost 5.6 percentage points of top-1 accuracy on ImageNet when the batch size was reduced from 32 to 2, a dramatic degradation.
Several practical scenarios force the use of small batch sizes:
In these settings, batch normalization's noisy statistics cause training instability, poor convergence, and degraded final performance. This has driven the adoption of alternative normalization methods.
One partial solution for multi-GPU setups is synchronized batch normalization (SyncBN), which computes batch statistics across all GPUs rather than independently on each device. If each of 8 GPUs processes 2 examples, SyncBN computes statistics over an effective batch of 16, reducing the noise problem.
However, SyncBN requires an all-reduce communication step at every batch normalization layer during both forward and backward passes, adding significant overhead. It also does not help when the total batch across all devices is still small.
Batch normalization is poorly suited for transformer architectures and recurrent neural networks that process variable-length sequences. In these settings, different positions in the sequence may carry different semantic meaning, and computing statistics across the batch at each position mixes incomparable information. Additionally, during autoregressive inference, tokens are processed one at a time, making batch statistics meaningless. This is the primary reason that transformers universally use layer normalization or its variants instead of batch normalization.
In online learning scenarios where data arrives one sample at a time, or in streaming settings with non-stationary data distributions, batch normalization cannot compute meaningful batch statistics. The running statistics accumulated during training may also become stale if the data distribution shifts over time.
The limitations of batch normalization have motivated the development of several alternative normalization methods. Each computes statistics over a different set of dimensions.
| Technique | Normalizes over | Batch-size dependent? | Primary use case | Key paper |
|---|---|---|---|---|
| Batch Normalization | Batch + spatial dims, per channel | Yes | CNNs with large batch sizes | Ioffe and Szegedy, 2015 |
| Layer Normalization | All features in a single sample | No | Transformers, RNNs | Ba, Kiros, and Hinton, 2016 |
| Instance Normalization | Spatial dims per channel, per sample | No | Style transfer, image generation | Ulyanov, Vedaldi, and Lempitsky, 2016 |
| Group Normalization | Groups of channels per sample | No | CNNs with small batch sizes, detection | Wu and He, 2018 |
| RMSNorm | All features (RMS only, no mean subtraction) | No | Large language models | Zhang and Sennrich, 2019 |
Layer normalization (Ba, Kiros, and Hinton, 2016) normalizes across all features within a single training example rather than across the batch. For a hidden vector of dimension d, layer normalization computes the mean and variance over the d features of that single example. Because it does not depend on batch size, it works with a batch size of 1 and does not require running statistics. Layer normalization became the standard normalization technique for transformers and is used in models like BERT, GPT, and their successors.
Key differences from batch normalization:
| Property | Batch Normalization | Layer Normalization |
|---|---|---|
| Normalizes across | Batch dimension | Feature dimension |
| Depends on batch size | Yes | No |
| Works for batch size 1 | Poorly | Yes |
| Suited for sequence models | No | Yes |
| Training/inference difference | Yes (running stats) | No |
Instance normalization (Ulyanov, Vedaldi, and Lempitsky, 2016) normalizes across the spatial dimensions (H, W) of each feature map for each sample independently. It was introduced specifically for neural style transfer, where removing instance-specific contrast information from the content image improves stylization quality. It is equivalent to layer normalization applied to each channel independently and is most commonly used in generative models for image synthesis and style transfer applications.
Group Normalization (Wu and He, 2018) divides channels into groups and normalizes within each group independently for each example. With G groups and C channels, each group contains C/G channels. Statistics are computed over the spatial dimensions (H, W) and the C/G channels within each group, for each example independently. It serves as a middle ground between layer normalization (which normalizes over all channels) and instance normalization (which normalizes each channel separately).
The critical advantage is that group normalization's computation depends only on a single example, not on the batch. This means:
Group normalization became the standard choice in object detection frameworks like Detectron2 and MMDetection and segmentation models where high-resolution inputs necessitate small batches. FAIR's Detectron2, for example, defaults to group normalization with 32 groups for all backbone networks. Wu and He showed that while BN lost 5.6% top-1 accuracy on ImageNet when batch size dropped from 32 to 2, group normalization lost only 0.3 percentage points.
| Normalization method | Performance at batch size 32 | Performance at batch size 2 | Batch size dependence |
|---|---|---|---|
| Batch Normalization | Best | Significantly degraded (-5.6% top-1) | Strong |
| Group Normalization (32 groups) | Slightly below BN (-0.5%) | Near-optimal (-0.3% from BN@32) | None |
| Layer Normalization | Below BN | Stable | None |
| Instance Normalization | Below BN | Stable | None |
RMSNorm (Root Mean Square Layer Normalization) was proposed by Zhang and Sennrich (2019). RMSNorm simplifies layer normalization by removing the mean-centering step and normalizing only by the root mean square of the activations:
RMSNorm(x) = gamma * x / sqrt((1/d) * sum(x_i^2) + epsilon)
The authors found that the mean subtraction in standard layer normalization is not necessary for the technique to work. Removing it reduces computational overhead because:
By eliminating the mean computation, RMSNorm reduces computational overhead by 7% to 64% depending on the model and implementation, while achieving comparable training performance. RMSNorm has become the dominant normalization choice in modern large language models, including LLaMA, Mistral, and Gemma.
Batch normalization was designed for convolutional neural networks processing images in fixed-size batches. Several properties of transformer models make batch normalization a poor fit:
Research has confirmed this empirically. Shen et al. (2020) showed that standard batch normalization leads to "significant performance degradation" in transformers for NLP tasks.
The evolution of normalization techniques in large language models tells a clear story: batch normalization was never suited for language modeling, and the field has progressively simplified normalization toward more efficient variants. RMSNorm has become the normalization method of choice in most state-of-the-art LLMs developed since 2023. The LLaMA family of models (Meta, 2023) adopted RMSNorm, and this choice was followed by many subsequent open-weight models.
| Model | Normalization method |
|---|---|
| GPT-2 (2019) | Layer Normalization |
| GPT-3 (2020) | Layer Normalization |
| BERT (2018) | Layer Normalization |
| LLaMA (2023) | RMSNorm |
| LLaMA 2 (2023) | RMSNorm |
| LLaMA 3 (2024) | RMSNorm |
| Mistral (2023) | RMSNorm |
| Gemma (2024) | RMSNorm |
The shift to RMSNorm is driven by practical benefits: comparable training stability with lower computational overhead per layer. When training models with billions of parameters across trillions of tokens, even small efficiency improvements per operation compound into meaningful savings in time and energy.
Beyond the choice of normalization function, where normalization is placed within the transformer block has a significant impact on training stability.
The original transformer (Vaswani et al., 2017) used post-norm placement:
x = LayerNorm(x + Sublayer(x))
Most modern LLMs use pre-norm placement:
x = x + Sublayer(LayerNorm(x))
Xiong et al. (2020) showed that pre-norm placement makes the gradient norms more predictable across layers, which stabilizes training and allows the use of larger learning rates. Pre-norm transformers can often be trained without warmup, while post-norm transformers typically require careful warmup to avoid divergence.
The tradeoff is that pre-norm placement can lead to slightly worse final performance compared to post-norm in some settings, because the residual stream is not normalized. However, the training stability benefits have made pre-norm the dominant choice in practice.
Recognizing the tradeoffs between pre-norm and post-norm, Microsoft introduced DeepNorm (Wang et al., 2022) for training very deep transformers (up to 1,000 layers). DeepNorm modifies the residual connection with a scaling factor:
x = LayerNorm(alpha * x + Sublayer(x))
where alpha is a constant that depends on the number of layers. DeepNorm achieves the training stability of pre-norm with the performance benefits of post-norm, enabling the training of transformers that are significantly deeper than previously possible.
| Year | Model | Normalization | Placement | Notes |
|---|---|---|---|---|
| 2017 | Original Transformer | Layer Normalization | Post-norm | First transformer architecture |
| 2018 | BERT | Layer Normalization | Post-norm | Followed original transformer |
| 2019 | GPT-2 | Layer Normalization | Pre-norm | Shifted to pre-norm for stability |
| 2020 | GPT-3 | Layer Normalization | Pre-norm | Continued pre-norm convention |
| 2022 | DeepNorm | Modified Layer Norm | Hybrid | For very deep transformers |
| 2023 | LLaMA | RMSNorm | Pre-norm | Shifted to RMSNorm for efficiency |
| 2023 | Mistral | RMSNorm | Pre-norm | Followed LLaMA convention |
| 2024 | LLaMA 3 | RMSNorm | Pre-norm | RMSNorm now standard |
| 2024 | Gemma | RMSNorm | Pre-norm | Google also adopted RMSNorm |
| 2024 | DeepSeek-V2 | RMSNorm | Pre-norm | MoE models also use RMSNorm |
The convergence of virtually all modern LLMs on RMSNorm with pre-norm placement reflects a consensus that this combination provides the best balance of training stability, computational efficiency, and model quality. Batch normalization plays no role in this landscape, having been superseded by architectures where per-example normalization is more natural and more efficient.
PyTorch provides batch normalization through torch.nn.BatchNorm1d, torch.nn.BatchNorm2d, and torch.nn.BatchNorm3d for 1D, 2D, and 3D inputs respectively.
import torch
import torch.nn as nn
# BatchNorm for 2D convolutions (input: N, C, H, W)
bn = nn.BatchNorm2d(
num_features=64, # Number of channels
eps=1e-5, # Epsilon for numerical stability
momentum=0.1, # Momentum for running stats
affine=True, # Learn gamma and beta
track_running_stats=True # Track running mean/var
)
# Example usage in a CNN block
class ConvBlock(nn.Module):
def __init__(self, in_channels, out_channels):
super().__init__()
self.conv = nn.Conv2d(in_channels, out_channels, 3, padding=1)
self.bn = nn.BatchNorm2d(out_channels)
self.relu = nn.ReLU(inplace=True)
def forward(self, x):
return self.relu(self.bn(self.conv(x)))
# Important: switch to eval mode for inference
model = ConvBlock(3, 64)
model.eval() # Uses running statistics instead of batch statistics
Key parameters:
num_features: matches the number of channels (C dimension) of the inputmomentum: controls the update rate of running statistics (default 0.1)affine: when True (default), the layer learns gamma and beta; when False, normalization has no learnable parameterstrack_running_stats: when True (default), maintains running mean and variance for inferenceIn TensorFlow and Keras, batch normalization is available via tf.keras.layers.BatchNormalization.
import tensorflow as tf
# Keras functional API
inputs = tf.keras.Input(shape=(32, 32, 3))
x = tf.keras.layers.Conv2D(64, 3, padding='same')(inputs)
x = tf.keras.layers.BatchNormalization()(x)
x = tf.keras.layers.ReLU()(x)
# The 'training' flag is handled automatically in model.fit()
# For manual control:
# x = tf.keras.layers.BatchNormalization()(x, training=True)
Keras automatically handles the training/inference mode switch when using model.fit() and model.predict(). For custom training loops, the training argument must be passed explicitly.
Batch normalization had a transformative effect on the practice of training deep neural networks when it was introduced. Before BN, training very deep networks was an arduous process that required meticulous hyperparameter tuning, careful initialization schemes, and conservative learning rates. The original paper demonstrated that a batch-normalized version of the Inception network matched the accuracy of the original with 14 times fewer training steps, and an ensemble of BN-Inception models achieved 4.9% top-5 error on the ImageNet validation set, surpassing the state of the art at the time.
The technique's success in CNNs for computer vision helped establish the modern deep learning training recipe: BN (or a normalization variant), residual connections, and adaptive optimizers like Adam. Nearly every ImageNet competition winner and major vision architecture from 2015 onward incorporated batch normalization. While transformers and large language models have moved toward layer normalization and RMSNorm, batch normalization remains the default choice for convolutional architectures in computer vision, and its conceptual legacy, normalizing intermediate representations to stabilize training, underlies all modern normalization techniques.
Imagine a classroom where each student takes a different version of a math test. Some tests have questions where all the numbers are between 1 and 10, while other tests have numbers between 1,000 and 1,000,000. The students with the huge numbers are going to struggle more, not because the math is harder, but because the scale makes everything more confusing.
Batch normalization is like a teacher who collects everyone's test, adjusts all the numbers to be on the same comfortable scale, and then hands them back. Now every student is working with similar-sized numbers, and they can focus on actually learning the math instead of wrestling with wildly different scales. The teacher also lets each student shift and stretch the numbers a little if that helps them learn better (those are the gamma and beta parameters). At the end of the school year (inference time), the teacher uses the average adjustments from the whole year instead of recalculating for each new test.