# Batch Normalization

> Source: https://aiwiki.ai/wiki/batch_normalization
> Updated: 2026-07-11
> Categories: Deep Learning, Machine Learning, Neural Networks
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

*See also: [Machine learning terms](/wiki/machine_learning_terms)*

**Batch normalization** (often abbreviated **BatchNorm** or **BN**) is a technique for improving the speed, stability, and performance of deep [neural network](/wiki/neural_network) training by normalizing the inputs to each layer using statistics computed from the current mini-batch. Introduced by Sergey Ioffe and Christian Szegedy of Google in their 2015 paper "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift," the method re-centers and re-scales each layer's inputs to have zero mean and unit variance across the mini-batch, then applies two learnable parameters (a scale gamma and a shift beta) that let the network recover any transformation it needs.[1] On a state-of-the-art image classification model, batch normalization reached the same accuracy with 14 times fewer training steps, and an ensemble of batch-normalized networks set a new record of 4.9% top-5 validation error on the [ImageNet](/wiki/imagenet) benchmark, surpassing the accuracy of human raters at the time.[1]

Batch normalization became one of the most widely adopted techniques in [deep learning](/wiki/deep_learning) after its introduction. It was a key component of the [Inception](/wiki/inception) v2 architecture that achieved those state-of-the-art ImageNet results, matching the previous best accuracy with 14 times fewer training steps.[1] Nearly every major [convolutional neural network](/wiki/convolutional_neural_network) architecture developed after 2015, including [ResNet](/wiki/resnet)[8], [DenseNet](/wiki/densenet), Inception v2/v3, and [EfficientNet](/wiki/efficientnet), uses batch normalization. As Ioffe and Szegedy summarized in the paper's abstract, "Batch Normalization allows us to use much higher learning rates and be less careful about initialization. It also acts as a regularizer, in some cases eliminating the need for Dropout."[1]

## What is batch normalization used for?

Batch normalization is used to make deep neural networks train faster, more stably, and with less hyperparameter tuning. By keeping the distribution of each layer's inputs roughly fixed, it lets practitioners use higher [learning rates](/wiki/learning_rate), worry less about weight initialization, and in many cases drop other regularizers such as [dropout](/wiki/dropout_regularization).[1] It is the default normalization layer in convolutional architectures for computer vision, where it is typically inserted after each convolution and before the [activation function](/wiki/activation_function). The remainder of this article explains how the algorithm works, why it helps, where it breaks down, and the alternatives that have largely replaced it in [transformer](/wiki/transformer) models.

## How batch normalization works

Batch normalization operates on the activations (or pre-activations) of a layer during training by standardizing them to have zero mean and unit variance across the mini-batch. It then applies a learned linear transformation to restore representational capacity. The algorithm operates on each feature independently, normalizing across the examples in a mini-batch, and consists of four steps applied to each mini-batch of size *m*.

### Step-by-step algorithm

Given a mini-batch of values **B** = {x_1, x_2, ..., x_m} for a particular activation:

| Step | Operation | Formula |
|---|---|---|
| 1. Compute mini-batch mean | Average over the batch | $\mu_B = \frac{1}{m} \sum_{i=1}^{m} x_i$ |
| 2. Compute mini-batch variance | Variance over the batch | $\sigma_B^2 = \frac{1}{m} \sum_{i=1}^{m} (x_i - \mu_B)^2$ |
| 3. Normalize | Subtract mean, divide by std | $\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}$ |
| 4. Scale and shift | Apply learnable parameters | $y_i = \gamma \hat{x}_i + \beta$ |

The small constant $$\epsilon$$ (typically $$10^{-5}$$) is added for numerical stability to prevent division by zero. The parameters gamma (scale) and beta (shift) are learnable parameters that are updated during training via [backpropagation](/wiki/backpropagation) and [gradient descent](/wiki/gradient_descent), just like the layer's weights.

### Why gamma and beta matter

These parameters are critical: without them, the normalization step would constrain the network to only represent activations with zero mean and unit variance, limiting the model's expressive power. For example, if the [activation function](/wiki/activation_function) is a [sigmoid](/wiki/sigmoid_function), forcing zero mean and unit variance would confine the inputs to the near-linear region of the sigmoid, eliminating the non-linearity.

By learning gamma and beta, the network can recover any linear transformation of the normalized values, including the identity transformation if that turns out to be optimal. In principle, the network can undo the normalization entirely (by setting $$\gamma = \sigma_B$$ and $$\beta = \mu_B$$). In practice, the network learns whatever scale and shift works best for each layer, which is usually something between full normalization and no normalization at all.

### Batch normalization in convolutional neural networks

In CNNs, batch normalization is applied per channel (per feature map) rather than per individual activation. For a convolutional layer producing feature maps of shape (N, C, H, W), where N is the batch size, C is the number of channels, and H and W are spatial dimensions, the mean and variance are computed across the batch dimension and both spatial dimensions for each channel independently. This means each of the C channels has its own scalar mean, scalar variance, and its own learned gamma and beta parameters. The rationale is that all spatial locations within a single feature map should share the same normalization statistics because they are produced by the same [convolutional](/wiki/convolution) filter.

### Placement: before or after the activation function?

The original paper by Ioffe and Szegedy proposed applying batch normalization before the activation function, following the pattern: Convolution, BatchNorm, [ReLU](/wiki/relu).[1] The reasoning was that BN aims to normalize the inputs to the nonlinearity, keeping them in a range where gradients flow effectively. For ReLU activations, centering the distribution around zero before applying ReLU helps ensure that a meaningful proportion of activations are non-zero, which can reduce the "dying ReLU" problem.

However, this placement is not universally agreed upon. Some practitioners and researchers have experimented with placing batch normalization after the activation function, and for ReLU-based networks the empirical differences are often minimal. For bounded activation functions like tanh, placing BN after the activation has been reported to yield better performance on certain benchmarks. In practice, the pre-activation convention (BN before the nonlinearity) remains the most common default in modern frameworks and architectures.

## How does batch normalization behave during training vs. inference?

Batch normalization behaves differently during training and inference, and understanding this distinction is essential for correct implementation. The distinction is also a frequent source of bugs in practice.

### During training

During training, the mean and variance are computed from the current mini-batch. This introduces a form of stochastic noise into the forward pass, because the normalization of each sample depends on the other samples in the same mini-batch. Each mini-batch produces slightly different statistics, which is what creates the regularization effect (see below). This noise acts as a mild regularizer, which can be beneficial but also means that predictions for a single input are not deterministic during training. The [gradients](/wiki/gradient) flow through the mean and variance computation, so they are part of the computation graph.

Simultaneously, the BN layer maintains exponential moving averages of the batch mean and variance, called the **running mean** and **running variance**. After each mini-batch, these running statistics are updated:

| Parameter | Update rule |
|---|---|
| Running mean | $$\text{running\_mean} = (1 - \text{momentum}) \cdot \text{running\_mean} + \text{momentum} \cdot \text{batch\_mean}$$ |
| Running variance | $$\text{running\_var} = (1 - \text{momentum}) \cdot \text{running\_var} + \text{momentum} \cdot \text{batch\_var}$$ |

The momentum parameter (typically 0.1 or 0.9, depending on the framework convention) controls how quickly the running statistics adapt. In PyTorch, the default momentum value is 0.1.

### During inference

At inference time, it is impractical (and undesirable) to compute statistics from a mini-batch, since predictions should be deterministic and may involve a single sample. Instead, the BN layer uses the accumulated running mean and running variance from training. The normalization at inference becomes a fixed linear transformation:

$$
y = \gamma \frac{x - \text{running\_mean}}{\sqrt{\text{running\_var} + \epsilon}} + \beta
$$

This fixed transformation can be fused with the preceding linear or convolutional layer's weights and biases, eliminating the BN layer entirely at inference and reducing the computational overhead to zero.

### Common pitfalls

| Problem | Cause | Solution |
|---|---|---|
| Training and inference results differ significantly | Model not switched to eval mode (batch stats vs. running stats) | Call `model.eval()` before inference in PyTorch |
| Poor accuracy with small batch sizes | Noisy batch statistics from too few samples | Use Group Normalization or increase batch size |
| Running statistics are inaccurate after fine-tuning | Running stats from pre-training do not match new data distribution | Reset or freeze BN statistics when fine-tuning on a new domain |
| NaN or unstable loss values | Extremely small batches or degenerate batches | Check batch composition; consider batch-size-independent normalization |

## Why does batch normalization work?

### The original explanation: internal covariate shift

Ioffe and Szegedy motivated batch normalization as a solution to **internal covariate shift**, defined as the change in the distribution of layer inputs caused by updates to the parameters of preceding layers during training. They argued that this shifting distribution forced each layer to continuously adapt to new input statistics, slowing down training and requiring conservative [learning rate](/wiki/learning_rate) choices and careful parameter initialization. In their words, this "slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift."[1] By normalizing layer inputs, BN was claimed to stabilize these distributions and allow the network to train more effectively.[1]

### The revised understanding: loss landscape smoothing

In 2018, Santurkar, Tsipras, Ilyas, and Madry published a highly influential paper titled "How Does Batch Normalization Help Optimization?" at [NeurIPS](/wiki/neurips) that challenged the internal covariate shift narrative.[2] They concluded bluntly that "such distributional stability of layer inputs has little to do with the success of BatchNorm."[2] Through careful experiments, they demonstrated:

- Adding batch normalization did not consistently reduce internal covariate shift as measured by distribution statistics. Networks with batch normalization can exhibit just as much distributional change in layer inputs as networks without it.
- Networks with batch normalization plus artificially injected covariate shift (random noise added after normalization) still trained better than networks without batch normalization.
- The actual benefit appeared to be that batch normalization makes the [loss function](/wiki/loss_function) landscape significantly smoother.

In their own summary, the key effect is that "it makes the optimization landscape significantly smoother. This smoothness induces a more predictive and stable behavior of the gradients."[2] Specifically, BN improves the Lipschitzness of both the loss function and its gradients (also known as beta-smoothness).[2] A smoother loss landscape means that gradients are more predictive of the actual direction of improvement, which has several practical consequences:

- The optimizer can take larger steps (higher learning rates) without overshooting
- Gradient directions change more slowly, making optimization more stable
- The training process is less likely to get trapped in sharp, poorly generalizing minima

| Explanation | Proposed by | Year | Status |
|---|---|---|---|
| Reduces internal covariate shift | Ioffe and Szegedy | 2015 | Disputed |
| Smooths the loss landscape | Santurkar et al. | 2018 | Widely accepted |
| Acts as a regularizer (noise from batch statistics) | Ioffe and Szegedy (secondary claim) | 2015 | Generally accepted |

This loss landscape smoothing explanation is now widely accepted as the more accurate account of why batch normalization helps training, though the exact mechanisms by which batch normalization achieves this smoothing are still being studied.

### Implicit regularization

Because the normalization statistics are computed over mini-batches, each sample's normalized value depends on the other samples in the batch. This introduces noise into the training process in a manner somewhat analogous to [dropout](/wiki/dropout_regularization). This stochastic regularization effect can reduce [overfitting](/wiki/overfitting)[12], and Ioffe and Szegedy noted in their original paper that batch normalization sometimes eliminates the need for dropout entirely.[1] The regularization effect is stronger with smaller batch sizes (since the batch statistics are noisier) and weaker with larger batch sizes.[12]

### Reduced sensitivity to initialization and learning rate

Batch normalization makes networks significantly more robust to the choice of weight initialization. Without BN, poor initialization can cause activations to explode or vanish as they propagate through many layers, leading to very slow training or complete failure. With BN, the normalization step re-centers and re-scales activations at every layer, preventing such cascading effects. While proper initialization (such as He initialization for ReLU networks) is still beneficial and can speed up convergence[9], BN provides a safety net that prevents catastrophic initialization failures.

Similarly, BN allows practitioners to use much higher learning rates than would otherwise be stable. The smoother loss landscape allows the use of learning rates that would cause divergence without normalization.[10] Ioffe and Szegedy reported that their batch-normalized Inception network could be trained with learning rates an order of magnitude higher than the non-normalized version while still converging.[1]

## Interaction with dropout

Combining batch normalization and dropout in the same network requires care, as the two techniques can conflict. Li et al. (2019) identified a "variance shift" problem that arises when dropout is placed before a BN layer.[7] During training, dropout randomly zeros out activations with probability (1 - p), which changes the variance of the layer's outputs. BN then computes running statistics based on these dropout-modified activations. At inference time, however, dropout is disabled and all activations are present (scaled by p), producing a different variance than what BN's running statistics expect. This mismatch between training-time and inference-time statistics can degrade performance.[7]

Practical recommendations for combining the two techniques include:

- Place dropout after BN layers, not before them
- Use the ordering: Linear/Conv, BN, Activation, Dropout
- Consider using only batch normalization without dropout, since BN already provides [regularization](/wiki/regularization)
- If using both, tune the dropout rate carefully and validate that the combination improves performance on a held-out set

## Interaction with weight initialization

Batch normalization significantly reduces, but does not eliminate, the importance of weight initialization. Before BN became widespread, techniques like Xavier initialization and He initialization were critical for training deep networks, as they aimed to maintain stable activation variances across layers. With BN, the normalization step automatically corrects activation scales at every layer, providing resilience against poor initialization. In practice, He initialization[9] combined with BN and skip connections[8] is the standard recipe for training very deep networks (50+ layers). Even with BN, very poor initialization choices can slow convergence, so using established initialization methods remains a best practice.

## Limitations

### Small batch sizes

Batch normalization's reliance on mini-batch statistics creates a fundamental problem when batch sizes are small. The sample mean and variance become noisy estimates of the true population statistics, with estimation errors scaling as $$O(1/\sqrt{m})$$, where m is the batch size. With a batch size of 32, these estimates are reasonably stable. But as the batch size shrinks:

| Batch size | Relative estimation error | Typical impact on performance |
|---|---|---|
| 32+ | Low | Batch normalization works well |
| 8-16 | Moderate | Slight degradation; still usable |
| 2-4 | High | Noticeable performance drop |
| 1 | Undefined (zero variance for single sample) | Batch normalization breaks entirely |

With a batch size of 1, the mini-batch mean is just the single example's value, and the variance is zero. The normalization step divides by zero (or by epsilon, which produces an extremely scaled output), making the computation meaningless. Wu and He (2018) demonstrated that ResNet-50 with BN lost 5.6 percentage points of top-1 accuracy on ImageNet when the batch size was reduced from 32 to 2, a dramatic degradation. They reported the gap the other way around in their abstract: "GN has 10.6% lower error than its BN counterpart when using a batch size of 2."[4]

### When small batches are unavoidable

Several practical scenarios force the use of small batch sizes:

- **[Object detection](/wiki/object_detection) and segmentation**: These tasks require high-resolution inputs (e.g., 800x1200 pixels), which consume so much GPU memory that only 1-2 images fit per GPU.
- **3D medical imaging**: Volumetric data (CT scans, MRI) can require gigabytes per sample, limiting batch sizes to 1-2.
- **Video models**: Processing multiple frames per example multiplies memory requirements.
- **Large models on limited hardware**: Training or fine-tuning large models on consumer GPUs often forces batch sizes below 8.

In these settings, batch normalization's noisy statistics cause training instability, poor convergence, and degraded final performance. This has driven the adoption of alternative normalization methods.

### Synchronized batch normalization

One partial solution for multi-GPU setups is **synchronized batch normalization** (SyncBN), which computes batch statistics across all GPUs rather than independently on each device. If each of 8 GPUs processes 2 examples, SyncBN computes statistics over an effective batch of 16, reducing the noise problem.

However, SyncBN requires an all-reduce communication step at every batch normalization layer during both forward and backward passes, adding significant overhead. It also does not help when the total batch across all devices is still small.

### Sequence models and variable-length inputs

Batch normalization is poorly suited for [transformer](/wiki/transformer) architectures and recurrent neural networks that process variable-length sequences. In these settings, different positions in the sequence may carry different semantic meaning, and computing statistics across the batch at each position mixes incomparable information. Additionally, during autoregressive inference, tokens are processed one at a time, making batch statistics meaningless. This is the primary reason that transformers universally use layer normalization or its variants instead of batch normalization.

### Online and streaming settings

In online learning scenarios where data arrives one sample at a time, or in streaming settings with non-stationary data distributions, batch normalization cannot compute meaningful batch statistics. The running statistics accumulated during training may also become stale if the data distribution shifts over time.

## Alternatives and comparison of [normalization](/wiki/normalization) techniques

The limitations of batch normalization have motivated the development of several alternative normalization methods. Each computes statistics over a different set of dimensions.

| Technique | Normalizes over | Batch-size dependent? | Primary use case | Key paper |
|---|---|---|---|---|
| Batch Normalization | Batch + spatial dims, per channel | Yes | CNNs with large batch sizes | Ioffe and Szegedy, 2015 |
| Layer Normalization | All features in a single sample | No | Transformers, RNNs | Ba, Kiros, and Hinton, 2016 |
| Instance Normalization | Spatial dims per channel, per sample | No | Style transfer, image generation | Ulyanov, Vedaldi, and Lempitsky, 2016 |
| Group Normalization | Groups of channels per sample | No | CNNs with small batch sizes, detection | Wu and He, 2018 |
| RMSNorm | All features (RMS only, no mean subtraction) | No | Large language models | Zhang and Sennrich, 2019 |

### Layer normalization

[Layer normalization](/wiki/layer_normalization) (Ba, Kiros, and Hinton, 2016) normalizes across all features within a single training example rather than across the batch.[3] For a hidden vector of dimension d, layer normalization computes the mean and variance over the d features of that single example. Because it does not depend on batch size, it works with a batch size of 1 and does not require running statistics. Layer normalization became the standard normalization technique for transformers and is used in models like BERT, GPT, and their successors.[3]

Key differences from batch normalization:

| Property | Batch Normalization | Layer Normalization |
|---|---|---|
| Normalizes across | Batch dimension | Feature dimension |
| Depends on batch size | Yes | No |
| Works for batch size 1 | Poorly | Yes |
| Suited for sequence models | No | Yes |
| Training/inference difference | Yes (running stats) | No |

### Instance normalization

Instance normalization (Ulyanov, Vedaldi, and Lempitsky, 2016) normalizes across the spatial dimensions (H, W) of each feature map for each sample independently. It was introduced specifically for neural style transfer, where removing instance-specific contrast information from the content image improves stylization quality.[6] It is equivalent to layer normalization applied to each channel independently and is most commonly used in generative models for image synthesis and style transfer applications.

### Group normalization

[Group Normalization](/wiki/group_normalization) (Wu and He, 2018) divides channels into groups and normalizes within each group independently for each example.[4] With G groups and C channels, each group contains C/G channels. Statistics are computed over the spatial dimensions (H, W) and the C/G channels within each group, for each example independently. It serves as a middle ground between layer normalization (which normalizes over all channels) and instance normalization (which normalizes each channel separately).

The critical advantage is that group normalization's computation depends only on a single example, not on the batch. As Wu and He put it, "GN's computation is independent of batch sizes, and its accuracy is stable in a wide range of batch sizes."[4] This means:

- It works identically with batch size 1.
- There is no difference between training and inference behavior (no running statistics needed).
- It is fully compatible with [gradient accumulation](/wiki/gradient_accumulation) and variable batch sizes.

Group normalization became the standard choice in object detection frameworks like Detectron2 and MMDetection and segmentation models where high-resolution inputs necessitate small batches. FAIR's Detectron2, for example, defaults to group normalization with 32 groups for all backbone networks. Wu and He showed that while BN lost 5.6% top-1 accuracy on ImageNet when batch size dropped from 32 to 2, group normalization lost only 0.3 percentage points; equivalently, GN's error was 10.6% lower than BN's at a batch size of 2.[4]

| Normalization method | Performance at batch size 32 | Performance at batch size 2 | Batch size dependence |
|---|---|---|---|
| Batch Normalization | Best | Significantly degraded (-5.6% top-1) | Strong |
| Group Normalization (32 groups) | Slightly below BN (-0.5%) | Near-optimal (-0.3% from BN@32) | None |
| Layer Normalization | Below BN | Stable | None |
| Instance Normalization | Below BN | Stable | None |

### RMSNorm

RMSNorm (Root Mean Square Layer Normalization) was proposed by Zhang and Sennrich (2019).[5] RMSNorm simplifies layer normalization by removing the mean-centering step and normalizing only by the root mean square of the activations:

$$
\mathrm{RMSNorm}(x) = \gamma \frac{x}{\sqrt{\frac{1}{d} \sum_{i=1}^{d} x_i^2 + \epsilon}}
$$

The authors found that the mean subtraction in standard layer normalization is not necessary for the technique to work.[5] Removing it reduces computational overhead because:

- There is no need to compute the mean.
- There is no subtraction operation.
- The learned shift parameter (beta) is removed, reducing the number of learnable parameters.

By eliminating the mean computation, RMSNorm reduces the running time by 7% to 64% on different models, while achieving comparable training performance.[5] RMSNorm has become the dominant normalization choice in modern large language models, including LLaMA, Mistral, and Gemma.

## Why do transformers use Layer Normalization, not Batch Normalization?

Batch normalization was designed for convolutional neural networks processing images in fixed-size batches. Several properties of transformer models make batch normalization a poor fit:

1. **Variable sequence lengths:** Text sequences have varying lengths. Batch normalization would compute statistics across different sequence positions, mixing statistics from the start of one sentence with the middle of another.
2. **Sequence position semantics:** In a CNN, the same pixel position across images has roughly comparable meaning. In a transformer, position 5 in one sentence and position 5 in another sentence may represent completely different linguistic functions.
3. **Small effective batch sizes:** When training large transformers, memory constraints often force small per-device batch sizes. Batch normalization performs poorly with small batches because the batch statistics become noisy.
4. **Autoregressive inference:** During text generation, transformers process one token at a time. There is no batch to compute statistics over.

Research has confirmed this empirically. Shen et al. (2020) showed that standard batch normalization leads to "significant performance degradation" in transformers for NLP tasks.[13]

## The normalization landscape in modern LLMs

The evolution of normalization techniques in [large language models](/wiki/large_language_model) tells a clear story: batch normalization was never suited for language modeling, and the field has progressively simplified normalization toward more efficient variants. RMSNorm has become the normalization method of choice in most state-of-the-art LLMs developed since 2023. The [LLaMA](/wiki/llama) family of models (Meta, 2023) adopted RMSNorm, and this choice was followed by many subsequent open-weight models.[15]

| Model | Normalization method |
|---|---|
| GPT-2 (2019) | Layer Normalization |
| GPT-3 (2020) | Layer Normalization |
| BERT (2018) | Layer Normalization |
| LLaMA (2023) | RMSNorm |
| LLaMA 2 (2023) | RMSNorm |
| LLaMA 3 (2024) | RMSNorm |
| Mistral (2023) | RMSNorm |
| Gemma (2024) | RMSNorm |

The shift to RMSNorm is driven by practical benefits: comparable training stability with lower computational overhead per layer. When training models with billions of parameters across trillions of tokens, even small efficiency improvements per operation compound into meaningful savings in time and energy.

### Pre-norm vs. post-norm placement

Beyond the choice of normalization function, **where** normalization is placed within the transformer block has a significant impact on training stability.

The original transformer (Vaswani et al., 2017) used **post-norm** placement:

$$
x = \mathrm{LayerNorm}(x + \mathrm{Sublayer}(x))
$$

Most modern LLMs use **pre-norm** placement:

$$
x = x + \mathrm{Sublayer}(\mathrm{LayerNorm}(x))
$$

Xiong et al. (2020) showed that pre-norm placement makes the gradient norms more predictable across layers, which stabilizes training and allows the use of larger learning rates.[11] Pre-norm transformers can often be trained without warmup, while post-norm transformers typically require careful warmup to avoid divergence.

The tradeoff is that pre-norm placement can lead to slightly worse final performance compared to post-norm in some settings, because the residual stream is not normalized. However, the training stability benefits have made pre-norm the dominant choice in practice.

### DeepNorm

Recognizing the tradeoffs between pre-norm and post-norm, Microsoft introduced **DeepNorm** (Wang et al., 2022) for training very deep transformers (up to 1,000 layers).[14] DeepNorm modifies the residual connection with a scaling factor:

$$
x = \mathrm{LayerNorm}(\alpha x + \mathrm{Sublayer}(x))
$$

where $$\alpha$$ is a constant that depends on the number of layers. DeepNorm achieves the training stability of pre-norm with the performance benefits of post-norm, enabling the training of transformers that are significantly deeper than previously possible.

### Timeline of normalization in language models

| Year | Model | Normalization | Placement | Notes |
|---|---|---|---|---|
| 2017 | Original Transformer | Layer Normalization | Post-norm | First transformer architecture |
| 2018 | [BERT](/wiki/bert) | Layer Normalization | Post-norm | Followed original transformer |
| 2019 | [GPT-2](/wiki/gpt) | Layer Normalization | Pre-norm | Shifted to pre-norm for stability |
| 2020 | GPT-3 | Layer Normalization | Pre-norm | Continued pre-norm convention |
| 2022 | DeepNorm | Modified Layer Norm | Hybrid | For very deep transformers |
| 2023 | [LLaMA](/wiki/llama) | RMSNorm | Pre-norm | Shifted to RMSNorm for efficiency |
| 2023 | [Mistral](/wiki/mistral) | RMSNorm | Pre-norm | Followed LLaMA convention |
| 2024 | LLaMA 3 | RMSNorm | Pre-norm | RMSNorm now standard |
| 2024 | [Gemma](/wiki/gemma) | RMSNorm | Pre-norm | Google also adopted RMSNorm |
| 2024 | [DeepSeek-V2](/wiki/deepseek) | RMSNorm | Pre-norm | MoE models also use RMSNorm |

The convergence of virtually all modern LLMs on RMSNorm with pre-norm placement reflects a consensus that this combination provides the best balance of training stability, computational efficiency, and model quality. Batch normalization plays no role in this landscape, having been superseded by architectures where per-example normalization is more natural and more efficient.

## Implementation

### PyTorch

PyTorch provides batch normalization through `torch.nn.BatchNorm1d`, `torch.nn.BatchNorm2d`, and `torch.nn.BatchNorm3d` for 1D, 2D, and 3D inputs respectively.

```python
import torch
import torch.nn as nn

# BatchNorm for 2D convolutions (input: N, C, H, W)
bn = nn.BatchNorm2d(
    num_features=64,     # Number of channels
    eps=1e-5,            # Epsilon for numerical stability
    momentum=0.1,        # Momentum for running stats
    affine=True,         # Learn gamma and beta
    track_running_stats=True  # Track running mean/var
)

# Example usage in a CNN block
class ConvBlock(nn.Module):
    def __init__(self, in_channels, out_channels):
        super().__init__()
        self.conv = nn.Conv2d(in_channels, out_channels, 3, padding=1)
        self.bn = nn.BatchNorm2d(out_channels)
        self.relu = nn.ReLU(inplace=True)

    def forward(self, x):
        return self.relu(self.bn(self.conv(x)))

# Important: switch to eval mode for inference
model = ConvBlock(3, 64)
model.eval()  # Uses running statistics instead of batch statistics
```

Key parameters:
- `num_features`: matches the number of channels (C dimension) of the input
- `momentum`: controls the update rate of running statistics (default 0.1)
- `affine`: when True (default), the layer learns gamma and beta; when False, normalization has no learnable parameters
- `track_running_stats`: when True (default), maintains running mean and variance for inference

### TensorFlow / Keras

In TensorFlow and Keras, batch normalization is available via `tf.keras.layers.BatchNormalization`.

```python
import tensorflow as tf

# Keras functional API
inputs = tf.keras.Input(shape=(32, 32, 3))
x = tf.keras.layers.Conv2D(64, 3, padding='same')(inputs)
x = tf.keras.layers.BatchNormalization()(x)
x = tf.keras.layers.ReLU()(x)

# The 'training' flag is handled automatically in model.fit()
# For manual control:
# x = tf.keras.layers.BatchNormalization()(x, training=True)
```

Keras automatically handles the training/inference mode switch when using `model.fit()` and `model.predict()`. For custom training loops, the `training` argument must be passed explicitly.

## When was batch normalization introduced?

Batch normalization was introduced in February 2015, when Sergey Ioffe and Christian Szegedy of Google posted "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift" to arXiv (arXiv:1502.03167); the paper was published later that year at the 32nd International Conference on Machine Learning (ICML 2015).[1] It quickly became one of the most-cited papers in deep learning. The revised theoretical understanding arrived in 2018 with Santurkar et al. at NeurIPS, and the alternatives that displaced BN in non-CNN settings followed soon after: Layer Normalization (2016), Instance Normalization (2016), Group Normalization (2018), and RMSNorm (2019).[2][3][4][5][6]

## Impact on deep learning

Batch normalization had a transformative effect on the practice of training deep neural networks when it was introduced. Before BN, training very deep networks was an arduous process that required meticulous hyperparameter tuning, careful initialization schemes, and conservative learning rates. The original paper demonstrated that a batch-normalized version of the Inception network matched the accuracy of the original with 14 times fewer training steps, and an ensemble of BN-Inception models achieved 4.9% top-5 error on the ImageNet validation set, surpassing the state of the art at the time.[1]

The technique's success in CNNs for computer vision helped establish the modern deep learning training recipe: BN (or a normalization variant), residual connections, and adaptive optimizers like Adam. Nearly every ImageNet competition winner and major vision architecture from 2015 onward incorporated batch normalization. While transformers and large language models have moved toward layer normalization and RMSNorm, batch normalization remains the default choice for convolutional architectures in computer vision, and its conceptual legacy, normalizing intermediate representations to stabilize training, underlies all modern normalization techniques.

## ELI5: Batch normalization in simple terms

Imagine a classroom where each student takes a different version of a math test. Some tests have questions where all the numbers are between 1 and 10, while other tests have numbers between 1,000 and 1,000,000. The students with the huge numbers are going to struggle more, not because the math is harder, but because the scale makes everything more confusing.

Batch normalization is like a teacher who collects everyone's test, adjusts all the numbers to be on the same comfortable scale, and then hands them back. Now every student is working with similar-sized numbers, and they can focus on actually learning the math instead of wrestling with wildly different scales. The teacher also lets each student shift and stretch the numbers a little if that helps them learn better (those are the gamma and beta parameters). At the end of the school year (inference time), the teacher uses the average adjustments from the whole year instead of recalculating for each new test.

## References

1. Ioffe, S. and Szegedy, C. (2015). "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift." Proceedings of the 32nd International Conference on Machine Learning (ICML). [arXiv:1502.03167](https://arxiv.org/abs/1502.03167)
2. Santurkar, S., Tsipras, D., Ilyas, A., and Madry, A. (2018). "How Does Batch Normalization Help Optimization?" Advances in Neural Information Processing Systems 31 (NeurIPS). [arXiv:1805.11604](https://arxiv.org/abs/1805.11604)
3. Ba, J. L., Kiros, J. R., and Hinton, G. E. (2016). "Layer Normalization." [arXiv:1607.06450](https://arxiv.org/abs/1607.06450)
4. Wu, Y. and He, K. (2018). "Group Normalization." Proceedings of the European Conference on Computer Vision (ECCV). [arXiv:1803.08494](https://arxiv.org/abs/1803.08494)
5. Zhang, B. and Sennrich, R. (2019). "Root Mean Square Layer Normalization." Advances in Neural Information Processing Systems 32 (NeurIPS). [arXiv:1910.07467](https://arxiv.org/abs/1910.07467)
6. Ulyanov, D., Vedaldi, A., and Lempitsky, V. (2016). "Instance Normalization: The Missing Ingredient for Fast Stylization." [arXiv:1607.08022](https://arxiv.org/abs/1607.08022)
7. Li, X., Chen, S., Hu, X., and Yang, J. (2019). "Understanding the Disharmony between Dropout and Batch Normalization by Variance Shift." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
8. He, K., Zhang, X., Ren, S., and Sun, J. (2016). "Deep Residual Learning for Image Recognition." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). [arXiv:1512.03385](https://arxiv.org/abs/1512.03385)
9. He, K., Zhang, X., Ren, S., and Sun, J. (2015). "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification." Proceedings of the IEEE International Conference on Computer Vision (ICCV). [arXiv:1502.01852](https://arxiv.org/abs/1502.01852)
10. Bjorck, N., Gomes, C. P., Selman, B., and Weinberger, K. Q. (2018). "Understanding Batch Normalization." Advances in Neural Information Processing Systems 31 (NeurIPS). [arXiv:1806.02375](https://arxiv.org/abs/1806.02375)
11. Xiong, R., Yang, Y., He, D., Zheng, K., Zheng, S., Xing, C., Zhang, H., Lan, Y., Wang, L., and Liu, T. (2020). "On Layer Normalization in the Transformer Architecture." Proceedings of the 37th International Conference on Machine Learning (ICML). [arXiv:2002.04745](https://arxiv.org/abs/2002.04745)
12. Luo, P., Wang, X., Shao, W., and Peng, Z. (2019). "Towards Understanding Regularization in Batch Normalization." Proceedings of the International Conference on Learning Representations (ICLR). [arXiv:1809.00846](https://arxiv.org/abs/1809.00846)
13. Shen, S., et al. (2020). "PowerNorm: Rethinking Batch Normalization in Transformers." ICML. http://proceedings.mlr.press/v119/shen20e/shen20e.pdf
14. Wang, H., Ma, S., Dong, L., Huang, S., Zhang, D., and Wei, F. (2022). "DeepNet: Scaling Transformers to 1,000 Layers." [arXiv:2203.00555](https://arxiv.org/abs/2203.00555)
15. Touvron, H., et al. (2023). "LLaMA: Open and Efficient Foundation Language Models." [arXiv:2302.13971](https://arxiv.org/abs/2302.13971)