Batch Normalization

See also: Machine learning terms

Batch normalization (often abbreviated BatchNorm or BN) is a technique for improving the speed, stability, and performance of deep neural network training. Introduced by Sergey Ioffe and Christian Szegedy in their 2015 paper "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift," the method normalizes the inputs to each layer by re-centering and re-scaling them using statistics computed from the current mini-batch, then applies learnable scale and shift parameters that let the network recover any transformation it needs.

Batch normalization became one of the most widely adopted techniques in deep learning after its introduction. It was a key component of the Inception v2 architecture that achieved state-of-the-art results on the ImageNet benchmark, matching the previous best accuracy with 14 times fewer training steps. Nearly every major convolutional neural network architecture developed after 2015, including ResNet, DenseNet, Inception v2/v3, and EfficientNet, uses batch normalization.

How batch normalization works

Batch normalization operates on the activations (or pre-activations) of a layer during training by standardizing them to have zero mean and unit variance across the mini-batch. It then applies a learned linear transformation to restore representational capacity. The algorithm operates on each feature independently, normalizing across the examples in a mini-batch, and consists of four steps applied to each mini-batch of size m.

Step-by-step algorithm

Given a mini-batch of values B = {x_1, x_2, ..., x_m} for a particular activation:

Step	Operation	Formula
1. Compute mini-batch mean	Average over the batch	$\mu_B = \frac{1}{m} \sum_{i=1}^{m} x_i$
2. Compute mini-batch variance	Variance over the batch	$\sigma_B^2 = \frac{1}{m} \sum_{i=1}^{m} (x_i - \mu_B)^2$
3. Normalize	Subtract mean, divide by std	$\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}$
4. Scale and shift	Apply learnable parameters	$y_i = \gamma \hat{x}_i + \beta$

The small constant epsilon (typically 1e-5) is added for numerical stability to prevent division by zero. The parameters gamma (scale) and beta (shift) are learnable parameters that are updated during training via backpropagation and gradient descent, just like the layer's weights.

Why gamma and beta matter

These parameters are critical: without them, the normalization step would constrain the network to only represent activations with zero mean and unit variance, limiting the model's expressive power. For example, if the activation function is a sigmoid, forcing zero mean and unit variance would confine the inputs to the near-linear region of the sigmoid, eliminating the non-linearity.

By learning gamma and beta, the network can recover any linear transformation of the normalized values, including the identity transformation if that turns out to be optimal. In principle, the network can undo the normalization entirely (by setting gamma = sigma_B and beta = mu_B). In practice, the network learns whatever scale and shift works best for each layer, which is usually something between full normalization and no normalization at all.

Batch normalization in convolutional neural networks

In CNNs, batch normalization is applied per channel (per feature map) rather than per individual activation. For a convolutional layer producing feature maps of shape (N, C, H, W), where N is the batch size, C is the number of channels, and H and W are spatial dimensions, the mean and variance are computed across the batch dimension and both spatial dimensions for each channel independently. This means each of the C channels has its own scalar mean, scalar variance, and its own learned gamma and beta parameters. The rationale is that all spatial locations within a single feature map should share the same normalization statistics because they are produced by the same convolutional filter.

Placement: before or after the activation function?

The original paper by Ioffe and Szegedy proposed applying batch normalization before the activation function, following the pattern: Convolution, BatchNorm, ReLU. The reasoning was that BN aims to normalize the inputs to the nonlinearity, keeping them in a range where gradients flow effectively. For ReLU activations, centering the distribution around zero before applying ReLU helps ensure that a meaningful proportion of activations are non-zero, which can reduce the "dying ReLU" problem.

However, this placement is not universally agreed upon. Some practitioners and researchers have experimented with placing batch normalization after the activation function, and for ReLU-based networks the empirical differences are often minimal. For bounded activation functions like tanh, placing BN after the activation has been reported to yield better performance on certain benchmarks. In practice, the pre-activation convention (BN before the nonlinearity) remains the most common default in modern frameworks and architectures.

Training vs. inference behavior

Batch normalization behaves differently during training and inference, and understanding this distinction is essential for correct implementation. The distinction is also a frequent source of bugs in practice.

During training

During training, the mean and variance are computed from the current mini-batch. This introduces a form of stochastic noise into the forward pass, because the normalization of each sample depends on the other samples in the same mini-batch. Each mini-batch produces slightly different statistics, which is what creates the regularization effect (see below). This noise acts as a mild regularizer, which can be beneficial but also means that predictions for a single input are not deterministic during training. The gradients flow through the mean and variance computation, so they are part of the computation graph.

Simultaneously, the BN layer maintains exponential moving averages of the batch mean and variance, called the running mean and running variance. After each mini-batch, these running statistics are updated:

Parameter	Update rule
Running mean	running_mean = (1 - momentum) * running_mean + momentum * batch_mean
Running variance	running_var = (1 - momentum) * running_var + momentum * batch_var

The momentum parameter (typically 0.1 or 0.9, depending on the framework convention) controls how quickly the running statistics adapt. In PyTorch, the default momentum value is 0.1.

During inference

At inference time, it is impractical (and undesirable) to compute statistics from a mini-batch, since predictions should be deterministic and may involve a single sample. Instead, the BN layer uses the accumulated running mean and running variance from training. The normalization at inference becomes a fixed linear transformation:

y = gamma * (x - running_mean) / sqrt(running_var + epsilon) + beta

This fixed transformation can be fused with the preceding linear or convolutional layer's weights and biases, eliminating the BN layer entirely at inference and reducing the computational overhead to zero.

Common pitfalls

Problem	Cause	Solution
Training and inference results differ significantly	Model not switched to eval mode (batch stats vs. running stats)	Call `model.eval()` before inference in PyTorch
Poor accuracy with small batch sizes	Noisy batch statistics from too few samples	Use Group Normalization or increase batch size
Running statistics are inaccurate after fine-tuning	Running stats from pre-training do not match new data distribution	Reset or freeze BN statistics when fine-tuning on a new domain
NaN or unstable loss values	Extremely small batches or degenerate batches	Check batch composition; consider batch-size-independent normalization

Why batch normalization works

The original explanation: internal covariate shift

Ioffe and Szegedy motivated batch normalization as a solution to internal covariate shift, defined as the change in the distribution of layer inputs caused by updates to the parameters of preceding layers during training. They argued that this shifting distribution forced each layer to continuously adapt to new input statistics, slowing down training and requiring conservative learning rate choices and careful parameter initialization. By normalizing layer inputs, BN was claimed to stabilize these distributions and allow the network to train more effectively.

The revised understanding: loss landscape smoothing

In 2018, Santurkar, Tsipras, Ilyas, and Madry published a highly influential paper titled "How Does Batch Normalization Help Optimization?" at NeurIPS that challenged the internal covariate shift narrative. Through careful experiments, they demonstrated:

Adding batch normalization did not consistently reduce internal covariate shift as measured by distribution statistics. Networks with batch normalization can exhibit just as much distributional change in layer inputs as networks without it.
Networks with batch normalization plus artificially injected covariate shift (random noise added after normalization) still trained better than networks without batch normalization.
The actual benefit appeared to be that batch normalization makes the loss function landscape significantly smoother.

Specifically, BN improves the Lipschitzness of both the loss function and its gradients (also known as beta-smoothness). A smoother loss landscape means that gradients are more predictive of the actual direction of improvement, which has several practical consequences:

The optimizer can take larger steps (higher learning rates) without overshooting
Gradient directions change more slowly, making optimization more stable
The training process is less likely to get trapped in sharp, poorly generalizing minima

Explanation	Proposed by	Year	Status
Reduces internal covariate shift	Ioffe and Szegedy	2015	Disputed
Smooths the loss landscape	Santurkar et al.	2018	Widely accepted
Acts as a regularizer (noise from batch statistics)	Ioffe and Szegedy (secondary claim)	2015	Generally accepted

This loss landscape smoothing explanation is now widely accepted as the more accurate account of why batch normalization helps training, though the exact mechanisms by which batch normalization achieves this smoothing are still being studied.

Implicit regularization

Because the normalization statistics are computed over mini-batches, each sample's normalized value depends on the other samples in the batch. This introduces noise into the training process in a manner somewhat analogous to dropout. This stochastic regularization effect can reduce overfitting, and Ioffe and Szegedy noted in their original paper that batch normalization sometimes eliminates the need for dropout entirely. The regularization effect is stronger with smaller batch sizes (since the batch statistics are noisier) and weaker with larger batch sizes.

Reduced sensitivity to initialization and learning rate

Batch normalization makes networks significantly more robust to the choice of weight initialization. Without BN, poor initialization can cause activations to explode or vanish as they propagate through many layers, leading to very slow training or complete failure. With BN, the normalization step re-centers and re-scales activations at every layer, preventing such cascading effects. While proper initialization (such as He initialization for ReLU networks) is still beneficial and can speed up convergence, BN provides a safety net that prevents catastrophic initialization failures.

Similarly, BN allows practitioners to use much higher learning rates than would otherwise be stable. The smoother loss landscape allows the use of learning rates that would cause divergence without normalization. Ioffe and Szegedy reported that their batch-normalized Inception network could be trained with learning rates an order of magnitude higher than the non-normalized version while still converging.

Interaction with dropout

Combining batch normalization and dropout in the same network requires care, as the two techniques can conflict. Li et al. (2019) identified a "variance shift" problem that arises when dropout is placed before a BN layer. During training, dropout randomly zeros out activations with probability (1 - p), which changes the variance of the layer's outputs. BN then computes running statistics based on these dropout-modified activations. At inference time, however, dropout is disabled and all activations are present (scaled by p), producing a different variance than what BN's running statistics expect. This mismatch between training-time and inference-time statistics can degrade performance.

Practical recommendations for combining the two techniques include:

Place dropout after BN layers, not before them
Use the ordering: Linear/Conv, BN, Activation, Dropout
Consider using only batch normalization without dropout, since BN already provides regularization
If using both, tune the dropout rate carefully and validate that the combination improves performance on a held-out set

Interaction with weight initialization

Batch normalization significantly reduces, but does not eliminate, the importance of weight initialization. Before BN became widespread, techniques like Xavier initialization and He initialization were critical for training deep networks, as they aimed to maintain stable activation variances across layers. With BN, the normalization step automatically corrects activation scales at every layer, providing resilience against poor initialization. In practice, He initialization combined with BN and skip connections is the standard recipe for training very deep networks (50+ layers). Even with BN, very poor initialization choices can slow convergence, so using established initialization methods remains a best practice.

Limitations

Small batch sizes

Batch normalization's reliance on mini-batch statistics creates a fundamental problem when batch sizes are small. The sample mean and variance become noisy estimates of the true population statistics, with estimation errors scaling as O(1/sqrt(m)), where m is the batch size. With a batch size of 32, these estimates are reasonably stable. But as the batch size shrinks:

Batch size	Relative estimation error	Typical impact on performance
32+	Low	Batch normalization works well
8-16	Moderate	Slight degradation; still usable
2-4	High	Noticeable performance drop
1	Undefined (zero variance for single sample)	Batch normalization breaks entirely

With a batch size of 1, the mini-batch mean is just the single example's value, and the variance is zero. The normalization step divides by zero (or by epsilon, which produces an extremely scaled output), making the computation meaningless. Wu and He (2018) demonstrated that ResNet-50 with BN lost 5.6 percentage points of top-1 accuracy on ImageNet when the batch size was reduced from 32 to 2, a dramatic degradation.

When small batches are unavoidable

Several practical scenarios force the use of small batch sizes:

Object detection and segmentation: These tasks require high-resolution inputs (e.g., 800x1200 pixels), which consume so much GPU memory that only 1-2 images fit per GPU.
3D medical imaging: Volumetric data (CT scans, MRI) can require gigabytes per sample, limiting batch sizes to 1-2.
Video models: Processing multiple frames per example multiplies memory requirements.
Large models on limited hardware: Training or fine-tuning large models on consumer GPUs often forces batch sizes below 8.

In these settings, batch normalization's noisy statistics cause training instability, poor convergence, and degraded final performance. This has driven the adoption of alternative normalization methods.

Synchronized batch normalization

One partial solution for multi-GPU setups is synchronized batch normalization (SyncBN), which computes batch statistics across all GPUs rather than independently on each device. If each of 8 GPUs processes 2 examples, SyncBN computes statistics over an effective batch of 16, reducing the noise problem.

However, SyncBN requires an all-reduce communication step at every batch normalization layer during both forward and backward passes, adding significant overhead. It also does not help when the total batch across all devices is still small.

Sequence models and variable-length inputs

Batch normalization is poorly suited for transformer architectures and recurrent neural networks that process variable-length sequences. In these settings, different positions in the sequence may carry different semantic meaning, and computing statistics across the batch at each position mixes incomparable information. Additionally, during autoregressive inference, tokens are processed one at a time, making batch statistics meaningless. This is the primary reason that transformers universally use layer normalization or its variants instead of batch normalization.

Online and streaming settings

In online learning scenarios where data arrives one sample at a time, or in streaming settings with non-stationary data distributions, batch normalization cannot compute meaningful batch statistics. The running statistics accumulated during training may also become stale if the data distribution shifts over time.

Alternatives and comparison of normalization techniques

The limitations of batch normalization have motivated the development of several alternative normalization methods. Each computes statistics over a different set of dimensions.

Technique	Normalizes over	Batch-size dependent?	Primary use case	Key paper
Batch Normalization	Batch + spatial dims, per channel	Yes	CNNs with large batch sizes	Ioffe and Szegedy, 2015
Layer Normalization	All features in a single sample	No	Transformers, RNNs	Ba, Kiros, and Hinton, 2016
Instance Normalization	Spatial dims per channel, per sample	No	Style transfer, image generation	Ulyanov, Vedaldi, and Lempitsky, 2016
Group Normalization	Groups of channels per sample	No	CNNs with small batch sizes, detection	Wu and He, 2018
RMSNorm	All features (RMS only, no mean subtraction)	No	Large language models	Zhang and Sennrich, 2019

Layer normalization

Layer normalization (Ba, Kiros, and Hinton, 2016) normalizes across all features within a single training example rather than across the batch. For a hidden vector of dimension d, layer normalization computes the mean and variance over the d features of that single example. Because it does not depend on batch size, it works with a batch size of 1 and does not require running statistics. Layer normalization became the standard normalization technique for transformers and is used in models like BERT, GPT, and their successors.

Key differences from batch normalization:

Property	Batch Normalization	Layer Normalization
Normalizes across	Batch dimension	Feature dimension
Depends on batch size	Yes	No
Works for batch size 1	Poorly	Yes
Suited for sequence models	No	Yes
Training/inference difference	Yes (running stats)	No

Instance normalization

Instance normalization (Ulyanov, Vedaldi, and Lempitsky, 2016) normalizes across the spatial dimensions (H, W) of each feature map for each sample independently. It was introduced specifically for neural style transfer, where removing instance-specific contrast information from the content image improves stylization quality. It is equivalent to layer normalization applied to each channel independently and is most commonly used in generative models for image synthesis and style transfer applications.

Group normalization

Group Normalization (Wu and He, 2018) divides channels into groups and normalizes within each group independently for each example. With G groups and C channels, each group contains C/G channels. Statistics are computed over the spatial dimensions (H, W) and the C/G channels within each group, for each example independently. It serves as a middle ground between layer normalization (which normalizes over all channels) and instance normalization (which normalizes each channel separately).

The critical advantage is that group normalization's computation depends only on a single example, not on the batch. This means:

It works identically with batch size 1.
There is no difference between training and inference behavior (no running statistics needed).
It is fully compatible with gradient accumulation and variable batch sizes.

Group normalization became the standard choice in object detection frameworks like Detectron2 and MMDetection and segmentation models where high-resolution inputs necessitate small batches. FAIR's Detectron2, for example, defaults to group normalization with 32 groups for all backbone networks. Wu and He showed that while BN lost 5.6% top-1 accuracy on ImageNet when batch size dropped from 32 to 2, group normalization lost only 0.3 percentage points.

Normalization method	Performance at batch size 32	Performance at batch size 2	Batch size dependence
Batch Normalization	Best	Significantly degraded (-5.6% top-1)	Strong
Group Normalization (32 groups)	Slightly below BN (-0.5%)	Near-optimal (-0.3% from BN@32)	None
Layer Normalization	Below BN	Stable	None
Instance Normalization	Below BN	Stable	None

RMSNorm

RMSNorm (Root Mean Square Layer Normalization) was proposed by Zhang and Sennrich (2019). RMSNorm simplifies layer normalization by removing the mean-centering step and normalizing only by the root mean square of the activations:

RMSNorm(x) = gamma * x / sqrt((1/d) * sum(x_i^2) + epsilon)

The authors found that the mean subtraction in standard layer normalization is not necessary for the technique to work. Removing it reduces computational overhead because:

There is no need to compute the mean.
There is no subtraction operation.
The learned shift parameter (beta) is removed, reducing the number of learnable parameters.

By eliminating the mean computation, RMSNorm reduces computational overhead by 7% to 64% depending on the model and implementation, while achieving comparable training performance. RMSNorm has become the dominant normalization choice in modern large language models, including LLaMA, Mistral, and Gemma.

Why transformers use Layer Normalization (not Batch Normalization)

Batch normalization was designed for convolutional neural networks processing images in fixed-size batches. Several properties of transformer models make batch normalization a poor fit:

Variable sequence lengths: Text sequences have varying lengths. Batch normalization would compute statistics across different sequence positions, mixing statistics from the start of one sentence with the middle of another.
Sequence position semantics: In a CNN, the same pixel position across images has roughly comparable meaning. In a transformer, position 5 in one sentence and position 5 in another sentence may represent completely different linguistic functions.
Small effective batch sizes: When training large transformers, memory constraints often force small per-device batch sizes. Batch normalization performs poorly with small batches because the batch statistics become noisy.
Autoregressive inference: During text generation, transformers process one token at a time. There is no batch to compute statistics over.

Research has confirmed this empirically. Shen et al. (2020) showed that standard batch normalization leads to "significant performance degradation" in transformers for NLP tasks.

The normalization landscape in modern LLMs

The evolution of normalization techniques in large language models tells a clear story: batch normalization was never suited for language modeling, and the field has progressively simplified normalization toward more efficient variants. RMSNorm has become the normalization method of choice in most state-of-the-art LLMs developed since 2023. The LLaMA family of models (Meta, 2023) adopted RMSNorm, and this choice was followed by many subsequent open-weight models.

Model	Normalization method
GPT-2 (2019)	Layer Normalization
GPT-3 (2020)	Layer Normalization
BERT (2018)	Layer Normalization
LLaMA (2023)	RMSNorm
LLaMA 2 (2023)	RMSNorm
LLaMA 3 (2024)	RMSNorm
Mistral (2023)	RMSNorm
Gemma (2024)	RMSNorm

The shift to RMSNorm is driven by practical benefits: comparable training stability with lower computational overhead per layer. When training models with billions of parameters across trillions of tokens, even small efficiency improvements per operation compound into meaningful savings in time and energy.

Pre-norm vs. post-norm placement

Beyond the choice of normalization function, where normalization is placed within the transformer block has a significant impact on training stability.

The original transformer (Vaswani et al., 2017) used post-norm placement:

x = LayerNorm(x + Sublayer(x))

Most modern LLMs use pre-norm placement:

x = x + Sublayer(LayerNorm(x))

Xiong et al. (2020) showed that pre-norm placement makes the gradient norms more predictable across layers, which stabilizes training and allows the use of larger learning rates. Pre-norm transformers can often be trained without warmup, while post-norm transformers typically require careful warmup to avoid divergence.

The tradeoff is that pre-norm placement can lead to slightly worse final performance compared to post-norm in some settings, because the residual stream is not normalized. However, the training stability benefits have made pre-norm the dominant choice in practice.

DeepNorm

Recognizing the tradeoffs between pre-norm and post-norm, Microsoft introduced DeepNorm (Wang et al., 2022) for training very deep transformers (up to 1,000 layers). DeepNorm modifies the residual connection with a scaling factor:

x = LayerNorm(alpha * x + Sublayer(x))

where alpha is a constant that depends on the number of layers. DeepNorm achieves the training stability of pre-norm with the performance benefits of post-norm, enabling the training of transformers that are significantly deeper than previously possible.

Timeline of normalization in language models

Year	Model	Normalization	Placement	Notes
2017	Original Transformer	Layer Normalization	Post-norm	First transformer architecture
2018	BERT	Layer Normalization	Post-norm	Followed original transformer
2019	GPT-2	Layer Normalization	Pre-norm	Shifted to pre-norm for stability
2020	GPT-3	Layer Normalization	Pre-norm	Continued pre-norm convention
2022	DeepNorm	Modified Layer Norm	Hybrid	For very deep transformers
2023	LLaMA	RMSNorm	Pre-norm	Shifted to RMSNorm for efficiency
2023	Mistral	RMSNorm	Pre-norm	Followed LLaMA convention
2024	LLaMA 3	RMSNorm	Pre-norm	RMSNorm now standard
2024	Gemma	RMSNorm	Pre-norm	Google also adopted RMSNorm
2024	DeepSeek-V2	RMSNorm	Pre-norm	MoE models also use RMSNorm

The convergence of virtually all modern LLMs on RMSNorm with pre-norm placement reflects a consensus that this combination provides the best balance of training stability, computational efficiency, and model quality. Batch normalization plays no role in this landscape, having been superseded by architectures where per-example normalization is more natural and more efficient.

Implementation

PyTorch

PyTorch provides batch normalization through torch.nn.BatchNorm1d, torch.nn.BatchNorm2d, and torch.nn.BatchNorm3d for 1D, 2D, and 3D inputs respectively.

import torch
import torch.nn as nn

# BatchNorm for 2D convolutions (input: N, C, H, W)
bn = nn.BatchNorm2d(
    num_features=64,     # Number of channels
    eps=1e-5,            # Epsilon for numerical stability
    momentum=0.1,        # Momentum for running stats
    affine=True,         # Learn gamma and beta
    track_running_stats=True  # Track running mean/var
)

# Example usage in a CNN block
class ConvBlock(nn.Module):
    def __init__(self, in_channels, out_channels):
        super().__init__()
        self.conv = nn.Conv2d(in_channels, out_channels, 3, padding=1)
        self.bn = nn.BatchNorm2d(out_channels)
        self.relu = nn.ReLU(inplace=True)

    def forward(self, x):
        return self.relu(self.bn(self.conv(x)))

# Important: switch to eval mode for inference
model = ConvBlock(3, 64)
model.eval()  # Uses running statistics instead of batch statistics

Key parameters:

num_features: matches the number of channels (C dimension) of the input
momentum: controls the update rate of running statistics (default 0.1)
affine: when True (default), the layer learns gamma and beta; when False, normalization has no learnable parameters
track_running_stats: when True (default), maintains running mean and variance for inference

TensorFlow / Keras

In TensorFlow and Keras, batch normalization is available via tf.keras.layers.BatchNormalization.

import tensorflow as tf

# Keras functional API
inputs = tf.keras.Input(shape=(32, 32, 3))
x = tf.keras.layers.Conv2D(64, 3, padding='same')(inputs)
x = tf.keras.layers.BatchNormalization()(x)
x = tf.keras.layers.ReLU()(x)

# The 'training' flag is handled automatically in model.fit()
# For manual control:
# x = tf.keras.layers.BatchNormalization()(x, training=True)

Keras automatically handles the training/inference mode switch when using model.fit() and model.predict(). For custom training loops, the training argument must be passed explicitly.

Impact on deep learning

Batch normalization had a transformative effect on the practice of training deep neural networks when it was introduced. Before BN, training very deep networks was an arduous process that required meticulous hyperparameter tuning, careful initialization schemes, and conservative learning rates. The original paper demonstrated that a batch-normalized version of the Inception network matched the accuracy of the original with 14 times fewer training steps, and an ensemble of BN-Inception models achieved 4.9% top-5 error on the ImageNet validation set, surpassing the state of the art at the time.

The technique's success in CNNs for computer vision helped establish the modern deep learning training recipe: BN (or a normalization variant), residual connections, and adaptive optimizers like Adam. Nearly every ImageNet competition winner and major vision architecture from 2015 onward incorporated batch normalization. While transformers and large language models have moved toward layer normalization and RMSNorm, batch normalization remains the default choice for convolutional architectures in computer vision, and its conceptual legacy, normalizing intermediate representations to stabilize training, underlies all modern normalization techniques.

ELI5: Batch normalization in simple terms

Imagine a classroom where each student takes a different version of a math test. Some tests have questions where all the numbers are between 1 and 10, while other tests have numbers between 1,000 and 1,000,000. The students with the huge numbers are going to struggle more, not because the math is harder, but because the scale makes everything more confusing.

Batch normalization is like a teacher who collects everyone's test, adjusts all the numbers to be on the same comfortable scale, and then hands them back. Now every student is working with similar-sized numbers, and they can focus on actually learning the math instead of wrestling with wildly different scales. The teacher also lets each student shift and stretch the numbers a little if that helps them learn better (those are the gamma and beta parameters). At the end of the school year (inference time), the teacher uses the average adjustments from the whole year instead of recalculating for each new test.

References

Ioffe, S. and Szegedy, C. (2015). "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift." Proceedings of the 32nd International Conference on Machine Learning (ICML). arXiv:1502.03167
Santurkar, S., Tsipras, D., Ilyas, A., and Madry, A. (2018). "How Does Batch Normalization Help Optimization?" Advances in Neural Information Processing Systems 31 (NeurIPS). arXiv:1805.11604
Ba, J. L., Kiros, J. R., and Hinton, G. E. (2016). "Layer Normalization." arXiv:1607.06450
Wu, Y. and He, K. (2018). "Group Normalization." Proceedings of the European Conference on Computer Vision (ECCV). arXiv:1803.08494
Zhang, B. and Sennrich, R. (2019). "Root Mean Square Layer Normalization." Advances in Neural Information Processing Systems 32 (NeurIPS). arXiv:1910.07467
Ulyanov, D., Vedaldi, A., and Lempitsky, V. (2016). "Instance Normalization: The Missing Ingredient for Fast Stylization." arXiv:1607.08022
Li, X., Chen, S., Hu, X., and Yang, J. (2019). "Understanding the Disharmony between Dropout and Batch Normalization by Variance Shift." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
He, K., Zhang, X., Ren, S., and Sun, J. (2016). "Deep Residual Learning for Image Recognition." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). arXiv:1512.03385
He, K., Zhang, X., Ren, S., and Sun, J. (2015). "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification." Proceedings of the IEEE International Conference on Computer Vision (ICCV). arXiv:1502.01852
Bjorck, N., Gomes, C. P., Selman, B., and Weinberger, K. Q. (2018). "Understanding Batch Normalization." Advances in Neural Information Processing Systems 31 (NeurIPS). arXiv:1806.02375
Xiong, R., Yang, Y., He, D., Zheng, K., Zheng, S., Xing, C., Zhang, H., Lan, Y., Wang, L., and Liu, T. (2020). "On Layer Normalization in the Transformer Architecture." Proceedings of the 37th International Conference on Machine Learning (ICML). arXiv:2002.04745
Luo, P., Wang, X., Shao, W., and Peng, Z. (2019). "Towards Understanding Regularization in Batch Normalization." Proceedings of the International Conference on Learning Representations (ICLR). arXiv:1809.00846
Shen, S., et al. (2020). "PowerNorm: Rethinking Batch Normalization in Transformers." ICML. http://proceedings.mlr.press/v119/shen20e/shen20e.pdf
Wang, H., Ma, S., Dong, L., Huang, S., Zhang, D., and Wei, F. (2022). "DeepNet: Scaling Transformers to 1,000 Layers." arXiv:2203.00555
Touvron, H., et al. (2023). "LLaMA: Open and Efficient Foundation Language Models." arXiv:2302.13971

How batch normalization works

Step-by-step algorithm

Why gamma and beta matter

Batch normalization in convolutional neural networks

Placement: before or after the activation function?

Training vs. inference behavior

During training

During inference

Common pitfalls

Why batch normalization works

The original explanation: internal covariate shift

The revised understanding: loss landscape smoothing

Implicit regularization

Reduced sensitivity to initialization and learning rate

Interaction with dropout

Interaction with weight initialization

Limitations

Small batch sizes

When small batches are unavoidable

Synchronized batch normalization

Sequence models and variable-length inputs

Online and streaming settings

Alternatives and comparison of normalization techniques

Layer normalization

Instance normalization

Group normalization

RMSNorm

Why transformers use Layer Normalization (not Batch Normalization)

The normalization landscape in modern LLMs

Pre-norm vs. post-norm placement

DeepNorm

Timeline of normalization in language models

Implementation

PyTorch

TensorFlow / Keras

Impact on deep learning

ELI5: Batch normalization in simple terms

References

Improve this article

Related Articles

GELU (Gaussian Error Linear Unit)

Multi-head Latent Attention

Sparse autoencoder

ARC-AGI 2

LeNet

Mixture of Experts (MoE)

How batch normalization works

Step-by-step algorithm

Why gamma and beta matter

Batch normalization in convolutional neural networks

Placement: before or after the activation function?

Training vs. inference behavior

During training

During inference

Common pitfalls

Why batch normalization works

The original explanation: internal covariate shift

The revised understanding: loss landscape smoothing

Implicit regularization

Reduced sensitivity to initialization and learning rate

Interaction with dropout

Interaction with weight initialization

Limitations

Small batch sizes

When small batches are unavoidable

Synchronized batch normalization

Sequence models and variable-length inputs

Online and streaming settings

Alternatives and comparison of normalization techniques

Layer normalization

Instance normalization

Group normalization

RMSNorm

Why transformers use Layer Normalization (not Batch Normalization)

The normalization landscape in modern LLMs

Pre-norm vs. post-norm placement

DeepNorm

Timeline of normalization in language models

Implementation

PyTorch