# Spatial Pooling

> Source: https://aiwiki.ai/wiki/spatial_pooling
> Updated: 2026-04-09
> Categories: Computer Vision, Machine Learning, Neural Networks
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**Spatial pooling** is a family of operations used in [convolutional neural networks](/wiki/convolutional_neural_network) (CNNs) to reduce the spatial dimensions of feature maps while preserving the most relevant information. By summarizing local regions of a feature map into single values, pooling layers decrease the number of parameters, lower computational cost, expand the receptive field of subsequent layers, and introduce a degree of translation invariance. Spatial pooling has been a core building block of CNN architectures since the earliest designs in the 1990s and remains widely used in modern [computer vision](/wiki/computer_vision) systems.

Pooling layers are deterministic, parameter-free operations in their standard forms: they contain no learnable weights and instead apply a fixed aggregation function (such as taking the maximum or computing the mean) over a sliding window. This distinguishes them from [convolutional layers](/wiki/convolutional_layer), which learn their filter weights during training.

## Explain like I'm 5 (ELI5)

Imagine you have a huge painting with thousands of tiny details. You want to describe that painting to a friend, but you do not have time to talk about every single brushstroke. So instead, you look at small sections of the painting one at a time and pick out the single most important thing in each section (like the brightest color or the biggest shape). When you are done, you have a much shorter description that still captures what the painting looks like. That is what spatial pooling does for a computer: it shrinks a big picture of numbers down to a smaller one by keeping only the most useful information from each little area.

## Historical background

The concept of pooling in neural networks has roots in biological vision research. Early 20th-century neuroanatomists identified local pooling as a mechanism supporting translation-invariant pattern recognition in the visual cortex. Haldan Keffer Hartline provided electrophysiological evidence in 1940 through studies of retinal ganglion cells, and Hubel and Wiesel's Nobel Prize-winning experiments in the 1960s demonstrated that the cat visual system contains cells that sum over inputs from lower layers, a behavior analogous to pooling.

In artificial neural networks, pooling operations appeared as early as 1990 for speech processing and 1992 for image processing in the Cresceptron architecture. Yann LeCun's LeNet-5 (1998) formalized the use of subsampling layers (a form of average pooling) in combination with [convolution](/wiki/convolution) layers and [fully connected layers](/wiki/fully_connected_layer), establishing the architectural template that dominated CNN design for over a decade. Max pooling later gained popularity and was used prominently in [AlexNet](/wiki/alexnet) (Krizhevsky et al., 2012), which won the ImageNet Large Scale Visual Recognition Challenge and launched the modern [deep learning](/wiki/deep_learning) era.

## How spatial pooling works

A pooling layer slides a fixed-size window (the pooling kernel) across each channel of the input feature map, computing a summary statistic for each window position. The key hyperparameters are:

| Parameter | Description | Typical value |
|-----------|-------------|---------------|
| Kernel size (f) | The height and width of the pooling window | 2x2 or 3x3 |
| [Stride](/wiki/stride) (s) | The number of pixels the window moves between positions | 2 |
| Padding (p) | The number of zero-valued pixels added around the input border | 0 (no padding) |

The output spatial dimensions are computed as:

**H_out = floor((H_in - f + 2p) / s) + 1**

**W_out = floor((W_in - f + 2p) / s) + 1**

The depth (number of channels) of the output is always equal to the depth of the input, because pooling operates independently on each channel.

With the most common configuration of a 2x2 kernel and [stride](/wiki/stride) of 2, pooling discards approximately 75% of the activations, reducing each spatial dimension by half. This aggressive [downsampling](/wiki/downsampling) is the primary mechanism through which pooling reduces computational cost in deeper layers.

### Properties of pooling layers

- **No learnable parameters.** Standard pooling operations (max, average, global) do not contain any trainable weights or biases. This means they add zero parameters to the model.
- **Channel independence.** Pooling is applied to each feature map channel separately; it does not mix information across channels.
- **Translation invariance.** By summarizing local regions, pooling makes the network less sensitive to small shifts in the position of features. If a feature moves slightly within a pooling window, the output remains the same.
- **Receptive field expansion.** Each neuron in layers following a pooling operation effectively "sees" a larger region of the original input, enabling higher layers to capture more global patterns.

## Types of spatial pooling

### Max pooling

Max pooling selects the maximum value within each pooling window. Formally, for a pooling region R at output position (i, j) in channel c:

**MaxPool(i, j, c) = max(x_R)**

where x_R contains all activation values within the window.

Max pooling is the most widely used pooling operation in modern CNNs. It preserves the strongest activations (which often correspond to detected features such as edges, textures, or object parts) while discarding weaker responses. This makes max pooling particularly effective at retaining high-frequency, spatially localized features.

During backpropagation, [gradients](/wiki/gradient) are routed only to the position that held the maximum value in the forward pass. The network records the indices of these maxima (sometimes called "switches" or "masks") during the forward pass so that gradient routing is efficient. All non-maximum positions receive a gradient of zero.

**Advantages:**
- Retains the strongest feature activations
- Provides robustness to small translations and distortions
- Works well in practice for most classification and detection tasks

**Disadvantages:**
- Discards all information except the maximum, which can be lossy
- Only a single activation per window influences the gradient during training

### Average pooling

Average pooling computes the arithmetic mean of all values within each pooling window:

**AvgPool(i, j, c) = (1 / f^2) * sum(x_R)**

Average pooling produces a smoother, more uniform representation compared to max pooling. It was the dominant pooling method in early CNN architectures such as LeNet-5 but has since been largely replaced by max pooling in hidden layers. The Stanford CS231n course notes observe that "average pooling was often used historically but has recently fallen out of favor compared to the max pooling operation, which has been shown to work better in practice."

During backpropagation, the gradient from the output is distributed equally among all positions in the pooling window, since every element contributed equally to the output.

**Advantages:**
- Considers all activations in the window, preserving more information
- Produces smoother feature maps that can be beneficial for certain tasks
- Performs well when handling noisy data

**Disadvantages:**
- May dilute strong activations by averaging them with weaker ones
- Less effective at preserving sharp edges and high-frequency features

### Global average pooling

Global average pooling (GAP) computes the mean of all spatial positions within each channel, collapsing an H x W x C feature map to a 1 x 1 x C vector. Introduced by Min Lin, Qiang Chen, and Shuicheng Yan in the "Network in Network" paper (2013), GAP was proposed as a replacement for [fully connected layers](/wiki/fully_connected_layer) at the end of classification networks.

GAP offers several benefits over fully connected layers:

| Property | Global average pooling | Fully connected layer |
|----------|----------------------|----------------------|
| Parameters | 0 | Large (depends on input size) |
| [Overfitting](/wiki/overfitting) risk | Low | High (many parameters) |
| Input size flexibility | Accepts any spatial size | Requires fixed input size |
| Interpretability | Each channel maps directly to a class | Less interpretable |

GAP became standard in architectures such as GoogLeNet/[Inception](/wiki/inception), where it replaced fully connected layers and improved top-1 accuracy by approximately 0.6%. It is now used in most modern classification networks including [ResNet](/wiki/resnet), [DenseNet](/wiki/densenet), [MobileNet](/wiki/mobilenet), and [EfficientNet](/wiki/efficientnet).

### Global max pooling

Global max pooling (GMP) takes the maximum value across the entire spatial extent of each channel, producing a 1 x 1 x C vector. Like GAP, it is used before the final classification layer. GMP captures the single strongest activation per channel and is sometimes used in tasks where the presence (rather than the spatial distribution) of a feature is the primary signal.

### Overlapping pooling

In standard pooling, the stride equals the kernel size, so pooling windows do not overlap. Overlapping pooling uses a stride smaller than the kernel size, meaning adjacent windows share some input elements. [AlexNet](/wiki/alexnet) used overlapping pooling with a 3x3 kernel and stride of 2, which reduced the top-1 error rate by 0.4% and the top-5 error rate by 0.3% compared to non-overlapping 2x2 pooling with stride 2. Krizhevsky et al. also observed that models with overlapping pooling were slightly more resistant to [overfitting](/wiki/overfitting).

### Stochastic pooling

Stochastic pooling, introduced by Zeiler and Fergus (2013), replaces the deterministic selection of max pooling with a random sampling procedure. Instead of always picking the maximum activation, it samples from the pooling region according to a multinomial distribution defined by the normalized activations:

**p(k) = x_k / sum(x_R)**

where x_k is the activation at position k and x_R is the set of activations in the pooling region.

Stochastic pooling acts as a [regularization](/wiki/regularization) technique: during training, the random sampling prevents the network from relying too heavily on any single activation, reducing overfitting. At test time, the method computes a probability-weighted average. The approach is hyperparameter-free and can be combined with other regularization methods such as [dropout](/wiki/dropout_regularization) and data augmentation. Zeiler and Fergus demonstrated state-of-the-art performance on several image benchmarks at the time of publication.

### Lp pooling

Lp pooling generalizes both average pooling and max pooling through the Lp norm:

**LpPool(i, j, c) = (1/N * sum(|x_k|^p))^(1/p)**

where p is a positive real number and N is the number of elements in the pooling region. When p = 1, Lp pooling is equivalent to average pooling. As p approaches infinity, Lp pooling converges to max pooling. The special case p = 2 is sometimes called "square-root pooling" or "L2 pooling." The value of p can be fixed as a hyperparameter or learned during training, providing a smooth interpolation between average and max pooling.

### Mixed pooling

Mixed pooling combines max pooling and average pooling through a weighted sum:

**MixedPool = w * MaxPool + (1 - w) * AvgPool**

where w is a mixing coefficient in the range [0, 1]. The weight w can be set as a hyperparameter or learned during training. Mixed pooling aims to capture the benefits of both methods: the feature-preserving properties of max pooling and the smoothing effect of average pooling.

### Fractional max pooling

Fractional max pooling, proposed by Benjamin Graham (2015), allows non-integer reduction ratios. Rather than always reducing spatial dimensions by integer factors (such as halving with a 2x2 kernel and stride 2), fractional max pooling uses stochastically generated, non-uniform pooling regions to achieve reduction factors like 1.5x or the square root of 2. This provides finer control over the rate of spatial reduction and can improve accuracy in some settings. PyTorch includes a built-in `FractionalMaxPool2d` layer.

## Specialized pooling methods

### Spatial pyramid pooling (SPP)

Spatial pyramid pooling, introduced by Kaiming He et al. (2014, published in IEEE TPAMI 2015), addresses a fundamental limitation of standard CNNs: the requirement for fixed-size input images. The fixed-size constraint arises from the fully connected layers that expect inputs of a predetermined length, not from the convolutional or pooling layers themselves.

SPP replaces the final pooling layer with a multi-level pooling structure that applies max pooling at several different spatial granularities (for example, 1x1, 2x2, and 4x4 grids) and concatenates the results into a fixed-length vector. This allows the network to accept images of any size or aspect ratio.

Key results of SPPNet:

| Metric | Result |
|--------|--------|
| Speed improvement over R-CNN | 24 to 102 times faster at test time |
| ILSVRC 2014 object detection | 2nd place among 38 teams |
| ILSVRC 2014 image classification | 3rd place among 38 teams |
| Accuracy | Better or comparable to R-CNN on Pascal VOC 2007 |

SPP is a hierarchical form of global pooling and was influential in the development of subsequent [object detection](/wiki/object_detection) architectures.

### Atrous spatial pyramid pooling (ASPP)

Atrous spatial pyramid pooling extends the SPP concept by using parallel atrous (dilated) convolutions at multiple dilation rates instead of standard pooling at multiple scales. Introduced in the DeepLab family of architectures by Liang-Chieh Chen et al. (2017), ASPP captures multi-scale context for dense prediction tasks such as [semantic segmentation](/wiki/image_segmentation).

A typical ASPP module contains five parallel branches:

1. A 1x1 convolution
2. A 3x3 atrous convolution with a small dilation rate (e.g., 6)
3. A 3x3 atrous convolution with a medium dilation rate (e.g., 12)
4. A 3x3 atrous convolution with a large dilation rate (e.g., 18)
5. Global average pooling followed by a 1x1 convolution

The outputs of all branches are concatenated and passed through a final 1x1 convolution. ASPP enlarges the effective receptive field without increasing the number of parameters or the amount of computation, making it well suited for pixel-level prediction tasks.

### Region of interest (RoI) pooling

RoI pooling, introduced by Ross Girshick in Fast R-CNN (2015), is a specialized pooling operation designed for [object detection](/wiki/object_detection). Given a feature map computed from the entire image and a set of proposed bounding box regions, RoI pooling extracts a fixed-size feature representation (e.g., 7x7) for each region by dividing it into a grid of sub-windows and applying max pooling within each sub-window.

RoI pooling enables the network to process the entire image through the convolutional backbone only once, then efficiently extract features for each region proposal. This approach was significantly faster than the original R-CNN, which ran the full CNN independently for every proposed region. Later refinements include RoI Align (Mask R-CNN, 2017), which uses bilinear interpolation instead of quantized grid cells to improve spatial precision.

### Adaptive pooling

Adaptive pooling, available in frameworks like PyTorch (`AdaptiveAvgPool2d`, `AdaptiveMaxPool2d`), automatically computes the necessary kernel size and stride to produce an output of a specified spatial size, regardless of the input dimensions. For example, `AdaptiveAvgPool2d((1, 1))` performs global average pooling on any input size. This is particularly useful for building networks that can handle variable-size inputs.

## Comparison of pooling methods

| Method | Operation | Parameters | Strengths | Weaknesses | Typical use |
|--------|-----------|------------|-----------|------------|-------------|
| Max pooling | Takes maximum in window | 0 | Preserves strong features; robust to translation | Discards all but max value | Hidden layers of classification CNNs |
| Average pooling | Computes mean in window | 0 | Smooth output; uses all activations | Dilutes strong features | Early architectures; some hidden layers |
| Global average pooling | Mean across entire channel | 0 | Eliminates FC layers; reduces overfitting | Loses all spatial information | Before final classifier |
| Global max pooling | Max across entire channel | 0 | Captures strongest activation per channel | Ignores spatial distribution | Before final classifier |
| Stochastic pooling | Multinomial sampling | 0 | Regularization effect; hyperparameter-free | Slower; non-deterministic training | Training-time regularization |
| Lp pooling | Lp norm in window | 0 or 1 (if p is learned) | Generalizes max and average | Adds complexity | Research; specialized tasks |
| Mixed pooling | Weighted max + average | 0 or 1 (if w is learned) | Balances max and average benefits | Limited practical advantage | Research |
| Fractional max pooling | Max over non-uniform regions | 0 | Finer spatial reduction control | More complex implementation | Fine-grained classification |
| Spatial pyramid pooling | Multi-scale max pooling | 0 | Handles arbitrary input sizes | Produces large feature vectors | Object detection, classification |
| RoI pooling | Max pooling over regions | 0 | Efficient multi-region feature extraction | Quantization artifacts | Object detection |
| ASPP | Parallel atrous convolutions | Yes (conv weights) | Multi-scale context capture | Higher computational cost | Semantic segmentation |

## Pooling in notable CNN architectures

The role and configuration of pooling layers has evolved significantly across major CNN architectures:

| Architecture | Year | Pooling approach | Details |
|-------------|------|-----------------|----------|
| LeNet-5 | 1998 | Average pooling (subsampling) | 2x2 average pooling layers between convolutional stages |
| [AlexNet](/wiki/alexnet) | 2012 | Overlapping max pooling | 3x3 kernel, stride 2; reduced error vs. non-overlapping |
| [VGG](/wiki/vgg) | 2014 | Max pooling | 2x2 kernel, stride 2 after each conv block |
| GoogLeNet/[Inception](/wiki/inception) | 2014 | Max pooling + global average pooling | Max pooling within Inception modules and between groups; GAP before classifier |
| SPPNet | 2014 | Spatial pyramid pooling | Multi-level pooling for fixed-length output from any input size |
| [ResNet](/wiki/resnet) | 2015 | Max pooling + global average pooling | Initial 3x3 max pool; GAP before final FC layer |
| [DenseNet](/wiki/densenet) | 2017 | Average pooling | Average pooling in transition layers between dense blocks |
| [MobileNet](/wiki/mobilenet) | 2017 | Global average pooling | GAP before classifier; no intermediate pooling (uses strided depthwise convolutions) |
| [EfficientNet](/wiki/efficientnet) | 2019 | Global average pooling | GAP before classifier; strided convolutions for downsampling |

## The pooling debate: pooling vs. strided convolutions

A growing body of research questions whether dedicated pooling layers are necessary at all. The alternative is to use convolutional layers with a [stride](/wiki/stride) greater than 1 (strided convolutions), which also reduce spatial dimensions but do so with learnable filters rather than a fixed aggregation rule.

**Arguments for strided convolutions:**
- The network learns how to downsample rather than relying on a hand-designed rule
- Strided convolutions can preserve more spatial detail when learned appropriately
- Some modern architectures (such as all-convolutional networks proposed by Springenberg et al., 2015) have shown competitive performance without any pooling layers
- Generative models such as [variational autoencoders](/wiki/variational_autoencoder) and [generative adversarial networks](/wiki/generative_adversarial_network_gan) typically avoid pooling because spatial information needs to be preserved for reconstruction

**Arguments for keeping pooling:**
- Pooling introduces no additional parameters, keeping the model compact
- Max pooling provides built-in translation invariance without extra training
- Pooling layers are computationally cheaper than strided convolutions
- Decades of empirical evidence support their effectiveness in classification tasks

The CS231n course at Stanford notes that "future architectures will feature very few to no pooling layers," reflecting a trend toward strided convolutions in recent designs. However, pooling remains widely used in practice, and many state-of-the-art architectures still employ at least global average pooling before the classification head.

## Pooling in vision transformers

The rise of [vision transformers](/wiki/transformer) (ViTs) has introduced new pooling paradigms beyond the sliding-window operations of CNNs. The original Vision Transformer (Dosovitskiy et al., 2020) used a learnable [CLS] token, inspired by BERT, whose output serves as the image representation for classification. An alternative approach applies global average pooling across all output patch tokens to produce the classification embedding.

Research has shown that both GAP and multihead attention pooling (MAP) can match or exceed the performance of the CLS token approach in vision transformers. Some hybrid architectures, such as Swin Transformer, combine local windowed [attention](/wiki/attention) with pooling-like downsampling operations between stages.

## Backpropagation through pooling layers

Although pooling layers have no learnable parameters, gradients still need to flow through them during backpropagation to update the weights in preceding convolutional layers.

**Max pooling:** The gradient from the output is passed only to the input position that had the maximum value. All other positions receive a gradient of zero. During the forward pass, the network records the indices of the maxima (the "switches") to enable efficient gradient routing.

**Average pooling:** The gradient from the output is divided equally among all positions in the pooling window, since each position contributed equally to the mean.

**Global average pooling:** The gradient is distributed uniformly across all H x W spatial positions within each channel.

## Limitations and considerations

While pooling is a useful tool, it has several known limitations:

1. **Information loss.** By reducing spatial resolution, pooling discards fine-grained positional information. This can be problematic for tasks that require precise localization, such as semantic segmentation or pose estimation.
2. **Violation of shift invariance.** Although pooling is often described as providing translation invariance, studies have shown that standard max pooling and average pooling with stride greater than 1 can violate the Nyquist sampling theorem, resulting in outputs that are not truly shift-invariant. Zhang (2019) proposed antialiased pooling (blur pooling) as a remedy, applying a low-pass filter before downsampling.
3. **Loss of spatial relationships.** Pooling reduces spatial dimensions uniformly and does not preserve the relative positions of features, which can matter for tasks requiring spatial reasoning.
4. **Fixed operation.** Standard pooling applies the same aggregation function everywhere, regardless of the content of the feature map. Learnable alternatives (such as strided convolutions or attention-based pooling) can adapt their behavior to the data.

## Implementation example

In PyTorch, common pooling layers are defined as follows:

```python
import torch.nn as nn

# Max pooling with 2x2 kernel and stride 2
max_pool = nn.MaxPool2d(kernel_size=2, stride=2)

# Average pooling with 2x2 kernel and stride 2
avg_pool = nn.AvgPool2d(kernel_size=2, stride=2)

# Global average pooling (output size 1x1)
global_avg_pool = nn.AdaptiveAvgPool2d((1, 1))

# Fractional max pooling with output ratio 0.7
frac_pool = nn.FractionalMaxPool2d(kernel_size=3, output_ratio=0.7)
```

In TensorFlow/Keras:

```python
from tensorflow.keras.layers import (
    MaxPooling2D, AveragePooling2D, GlobalAveragePooling2D
)

# Max pooling with 2x2 kernel and stride 2
max_pool = MaxPooling2D(pool_size=(2, 2), strides=2)

# Average pooling with 2x2 kernel and stride 2
avg_pool = AveragePooling2D(pool_size=(2, 2), strides=2)

# Global average pooling
global_avg_pool = GlobalAveragePooling2D()
```

## See also

- [Convolutional neural network](/wiki/convolutional_neural_network)
- [Convolutional layer](/wiki/convolutional_layer)
- [Pooling](/wiki/pooling)
- [Stride](/wiki/stride)
- [Downsampling](/wiki/downsampling)
- [Feature extraction](/wiki/feature_extraction)
- [Overfitting](/wiki/overfitting)
- [Regularization](/wiki/regularization)

## References

1. LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). "Gradient-based learning applied to document recognition." *Proceedings of the IEEE*, 86(11), 2278-2324.
2. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). "ImageNet classification with deep convolutional neural networks." *Advances in Neural Information Processing Systems*, 25.
3. Lin, M., Chen, Q., & Yan, S. (2014). "Network in Network." *International Conference on Learning Representations (ICLR)*. arXiv:1312.4400.
4. Zeiler, M. D. & Fergus, R. (2013). "Stochastic pooling for regularization of deep convolutional neural networks." *International Conference on Learning Representations (ICLR)*. arXiv:1301.3557.
5. He, K., Zhang, X., Ren, S., & Sun, J. (2015). "Spatial pyramid pooling in deep convolutional networks for visual recognition." *IEEE Transactions on Pattern Analysis and Machine Intelligence*, 37(9), 1904-1916.
6. Girshick, R. (2015). "Fast R-CNN." *Proceedings of the IEEE International Conference on Computer Vision (ICCV)*, 1440-1448.
7. Szegedy, C., Liu, W., Jia, Y., et al. (2015). "Going deeper with convolutions." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*.
8. Springenberg, J. T., Dosovitskiy, A., Brox, T., & Riedmiller, M. (2015). "Striving for simplicity: The all convolutional net." *International Conference on Learning Representations (ICLR) Workshop*. arXiv:1412.6806.
9. Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2017). "DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs." *IEEE Transactions on Pattern Analysis and Machine Intelligence*, 40(4), 834-848.
10. Graham, B. (2015). "Fractional max-pooling." arXiv:1412.6071.
11. Zhang, R. (2019). "Making convolutional networks shift-invariant again." *Proceedings of the International Conference on Machine Learning (ICML)*. arXiv:1904.11486.
12. Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al. (2021). "An image is worth 16x16 words: Transformers for image recognition at scale." *International Conference on Learning Representations (ICLR)*. arXiv:2010.11929.
13. Gholamalinezhad, H. & Khosravi, H. (2020). "Pooling methods in deep neural networks, a review." arXiv:2009.07485.
14. Zhai, S., Talbott, S., Srivastava, N., et al. (2022). "Convolutional neural networks: A comprehensive evaluation and benchmarking of pooling layer variants." *Symmetry*, 16(11), 1516.
