Pooling

Introduction

Pooling is a downsampling operation used in neural networks to reduce the spatial dimensions of feature maps while retaining essential information. It is a fundamental building block in convolutional neural networks (CNNs), where pooling layers typically follow convolutional layers to progressively shrink the spatial resolution of learned representations. By aggregating local regions of a feature map into single summary values, pooling achieves several goals at once: it reduces computational cost, lowers memory consumption, increases the receptive field of subsequent layers, and introduces a degree of translational invariance that helps models generalize to unseen data.

Pooling operations do not contain learnable parameters in their most common forms. Unlike convolutional layers, which learn filter weights during training, standard pooling layers apply fixed aggregation functions (such as taking the maximum or computing the mean) over local neighborhoods of the input. This makes pooling layers computationally inexpensive, easy to implement, and trivially differentiable when needed for backpropagation. The original LeNet-5 design did include trainable scale and bias coefficients per channel inside the subsampling step (LeCun et al., 1998), but the modern convention since AlexNet (2012) has been to use parameter-free max or average pooling.

The concept of spatial subsampling in neural networks dates back to the neocognitron proposed by Kunihiko Fukushima in 1980, which alternated layers of feature extraction with layers of spatial averaging. The idea was refined in Yann LeCun's LeNet-5 architecture in 1998, which used trainable subsampling layers after each convolutional stage. The specific operation of max pooling was first applied to neural networks by Yamaguchi et al. (1990) for speaker-independent isolated word recognition with time delay neural networks, and the same idea was independently advocated by Riesenhuber and Poggio (1999) as a model of complex cells in the primate visual cortex. As deep learning matured through the 2010s, pooling became a standard component of nearly every CNN architecture used for image recognition, object detection, and segmentation tasks.

How pooling works

A pooling layer operates on each channel of the input independently. Given an input feature map of size H x W x C (height, width, channels), the pooling operation slides a window of size p_h x p_w across each spatial plane with a specified stride s and computes a summary statistic within that window. The output feature map has reduced spatial dimensions but retains the same number of channels.

The output dimensions are calculated as follows:

Output height = floor((H - p_h + 2 * padding) / s) + 1

Output width = floor((W - p_w + 2 * padding) / s) + 1

The most common configuration uses a 2x2 window with a stride of 2 and no padding, which halves the spatial dimensions in each direction and reduces the total number of spatial elements by a factor of four. In most deep learning frameworks, the default stride equals the pool size when no stride is explicitly specified, so adjacent pooling windows do not overlap. AlexNet (Krizhevsky et al., 2012) was a notable exception: it used 3x3 windows with stride 2, which overlap by one pixel on each side. Krizhevsky reported that overlapping pooling reduced the top-1 error rate by 0.4% and the top-5 error rate by 0.3% on ImageNet compared with the equivalent non-overlapping setup, and that models with overlapping pooling were slightly harder to overfit.

Worked example

Consider a 4x4 input feature map and a 2x2 max pooling operation with stride 2:

Input:                Output (max pool 2x2, stride 2):
[1  3  2  4]          [3  4]
[5  6  7  8]          [6  8]
[9  4  3  2]          [9  3]
[1  2  3  1]

The top-left 2x2 region contains {1, 3, 5, 6}, so the pooled value is 6. The top-right region {2, 4, 7, 8} pools to 8. The bottom-left region {9, 4, 1, 2} pools to 9. The bottom-right region {3, 2, 3, 1} pools to 3. With average pooling the same windows would produce 3.75, 5.25, 4.0, and 2.25 respectively. The 4x4 input is reduced to 2x2 in either case.

Key parameters

Parameter	Description	Typical values
Pool size (kernel size)	The height and width of the pooling window	2x2, 3x3
Stride	The step size for sliding the window across the input	2 (matches pool size by default)
Padding	Zero-padding added to the input borders before pooling	0 (no padding) or 1
Channels	Pooling operates independently on each channel	Unchanged from input
Dilation	Spacing between elements in the pooling window	1 (standard); supported by some frameworks

Backpropagation through pooling

Although pooling has no learnable parameters, gradients still need to flow back through it during training. For max pooling, the gradient with respect to the output is routed only to the input position that produced the maximum value within each window; all other positions receive zero gradient. Frameworks store the argmax indices during the forward pass to make the backward pass efficient. For average pooling, the output gradient is divided equally among all input positions in the window. This difference has practical consequences: max pooling produces sparse gradient updates, while average pooling distributes the signal across the whole receptive field.

Types of pooling

Max pooling

Max pooling selects the maximum value within each pooling window. For a 2x2 window the operation examines four values and outputs the single largest one. This preserves the strongest activations in each local region, which often correspond to the most prominent features such as edges, corners, and textures. The intuition is that ReLU activations are non-negative and a high value indicates the presence of a feature, so taking the maximum keeps the evidence that the feature exists somewhere in the window without committing to its exact location.

Max pooling is the most widely used pooling variant in modern CNNs. It was applied to neural networks for speech by Yamaguchi et al. (1990), generalized as a model of cortical complex cells by Riesenhuber and Poggio (1999), used in the Cresceptron of Weng, Ahuja and Huang (1992) for image processing, and popularized for deep learning by AlexNet (Krizhevsky et al., 2012). It remains the default pooling operation in architectures such as VGGNet (Simonyan and Zisserman, 2014) and the early stages of ResNet.

For a pooling region R, max pooling computes:

y = max(x_i) for all x_i in R

Max pooling provides a form of translational invariance: if a feature such as an edge shifts slightly within the pooling window, the output remains the same as long as the maximum value does not change. This property helps CNNs recognize patterns regardless of their exact spatial position, although the invariance is local and breaks down for shifts larger than the pool size.

Average pooling

Average pooling computes the arithmetic mean of all values in the pooling window. For a pooling region R containing n elements:

y = (1/n) * sum(x_i) for all x_i in R

Average pooling produces smoother output feature maps compared with max pooling because it considers all values rather than only the peak activation. It was the original form of pooling used in LeCun's LeNet-5 (1998), where the subsampling layers computed the sum of four neighboring inputs, multiplied by a trainable coefficient, added a trainable bias, and passed the result through a sigmoid. Each S2 channel had only two trainable parameters (one weight and one bias), and the same scheme was used in S4.

In practice, max pooling tends to outperform average pooling in most classification tasks because it preserves the strongest signals and produces sparser, more discriminative features. However, average pooling can be preferable when smoother feature representations are desired, such as in certain generative models, when the input contains significant noise, or in transition layers that aggregate dense features. DenseNet (Huang et al., 2017) uses 2x2 average pooling in its transition layers between dense blocks for this reason.

Global average pooling

Global average pooling (GAP) computes the mean of each entire feature map channel, collapsing the spatial dimensions to a single value per channel. For an input of size H x W x C, GAP produces an output of size 1 x 1 x C.

GAP was introduced by Lin, Chen and Yan (2014) in the "Network in Network" paper as a replacement for the fully connected layers that traditionally appeared at the end of CNN classifiers. In their approach, the final convolutional layer produces one feature map per class and GAP averages each map into a single scalar that is fed directly into a softmax. This eliminates the large fully connected layers that dominated the parameter count in architectures like AlexNet and VGGNet. The original NIN paper notes that GAP enforces correspondence between feature maps and classes, which makes the final layer easier to interpret and acts as a structural regularizer because there are no parameters in the GAP itself.

The advantages of GAP over fully connected layers include:

A dramatic reduction in the number of parameters, which lowers the risk of overfitting
A more interpretable connection between feature maps and class predictions
No sensitivity to input spatial dimensions, allowing the same network to process images of varying sizes

GAP became a standard component in modern architectures. GoogLeNet (Szegedy et al., 2015), ResNet (He et al., 2016), DenseNet (Huang et al., 2017), and EfficientNet (Tan and Le, 2019) all use GAP before the final classification layer. GAP is also a key ingredient in class activation maps (Zhou et al., 2016), where the spatial average is reversed to localize discriminative regions per class.

Global max pooling

Global max pooling takes the maximum value from each entire feature map channel, producing a 1 x 1 x C output from an H x W x C input. It is less common than GAP in classification networks but is used in certain architectures and tasks where preserving the single strongest activation per channel is more informative than the average response. Multiple-instance learning models, weakly supervised localization heads, and some text classifiers (Kim, 2014) use global max pooling because it picks out the most active feature regardless of how many positions support it.

Stochastic pooling

Stochastic pooling, proposed by Zeiler and Fergus (2013), replaces the deterministic selection of max or average pooling with a probabilistic sampling procedure. Within each pooling region the activation values are first normalized to form a probability distribution, and a single value is then sampled according to this multinomial distribution. During training the random sampling acts as a regularizer in a similar spirit to dropout, preventing the network from relying too heavily on any single activation. During inference the method uses a probability-weighted average, which is equivalent to average pooling in expectation.

Stochastic pooling achieved state-of-the-art results on several benchmarks at the time of its introduction and demonstrated that pooling operations could serve as regularization mechanisms. It is hyperparameter-free and can be combined with other regularization techniques like dropout and data augmentation.

Fractional max pooling

Fractional max pooling, introduced by Graham (2014), generalizes the integer downsampling factor used by standard pooling to non-integer values. Conventional alpha x alpha max pooling reduces hidden layer size by an integer factor (typically 2), which can cause spatial information to drop too rapidly through a deep network. Graham proposed allowing alpha to take non-integer values such as 1.41 or sqrt(2), which permits more fine-grained reduction across more layers.

Because non-integer pooling regions cannot tile a grid exactly, the construction is stochastic: for each layer the pooling regions are sampled from one of several valid pseudo-random tilings, which introduces additional regularization during training. Graham reported state-of-the-art results on CIFAR-100 without dropout and competitive performance on CIFAR-10, MNIST and assorted other benchmarks.

Mixed pooling

Mixed pooling, proposed by Yu et al. (2014), combines max and average pooling stochastically. For each pooling window a random parameter lambda in {0, 1} chooses whether to apply max pooling (lambda = 1) or average pooling (lambda = 0). The chosen lambda is recorded so that the backward pass uses the matching gradient rule. Yu et al. reported that mixed pooling outperformed both max and average pooling on several image classification benchmarks. Subsequent variants include gated pooling and tree-structured pooling (Lee, Gallagher and Tu, 2016), which learn the mixing weights rather than sampling them.

Spatial pyramid pooling

Spatial pyramid pooling (SPP) was introduced by He et al. (2015) in the SPP-Net paper. Traditional CNNs require fixed-size input images because the fully connected layers expect a specific input dimension. SPP solves this constraint by applying pooling at multiple scales simultaneously. See also spatial pyramid pooling.

The SPP layer divides the input feature map into a hierarchy of grids (for example 1x1, 2x2 and 4x4 regions), performs max pooling within each grid cell, and concatenates all the results into a single fixed-length vector. With these three levels and a 256-channel feature map, the resulting vector has length 256 x (1 + 4 + 16) = 5,376 regardless of the input image size or aspect ratio. The classifier then operates on this fixed vector.

For object detection, SPP-Net computed convolutional features from the entire image only once and then pooled features from arbitrary sub-regions, making it 30 to 170 times faster than R-CNN at test time while achieving better or comparable accuracy on the Pascal VOC 2007 benchmark.

Atrous spatial pyramid pooling (ASPP)

Atrous spatial pyramid pooling (ASPP) was introduced in the DeepLab family of semantic segmentation models (Chen et al., 2017). Instead of pooling at multiple grid resolutions like SPP, ASPP applies several parallel atrous (dilated) convolutions to the same feature map, each with a different dilation rate. The outputs are concatenated and projected to the segmentation logits. By varying the dilation rate, ASPP captures context at multiple effective receptive fields without adding extra parameters or downsampling. DeepLabv3 (Chen et al., 2017) and DeepLabv3+ (Chen et al., 2018) added a 1x1 convolution branch and an image-level GAP branch to the ASPP module to capture both local and global context.

Lp pooling

Lp pooling generalizes max pooling and average pooling through the Lp norm. For a pooling region R:

y = (sum(|x_i|^p) / n)^(1/p)

When p = 1, this reduces to average pooling. As p approaches infinity, the result converges to max pooling. p = 2 (root mean square pooling, sometimes called L2 pooling) is biologically motivated and was studied by Sermanet, Chintala and LeCun (2012) for traffic sign classification.

Sum pooling

Sum pooling outputs the sum rather than the mean of the values in each window. It is mathematically equivalent to average pooling up to a constant scaling factor and is rarely used as a standalone layer in modern architectures, although it appears occasionally in graph pooling and in Bag-of-Visual-Words pipelines.

Local response normalization

Local response normalization (LRN), used in AlexNet, is not strictly a pooling operation but is sometimes grouped with it because it performs a local aggregation across nearby activations. It normalizes each activation by a function of the squared activations in a neighborhood (across channels in AlexNet) and has largely been replaced by batch normalization in modern networks.

Comparison of pooling types

Pooling type	Operation	Learnable parameters	Key strength	Typical use case
Max pooling	Maximum value in window	No	Preserves strongest features	Most CNN classifiers and detectors
Average pooling	Mean of values in window	No	Produces smooth representations	Early architectures (LeNet), DenseNet transitions, generative models
Global average pooling	Mean over entire feature map	No	Eliminates fully connected layers	Final layer before softmax in modern CNNs
Global max pooling	Max over entire feature map	No	Preserves strongest activation per channel	Multiple-instance learning, text CNNs
Stochastic pooling	Probabilistic sampling from region	No	Regularization during training	Small training sets, CIFAR-scale benchmarks
Fractional max pooling	Non-integer downsampling factor with random tilings	No	Slow, fine-grained downsampling and regularization	Deep networks on small datasets
Mixed pooling	Random or learned mix of max and average	Optional	Combines smoothing and selection	Regularizing CNN classifiers
Spatial pyramid pooling	Multi-scale grid pooling, fixed output length	No	Handles variable input sizes	Object detection (SPP-Net)
Atrous spatial pyramid pooling	Parallel dilated convolutions at multiple rates	Yes (the convolutions)	Multi-scale context for dense prediction	Semantic segmentation (DeepLab)
Lp pooling	Lp norm over region	No (p is a hyperparameter)	Tunable between max and average	Specialized applications
Overlapping max pooling	Max with stride less than kernel size	No	Slight regularization, denser sampling	AlexNet (3x3 window, stride 2)

Pooling for object detection

Object detection models need to extract a fixed-size feature representation from variably sized region proposals so that downstream classifier and regression heads can operate on a uniform input. Pooling provides this with several specialized variants.

RoI Pooling

Region of Interest (RoI) Pooling was introduced as part of Fast R-CNN (Girshick, 2015). The RoI pooling layer takes two inputs: a convolutional feature map computed once for the whole image, and a list of region proposals encoded as bounding boxes. For each RoI it converts the corresponding sub-region of the feature map to a fixed H x W output (typically 7x7) by dividing the RoI into an H x W grid of sub-windows and applying max pooling within each sub-window. The result feeds into the per-RoI fully connected head for classification and bounding box regression.

Because the convolutional feature map is shared across all proposals, Fast R-CNN avoids recomputing the convolutional backbone for each proposal as the original R-CNN did, which produces a large speedup. Faster R-CNN (Ren et al., 2015) keeps the same RoI pooling layer and replaces the external selective search proposals with a learned region proposal network. See also RoI pooling.

RoIAlign

A limitation of RoI pooling is that it quantizes the floating-point RoI coordinates twice: once when mapping the RoI onto the discrete feature map grid, and again when dividing the RoI into sub-windows. These quantizations introduce small misalignments between the RoI and the pooled features. While the misalignment has little effect on classification, it materially hurts pixel-accurate tasks like instance segmentation.

RoIAlign, introduced in Mask R-CNN (He et al., 2017), removes both quantizations. The RoI is divided into bins using exact floating-point coordinates, four sample points are placed inside each bin, and the feature map is sampled at those exact points using bilinear interpolation. The bin output is then taken as the max or average of the four samples. He et al. reported that RoIAlign improves mask average precision by roughly 10 percentage points over RoI pooling on the COCO benchmark and gives more modest but still meaningful gains for bounding box detection.

Position-sensitive RoI pooling

R-FCN (Dai et al., 2016) proposed position-sensitive RoI pooling (PSRoIPool) to make detectors fully convolutional. Instead of pooling from a generic feature map, the network produces k x k position-sensitive score maps per class, where each score map encodes the response for one spatial position within the RoI grid (top-left, top-middle, and so on). PSRoIPool then pools each k x k bin from its corresponding score map, encoding spatial position information without per-RoI fully connected layers. Using ResNet-101, R-FCN achieved 83.6% mAP on PASCAL VOC 2007 at 170 ms per image, which is 2.5x to 20x faster than the equivalent Faster R-CNN model. PSRoIPool is available in torchvision as ps_roi_pool.

Comparison of detection pooling variants

Variant	Paper	Year	Quantization	Output shape	Notable use
RoI pooling	Fast R-CNN (Girshick)	2015	Yes, on RoI and bins	Fixed H x W per RoI	Fast R-CNN, Faster R-CNN
Position-sensitive RoI pool	R-FCN (Dai et al.)	2016	Yes	k x k per RoI	R-FCN
RoIAlign	Mask R-CNN (He et al.)	2017	None (bilinear sampling)	Fixed H x W per RoI	Mask R-CNN, YOLO variants with RoI heads
Precise RoI pooling	Jiang et al.	2018	None (continuous integral)	Fixed H x W per RoI	IoU-Net

Why pooling helps

Pooling serves several purposes in neural network architectures. Shrinking the spatial dimensions of feature maps reduces the cost of subsequent layers; a 2x2 pooling with stride 2 cuts the number of spatial elements by 75%, which compounds across multiple layers. Pooling also gives the network a degree of translational invariance: if a feature shifts by a few pixels, the pooled output may remain identical, which helps CNNs recognize objects regardless of precise location (the invariance is local and approximate rather than global). Each pooling step roughly doubles the effective receptive field of neurons in the next layer, so deeper layers can integrate information from larger regions of the original input. Reducing spatial dimensions also decreases the total parameter count, acting as implicit regularization, and stochastic, fractional and mixed pooling go further by injecting noise into the operation itself. Finally, smaller feature maps require less memory and fewer floating-point operations, which speeds up both training and inference and is critical for deployment on resource-constrained devices.

Why some networks skip pooling

The necessity of dedicated pooling layers has been debated since the mid-2010s. Pooling discards spatial detail, which hurts tasks that need pixel-accurate predictions such as semantic segmentation, depth estimation, super-resolution and generation. U-Net uses pooling in the encoder but compensates with skip connections that re-inject the pre-pooled features. Standard pooling also cannot adapt its downsampling rule to the data, while strided convolutions use weights that the rest of the network learns, so the downsampling behavior is shaped by gradient descent.

Springenberg, Dosovitskiy, Brox and Riedmiller (2014) showed in "Striving for Simplicity: The All Convolutional Net" that max pooling can be replaced by a stride-2 convolutional layer without loss of accuracy on CIFAR-10, CIFAR-100 and ImageNet. They argued that pooling is a special case of strided convolution with hand-designed weights and that letting the network learn the downsampling is at least as good. This finding heavily influenced later CNN design. ResNet uses one 3x3 max pool early in the network and replaces all subsequent downsampling with stride-2 convolutions inside the residual stages, and ResNeXt, EfficientNet and most contemporary CNN backbones follow the same pattern. Vision transformers replace pooling with a patch embedding step that splits the image into non-overlapping patches and projects each one to a token embedding, effectively merging the first conv plus pool stage into a single linear projection.

Despite these alternatives, pooling has not disappeared. GAP remains the standard final aggregation in nearly every modern CNN classifier, and detection heads still rely on RoIAlign or its variants.

Pooling in CNNs

In the standard CNN pipeline, pooling layers are inserted after one or more convolutional layers. A typical block consists of a convolutional layer, an activation such as ReLU, and a pooling layer, repeated several times. Each stage reduces spatial dimensions while increasing the number of feature channels. A classic VGG-style progression looks like this:

Input (224x224x3) -> Conv + ReLU (224x224x64) -> MaxPool 2x2 (112x112x64) -> Conv + ReLU (112x112x128) -> MaxPool 2x2 (56x56x128) -> ... -> GAP -> Softmax

Notable CNN architectures and their use of pooling:

Architecture	Year	Pooling strategy
LeNet-5	1998	Trainable subsampling (sum, scale, bias, sigmoid) after each conv block
AlexNet	2012	3x3 overlapping max pooling with stride 2; LRN
VGGNet	2014	2x2 max pooling with stride 2 after each conv block
GoogLeNet (Inception)	2015	Max pooling within Inception modules, GAP before classifier
ResNet	2016	Single 3x3 max pool early on, stride-2 convolutions for the rest, GAP before classifier
DenseNet	2017	2x2 average pooling in transition layers, GAP at the end
MobileNet / EfficientNet	2017 / 2019	Stride-2 depthwise convolutions for downsampling, GAP before classifier
Vision Transformer	2021	Patch embedding replaces early conv+pool; CLS token or GAP at the end

Pooling in transformers and NLP

Although pooling originated in computer vision, it plays an important role in transformer-based models as well. When a transformer like BERT processes a sentence it produces a sequence of token embeddings. Many downstream tasks (classification, semantic similarity, retrieval) require a single fixed-length vector to represent the entire input. Pooling provides the mechanism for aggregating token-level representations into a sentence-level embedding.

CLS token pooling

BERT and similar models prepend a special [CLS] token to the input sequence, and the final hidden state of this token is treated as the sentence representation. The model is trained so that information from the rest of the sequence flows into the CLS position via self-attention. CLS pooling works well when the model has been fine-tuned on a downstream classification task, but the raw CLS embedding from a pretrained BERT does not capture sentence semantics well in zero-shot settings (Reimers and Gurevych, 2019).

Mean pooling

Averaging the hidden states of all tokens (excluding padding) produces a balanced representation that gives every token equal influence. Mean pooling is the default strategy in the Sentence-BERT library (Reimers and Gurevych, 2019), where the authors compared CLS, mean and max pooling and found mean pooling produced the best results on STS and NLI benchmarks. Most modern open-source sentence encoders, including the sentence-transformers family, default to mean pooling.

Max pooling

Taking the element-wise maximum across all token hidden states captures the most salient features at each embedding dimension. Max pooling can be useful for tasks where specific keywords or phrases carry disproportionate importance, although Reimers and Gurevych reported it underperformed mean pooling for general semantic similarity.

Weighted mean pooling

Applying learned or heuristic weights to each token before averaging allows the model to emphasize certain positions. A common variant uses the attention mask only (treating padded tokens as zero weight), while more elaborate schemes weight tokens by inverse document frequency, by self-attention scores, or by a learned linear projection.

Attention pooling

Attention pooling generalizes mean pooling by computing a learned weighted combination of token vectors. A small set of learnable query vectors attends to the token sequence via a single multi-head attention layer, and the resulting vectors form the pooled representation. Pooling by Multi-head Attention (PMA), introduced in the Set Transformer (Lee et al., 2019), uses a fixed number of seed queries and lets attention decide which input elements to focus on for each seed. Attention pooling is now common in image classification heads (CaiT, BEiT, DINOv2), in dense retrieval (Contriever, NV-Embed), and in audio and video models.

Last-hidden-state strategies

For large language models with causal attention, the standard pooling choice for embedding tasks is the last token's hidden state, which by virtue of the causal mask has attended over the full input. Some encoders use the mean of the last layer's hidden states, others mix several layers.

Pooling in vision transformers

Vision transformers (ViTs) blend the BERT-style CLS-token approach with options inherited from CNNs. The original ViT (Dosovitskiy et al., 2021) uses a CLS token in emulation of BERT and reports that GAP over the patch tokens reaches comparable accuracy after careful tuning. Subsequent papers have shown that GAP and attention pooling often outperform the CLS token, especially when the model is trained with strong augmentation regimes or self-supervised objectives. CaiT (Touvron et al., 2021), Swin Transformer (Liu et al., 2021), and BEiT v2 (Peng et al., 2022) use GAP or attention pooling rather than CLS at the classifier head.

A related observation from Raghu et al. (2021) is that ViTs trained with GAP show less localized attention patterns than CLS-trained ViTs. The choice between CLS and GAP therefore affects not only accuracy but also interpretability tools like attention rollouts.

Adaptive pooling

Adaptive pooling is a variant where the user specifies the desired output size rather than the kernel size and stride. The framework automatically computes the necessary kernel size and stride to produce the requested output dimensions from whatever input size is provided. This makes architectures more flexible, since the same model can process inputs of varying spatial dimensions.

Adaptive pooling is particularly useful for:

Implementing global average pooling, where the output size is set to 1x1
Transfer learning, where a pretrained model may need to accept non-standard input sizes
Architectures that process multi-scale feature maps
Implementing the per-RoI fixed-size output of RoI pooling and RoIAlign

In PyTorch, adaptive pooling is available as nn.AdaptiveAvgPool2d and nn.AdaptiveMaxPool2d. Setting the output size to (1, 1) is equivalent to global pooling.

Implementation in deep learning frameworks

PyTorch

PyTorch provides pooling layers in the torch.nn module and detection-specific pooling in torchvision.ops.

import torch
import torch.nn as nn
from torchvision.ops import roi_pool, roi_align, ps_roi_pool

# Max pooling: 2x2 window, stride 2
max_pool = nn.MaxPool2d(kernel_size=2, stride=2)

# Average pooling: 2x2 window, stride 2
avg_pool = nn.AvgPool2d(kernel_size=2, stride=2)

# Global average pooling (adaptive, output size 1x1)
global_avg_pool = nn.AdaptiveAvgPool2d(output_size=(1, 1))

# Example: input tensor with batch=1, channels=3, height=8, width=8
x = torch.randn(1, 3, 8, 8)

print(max_pool(x).shape)         # torch.Size([1, 3, 4, 4])
print(avg_pool(x).shape)         # torch.Size([1, 3, 4, 4])
print(global_avg_pool(x).shape)  # torch.Size([1, 3, 1, 1])

# RoI pooling and RoIAlign: input is (N, C, H, W); rois is (K, 5)
# where each row is (batch_idx, x1, y1, x2, y2) in input coordinates.
feats = torch.randn(1, 256, 32, 32)
rois = torch.tensor([[0, 4.0, 4.0, 20.0, 20.0]])
pooled = roi_pool(feats, rois, output_size=(7, 7), spatial_scale=1.0)
aligned = roi_align(feats, rois, output_size=(7, 7), spatial_scale=1.0,
                    sampling_ratio=2, aligned=True)

PyTorch also provides 1D and 3D variants (MaxPool1d, MaxPool3d, AvgPool1d, AvgPool3d) for sequential and volumetric data.

TensorFlow and Keras

In TensorFlow / Keras, pooling layers are available in tf.keras.layers.

import tensorflow as tf

# Max pooling: 2x2 window, stride 2
max_pool = tf.keras.layers.MaxPooling2D(pool_size=(2, 2), strides=2)

# Average pooling: 2x2 window, stride 2
avg_pool = tf.keras.layers.AveragePooling2D(pool_size=(2, 2), strides=2)

# Global pooling
global_avg_pool = tf.keras.layers.GlobalAveragePooling2D()
global_max_pool = tf.keras.layers.GlobalMaxPooling2D()

Keras layers accept data_format as either "channels_last" (default, NHWC) or "channels_first" (NCHW), and padding as either "valid" (no padding) or "same" (zero-padding to preserve dimensions).

JAX / Flax and Sentence-Transformers

Flax exposes pooling through flax.linen.max_pool and flax.linen.avg_pool, both of which accept window_shape and strides arguments. The sentence-transformers library provides a models.Pooling module that wraps mean, max, and CLS strategies for transformer outputs and lets the user combine several modes by element-wise concatenation.

Applications

Pooling techniques appear in a wide range of deep learning applications.

Image classification. Pooling layers reduce feature map dimensions progressively through the network, enabling CNNs to learn abstract, invariant features from input images. Global average pooling replaces fully connected layers in modern classifiers.
Object detection. RoI pooling, RoIAlign and PSRoIPool convert arbitrarily sized region proposals into fixed-size feature maps for detection heads. SPP-Net uses spatial pyramid pooling for scale-invariant detection.
Semantic segmentation. Encoder-decoder architectures like U-Net use pooling in the encoder path to capture context and then upsample in the decoder path to restore spatial resolution. ASPP captures multi-scale context in DeepLab.
Sentence and document classification. In NLP, pooling aggregates variable-length sequences of token embeddings into fixed-length sentence or document representations. Sentence-BERT uses mean pooling by default.
Dense retrieval and embedding models. Models like DPR, GTR, Contriever and OpenAI's text embeddings use mean or attention pooling to map sequences to fixed-size vectors that can be indexed and compared with cosine similarity.
Audio processing. Mel-spectrogram CNNs, wav2vec 2.0 and Whisper use pooling and stride-2 convolutions in their encoders to downsample time-frequency representations.
Graph neural networks. Graph pooling operations (DiffPool, SAGPool, Top-K pooling) coarsen graph structures by merging or dropping nodes, enabling hierarchical graph-level representations for tasks like molecular property prediction.
Reinforcement learning vision encoders. DeepMind's Atari and continuous control agents typically use a small CNN with strided convolutions and global pooling before the policy head.

Practical guidance

A few rules of thumb that follow from common practice:

For a new image classifier, start with stride-2 convolutions for downsampling and GAP at the end. This is what ResNet, EfficientNet and most modern backbones do.
Max pooling is a reasonable default in small models, especially in early layers where feature maps are large.
Use average pooling when feature smoothing matters, for example in DenseNet transition layers or when training generative models.
For variable-size inputs, use adaptive pooling or spatial pyramid pooling so the head receives a consistent shape.
For object detection, prefer RoIAlign over RoI pooling. RoIAlign is the standard in Mask R-CNN, Cascade R-CNN and most modern detectors.
For sentence embeddings from a pretrained encoder, mean pooling with attention masking is the safest default; switch to attention pooling if you have enough labeled data.
For vision transformers trained with strong augmentation or self-supervised objectives, GAP or attention pooling on the patch tokens often beats the CLS token.

Explain like I'm 5 (ELI5)

Imagine you have a big picture made of lots of tiny colored squares. Pooling is like squishing that picture down to make it smaller. One way to squish is to look at a small group of squares and keep only the brightest one (that is max pooling). Another way is to mix all the colors in the group together to get the average color (that is average pooling). Either way the picture gets smaller but you can still tell what it is. This helps the computer work faster because it has fewer squares to look at, and it also means the computer does not care if the cat in the picture is a tiny bit to the left or the right.

References

Fukushima, K. (1980). "Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position." *Biological Cybernetics*, 36(4), 193-202.
Yamaguchi, K., Sakamoto, K., Akabane, T., & Fujimoto, Y. (1990). "A neural network for speaker-independent isolated word recognition." *Proceedings of the First International Conference on Spoken Language Processing (ICSLP)*, Kobe, Japan.
Weng, J., Ahuja, N., & Huang, T. S. (1992). "Cresceptron: A self-organizing neural network which grows adaptively." *Proceedings of the International Joint Conference on Neural Networks (IJCNN)*, Baltimore, MD, USA.
LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). "Gradient-based learning applied to document recognition." *Proceedings of the IEEE*, 86(11), 2278-2324.
Riesenhuber, M., & Poggio, T. (1999). "Hierarchical models of object recognition in cortex." *Nature Neuroscience*, 2(11), 1019-1025.
Sermanet, P., Chintala, S., & LeCun, Y. (2012). "Convolutional neural networks applied to house numbers digit classification." *Proceedings of the 21st International Conference on Pattern Recognition (ICPR)*, 3288-3291.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). "ImageNet classification with deep convolutional neural networks." *Advances in Neural Information Processing Systems*, 25, 1097-1105.
Zeiler, M. D., & Fergus, R. (2013). "Stochastic pooling for regularization of deep convolutional neural networks." *Proceedings of the International Conference on Learning Representations (ICLR)*. arXiv:1301.3557.
Lin, M., Chen, Q., & Yan, S. (2014). "Network in Network." *Proceedings of the International Conference on Learning Representations (ICLR)*. arXiv:1312.4400.
Yu, D., Wang, H., Chen, P., & Wei, Z. (2014). "Mixed pooling for convolutional neural networks." *Rough Sets and Knowledge Technology (RSKT 2014)*, Lecture Notes in Computer Science, vol. 8818, 364-375.
Kim, Y. (2014). "Convolutional neural networks for sentence classification." *Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)*, 1746-1751.
Simonyan, K., & Zisserman, A. (2014). "Very deep convolutional networks for large-scale image recognition." arXiv:1409.1556.
Graham, B. (2014). "Fractional max-pooling." arXiv:1412.6071.
Springenberg, J. T., Dosovitskiy, A., Brox, T., & Riedmiller, M. (2015). "Striving for simplicity: The all convolutional net." *ICLR Workshop Track Proceedings*. arXiv:1412.6806.
He, K., Zhang, X., Ren, S., & Sun, J. (2015). "Spatial pyramid pooling in deep convolutional networks for visual recognition." *IEEE Transactions on Pattern Analysis and Machine Intelligence*, 37(9), 1904-1916.
Girshick, R. (2015). "Fast R-CNN." *Proceedings of the IEEE International Conference on Computer Vision (ICCV)*, 1440-1448.
Ren, S., He, K., Girshick, R., & Sun, J. (2015). "Faster R-CNN: Towards real-time object detection with region proposal networks." *Advances in Neural Information Processing Systems*, 28, 91-99.
Szegedy, C., Liu, W., Jia, Y., et al. (2015). "Going deeper with convolutions." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*, 1-9.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). "Deep residual learning for image recognition." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*, 770-778.
Lee, C.-Y., Gallagher, P. W., & Tu, Z. (2016). "Generalizing pooling functions in convolutional neural networks: Mixed, gated, and tree." *Proceedings of the 19th International Conference on Artificial Intelligence and Statistics (AISTATS)*, 464-472.
Dai, J., Li, Y., He, K., & Sun, J. (2016). "R-FCN: Object detection via region-based fully convolutional networks." *Advances in Neural Information Processing Systems*, 29, 379-387.
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. (2016). "Learning deep features for discriminative localization." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*, 2921-2929.
He, K., Gkioxari, G., Dollar, P., & Girshick, R. (2017). "Mask R-CNN." *Proceedings of the IEEE International Conference on Computer Vision (ICCV)*, 2961-2969.
Huang, G., Liu, Z., van der Maaten, L., & Weinberger, K. Q. (2017). "Densely connected convolutional networks." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*, 4700-4708.
Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2017). "DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs." *IEEE Transactions on Pattern Analysis and Machine Intelligence*, 40(4), 834-848.
Chen, L.-C., Papandreou, G., Schroff, F., & Adam, H. (2017). "Rethinking atrous convolution for semantic image segmentation." arXiv:1706.05587.
Lee, J., Lee, Y., Kim, J., Kosiorek, A. R., Choi, S., & Teh, Y. W. (2019). "Set Transformer: A framework for attention-based permutation-invariant neural networks." *Proceedings of the 36th International Conference on Machine Learning (ICML)*, 3744-3753.
Tan, M., & Le, Q. (2019). "EfficientNet: Rethinking model scaling for convolutional neural networks." *Proceedings of the 36th International Conference on Machine Learning (ICML)*, 6105-6114.
Reimers, N., & Gurevych, I. (2019). "Sentence-BERT: Sentence embeddings using Siamese BERT-networks." *Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)*, 3982-3992.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al. (2021). "An image is worth 16x16 words: Transformers for image recognition at scale." *Proceedings of the International Conference on Learning Representations (ICLR)*. arXiv:2010.11929.
Raghu, M., Unterthiner, T., Kornblith, S., Zhang, C., & Dosovitskiy, A. (2021). "Do vision transformers see like convolutional neural networks?" *Advances in Neural Information Processing Systems*, 34, 12116-12128.

Introduction

How pooling works

Worked example

Key parameters

Backpropagation through pooling

Types of pooling

Max pooling

Average pooling

Global average pooling

Global max pooling

Stochastic pooling

Fractional max pooling

Mixed pooling

Spatial pyramid pooling

Atrous spatial pyramid pooling (ASPP)

Lp pooling

Sum pooling

Local response normalization

Comparison of pooling types

Pooling for object detection

RoI Pooling

RoIAlign

Position-sensitive RoI pooling

Comparison of detection pooling variants

Why pooling helps

Why some networks skip pooling

Pooling in CNNs

Pooling in transformers and NLP

CLS token pooling

Mean pooling

Max pooling

Weighted mean pooling

Attention pooling

Last-hidden-state strategies

Pooling in vision transformers

Adaptive pooling

Implementation in deep learning frameworks

PyTorch

TensorFlow and Keras

JAX / Flax and Sentence-Transformers

Applications

Practical guidance

Explain like I'm 5 (ELI5)

References

Improve this article

Related Articles

Machine learning terms/Computer Vision

Photography

LeNet

Computer-use agent

Computer-use model

OCR Models

Introduction

How pooling works

Worked example

Key parameters

Backpropagation through pooling

Types of pooling

Max pooling

Average pooling

Global average pooling

Global max pooling

Stochastic pooling

Fractional max pooling

Mixed pooling

Spatial pyramid pooling

Atrous spatial pyramid pooling (ASPP)

Lp pooling

Sum pooling

Local response normalization

Comparison of pooling types

Pooling for object detection

RoI Pooling

RoIAlign

Position-sensitive RoI pooling

Comparison of detection pooling variants

Why pooling helps

Why some networks skip pooling

Pooling in CNNs

Pooling in transformers and NLP