See also: convolutional neural network, convolutional layer, feature map, downsampling
Pooling is a downsampling operation used in neural networks to reduce the spatial dimensions of feature maps while retaining essential information. It is a fundamental building block in convolutional neural networks (CNNs), where pooling layers typically follow convolutional layers to progressively shrink the spatial resolution of learned representations. By aggregating local regions of a feature map into single summary values, pooling achieves several goals at once: it reduces computational cost, lowers memory consumption, increases the receptive field of subsequent layers, and introduces a degree of translational invariance that helps models generalize to unseen data.
Pooling operations do not contain learnable parameters in their most common forms. Unlike convolutional layers, which learn filter weights during training, standard pooling layers apply fixed aggregation functions (such as taking the maximum or computing the mean) over local neighborhoods of the input. This makes pooling layers computationally inexpensive, easy to implement, and trivially differentiable when needed for backpropagation. The original LeNet-5 design did include trainable scale and bias coefficients per channel inside the subsampling step (LeCun et al., 1998), but the modern convention since AlexNet (2012) has been to use parameter-free max or average pooling.
The concept of spatial subsampling in neural networks dates back to the neocognitron proposed by Kunihiko Fukushima in 1980, which alternated layers of feature extraction with layers of spatial averaging. The idea was refined in Yann LeCun's LeNet-5 architecture in 1998, which used trainable subsampling layers after each convolutional stage. The specific operation of max pooling was first applied to neural networks by Yamaguchi et al. (1990) for speaker-independent isolated word recognition with time delay neural networks, and the same idea was independently advocated by Riesenhuber and Poggio (1999) as a model of complex cells in the primate visual cortex. As deep learning matured through the 2010s, pooling became a standard component of nearly every CNN architecture used for image recognition, object detection, and segmentation tasks.
A pooling layer operates on each channel of the input independently. Given an input feature map of size H x W x C (height, width, channels), the pooling operation slides a window of size p_h x p_w across each spatial plane with a specified stride s and computes a summary statistic within that window. The output feature map has reduced spatial dimensions but retains the same number of channels.
The output dimensions are calculated as follows:
Output height = floor((H - p_h + 2 * padding) / s) + 1
Output width = floor((W - p_w + 2 * padding) / s) + 1
The most common configuration uses a 2x2 window with a stride of 2 and no padding, which halves the spatial dimensions in each direction and reduces the total number of spatial elements by a factor of four. In most deep learning frameworks, the default stride equals the pool size when no stride is explicitly specified, so adjacent pooling windows do not overlap. AlexNet (Krizhevsky et al., 2012) was a notable exception: it used 3x3 windows with stride 2, which overlap by one pixel on each side. Krizhevsky reported that overlapping pooling reduced the top-1 error rate by 0.4% and the top-5 error rate by 0.3% on ImageNet compared with the equivalent non-overlapping setup, and that models with overlapping pooling were slightly harder to overfit.
Consider a 4x4 input feature map and a 2x2 max pooling operation with stride 2:
Input: Output (max pool 2x2, stride 2):
[1 3 2 4] [3 4]
[5 6 7 8] [6 8]
[9 4 3 2] [9 3]
[1 2 3 1]
The top-left 2x2 region contains {1, 3, 5, 6}, so the pooled value is 6. The top-right region {2, 4, 7, 8} pools to 8. The bottom-left region {9, 4, 1, 2} pools to 9. The bottom-right region {3, 2, 3, 1} pools to 3. With average pooling the same windows would produce 3.75, 5.25, 4.0, and 2.25 respectively. The 4x4 input is reduced to 2x2 in either case.
| Parameter | Description | Typical values |
|---|---|---|
| Pool size (kernel size) | The height and width of the pooling window | 2x2, 3x3 |
| Stride | The step size for sliding the window across the input | 2 (matches pool size by default) |
| Padding | Zero-padding added to the input borders before pooling | 0 (no padding) or 1 |
| Channels | Pooling operates independently on each channel | Unchanged from input |
| Dilation | Spacing between elements in the pooling window | 1 (standard); supported by some frameworks |
Although pooling has no learnable parameters, gradients still need to flow back through it during training. For max pooling, the gradient with respect to the output is routed only to the input position that produced the maximum value within each window; all other positions receive zero gradient. Frameworks store the argmax indices during the forward pass to make the backward pass efficient. For average pooling, the output gradient is divided equally among all input positions in the window. This difference has practical consequences: max pooling produces sparse gradient updates, while average pooling distributes the signal across the whole receptive field.
Max pooling selects the maximum value within each pooling window. For a 2x2 window the operation examines four values and outputs the single largest one. This preserves the strongest activations in each local region, which often correspond to the most prominent features such as edges, corners, and textures. The intuition is that ReLU activations are non-negative and a high value indicates the presence of a feature, so taking the maximum keeps the evidence that the feature exists somewhere in the window without committing to its exact location.
Max pooling is the most widely used pooling variant in modern CNNs. It was applied to neural networks for speech by Yamaguchi et al. (1990), generalized as a model of cortical complex cells by Riesenhuber and Poggio (1999), used in the Cresceptron of Weng, Ahuja and Huang (1992) for image processing, and popularized for deep learning by AlexNet (Krizhevsky et al., 2012). It remains the default pooling operation in architectures such as VGGNet (Simonyan and Zisserman, 2014) and the early stages of ResNet.
For a pooling region R, max pooling computes:
y = max(x_i) for all x_i in R
Max pooling provides a form of translational invariance: if a feature such as an edge shifts slightly within the pooling window, the output remains the same as long as the maximum value does not change. This property helps CNNs recognize patterns regardless of their exact spatial position, although the invariance is local and breaks down for shifts larger than the pool size.
Average pooling computes the arithmetic mean of all values in the pooling window. For a pooling region R containing n elements:
y = (1/n) * sum(x_i) for all x_i in R
Average pooling produces smoother output feature maps compared with max pooling because it considers all values rather than only the peak activation. It was the original form of pooling used in LeCun's LeNet-5 (1998), where the subsampling layers computed the sum of four neighboring inputs, multiplied by a trainable coefficient, added a trainable bias, and passed the result through a sigmoid. Each S2 channel had only two trainable parameters (one weight and one bias), and the same scheme was used in S4.
In practice, max pooling tends to outperform average pooling in most classification tasks because it preserves the strongest signals and produces sparser, more discriminative features. However, average pooling can be preferable when smoother feature representations are desired, such as in certain generative models, when the input contains significant noise, or in transition layers that aggregate dense features. DenseNet (Huang et al., 2017) uses 2x2 average pooling in its transition layers between dense blocks for this reason.
Global average pooling (GAP) computes the mean of each entire feature map channel, collapsing the spatial dimensions to a single value per channel. For an input of size H x W x C, GAP produces an output of size 1 x 1 x C.
GAP was introduced by Lin, Chen and Yan (2014) in the "Network in Network" paper as a replacement for the fully connected layers that traditionally appeared at the end of CNN classifiers. In their approach, the final convolutional layer produces one feature map per class and GAP averages each map into a single scalar that is fed directly into a softmax. This eliminates the large fully connected layers that dominated the parameter count in architectures like AlexNet and VGGNet. The original NIN paper notes that GAP enforces correspondence between feature maps and classes, which makes the final layer easier to interpret and acts as a structural regularizer because there are no parameters in the GAP itself.
The advantages of GAP over fully connected layers include:
GAP became a standard component in modern architectures. GoogLeNet (Szegedy et al., 2015), ResNet (He et al., 2016), DenseNet (Huang et al., 2017), and EfficientNet (Tan and Le, 2019) all use GAP before the final classification layer. GAP is also a key ingredient in class activation maps (Zhou et al., 2016), where the spatial average is reversed to localize discriminative regions per class.
Global max pooling takes the maximum value from each entire feature map channel, producing a 1 x 1 x C output from an H x W x C input. It is less common than GAP in classification networks but is used in certain architectures and tasks where preserving the single strongest activation per channel is more informative than the average response. Multiple-instance learning models, weakly supervised localization heads, and some text classifiers (Kim, 2014) use global max pooling because it picks out the most active feature regardless of how many positions support it.
Stochastic pooling, proposed by Zeiler and Fergus (2013), replaces the deterministic selection of max or average pooling with a probabilistic sampling procedure. Within each pooling region the activation values are first normalized to form a probability distribution, and a single value is then sampled according to this multinomial distribution. During training the random sampling acts as a regularizer in a similar spirit to dropout, preventing the network from relying too heavily on any single activation. During inference the method uses a probability-weighted average, which is equivalent to average pooling in expectation.
Stochastic pooling achieved state-of-the-art results on several benchmarks at the time of its introduction and demonstrated that pooling operations could serve as regularization mechanisms. It is hyperparameter-free and can be combined with other regularization techniques like dropout and data augmentation.
Fractional max pooling, introduced by Graham (2014), generalizes the integer downsampling factor used by standard pooling to non-integer values. Conventional alpha x alpha max pooling reduces hidden layer size by an integer factor (typically 2), which can cause spatial information to drop too rapidly through a deep network. Graham proposed allowing alpha to take non-integer values such as 1.41 or sqrt(2), which permits more fine-grained reduction across more layers.
Because non-integer pooling regions cannot tile a grid exactly, the construction is stochastic: for each layer the pooling regions are sampled from one of several valid pseudo-random tilings, which introduces additional regularization during training. Graham reported state-of-the-art results on CIFAR-100 without dropout and competitive performance on CIFAR-10, MNIST and assorted other benchmarks.
Mixed pooling, proposed by Yu et al. (2014), combines max and average pooling stochastically. For each pooling window a random parameter lambda in {0, 1} chooses whether to apply max pooling (lambda = 1) or average pooling (lambda = 0). The chosen lambda is recorded so that the backward pass uses the matching gradient rule. Yu et al. reported that mixed pooling outperformed both max and average pooling on several image classification benchmarks. Subsequent variants include gated pooling and tree-structured pooling (Lee, Gallagher and Tu, 2016), which learn the mixing weights rather than sampling them.
Spatial pyramid pooling (SPP) was introduced by He et al. (2015) in the SPP-Net paper. Traditional CNNs require fixed-size input images because the fully connected layers expect a specific input dimension. SPP solves this constraint by applying pooling at multiple scales simultaneously. See also spatial pyramid pooling.
The SPP layer divides the input feature map into a hierarchy of grids (for example 1x1, 2x2 and 4x4 regions), performs max pooling within each grid cell, and concatenates all the results into a single fixed-length vector. With these three levels and a 256-channel feature map, the resulting vector has length 256 x (1 + 4 + 16) = 5,376 regardless of the input image size or aspect ratio. The classifier then operates on this fixed vector.
For object detection, SPP-Net computed convolutional features from the entire image only once and then pooled features from arbitrary sub-regions, making it 30 to 170 times faster than R-CNN at test time while achieving better or comparable accuracy on the Pascal VOC 2007 benchmark.
Atrous spatial pyramid pooling (ASPP) was introduced in the DeepLab family of semantic segmentation models (Chen et al., 2017). Instead of pooling at multiple grid resolutions like SPP, ASPP applies several parallel atrous (dilated) convolutions to the same feature map, each with a different dilation rate. The outputs are concatenated and projected to the segmentation logits. By varying the dilation rate, ASPP captures context at multiple effective receptive fields without adding extra parameters or downsampling. DeepLabv3 (Chen et al., 2017) and DeepLabv3+ (Chen et al., 2018) added a 1x1 convolution branch and an image-level GAP branch to the ASPP module to capture both local and global context.
Lp pooling generalizes max pooling and average pooling through the Lp norm. For a pooling region R:
y = (sum(|x_i|^p) / n)^(1/p)
When p = 1, this reduces to average pooling. As p approaches infinity, the result converges to max pooling. p = 2 (root mean square pooling, sometimes called L2 pooling) is biologically motivated and was studied by Sermanet, Chintala and LeCun (2012) for traffic sign classification.
Sum pooling outputs the sum rather than the mean of the values in each window. It is mathematically equivalent to average pooling up to a constant scaling factor and is rarely used as a standalone layer in modern architectures, although it appears occasionally in graph pooling and in Bag-of-Visual-Words pipelines.
Local response normalization (LRN), used in AlexNet, is not strictly a pooling operation but is sometimes grouped with it because it performs a local aggregation across nearby activations. It normalizes each activation by a function of the squared activations in a neighborhood (across channels in AlexNet) and has largely been replaced by batch normalization in modern networks.
| Pooling type | Operation | Learnable parameters | Key strength | Typical use case |
|---|---|---|---|---|
| Max pooling | Maximum value in window | No | Preserves strongest features | Most CNN classifiers and detectors |
| Average pooling | Mean of values in window | No | Produces smooth representations | Early architectures (LeNet), DenseNet transitions, generative models |
| Global average pooling | Mean over entire feature map | No | Eliminates fully connected layers | Final layer before softmax in modern CNNs |
| Global max pooling | Max over entire feature map | No | Preserves strongest activation per channel | Multiple-instance learning, text CNNs |
| Stochastic pooling | Probabilistic sampling from region | No | Regularization during training | Small training sets, CIFAR-scale benchmarks |
| Fractional max pooling | Non-integer downsampling factor with random tilings | No | Slow, fine-grained downsampling and regularization | Deep networks on small datasets |
| Mixed pooling | Random or learned mix of max and average | Optional | Combines smoothing and selection | Regularizing CNN classifiers |
| Spatial pyramid pooling | Multi-scale grid pooling, fixed output length | No | Handles variable input sizes | Object detection (SPP-Net) |
| Atrous spatial pyramid pooling | Parallel dilated convolutions at multiple rates | Yes (the convolutions) | Multi-scale context for dense prediction | Semantic segmentation (DeepLab) |
| Lp pooling | Lp norm over region | No (p is a hyperparameter) | Tunable between max and average | Specialized applications |
| Overlapping max pooling | Max with stride less than kernel size | No | Slight regularization, denser sampling | AlexNet (3x3 window, stride 2) |
Object detection models need to extract a fixed-size feature representation from variably sized region proposals so that downstream classifier and regression heads can operate on a uniform input. Pooling provides this with several specialized variants.
Region of Interest (RoI) Pooling was introduced as part of Fast R-CNN (Girshick, 2015). The RoI pooling layer takes two inputs: a convolutional feature map computed once for the whole image, and a list of region proposals encoded as bounding boxes. For each RoI it converts the corresponding sub-region of the feature map to a fixed H x W output (typically 7x7) by dividing the RoI into an H x W grid of sub-windows and applying max pooling within each sub-window. The result feeds into the per-RoI fully connected head for classification and bounding box regression.
Because the convolutional feature map is shared across all proposals, Fast R-CNN avoids recomputing the convolutional backbone for each proposal as the original R-CNN did, which produces a large speedup. Faster R-CNN (Ren et al., 2015) keeps the same RoI pooling layer and replaces the external selective search proposals with a learned region proposal network. See also RoI pooling.
A limitation of RoI pooling is that it quantizes the floating-point RoI coordinates twice: once when mapping the RoI onto the discrete feature map grid, and again when dividing the RoI into sub-windows. These quantizations introduce small misalignments between the RoI and the pooled features. While the misalignment has little effect on classification, it materially hurts pixel-accurate tasks like instance segmentation.
RoIAlign, introduced in Mask R-CNN (He et al., 2017), removes both quantizations. The RoI is divided into bins using exact floating-point coordinates, four sample points are placed inside each bin, and the feature map is sampled at those exact points using bilinear interpolation. The bin output is then taken as the max or average of the four samples. He et al. reported that RoIAlign improves mask average precision by roughly 10 percentage points over RoI pooling on the COCO benchmark and gives more modest but still meaningful gains for bounding box detection.
R-FCN (Dai et al., 2016) proposed position-sensitive RoI pooling (PSRoIPool) to make detectors fully convolutional. Instead of pooling from a generic feature map, the network produces k x k position-sensitive score maps per class, where each score map encodes the response for one spatial position within the RoI grid (top-left, top-middle, and so on). PSRoIPool then pools each k x k bin from its corresponding score map, encoding spatial position information without per-RoI fully connected layers. Using ResNet-101, R-FCN achieved 83.6% mAP on PASCAL VOC 2007 at 170 ms per image, which is 2.5x to 20x faster than the equivalent Faster R-CNN model. PSRoIPool is available in torchvision as ps_roi_pool.
| Variant | Paper | Year | Quantization | Output shape | Notable use |
|---|---|---|---|---|---|
| RoI pooling | Fast R-CNN (Girshick) | 2015 | Yes, on RoI and bins | Fixed H x W per RoI | Fast R-CNN, Faster R-CNN |
| Position-sensitive RoI pool | R-FCN (Dai et al.) | 2016 | Yes | k x k per RoI | R-FCN |
| RoIAlign | Mask R-CNN (He et al.) | 2017 | None (bilinear sampling) | Fixed H x W per RoI | Mask R-CNN, YOLO variants with RoI heads |
| Precise RoI pooling | Jiang et al. | 2018 | None (continuous integral) | Fixed H x W per RoI | IoU-Net |
Pooling serves several purposes in neural network architectures. Shrinking the spatial dimensions of feature maps reduces the cost of subsequent layers; a 2x2 pooling with stride 2 cuts the number of spatial elements by 75%, which compounds across multiple layers. Pooling also gives the network a degree of translational invariance: if a feature shifts by a few pixels, the pooled output may remain identical, which helps CNNs recognize objects regardless of precise location (the invariance is local and approximate rather than global). Each pooling step roughly doubles the effective receptive field of neurons in the next layer, so deeper layers can integrate information from larger regions of the original input. Reducing spatial dimensions also decreases the total parameter count, acting as implicit regularization, and stochastic, fractional and mixed pooling go further by injecting noise into the operation itself. Finally, smaller feature maps require less memory and fewer floating-point operations, which speeds up both training and inference and is critical for deployment on resource-constrained devices.
The necessity of dedicated pooling layers has been debated since the mid-2010s. Pooling discards spatial detail, which hurts tasks that need pixel-accurate predictions such as semantic segmentation, depth estimation, super-resolution and generation. U-Net uses pooling in the encoder but compensates with skip connections that re-inject the pre-pooled features. Standard pooling also cannot adapt its downsampling rule to the data, while strided convolutions use weights that the rest of the network learns, so the downsampling behavior is shaped by gradient descent.
Springenberg, Dosovitskiy, Brox and Riedmiller (2014) showed in "Striving for Simplicity: The All Convolutional Net" that max pooling can be replaced by a stride-2 convolutional layer without loss of accuracy on CIFAR-10, CIFAR-100 and ImageNet. They argued that pooling is a special case of strided convolution with hand-designed weights and that letting the network learn the downsampling is at least as good. This finding heavily influenced later CNN design. ResNet uses one 3x3 max pool early in the network and replaces all subsequent downsampling with stride-2 convolutions inside the residual stages, and ResNeXt, EfficientNet and most contemporary CNN backbones follow the same pattern. Vision transformers replace pooling with a patch embedding step that splits the image into non-overlapping patches and projects each one to a token embedding, effectively merging the first conv plus pool stage into a single linear projection.
Despite these alternatives, pooling has not disappeared. GAP remains the standard final aggregation in nearly every modern CNN classifier, and detection heads still rely on RoIAlign or its variants.
In the standard CNN pipeline, pooling layers are inserted after one or more convolutional layers. A typical block consists of a convolutional layer, an activation such as ReLU, and a pooling layer, repeated several times. Each stage reduces spatial dimensions while increasing the number of feature channels. A classic VGG-style progression looks like this:
Input (224x224x3) -> Conv + ReLU (224x224x64) -> MaxPool 2x2 (112x112x64) -> Conv + ReLU (112x112x128) -> MaxPool 2x2 (56x56x128) -> ... -> GAP -> Softmax
Notable CNN architectures and their use of pooling:
| Architecture | Year | Pooling strategy |
|---|---|---|
| LeNet-5 | 1998 | Trainable subsampling (sum, scale, bias, sigmoid) after each conv block |
| AlexNet | 2012 | 3x3 overlapping max pooling with stride 2; LRN |
| VGGNet | 2014 | 2x2 max pooling with stride 2 after each conv block |
| GoogLeNet (Inception) | 2015 | Max pooling within Inception modules, GAP before classifier |
| ResNet | 2016 | Single 3x3 max pool early on, stride-2 convolutions for the rest, GAP before classifier |
| DenseNet | 2017 | 2x2 average pooling in transition layers, GAP at the end |
| MobileNet / EfficientNet | 2017 / 2019 | Stride-2 depthwise convolutions for downsampling, GAP before classifier |
| Vision Transformer | 2021 | Patch embedding replaces early conv+pool; CLS token or GAP at the end |
Although pooling originated in computer vision, it plays an important role in transformer-based models as well. When a transformer like BERT processes a sentence it produces a sequence of token embeddings. Many downstream tasks (classification, semantic similarity, retrieval) require a single fixed-length vector to represent the entire input. Pooling provides the mechanism for aggregating token-level representations into a sentence-level embedding.
BERT and similar models prepend a special [CLS] token to the input sequence, and the final hidden state of this token is treated as the sentence representation. The model is trained so that information from the rest of the sequence flows into the CLS position via self-attention. CLS pooling works well when the model has been fine-tuned on a downstream classification task, but the raw CLS embedding from a pretrained BERT does not capture sentence semantics well in zero-shot settings (Reimers and Gurevych, 2019).
Averaging the hidden states of all tokens (excluding padding) produces a balanced representation that gives every token equal influence. Mean pooling is the default strategy in the Sentence-BERT library (Reimers and Gurevych, 2019), where the authors compared CLS, mean and max pooling and found mean pooling produced the best results on STS and NLI benchmarks. Most modern open-source sentence encoders, including the sentence-transformers family, default to mean pooling.
Taking the element-wise maximum across all token hidden states captures the most salient features at each embedding dimension. Max pooling can be useful for tasks where specific keywords or phrases carry disproportionate importance, although Reimers and Gurevych reported it underperformed mean pooling for general semantic similarity.
Applying learned or heuristic weights to each token before averaging allows the model to emphasize certain positions. A common variant uses the attention mask only (treating padded tokens as zero weight), while more elaborate schemes weight tokens by inverse document frequency, by self-attention scores, or by a learned linear projection.
Attention pooling generalizes mean pooling by computing a learned weighted combination of token vectors. A small set of learnable query vectors attends to the token sequence via a single multi-head attention layer, and the resulting vectors form the pooled representation. Pooling by Multi-head Attention (PMA), introduced in the Set Transformer (Lee et al., 2019), uses a fixed number of seed queries and lets attention decide which input elements to focus on for each seed. Attention pooling is now common in image classification heads (CaiT, BEiT, DINOv2), in dense retrieval (Contriever, NV-Embed), and in audio and video models.
For large language models with causal attention, the standard pooling choice for embedding tasks is the last token's hidden state, which by virtue of the causal mask has attended over the full input. Some encoders use the mean of the last layer's hidden states, others mix several layers.
Vision transformers (ViTs) blend the BERT-style CLS-token approach with options inherited from CNNs. The original ViT (Dosovitskiy et al., 2021) uses a CLS token in emulation of BERT and reports that GAP over the patch tokens reaches comparable accuracy after careful tuning. Subsequent papers have shown that GAP and attention pooling often outperform the CLS token, especially when the model is trained with strong augmentation regimes or self-supervised objectives. CaiT (Touvron et al., 2021), Swin Transformer (Liu et al., 2021), and BEiT v2 (Peng et al., 2022) use GAP or attention pooling rather than CLS at the classifier head.
A related observation from Raghu et al. (2021) is that ViTs trained with GAP show less localized attention patterns than CLS-trained ViTs. The choice between CLS and GAP therefore affects not only accuracy but also interpretability tools like attention rollouts.
Adaptive pooling is a variant where the user specifies the desired output size rather than the kernel size and stride. The framework automatically computes the necessary kernel size and stride to produce the requested output dimensions from whatever input size is provided. This makes architectures more flexible, since the same model can process inputs of varying spatial dimensions.
Adaptive pooling is particularly useful for:
In PyTorch, adaptive pooling is available as nn.AdaptiveAvgPool2d and nn.AdaptiveMaxPool2d. Setting the output size to (1, 1) is equivalent to global pooling.
PyTorch provides pooling layers in the torch.nn module and detection-specific pooling in torchvision.ops.
import torch
import torch.nn as nn
from torchvision.ops import roi_pool, roi_align, ps_roi_pool
# Max pooling: 2x2 window, stride 2
max_pool = nn.MaxPool2d(kernel_size=2, stride=2)
# Average pooling: 2x2 window, stride 2
avg_pool = nn.AvgPool2d(kernel_size=2, stride=2)
# Global average pooling (adaptive, output size 1x1)
global_avg_pool = nn.AdaptiveAvgPool2d(output_size=(1, 1))
# Example: input tensor with batch=1, channels=3, height=8, width=8
x = torch.randn(1, 3, 8, 8)
print(max_pool(x).shape) # torch.Size([1, 3, 4, 4])
print(avg_pool(x).shape) # torch.Size([1, 3, 4, 4])
print(global_avg_pool(x).shape) # torch.Size([1, 3, 1, 1])
# RoI pooling and RoIAlign: input is (N, C, H, W); rois is (K, 5)
# where each row is (batch_idx, x1, y1, x2, y2) in input coordinates.
feats = torch.randn(1, 256, 32, 32)
rois = torch.tensor([[0, 4.0, 4.0, 20.0, 20.0]])
pooled = roi_pool(feats, rois, output_size=(7, 7), spatial_scale=1.0)
aligned = roi_align(feats, rois, output_size=(7, 7), spatial_scale=1.0,
sampling_ratio=2, aligned=True)
PyTorch also provides 1D and 3D variants (MaxPool1d, MaxPool3d, AvgPool1d, AvgPool3d) for sequential and volumetric data.
In TensorFlow / Keras, pooling layers are available in tf.keras.layers.
import tensorflow as tf
# Max pooling: 2x2 window, stride 2
max_pool = tf.keras.layers.MaxPooling2D(pool_size=(2, 2), strides=2)
# Average pooling: 2x2 window, stride 2
avg_pool = tf.keras.layers.AveragePooling2D(pool_size=(2, 2), strides=2)
# Global pooling
global_avg_pool = tf.keras.layers.GlobalAveragePooling2D()
global_max_pool = tf.keras.layers.GlobalMaxPooling2D()
Keras layers accept data_format as either "channels_last" (default, NHWC) or "channels_first" (NCHW), and padding as either "valid" (no padding) or "same" (zero-padding to preserve dimensions).
Flax exposes pooling through flax.linen.max_pool and flax.linen.avg_pool, both of which accept window_shape and strides arguments. The sentence-transformers library provides a models.Pooling module that wraps mean, max, and CLS strategies for transformer outputs and lets the user combine several modes by element-wise concatenation.
Pooling techniques appear in a wide range of deep learning applications.
A few rules of thumb that follow from common practice:
Imagine you have a big picture made of lots of tiny colored squares. Pooling is like squishing that picture down to make it smaller. One way to squish is to look at a small group of squares and keep only the brightest one (that is max pooling). Another way is to mix all the colors in the group together to get the average color (that is average pooling). Either way the picture gets smaller but you can still tell what it is. This helps the computer work faster because it has fewer squares to look at, and it also means the computer does not care if the cat in the picture is a tiny bit to the left or the right.