Stride is a hyperparameter in convolutional neural networks (CNNs) that controls how many pixels (or units) a filter moves across the input at each step during the convolution or pooling operation. It directly determines the spatial dimensions of the output feature map and plays a central role in balancing computational efficiency, spatial resolution, and the network's ability to capture patterns at different scales.
In the context of a convolutional layer, a filter (also called a kernel) slides over the input data to compute dot products at each position. The stride specifies the number of positions the filter shifts between successive applications. A stride of 1 means the filter moves one pixel at a time, covering every possible position. A stride of 2 means the filter jumps two pixels at each step, skipping every other position.
Stride can be specified as a single integer (applied equally in both the horizontal and vertical directions) or as a tuple such as (2, 2) or (1, 2) to allow different stride values along each spatial dimension. Most frameworks default to a stride of 1 when no value is explicitly provided.
Consider a 7x7 input with a 3x3 filter. With stride 1, the filter can be placed at 5 positions along each axis, producing a 5x5 output. With stride 2, the filter lands at positions 0, 2, and 4 along each axis, producing a 3x3 output. The larger stride reduces the output size because the filter visits fewer positions.
The relationship between stride and output dimensions is governed by a well-known formula. For a single spatial dimension, the output size is:
O = floor((W - K + 2P) / S) + 1
Where:
| Symbol | Meaning |
|---|---|
| W | Input size (width or height) |
| K | Kernel size |
| P | Padding amount |
| S | Stride |
| O | Output size |
For two-dimensional inputs, the formula is applied independently to each spatial dimension:
When dilation is used, the formula generalizes to:
O = floor((W + 2P - D * (K - 1) - 1) / S) + 1
where D is the dilation rate.
| Input Size | Kernel Size | Padding | Stride | Output Size |
|---|---|---|---|---|
| 32x32 | 3x3 | 0 | 1 | 30x30 |
| 32x32 | 3x3 | 1 | 1 | 32x32 |
| 32x32 | 3x3 | 1 | 2 | 16x16 |
| 224x224 | 7x7 | 3 | 2 | 112x112 |
| 8x8 | 3x3 | 0 | 3 | 2x2 |
The table illustrates that stride 1 with appropriate padding preserves the input dimensions, while stride 2 halves them. This is the most common configuration in modern CNN architectures.
The two most frequently used stride values are 1 and 2. They serve fundamentally different purposes in a network's design.
With stride 1 and "same" padding (P = (K - 1) / 2 for odd kernel sizes), the output feature map retains the same spatial dimensions as the input. This is the default setting in most convolutional layers and is preferred when fine-grained spatial detail must be preserved. Early layers in a network and layers within residual blocks commonly use stride 1.
Stride 2 roughly halves the spatial dimensions of the input, serving as a form of downsampling. This reduces the number of elements in the feature map by approximately 75%, significantly lowering the computational cost and memory usage of subsequent layers. Stride 2 is commonly used at transition points in a network where spatial resolution is intentionally reduced.
Networks such as VGG used max pooling with stride 2 to downsample, while more recent architectures like ResNet and many others use strided convolutions for the same purpose.
Stride is also a parameter in pooling layers, including max pooling and average pooling. In pooling, the stride controls how the pooling window moves across the feature map. The same output size formula applies.
In many standard pooling configurations, the stride equals the pool size (for example, a 2x2 max pooling window with stride 2), which produces non-overlapping pooling regions and halves the spatial dimensions. When the stride is smaller than the pool size, overlapping pooling occurs. AlexNet famously used overlapping max pooling with a 3x3 window and stride 2, which was shown to slightly reduce overfitting compared to non-overlapping pooling.
Traditionally, CNNs alternated convolutional layers with pooling layers to progressively reduce spatial dimensions. In 2015, Springenberg et al. published "Striving for Simplicity: The All Convolutional Net," which demonstrated that max pooling layers could be entirely replaced by convolutional layers with stride 2 without any loss in accuracy on benchmarks including CIFAR-10, CIFAR-100, and ImageNet.
The key insight of this work is that pooling is a fixed, non-learnable operation, while a strided convolution is a learnable downsampling operation. By replacing pooling with strided convolution, the network gains additional trainable parameters that allow it to learn the optimal way to reduce spatial dimensions for the task at hand. The authors found that "when pooling is replaced by an additional convolution layer with stride 2, performance stabilizes and even improves on the base model."
This approach has been widely adopted. Many modern architectures, including ResNet, DenseNet, and ConvNeXt, use strided convolutions for downsampling rather than pooling layers.
| Approach | Type | Learnable | Parameters | Typical Use |
|---|---|---|---|---|
| Max pooling (stride 2) | Fixed operation | No | 0 | Classical architectures (VGG, AlexNet) |
| Average pooling (stride 2) | Fixed operation | No | 0 | Transition layers, global pooling |
| Strided convolution (stride 2) | Learned operation | Yes | K x K x C_in x C_out | Modern architectures (ResNet, ConvNeXt) |
Transposed convolutions (also called fractionally strided convolutions or sometimes incorrectly called deconvolutions) use stride in the opposite manner compared to standard convolutions. While stride greater than 1 in a standard convolution reduces spatial dimensions, stride greater than 1 in a transposed convolution increases spatial dimensions, performing upsampling.
In a transposed convolution, the stride parameter controls how much spacing is inserted between input elements before the convolution is applied. A stride of 2 effectively doubles the spatial dimensions of the output. This is why the operation is also called "fractionally strided convolution": a stride of 2 over the output is equivalent to a stride of 1/2 over the input.
Transposed convolutions with stride 2 are widely used in:
Transposed convolutions with certain stride and kernel size combinations can produce checkerboard artifacts in the output. This occurs because of uneven overlap when the kernel size is not divisible by the stride. Odena, Dumoulin, and Olah (2016) documented this issue in their influential article "Deconvolution and Checkerboard Artifacts" and proposed using nearest-neighbor or bilinear upsampling followed by a standard convolution (the "resize-convolution" approach) as a solution. Choosing kernel sizes that are divisible by the stride (for example, a 4x4 kernel with stride 2) also helps reduce these artifacts.
The receptive field of a neuron in a CNN is the region of the input image that influences that neuron's activation. Stride has a significant effect on the receptive field of neurons in deeper layers.
When a layer uses stride greater than 1, neurons in subsequent layers effectively "see" a larger region of the original input. This is because each position in the downsampled feature map corresponds to a larger area of the input. Specifically, the receptive field size at layer l can be computed recursively:
r_l = r_(l-1) + (k_l - 1) * product of all strides in layers 1 to (l-1)
where r_l is the receptive field at layer l, k_l is the kernel size at layer l, and the product term accumulates all preceding stride values. A stride of 2 at any layer effectively doubles the contribution of all subsequent layers to the receptive field.
This relationship means that using strided convolutions or strided pooling in early layers is an efficient way to rapidly increase the receptive field, allowing deeper neurons to capture large-scale patterns and global context in the input.
Different architectures use stride strategically at specific points in the network. The table below summarizes stride usage in well-known models.
| Architecture | Year | First Layer Stride | Downsampling Method | Notes |
|---|---|---|---|---|
| LeNet-5 | 1998 | 1 | Average pooling (stride 2) | Pioneer CNN for digit recognition |
| AlexNet | 2012 | 4 | Overlapping max pooling (stride 2) | Large initial stride to reduce 227x227 input |
| VGG | 2014 | 1 | Max pooling (stride 2) | All conv layers use stride 1; pooling for downsampling |
| GoogLeNet | 2014 | 2 | Mixed (pooling + strided conv) | Inception modules with stride 1 convolutions |
| ResNet | 2015 | 2 | Strided convolution (stride 2) | 7x7 conv with stride 2 at input; strided conv at transitions |
| DenseNet | 2017 | 2 | Strided conv + pooling in transition layers | Dense blocks with stride 1 |
| EfficientNet | 2019 | 2 | Strided depthwise convolution | Compound scaling of depth, width, and resolution |
| ConvNeXt | 2022 | 4 | Strided convolution (stride 4 patchify stem) | Patchify stem inspired by Vision Transformers |
All major deep learning frameworks support stride as a parameter in convolution and pooling layers.
In PyTorch, the stride parameter is specified as an integer or tuple:
import torch.nn as nn
# Stride 1 (default) - preserves spatial dimensions with padding
conv1 = nn.Conv2d(in_channels=3, out_channels=64, kernel_size=3, stride=1, padding=1)
# Stride 2 - halves spatial dimensions
conv2 = nn.Conv2d(in_channels=64, out_channels=128, kernel_size=3, stride=2, padding=1)
# Asymmetric stride
conv3 = nn.Conv2d(in_channels=64, out_channels=128, kernel_size=(3, 5), stride=(2, 1), padding=(1, 2))
# Transposed convolution with stride 2 for upsampling
deconv = nn.ConvTranspose2d(in_channels=128, out_channels=64, kernel_size=4, stride=2, padding=1)
In TensorFlow and Keras, the parameter is named strides (plural):
from tensorflow.keras.layers import Conv2D, Conv2DTranspose
# Stride 1 (default)
conv1 = Conv2D(filters=64, kernel_size=3, strides=1, padding='same')
# Stride 2 - halves spatial dimensions
conv2 = Conv2D(filters=128, kernel_size=3, strides=2, padding='same')
# Transposed convolution for upsampling
deconv = Conv2DTranspose(filters=64, kernel_size=4, strides=2, padding='same')
A key difference between the two frameworks is the default data format: PyTorch expects channel-first input (batch, channels, height, width), while TensorFlow/Keras defaults to channel-last input (batch, height, width, channels). The stride parameter behaves identically in both cases and is applied to the spatial dimensions.
Stride and dilation (also called atrous convolution) both affect how a convolution interacts with its input, but they serve different purposes. Stride controls how many positions the filter moves between applications, while dilation controls the spacing between filter elements.
A dilated convolution with dilation rate d expands the receptive field without reducing the spatial resolution and without increasing the number of parameters. This is in contrast to stride, which increases the receptive field by reducing spatial resolution. Dilated convolutions are commonly used in models like DeepLab for semantic segmentation, where maintaining spatial resolution is important.
| Property | Stride > 1 | Dilation > 1 |
|---|---|---|
| Effect on output size | Reduces spatial dimensions | Preserves spatial dimensions |
| Effect on receptive field | Increases (indirectly) | Increases (directly) |
| Additional parameters | None (same kernel) | None (same kernel) |
| Primary use case | Downsampling, efficiency | Enlarging receptive field without downsampling |
When choosing a stride value for a particular layer, several factors should be considered:
Imagine you have a big picture made of tiny colored squares, like a mosaic. You want to look at the picture through a small magnifying glass that can only see a few squares at a time.
If the stride is 1, you move your magnifying glass one square at a time, looking at almost every spot on the picture. You get a very detailed idea of what the picture looks like, but it takes a long time.
If the stride is 2, you skip a square each time you move the magnifying glass. You finish looking at the picture much faster, but the picture you remember is smaller because you skipped some spots.
If the stride is 3, you skip even more squares. You finish very quickly, but you might miss some details.
So the stride is just how many squares you skip each time you move your magnifying glass. A small stride gives you more detail but takes more work. A big stride is faster but gives you a smaller, less detailed picture.