Stride

Stride is a hyperparameter in convolutional neural networks (CNNs) that controls how many pixels (or units) a filter moves across the input at each step during the convolution or pooling operation. It directly determines the spatial dimensions of the output feature map and plays a central role in balancing computational efficiency, spatial resolution, and the network's ability to capture patterns at different scales.

Definition and Basic Mechanism

In the context of a convolutional layer, a filter (also called a kernel) slides over the input data to compute dot products at each position. The stride specifies the number of positions the filter shifts between successive applications. A stride of 1 means the filter moves one pixel at a time, covering every possible position. A stride of 2 means the filter jumps two pixels at each step, skipping every other position.

Stride can be specified as a single integer (applied equally in both the horizontal and vertical directions) or as a tuple such as (2, 2) or (1, 2) to allow different stride values along each spatial dimension. Most frameworks default to a stride of 1 when no value is explicitly provided.

Visual Intuition

Consider a 7x7 input with a 3x3 filter. With stride 1, the filter can be placed at 5 positions along each axis, producing a 5x5 output. With stride 2, the filter lands at positions 0, 2, and 4 along each axis, producing a 3x3 output. The larger stride reduces the output size because the filter visits fewer positions.

Output Size Formula

The relationship between stride and output dimensions is governed by a well-known formula. For a single spatial dimension, the output size is:

O = floor((W - K + 2P) / S) + 1

Where:

Symbol	Meaning
W	Input size (width or height)
K	Kernel size
P	Padding amount
S	Stride
O	Output size

For two-dimensional inputs, the formula is applied independently to each spatial dimension:

W_out = floor((W_in - K_w + 2 * P_w) / S_w) + 1
H_out = floor((H_in - K_h + 2 * P_h) / S_h) + 1

When dilation is used, the formula generalizes to:

O = floor((W + 2P - D * (K - 1) - 1) / S) + 1

where D is the dilation rate.

Worked Example

Input Size	Kernel Size	Padding	Stride	Output Size
32x32	3x3	0	1	30x30
32x32	3x3	1	1	32x32
32x32	3x3	1	2	16x16
224x224	7x7	3	2	112x112
8x8	3x3	0	3	2x2

The table illustrates that stride 1 with appropriate padding preserves the input dimensions, while stride 2 halves them. This is the most common configuration in modern CNN architectures.

Stride = 1 vs. Stride = 2

The two most frequently used stride values are 1 and 2. They serve fundamentally different purposes in a network's design.

Stride = 1 (Preserving Dimensions)

With stride 1 and "same" padding (P = (K - 1) / 2 for odd kernel sizes), the output feature map retains the same spatial dimensions as the input. This is the default setting in most convolutional layers and is preferred when fine-grained spatial detail must be preserved. Early layers in a network and layers within residual blocks commonly use stride 1.

Stride = 2 (Halving Dimensions)

Stride 2 roughly halves the spatial dimensions of the input, serving as a form of downsampling. This reduces the number of elements in the feature map by approximately 75%, significantly lowering the computational cost and memory usage of subsequent layers. Stride 2 is commonly used at transition points in a network where spatial resolution is intentionally reduced.

Networks such as VGG used max pooling with stride 2 to downsample, while more recent architectures like ResNet and many others use strided convolutions for the same purpose.

Stride in Pooling Layers

Stride is also a parameter in pooling layers, including max pooling and average pooling. In pooling, the stride controls how the pooling window moves across the feature map. The same output size formula applies.

In many standard pooling configurations, the stride equals the pool size (for example, a 2x2 max pooling window with stride 2), which produces non-overlapping pooling regions and halves the spatial dimensions. When the stride is smaller than the pool size, overlapping pooling occurs. AlexNet famously used overlapping max pooling with a 3x3 window and stride 2, which was shown to slightly reduce overfitting compared to non-overlapping pooling.

Strided Convolution as an Alternative to Pooling

Traditionally, CNNs alternated convolutional layers with pooling layers to progressively reduce spatial dimensions. In 2015, Springenberg et al. published "Striving for Simplicity: The All Convolutional Net," which demonstrated that max pooling layers could be entirely replaced by convolutional layers with stride 2 without any loss in accuracy on benchmarks including CIFAR-10, CIFAR-100, and ImageNet.

The key insight of this work is that pooling is a fixed, non-learnable operation, while a strided convolution is a learnable downsampling operation. By replacing pooling with strided convolution, the network gains additional trainable parameters that allow it to learn the optimal way to reduce spatial dimensions for the task at hand. The authors found that "when pooling is replaced by an additional convolution layer with stride 2, performance stabilizes and even improves on the base model."

This approach has been widely adopted. Many modern architectures, including ResNet, DenseNet, and ConvNeXt, use strided convolutions for downsampling rather than pooling layers.

Approach	Type	Learnable	Parameters	Typical Use
Max pooling (stride 2)	Fixed operation	No	0	Classical architectures (VGG, AlexNet)
Average pooling (stride 2)	Fixed operation	No	0	Transition layers, global pooling
Strided convolution (stride 2)	Learned operation	Yes	K x K x C_in x C_out	Modern architectures (ResNet, ConvNeXt)

Stride in Transposed Convolutions

Transposed convolutions (also called fractionally strided convolutions or sometimes incorrectly called deconvolutions) use stride in the opposite manner compared to standard convolutions. While stride greater than 1 in a standard convolution reduces spatial dimensions, stride greater than 1 in a transposed convolution increases spatial dimensions, performing upsampling.

In a transposed convolution, the stride parameter controls how much spacing is inserted between input elements before the convolution is applied. A stride of 2 effectively doubles the spatial dimensions of the output. This is why the operation is also called "fractionally strided convolution": a stride of 2 over the output is equivalent to a stride of 1/2 over the input.

Transposed convolutions with stride 2 are widely used in:

Semantic segmentation networks such as FCN and U-Net, which upsample feature maps back to the original input resolution for pixel-wise classification.
Generative adversarial networks (GANs), where the generator transforms low-dimensional latent vectors into high-resolution images through a series of upsampling layers.
Super-resolution models that increase the spatial resolution of images.

Checkerboard Artifacts

Transposed convolutions with certain stride and kernel size combinations can produce checkerboard artifacts in the output. This occurs because of uneven overlap when the kernel size is not divisible by the stride. Odena, Dumoulin, and Olah (2016) documented this issue in their influential article "Deconvolution and Checkerboard Artifacts" and proposed using nearest-neighbor or bilinear upsampling followed by a standard convolution (the "resize-convolution" approach) as a solution. Choosing kernel sizes that are divisible by the stride (for example, a 4x4 kernel with stride 2) also helps reduce these artifacts.

Stride and Receptive Field

The receptive field of a neuron in a CNN is the region of the input image that influences that neuron's activation. Stride has a significant effect on the receptive field of neurons in deeper layers.

When a layer uses stride greater than 1, neurons in subsequent layers effectively "see" a larger region of the original input. This is because each position in the downsampled feature map corresponds to a larger area of the input. Specifically, the receptive field size at layer l can be computed recursively:

r_l = r_(l-1) + (k_l - 1) * product of all strides in layers 1 to (l-1)

where r_l is the receptive field at layer l, k_l is the kernel size at layer l, and the product term accumulates all preceding stride values. A stride of 2 at any layer effectively doubles the contribution of all subsequent layers to the receptive field.

This relationship means that using strided convolutions or strided pooling in early layers is an efficient way to rapidly increase the receptive field, allowing deeper neurons to capture large-scale patterns and global context in the input.

Stride in Popular CNN Architectures

Different architectures use stride strategically at specific points in the network. The table below summarizes stride usage in well-known models.

Architecture	Year	First Layer Stride	Downsampling Method	Notes
LeNet-5	1998	1	Average pooling (stride 2)	Pioneer CNN for digit recognition
AlexNet	2012	4	Overlapping max pooling (stride 2)	Large initial stride to reduce 227x227 input
VGG	2014	1	Max pooling (stride 2)	All conv layers use stride 1; pooling for downsampling
GoogLeNet	2014	2	Mixed (pooling + strided conv)	Inception modules with stride 1 convolutions
ResNet	2015	2	Strided convolution (stride 2)	7x7 conv with stride 2 at input; strided conv at transitions
DenseNet	2017	2	Strided conv + pooling in transition layers	Dense blocks with stride 1
EfficientNet	2019	2	Strided depthwise convolution	Compound scaling of depth, width, and resolution
ConvNeXt	2022	4	Strided convolution (stride 4 patchify stem)	Patchify stem inspired by Vision Transformers

Implementation in Deep Learning Frameworks

All major deep learning frameworks support stride as a parameter in convolution and pooling layers.

PyTorch

In PyTorch, the stride parameter is specified as an integer or tuple:

import torch.nn as nn

# Stride 1 (default) - preserves spatial dimensions with padding
conv1 = nn.Conv2d(in_channels=3, out_channels=64, kernel_size=3, stride=1, padding=1)

# Stride 2 - halves spatial dimensions
conv2 = nn.Conv2d(in_channels=64, out_channels=128, kernel_size=3, stride=2, padding=1)

# Asymmetric stride
conv3 = nn.Conv2d(in_channels=64, out_channels=128, kernel_size=(3, 5), stride=(2, 1), padding=(1, 2))

# Transposed convolution with stride 2 for upsampling
deconv = nn.ConvTranspose2d(in_channels=128, out_channels=64, kernel_size=4, stride=2, padding=1)

TensorFlow / Keras

In TensorFlow and Keras, the parameter is named strides (plural):

from tensorflow.keras.layers import Conv2D, Conv2DTranspose

# Stride 1 (default)
conv1 = Conv2D(filters=64, kernel_size=3, strides=1, padding='same')

# Stride 2 - halves spatial dimensions
conv2 = Conv2D(filters=128, kernel_size=3, strides=2, padding='same')

# Transposed convolution for upsampling
deconv = Conv2DTranspose(filters=64, kernel_size=4, strides=2, padding='same')

A key difference between the two frameworks is the default data format: PyTorch expects channel-first input (batch, channels, height, width), while TensorFlow/Keras defaults to channel-last input (batch, height, width, channels). The stride parameter behaves identically in both cases and is applied to the spatial dimensions.

Relationship to Dilated Convolutions

Stride and dilation (also called atrous convolution) both affect how a convolution interacts with its input, but they serve different purposes. Stride controls how many positions the filter moves between applications, while dilation controls the spacing between filter elements.

A dilated convolution with dilation rate d expands the receptive field without reducing the spatial resolution and without increasing the number of parameters. This is in contrast to stride, which increases the receptive field by reducing spatial resolution. Dilated convolutions are commonly used in models like DeepLab for semantic segmentation, where maintaining spatial resolution is important.

Property	Stride > 1	Dilation > 1
Effect on output size	Reduces spatial dimensions	Preserves spatial dimensions
Effect on receptive field	Increases (indirectly)	Increases (directly)
Additional parameters	None (same kernel)	None (same kernel)
Primary use case	Downsampling, efficiency	Enlarging receptive field without downsampling

Practical Considerations

When choosing a stride value for a particular layer, several factors should be considered:

Spatial resolution vs. efficiency: Larger strides reduce computational cost and memory usage but discard spatial information. For tasks requiring precise localization (such as object detection or segmentation), aggressive downsampling early in the network can hurt performance.
Information loss: Unlike pooling, which selects or averages values, a strided convolution computes new features at fewer positions. The filter still "sees" its full receptive field at each position, but positions between stride steps are skipped entirely.
Alignment with kernel size: The kernel size should generally be larger than or equal to the stride to avoid gaps in coverage. A 3x3 kernel with stride 2 provides overlapping coverage, while a 2x2 kernel with stride 2 produces exactly non-overlapping coverage.
Transposed convolution artifacts: When using transposed convolutions for upsampling, choose kernel sizes divisible by the stride to minimize checkerboard artifacts.

Explain Like I'm 5 (ELI5)

Imagine you have a big picture made of tiny colored squares, like a mosaic. You want to look at the picture through a small magnifying glass that can only see a few squares at a time.

If the stride is 1, you move your magnifying glass one square at a time, looking at almost every spot on the picture. You get a very detailed idea of what the picture looks like, but it takes a long time.

If the stride is 2, you skip a square each time you move the magnifying glass. You finish looking at the picture much faster, but the picture you remember is smaller because you skipped some spots.

If the stride is 3, you skip even more squares. You finish very quickly, but you might miss some details.

So the stride is just how many squares you skip each time you move your magnifying glass. A small stride gives you more detail but takes more work. A big stride is faster but gives you a smaller, less detailed picture.

References

Springenberg, J. T., Dosovitskiy, A., Brox, T., & Riedmiller, M. (2015). "Striving for Simplicity: The All Convolutional Net." *ICLR 2015 Workshop Track*. arXiv:1412.6806
Odena, A., Dumoulin, V., & Olah, C. (2016). "Deconvolution and Checkerboard Artifacts." *Distill*. https://distill.pub/2016/deconv-checkerboard/
Dumoulin, V. & Visin, F. (2016). "A Guide to Convolution Arithmetic for Deep Learning." arXiv:1603.07285
He, K., Zhang, X., Ren, S., & Sun, J. (2016). "Deep Residual Learning for Image Recognition." *CVPR 2016*. arXiv:1512.03385
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). "ImageNet Classification with Deep Convolutional Neural Networks." *NeurIPS 2012*.
Simonyan, K. & Zisserman, A. (2015). "Very Deep Convolutional Networks for Large-Scale Image Recognition." *ICLR 2015*. arXiv:1409.1556
Zhang, A., Lipton, Z. C., Li, M., & Smola, A. J. (2023). "Dive into Deep Learning." Section 7.3: Padding and Stride. https://d2l.ai/chapter_convolutional-neural-networks/padding-and-strides.html
Long, J., Shelhamer, E., & Darrell, T. (2015). "Fully Convolutional Networks for Semantic Segmentation." *CVPR 2015*. arXiv:1411.4038
Goodfellow, I., Bengio, Y., & Courville, A. (2016). *Deep Learning*. MIT Press. Chapter 9: Convolutional Networks.
PyTorch Documentation. "torch.nn.Conv2d." https://pytorch.org/docs/stable/generated/torch.nn.Conv2d.html

Definition and Basic Mechanism

Visual Intuition

Output Size Formula

Worked Example

Stride = 1 vs. Stride = 2

Stride = 1 (Preserving Dimensions)

Stride = 2 (Halving Dimensions)

Stride in Pooling Layers

Strided Convolution as an Alternative to Pooling

Stride in Transposed Convolutions

Checkerboard Artifacts

Stride and Receptive Field

Stride in Popular CNN Architectures

Implementation in Deep Learning Frameworks

PyTorch

TensorFlow / Keras

Relationship to Dilated Convolutions

Practical Considerations

Explain Like I'm 5 (ELI5)

References

Improve this article

Related Articles

LeNet

OCR Models

Pre-training

ImageNet

AlexNet

U-Net

Definition and Basic Mechanism

Visual Intuition

Output Size Formula

Worked Example

Stride = 1 vs. Stride = 2

Stride = 1 (Preserving Dimensions)

Stride = 2 (Halving Dimensions)

Stride in Pooling Layers

Strided Convolution as an Alternative to Pooling

Stride in Transposed Convolutions

Checkerboard Artifacts

Stride and Receptive Field

Stride in Popular CNN Architectures

Implementation in Deep Learning Frameworks

PyTorch

TensorFlow / Keras

Relationship to Dilated Convolutions

Practical Considerations

Explain Like I'm 5 (ELI5)

References

Related Articles

LeNet

OCR Models

Pre-training

ImageNet

AlexNet

U-Net