# Stride

> Source: https://aiwiki.ai/wiki/stride
> Updated: 2026-06-23
> Categories: Computer Vision, Deep Learning, Machine Learning
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**Stride** is the step size by which a filter (or pooling window) moves across the input in a [convolutional neural network](/wiki/convolutional_neural_network) (CNN): a stride of 1 shifts the filter one position at a time and visits every location, while a stride greater than 1 skips positions and downsamples the output. Stride is the hyperparameter that, together with kernel size and [padding](/wiki/padding), determines the spatial dimensions of the output [feature map](/wiki/feature), and it directly trades spatial resolution against computational cost. The textbook "Dive into Deep Learning" defines it concisely as "the number of rows and columns traversed per slide." [7]

Stride applies to both the [convolution](/wiki/convolution) operation and to [pooling](/wiki/pooling). It plays a central role in balancing computational efficiency, spatial resolution, and the network's ability to capture patterns at different scales. [9]

## What does stride do in a convolutional layer?

In the context of a [convolutional layer](/wiki/convolutional_layer), a filter (also called a [kernel](/wiki/kernel)) slides over the input data to compute dot products at each position. The stride specifies the number of positions the filter shifts between successive applications. A stride of 1 means the filter moves one pixel at a time, covering every possible position. A stride of 2 means the filter jumps two pixels at each step, skipping every other position. [7]

Stride can be specified as a single integer (applied equally in both the horizontal and vertical directions) or as a tuple such as (2, 2) or (1, 2) to allow different stride values along each spatial dimension. Most frameworks default to a stride of 1 when no value is explicitly provided. [10]

### Visual Intuition

Consider a 7x7 input with a 3x3 filter. With stride 1, the filter can be placed at 5 positions along each axis, producing a 5x5 output. With stride 2, the filter lands at positions 0, 2, and 4 along each axis, producing a 3x3 output. The larger stride reduces the output size because the filter visits fewer positions.

## How is output size calculated from stride?

The relationship between stride and output dimensions is governed by a well-known formula. For a single spatial dimension, the output size is:

**O = floor((W - K + 2P) / S) + 1**

Where:

| Symbol | Meaning |
|--------|---------|
| W | Input size (width or height) |
| K | [Kernel](/wiki/kernel) size |
| P | [Padding](/wiki/padding) amount |
| S | Stride |
| O | Output size |

For two-dimensional inputs, the formula is applied independently to each spatial dimension: [7]

- **W_out = floor((W_in - K_w + 2 * P_w) / S_w) + 1**
- **H_out = floor((H_in - K_h + 2 * P_h) / S_h) + 1**

When [dilation](/wiki/dilation) is used, the formula generalizes to: [3]

**O = floor((W + 2P - D * (K - 1) - 1) / S) + 1**

where D is the dilation rate.

### Worked Example

| Input Size | Kernel Size | Padding | Stride | Output Size |
|------------|-------------|---------|--------|-------------|
| 32x32 | 3x3 | 0 | 1 | 30x30 |
| 32x32 | 3x3 | 1 | 1 | 32x32 |
| 32x32 | 3x3 | 1 | 2 | 16x16 |
| 224x224 | 7x7 | 3 | 2 | 112x112 |
| 8x8 | 3x3 | 0 | 3 | 2x2 |

The table illustrates that stride 1 with appropriate padding preserves the input dimensions, while stride 2 halves them. This is the most common configuration in modern CNN architectures. The 224x224 to 112x112 row is exactly the input stem used by [ResNet](/wiki/resnet), whose first layer is a 7x7 convolution with stride 2. [4]

## How does stride 1 differ from stride 2?

The two most frequently used stride values are 1 and 2. They serve fundamentally different purposes in a network's design.

### Stride = 1 (Preserving Dimensions)

With stride 1 and "same" padding (P = (K - 1) / 2 for odd kernel sizes), the output feature map retains the same spatial dimensions as the input. This is the default setting in most [convolutional layers](/wiki/convolutional_layer) and is preferred when fine-grained spatial detail must be preserved. Early layers in a network and layers within residual blocks commonly use stride 1. [10]

### Stride = 2 (Halving Dimensions)

Stride 2 roughly halves the spatial dimensions of the input, serving as a form of [downsampling](/wiki/downsampling). Because the output has roughly half the width and half the height, it contains about one quarter as many elements, reducing the number of activations in the feature map by approximately 75% and significantly lowering the computational cost and memory usage of subsequent layers. Stride 2 is commonly used at transition points in a network where spatial resolution is intentionally reduced.

Networks such as [VGG](/wiki/vgg) used max [pooling](/wiki/pooling) with stride 2 to downsample [6], while more recent architectures like [ResNet](/wiki/resnet) and many others use strided convolutions for the same purpose. [4]

## How does stride work in pooling layers?

Stride is also a parameter in [pooling](/wiki/pooling) layers, including max pooling and average pooling. In pooling, the stride controls how the pooling window moves across the feature map. The same output size formula applies. [7]

In many standard pooling configurations, the stride equals the pool size (for example, a 2x2 max pooling window with stride 2), which produces non-overlapping pooling regions and halves the spatial dimensions. When the stride is smaller than the pool size, overlapping pooling occurs. [AlexNet](/wiki/alexnet) famously used overlapping max pooling with a 3x3 window and stride 2. In the words of Krizhevsky, Sutskever, and Hinton, "This scheme reduces the top-1 and top-5 error rates by 0.4% and 0.3%, respectively, as compared with the non-overlapping scheme s = 2, z = 2," and they added that "models with overlapping pooling find it slightly more difficult to overfit." [5]

## Can strided convolution replace pooling?

Traditionally, CNNs alternated [convolutional layers](/wiki/convolutional_layer) with [pooling](/wiki/pooling) layers to progressively reduce spatial dimensions. In 2014, Springenberg, Dosovitskiy, Brox, and Riedmiller published "Striving for Simplicity: The All Convolutional Net" (ICLR 2015 workshop), which demonstrated that max pooling layers could be entirely replaced by convolutional layers with stride 2. The paper's central claim is that "max-pooling can simply be replaced by a convolutional layer with increased stride without loss in accuracy on several image recognition benchmarks," including [CIFAR-10](/wiki/cifar_10), CIFAR-100, and [ImageNet](/wiki/imagenet). [1]

The key insight of this work is that pooling is a fixed, non-learnable operation, while a strided convolution is a learnable downsampling operation. By replacing pooling with strided convolution, the network gains additional trainable parameters that allow it to learn the optimal way to reduce spatial dimensions for the task at hand. The authors reported that "when pooling is replaced by an additional convolution layer with stride r=2 performance stabilizes and even improves on the base model." [1] Their all-convolutional network reached a test error of 9.08% on CIFAR-10 without data augmentation and 7.25% with augmentation, 33.71% on CIFAR-100, and a top-1 validation error of 41.2% on ImageNet. [1]

This approach has been widely adopted. Many modern architectures, including [ResNet](/wiki/resnet) [4], [DenseNet](/wiki/densenet), and [ConvNeXt](/wiki/convnext), use strided convolutions for [downsampling](/wiki/downsampling) rather than pooling layers. ConvNeXt, for example, replaces the classic ResNet stem (a 7x7 convolution with stride 2 followed by max pooling, a 4x downsampling) with a single non-overlapping 4x4 convolution with stride 4, a "patchify" stem borrowed from [Vision Transformers](/wiki/vision_transformer); this change nudged ImageNet top-1 accuracy from 79.4% to 79.5% in the authors' modernization study. [11]

| Approach | Type | Learnable | Parameters | Typical Use |
|----------|------|-----------|------------|-------------|
| Max pooling (stride 2) | Fixed operation | No | 0 | Classical architectures (VGG, AlexNet) |
| Average pooling (stride 2) | Fixed operation | No | 0 | Transition layers, global pooling |
| Strided convolution (stride 2) | Learned operation | Yes | K x K x C_in x C_out | Modern architectures (ResNet, ConvNeXt) |

## How does stride behave in transposed convolutions?

[Transposed convolutions](/wiki/transposed_convolution) (also called fractionally strided convolutions or sometimes incorrectly called deconvolutions) use stride in the opposite manner compared to standard convolutions. While stride greater than 1 in a standard convolution reduces spatial dimensions, stride greater than 1 in a transposed convolution increases spatial dimensions, performing upsampling. [3]

In a transposed convolution, the stride parameter controls how much spacing is inserted between input elements before the convolution is applied. A stride of 2 effectively doubles the spatial dimensions of the output. This is why the operation is also called "fractionally strided convolution": a stride of 2 over the output is equivalent to a stride of 1/2 over the input. [3]

Transposed convolutions with stride 2 are widely used in:

- **Semantic segmentation** networks such as [FCN](/wiki/fcn) and [U-Net](/wiki/unet), which upsample feature maps back to the original input resolution for pixel-wise classification. [8]
- **[Generative adversarial networks](/wiki/generative_adversarial_network)** (GANs), where the generator transforms low-dimensional [latent vectors](/wiki/latent_space) into high-resolution images through a series of upsampling layers.
- **Super-resolution** models that increase the spatial resolution of images.

### What causes checkerboard artifacts?

Transposed convolutions with certain stride and kernel size combinations can produce checkerboard artifacts in the output. Odena, Dumoulin, and Olah (2016) documented this issue in their influential Distill article "Deconvolution and Checkerboard Artifacts," explaining that "deconvolution has uneven overlap when the kernel size (the output window size) is not divisible by the stride." [2] As a remedy they proposed to "resize the image (using nearest-neighbor interpolation or bilinear interpolation) and then do a convolutional layer" (the "resize-convolution" approach). [2] Choosing kernel sizes that are divisible by the stride (for example, a 4x4 kernel with stride 2) also helps reduce these artifacts. [2]

## How does stride affect the receptive field?

The [receptive field](/wiki/receptive_field) of a neuron in a CNN is the region of the input image that influences that neuron's activation. Stride has a significant effect on the receptive field of neurons in deeper layers.

When a layer uses stride greater than 1, neurons in subsequent layers effectively "see" a larger region of the original input. This is because each position in the downsampled feature map corresponds to a larger area of the input. Specifically, the receptive field size at layer l can be computed recursively:

**r_l = r_(l-1) + (k_l - 1) * product of all strides in layers 1 to (l-1)**

where r_l is the receptive field at layer l, k_l is the kernel size at layer l, and the product term accumulates all preceding stride values. A stride of 2 at any layer effectively doubles the contribution of all subsequent layers to the receptive field. [9]

This relationship means that using strided convolutions or strided pooling in early layers is an efficient way to rapidly increase the receptive field, allowing deeper neurons to capture large-scale patterns and global context in the input.

## How do popular CNN architectures use stride?

Different architectures use stride strategically at specific points in the network. The table below summarizes stride usage in well-known models.

| Architecture | Year | First Layer Stride | Downsampling Method | Notes |
|-------------|------|--------------------|---------------------|-------|
| [LeNet-5](/wiki/lenet) | 1998 | 1 | Average pooling (stride 2) | Pioneer CNN for digit recognition |
| [AlexNet](/wiki/alexnet) | 2012 | 4 | Overlapping max pooling (stride 2) | Large initial stride to reduce the input; 60 million parameters [5] |
| [VGG](/wiki/vgg) | 2014 | 1 | Max pooling (stride 2) | All conv layers use stride 1; pooling for downsampling [6] |
| [GoogLeNet](/wiki/googlenet) | 2014 | 2 | Mixed (pooling + strided conv) | Inception modules with stride 1 convolutions |
| [ResNet](/wiki/resnet) | 2015 | 2 | Strided convolution (stride 2) | 7x7 conv with stride 2 at input; strided conv at transitions [4] |
| [DenseNet](/wiki/densenet) | 2017 | 2 | Strided conv + pooling in transition layers | Dense blocks with stride 1 |
| [EfficientNet](/wiki/efficientnet) | 2019 | 2 | Strided depthwise convolution | Compound scaling of depth, width, and resolution |
| [ConvNeXt](/wiki/convnext) | 2022 | 4 | Strided convolution (stride 4 patchify stem) | Patchify stem inspired by [Vision Transformers](/wiki/vision_transformer) [11] |

AlexNet's first convolutional layer "filters the 224 x 224 x 3 input image with 96 kernels of size 11 x 11 x 3 with a stride of 4 pixels," the large stride being chosen to shrink the high-resolution input quickly. [5]

## Implementation in Deep Learning Frameworks

All major [deep learning](/wiki/deep_learning) frameworks support stride as a parameter in convolution and pooling layers.

### PyTorch

In [PyTorch](/wiki/pytorch), the `stride` parameter is specified as an integer or tuple: [10]

```python
import torch.nn as nn

# Stride 1 (default) - preserves spatial dimensions with padding
conv1 = nn.Conv2d(in_channels=3, out_channels=64, kernel_size=3, stride=1, padding=1)

# Stride 2 - halves spatial dimensions
conv2 = nn.Conv2d(in_channels=64, out_channels=128, kernel_size=3, stride=2, padding=1)

# Asymmetric stride
conv3 = nn.Conv2d(in_channels=64, out_channels=128, kernel_size=(3, 5), stride=(2, 1), padding=(1, 2))

# Transposed convolution with stride 2 for upsampling
deconv = nn.ConvTranspose2d(in_channels=128, out_channels=64, kernel_size=4, stride=2, padding=1)
```

### TensorFlow / Keras

In [TensorFlow](/wiki/tensorflow) and Keras, the parameter is named `strides` (plural):

```python
from tensorflow.keras.layers import Conv2D, Conv2DTranspose

# Stride 1 (default)
conv1 = Conv2D(filters=64, kernel_size=3, strides=1, padding='same')

# Stride 2 - halves spatial dimensions
conv2 = Conv2D(filters=128, kernel_size=3, strides=2, padding='same')

# Transposed convolution for upsampling
deconv = Conv2DTranspose(filters=64, kernel_size=4, strides=2, padding='same')
```

A key difference between the two frameworks is the default data format: PyTorch expects channel-first input (batch, channels, height, width), while TensorFlow/Keras defaults to channel-last input (batch, height, width, channels). The stride parameter behaves identically in both cases and is applied to the spatial dimensions.

## How is stride different from dilation?

Stride and [dilation](/wiki/dilation) (also called atrous convolution) both affect how a convolution interacts with its input, but they serve different purposes. Stride controls how many positions the filter moves between applications, while dilation controls the spacing between filter elements. [3]

A dilated convolution with dilation rate d expands the receptive field without reducing the spatial resolution and without increasing the number of parameters. This is in contrast to stride, which increases the receptive field by reducing spatial resolution. Dilated convolutions are commonly used in models like [DeepLab](/wiki/deeplab) for semantic segmentation, where maintaining spatial resolution is important.

| Property | Stride > 1 | Dilation > 1 |
|----------|------------|---------------|
| Effect on output size | Reduces spatial dimensions | Preserves spatial dimensions |
| Effect on receptive field | Increases (indirectly) | Increases (directly) |
| Additional parameters | None (same kernel) | None (same kernel) |
| Primary use case | Downsampling, efficiency | Enlarging receptive field without downsampling |

## Practical Considerations

When choosing a stride value for a particular layer, several factors should be considered:

- **Spatial resolution vs. efficiency**: Larger strides reduce computational cost and memory usage but discard spatial information. For tasks requiring precise localization (such as [object detection](/wiki/object_detection) or segmentation), aggressive downsampling early in the network can hurt performance.
- **Information loss**: Unlike pooling, which selects or averages values, a strided convolution computes new features at fewer positions. The filter still "sees" its full receptive field at each position, but positions between stride steps are skipped entirely.
- **Alignment with kernel size**: The kernel size should generally be larger than or equal to the stride to avoid gaps in coverage. A 3x3 kernel with stride 2 provides overlapping coverage, while a 2x2 kernel with stride 2 produces exactly non-overlapping coverage.
- **Transposed convolution artifacts**: When using transposed convolutions for upsampling, choose kernel sizes divisible by the stride to minimize checkerboard artifacts. [2]

## Explain Like I'm 5 (ELI5)

Imagine you have a big picture made of tiny colored squares, like a mosaic. You want to look at the picture through a small magnifying glass that can only see a few squares at a time.

If the stride is 1, you move your magnifying glass one square at a time, looking at almost every spot on the picture. You get a very detailed idea of what the picture looks like, but it takes a long time.

If the stride is 2, you skip a square each time you move the magnifying glass. You finish looking at the picture much faster, but the picture you remember is smaller because you skipped some spots.

If the stride is 3, you skip even more squares. You finish very quickly, but you might miss some details.

So the stride is just how many squares you skip each time you move your magnifying glass. A small stride gives you more detail but takes more work. A big stride is faster but gives you a smaller, less detailed picture.

## References

1. Springenberg, J. T., Dosovitskiy, A., Brox, T., & Riedmiller, M. (2015). "Striving for Simplicity: The All Convolutional Net." *ICLR 2015 Workshop Track*. [arXiv:1412.6806](https://arxiv.org/abs/1412.6806)
2. Odena, A., Dumoulin, V., & Olah, C. (2016). "Deconvolution and Checkerboard Artifacts." *Distill*. [https://distill.pub/2016/deconv-checkerboard/](https://distill.pub/2016/deconv-checkerboard/)
3. Dumoulin, V. & Visin, F. (2016). "A Guide to Convolution Arithmetic for Deep Learning." [arXiv:1603.07285](https://arxiv.org/abs/1603.07285)
4. He, K., Zhang, X., Ren, S., & Sun, J. (2016). "Deep Residual Learning for Image Recognition." *CVPR 2016*. [arXiv:1512.03385](https://arxiv.org/abs/1512.03385)
5. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). "ImageNet Classification with Deep Convolutional Neural Networks." *NeurIPS 2012*. [https://proceedings.neurips.cc/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html](https://proceedings.neurips.cc/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html)
6. Simonyan, K. & Zisserman, A. (2015). "Very Deep Convolutional Networks for Large-Scale Image Recognition." *ICLR 2015*. [arXiv:1409.1556](https://arxiv.org/abs/1409.1556)
7. Zhang, A., Lipton, Z. C., Li, M., & Smola, A. J. (2023). "Dive into Deep Learning." Section 7.3: Padding and Stride. [https://d2l.ai/chapter_convolutional-neural-networks/padding-and-strides.html](https://d2l.ai/chapter_convolutional-neural-networks/padding-and-strides.html)
8. Long, J., Shelhamer, E., & Darrell, T. (2015). "Fully Convolutional Networks for Semantic Segmentation." *CVPR 2015*. [arXiv:1411.4038](https://arxiv.org/abs/1411.4038)
9. Goodfellow, I., Bengio, Y., & Courville, A. (2016). *Deep Learning*. MIT Press. Chapter 9: Convolutional Networks. [https://www.deeplearningbook.org/](https://www.deeplearningbook.org/)
10. PyTorch Documentation. "torch.nn.Conv2d." [https://pytorch.org/docs/stable/generated/torch.nn.Conv2d.html](https://pytorch.org/docs/stable/generated/torch.nn.Conv2d.html)
11. Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., & Xie, S. (2022). "A ConvNet for the 2020s." *CVPR 2022*. [arXiv:2201.03545](https://arxiv.org/abs/2201.03545)

