# Convolution

> Source: https://aiwiki.ai/wiki/convolution
> Updated: 2026-07-12
> Categories: Deep Learning, Machine Learning, Mathematics
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

*See also: [Machine learning terms](/wiki/machine_learning_terms), [Convolutional layer](/wiki/convolutional_layer), [Convolutional filter](/wiki/convolutional_filter)*

## Introduction

Convolution is a mathematical operation that combines two functions to produce a third function, expressing how the shape of one is modified by the other. In its most general form, convolution measures the overlap between one function and a reversed, shifted copy of another function, accumulated over all possible shifts. The operation appears throughout mathematics, physics, engineering, and computer science, and it serves as the foundational building block behind [convolutional neural networks](/wiki/convolutional_neural_network) (CNNs) and much of modern [deep learning](/wiki/deep_learning).

In signal processing, convolution describes how a linear time-invariant (LTI) system transforms an input signal: the output is the convolution of the input with the system's impulse response. In probability and statistics, the convolution of two probability density functions gives the density of the sum of the corresponding independent random variables. In [machine learning](/wiki/machine_learning), and especially in the domain of deep learning, convolution is used to extract features from structured input data such as images, audio, and text. By applying small, learnable [convolutional filters](/wiki/convolutional_filter) across an input, [convolutional layers](/wiki/convolutional_layer) capture spatial and temporal patterns at multiple scales of abstraction.

Convolution is the foundational operation behind many breakthroughs in [computer vision](/wiki/computer_vision), [natural language processing](/wiki/natural_language_processing), [speech recognition](/wiki/speech_recognition), and time series analysis. Its impact became undeniable in 2012, when the convolutional [AlexNet](/wiki/alexnet) won the ImageNet Large Scale Visual Recognition Challenge with a top-5 error rate of 15.3 percent, compared with 26.2 percent for the second-place entry, a margin of roughly 10.8 percentage points that launched the modern deep learning era.[2][25] Over the past decade, researchers have introduced numerous variants of the convolution operation, each designed to address specific computational or representational challenges.

## Explain Like I'm 5 (ELI5)

Imagine you have a rubber stamp with a small pattern on it, like a star. Now imagine you have a big sheet of paper with a drawing on it. You press your stamp down on one corner of the drawing, and it tells you how much the stamp's pattern matches what is underneath. Then you slide the stamp one step to the right and check again. You keep sliding and checking across the entire drawing, writing down the match score each time. When you are done, you have a new picture made entirely of match scores. That new picture is the convolution.

In real life, this is how computers look at photos. They use tiny stamps (called kernels or filters) to scan across an image. One stamp might look for edges, another for corners, another for colors. By scanning with many different stamps, the computer builds up an understanding of what is in the image, like recognizing a cat, a car, or a face.

The key idea is simple: slide a small pattern over a big input, multiply and add up what overlaps at each position, and record the result. That is convolution.

## When was convolution invented?

The convolution integral first appeared in the work of Jean le Rond d'Alembert in 1754, within his derivation of Taylor's theorem in *Recherches sur differents points importants du systeme du monde*. Sylvestre Francois Lacroix used similar expressions in his treatise published between 1797 and 1800. Vito Volterra examined composition products, a closely related concept, in his 1913 work on integral equations.

The term "convolution" itself did not come into wide use until the 1950s or 1960s. Before that, the operation went by several names in different communities: *Faltung* (meaning "folding" in German), composition product, superposition integral, and Carson's integral.

In electrical engineering, convolution became central to the theory of linear time-invariant systems during the mid-20th century, providing the mathematical link between a system's impulse response and its output for any given input. The development of the Fast Fourier Transform (FFT) algorithm by Cooley and Tukey in 1965 made convolution computationally practical for large signals by exploiting the convolution theorem.[23]

Convolution entered the field of [neural networks](/wiki/neural_network) through the work of Kunihiko Fukushima, whose Neocognitron (1980) used a convolution-like operation for visual pattern recognition.[24] Yann LeCun and colleagues formalized the modern [convolutional neural network](/wiki/convolutional_neural_network) in 1989, applying backpropagation to train convolutional filters for handwritten zip-code digit recognition; the paper "Backpropagation Applied to Handwritten Zip Code Recognition" appeared in *Neural Computation* and is widely regarded as the earliest real-world deployment of a backpropagation-trained neural network.[1] The landmark AlexNet architecture (Krizhevsky et al., 2012) demonstrated that deep convolutional networks trained on GPUs could dramatically outperform traditional methods on large-scale image recognition, halving the previous best top-5 error and launching the deep learning revolution.[2][25]

## Mathematical Foundation

### Continuous Convolution

The convolution of two continuous functions, $$f$$ and $$g$$, is mathematically denoted as $$(f * g)$$ and is defined by the integral:

$$
(f * g)(t) = \int_{-\infty}^{\infty} f(\tau) \, g(t - \tau) \, d\tau
$$

This integral computes the area of overlap between $$f$$ and a reversed, shifted version of $$g$$ at each point $$t$$. In signal processing and mathematics, this operation captures how the shape of one function is modified by another. The reversal of $$g$$ (replacing $$\tau$$ with $$t - \tau$$) is what distinguishes convolution from cross-correlation.

### Discrete Convolution

In the context of discrete data like images or digital signals, the continuous integral is replaced by a sum:

$$
(f * g)[n] = \sum_m f[m] \, g[n - m]
$$

For two finite sequences of length N, computing this sum directly requires $$O(N^2)$$ operations. For images, the convolution operation can be represented as a double summation over a two-dimensional kernel, $$K$$, applied to an input image, $$I$$:

$$
(I * K)(x, y) = \sum_m \sum_n I(m, n) \, K(x - m, y - n)
$$

The kernel is a smaller matrix compared to the input image, and it is used to extract features by sliding over the input image, performing element-wise multiplication and summation of the overlapping elements.

### What is the difference between convolution and cross-correlation?

Although the operation used in deep learning is commonly called "convolution," most frameworks actually implement **cross-correlation** rather than true mathematical convolution. The difference is subtle but important: true convolution flips the kernel (rotates it by 180 degrees) before sliding it across the input, while cross-correlation does not flip the kernel. As Goodfellow, Bengio, and Courville note in *Deep Learning*, "Many machine learning libraries implement cross-correlation but call it convolution."[22]

The 2D cross-correlation formula is:

$$
(X \star K)(i, j) = \sum_a \sum_b X(i + a, j + b) \, K(a, b)
$$

Notice that in cross-correlation, the indices of the input use addition $$(i + a, j + b)$$, while in true convolution they use subtraction $$(x - m, y - n)$$. Because the kernel weights are learned during [backpropagation](/wiki/backpropagation) rather than hand-designed, the distinction between convolution and cross-correlation has no practical consequence. A learned kernel and its flipped version simply represent equivalent solutions.[22] For this reason, deep learning literature and frameworks (including [PyTorch](/wiki/pytorch) and [TensorFlow](/wiki/tensorflow)) refer to the operation as "convolution" even though the implementation is technically cross-correlation.

### Kernels and Feature Extraction

Kernels or filters are pivotal in the convolution process, as they are responsible for extracting specific features from the input data. Common kernel examples include edge detection, blur, and sharpen filters. These kernels are designed to emphasize or suppress certain properties in the input data, allowing the machine learning model to learn more effectively from the transformed data.

In deep learning, the kernels are usually initialized with random values and learned by the model during the training process. As the model learns, the kernels become more specialized in extracting useful features for the specific task at hand.

## Properties of Convolution

Convolution satisfies several important algebraic properties that have both theoretical and practical significance.

| Property | Formula | Practical Implication |
|---|---|---|
| Commutativity | $$f * g = g * f$$ | The order of operands does not matter; a signal convolved with a filter gives the same result as the filter convolved with the signal |
| Associativity | $$f * (g * h) = (f * g) * h$$ | A cascade of filters can be pre-combined into a single equivalent filter |
| Distributivity | $$f * (g + h) = (f * g) + (f * h)$$ | Convolution distributes over addition, enabling parallel filter application |
| Scalar multiplication | $$a(f * g) = (af) * g$$ | Scaling can be applied before or after convolution |
| Identity | $$f * \delta = f$$ | Convolving any function with the Dirac delta (or Kronecker delta for discrete signals) returns the original function unchanged |

**Commutativity** is proven by a simple substitution of variables in the convolution integral. In practice, it means we can interpret a system's effect on a signal equivalently as the signal's effect on the system's impulse response.

**Associativity** is particularly useful in signal processing: if a signal passes through two filters in series with impulse responses g and h, the combined effect is the same as a single filter whose impulse response is $$(g * h)$$. This allows engineers to analyze complex multi-stage systems as a single equivalent system.

**Distributivity** means that if two signals are added together and then convolved with a filter, the result is the same as convolving each signal separately and then adding the results. This property underpins the principle of superposition in linear systems theory.

Note that convolution does **not** possess a general multiplicative identity in the space of integrable functions. However, the Dirac delta function serves as an identity element: convolving any function with the delta function returns the original function.

## Convolution Theorem

The convolution theorem is one of the most important results connecting convolution with the Fourier transform. It states that convolution in the spatial (or time) domain is equivalent to element-wise multiplication in the frequency domain:

$$
\mathcal{F}\{f * g\} = \mathcal{F}\{f\} \cdot \mathcal{F}\{g\}
$$

where $$\mathcal{F}$$ denotes the Fourier transform. The converse also holds: multiplication in the spatial domain corresponds to convolution in the frequency domain:

$$
\mathcal{F}\{f \cdot g\} = \mathcal{F}\{f\} * \mathcal{F}\{g\}
$$

This theorem extends to other transforms as well, including the Laplace transform, the Z-transform, and the Mellin transform.

### Practical Significance

The convolution theorem has profound practical implications:

- **Computational efficiency:** For two sequences of length N, direct convolution requires $$O(N^2)$$ operations. Using the FFT, both sequences can be transformed in $$O(N \log N)$$ time, multiplied element-wise in $$O(N)$$ time, and inverse-transformed in $$O(N \log N)$$ time, yielding an overall complexity of $$O(N \log N)$$.[23] This speedup is critical for processing long signals in audio, telecommunications, and seismology.
- **Filter design:** In signal processing, filters are often designed in the frequency domain (specifying which frequencies to pass or reject) and then converted to time-domain impulse responses via the inverse Fourier transform. The convolution theorem guarantees that applying this impulse response via convolution achieves the desired frequency-domain filtering.
- **Analysis of linear systems:** The theorem provides a direct way to compute the frequency response of an LTI system. If the impulse response is $$h(t)$$, then the frequency response is simply $$H(f) = \mathcal{F}\{h(t)\}$$, and the output spectrum is the input spectrum multiplied by $$H(f)$$.

## Convolution in Signal Processing

In signal processing, convolution is the primary tool for describing how linear time-invariant (LTI) systems process inputs. If a system has impulse response $$h(t)$$, then for any input $$x(t)$$, the output $$y(t)$$ is:

$$
y(t) = (x * h)(t) = \int_{-\infty}^{\infty} x(\tau) \, h(t - \tau) \, d\tau
$$

This relationship is fundamental to fields including audio engineering (reverb is convolution with a room's impulse response), telecommunications (channel equalization), radar (matched filtering), and medical imaging (CT reconstruction). Digital filters, both FIR (finite impulse response) and IIR (infinite impulse response), are implemented as discrete convolutions of the input signal with the filter coefficients.

## Convolution by Dimensionality

Convolutional operations exist in 1D, 2D, and 3D variants, each suited to data with different structural properties.

### 1D Convolution

1D convolution slides the kernel along a single axis. It is used extensively for sequential and temporal data. Key applications include:

- **Text classification:** 1D convolution extracts n-gram-like features by sliding over sequences of [word embeddings](/wiki/word_embedding). Filters of different widths (e.g., 3, 4, 5) capture phrases of varying length. Yoon Kim's 2014 paper "Convolutional Neural Networks for Sentence Classification" demonstrated that simple 1D CNN architectures could rival [recurrent neural networks](/wiki/recurrent_neural_network) on text classification benchmarks.[20]
- **[Time series analysis](/wiki/time_series_analysis):** Financial data, sensor readings, and physiological signals such as electrocardiograms (ECGs) benefit from 1D convolution because the operation captures local temporal patterns.
- **Audio and speech processing:** [WaveNet](/wiki/wavenet), developed by [DeepMind](/wiki/deepmind) in 2016, uses stacks of dilated 1D causal convolutions to generate raw audio waveforms sample by sample.[10]

The 1D convolution formula for an input signal $$x$$ of length $$L$$ and a kernel $$k$$ of size $$K$$ is:

$$
y[n] = \sum_{m=0}^{K-1} x[n + m] \, k[m]
$$

The output length depends on the padding and [stride](/wiki/stride) settings (discussed below).

### 2D Convolution

2D convolution is the most widely used form and the default when researchers mention "convolution" in the context of deep learning. The kernel slides along two spatial axes (height and width) and is the backbone of all major image recognition architectures, including [AlexNet](/wiki/alexnet),[2] [VGGNet](/wiki/vggnet),[4] [ResNet](/wiki/resnet),[8] and [Inception](/wiki/inception).[5]

A 2D convolution with input of shape $$(C_{in}, H, W)$$ and a kernel of shape $$(C_{in}, K_h, K_w)$$ produces a single 2D output feature map. Multiple kernels produce multiple output channels, resulting in an output of shape $$(C_{out}, H_{out}, W_{out})$$.

### 3D Convolution

3D convolution extends the operation to three spatial or spatiotemporal axes. The kernel slides along height, width, and a third dimension (often depth or time). Primary applications include:

- **Video understanding:** The C3D network (Tran et al., 2015) used 3x3x3 kernels to learn spatiotemporal features from consecutive video frames.[7] Later architectures such as I3D (Inflated 3D) and SlowFast Networks built upon this idea.
- **Medical imaging:** Volumetric data from CT scans and MRI require 3D convolution to capture spatial relationships across slices.
- **Point cloud processing:** Some methods voxelize 3D point clouds and apply 3D convolutions for tasks like 3D object detection.

3D convolutions are considerably more expensive than their 2D counterparts because the kernel must cover an additional spatial dimension, increasing both the number of parameters and the number of multiply-accumulate operations.

## Key Parameters

Several hyperparameters control the behavior and output dimensions of a [convolutional layer](/wiki/convolutional_layer).

### Kernel (Filter) Size

The kernel size determines the spatial extent of the region the filter examines at each position. Common choices include:

| Kernel Size | Typical Use | Notes |
|---|---|---|
| 1x1 | Channel mixing, dimensionality reduction | Introduced in Network in Network (Lin et al., 2013)[3] |
| 3x3 | General-purpose feature extraction | Most popular choice since [VGGNet](/wiki/vggnet) (Simonyan & Zisserman, 2014)[4] |
| 5x5 | Wider receptive field per layer | Used in early Inception modules[5] |
| 7x7 | First layer of deep networks | Used in [ResNet](/wiki/resnet) and Inception v1 for initial downsampling[8] |

Smaller kernels (3x3) stacked in multiple layers can achieve the same receptive field as a single large kernel while using fewer parameters. Two stacked 3x3 convolutions cover the same 5x5 receptive field with $$2 \times (3 \times 3) = 18$$ weights instead of 25, and three stacked 3x3 convolutions cover a 7x7 field with 27 weights instead of 49.[4]

### Stride

[Stride](/wiki/stride) specifies the step size by which the kernel moves across the input. A stride of 1 moves the filter one pixel at a time, producing an output whose spatial dimensions are similar to the input. A stride of 2 moves the filter two pixels at a time, halving the spatial dimensions. Increasing the stride reduces the output resolution and computational cost, and it is often used as an alternative to [pooling](/wiki/pooling) layers for spatial downsampling.

The output dimension for one axis is computed as:

$$
O = \left\lfloor \frac{W - K + 2P}{S} \right\rfloor + 1
$$

where $$W$$ is the input size, $$K$$ is the kernel size, $$P$$ is the padding, and $$S$$ is the [stride](/wiki/stride).[12]

### Padding

Padding adds extra values (typically zeros) around the border of the input before the convolution is performed. This is necessary to control the spatial dimensions of the output.

| Padding Type | Description | Output Size (stride = 1) |
|---|---|---|
| Valid (no padding) | No padding is added; the kernel only visits positions where it fully overlaps the input | Smaller than input: $$(W - K + 1)$$ |
| Same (zero padding) | Enough zeros are added so the output has the same spatial dimensions as the input | Equal to input: $$W$$ |
| Full padding | Enough zeros are added so every element of the input participates in every possible kernel position | Larger than input: $$(W + K - 1)$$ |

"Same" padding is the most common choice in modern architectures because it simplifies dimension calculations when stacking many layers.

### Dilation (Atrous Convolution)

Dilation inserts gaps (zeros) between kernel elements, effectively expanding the receptive field without increasing the number of parameters. A dilation rate of $$r$$ spreads a kernel of size $$K$$ so its effective size becomes:

$$
K_{\text{eff}} = K + (K - 1)(r - 1)
$$

For example, a 3x3 kernel with dilation rate 2 has an effective receptive field of 5x5, and with dilation rate 3 it covers 7x7, all while maintaining only 9 learnable weights.

Dilated convolutions were originally developed for efficient wavelet decomposition in signal processing. In deep learning, they became prominent through:

- **DeepLab** (Chen et al., 2015, 2017): Used atrous (dilated) convolutions for [semantic segmentation](/wiki/semantic_segmentation), replacing pooling layers to preserve spatial resolution while maintaining a wide receptive field. DeepLab v2 introduced Atrous Spatial Pyramid Pooling (ASPP), which runs multiple dilated convolutions in parallel at dilation rates of 6, 12, and 18 to capture multi-scale context.[13]
- **WaveNet** (van den Oord et al., 2016): Stacked dilated causal 1D convolutions with exponentially increasing dilation rates (1, 2, 4, 8, ..., 512) to model long-range temporal dependencies in audio.[10]
- **Multi-Scale Context Aggregation** (Yu & Koltun, 2016): Proposed systematic use of dilated convolutions for dense prediction tasks.[9]

## Feature Maps and Channels

A feature map is the output produced by applying a single [convolutional filter](/wiki/convolutional_filter) to the input. Each filter detects a specific pattern, and the collection of all feature maps from a [convolutional layer](/wiki/convolutional_layer) forms a 3D volume with dimensions $$(C_{out}, H_{out}, W_{out})$$, where $$C_{out}$$ is the number of output channels (one per filter).

The input to a convolutional layer also has multiple channels. For a color image, the input has 3 channels (red, green, blue). For deeper layers, the number of input channels equals the number of filters in the preceding convolutional layer. Each filter in a convolutional layer spans all input channels: a filter in a layer with $$C_{in}$$ input channels has shape $$(C_{in}, K_h, K_w)$$, and the total number of parameters for one filter (excluding bias) is $$C_{in} \times K_h \times K_w$$.

The total number of parameters in a convolutional layer with C_out filters is:

$$
\text{Parameters} = C_{out} \times (C_{in} \times K_h \times K_w + 1)
$$

where the +1 accounts for the bias term per filter.

## Why are convolutional layers more efficient than fully connected layers?

Convolutional layers exploit two properties that make them far more efficient than fully connected layers: **parameter sharing** and **local connectivity** (also called sparse interactions).

### Parameter Sharing

In a [convolutional layer](/wiki/convolutional_layer), the same filter weights are applied at every spatial position of the input. This means a single set of $$K_h \times K_w \times C_{in}$$ weights is reused across the entire input, regardless of the input size. Parameter sharing encodes the assumption of **translation equivariance**: a pattern that is useful to detect in one part of the input is equally useful in other parts. For example, an edge detector should work the same way whether the edge appears in the top-left or bottom-right of the image.

Parameter sharing reduces the number of learnable weights by several orders of magnitude compared to a fully connected layer with the same input and output sizes. For an input image of 224x224x3 mapped to a feature map of 224x224x64, a fully connected layer would require roughly $$224 \times 224 \times 3 \times 224 \times 224 \times 64$$, which is approximately 4.8 trillion parameters. A convolutional layer with 64 filters of size 3x3 requires only $$64 \times (3 \times 3 \times 3 + 1) = 1{,}792$$ parameters.

### Local Connectivity (Sparse Interactions)

Each output neuron in a convolutional layer is connected to only a small, local region of the input (determined by the kernel size), rather than to every input element. This sparse connectivity reduces computation and also provides an inductive bias well suited to data with local spatial structure, such as images and audio. Deep stacking of locally connected layers allows the network to build up large **receptive fields** gradually: a neuron in a deep layer indirectly depends on a large region of the original input, even though each individual layer examines only a small neighborhood.

## Types of Convolution

### Standard Convolution

Standard convolution applies each filter across all input channels simultaneously. Given an input with $$C_{in}$$ channels and $$C_{out}$$ filters of size $$K \times K$$, the computational cost (in multiply-accumulate operations) for producing one output feature map of spatial size $$H_{out} \times W_{out}$$ is:

$$
\text{FLOPs} = H_{out} \times W_{out} \times C_{in} \times C_{out} \times K \times K
$$

Standard convolution is the most expressive form because every output channel can learn arbitrary combinations of all input channels. However, it is also the most expensive, which has motivated the development of lighter alternatives.

### Depthwise Separable Convolution

Depthwise separable convolution, popularized by [MobileNet](/wiki/mobilenet) (Howard et al., 2017)[15] and also used in [Xception](/wiki/xception) (Chollet, 2017),[16] factorizes a standard convolution into two steps:

1. **Depthwise convolution:** A separate $$K \times K$$ spatial filter is applied to each input channel independently. This produces $$C_{in}$$ feature maps (one per channel) and costs $$H_{out} \times W_{out} \times C_{in} \times K \times K$$ operations.
2. **Pointwise convolution:** A 1x1 convolution combines the depthwise outputs across channels, projecting from $$C_{in}$$ to $$C_{out}$$ channels. This costs $$H_{out} \times W_{out} \times C_{in} \times C_{out}$$ operations.

The total cost of depthwise separable convolution is:

$$
\text{FLOPs} = H_{out} \times W_{out} \times C_{in} \times (K \times K + C_{out})
$$

The reduction factor compared to standard convolution is:

$$
\text{Reduction} = 1/C_{out} + 1/K^2
$$

For a common configuration with $$K = 3$$ and $$C_{out} = 256$$, this yields a reduction factor of approximately $$1/256 + 1/9$$, which is roughly 0.115, meaning depthwise separable convolution uses only about 11-12% of the computation of standard convolution.

The MobileNet authors state that their model "uses 3x3 depthwise separable convolutions which uses between 8 to 9 times less computation than standard convolutions at only a small reduction in accuracy," making the design suitable for mobile and edge devices.[15] [MobileNet V2](/wiki/mobilenet_v2) refined this with inverted residual blocks and linear bottlenecks.[18]

### Grouped Convolution

Grouped convolution divides the input channels into $$G$$ non-overlapping groups and performs independent convolutions within each group. It was first used in [AlexNet](/wiki/alexnet) (Krizhevsky et al., 2012) for a practical reason: the model was split across two GPUs, and each GPU processed half the channels.[2] Later work showed that grouping also provides a useful regularization effect.

The computational cost of grouped convolution is:

$$
\text{FLOPs} = H_{out} \times W_{out} \times (C_{in} / G) \times (C_{out} / G) \times K \times K \times G = H_{out} \times W_{out} \times C_{in} \times C_{out} \times K \times K / G
$$

This represents a $$1/G$$ reduction compared to standard convolution. [ResNeXt](/wiki/resnext) (Xie et al., 2017) demonstrated that increasing the number of groups (called "cardinality") while keeping the total number of parameters fixed improved accuracy over deeper or wider ResNets.[17] ShuffleNet (Zhang et al., 2018) combined grouped convolutions with channel shuffling to allow information flow between groups.

Note that depthwise convolution is a special case of grouped convolution where $$G = C_{in}$$ (each channel is its own group).

### Deformable Convolution

Deformable convolution, introduced by Dai et al. (2017) at ICCV, augments the standard fixed grid sampling pattern with learnable 2D offsets.[14] In a standard 3x3 convolution, the nine sampling locations form a regular grid. In deformable convolution, each of these locations is displaced by an offset $$(\delta_x, \delta_y)$$ that is predicted by an auxiliary convolutional layer applied to the preceding feature map.

The output at position $$p$$ is:

$$
y(p) = \sum_k w_k \, x(p + p_k + \delta_{p_k})
$$

where $$p_k$$ are the fixed grid offsets and $$\delta_{p_k}$$ are the learned offsets. Since the offsets are typically fractional, bilinear interpolation is used to sample the input at non-integer locations.

Key properties of deformable convolution:

- The offsets are conditioned on the input, so the sampling pattern adapts to the content of each image. Objects with irregular shapes or varying scales benefit from this adaptive geometry.
- The module adds only a small number of extra parameters (the offset prediction layer) and can replace standard convolutions in existing architectures with minimal overhead.
- Deformable Convolutional Networks V2 (Zhu et al., 2019) added a modulation mechanism that also learns a scalar weight for each sampling position, further improving performance.[19]

Deformable convolution has proven particularly effective for [object detection](/wiki/object_detection) and [semantic segmentation](/wiki/semantic_segmentation) tasks where objects appear at varying scales, aspect ratios, and orientations.

### Dilated (Atrous) Convolution

Dilated convolution inserts zeros between kernel elements to expand the receptive field without increasing the parameter count. The dilation rate $$r$$ controls the spacing. See the [Dilation section](#dilation-atrous-convolution) above for the mathematical details and applications in DeepLab and WaveNet.

Dilated convolutions are especially valuable in dense prediction tasks (semantic segmentation, depth estimation) where maintaining high spatial resolution is important. They provide an alternative to encoder-decoder architectures by keeping feature maps at full resolution while still capturing long-range context.

### Transposed Convolution

Transposed convolution (sometimes called "deconvolution" or "fractionally strided convolution") performs spatial upsampling with learnable parameters. It is the approximate inverse of a strided convolution: while a strided convolution reduces spatial dimensions, a transposed convolution increases them.

The operation works by inserting zeros between input elements (effectively applying a fractional [stride](/wiki/stride)) and then performing a standard convolution.[12] For an input of size $$H_{in}$$ and a transposed convolution with kernel size $$K$$, stride $$S$$, and padding $$P$$, the output size is:

$$
H_{out} = (H_{in} - 1) \times S - 2P + K
$$

It is important to note that transposed convolution is not a true mathematical inverse (deconvolution). A true deconvolution would recover the exact original input from a convolution's output, which is generally an ill-posed problem. Transposed convolution only reverses the spatial dimensions while performing a learnable upsampling, not a value-level inversion.

Transposed convolutions are widely used in:

- **[Generative adversarial networks](/wiki/generative_adversarial_network) (GANs):** The generator network uses transposed convolutions to upsample a low-dimensional latent vector into a full-resolution image.
- **Semantic segmentation:** Fully Convolutional Networks (Long et al., 2015) used transposed convolutions to upsample feature maps back to the input resolution for pixel-wise prediction.[6]
- **[Autoencoders](/wiki/autoencoder):** The decoder portion uses transposed convolutions to reconstruct the input from a compressed representation.

One known issue with transposed convolutions is the "checkerboard artifact," where overlapping kernel regions during upsampling create a grid pattern in the output. This can be mitigated by choosing kernel sizes divisible by the stride or by using alternative upsampling strategies such as nearest-neighbor interpolation followed by a standard convolution.

### 1x1 Convolution

1x1 convolution, first proposed in the Network in Network paper (Lin et al., 2013), applies a 1x1 spatial filter across all input channels.[3] Despite having no spatial extent, it serves several important functions:

- **Channel mixing:** It computes a learned linear combination of all input channels at each spatial location. Lin et al. described this as "cross-channel parametric pooling," enabling complex, learnable interactions between channels.
- **[Dimensionality reduction](/wiki/dimension_reduction):** By using fewer output channels than input channels, 1x1 convolutions reduce the channel dimension and the associated computational cost of subsequent layers. [GoogLeNet](/wiki/googlenet) (Szegedy et al., 2014) used 1x1 convolutions within Inception modules to reduce channel dimensions before expensive 3x3 and 5x5 convolutions, cutting the computational cost by a large factor.[5]
- **Dimensionality expansion:** Conversely, 1x1 convolutions can increase the channel dimension, as in the inverted residual blocks of MobileNet V2.[18]
- **Adding nonlinearity:** When followed by a [ReLU](/wiki/relu) or other [activation function](/wiki/activation_function), a 1x1 convolution adds an additional layer of nonlinearity without changing the spatial dimensions.

The computational cost of a 1x1 convolution is $$H \times W \times C_{in} \times C_{out}$$, with no dependence on the spatial kernel size. This makes it one of the cheapest convolution operations per parameter.

## Comparison of Convolution Types

The following table summarizes the key convolution variants discussed above.

| Type | Parameters (per layer) | FLOPs (relative to standard) | Receptive Field | Key Advantage | Representative Architecture |
|---|---|---|---|---|---|
| Standard | $$C_{out} \times C_{in} \times K^2$$ | 1x (baseline) | $$K \times K$$ per layer | Full cross-channel interaction | [VGGNet](/wiki/vggnet), [ResNet](/wiki/resnet) |
| Depthwise Separable | $$C_{in} \times K^2 + C_{in} \times C_{out}$$ | $$\sim 1/K^2 + 1/C_{out}$$ (approx. 0.11x for $$K=3$$) | $$K \times K$$ per layer | Major computation savings | [MobileNet](/wiki/mobilenet), [Xception](/wiki/xception) |
| Grouped (G groups) | $$C_{out} \times C_{in} \times K^2 / G$$ | $$1/G$$ | $$K \times K$$ per layer | Tunable efficiency-accuracy tradeoff | [AlexNet](/wiki/alexnet), [ResNeXt](/wiki/resnext) |
| Dilated (rate r) | $$C_{out} \times C_{in} \times K^2$$ | Same as standard (same params) | $$K + (K-1)(r-1)$$ per layer | Larger receptive field, no extra params | DeepLab, [WaveNet](/wiki/wavenet) |
| Deformable | $$C_{out} \times C_{in} \times K^2$$ + offset params | ~1x + small offset overhead | Adaptive (input-dependent) | Content-adaptive sampling | DCN (Dai et al., 2017) |
| Transposed | $$C_{in} \times C_{out} \times K^2$$ | Depends on output size | N/A (upsampling operation) | Learnable upsampling | FCN, DCGAN |
| 1x1 | $$C_{out} \times C_{in}$$ | $$H \times W \times C_{in} \times C_{out}$$ | 1x1 (no spatial extent) | Channel mixing, cheap dimensionality change | Network in Network, [Inception](/wiki/inception) |

## How is convolution implemented efficiently?

While the sliding-window description of convolution is intuitive, deep learning frameworks employ several optimized strategies to implement convolution at high speed.

### im2col (Image to Column)

The most common approach is the **im2col** transformation, which converts the convolution operation into a single large matrix multiplication (GEMM: General Matrix-Matrix Multiplication) by rearranging the input data:

1. For each position where the kernel is applied, extract the local patch of input values that the kernel overlaps.
2. Flatten each patch into a column vector.
3. Stack all column vectors side by side to form a large matrix of shape $$(C_{in} \times K_h \times K_w, H_{out} \times W_{out})$$.
4. Reshape all convolutional filters into a matrix of shape $$(C_{out}, C_{in} \times K_h \times K_w)$$.
5. Multiply the filter matrix by the patch matrix. The result is the output feature maps, reshaped to $$(C_{out}, H_{out}, W_{out})$$.

Matrix multiplication is one of the most heavily optimized operations in computing. Highly tuned BLAS (Basic Linear Algebra Subprograms) libraries such as cuBLAS (for [NVIDIA](/wiki/nvidia) GPUs), oneDNN (for Intel CPUs), and Apple's Accelerate framework provide near-peak-performance GEMM implementations. The drawback of im2col is memory overhead: it duplicates input data because overlapping patches share elements. For an input of size $$(C_{in}, H, W)$$ with a $$K \times K$$ kernel, the expanded matrix requires $$O(K^2 \times C_{in} \times H_{out} \times W_{out})$$ memory, which can be several times larger than the original input.

### Winograd-Based Convolution

Winograd's minimal filtering algorithms reduce the number of multiplications needed for small convolutions. For a 3x3 kernel applied to a 4x4 input tile, the Winograd F(2x2, 3x3) algorithm reduces the multiplications from 36 (direct) to 16, a 2.25x savings. Lavin and Gray (2016) demonstrated speedups of up to 4x for 3x3 convolutions in CNNs.[11]

The algorithm works by transforming both the input tile and the kernel into a special domain (using pre-computed transformation matrices), performing element-wise multiplication there, and transforming back. This is analogous to the convolution theorem but uses transforms tailored to small discrete filters rather than the Fourier transform.

A significant limitation is numerical precision. The transformation matrices grow in magnitude for larger tile sizes, causing floating-point errors to accumulate. This makes Winograd convolution less suitable for large kernels or for low-precision (e.g., INT8) computation, though recent research has made progress on quantized Winograd implementations.

### FFT-Based Convolution

The convolution theorem enables an alternative implementation using the Fast Fourier Transform:

1. Compute the FFT of both the input and the kernel (padding both to the same size).
2. Perform element-wise multiplication of the two frequency-domain representations.
3. Compute the inverse FFT of the product to obtain the convolution result in the spatial domain.

Direct spatial convolution of an input of size $$N \times N$$ with a kernel of size $$K \times K$$ costs $$O(N^2 K^2)$$ operations. FFT-based convolution costs $$O(N^2 \log N)$$ for the transforms plus $$O(N^2)$$ for the element-wise multiplication. Mathieu, Henaff, and LeCun (2014) demonstrated that FFT-based convolution could accelerate CNN training significantly compared to spatial implementations.[21]

In modern practice, FFT-based convolution is less commonly used for standard CNN architectures because most popular architectures use small 3x3 kernels where the overhead of FFT (computing transforms, padding, managing complex arithmetic) outweighs the theoretical benefit. Winograd-based algorithms offer better performance for small kernels without requiring frequency-domain transforms. However, FFT-based convolution remains relevant for applications involving large kernels (such as global convolutions in architectures like ConvNeXt V2) and in traditional signal processing where kernels may span hundreds or thousands of elements.

### Other Approaches

| Method | Description | Best For |
|---|---|---|
| Implicit GEMM | Dynamically computes im2col layout during GEMM, avoiding explicit data rearrangement | NVIDIA cuDNN, Google TPU |
| Indirect convolution | Replaces data duplication with a pointer-based indirection buffer (Dukhan, 2019) | Memory-constrained environments |
| Direct convolution | Naive nested loops; simple but slow | Very small inputs, educational use |

## Convolutional Neural Networks

[Convolutional neural networks](/wiki/convolutional_neural_network) (CNNs) are a class of [deep learning](/wiki/deep_learning) models that exploit the convolution operation for processing grid-like data, such as images, videos, and spectrograms. CNNs have demonstrated exceptional performance in tasks like [image classification](/wiki/image_classification_models), [object detection](/wiki/object_detection), and [semantic segmentation](/wiki/semantic_segmentation).

A typical CNN architecture consists of several layers, including [convolutional layers](/wiki/convolutional_layer), [activation functions](/wiki/activation_function), [pooling](/wiki/pooling) layers, and fully connected layers. The convolutional layers perform the actual convolution operation, while the activation functions introduce non-linearity to the model, allowing it to learn complex patterns. Pooling layers are used to reduce the spatial dimensions of the feature maps, and fully connected layers serve as a final classifier or regressor.[22]

Modern CNN architectures have evolved to incorporate many of the convolution variants described above. [EfficientNet](/wiki/efficientnet) uses depthwise separable convolutions with compound scaling. [ConvNeXt](/wiki/convnext) revisits pure convolutional designs using large depthwise kernels and modern training techniques, achieving performance competitive with [Vision Transformers](/wiki/vision_transformer). The ongoing development of new convolution types and architectures continues to expand the capabilities and efficiency of convolutional networks.

## References

1. LeCun, Y., Boser, B., Denker, J. S., et al. (1989). "Backpropagation applied to handwritten zip code recognition." *Neural Computation*, 1(4), 541-551. See also LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). "Gradient-based learning applied to document recognition." *Proceedings of the IEEE*, 86(11), 2278-2324.
2. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). "[ImageNet](/wiki/imagenet) classification with deep convolutional neural networks." *Advances in Neural Information Processing Systems*, 25.
3. Lin, M., Chen, Q., & Yan, S. (2013). "Network in Network." *arXiv preprint arXiv:1312.4400*.
4. Simonyan, K. & Zisserman, A. (2014). "Very deep convolutional networks for large-scale image recognition." *arXiv preprint arXiv:1409.1556*.
5. Szegedy, C., Liu, W., Jia, Y., et al. (2014). "Going deeper with convolutions." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*.
6. Long, J., Shelhamer, E., & Darrell, T. (2015). "Fully convolutional networks for semantic segmentation." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*.
7. Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). "Learning spatiotemporal features with 3D convolutional networks." *Proceedings of the IEEE International Conference on Computer Vision (ICCV)*.
8. He, K., Zhang, X., Ren, S., & Sun, J. (2016). "Deep residual learning for image recognition." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*.
9. Yu, F. & Koltun, V. (2016). "Multi-scale context aggregation by dilated convolutions." *International Conference on Learning Representations (ICLR)*.
10. van den Oord, A., Dieleman, S., Zen, H., et al. (2016). "WaveNet: A generative model for raw audio." *arXiv preprint arXiv:1609.03499*.
11. Lavin, A. & Gray, S. (2016). "Fast algorithms for convolutional neural networks." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*.
12. Dumoulin, V. & Visin, F. (2016). "A guide to convolution arithmetic for deep learning." *arXiv preprint arXiv:1603.07285*.
13. Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2017). "DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs." *IEEE Transactions on Pattern Analysis and Machine Intelligence*, 40(4), 834-848.
14. Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., & Wei, Y. (2017). "Deformable convolutional networks." *Proceedings of the IEEE International Conference on Computer Vision (ICCV)*, 764-773.
15. Howard, A. G., Zhu, M., Chen, B., et al. (2017). "MobileNets: Efficient convolutional neural networks for mobile vision applications." *arXiv preprint arXiv:1704.04861*.
16. Chollet, F. (2017). "Xception: [Deep learning](/wiki/deep_learning) with depthwise separable convolutions." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*.
17. Xie, S., Girshick, R., Dollar, P., Tu, Z., & He, K. (2017). "Aggregated residual transformations for deep neural networks." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*.
18. Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L.-C. (2018). "MobileNetV2: Inverted residuals and linear bottlenecks." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*.
19. Zhu, X., Hu, H., Lin, S., & Dai, J. (2019). "Deformable ConvNets V2: More deformable, better results." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*.
20. Kim, Y. (2014). "Convolutional neural networks for sentence classification." *Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP)*.
21. Mathieu, M., Henaff, M., & LeCun, Y. (2014). "Fast training of convolutional networks through FFTs." *International Conference on Learning Representations (ICLR)*.
22. Goodfellow, I., Bengio, Y., & Courville, A. (2016). *Deep Learning*. MIT Press. Chapter 9: Convolutional Networks.
23. Cooley, J. W. & Tukey, J. W. (1965). "An algorithm for the machine calculation of complex Fourier series." *Mathematics of Computation*, 19(90), 297-301.
24. Fukushima, K. (1980). "Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position." *Biological Cybernetics*, 36(4), 193-202.
25. Russakovsky, O., Deng, J., Su, H., et al. (2015). "ImageNet Large Scale Visual Recognition Challenge." *International Journal of Computer Vision*, 115(3), 211-252. (ILSVRC 2012 results: AlexNet top-5 error 15.3% vs 26.2% for the runner-up.)