See also: Machine learning terms, Convolutional layer, Convolutional filter
Convolution is a mathematical operation that combines two functions to produce a third function, expressing how the shape of one is modified by the other. In its most general form, convolution measures the overlap between one function and a reversed, shifted copy of another function, accumulated over all possible shifts. The operation appears throughout mathematics, physics, engineering, and computer science, and it serves as the foundational building block behind convolutional neural networks (CNNs) and much of modern deep learning.
In signal processing, convolution describes how a linear time-invariant (LTI) system transforms an input signal: the output is the convolution of the input with the system's impulse response. In probability and statistics, the convolution of two probability density functions gives the density of the sum of the corresponding independent random variables. In machine learning, and especially in the domain of deep learning, convolution is used to extract features from structured input data such as images, audio, and text. By applying small, learnable convolutional filters across an input, convolutional layers capture spatial and temporal patterns at multiple scales of abstraction.
Convolution is the foundational operation behind many breakthroughs in computer vision, natural language processing, speech recognition, and time series analysis. Over the past decade, researchers have introduced numerous variants of the convolution operation, each designed to address specific computational or representational challenges.
Imagine you have a rubber stamp with a small pattern on it, like a star. Now imagine you have a big sheet of paper with a drawing on it. You press your stamp down on one corner of the drawing, and it tells you how much the stamp's pattern matches what is underneath. Then you slide the stamp one step to the right and check again. You keep sliding and checking across the entire drawing, writing down the match score each time. When you are done, you have a new picture made entirely of match scores. That new picture is the convolution.
In real life, this is how computers look at photos. They use tiny stamps (called kernels or filters) to scan across an image. One stamp might look for edges, another for corners, another for colors. By scanning with many different stamps, the computer builds up an understanding of what is in the image, like recognizing a cat, a car, or a face.
The key idea is simple: slide a small pattern over a big input, multiply and add up what overlaps at each position, and record the result. That is convolution.
The convolution integral first appeared in the work of Jean le Rond d'Alembert in 1754, within his derivation of Taylor's theorem in Recherches sur differents points importants du systeme du monde. Sylvestre Francois Lacroix used similar expressions in his treatise published between 1797 and 1800. Vito Volterra examined composition products, a closely related concept, in his 1913 work on integral equations.
The term "convolution" itself did not come into wide use until the 1950s or 1960s. Before that, the operation went by several names in different communities: Faltung (meaning "folding" in German), composition product, superposition integral, and Carson's integral.
In electrical engineering, convolution became central to the theory of linear time-invariant systems during the mid-20th century, providing the mathematical link between a system's impulse response and its output for any given input. The development of the Fast Fourier Transform (FFT) algorithm by Cooley and Tukey in 1965 made convolution computationally practical for large signals by exploiting the convolution theorem.
Convolution entered the field of neural networks through the work of Kunihiko Fukushima, whose Neocognitron (1980) used a convolution-like operation for visual pattern recognition. Yann LeCun and colleagues formalized the modern convolutional neural network in 1989 with LeNet, applying backpropagation to train convolutional filters for handwritten digit recognition. The landmark AlexNet architecture (Krizhevsky et al., 2012) demonstrated that deep convolutional networks trained on GPUs could dramatically outperform traditional methods on large-scale image recognition, launching the deep learning revolution.
The convolution of two continuous functions, f and g, is mathematically denoted as (f * g) and is defined by the integral:
(f * g)(t) = integral from negative infinity to positive infinity of f(tau) * g(t - tau) d(tau)
This integral computes the area of overlap between f and a reversed, shifted version of g at each point t. In signal processing and mathematics, this operation captures how the shape of one function is modified by another. The reversal of g (replacing tau with t - tau) is what distinguishes convolution from cross-correlation.
In the context of discrete data like images or digital signals, the continuous integral is replaced by a sum:
(f * g)[n] = sum over m of f[m] * g[n - m]
For two finite sequences of length N, computing this sum directly requires O(N^2) operations. For images, the convolution operation can be represented as a double summation over a two-dimensional kernel, K, applied to an input image, I:
(I * K)(x, y) = sum over m, sum over n of I(m, n) * K(x - m, y - n)
The kernel is a smaller matrix compared to the input image, and it is used to extract features by sliding over the input image, performing element-wise multiplication and summation of the overlapping elements.
Although the operation used in deep learning is commonly called "convolution," most frameworks actually implement cross-correlation rather than true mathematical convolution. The difference is subtle but important: true convolution flips the kernel (rotates it by 180 degrees) before sliding it across the input, while cross-correlation does not flip the kernel.
The 2D cross-correlation formula is:
(X star K)(i, j) = sum over a, sum over b of X(i + a, j + b) * K(a, b)
Notice that in cross-correlation, the indices of the input use addition (i + a, j + b), while in true convolution they use subtraction (x - m, y - n). Because the kernel weights are learned during backpropagation rather than hand-designed, the distinction between convolution and cross-correlation has no practical consequence. A learned kernel and its flipped version simply represent equivalent solutions. For this reason, deep learning literature and frameworks (including PyTorch and TensorFlow) refer to the operation as "convolution" even though the implementation is technically cross-correlation.
Kernels or filters are pivotal in the convolution process, as they are responsible for extracting specific features from the input data. Common kernel examples include edge detection, blur, and sharpen filters. These kernels are designed to emphasize or suppress certain properties in the input data, allowing the machine learning model to learn more effectively from the transformed data.
In deep learning, the kernels are usually initialized with random values and learned by the model during the training process. As the model learns, the kernels become more specialized in extracting useful features for the specific task at hand.
Convolution satisfies several important algebraic properties that have both theoretical and practical significance.
| Property | Formula | Practical Implication |
|---|---|---|
| Commutativity | f * g = g * f | The order of operands does not matter; a signal convolved with a filter gives the same result as the filter convolved with the signal |
| Associativity | f * (g * h) = (f * g) * h | A cascade of filters can be pre-combined into a single equivalent filter |
| Distributivity | f * (g + h) = (f * g) + (f * h) | Convolution distributes over addition, enabling parallel filter application |
| Scalar multiplication | a(f * g) = (af) * g | Scaling can be applied before or after convolution |
| Identity | f * delta = f | Convolving any function with the Dirac delta (or Kronecker delta for discrete signals) returns the original function unchanged |
Commutativity is proven by a simple substitution of variables in the convolution integral. In practice, it means we can interpret a system's effect on a signal equivalently as the signal's effect on the system's impulse response.
Associativity is particularly useful in signal processing: if a signal passes through two filters in series with impulse responses g and h, the combined effect is the same as a single filter whose impulse response is (g * h). This allows engineers to analyze complex multi-stage systems as a single equivalent system.
Distributivity means that if two signals are added together and then convolved with a filter, the result is the same as convolving each signal separately and then adding the results. This property underpins the principle of superposition in linear systems theory.
Note that convolution does not possess a general multiplicative identity in the space of integrable functions. However, the Dirac delta function serves as an identity element: convolving any function with the delta function returns the original function.
The convolution theorem is one of the most important results connecting convolution with the Fourier transform. It states that convolution in the spatial (or time) domain is equivalent to element-wise multiplication in the frequency domain:
F{f * g} = F{f} . F{g}
where F denotes the Fourier transform. The converse also holds: multiplication in the spatial domain corresponds to convolution in the frequency domain:
F{f . g} = F{f} * F{g}
This theorem extends to other transforms as well, including the Laplace transform, the Z-transform, and the Mellin transform.
The convolution theorem has profound practical implications:
In signal processing, convolution is the primary tool for describing how linear time-invariant (LTI) systems process inputs. If a system has impulse response h(t), then for any input x(t), the output y(t) is:
y(t) = (x * h)(t) = integral from negative infinity to positive infinity of x(tau) * h(t - tau) d(tau)
This relationship is fundamental to fields including audio engineering (reverb is convolution with a room's impulse response), telecommunications (channel equalization), radar (matched filtering), and medical imaging (CT reconstruction). Digital filters, both FIR (finite impulse response) and IIR (infinite impulse response), are implemented as discrete convolutions of the input signal with the filter coefficients.
Convolutional operations exist in 1D, 2D, and 3D variants, each suited to data with different structural properties.
1D convolution slides the kernel along a single axis. It is used extensively for sequential and temporal data. Key applications include:
The 1D convolution formula for an input signal x of length L and a kernel k of size K is:
y[n] = sum from m=0 to K-1 of x[n + m] * k[m]
The output length depends on the padding and stride settings (discussed below).
2D convolution is the most widely used form and the default when researchers mention "convolution" in the context of deep learning. The kernel slides along two spatial axes (height and width) and is the backbone of all major image recognition architectures, including AlexNet, VGGNet, ResNet, and Inception.
A 2D convolution with input of shape (C_in, H, W) and a kernel of shape (C_in, K_h, K_w) produces a single 2D output feature map. Multiple kernels produce multiple output channels, resulting in an output of shape (C_out, H_out, W_out).
3D convolution extends the operation to three spatial or spatiotemporal axes. The kernel slides along height, width, and a third dimension (often depth or time). Primary applications include:
3D convolutions are considerably more expensive than their 2D counterparts because the kernel must cover an additional spatial dimension, increasing both the number of parameters and the number of multiply-accumulate operations.
Several hyperparameters control the behavior and output dimensions of a convolutional layer.
The kernel size determines the spatial extent of the region the filter examines at each position. Common choices include:
| Kernel Size | Typical Use | Notes |
|---|---|---|
| 1x1 | Channel mixing, dimensionality reduction | Introduced in Network in Network (Lin et al., 2013) |
| 3x3 | General-purpose feature extraction | Most popular choice since VGGNet (Simonyan & Zisserman, 2014) |
| 5x5 | Wider receptive field per layer | Used in early Inception modules |
| 7x7 | First layer of deep networks | Used in ResNet and Inception v1 for initial downsampling |
Smaller kernels (3x3) stacked in multiple layers can achieve the same receptive field as a single large kernel while using fewer parameters. Two stacked 3x3 convolutions cover the same 5x5 receptive field with 2 * (3 * 3) = 18 weights instead of 25, and three stacked 3x3 convolutions cover a 7x7 field with 27 weights instead of 49.
Stride specifies the step size by which the kernel moves across the input. A stride of 1 moves the filter one pixel at a time, producing an output whose spatial dimensions are similar to the input. A stride of 2 moves the filter two pixels at a time, halving the spatial dimensions. Increasing the stride reduces the output resolution and computational cost, and it is often used as an alternative to pooling layers for spatial downsampling.
The output dimension for one axis is computed as:
O = floor((W - K + 2P) / S) + 1
where W is the input size, K is the kernel size, P is the padding, and S is the stride.
Padding adds extra values (typically zeros) around the border of the input before the convolution is performed. This is necessary to control the spatial dimensions of the output.
| Padding Type | Description | Output Size (stride = 1) |
|---|---|---|
| Valid (no padding) | No padding is added; the kernel only visits positions where it fully overlaps the input | Smaller than input: (W - K + 1) |
| Same (zero padding) | Enough zeros are added so the output has the same spatial dimensions as the input | Equal to input: W |
| Full padding | Enough zeros are added so every element of the input participates in every possible kernel position | Larger than input: (W + K - 1) |
"Same" padding is the most common choice in modern architectures because it simplifies dimension calculations when stacking many layers.
Dilation inserts gaps (zeros) between kernel elements, effectively expanding the receptive field without increasing the number of parameters. A dilation rate of r spreads a kernel of size K so its effective size becomes:
K_eff = K + (K - 1)(r - 1)
For example, a 3x3 kernel with dilation rate 2 has an effective receptive field of 5x5, and with dilation rate 3 it covers 7x7, all while maintaining only 9 learnable weights.
Dilated convolutions were originally developed for efficient wavelet decomposition in signal processing. In deep learning, they became prominent through:
A feature map is the output produced by applying a single convolutional filter to the input. Each filter detects a specific pattern, and the collection of all feature maps from a convolutional layer forms a 3D volume with dimensions (C_out, H_out, W_out), where C_out is the number of output channels (one per filter).
The input to a convolutional layer also has multiple channels. For a color image, the input has 3 channels (red, green, blue). For deeper layers, the number of input channels equals the number of filters in the preceding convolutional layer. Each filter in a convolutional layer spans all input channels: a filter in a layer with C_in input channels has shape (C_in, K_h, K_w), and the total number of parameters for one filter (excluding bias) is C_in * K_h * K_w.
The total number of parameters in a convolutional layer with C_out filters is:
Parameters = C_out * (C_in * K_h * K_w + 1)
where the +1 accounts for the bias term per filter.
Convolutional layers exploit two properties that make them far more efficient than fully connected layers: parameter sharing and local connectivity (also called sparse interactions).
In a convolutional layer, the same filter weights are applied at every spatial position of the input. This means a single set of K_h * K_w * C_in weights is reused across the entire input, regardless of the input size. Parameter sharing encodes the assumption of translation equivariance: a pattern that is useful to detect in one part of the input is equally useful in other parts. For example, an edge detector should work the same way whether the edge appears in the top-left or bottom-right of the image.
Parameter sharing reduces the number of learnable weights by several orders of magnitude compared to a fully connected layer with the same input and output sizes. For an input image of 224x224x3 mapped to a feature map of 224x224x64, a fully connected layer would require roughly 224 * 224 * 3 * 224 * 224 * 64, which is approximately 4.8 trillion parameters. A convolutional layer with 64 filters of size 3x3 requires only 64 * (3 * 3 * 3 + 1) = 1,792 parameters.
Each output neuron in a convolutional layer is connected to only a small, local region of the input (determined by the kernel size), rather than to every input element. This sparse connectivity reduces computation and also provides an inductive bias well suited to data with local spatial structure, such as images and audio. Deep stacking of locally connected layers allows the network to build up large receptive fields gradually: a neuron in a deep layer indirectly depends on a large region of the original input, even though each individual layer examines only a small neighborhood.
Standard convolution applies each filter across all input channels simultaneously. Given an input with C_in channels and C_out filters of size K x K, the computational cost (in multiply-accumulate operations) for producing one output feature map of spatial size H_out x W_out is:
FLOPs = H_out * W_out * C_in * C_out * K * K
Standard convolution is the most expressive form because every output channel can learn arbitrary combinations of all input channels. However, it is also the most expensive, which has motivated the development of lighter alternatives.
Depthwise separable convolution, popularized by MobileNet (Howard et al., 2017) and also used in Xception (Chollet, 2017), factorizes a standard convolution into two steps:
The total cost of depthwise separable convolution is:
FLOPs = H_out * W_out * C_in * (K * K + C_out)
The reduction factor compared to standard convolution is:
Reduction = 1/C_out + 1/K^2
For a common configuration with K = 3 and C_out = 256, this yields a reduction factor of approximately 1/256 + 1/9, which is roughly 0.115, meaning depthwise separable convolution uses only about 11-12% of the computation of standard convolution.
MobileNet V1 demonstrated that depthwise separable convolutions could achieve accuracy comparable to standard architectures while being 8-9 times faster, making them suitable for mobile and edge devices. MobileNet V2 refined this with inverted residual blocks and linear bottlenecks.
Grouped convolution divides the input channels into G non-overlapping groups and performs independent convolutions within each group. It was first used in AlexNet (Krizhevsky et al., 2012) for a practical reason: the model was split across two GPUs, and each GPU processed half the channels. Later work showed that grouping also provides a useful regularization effect.
The computational cost of grouped convolution is:
FLOPs = H_out * W_out * (C_in / G) * (C_out / G) * K * K * G = H_out * W_out * C_in * C_out * K * K / G
This represents a 1/G reduction compared to standard convolution. ResNeXt (Xie et al., 2017) demonstrated that increasing the number of groups (called "cardinality") while keeping the total number of parameters fixed improved accuracy over deeper or wider ResNets. ShuffleNet (Zhang et al., 2018) combined grouped convolutions with channel shuffling to allow information flow between groups.
Note that depthwise convolution is a special case of grouped convolution where G = C_in (each channel is its own group).
Deformable convolution, introduced by Dai et al. (2017) at ICCV, augments the standard fixed grid sampling pattern with learnable 2D offsets. In a standard 3x3 convolution, the nine sampling locations form a regular grid. In deformable convolution, each of these locations is displaced by an offset (delta_x, delta_y) that is predicted by an auxiliary convolutional layer applied to the preceding feature map.
The output at position p is:
y(p) = sum over k of w_k * x(p + p_k + delta_p_k)
where p_k are the fixed grid offsets and delta_p_k are the learned offsets. Since the offsets are typically fractional, bilinear interpolation is used to sample the input at non-integer locations.
Key properties of deformable convolution:
Deformable convolution has proven particularly effective for object detection and semantic segmentation tasks where objects appear at varying scales, aspect ratios, and orientations.
Dilated convolution inserts zeros between kernel elements to expand the receptive field without increasing the parameter count. The dilation rate r controls the spacing. See the Dilation section above for the mathematical details and applications in DeepLab and WaveNet.
Dilated convolutions are especially valuable in dense prediction tasks (semantic segmentation, depth estimation) where maintaining high spatial resolution is important. They provide an alternative to encoder-decoder architectures by keeping feature maps at full resolution while still capturing long-range context.
Transposed convolution (sometimes called "deconvolution" or "fractionally strided convolution") performs spatial upsampling with learnable parameters. It is the approximate inverse of a strided convolution: while a strided convolution reduces spatial dimensions, a transposed convolution increases them.
The operation works by inserting zeros between input elements (effectively applying a fractional stride) and then performing a standard convolution. For an input of size H_in and a transposed convolution with kernel size K, stride S, and padding P, the output size is:
H_out = (H_in - 1) * S - 2P + K
It is important to note that transposed convolution is not a true mathematical inverse (deconvolution). A true deconvolution would recover the exact original input from a convolution's output, which is generally an ill-posed problem. Transposed convolution only reverses the spatial dimensions while performing a learnable upsampling, not a value-level inversion.
Transposed convolutions are widely used in:
One known issue with transposed convolutions is the "checkerboard artifact," where overlapping kernel regions during upsampling create a grid pattern in the output. This can be mitigated by choosing kernel sizes divisible by the stride or by using alternative upsampling strategies such as nearest-neighbor interpolation followed by a standard convolution.
1x1 convolution, first proposed in the Network in Network paper (Lin et al., 2013), applies a 1x1 spatial filter across all input channels. Despite having no spatial extent, it serves several important functions:
The computational cost of a 1x1 convolution is H * W * C_in * C_out, with no dependence on the spatial kernel size. This makes it one of the cheapest convolution operations per parameter.
The following table summarizes the key convolution variants discussed above.
| Type | Parameters (per layer) | FLOPs (relative to standard) | Receptive Field | Key Advantage | Representative Architecture |
|---|---|---|---|---|---|
| Standard | C_out * C_in * K^2 | 1x (baseline) | K x K per layer | Full cross-channel interaction | VGGNet, ResNet |
| Depthwise Separable | C_in * K^2 + C_in * C_out | ~1/K^2 + 1/C_out (approx. 0.11x for K=3) | K x K per layer | Major computation savings | MobileNet, Xception |
| Grouped (G groups) | C_out * C_in * K^2 / G | 1/G | K x K per layer | Tunable efficiency-accuracy tradeoff | AlexNet, ResNeXt |
| Dilated (rate r) | C_out * C_in * K^2 | Same as standard (same params) | K + (K-1)(r-1) per layer | Larger receptive field, no extra params | DeepLab, WaveNet |
| Deformable | C_out * C_in * K^2 + offset params | ~1x + small offset overhead | Adaptive (input-dependent) | Content-adaptive sampling | DCN (Dai et al., 2017) |
| Transposed | C_in * C_out * K^2 | Depends on output size | N/A (upsampling operation) | Learnable upsampling | FCN, DCGAN |
| 1x1 | C_out * C_in | H * W * C_in * C_out | 1x1 (no spatial extent) | Channel mixing, cheap dimensionality change | Network in Network, Inception |
While the sliding-window description of convolution is intuitive, deep learning frameworks employ several optimized strategies to implement convolution at high speed.
The most common approach is the im2col transformation, which converts the convolution operation into a single large matrix multiplication (GEMM: General Matrix-Matrix Multiplication) by rearranging the input data:
Matrix multiplication is one of the most heavily optimized operations in computing. Highly tuned BLAS (Basic Linear Algebra Subprograms) libraries such as cuBLAS (for NVIDIA GPUs), oneDNN (for Intel CPUs), and Apple's Accelerate framework provide near-peak-performance GEMM implementations. The drawback of im2col is memory overhead: it duplicates input data because overlapping patches share elements. For an input of size (C_in, H, W) with a K x K kernel, the expanded matrix requires O(K^2 * C_in * H_out * W_out) memory, which can be several times larger than the original input.
Winograd's minimal filtering algorithms reduce the number of multiplications needed for small convolutions. For a 3x3 kernel applied to a 4x4 input tile, the Winograd F(2x2, 3x3) algorithm reduces the multiplications from 36 (direct) to 16, a 2.25x savings. Lavin and Gray (2016) demonstrated speedups of up to 4x for 3x3 convolutions in CNNs.
The algorithm works by transforming both the input tile and the kernel into a special domain (using pre-computed transformation matrices), performing element-wise multiplication there, and transforming back. This is analogous to the convolution theorem but uses transforms tailored to small discrete filters rather than the Fourier transform.
A significant limitation is numerical precision. The transformation matrices grow in magnitude for larger tile sizes, causing floating-point errors to accumulate. This makes Winograd convolution less suitable for large kernels or for low-precision (e.g., INT8) computation, though recent research has made progress on quantized Winograd implementations.
The convolution theorem enables an alternative implementation using the Fast Fourier Transform:
Direct spatial convolution of an input of size N x N with a kernel of size K x K costs O(N^2 * K^2) operations. FFT-based convolution costs O(N^2 * log N) for the transforms plus O(N^2) for the element-wise multiplication. Mathieu, Henaff, and LeCun (2014) demonstrated that FFT-based convolution could accelerate CNN training significantly compared to spatial implementations.
In modern practice, FFT-based convolution is less commonly used for standard CNN architectures because most popular architectures use small 3x3 kernels where the overhead of FFT (computing transforms, padding, managing complex arithmetic) outweighs the theoretical benefit. Winograd-based algorithms offer better performance for small kernels without requiring frequency-domain transforms. However, FFT-based convolution remains relevant for applications involving large kernels (such as global convolutions in architectures like ConvNeXt V2) and in traditional signal processing where kernels may span hundreds or thousands of elements.
| Method | Description | Best For |
|---|---|---|
| Implicit GEMM | Dynamically computes im2col layout during GEMM, avoiding explicit data rearrangement | NVIDIA cuDNN, Google TPU |
| Indirect convolution | Replaces data duplication with a pointer-based indirection buffer (Dukhan, 2019) | Memory-constrained environments |
| Direct convolution | Naive nested loops; simple but slow | Very small inputs, educational use |
Convolutional neural networks (CNNs) are a class of deep learning models that exploit the convolution operation for processing grid-like data, such as images, videos, and spectrograms. CNNs have demonstrated exceptional performance in tasks like image classification, object detection, and semantic segmentation.
A typical CNN architecture consists of several layers, including convolutional layers, activation functions, pooling layers, and fully connected layers. The convolutional layers perform the actual convolution operation, while the activation functions introduce non-linearity to the model, allowing it to learn complex patterns. Pooling layers are used to reduce the spatial dimensions of the feature maps, and fully connected layers serve as a final classifier or regressor.
Modern CNN architectures have evolved to incorporate many of the convolution variants described above. EfficientNet uses depthwise separable convolutions with compound scaling. ConvNeXt revisits pure convolutional designs using large depthwise kernels and modern training techniques, achieving performance competitive with Vision Transformers. The ongoing development of new convolution types and architectures continues to expand the capabilities and efficiency of convolutional networks.