The convolutional operation is a mathematical procedure that combines two functions to produce a third function expressing how the shape of one is modified by the other. In the context of convolutional neural networks (CNNs), the operation applies learned filters (also called kernels) to input data, enabling the automatic extraction of spatial features such as edges, textures, and patterns. The convolutional operation forms the computational backbone of nearly all modern computer vision systems, and its variants have been extended to audio, text, and volumetric data.
Imagine you have a big picture made of tiny colored squares. You also have a small stamp with a pattern on it. You place the stamp on the top-left corner of the picture, look at the squares underneath, do some simple math (multiply and add), and write down a single number. Then you slide the stamp one square to the right and repeat. When you reach the end of a row, you move down and start again. When you finish, all those numbers you wrote down form a new, smaller picture. That new picture highlights the parts of the original that match the stamp's pattern, like edges or certain shapes. That sliding-and-multiplying process is the convolutional operation, and the stamp is the filter (or kernel). In a neural network, the computer learns the best stamp patterns on its own by looking at thousands of examples.
The mathematical concept of convolution dates back to the 18th century, with roots in the work of Euler and Laplace on integral transforms. In modern applied mathematics, convolution became a standard tool in signal processing and linear systems theory during the 20th century.
The application of convolution-like operations to artificial neural networks began with Kunihiko Fukushima's Neocognitron in 1980. Fukushima was inspired by the discoveries of David Hubel and Torsten Wiesel, who showed in the early 1960s that neurons in the cat visual cortex respond to oriented edges within localized receptive fields. The Neocognitron introduced a hierarchical architecture with alternating layers of simple and complex cells, anticipating many structural features of today's CNNs. However, the Neocognitron used unsupervised reinforcement learning for training.
In 1989, Yann LeCun and colleagues at Bell Labs demonstrated that a convolutional neural network trained with backpropagation could recognize handwritten zip codes. This work led to LeNet-5 (1998), the architecture that established CNNs as practical tools for image recognition. The convolutional operation as used in CNNs, with learned filters, weight sharing, and gradient-based training, traces directly to LeCun's contributions.
The field remained relatively quiet until 2012, when AlexNet won the ImageNet Large Scale Visual Recognition Challenge by a wide margin, demonstrating that deep convolutional networks trained on GPUs could achieve unprecedented accuracy on large-scale image classification. This result triggered the modern deep learning era and led to rapid development of new convolution variants.
For two continuous functions f and g, the convolution is defined as:
(f * g)(t) = \u222b f(\u03c4) g(t - \u03c4) d\u03c4
The integral runs over all values of \u03c4 (from -\u221e to +\u221e). The key characteristic is that one of the functions is flipped (reversed) before the integration. This flipping distinguishes convolution from the closely related cross-correlation operation.
For discrete signals, the continuous integral is replaced with a summation:
(f * g)[n] = \u2211_k f[k] \u00b7 g[n - k]
This form is more directly applicable to digital signals and images, where data is represented as sequences or grids of discrete values.
In image processing, both the input and the kernel are two-dimensional. The 2D discrete convolution of an input X with a kernel K is:
(X * K)[i, j] = \u2211_m \u2211_n X[m, n] \u00b7 K[i - m, j - n]
Here, the kernel K is flipped both horizontally and vertically before being slid across the input. The output at each position (i, j) is the sum of element-wise products between the flipped kernel and the overlapping region of the input.
In practice, most machine learning frameworks (PyTorch, TensorFlow, JAX) implement cross-correlation rather than true mathematical convolution. The 2D cross-correlation formula is:
(X \u2606 K)[i, j] = \u2211_m \u2211_n X[i + m, j + n] \u00b7 K[m, n]
The only difference from convolution is the absence of the kernel flip. In cross-correlation, the kernel slides over the input without being reversed.
This distinction is practically irrelevant in neural networks because the kernel weights are learned from data during training. If the network needed a flipped version of some filter, backpropagation would simply learn the flipped weights directly. As a result, CNN implementations universally use cross-correlation for computational simplicity but call the operation "convolution" by convention.
| Property | True convolution | Cross-correlation (used in CNNs) |
|---|---|---|
| Kernel flipped | Yes (180-degree rotation) | No |
| Mathematical notation | f * g | f \u2606 g |
| Commutativity | Commutative (f * g = g * f) | Not commutative |
| Associativity | Associative | Not associative |
| Used in ML frameworks | Rarely | Almost universally |
| Effect on learned weights | Weights learned in flipped orientation | Weights learned directly |
Several hyperparameters control how the convolutional operation is applied and determine the size and properties of the output.
The kernel size defines the spatial extent of the filter, typically expressed as height x width. Common choices include 1x1, 3x3, 5x5, and 7x7. Smaller kernels (especially 3x3) have become the standard in modern architectures because stacking multiple small kernels achieves the same receptive field as a single large kernel while using fewer parameters and introducing more nonlinearity through additional activation functions.
Stride specifies the number of positions the kernel moves between consecutive applications. A stride of 1 means the kernel moves one pixel at a time, producing a dense output. A stride of 2 moves two pixels at a time, reducing each spatial dimension by roughly half. Larger strides downsample the output, which reduces computation and increases the effective receptive field.
Padding adds extra values (usually zeros) around the border of the input before convolution. Without padding, each convolutional layer shrinks the spatial dimensions, and information at the edges of the input is underrepresented.
| Padding type | Description | Output size (for stride = 1) |
|---|---|---|
| Valid (no padding) | No padding applied; output is smaller than input | (n - k + 1) x (n - k + 1) |
| Same padding | Padding chosen so output has same dimensions as input | n x n |
| Full padding | Maximum padding; every element of input participates in every possible position | (n + k - 1) x (n + k - 1) |
In the table above, n represents the input dimension and k represents the kernel dimension. Most modern architectures use same padding with 3x3 kernels (i.e., 1 pixel of zero padding on each side).
Each filter produces one output channel (also called a feature map). A convolutional layer typically uses multiple filters in parallel, each learning to detect a different pattern. For example, a layer with 64 filters applied to an input produces an output with 64 channels.
Dilation inserts gaps between kernel elements, expanding the receptive field without increasing the number of parameters. A dilation rate of 1 is a standard convolution; a dilation rate of 2 inserts one gap between each kernel element. See the section on dilated convolution below for details.
The spatial dimensions of the output feature map are determined by a standard formula. Given an input of size I, a kernel of size K, padding P, stride S, and dilation rate D, the output size O along one dimension is:
O = floor((I + 2P - D(K - 1) - 1) / S) + 1
For the common case with dilation rate 1, this simplifies to:
O = floor((I - K + 2P) / S) + 1
| Example configuration | Input size | Kernel | Padding | Stride | Dilation | Output size |
|---|---|---|---|---|---|---|
| Standard 3x3 with same padding | 32 | 3 | 1 | 1 | 1 | 32 |
| Standard 3x3 with valid padding | 32 | 3 | 0 | 1 | 1 | 30 |
| Strided 3x3 | 32 | 3 | 1 | 2 | 1 | 16 |
| Dilated 3x3 (rate=2) | 32 | 3 | 2 | 1 | 2 | 32 |
| Large 7x7 with same padding | 224 | 7 | 3 | 2 | 1 | 112 |
The total number of learnable parameters in a convolutional layer depends on the kernel size, number of input channels (C_in), number of output channels (C_out), and whether a bias term is included:
Parameters = (K_h x K_w x C_in + 1) x C_out
The "+1" accounts for one bias value per output channel. For example, a layer with 64 filters of size 3x3 applied to an input with 128 channels has (3 x 3 x 128 + 1) x 64 = 73,792 parameters.
A single filter uses the same set of weights at every spatial location in the input. This weight sharing reduces the number of parameters compared to fully connected layers and encodes the assumption that the same pattern can appear anywhere in the input.
Convolutional layers are translation equivariant: if the input shifts by a certain amount, the output shifts by the same amount. Formally, if T is a translation operator and f is the convolution operation, then f(T(x)) = T(f(x)). This property means that a feature detector learned at one location automatically applies everywhere in the input, without requiring separate detectors for each position.
Translation equivariance should not be confused with translation invariance. Pooling layers and global aggregation operations contribute to approximate translation invariance, meaning the final classification output remains similar regardless of where an object appears in the image.
Each output unit depends only on a small region of the input, defined by the kernel size. This locality is both computationally efficient and well suited for capturing local spatial patterns like edges, corners, and textures. Higher-level layers, through stacking, build receptive fields that span larger portions of the input.
When convolutional layers are stacked, the network learns a hierarchy of features. Early layers detect simple patterns such as edges and color gradients. Middle layers combine these into more complex shapes like corners, contours, and texture elements. Deeper layers compose these into high-level features representing object parts or entire objects. This hierarchical decomposition is a central reason why CNNs work well for visual recognition.
During training, convolutional layers are optimized using gradient descent and backpropagation. The backward pass through a convolutional layer involves two gradient computations:
Gradient with respect to the input (needed for propagating gradients to earlier layers): This is computed as a convolution of the upstream gradient with the transposed (flipped) kernel.
Gradient with respect to the kernel weights (needed for updating the filter): This is computed as a convolution between the input and the upstream gradient.
Because of weight sharing, gradients from all spatial positions are accumulated into a single gradient for each filter weight. This makes gradient computation efficient and ensures that every spatial location contributes to the filter update.
Modern deep learning frameworks use several strategies to compute convolutions efficiently on hardware accelerators.
The most widely used approach, called im2col (image-to-column), rearranges overlapping input patches into columns of a matrix. The convolution is then computed as a single large matrix multiplication (GEMM, General Matrix-Matrix Multiply). This approach leverages highly optimized BLAS libraries and GPU tensor cores but requires additional memory because overlapping patches are duplicated in the rearranged matrix.
The convolution theorem states that convolution in the spatial domain is equivalent to element-wise multiplication in the frequency domain. By transforming both the input and the kernel to the frequency domain using the Fast Fourier Transform (FFT), performing element-wise multiplication, and transforming back, convolution can be computed in O(n log n) time rather than O(n^2). This approach is advantageous for large kernel sizes (typically larger than 5x5) but incurs overhead from the domain transforms that makes it less efficient for the small 3x3 kernels common in modern architectures.
The Winograd minimal filtering algorithm reduces the number of multiplications needed for small, fixed-size convolutions (particularly 3x3 with stride 1). It trades multiplications for additions, which are cheaper on most hardware. Many deep learning libraries (cuDNN, MKL-DNN) automatically select Winograd-based implementations for eligible convolutions.
Direct implementation applies the sliding-window operation as written, without matrix rearrangement. While conceptually simple, direct convolution is generally slower on GPUs than im2col or Winograd methods. It remains useful on specialized hardware or for unusual kernel configurations.
| Method | Best kernel sizes | Memory overhead | Computational complexity |
|---|---|---|---|
| im2col + GEMM | Any | High (patch duplication) | O(n^2 k^2) per output |
| FFT-based | Large (>5x5) | Moderate (frequency buffers) | O(n^2 log n) |
| Winograd | Small, fixed (3x3, 5x5) | Low | Fewer multiplications than direct |
| Direct | Any | None | O(n^2 k^2) per output |
Researchers have developed many specialized variants of the standard convolution to address different requirements in terms of receptive field size, computational cost, spatial resolution, and input structure.
A 1x1 convolution applies a 1x1 filter across all input channels at each spatial location. It performs no spatial filtering; instead, it computes a linear combination of channel values. Its primary uses include:
The concept was introduced in the Network-in-Network paper (Lin et al., 2013) and became a standard building block in architectures like Inception (GoogLeNet), ResNet bottleneck blocks, and MobileNet.
Dilated convolution, also called atrous convolution (from the French "a trous," meaning "with holes"), inserts gaps between kernel elements. A dilation rate of r means there are r - 1 gaps between each pair of adjacent kernel values. This expands the receptive field exponentially without increasing the number of parameters or reducing spatial resolution.
For a 3x3 kernel with dilation rate 2, the kernel effectively covers a 5x5 area but only uses 9 weights. With dilation rate 4, it covers a 9x9 area with the same 9 weights.
Dilated convolutions are widely used in:
One known issue is the gridding effect: when dilated convolutions with the same rate are stacked, certain input positions are never sampled. Solutions include using a mix of dilation rates or hybrid dilated convolution schedules.
A depthwise separable convolution factorizes a standard convolution into two sequential steps:
Depthwise convolution: A separate spatial filter (e.g., 3x3) is applied independently to each input channel. If the input has C channels, C separate filters are used, each operating on a single channel.
Pointwise convolution: A 1x1 convolution combines the outputs across channels.
This factorization reduces the number of parameters and computations by a factor of approximately (1/C_out + 1/K^2) compared to a standard convolution. For a standard convolution with a 3x3 kernel, depthwise separable convolutions require roughly 8 to 9 times fewer multiply-add operations.
Depthwise separable convolutions were popularized by MobileNet (Howard et al., 2017) and are used in EfficientNet, Xception, and other lightweight architectures designed for mobile and edge deployment.
| Convolution type | Parameters (K=3, C_in=128, C_out=256) | Multiply-adds (spatial 8x8) | Relative cost |
|---|---|---|---|
| Standard 3x3 | 3 x 3 x 128 x 256 = 294,912 | 18,874,368 | 1.0x |
| Depthwise separable | (3 x 3 x 128) + (128 x 256) = 33,920 | 2,170,880 | ~0.12x |
Transposed convolution (sometimes incorrectly called "deconvolution") performs an upsampling operation that increases the spatial dimensions of the input. It is the gradient operation of a normal convolution and can be thought of as a convolution with fractional strides.
Transposed convolutions are used in:
A known artifact of transposed convolutions is the checkerboard pattern, caused by uneven overlap when the kernel size is not divisible by the stride. This can be mitigated by using resize-convolution (nearest-neighbor upsampling followed by a standard convolution) instead.
Group convolution divides the input channels and output channels into G groups and performs separate convolutions within each group. A standard convolution is the special case where G = 1, and a depthwise convolution is the case where G equals the number of input channels.
Group convolution was introduced in AlexNet (Krizhevsky et al., 2012) as a practical solution for splitting computation across two GPUs. It was later adopted as a deliberate architectural choice in ResNeXt (Xie et al., 2017), where the "cardinality" (number of groups) was shown to improve accuracy more effectively than increasing depth or width alone.
A 3D convolution uses a kernel with three spatial dimensions (height, width, depth or time). The kernel slides along all three axes, producing a 3D output. This is useful for data with a natural third dimension:
A causal convolution ensures that the output at time step t depends only on inputs at time t and earlier, never on future inputs. This is achieved by asymmetric padding (padding only on one side). Causal convolutions are used for sequence modeling tasks where the temporal order must be respected, such as:
When combined with dilation, causal convolutions can achieve very large receptive fields over temporal sequences while maintaining strict causal ordering. Several studies have shown that TCNs with dilated causal convolutions can match or outperform recurrent neural networks (RNNs and LSTMs) on many sequence tasks, with the additional benefit of parallelizable computation.
Deformable convolution, introduced by Dai et al. (2017), adds learnable 2D offsets to each sampling position in the kernel grid. Instead of sampling on a fixed regular grid, the network learns to shift each sample point based on the input content. This allows the receptive field to adapt to the scale and shape of objects in the image.
A parallel convolutional layer predicts the offsets, which are applied to the sampling locations of the main convolution. The offsets are typically fractional, so bilinear interpolation is used to sample the input at non-integer positions. Deformable convolutions have been adopted in modern object detection architectures to improve localization of objects with irregular shapes.
Although the convolutional operation is most closely associated with 2D image processing, it has been successfully applied across various data types.
| Domain | Input dimensionality | Convolution type | Example applications |
|---|---|---|---|
| Images | 2D (height x width x channels) | 2D convolution | Classification, detection, segmentation |
| Audio and speech | 1D (time x channels) | 1D convolution | Speech recognition, music generation |
| Video | 3D (time x height x width x channels) | 3D convolution | Action recognition, video captioning |
| Text | 1D (sequence length x embedding dim) | 1D convolution | Text classification, NLP |
| Medical volumes | 3D (depth x height x width) | 3D convolution | Tumor detection, organ segmentation |
| Graph-structured data | Irregular | Graph convolution | Social network analysis, molecule property prediction |
| Variant | Receptive field per layer | Parameter efficiency | Spatial resolution | Primary use case |
|---|---|---|---|---|
| Standard convolution | K x K | Baseline | Preserved (stride=1) or reduced | General feature extraction |
| 1x1 convolution | 1 x 1 | Very high | Preserved | Channel mixing, dimensionality change |
| Dilated convolution | Expanded (by dilation rate) | Same as standard | Preserved | Segmentation, dense prediction |
| Depthwise separable | K x K (factorized) | ~8-9x fewer ops | Preserved or reduced | Mobile and edge deployment |
| Transposed convolution | K x K (upsampling) | Baseline | Increased | Decoders, generators |
| Group convolution | K x K (per group) | Reduced by factor G | Preserved or reduced | Multi-GPU, increased cardinality |
| Deformable convolution | Adaptive | Slightly more (offset params) | Preserved | Irregular object detection |
| 3D convolution | K x K x K | Higher (extra dimension) | Preserved or reduced per axis | Video, volumetric data |
| Causal convolution | Asymmetric temporal | Same as 1D | Preserved | Sequence modeling |
The convolutional operation appears in virtually every modern vision architecture, though its role has evolved alongside newer building blocks like self-attention.
The convolution theorem establishes that convolution in the spatial (or time) domain corresponds to element-wise multiplication in the frequency domain:
F(f * g) = F(f) . F(g)
where F denotes the Fourier transform. This means a convolution can be computed by:
This approach has O(n log n) complexity compared to the O(n^2) complexity of direct convolution. However, for the small 3x3 kernels used in most modern CNNs, the overhead of the FFT transforms outweighs the savings, so spatial-domain methods (im2col, Winograd) are preferred in practice. FFT-based convolution remains useful in signal processing applications and in research architectures that use large kernels.