Convolutional Operation
Last reviewed
Jun 2, 2026
Sources
22 citations
Review status
Source-backed
Revision
v3 · 5,546 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 2, 2026
Sources
22 citations
Review status
Source-backed
Revision
v3 · 5,546 words
Add missing citations, update stale details, or suggest a clearer explanation.
The convolutional operation is a mathematical procedure that combines two functions to produce a third function expressing how the shape of one is modified by the other. In the context of convolutional neural networks (CNNs), the operation applies learned filters (also called kernels) to input data, enabling the automatic extraction of spatial features such as edges, textures, and patterns. The convolutional operation forms the computational backbone of nearly all modern computer vision systems, and its variants have been extended to audio, text, and volumetric data.
Imagine you have a big picture made of tiny colored squares. You also have a small stamp with a pattern on it. You place the stamp on the top-left corner of the picture, look at the squares underneath, do some simple math (multiply and add), and write down a single number. Then you slide the stamp one square to the right and repeat. When you reach the end of a row, you move down and start again. When you finish, all those numbers you wrote down form a new, smaller picture. That new picture highlights the parts of the original that match the stamp's pattern, like edges or certain shapes. That sliding-and-multiplying process is the convolutional operation, and the stamp is the filter (or kernel). In a neural network, the computer learns the best stamp patterns on its own by looking at thousands of examples.
The mathematical concept of convolution dates back to the 18th century, with roots in the work of Euler and Laplace on integral transforms. In modern applied mathematics, convolution became a standard tool in signal processing and linear systems theory during the 20th century.
The application of convolution-like operations to artificial neural networks began with Kunihiko Fukushima's Neocognitron in 1980. Fukushima was inspired by the discoveries of David Hubel and Torsten Wiesel, who showed in the early 1960s that neurons in the cat visual cortex respond to oriented edges within localized receptive fields.[11] The Neocognitron introduced a hierarchical architecture with alternating layers of simple and complex cells, anticipating many structural features of today's CNNs.[12] However, the Neocognitron used unsupervised reinforcement learning for training.
In 1989, Yann LeCun and colleagues at Bell Labs demonstrated that a convolutional neural network trained with backpropagation could recognize handwritten zip codes.[1] This work led to LeNet-5 (1998), the architecture that established CNNs as practical tools for image recognition.[2] The convolutional operation as used in CNNs, with learned filters, weight sharing, and gradient-based training, traces directly to LeCun's contributions.
The field remained relatively quiet until 2012, when AlexNet won the ImageNet Large Scale Visual Recognition Challenge by a wide margin, demonstrating that deep convolutional networks trained on GPUs could achieve unprecedented accuracy on large-scale image classification.[3] This result triggered the modern deep learning era and led to rapid development of new convolution variants.
For two continuous functions f and g, the convolution is defined as:
(f * g)(t) = \u222b f(\u03c4) g(t - \u03c4) d\u03c4
The integral runs over all values of \u03c4 (from -\u221e to +\u221e). The key characteristic is that one of the functions is flipped (reversed) before the integration. This flipping distinguishes convolution from the closely related cross-correlation operation.
For discrete signals, the continuous integral is replaced with a summation:
(f * g)[n] = \u2211_k f[k] \u00b7 g[n - k]
This form is more directly applicable to digital signals and images, where data is represented as sequences or grids of discrete values.
In image processing, both the input and the kernel are two-dimensional. The 2D discrete convolution of an input X with a kernel K is:[4]
(X * K)[i, j] = \u2211_m \u2211_n X[m, n] \u00b7 K[i - m, j - n]
Here, the kernel K is flipped both horizontally and vertically before being slid across the input. The output at each position (i, j) is the sum of element-wise products between the flipped kernel and the overlapping region of the input.
In practice, most machine learning frameworks (PyTorch, TensorFlow, JAX) implement cross-correlation rather than true mathematical convolution.[4][5] The 2D cross-correlation formula is:
(X \u2606 K)[i, j] = \u2211_m \u2211_n X[i + m, j + n] \u00b7 K[m, n]
The only difference from convolution is the absence of the kernel flip. In cross-correlation, the kernel slides over the input without being reversed.
This distinction is practically irrelevant in neural networks because the kernel weights are learned from data during training. If the network needed a flipped version of some filter, backpropagation would simply learn the flipped weights directly. As a result, CNN implementations universally use cross-correlation for computational simplicity but call the operation "convolution" by convention.
| Property | True convolution | Cross-correlation (used in CNNs) |
|---|---|---|
| Kernel flipped | Yes (180-degree rotation) | No |
| Mathematical notation | f * g | f \u2606 g |
| Commutativity | Commutative (f * g = g * f) | Not commutative |
| Associativity | Associative | Not associative |
| Used in ML frameworks | Rarely | Almost universally |
| Effect on learned weights | Weights learned in flipped orientation | Weights learned directly |
Several hyperparameters control how the convolutional operation is applied and determine the size and properties of the output.
The kernel size defines the spatial extent of the filter, typically expressed as height x width. Common choices include 1x1, 3x3, 5x5, and 7x7. Smaller kernels (especially 3x3) have become the standard in modern architectures because stacking multiple small kernels achieves the same receptive field as a single large kernel while using fewer parameters and introducing more nonlinearity through additional activation functions. For instance, two stacked 3x3 layers cover the same 5x5 receptive field as one 5x5 layer but use 2 x (3 x 3) = 18 weights per channel pair instead of 25, a point that VGG made explicit.[14]
Stride specifies the number of positions the kernel moves between consecutive applications. A stride of 1 means the kernel moves one pixel at a time, producing a dense output. A stride of 2 moves two pixels at a time, reducing each spatial dimension by roughly half. Larger strides downsample the output, which reduces computation and increases the effective receptive field.
Padding adds extra values (usually zeros) around the border of the input before convolution. Without padding, each convolutional layer shrinks the spatial dimensions, and information at the edges of the input is underrepresented.
| Padding type | Description | Output size (for stride = 1) |
|---|---|---|
| Valid (no padding) | No padding applied; output is smaller than input | (n - k + 1) x (n - k + 1) |
| Same padding | Padding chosen so output has same dimensions as input | n x n |
| Full padding | Maximum padding; every element of input participates in every possible position | (n + k - 1) x (n + k - 1) |
In the table above, n represents the input dimension and k represents the kernel dimension. Most modern architectures use same padding with 3x3 kernels (i.e., 1 pixel of zero padding on each side).
Each filter produces one output channel (also called a feature map). A convolutional layer typically uses multiple filters in parallel, each learning to detect a different pattern. For example, a layer with 64 filters applied to an input produces an output with 64 channels.
Dilation inserts gaps between kernel elements, expanding the receptive field without increasing the number of parameters. A dilation rate of 1 is a standard convolution; a dilation rate of 2 inserts one gap between each kernel element. See the section on dilated convolution below for details.
The spatial dimensions of the output feature map are determined by a standard formula. Given an input of size I, a kernel of size K, padding P, stride S, and dilation rate D, the output size O along one dimension is:[9][14]
O = floor((I + 2P - D(K - 1) - 1) / S) + 1
For the common case with dilation rate 1, this simplifies to:
O = floor((I - K + 2P) / S) + 1
The floor operation means that when the stride does not divide the padded input evenly, some input columns at the right and bottom edges are dropped. This is the convention used by PyTorch, TensorFlow, and the framework-agnostic reference by Dumoulin and Visin.[9]
| Example configuration | Input size | Kernel | Padding | Stride | Dilation | Output size |
|---|---|---|---|---|---|---|
| Standard 3x3 with same padding | 32 | 3 | 1 | 1 | 1 | 32 |
| Standard 3x3 with valid padding | 32 | 3 | 0 | 1 | 1 | 30 |
| Strided 3x3 | 32 | 3 | 1 | 2 | 1 | 16 |
| Dilated 3x3 (rate=2) | 32 | 3 | 2 | 1 | 2 | 32 |
| Large 7x7 with same padding | 224 | 7 | 3 | 2 | 1 | 112 |
The total number of learnable parameters in a convolutional layer depends on the kernel size, number of input channels (C_in), number of output channels (C_out), and whether a bias term is included:
Parameters = (K_h x K_w x C_in + 1) x C_out
The "+1" accounts for one bias value per output channel. For example, a layer with 64 filters of size 3x3 applied to an input with 128 channels has (3 x 3 x 128 + 1) x 64 = 73,792 parameters. A key property is that this count is independent of the input's spatial size: the same filter is reused at every location, so a 224x224 input and a 7x7 input require exactly the same number of weights.
While the parameter count measures memory, the computational cost of a convolutional layer is usually measured in multiply-accumulate operations (MACs) or floating-point operations (FLOPs). Each output element requires one multiply and one add for every weight that touches it. Counting bias as negligible, the number of MACs for a 2D convolutional layer is the product of the output spatial size and the per-output work:
MACs = H_out x W_out x C_in x K_h x K_w x C_out
Because each multiply-accumulate is conventionally counted as two floating-point operations (one multiplication and one addition), the FLOP count is approximately twice the MAC count:
FLOPs ~ 2 x H_out x W_out x C_in x K_h x K_w x C_out
For example, a 3x3 convolution with 128 input channels and 256 output channels producing a 56x56 output map costs 56 x 56 x 128 x 3 x 3 x 256, which is about 0.92 billion MACs, or roughly 1.85 GFLOPs. Unlike the parameter count, the MAC count scales with the spatial resolution of the feature map, which is why early layers operating on large feature maps often dominate a network's compute budget even though they hold few parameters.[14] These two metrics, parameters and FLOPs, are reported separately because a layer can be parameter-light but compute-heavy (an early high-resolution layer) or parameter-heavy but compute-light (a 1x1 layer deep in the network).
A single filter uses the same set of weights at every spatial location in the input. This weight sharing reduces the number of parameters compared to fully connected layers and encodes the assumption that the same pattern can appear anywhere in the input.
Convolutional layers are translation equivariant: if the input shifts by a certain amount, the output shifts by the same amount. Formally, if T is a translation operator and f is the convolution operation, then f(T(x)) = T(f(x)). This property means that a feature detector learned at one location automatically applies everywhere in the input, without requiring separate detectors for each position.
Translation equivariance should not be confused with translation invariance. Pooling layers and global aggregation operations contribute to approximate translation invariance, meaning the final classification output remains similar regardless of where an object appears in the image.
Each output unit depends only on a small region of the input, defined by the kernel size. This locality is both computationally efficient and well suited for capturing local spatial patterns like edges, corners, and textures. Higher-level layers, through stacking, build receptive fields that span larger portions of the input.
When convolutional layers are stacked, the network learns a hierarchy of features. Early layers detect simple patterns such as edges and color gradients. Middle layers combine these into more complex shapes like corners, contours, and texture elements. Deeper layers compose these into high-level features representing object parts or entire objects. This hierarchical decomposition is a central reason why CNNs work well for visual recognition.
During training, convolutional layers are optimized using gradient descent and backpropagation.[1][4] The backward pass through a convolutional layer involves two gradient computations:
Gradient with respect to the input (needed for propagating gradients to earlier layers): This is computed as a convolution of the upstream gradient with the transposed (flipped) kernel.
Gradient with respect to the kernel weights (needed for updating the filter): This is computed as a convolution between the input and the upstream gradient.
Because of weight sharing, gradients from all spatial positions are accumulated into a single gradient for each filter weight. This makes gradient computation efficient and ensures that every spatial location contributes to the filter update.
Modern deep learning frameworks use several strategies to compute convolutions efficiently on hardware accelerators.
The most widely used approach, called im2col (image-to-column), rearranges overlapping input patches into columns of a matrix. The convolution is then computed as a single large matrix multiplication (GEMM, General Matrix-Matrix Multiply). This approach leverages highly optimized BLAS libraries and GPU tensor cores but requires additional memory because overlapping patches are duplicated in the rearranged matrix. The technique was introduced by Chellapilla, Puri, and Simard in 2006 for high-performance document processing and is now used by most major frameworks, including the original Caffe convolution.[16]
The convolution theorem states that convolution in the spatial domain is equivalent to element-wise multiplication in the frequency domain. By transforming both the input and the kernel to the frequency domain using the Fast Fourier Transform (FFT), performing element-wise multiplication, and transforming back, convolution can be computed in O(n log n) time rather than O(n^2). This approach is advantageous for large kernel sizes (typically larger than 5x5) but incurs overhead from the domain transforms that makes it less efficient for the small 3x3 kernels common in modern architectures.
The Winograd minimal filtering algorithm reduces the number of multiplications needed for small, fixed-size convolutions (particularly 3x3 with stride 1). It trades multiplications for additions, which are cheaper on most hardware. Lavin and Gray showed in 2016 that the F(2x2, 3x3) Winograd algorithm computes a 2x2 output tile from a 3x3 filter using only 16 multiplications, versus the 36 required by direct convolution, a 2.25x reduction, and that this made Winograd faster than direct convolution and FFT on every layer of VGG.[17] Many deep learning libraries (cuDNN, MKL-DNN) automatically select Winograd-based implementations for eligible convolutions. The main caveat is reduced numerical precision, since the input and filter transforms can amplify rounding error, especially for larger output tiles.[17]
Direct implementation applies the sliding-window operation as written, without matrix rearrangement. While conceptually simple, direct convolution is generally slower on GPUs than im2col or Winograd methods. It remains useful on specialized hardware or for unusual kernel configurations.
| Method | Best kernel sizes | Memory overhead | Computational complexity |
|---|---|---|---|
| im2col + GEMM | Any | High (patch duplication) | O(n^2 k^2) per output |
| FFT-based | Large (>5x5) | Moderate (frequency buffers) | O(n^2 log n) |
| Winograd | Small, fixed (3x3, 5x5) | Low | Fewer multiplications than direct |
| Direct | Any | None | O(n^2 k^2) per output |
Researchers have developed many specialized variants of the standard convolution to address different requirements in terms of receptive field size, computational cost, spatial resolution, and input structure.
A 1x1 convolution applies a 1x1 filter across all input channels at each spatial location. It performs no spatial filtering; instead, it computes a linear combination of channel values. Its primary uses include:
The concept was introduced in the Network-in-Network paper (Lin et al., 2013), which framed the 1x1 layer as a tiny multilayer perceptron applied at each spatial location.[18] It became a standard building block in architectures like Inception (GoogLeNet), ResNet bottleneck blocks, and MobileNet. In ResNet bottleneck blocks, a 1x1 layer first reduces the channel count, a 3x3 layer does the spatial filtering on the cheaper reduced representation, and a second 1x1 layer restores the channel count, which sharply lowers the cost of deep residual networks.
Dilated convolution, also called atrous convolution (from the French "a trous," meaning "with holes"), inserts gaps between kernel elements. A dilation rate of r means there are r - 1 gaps between each pair of adjacent kernel values. This expands the receptive field exponentially without increasing the number of parameters or reducing spatial resolution.
For a 3x3 kernel with dilation rate 2, the kernel effectively covers a 5x5 area but only uses 9 weights. With dilation rate 4, it covers a 9x9 area with the same 9 weights. Yu and Koltun showed that stacking dilated convolutions with exponentially increasing rates expands the receptive field exponentially with depth while keeping the parameter count and resolution fixed.[7]
Dilated convolutions are widely used in:
One known issue is the gridding effect: when dilated convolutions with the same rate are stacked, certain input positions are never sampled. Solutions include using a mix of dilation rates or hybrid dilated convolution schedules.
A depthwise separable convolution factorizes a standard convolution into two sequential steps:
Depthwise convolution: A separate spatial filter (e.g., 3x3) is applied independently to each input channel. If the input has C channels, C separate filters are used, each operating on a single channel.
Pointwise convolution: A 1x1 convolution combines the outputs across channels.
The MobileNet paper expresses the cost reduction precisely. The ratio of depthwise separable cost to standard convolution cost is:[6]
(1 / C_out) + (1 / K^2)
where C_out is the number of output channels and K is the kernel size. For a 3x3 kernel (K = 3) the second term is 1/9, and since C_out is usually large the first term is small, so the combined factor is close to 1/9. This is why depthwise separable convolutions require roughly 8 to 9 times fewer multiply-add operations than a standard 3x3 convolution.[6]
This factorization should not be confused with spatially separable convolution, an older idea that splits a 2D kernel into two 1D kernels (for example a 3x3 kernel into a 3x1 followed by a 1x3). Spatially separable convolution only works for kernels that are mathematically separable, whereas depthwise separable convolution factorizes along the channel dimension and applies to any learned filter.
Depthwise separable convolutions were popularized by MobileNet (Howard et al., 2017) and are used in EfficientNet, Xception, and other lightweight architectures designed for mobile and edge deployment.[6] Xception (Chollet, 2017) interpreted the design as an extreme form of the Inception module, hypothesizing that cross-channel correlations and spatial correlations can be mapped entirely separately.[19]
| Convolution type | Parameters (K=3, C_in=128, C_out=256) | Multiply-adds (spatial 8x8) | Relative cost |
|---|---|---|---|
| Standard 3x3 | 3 x 3 x 128 x 256 = 294,912 | 18,874,368 | 1.0x |
| Depthwise separable | (3 x 3 x 128) + (128 x 256) = 33,920 | 2,170,880 | ~0.12x |
Transposed convolution (sometimes incorrectly called "deconvolution") performs an upsampling operation that increases the spatial dimensions of the input. It is the gradient operation of a normal convolution and can be thought of as a convolution with fractional strides, which is why it is also called a fractionally strided convolution.[9] The name "deconvolution" is misleading because the operation does not invert a convolution in the signal-processing sense; it only reverses the spatial transformation of the shapes, not the values.[15]
For a transposed convolution, the output size grows with the stride rather than shrinking. With input size I, kernel K, padding P, stride S, dilation D, and output padding A, the output size along one dimension is:[9]
O = (I - 1) x S - 2P + D(K - 1) + A + 1
In the common case of dilation 1 and no output padding, this reduces to O = (I - 1) x S + K - 2P. The optional output padding term resolves the ambiguity that several input sizes can map to the same convolution output, so it adjusts the result by a fixed amount without adding any computation on those extra rows or columns.
Transposed convolutions are used in:
A known artifact of transposed convolutions is the checkerboard pattern, caused by uneven overlap when the kernel size is not divisible by the stride. This can be mitigated by using resize-convolution (nearest-neighbor upsampling followed by a standard convolution) instead.[15]
Group convolution divides the input channels and output channels into G groups and performs separate convolutions within each group. Each group sees only 1/G of the input channels, so the parameter count and computational cost both drop by a factor of G:
Parameters = (K_h x K_w x C_in x C_out) / G
A standard convolution is the special case where G = 1, and a depthwise convolution is the case where G equals the number of input channels. The trade-off is that channels in different groups can no longer interact within the layer, which is often compensated by adding a 1x1 convolution or channel shuffle afterward.
Group convolution was introduced in AlexNet (Krizhevsky et al., 2012) as a practical solution for splitting computation across two GPUs with limited memory.[3] It was later adopted as a deliberate architectural choice in ResNeXt (Xie et al., 2017), where the "cardinality" (number of groups) was shown to improve accuracy more effectively than increasing depth or width alone.[13]
A 3D convolution uses a kernel with three spatial dimensions (height, width, depth or time). The kernel slides along all three axes, producing a 3D output. This is useful for data with a natural third dimension:
A causal convolution ensures that the output at time step t depends only on inputs at time t and earlier, never on future inputs. This is achieved by asymmetric padding (padding only on one side). Causal convolutions are used for sequence modeling tasks where the temporal order must be respected, such as:
When combined with dilation, causal convolutions can achieve very large receptive fields over temporal sequences while maintaining strict causal ordering. Bai, Kolter, and Koltun showed in 2018 that a generic TCN built from dilated causal convolutions and residual connections can match or outperform recurrent neural networks (RNNs and LSTMs) across a broad set of sequence-modeling benchmarks, with the additional benefits of parallelizable computation and longer effective memory.[22]
Deformable convolution, introduced by Dai et al. (2017), adds learnable 2D offsets to each sampling position in the kernel grid.[8] Instead of sampling on a fixed regular grid, the network learns to shift each sample point based on the input content. This allows the receptive field to adapt to the scale and shape of objects in the image.
A parallel convolutional layer predicts the offsets, which are applied to the sampling locations of the main convolution. The offsets are typically fractional, so bilinear interpolation is used to sample the input at non-integer positions. Deformable convolutions have been adopted in modern object detection architectures to improve localization of objects with irregular shapes.
Although the convolutional operation is most closely associated with 2D image processing, it has been successfully applied across various data types.
| Domain | Input dimensionality | Convolution type | Example applications |
|---|---|---|---|
| Images | 2D (height x width x channels) | 2D convolution | Classification, detection, segmentation |
| Audio and speech | 1D (time x channels) | 1D convolution | Speech recognition, music generation |
| Video | 3D (time x height x width x channels) | 3D convolution | Action recognition, video captioning |
| Text | 1D (sequence length x embedding dim) | 1D convolution | Text classification, NLP |
| Medical volumes | 3D (depth x height x width) | 3D convolution | Tumor detection, organ segmentation |
| Graph-structured data | Irregular | Graph convolution | Social network analysis, molecule property prediction |
| Variant | Receptive field per layer | Parameter efficiency | Spatial resolution | Primary use case |
|---|---|---|---|---|
| Standard convolution | K x K | Baseline | Preserved (stride=1) or reduced | General feature extraction |
| 1x1 convolution | 1 x 1 | Very high | Preserved | Channel mixing, dimensionality change |
| Dilated convolution | Expanded (by dilation rate) | Same as standard | Preserved | Segmentation, dense prediction |
| Depthwise separable | K x K (factorized) | ~8-9x fewer ops | Preserved or reduced | Mobile and edge deployment |
| Transposed convolution | K x K (upsampling) | Baseline | Increased | Decoders, generators |
| Group convolution | K x K (per group) | Reduced by factor G | Preserved or reduced | Multi-GPU, increased cardinality |
| Deformable convolution | Adaptive | Slightly more (offset params) | Preserved | Irregular object detection |
| 3D convolution | K x K x K | Higher (extra dimension) | Preserved or reduced per axis | Video, volumetric data |
| Causal convolution | Asymmetric temporal | Same as 1D | Preserved | Sequence modeling |
The convolutional operation appears in virtually every modern vision architecture, though its role has evolved alongside newer building blocks like self-attention.
The convolution theorem establishes that convolution in the spatial (or time) domain corresponds to element-wise multiplication in the frequency domain:
F(f * g) = F(f) . F(g)
where F denotes the Fourier transform. This means a convolution can be computed by:
This approach has O(n log n) complexity compared to the O(n^2) complexity of direct convolution.[4] However, for the small 3x3 kernels used in most modern CNNs, the overhead of the FFT transforms outweighs the savings, so spatial-domain methods (im2col, Winograd) are preferred in practice. FFT-based convolution remains useful in signal processing applications and in research architectures that use large kernels.