Convolution

Introduction

Convolution is a mathematical operation that combines two functions to produce a third function, expressing how the shape of one is modified by the other. In its most general form, convolution measures the overlap between one function and a reversed, shifted copy of another function, accumulated over all possible shifts. The operation appears throughout mathematics, physics, engineering, and computer science, and it serves as the foundational building block behind convolutional neural networks (CNNs) and much of modern deep learning.

In signal processing, convolution describes how a linear time-invariant (LTI) system transforms an input signal: the output is the convolution of the input with the system's impulse response. In probability and statistics, the convolution of two probability density functions gives the density of the sum of the corresponding independent random variables. In machine learning, and especially in the domain of deep learning, convolution is used to extract features from structured input data such as images, audio, and text. By applying small, learnable convolutional filters across an input, convolutional layers capture spatial and temporal patterns at multiple scales of abstraction.

Convolution is the foundational operation behind many breakthroughs in computer vision, natural language processing, speech recognition, and time series analysis. Over the past decade, researchers have introduced numerous variants of the convolution operation, each designed to address specific computational or representational challenges.

Explain Like I'm 5 (ELI5)

Imagine you have a rubber stamp with a small pattern on it, like a star. Now imagine you have a big sheet of paper with a drawing on it. You press your stamp down on one corner of the drawing, and it tells you how much the stamp's pattern matches what is underneath. Then you slide the stamp one step to the right and check again. You keep sliding and checking across the entire drawing, writing down the match score each time. When you are done, you have a new picture made entirely of match scores. That new picture is the convolution.

In real life, this is how computers look at photos. They use tiny stamps (called kernels or filters) to scan across an image. One stamp might look for edges, another for corners, another for colors. By scanning with many different stamps, the computer builds up an understanding of what is in the image, like recognizing a cat, a car, or a face.

The key idea is simple: slide a small pattern over a big input, multiply and add up what overlaps at each position, and record the result. That is convolution.

Historical Background

The convolution integral first appeared in the work of Jean le Rond d'Alembert in 1754, within his derivation of Taylor's theorem in Recherches sur differents points importants du systeme du monde. Sylvestre Francois Lacroix used similar expressions in his treatise published between 1797 and 1800. Vito Volterra examined composition products, a closely related concept, in his 1913 work on integral equations.

The term "convolution" itself did not come into wide use until the 1950s or 1960s. Before that, the operation went by several names in different communities: Faltung (meaning "folding" in German), composition product, superposition integral, and Carson's integral.

In electrical engineering, convolution became central to the theory of linear time-invariant systems during the mid-20th century, providing the mathematical link between a system's impulse response and its output for any given input. The development of the Fast Fourier Transform (FFT) algorithm by Cooley and Tukey in 1965 made convolution computationally practical for large signals by exploiting the convolution theorem.

Convolution entered the field of neural networks through the work of Kunihiko Fukushima, whose Neocognitron (1980) used a convolution-like operation for visual pattern recognition. Yann LeCun and colleagues formalized the modern convolutional neural network in 1989 with LeNet, applying backpropagation to train convolutional filters for handwritten digit recognition. The landmark AlexNet architecture (Krizhevsky et al., 2012) demonstrated that deep convolutional networks trained on GPUs could dramatically outperform traditional methods on large-scale image recognition, launching the deep learning revolution.

Mathematical Foundation

Continuous Convolution

The convolution of two continuous functions, f and g, is mathematically denoted as (f * g) and is defined by the integral:

(f * g)(t) = integral from negative infinity to positive infinity of f(tau) * g(t - tau) d(tau)

This integral computes the area of overlap between f and a reversed, shifted version of g at each point t. In signal processing and mathematics, this operation captures how the shape of one function is modified by another. The reversal of g (replacing tau with t - tau) is what distinguishes convolution from cross-correlation.

Discrete Convolution

In the context of discrete data like images or digital signals, the continuous integral is replaced by a sum:

(f * g)[n] = sum over m of f[m] * g[n - m]

For two finite sequences of length N, computing this sum directly requires O(N^2) operations. For images, the convolution operation can be represented as a double summation over a two-dimensional kernel, K, applied to an input image, I:

(I * K)(x, y) = sum over m, sum over n of I(m, n) * K(x - m, y - n)

The kernel is a smaller matrix compared to the input image, and it is used to extract features by sliding over the input image, performing element-wise multiplication and summation of the overlapping elements.

Cross-Correlation in Practice

Although the operation used in deep learning is commonly called "convolution," most frameworks actually implement cross-correlation rather than true mathematical convolution. The difference is subtle but important: true convolution flips the kernel (rotates it by 180 degrees) before sliding it across the input, while cross-correlation does not flip the kernel.

The 2D cross-correlation formula is:

(X star K)(i, j) = sum over a, sum over b of X(i + a, j + b) * K(a, b)

Notice that in cross-correlation, the indices of the input use addition (i + a, j + b), while in true convolution they use subtraction (x - m, y - n). Because the kernel weights are learned during backpropagation rather than hand-designed, the distinction between convolution and cross-correlation has no practical consequence. A learned kernel and its flipped version simply represent equivalent solutions. For this reason, deep learning literature and frameworks (including PyTorch and TensorFlow) refer to the operation as "convolution" even though the implementation is technically cross-correlation.

Kernels and Feature Extraction

Kernels or filters are pivotal in the convolution process, as they are responsible for extracting specific features from the input data. Common kernel examples include edge detection, blur, and sharpen filters. These kernels are designed to emphasize or suppress certain properties in the input data, allowing the machine learning model to learn more effectively from the transformed data.

In deep learning, the kernels are usually initialized with random values and learned by the model during the training process. As the model learns, the kernels become more specialized in extracting useful features for the specific task at hand.

Properties of Convolution

Convolution satisfies several important algebraic properties that have both theoretical and practical significance.

Property	Formula	Practical Implication
Commutativity	f * g = g * f	The order of operands does not matter; a signal convolved with a filter gives the same result as the filter convolved with the signal
Associativity	f * (g * h) = (f * g) * h	A cascade of filters can be pre-combined into a single equivalent filter
Distributivity	f * (g + h) = (f * g) + (f * h)	Convolution distributes over addition, enabling parallel filter application
Scalar multiplication	a(f * g) = (af) * g	Scaling can be applied before or after convolution
Identity	f * delta = f	Convolving any function with the Dirac delta (or Kronecker delta for discrete signals) returns the original function unchanged

Commutativity is proven by a simple substitution of variables in the convolution integral. In practice, it means we can interpret a system's effect on a signal equivalently as the signal's effect on the system's impulse response.

Associativity is particularly useful in signal processing: if a signal passes through two filters in series with impulse responses g and h, the combined effect is the same as a single filter whose impulse response is (g * h). This allows engineers to analyze complex multi-stage systems as a single equivalent system.

Distributivity means that if two signals are added together and then convolved with a filter, the result is the same as convolving each signal separately and then adding the results. This property underpins the principle of superposition in linear systems theory.

Note that convolution does not possess a general multiplicative identity in the space of integrable functions. However, the Dirac delta function serves as an identity element: convolving any function with the delta function returns the original function.

Convolution Theorem

The convolution theorem is one of the most important results connecting convolution with the Fourier transform. It states that convolution in the spatial (or time) domain is equivalent to element-wise multiplication in the frequency domain:

F{f * g} = F{f} . F{g}

where F denotes the Fourier transform. The converse also holds: multiplication in the spatial domain corresponds to convolution in the frequency domain:

F{f . g} = F{f} * F{g}

This theorem extends to other transforms as well, including the Laplace transform, the Z-transform, and the Mellin transform.

Practical Significance

The convolution theorem has profound practical implications:

Computational efficiency: For two sequences of length N, direct convolution requires O(N^2) operations. Using the FFT, both sequences can be transformed in O(N log N) time, multiplied element-wise in O(N) time, and inverse-transformed in O(N log N) time, yielding an overall complexity of O(N log N). This speedup is critical for processing long signals in audio, telecommunications, and seismology.
Filter design: In signal processing, filters are often designed in the frequency domain (specifying which frequencies to pass or reject) and then converted to time-domain impulse responses via the inverse Fourier transform. The convolution theorem guarantees that applying this impulse response via convolution achieves the desired frequency-domain filtering.
Analysis of linear systems: The theorem provides a direct way to compute the frequency response of an LTI system. If the impulse response is h(t), then the frequency response is simply H(f) = F{h(t)}, and the output spectrum is the input spectrum multiplied by H(f).

Convolution in Signal Processing

In signal processing, convolution is the primary tool for describing how linear time-invariant (LTI) systems process inputs. If a system has impulse response h(t), then for any input x(t), the output y(t) is:

y(t) = (x * h)(t) = integral from negative infinity to positive infinity of x(tau) * h(t - tau) d(tau)

This relationship is fundamental to fields including audio engineering (reverb is convolution with a room's impulse response), telecommunications (channel equalization), radar (matched filtering), and medical imaging (CT reconstruction). Digital filters, both FIR (finite impulse response) and IIR (infinite impulse response), are implemented as discrete convolutions of the input signal with the filter coefficients.

Convolution by Dimensionality

Convolutional operations exist in 1D, 2D, and 3D variants, each suited to data with different structural properties.

1D Convolution

1D convolution slides the kernel along a single axis. It is used extensively for sequential and temporal data. Key applications include:

Text classification: 1D convolution extracts n-gram-like features by sliding over sequences of word embeddings. Filters of different widths (e.g., 3, 4, 5) capture phrases of varying length. Yoon Kim's 2014 paper "Convolutional Neural Networks for Sentence Classification" demonstrated that simple 1D CNN architectures could rival recurrent neural networks on text classification benchmarks.
Time series analysis: Financial data, sensor readings, and physiological signals such as electrocardiograms (ECGs) benefit from 1D convolution because the operation captures local temporal patterns.
Audio and speech processing: WaveNet, developed by DeepMind in 2016, uses stacks of dilated 1D causal convolutions to generate raw audio waveforms sample by sample.

The 1D convolution formula for an input signal x of length L and a kernel k of size K is:

y[n] = sum from m=0 to K-1 of x[n + m] * k[m]

The output length depends on the padding and stride settings (discussed below).

2D Convolution

2D convolution is the most widely used form and the default when researchers mention "convolution" in the context of deep learning. The kernel slides along two spatial axes (height and width) and is the backbone of all major image recognition architectures, including AlexNet, VGGNet, ResNet, and Inception.

A 2D convolution with input of shape (C_in, H, W) and a kernel of shape (C_in, K_h, K_w) produces a single 2D output feature map. Multiple kernels produce multiple output channels, resulting in an output of shape (C_out, H_out, W_out).

3D Convolution

3D convolution extends the operation to three spatial or spatiotemporal axes. The kernel slides along height, width, and a third dimension (often depth or time). Primary applications include:

Video understanding: The C3D network (Tran et al., 2015) used 3x3x3 kernels to learn spatiotemporal features from consecutive video frames. Later architectures such as I3D (Inflated 3D) and SlowFast Networks built upon this idea.
Medical imaging: Volumetric data from CT scans and MRI require 3D convolution to capture spatial relationships across slices.
Point cloud processing: Some methods voxelize 3D point clouds and apply 3D convolutions for tasks like 3D object detection.

3D convolutions are considerably more expensive than their 2D counterparts because the kernel must cover an additional spatial dimension, increasing both the number of parameters and the number of multiply-accumulate operations.

Key Parameters

Several hyperparameters control the behavior and output dimensions of a convolutional layer.

Kernel (Filter) Size

The kernel size determines the spatial extent of the region the filter examines at each position. Common choices include:

Kernel Size	Typical Use	Notes
1x1	Channel mixing, dimensionality reduction	Introduced in Network in Network (Lin et al., 2013)
3x3	General-purpose feature extraction	Most popular choice since VGGNet (Simonyan & Zisserman, 2014)
5x5	Wider receptive field per layer	Used in early Inception modules
7x7	First layer of deep networks	Used in ResNet and Inception v1 for initial downsampling

Smaller kernels (3x3) stacked in multiple layers can achieve the same receptive field as a single large kernel while using fewer parameters. Two stacked 3x3 convolutions cover the same 5x5 receptive field with 2 * (3 * 3) = 18 weights instead of 25, and three stacked 3x3 convolutions cover a 7x7 field with 27 weights instead of 49.

Stride

Stride specifies the step size by which the kernel moves across the input. A stride of 1 moves the filter one pixel at a time, producing an output whose spatial dimensions are similar to the input. A stride of 2 moves the filter two pixels at a time, halving the spatial dimensions. Increasing the stride reduces the output resolution and computational cost, and it is often used as an alternative to pooling layers for spatial downsampling.

The output dimension for one axis is computed as:

O = floor((W - K + 2P) / S) + 1

where W is the input size, K is the kernel size, P is the padding, and S is the stride.

Padding

Padding adds extra values (typically zeros) around the border of the input before the convolution is performed. This is necessary to control the spatial dimensions of the output.

Padding Type	Description	Output Size (stride = 1)
Valid (no padding)	No padding is added; the kernel only visits positions where it fully overlaps the input	Smaller than input: (W - K + 1)
Same (zero padding)	Enough zeros are added so the output has the same spatial dimensions as the input	Equal to input: W
Full padding	Enough zeros are added so every element of the input participates in every possible kernel position	Larger than input: (W + K - 1)

"Same" padding is the most common choice in modern architectures because it simplifies dimension calculations when stacking many layers.

Dilation (Atrous Convolution)

Dilation inserts gaps (zeros) between kernel elements, effectively expanding the receptive field without increasing the number of parameters. A dilation rate of r spreads a kernel of size K so its effective size becomes:

K_eff = K + (K - 1)(r - 1)

For example, a 3x3 kernel with dilation rate 2 has an effective receptive field of 5x5, and with dilation rate 3 it covers 7x7, all while maintaining only 9 learnable weights.

Dilated convolutions were originally developed for efficient wavelet decomposition in signal processing. In deep learning, they became prominent through:

DeepLab (Chen et al., 2015, 2017): Used atrous (dilated) convolutions for semantic segmentation, replacing pooling layers to preserve spatial resolution while maintaining a wide receptive field. DeepLab v2 introduced Atrous Spatial Pyramid Pooling (ASPP), which runs multiple dilated convolutions in parallel at dilation rates of 6, 12, and 18 to capture multi-scale context.
WaveNet (van den Oord et al., 2016): Stacked dilated causal 1D convolutions with exponentially increasing dilation rates (1, 2, 4, 8, ..., 512) to model long-range temporal dependencies in audio.
Multi-Scale Context Aggregation (Yu & Koltun, 2016): Proposed systematic use of dilated convolutions for dense prediction tasks.

Feature Maps and Channels

A feature map is the output produced by applying a single convolutional filter to the input. Each filter detects a specific pattern, and the collection of all feature maps from a convolutional layer forms a 3D volume with dimensions (C_out, H_out, W_out), where C_out is the number of output channels (one per filter).

The input to a convolutional layer also has multiple channels. For a color image, the input has 3 channels (red, green, blue). For deeper layers, the number of input channels equals the number of filters in the preceding convolutional layer. Each filter in a convolutional layer spans all input channels: a filter in a layer with C_in input channels has shape (C_in, K_h, K_w), and the total number of parameters for one filter (excluding bias) is C_in * K_h * K_w.

The total number of parameters in a convolutional layer with C_out filters is:

Parameters = C_out * (C_in * K_h * K_w + 1)

where the +1 accounts for the bias term per filter.

Convolutional layers exploit two properties that make them far more efficient than fully connected layers: parameter sharing and local connectivity (also called sparse interactions).

In a convolutional layer, the same filter weights are applied at every spatial position of the input. This means a single set of K_h * K_w * C_in weights is reused across the entire input, regardless of the input size. Parameter sharing encodes the assumption of translation equivariance: a pattern that is useful to detect in one part of the input is equally useful in other parts. For example, an edge detector should work the same way whether the edge appears in the top-left or bottom-right of the image.

Parameter sharing reduces the number of learnable weights by several orders of magnitude compared to a fully connected layer with the same input and output sizes. For an input image of 224x224x3 mapped to a feature map of 224x224x64, a fully connected layer would require roughly 224 * 224 * 3 * 224 * 224 * 64, which is approximately 4.8 trillion parameters. A convolutional layer with 64 filters of size 3x3 requires only 64 * (3 * 3 * 3 + 1) = 1,792 parameters.

Local Connectivity (Sparse Interactions)

Each output neuron in a convolutional layer is connected to only a small, local region of the input (determined by the kernel size), rather than to every input element. This sparse connectivity reduces computation and also provides an inductive bias well suited to data with local spatial structure, such as images and audio. Deep stacking of locally connected layers allows the network to build up large receptive fields gradually: a neuron in a deep layer indirectly depends on a large region of the original input, even though each individual layer examines only a small neighborhood.

Types of Convolution

Standard Convolution

Standard convolution applies each filter across all input channels simultaneously. Given an input with C_in channels and C_out filters of size K x K, the computational cost (in multiply-accumulate operations) for producing one output feature map of spatial size H_out x W_out is:

FLOPs = H_out * W_out * C_in * C_out * K * K

Standard convolution is the most expressive form because every output channel can learn arbitrary combinations of all input channels. However, it is also the most expensive, which has motivated the development of lighter alternatives.

Depthwise Separable Convolution

Depthwise separable convolution, popularized by MobileNet (Howard et al., 2017) and also used in Xception (Chollet, 2017), factorizes a standard convolution into two steps:

Depthwise convolution: A separate K x K spatial filter is applied to each input channel independently. This produces C_in feature maps (one per channel) and costs H_out * W_out * C_in * K * K operations.
Pointwise convolution: A 1x1 convolution combines the depthwise outputs across channels, projecting from C_in to C_out channels. This costs H_out * W_out * C_in * C_out operations.

The total cost of depthwise separable convolution is:

FLOPs = H_out * W_out * C_in * (K * K + C_out)

The reduction factor compared to standard convolution is:

Reduction = 1/C_out + 1/K^2

For a common configuration with K = 3 and C_out = 256, this yields a reduction factor of approximately 1/256 + 1/9, which is roughly 0.115, meaning depthwise separable convolution uses only about 11-12% of the computation of standard convolution.

MobileNet V1 demonstrated that depthwise separable convolutions could achieve accuracy comparable to standard architectures while being 8-9 times faster, making them suitable for mobile and edge devices. MobileNet V2 refined this with inverted residual blocks and linear bottlenecks.

Grouped Convolution

Grouped convolution divides the input channels into G non-overlapping groups and performs independent convolutions within each group. It was first used in AlexNet (Krizhevsky et al., 2012) for a practical reason: the model was split across two GPUs, and each GPU processed half the channels. Later work showed that grouping also provides a useful regularization effect.

The computational cost of grouped convolution is:

FLOPs = H_out * W_out * (C_in / G) * (C_out / G) * K * K * G = H_out * W_out * C_in * C_out * K * K / G

This represents a 1/G reduction compared to standard convolution. ResNeXt (Xie et al., 2017) demonstrated that increasing the number of groups (called "cardinality") while keeping the total number of parameters fixed improved accuracy over deeper or wider ResNets. ShuffleNet (Zhang et al., 2018) combined grouped convolutions with channel shuffling to allow information flow between groups.

Note that depthwise convolution is a special case of grouped convolution where G = C_in (each channel is its own group).

Deformable Convolution

Deformable convolution, introduced by Dai et al. (2017) at ICCV, augments the standard fixed grid sampling pattern with learnable 2D offsets. In a standard 3x3 convolution, the nine sampling locations form a regular grid. In deformable convolution, each of these locations is displaced by an offset (delta_x, delta_y) that is predicted by an auxiliary convolutional layer applied to the preceding feature map.

The output at position p is:

y(p) = sum over k of w_k * x(p + p_k + delta_p_k)

where p_k are the fixed grid offsets and delta_p_k are the learned offsets. Since the offsets are typically fractional, bilinear interpolation is used to sample the input at non-integer locations.

Key properties of deformable convolution:

The offsets are conditioned on the input, so the sampling pattern adapts to the content of each image. Objects with irregular shapes or varying scales benefit from this adaptive geometry.
The module adds only a small number of extra parameters (the offset prediction layer) and can replace standard convolutions in existing architectures with minimal overhead.
Deformable Convolutional Networks V2 (Zhu et al., 2019) added a modulation mechanism that also learns a scalar weight for each sampling position, further improving performance.

Deformable convolution has proven particularly effective for object detection and semantic segmentation tasks where objects appear at varying scales, aspect ratios, and orientations.

Dilated (Atrous) Convolution

Dilated convolution inserts zeros between kernel elements to expand the receptive field without increasing the parameter count. The dilation rate r controls the spacing. See the Dilation section above for the mathematical details and applications in DeepLab and WaveNet.

Dilated convolutions are especially valuable in dense prediction tasks (semantic segmentation, depth estimation) where maintaining high spatial resolution is important. They provide an alternative to encoder-decoder architectures by keeping feature maps at full resolution while still capturing long-range context.

Transposed Convolution

Transposed convolution (sometimes called "deconvolution" or "fractionally strided convolution") performs spatial upsampling with learnable parameters. It is the approximate inverse of a strided convolution: while a strided convolution reduces spatial dimensions, a transposed convolution increases them.

The operation works by inserting zeros between input elements (effectively applying a fractional stride) and then performing a standard convolution. For an input of size H_in and a transposed convolution with kernel size K, stride S, and padding P, the output size is:

H_out = (H_in - 1) * S - 2P + K

It is important to note that transposed convolution is not a true mathematical inverse (deconvolution). A true deconvolution would recover the exact original input from a convolution's output, which is generally an ill-posed problem. Transposed convolution only reverses the spatial dimensions while performing a learnable upsampling, not a value-level inversion.

Transposed convolutions are widely used in:

Generative adversarial networks (GANs): The generator network uses transposed convolutions to upsample a low-dimensional latent vector into a full-resolution image.
Semantic segmentation: Fully Convolutional Networks (Long et al., 2015) used transposed convolutions to upsample feature maps back to the input resolution for pixel-wise prediction.
Autoencoders: The decoder portion uses transposed convolutions to reconstruct the input from a compressed representation.

One known issue with transposed convolutions is the "checkerboard artifact," where overlapping kernel regions during upsampling create a grid pattern in the output. This can be mitigated by choosing kernel sizes divisible by the stride or by using alternative upsampling strategies such as nearest-neighbor interpolation followed by a standard convolution.

1x1 Convolution

1x1 convolution, first proposed in the Network in Network paper (Lin et al., 2013), applies a 1x1 spatial filter across all input channels. Despite having no spatial extent, it serves several important functions:

Channel mixing: It computes a learned linear combination of all input channels at each spatial location. Lin et al. described this as "cross-channel parametric pooling," enabling complex, learnable interactions between channels.
Dimensionality reduction: By using fewer output channels than input channels, 1x1 convolutions reduce the channel dimension and the associated computational cost of subsequent layers. GoogLeNet (Szegedy et al., 2014) used 1x1 convolutions within Inception modules to reduce channel dimensions before expensive 3x3 and 5x5 convolutions, cutting the computational cost by a large factor.
Dimensionality expansion: Conversely, 1x1 convolutions can increase the channel dimension, as in the inverted residual blocks of MobileNet V2.
Adding nonlinearity: When followed by a ReLU or other activation function, a 1x1 convolution adds an additional layer of nonlinearity without changing the spatial dimensions.

The computational cost of a 1x1 convolution is H * W * C_in * C_out, with no dependence on the spatial kernel size. This makes it one of the cheapest convolution operations per parameter.

Comparison of Convolution Types

The following table summarizes the key convolution variants discussed above.

Type	Parameters (per layer)	FLOPs (relative to standard)	Receptive Field	Key Advantage	Representative Architecture
Standard	C_out * C_in * K^2	1x (baseline)	K x K per layer	Full cross-channel interaction	VGGNet, ResNet
Depthwise Separable	C_in * K^2 + C_in * C_out	~1/K^2 + 1/C_out (approx. 0.11x for K=3)	K x K per layer	Major computation savings	MobileNet, Xception
Grouped (G groups)	C_out * C_in * K^2 / G	1/G	K x K per layer	Tunable efficiency-accuracy tradeoff	AlexNet, ResNeXt
Dilated (rate r)	C_out * C_in * K^2	Same as standard (same params)	K + (K-1)(r-1) per layer	Larger receptive field, no extra params	DeepLab, WaveNet
Deformable	C_out * C_in * K^2 + offset params	~1x + small offset overhead	Adaptive (input-dependent)	Content-adaptive sampling	DCN (Dai et al., 2017)
Transposed	C_in * C_out * K^2	Depends on output size	N/A (upsampling operation)	Learnable upsampling	FCN, DCGAN
1x1	C_out * C_in	H * W * C_in * C_out	1x1 (no spatial extent)	Channel mixing, cheap dimensionality change	Network in Network, Inception

Efficient Implementation of Convolution

While the sliding-window description of convolution is intuitive, deep learning frameworks employ several optimized strategies to implement convolution at high speed.

im2col (Image to Column)

The most common approach is the im2col transformation, which converts the convolution operation into a single large matrix multiplication (GEMM: General Matrix-Matrix Multiplication) by rearranging the input data:

For each position where the kernel is applied, extract the local patch of input values that the kernel overlaps.
Flatten each patch into a column vector.
Stack all column vectors side by side to form a large matrix of shape (C_in * K_h * K_w, H_out * W_out).
Reshape all convolutional filters into a matrix of shape (C_out, C_in * K_h * K_w).
Multiply the filter matrix by the patch matrix. The result is the output feature maps, reshaped to (C_out, H_out, W_out).

Matrix multiplication is one of the most heavily optimized operations in computing. Highly tuned BLAS (Basic Linear Algebra Subprograms) libraries such as cuBLAS (for NVIDIA GPUs), oneDNN (for Intel CPUs), and Apple's Accelerate framework provide near-peak-performance GEMM implementations. The drawback of im2col is memory overhead: it duplicates input data because overlapping patches share elements. For an input of size (C_in, H, W) with a K x K kernel, the expanded matrix requires O(K^2 * C_in * H_out * W_out) memory, which can be several times larger than the original input.

Winograd-Based Convolution

Winograd's minimal filtering algorithms reduce the number of multiplications needed for small convolutions. For a 3x3 kernel applied to a 4x4 input tile, the Winograd F(2x2, 3x3) algorithm reduces the multiplications from 36 (direct) to 16, a 2.25x savings. Lavin and Gray (2016) demonstrated speedups of up to 4x for 3x3 convolutions in CNNs.

The algorithm works by transforming both the input tile and the kernel into a special domain (using pre-computed transformation matrices), performing element-wise multiplication there, and transforming back. This is analogous to the convolution theorem but uses transforms tailored to small discrete filters rather than the Fourier transform.

A significant limitation is numerical precision. The transformation matrices grow in magnitude for larger tile sizes, causing floating-point errors to accumulate. This makes Winograd convolution less suitable for large kernels or for low-precision (e.g., INT8) computation, though recent research has made progress on quantized Winograd implementations.

FFT-Based Convolution

The convolution theorem enables an alternative implementation using the Fast Fourier Transform:

Compute the FFT of both the input and the kernel (padding both to the same size).
Perform element-wise multiplication of the two frequency-domain representations.
Compute the inverse FFT of the product to obtain the convolution result in the spatial domain.

Direct spatial convolution of an input of size N x N with a kernel of size K x K costs O(N^2 * K^2) operations. FFT-based convolution costs O(N^2 * log N) for the transforms plus O(N^2) for the element-wise multiplication. Mathieu, Henaff, and LeCun (2014) demonstrated that FFT-based convolution could accelerate CNN training significantly compared to spatial implementations.

In modern practice, FFT-based convolution is less commonly used for standard CNN architectures because most popular architectures use small 3x3 kernels where the overhead of FFT (computing transforms, padding, managing complex arithmetic) outweighs the theoretical benefit. Winograd-based algorithms offer better performance for small kernels without requiring frequency-domain transforms. However, FFT-based convolution remains relevant for applications involving large kernels (such as global convolutions in architectures like ConvNeXt V2) and in traditional signal processing where kernels may span hundreds or thousands of elements.

Other Approaches

Method	Description	Best For
Implicit GEMM	Dynamically computes im2col layout during GEMM, avoiding explicit data rearrangement	NVIDIA cuDNN, Google TPU
Indirect convolution	Replaces data duplication with a pointer-based indirection buffer (Dukhan, 2019)	Memory-constrained environments
Direct convolution	Naive nested loops; simple but slow	Very small inputs, educational use

Convolutional Neural Networks

Convolutional neural networks (CNNs) are a class of deep learning models that exploit the convolution operation for processing grid-like data, such as images, videos, and spectrograms. CNNs have demonstrated exceptional performance in tasks like image classification, object detection, and semantic segmentation.

A typical CNN architecture consists of several layers, including convolutional layers, activation functions, pooling layers, and fully connected layers. The convolutional layers perform the actual convolution operation, while the activation functions introduce non-linearity to the model, allowing it to learn complex patterns. Pooling layers are used to reduce the spatial dimensions of the feature maps, and fully connected layers serve as a final classifier or regressor.

Modern CNN architectures have evolved to incorporate many of the convolution variants described above. EfficientNet uses depthwise separable convolutions with compound scaling. ConvNeXt revisits pure convolutional designs using large depthwise kernels and modern training techniques, achieving performance competitive with Vision Transformers. The ongoing development of new convolution types and architectures continues to expand the capabilities and efficiency of convolutional networks.

References

LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). "Gradient-based learning applied to document recognition." *Proceedings of the IEEE*, 86(11), 2278-2324.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). "ImageNet classification with deep convolutional neural networks." *Advances in Neural Information Processing Systems*, 25.
Lin, M., Chen, Q., & Yan, S. (2013). "Network in Network." *arXiv preprint arXiv:1312.4400*.
Simonyan, K. & Zisserman, A. (2014). "Very deep convolutional networks for large-scale image recognition." *arXiv preprint arXiv:1409.1556*.
Szegedy, C., Liu, W., Jia, Y., et al. (2014). "Going deeper with convolutions." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*.
Long, J., Shelhamer, E., & Darrell, T. (2015). "Fully convolutional networks for semantic segmentation." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*.
Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). "Learning spatiotemporal features with 3D convolutional networks." *Proceedings of the IEEE International Conference on Computer Vision (ICCV)*.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). "Deep residual learning for image recognition." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*.
Yu, F. & Koltun, V. (2016). "Multi-scale context aggregation by dilated convolutions." *International Conference on Learning Representations (ICLR)*.
van den Oord, A., Dieleman, S., Zen, H., et al. (2016). "WaveNet: A generative model for raw audio." *arXiv preprint arXiv:1609.03499*.
Lavin, A. & Gray, S. (2016). "Fast algorithms for convolutional neural networks." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*.
Dumoulin, V. & Visin, F. (2016). "A guide to convolution arithmetic for deep learning." *arXiv preprint arXiv:1603.07285*.
Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2017). "DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs." *IEEE Transactions on Pattern Analysis and Machine Intelligence*, 40(4), 834-848.
Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., & Wei, Y. (2017). "Deformable convolutional networks." *Proceedings of the IEEE International Conference on Computer Vision (ICCV)*, 764-773.
Howard, A. G., Zhu, M., Chen, B., et al. (2017). "MobileNets: Efficient convolutional neural networks for mobile vision applications." *arXiv preprint arXiv:1704.04861*.
Chollet, F. (2017). "Xception: Deep learning with depthwise separable convolutions." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*.
Xie, S., Girshick, R., Dollar, P., Tu, Z., & He, K. (2017). "Aggregated residual transformations for deep neural networks." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*.
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L.-C. (2018). "MobileNetV2: Inverted residuals and linear bottlenecks." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*.
Zhu, X., Hu, H., Lin, S., & Dai, J. (2019). "Deformable ConvNets V2: More deformable, better results." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*.
Kim, Y. (2014). "Convolutional neural networks for sentence classification." *Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP)*.
Mathieu, M., Henaff, M., & LeCun, Y. (2014). "Fast training of convolutional networks through FFTs." *International Conference on Learning Representations (ICLR)*.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). *Deep Learning*. MIT Press. Chapter 9: Convolutional Networks.
Cooley, J. W. & Tukey, J. W. (1965). "An algorithm for the machine calculation of complex Fourier series." *Mathematics of Computation*, 19(90), 297-301.
Fukushima, K. (1980). "Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position." *Biological Cybernetics*, 36(4), 193-202.

Introduction

Explain Like I'm 5 (ELI5)

Historical Background

Mathematical Foundation

Continuous Convolution

Discrete Convolution

Cross-Correlation in Practice

Kernels and Feature Extraction

Properties of Convolution

Convolution Theorem

Practical Significance

Convolution in Signal Processing

Convolution by Dimensionality

1D Convolution

2D Convolution

3D Convolution

Key Parameters

Kernel (Filter) Size

Stride

Padding

Dilation (Atrous Convolution)

Feature Maps and Channels

Parameter Sharing and Local Connectivity

Parameter Sharing

Local Connectivity (Sparse Interactions)

Types of Convolution

Standard Convolution

Depthwise Separable Convolution

Grouped Convolution

Deformable Convolution

Dilated (Atrous) Convolution

Transposed Convolution

1x1 Convolution

Comparison of Convolution Types

Efficient Implementation of Convolution

im2col (Image to Column)

Winograd-Based Convolution

FFT-Based Convolution

Other Approaches

Convolutional Neural Networks

References

Improve this article

Related Articles

Sparse autoencoder

ARC-AGI 2

GELU (Gaussian Error Linear Unit)

LeNet

Broadcasting

Sigmoid Function

Introduction

Explain Like I'm 5 (ELI5)

Historical Background

Mathematical Foundation

Continuous Convolution

Discrete Convolution

Cross-Correlation in Practice

Kernels and Feature Extraction

Properties of Convolution

Convolution Theorem

Practical Significance

Convolution in Signal Processing

Convolution by Dimensionality

1D Convolution

2D Convolution

3D Convolution

Key Parameters

Kernel (Filter) Size

Stride

Padding

Dilation (Atrous Convolution)

Feature Maps and Channels

Parameter Sharing and Local Connectivity

Parameter Sharing

Local Connectivity (Sparse Interactions)

Types of Convolution

Standard Convolution

Depthwise Separable Convolution

Grouped Convolution

Deformable Convolution

Dilated (Atrous) Convolution