Convolutional Operation

The convolutional operation is a mathematical procedure that combines two functions to produce a third function expressing how the shape of one is modified by the other. In the context of convolutional neural networks (CNNs), the operation applies learned filters (also called kernels) to input data, enabling the automatic extraction of spatial features such as edges, textures, and patterns. The convolutional operation forms the computational backbone of nearly all modern computer vision systems, and its variants have been extended to audio, text, and volumetric data.

Explain like I'm 5 (ELI5)

Imagine you have a big picture made of tiny colored squares. You also have a small stamp with a pattern on it. You place the stamp on the top-left corner of the picture, look at the squares underneath, do some simple math (multiply and add), and write down a single number. Then you slide the stamp one square to the right and repeat. When you reach the end of a row, you move down and start again. When you finish, all those numbers you wrote down form a new, smaller picture. That new picture highlights the parts of the original that match the stamp's pattern, like edges or certain shapes. That sliding-and-multiplying process is the convolutional operation, and the stamp is the filter (or kernel). In a neural network, the computer learns the best stamp patterns on its own by looking at thousands of examples.

Historical background

The mathematical concept of convolution dates back to the 18th century, with roots in the work of Euler and Laplace on integral transforms. In modern applied mathematics, convolution became a standard tool in signal processing and linear systems theory during the 20th century.

The application of convolution-like operations to artificial neural networks began with Kunihiko Fukushima's Neocognitron in 1980. Fukushima was inspired by the discoveries of David Hubel and Torsten Wiesel, who showed in the early 1960s that neurons in the cat visual cortex respond to oriented edges within localized receptive fields. The Neocognitron introduced a hierarchical architecture with alternating layers of simple and complex cells, anticipating many structural features of today's CNNs. However, the Neocognitron used unsupervised reinforcement learning for training.

In 1989, Yann LeCun and colleagues at Bell Labs demonstrated that a convolutional neural network trained with backpropagation could recognize handwritten zip codes. This work led to LeNet-5 (1998), the architecture that established CNNs as practical tools for image recognition. The convolutional operation as used in CNNs, with learned filters, weight sharing, and gradient-based training, traces directly to LeCun's contributions.

The field remained relatively quiet until 2012, when AlexNet won the ImageNet Large Scale Visual Recognition Challenge by a wide margin, demonstrating that deep convolutional networks trained on GPUs could achieve unprecedented accuracy on large-scale image classification. This result triggered the modern deep learning era and led to rapid development of new convolution variants.

Mathematical definition

Continuous convolution

For two continuous functions f and g, the convolution is defined as:

(f * g)(t) = \u222b f(\u03c4) g(t - \u03c4) d\u03c4

The integral runs over all values of \u03c4 (from -\u221e to +\u221e). The key characteristic is that one of the functions is flipped (reversed) before the integration. This flipping distinguishes convolution from the closely related cross-correlation operation.

Discrete convolution

For discrete signals, the continuous integral is replaced with a summation:

(f * g)[n] = \u2211_k f[k] \u00b7 g[n - k]

This form is more directly applicable to digital signals and images, where data is represented as sequences or grids of discrete values.

2D discrete convolution for images

In image processing, both the input and the kernel are two-dimensional. The 2D discrete convolution of an input X with a kernel K is:

(X * K)[i, j] = \u2211_m \u2211_n X[m, n] \u00b7 K[i - m, j - n]

Here, the kernel K is flipped both horizontally and vertically before being slid across the input. The output at each position (i, j) is the sum of element-wise products between the flipped kernel and the overlapping region of the input.

Convolution versus cross-correlation

In practice, most machine learning frameworks (PyTorch, TensorFlow, JAX) implement cross-correlation rather than true mathematical convolution. The 2D cross-correlation formula is:

(X \u2606 K)[i, j] = \u2211_m \u2211_n X[i + m, j + n] \u00b7 K[m, n]

The only difference from convolution is the absence of the kernel flip. In cross-correlation, the kernel slides over the input without being reversed.

This distinction is practically irrelevant in neural networks because the kernel weights are learned from data during training. If the network needed a flipped version of some filter, backpropagation would simply learn the flipped weights directly. As a result, CNN implementations universally use cross-correlation for computational simplicity but call the operation "convolution" by convention.

Property	True convolution	Cross-correlation (used in CNNs)
Kernel flipped	Yes (180-degree rotation)	No
Mathematical notation	f * g	f \u2606 g
Commutativity	Commutative (f * g = g * f)	Not commutative
Associativity	Associative	Not associative
Used in ML frameworks	Rarely	Almost universally
Effect on learned weights	Weights learned in flipped orientation	Weights learned directly

Key parameters of the convolutional operation

Several hyperparameters control how the convolutional operation is applied and determine the size and properties of the output.

Kernel (filter) size

The kernel size defines the spatial extent of the filter, typically expressed as height x width. Common choices include 1x1, 3x3, 5x5, and 7x7. Smaller kernels (especially 3x3) have become the standard in modern architectures because stacking multiple small kernels achieves the same receptive field as a single large kernel while using fewer parameters and introducing more nonlinearity through additional activation functions.

Stride

Stride specifies the number of positions the kernel moves between consecutive applications. A stride of 1 means the kernel moves one pixel at a time, producing a dense output. A stride of 2 moves two pixels at a time, reducing each spatial dimension by roughly half. Larger strides downsample the output, which reduces computation and increases the effective receptive field.

Padding

Padding adds extra values (usually zeros) around the border of the input before convolution. Without padding, each convolutional layer shrinks the spatial dimensions, and information at the edges of the input is underrepresented.

Padding type	Description	Output size (for stride = 1)
Valid (no padding)	No padding applied; output is smaller than input	(n - k + 1) x (n - k + 1)
Same padding	Padding chosen so output has same dimensions as input	n x n
Full padding	Maximum padding; every element of input participates in every possible position	(n + k - 1) x (n + k - 1)

In the table above, n represents the input dimension and k represents the kernel dimension. Most modern architectures use same padding with 3x3 kernels (i.e., 1 pixel of zero padding on each side).

Number of filters (output channels)

Each filter produces one output channel (also called a feature map). A convolutional layer typically uses multiple filters in parallel, each learning to detect a different pattern. For example, a layer with 64 filters applied to an input produces an output with 64 channels.

Dilation rate

Dilation inserts gaps between kernel elements, expanding the receptive field without increasing the number of parameters. A dilation rate of 1 is a standard convolution; a dilation rate of 2 inserts one gap between each kernel element. See the section on dilated convolution below for details.

Output size formula

The spatial dimensions of the output feature map are determined by a standard formula. Given an input of size I, a kernel of size K, padding P, stride S, and dilation rate D, the output size O along one dimension is:

O = floor((I + 2P - D(K - 1) - 1) / S) + 1

For the common case with dilation rate 1, this simplifies to:

O = floor((I - K + 2P) / S) + 1

Example configuration	Input size	Kernel	Padding	Stride	Dilation	Output size
Standard 3x3 with same padding	32	3	1	1	1	32
Standard 3x3 with valid padding	32	3	0	1	1	30
Strided 3x3	32	3	1	2	1	16
Dilated 3x3 (rate=2)	32	3	2	1	2	32
Large 7x7 with same padding	224	7	3	2	1	112

Parameter count

The total number of learnable parameters in a convolutional layer depends on the kernel size, number of input channels (C_in), number of output channels (C_out), and whether a bias term is included:

Parameters = (K_h x K_w x C_in + 1) x C_out

The "+1" accounts for one bias value per output channel. For example, a layer with 64 filters of size 3x3 applied to an input with 128 channels has (3 x 3 x 128 + 1) x 64 = 73,792 parameters.

Properties of convolution in neural networks

A single filter uses the same set of weights at every spatial location in the input. This weight sharing reduces the number of parameters compared to fully connected layers and encodes the assumption that the same pattern can appear anywhere in the input.

Translation equivariance

Convolutional layers are translation equivariant: if the input shifts by a certain amount, the output shifts by the same amount. Formally, if T is a translation operator and f is the convolution operation, then f(T(x)) = T(f(x)). This property means that a feature detector learned at one location automatically applies everywhere in the input, without requiring separate detectors for each position.

Translation equivariance should not be confused with translation invariance. Pooling layers and global aggregation operations contribute to approximate translation invariance, meaning the final classification output remains similar regardless of where an object appears in the image.

Local connectivity (sparse interactions)

Each output unit depends only on a small region of the input, defined by the kernel size. This locality is both computationally efficient and well suited for capturing local spatial patterns like edges, corners, and textures. Higher-level layers, through stacking, build receptive fields that span larger portions of the input.

Hierarchical feature extraction

When convolutional layers are stacked, the network learns a hierarchy of features. Early layers detect simple patterns such as edges and color gradients. Middle layers combine these into more complex shapes like corners, contours, and texture elements. Deeper layers compose these into high-level features representing object parts or entire objects. This hierarchical decomposition is a central reason why CNNs work well for visual recognition.

Backpropagation through convolutional layers

During training, convolutional layers are optimized using gradient descent and backpropagation. The backward pass through a convolutional layer involves two gradient computations:

Gradient with respect to the input (needed for propagating gradients to earlier layers): This is computed as a convolution of the upstream gradient with the transposed (flipped) kernel.
Gradient with respect to the kernel weights (needed for updating the filter): This is computed as a convolution between the input and the upstream gradient.

Because of weight sharing, gradients from all spatial positions are accumulated into a single gradient for each filter weight. This makes gradient computation efficient and ensures that every spatial location contributes to the filter update.

Implementation strategies

Modern deep learning frameworks use several strategies to compute convolutions efficiently on hardware accelerators.

im2col + GEMM

The most widely used approach, called im2col (image-to-column), rearranges overlapping input patches into columns of a matrix. The convolution is then computed as a single large matrix multiplication (GEMM, General Matrix-Matrix Multiply). This approach leverages highly optimized BLAS libraries and GPU tensor cores but requires additional memory because overlapping patches are duplicated in the rearranged matrix.

FFT-based convolution

The convolution theorem states that convolution in the spatial domain is equivalent to element-wise multiplication in the frequency domain. By transforming both the input and the kernel to the frequency domain using the Fast Fourier Transform (FFT), performing element-wise multiplication, and transforming back, convolution can be computed in O(n log n) time rather than O(n^2). This approach is advantageous for large kernel sizes (typically larger than 5x5) but incurs overhead from the domain transforms that makes it less efficient for the small 3x3 kernels common in modern architectures.

Winograd transform

The Winograd minimal filtering algorithm reduces the number of multiplications needed for small, fixed-size convolutions (particularly 3x3 with stride 1). It trades multiplications for additions, which are cheaper on most hardware. Many deep learning libraries (cuDNN, MKL-DNN) automatically select Winograd-based implementations for eligible convolutions.

Direct convolution

Direct implementation applies the sliding-window operation as written, without matrix rearrangement. While conceptually simple, direct convolution is generally slower on GPUs than im2col or Winograd methods. It remains useful on specialized hardware or for unusual kernel configurations.

Method	Best kernel sizes	Memory overhead	Computational complexity
im2col + GEMM	Any	High (patch duplication)	O(n^2 k^2) per output
FFT-based	Large (>5x5)	Moderate (frequency buffers)	O(n^2 log n)
Winograd	Small, fixed (3x3, 5x5)	Low	Fewer multiplications than direct
Direct	Any	None	O(n^2 k^2) per output

Variants of the convolutional operation

Researchers have developed many specialized variants of the standard convolution to address different requirements in terms of receptive field size, computational cost, spatial resolution, and input structure.

1x1 convolution (pointwise convolution)

A 1x1 convolution applies a 1x1 filter across all input channels at each spatial location. It performs no spatial filtering; instead, it computes a linear combination of channel values. Its primary uses include:

Changing the number of channels (dimensionality reduction or expansion)
Adding nonlinearity when followed by an activation function
Mixing information across channels without affecting spatial resolution

The concept was introduced in the Network-in-Network paper (Lin et al., 2013) and became a standard building block in architectures like Inception (GoogLeNet), ResNet bottleneck blocks, and MobileNet.

Dilated (atrous) convolution

Dilated convolution, also called atrous convolution (from the French "a trous," meaning "with holes"), inserts gaps between kernel elements. A dilation rate of r means there are r - 1 gaps between each pair of adjacent kernel values. This expands the receptive field exponentially without increasing the number of parameters or reducing spatial resolution.

For a 3x3 kernel with dilation rate 2, the kernel effectively covers a 5x5 area but only uses 9 weights. With dilation rate 4, it covers a 9x9 area with the same 9 weights.

Dilated convolutions are widely used in:

Semantic segmentation: The DeepLab family of models replaces pooling layers with dilated convolutions to maintain high spatial resolution while achieving a large receptive field.
Audio generation: WaveNet uses stacks of dilated causal convolutions to capture long-range temporal dependencies.
Dense prediction tasks: Any task requiring pixel-level or sample-level output benefits from the combination of large receptive fields and preserved resolution.

One known issue is the gridding effect: when dilated convolutions with the same rate are stacked, certain input positions are never sampled. Solutions include using a mix of dilation rates or hybrid dilated convolution schedules.

Depthwise separable convolution

A depthwise separable convolution factorizes a standard convolution into two sequential steps:

Depthwise convolution: A separate spatial filter (e.g., 3x3) is applied independently to each input channel. If the input has C channels, C separate filters are used, each operating on a single channel.
Pointwise convolution: A 1x1 convolution combines the outputs across channels.

This factorization reduces the number of parameters and computations by a factor of approximately (1/C_out + 1/K^2) compared to a standard convolution. For a standard convolution with a 3x3 kernel, depthwise separable convolutions require roughly 8 to 9 times fewer multiply-add operations.

Depthwise separable convolutions were popularized by MobileNet (Howard et al., 2017) and are used in EfficientNet, Xception, and other lightweight architectures designed for mobile and edge deployment.

Convolution type	Parameters (K=3, C_in=128, C_out=256)	Multiply-adds (spatial 8x8)	Relative cost
Standard 3x3	3 x 3 x 128 x 256 = 294,912	18,874,368	1.0x
Depthwise separable	(3 x 3 x 128) + (128 x 256) = 33,920	2,170,880	~0.12x

Transposed convolution

Transposed convolution (sometimes incorrectly called "deconvolution") performs an upsampling operation that increases the spatial dimensions of the input. It is the gradient operation of a normal convolution and can be thought of as a convolution with fractional strides.

Transposed convolutions are used in:

Decoder networks for image segmentation (e.g., U-Net, FCN)
Generative adversarial networks (GANs) for image synthesis
Super-resolution models

A known artifact of transposed convolutions is the checkerboard pattern, caused by uneven overlap when the kernel size is not divisible by the stride. This can be mitigated by using resize-convolution (nearest-neighbor upsampling followed by a standard convolution) instead.

Group convolution

Group convolution divides the input channels and output channels into G groups and performs separate convolutions within each group. A standard convolution is the special case where G = 1, and a depthwise convolution is the case where G equals the number of input channels.

Group convolution was introduced in AlexNet (Krizhevsky et al., 2012) as a practical solution for splitting computation across two GPUs. It was later adopted as a deliberate architectural choice in ResNeXt (Xie et al., 2017), where the "cardinality" (number of groups) was shown to improve accuracy more effectively than increasing depth or width alone.

3D convolution

A 3D convolution uses a kernel with three spatial dimensions (height, width, depth or time). The kernel slides along all three axes, producing a 3D output. This is useful for data with a natural third dimension:

Video understanding: 3D convolutions capture spatiotemporal features by processing short clips. Notable architectures include C3D (Tran et al., 2015) and I3D (Carreira and Zisserman, 2017).
Medical imaging: CT scans and MRI volumes are inherently 3D, making 3D convolutions a natural choice.
Point cloud processing: Voxelized 3D point clouds can be processed with 3D convolutional networks.

Causal convolution

A causal convolution ensures that the output at time step t depends only on inputs at time t and earlier, never on future inputs. This is achieved by asymmetric padding (padding only on one side). Causal convolutions are used for sequence modeling tasks where the temporal order must be respected, such as:

Audio generation (WaveNet)
Time series forecasting (Temporal Convolutional Networks, or TCNs)
Language modeling with convolutional architectures

When combined with dilation, causal convolutions can achieve very large receptive fields over temporal sequences while maintaining strict causal ordering. Several studies have shown that TCNs with dilated causal convolutions can match or outperform recurrent neural networks (RNNs and LSTMs) on many sequence tasks, with the additional benefit of parallelizable computation.

Deformable convolution

Deformable convolution, introduced by Dai et al. (2017), adds learnable 2D offsets to each sampling position in the kernel grid. Instead of sampling on a fixed regular grid, the network learns to shift each sample point based on the input content. This allows the receptive field to adapt to the scale and shape of objects in the image.

A parallel convolutional layer predicts the offsets, which are applied to the sampling locations of the main convolution. The offsets are typically fractional, so bilinear interpolation is used to sample the input at non-integer positions. Deformable convolutions have been adopted in modern object detection architectures to improve localization of objects with irregular shapes.

Convolution in different domains

Although the convolutional operation is most closely associated with 2D image processing, it has been successfully applied across various data types.

Domain	Input dimensionality	Convolution type	Example applications
Images	2D (height x width x channels)	2D convolution	Classification, detection, segmentation
Audio and speech	1D (time x channels)	1D convolution	Speech recognition, music generation
Video	3D (time x height x width x channels)	3D convolution	Action recognition, video captioning
Text	1D (sequence length x embedding dim)	1D convolution	Text classification, NLP
Medical volumes	3D (depth x height x width)	3D convolution	Tumor detection, organ segmentation
Graph-structured data	Irregular	Graph convolution	Social network analysis, molecule property prediction

Comparison of convolution variants

Variant	Receptive field per layer	Parameter efficiency	Spatial resolution	Primary use case
Standard convolution	K x K	Baseline	Preserved (stride=1) or reduced	General feature extraction
1x1 convolution	1 x 1	Very high	Preserved	Channel mixing, dimensionality change
Dilated convolution	Expanded (by dilation rate)	Same as standard	Preserved	Segmentation, dense prediction
Depthwise separable	K x K (factorized)	~8-9x fewer ops	Preserved or reduced	Mobile and edge deployment
Transposed convolution	K x K (upsampling)	Baseline	Increased	Decoders, generators
Group convolution	K x K (per group)	Reduced by factor G	Preserved or reduced	Multi-GPU, increased cardinality
Deformable convolution	Adaptive	Slightly more (offset params)	Preserved	Irregular object detection
3D convolution	K x K x K	Higher (extra dimension)	Preserved or reduced per axis	Video, volumetric data
Causal convolution	Asymmetric temporal	Same as 1D	Preserved	Sequence modeling

Convolution in modern architectures

The convolutional operation appears in virtually every modern vision architecture, though its role has evolved alongside newer building blocks like self-attention.

VGG (2014) demonstrated that stacking many 3x3 convolutional layers is more effective than using fewer layers with larger kernels.
Inception (2014) used parallel convolutions with different kernel sizes (1x1, 3x3, 5x5) at each layer, allowing the network to capture features at multiple scales simultaneously.
ResNet (2015) introduced residual connections around convolutional blocks, enabling training of networks with over 100 layers.
MobileNet (2017) replaced standard convolutions with depthwise separable convolutions to achieve models small enough for mobile devices.
EfficientNet (2019) used compound scaling to balance depth, width, and resolution, combining depthwise separable convolutions with squeeze-and-excitation blocks.
ConvNeXt (2022) modernized the pure convolutional architecture by adopting design choices from Transformers (such as larger kernel sizes, layer normalization, and fewer activation functions), achieving competitive performance with vision transformers on ImageNet.

Relationship to the convolution theorem

The convolution theorem establishes that convolution in the spatial (or time) domain corresponds to element-wise multiplication in the frequency domain:

F(f * g) = F(f) . F(g)

where F denotes the Fourier transform. This means a convolution can be computed by:

Transforming both the input and the kernel to the frequency domain using the Fast Fourier Transform (FFT).
Performing element-wise multiplication.
Applying the inverse FFT to return to the spatial domain.

This approach has O(n log n) complexity compared to the O(n^2) complexity of direct convolution. However, for the small 3x3 kernels used in most modern CNNs, the overhead of the FFT transforms outweighs the savings, so spatial-domain methods (im2col, Winograd) are preferred in practice. FFT-based convolution remains useful in signal processing applications and in research architectures that use large kernels.

References

LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., & Jackel, L. D. (1989). "Backpropagation Applied to Handwritten Zip Code Recognition." *Neural Computation*, 1(4), 541-551.
LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). "Gradient-Based Learning Applied to Document Recognition." *Proceedings of the IEEE*, 86(11), 2278-2324.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). "ImageNet Classification with Deep Convolutional Neural Networks." *Advances in Neural Information Processing Systems*, 25.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). *Deep Learning*. MIT Press. Chapter 9: Convolutional Networks.
Zhang, A., Lipton, Z. C., Li, M., & Smola, A. J. (2023). "Dive into Deep Learning." Section 7.2: Convolutions for Images. Available at: https://d2l.ai/
Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., & Adam, H. (2017). "MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications." *arXiv preprint arXiv:1704.04861*.
Yu, F., & Koltun, V. (2016). "Multi-Scale Context Aggregation by Dilated Convolutions." *International Conference on Learning Representations (ICLR)*.
Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., & Wei, Y. (2017). "Deformable Convolutional Networks." *Proceedings of the IEEE International Conference on Computer Vision (ICCV)*, 764-773.
Dumoulin, V., & Visin, F. (2016). "A Guide to Convolution Arithmetic for Deep Learning." *arXiv preprint arXiv:1603.07285*.
Liu, Z., Mao, H., Wu, C. Y., Feichtenhofer, C., Darrell, T., & Xie, S. (2022). "A ConvNet for the 2020s." *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*, 11976-11986.
Hubel, D. H., & Wiesel, T. N. (1962). "Receptive Fields, Binocular Interaction and Functional Architecture in the Cat's Visual Cortex." *The Journal of Physiology*, 160(1), 106-154.
Fukushima, K. (1980). "Neocognitron: A Self-Organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position." *Biological Cybernetics*, 36(4), 193-202.
Xie, S., Girshick, R., Dollar, P., Tu, Z., & He, K. (2017). "Aggregated Residual Transformations for Deep Neural Networks." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*, 1492-1500.
Shervine Amidi. "CS 230 - Convolutional Neural Networks Cheatsheet." Stanford University. Available at: https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-convolutional-neural-networks/
Odena, A., Dumoulin, V., & Olah, C. (2016). "Deconvolution and Checkerboard Artifacts." *Distill*. Available at: https://distill.pub/2016/deconv-checkerboard/

Explain like I'm 5 (ELI5)

Historical background

Mathematical definition

Continuous convolution

Discrete convolution

2D discrete convolution for images

Convolution versus cross-correlation

Key parameters of the convolutional operation

Kernel (filter) size

Stride

Padding

Number of filters (output channels)

Dilation rate

Output size formula

Parameter count

Properties of convolution in neural networks

Weight sharing

Translation equivariance

Local connectivity (sparse interactions)

Hierarchical feature extraction

Backpropagation through convolutional layers

Implementation strategies

im2col + GEMM

FFT-based convolution

Winograd transform

Direct convolution

Variants of the convolutional operation

1x1 convolution (pointwise convolution)

Dilated (atrous) convolution

Depthwise separable convolution

Transposed convolution

Group convolution

3D convolution

Causal convolution

Deformable convolution

Convolution in different domains

Comparison of convolution variants

Convolution in modern architectures

Relationship to the convolution theorem

See also

References

Improve this article

Related Articles

Multi-head Latent Attention

Sparse autoencoder

ARC-AGI 2

GELU (Gaussian Error Linear Unit)

Computer-use agent

Computer-use model

Explain like I'm 5 (ELI5)

Historical background

Mathematical definition

Continuous convolution

Discrete convolution

2D discrete convolution for images

Convolution versus cross-correlation

Key parameters of the convolutional operation

Kernel (filter) size

Stride

Padding

Number of filters (output channels)

Dilation rate

Output size formula

Parameter count

Properties of convolution in neural networks

Weight sharing

Translation equivariance

Local connectivity (sparse interactions)

Hierarchical feature extraction

Backpropagation through convolutional layers

Implementation strategies

im2col + GEMM

FFT-based convolution

Winograd transform

Direct convolution

Variants of the convolutional operation

1x1 convolution (pointwise convolution)

Dilated (atrous) convolution

Depthwise separable convolution

Transposed convolution