See also: Machine learning terms
Translational invariance (also called translation invariance or shift invariance) is a property of a function, system, or machine learning model whose output does not change when its input is translated (shifted) in space. In the context of computer vision and deep learning, translational invariance is the property that the prediction made by a network for an image is unchanged when the contents of the image are shifted by some amount in any direction. The concept is most closely associated with convolutional neural networks (CNNs), which were designed in part to encode this property as an architectural prior or inductive bias. The notion is closely related, but not identical, to translational equivariance, where the output transforms in a predictable way when the input is shifted.
Translational invariance plays an important role across signal processing, pattern recognition, and neuroscience. The visual systems of mammals, for example, recognize objects regardless of where they appear in the visual field, and this observation directly inspired early models of pattern recognition such as the neocognitron and the convolutional architectures that followed it. The concept also has deep roots in physics, where the invariance of physical laws under spatial translation is connected by Noether's theorem to the conservation of linear momentum.
Let $T_v$ denote a translation operator that shifts a function $x: \mathbb{R}^d \to \mathbb{R}$ by a vector $v \in \mathbb{R}^d$, so that $(T_v x)(u) = x(u - v)$. A function $f$ acting on inputs $x$ is called translation invariant if
$$f(T_v x) = f(x)$$
for all translations $v$ in some set (typically the entire translation group). In words, translating the input has no effect on the output.
A related but distinct notion is translation equivariance: a function $f$ is translation equivariant if there exists a translation operator $T'_v$ acting on the output space such that
$$f(T_v x) = T'_v f(x)$$
for all $v$. In equivariant systems, translating the input by $v$ shifts the output by the corresponding amount. The two concepts are connected: composing an equivariant map with an invariant aggregation (such as taking a maximum or an average over the entire spatial domain) yields a fully invariant map.
The distinction between invariance and equivariance is central to modern deep learning theory. The two terms describe different relationships between an input transformation and the resulting change (or lack thereof) in the output, and conflating them obscures how CNNs actually behave.
| Property | Definition | Typical role in CNNs | Example |
|---|---|---|---|
| Translation invariance | Output stays the same when the input is shifted: $f(T_v x) = f(x)$. | Achieved by global pooling or by a final classification head. | A network outputs the same class label regardless of where a digit appears in the image. |
| Translation equivariance | Output shifts in a corresponding, predictable way when the input is shifted: $f(T_v x) = T'_v f(x)$. | Achieved (in idealized form) by convolutional layers. | A feature map activation moves to a new position when the object in the image moves. |
In standard CNN parlance, the convolution operation is best described as translation equivariant, while the combination of convolutions with global pooling or a permutation-invariant readout produces translation invariance. Because invariance can be obtained by composing equivariance with a final invariant operation, equivariance is often the more fundamental architectural property. This perspective is emphasized in the geometric deep learning framework of Bronstein, Bruna, Cohen, and Velickovic.
In classical signal processing, a foundational result holds that the only linear, shift-equivariant operators on infinite or periodic signals are convolutions. Equivalently, every linear map that commutes with the shift operator can be written as a convolution against some kernel. The convolution theorem makes this precise in the Fourier domain: convolution in the spatial domain corresponds to pointwise multiplication in the frequency domain, and the eigenvectors of the shift operator are the Fourier basis. This mathematical structure is one reason convolution shows up so naturally whenever shift symmetry is desired in a learned function.
A convenient way to describe translational invariance is to imagine recognizing a cat in a photograph. A human viewer identifies the cat as a cat regardless of whether it is in the upper left corner, the lower right corner, or the center of the frame. The perceived identity does not depend on the absolute pixel coordinates of the cat. A vision model with translational invariance behaves the same way: it produces the same prediction whether the cat is shifted left, right, up, or down by a few pixels.
A classic textbook framing is the game of "I Spy." The player must spot a target object in a scene. Whether the object is at the top of the page or at the bottom, the player still recognizes it. A translation invariant classifier is a model that has learned to play I Spy in this position-blind way.
The study of translation invariance in artificial systems is rooted in neuroscience. In 1962, David Hubel and Torsten Wiesel published influential studies of the cat's primary visual cortex. They identified two main classes of orientation-selective cells. Simple cells responded to oriented edges in specific positions of the visual field, with separate "on" and "off" subregions in their receptive fields. Complex cells also responded to oriented edges but were less sensitive to the exact spatial location of the stimulus within their receptive field, providing a degree of local position invariance. The hierarchical organization where complex cells pool over simple cells with shifted receptive fields became the canonical model of how invariance might be built up in biological vision. Hubel and Wiesel later received the 1981 Nobel Prize in Physiology or Medicine for this body of work.
In 1980, the Japanese researcher Kunihiko Fukushima introduced the neocognitron, described in his paper Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position, published in Biological Cybernetics. The neocognitron was directly inspired by Hubel and Wiesel's findings. It used alternating layers of S-cells (analogous to simple cells, performing feature extraction with shared weights) and C-cells (analogous to complex cells, performing local pooling). All S-cells in the same plane shared the same receptive field profile, with only their position shifted, which is the same weight sharing scheme that later defined CNNs. C-cells aggregated responses over local neighborhoods to produce gradual position invariance. The architecture demonstrated that hierarchical feature extraction with pooling could achieve shift invariant pattern recognition without explicit alignment.
Yann LeCun and collaborators at AT&T Bell Laboratories built on these ideas in the late 1980s. In 1989, LeCun et al. published Backpropagation Applied to Handwritten Zip Code Recognition in Neural Computation, in which a convolutional network trained with backpropagation learned to read handwritten digits from U.S. mail. The network had hand-engineered constraints (local connectivity and shared weights across positions) that built in approximate translation equivariance. The training data consisted of 9298 16x16 grayscale images, with 7291 used for training and 2007 for testing.
LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner expanded on this work in their 1998 Proceedings of the IEEE paper Gradient-Based Learning Applied to Document Recognition, which introduced LeNet-5. LeNet-5 became the canonical early CNN architecture and combined convolutional layers, subsampling layers, and fully connected layers to perform handwritten character recognition. The shared-weight convolutions provided translation equivariance, and the subsampling operations provided local position invariance, mirroring the S-cell / C-cell hierarchy of the neocognitron.
The success of AlexNet on ImageNet in 2012 brought CNNs to the center of computer vision. Subsequent architectures such as VGG, GoogLeNet, and ResNet all relied on convolution and pooling as their core operations, and translation invariance was widely cited as a key reason for their success. Beginning around 2018, however, a series of papers documented that this invariance is far more fragile in practice than the textbook account suggests, prompting a wave of follow-up work on how to recover or strengthen the property.
The convolutional layer is the source of approximate translation equivariance in a CNN. A learned kernel (or filter) is slid over the input feature map, computing the same dot product at every spatial location. Because the same kernel is applied at every position, the same feature can be detected wherever it appears. Mathematically, in the continuous setting, convolution commutes with translation, so a shifted input produces a shifted output.
In practice, deep learning libraries implement cross-correlation rather than mathematical convolution, but the equivariance property is the same. The weight sharing across spatial positions is what gives CNNs both their translation equivariance and a dramatic reduction in parameter count compared to a fully connected network applied to the same input.
Pooling layers provide local invariance to small spatial perturbations. The two most common forms are max pooling, which takes the maximum activation within a small spatial window, and average pooling, which takes the mean. After max pooling with a $k \times k$ window and a stride of $k$, the output is unchanged by translations of the input within that window. Stacked pooling layers compound this effect: invariance to small shifts at one layer becomes invariance to larger shifts after several layers.
Global average pooling (GAP) and global max pooling are special cases that aggregate over the entire spatial extent of a feature map, producing a single scalar per channel. Because global pooling does not depend on position, it provides full translation invariance over the receptive field of the feature map. GAP was popularized by Network in Network (Lin, Chen, and Yan, 2014) and is used in many modern architectures including ResNet and DenseNet to replace large fully connected classification heads.
Many classic CNN architectures, including LeNet-5 and AlexNet, end with one or more fully connected layers on top of the spatial feature maps. Fully connected layers are not translation invariant by themselves, since they have a separate weight for every spatial position. Their position dependence means that the overall network's output can shift in unintended ways when the input is translated. Replacing fully connected heads with global average pooling, as in modern architectures, partially addresses this issue.
In 2018, Aharon Azulay and Yair Weiss posted the preprint Why do deep convolutional networks generalize so poorly to small image transformations? (later published in the Journal of Machine Learning Research in 2019). The paper provided a striking empirical demonstration that modern CNNs, despite their architectural priors, are not robust to small translations. Shifting an image by even a single pixel can change the predicted class probability dramatically, and predictions can flip between correct and incorrect classes when the input is translated by a few pixels.
Azulay and Weiss attributed this behavior to two factors. First, modern CNNs use strided convolutions and strided pooling, which act as downsampling operations. Downsampling without proper anti-alias filtering violates the Nyquist sampling theorem and introduces aliasing, which destroys the underlying shift equivariance. Second, the authors showed that data augmentation does not fully compensate for these architectural shortcomings. Networks trained with translation augmentation become invariant only in regions of the input space close to typical training images and remain fragile elsewhere.
A related finding by Osman Semih Kayhan and Jan C. van Gemert, presented in their CVPR 2020 paper On Translation Invariance in CNNs: Convolutional Layers Can Exploit Absolute Spatial Location, showed that CNN filters can and do learn to respond to absolute image positions. This is possible because of boundary effects (zero padding at the image edges), which give the network a positional signal that propagates inward through deep stacks of convolutions. Modern architectures with large receptive fields can therefore exploit absolute spatial location everywhere in the image, undermining the assumption that a CNN treats all positions identically. The authors proposed simple modifications to remove this positional encoding and reported improved generalization, especially on small datasets.
Taken together, these findings reframed CNNs as networks with approximate, learned, and partial translation invariance, rather than networks that are translation invariant by design. The empirical fragility has practical consequences in safety-critical applications such as medical imaging, autonomous vehicles, and biometric systems, where small shifts in the input can occur naturally and should not change the prediction. The findings also motivated a body of follow-up research into stronger architectural priors and better training procedures.
In 2019, Richard Zhang published Making Convolutional Networks Shift-Invariant Again at ICML. The paper applied the classical signal processing fix for aliasing: insert a low-pass filter (a blurring step) before any downsampling operation. Zhang integrated this idea into max pooling, average pooling, and strided convolution. The resulting anti-aliased CNNs demonstrated improved consistency under input shifts, higher classification accuracy on ImageNet, and increased robustness to common image corruptions. The approach was released as an open-source library and was integrated into popular architectures such as ResNet, DenseNet, and MobileNet.
A more general approach extends equivariance beyond translations to other symmetry groups. In the 2016 ICML paper Group Equivariant Convolutional Networks, Taco S. Cohen and Max Welling introduced G-CNNs, in which feature maps are functions on a group $G$ and the network's layers are equivariant to the action of $G$. For the discrete group of translations combined with 90-degree rotations and reflections, the resulting G-convolutions enjoy a higher degree of weight sharing than standard convolutions and require no extra parameters. G-CNNs achieved state-of-the-art results on rotated MNIST and CIFAR-10 at the time of publication. Subsequent work generalized the framework to continuous rotation groups, scale groups, and Lie groups, leading to the broader research area of equivariant neural networks.
In the 2015 NeurIPS paper Spatial Transformer Networks, Max Jaderberg, Karen Simonyan, Andrew Zisserman, and Koray Kavukcuoglu proposed a learnable, differentiable module that explicitly performs a spatial transformation on its input. A spatial transformer consists of a localization network that predicts transformation parameters, a grid generator that produces a sampling grid, and a sampler that interpolates the input at the sampled positions. By inserting spatial transformers into a CNN, the model learns to actively warp inputs to a canonical form, providing invariance not just to translation but also to scale, rotation, and more general affine or non-rigid deformations.
Geoffrey Hinton has long argued that pooling-based invariance throws away valuable spatial relationships between features. In the 2017 NeurIPS paper Dynamic Routing Between Capsules, Sara Sabour, Nicholas Frosst, and Geoffrey Hinton proposed capsule networks, in which a capsule is a group of neurons whose activity vector represents the instantiation parameters of a part or object. The length of the activity vector encodes the probability that the entity is present, while its orientation encodes pose information. Active capsules at one level make predictions, via learned transformation matrices, for the instantiation parameters of higher-level capsules; when these predictions agree, the higher-level capsule becomes active. The system is referred to as routing-by-agreement.
In the framing of Sabour, Frosst, and Hinton, capsule networks aim to encode the intrinsic spatial relationship between a part and a whole as viewpoint invariant knowledge that generalizes to novel viewpoints, rather than relying on max pooling to discard pose information. Capsule networks reported strong performance on MNIST and were noted for their ability to disentangle highly overlapping digits.
The simplest and most widely used method for encouraging translation invariance is data augmentation. During training, input images are randomly translated, cropped, flipped, scaled, or otherwise transformed before being shown to the network. The network never sees the same image twice in the same position, which forces it to learn representations that generalize across positions. Data augmentation is now standard practice in image classification pipelines and is often combined with the architectural priors of CNNs. Empirical work suggests that data augmentation is the dominant factor for achieving useful translation invariance in practice, even more so than the architectural inductive bias of convolution and pooling.
Data augmentation has the advantage that it works with any architecture, including transformer-based models that lack convolutional priors. Its limitation is that it teaches invariance only over the range of transformations sampled during training and does not extend to large or unusual shifts.
The Vision Transformer (ViT), introduced by Alexey Dosovitskiy and colleagues in the 2020 paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, applies the transformer architecture to images by partitioning each image into a sequence of non-overlapping patches and processing them as tokens. ViT explicitly does not have the convolutional inductive biases of locality and translation equivariance. Instead, it relies on global self-attention and learned positional encodings to relate tokens.
Dosovitskiy et al. observed that, on small or medium-sized datasets such as ImageNet-1k, ViT trains less efficiently than CNNs of comparable size, which the authors attribute partly to ViT's missing inductive biases. When pretrained on much larger datasets such as ImageNet-21k or JFT-300M, however, ViT matches or surpasses state-of-the-art CNNs, suggesting that scale can substitute for inductive bias. Subsequent work has explored hybrid architectures that combine convolutional patch embeddings with attention layers to recover some of the locality and equivariance properties for free, and the broader research community has explored translation-invariant or translation-equivariant attention mechanisms as a design space.
Translation is only one of many natural transformations of an image. Other common ones include:
| Transformation | Geometric description | Typical mechanism for invariance |
|---|---|---|
| Translation | Shift of all pixels by a constant vector | Convolution + pooling, data augmentation |
| Rotation | Rotation about an axis (in 2D, about the image center) | Group equivariant CNNs, augmentation, spatial transformers |
| Scale | Uniform stretching or shrinking | Image pyramids, multi-scale features, augmentation, spatial transformers |
| Reflection | Mirror flip (horizontal or vertical) | Group equivariant CNNs (e.g., the p4m group), augmentation |
| Affine | Linear transformations including shear | Spatial transformer networks |
| Photometric | Brightness, contrast, color shifts | Augmentation, normalization |
Unlike translation, rotation and scale invariance are not built into standard CNN architectures. CNNs trained without rotation augmentation typically misclassify images that are rotated by even small amounts. This sparked the development of group equivariant CNNs (Cohen and Welling 2016) for rotation, scattering networks (Mallat 2012) for scale, and a long line of follow-up work on equivariant networks for arbitrary symmetry groups.
The principle that physical laws should not depend on absolute spatial position has a deep mathematical consequence captured by Noether's theorem, proved by Emmy Noether in 1915. The theorem states that every continuous symmetry of the action of a physical system corresponds to a conserved quantity. Translation symmetry in space corresponds to conservation of linear momentum, translation symmetry in time corresponds to conservation of energy, and rotation symmetry corresponds to conservation of angular momentum.
The analogy with deep learning is more than poetic. Designing a neural network to be invariant or equivariant under a symmetry group is mathematically the same kind of constraint as building physical theories that respect the symmetries of nature, and the resulting parameter sharing in equivariant networks is the deep-learning analogue of conservation laws. The geometric deep learning program of Bronstein, Bruna, Cohen, and Velickovic (2021) makes this connection explicit, recasting the design of architectures such as CNNs, graph neural networks, and transformers as a unified study of equivariance under different symmetry groups.
Imagine playing "I Spy" with a friend, looking for a small picture of a cat in a big poster. The cat might be in the corner, the middle, or the edge. You can still find it because you know what a cat looks like, no matter where it is on the page. Translational invariance is the same idea for a computer program. A program that has translational invariance can still recognize a cat (or a number, or a face) even when the picture moves around. Convolutional neural networks try to have this skill built in, but they are not perfect at it: sometimes when a picture shifts by just one pixel, the program changes its mind. Researchers have invented many tricks (special filters, extra training, new kinds of layers) to try to make computer programs as good at this game as humans are.