Perceiver
Last reviewed
Jun 3, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 · 1,573 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 · 1,573 words
Add missing citations, update stale details, or suggest a clearer explanation.
Perceiver is a family of general-purpose neural network architectures from DeepMind built around attention and a small latent bottleneck. The first model, "Perceiver: General Perception with Iterative Attention," was presented at ICML 2021 and aimed to handle many input types, such as images, audio, video, and 3D point clouds, with a single design and very few modality-specific assumptions [1][2]. A follow-up, Perceiver IO, generalized the approach so the same backbone could also produce arbitrary structured outputs, extending it from classification to tasks like language understanding and optical flow [3][4]. The central idea in both is to use cross-attention to compress a large input into a fixed-size array of latent variables, then do the heavy processing among those latents, which keeps compute from blowing up as inputs grow.
Standard Transformers apply self-attention directly over their inputs, so cost grows with the square of the input length. That quadratic scaling is manageable for short token sequences but becomes prohibitive for high-resolution images, long audio, or video, where the number of elements can run into the tens or hundreds of thousands. The common workaround is to bake in domain-specific structure first, for example the local grids and convolutions used by virtually all vision models, which add helpful inductive biases but tie an architecture to one modality [1].
The Perceiver authors, led by Andrew Jaegle, set out to remove most of those assumptions while still scaling to very large inputs the way convolutional networks do. The motivation was partly biological: perception in living systems fuses high-dimensional signals from vision, hearing, touch, and other senses, rather than relying on a separate hand-built system per sense [1]. The goal was an architecture that makes few assumptions about how its inputs are arranged yet remains practical on raw, high-dimensional data.
The Perceiver is built from two repeating components. A cross-attention module maps a large input array (for example a flattened image) together with a much smaller learned latent array into an updated latent array. A latent Transformer, a stack of standard self-attention blocks, then processes the latents [1][2]. Crucially, the queries in the cross-attention come from the small latent array while the keys and values come from the large input, so the expensive attention step scales linearly with the input size rather than quadratically. After that bottleneck, the self-attention tower operates only on the fixed set of latents, so its cost is independent of how big the input was [4][5].
This asymmetry is the key trick. Because the latent array is a hyperparameter that is typically far smaller than the input, the network can attend to enormous inputs and still keep the bulk of its computation cheap. DeepMind describes the latents as a "tight latent bottleneck" that inputs are iteratively distilled into [1]. The model can alternate cross-attention and latent self-attention several times, repeatedly querying the input, which is the "iterative attention" of the title. The authors found that sharing weights across the repeated cross-attention and Transformer blocks made the network more parameter-efficient and gave it a recurrent, RNN-like character [5].
Because attention alone is permutation-invariant and carries no sense of position, the Perceiver injects position and modality information through Fourier feature position encodings appended to the inputs, rather than relying on a grid baked into the architecture [1][2]. Swapping these encodings is much of what lets one design serve different modalities.
For ImageNet image classification, the published configuration attended to the full array of 50,176 pixels (a 224 by 224 image) and used a latent array of 512 latents with 1,024 channels, repeating a six-block latent Transformer eight times [5][6]. On ImageNet the Perceiver reached accuracy comparable to ResNet-50 and the Vision Transformer (ViT) without using 2D convolutions, by attending directly to the pixels [1]. It was also competitive across point clouds, audio, video, and combined video plus audio, including on the AudioSet benchmark [1].
A notable limitation of the original model was its output. The latent bottleneck made it natural to pool the latents into a single vector for classification, but it offered no clean way to emit large, structured outputs such as a per-pixel map or a sequence [3][4].
Perceiver IO, posted in mid-2021, addressed that limitation. Its full title is "Perceiver IO: A General Architecture for Structured Inputs & Outputs," and DeepMind's accompanying blog post appeared on 3 August 2021 [3][7]. The architecture keeps the same cross-attention encoder and latent self-attention core, then adds a cross-attention decoder. To read out a result, it constructs an output query array, one query per output element, and cross-attends those queries against the processed latents. The queries carry the structure and semantics the task needs, so the output array can be any size and shape the user specifies [3][4].
This makes the whole pipeline scale linearly with the size of both inputs and outputs, since neither the input nor the output ever participates in a quadratic self-attention step. Only the fixed latent array does [3][4]. The result is a single backbone that can be pointed at very different problems by changing the input preprocessing and the output queries rather than the core network.
DeepMind reported that Perceiver IO matched or beat a comparable BERT baseline on the GLUE language benchmark while operating directly on raw UTF-8 bytes, with no learned tokenizer; the byte-level vocabulary is just 256 byte values plus a handful of special symbols [3][4][8]. On the Sintel optical flow benchmark it reached state-of-the-art results with no flow-specific machinery such as explicit multiscale matching [3][4][8]. The same architecture was also applied to StarCraft II, to ImageNet classification, and to joint audio-video-label autoencoding of Kinetics videos [3][4].
| Aspect | Perceiver (ICML 2021) | Perceiver IO (2021) |
|---|---|---|
| Encoder | Cross-attention to latent array | Cross-attention to latent array |
| Core | Latent self-attention Transformer | Latent self-attention Transformer |
| Decoder / output | Pooled latents, classification only | Cross-attention with output queries, arbitrary structured outputs |
| Scaling | Linear in input size | Linear in input and output size |
| Example results | ImageNet comparable to ResNet-50 / ViT; AudioSet | GLUE on par with BERT (byte-level); Sintel optical flow SOTA |
Across the two papers, the architecture was demonstrated on a wide span of modalities and tasks: image classification, 3D point cloud classification, audio and video understanding, multimodal video plus audio, byte-level language tasks, optical flow estimation, multimodal autoencoding, and the symbolic-plus-spatial inputs of StarCraft II [1][3][4]. DeepMind framed it as "an off-the-shelf way to handle many kinds of data" without building a specialized system per domain [7]. The team released code for both models on GitHub, and Perceiver IO was later integrated into the Hugging Face Transformers library with variants for text classification, masked language modeling, image classification, optical flow, and multimodal autoencoding [4].
The Perceiver's latent-bottleneck idea spread beyond DeepMind's original demonstrations. The most cited example is the Perceiver Resampler inside Flamingo, DeepMind's 2022 visual language model. The Resampler takes a variable number of features from a frozen vision encoder and uses a set of learned latent queries to cross-attend over them, producing a fixed and small number of visual tokens (64 in practice) for the language model to consume. Flamingo's authors state that this component is based on the original Perceiver paper [9][10]. The same compress-with-learned-queries pattern reappears in other multimodal systems that need to feed many image or video features into a language model without quadratic cost.
DeepMind also extended the line directly with Perceiver AR, an autoregressive variant aimed at long-context generation [4]. More broadly, the work is often cited as an early argument that a single attention-based backbone, freed from modality-specific structure, can serve as a general perception module, an aim shared by later generalist efforts such as Gato, although Gato itself uses a conventional tokenized Transformer rather than Perceiver's latent bottleneck.
The Perceiver mattered less as a record-setting model on any one benchmark than as a demonstration that quadratic self-attention is not a hard requirement for applying Transformers to large, diverse inputs. By routing everything through a fixed latent array, it decoupled the cost of the core network from the size of the data and showed that one design could be competitive across images, audio, video, point clouds, and language at the same time. Perceiver IO turned that encoder into a full input-to-output system, and the latent-bottleneck-with-learned-queries motif it popularized became a reusable building block for efficient multimodal models that followed.