Infini-Attention
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,053 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,053 words
Add missing citations, update stale details, or suggest a clearer explanation.
Infini-attention is an attention mechanism introduced by Google researchers Tsendsuren Munkhdalai, Manaal Faruqui, and Siddharth Gopal in the April 2024 paper "Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention." It augments standard transformer attention with a compressive associative memory that is updated linearly with every new chunk of tokens, allowing a model to attend over effectively unbounded context while keeping the per-step memory and compute budget bounded.[^1] Each transformer layer in an Infini-Transformer reuses its query, key, and value projections for two complementary attention paths inside one block: a regular masked dot-product attention over the local segment, and a linear attention retrieval over a fixed-size memory matrix that absorbs all earlier segments. A learned sigmoid gate per attention head mixes the two contexts before the output projection.[^1] The authors reported that a 1B model fine-tuned with Infini-attention solved 1M-token passkey retrieval, and that an 8B model produced state-of-the-art ROUGE scores on 500K-token book summarization with constant memory footprint per step.[^1]
The vanilla Transformer computes attention as a softmax over the full key matrix, which forces compute and the KV cache to grow quadratically with sequence length during training and linearly during streaming inference.[^2] For a context of length L this means O(L^2) operations and O(L) cached states per layer, which becomes prohibitive when L is in the hundreds of thousands. A large branch of research has tried to soften that scaling by sparsifying the attention pattern (sliding windows, dilated and routed sparsity), by recurrent buffers that store hidden states between segments, or by replacing softmax with kernels that admit constant-state recurrent updates (linear attention).[^3]
Three lines of work directly anticipated Infini-attention. Transformer-XL kept the previous segment's keys and values as an extra read-only buffer accessible to the next segment, which extended the effective context but still scaled storage linearly with how far back the model could see.[^4] The Compressive Transformer from DeepMind, introduced by Jack Rae and colleagues in November 2019, added a coarse compressed cache below Transformer-XL's recurrence buffer; the compression reduced storage but used dedicated learned compression networks per layer.[^5] The Memorizing Transformer and other retrieval-based architectures stored explicit key-value pairs in a large external bank and used k-nearest-neighbour lookup at attention time, which scaled compute sublinearly but required a database alongside the model.[^6] Infini-attention sits in the same family but stores its long-term memory as a single fixed-shape parameter matrix per layer that is updated in place, rather than as an explicit cache of past activations or an external index.[^1]
A key building block is the linear attention reformulation introduced by Katharopoulos, Vyas, Pappas, and Fleuret in 2020, which rewrites attention as a kernel feature map then exploits the associativity of matrix multiplication.[^7] Once attention is written in the form softmax(QK^T)V approximated by phi(Q)(phi(K)^T V), the inner product phi(K)^T V is a single matrix that can be accumulated incrementally as new tokens arrive. This realises a recurrent update with constant state, sometimes called an associative memory.[^7] The same matrix update can be improved by a delta rule that subtracts an estimate of what the memory already predicts for a key before writing, reducing redundant overwriting; this idea recurs across linear-attention variants and is the basis of Infini-attention's compressive memory update.[^1]
Tsendsuren Munkhdalai had previously worked on memory-augmented neural networks (the Neural Semantic Encoder and related models) before joining Google, where Manaal Faruqui and Siddharth Gopal led related long-context research in the same team. The Infini-attention preprint appeared on arXiv on 10 April 2024 and was revised on 9 August 2024.[^1] Coverage in technology outlets described it explicitly as a Google research result framed around enabling infinite context language models.[^8]
Inside an Infini-Transformer block, a sequence is processed segment by segment. For segment s of length N the layer first computes the usual query, key, and value projections Q_s, K_s, V_s from the input tokens. Two attention outputs are then produced from the same projections.[^1]
The first is local causal attention. The block computes A_dot = softmax(Q_s K_s^T / sqrt(d_model)) V_s with a causal mask, exactly as in standard attention. This handles the immediate context of the current segment, with a small fixed window cost.[^1]
The second is memory retrieval. A persistent memory matrix M_{s-1} of shape (d_key, d_value) and a normalisation vector z_{s-1} of length d_key have been carried over from the previous segment. The layer applies the linear-attention feature map sigma(x) = ELU(x) + 1 to each query, giving sigma(Q_s) of shape (N, d_key), and retrieves the long-term context as A_mem = sigma(Q_s) M_{s-1} / (sigma(Q_s) z_{s-1}). The denominator implements the same normalisation that softmax provides in standard attention, but using accumulated key sums rather than a per-step exponential normaliser.[^1]
The two outputs are fused with a per-head sigmoid gate that interpolates between them: A = sigmoid(beta) * A_mem + (1 - sigmoid(beta)) * A_dot, where beta is a scalar parameter learned per attention head. After fusion the result is projected through the output matrix in the usual way. The added parameter cost is small: one scalar gate per head per layer.[^1]
After the segment has been processed, the memory and normaliser are updated to absorb the new keys and values. The paper's preferred update is a linear delta rule: M_s <- M_{s-1} + sigma(K_s)^T (V_s - sigma(K_s) M_{s-1} / (sigma(K_s) z_{s-1})), z_s <- z_{s-1} + sum_t sigma(K_{s,t}).[^1]
The first equation writes the residual between the current values and what the memory already predicts for the new keys, multiplied by the feature-mapped keys; this is a generalisation of the classical delta rule used in associative-memory networks. If the bindings (K_s, V_s) are already accurately stored, the residual is near zero and the memory is barely changed, which the authors describe as preventing redundant updates.[^1] The vector z_s tracks the sum of feature-mapped keys ever written, providing the denominator that converts unnormalised memory reads into proper averages.[^1]
The paper also reports a simpler additive variant M_s <- M_{s-1} + sigma(K_s)^T V_s without the delta term; the delta-rule version performed slightly better on long-context language modelling but both worked.[^1]
Because M_s has fixed shape (d_key, d_value) regardless of how many tokens have been processed, the per-step memory footprint of the long-term context is constant. The local attention path scales as O(N^2) inside the segment but N is held fixed (commonly 2048 in the paper's experiments), so the cost per token is constant in the global sequence length. The model can stream over arbitrarily long inputs at constant compute and memory per token, which is the property the title refers to as "infinite context."[^1]
The paper's ablation reports compression ratios relative to retrieval baselines. On the long-context language modelling experiments Infini-attention used 114 times less memory than a Memorizing Transformer with a 65K-token external cache while improving perplexity, illustrating the storage advantage of the fixed-size matrix.[^1]
The sigmoid gate is the only mechanism the model has to choose between local and global context at each head, and the authors argue it lets different heads specialise. In their analysis, gates settled into two regimes during training: some heads learned values close to 0, behaving like standard local-attention heads, while others learned values close to 1, behaving like memory-only heads. Many heads landed in an intermediate "mixer" regime, blending both signals.[^1]
The gate is initialised so the model starts roughly in the standard-attention regime, which the authors say lets continual pretraining warm up without immediately disrupting the base model. Because the gate scale is per head rather than per position, the mixing is not data-dependent at the token level; this is one of the design choices that downstream reproductions found problematic when trying to train the gate to actually move (see Limitations).[^9]
Infini-attention is presented as a drop-in modification to standard pretrained transformer layers. The authors continued pretraining base models with the Infini-attention modification, using the Adafactor optimiser and a learning rate around 1e-4 for the experiments reported.[^1]
Two training regimes are described. Long-context language modelling experiments at 100K segment length were used to validate the approach on raw next-token prediction (PG19 and Arxiv-math). For the 1M passkey and 500K BookSum tasks, the authors warm-started from existing 1B and 8B models, continued pretraining with the Infini-attention path active for a relatively small number of steps, then fine-tuned on task-specific data (5K-length passkey sequences and BookSum chapters respectively).[^1] The paper notes that gradients flow through the memory recurrence during training in a manner similar to truncated backpropagation through time used in recurrent neural networks.[^1]
On long-form text modelling Infini-Transformer was evaluated on PG19 (long fiction books from Project Gutenberg, the benchmark introduced alongside the Compressive Transformer) and Arxiv-math (mathematical preprints). With a segment length of 2048 tokens the model reached perplexity 9.65 on PG19 and around 2.23 on Arxiv-math, beating the Memorizing Transformer's 11.37 PG19 perplexity at 65K external memory, despite the Infini-Transformer using a fixed-size memory matrix that the paper reports as roughly 114 times smaller than the Memorizing Transformer's KV bank.[^1]
These language-modelling experiments are important because they were the only ones in the paper run from a clean training configuration on a public benchmark, and they used segment lengths short enough to admit many memory-update steps per document. Improving over Transformer-XL and Memorizing Transformer baselines on PG19 at substantially smaller persistent state suggests that the linear-attention delta memory is genuinely doing useful compression rather than serving as dead weight that the model routes around via the local softmax path.[^1]
The authors also compared a 100K segment-length training run, showing that Infini-attention scaled cleanly to that segment size with the same architectural pattern.[^1]
The headline result is passkey retrieval at sequence lengths up to one million tokens, the standard needle-in-a-haystack style stress test for long-context models. The paper trained a 1B-parameter model with Infini-attention on 5K-length passkey sequences for 400 fine-tuning steps, then evaluated retrieval at much longer test lengths.[^1]
After fine-tuning, the 1B Infini-Transformer reached effectively 100% accuracy across all tested lengths (32K, 128K, 256K, 512K, and 1M) and across passkey positions (start, middle, end of the document).[^1] In the zero-shot setting at 1M with the model only continually pretrained but not task-fine-tuned, accuracy was much lower at start positions but high near the end of the document, indicating that fine-tuning was necessary to teach the model to actually use the compressive memory.[^1]
On the BookSum benchmark, an 8B Infini-Transformer continually pretrained from a base model and then fine-tuned for summarization was evaluated by feeding entire books up to 500K tokens into the model in a single pass. It reported ROUGE scores of 40.0 R1, 8.8 R2, and 17.9 RL, which the paper presents as state-of-the-art on BookSum at the time, outperforming retrieval-based and hierarchical-summarization baselines such as BART + Unlimiformer and PRIMERA on the same task.[^1]
| Task | Model | Segment / context | Headline metric |
|---|---|---|---|
| PG19 long-form LM | Infini-Transformer 1B | 2048-token segments | 9.65 perplexity[^1] |
| Arxiv-math LM | Infini-Transformer 1B | 2048-token segments | ~2.23 perplexity[^1] |
| Passkey retrieval | Infini-Transformer 1B (fine-tuned) | up to 1M tokens | ~100% at all positions[^1] |
| BookSum summarization | Infini-Transformer 8B | up to 500K tokens | ROUGE 40.0 / 8.8 / 17.9 (R1/R2/RL)[^1] |
Google did not release weights or training code for Infini-attention, which constrained verification of the published results.[^9] Several open-source efforts attempted to reproduce or build on the mechanism in the months after the preprint.
Independent researcher jlamprou released a PyTorch implementation in 2024 targeting Hugging Face Transformers integration. The repository contains the Infini-attention module, a modified Qwen1.5-MoE-A2.7B modelling file, a causal-LM training script, and a 1M-context passkey retrieval evaluation harness, with a segmented batch collator to handle the recurrent memory state.[^10] The author reports that once the learnable beta gates had received enough training data, the model recovered accuracy comparable to standard scaled-dot-product attention on the sequence lengths they tested, while flagging the work as "definitely not production-ready" and noting fundamental limitations in the non-learnable compression scheme.[^10]
Another community port by the Korean researcher Beomi, called InfiniTransformer, adapted Gemma and Llama 3 to use Infini-attention layers. The repository offers two flavours: an attention-layer-only modification suitable for swapping into an existing model, and a wider model and training stack with rewritten forward passes for segmented streaming.[^11]
A work-in-progress pull request to the Hugging Face Transformers library by GitHub user joey00072 added Infini-attention support to Llama in PyTorch, aiming to make the architecture available for fine-tuning experiments inside the standard transformers training stack.[^12]
The most detailed external reproduction was a HuggingFace blog post published 14 August 2024 by Leandro von Werra, Lewis Tunstall, Nouamane Tazi, Omar Sanseviero, Pedro Cuenca, and Norm, with contributions from a larger team.[^9] They continued pretraining Llama 3 8B with the Infini-attention modification on FineWeb data at 8192-token contexts with two rollouts of 4096-token segments, and ran companion 200M-scale experiments to debug the architecture in detail.[^9]
The HuggingFace team identified several practical problems. The balance factor beta failed to move during training under standard hyperparameters because Adam's gradient normalisation combined with a learning rate of 3e-4 and weight decay 0.1 kept the sigmoid output stuck near 0.5 across roughly 95% of heads. They raised the gating learning rate to 0.01, removed weight decay on the gates, increased rollouts from four to sixteen, and shortened segments to 64 tokens for the small-scale runs. With these adjustments the gates spread across the (0, 1) range, but convergence was still incomplete after 18000 steps and the 200M model could not reliably retrieve information across segment boundaries even when in-segment retrieval worked at 100%.[^9] Their stated conclusion was that "Infini-attention's performance gets worse as we increase the number of times we compress the memory" and that ring attention, YaRN and RoPE scaling were better choices for extending pretrained models to longer context windows.[^9]
The architectural value of Infini-attention is that the dominant cost of streaming inference no longer grows with how much context has already been seen. For applications such as long-document question answering, log analysis, codebase navigation, audio or video transcript reasoning, and long agent trajectories, an Infini-attention layer in principle lets a model continue indefinitely with the same per-token cost. This is a different operating point from external retrieval over a vector database, which scales sublinearly in time but requires explicit chunking and a separate retrieval stack.[^1]
The reported 8B BookSum result is notable because the entire book is fed in a single pass rather than chunked and recursively summarised, which is the standard approach used by long-document summarisation pipelines that target a small context window.[^1]
Several secondary sources speculated that Infini-attention or related techniques underpinned the long-context capabilities of Gemini 1.5 and later Gemini releases, which expose 1M-token and (in Gemini 1.5 Pro previews) 2M-token windows.[^8][^13] Google has not confirmed which long-context mechanism actually backs Gemini in production, and the published Gemini technical reports do not name Infini-attention; the connection is therefore an industry inference rather than a stated fact.[^8] Coverage in Towards AI explicitly framed Infini-attention as the technique "powering Gemini 2M token window," but this remains a journalistic attribution rather than a confirmation by the company.[^13]
Beyond any specific product deployment, Infini-attention sharpened a broader research move toward hybrid attention architectures that combine a local softmax window with a global linear-attention or state-space pathway. Follow-on work on Gated DeltaNet, Mamba-2, RWKV-7, and Kimi Linear all use related ideas of a fixed-state recurrent memory updated by something like a delta or gated rule, with the softmax replaced or augmented at sequence-level scale.[^14]
Because Google released neither model weights nor training code, the headline 1M passkey and 500K BookSum numbers have not been independently verified at scale.[^9] Independent reproductions at smaller scale (200M, 8B Llama 3 continually pretrained) found that the model could not reliably do cross-segment needle retrieval and that performance degraded as more memory-update steps separated the relevant context from the query.[^9]
The mechanism intrinsically encodes a recency bias. Recent tokens are addressed at full token-level resolution via softmax attention, while everything beyond the current segment is addressed only through the compressed linear-attention matrix. This asymmetry assumes recent context is more relevant than distant context, an assumption that breaks for tasks requiring symmetric access across the document such as fact lookup in a randomly chosen page.[^8] The 1M passkey result is consistent with this view in the zero-shot setting, where retrieval accuracy was high near the end of the document and low at the start before task-specific fine-tuning was added.[^1]
The HuggingFace team's debugging of the beta gate showed that the model has only a thin gradient signal for moving the local-vs-global mix, and that standard optimiser settings can leave the gates effectively non-functional. This is not described as a failure of the algorithm in principle but as a sign that careful per-parameter hyperparameter tuning is needed for the gate to do its job.[^9]
Because the entire history is squeezed into a fixed-shape matrix M of size d_key by d_value per layer, there is a hard upper bound on the amount of distinct content the memory can store before older information is overwritten by the delta rule. Practical evidence from reproductions suggests the effective capacity is much smaller than the storage capacity implied by the matrix dimensions, especially when later writes share key directions with earlier ones.[^9]
For applied users the most practical limitation is that no Infini-attention model has been released by Google for download, evaluation, or fine-tuning. Practitioners wanting to use the architecture must rely on third-party reproductions whose authors themselves report partial success.[^10][^9]
Infini-attention belongs to a wider family of long-context attention modifications. The table below summarises how it differs from the most closely related approaches.
| Approach | Memory mechanism | Per-token cost | Capacity bound | Notes |
|---|---|---|---|---|
| Transformer-XL (Dai et al., 2019) | Cached keys and values of previous segment | O(N + M) per token where M is cache length | Linear in cache | Read-only previous-segment KV[^4] |
| Compressive Transformer (Rae et al., 2019) | Coarse compressed cache below recurrence buffer | O(N + M + M_compressed) | Linear, but compressed | Learned compression network per layer[^5] |
| Memorizing Transformer (Wu et al., 2022) | External k-NN cache of KV pairs | O(N + k) | Up to 65K KV pairs in paper | Retrieval at attention time[^6] |
| Recurrent Memory Transformer (Bulatov et al., 2022) | Special memory tokens passed between segments | O(N) | Bounded by token count | Architecture-agnostic[^15] |
| Infini-attention (Munkhdalai et al., 2024) | Fixed (d_key, d_value) matrix updated by delta rule | O(N) per token | Fixed matrix bound | Hybrid local softmax + linear-attention memory[^1] |
| Mamba (Gu and Dao, 2023) | Selective state-space recurrence | O(N) per token | Fixed hidden state | No softmax attention[^16] |
| RWKV (Peng et al., 2023) | Linear recurrent RNN-like update | O(N) per token | Fixed state | Token-mix and channel-mix architecture[^17] |
| Ring Attention (Liu et al., 2023) | Distributed full attention across devices | O(N^2 / P) per device | Same as standard attention | Exact attention via blockwise communication[^18] |
Mamba and RWKV both also produce an O(N)-per-token long-context architecture by abandoning softmax attention. Mamba uses a selective state space model in which the transition matrix becomes input-dependent, giving a constant-state recurrence with stronger expressivity than vanilla linear attention.[^16] RWKV uses a token-mixing recurrence that interpolates between attention-like and RNN-like behaviour.[^17] Infini-attention is architecturally closer to a hybrid: it keeps the standard softmax attention path inside each segment and adds a linear-attention long-term memory beside it, rather than replacing softmax entirely.
A practical consequence is that an Infini-Transformer can be initialised from a standard pretrained transformer with minimal added parameters, so existing checkpoints can in principle be upgraded by continued pretraining, while Mamba and RWKV require training from scratch (or extensive distillation) to convert from a softmax-attention base.[^1]
Other strategies extend the context of an existing softmax-attention transformer without modifying the architecture. Ring Attention partitions the full quadratic attention across devices, yielding exact softmax at long context for the cost of substantial hardware (the HuggingFace team noted Llama 3 8B at 1M context needs around 512 GPUs at batch size 1).[^9] YaRN and RoPE scaling instead modify positional encodings to let a model trained at, for example, 8K tokens generalise to longer windows; these are zero-additional-parameter tweaks but do not address the underlying O(L^2) compute cost during training.[^9]
Retrieval-augmented generation pipelines, by contrast, never feed the whole history into a single attention pass; they chunk it, embed each chunk, and retrieve a few via in-context prompts. These pipelines scale to arbitrary document sizes but trade off granularity: a token-level needle may not survive chunking, while Infini-attention in principle keeps a token-level memory inside the model.[^1]