Infini-Attention

Google Model Architecture Transformer Models

22 min read

Updated Jul 23, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 23, 2026

Fact-checked

In review queue

Sources

18 citations

Revision

v5 · 4,449 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Infini-attention is an attention mechanism introduced by Google researchers Tsendsuren Munkhdalai, Manaal Faruqui, and Siddharth Gopal in the April 2024 paper "Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention."^[1] It augments standard transformer attention with a compressive associative memory that is updated linearly with every new chunk of tokens, allowing a model to attend over effectively unbounded context while keeping the per-step memory and compute budget bounded.^[1] In the authors' words, the technique "incorporates a compressive memory into the vanilla attention mechanism and builds in both masked local attention and long-term linear attention mechanisms in a single Transformer block," so that a Transformer-based large language model can be scaled "to infinitely long inputs with bounded memory and computation."^[1] Each transformer layer in an Infini-Transformer reuses its query, key, and value projections for two complementary attention paths inside one block: a regular masked dot-product attention over the local segment, and a linear attention retrieval over a fixed-size memory matrix that absorbs all earlier segments. A learned sigmoid gate per attention head mixes the two contexts before the output projection.^[1] The authors reported that a 1B model fine-tuned with Infini-attention solved 1M-token passkey retrieval, and that an 8B model produced state-of-the-art ROUGE scores on 500K-token book summarization with constant memory footprint per step.^[1]

What problem does Infini-attention solve?

Why does standard attention struggle with long context?

The vanilla Transformer computes attention as a softmax over the full key matrix, which forces compute and the KV cache to grow quadratically with sequence length during training and linearly during streaming inference.^[2] For a context of length L this means $O(L^2)$ operations and $O(L)$ cached states per layer, which becomes prohibitive when L is in the hundreds of thousands. A large branch of research has tried to soften that scaling by sparsifying the attention pattern (sliding windows, dilated and routed sparsity), by recurrent buffers that store hidden states between segments, or by replacing softmax with kernels that admit constant-state recurrent updates (linear attention).^[3] Infini-attention targets this same bottleneck but, unlike a sparsity pattern or a longer context window, it removes the dependence of memory and compute on how much context has already been processed, holding both bounded as the input grows without limit.^[1]

What prior architectures led to Infini-attention?

Three lines of work directly anticipated Infini-attention. Transformer-XL kept the previous segment's keys and values as an extra read-only buffer accessible to the next segment, which extended the effective context but still scaled storage linearly with how far back the model could see.^[4] The Compressive Transformer from DeepMind, introduced by Jack Rae and colleagues in November 2019, added a coarse compressed cache below Transformer-XL's recurrence buffer; the compression reduced storage but used dedicated learned compression networks per layer.^[5] The Memorizing Transformer and other retrieval-based architectures stored explicit key-value pairs in a large external bank and used k-nearest-neighbour lookup at attention time, which scaled compute sublinearly but required a database alongside the model.^[6] Infini-attention sits in the same family but stores its long-term memory as a single fixed-shape parameter matrix per layer that is updated in place, rather than as an explicit cache of past activations or an external index.^[1]

What is linear attention and the delta rule?

A key building block is the linear attention reformulation introduced by Katharopoulos, Vyas, Pappas, and Fleuret in 2020, which rewrites attention as a kernel feature map then exploits the associativity of matrix multiplication.^[7] Once attention is written in the form $\mathrm{softmax}(QK^\top)V$ approximated by $\phi(Q)(\phi(K)^\top V)$ , the inner product $\phi(K)^\top V$ is a single matrix that can be accumulated incrementally as new tokens arrive. This realises a recurrent update with constant state, sometimes called an associative memory.^[7] The same matrix update can be improved by a delta rule that subtracts an estimate of what the memory already predicts for a key before writing, reducing redundant overwriting; this idea recurs across linear-attention variants and is the basis of Infini-attention's compressive memory update.^[1]

Who created Infini-attention?

Tsendsuren Munkhdalai had previously worked on memory-augmented neural networks (the Neural Semantic Encoder and related models) before joining Google, where Manaal Faruqui and Siddharth Gopal led related long-context research in the same team. The Infini-attention preprint appeared on arXiv on 10 April 2024 and was revised on 9 August 2024.^[1] Coverage in technology outlets described it explicitly as a Google research result framed around enabling infinite context language models.^[8]

How does Infini-attention work?

What happens inside the Infini-attention block?

Inside an Infini-Transformer block, a sequence is processed segment by segment. For segment s of length N the layer first computes the usual query, key, and value projections $Q_s, K_s, V_s$ from the input tokens. Two attention outputs are then produced from the same projections.^[1]

The first is local causal attention. The block computes $A_{\text{dot}} = \mathrm{softmax}(Q_s K_s^\top / \sqrt{d_{\text{model}}}) V_s$ with a causal mask, exactly as in standard attention. This handles the immediate context of the current segment, with a small fixed window cost.^[1]

The second is memory retrieval. A persistent memory matrix $M_{s-1}$ of shape $(d_{\text{key}}, d_{\text{value}})$ and a normalisation vector $z_{s-1}$ of length $d_{\text{key}}$ have been carried over from the previous segment. The layer applies the linear-attention feature map $\sigma(x) = \mathrm{ELU}(x) + 1$ to each query, giving $\sigma(Q_s)$ of shape $(N, d_{\text{key}})$ , and retrieves the long-term context as $A_{\text{mem}} = \frac{\sigma(Q_s) M_{s-1}}{\sigma(Q_s) z_{s-1}}$ . The denominator implements the same normalisation that softmax provides in standard attention, but using accumulated key sums rather than a per-step exponential normaliser.^[1]

The two outputs are fused with a per-head sigmoid gate that interpolates between them: $A = \mathrm{sigmoid}(\beta) A_{\text{mem}} + (1 - \mathrm{sigmoid}(\beta)) A_{\text{dot}}$ , where beta is a scalar parameter learned per attention head. After fusion the result is projected through the output matrix in the usual way. The added parameter cost is small: one scalar gate per head per layer.^[1]

How does the compressive memory work?

After the segment has been processed, the memory and normaliser are updated to absorb the new keys and values. The paper's preferred update is a linear delta rule.^[1]

\begin{aligned} M_s &\leftarrow M_{s-1} + \sigma(K_s)^\top \left(V_s - \frac{\sigma(K_s) M_{s-1}}{\sigma(K_s) z_{s-1}}\right) \\ z_s &\leftarrow z_{s-1} + \sum_t \sigma(K_{s,t}) \end{aligned}

The first equation writes the residual between the current values and what the memory already predicts for the new keys, multiplied by the feature-mapped keys; this is a generalisation of the classical delta rule used in associative-memory networks. If the bindings (K_s, V_s) are already accurately stored, the residual is near zero and the memory is barely changed, which the authors describe as preventing redundant updates.^[1] The vector z_s tracks the sum of feature-mapped keys ever written, providing the denominator that converts unnormalised memory reads into proper averages.^[1]

The paper also reports a simpler additive variant $M_s \leftarrow M_{s-1} + \sigma(K_s)^\top V_s$ without the delta term; the delta-rule version performed slightly better on long-context language modelling but both worked.^[1]

Why is the memory footprint bounded?

Because M_s has fixed shape (d_key, d_value) regardless of how many tokens have been processed, the per-step memory footprint of the long-term context is constant. The local attention path scales as $O(N^2)$ inside the segment but N is held fixed (commonly 2048 in the paper's experiments), so the cost per token is constant in the global sequence length. The model can stream over arbitrarily long inputs at constant compute and memory per token, which is the property the title refers to as "infinite context."^[1]

The paper's ablation reports compression ratios relative to retrieval baselines. The authors state that the approach "outperforms baseline models on long-context language modeling benchmarks while having 114x comprehension ratio in terms of memory size," using 114 times less memory than a Memorizing Transformer with a 65K-token external cache while improving perplexity, illustrating the storage advantage of the fixed-size matrix.^[1]

How does the gating mechanism choose between local and global context?

The sigmoid gate is the only mechanism the model has to choose between local and global context at each head, and the authors argue it lets different heads specialise. In their analysis, gates settled into two regimes during training: some heads learned values close to 0, behaving like standard local-attention heads, while others learned values close to 1, behaving like memory-only heads. Many heads landed in an intermediate "mixer" regime, blending both signals.^[1]

The gate is initialised so the model starts roughly in the standard-attention regime, which the authors say lets continual pretraining warm up without immediately disrupting the base model. Because the gate scale is per head rather than per position, the mixing is not data-dependent at the token level; this is one of the design choices that downstream reproductions found problematic when trying to train the gate to actually move (see Limitations).^[9]

How is an Infini-attention model trained?

Infini-attention is presented as a drop-in modification to standard pretrained transformer layers. The authors continued pretraining base models with the Infini-attention modification, using the Adafactor optimiser and a learning rate around 1e-4 for the experiments reported.^[1]

Two training regimes are described. Long-context language modelling experiments at 100K segment length were used to validate the approach on raw next-token prediction (PG19 and Arxiv-math). For the 1M passkey and 500K BookSum tasks, the authors warm-started from existing 1B and 8B models, continued pretraining with the Infini-attention path active for a relatively small number of steps, then fine-tuned on task-specific data (5K-length passkey sequences and BookSum chapters respectively).^[1] The paper notes that gradients flow through the memory recurrence during training in a manner similar to truncated backpropagation through time used in recurrent neural networks.^[1]

How well does Infini-attention perform?

How does it do on long-context language modelling?

On long-form text modelling Infini-Transformer was evaluated on PG19 (long fiction books from Project Gutenberg, the benchmark introduced alongside the Compressive Transformer) and Arxiv-math (mathematical preprints). With a segment length of 2048 tokens the model reached perplexity 9.65 on PG19 and around 2.23 on Arxiv-math, beating the Memorizing Transformer's 11.37 PG19 perplexity at 65K external memory, despite the Infini-Transformer using a fixed-size memory matrix that the paper reports as roughly 114 times smaller than the Memorizing Transformer's KV bank.^[1]

These language-modelling experiments are important because they were the only ones in the paper run from a clean training configuration on a public benchmark, and they used segment lengths short enough to admit many memory-update steps per document. Improving over Transformer-XL and Memorizing Transformer baselines on PG19 at substantially smaller persistent state suggests that the linear-attention delta memory is genuinely doing useful compression rather than serving as dead weight that the model routes around via the local softmax path.^[1]

The authors also compared a 100K segment-length training run, showing that Infini-attention scaled cleanly to that segment size with the same architectural pattern.^[1]

Can Infini-attention retrieve a passkey at one million tokens?

The headline result is passkey retrieval at sequence lengths up to one million tokens, the standard needle-in-a-haystack style stress test for long-context models. The authors report that "a 1B LLM naturally scales to 1M sequence length and solves the passkey retrieval task when injected with Infini-attention."^[1] In practice they trained a 1B-parameter model with Infini-attention on 5K-length passkey sequences for 400 fine-tuning steps, then evaluated retrieval at much longer test lengths.^[1]

After fine-tuning, the 1B Infini-Transformer reached effectively 100% accuracy across all tested lengths (32K, 128K, 256K, 512K, and 1M) and across passkey positions (start, middle, end of the document).^[1] In the zero-shot setting at 1M with the model only continually pretrained but not task-fine-tuned, accuracy was much lower at start positions but high near the end of the document, indicating that fine-tuning was necessary to teach the model to actually use the compressive memory.^[1]

What were the results on 500K book summarization?

On the BookSum benchmark, an 8B Infini-Transformer continually pretrained from a base model and then fine-tuned for summarization was evaluated by feeding entire books up to 500K tokens into the model in a single pass. The paper states that "a 8B model with Infini-attention reaches a new SOTA result on a 500K length book summarization task after continual pre-training and task fine-tuning."^[1] It reported ROUGE scores of 40.0 R1, 8.8 R2, and 17.9 RL, which the paper presents as state-of-the-art on BookSum at the time, outperforming retrieval-based and hierarchical-summarization baselines such as BART + Unlimiformer and PRIMERA on the same task.^[1]

Results at a glance

Task	Model	Segment / context	Headline metric
PG19 long-form LM	Infini-Transformer 1B	2048-token segments	9.65 perplexity^[1]
Arxiv-math LM	Infini-Transformer 1B	2048-token segments	~2.23 perplexity^[1]
Passkey retrieval	Infini-Transformer 1B (fine-tuned)	up to 1M tokens	~100% at all positions^[1]
BookSum summarization	Infini-Transformer 8B	up to 500K tokens	ROUGE 40.0 / 8.8 / 17.9 (R1/R2/RL)^[1]
Memory size vs Memorizing Transformer	Infini-Transformer	65K external cache baseline	114x smaller memory^[1]

What implementations and reproductions exist?

Google did not release weights or training code for Infini-attention, which constrained verification of the published results.^[9] Several open-source efforts attempted to reproduce or build on the mechanism in the months after the preprint.

jlamprou's PyTorch port

Independent researcher jlamprou released a PyTorch implementation in 2024 targeting Hugging Face Transformers integration. The repository contains the Infini-attention module, a modified Qwen1.5-MoE-A2.7B modelling file, a causal-LM training script, and a 1M-context passkey retrieval evaluation harness, with a segmented batch collator to handle the recurrent memory state.^[10] The author reports that once the learnable beta gates had received enough training data, the model recovered accuracy comparable to standard scaled-dot-product attention on the sequence lengths they tested, while flagging the work as "definitely not production-ready" and noting fundamental limitations in the non-learnable compression scheme.^[10]

Beomi's Gemma and Llama 3 implementation

Another community port by the Korean researcher Beomi, called InfiniTransformer, adapted Gemma and Llama 3 to use Infini-attention layers. The repository offers two flavours: an attention-layer-only modification suitable for swapping into an existing model, and a wider model and training stack with rewritten forward passes for segmented streaming.^[11]

Hugging Face transformers pull request

A work-in-progress pull request to the Hugging Face Transformers library by GitHub user joey00072 added Infini-attention support to Llama in PyTorch, aiming to make the architecture available for fine-tuning experiments inside the standard transformers training stack.^[12]

Hugging Face reproduction study

The most detailed external reproduction was a HuggingFace blog post published 14 August 2024 by Leandro von Werra, Lewis Tunstall, Nouamane Tazi, Omar Sanseviero, Pedro Cuenca, and Norm, with contributions from a larger team.^[9] They continued pretraining Llama 3 8B with the Infini-attention modification on FineWeb data at 8192-token contexts with two rollouts of 4096-token segments, and ran companion 200M-scale experiments to debug the architecture in detail.^[9]

The HuggingFace team identified several practical problems. The balance factor beta failed to move during training under standard hyperparameters because Adam's gradient normalisation combined with a learning rate of 3e-4 and weight decay 0.1 kept the sigmoid output stuck near 0.5 across roughly 95% of heads. They raised the gating learning rate to 0.01, removed weight decay on the gates, increased rollouts from four to sixteen, and shortened segments to 64 tokens for the small-scale runs. With these adjustments the gates spread across the (0, 1) range, but convergence was still incomplete after 18000 steps and the 200M model could not reliably retrieve information across segment boundaries even when in-segment retrieval worked at 100%.^[9] Their stated conclusion was that "Infini-attention's performance gets worse as we increase the number of times we compress the memory" and that ring attention, YaRN and RoPE scaling were better choices for extending pretrained models to longer context windows.^[9]

What is Infini-attention used for and why does it matter?

What does Infini-attention enable?

The architectural value of Infini-attention is that the dominant cost of streaming inference no longer grows with how much context has already been seen. For applications such as long-document question answering, log analysis, codebase navigation, audio or video transcript reasoning, and long agent trajectories, an Infini-attention layer in principle lets a model continue indefinitely with the same per-token cost. This is a different operating point from external retrieval over a vector database, which scales sublinearly in time but requires explicit chunking and a separate retrieval stack.^[1]

The reported 8B BookSum result is notable because the entire book is fed in a single pass rather than chunked and recursively summarised, which is the standard approach used by long-document summarisation pipelines that target a small context window.^[1]

Does Infini-attention power Gemini?

Several secondary sources speculated that Infini-attention or related techniques underpinned the long-context capabilities of Gemini 1.5 and later Gemini releases, which expose 1M-token and (in Gemini 1.5 Pro previews) 2M-token windows.^[8]^[13] Google has not confirmed which long-context mechanism actually backs Gemini in production, and the published Gemini technical reports do not name Infini-attention; the connection is therefore an industry inference rather than a stated fact.^[8] Coverage in Towards AI explicitly framed Infini-attention as the technique "powering Gemini 2M token window," but this remains a journalistic attribution rather than a confirmation by the company.^[13]

Why is Infini-attention significant for research?

Beyond any specific product deployment, Infini-attention sharpened a broader research move toward hybrid attention architectures that combine a local softmax window with a global linear-attention or state-space pathway. Follow-on work on Gated DeltaNet, Mamba-2, RWKV-7, and Kimi Linear all use related ideas of a fixed-state recurrent memory updated by something like a delta or gated rule, with the softmax replaced or augmented at sequence-level scale.^[14]

What are the limitations and criticisms of Infini-attention?

Is Infini-attention reproducible?

Because Google released neither model weights nor training code, the headline 1M passkey and 500K BookSum numbers have not been independently verified at scale.^[9] Independent reproductions at smaller scale (200M, 8B Llama 3 continually pretrained) found that the model could not reliably do cross-segment needle retrieval and that performance degraded as more memory-update steps separated the relevant context from the query.^[9]

Does the architecture have a recency bias?

The mechanism intrinsically encodes a recency bias. Recent tokens are addressed at full token-level resolution via softmax attention, while everything beyond the current segment is addressed only through the compressed linear-attention matrix. This asymmetry assumes recent context is more relevant than distant context, an assumption that breaks for tasks requiring symmetric access across the document such as fact lookup in a randomly chosen page.^[8] The 1M passkey result is consistent with this view in the zero-shot setting, where retrieval accuracy was high near the end of the document and low at the start before task-specific fine-tuning was added.^[1]

Why is gate training fragile?

The HuggingFace team's debugging of the beta gate showed that the model has only a thin gradient signal for moving the local-vs-global mix, and that standard optimiser settings can leave the gates effectively non-functional. This is not described as a failure of the algorithm in principle but as a sign that careful per-parameter hyperparameter tuning is needed for the gate to do its job.^[9]

How much can the compressed memory actually hold?

Because the entire history is squeezed into a fixed-shape matrix $M$ of size $d_{\text{key}}$ by $d_{\text{value}}$ per layer, there is a hard upper bound on the amount of distinct content the memory can store before older information is overwritten by the delta rule. Practical evidence from reproductions suggests the effective capacity is much smaller than the storage capacity implied by the matrix dimensions, especially when later writes share key directions with earlier ones.^[9]

Are there public weights?

For applied users the most practical limitation is that no Infini-attention model has been released by Google for download, evaluation, or fine-tuning. Practitioners wanting to use the architecture must rely on third-party reproductions whose authors themselves report partial success.^[10]^[9]

How does Infini-attention compare to other long-context methods?

Which long-context architectures are adjacent?

Infini-attention belongs to a wider family of long-context attention modifications. The table below summarises how it differs from the most closely related approaches.

Approach	Memory mechanism	Per-token cost	Capacity bound	Notes
Transformer-XL (Dai et al., 2019)	Cached keys and values of previous segment	$O(N + M)$ per token where M is cache length	Linear in cache	Read-only previous-segment KV^[4]
Compressive Transformer (Rae et al., 2019)	Coarse compressed cache below recurrence buffer	$O(N + M + M_{\text{compressed}})$	Linear, but compressed	Learned compression network per layer^[5]
Memorizing Transformer (Wu et al., 2022)	External k-NN cache of KV pairs	$O(N + k)$	Up to 65K KV pairs in paper	Retrieval at attention time^[6]
Recurrent Memory Transformer (Bulatov et al., 2022)	Special memory tokens passed between segments	$O(N)$	Bounded by token count	Architecture-agnostic^[15]
Infini-attention (Munkhdalai et al., 2024)	Fixed $(d_{\text{key}}, d_{\text{value}})$ matrix updated by delta rule	$O(N)$ per token	Fixed matrix bound	Hybrid local softmax + linear-attention memory^[1]
Mamba (Gu and Dao, 2023)	Selective state-space recurrence	$O(N)$ per token	Fixed hidden state	No softmax attention^[16]
RWKV (Peng et al., 2023)	Linear recurrent RNN-like update	$O(N)$ per token	Fixed state	Token-mix and channel-mix architecture^[17]
Ring Attention (Liu et al., 2023)	Distributed full attention across devices	$O(N^2 / P)$ per device	Same as standard attention	Exact attention via blockwise communication^[18]

How does Infini-attention differ from Mamba and RWKV?

Mamba and RWKV both also produce an O(N)-per-token long-context architecture by abandoning softmax attention. Mamba uses a selective state space model in which the transition matrix becomes input-dependent, giving a constant-state recurrence with stronger expressivity than vanilla linear attention.^[16] RWKV uses a token-mixing recurrence that interpolates between attention-like and RNN-like behaviour.^[17] Infini-attention is architecturally closer to a hybrid: it keeps the standard softmax attention path inside each segment and adds a linear-attention long-term memory beside it, rather than replacing softmax entirely.

A practical consequence is that an Infini-Transformer can be initialised from a standard pretrained transformer with minimal added parameters, so existing checkpoints can in principle be upgraded by continued pretraining, while Mamba and RWKV require training from scratch (or extensive distillation) to convert from a softmax-attention base.^[1]

How does it compare to retrieval and context-extension methods?

Other strategies extend the context of an existing softmax-attention transformer without modifying the architecture. Ring Attention partitions the full quadratic attention across devices, yielding exact softmax at long context for the cost of substantial hardware (the HuggingFace team noted Llama 3 8B at 1M context needs around 512 GPUs at batch size 1).^[9] YaRN and RoPE scaling instead modify positional encodings to let a model trained at, for example, 8K tokens generalise to longer windows; these are zero-additional-parameter tweaks but do not address the underlying $O(L^2)$ compute cost during training.^[9]

Retrieval-augmented generation pipelines, by contrast, never feed the whole history into a single attention pass; they chunk it, embed each chunk, and retrieve a few via in-context prompts. These pipelines scale to arbitrary document sizes but trade off granularity: a token-level needle may not survive chunking, while Infini-attention in principle keeps a token-level memory inside the model.^[1]

ELI5: what is Infini-attention in plain terms?

Imagine reading a very long book. A normal transformer tries to keep every word it has read open on the desk at once, so the desk has to get bigger and bigger and eventually runs out of room. Infini-attention instead keeps only the current page open in full detail, and after finishing each page it writes a tidy summary into a notebook of fixed size. To answer a question it both looks at the current page and flips through the notebook. Because the notebook never grows, the model can keep reading forever without needing a bigger desk. The trade-off is that the notebook can only hold so much, so very old details can get smudged or written over.^[1]

References

Tsendsuren Munkhdalai, Manaal Faruqui, Siddharth Gopal, "Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention", arXiv, 2024-04-10 (revised 2024-08-09). https://arxiv.org/abs/2404.07143. Accessed 2026-06-28. ↩
Ashish Vaswani et al., "Attention Is All You Need", arXiv, 2017-06-12. https://arxiv.org/abs/1706.03762. Accessed 2026-06-28. ↩
Yi Tay, Mostafa Dehghani, Dara Bahri, Donald Metzler, "Efficient Transformers: A Survey", arXiv, 2020-09-14. https://arxiv.org/abs/2009.06732. Accessed 2026-06-28. ↩
Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov, "Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context", arXiv, 2019-01-09. https://arxiv.org/abs/1901.02860. Accessed 2026-06-28. ↩
Jack W. Rae, Anna Potapenko, Siddhant M. Jayakumar, Timothy P. Lillicrap, "Compressive Transformers for Long-Range Sequence Modelling", arXiv, 2019-11-13. https://arxiv.org/abs/1911.05507. Accessed 2026-06-28. ↩
Yuhuai Wu, Markus N. Rabe, DeLesley Hutchins, Christian Szegedy, "Memorizing Transformers", arXiv, 2022-03-16. https://arxiv.org/abs/2203.08913. Accessed 2026-06-28. ↩
Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, Francois Fleuret, "Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention", Proceedings of the 37th International Conference on Machine Learning (ICML), 2020-07-13. https://proceedings.mlr.press/v119/katharopoulos20a.html. Accessed 2026-06-28. ↩
Ben Dickson, "Google's new technique gives LLMs infinite context", VentureBeat, 2024-04-15. https://venturebeat.com/ai/googles-new-technique-gives-llms-infinite-context/. Accessed 2026-06-28. ↩
Leandro von Werra, Lewis Tunstall, Nouamane Tazi, Omar Sanseviero, Pedro Cuenca, Norm, et al., "A failed experiment: Infini-Attention, and why we should keep trying?", HuggingFace Blog, 2024-08-14. https://huggingface.co/blog/infini-attention. Accessed 2026-06-28. ↩
jlamprou, "Infini-Attention: Efficient Infinite Context Transformers with Infini-attention Pytorch Implementation + QwenMoE Implementation + Training Script + 1M context keypass retrieval", GitHub repository, 2024. https://github.com/jlamprou/Infini-Attention. Accessed 2026-06-28. ↩
Beomi, "InfiniTransformer: Gemma/Llama3 based Infini-attention Implementation", Hugging Face post, 2024-04-19. https://huggingface.co/posts/beomi/277288382277555. Accessed 2026-06-28. ↩
joey00072, "[WIP] adding infini-attention", Pull Request #31736 to huggingface/transformers, GitHub, 2024. https://github.com/huggingface/transformers/pull/31736. Accessed 2026-06-28. ↩
"Inside Infini Attention: Google DeepMind's Technique Powering Gemini 2M Token Window", Towards AI, 2024. https://towardsai.net/p/artificial-intelligence/inside-infini-attention-google-deepminds-technique-powering-gemini-2m-token-window. Accessed 2026-06-28. ↩
Songlin Yang et al., "Gated Delta Networks: Improving Mamba2 with Delta Rule", arXiv, 2024-12-09. https://arxiv.org/abs/2412.06464. Accessed 2026-06-28. ↩
Aydar Bulatov, Yury Kuratov, Mikhail S. Burtsev, "Recurrent Memory Transformer", arXiv, 2022-07-14. https://arxiv.org/abs/2207.06881. Accessed 2026-06-28. ↩
Albert Gu, Tri Dao, "Mamba: Linear-Time Sequence Modeling with Selective State Spaces", arXiv, 2023-12-01. https://arxiv.org/abs/2312.00752. Accessed 2026-06-28. ↩
Bo Peng et al., "RWKV: Reinventing RNNs for the Transformer Era", arXiv, 2023-05-22. https://arxiv.org/abs/2305.13048. Accessed 2026-06-28. ↩
Hao Liu, Matei Zaharia, Pieter Abbeel, "Ring Attention with Blockwise Transformers for Near-Infinite Context", arXiv, 2023-10-03. https://arxiv.org/abs/2310.01889. Accessed 2026-06-28. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

4 revisions by 1 contributor · full history

Suggest edit

What links here

H2O (Heavy-Hitter Oracle for KV Cache)LongNet LongRoPE StreamingLLM

What problem does Infini-attention solve?

Why does standard attention struggle with long context?

What prior architectures led to Infini-attention?

What is linear attention and the delta rule?

Who created Infini-attention?

How does Infini-attention work?

What happens inside the Infini-attention block?

How does the compressive memory work?

Why is the memory footprint bounded?

How does the gating mechanism choose between local and global context?

How is an Infini-attention model trained?

How well does Infini-attention perform?

How does it do on long-context language modelling?

Can Infini-attention retrieve a passkey at one million tokens?

What were the results on 500K book summarization?

Results at a glance

What implementations and reproductions exist?

jlamprou's PyTorch port

Beomi's Gemma and Llama 3 implementation

Hugging Face transformers pull request

Hugging Face reproduction study

What is Infini-attention used for and why does it matter?

What does Infini-attention enable?

Does Infini-attention power Gemini?

Why is Infini-attention significant for research?

What are the limitations and criticisms of Infini-attention?

Is Infini-attention reproducible?

Does the architecture have a recency bias?

Why is gate training fragile?

How much can the compressed memory actually hold?

Are there public weights?

How does Infini-attention compare to other long-context methods?

Which long-context architectures are adjacent?

How does Infini-attention differ from Mamba and RWKV?

How does it compare to retrieval and context-extension methods?

ELI5: what is Infini-attention in plain terms?

See also

References

Improve this article

Related Articles

Multi-head Latent Attention

Multi-Head Self-Attention

Rotary Position Embedding

Self-attention

Cross-attention

Mixture of Depths

What links here

Related Articles

Multi-head Latent Attention

Multi-Head Self-Attention

Rotary Position Embedding

Self-attention

Cross-attention

Mixture of Depths

What links here