Retentive Network (RetNet)
Last reviewed
May 16, 2026
Sources
15 citations
Review status
Source-backed
Revision
v1 · 3,731 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 16, 2026
Sources
15 citations
Review status
Source-backed
Revision
v1 · 3,731 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Transformer, Mamba, RWKV, Linear Attention, Microsoft Research
Retentive Network, commonly abbreviated as RetNet, is a foundation architecture for sequence modeling and large language modeling introduced in July 2023 by researchers at Microsoft Research Asia and Tsinghua University. The paper, titled "Retentive Network: A Successor to Transformer for Large Language Models," was uploaded to arXiv on 17 July 2023 under the identifier 2307.08621. The authors are Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei.
The central proposal of RetNet is a new sequence operator called retention, which the authors derive from a theoretical connection between recurrence and attention. Retention can be expressed in three mathematically equivalent forms: a parallel form for efficient training, a recurrent form for constant-time inference, and a chunkwise recurrent form that mixes the two for long sequences. This duality is the main technical lever that allows RetNet to claim simultaneous progress on three properties that are usually in tension for sequence models, namely training parallelism, low-cost inference, and competitive language modeling quality. The authors describe this combination as resolving the "impossible triangle" of large language model architectures.
Reported experiments compare RetNet against Transformer baselines at 1.3 billion, 2.7 billion, and 6.7 billion parameters. RetNet matches or beats Transformer perplexity at scales above roughly 2 billion parameters while decoding 8.4 times faster than a 7 billion parameter Transformer with key-value caching at an 8K sequence length, with about 70 percent lower memory use. The architecture has since been adopted, modified, and studied across a wide range of follow-up work, including the Gated RetNet (gRet) variant used in Microsoft's YOCO decoder-decoder architecture, vision adaptations such as RMT and RetViT, and the LION family of bidirectional retention models.
By mid-2023, the Transformer introduced by Vaswani et al. in 2017 dominated language modeling, machine translation, and most other large-scale sequence tasks. Its core component, scaled dot-product self-attention, computes pairwise interactions between every token in a sequence. This gives the architecture two practical strengths: every layer can be trained fully in parallel across the sequence, and any pair of tokens can directly influence each other regardless of distance.
The same property is also Transformer's most stubborn weakness. Self-attention has time and memory complexity of O(L^2) in the sequence length L. During autoregressive inference, decoder-only Transformers cache the keys and values produced for every previously generated token, so memory consumption grows linearly with context length, and per-token decoding cost grows with the size of the cache. Long-context inference is therefore expensive in both wall-clock time and GPU memory, even when the rest of the system is highly optimized.
Researchers had spent several years trying to remove this quadratic bottleneck. The main families of work included:
| Family | Representative work | Approach |
|---|---|---|
| Sparse attention | Longformer, BigBird | Attend only to local or pattern-selected tokens |
| Linear Attention | Performer, Linear Transformers | Replace softmax with kernel feature maps for O(L) cost |
| Recurrent revival | RWKV | Combine attention-style training with RNN-style inference |
| State space models | S4, S5, Mamba | Continuous-time dynamical systems with structured matrices |
| Sub-quadratic kernels | FlashAttention | Keep the O(L^2) algorithm but reduce memory traffic |
Many of these approaches improved efficiency, but each tended to sacrifice something. Sparse and linear attention often gave up modeling quality at scale. Pure RNN architectures suffered from sequential training and poor parallel utilization on modern GPUs. FlashAttention sped up the standard Transformer dramatically but did not change the underlying O(L^2) cost.
RetNet's framing positions these three properties as the corners of a triangle that prior architectures could touch only two at a time:
| Property | Transformer | Linear attention | Recurrent network |
|---|---|---|---|
| Training parallelism | Yes | Yes | No |
| O(1) inference cost per token | No | Approximate | Yes |
| Strong language modeling quality at scale | Yes | Mixed | Mixed |
The RetNet paper argues that retention satisfies all three corners at once. Whether that claim survives careful scrutiny at very large scales is still debated, but the framing of the impossible triangle has become a useful shorthand when comparing efficient sequence architectures.
Retention is best understood as a particular kind of linear recurrence with a complex exponential decay, written in a form that admits a closed-form parallel computation. The paper develops the mechanism from first principles, starting with a recurrent equation and then deriving an equivalent matrix expression that can be evaluated all at once during training.
Consider a sequence of input vectors x_1, x_2, ..., x_L. RetNet first projects each x_n into a query vector Q_n, a key vector K_n, and a value vector V_n using learned linear maps, similar to standard attention. The retention recurrence maintains a hidden state matrix S_n that is updated at every step using a complex decay factor gamma:
The state S_n is a fixed-size matrix that summarizes the entire history up to step n. Decoding a new token requires only a single matrix update and a single matrix-vector product, so the per-token cost during generation is independent of how many tokens have already been emitted. This gives RetNet its O(1) inference cost per step.
The gamma factor controls how quickly older information decays. In RetNet, it is a complex-valued scalar (or, more precisely, parameterised by an angle that gives it a rotation as well as a magnitude), which connects the retention mechanism to relative position encoding through the same trick used by xPos and RoPE. The decay factor causes tokens that are far apart in the sequence to influence each other less than nearby tokens, which gives retention its name and provides an implicit positional bias without explicit position embeddings.
The recurrent form is convenient for inference but hard to train efficiently, because each step depends on the previous one. RetNet derives a closed-form parallel expression for the same operator by unrolling the recurrence. The result looks similar to a single attention step, with one key difference: the softmax of standard attention is replaced by a fixed lower-triangular decay matrix D, whose entries D[i,j] = gamma^(i-j) for i >= j and zero otherwise.
The parallel retention operation can be written as:
Retention(X) = (Q * K^T * D) * V
where Q, K, and V are the query, key, and value matrices stacked across the sequence, and D is the decay mask. Because there is no softmax, this expression is a sequence of dense matrix multiplications and can be computed for the whole sequence in one pass on a GPU. This gives RetNet its training parallelism. Crucially, this parallel form is mathematically equivalent to the recurrent form rather than a separate approximation, so the same trained weights can be used in either mode.
The parallel form is fast but uses O(L^2) memory for the intermediate Q * K^T product, which becomes a problem for very long sequences. The pure recurrent form uses O(1) state but cannot be parallelised across the sequence dimension. RetNet's chunkwise recurrent form interpolates between them: the sequence is split into chunks of length C, the parallel form is applied inside each chunk, and the recurrent form is used to pass state between chunks.
The resulting compute and memory cost is linear in L while still using GPU-friendly dense matrix operations within each chunk. The authors recommend chunkwise computation as the default for long-sequence training. This is broadly the same idea that later appeared as Structured State Space Duality (SSD) in Mamba 2, where Tri Dao and Albert Gu formalised the connection between linear recurrences and chunked matrix multiplications.
Using a single global decay factor would force every position to forget the past at the same rate. RetNet uses a multi-head variant called Multi-Scale Retention (MSR), in which each head has its own decay factor gamma_h. Different heads end up specialising in different temporal scales, with some heads tracking very local context and others retaining information across much longer windows. This is analogous to multi-head attention but uses a fixed exponential schedule of decay rates rather than learned attention patterns.
The full RetNet block applies multi-scale retention, followed by a swish-gated linear unit (originally SwiGLU in later updates) feed-forward network, with RMSNorm used as the normalisation layer. The block structure is otherwise close to a standard Transformer block, which makes RetNet a near drop-in replacement at the architectural level.
In May 2024, the same group introduced a refinement called Gated Retention (gRet), also referred to as RetNet-3, as part of the YOCO architecture. Gated retention adds a data-dependent gating term to the retention recurrence, so that the decay between tokens is no longer a fixed schedule but is conditioned on the input. This addresses one of the limitations of the original formulation, where the rate of forgetting is hardwired into the architecture rather than learned from data. The result is closer in spirit to selective state space models like Mamba, which similarly make their dynamics input-dependent.
Because the parallel form is a sequence of dense matrix multiplications without softmax, RetNet maps naturally onto GPU matrix-multiply hardware. The paper reports 25 to 50 percent memory savings and roughly 7x training speedup over a standard Transformer at similar parameter counts. It also claims a small advantage over Transformer with FlashAttention, although the size of that advantage shrinks as the attention implementation is more aggressively optimised.
In recurrent mode, RetNet emits each new token with constant time and constant memory regardless of context length, since the state matrix S is a fixed-size summary of the past. This contrasts sharply with a decoder-only Transformer, whose KV cache grows linearly with the number of generated tokens and whose per-token attention cost grows with cache size. The paper quotes an 8.4x speedup over a 7 billion parameter Transformer with KV caching at 8K sequence length, along with roughly 70 percent lower GPU memory consumption. The speedup is even larger at longer contexts, since the Transformer's costs keep growing while RetNet's stay flat.
Because the decay factor gamma effectively encodes relative position, RetNet does not need separate positional embeddings. The authors argue this gives the architecture better length-extrapolation behaviour, since there is no learned position table to run off the end of. Decay-based positional encoding is closely related to the xPos scheme developed by some of the same authors in earlier work on length-extrapolatable Transformers.
The chunkwise recurrent form is designed to keep most computation in the form of dense matrix multiplications, which are well-served by tensor cores on modern GPUs. This is one of the more practical advantages of RetNet relative to architectures that rely heavily on scans or other operations that GPUs handle less efficiently.
The original paper trained RetNet and Transformer baselines on a mixed corpus drawn from The Pile, C4, and The Stack, using the TorchScale library on 512 AMD MI200 GPUs. Models were trained at 1.3B, 2.7B, and 6.7B parameter scales for comparison.
The headline result is that RetNet's perplexity is competitive with Transformer's at smaller scales and surpasses it once the model crosses roughly 2 billion parameters. The scaling curve has a more favourable slope than Transformer, meaning RetNet's advantage grows with scale rather than shrinking.
| Model size | RetNet perplexity vs Transformer | Notes |
|---|---|---|
| 1.3B | Slightly behind to roughly comparable | Within noise of Transformer baseline |
| 2.7B | Roughly comparable | Crossover point around this scale |
| 6.7B | Better than Transformer | Favourable scaling trend reported |
On downstream zero-shot and few-shot evaluation tasks including LAMBADA, HellaSwag, PIQA, WinoGrande, ARC, BoolQ, COPA, and Story Cloze, RetNet-6.7B reports accuracy that is broadly on par with a similarly sized Transformer, with some tasks favouring each architecture.
RetNet's main advantage shows up in inference benchmarks. The paper reports the following for a 6.7 billion parameter model on A100-80GB GPUs at an 8,192 token context length:
| Metric | Transformer (with KV cache) | RetNet | Improvement |
|---|---|---|---|
| Decoding speed | 1x baseline | ~8.4x | Per-token cost is constant |
| GPU memory consumption | 1x baseline | ~0.3x | About 70 percent reduction |
| Latency vs context length | Grows with length | Flat | O(1) decoding |
| Throughput vs batch size | Limited by KV cache | Higher | Smaller per-sequence footprint |
The gap widens as context length increases. At 32K tokens or beyond, the Transformer KV cache dominates GPU memory and limits batch size, while RetNet's fixed-size state lets the system use most of the available memory for additional concurrent sequences.
Using the chunkwise recurrent form, RetNet reportedly achieved 25 to 50 percent memory savings and around 7x throughput improvement over a standard Transformer implementation in PyTorch. Against FlashAttention, the throughput advantage is smaller but still positive at long sequence lengths.
Microsoft released a reference implementation of RetNet inside the TorchScale library, available on GitHub at microsoft/torchscale. The library exposes a RetNetConfig and RetNetDecoder class that can be used as a drop-in replacement for a Transformer decoder. The repository was updated in October 2023 to make RMSNorm and SwiGLU the default modules inside RetNet blocks. A second copy of the implementation sits inside microsoft/unilm/retnet, the broader UniLM research repository.
Independent open-source implementations followed quickly. Two of the most cited community ports are Jamie-Stirling/RetNet and fkodom/yet-another-retnet, both of which provide PyTorch implementations of multi-scale retention and the three computation forms. These have been used as starting points for many downstream research projects.
In May 2024, several of the original RetNet authors published "You Only Cache Once: Decoder-Decoder Architectures for Language Models" (arXiv 2405.05254). YOCO uses two stacks of decoder blocks: a self-decoder that uses efficient sequence operators (including gated retention), and a cross-decoder that uses attention with a single shared global KV cache produced by the self-decoder. The paper reports that YOCO can be extended to 1 million token contexts with near-perfect needle-in-a-haystack retrieval, consuming roughly 9.38 times less memory than a Transformer baseline with Grouped Query Attention, Flash-Decoding, and kernel fusion at the same length.
Gated retention, used as the self-decoder backbone in YOCO, is sometimes called RetNet-3 to signal its lineage. It adds a learned, data-dependent gate to the retention recurrence, which improves quality on tasks where a fixed decay schedule was too rigid.
A 2025 survey of retentive networks (arXiv 2506.06708) catalogued more than fifty distinct adaptations of RetNet across domains. Notable variants include:
| Variant | Domain | Notes |
|---|---|---|
| RMT | Vision | Retentive networks meet vision transformers, manhattan self-retention |
| RetViT | Vision | Retention as a drop-in for vision transformer attention |
| ViR | Vision | Vision retention networks for image recognition |
| LION / LION-D | Language | Bidirectional retention framework for encoder tasks |
| DenseRetNet | Language | Densely connected retention layers |
| RetCompletion | Image inpainting | High-speed image completion with retention |
| JetRetNet | Particle physics | Retentive networks for jet tagging |
| CellFM | Transcriptomics | Foundation model for single-cell genomics |
| RetEEG | Neuroscience | EEG signal modeling with retention |
| MonoRetNet | 3D vision | Monocular depth estimation |
Most vision adaptations swap attention for retention in the same overall block layout used by Vision Transformer or Swin Transformer, sometimes generalising the 1D causal decay to 2D Manhattan or Chebyshev distances over image patches.
RetNet sits in the same general family as several other efficient sequence architectures, each of which takes a different angle on the same basic problem. The table below summarises the main points of contrast at a high level.
| Architecture | Mechanism | Training | Inference per token | Decay / selectivity | Notable variants |
|---|---|---|---|---|---|
| Transformer | Softmax self-attention | Parallel, O(L^2) | O(L) with KV cache | Learned via attention | Decoder-only LLMs |
| RetNet | Multi-scale retention, exponential decay | Parallel and chunkwise, O(L) | O(1) recurrent | Fixed decay per head | Gated RetNet, YOCO |
| Linear Attention | Kernel feature maps replace softmax | Parallel, O(L) | O(1) recurrent | None or simple | Performer, Linear Transformer |
| RWKV | Time-mix and channel-mix with exponential decay | Parallel (via WKV) | O(1) recurrent | Learned channel-wise decay | RWKV-4, RWKV-5, RWKV-6 |
| Mamba | Selective state space model | Parallel scan, O(L) | O(1) recurrent | Input-dependent | Mamba 2, Jamba |
| Mamba 2 | Structured state space duality | Chunked dense matmul, O(L) | O(1) recurrent | Input-dependent | SSD framework |
The family resemblances are not accidental. The 2024 SSD paper that introduced Mamba 2 shows that a wide class of linear-state architectures, including RetNet, Linear Attention, RWKV, and a scalar-A state space model, can all be expressed in a unified semiseparable-matrix framework. From this view, RetNet is a specific point in a larger design space, with a fixed exponential decay schedule across heads and no input-dependent gating.
Relative to RWKV, RetNet has a tighter parallel form and uses a complex exponential decay tied to relative positions, while RWKV uses a learned per-channel decay with a different parallelisation scheme. Relative to Mamba, RetNet keeps its dynamics input-independent in the original formulation, which trades some expressiveness for a cleaner parallel form. Gated retention, introduced later, closes part of this gap by adding a learned gate.
RetNet drew significant attention immediately after its arXiv release in July 2023. The framing as "a successor to Transformer" attracted both enthusiasm and pushback. Coverage in technical blogs, industry newsletters, and Chinese AI media often described the paper as the most credible proposal yet to challenge Transformer dominance, while academic responses were more measured.
The paper was submitted to ICLR 2024. Reviewers generally praised the theoretical framework connecting recurrence and attention, the engineering quality of the experiments, and the clarity of the three-form derivation. The most common criticisms focused on:
At the same time, RetNet has been broadly influential. The retention mechanism is referenced as a baseline or starting point in most papers on efficient sequence modeling published since 2024. The chunkwise recurrent form anticipated the structured state space duality framework. Microsoft has continued to use retention as part of YOCO and follow-up architectures, and independent labs have built on retention for domain-specific models in vision, biology, and physical science.
Whether RetNet itself displaces Transformer at the largest scales remains an open question. As of 2026 there is no publicly available frontier-scale large language model that uses pure retention as its sole sequence operator, and the dominant hybrid pattern in efficient architectures today is to interleave attention layers with linear-state operators rather than committing fully to one or the other. RetNet's lasting contribution may turn out to be conceptual rather than architectural: it sharpened the trilemma between training parallelism, inference cost, and modeling quality, and gave the community a clean reference point for arguing about it.