# Retentive Network (RetNet)

> Source: https://aiwiki.ai/wiki/retnet
> Updated: 2026-07-12
> Categories: AI Research, Neural Networks, Open Source AI
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

*See also: [Transformer](/wiki/transformer), [Mamba](/wiki/mamba), [RWKV](/wiki/rwkv), [Linear Attention](/wiki/linear_attention), [Microsoft Research](/wiki/microsoft_research)*

## What is RetNet?

**RetNet (Retentive Network)** is a sequence-modeling architecture proposed by [Microsoft Research](/wiki/microsoft_research) and Tsinghua University in July 2023 as a successor to the [Transformer](/wiki/transformer) for large language models. Its defining feature is a sequence operator called **retention** that can be computed in three mathematically equivalent forms: a parallel form for efficient training, a recurrent form for constant-time $$O(1)$$ inference, and a chunkwise recurrent form for long sequences. This lets RetNet claim all three corners of the so-called impossible triangle at once: training parallelism, low-cost inference, and strong performance. For a 7 billion parameter model at an 8,000 token context length, the paper reports that RetNet decodes 8.4 times faster and uses 70 percent less memory than a Transformer with key-value caching.[1]

The architecture was introduced in the paper "Retentive Network: A Successor to Transformer for Large Language Models," uploaded to arXiv on 17 July 2023 under the identifier 2307.08621. Its authors are Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei, working at Microsoft Research and Tsinghua University.[1] The paper's abstract states the goal directly: RetNet is proposed "as a foundation architecture for large language models, simultaneously achieving training parallelism, low-cost inference, and good performance," and the authors conclude that "the intriguing properties make RetNet a strong successor to Transformer for large language models."[1]

The central technical claim of RetNet is that retention, derived from a theoretical connection between recurrence and attention, resolves three properties that are usually in tension for sequence models. The same trained weights can be run in any of the three forms, so a model trained in parallel can be deployed in $$O(1)$$ recurrent mode without retraining. RetNet has since been adopted, modified, and studied across a wide range of follow-up work, including the Gated RetNet (gRet) variant used in Microsoft's YOCO decoder-decoder architecture, vision adaptations such as RMT and RetViT, and the LION family of bidirectional retention models.

## Background

### The sequence model landscape before RetNet

By mid-2023, the [Transformer](/wiki/transformer) introduced by Vaswani et al. in 2017 dominated language modeling, machine translation, and most other large-scale sequence tasks. Its core component, scaled dot-product self-attention, computes pairwise interactions between every token in a sequence. This gives the architecture two practical strengths: every layer can be trained fully in parallel across the sequence, and any pair of tokens can directly influence each other regardless of distance.

The same property is also Transformer's most stubborn weakness. Self-attention has time and memory complexity of $$O(L^2)$$ in the sequence length $$L$$. During autoregressive inference, decoder-only Transformers cache the keys and values produced for every previously generated token, so memory consumption grows linearly with context length, and per-token decoding cost grows with the size of the cache. Long-context inference is therefore expensive in both wall-clock time and GPU memory, even when the rest of the system is highly optimized.

Researchers had spent several years trying to remove this quadratic bottleneck. The main families of work included:

| Family | Representative work | Approach |
|--------|---------------------|----------|
| Sparse attention | Longformer, BigBird | Attend only to local or pattern-selected tokens |
| [Linear Attention](/wiki/linear_attention) | Performer, Linear Transformers | Replace softmax with kernel feature maps for $$O(L)$$ cost |
| Recurrent revival | [RWKV](/wiki/rwkv) | Combine attention-style training with RNN-style inference |
| State space models | S4, S5, [Mamba](/wiki/mamba) | Continuous-time dynamical systems with structured matrices |
| Sub-quadratic kernels | FlashAttention | Keep the $$O(L^2)$$ algorithm but reduce memory traffic |

Many of these approaches improved efficiency, but each tended to sacrifice something. Sparse and linear attention often gave up modeling quality at scale. Pure RNN architectures suffered from sequential training and poor parallel utilization on modern GPUs. FlashAttention sped up the standard Transformer dramatically but did not change the underlying $$O(L^2)$$ cost.

### What is the impossible triangle?

The impossible triangle is RetNet's framing of three properties that prior sequence architectures could touch only two at a time: training parallelism, low-cost (ideally $$O(1)$$) inference, and strong language modeling quality at scale.[1] Each prior family hit a different corner of the triangle:

| Property | Transformer | Linear attention | Recurrent network |
|----------|-------------|------------------|-------------------|
| Training parallelism | Yes | Yes | No |
| $$O(1)$$ inference cost per token | No | Approximate | Yes |
| Strong language modeling quality at scale | Yes | Mixed | Mixed |

The RetNet paper argues that retention satisfies all three corners at once. Whether that claim survives careful scrutiny at very large scales is still debated, but the framing of the impossible triangle has become a useful shorthand when comparing efficient sequence architectures.

## How does the retention mechanism work?

Retention is best understood as a particular kind of linear recurrence with a complex exponential decay, written in a form that admits a closed-form parallel computation. The paper develops the mechanism from first principles, starting with a recurrent equation and then deriving an equivalent matrix expression that can be evaluated all at once during training. The abstract summarizes the design as a mechanism "which supports three computation paradigms, i.e., parallel, recurrent, and chunkwise recurrent."[1]

### Recurrent form

Consider a sequence of input vectors $$x_1, x_2, \ldots, x_L$$. RetNet first projects each $$x_n$$ into a query vector $$Q_n$$, a key vector $$K_n$$, and a value vector $$V_n$$ using learned linear maps, similar to standard attention. The retention recurrence maintains a hidden state matrix $$S_n$$ that is updated at every step using a complex decay factor $$\gamma$$:

$$
S_n = \gamma S_{n-1} + K_n^\top V_n
$$

$$
O_n = Q_n S_n
$$

The state $$S_n$$ is a fixed-size matrix that summarizes the entire history up to step $$n$$. Decoding a new token requires only a single matrix update and a single matrix-vector product, so the per-token cost during generation is independent of how many tokens have already been emitted. This gives RetNet its $$O(1)$$ inference cost per step. The authors note that the recurrent representation "enables low-cost O(1) inference, which improves decoding throughput, latency, and GPU memory without sacrificing performance."[1]

The $$\gamma$$ factor controls how quickly older information decays. In RetNet, it is a complex-valued scalar (or, more precisely, parameterised by an angle that gives it a rotation as well as a magnitude), which connects the retention mechanism to relative position encoding through the same trick used by xPos and RoPE. The decay factor causes tokens that are far apart in the sequence to influence each other less than nearby tokens, which gives retention its name and provides an implicit positional bias without explicit position embeddings.

### Parallel form

The recurrent form is convenient for inference but hard to train efficiently, because each step depends on the previous one. RetNet derives a closed-form parallel expression for the same operator by unrolling the recurrence. The result looks similar to a single attention step, with one key difference: the softmax of standard attention is replaced by a fixed lower-triangular decay matrix $$D$$, whose entries $$D[i,j] = \gamma^{i-j}$$ for $$i \ge j$$ and zero otherwise.

The parallel retention operation can be written as:

$$
\mathrm{Retention}(X) = (Q K^\top D) V
$$

where $$Q$$, $$K$$, and $$V$$ are the query, key, and value matrices stacked across the sequence, and $$D$$ is the decay mask. Because there is no softmax, this expression is a sequence of dense matrix multiplications and can be computed for the whole sequence in one pass on a GPU. This gives RetNet its training parallelism. Crucially, this parallel form is mathematically equivalent to the recurrent form rather than a separate approximation, so the same trained weights can be used in either mode.

### Chunkwise recurrent form

The parallel form is fast but uses $$O(L^2)$$ memory for the intermediate $$Q K^\top$$ product, which becomes a problem for very long sequences. The pure recurrent form uses $$O(1)$$ state but cannot be parallelised across the sequence dimension. RetNet's chunkwise recurrent form interpolates between them: the sequence is split into chunks of length $$C$$, the parallel form is applied inside each chunk, and the recurrent form is used to pass state between chunks. The paper describes this form as one in which "each chunk is encoded parallelly while recurrently summarizing the chunks," giving long-sequence modeling with linear complexity.[1]

The resulting compute and memory cost is linear in $$L$$ while still using GPU-friendly dense matrix operations within each chunk. The authors recommend chunkwise computation as the default for long-sequence training. This is broadly the same idea that later appeared as Structured State Space Duality (SSD) in [Mamba 2](/wiki/mamba_2), where Tri Dao and Albert Gu formalised the connection between linear recurrences and chunked matrix multiplications.[10]

### Multi-scale retention

Using a single global decay factor would force every position to forget the past at the same rate. RetNet uses a multi-head variant called **Multi-Scale Retention (MSR)**, in which each head has its own decay factor $$\gamma_h$$. Different heads end up specialising in different temporal scales, with some heads tracking very local context and others retaining information across much longer windows. This is analogous to multi-head attention but uses a fixed exponential schedule of decay rates rather than learned attention patterns.

The full RetNet block applies multi-scale retention, followed by a swish-gated linear unit (originally SwiGLU in later updates) feed-forward network, with [RMSNorm](/wiki/rmsnorm) used as the normalisation layer. The block structure is otherwise close to a standard Transformer block, which makes RetNet a near drop-in replacement at the architectural level.

### Gated retention

In May 2024, the same group introduced a refinement called **Gated Retention (gRet)**, also referred to as RetNet-3, as part of the YOCO architecture. Gated retention adds a data-dependent gating term to the retention recurrence, so that the decay between tokens is no longer a fixed schedule but is conditioned on the input. This addresses one of the limitations of the original formulation, where the rate of forgetting is hardwired into the architecture rather than learned from data. The result is closer in spirit to selective state space models like [Mamba](/wiki/mamba), which similarly make their dynamics input-dependent.

## Theoretical advantages

### Training cost

Because the parallel form is a sequence of dense matrix multiplications without softmax, RetNet maps naturally onto GPU matrix-multiply hardware. The paper reports that, during training, "RetNet also achieves 25-50% memory saving and 7x acceleration than standard Transformer" at similar parameter counts.[1] It also claims a small advantage over Transformer with FlashAttention, although the size of that advantage shrinks as the attention implementation is more aggressively optimised.

### Inference cost

In recurrent mode, RetNet emits each new token with constant time and constant memory regardless of context length, since the state matrix $$S$$ is a fixed-size summary of the past. This contrasts sharply with a decoder-only Transformer, whose KV cache grows linearly with the number of generated tokens and whose per-token attention cost grows with cache size. The paper states that "for a 7B model and 8k sequence length, RetNet decodes 8.4x faster and saves 70% of memory than Transformers with key-value caches."[1] The speedup is even larger at longer contexts, since the Transformer's costs keep growing while RetNet's stay flat.

### Length extrapolation

Because the decay factor $$\gamma$$ effectively encodes relative position, RetNet does not need separate positional embeddings. The authors argue this gives the architecture better length-extrapolation behaviour, since there is no learned position table to run off the end of. Decay-based positional encoding is closely related to the xPos scheme developed by some of the same authors in earlier work on length-extrapolatable Transformers.[13]

### Hardware friendliness

The chunkwise recurrent form is designed to keep most computation in the form of dense matrix multiplications, which are well-served by tensor cores on modern GPUs. This is one of the more practical advantages of RetNet relative to architectures that rely heavily on scans or other operations that GPUs handle less efficiently.

## What were RetNet's empirical results?

The original paper trained RetNet and Transformer baselines on a mixed corpus drawn from The Pile, C4, and The Stack, using the TorchScale library on 512 AMD MI200 GPUs. Models were trained at 1.3B, 2.7B, and 6.7B parameter scales for comparison.[1]

### Language modeling perplexity

The headline result is that RetNet's perplexity is competitive with Transformer's at smaller scales and surpasses it once the model crosses roughly 2 billion parameters. The scaling curve has a more favourable slope than Transformer, meaning RetNet's advantage grows with scale rather than shrinking.

| Model size | RetNet perplexity vs Transformer | Notes |
|------------|----------------------------------|-------|
| 1.3B | Slightly behind to roughly comparable | Within noise of Transformer baseline |
| 2.7B | Roughly comparable | Crossover point around this scale |
| 6.7B | Better than Transformer | Favourable scaling trend reported |

On downstream zero-shot and few-shot evaluation tasks including LAMBADA, HellaSwag, PIQA, WinoGrande, ARC, BoolQ, COPA, and Story Cloze, RetNet-6.7B reports accuracy that is broadly on par with a similarly sized Transformer, with some tasks favouring each architecture.

### Inference throughput and memory

RetNet's main advantage shows up in inference benchmarks. The paper reports the following for a 6.7 billion parameter model on A100-80GB GPUs at an 8,192 token context length:[1]

| Metric | Transformer (with KV cache) | RetNet | Improvement |
|--------|------------------------------|--------|-------------|
| Decoding speed | 1x baseline | ~8.4x | Per-token cost is constant |
| GPU memory consumption | 1x baseline | ~0.3x | About 70 percent reduction |
| Latency vs context length | Grows with length | Flat | $$O(1)$$ decoding |
| Throughput vs batch size | Limited by KV cache | Higher | Smaller per-sequence footprint |

The gap widens as context length increases. At 32K tokens or beyond, the Transformer KV cache dominates GPU memory and limits batch size, while RetNet's fixed-size state lets the system use most of the available memory for additional concurrent sequences.

### Training throughput

Using the chunkwise recurrent form, RetNet reportedly achieved 25 to 50 percent memory savings and around 7x throughput improvement over a standard Transformer implementation in PyTorch.[1] Against FlashAttention, the throughput advantage is smaller but still positive at long sequence lengths.

## Implementations and adoption

### Is RetNet open source?

Yes. Microsoft released a reference implementation of RetNet inside the **TorchScale** library, available on GitHub at `microsoft/torchscale` under the MIT license.[4] The library exposes a `RetNetConfig` and `RetNetDecoder` class that can be used as a drop-in replacement for a Transformer decoder. The repository was updated in October 2023 to make [RMSNorm](/wiki/rmsnorm) and SwiGLU the default modules inside RetNet blocks. A second copy of the implementation sits inside `microsoft/unilm/retnet`, the broader UniLM research repository.[5]

Independent open-source implementations followed quickly. Two of the most cited community ports are `Jamie-Stirling/RetNet` and `fkodom/yet-another-retnet`, both of which provide PyTorch implementations of multi-scale retention and the three computation forms.[6][7] These have been used as starting points for many downstream research projects.

### YOCO and gated retention

In May 2024, several of the original RetNet authors published "You Only Cache Once: Decoder-Decoder Architectures for Language Models" (arXiv 2405.05254).[3] YOCO uses two stacks of decoder blocks: a **self-decoder** that uses efficient sequence operators (including gated retention), and a **cross-decoder** that uses attention with a single shared global KV cache produced by the self-decoder. The paper reports that YOCO can be extended to 1 million token contexts with near-perfect needle-in-a-haystack retrieval, and that at 1M length it consumes 9.38 times less memory than a Transformer baseline with Grouped Query Attention, Flash-Decoding, and kernel fusion.[3]

Gated retention, used as the self-decoder backbone in YOCO, is sometimes called RetNet-3 to signal its lineage. It adds a learned, data-dependent gate to the retention recurrence, which improves quality on tasks where a fixed decay schedule was too rigid.

### Vision and other adaptations

A 2025 survey of retentive networks (arXiv 2506.06708) catalogued more than fifty distinct adaptations of RetNet across domains.[8] Notable variants include:

| Variant | Domain | Notes |
|---------|--------|-------|
| RMT | Vision | Retentive networks meet vision transformers, manhattan self-retention |
| RetViT | Vision | Retention as a drop-in for vision transformer attention |
| ViR | Vision | Vision retention networks for image recognition |
| LION / LION-D | Language | Bidirectional retention framework for encoder tasks |
| DenseRetNet | Language | Densely connected retention layers |
| RetCompletion | Image inpainting | High-speed image completion with retention |
| JetRetNet | Particle physics | Retentive networks for jet tagging |
| CellFM | Transcriptomics | Foundation model for single-cell genomics |
| RetEEG | Neuroscience | EEG signal modeling with retention |
| MonoRetNet | 3D vision | Monocular depth estimation |

Most vision adaptations swap attention for retention in the same overall block layout used by Vision Transformer or Swin Transformer, sometimes generalising the 1D causal decay to 2D Manhattan or Chebyshev distances over image patches.

## How does RetNet compare to Transformers, Mamba, and RWKV?

RetNet sits in the same general family as several other efficient sequence architectures, each of which takes a different angle on the same basic problem. The table below summarises the main points of contrast at a high level.

| Architecture | Mechanism | Training | Inference per token | Decay / selectivity | Notable variants |
|--------------|-----------|----------|---------------------|---------------------|-------------------|
| [Transformer](/wiki/transformer) | Softmax self-attention | Parallel, $$O(L^2)$$ | $$O(L)$$ with KV cache | Learned via attention | Decoder-only LLMs |
| RetNet | Multi-scale retention, exponential decay | Parallel and chunkwise, $$O(L)$$ | $$O(1)$$ recurrent | Fixed decay per head | Gated RetNet, YOCO |
| [Linear Attention](/wiki/linear_attention) | Kernel feature maps replace softmax | Parallel, $$O(L)$$ | $$O(1)$$ recurrent | None or simple | Performer, Linear Transformer |
| [RWKV](/wiki/rwkv) | Time-mix and channel-mix with exponential decay | Parallel (via WKV) | $$O(1)$$ recurrent | Learned channel-wise decay | RWKV-4, RWKV-5, RWKV-6 |
| [Mamba](/wiki/mamba) | Selective state space model | Parallel scan, $$O(L)$$ | $$O(1)$$ recurrent | Input-dependent | Mamba 2, Jamba |
| [Mamba 2](/wiki/mamba_2) | Structured state space duality | Chunked dense matmul, $$O(L)$$ | $$O(1)$$ recurrent | Input-dependent | SSD framework |

The family resemblances are not accidental. The 2024 SSD paper that introduced [Mamba 2](/wiki/mamba_2) shows that a wide class of linear-state architectures, including RetNet, [Linear Attention](/wiki/linear_attention), [RWKV](/wiki/rwkv), and a scalar-A state space model, can all be expressed in a unified semiseparable-matrix framework.[10] From this view, RetNet is a specific point in a larger design space, with a fixed exponential decay schedule across heads and no input-dependent gating.

Relative to RWKV, RetNet has a tighter parallel form and uses a complex exponential decay tied to relative positions, while RWKV uses a learned per-channel decay with a different parallelisation scheme.[12] Relative to Mamba, RetNet keeps its dynamics input-independent in the original formulation, which trades some expressiveness for a cleaner parallel form.[11] Gated retention, introduced later, closes part of this gap by adding a learned gate.

## Reception

RetNet drew significant attention immediately after its arXiv release in July 2023. The framing as "a successor to Transformer" attracted both enthusiasm and pushback. Coverage in technical blogs, industry newsletters, and Chinese AI media often described the paper as the most credible proposal yet to challenge Transformer dominance, while academic responses were more measured.

The paper was submitted to ICLR 2024.[15] Reviewers generally praised the theoretical framework connecting recurrence and attention, the engineering quality of the experiments, and the clarity of the three-form derivation. The most common criticisms focused on:

- **Scale of evaluation.** The 6.7 billion parameter scale is small relative to frontier language models. Whether RetNet's favourable scaling continues past tens or hundreds of billions of parameters has not been settled by independent runs at that scale.
- **Fixed decay.** Hardwiring the decay schedule per head means the model cannot learn task-specific forgetting patterns. The 2025 retention survey identifies this as one of the main open research directions, and the later gated retention variant addresses it directly.[8]
- **Benchmark coverage.** The original paper evaluated on language modeling perplexity and a set of zero-shot tasks, but not on the broader range of benchmarks (MMLU, HumanEval, GSM8K, long-context recall) that became standard for large language models in 2024 and beyond.
- **In-context learning and recall.** Studies of efficient architectures, including the Zoology line of work from Stanford's Hazy Research group, found that linear-state models including RetNet, RWKV, and earlier state space models lag Transformers on associative recall and multi-query retrieval tasks.[14] This is a structural consequence of compressing the past into a fixed-size state.

At the same time, RetNet has been broadly influential. The retention mechanism is referenced as a baseline or starting point in most papers on efficient sequence modeling published since 2024. The chunkwise recurrent form anticipated the structured state space duality framework. Microsoft has continued to use retention as part of YOCO and follow-up architectures, and independent labs have built on retention for domain-specific models in vision, biology, and physical science.

Whether RetNet itself displaces Transformer at the largest scales remains an open question. As of 2026 there is no publicly available frontier-scale large language model that uses pure retention as its sole sequence operator, and the dominant hybrid pattern in efficient architectures today is to interleave attention layers with linear-state operators rather than committing fully to one or the other. RetNet's lasting contribution may turn out to be conceptual rather than architectural: it sharpened the trilemma between training parallelism, inference cost, and modeling quality, and gave the community a clean reference point for arguing about it.

## See also

- [Transformer](/wiki/transformer)
- [Mamba](/wiki/mamba)
- [Mamba 2](/wiki/mamba_2)
- [RWKV](/wiki/rwkv)
- [Linear Attention](/wiki/linear_attention)
- [Microsoft Research](/wiki/microsoft_research)
- [Attention mechanism](/wiki/attention_mechanism)
- [Large language model](/wiki/large_language_model)
- [Recurrent neural network](/wiki/recurrent_neural_network)
- [State space model](/wiki/state_space_model)
- [FlashAttention](/wiki/flash_attention)
- [RMSNorm](/wiki/rmsnorm)

## References

1. Sun, Y., Dong, L., Huang, S., Ma, S., Xia, Y., Xue, J., Wang, J., & Wei, F. (2023). "Retentive Network: A Successor to Transformer for Large Language Models." arXiv:2307.08621. https://arxiv.org/abs/2307.08621
2. Sun, Y., Dong, L., Huang, S., Ma, S., Xia, Y., Xue, J., Wang, J., & Wei, F. (2023). "Retentive Network: A Successor to Transformer for Large Language Models." Microsoft Research publication page. https://www.microsoft.com/en-us/research/publication/retentive-network-a-successor-to-transformer-for-large-language-models/
3. Sun, Y., Dong, L., Pan, Y., Huang, S., Ma, S., Xia, Y., Xue, J., Wang, J., & Wei, F. (2024). "You Only Cache Once: Decoder-Decoder Architectures for Language Models." arXiv:2405.05254. https://arxiv.org/abs/2405.05254
4. Microsoft. (2023). TorchScale library, including RetNet reference implementation. https://github.com/microsoft/torchscale
5. Microsoft. (2023). UniLM repository, RetNet directory. https://github.com/microsoft/unilm/tree/master/retnet
6. Stirling, J. (2023). "An implementation of Retentive Network: A Successor to Transformer for Large Language Models." https://github.com/Jamie-Stirling/RetNet
7. Kodom, F. (2023). "yet-another-retnet: A simple but robust PyTorch implementation of RetNet." https://github.com/fkodom/yet-another-retnet
8. A Survey of Retentive Network (2025). arXiv:2506.06708. https://arxiv.org/abs/2506.06708
9. Vaswani, A., et al. (2017). "Attention Is All You Need." Advances in Neural Information Processing Systems 30.
10. Dao, T., & Gu, A. (2024). "Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality." arXiv:2405.21060. https://arxiv.org/abs/2405.21060
11. Gu, A., & Dao, T. (2023). "Mamba: Linear-Time Sequence Modeling with Selective State Spaces." arXiv:2312.00752. https://arxiv.org/abs/2312.00752
12. Peng, B., et al. (2023). "RWKV: Reinventing RNNs for the Transformer Era." arXiv:2305.13048. https://arxiv.org/abs/2305.13048
13. Sun, Y., Dong, L., et al. (2023). "A Length-Extrapolatable Transformer." Proceedings of ACL 2023. (xPos position encoding background.)
14. Hazy Research. (2023). "Zoology: Measuring and Improving Recall in Efficient Language Models." https://hazyresearch.stanford.edu/blog/2023-12-11-zoology1-analysis
15. RetNet OpenReview submission. ICLR 2024 review thread. https://openreview.net/forum?id=UU9Icwbhin