Retentive Network (RetNet)

AI Research Neural Networks Open Source AI

19 min read

Updated Jul 12, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 12, 2026

Fact-checked

In review queue

Sources

15 citations

Revision

v3 · 3,881 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

What is RetNet?

RetNet (Retentive Network) is a sequence-modeling architecture proposed by Microsoft Research and Tsinghua University in July 2023 as a successor to the Transformer for large language models. Its defining feature is a sequence operator called retention that can be computed in three mathematically equivalent forms: a parallel form for efficient training, a recurrent form for constant-time $O(1)$ inference, and a chunkwise recurrent form for long sequences. This lets RetNet claim all three corners of the so-called impossible triangle at once: training parallelism, low-cost inference, and strong performance. For a 7 billion parameter model at an 8,000 token context length, the paper reports that RetNet decodes 8.4 times faster and uses 70 percent less memory than a Transformer with key-value caching.^[1]

The architecture was introduced in the paper "Retentive Network: A Successor to Transformer for Large Language Models," uploaded to arXiv on 17 July 2023 under the identifier 2307.08621. Its authors are Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei, working at Microsoft Research and Tsinghua University.^[1] The paper's abstract states the goal directly: RetNet is proposed "as a foundation architecture for large language models, simultaneously achieving training parallelism, low-cost inference, and good performance," and the authors conclude that "the intriguing properties make RetNet a strong successor to Transformer for large language models."^[1]

The central technical claim of RetNet is that retention, derived from a theoretical connection between recurrence and attention, resolves three properties that are usually in tension for sequence models. The same trained weights can be run in any of the three forms, so a model trained in parallel can be deployed in $O(1)$ recurrent mode without retraining. RetNet has since been adopted, modified, and studied across a wide range of follow-up work, including the Gated RetNet (gRet) variant used in Microsoft's YOCO decoder-decoder architecture, vision adaptations such as RMT and RetViT, and the LION family of bidirectional retention models.

Background

The sequence model landscape before RetNet

By mid-2023, the Transformer introduced by Vaswani et al. in 2017 dominated language modeling, machine translation, and most other large-scale sequence tasks. Its core component, scaled dot-product self-attention, computes pairwise interactions between every token in a sequence. This gives the architecture two practical strengths: every layer can be trained fully in parallel across the sequence, and any pair of tokens can directly influence each other regardless of distance.

The same property is also Transformer's most stubborn weakness. Self-attention has time and memory complexity of $O(L^2)$ in the sequence length $L$ . During autoregressive inference, decoder-only Transformers cache the keys and values produced for every previously generated token, so memory consumption grows linearly with context length, and per-token decoding cost grows with the size of the cache. Long-context inference is therefore expensive in both wall-clock time and GPU memory, even when the rest of the system is highly optimized.

Researchers had spent several years trying to remove this quadratic bottleneck. The main families of work included:

Family	Representative work	Approach
Sparse attention	Longformer, BigBird	Attend only to local or pattern-selected tokens
Linear Attention	Performer, Linear Transformers	Replace softmax with kernel feature maps for $O(L)$ cost
Recurrent revival	RWKV	Combine attention-style training with RNN-style inference
State space models	S4, S5, Mamba	Continuous-time dynamical systems with structured matrices
Sub-quadratic kernels	FlashAttention	Keep the $O(L^2)$ algorithm but reduce memory traffic

Many of these approaches improved efficiency, but each tended to sacrifice something. Sparse and linear attention often gave up modeling quality at scale. Pure RNN architectures suffered from sequential training and poor parallel utilization on modern GPUs. FlashAttention sped up the standard Transformer dramatically but did not change the underlying $O(L^2)$ cost.

What is the impossible triangle?

The impossible triangle is RetNet's framing of three properties that prior sequence architectures could touch only two at a time: training parallelism, low-cost (ideally $O(1)$ ) inference, and strong language modeling quality at scale.^[1] Each prior family hit a different corner of the triangle:

Property	Transformer	Linear attention	Recurrent network
Training parallelism	Yes	Yes	No
$O(1)$ inference cost per token	No	Approximate	Yes
Strong language modeling quality at scale	Yes	Mixed	Mixed

The RetNet paper argues that retention satisfies all three corners at once. Whether that claim survives careful scrutiny at very large scales is still debated, but the framing of the impossible triangle has become a useful shorthand when comparing efficient sequence architectures.

How does the retention mechanism work?

Retention is best understood as a particular kind of linear recurrence with a complex exponential decay, written in a form that admits a closed-form parallel computation. The paper develops the mechanism from first principles, starting with a recurrent equation and then deriving an equivalent matrix expression that can be evaluated all at once during training. The abstract summarizes the design as a mechanism "which supports three computation paradigms, i.e., parallel, recurrent, and chunkwise recurrent."^[1]

Recurrent form

Consider a sequence of input vectors $x_1, x_2, \ldots, x_L$ . RetNet first projects each $x_n$ into a query vector $Q_n$ , a key vector $K_n$ , and a value vector $V_n$ using learned linear maps, similar to standard attention. The retention recurrence maintains a hidden state matrix $S_n$ that is updated at every step using a complex decay factor $\gamma$ :

S_n = \gamma S_{n-1} + K_n^\top V_n

O_n = Q_n S_n

The state $S_n$ is a fixed-size matrix that summarizes the entire history up to step $n$ . Decoding a new token requires only a single matrix update and a single matrix-vector product, so the per-token cost during generation is independent of how many tokens have already been emitted. This gives RetNet its $O(1)$ inference cost per step. The authors note that the recurrent representation "enables low-cost O(1) inference, which improves decoding throughput, latency, and GPU memory without sacrificing performance."^[1]

The $\gamma$ factor controls how quickly older information decays. In RetNet, it is a complex-valued scalar (or, more precisely, parameterised by an angle that gives it a rotation as well as a magnitude), which connects the retention mechanism to relative position encoding through the same trick used by xPos and RoPE. The decay factor causes tokens that are far apart in the sequence to influence each other less than nearby tokens, which gives retention its name and provides an implicit positional bias without explicit position embeddings.

Parallel form

The recurrent form is convenient for inference but hard to train efficiently, because each step depends on the previous one. RetNet derives a closed-form parallel expression for the same operator by unrolling the recurrence. The result looks similar to a single attention step, with one key difference: the softmax of standard attention is replaced by a fixed lower-triangular decay matrix $D$ , whose entries $D[i,j] = \gamma^{i-j}$ for $i \ge j$ and zero otherwise.

The parallel retention operation can be written as:

\mathrm{Retention}(X) = (Q K^\top D) V

where $Q$ , $K$ , and $V$ are the query, key, and value matrices stacked across the sequence, and $D$ is the decay mask. Because there is no softmax, this expression is a sequence of dense matrix multiplications and can be computed for the whole sequence in one pass on a GPU. This gives RetNet its training parallelism. Crucially, this parallel form is mathematically equivalent to the recurrent form rather than a separate approximation, so the same trained weights can be used in either mode.

Chunkwise recurrent form

The parallel form is fast but uses $O(L^2)$ memory for the intermediate $Q K^\top$ product, which becomes a problem for very long sequences. The pure recurrent form uses $O(1)$ state but cannot be parallelised across the sequence dimension. RetNet's chunkwise recurrent form interpolates between them: the sequence is split into chunks of length $C$ , the parallel form is applied inside each chunk, and the recurrent form is used to pass state between chunks. The paper describes this form as one in which "each chunk is encoded parallelly while recurrently summarizing the chunks," giving long-sequence modeling with linear complexity.^[1]

The resulting compute and memory cost is linear in $L$ while still using GPU-friendly dense matrix operations within each chunk. The authors recommend chunkwise computation as the default for long-sequence training. This is broadly the same idea that later appeared as Structured State Space Duality (SSD) in Mamba 2, where Tri Dao and Albert Gu formalised the connection between linear recurrences and chunked matrix multiplications.^[10]

Multi-scale retention

Using a single global decay factor would force every position to forget the past at the same rate. RetNet uses a multi-head variant called Multi-Scale Retention (MSR), in which each head has its own decay factor $\gamma_h$ . Different heads end up specialising in different temporal scales, with some heads tracking very local context and others retaining information across much longer windows. This is analogous to multi-head attention but uses a fixed exponential schedule of decay rates rather than learned attention patterns.

The full RetNet block applies multi-scale retention, followed by a swish-gated linear unit (originally SwiGLU in later updates) feed-forward network, with RMSNorm used as the normalisation layer. The block structure is otherwise close to a standard Transformer block, which makes RetNet a near drop-in replacement at the architectural level.

Gated retention

In May 2024, the same group introduced a refinement called Gated Retention (gRet), also referred to as RetNet-3, as part of the YOCO architecture. Gated retention adds a data-dependent gating term to the retention recurrence, so that the decay between tokens is no longer a fixed schedule but is conditioned on the input. This addresses one of the limitations of the original formulation, where the rate of forgetting is hardwired into the architecture rather than learned from data. The result is closer in spirit to selective state space models like Mamba, which similarly make their dynamics input-dependent.

Theoretical advantages

Training cost

Because the parallel form is a sequence of dense matrix multiplications without softmax, RetNet maps naturally onto GPU matrix-multiply hardware. The paper reports that, during training, "RetNet also achieves 25-50% memory saving and 7x acceleration than standard Transformer" at similar parameter counts.^[1] It also claims a small advantage over Transformer with FlashAttention, although the size of that advantage shrinks as the attention implementation is more aggressively optimised.

Inference cost

In recurrent mode, RetNet emits each new token with constant time and constant memory regardless of context length, since the state matrix $S$ is a fixed-size summary of the past. This contrasts sharply with a decoder-only Transformer, whose KV cache grows linearly with the number of generated tokens and whose per-token attention cost grows with cache size. The paper states that "for a 7B model and 8k sequence length, RetNet decodes 8.4x faster and saves 70% of memory than Transformers with key-value caches."^[1] The speedup is even larger at longer contexts, since the Transformer's costs keep growing while RetNet's stay flat.

Length extrapolation

Because the decay factor $\gamma$ effectively encodes relative position, RetNet does not need separate positional embeddings. The authors argue this gives the architecture better length-extrapolation behaviour, since there is no learned position table to run off the end of. Decay-based positional encoding is closely related to the xPos scheme developed by some of the same authors in earlier work on length-extrapolatable Transformers.^[13]

Hardware friendliness

The chunkwise recurrent form is designed to keep most computation in the form of dense matrix multiplications, which are well-served by tensor cores on modern GPUs. This is one of the more practical advantages of RetNet relative to architectures that rely heavily on scans or other operations that GPUs handle less efficiently.

What were RetNet's empirical results?

The original paper trained RetNet and Transformer baselines on a mixed corpus drawn from The Pile, C4, and The Stack, using the TorchScale library on 512 AMD MI200 GPUs. Models were trained at 1.3B, 2.7B, and 6.7B parameter scales for comparison.^[1]

Language modeling perplexity

The headline result is that RetNet's perplexity is competitive with Transformer's at smaller scales and surpasses it once the model crosses roughly 2 billion parameters. The scaling curve has a more favourable slope than Transformer, meaning RetNet's advantage grows with scale rather than shrinking.

Model size	RetNet perplexity vs Transformer	Notes
1.3B	Slightly behind to roughly comparable	Within noise of Transformer baseline
2.7B	Roughly comparable	Crossover point around this scale
6.7B	Better than Transformer	Favourable scaling trend reported

On downstream zero-shot and few-shot evaluation tasks including LAMBADA, HellaSwag, PIQA, WinoGrande, ARC, BoolQ, COPA, and Story Cloze, RetNet-6.7B reports accuracy that is broadly on par with a similarly sized Transformer, with some tasks favouring each architecture.

Inference throughput and memory

RetNet's main advantage shows up in inference benchmarks. The paper reports the following for a 6.7 billion parameter model on A100-80GB GPUs at an 8,192 token context length:^[1]

Metric	Transformer (with KV cache)	RetNet	Improvement
Decoding speed	1x baseline	~8.4x	Per-token cost is constant
GPU memory consumption	1x baseline	~0.3x	About 70 percent reduction
Latency vs context length	Grows with length	Flat	$O(1)$ decoding
Throughput vs batch size	Limited by KV cache	Higher	Smaller per-sequence footprint

The gap widens as context length increases. At 32K tokens or beyond, the Transformer KV cache dominates GPU memory and limits batch size, while RetNet's fixed-size state lets the system use most of the available memory for additional concurrent sequences.

Training throughput

Using the chunkwise recurrent form, RetNet reportedly achieved 25 to 50 percent memory savings and around 7x throughput improvement over a standard Transformer implementation in PyTorch.^[1] Against FlashAttention, the throughput advantage is smaller but still positive at long sequence lengths.

Implementations and adoption

Is RetNet open source?

Yes. Microsoft released a reference implementation of RetNet inside the TorchScale library, available on GitHub at microsoft/torchscale under the MIT license.^[4] The library exposes a RetNetConfig and RetNetDecoder class that can be used as a drop-in replacement for a Transformer decoder. The repository was updated in October 2023 to make RMSNorm and SwiGLU the default modules inside RetNet blocks. A second copy of the implementation sits inside microsoft/unilm/retnet, the broader UniLM research repository.^[5]

Independent open-source implementations followed quickly. Two of the most cited community ports are Jamie-Stirling/RetNet and fkodom/yet-another-retnet, both of which provide PyTorch implementations of multi-scale retention and the three computation forms.^[6]^[7] These have been used as starting points for many downstream research projects.

YOCO and gated retention

In May 2024, several of the original RetNet authors published "You Only Cache Once: Decoder-Decoder Architectures for Language Models" (arXiv 2405.05254).^[3] YOCO uses two stacks of decoder blocks: a self-decoder that uses efficient sequence operators (including gated retention), and a cross-decoder that uses attention with a single shared global KV cache produced by the self-decoder. The paper reports that YOCO can be extended to 1 million token contexts with near-perfect needle-in-a-haystack retrieval, and that at 1M length it consumes 9.38 times less memory than a Transformer baseline with Grouped Query Attention, Flash-Decoding, and kernel fusion.^[3]

Gated retention, used as the self-decoder backbone in YOCO, is sometimes called RetNet-3 to signal its lineage. It adds a learned, data-dependent gate to the retention recurrence, which improves quality on tasks where a fixed decay schedule was too rigid.

Vision and other adaptations

A 2025 survey of retentive networks (arXiv 2506.06708) catalogued more than fifty distinct adaptations of RetNet across domains.^[8] Notable variants include:

Variant	Domain	Notes
RMT	Vision	Retentive networks meet vision transformers, manhattan self-retention
RetViT	Vision	Retention as a drop-in for vision transformer attention
ViR	Vision	Vision retention networks for image recognition
LION / LION-D	Language	Bidirectional retention framework for encoder tasks
DenseRetNet	Language	Densely connected retention layers
RetCompletion	Image inpainting	High-speed image completion with retention
JetRetNet	Particle physics	Retentive networks for jet tagging
CellFM	Transcriptomics	Foundation model for single-cell genomics
RetEEG	Neuroscience	EEG signal modeling with retention
MonoRetNet	3D vision	Monocular depth estimation

Most vision adaptations swap attention for retention in the same overall block layout used by Vision Transformer or Swin Transformer, sometimes generalising the 1D causal decay to 2D Manhattan or Chebyshev distances over image patches.

How does RetNet compare to Transformers, Mamba, and RWKV?

RetNet sits in the same general family as several other efficient sequence architectures, each of which takes a different angle on the same basic problem. The table below summarises the main points of contrast at a high level.

Architecture	Mechanism	Training	Inference per token	Decay / selectivity	Notable variants
Transformer	Softmax self-attention	Parallel, $O(L^2)$	$O(L)$ with KV cache	Learned via attention	Decoder-only LLMs
RetNet	Multi-scale retention, exponential decay	Parallel and chunkwise, $O(L)$	$O(1)$ recurrent	Fixed decay per head	Gated RetNet, YOCO
Linear Attention	Kernel feature maps replace softmax	Parallel, $O(L)$	$O(1)$ recurrent	None or simple	Performer, Linear Transformer
RWKV	Time-mix and channel-mix with exponential decay	Parallel (via WKV)	$O(1)$ recurrent	Learned channel-wise decay	RWKV-4, RWKV-5, RWKV-6
Mamba	Selective state space model	Parallel scan, $O(L)$	$O(1)$ recurrent	Input-dependent	Mamba 2, Jamba
Mamba 2	Structured state space duality	Chunked dense matmul, $O(L)$	$O(1)$ recurrent	Input-dependent	SSD framework

The family resemblances are not accidental. The 2024 SSD paper that introduced Mamba 2 shows that a wide class of linear-state architectures, including RetNet, Linear Attention, RWKV, and a scalar-A state space model, can all be expressed in a unified semiseparable-matrix framework.^[10] From this view, RetNet is a specific point in a larger design space, with a fixed exponential decay schedule across heads and no input-dependent gating.

Relative to RWKV, RetNet has a tighter parallel form and uses a complex exponential decay tied to relative positions, while RWKV uses a learned per-channel decay with a different parallelisation scheme.^[12] Relative to Mamba, RetNet keeps its dynamics input-independent in the original formulation, which trades some expressiveness for a cleaner parallel form.^[11] Gated retention, introduced later, closes part of this gap by adding a learned gate.

Reception

RetNet drew significant attention immediately after its arXiv release in July 2023. The framing as "a successor to Transformer" attracted both enthusiasm and pushback. Coverage in technical blogs, industry newsletters, and Chinese AI media often described the paper as the most credible proposal yet to challenge Transformer dominance, while academic responses were more measured.

The paper was submitted to ICLR 2024.^[15] Reviewers generally praised the theoretical framework connecting recurrence and attention, the engineering quality of the experiments, and the clarity of the three-form derivation. The most common criticisms focused on:

Scale of evaluation. The 6.7 billion parameter scale is small relative to frontier language models. Whether RetNet's favourable scaling continues past tens or hundreds of billions of parameters has not been settled by independent runs at that scale.
Fixed decay. Hardwiring the decay schedule per head means the model cannot learn task-specific forgetting patterns. The 2025 retention survey identifies this as one of the main open research directions, and the later gated retention variant addresses it directly.^[8]
Benchmark coverage. The original paper evaluated on language modeling perplexity and a set of zero-shot tasks, but not on the broader range of benchmarks (MMLU, HumanEval, GSM8K, long-context recall) that became standard for large language models in 2024 and beyond.
In-context learning and recall. Studies of efficient architectures, including the Zoology line of work from Stanford's Hazy Research group, found that linear-state models including RetNet, RWKV, and earlier state space models lag Transformers on associative recall and multi-query retrieval tasks.^[14] This is a structural consequence of compressing the past into a fixed-size state.

At the same time, RetNet has been broadly influential. The retention mechanism is referenced as a baseline or starting point in most papers on efficient sequence modeling published since 2024. The chunkwise recurrent form anticipated the structured state space duality framework. Microsoft has continued to use retention as part of YOCO and follow-up architectures, and independent labs have built on retention for domain-specific models in vision, biology, and physical science.

Whether RetNet itself displaces Transformer at the largest scales remains an open question. As of 2026 there is no publicly available frontier-scale large language model that uses pure retention as its sole sequence operator, and the dominant hybrid pattern in efficient architectures today is to interleave attention layers with linear-state operators rather than committing fully to one or the other. RetNet's lasting contribution may turn out to be conceptual rather than architectural: it sharpened the trilemma between training parallelism, inference cost, and modeling quality, and gave the community a clean reference point for arguing about it.

References

Sun, Y., Dong, L., Huang, S., Ma, S., Xia, Y., Xue, J., Wang, J., & Wei, F. (2023). "Retentive Network: A Successor to Transformer for Large Language Models." arXiv:2307.08621. https://arxiv.org/abs/2307.08621 ↩
Sun, Y., Dong, L., Huang, S., Ma, S., Xia, Y., Xue, J., Wang, J., & Wei, F. (2023). "Retentive Network: A Successor to Transformer for Large Language Models." Microsoft Research publication page. https://www.microsoft.com/en-us/research/publication/retentive-network-a-successor-to-transformer-for-large-language-models/
Sun, Y., Dong, L., Pan, Y., Huang, S., Ma, S., Xia, Y., Xue, J., Wang, J., & Wei, F. (2024). "You Only Cache Once: Decoder-Decoder Architectures for Language Models." arXiv:2405.05254. https://arxiv.org/abs/2405.05254 ↩
Microsoft. (2023). TorchScale library, including RetNet reference implementation. https://github.com/microsoft/torchscale ↩
Microsoft. (2023). UniLM repository, RetNet directory. https://github.com/microsoft/unilm/tree/master/retnet ↩
Stirling, J. (2023). "An implementation of Retentive Network: A Successor to Transformer for Large Language Models." https://github.com/Jamie-Stirling/RetNet ↩
Kodom, F. (2023). "yet-another-retnet: A simple but robust PyTorch implementation of RetNet." https://github.com/fkodom/yet-another-retnet ↩
A Survey of Retentive Network (2025). arXiv:2506.06708. https://arxiv.org/abs/2506.06708 ↩
Vaswani, A., et al. (2017). "Attention Is All You Need." Advances in Neural Information Processing Systems 30.
Dao, T., & Gu, A. (2024). "Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality." arXiv:2405.21060. https://arxiv.org/abs/2405.21060 ↩
Gu, A., & Dao, T. (2023). "Mamba: Linear-Time Sequence Modeling with Selective State Spaces." arXiv:2312.00752. https://arxiv.org/abs/2312.00752 ↩
Peng, B., et al. (2023). "RWKV: Reinventing RNNs for the Transformer Era." arXiv:2305.13048. https://arxiv.org/abs/2305.13048 ↩
Sun, Y., Dong, L., et al. (2023). "A Length-Extrapolatable Transformer." Proceedings of ACL 2023. (xPos position encoding background.) ↩
Hazy Research. (2023). "Zoology: Measuring and Improving Recall in Efficient Language Models." https://hazyresearch.stanford.edu/blog/2023-12-11-zoology1-analysis ↩
RetNet OpenReview submission. ICLR 2024 review thread. https://openreview.net/forum?id=UU9Icwbhin ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributors · full history

Suggest edit

What links here

Differential Transformer Gated DeltaNet Hyena Lightning Attention Linear Attention LongNet RWKV-7 (Goose)Titans (neural architecture)YOCO (You Only Cache Once)xLSTM

What is RetNet?

Background

The sequence model landscape before RetNet

What is the impossible triangle?

How does the retention mechanism work?

Recurrent form

Parallel form

Chunkwise recurrent form

Multi-scale retention

Gated retention

Theoretical advantages

Training cost

Inference cost

Length extrapolation

Hardware friendliness

What were RetNet's empirical results?

Language modeling perplexity

Inference throughput and memory

Training throughput

Implementations and adoption

Is RetNet open source?

YOCO and gated retention

Vision and other adaptations

How does RetNet compare to Transformers, Mamba, and RWKV?

Reception

See also

References

Improve this article

Related Articles

RWKV-7 (Goose)

Meta AI

EleutherAI

Nous Research

Tülu 3

Jet-Nemotron

What links here

Related Articles

RWKV-7 (Goose)

Meta AI

EleutherAI

Nous Research

Tülu 3

Jet-Nemotron

What links here