Gated DeltaNet

Deep Learning Neural Networks

11 min read

Updated Jul 11, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 11, 2026

Fact-checked

In review queue

Sources

12 citations

Revision

v2 · 2,278 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Gated DeltaNet is a linear attention sequence model that augments the delta rule, a key-value error-correction update used in the earlier DeltaNet architecture, with a data-dependent gate that lets the model quickly erase its memory. It was introduced in the paper "Gated Delta Networks: Improving Mamba2 with Delta Rule" by Songlin Yang (MIT CSAIL), Jan Kautz and Ali Hatamizadeh (NVIDIA), posted in December 2024 and published at ICLR 2025. Like other linear-attention methods, Gated DeltaNet runs as a recurrent neural network with a matrix-valued hidden state, which gives it linear-time training and constant-memory autoregressive inference. What sets it apart is a memory-management rule that pairs targeted, content-addressed updates with fast forgetting. The design has since been adopted as the linear-attention component of several hybrid large language models, including Alibaba's Qwen3-Next and, in a refined form, Moonshot AI's Kimi Linear. ^[1]

Overview

The central problem Gated DeltaNet addresses is the limited, fixed-size memory of linear-recurrent sequence models. A standard transformer keeps every past token in a growing key-value cache, so it can retrieve any earlier detail but pays a cost that grows with sequence length. Linear-attention and state space model layers instead compress the entire past into a single fixed-size state, which makes them cheap but forces a constant tension: how to write new information into that bounded memory without overwriting useful old information, and how to discard information that is no longer relevant. ^[1]

Gated DeltaNet resolves this tension by combining two mechanisms that prior models used only in isolation. The delta rule, inherited from DeltaNet, performs a precise update that removes the value currently associated with an incoming key before writing the new value, which avoids blindly piling associations on top of one another. A scalar gate, inherited from Mamba2 and the gated-linear-attention line of work, multiplicatively decays the whole state, which lets the model flush its memory rapidly when the context shifts. The paper frames the result as adding the delta rule on top of Mamba2's gating, hence the subtitle "Improving Mamba2 with Delta Rule." ^[1]

Background: linear attention, the delta rule, and gating

Linear attention as a linear RNN

Ordinary self-attention computes $\mathrm{softmax}(QK^\top)V$ , which costs time and memory quadratic in sequence length. Linear attention, introduced for transformers by Katharopoulos and colleagues in 2020, drops the softmax so the computation becomes associative and can be reordered into a recurrence. In the causal setting this is equivalent to maintaining a matrix-valued state S that summarizes the past. At each step the model reads the state with the query and then writes the new key-value pair into it. The plainest version uses pure additive accumulation, $S_t = S_{t-1} + v_t k_t^\top$ , with the output $o_t = S_t q_t$ . This is a linear RNN whose hidden state is a matrix rather than a vector, and it gives O(N) training cost and a constant-size state at inference. ^[4]

The delta rule and DeltaNet

Pure addition never removes anything, so old associations accumulate and collide, which is one reason naive linear attention is weak at recall. The delta rule fixes this with an error-correction step borrowed from classic associative memory and the fast-weight literature. Instead of just adding $v_t k_t^\top$ , the model first predicts the value that its current memory associates with the incoming key, $S_{t-1} k_t$ , computes the prediction error against the true value $v_t$ , and writes only a fraction of that error back. The update becomes $S_t = S_{t-1} + \beta_t (v_t - S_{t-1} k_t) k_t^\top$ , where $\beta_t \in (0,1)$ is a writing strength, or learning rate, that controls how aggressively the memory is corrected. Expanding the product gives the equivalent form $S_t = S_{t-1} (I - \beta_t k_t k_t^\top) + \beta_t v_t k_t^\top$ , in which the factor $(I - \beta_t k_t k_t^\top)$ erases the old content stored at key $k_t$ before the new value is inserted. This delta-rule update for linear transformers was introduced by Imanol Schlag, Kazuki Irie and Jurgen Schmidhuber in 2021, who named the model DeltaNet. Songlin Yang and colleagues made it practical at scale in 2024 by reparameterizing the recurrence as a product of generalized Householder transformations and deriving a chunkwise parallel training algorithm, releasing it in the flash-linear-attention library. ^[3]^[2]

Gating and decay

DeltaNet has no notion of forgetting: in its update the state is only ever rotated and corrected, never globally shrunk, so information persists indefinitely unless a later key happens to overwrite it. Gated linear attention and Mamba2 take the opposite emphasis. They multiply the state by a data-dependent decay each step, $S_t = \alpha_t S_{t-1} + v_t k_t^\top$ , where $\alpha_t$ close to 0 wipes the memory and $\alpha_t$ close to 1 preserves it. Gated linear attention (GLA), also from Yang and colleagues, uses a fine-grained diagonal gate, while Mamba 2 uses a single scalar gate per head. Gating gives precise control over how long memories live but, on its own, still writes new associations by plain addition, so it inherits linear attention's collision problem. Gated DeltaNet is the synthesis: it keeps the delta rule's targeted writes and adds the scalar gate's global forgetting. ^[5]^[1]

How Gated DeltaNet works

Gated DeltaNet's recurrence is the gated delta rule:

S_t = \alpha_t S_{t-1} (I - \beta_t k_t k_t^\top) + \beta_t v_t k_t^\top

with output $o_t = S_t q_t$ . Here $S_t$ is the matrix-valued state, $k_t$ and $v_t$ are the key and value, $q_t$ is the query, $\beta_t \in (0,1)$ is the writing strength of the delta update, and $\alpha_t \in (0,1)$ is a scalar gate that decays the entire state. Both $\alpha_t$ and $\beta_t$ are produced from the input by small learned projections, so the model decides at every position how much to forget and how strongly to write. The expression cleanly contains its predecessors as special cases. Setting $\alpha_t = 1$ recovers DeltaNet, which never forgets. Dropping the delta term $(I - \beta_t k_t k_t^\top)$ and writing additively recovers Mamba2's scalar-gated update. Using a diagonal gate in place of the scalar but keeping additive writes recovers GLA. Gated DeltaNet is the only one of the four that keeps both the gate and the error-correcting delta term, so it can both flush stale context and overwrite a specific association in place. ^[1]

The harder engineering problem is training. The state update is now a product of a scalar decay and a rank-one correction matrix, and naively unrolling it token by token would be sequential and far too slow for GPUs. The authors derive a chunkwise parallel form that splits the sequence into fixed-size chunks, computes the cross-chunk contribution through a single carried state, and expresses the within-chunk computation as dense matrix multiplications. They fold the per-step gates into this scheme using a memory-efficient WY representation and the UT transform of Joffrain and colleagues, which rewrite the accumulated Householder-style products as a few matrix multiplies with only minor overhead from the diagonal gating terms. The kernels are released through flash-linear-attention. The practical outcome reported in the paper is that Gated DeltaNet trains at roughly the same throughput as DeltaNet and only slightly below Mamba2, while remaining linear in sequence length. ^[1]

Results

The paper trains models at 340 million and 1.3 billion parameters on the FineWeb-Edu corpus (15 billion and 100 billion tokens respectively) and compares against Mamba2 and DeltaNet under identical settings. At the 1.3B scale, Gated DeltaNet improves both language-modeling perplexity and downstream accuracy over both baselines, and the hybrid variants improve further. ^[1]

Model (1.3B, 100B tokens)	WikiText ppl	LAMBADA ppl	Common-sense avg acc (%)	Real-world recall avg (%)
Mamba2	16.56	12.56	54.89	29.8
DeltaNet	17.71	16.88	52.14	26.2
Gated DeltaNet	16.42	12.17	55.32	30.6
Gated DeltaNet-H1	16.07	12.12	56.40	39.0
Gated DeltaNet-H2	15.91	12.55	56.18	40.1

(Lower perplexity is better; higher accuracy is better.) Beyond average accuracy, the largest gains appear on recall-intensive tasks. On the synthetic multi-query associative-recall and needle-in-a-haystack probes, Gated DeltaNet stays accurate where Mamba2 collapses: on the harder S-NIAH-2 retrieval task at a 4,000-token context it scored about 92 percent versus roughly 56 percent for Mamba2, reflecting the delta rule's ability to keep distinct key-value pairs from interfering. It also extrapolates to sequences longer than those seen in training without the perplexity blow-up that pure-additive linear models often show. On a suite of real-world recall datasets (SWDE, SQuAD, TriviaQA, NaturalQuestions and others), pure recurrent models still trail a same-size transformer at about 37 percent average, but interleaving Gated DeltaNet with a few attention layers closes most of that gap. The two hybrids combine Gated DeltaNet with sliding-window attention (H1) and with both Mamba2 and sliding-window attention (H2); H2 reaches about 40 percent on real-world recall and posts the best overall scores in the study. ^[1]

Adoption in hybrid models

Gated DeltaNet has become one of the default linear-attention blocks in the wave of hybrid language models that mix a cheap recurrent mixer with occasional full-attention layers.

Alibaba's Qwen3-Next, released in September 2025, is the most prominent example. Its 80-billion-parameter mixture-of-experts backbone (about 3 billion active parameters) uses a 3 to 1 layout in which three of every four layers are Gated DeltaNet and the fourth is a gated full-attention layer, a structure the Qwen team credits for strong long-context behavior at low cost. The same Gated DeltaNet block carried into the subsequent Qwen3.5 generation. ^[9]

Moonshot AI's Kimi Linear, described in a technical report dated October 2025, builds directly on Gated DeltaNet. Its core mixer, Kimi Delta Attention (KDA), extends the gated delta rule with a finer-grained channel-wise gate in place of the single scalar, which the authors argue uses the finite-state memory more efficiently. Kimi Linear is a 48-billion-parameter mixture-of-experts model (about 3 billion active) that interleaves KDA with multi-head latent attention at a 3 to 1 ratio, reports up to a 75 percent reduction in key-value cache size, and reaches up to roughly 6 times faster decoding at million-token contexts. ^[10] The Allen Institute for AI later used a 3 to 1 Gated DeltaNet-to-attention mix in its 7-billion-parameter Olmo Hybrid model in early 2026. ^[11]

One claim worth correcting is that Gated DeltaNet underlies NVIDIA's Nemotron-H. Despite Gated DeltaNet itself originating at NVIDIA, the Nemotron-H family, and the related Nemotron Nano 2, are hybrid Mamba-Transformer models whose linear mixer is Mamba2, not Gated DeltaNet. ^[12]

Relationship to Mamba and gated linear attention

Gated DeltaNet sits inside a family of sub-quadratic sequence models that can all be written as a matrix-valued linear RNN, distinguished by how each one writes into and erases from its state. Through the lens of structured state-space duality, Mamba and Mamba2 are the scalar-gated additive special case and GLA its diagonal-gated cousin, while DeltaNet contributes the error-correcting write without any gate. Gated DeltaNet is the member that keeps both a gate and the delta rule, which is why the paper presents it as Mamba2 plus the delta rule, or equivalently DeltaNet plus a forget gate. Kimi's KDA pushes one step further by pairing the delta rule with a diagonal gate rather than a scalar one. ^[1]^[5]

The delta rule also connects these models to online learning and test-time training. Writing $\beta_t (v_t - S_{t-1} k_t) k_t^\top$ is exactly one step of online gradient descent on the squared error of an associative memory that maps keys to values, with $\beta_t$ as the learning rate; the multiplicative gate $\alpha_t$ plays the role of weight decay on that fast-changing memory. This view links Gated DeltaNet to the fast-weight programmers of Schlag and colleagues and to a broader 2024 to 2026 trend of treating a sequence layer's state as a small model updated by gradient descent at inference time. Like RetNet, RWKV and Mamba, Gated DeltaNet exploits the duality between a parallel form for efficient training and a recurrent form for efficient generation, and in practice it is paired with a small number of softmax-attention layers, because exact long-range retrieval remains the one capability a bounded recurrent state cannot fully match. ^[3]^[1]

References

Yang, Songlin; Kautz, Jan; Hatamizadeh, Ali. "Gated Delta Networks: Improving Mamba2 with Delta Rule." International Conference on Learning Representations (ICLR) 2025. arXiv:2412.06464, December 2024. https://arxiv.org/abs/2412.06464 ↩
Yang, Songlin; Wang, Bailin; Zhang, Yu; Shen, Yikang; Kim, Yoon. "Parallelizing Linear Transformers with the Delta Rule over Sequence Length." Advances in Neural Information Processing Systems (NeurIPS) 2024. arXiv:2406.06484. https://arxiv.org/abs/2406.06484 ↩
Schlag, Imanol; Irie, Kazuki; Schmidhuber, Jurgen. "Linear Transformers Are Secretly Fast Weight Programmers." Proceedings of ICML 2021. arXiv:2102.11174. https://arxiv.org/abs/2102.11174 ↩
Katharopoulos, Angelos; Vyas, Apoorv; Pappas, Nikolaos; Fleuret, Francois. "Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention." Proceedings of ICML 2020. arXiv:2006.16236. https://arxiv.org/abs/2006.16236 ↩
Yang, Songlin; Wang, Bailin; Shen, Yikang; Panda, Rameswar; Kim, Yoon. "Gated Linear Attention Transformers with Hardware-Efficient Training." Proceedings of ICML 2024. arXiv:2312.06635. https://arxiv.org/abs/2312.06635 ↩
Dao, Tri; Gu, Albert. "Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality." Proceedings of ICML 2024 (Mamba2). arXiv:2405.21060. https://arxiv.org/abs/2405.21060
Gu, Albert; Dao, Tri. "Mamba: Linear-Time Sequence Modeling with Selective State Spaces." arXiv:2312.00752, December 2023. https://arxiv.org/abs/2312.00752
NVlabs. "GatedDeltaNet: Official PyTorch Implementation of Gated Delta Networks." GitHub repository. https://github.com/NVlabs/GatedDeltaNet
Qwen Team, Alibaba. "Qwen3-Next: Towards Ultimate Training and Inference Efficiency." September 2025. https://qwen.ai/blog ↩
Kimi Team, Moonshot AI. "Kimi Linear: An Expressive, Efficient Attention Architecture." arXiv:2510.26692, October 2025. https://arxiv.org/abs/2510.26692 ↩
Allen Institute for AI (Ai2). "Olmo Hybrid: Combining Transformers and Linear RNNs for Superior Scaling." March 2026. https://allenai.org/blog/olmohybrid ↩
NVIDIA ADLR. "Nemotron-H: A Family of Accurate, Efficient Hybrid Mamba-Transformer Models." 2025. https://research.nvidia.com/labs/adlr/nemotronh/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Qwen3-Next RWKV-7 (Goose)

Overview

Background: linear attention, the delta rule, and gating

Linear attention as a linear RNN

The delta rule and DeltaNet

Gating and decay

How Gated DeltaNet works

Results

Adoption in hybrid models

Relationship to Mamba and gated linear attention

References

Improve this article

Related Articles

LSTM

Mixture of Experts (MoE)

Translational invariance

Activation Function

Attention

Backpropagation

What links here

Related Articles

LSTM

Mixture of Experts (MoE)

Translational invariance

Activation Function

Attention

Backpropagation

What links here