Gated DeltaNet
Last reviewed
Jun 8, 2026
Sources
12 citations
Review status
Source-backed
Revision
v1 · 2,276 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
12 citations
Review status
Source-backed
Revision
v1 · 2,276 words
Add missing citations, update stale details, or suggest a clearer explanation.
Gated DeltaNet is a linear attention sequence model that augments the delta rule, a key-value error-correction update used in the earlier DeltaNet architecture, with a data-dependent gate that lets the model quickly erase its memory. It was introduced in the paper "Gated Delta Networks: Improving Mamba2 with Delta Rule" by Songlin Yang (MIT CSAIL), Jan Kautz and Ali Hatamizadeh (NVIDIA), posted in December 2024 and published at ICLR 2025. Like other linear-attention methods, Gated DeltaNet runs as a recurrent neural network with a matrix-valued hidden state, which gives it linear-time training and constant-memory autoregressive inference. What sets it apart is a memory-management rule that pairs targeted, content-addressed updates with fast forgetting. The design has since been adopted as the linear-attention component of several hybrid large language models, including Alibaba's Qwen3-Next and, in a refined form, Moonshot AI's Kimi Linear. [1]
The central problem Gated DeltaNet addresses is the limited, fixed-size memory of linear-recurrent sequence models. A standard transformer keeps every past token in a growing key-value cache, so it can retrieve any earlier detail but pays a cost that grows with sequence length. Linear-attention and state space model layers instead compress the entire past into a single fixed-size state, which makes them cheap but forces a constant tension: how to write new information into that bounded memory without overwriting useful old information, and how to discard information that is no longer relevant. [1]
Gated DeltaNet resolves this tension by combining two mechanisms that prior models used only in isolation. The delta rule, inherited from DeltaNet, performs a precise update that removes the value currently associated with an incoming key before writing the new value, which avoids blindly piling associations on top of one another. A scalar gate, inherited from Mamba2 and the gated-linear-attention line of work, multiplicatively decays the whole state, which lets the model flush its memory rapidly when the context shifts. The paper frames the result as adding the delta rule on top of Mamba2's gating, hence the subtitle "Improving Mamba2 with Delta Rule." [1]
Ordinary self-attention computes softmax(QK^T)V, which costs time and memory quadratic in sequence length. Linear attention, introduced for transformers by Katharopoulos and colleagues in 2020, drops the softmax so the computation becomes associative and can be reordered into a recurrence. In the causal setting this is equivalent to maintaining a matrix-valued state S that summarizes the past. At each step the model reads the state with the query and then writes the new key-value pair into it. The plainest version uses pure additive accumulation, S_t = S_{t-1} + v_t k_t^T, with the output o_t = S_t q_t. This is a linear RNN whose hidden state is a matrix rather than a vector, and it gives O(N) training cost and a constant-size state at inference. [4]
Pure addition never removes anything, so old associations accumulate and collide, which is one reason naive linear attention is weak at recall. The delta rule fixes this with an error-correction step borrowed from classic associative memory and the fast-weight literature. Instead of just adding v_t k_t^T, the model first predicts the value that its current memory associates with the incoming key, S_{t-1} k_t, computes the prediction error against the true value v_t, and writes only a fraction of that error back. The update becomes S_t = S_{t-1} + beta_t (v_t - S_{t-1} k_t) k_t^T, where beta_t in (0,1) is a writing strength, or learning rate, that controls how aggressively the memory is corrected. Expanding the product gives the equivalent form S_t = S_{t-1} (I - beta_t k_t k_t^T) + beta_t v_t k_t^T, in which the factor (I - beta_t k_t k_t^T) erases the old content stored at key k_t before the new value is inserted. This delta-rule update for linear transformers was introduced by Imanol Schlag, Kazuki Irie and Jurgen Schmidhuber in 2021, who named the model DeltaNet. Songlin Yang and colleagues made it practical at scale in 2024 by reparameterizing the recurrence as a product of generalized Householder transformations and deriving a chunkwise parallel training algorithm, releasing it in the flash-linear-attention library. [3][2]
DeltaNet has no notion of forgetting: in its update the state is only ever rotated and corrected, never globally shrunk, so information persists indefinitely unless a later key happens to overwrite it. Gated linear attention and Mamba2 take the opposite emphasis. They multiply the state by a data-dependent decay each step, S_t = alpha_t S_{t-1} + v_t k_t^T, where alpha_t close to 0 wipes the memory and alpha_t close to 1 preserves it. Gated linear attention (GLA), also from Yang and colleagues, uses a fine-grained diagonal gate, while Mamba 2 uses a single scalar gate per head. Gating gives precise control over how long memories live but, on its own, still writes new associations by plain addition, so it inherits linear attention's collision problem. Gated DeltaNet is the synthesis: it keeps the delta rule's targeted writes and adds the scalar gate's global forgetting. [5][1]
Gated DeltaNet's recurrence is the gated delta rule:
S_t = alpha_t S_{t-1} (I - beta_t k_t k_t^T) + beta_t v_t k_t^T
with output o_t = S_t q_t. Here S_t is the matrix-valued state, k_t and v_t are the key and value, q_t is the query, beta_t in (0,1) is the writing strength of the delta update, and alpha_t in (0,1) is a scalar gate that decays the entire state. Both alpha_t and beta_t are produced from the input by small learned projections, so the model decides at every position how much to forget and how strongly to write. The expression cleanly contains its predecessors as special cases. Setting alpha_t = 1 recovers DeltaNet, which never forgets. Dropping the delta term (I - beta_t k_t k_t^T) and writing additively recovers Mamba2's scalar-gated update. Using a diagonal gate in place of the scalar but keeping additive writes recovers GLA. Gated DeltaNet is the only one of the four that keeps both the gate and the error-correcting delta term, so it can both flush stale context and overwrite a specific association in place. [1]
The harder engineering problem is training. The state update is now a product of a scalar decay and a rank-one correction matrix, and naively unrolling it token by token would be sequential and far too slow for GPUs. The authors derive a chunkwise parallel form that splits the sequence into fixed-size chunks, computes the cross-chunk contribution through a single carried state, and expresses the within-chunk computation as dense matrix multiplications. They fold the per-step gates into this scheme using a memory-efficient WY representation and the UT transform of Joffrain and colleagues, which rewrite the accumulated Householder-style products as a few matrix multiplies with only minor overhead from the diagonal gating terms. The kernels are released through flash-linear-attention. The practical outcome reported in the paper is that Gated DeltaNet trains at roughly the same throughput as DeltaNet and only slightly below Mamba2, while remaining linear in sequence length. [1]
The paper trains models at 340 million and 1.3 billion parameters on the FineWeb-Edu corpus (15 billion and 100 billion tokens respectively) and compares against Mamba2 and DeltaNet under identical settings. At the 1.3B scale, Gated DeltaNet improves both language-modeling perplexity and downstream accuracy over both baselines, and the hybrid variants improve further. [1]
| Model (1.3B, 100B tokens) | WikiText ppl | LAMBADA ppl | Common-sense avg acc (%) | Real-world recall avg (%) |
|---|---|---|---|---|
| Mamba2 | 16.56 | 12.56 | 54.89 | 29.8 |
| DeltaNet | 17.71 | 16.88 | 52.14 | 26.2 |
| Gated DeltaNet | 16.42 | 12.17 | 55.32 | 30.6 |
| Gated DeltaNet-H1 | 16.07 | 12.12 | 56.40 | 39.0 |
| Gated DeltaNet-H2 | 15.91 | 12.55 | 56.18 | 40.1 |
(Lower perplexity is better; higher accuracy is better.) Beyond average accuracy, the largest gains appear on recall-intensive tasks. On the synthetic multi-query associative-recall and needle-in-a-haystack probes, Gated DeltaNet stays accurate where Mamba2 collapses: on the harder S-NIAH-2 retrieval task at a 4,000-token context it scored about 92 percent versus roughly 56 percent for Mamba2, reflecting the delta rule's ability to keep distinct key-value pairs from interfering. It also extrapolates to sequences longer than those seen in training without the perplexity blow-up that pure-additive linear models often show. On a suite of real-world recall datasets (SWDE, SQuAD, TriviaQA, NaturalQuestions and others), pure recurrent models still trail a same-size transformer at about 37 percent average, but interleaving Gated DeltaNet with a few attention layers closes most of that gap. The two hybrids combine Gated DeltaNet with sliding-window attention (H1) and with both Mamba2 and sliding-window attention (H2); H2 reaches about 40 percent on real-world recall and posts the best overall scores in the study. [1]
Gated DeltaNet has become one of the default linear-attention blocks in the wave of hybrid language models that mix a cheap recurrent mixer with occasional full-attention layers.
Alibaba's Qwen3-Next, released in September 2025, is the most prominent example. Its 80-billion-parameter mixture-of-experts backbone (about 3 billion active parameters) uses a 3 to 1 layout in which three of every four layers are Gated DeltaNet and the fourth is a gated full-attention layer, a structure the Qwen team credits for strong long-context behavior at low cost. The same Gated DeltaNet block carried into the subsequent Qwen3.5 generation. [9]
Moonshot AI's Kimi Linear, described in a technical report dated October 2025, builds directly on Gated DeltaNet. Its core mixer, Kimi Delta Attention (KDA), extends the gated delta rule with a finer-grained channel-wise gate in place of the single scalar, which the authors argue uses the finite-state memory more efficiently. Kimi Linear is a 48-billion-parameter mixture-of-experts model (about 3 billion active) that interleaves KDA with multi-head latent attention at a 3 to 1 ratio, reports up to a 75 percent reduction in key-value cache size, and reaches up to roughly 6 times faster decoding at million-token contexts. [10] The Allen Institute for AI later used a 3 to 1 Gated DeltaNet-to-attention mix in its 7-billion-parameter Olmo Hybrid model in early 2026. [11]
One claim worth correcting is that Gated DeltaNet underlies NVIDIA's Nemotron-H. Despite Gated DeltaNet itself originating at NVIDIA, the Nemotron-H family, and the related Nemotron Nano 2, are hybrid Mamba-Transformer models whose linear mixer is Mamba2, not Gated DeltaNet. [12]
Gated DeltaNet sits inside a family of sub-quadratic sequence models that can all be written as a matrix-valued linear RNN, distinguished by how each one writes into and erases from its state. Through the lens of structured state-space duality, Mamba and Mamba2 are the scalar-gated additive special case and GLA its diagonal-gated cousin, while DeltaNet contributes the error-correcting write without any gate. Gated DeltaNet is the member that keeps both a gate and the delta rule, which is why the paper presents it as Mamba2 plus the delta rule, or equivalently DeltaNet plus a forget gate. Kimi's KDA pushes one step further by pairing the delta rule with a diagonal gate rather than a scalar one. [1][5]
The delta rule also connects these models to online learning and test-time training. Writing beta_t (v_t - S_{t-1} k_t) k_t^T is exactly one step of online gradient descent on the squared error of an associative memory that maps keys to values, with beta_t as the learning rate; the multiplicative gate alpha_t plays the role of weight decay on that fast-changing memory. This view links Gated DeltaNet to the fast-weight programmers of Schlag and colleagues and to a broader 2024 to 2026 trend of treating a sequence layer's state as a small model updated by gradient descent at inference time. Like RetNet, RWKV and Mamba, Gated DeltaNet exploits the duality between a parallel form for efficient training and a recurrent form for efficient generation, and in practice it is paired with a small number of softmax-attention layers, because exact long-range retrieval remains the one capability a bounded recurrent state cannot fully match. [3][1]