RWKV-7 (Goose)
Last reviewed
May 16, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 · 3,546 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 16, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 · 3,546 words
Add missing citations, update stale details, or suggest a clearer explanation.
RWKV-7, codenamed Goose, is the seventh major iteration of the RWKV (Receptance Weighted Key Value) sequence modeling architecture. It was introduced in the March 2025 paper RWKV-7 "Goose" with Expressive Dynamic State Evolution, posted to arXiv as 2503.14456 on 18 March 2025 and co-authored by Peng Bo, Ruichong Zhang, Daniel Goldstein, Eric Alcaide, and roughly two dozen further contributors from the RWKV Project, EleutherAI, Recursal AI, and several universities. The work delivers a recurrent neural network family that trains in parallel like a Transformer but runs inference in linear time with constant memory, and it positions itself as an alternative to attention-based models and to Mamba style state space architectures.
The headline release is a 2.9-billion-parameter model that reaches a new state of the art at the 3B scale on a broad suite of multilingual benchmarks while being trained on roughly 3.1 trillion tokens, a fraction of what comparable open Transformers consumed. The paper also argues that RWKV-7 is provably more expressive than standard attention under common complexity assumptions, since it can perform full state tracking and recognize all regular languages, capabilities that lie outside the TC⁰ class to which softmax attention and linear attention belong.
Weights, training data manifests, training code, and inference kernels were released under the Apache 2.0 license through HuggingFace and GitHub. The RWKV Project itself sits inside the Linux Foundation AI and Data foundation, where it became the first generative AI model accepted as a hosted project, and Goose is the version of the architecture that the community is now treating as the reference baseline for downstream work.
RWKV started as a hobby project by Peng Bo, an independent researcher then loosely affiliated with EleutherAI, who wanted to know whether an RNN could match a Transformer if you redesigned the recurrence to be parallelizable across the time axis. The first versions, RWKV-1 through RWKV-3, were small experiments that mostly served as proofs of concept. The line became serious with RWKV-4 in 2023, which reached parity with comparably sized GPT models on the Pile and demonstrated that an RNN could be trained at multi-billion parameter scale without the usual instabilities. RWKV-4 is also the version that introduced the now-familiar split between a time-mix block, which carries the recurrent state, and a channel-mix block, which acts as a position-wise feed-forward layer.
RWKV-5, sometimes called Eagle, refined the recurrence by promoting the hidden state from a vector to a matrix and by introducing multi-headed time mixing. This lifted long-context performance considerably and was the first version to be trained on the RWKV World multilingual corpus, a deliberate move away from the English-heavy Pile. RWKV-6, codenamed Finch, replaced the static decay vector with a data-dependent one, allowing the network to choose how aggressively it forgets state on a per-token basis, and added low-rank adapters inside the time-mix block so that each token could project itself into a learned subspace before the recurrence acted on it. Finch was the first RWKV model to be widely deployed in production, mostly via the Recursal AI and Featherless services and via community quantizations.
By late 2024, however, two problems were obvious. The first was that even Finch, while strong on perplexity, struggled on in-context recall tasks where a Transformer of similar size would do well. The second was that several concurrent linear-attention architectures, especially Gated DeltaNet and Mamba 2, were closing the gap with Transformers from another angle, by adding non-trivial transition matrices to their recurrences. RWKV-7 was designed to absorb the best ideas from those efforts while keeping the engineering and licensing posture that had made RWKV easy to ship.
The core change in Goose is a generalized delta rule that turns the recurrent state update into something significantly more expressive than the diagonal decay used in earlier RWKV versions. The state in RWKV-7 is still a matrix, written wkv, that lives at every time step and is read out to produce the output token. What is new is how it evolves.
In RWKV-4 through RWKV-6, the state update could be written, in simplified form, as state_t = decay * state_{t-1} + v_t k_t^T, where decay was either a learned scalar (early versions) or a learned data-dependent vector (Finch). That structure is mathematically a diagonal transition matrix, which means each channel of the state evolves independently of the others. It is fast and stable, but it limits how richly the state can mix information across channels.
RWKV-7 replaces the diagonal decay with a non-diagonal transition matrix of the form G_t = diag(w_t) - kappa_t^T (a_t * kappa_t). The first term is the familiar diagonal decay, parameterized by a per-channel decay vector w_t whose values lie in a stable band roughly between 0.61 and 1.0. The second term is a rank-one correction that subtracts a learned amount of the previous state along the direction kappa_t, a removal key derived from the current token. The vector a_t, computed by a small low-rank MLP, acts as an in-context learning rate that lives in the open interval (0, 1) per channel and decides how aggressively the network should overwrite previous information.
This is the generalized delta rule that gives the paper its subtitle. It is a strict superset of the DeltaNet update, which only allowed a scalar learning rate, and it is also a strict superset of the diagonal decay used by Mamba and earlier RWKV. The non-diagonal correction is what lets RWKV-7 perform genuine state tracking. The paper proves that with this update the network can recognize all regular languages, which puts it above the TC⁰ complexity class that contains softmax attention and standard linear attention.
A second innovation is the decoupling of the key that participates in removal from the key that participates in addition. The removal key kappa_t and the replacement key tilde_k_t are both derived from a single learned key k_t, but they are scaled by different learned parameters before they enter the update. This lets the network forget along one direction while writing new content along a slightly different one, which the paper argues is necessary for some of the state-tracking tasks the architecture was designed to handle.
Goose also introduces value residual learning, a trick that prevents value vectors in deep layers from drifting too far from the values seen at the first layer. In practice, each layer's value vector is a learned interpolation between the layer-0 value and a freshly computed one. This stabilizes training and, according to ablations in the paper, gives roughly a one-point average gain on downstream benchmarks at the 1.5B scale.
The decay vector w_t in RWKV-7 is parameterized as w_t = exp(-exp(-0.5) * sigmoid(d_t)), where d_t is the output of a small low-rank MLP applied to the input token after a tanh nonlinearity. This nested-exponential form keeps the decay strictly inside the stable interval (e^(-e^(-0.5)), 1), roughly (0.606, 1.0), which prevents the kind of state explosion that plagued earlier linear-attention designs while still letting the model choose how quickly to forget on a per-token basis.
The channel-mix block in RWKV-7 is a simpler position-wise feed-forward layer compared to the elaborate variants tried in RWKV-5 and RWKV-6. The block stack alternates time mix and channel mix, with RMSNorm before each sublayer and a residual connection around each. The model is purely recurrent at inference time, so there is no KV cache and no attention matrix to materialize, which keeps memory flat as context grows.
The Goose family released alongside the paper consists of four base models trained on the RWKV World v3 corpus, plus a smaller set of Pile-trained reference models for ablations. All sizes use the same architecture, scaled in width and depth.
| Model | Parameters | Layers | Width | Total training tokens | Notes |
|---|---|---|---|---|---|
| RWKV7-World3-0.1B | 0.19B | 12 | 768 | 1.6T | 0.6T from RWKV-5 init plus 1.0T as RWKV-7 |
| RWKV7-World3-0.4B | 0.45B | 24 | 1024 | 3.1T | 1.1T RWKV-5 plus 2.0T RWKV-7 |
| RWKV7-World3-1.5B | 1.52B | 24 | 2048 | 5.6T | 2.5T RWKV-6 plus 3.1T RWKV-7 |
| RWKV7-World3-2.9B | 2.91B | 32 | 2560 | 5.6T | 2.5T RWKV-6 plus 3.1T RWKV-7 |
The paper also reports Pile reference checkpoints at 0.17B, 0.4B, and 1.47B trained on 332B tokens of the original Pile with the GPT-NeoX tokenizer. These are used for apples-to-apples comparisons against the Mamba, Mamba 2, and Transformer++ baselines that were originally evaluated on the same corpus.
A larger 7.2B Goose checkpoint was released later in 2025 as part of the RWKV-7 G1 line, alongside instruction-tuned and reasoning-tuned variants. The G1 models share the Goose architecture exactly; the differences are in data mixture and post-training.
Training used the RWKV World v3 corpus, a 3.119-trillion-token multilingual mixture that the RWKV Project assembled from public sources including arXiv, GitHub, Wikipedia, public-domain books, and a substantial slice of Chinese web fiction. The composition is roughly 80 percent English, 10 percent other languages, and 10 percent code, with more than a hundred languages represented in the long tail. The dataset is itself released under permissive terms; the curation scripts and source manifests are public on HuggingFace.
The larger Goose models were not trained from scratch. The 1.5B and 2.9B checkpoints started from RWKV-6 weights trained on 2.5T tokens and were then continued for another 3.1T tokens after a surgical conversion of the time-mix blocks to the new generalized delta rule. The smaller 0.1B and 0.4B models were initialized from RWKV-5 weights in the same way. The paper presents this as an efficiency choice: reusing well-trained checkpoints from earlier RWKV versions reduced total compute, and the team reports that the new architecture was stable enough to absorb the surgery without loss spikes once the new parameters were carefully initialized.
Training ran on clusters of H100 and A100 GPUs supplied by Recursal AI, EleutherAI's StabilityAI compute grant, and several smaller donors. The largest run, for the 2.9B model, used a context length of 4096 tokens during the bulk of training, extended to 16,384 for a final long-context phase. The team reports a throughput of about 259,000 tokens per second on a 4-node, 8-GPU H100 cluster at context length 8192.
Optimizer details follow the conventions established for earlier RWKV runs. The team uses Adam with selective weight decay applied only to large matrices, a cosine learning-rate schedule with linear warmup, and a small amount of dropout in the early phases that is annealed to zero. The paper emphasizes that RWKV-7 is spike-free at every scale they tested, which is a meaningful claim for an architecture this new because instability has historically been the main reason linear-attention models fail to scale.
The benchmark story for Goose is split between English and multilingual evaluations, with the multilingual case being the more dramatic.
The 2.9B Goose model is competitive with the strongest open 3B Transformers on standard English benchmarks despite seeing roughly three times fewer tokens. The table below reports normalized accuracy on the seven tasks the paper highlights.
| Benchmark | RWKV7-2.9B | Qwen2.5-3B | Llama 3.2 3B |
|---|---|---|---|
| LAMBADA | 73.4 | 67.1 | 70.5 |
| HellaSwag | 76.4 | 73.5 | 73.6 |
| PIQA | 79.7 | 78.6 | 76.7 |
| ARC-Easy | 81.0 | 77.4 | 74.5 |
| ARC-Challenge | 48.7 | 45.0 | 42.2 |
| MMLU | 55.0 | 65.7 | 56.5 |
| Average | 71.5 | 71.4 | 67.8 |
Goose wins on five of the six commonsense and reading tasks and loses meaningfully only on MMLU, a result the authors attribute to the smaller English subset of the World v3 corpus. Qwen2.5-3B, which was trained on roughly 18 trillion tokens, has a clear advantage on knowledge-dense benchmarks. On reasoning and surface-form tasks, the 3-times-fewer-tokens model holds its own.
On cross-lingual benchmarks, RWKV-7 reaches a new state of the art at the 3B scale by a comfortable margin. The table reports zero-shot accuracy except for LAMBADA-M, which is perplexity (lower is better).
| Benchmark | RWKV7-2.9B | Qwen2.5-3B | Llama 3.2 3B |
|---|---|---|---|
| LAMBADA-M (ppl) | 18 | 36 | 30 |
| XCOPA | 63.1 | 59.0 | 58.5 |
| XNLI | 45.4 | 38.5 | 44.2 |
| XStoryCloze | 64.7 | 59.6 | 60.6 |
| xWinogrande | 82.4 | 79.8 | 79.2 |
| Average | 61.1 | 55.6 | 58.1 |
The paper credits the multilingual lead to a combination of the World v3 data mixture, the larger recurrent state, and the in-context learning rate that lets the network shift its language model conditioning quickly when a passage switches languages.
The Mechanistic Architecture Design (MAD) suite from the Zoology line of work tests recurrent and attention-based models on synthetic recall tasks. RWKV-7 posts the strongest average score of the architectures evaluated.
| Task | RWKV-7 | Transformer | Mamba | DeltaNet |
|---|---|---|---|---|
| Compress | 44.5 | 51.6 | 52.7 | 42.2 |
| Fuzzy recall | 43.2 | 29.8 | 6.7 | 35.7 |
| In-context recall | 100 | 94.1 | 90.4 | 100 |
| Memorize | 89.1 | 85.2 | 89.5 | 52.8 |
| Noisy recall | 100 | 86.8 | 90.1 | 100 |
| Selective copy | 98.8 | 99.6 | 86.3 | 100 |
| Average | 79.3 | 74.5 | 69.3 | 71.8 |
The most interesting line is fuzzy recall. Mamba scores 6.7 on this task, an obvious failure mode for diagonal state space models, while RWKV-7 reaches 43.2 thanks to the non-diagonal correction in the generalized delta rule.
A forward pass through the RWKV-7 time-mix kernel at sequence length 16,384 and batch size 8 takes about 7.9 milliseconds on an H100, or 11.2 milliseconds if the state is also written to global memory. The corresponding backward pass is 22.5 milliseconds. Flash Attention v3 on the same shape takes 33.9 milliseconds for the forward pass alone. The asymptotic difference is the usual one: RWKV-7 is linear in sequence length, attention is quadratic, and the gap grows as contexts get longer.
Memory is also flat. The paper reports 18 bfloat16 variables stored per layer during training, against 10 for Flash Attention v3, but the variables are short vectors rather than full attention matrices, so the absolute footprint at long contexts is much smaller.
Everything in the Goose release is published under the Apache 2.0 license. This applies to the model weights for all four World v3 checkpoints and the Pile reference checkpoints, to the training code in the RWKV-LM repository on GitHub, to the inference kernels including the CUDA and Triton implementations of the new delta-rule update, and to the dataset manifests and curation scripts. The license permits commercial use, modification, and redistribution without any acceptable-use clause or downstream restriction, which puts RWKV-7 among the most permissive open models at the 3B scale alongside OLMo and Pythia.
The permissive license is a deliberate part of the project's positioning. RWKV is hosted by the Linux Foundation AI and Data foundation and is required to use a recognized open-source license to remain a hosted project; the project has consistently chosen Apache 2.0 over weight-available licenses like the Llama community license.
The paper situates Goose against the other major non-attention architectures by checking which of four expressivity properties each one supports. The four properties are large state (LS), flexible decay (FD), dynamic dependence (DD), and generalized eigenvalue spectrum (GE).
| Architecture | LS | FD | DD | GE | Notes |
|---|---|---|---|---|---|
| Mamba | yes | no | yes | no | Diagonal state space, data-dependent inputs |
| Mamba 2 | yes | yes | yes | no | Adds scalar decay; structured state-space duality |
| RetNet | yes | no | no | no | Retentive network with fixed decay |
| DeltaNet | yes | no | yes | no | Delta rule but scalar learning rate |
| Gated DeltaNet | yes | yes | yes | yes | Concurrent work, similar expressivity to Goose |
| RWKV-6 (Finch) | yes | yes | no | no | Data-dependent diagonal decay |
| RWKV-7 (Goose) | yes | yes | yes | yes | All four properties |
Goose and Gated DeltaNet are the only two non-attention designs that the paper credits with all four properties. The two architectures were developed in parallel and arrive at the conclusion that vector-valued learning rates and non-diagonal transitions are necessary together to break the TC⁰ expressivity ceiling.
Against attention, the comparison is the familiar one. A Transformer is more expressive than a linear-attention or state-space model on tasks that require comparing every token to every other token. Goose narrows that gap on most of the benchmarks the community currently runs, and on a few tasks like in-context recall it reaches the Transformer ceiling.
The RWKV Project joined the Linux Foundation AI and Data foundation in early 2024, becoming the first generative AI model accepted as a hosted project under the Generative AI Commons sub-foundation. The acceptance covered the architecture, the training code, and the World dataset; it did not cover any single trained checkpoint, which is consistent with how the Linux Foundation treats data assets in other AI projects. By the time the Goose paper was published, the project had moved from sandbox to incubation stage, the second of three maturity tiers, and was discussing graduation criteria.
The governance structure is light. Peng Bo remains the lead maintainer of the architecture and writes most of the reference code. A technical steering committee that includes representatives from EleutherAI, Recursal AI, Featherless, and several universities oversees the roadmap and dataset decisions. Day-to-day discussion happens on a public Discord server, and pull requests land in the RWKV-LM repository, which had roughly 14,500 stars and a thousand forks at the time the Goose paper was posted.
Commercial users of RWKV-7 include Recursal AI, which offers hosted inference, and Featherless, which serves the World v3 checkpoints behind an OpenAI-compatible API. Several research groups have built multimodal extensions on top of Goose, including VisualRWKV-7 for image-conditioned generation and AudioRWKV-7 for speech.
Reaction to the paper was mostly positive within the niche of researchers who follow non-attention sequence models, and more skeptical outside it. The OpenReview submission of an earlier draft attracted reviews that praised the theoretical contribution and the engineering of the delta-rule kernel but pushed back on the multilingual benchmark claims, arguing that the comparison set was thin and that LAMBADA-M was the wrong perplexity benchmark to highlight. The published version of the paper added more baselines and explicitly noted the MMLU gap.
Independent practitioners who downloaded the 1.5B and 2.9B checkpoints in the weeks after release reported that the models held up well on long-context tasks and on languages with relatively little training data, but that the instruction-following behaviour of the base checkpoints was raw. The G1 line, which adds supervised fine-tuning and a small reasoning corpus, addressed most of the practical complaints later in 2025.
The most consequential reception, arguably, was in the linear attention research community, where Goose and Gated DeltaNet together made the case that the gap with attention is closable and that the next generation of recurrent designs should default to vector-valued learning rates and non-diagonal transitions. By late 2025 several follow-up papers had taken the generalized delta rule as a starting point rather than as a curiosity.
On the commercial side, the picture is more mixed. Apache 2.0 weights at 3B scale are useful, but the practical alternatives, particularly Qwen 2.5 and Llama 3.2, were trained on far more data and still ship with more polished instruction-tuned variants. RWKV-7 finds its audience mainly among teams that need long context with bounded memory, teams that need to fine-tune on languages or modalities the larger labs do not cover well, and teams that want a permissively licensed base to build on without negotiating community-license terms.