RWKV-7 (Goose)

AI Research Large Language Models Neural Networks Open Source AI

19 min read

Updated Jun 24, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 24, 2026

Fact-checked

In review queue

Sources

10 citations

Revision

v2 · 3,773 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

RWKV-7, codenamed Goose, is an attention-free, RNN-style large-language-model architecture introduced in March 2025 that runs inference in linear time with constant memory per token while still training in parallel like a Transformer. It is the seventh major iteration of the RWKV (Receptance Weighted Key Value) sequence-modeling line, and its headline 2.9-billion-parameter model set a new state of the art at the 3B scale on multilingual benchmarks while being trained on far fewer tokens than competing open models. ^[1] The paper's central claim is that RWKV-7 "can perform state tracking and recognize all regular languages, while retaining parallelizability of training," which it argues "exceeds the capabilities of Transformers under standard complexity conjectures, which are limited to TC0." ^[1]

The architecture was presented in the paper RWKV-7 "Goose" with Expressive Dynamic State Evolution, posted to arXiv as 2503.14456 on 18 March 2025 and authored by Bo Peng (the project lead, who publishes as BlinkDL), Ruichong Zhang, Daniel Goldstein, Eric Alcaide, and 14 further co-authors (18 authors total) drawn from the RWKV Project, EleutherAI, Recursal AI, and several universities. ^[1] RWKV-7 positions itself as a recurrent neural network alternative to attention-based models and to Mamba-style state space models.

The key technical change is a generalized delta rule with vector-valued gating and in-context learning rates, which the authors call "dynamic state evolution." ^[1] Weights, training-data manifests, training code, and inference kernels were released under the Apache License 2.0 through HuggingFace and GitHub. The RWKV Project sits inside the Linux Foundation AI and Data foundation, where in September 2023 it became the first generative AI model accepted as a hosted project under the Generative AI Commons. ^[4]^[5] Goose is the version of the architecture that the community now treats as the reference baseline for downstream work.

What is the RWKV family, from v1 to v6?

RWKV started as a hobby project by Bo Peng, an independent researcher then loosely affiliated with EleutherAI, who wanted to know whether an RNN could match a Transformer if you redesigned the recurrence to be parallelizable across the time axis. The name comes from its four core parameters: Receptance, Weight, Key, and Value. ^[2] The first versions, RWKV-1 through RWKV-3, were small experiments that mostly served as proofs of concept. The line became serious with RWKV-4 in 2023, which reached parity with comparably sized GPT models on the Pile and demonstrated that an RNN could be trained at multi-billion parameter scale without the usual instabilities. RWKV-4 is also the version that introduced the now-familiar split between a time-mix block, which carries the recurrent state, and a channel-mix block, which acts as a position-wise feed-forward layer.

RWKV-5, codenamed Eagle, refined the recurrence by promoting the hidden state from a vector to a matrix and by introducing multi-headed time mixing. This lifted long-context performance considerably and was the first version to be trained on the RWKV World multilingual corpus, a deliberate move away from the English-heavy Pile. RWKV-6, codenamed Finch, replaced the static decay vector with a data-dependent one, allowing the network to choose how aggressively it forgets state on a per-token basis, and added low-rank adapters inside the time-mix block so that each token could project itself into a learned subspace before the recurrence acted on it. Finch was the first RWKV model to be widely deployed in production, mostly via the Recursal AI and Featherless services and via community quantizations.

By late 2024, however, two problems were obvious. The first was that even Finch, while strong on perplexity, struggled on in-context recall tasks where a Transformer of similar size would do well. The second was that several concurrent linear attention architectures, especially Gated DeltaNet and Mamba 2, were closing the gap with Transformers from another angle, by adding non-trivial transition matrices to their recurrences. RWKV-7 was designed to absorb the best ideas from those efforts while keeping the engineering and licensing posture that had made RWKV easy to ship.

The table below summarizes the lineage. ^[2]

Version	Codename	Year	Key change
RWKV-4	(Dove)	2023	First official version; time-mix / channel-mix split; trained on the Pile
RWKV-5	Eagle	2023	Matrix-valued state; multi-headed time mixing; World corpus
RWKV-6	Finch	2024	Data-dependent diagonal decay; LoRA-style time-mix adapters
RWKV-7	Goose	2025	Generalized delta rule; non-diagonal dynamic state evolution

How does the RWKV-7 architecture work?

The core change in Goose is a generalized delta rule that turns the recurrent state update into something significantly more expressive than the diagonal decay used in earlier RWKV versions. The state in RWKV-7 is still a matrix, written wkv, that lives at every time step and is read out to produce the output token. What is new is how it evolves.

From diagonal decay to dynamic state evolution

In RWKV-4 through RWKV-6, the state update could be written, in simplified form, as state_t = decay * state_{t-1} + v_t k_t^T, where decay was either a learned scalar (early versions) or a learned data-dependent vector (Finch). That structure is mathematically a diagonal transition matrix, which means each channel of the state evolves independently of the others. It is fast and stable, but it limits how richly the state can mix information across channels.

RWKV-7 replaces the diagonal decay with a non-diagonal transition matrix of the form G_t = diag(w_t) - kappa_t^T (a_t * kappa_t). The first term is the familiar diagonal decay, parameterized by a per-channel decay vector w_t whose values lie in a stable band roughly between 0.61 and 1.0. The second term is a rank-one correction that subtracts a learned amount of the previous state along the direction kappa_t, a removal key derived from the current token. The vector a_t, computed by a small low-rank MLP, acts as an in-context learning rate that lives in the open interval (0, 1) per channel and decides how aggressively the network should overwrite previous information.

This is the generalized delta rule that gives the paper its subtitle. It is a strict superset of the DeltaNet update, which only allowed a scalar learning rate, and it is also a strict superset of the diagonal decay used by Mamba and earlier RWKV. The non-diagonal correction is what lets RWKV-7 perform genuine state tracking. The paper proves that with this update the network can recognize all regular languages, which puts it above the TC0 complexity class that contains softmax attention and standard linear attention. ^[1]

Decoupled removal and replacement keys

A second innovation is the decoupling of the key that participates in removal from the key that participates in addition, together with what the abstract calls "a relaxed value replacement rule." ^[1] The removal key kappa_t and the replacement key tilde_k_t are both derived from a single learned key k_t, but they are scaled by different learned parameters before they enter the update. This lets the network forget along one direction while writing new content along a slightly different one, which the paper argues is necessary for some of the state-tracking tasks the architecture was designed to handle.

Value residual learning

Goose also introduces value residual learning, a trick that prevents value vectors in deep layers from drifting too far from the values seen at the first layer. In practice, each layer's value vector is a learned interpolation between the layer-0 value and a freshly computed one. This stabilizes training and, according to ablations in the paper, gives roughly a one-point average gain on downstream benchmarks at the 1.5B scale.

In-context weight decay

The decay vector w_t in RWKV-7 is parameterized as w_t = exp(-exp(-0.5) * sigmoid(d_t)), where d_t is the output of a small low-rank MLP applied to the input token after a tanh nonlinearity. This nested-exponential form keeps the decay strictly inside the stable interval (e^(-e^(-0.5)), 1), roughly (0.606, 1.0), which prevents the kind of state explosion that plagued earlier linear-attention designs while still letting the model choose how quickly to forget on a per-token basis.

Channel mix and overall block layout

The channel-mix block in RWKV-7 is a simpler position-wise feed-forward layer compared to the elaborate variants tried in RWKV-5 and RWKV-6, and the token-shift mechanism was likewise simplified to speed up training and inference. ^[2] The block stack alternates time mix and channel mix, with RMSNorm before each sublayer and a residual connection around each. The model is purely recurrent at inference time, so there is no KV cache and no attention matrix to materialize, which keeps memory flat as context grows.

What model sizes were released?

The Goose family released alongside the paper consists of four base models trained on the RWKV World v3 corpus, plus a smaller set of Pile-trained reference models for ablations. All four base models share the same architecture, scaled in width and depth, and range from 0.19 billion to 2.9 billion parameters. ^[1]

Model	Parameters	Layers	Width	Total training tokens	Notes
RWKV7-World3-0.1B	0.19B	12	768	1.6T	0.6T from RWKV-5 init plus 1.0T as RWKV-7
RWKV7-World3-0.4B	0.45B	24	1024	3.1T	1.1T RWKV-5 plus 2.0T RWKV-7
RWKV7-World3-1.5B	1.52B	24	2048	5.6T	2.5T RWKV-6 plus 3.1T RWKV-7
RWKV7-World3-2.9B	2.91B	32	2560	5.6T	2.5T RWKV-6 plus 3.1T RWKV-7

The paper also reports Pile reference checkpoints at 0.17B, 0.4B, and 1.47B trained on 332B tokens of the original Pile with the GPT-NeoX tokenizer. These are used for apples-to-apples comparisons against the Mamba, Mamba 2, and Transformer++ baselines that were originally evaluated on the same corpus.

A larger 7.2B Goose checkpoint was released later in 2025 as part of the RWKV-7 G1 line, alongside instruction-tuned and reasoning-tuned variants. The G1 models share the Goose architecture exactly; the differences are in data mixture and post-training.

How was RWKV-7 trained, and on what data?

Training used the RWKV World v3 corpus, a multilingual mixture of roughly 3.1 trillion tokens that the paper released as open source alongside the models. ^[1] The RWKV Project assembled it from public sources including arXiv, GitHub, Wikipedia, public-domain books, and a substantial slice of Chinese web fiction. The composition is roughly 80 percent English, 10 percent other languages, and 10 percent code, with more than a hundred languages represented in the long tail. The dataset component listing is published on HuggingFace. ^[3]

The larger Goose models were not trained from scratch. The 1.5B and 2.9B checkpoints started from RWKV-6 weights trained on 2.5T tokens and were then continued for another 3.1T tokens after a surgical conversion of the time-mix blocks to the new generalized delta rule. The smaller 0.1B and 0.4B models were initialized from RWKV-5 weights in the same way. The paper presents this as an efficiency choice: reusing well-trained checkpoints from earlier RWKV versions reduced total compute, and the team reports that the new architecture was stable enough to absorb the surgery without loss spikes once the new parameters were carefully initialized.

Training ran on clusters of H100 and A100 GPUs supplied by Recursal AI, EleutherAI's StabilityAI compute grant, and several smaller donors. The largest run, for the 2.9B model, used a context length of 4096 tokens during the bulk of training, extended to 16,384 for a final long-context phase. The team reports a throughput of about 259,000 tokens per second on a 4-node, 8-GPU H100 cluster at context length 8192.

Optimizer details follow the conventions established for earlier RWKV runs. The team uses Adam with selective weight decay applied only to large matrices, a cosine learning-rate schedule with linear warmup, and a small amount of dropout in the early phases that is annealed to zero. The paper emphasizes that RWKV-7 is spike-free at every scale they tested, which is a meaningful claim for an architecture this new because instability has historically been the main reason linear-attention models fail to scale.

How does RWKV-7 perform on benchmarks?

The benchmark story for Goose is split between English and multilingual evaluations, with the multilingual case being the more dramatic. The headline efficiency comparison is stark: RWKV7-World3-2.9B was trained on 5.6 trillion tokens, against roughly 18 trillion tokens for Qwen2.5-3B, yet it matches Qwen2.5 on English downstream tasks and beats it on multilingual ones. ^[1]

English downstream tasks

The 2.9B Goose model is competitive with the strongest open 3B Transformers on standard English benchmarks despite seeing roughly three times fewer tokens. The table below reports normalized accuracy on the tasks the paper highlights.

Benchmark	RWKV7-2.9B	Qwen2.5-3B	Llama 3.2 3B
LAMBADA	73.4	67.1	70.5
HellaSwag	76.4	73.5	73.6
PIQA	79.7	78.6	76.7
ARC-Easy	81.0	77.4	74.5
ARC-Challenge	48.7	45.0	42.2
MMLU	55.0	65.7	56.5
Average	71.5	71.4	67.8

Goose wins on five of the six commonsense and reading tasks and loses meaningfully only on MMLU, a result the authors attribute to the smaller English subset of the World v3 corpus. Qwen2.5-3B, which was trained on roughly 18 trillion tokens, has a clear advantage on knowledge-dense benchmarks. On reasoning and surface-form tasks, the 3-times-fewer-tokens model holds its own.

Multilingual tasks

On cross-lingual benchmarks, RWKV-7 reaches a new state of the art at the 3B scale by a comfortable margin. ^[1] The table reports zero-shot accuracy except for LAMBADA-M, which is perplexity (lower is better).

Benchmark	RWKV7-2.9B	Qwen2.5-3B	Llama 3.2 3B
LAMBADA-M (ppl)	18	36	30
XCOPA	63.1	59.0	58.5
XNLI	45.4	38.5	44.2
XStoryCloze	64.7	59.6	60.6
xWinogrande	82.4	79.8	79.2
Average	61.1	55.6	58.1

The paper credits the multilingual lead to a combination of the World v3 data mixture, the larger recurrent state, and the in-context learning rate that lets the network shift its language-model conditioning quickly when a passage switches languages.

Recall and synthetic tasks

The Mechanistic Architecture Design (MAD) suite from the Zoology line of work tests recurrent and attention-based models on synthetic recall tasks. RWKV-7 posts the strongest average score of the architectures evaluated.

Task	RWKV-7	Transformer	Mamba	DeltaNet
Compress	44.5	51.6	52.7	42.2
Fuzzy recall	43.2	29.8	6.7	35.7
In-context recall	100	94.1	90.4	100
Memorize	89.1	85.2	89.5	52.8
Noisy recall	100	86.8	90.1	100
Selective copy	98.8	99.6	86.3	100
Average	79.3	74.5	69.3	71.8

The most interesting line is fuzzy recall. Mamba scores 6.7 on this task, an obvious failure mode for diagonal state space models, while RWKV-7 reaches 43.2 thanks to the non-diagonal correction in the generalized delta rule.

Speed and memory

A forward pass through the RWKV-7 time-mix kernel at sequence length 16,384 and batch size 8 takes about 7.9 milliseconds on an H100, or 11.2 milliseconds if the state is also written to global memory. The corresponding backward pass is 22.5 milliseconds. Flash Attention v3 on the same shape takes 33.9 milliseconds for the forward pass alone. The asymptotic difference is the usual one: RWKV-7 is linear in sequence length, attention is quadratic, and the gap grows as contexts get longer.

Memory is also flat. The paper reports 18 bfloat16 variables stored per layer during training, against 10 for Flash Attention v3, but the variables are short vectors rather than full attention matrices, so the absolute footprint at long contexts is much smaller.

Is RWKV-7 open source?

Yes. Everything in the Goose release is published under the Apache 2.0 license. ^[1] This applies to the model weights for all four World v3 checkpoints and the Pile reference checkpoints, to the training code in the RWKV-LM repository on GitHub, to the inference kernels including the CUDA and Triton implementations of the new delta-rule update, and to the dataset manifests and curation scripts. The license permits commercial use, modification, and redistribution without any acceptable-use clause or downstream restriction, which puts RWKV-7 among the most permissive open models at the 3B scale alongside OLMo and Pythia.

The permissive license is a deliberate part of the project's positioning. RWKV is hosted by the Linux Foundation AI and Data foundation and is required to use a recognized open-source license to remain a hosted project; the project has consistently chosen Apache 2.0 over weight-available licenses like the Llama community license. ^[4]

How does RWKV-7 compare to peer architectures?

The paper situates Goose against the other major non-attention architectures by checking which of four expressivity properties each one supports. The four properties are large state (LS), flexible decay (FD), dynamic dependence (DD), and generalized eigenvalue spectrum (GE).

Architecture	LS	FD	DD	GE	Notes
Mamba	yes	no	yes	no	Diagonal state space, data-dependent inputs
Mamba 2	yes	yes	yes	no	Adds scalar decay; structured state-space duality
RetNet	yes	no	no	no	Retentive network with fixed decay
DeltaNet	yes	no	yes	no	Delta rule but scalar learning rate
Gated DeltaNet	yes	yes	yes	yes	Concurrent work, similar expressivity to Goose
RWKV-6 (Finch)	yes	yes	no	no	Data-dependent diagonal decay
RWKV-7 (Goose)	yes	yes	yes	yes	All four properties

Goose and Gated DeltaNet are the only two non-attention designs that the paper credits with all four properties. The two architectures were developed in parallel and arrive at the conclusion that vector-valued learning rates and non-diagonal transitions are necessary together to break the TC0 expressivity ceiling.

Against attention, the comparison is the familiar one. A Transformer is more expressive than a linear-attention or state-space model on tasks that require comparing every token to every other token. Goose narrows that gap on most of the benchmarks the community currently runs, and on a few tasks like in-context recall it reaches the Transformer ceiling.

Who maintains RWKV, and how is it governed?

The RWKV Project joined the Linux Foundation AI and Data foundation in September 2023, becoming the first generative AI model accepted as a hosted project under the Generative AI Commons. ^[4]^[5] The project entered as an EleutherAI community effort led by Bo Peng that was then donated to the foundation. ^[4] The acceptance covered the architecture, the training code, and the World dataset; it did not cover any single trained checkpoint, which is consistent with how the Linux Foundation treats data assets in other AI projects.

The governance structure is light. Bo Peng remains the lead maintainer of the architecture and writes most of the reference code. A technical steering committee that includes representatives from EleutherAI, Recursal AI, Featherless, and several universities oversees the roadmap and dataset decisions. Day-to-day discussion happens on a public Discord server, and pull requests land in the RWKV-LM repository. ^[2]

Commercial users of RWKV-7 include Recursal AI, which offers hosted inference, and Featherless, which serves the World v3 checkpoints behind an OpenAI-compatible API. Several research groups have built multimodal extensions on top of Goose, including VisualRWKV-7 for image-conditioned generation and AudioRWKV-7 for speech.

Reception

Reaction to the paper was mostly positive within the niche of researchers who follow non-attention sequence models, and more skeptical outside it. The OpenReview submission of an earlier draft attracted reviews that praised the theoretical contribution and the engineering of the delta-rule kernel but pushed back on the multilingual benchmark claims, arguing that the comparison set was thin and that LAMBADA-M was the wrong perplexity benchmark to highlight. ^[8] The published version of the paper added more baselines and explicitly noted the MMLU gap.

Independent practitioners who downloaded the 1.5B and 2.9B checkpoints in the weeks after release reported that the models held up well on long-context tasks and on languages with relatively little training data, but that the instruction-following behaviour of the base checkpoints was raw. The G1 line, which adds supervised fine-tuning and a small reasoning corpus, addressed most of the practical complaints later in 2025.

The most consequential reception, arguably, was in the linear attention research community, where Goose and Gated DeltaNet together made the case that the gap with attention is closable and that the next generation of recurrent designs should default to vector-valued learning rates and non-diagonal transitions. By late 2025 several follow-up papers had taken the generalized delta rule as a starting point rather than as a curiosity, and the RWKV Project had begun an experimental RWKV-8 line codenamed Heron. ^[2]

On the commercial side, the picture is more mixed. Apache 2.0 weights at 3B scale are useful, but the practical alternatives, particularly Qwen 2.5 and Llama 3.2, were trained on far more data and still ship with more polished instruction-tuned variants. RWKV-7 finds its audience mainly among teams that need long context with bounded memory, teams that need to fine-tune on languages or modalities the larger labs do not cover well, and teams that want a permissively licensed base to build on without negotiating community-license terms.

References

Peng, B., Zhang, R., Goldstein, D., Alcaide, E., et al. (2025). *RWKV-7 "Goose" with Expressive Dynamic State Evolution*. arXiv:2503.14456. https://arxiv.org/abs/2503.14456 ↩
RWKV Project. *RWKV-LM repository*. GitHub. https://github.com/RWKV/RWKV-LM ↩
RWKV. *RWKV models and dataset listing*. HuggingFace. https://huggingface.co/RWKV ↩
RWKV Project. *RWKV joins the Linux Foundation as the first AI model under the Generative AI Commons*. RWKV Blog, 2023. https://blog.rwkv.com/p/rwkv-joins-the-linux-foundation-as ↩
Linux Foundation AI and Data. *RWKV project page*. https://lfaidata.foundation/projects/rwkv/ ↩
RWKV Project. *RWKV Wiki: Architecture history*. https://wiki.rwkv.com/basic/architecture.html
Papers With Code. *RWKV-7 Goose with Expressive Dynamic State Evolution*. https://paperswithcode.com/paper/rwkv-7-goose-with-expressive-dynamic-state
OpenReview. *RWKV-7 Goose with Expressive Dynamic State Evolution* (submission and reviews). https://openreview.net/forum?id=ayB1PACN5j ↩
HuggingFace Papers. *RWKV-7 Goose with Expressive Dynamic State Evolution*. https://huggingface.co/papers/2503.14456
InfoQ. *RWKV Project Open-Sources LLM Eagle 7B*. March 2024. https://www.infoq.com/news/2024/03/rwkv-llm-eagle-7b/

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Axolotl Hyena Infini-Attention Jet-Nemotron Linear Attention Long-context language models xLSTM

What is the RWKV family, from v1 to v6?

How does the RWKV-7 architecture work?

From diagonal decay to dynamic state evolution

Decoupled removal and replacement keys

Value residual learning

In-context weight decay

Channel mix and overall block layout

What model sizes were released?

How was RWKV-7 trained, and on what data?

How does RWKV-7 perform on benchmarks?

English downstream tasks

Multilingual tasks

Recall and synthetic tasks

Speed and memory

Is RWKV-7 open source?

How does RWKV-7 compare to peer architectures?

Who maintains RWKV, and how is it governed?

Reception

See also

References

Improve this article

Related Articles

Retentive Network (RetNet)

Meta AI

EleutherAI

Nous Research

Jet-Nemotron

Tülu 3

What links here

Related Articles

Retentive Network (RetNet)

Meta AI

EleutherAI

Nous Research

Jet-Nemotron

Tülu 3

What links here