RWKV

Deep Learning Machine Learning Model Architecture Neural Networks

23 min read

Updated Jun 21, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 21, 2026

Fact-checked

In review queue

Sources

12 citations

Revision

v6 · 4,505 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

RWKV (pronounced "RwaKuv") is an open-source neural network architecture that combines the parallelizable training of Transformers with the constant-time, constant-memory inference of recurrent neural networks (RNNs).^[1] The name stands for Receptance Weighted Key Value, after the four core parameters in its mechanism. Proposed by Bo Peng (known online as BlinkDL) and 33 collaborators in the 2023 paper "RWKV: Reinventing RNNs for the Transformer Era," the architecture uses a linear attention mechanism that lets the same trained model be formulated as either a Transformer (for training) or an RNN (for inference).^[1] During training, RWKV processes sequences in parallel like a Transformer; during inference, it runs as an RNN with O(1) memory and time per token, avoiding the quadratic complexity that limits standard attention.^[1] The official project describes it as "an RNN with great LLM performance, which can also be directly trained like a GPT transformer (parallelizable)."^[12]

RWKV-4 was scaled to 14 billion parameters, which the authors state is "by far the largest dense RNN ever trained" at the time of its release, performing on par with similarly sized Transformers across a range of natural language processing tasks.^[1] The project originated within the EleutherAI community and on September 20, 2023 joined the Linux Foundation, becoming the first open-source AI model hosted under the LF AI & Data Foundation.^[11] All RWKV models and code are released under the Apache 2.0 license.^[9] As of mid-2025 the current stable architecture is RWKV-7 "Goose," whose 2.9B model set a new state of the art for 3B-scale models on multilingual benchmarks despite training on far fewer tokens than competitors.^[3]

What problem does RWKV solve?

The Transformer architecture, introduced by Vaswani et al. in 2017, reshaped the field of natural language processing through its self-attention mechanism.^[7] Self-attention allows each token in a sequence to attend to every other token, capturing long-range dependencies far more effectively than prior architectures.^[7] However, this mechanism computes pairwise interactions between all tokens in a sequence, resulting in O(T^2) time and memory complexity for a sequence of length T.^[1] This quadratic scaling becomes a serious bottleneck when processing long sequences, particularly as context windows grow beyond 100,000 tokens. The memory required for the KV cache also grows linearly with sequence length during autoregressive generation, adding further constraints.^[1]

Traditional RNNs, including LSTMs and GRUs, process sequences with O(T) time complexity and O(1) memory per step, since they maintain a fixed-size hidden state.^[1] However, RNNs suffer from vanishing and exploding gradients during backpropagation through time, making them difficult to train on long sequences. Their strictly sequential nature also prevents efficient parallelization on modern GPU hardware, where massive parallelism is essential for practical training speeds.^[1]

Several lines of research have attempted to address these limitations. Linear attention mechanisms (Katharopoulos et al., 2020) remove the softmax from attention to achieve linear complexity but often sacrifice model quality.^[8] State space models like S4 (Gu et al., 2021) reformulate sequence modeling as a continuous-time dynamical system. Attention-Free Transformers (Zhai et al., 2021) replace attention with element-wise operations.^[1]

RWKV was designed to bridge the RNN and Transformer paradigms. By replacing the softmax-based attention mechanism with a linear attention variant that uses exponential decay, RWKV achieves the best of both worlds: efficient parallel training (like Transformers) and efficient sequential inference (like RNNs).^[1] The architecture eliminates the need for a KV cache during inference, enabling constant memory usage regardless of sequence length.^[1]

How does the RWKV architecture work?

The RWKV architecture stacks N residual blocks, each containing two sub-layers: a time-mixing block and a channel-mixing block.^[1] These are analogous to the self-attention and feed-forward network (FFN) layers in a standard Transformer, respectively. Each sub-layer uses layer normalization (pre-LN) and residual connections. An embedding layer maps input tokens to dense vectors before the first block, and a final layer normalization followed by a linear projection produces output logits.^[1]

The name RWKV derives from the four key parameters used in the architecture:

Parameter	Name	Role
R	Receptance	Controls how much past information the model accepts at each position, applied as a sigmoid gate (values between 0 and 1)
W	Weight (Decay)	A trainable positional weight decay vector that determines how quickly past information fades, applied as an exponential decay factor
K	Key	Functions similarly to keys in standard attention, determining the importance of values to be stored in memory
V	Value	Represents the actual information content to be stored, analogous to values in standard attention

Token Shift

Before computing the R, K, and V vectors, RWKV applies a token shift operation that linearly interpolates between the current token's embedding and the previous token's embedding.^[1] This mixing is controlled by learnable per-channel parameters (mu values), allowing different feature dimensions to integrate temporal information at different rates. In mathematical terms, for a given input x_t at time step t:

x't = mu * x_t + (1 - mu) * x{t-1}

This simple operation provides the model with access to local context without requiring explicit attention over multiple positions. Importantly, different projections (R, K, V) each have their own set of mu parameters, so the model can learn to mix temporal information differently for each role.^[1] In later versions (RWKV-6), token shift was upgraded to use data-dependent dynamic linear interpolation with low-rank adaptation, making the mixing weights input-dependent.^[2]

Time-Mixing Block

The time-mixing block is the core computational unit that replaces self-attention. After token shifting, the block computes receptance (r), key (k), and value (v) vectors through linear projections. The WKV operator then computes a weighted combination of values using exponential decay.^[1]

The WKV operator maintains two running accumulators (a numerator and a denominator) that track an exponential moving average of key-value interactions. At each time step, the accumulators are decayed by the factor exp(-w) and updated with the current key-value contribution. The decay rate w is a trainable per-channel parameter, meaning each feature dimension can forget past information at a different rate. A separate "bonus" parameter (u) allows the most recent token to be weighted differently from historical tokens, giving the model fine-grained control over recency bias.^[1]

The output of the WKV operator is then gated by the sigmoid of the receptance vector:

output_t = sigmoid(r_t) * WKV(k_t, v_t)

This gating mechanism controls how much of the computed attention-like output passes through to the residual stream. The receptance gate acts as a learned filter that determines channel-wise responsiveness to the aggregated past information.^[1]

The key insight is that the exponentials in this formulation can be interpreted two ways: during training (time-parallel mode), they resemble softmax normalization, enabling parallelization across the sequence through prefix-sum (scan) operations; during inference (RNN mode), they function as multiplicative decay in a persistent memory, enabling O(1) per-step computation. This mathematical equivalence is what allows the same model to switch seamlessly between modes.^[1]

Channel-Mixing Block

The channel-mixing block performs feature integration across the embedding dimensions (channels), where each element in the feature vector gathers information from other dimensions to update its own value. It follows a structure similar to the FFN in Transformers but uses RWKV-specific components:^[1]

A linear "key" projection followed by a squared ReLU nonlinearity (the element-wise square of ReLU activations), which provides stronger sparsity than standard ReLU
A linear "value" projection that produces the output
A sigmoid-applied "receptance" gate that multiplies the output element-wise, controlling the flow of information

The channel-mixing block also uses token shift for its inputs, with separate learned mixing parameters distinct from those used in the time-mixing block.^[1]

How can one model run as both a Transformer and an RNN?

A defining feature of RWKV is its ability to operate in two mathematically equivalent modes:

Mode	Used During	Complexity (Time)	Complexity (Memory)	Description
Time-Parallel	Training	O(BTd^2)	O(BTd)	Processes entire sequences in parallel across GPUs, similar to Transformer training
RNN	Inference	O(Td)	O(d)	Processes one token at a time with fixed-size state buffers, no KV cache needed

Here B is batch size, T is sequence length, and d is model dimension. The equivalence between these two modes is achieved through algebraic manipulation of the exponential terms in the WKV operator. This duality means the same set of trained weights can be used for both high-throughput batch processing and low-latency single-token generation without any conversion or approximation.^[1]

Version History

RWKV has evolved through seven major versions since its initial release in 2021, with each iteration introducing significant architectural improvements.^[10] The following table provides a summary of the key differences across versions.

Version	Codename	Year	State Type	Decay Mechanism	Token Shift	Key Innovation
RWKV-4	Dove	2023	Vector	Fixed (w parameter)	lerp	First published paper, scaled to 14B
RWKV-5	Eagle	2023	Matrix-valued	Fixed per-head	lerp	Multi-headed matrix states (32x capacity)
RWKV-6	Finch	2023-2024	Matrix-valued	Dynamic per-timestep	ddlerp + LoRA	Data-dependent recurrence, state tuning
RWKV-7	Goose	2024-2025	Matrix-valued	Generalized delta rule	Simplified	Dynamic state evolution, regular language recognition

RWKV-1 (August 2021)

The first version replaced attention with long convolution and introduced the alternating time-mixing and channel-mixing block structure. It drew inspiration from the Attention Free Transformer (AFT) architecture proposed by Zhai et al.^[10]

RWKV-2 (2022)

The second version introduced the first true RNN formulation, implementing exponential moving averages for key-value pairs and a headQK mechanism for in-context learning. This version achieved genuine RNN-style inference with fixed-size hidden states, establishing the dual-mode property that became central to the architecture.^[10]

RWKV-3 (2022)

A transitional version that added trainable TimeMix factors for the R, K, and V parameters and switched from post-layer normalization to pre-layer normalization, improving training stability and convergence behavior.^[10]

RWKV-4 "Dove" (May 2023)

The first version published with a formal research paper, accepted at the Findings of EMNLP 2023.^[1] The paper, "RWKV: Reinventing RNNs for the Transformer Era," had 34 co-authors led by Bo Peng.^[1] RWKV-4 resolved numerical stability issues in the WKV operator and introduced the formal token shift concept using causal convolution. Models were released in sizes from 169M to 14B parameters, all trained on the Pile dataset (331 billion tokens).^[1] The authors summarize the result directly: "We scale our models as large as 14 billion parameters, by far the largest dense RNN ever trained, and find RWKV performs on par with similarly sized Transformers."^[1]

Variants included RWKV-4-Raven (instruction-tuned for chat), RWKV-4-World (multilingual, supporting 100+ languages), and RWKV-4-Music (trained on MIDI and ABC music notation for music generation).^[1]

Model	Layers	Hidden Dimension	Parameters
RWKV-4-169M	12	768	169M
RWKV-4-430M	24	1,024	430M
RWKV-4-1.5B	24	2,048	1.5B
RWKV-4-3B	32	2,560	3B
RWKV-4-7B	32	4,096	7B
RWKV-4-14B	40	5,120	14B

RWKV-5 "Eagle" (2023)

Eagle introduced matrix-valued states to replace the vector-valued states of RWKV-4. The state was split into 64x64 matrices per head, increasing the state capacity by roughly 32 times compared to the previous version.^[2] This multi-headed design was described in the paper "Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence" (Peng et al., 2024).^[2] Eagle models ranged from 0.46B to 7.52B parameters and were trained on 1.12 trillion tokens from a new multilingual corpus covering over 100 languages.^[2] A custom fast tokenizer based on greedy matching improved multilingual tokenization efficiency.^[2]

The Eagle 7B model demonstrated competitive performance against Transformer models of similar size. On multilingual benchmarks across 23 languages (xLAMBDA, xStoryCloze, xWinograd, xCopa), Eagle 7B outperformed all other 7B-parameter models, including Mistral 7B, Falcon 7B, and LLaMA 2 7B.^[2] On English benchmarks, it performed comparably, surpassing competitors on tasks like LAMBADA (perplexity of 3.36, close to Mistral 7B's 3.18), StoryCloze, WinoGrande, and SciQ.^[2]

RWKV-6 "Finch" (2023-2024)

Finch built on Eagle by adding dynamic recurrence through low-rank adaptation. The key innovation was making the decay parameter (w) time-varying and input-dependent, rather than static across the sequence.^[2] This means the model can dynamically decide how much past information to retain at each position based on the current input, rather than relying on fixed decay rates. Token shifting was also upgraded from simple linear interpolation (lerp) to data-dependent dynamic linear interpolation (ddlerp) with LoRA-based adaptation.^[2]

Finch models were released in sizes of 0.1B, 1.6B, 3B, 7B, and 14B parameters.^[2] Finch also introduced state tuning, a parameter-efficient technique for fine-tuning only the initial hidden states of the model for alignment and downstream tasks, without modifying the main model weights.^[2] Compared to Eagle 7B, Finch 7B improved by 5.38% across all benchmarks, and Finch 14B improved by an additional 7.14%. Evaluations covered 235 benchmarks in total using the LM evaluation harness.^[2]

RWKV-7 "Goose" (2024-2025)

Goose represents the most substantial architectural change since RWKV-4, introducing expressive dynamic state evolution based on a generalized formulation of the delta rule.^[3] The paper describes the change in one sentence: RWKV-7 "introduces a newly generalized formulation of the delta rule with vector-valued gating and in-context learning rates, as well as a relaxed value replacement rule."^[3] The 2025 paper "RWKV-7 'Goose' with Expressive Dynamic State Evolution" (Peng et al.), with 18 co-authors, describes three key innovations:^[3]

Vector-valued gating: Replaces scalar gating with vector-valued gates, allowing per-channel control over information retention and removal. This provides significantly more expressivity than the scalar gates used in earlier versions.
In-context learning rates: Each channel can selectively update the state at different rates, enabling the model to simulate a form of gradient descent during inference. This allows the model to adapt its internal representations to the current context more effectively.
Relaxed value replacement rule: Decouples the keys used for removing information from the state and the keys used for adding new information, providing greater flexibility in state management and allowing the model to selectively forget old information while storing new content in a different manner.

RWKV-7 also simplifies the token shift and channel mixing modules compared to RWKV-6, removing data dependency from token shift and simplifying channel mixing gating. These simplifications improve both training and inference speed without sacrificing quality.^[3]

A significant theoretical result is that RWKV-7 "can perform state tracking and recognize all regular languages, while retaining parallelizability of training," a capability the authors note "exceeds the capabilities of Transformers under standard complexity conjectures, which are limited to TC0."^[3] This means RWKV-7 is strictly more powerful than Transformers in certain formal language recognition tasks.^[3]

Four Goose models were released (0.19B, 0.4B, 1.5B, and 2.9B parameters), trained on a 3.1 trillion token multilingual corpus.^[3] The 2.9B model achieved a new state-of-the-art for 3B-scale models on multilingual benchmarks, outperforming Qwen 2.5-3B, LLaMA 3.2-3B, and SmolLM2 by significant margins.^[3] On English benchmarks, RWKV-7 2.9B matched Qwen 2.5-3B (71.5% vs 71.4% average accuracy), despite being trained on 5.6 trillion tokens compared to Qwen's 18 trillion tokens.^[3] On a per-training-FLOP basis, the 2.9B model even outperformed 7B Transformer models, demonstrating superior compute efficiency.^[3]

How does RWKV compare to Transformers and Mamba?

RWKV occupies a unique position among modern sequence modeling architectures. It can be described as the ratio of two linear RNNs, derived from a mechanism close to standard attention but using a receptance vector instead of a query vector.^[5] The following table summarizes the key differences between RWKV, standard Transformers, and Mamba (the leading state space model alternative).

Feature	Transformer	RWKV	Mamba
Attention Type	Softmax (quadratic)	Linear (exponential decay)	Selective state spaces
Training Complexity (Time)	O(T^2 d)	O(T d^2)	O(T d)
Inference Complexity (Time per token)	O(T d)	O(d)	O(d)
Inference Memory	O(T d) (KV cache)	O(d) (fixed state)	O(d) (fixed state)
Parallelizable Training	Yes	Yes	Yes (via scan)
KV Cache Required	Yes	No	No
Maximum Context Length	Fixed (architectural)	Unlimited (theoretical)	Unlimited (theoretical)
State Tracking Capability	Limited (TC^0)	All regular languages (v7)	Limited
In-Context Learning	Strong	Improving (v7)	Moderate
Training Maturity	Very high	Growing	Growing
Model Scale (largest trained)	1T+ parameters	14B parameters	8B parameters
Open Source	Varies	Yes (Apache 2.0)	Yes

RWKV vs. Transformers

RWKV's primary advantage over Transformers is inference efficiency. A Transformer model must store and attend to all previous tokens through its KV cache, which grows linearly with sequence length. RWKV maintains a fixed-size state that summarizes all past context, enabling constant memory usage.^[1] For example, the RWKV-4 14B model can generate text on sequences of arbitrary length using only 3 GB of VRAM with INT8 quantization.^[1] A comparably sized Transformer would require substantially more memory for long sequences due to the growing KV cache.

Another area where RWKV differs is time-to-first-token (TTFT). Transformers exhibit quadratic growth in TTFT as the input prompt grows, because the full attention matrix must be computed over the entire prompt. RWKV scales linearly in TTFT, processing each prompt token in constant time per token.^[1]

On language modeling benchmarks, RWKV-4 models perform competitively with similarly sized Transformer models from the Pythia, OPT, and BLOOM families when matched on training compute (FLOPs).^[1] Evaluations covered twelve NLP tasks including ARC (Easy and Challenge), BoolQ, COPA, HeadQA, HellaSwag, LAMBADA, OpenBookQA, PIQA, ReCoRD, SciQ, and WinoGrande.^[1] The RWKV-4 models were trained for one epoch on the Pile (approximately 331 billion tokens), similar to the training budgets of these comparison models.^[1]

However, Transformers currently benefit from a much larger ecosystem of optimization techniques, fine-tuning methods, and hardware-specific optimizations (such as Flash Attention, paged attention, and speculative decoding).^[5] Transformer models have also been scaled to far larger sizes (hundreds of billions to over a trillion parameters), while the largest RWKV model remains 14B.

RWKV vs. Mamba

Mamba (Gu and Dao, 2023) is a selective state space model (SSM) that also achieves linear-time inference.^[4] Both RWKV and Mamba avoid quadratic attention and offer constant-memory inference, but they differ in their underlying mechanisms. RWKV uses exponential decay with linear attention derived from the Attention Free Transformer lineage, while Mamba uses selective state space dynamics with input-dependent gating inspired by continuous-time dynamical systems.^[4]

In practice, Mamba tends to achieve slightly better language modeling perplexity at comparable model sizes and has been described as the first attention-free model to match strong Transformer training recipes.^[4] RWKV offers faster inference and lower memory usage in some configurations. On vision tasks such as semantic segmentation, studies have found that Mamba achieves higher segmentation quality while RWKV provides faster inference times and lower memory usage.^[5]

Both architectures are active areas of research, and hybrid models that combine elements of attention, SSMs, and linear RNNs have become an emerging trend.^[5] Google's Griffin architecture, for example, mixes local attention with linear recurrences.

RWKV vs. Linear Transformers

Linear Transformers (Katharopoulos et al., 2020) also remove the softmax from attention to achieve sub-quadratic complexity.^[8] However, standard Linear Transformers have O(Td^2) complexity, while RWKV achieves O(Td) complexity for inference.^[1] Linear attention also suffers from non-injectivity, meaning it can assign identical attention weights to different query vectors, causing semantic confusion. RWKV sidesteps this issue through its receptance gating mechanism and exponential decay, which provide a different computational path than kernel-based linear attention.^[1]

How are RWKV models trained?

RWKV models are trained using standard language modeling objectives (next-token prediction) with the AdamW optimizer.^[1] The training infrastructure has evolved across versions:

RWKV-4: Trained on the Pile dataset (331B tokens) using compute provided by Stability AI and EleutherAI.^[1] The 14B model was trained on NVIDIA A100 40GB GPUs. Estimated training cost for the 14B model was approximately $100,000, though improvements to the training code have since reduced this estimate to around $40,000. All models were trained for a single epoch on the Pile, taking 8,043 mini-epochs to complete one pass.^[1]
RWKV-5 (Eagle): Trained on a new multilingual corpus of 1.12 trillion tokens spanning 100+ languages, with a custom fast tokenizer based on greedy matching for enhanced multilingual performance.^[2]
RWKV-6 (Finch): Continued training on expanded multilingual data. The 14B model is the largest Finch variant.^[2]
RWKV-7 (Goose): Trained on an extended open-source multilingual corpus. The World v3 dataset contains 3.1 trillion tokens, and the World v3.5 dataset contains 5.16 trillion tokens. The 2.9B model was trained on 5.6 trillion tokens total.^[3]

All RWKV models and training code are released under the Apache 2.0 license, making them available for both personal and commercial use.^[9]

What is RWKV used for?

RWKV's constant-memory inference makes it particularly well-suited for several deployment scenarios that are challenging for Transformer-based models.^[5]

Natural Language Processing

RWKV has been applied to chatbots, code generation, and text summarization.^[5] The ChatRWKV project provides a conversational interface similar to ChatGPT, powered by RWKV models. Instruction-tuned variants (Raven series) support multi-turn dialogue. The RWKV-4-World models support over 100 languages, making them one of the most multilingual open-source language models available.^[1]

Edge and Mobile Deployment

RWKV's fixed memory footprint makes it especially attractive for resource-constrained devices such as smartphones, wearable gadgets, drones, and other mobile robots.^[6] The RWKV-edge project applies compression techniques (low-rank approximation, sparsity predictors, and clustering heads) to achieve 3.8x to 4.95x compression with only a 2.95 percentage point loss in accuracy.^[6] Compressed models run on devices as small as a Raspberry Pi 5, generating text at usable speeds.^[6]

Mobile deployment is supported through multiple backends: llama.cpp for Android CPU inference, NCNN for running lightweight models, MLX for Apple Silicon devices, and CoreML for Apple Neural Engine.^[5] The RWKV-7 2.9B model achieves approximately 33 tokens per second on mobile hardware. Community-developed Android and iOS apps enable local, offline inference without any server connection.

Computer Vision

RWKV has been adapted for vision tasks, including image classification, object detection, semantic segmentation, and image restoration.^[5] Notable variants include PointRWKV (3D point cloud processing), RWKV-CLIP (vision-language representation learning), and Vision-RWKV (general-purpose vision backbone).^[5] RWKV-SAM applies the architecture to segment anything tasks, outperforming several Mamba-based vision models in both classification and segmentation metrics.^[5]

Audio and Speech

RWKV-based models have been developed for voice activity detection (showing strong noise resilience), automatic speech recognition (RWKV-ASR and RWKV-Transducer for streaming ASR), and music generation and genre classification.^[5]

Time Series and Forecasting

Applications include stock price prediction, photovoltaic power generation forecasting (MSRWKV-2DTCN), clinical outcome prediction (CLOPPS framework), and population demographic forecasting.^[5]

RWKV Foundation and Community

The RWKV project was initially developed by Bo Peng (known online as BlinkDL) and grew out of the EleutherAI community.^[9] On September 20, 2023, RWKV officially joined the Linux Foundation, becoming the first open-source AI model to be hosted under the Generative AI Commons.^[11] It is an incubation-stage project of the LF AI & Data Foundation.^[11]

The project operates as an open-source, non-profit organization. Computing resources have been sponsored by Stability AI and EleutherAI, among others.^[9] The community maintains implementations in over 15 programming languages (C, C++, Rust, Go, Julia, Zig, Java, and others) and supports multiple deep learning frameworks (TensorFlow, Keras, JAX) alongside the primary PyTorch implementation.^[9] Specialized optimizations exist for CUDA, ROCm, ONNX, WebAssembly (WASM), and WebGL, covering server, desktop, mobile, and browser deployment scenarios.

RWKV has an active open-source community contributing to model training, evaluation, tooling, and applications. The project maintains official repositories on GitHub (BlinkDL/RWKV-LM) and distributes models through Hugging Face.^[12] Community tools include RWKV Runner (a cross-platform GUI for model management and inference) and various integration plugins for popular frameworks.

What are RWKV's limitations?

Despite its advantages, RWKV has several notable limitations compared to Transformer models:

In-context learning: Earlier RWKV versions (v4, v5) showed weaker in-context learning performance compared to Transformers, particularly on tasks requiring precise retrieval of information provided in the prompt. RWKV-7 significantly narrows this gap with its delta rule formulation and in-context learning rates, but the gap has not been fully closed.^[3]
Recall on long sequences: Because RWKV compresses all past context into a fixed-size state, it can lose fine-grained information from early in long sequences.^[5] Transformers with full attention can in principle attend to any past token with equal fidelity. This trade-off is inherent to fixed-state architectures.
Ecosystem maturity: The Transformer ecosystem has years of accumulated tooling, hardware optimizations (Flash Attention, paged attention, speculative decoding), and best practices.^[5] RWKV's tooling is growing but less mature in comparison.
Scale: The largest RWKV model is 14B parameters, while leading Transformer models exceed hundreds of billions.^[5] It remains to be demonstrated whether RWKV's performance parity holds at very large scales, though the architecture's scaling behavior up to 14B has followed expected scaling laws.
Community and adoption: Transformer-based models dominate commercial deployments. RWKV adoption is growing but remains primarily within the research and open-source communities.^[5]

Future Directions

As of early 2025, RWKV-7 represents the current stable release.^[3] Development of RWKV-8 (codenamed "Heron") is underway, with experimental features including DeepEmbed (a sparse, edge-friendly design that trains a learnable high-dimensional vector for each vocabulary token in every layer's FFN, which can be offloaded to RAM or SSD to eliminate Mixture-of-Experts VRAM overhead) and ROSA (RWKV Online Suffix Automaton), a neurosymbolic token prediction mechanism that aims to provide efficient, genuinely infinite context beyond traditional attention.^[10]

Broader trends in the field point toward hybrid architectures that combine elements of attention, linear RNNs, and state space models.^[5] RWKV's continued evolution reflects this convergence, incorporating increasingly sophisticated state management while preserving the efficiency guarantees that differentiate it from standard Transformers. The growing interest in efficient inference for edge deployment and long-context applications positions RWKV as a significant alternative in the landscape of sequence modeling architectures.

References

Peng, B., Alcaide, E., Anthony, Q., et al. (2023). "RWKV: Reinventing RNNs for the Transformer Era." *Findings of the Association for Computational Linguistics: EMNLP 2023*. arXiv:2305.13048 ↩
Peng, B., Goldstein, D., Anthony, Q., et al. (2024). "Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence." arXiv:2404.05892 ↩
Peng, B., Zhang, R., Goldstein, D., et al. (2025). "RWKV-7 'Goose' with Expressive Dynamic State Evolution." arXiv:2503.14456 ↩
Gu, A. and Dao, T. (2023). "Mamba: Linear-Time Sequence Modeling with Selective State Spaces." arXiv:2312.00752 ↩
Duan, H., et al. (2024). "A Survey of RWKV." *Neurocomputing*. arXiv:2412.14847 ↩
Xu, Y., et al. (2024). "RWKV-edge: Deeply Compressed RWKV for Resource-Constrained Devices." arXiv:2412.10856 ↩
Vaswani, A., et al. (2017). "Attention Is All You Need." *Advances in Neural Information Processing Systems 30*. ↩
Katharopoulos, A., et al. (2020). "Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention." *ICML 2020*. ↩
RWKV Foundation. "RWKV Language Model." https://www.rwkv.com/ ↩
RWKV Wiki. "Architecture History." https://wiki.rwkv.com/basic/architecture.html ↩
LF AI & Data Foundation. "RWKV Project." https://lfaidata.foundation/projects/rwkv/ ↩
BlinkDL. "RWKV-LM GitHub Repository." https://github.com/BlinkDL/RWKV-LM ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

5 revisions by 1 contributors · full history

Suggest edit

RWKV

What problem does RWKV solve?

How does the RWKV architecture work?

Token Shift

Time-Mixing Block

Channel-Mixing Block

How can one model run as both a Transformer and an RNN?

Version History

RWKV-1 (August 2021)

RWKV-2 (2022)

RWKV-3 (2022)

RWKV-4 "Dove" (May 2023)

RWKV-5 "Eagle" (2023)

RWKV-6 "Finch" (2023-2024)

RWKV-7 "Goose" (2024-2025)

How does RWKV compare to Transformers and Mamba?

RWKV vs. Transformers

RWKV vs. Mamba

RWKV vs. Linear Transformers

How are RWKV models trained?

What is RWKV used for?

Natural Language Processing

Edge and Mobile Deployment

Computer Vision

Audio and Speech

Time Series and Forecasting

RWKV Foundation and Community

What are RWKV's limitations?

Future Directions

References

Improve this article

What links here (24 of 27)

What links here (24 of 27)

What problem does RWKV solve?

How does the RWKV architecture work?

Token Shift

Time-Mixing Block

Channel-Mixing Block

How can one model run as both a Transformer and an RNN?

Version History

RWKV-1 (August 2021)

RWKV-2 (2022)

RWKV-3 (2022)

RWKV-4 "Dove" (May 2023)

RWKV-5 "Eagle" (2023)

RWKV-6 "Finch" (2023-2024)

RWKV-7 "Goose" (2024-2025)

How does RWKV compare to Transformers and Mamba?

RWKV vs. Transformers

RWKV vs. Mamba

RWKV vs. Linear Transformers

How are RWKV models trained?

What is RWKV used for?

Natural Language Processing

Edge and Mobile Deployment

Computer Vision

Audio and Speech

Time Series and Forecasting

RWKV Foundation and Community

What are RWKV's limitations?

Future Directions

References

Improve this article

Related Articles

Long Short-Term Memory (LSTM)

Multi-head Latent Attention

Multi-Head Self-Attention

Recurrent Neural Network

Self-attention

LSTM

What links here (24 of 27)

Related Articles

Long Short-Term Memory (LSTM)

Multi-head Latent Attention

Multi-Head Self-Attention

Recurrent Neural Network

Self-attention

LSTM

What links here (24 of 27)