RWKV (pronounced "RwaKuv") stands for Receptance Weighted Key Value, a novel neural network architecture that combines the parallelizable training of Transformers with the constant-time inference of recurrent neural networks (RNNs). Proposed by Bo Peng and collaborators in 2023, RWKV uses a linear attention mechanism that allows the model to be formulated as either a Transformer or an RNN, depending on the phase of operation. During training, RWKV processes sequences in parallel like a Transformer. During inference, it operates as an RNN with O(1) memory and time per token, avoiding the quadratic complexity that limits standard attention mechanisms.
RWKV has been scaled up to 14 billion parameters, making it the largest dense RNN ever trained at the time of its release. Benchmark evaluations show that RWKV performs on par with similarly sized Transformer models across a range of natural language processing tasks. The project originated within the EleutherAI community and joined the Linux Foundation in September 2023, becoming the first open-source AI model hosted under the LF AI & Data Foundation.
The Transformer architecture, introduced by Vaswani et al. in 2017, reshaped the field of natural language processing through its self-attention mechanism. Self-attention allows each token in a sequence to attend to every other token, capturing long-range dependencies far more effectively than prior architectures. However, this mechanism computes pairwise interactions between all tokens in a sequence, resulting in O(T^2) time and memory complexity for a sequence of length T. This quadratic scaling becomes a serious bottleneck when processing long sequences, particularly as context windows grow beyond 100,000 tokens. The memory required for the KV cache also grows linearly with sequence length during autoregressive generation, adding further constraints.
Traditional RNNs, including LSTMs and GRUs, process sequences with O(T) time complexity and O(1) memory per step, since they maintain a fixed-size hidden state. However, RNNs suffer from vanishing and exploding gradients during backpropagation through time, making them difficult to train on long sequences. Their strictly sequential nature also prevents efficient parallelization on modern GPU hardware, where massive parallelism is essential for practical training speeds.
Several lines of research have attempted to address these limitations. Linear attention mechanisms (Katharopoulos et al., 2020) remove the softmax from attention to achieve linear complexity but often sacrifice model quality. State space models like S4 (Gu et al., 2021) reformulate sequence modeling as a continuous-time dynamical system. Attention-Free Transformers (Zhai et al., 2021) replace attention with element-wise operations.
RWKV was designed to bridge the RNN and Transformer paradigms. By replacing the softmax-based attention mechanism with a linear attention variant that uses exponential decay, RWKV achieves the best of both worlds: efficient parallel training (like Transformers) and efficient sequential inference (like RNNs). The architecture eliminates the need for a KV cache during inference, enabling constant memory usage regardless of sequence length.
The RWKV architecture stacks N residual blocks, each containing two sub-layers: a time-mixing block and a channel-mixing block. These are analogous to the self-attention and feed-forward network (FFN) layers in a standard Transformer, respectively. Each sub-layer uses layer normalization (pre-LN) and residual connections. An embedding layer maps input tokens to dense vectors before the first block, and a final layer normalization followed by a linear projection produces output logits.
The name RWKV derives from the four key parameters used in the architecture:
| Parameter | Name | Role |
|---|---|---|
| R | Receptance | Controls how much past information the model accepts at each position, applied as a sigmoid gate (values between 0 and 1) |
| W | Weight (Decay) | A trainable positional weight decay vector that determines how quickly past information fades, applied as an exponential decay factor |
| K | Key | Functions similarly to keys in standard attention, determining the importance of values to be stored in memory |
| V | Value | Represents the actual information content to be stored, analogous to values in standard attention |
Before computing the R, K, and V vectors, RWKV applies a token shift operation that linearly interpolates between the current token's embedding and the previous token's embedding. This mixing is controlled by learnable per-channel parameters (mu values), allowing different feature dimensions to integrate temporal information at different rates. In mathematical terms, for a given input x_t at time step t:
x't = mu * x_t + (1 - mu) * x{t-1}
This simple operation provides the model with access to local context without requiring explicit attention over multiple positions. Importantly, different projections (R, K, V) each have their own set of mu parameters, so the model can learn to mix temporal information differently for each role. In later versions (RWKV-6), token shift was upgraded to use data-dependent dynamic linear interpolation with low-rank adaptation, making the mixing weights input-dependent.
The time-mixing block is the core computational unit that replaces self-attention. After token shifting, the block computes receptance (r), key (k), and value (v) vectors through linear projections. The WKV operator then computes a weighted combination of values using exponential decay.
The WKV operator maintains two running accumulators (a numerator and a denominator) that track an exponential moving average of key-value interactions. At each time step, the accumulators are decayed by the factor exp(-w) and updated with the current key-value contribution. The decay rate w is a trainable per-channel parameter, meaning each feature dimension can forget past information at a different rate. A separate "bonus" parameter (u) allows the most recent token to be weighted differently from historical tokens, giving the model fine-grained control over recency bias.
The output of the WKV operator is then gated by the sigmoid of the receptance vector:
output_t = sigmoid(r_t) * WKV(k_t, v_t)
This gating mechanism controls how much of the computed attention-like output passes through to the residual stream. The receptance gate acts as a learned filter that determines channel-wise responsiveness to the aggregated past information.
The key insight is that the exponentials in this formulation can be interpreted two ways: during training (time-parallel mode), they resemble softmax normalization, enabling parallelization across the sequence through prefix-sum (scan) operations; during inference (RNN mode), they function as multiplicative decay in a persistent memory, enabling O(1) per-step computation. This mathematical equivalence is what allows the same model to switch seamlessly between modes.
The channel-mixing block performs feature integration across the embedding dimensions (channels), where each element in the feature vector gathers information from other dimensions to update its own value. It follows a structure similar to the FFN in Transformers but uses RWKV-specific components:
The channel-mixing block also uses token shift for its inputs, with separate learned mixing parameters distinct from those used in the time-mixing block.
A defining feature of RWKV is its ability to operate in two mathematically equivalent modes:
| Mode | Used During | Complexity (Time) | Complexity (Memory) | Description |
|---|---|---|---|---|
| Time-Parallel | Training | O(BTd^2) | O(BTd) | Processes entire sequences in parallel across GPUs, similar to Transformer training |
| RNN | Inference | O(Td) | O(d) | Processes one token at a time with fixed-size state buffers, no KV cache needed |
Here B is batch size, T is sequence length, and d is model dimension. The equivalence between these two modes is achieved through algebraic manipulation of the exponential terms in the WKV operator. This duality means the same set of trained weights can be used for both high-throughput batch processing and low-latency single-token generation without any conversion or approximation.
RWKV has evolved through seven major versions since its initial release in 2021, with each iteration introducing significant architectural improvements. The following table provides a summary of the key differences across versions.
| Version | Codename | Year | State Type | Decay Mechanism | Token Shift | Key Innovation |
|---|---|---|---|---|---|---|
| RWKV-4 | Dove | 2023 | Vector | Fixed (w parameter) | lerp | First published paper, scaled to 14B |
| RWKV-5 | Eagle | 2023 | Matrix-valued | Fixed per-head | lerp | Multi-headed matrix states (32x capacity) |
| RWKV-6 | Finch | 2023-2024 | Matrix-valued | Dynamic per-timestep | ddlerp + LoRA | Data-dependent recurrence, state tuning |
| RWKV-7 | Goose | 2024-2025 | Matrix-valued | Generalized delta rule | Simplified | Dynamic state evolution, regular language recognition |
The first version replaced attention with long convolution and introduced the alternating time-mixing and channel-mixing block structure. It drew inspiration from the Attention Free Transformer (AFT) architecture proposed by Zhai et al.
The second version introduced the first true RNN formulation, implementing exponential moving averages for key-value pairs and a headQK mechanism for in-context learning. This version achieved genuine RNN-style inference with fixed-size hidden states, establishing the dual-mode property that became central to the architecture.
A transitional version that added trainable TimeMix factors for the R, K, and V parameters and switched from post-layer normalization to pre-layer normalization, improving training stability and convergence behavior.
The first version published with a formal research paper, accepted at the Findings of EMNLP 2023. The paper, "RWKV: Reinventing RNNs for the Transformer Era," had 34 co-authors led by Bo Peng. RWKV-4 resolved numerical stability issues in the WKV operator and introduced the formal token shift concept using causal convolution. Models were released in sizes from 169M to 14B parameters, all trained on the Pile dataset (331 billion tokens). At the time, the 14B model was the largest dense RNN ever trained.
Variants included RWKV-4-Raven (instruction-tuned for chat), RWKV-4-World (multilingual, supporting 100+ languages), and RWKV-4-Music (trained on MIDI and ABC music notation for music generation).
| Model | Layers | Hidden Dimension | Parameters |
|---|---|---|---|
| RWKV-4-169M | 12 | 768 | 169M |
| RWKV-4-430M | 24 | 1,024 | 430M |
| RWKV-4-1.5B | 24 | 2,048 | 1.5B |
| RWKV-4-3B | 32 | 2,560 | 3B |
| RWKV-4-7B | 32 | 4,096 | 7B |
| RWKV-4-14B | 40 | 5,120 | 14B |
Eagle introduced matrix-valued states to replace the vector-valued states of RWKV-4. The state was split into 64x64 matrices per head, increasing the state capacity by roughly 32 times compared to the previous version. This multi-headed design was described in the paper "Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence" (Peng et al., 2024). Eagle models ranged from 0.46B to 7.52B parameters and were trained on 1.12 trillion tokens from a new multilingual corpus covering over 100 languages. A custom fast tokenizer based on greedy matching improved multilingual tokenization efficiency.
The Eagle 7B model demonstrated competitive performance against Transformer models of similar size. On multilingual benchmarks across 23 languages (xLAMBDA, xStoryCloze, xWinograd, xCopa), Eagle 7B outperformed all other 7B-parameter models, including Mistral 7B, Falcon 7B, and LLaMA 2 7B. On English benchmarks, it performed comparably, surpassing competitors on tasks like LAMBADA (perplexity of 3.36, close to Mistral 7B's 3.18), StoryCloze, WinoGrande, and SciQ.
Finch built on Eagle by adding dynamic recurrence through low-rank adaptation. The key innovation was making the decay parameter (w) time-varying and input-dependent, rather than static across the sequence. This means the model can dynamically decide how much past information to retain at each position based on the current input, rather than relying on fixed decay rates. Token shifting was also upgraded from simple linear interpolation (lerp) to data-dependent dynamic linear interpolation (ddlerp) with LoRA-based adaptation.
Finch models were released in sizes of 0.1B, 1.6B, 3B, 7B, and 14B parameters. Finch also introduced state tuning, a parameter-efficient technique for fine-tuning only the initial hidden states of the model for alignment and downstream tasks, without modifying the main model weights. Compared to Eagle 7B, Finch 7B improved by 5.38% across all benchmarks, and Finch 14B improved by an additional 7.14%. Evaluations covered 235 benchmarks in total using the LM evaluation harness.
Goose represents the most substantial architectural change since RWKV-4, introducing expressive dynamic state evolution based on a generalized formulation of the delta rule. The paper "RWKV-7 'Goose' with Expressive Dynamic State Evolution" (Peng et al., 2025), with 18 co-authors, describes three key innovations:
RWKV-7 also simplifies the token shift and channel mixing modules compared to RWKV-6, removing data dependency from token shift and simplifying channel mixing gating. These simplifications improve both training and inference speed without sacrificing quality.
A significant theoretical result is that RWKV-7 can perform state tracking and recognize all regular languages, which exceeds the capabilities of Transformers under standard complexity-theoretic conjectures (Transformers are limited to TC^0). This means RWKV-7 is strictly more powerful than Transformers in certain formal language recognition tasks.
Four Goose models were released (0.19B, 0.4B, 1.5B, and 2.9B parameters), trained on a 3.1 trillion token multilingual corpus. The 2.9B model achieved a new state-of-the-art for 3B-scale models on multilingual benchmarks, outperforming Qwen 2.5-3B, LLaMA 3.2-3B, and SmolLM2 by significant margins. On English benchmarks, RWKV-7 2.9B matched Qwen 2.5-3B (71.5% vs 71.4% average accuracy), despite being trained on 5.6 trillion tokens compared to Qwen's 18 trillion tokens. On a per-training-FLOP basis, the 2.9B model even outperformed 7B Transformer models, demonstrating superior compute efficiency.
RWKV occupies a unique position among modern sequence modeling architectures. It can be described as the ratio of two linear RNNs, derived from a mechanism close to standard attention but using a receptance vector instead of a query vector. The following table summarizes the key differences between RWKV, standard Transformers, and Mamba (the leading state space model alternative).
| Feature | Transformer | RWKV | Mamba |
|---|---|---|---|
| Attention Type | Softmax (quadratic) | Linear (exponential decay) | Selective state spaces |
| Training Complexity (Time) | O(T^2 d) | O(T d^2) | O(T d) |
| Inference Complexity (Time per token) | O(T d) | O(d) | O(d) |
| Inference Memory | O(T d) (KV cache) | O(d) (fixed state) | O(d) (fixed state) |
| Parallelizable Training | Yes | Yes | Yes (via scan) |
| KV Cache Required | Yes | No | No |
| Maximum Context Length | Fixed (architectural) | Unlimited (theoretical) | Unlimited (theoretical) |
| State Tracking Capability | Limited (TC^0) | All regular languages (v7) | Limited |
| In-Context Learning | Strong | Improving (v7) | Moderate |
| Training Maturity | Very high | Growing | Growing |
| Model Scale (largest trained) | 1T+ parameters | 14B parameters | 8B parameters |
| Open Source | Varies | Yes (Apache 2.0) | Yes |
RWKV's primary advantage over Transformers is inference efficiency. A Transformer model must store and attend to all previous tokens through its KV cache, which grows linearly with sequence length. RWKV maintains a fixed-size state that summarizes all past context, enabling constant memory usage. For example, the RWKV-4 14B model can generate text on sequences of arbitrary length using only 3 GB of VRAM with INT8 quantization. A comparably sized Transformer would require substantially more memory for long sequences due to the growing KV cache.
Another area where RWKV differs is time-to-first-token (TTFT). Transformers exhibit quadratic growth in TTFT as the input prompt grows, because the full attention matrix must be computed over the entire prompt. RWKV scales linearly in TTFT, processing each prompt token in constant time per token.
On language modeling benchmarks, RWKV-4 models perform competitively with similarly sized Transformer models from the Pythia, OPT, and BLOOM families when matched on training compute (FLOPs). Evaluations covered twelve NLP tasks including ARC (Easy and Challenge), BoolQ, COPA, HeadQA, HellaSwag, LAMBADA, OpenBookQA, PIQA, ReCoRD, SciQ, and WinoGrande. The RWKV-4 models were trained for one epoch on the Pile (approximately 331 billion tokens), similar to the training budgets of these comparison models.
However, Transformers currently benefit from a much larger ecosystem of optimization techniques, fine-tuning methods, and hardware-specific optimizations (such as Flash Attention, paged attention, and speculative decoding). Transformer models have also been scaled to far larger sizes (hundreds of billions to over a trillion parameters), while the largest RWKV model remains 14B.
Mamba (Gu and Dao, 2023) is a selective state space model (SSM) that also achieves linear-time inference. Both RWKV and Mamba avoid quadratic attention and offer constant-memory inference, but they differ in their underlying mechanisms. RWKV uses exponential decay with linear attention derived from the Attention Free Transformer lineage, while Mamba uses selective state space dynamics with input-dependent gating inspired by continuous-time dynamical systems.
In practice, Mamba tends to achieve slightly better language modeling perplexity at comparable model sizes and has been described as the first attention-free model to match strong Transformer training recipes. RWKV offers faster inference and lower memory usage in some configurations. On vision tasks such as semantic segmentation, studies have found that Mamba achieves higher segmentation quality while RWKV provides faster inference times and lower memory usage.
Both architectures are active areas of research, and hybrid models that combine elements of attention, SSMs, and linear RNNs have become an emerging trend. Google's Griffin architecture, for example, mixes local attention with linear recurrences.
Linear Transformers (Katharopoulos et al., 2020) also remove the softmax from attention to achieve sub-quadratic complexity. However, standard Linear Transformers have O(Td^2) complexity, while RWKV achieves O(Td) complexity for inference. Linear attention also suffers from non-injectivity, meaning it can assign identical attention weights to different query vectors, causing semantic confusion. RWKV sidesteps this issue through its receptance gating mechanism and exponential decay, which provide a different computational path than kernel-based linear attention.
RWKV models are trained using standard language modeling objectives (next-token prediction) with the AdamW optimizer. The training infrastructure has evolved across versions:
All RWKV models and training code are released under the Apache 2.0 license, making them available for both personal and commercial use.
RWKV's constant-memory inference makes it particularly well-suited for several deployment scenarios that are challenging for Transformer-based models.
RWKV has been applied to chatbots, code generation, and text summarization. The ChatRWKV project provides a conversational interface similar to ChatGPT, powered by RWKV models. Instruction-tuned variants (Raven series) support multi-turn dialogue. The RWKV-4-World models support over 100 languages, making them one of the most multilingual open-source language models available.
RWKV's fixed memory footprint makes it especially attractive for resource-constrained devices such as smartphones, wearable gadgets, drones, and other mobile robots. The RWKV-edge project applies compression techniques (low-rank approximation, sparsity predictors, and clustering heads) to achieve 3.8x to 4.95x compression with only a 2.95 percentage point loss in accuracy. Compressed models run on devices as small as a Raspberry Pi 5, generating text at usable speeds.
Mobile deployment is supported through multiple backends: llama.cpp for Android CPU inference, NCNN for running lightweight models, MLX for Apple Silicon devices, and CoreML for Apple Neural Engine. The RWKV-7 2.9B model achieves approximately 33 tokens per second on mobile hardware. Community-developed Android and iOS apps enable local, offline inference without any server connection.
RWKV has been adapted for vision tasks, including image classification, object detection, semantic segmentation, and image restoration. Notable variants include PointRWKV (3D point cloud processing), RWKV-CLIP (vision-language representation learning), and Vision-RWKV (general-purpose vision backbone). RWKV-SAM applies the architecture to segment anything tasks, outperforming several Mamba-based vision models in both classification and segmentation metrics.
RWKV-based models have been developed for voice activity detection (showing strong noise resilience), automatic speech recognition (RWKV-ASR and RWKV-Transducer for streaming ASR), and music generation and genre classification.
Applications include stock price prediction, photovoltaic power generation forecasting (MSRWKV-2DTCN), clinical outcome prediction (CLOPPS framework), and population demographic forecasting.
The RWKV project was initially developed by Bo Peng (known online as BlinkDL) and grew out of the EleutherAI community. On September 20, 2023, RWKV officially joined the Linux Foundation, becoming the first open-source AI model to be hosted under the Generative AI Commons. It is an incubation-stage project of the LF AI & Data Foundation.
The project operates as an open-source, non-profit organization. Computing resources have been sponsored by Stability AI and EleutherAI, among others. The community maintains implementations in over 15 programming languages (C, C++, Rust, Go, Julia, Zig, Java, and others) and supports multiple deep learning frameworks (TensorFlow, Keras, JAX) alongside the primary PyTorch implementation. Specialized optimizations exist for CUDA, ROCm, ONNX, WebAssembly (WASM), and WebGL, covering server, desktop, mobile, and browser deployment scenarios.
RWKV has an active open-source community contributing to model training, evaluation, tooling, and applications. The project maintains official repositories on GitHub (BlinkDL/RWKV-LM) and distributes models through Hugging Face. Community tools include RWKV Runner (a cross-platform GUI for model management and inference) and various integration plugins for popular frameworks.
Despite its advantages, RWKV has several notable limitations compared to Transformer models:
As of early 2025, RWKV-7 represents the current stable release. Development of RWKV-8 (codenamed "Heron") is underway, with experimental features including DeepEmbed (a sparse, edge-friendly design that eliminates Mixture of Experts VRAM overhead) and ROSA (Rapid Online Suffix Automaton), a neurosymbolic token prediction mechanism that replaces traditional attention with a pattern-matching approach.
Broader trends in the field point toward hybrid architectures that combine elements of attention, linear RNNs, and state space models. RWKV's continued evolution reflects this convergence, incorporating increasingly sophisticated state management while preserving the efficiency guarantees that differentiate it from standard Transformers. The growing interest in efficient inference for edge deployment and long-context applications positions RWKV as a significant alternative in the landscape of sequence modeling architectures.