RecurrentGemma
Last reviewed
Jun 3, 2026
Sources
12 citations
Review status
Source-backed
Revision
v1 · 1,281 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
12 citations
Review status
Source-backed
Revision
v1 · 1,281 words
Add missing citations, update stale details, or suggest a clearer explanation.
RecurrentGemma is a family of open-weight language models released by Google DeepMind that is built on the Griffin architecture rather than the standard Transformer. Instead of relying on global self-attention, Griffin mixes gated linear recurrences with local sliding-window attention, which gives the model a fixed-size recurrent state and therefore lower memory use and faster generation on long sequences. RecurrentGemma is positioned as a sibling of the Transformer-based Gemma models: it draws on the same training data and tooling and reaches broadly comparable benchmark quality, while trading the Transformer's growing key-value cache for a constant-size state.[1][2]
RecurrentGemma grew out of two threads of research at Google DeepMind. The first is the Gemma program, a line of lightweight open models distilled from the research that produced Gemini; the second is a 2024 paper on efficient recurrent architectures. In late February 2024, DeepMind researchers published "Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models," which introduced two architectures: Hawk, a pure recurrent model, and Griffin, a hybrid that interleaves recurrence with local attention. The paper reported that Griffin matched the quality of Llama-2 while training on roughly six times fewer tokens, and that the family scaled to 14 billion parameters.[3]
RecurrentGemma applies that architecture to the Gemma setting. The model was introduced in the paper "RecurrentGemma: Moving Past Transformers for Efficient Open Language Models," credited to the Griffin, RLHF, and Gemma teams at Google and first posted to arXiv on 11 April 2024.[2] The 2B model was made available on Kaggle the same week, and a larger 9B variant followed on 11 June 2024.[1][4] The work sits alongside the other Gemma derivatives such as CodeGemma for code and PaliGemma for vision-language tasks, but it is the only member of the family that departs from the Transformer backbone.
Griffin is a hybrid sequence model. Its layers alternate between two kinds of temporal mixing blocks. The recurrent blocks are built around the Real-Gated Linear Recurrent Unit, or RG-LRU, a gated linear recurrence whose design is informed by linear state-space models. The attention blocks use local, sliding-window attention, so each position attends only to a fixed span of recent tokens rather than the entire sequence.[2][5]
The RG-LRU is the core novelty carried over from the Griffin paper. It is a linear recurrent layer with two input-dependent gates that, unlike a classic LSTM or GRU gate, depend only on the current input and not on the previous recurrent state. Within a recurrent block the input is split into two branches: one branch applies a small separable 1D convolution (with a temporal filter width of 4) followed by the RG-LRU, while the other applies a GeLU nonlinearity; the branches are then combined by element-wise multiplication and projected back out.[2][5] The local attention blocks cap the attention window at 2,048 tokens, which keeps the per-step cost bounded as sequences grow.[2][6]
RecurrentGemma makes only a single change to the published Griffin recipe: it multiplies the input embeddings by a constant equal to the square root of the model width. The output embedding (the language-model head) is left unscaled.[2] In the reference 2B configuration the model has 26 layers arranged so that recurrent blocks dominate, with a local-attention block inserted periodically; the open-source configuration describes the repeating unit as two recurrent blocks followed by one attention block.[6][7]
Two sizes were released, each with a pretrained base model and an instruction-tuned variant. The parameter counts below are taken from the technical report. Note that the original abstract describes the smaller model by its non-embedding parameter count (about 2B), while the headline name "2B" and "9B" refer to the families; the totals differ because of the large 256,000-token vocabulary.[2]
| Model | Total parameters | Non-embedding parameters | Model width | Training tokens | Released |
|---|---|---|---|---|---|
| RecurrentGemma-2B | 2.68B | 2.03B | 2,560 | 2T | April 2024 |
| RecurrentGemma-9B | 8.58B | 7.53B | 4,096 | 2T | June 2024 |
For comparison, the report notes that the Transformer-based Gemma-2B was trained on 3 trillion tokens and Gemma-7B on 6 trillion, so RecurrentGemma reaches similar quality on fewer training tokens.[2] Both RecurrentGemma models share the Gemma tokenizer and vocabulary, and both were trained on TPUv5e using JAX.[8]
The practical appeal of RecurrentGemma comes from how it stores context. A standard Transformer keeps a key-value (KV) cache whose size grows linearly with the sequence length, so memory consumption and per-token latency climb as generation continues. Griffin's recurrent blocks instead compress the past into a fixed-size state, and its attention blocks only look back over a bounded local window. As a result the memory footprint does not grow without bound, which lets the model generate longer sequences on hardware with limited memory and run inference at larger batch sizes.[1][2][5]
DeepMind reports that this translates into substantially higher sampling throughput on long sequences. Because the state size is constant, RecurrentGemma-9B sustains roughly steady throughput as the generated sequence lengthens, whereas Gemma-7B slows down as its KV cache expands; on longer sequences the throughput gap is large.[2][9] Prompt-processing speed (the initial pass over the input) is similar between the two, since that stage is compute-bound rather than memory-bound for both architectures.[2]
Despite the architectural change and the smaller training budget, RecurrentGemma scores close to the equivalent Gemma Transformers on standard benchmarks. The figures below are drawn from the model cards and technical report; metrics follow the Gemma evaluation conventions (for example, MMLU at 5-shot and HellaSwag at 0-shot).[2][8][10]
| Benchmark | RecurrentGemma-2B | RecurrentGemma-9B |
|---|---|---|
| MMLU (5-shot) | 38.4 | 60.5 |
| HellaSwag (0-shot) | 71.0 | 80.4 |
| ARC-e (0-shot) | 72.9 | 78.8 |
| HumanEval (pass@1) | 21.3 | 31.1 |
| GSM8K (maj@1) | 13.4 | 42.6 |
The technical report frames these results as evidence that a non-Transformer model can be competitive with a Transformer of similar size while offering better inference characteristics, rather than as a claim of state-of-the-art quality.[2] The instruction-tuned 9B variant was also evaluated for safety and human-preference behavior using the same procedures as the Gemma instruction-tuned models.[2][10]
RecurrentGemma was released as open weights. Checkpoints for both sizes, in pretrained and instruction-tuned forms, are distributed through Kaggle and Hugging Face, with Flax and PyTorch variants available.[1][11] Use of the weights is governed by the Gemma Terms of Use, the same license that covers the rest of the Gemma family, and users must accept those terms before downloading.[8][11]
DeepMind also published a reference implementation on GitHub under the Apache License 2.0. The repository provides a JAX/Flax codebase, described by the maintainers as the more optimized path and including custom Pallas kernels for the linear-scan recurrence, alongside a PyTorch implementation intended mainly for reference. The library targets CPU, GPU, and TPU, and the model was subsequently integrated into the Hugging Face Transformers library as the RecurrentGemma model type.[12][5] Because the recurrent design is less common than the Transformer, tooling and community support for RecurrentGemma remain narrower than for mainstream Transformer models, a tradeoff DeepMind's own documentation acknowledges.[6]