Gemma Scope
Last reviewed
Jun 8, 2026
Sources
5 citations
Review status
Source-backed
Revision
v1 · 1,846 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
5 citations
Review status
Source-backed
Revision
v1 · 1,846 words
Add missing citations, update stale details, or suggest a clearer explanation.
Gemma Scope is an open, comprehensive suite of sparse autoencoders (SAEs) released by Google DeepMind in 2024 to support mechanistic interpretability research on its open-weight Gemma 2 language models. The release comprises more than 400 JumpReLU SAEs in its headline set, collectively learning more than 30 million features, trained on the internal activations at every layer and sub-layer of Gemma 2 2B and Gemma 2 9B, plus selected layers of the 27B model. Counting the multiple sparsity settings shipped per location, the total number of released autoencoders exceeds 2,000. [1][2]
The project was announced on the Google DeepMind blog on July 31, 2024, and described in the technical report "Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2" by Tom Lieberum, Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Nicolas Sonnerat, Vikrant Varma, Janos Kramar, Anca Dragan, Rohin Shah, and Neel Nanda, posted to arXiv on August 9, 2024 and presented at the BlackboxNLP workshop at EMNLP 2024. [1][2] The name is a play on the idea of a microscope or telescope for peering inside a model, and on the title's allusion to "everywhere all at once," reflecting the goal of covering every part of the network rather than a handful of layers.
A sparse autoencoder is an unsupervised tool that decomposes a model's dense activation vectors into a much larger set of mostly inactive features, each of which is intended to track a single human-interpretable concept. Training one SAE well is expensive and finicky, which had previously confined high-quality SAE work to a few well-resourced labs and to one or two layers of a model at a time. Gemma Scope was conceived to remove that barrier: by training and openly publishing SAEs for an entire capable model, DeepMind aimed to let outside researchers do interpretability experiments, such as feature steering, circuit analysis, and safety probes, without first spending substantial compute training autoencoders of their own. [1][2]
Gemma Scope trains SAEs at three distinct sites within each transformer layer, capturing the model's representations at different points in the computation: the attention head outputs (taken before the final output projection W_O is applied), the MLP outputs, and the post-MLP residual stream. Training at all three sites across every layer is what makes the suite "comprehensive" rather than a single probe point. [2]
The coverage across the Gemma 2 family is as follows. [2]
| Model | Layers covered | Sites | Notes |
|---|---|---|---|
| Gemma 2 2B (base) | All 26 layers | Attention output, MLP output, residual stream | Full suite across all widths |
| Gemma 2 9B (base) | All 42 layers | Attention output, MLP output, residual stream | Full suite across all widths |
| Gemma 2 27B (base) | Selected layers | Residual stream | Partial coverage of the largest model |
| Gemma 2 9B (instruction-tuned) | Selected layers | Residual stream | Lets researchers compare base and chat behavior |
Each SAE has a dictionary "width," the number of features it can represent, chosen from powers of two. The released autoencoders span widths from 2^14 (about 16,000 features) up to 2^20 (roughly 1 million features), with intermediate widths of 2^15, 2^16, 2^17, 2^18, and 2^19. [1][2] Wider dictionaries can carve the activation space into finer concepts but are more expensive to train and use. Summed over every layer, site, and width, the headline release contains more than 400 SAEs and more than 30 million learned features in total, although DeepMind cautions that many of these features overlap or are duplicated across the different autoencoders, so the count is not 30 million distinct concepts. [1][2]
In addition to varying the width, Gemma Scope ships several SAEs at each location trained to different levels of sparsity, controlled during training by sweeping the coefficient on the sparsity penalty. Sparsity is commonly summarized by the average L0, the typical number of features active on a given input; lower L0 means a sparser, more selective decomposition but usually higher reconstruction error. Releasing a range of sparsity settings lets researchers pick the reconstruction-versus-interpretability tradeoff that suits their experiment. Accounting for these multiple sparsity variants, DeepMind published the weights of over 2,000 SAEs in total. [2]
Gemma Scope uses the JumpReLU SAE architecture, a variant introduced by an overlapping DeepMind team in July 2024. A JumpReLU SAE replaces the usual rectified-linear encoder activation with a per-feature learned threshold: any feature whose pre-activation falls below its own threshold is set to exactly zero, while features above the threshold pass through at full magnitude. This lets the model be trained directly against an L0 sparsity objective (using straight-through gradient estimators to get a usable training signal through the otherwise non-differentiable threshold) and avoids the "shrinkage" bias that an L1 penalty imposes on the features it keeps. The DeepMind authors reported that JumpReLU SAEs achieved state-of-the-art reconstruction fidelity at a fixed sparsity on Gemma 2 9B activations, which is why the architecture was chosen as the backbone for the whole suite. [1][2][3]
Building Gemma Scope was an unusually large engineering effort for an interpretability project. SAEs are trained on activations harvested by running the underlying model over large amounts of text, and the text used was drawn from the same distribution as the Gemma pretraining data. Each individual SAE was trained on roughly 4 billion to 16 billion tokens of activations. [2] Because the activations for every layer and site had to be generated and stored, the pipeline saved on the order of 20 pebibytes of activation data to disk over the course of the project, which DeepMind likened to roughly a million copies of English Wikipedia. [1][2]
The total training cost was substantial by interpretability standards. The technical report states that producing the suite consumed more than 20 percent of the compute that was used to pretrain GPT-3, and the announcement framed the same figure as roughly 15 percent of the compute used to train Gemma 2 9B itself. [1][2] The two figures are consistent: they simply compare the same training budget against two different reference models. The process produced hundreds of billions of SAE parameters in aggregate. [1] To put the throughput in context, the report noted that a single 131,000-width SAE could process one training batch in about 45 milliseconds on 8 TPUv3 chips, corresponding to a model FLOP utilization of roughly 50 percent. [2] The autoencoders were trained in 32-bit floating point precision. [2]
The Gemma Scope weights are published openly on Hugging Face under the permissive CC-BY-4.0 license, together with a tutorial and example code so that researchers can load any SAE and run it on Gemma 2 activations. [1][4] Because the SAEs are tied to specific, openly available Gemma 2 checkpoints, results obtained with them are straightforward for others to reproduce.
For interactive exploration, DeepMind partnered with Neuronpedia, an open platform for browsing and experimenting with SAE features. Neuronpedia hosts a dedicated Gemma Scope interface where users can search for features by the text that activates them, inspect which inputs cause a given feature to fire, and steer the model by manually amplifying or suppressing individual features to see how its outputs change. [1][5] This demo lowers the entry barrier further, since it allows people to study the model's learned concepts directly in a browser without writing any code or downloading multi-gigabyte weight files. The combination of openly licensed weights, a code tutorial, and a hosted feature browser was a deliberate part of the release strategy, aimed at making SAE-based interpretability accessible to a broad audience including students and independent researchers. [1]
Gemma Scope's main contribution is one of scale and accessibility rather than a new conceptual technique. Before its release, an interpretability researcher who wanted to study a frontier-scale open model typically had to train their own sparse autoencoders, a process demanding both expertise and significant compute, and most public SAEs covered only a layer or two. By openly publishing high-quality SAEs for every layer and sub-layer of a capable model, DeepMind turned SAE-based feature analysis into something closer to a shared, off-the-shelf research instrument. The project was widely taken up by the interpretability community and helped make Gemma 2 a common testbed for work on features, circuits, and steering. [1][2]
DeepMind explicitly framed the release in terms of AI safety. The blog post argued that better tools for seeing inside a model could help identify and address problems such as hallucination, deception, and undesirable behavior in autonomous agents, and that opening the suite would let the wider safety community contribute to that effort. [1] The dense, layer-by-layer coverage is particularly suited to circuit-style analysis, where researchers trace how information flows and is transformed across many layers, rather than examining a single bottleneck in isolation. [2] Gemma Scope also reinforced JumpReLU as a widely adopted SAE design, since the suite became the most prominent practical demonstration of that architecture. [2][3]
Gemma Scope inherits the open methodological questions of sparse autoencoders generally. SAE features are only approximately monosemantic: some learned features remain hard to interpret, some concepts are split across multiple features, and the "more than 30 million features" headline counts substantial overlap between autoencoders rather than that many genuinely distinct concepts. [2] Reconstruction is also imperfect, so passing a model's activations through an SAE and back loses some information; the available range of sparsity settings only lets a researcher choose, not eliminate, the tradeoff between faithful reconstruction and clean, sparse features.
The coverage is deliberately uneven. The 2B and 9B base models receive the full treatment, but the 27B model is covered only at selected layers and only on the residual stream, and the instruction-tuned coverage is similarly partial, so studies of the largest or chat-tuned variants have fewer SAEs to work with. [2] The suite is also specific to the Gemma 2 family and to the particular text distribution it was trained on, so the features it learns do not automatically transfer to other models or to very different input domains. Finally, DeepMind itself positioned Gemma Scope as a resource to accelerate research rather than a finished solution: whether the features it surfaces are the "right" units for understanding computation, and how reliably they support claims about model behavior, remained active areas of investigation, and later work continued to refine SAE training and evaluation. [1][2]