JumpReLU SAE

AI Safety Machine Learning

11 min read

Updated Jun 28, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 28, 2026

Fact-checked

In review queue

Sources

9 citations

Revision

v3 · 2,287 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

A JumpReLU sparse autoencoder (JumpReLU SAE) is a variant of the sparse autoencoder used in mechanistic interpretability whose encoder applies a learnable per-feature threshold that forces a feature to exactly zero unless its pre-activation clears the threshold, at which point the value passes through unchanged. Introduced by Google DeepMind researchers Senthooran Rajamanoharan, Tom Lieberum, Neel Nanda, and colleagues in July 2024 (arXiv:2407.14435), it achieves state-of-the-art reconstruction fidelity at a given sparsity level on Gemma 2 9B activations, outperforming the contemporaneous Gated and TopK SAEs, and it is the architecture behind DeepMind's Gemma Scope suite of more than 400 open SAEs. ^[1]^[2]

Overview

A JumpReLU sparse autoencoder (JumpReLU SAE) is a variant of the sparse autoencoder (SAE) used in mechanistic interpretability to decompose the dense internal activations of a neural network into a much larger set of sparse, individually interpretable features. Its defining feature is the JumpReLU activation in the encoder: a per-feature learned threshold below which a feature is forced to exactly zero and above which the pre-activation passes through unchanged. This replaces the standard rectified linear (ReLU) activation and lets the model train against a direct sparsity objective rather than the indirect L1 penalty used by earlier SAEs. ^[1]

The technique was introduced in "Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders," posted to arXiv on 19 July 2024 by Senthooran Rajamanoharan, Tom Lieberum, Nicolas Sonnerat, Arthur Conmy, Vikrant Varma, Janos Kramar, and Neel Nanda of Google DeepMind. ^[1] The authors reported that JumpReLU SAEs reach state-of-the-art reconstruction fidelity at a fixed level of sparsity on activations from Gemma 2 9B, matching or surpassing the two strongest alternatives of the time, Gated SAEs (also from DeepMind) and TopK SAEs (from OpenAI), while yielding features that human and automated raters judged to be equally interpretable. ^[1] The architecture is best known as the backbone of Gemma Scope, DeepMind's July 2024 open release of more than 400 SAEs spanning every layer of Gemma 2. ^[2]

The paper frames the core problem in one sentence: "To be useful for downstream tasks, SAEs need to decompose LM activations faithfully; yet to be interpretable the decomposition must be sparse -- two objectives that are in tension." ^[1] JumpReLU SAEs are one answer to that tension.

What is a sparse autoencoder, and why is it used for interpretability?

A central obstacle to interpreting language models is superposition: models appear to represent far more distinct concepts than they have neurons, packing many features into overlapping linear directions so that any single neuron responds to a jumble of unrelated concepts (it is polysemantic) ^[7]. Sparse autoencoders attempt to undo this packing. An SAE is a wide, shallow network trained to reconstruct a model's activation vector x while passing it through a sparse bottleneck. The encoder maps x to a high-dimensional vector of feature activations f(x), most of which are zero for any given input, and the decoder reconstructs an approximation of x as a sparse weighted sum of learned feature directions (the columns of the decoder weight matrix). The hope is that the recovered directions are monosemantic, each corresponding to one human-understandable concept. This dictionary-learning approach to interpretability was popularized by Anthropic's "Towards Monosemanticity" (October 2023) and by Cunningham et al. the same year. ^[5]^[6]

The standard recipe applies a ReLU encoder and trains it to minimize a reconstruction error plus an L1 penalty on the feature activations, where the L1 term encourages sparsity because it is the convex surrogate for the true (but non-differentiable) count of active features. This setup has two well-documented weaknesses. First, an L1 penalty applied to the activations themselves creates a shrinkage bias: because the penalty grows with activation magnitude, the optimizer systematically pushes feature values below their correct levels, harming reconstruction even when the right features have been selected. Second, there is an inherent reconstruction-versus-sparsity tradeoff, and the L1 surrogate sits on a worse point of that frontier than necessary. Gated SAEs (Rajamanoharan et al., April 2024) and TopK SAEs (Gao et al., OpenAI, June 2024) were each proposed to push this Pareto frontier outward; JumpReLU SAEs continue that line of work. ^[3]^[4]

How does the JumpReLU activation work?

A JumpReLU SAE keeps the usual encoder and decoder structure. The encoder produces feature activations

f(x) = JumpReLU_theta(W_enc x + b_enc),

and the decoder reconstructs the input as

x_hat(f) = W_dec f + b_dec.

The JumpReLU activation is applied elementwise and is defined as

JumpReLU_theta(z) = z * H(z - theta),

where H is the Heaviside step function (zero for negative arguments, one otherwise) and theta is a strictly positive threshold. ^[1] Crucially, theta is a learned vector with one entry per feature, so every feature gets its own cutoff. The activation acts as a shifted Heaviside gate multiplied by the input: any pre-activation at or below its feature's threshold is set to zero, while any pre-activation above it is passed through at full magnitude. This is the source of the name, because the activation "jumps" discontinuously from zero up to the threshold value at the cutoff point. Because the gate either kills a feature or leaves it untouched, the architecture does not shrink the magnitudes of the features it keeps, directly addressing the shrinkage bias of L1-trained ReLU SAEs. At inference the activation is just a threshold comparison, so a JumpReLU SAE costs essentially the same to run as a plain ReLU SAE: one matrix multiply followed by an elementwise gate. ^[1]

How is a JumpReLU SAE trained through a discontinuous activation?

Freed from the need to use L1 as a differentiable proxy, the authors train directly against the quantity they actually care about, the L0 "norm" (the number of nonzero features). The loss combines squared reconstruction error with an L0 sparsity penalty:

L(x) = || x - x_hat(f(x)) ||_2^2 + lambda * || f(x) ||_0,

where lambda controls the reconstruction-versus-sparsity tradeoff. ^[1] The difficulty is that both the Heaviside gate and the L0 count are piecewise constant in the threshold theta, so their gradient with respect to theta is zero almost everywhere and undefined at the jump. Ordinary backpropagation therefore provides no signal to move the thresholds.

The paper solves this with straight-through estimators (STEs). On the backward pass it substitutes custom pseudo-derivatives for the Heaviside and JumpReLU functions, built from a kernel function K with a small bandwidth epsilon. The effect is that the gradient with respect to a feature's threshold is estimated only from the pre-activations that fall within a narrow window of width epsilon around that threshold, which amounts to a kernel density estimate of how many pre-activations sit near the cutoff. The authors show that this approximates the gradient of the expected loss, balancing two pressures at each threshold: lowering it admits more features and improves reconstruction, while raising it removes features and reduces the L0 penalty. In their experiments K is a rectangle (boxcar) kernel and the bandwidth is set to epsilon = 0.001 on input data normalized to unit mean squared norm. ^[1] The STE is used only during training; the forward pass, and all inference, uses the exact hard threshold.

How does it compare to Gated and TopK SAEs?

JumpReLU SAEs are one of three closely related responses to the shortcomings of L1-trained ReLU SAEs, and the three are easy to confuse. The table below summarizes the differences.

SAE variant	Sparsity mechanism	Sparsity penalty	Shrinkage bias	Origin
Standard (ReLU)	ReLU on encoder pre-activations	L1 on feature activations	Yes	Bricken et al. / Cunningham et al., 2023 ^[5]^[6]
Gated	separate "gate" path selects features, "magnitude" path sizes them	L1 on gate pre-activations only	Largely avoided	Rajamanoharan et al., April 2024 ^[3]
TopK	keep the k largest pre-activations, zero the rest	none (k fixes L0 directly)	Avoided	Gao et al. (OpenAI), June 2024 ^[4]
JumpReLU	per-feature learned threshold, gate times input	L0 via straight-through estimator	Avoided	Rajamanoharan et al., July 2024 ^[1]

There is a direct lineage from Gated to JumpReLU SAEs. The Gated SAE paper had already observed that a Gated SAE with tied weights is mathematically equivalent to a single-layer encoder using a discontinuous "JumpReLU" activation, and it coined that name. ^[3] The JumpReLU SAE paper turns this observation into a first-class architecture, training the threshold directly rather than realizing it through an auxiliary gating path, which makes the encoder simpler and cheaper at inference. ^[1]

The contrast with TopK is also instructive. A TopK SAE enforces exactly k active features for every input by keeping only the k largest pre-activations, which sets the L0 sparsity precisely but requires a top-k selection at inference and imposes the same feature count on every token. A JumpReLU SAE instead lets the number of active features vary from input to input, since a given input simply activates however many features clear their thresholds. The DeepMind team argued this is more natural, because some inputs genuinely contain more active concepts than others. ^[1] In their head-to-head evaluation on Gemma 2 9B, across the residual stream, attention outputs, and MLP outputs at layers 9, 20, and 31, JumpReLU SAEs delivered similar or better reconstruction fidelity at a given sparsity than both Gated and TopK SAEs. ^[1] Interpretability was assessed with a blind manual rating study and an automated study using a Gemini model to simulate feature activations; all three architectures produced similarly interpretable features, with the automated metric showing a small improvement from Gated to JumpReLU. ^[1] One noted tradeoff is that, like TopK, JumpReLU SAEs tend to learn a few more very high frequency features than Gated SAEs, though these remained a small fraction of the dictionary (fewer than roughly 0.06 percent of features in the 131K-width SAEs), and both architectures produced few dead features without the resampling tricks earlier SAEs required. ^[1]

What is Gemma Scope, and how does it use JumpReLU SAEs?

The most prominent deployment of the architecture is Gemma Scope, an open suite of JumpReLU SAEs released by DeepMind in July 2024 and described in "Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2" by Tom Lieberum, Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Nicolas Sonnerat, Vikrant Varma, Janos Kramar, Anca Dragan, Rohin Shah, and Neel Nanda (arXiv, August 2024; presented at the BlackboxNLP workshop at EMNLP 2024). ^[2] Gemma Scope trains SAEs at three sites in every layer (the attention head outputs before the final output projection, the MLP outputs, and the post-MLP residual stream) of Gemma 2 2B and 9B, plus selected layers of Gemma 2 27B. ^[2]

The release comprises more than 400 SAEs in the headline set (and over 2,000 once the multiple sparsity levels per site are counted), totaling more than 30 million learned features, with dictionary widths ranging from about 16,000 (2^14) up to roughly one million (2^20) features. ^[2] Each SAE was trained on 4 to 16 billion tokens of text. Producing the suite was a large undertaking: the authors report that training consumed more than 20 percent of the compute used to pretrain GPT-3 and required saving roughly 20 pebibytes of model activations to disk. ^[2] The weights are published on Hugging Face under a permissive CC-BY-4.0 license, with an interactive feature browser hosted on Neuronpedia, lowering the barrier to interpretability research that previously required training SAEs from scratch. ^[2]^[8]^[9]

Why does JumpReLU matter for AI interpretability?

JumpReLU SAEs represent the convergence, in mid-2024, of a fast-moving line of work on making sparse autoencoders both more faithful and more practical. By training against an L0 objective through a straight-through estimator, the method removes the shrinkage bias of the L1 penalty and advances the reconstruction-versus-sparsity frontier, while keeping the encoder as cheap to run as a standard ReLU SAE and avoiding the per-token feature-count rigidity of TopK. ^[1] Its broader importance comes from Gemma Scope, which made hundreds of high-quality SAEs freely available on a capable open model and helped turn SAE-based feature analysis into a standard, reproducible tool for interpretability and AI safety research. ^[2]^[9] Training remains sensitive to the STE bandwidth and threshold initialization, and later work has continued to refine SAE methods, so JumpReLU is best understood as a strong, widely adopted point on an actively evolving design space rather than a final answer.

References

Senthooran Rajamanoharan, Tom Lieberum, Nicolas Sonnerat, Arthur Conmy, Vikrant Varma, Janos Kramar, Neel Nanda. "Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders." arXiv:2407.14435, July 2024. https://arxiv.org/abs/2407.14435 ↩
Tom Lieberum, Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Nicolas Sonnerat, Vikrant Varma, Janos Kramar, Anca Dragan, Rohin Shah, Neel Nanda. "Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2." arXiv:2408.05147, August 2024 (BlackboxNLP 2024). https://arxiv.org/abs/2408.05147 ↩
Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, Janos Kramar, Rohin Shah, Neel Nanda. "Improving Dictionary Learning with Gated Sparse Autoencoders." arXiv:2404.16014, April 2024. https://arxiv.org/abs/2404.16014 ↩
Leo Gao, Tom Dupre la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, Jeffrey Wu. "Scaling and Evaluating Sparse Autoencoders." arXiv:2406.04093, June 2024. https://arxiv.org/abs/2406.04093 ↩
Trenton Bricken, Adly Templeton, Joshua Batson, et al. "Towards Monosemanticity: Decomposing Language Models With Dictionary Learning." Anthropic, Transformer Circuits, October 2023. https://transformer-circuits.pub/2023/monosemantic-features/index.html ↩
Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, Lee Sharkey. "Sparse Autoencoders Find Highly Interpretable Features in Language Models." arXiv:2309.08600, September 2023. https://arxiv.org/abs/2309.08600 ↩
Nelson Elhage, Tristan Hume, Catherine Olsson, et al. "Toy Models of Superposition." Anthropic, Transformer Circuits, September 2022. https://transformer-circuits.pub/2022/toy_model/index.html ↩
"Gemma Scope." Hugging Face model collection, Google DeepMind. https://huggingface.co/google/gemma-scope ↩
"Gemma Scope: helping the safety community shed light on the inner workings of language models." Google DeepMind blog, July 2024. https://deepmind.google/discover/blog/gemma-scope-helping-the-safety-community-shed-light-on-the-inner-workings-of-language-models/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributors · full history

Suggest edit

What links here

Dictionary learning (for interpretability)Gated SAE Gemma Scope

Overview

What is a sparse autoencoder, and why is it used for interpretability?

How does the JumpReLU activation work?

How is a JumpReLU SAE trained through a discontinuous activation?

How does it compare to Gated and TopK SAEs?

What is Gemma Scope, and how does it use JumpReLU SAEs?

Why does JumpReLU matter for AI interpretability?

References

Improve this article

Related Articles

Confirmation Bias

Hallucination

Recursive self-improvement

Emergent abilities

AI Alignment

Model collapse

What links here

Related Articles

Confirmation Bias

Hallucination

Recursive self-improvement

Emergent abilities

AI Alignment

Model collapse

What links here