JumpReLU SAE
Last reviewed
Jun 8, 2026
Sources
9 citations
Review status
Source-backed
Revision
v1 · 2,113 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
9 citations
Review status
Source-backed
Revision
v1 · 2,113 words
Add missing citations, update stale details, or suggest a clearer explanation.
A JumpReLU sparse autoencoder (JumpReLU SAE) is a variant of the sparse autoencoder (SAE) used in mechanistic interpretability to decompose the dense internal activations of a neural network into a much larger set of sparse, individually interpretable features. Its defining feature is the JumpReLU activation in the encoder: a per-feature learned threshold below which a feature is forced to exactly zero and above which the pre-activation passes through unchanged. This replaces the standard rectified linear (ReLU) activation and lets the model train against a direct sparsity objective rather than the indirect L1 penalty used by earlier SAEs. [1]
The technique was introduced in "Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders," posted to arXiv in July 2024 by Senthooran Rajamanoharan, Tom Lieberum, Nicolas Sonnerat, Arthur Conmy, Vikrant Varma, Janos Kramar, and Neel Nanda of Google DeepMind. [1] The authors reported that JumpReLU SAEs reach state-of-the-art reconstruction fidelity at a fixed level of sparsity on activations from Gemma 2 9B, matching or surpassing the two strongest alternatives of the time, Gated SAEs (also from DeepMind) and TopK SAEs (from OpenAI), while yielding features that human and automated raters judged to be equally interpretable. [1] The architecture is best known as the backbone of Gemma Scope, DeepMind's July 2024 open release of more than 400 SAEs spanning every layer of Gemma 2. [2]
A central obstacle to interpreting language models is superposition: models appear to represent far more distinct concepts than they have neurons, packing many features into overlapping linear directions so that any single neuron responds to a jumble of unrelated concepts (it is polysemantic) [7]. Sparse autoencoders attempt to undo this packing. An SAE is a wide, shallow network trained to reconstruct a model's activation vector x while passing it through a sparse bottleneck. The encoder maps x to a high-dimensional vector of feature activations f(x), most of which are zero for any given input, and the decoder reconstructs an approximation of x as a sparse weighted sum of learned feature directions (the columns of the decoder weight matrix). The hope is that the recovered directions are monosemantic, each corresponding to one human-understandable concept. This dictionary-learning approach to interpretability was popularized by Anthropic's "Towards Monosemanticity" (October 2023) and by Cunningham et al. the same year. [5][6]
The standard recipe applies a ReLU encoder and trains it to minimize a reconstruction error plus an L1 penalty on the feature activations, where the L1 term encourages sparsity because it is the convex surrogate for the true (but non-differentiable) count of active features. This setup has two well-documented weaknesses. First, an L1 penalty applied to the activations themselves creates a shrinkage bias: because the penalty grows with activation magnitude, the optimizer systematically pushes feature values below their correct levels, harming reconstruction even when the right features have been selected. Second, there is an inherent reconstruction-versus-sparsity tradeoff, and the L1 surrogate sits on a worse point of that frontier than necessary. Gated SAEs (Rajamanoharan et al., April 2024) and TopK SAEs (Gao et al., OpenAI, June 2024) were each proposed to push this Pareto frontier outward; JumpReLU SAEs continue that line of work. [3][4]
A JumpReLU SAE keeps the usual encoder and decoder structure. The encoder produces feature activations
f(x) = JumpReLU_theta(W_enc x + b_enc),
and the decoder reconstructs the input as
x_hat(f) = W_dec f + b_dec.
The JumpReLU activation is applied elementwise and is defined as
JumpReLU_theta(z) = z * H(z - theta),
where H is the Heaviside step function (zero for negative arguments, one otherwise) and theta is a strictly positive threshold. [1] Crucially, theta is a learned vector with one entry per feature, so every feature gets its own cutoff. The activation acts as a shifted Heaviside gate multiplied by the input: any pre-activation at or below its feature's threshold is set to zero, while any pre-activation above it is passed through at full magnitude. This is the source of the name, because the activation "jumps" discontinuously from zero up to the threshold value at the cutoff point. Because the gate either kills a feature or leaves it untouched, the architecture does not shrink the magnitudes of the features it keeps, directly addressing the shrinkage bias of L1-trained ReLU SAEs. At inference the activation is just a threshold comparison, so a JumpReLU SAE costs essentially the same to run as a plain ReLU SAE: one matrix multiply followed by an elementwise gate. [1]
Freed from the need to use L1 as a differentiable proxy, the authors train directly against the quantity they actually care about, the L0 "norm" (the number of nonzero features). The loss combines squared reconstruction error with an L0 sparsity penalty:
L(x) = || x - x_hat(f(x)) ||_2^2 + lambda * || f(x) ||_0,
where lambda controls the reconstruction-versus-sparsity tradeoff. [1] The difficulty is that both the Heaviside gate and the L0 count are piecewise constant in the threshold theta, so their gradient with respect to theta is zero almost everywhere and undefined at the jump. Ordinary backpropagation therefore provides no signal to move the thresholds.
The paper solves this with straight-through estimators (STEs). On the backward pass it substitutes custom pseudo-derivatives for the Heaviside and JumpReLU functions, built from a kernel function K with a small bandwidth epsilon. The effect is that the gradient with respect to a feature's threshold is estimated only from the pre-activations that fall within a narrow window of width epsilon around that threshold, which amounts to a kernel density estimate of how many pre-activations sit near the cutoff. The authors show that this approximates the gradient of the expected loss, balancing two pressures at each threshold: lowering it admits more features and improves reconstruction, while raising it removes features and reduces the L0 penalty. In their experiments K is a rectangle (boxcar) kernel and the bandwidth is set to epsilon = 0.001 on input data normalized to unit mean squared norm. [1] The STE is used only during training; the forward pass, and all inference, uses the exact hard threshold.
JumpReLU SAEs are one of three closely related responses to the shortcomings of L1-trained ReLU SAEs, and the three are easy to confuse. The table below summarizes the differences.
| SAE variant | Sparsity mechanism | Sparsity penalty | Shrinkage bias | Origin |
|---|---|---|---|---|
| Standard (ReLU) | ReLU on encoder pre-activations | L1 on feature activations | Yes | Bricken et al. / Cunningham et al., 2023 [5][6] |
| Gated | separate "gate" path selects features, "magnitude" path sizes them | L1 on gate pre-activations only | Largely avoided | Rajamanoharan et al., April 2024 [3] |
| TopK | keep the k largest pre-activations, zero the rest | none (k fixes L0 directly) | Avoided | Gao et al. (OpenAI), June 2024 [4] |
| JumpReLU | per-feature learned threshold, gate times input | L0 via straight-through estimator | Avoided | Rajamanoharan et al., July 2024 [1] |
There is a direct lineage from Gated to JumpReLU SAEs. The Gated SAE paper had already observed that a Gated SAE with tied weights is mathematically equivalent to a single-layer encoder using a discontinuous "JumpReLU" activation, and it coined that name. [3] The JumpReLU SAE paper turns this observation into a first-class architecture, training the threshold directly rather than realizing it through an auxiliary gating path, which makes the encoder simpler and cheaper at inference. [1]
The contrast with TopK is also instructive. A TopK SAE enforces exactly k active features for every input by keeping only the k largest pre-activations, which sets the L0 sparsity precisely but requires a top-k selection at inference and imposes the same feature count on every token. A JumpReLU SAE instead lets the number of active features vary from input to input, since a given input simply activates however many features clear their thresholds. The DeepMind team argued this is more natural, because some inputs genuinely contain more active concepts than others. [1] In their head-to-head evaluation on Gemma 2 9B, across the residual stream, attention outputs, and MLP outputs at layers 9, 20, and 31, JumpReLU SAEs delivered similar or better reconstruction fidelity at a given sparsity than both Gated and TopK SAEs. [1] Interpretability was assessed with a blind manual rating study and an automated study using a Gemini model to simulate feature activations; all three architectures produced similarly interpretable features, with the automated metric showing a small improvement from Gated to JumpReLU. [1] One noted tradeoff is that, like TopK, JumpReLU SAEs tend to learn a few more very high frequency features than Gated SAEs, though these remained a small fraction of the dictionary (fewer than roughly 0.06 percent of features in the 131K-width SAEs), and both architectures produced few dead features without the resampling tricks earlier SAEs required. [1]
The most prominent deployment of the architecture is Gemma Scope, an open suite of JumpReLU SAEs released by DeepMind in July 2024 and described in "Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2" by Tom Lieberum, Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Nicolas Sonnerat, Vikrant Varma, Janos Kramar, Anca Dragan, Rohin Shah, and Neel Nanda (arXiv, August 2024; presented at the BlackboxNLP workshop at EMNLP 2024). [2] Gemma Scope trains SAEs at three sites in every layer (the attention head outputs before the final output projection, the MLP outputs, and the post-MLP residual stream) of Gemma 2 2B and 9B, plus selected layers of Gemma 2 27B. [2]
The release comprises more than 400 SAEs in the headline set (and over 2,000 once the multiple sparsity levels per site are counted), totaling more than 30 million learned features, with dictionary widths ranging from about 16,000 (2^14) up to roughly one million (2^20) features. [2] Each SAE was trained on 4 to 16 billion tokens of text. Producing the suite was a large undertaking: the authors report that training consumed more than 20 percent of the compute used to pretrain GPT-3 and required saving roughly 20 pebibytes of model activations to disk. [2] The weights are published on Hugging Face under a permissive CC-BY-4.0 license, with an interactive feature browser hosted on Neuronpedia, lowering the barrier to interpretability research that previously required training SAEs from scratch. [2][8][9]
JumpReLU SAEs represent the convergence, in mid-2024, of a fast-moving line of work on making sparse autoencoders both more faithful and more practical. By training against an L0 objective through a straight-through estimator, the method removes the shrinkage bias of the L1 penalty and advances the reconstruction-versus-sparsity frontier, while keeping the encoder as cheap to run as a standard ReLU SAE and avoiding the per-token feature-count rigidity of TopK. [1] Its broader importance comes from Gemma Scope, which made hundreds of high-quality SAEs freely available on a capable open model and helped turn SAE-based feature analysis into a standard, reproducible tool for interpretability and AI safety research. [2][9] Training remains sensitive to the STE bandwidth and threshold initialization, and later work has continued to refine SAE methods, so JumpReLU is best understood as a strong, widely adopted point on an actively evolving design space rather than a final answer.