Gated SAE

AI Safety Machine Learning

12 min read

Updated Jun 28, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 28, 2026

Fact-checked

In review queue

Sources

6 citations

Revision

v2 · 2,331 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

A Gated sparse autoencoder (Gated SAE) is a sparse-autoencoder architecture for mechanistic interpretability that splits the encoder into a gating path, which decides which features are active, and a magnitude path, which estimates how strongly they fire, and applies the L1 sparsity penalty to the gate alone. This separation removes the systematic "shrinkage" bias that the L1 penalty causes in a standard ReLU sparse autoencoder, letting a Gated SAE reach a better fidelity-versus-sparsity (Pareto) frontier. It was introduced by Senthooran Rajamanoharan and colleagues at Google DeepMind in April 2024 and is a direct precursor to the JumpReLU SAE. ^[1]

What is a Gated SAE?

A Gated sparse autoencoder (Gated SAE) is a variant of the sparse autoencoder (SAE) developed for mechanistic interpretability, the practice of decomposing the dense internal activations of a neural network into a much larger set of sparse, individually interpretable features. Its central idea is to split the encoder into two paths: a gating path that decides which features are active for a given input, and a magnitude path that estimates how strongly the active features fire. By applying the sparsity penalty only to the gating path, a Gated SAE removes the systematic downward bias on feature magnitudes, the "shrinkage," that the standard L1 penalty introduces in plain ReLU SAEs. ^[1]

The technique was introduced in "Improving Dictionary Learning with Gated Sparse Autoencoders," posted to arXiv on 25 April 2024 by Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, Janos Kramar, Rohin Shah, and Neel Nanda, all of Google DeepMind, and later published at the 2024 Conference on Neural Information Processing Systems (NeurIPS). ^[1]^[2] Training SAEs on language models up to 7 billion parameters, the authors reported that Gated SAEs are a Pareto improvement over standard SAEs at fixed training compute: they solve shrinkage, produce features that human raters judged at least as interpretable as the baseline, and require roughly half as many firing features to reach the same reconstruction fidelity. ^[1] As the paper's abstract puts it, "The key insight of Gated SAEs is to separate the functionality of (a) determining which directions to use and (b) estimating the magnitudes of those directions: this enables us to apply the L1 penalty only to the former, limiting the scope of undesirable side effects." ^[1] The Gated SAE is one of three closely related responses, alongside TopK SAEs and JumpReLU SAEs, to the shortcomings of L1-trained SAEs, and the paper itself coined the "JumpReLU" name by observing that a weight-tied Gated SAE reduces to a JumpReLU activation. ^[1]

What problem does shrinkage cause in standard SAEs?

Sparse autoencoders attempt to undo superposition, the phenomenon in which a model packs more distinct concepts than it has neurons into overlapping linear directions, so that any single neuron responds to a jumble of unrelated concepts. An SAE is a wide, shallow network trained to reconstruct an activation vector x while forcing it through a sparse bottleneck. The encoder maps x to a high-dimensional vector of feature activations f(x), most of which are zero for any given input, and the decoder reconstructs an approximation x_hat as a sparse weighted sum of learned feature directions (the columns of the decoder weight matrix). This dictionary learning approach to interpretability was popularized by Anthropic's "Towards Monosemanticity" (October 2023) and by Cunningham et al. the same year. ^[4]^[5]

The standard recipe uses a ReLU encoder and minimizes a reconstruction error plus an L1 penalty on the feature activations f(x). The L1 term serves as the convex, differentiable surrogate for the true sparsity objective, the L0 "norm," which counts how many features are nonzero and is not differentiable. ^[1] The problem is that the same activations f(x) that drive the reconstruction are also the quantity being penalized. Because the L1 penalty grows in proportion to activation magnitude, the optimizer can always reduce the penalty by shrinking feature values, even when the correct features have been selected. The result is a systematic underestimation of feature activations known as shrinkage: the SAE reports magnitudes that are too small, which damages reconstruction fidelity for reasons unrelated to choosing the wrong features. ^[1] Shrinkage is a bias rather than noise, so it does not average out, and the authors demonstrate it empirically by fitting an optimal per-feature rescaling after training and showing that this rescaling reliably increases activations. ^[1]

How does a Gated SAE work?

The Gated SAE attacks shrinkage by decoupling the two jobs the encoder is implicitly doing: deciding which feature directions to use, and estimating the magnitudes of those directions. ^[1] The encoder is reorganized around two pre-activation vectors computed from the centered input. A gating pre-activation,

pi_gate(x) = W_gate (x - b_dec) + b_gate,

decides which features turn on, and a magnitude pre-activation,

pi_mag(x) = W_mag (x - b_dec) + b_mag,

sets their sizes. The feature activations are then formed by gating the magnitude path with a hard binary switch:

f(x) = Heaviside(pi_gate(x)) * ReLU(pi_mag(x)),

where the Heaviside step function returns one for positive inputs and zero otherwise, and the multiplication is elementwise. ^[1] A feature fires only if its gate pre-activation is positive, and when it fires its value comes from the ReLU of the magnitude path. The reconstruction is the usual linear decode, x_hat = W_dec f(x) + b_dec.

The key move is where the sparsity penalty is applied. Rather than penalizing the feature activations f(x) directly, the Gated SAE applies the L1 penalty to ReLU(pi_gate(x)), the positive part of the gate pre-activations. ^[1] Because the magnitudes that feed the reconstruction are never penalized, there is no pressure to shrink them, which is precisely how the architecture avoids the shrinkage bias.

Naively, splitting the encoder into two independent linear maps would roughly double the encoder parameter count and risk the two paths drifting apart. The paper resolves this with a weight-sharing scheme: the magnitude weights are tied to the gate weights up to a learned per-feature rescaling,

W_mag = exp(r_mag) * W_gate,

where r_mag is a learnable vector and the exponential is elementwise. ^[1] The two paths therefore use the same directions in activation space, differing only by a positive per-feature scale, so a weight-tied Gated SAE adds only an extra rescaling vector and bias (about 2M additional parameters for a dictionary of width M) over a baseline SAE. ^[1]

This sharing creates a subtlety: the L1 penalty on ReLU(pi_gate) provides gradients only to keep the gates sparse, not to make them reconstruct well. To keep the gate path aligned with the data, the training objective adds an auxiliary reconstruction loss. The full Gated SAE loss is

L(x) = || x - x_hat(f(x)) ||_2^2 + lambda || ReLU(pi_gate(x)) ||_1 + || x - x_hat_frozen(ReLU(pi_gate(x))) ||_2^2,

combining the main reconstruction error, the L1 sparsity penalty on the gate, and an auxiliary term that asks a copy of the decoder to reconstruct x directly from ReLU(pi_gate(x)). ^[1] Crucially, the auxiliary decoder is a frozen (stop-gradient) copy: its weights receive no gradient from the auxiliary term, so the auxiliary loss trains the gate path to be reconstructive without letting that secondary objective distort the main decoder. The authors report ablations showing that freezing the decoder in the auxiliary task is important to the method's performance. ^[1]

How is a Gated SAE connected to JumpReLU?

A central theoretical observation in the paper is that the weight-tied Gated SAE is a reparameterization of a simpler architecture. When the gate and magnitude paths share directions exactly through W_mag = exp(r_mag) * W_gate, the combined operation of "gate by Heaviside(pi_gate), then take ReLU of the rescaled magnitude path" collapses to a single linear encoder followed by a discontinuous, pointwise JumpReLU activation. ^[1] The JumpReLU activation, which the paper attributes to Erichson et al. (2019), behaves like a ReLU with a per-feature threshold theta:

JumpReLU_theta(z) = z if z > theta, else 0.

Any pre-activation at or below its feature's threshold is forced to exactly zero, while any pre-activation above it passes through unchanged. Because the gate either kills a feature or leaves its magnitude untouched, this construction does not shrink the features it keeps, which is the same property that gives the full Gated SAE its advantage. ^[1]

This observation seeded a direct line of follow-up work. The later JumpReLU SAE paper (Rajamanoharan et al., July 2024) promoted the JumpReLU activation to a first-class architecture, learning the threshold theta directly through straight-through gradient estimators rather than realizing it indirectly through a gating path and auxiliary loss. ^[3] In this sense the Gated SAE both diagnosed the shrinkage problem and supplied the conceptual seed for its own successor.

What results did the Gated SAE paper report?

The authors trained Gated and baseline SAEs on three models of increasing scale: a one-layer model with GELU activations (GELU-1L), Pythia-2.8B, and Gemma-7B, the predecessor of Gemma 2. For each model they trained SAEs at three sites: the residual stream, the MLP layer outputs, and the attention layer outputs taken before the output projection W_O. ^[1] Reconstruction quality was measured by "loss recovered," the fraction of the model's cross-entropy loss preserved when the SAE's reconstruction is spliced back into the forward pass (0 percent for zero-ablation, 100 percent for a perfect reconstruction), and sparsity by L0, the average number of features active per input. ^[1]

Across these models and sites, Gated SAEs traced out a better reconstruction-versus-sparsity frontier than standard SAEs at equal training compute, a Pareto improvement, with the headline result that at many sites a Gated SAE needs roughly half the L0 to reach the same loss recovered as a baseline SAE. ^[1] The shrinkage bias was effectively removed: the optimal post-hoc rescaling that significantly improves a baseline SAE's activations was close to the identity for Gated SAEs, indicating little residual shrinkage to correct. ^[1]

To check that the reconstruction gains did not cost feature quality, the team ran a blind human interpretability study. Raters were shown features in random order without being told which SAE, site, or layer each came from, covering 150 features on Pythia-2.8B (five raters) and 192 features on Gemma-7B (seven raters). ^[1] Gated SAE features came out at least as interpretable as baseline features. The authors did not overclaim: the difference was not statistically conclusive (a p-value of about 0.06, with a confidence interval on the mean difference that included zero), so they reported Gated features as comparably interpretable rather than definitively more so. ^[1]

How does a Gated SAE differ from TopK, JumpReLU, and standard SAEs?

The Gated SAE is one of three architectures from 2024, each designed to escape the shrinkage and frontier limitations of L1-trained ReLU SAEs by separating the decision of which features fire from the estimate of how strongly they fire. They differ in how that separation is implemented and trained.

SAE variant	Sparsity mechanism	Sparsity objective	Shrinkage bias	Origin
Standard (ReLU)	ReLU on encoder pre-activations	L1 on feature activations	Yes	Bricken et al. / Cunningham et al., 2023 ^[4]^[5]
Gated	separate gate path selects features, magnitude path sizes them	L1 on gate pre-activations only	Largely avoided	Rajamanoharan et al., April 2024 ^[1]
TopK	keep the k largest pre-activations, zero the rest	none (k fixes L0 directly)	Avoided	Gao et al. (OpenAI), June 2024 ^[6]
JumpReLU	per-feature learned threshold, gate times input	L0 via straight-through estimator	Avoided	Rajamanoharan et al., July 2024 ^[3]

A TopK SAE (Gao et al., OpenAI, June 2024) enforces a fixed number k of active features per input by keeping only the k largest pre-activations and zeroing the rest, which sets L0 exactly but applies the same feature count to every token. A JumpReLU SAE lets the number of active features vary per input, since each input activates however many features clear their thresholds, and learns those thresholds against an L0 objective. The Gated SAE sits between these in spirit: like JumpReLU it allows a variable number of active features and learns soft, data-driven gates, but it achieves this through an explicit two-path encoder with an auxiliary loss rather than a single thresholded activation. ^[1]^[3]

In the head-to-head comparison reported in the later JumpReLU SAE paper, evaluated on Gemma 2 9B, JumpReLU matched or slightly exceeded both Gated and TopK SAEs on reconstruction fidelity at fixed sparsity, while all three produced similarly interpretable features. ^[3] The advantages the JumpReLU paper claimed over the Gated SAE are a simpler encoder (one activation rather than two paths plus an auxiliary objective) and lower training cost, since the Gated SAE's training-time machinery is heavier even though its run-time forward pass is also just a linear map and an elementwise gate. The Gated SAE remains historically important as the work that named the shrinkage problem precisely, introduced the gate-versus-magnitude framing the whole family now shares, and first identified the JumpReLU activation its successor would adopt. ^[1]^[3]

References

Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, Janos Kramar, Rohin Shah, Neel Nanda. "Improving Dictionary Learning with Gated Sparse Autoencoders." arXiv:2404.16014, April 2024. https://arxiv.org/abs/2404.16014 ↩
"Improving Sparse Decomposition of Language Model Activations with Gated Sparse Autoencoders." Advances in Neural Information Processing Systems 38 (NeurIPS 2024). https://proceedings.neurips.cc/paper_files/paper/2024/file/01772a8b0420baec00c4d59fe2fbace6-Paper-Conference.pdf ↩
Senthooran Rajamanoharan, Tom Lieberum, Nicolas Sonnerat, Arthur Conmy, Vikrant Varma, Janos Kramar, Neel Nanda. "Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders." arXiv:2407.14435, July 2024. https://arxiv.org/abs/2407.14435 ↩
Trenton Bricken, Adly Templeton, Joshua Batson, et al. "Towards Monosemanticity: Decomposing Language Models With Dictionary Learning." Anthropic, Transformer Circuits, October 2023. https://transformer-circuits.pub/2023/monosemantic-features/index.html ↩
Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, Lee Sharkey. "Sparse Autoencoders Find Highly Interpretable Features in Language Models." arXiv:2309.08600, September 2023. https://arxiv.org/abs/2309.08600 ↩
Leo Gao, Tom Dupre la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, Jeffrey Wu. "Scaling and Evaluating Sparse Autoencoders." arXiv:2406.04093, June 2024. https://arxiv.org/abs/2406.04093 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

JumpReLU SAE Sparse autoencoder TopK SAE

What is a Gated SAE?

What problem does shrinkage cause in standard SAEs?

How does a Gated SAE work?

How do weight sharing and the auxiliary loss keep the two paths aligned?

How is a Gated SAE connected to JumpReLU?

What results did the Gated SAE paper report?

How does a Gated SAE differ from TopK, JumpReLU, and standard SAEs?

References

Improve this article

Related Articles

Confirmation Bias

Hallucination

Recursive self-improvement

Emergent abilities

AI Alignment

Model collapse

What links here

Related Articles

Confirmation Bias

Hallucination

Recursive self-improvement

Emergent abilities

AI Alignment

Model collapse

What links here