Gated SAE
Last reviewed
Jun 8, 2026
Sources
6 citations
Review status
Source-backed
Revision
v1 · 2,144 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
6 citations
Review status
Source-backed
Revision
v1 · 2,144 words
Add missing citations, update stale details, or suggest a clearer explanation.
A Gated sparse autoencoder (Gated SAE) is a variant of the sparse autoencoder (SAE) developed for mechanistic interpretability, the practice of decomposing the dense internal activations of a neural network into a much larger set of sparse, individually interpretable features. Its central idea is to split the encoder into two paths: a gating path that decides which features are active for a given input, and a magnitude path that estimates how strongly the active features fire. By applying the sparsity penalty only to the gating path, a Gated SAE removes the systematic downward bias on feature magnitudes, the "shrinkage," that the standard L1 penalty introduces in plain ReLU SAEs. [1]
The technique was introduced in "Improving Dictionary Learning with Gated Sparse Autoencoders," posted to arXiv on 25 April 2024 by Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, Janos Kramar, Rohin Shah, and Neel Nanda, all of Google DeepMind, and later published at the 2024 Conference on Neural Information Processing Systems (NeurIPS). [1][2] Training SAEs on language models up to 7 billion parameters, the authors reported that Gated SAEs are a Pareto improvement over standard SAEs at fixed training compute: they solve shrinkage, produce features that human raters judged at least as interpretable as the baseline, and require roughly half as many firing features to reach the same reconstruction fidelity. [1] The Gated SAE is one of three closely related responses, alongside TopK SAEs and JumpReLU SAEs, to the shortcomings of L1-trained SAEs, and the paper itself coined the "JumpReLU" name by observing that a weight-tied Gated SAE reduces to a JumpReLU activation. [1]
Sparse autoencoders attempt to undo superposition, the phenomenon in which a model packs more distinct concepts than it has neurons into overlapping linear directions, so that any single neuron responds to a jumble of unrelated concepts. An SAE is a wide, shallow network trained to reconstruct an activation vector x while forcing it through a sparse bottleneck. The encoder maps x to a high-dimensional vector of feature activations f(x), most of which are zero for any given input, and the decoder reconstructs an approximation x_hat as a sparse weighted sum of learned feature directions (the columns of the decoder weight matrix). This dictionary learning approach to interpretability was popularized by Anthropic's "Towards Monosemanticity" (October 2023) and by Cunningham et al. the same year. [4][5]
The standard recipe uses a ReLU encoder and minimizes a reconstruction error plus an L1 penalty on the feature activations f(x). The L1 term serves as the convex, differentiable surrogate for the true sparsity objective, the L0 "norm," which counts how many features are nonzero and is not differentiable. [1] The problem is that the same activations f(x) that drive the reconstruction are also the quantity being penalized. Because the L1 penalty grows in proportion to activation magnitude, the optimizer can always reduce the penalty by shrinking feature values, even when the correct features have been selected. The result is a systematic underestimation of feature activations known as shrinkage: the SAE reports magnitudes that are too small, which damages reconstruction fidelity for reasons unrelated to choosing the wrong features. [1] Shrinkage is a bias rather than noise, so it does not average out, and the authors demonstrate it empirically by fitting an optimal per-feature rescaling after training and showing that this rescaling reliably increases activations. [1]
The Gated SAE attacks shrinkage by decoupling the two jobs the encoder is implicitly doing: deciding which feature directions to use, and estimating the magnitudes of those directions. [1] The encoder is reorganized around two pre-activation vectors computed from the centered input. A gating pre-activation,
pi_gate(x) = W_gate (x - b_dec) + b_gate,
decides which features turn on, and a magnitude pre-activation,
pi_mag(x) = W_mag (x - b_dec) + b_mag,
sets their sizes. The feature activations are then formed by gating the magnitude path with a hard binary switch:
f(x) = Heaviside(pi_gate(x)) * ReLU(pi_mag(x)),
where the Heaviside step function returns one for positive inputs and zero otherwise, and the multiplication is elementwise. [1] A feature fires only if its gate pre-activation is positive, and when it fires its value comes from the ReLU of the magnitude path. The reconstruction is the usual linear decode, x_hat = W_dec f(x) + b_dec.
The key move is where the sparsity penalty is applied. Rather than penalizing the feature activations f(x) directly, the Gated SAE applies the L1 penalty to ReLU(pi_gate(x)), the positive part of the gate pre-activations. [1] Because the magnitudes that feed the reconstruction are never penalized, there is no pressure to shrink them, which is precisely how the architecture avoids the shrinkage bias.
Naively, splitting the encoder into two independent linear maps would roughly double the encoder parameter count and risk the two paths drifting apart. The paper resolves this with a weight-sharing scheme: the magnitude weights are tied to the gate weights up to a learned per-feature rescaling,
W_mag = exp(r_mag) * W_gate,
where r_mag is a learnable vector and the exponential is elementwise. [1] The two paths therefore use the same directions in activation space, differing only by a positive per-feature scale, so a weight-tied Gated SAE adds only an extra rescaling vector and bias (about 2M additional parameters for a dictionary of width M) over a baseline SAE. [1]
This sharing creates a subtlety: the L1 penalty on ReLU(pi_gate) provides gradients only to keep the gates sparse, not to make them reconstruct well. To keep the gate path aligned with the data, the training objective adds an auxiliary reconstruction loss. The full Gated SAE loss is
L(x) = || x - x_hat(f(x)) ||_2^2 + lambda || ReLU(pi_gate(x)) ||_1 + || x - x_hat_frozen(ReLU(pi_gate(x))) ||_2^2,
combining the main reconstruction error, the L1 sparsity penalty on the gate, and an auxiliary term that asks a copy of the decoder to reconstruct x directly from ReLU(pi_gate(x)). [1] Crucially, the auxiliary decoder is a frozen (stop-gradient) copy: its weights receive no gradient from the auxiliary term, so the auxiliary loss trains the gate path to be reconstructive without letting that secondary objective distort the main decoder. The authors report ablations showing that freezing the decoder in the auxiliary task is important to the method's performance. [1]
A central theoretical observation in the paper is that the weight-tied Gated SAE is a reparameterization of a simpler architecture. When the gate and magnitude paths share directions exactly through W_mag = exp(r_mag) * W_gate, the combined operation of "gate by Heaviside(pi_gate), then take ReLU of the rescaled magnitude path" collapses to a single linear encoder followed by a discontinuous, pointwise JumpReLU activation. [1] The JumpReLU activation, which the paper attributes to Erichson et al. (2019), behaves like a ReLU with a per-feature threshold theta:
JumpReLU_theta(z) = z if z > theta, else 0.
Any pre-activation at or below its feature's threshold is forced to exactly zero, while any pre-activation above it passes through unchanged. Because the gate either kills a feature or leaves its magnitude untouched, this construction does not shrink the features it keeps, which is the same property that gives the full Gated SAE its advantage. [1]
This observation seeded a direct line of follow-up work. The later JumpReLU SAE paper (Rajamanoharan et al., July 2024) promoted the JumpReLU activation to a first-class architecture, learning the threshold theta directly through straight-through gradient estimators rather than realizing it indirectly through a gating path and auxiliary loss. [3] In this sense the Gated SAE both diagnosed the shrinkage problem and supplied the conceptual seed for its own successor.
The authors trained Gated and baseline SAEs on three models of increasing scale: a one-layer model with GELU activations (GELU-1L), Pythia-2.8B, and Gemma-7B, the predecessor of Gemma 2. For each model they trained SAEs at three sites: the residual stream, the MLP layer outputs, and the attention layer outputs taken before the output projection W_O. [1] Reconstruction quality was measured by "loss recovered," the fraction of the model's cross-entropy loss preserved when the SAE's reconstruction is spliced back into the forward pass (0 percent for zero-ablation, 100 percent for a perfect reconstruction), and sparsity by L0, the average number of features active per input. [1]
Across these models and sites, Gated SAEs traced out a better reconstruction-versus-sparsity frontier than standard SAEs at equal training compute, a Pareto improvement, with the headline result that at many sites a Gated SAE needs roughly half the L0 to reach the same loss recovered as a baseline SAE. [1] The shrinkage bias was effectively removed: the optimal post-hoc rescaling that significantly improves a baseline SAE's activations was close to the identity for Gated SAEs, indicating little residual shrinkage to correct. [1]
To check that the reconstruction gains did not cost feature quality, the team ran a blind human interpretability study. Raters were shown features in random order without being told which SAE, site, or layer each came from, covering 150 features on Pythia-2.8B (five raters) and 192 features on Gemma-7B (seven raters). [1] Gated SAE features came out at least as interpretable as baseline features. The authors did not overclaim: the difference was not statistically conclusive (a p-value of about 0.06, with a confidence interval on the mean difference that included zero), so they reported Gated features as comparably interpretable rather than definitively more so. [1]
The Gated SAE is one of three architectures from 2024, each designed to escape the shrinkage and frontier limitations of L1-trained ReLU SAEs by separating the decision of which features fire from the estimate of how strongly they fire. They differ in how that separation is implemented and trained.
| SAE variant | Sparsity mechanism | Sparsity objective | Shrinkage bias | Origin |
|---|---|---|---|---|
| Standard (ReLU) | ReLU on encoder pre-activations | L1 on feature activations | Yes | Bricken et al. / Cunningham et al., 2023 [4][5] |
| Gated | separate gate path selects features, magnitude path sizes them | L1 on gate pre-activations only | Largely avoided | Rajamanoharan et al., April 2024 [1] |
| TopK | keep the k largest pre-activations, zero the rest | none (k fixes L0 directly) | Avoided | Gao et al. (OpenAI), June 2024 [6] |
| JumpReLU | per-feature learned threshold, gate times input | L0 via straight-through estimator | Avoided | Rajamanoharan et al., July 2024 [3] |
A TopK SAE (Gao et al., OpenAI, June 2024) enforces a fixed number k of active features per input by keeping only the k largest pre-activations and zeroing the rest, which sets L0 exactly but applies the same feature count to every token. A JumpReLU SAE lets the number of active features vary per input, since each input activates however many features clear their thresholds, and learns those thresholds against an L0 objective. The Gated SAE sits between these in spirit: like JumpReLU it allows a variable number of active features and learns soft, data-driven gates, but it achieves this through an explicit two-path encoder with an auxiliary loss rather than a single thresholded activation. [1][3]
In the head-to-head comparison reported in the later JumpReLU SAE paper, evaluated on Gemma 2 9B, JumpReLU matched or slightly exceeded both Gated and TopK SAEs on reconstruction fidelity at fixed sparsity, while all three produced similarly interpretable features. [3] The advantages the JumpReLU paper claimed over the Gated SAE are a simpler encoder (one activation rather than two paths plus an auxiliary objective) and lower training cost, since the Gated SAE's training-time machinery is heavier even though its run-time forward pass is also just a linear map and an elementwise gate. The Gated SAE remains historically important as the work that named the shrinkage problem precisely, introduced the gate-versus-magnitude framing the whole family now shares, and first identified the JumpReLU activation its successor would adopt. [1][3]