TopK SAE

AI Safety Machine Learning

10 min read

Updated Jun 9, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 9, 2026

Fact-checked

In review queue

Sources

8 citations

Revision

v2 · 2,029 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Overview

A TopK SAE (TopK sparse autoencoder) is a variant of sparse autoencoder that enforces sparsity by keeping only the K largest latent pre-activations for each input and zeroing all the rest, rather than encouraging sparsity indirectly through an L1 penalty. Because exactly K latents are active per input, the TopK activation directly fixes the L0 "norm" (the count of nonzero activations) to K, which removes the activation-shrinkage bias of L1-regularized autoencoders and eliminates the need to tune a sparsity coefficient ^[1]^[2].

The technique is an application of the k-sparse autoencoder introduced by Alireza Makhzani and Brendan Frey in 2013 ^[3]. It was brought to prominence in mechanistic interpretability by the OpenAI paper "Scaling and evaluating sparse autoencoders" (Leo Gao, Tom Dupre la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, Jeffrey Wu), released on June 6, 2024, which used a TopK activation to train a 16-million-latent autoencoder on GPT-4 residual-stream activations ^[1]^[2]. The starting description of the method is accurate: it is a real, named, and widely adopted technique.

Background: SAEs and the L1 problem

Sparse autoencoders are an unsupervised method for decomposing the internal activations of a neural network into a larger set of sparsely activating, more interpretable features. The motivating hypothesis is superposition: a model represents far more features than it has neurons by encoding them as nearly orthogonal directions in activation space, so that each neuron participates in many features (polysemanticity). An SAE learns an over-complete dictionary that re-expresses each activation vector as a sparse combination of these feature directions ^[4].

The standard baseline is the ReLU SAE used by Anthropic's "Towards Monosemanticity" work ^[4]. For an input vector x from the residual stream and n latent dimensions, the encoder and decoder are:

z = ReLU(W_enc (x - b_pre) + b_enc)
x_hat = W_dec z + b_pre

It is trained to minimize a reconstruction error plus a sparsity penalty, L = ||x - x_hat||_2^2 + lambda ||z||_1, where the L1 term encourages most latents to be zero and lambda is a hyperparameter ^[1].

This formulation has two well-known weaknesses ^[1]^[2]:

Shrinkage bias. The L1 penalty acts on the same activation magnitudes that drive reconstruction, so the optimizer can reduce the loss simply by shrinking all positive activations toward zero. This systematically underestimates the true feature magnitudes, a phenomenon studied since the LASSO ^[5] and termed "shrinkage" in the SAE literature ^[6].
Indirect sparsity control. L1 is only a convex surrogate for the true L0 objective. The achieved sparsity depends on lambda in a way that is sensitive to scale and architecture, making lambda awkward to tune and making it hard to compare autoencoders trained at different sparsity levels.

How TopK SAEs work

A TopK SAE replaces the ReLU activation with a TopK operation that keeps only the K largest pre-activations and sets the others to zero ^[1]:

z = TopK(W_enc (x - b_pre))

The decoder is unchanged. Because exactly K latents are nonzero, the training objective reduces to pure reconstruction error with no penalty term, L = ||x - x_hat||_2^2 ^[1]. (In OpenAI's implementation a ReLU is also applied so activations stay non-negative, but for reasonable K the K largest pre-activations are almost always positive, so the training curves are indistinguishable ^[1].)

OpenAI reports several benefits of this design ^[1]:

It removes the L1 penalty entirely, and with it the shrinkage bias. Ablation experiments that re-optimize the magnitudes of a frozen set of active latents show ReLU activations are biased toward being too small, whereas TopK activations are not ^[1].
It sets the L0 sparsity directly to K, replacing the continuous lambda with an interpretable integer. This simplifies tuning, enables clean comparison between models, and can be combined with arbitrary activation functions.
It empirically improves the reconstruction-versus-sparsity Pareto frontier over ReLU SAEs, and the advantage grows with scale.
It increases the monosemanticity of features by clamping small spurious activations to exactly zero, giving TopK models fewer spurious positives than ReLU counterparts.

Decoder normalization. As in prior SAE work, inputs are mean-subtracted over the model dimension and normalized to unit norm before encoding, and the decoder columns are kept at unit norm so that latent magnitudes are comparable ^[1].

Preventing dead latents (AuxK). A "dead" latent is one that stops activating entirely during training; in large SAEs this can affect a large fraction of latents (OpenAI observed up to 90 percent without mitigation, and Anthropic's 34-million-latent autoencoder had only about 12 million alive ^[1]^[7]). Dead latents waste capacity and worsen reconstruction. OpenAI uses two ingredients to suppress them: initializing the encoder as the transpose of the decoder, and an auxiliary loss called AuxK that models the residual reconstruction error using the top-k_aux currently dead latents. With these techniques, only about 7 percent of latents were dead even in the 16-million-latent autoencoder ^[1].

OpenAI scaling work

The 2024 OpenAI paper is the work that established TopK SAEs as a practical recipe at frontier scale. Its main contributions were a state-of-the-art training methodology, clean scaling laws, and new evaluation metrics ^[1]^[2].

To demonstrate scalability, the authors trained a 16-million-latent TopK autoencoder on GPT-4 residual-stream activations for 40 billion tokens, choosing a layer roughly 5/6 of the way into the network; smaller GPT-2 small autoencoders were trained at layer 8 with a 64-token context ^[1]. They report two regimes of scaling law: an irreducible-loss power law L(C) relating reconstruction mean-squared error (MSE) to training compute, and a joint law over the number of latents n and the sparsity K:

L(n, k) = exp(alpha + beta_k log(k) + beta_n log(n) + gamma log(k) log(n)) + exp(zeta + eta log(k))

with fitted values alpha = -0.50, beta_k = 0.26, beta_n = -0.017, gamma = -0.042, zeta = -1.32, eta = -0.085 on GPT-4 autoencoders, indicating that adding latents reliably lowers MSE and that the L(n) curve steepens as K grows ^[1].

Because better reconstruction is not the ultimate goal, the paper introduces interpretability-oriented metrics that "generally improve with autoencoder size" ^[1]^[2]:

Downstream loss: the increase in language-model loss (measured by KL divergence and delta cross-entropy) when the residual stream is replaced by the SAE reconstruction. As a scale anchor, substituting the 16-million-latent SAE into GPT-4 gives a loss equivalent to a model trained with about 10 percent of GPT-4's pretraining compute; the zero-ablation reconstruction fidelity is 98.2 percent ^[1].
Probe loss: whether a 1D logistic probe on a single latent can recover hypothesized features across 61 binary classification tasks.
Explainability: whether latents have simple, high-precision and high-recall explanations, evaluated with an n-gram (Neuron to Graph) method.
Ablation sparsity: whether ablating a single latent has a sparse effect on downstream logits.

Multi-TopK and progressive recovery. A drawback of training at a fixed K is "overfitting to K": if the activation function is later swapped for TopK with a different K', MSE worsens sharply once K' exceeds the training K, so the code is only "progressive" up to K. To fix this off-by-K behavior, the paper proposes Multi-TopK, which sums TopK losses at multiple K values. Training with L(k) + L(4k)/8 is enough to yield a code that reconstructs well at all K', so the same autoencoder can be used with either a fixed or a variable number of active latents per token, including by substituting a JumpReLU (fixed-threshold) activation at test time ^[1].

OpenAI released training code, a full suite of GPT-2 small autoencoders, and a feature visualizer alongside the paper ^[1]^[2].

Comparison to Gated and JumpReLU SAEs

TopK is one of several 2024 architectures that attack the L1 shrinkage problem by decoupling feature selection from magnitude estimation or by training L0 more directly. The two most-cited alternatives both came from Google DeepMind ^[6]^[8].

Architecture	Sparsity mechanism	Shrinkage	Sparsity control	Notes
ReLU SAE ^[4]	L1 penalty on activations	Present	Indirect (lambda)	Standard baseline
Gated SAE ^[6]	Separate gating path selects active latents; L1 applied only to the gate	Avoided	Indirect (lambda on gate)	Decouples selection from magnitude; about half as many firing features for comparable fidelity at 7B scale
JumpReLU SAE ^[8]	Discontinuous JumpReLU threshold, `J_theta(x) = x * 1_{x > theta}`, trained with straight-through estimators	Avoided	Indirect (theta), variable per token	Elementwise (no sort); state-of-the-art fidelity on Gemma 2 9B
TopK SAE ^[1]	Keep K largest pre-activations	Avoided	Direct (L0 fixed to K)	Requires a partial sort; fixed K per token

The Gated SAE (Rajamanoharan et al., released May 1, 2024) separates the decision of which directions to use from the estimation of their magnitudes, applying the L1 penalty only to the selection gate; this "solves shrinkage," is similarly interpretable, and needs roughly half as many firing features for comparable reconstruction on models up to 7B parameters ^[6].

The JumpReLU SAE (Rajamanoharan et al., released July 19, 2024) replaces the ReLU with a discontinuous JumpReLU that zeros any value below a learned threshold theta, and trains the L0 objective directly using straight-through estimators. DeepMind reports it reaches reconstruction fidelity that is state-of-the-art on Gemma 2 9B activations and at least as good as TopK, while remaining elementwise and therefore cheaper than TopK, which requires a partial sort over the latents ^[8]. A practical distinction is that JumpReLU and Gated SAEs allow the number of active latents to vary across tokens, whereas plain TopK forces exactly K everywhere; OpenAI's own work notes a TopK-trained model can be evaluated with a JumpReLU at test time, and that Multi-TopK makes the TopK and JumpReLU test-time curves nearly coincide ^[1].

Limitations

OpenAI lists several limitations of the TopK approach ^[1]:

Fixed K per token is likely suboptimal. Forcing exactly K active latents for every input ignores that some inputs are genuinely simpler than others. The authors note one would ideally constrain the expected L0 across tokens rather than the per-token L0, which JumpReLU and Gated variants do more naturally.
Computational cost of the sort. The TopK operation requires a partial sort, unlike the elementwise activations of ReLU and JumpReLU.
Overfitting to K. Without Multi-TopK, a TopK autoencoder reconstructs poorly when evaluated at a sparsity above its training K.
Optimization and dead latents. The training recipe could likely be improved with learning-rate scheduling, better optimizers, and better auxiliary losses; even with AuxK a few percent of latents die at the largest scales.
Residual interpretability and reconstruction gap. A nonzero irreducible reconstruction loss remains, a meaningful fraction of GPT-4 features are not yet adequately monosemantic, and the 64-token context used for the smaller experiments may be too short to surface the most interesting behaviors. The authors suggest combining SAEs with mixture-of-experts routing to scale further.

Beyond these, an SAE that improves the reconstruction-sparsity frontier is not automatically more useful: an infinitely wide, maximally sparse (K = 1) autoencoder can reconstruct perfectly while learning structureless, uninteresting latents, which is why the downstream and explainability metrics matter as much as MSE ^[1].

References

Gao, L., Dupre la Tour, T., Tillman, H., Goh, G., Troll, R., Radford, A., Sutskever, I., Leike, J., Wu, J. (2024). "Scaling and evaluating sparse autoencoders." OpenAI. arXiv:2406.04093. https://arxiv.org/abs/2406.04093 ↩
OpenAI (2024). "Scaling and evaluating sparse autoencoders" (PDF). https://cdn.openai.com/papers/sparse-autoencoders.pdf ↩
Makhzani, A., Frey, B. (2013). "k-Sparse Autoencoders." arXiv:1312.5663 (presented at ICLR 2014). https://arxiv.org/abs/1312.5663 ↩
Bricken, T., et al. (2023). "Towards Monosemanticity: Decomposing Language Models With Dictionary Learning." Anthropic / Transformer Circuits. https://transformer-circuits.pub/2023/monosemantic-features/index.html ↩
Tibshirani, R. (1996). "Regression Shrinkage and Selection via the Lasso." Journal of the Royal Statistical Society: Series B, 58(1), 267-288. ↩
Rajamanoharan, S., Conmy, A., Smith, L., Lieberum, T., Varma, V., Kramar, J., Shah, R., Nanda, N. (2024). "Improving Dictionary Learning with Gated Sparse Autoencoders." Google DeepMind. arXiv:2404.16014. https://arxiv.org/abs/2404.16014 ↩
Templeton, A., et al. (2024). "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet." Anthropic / Transformer Circuits. https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html ↩
Rajamanoharan, S., Lieberum, T., Sonnerat, N., Conmy, A., Varma, V., Kramar, J., Nanda, N. (2024). "Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders." Google DeepMind. arXiv:2407.14435. https://arxiv.org/abs/2407.14435 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Gated SAE JumpReLU SAE

Overview

Background: SAEs and the L1 problem

How TopK SAEs work

OpenAI scaling work

Comparison to Gated and JumpReLU SAEs

Limitations

References

Improve this article

Related Articles

Confirmation Bias

Hallucination

Recursive self-improvement

Emergent abilities

AI Alignment

Model collapse

What links here

Related Articles

Confirmation Bias

Hallucination

Recursive self-improvement

Emergent abilities

AI Alignment

Model collapse

What links here