TopK SAE
Last reviewed
Jun 9, 2026
Sources
8 citations
Review status
Source-backed
Revision
v2 · 2,029 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 9, 2026
Sources
8 citations
Review status
Source-backed
Revision
v2 · 2,029 words
Add missing citations, update stale details, or suggest a clearer explanation.
A TopK SAE (TopK sparse autoencoder) is a variant of sparse autoencoder that enforces sparsity by keeping only the K largest latent pre-activations for each input and zeroing all the rest, rather than encouraging sparsity indirectly through an L1 penalty. Because exactly K latents are active per input, the TopK activation directly fixes the L0 "norm" (the count of nonzero activations) to K, which removes the activation-shrinkage bias of L1-regularized autoencoders and eliminates the need to tune a sparsity coefficient [1][2].
The technique is an application of the k-sparse autoencoder introduced by Alireza Makhzani and Brendan Frey in 2013 [3]. It was brought to prominence in mechanistic interpretability by the OpenAI paper "Scaling and evaluating sparse autoencoders" (Leo Gao, Tom Dupre la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, Jeffrey Wu), released on June 6, 2024, which used a TopK activation to train a 16-million-latent autoencoder on GPT-4 residual-stream activations [1][2]. The starting description of the method is accurate: it is a real, named, and widely adopted technique.
Sparse autoencoders are an unsupervised method for decomposing the internal activations of a neural network into a larger set of sparsely activating, more interpretable features. The motivating hypothesis is superposition: a model represents far more features than it has neurons by encoding them as nearly orthogonal directions in activation space, so that each neuron participates in many features (polysemanticity). An SAE learns an over-complete dictionary that re-expresses each activation vector as a sparse combination of these feature directions [4].
The standard baseline is the ReLU SAE used by Anthropic's "Towards Monosemanticity" work [4]. For an input vector x from the residual stream and n latent dimensions, the encoder and decoder are:
z = ReLU(W_enc (x - b_pre) + b_enc)
x_hat = W_dec z + b_pre
It is trained to minimize a reconstruction error plus a sparsity penalty, L = ||x - x_hat||_2^2 + lambda ||z||_1, where the L1 term encourages most latents to be zero and lambda is a hyperparameter [1].
This formulation has two well-known weaknesses [1][2]:
A TopK SAE replaces the ReLU activation with a TopK operation that keeps only the K largest pre-activations and sets the others to zero [1]:
z = TopK(W_enc (x - b_pre))
The decoder is unchanged. Because exactly K latents are nonzero, the training objective reduces to pure reconstruction error with no penalty term, L = ||x - x_hat||_2^2 [1]. (In OpenAI's implementation a ReLU is also applied so activations stay non-negative, but for reasonable K the K largest pre-activations are almost always positive, so the training curves are indistinguishable [1].)
OpenAI reports several benefits of this design [1]:
Decoder normalization. As in prior SAE work, inputs are mean-subtracted over the model dimension and normalized to unit norm before encoding, and the decoder columns are kept at unit norm so that latent magnitudes are comparable [1].
Preventing dead latents (AuxK). A "dead" latent is one that stops activating entirely during training; in large SAEs this can affect a large fraction of latents (OpenAI observed up to 90 percent without mitigation, and Anthropic's 34-million-latent autoencoder had only about 12 million alive [1][7]). Dead latents waste capacity and worsen reconstruction. OpenAI uses two ingredients to suppress them: initializing the encoder as the transpose of the decoder, and an auxiliary loss called AuxK that models the residual reconstruction error using the top-k_aux currently dead latents. With these techniques, only about 7 percent of latents were dead even in the 16-million-latent autoencoder [1].
The 2024 OpenAI paper is the work that established TopK SAEs as a practical recipe at frontier scale. Its main contributions were a state-of-the-art training methodology, clean scaling laws, and new evaluation metrics [1][2].
To demonstrate scalability, the authors trained a 16-million-latent TopK autoencoder on GPT-4 residual-stream activations for 40 billion tokens, choosing a layer roughly 5/6 of the way into the network; smaller GPT-2 small autoencoders were trained at layer 8 with a 64-token context [1]. They report two regimes of scaling law: an irreducible-loss power law L(C) relating reconstruction mean-squared error (MSE) to training compute, and a joint law over the number of latents n and the sparsity K:
L(n, k) = exp(alpha + beta_k log(k) + beta_n log(n) + gamma log(k) log(n)) + exp(zeta + eta log(k))
with fitted values alpha = -0.50, beta_k = 0.26, beta_n = -0.017, gamma = -0.042, zeta = -1.32, eta = -0.085 on GPT-4 autoencoders, indicating that adding latents reliably lowers MSE and that the L(n) curve steepens as K grows [1].
Because better reconstruction is not the ultimate goal, the paper introduces interpretability-oriented metrics that "generally improve with autoencoder size" [1][2]:
Multi-TopK and progressive recovery. A drawback of training at a fixed K is "overfitting to K": if the activation function is later swapped for TopK with a different K', MSE worsens sharply once K' exceeds the training K, so the code is only "progressive" up to K. To fix this off-by-K behavior, the paper proposes Multi-TopK, which sums TopK losses at multiple K values. Training with L(k) + L(4k)/8 is enough to yield a code that reconstructs well at all K', so the same autoencoder can be used with either a fixed or a variable number of active latents per token, including by substituting a JumpReLU (fixed-threshold) activation at test time [1].
OpenAI released training code, a full suite of GPT-2 small autoencoders, and a feature visualizer alongside the paper [1][2].
TopK is one of several 2024 architectures that attack the L1 shrinkage problem by decoupling feature selection from magnitude estimation or by training L0 more directly. The two most-cited alternatives both came from Google DeepMind [6][8].
| Architecture | Sparsity mechanism | Shrinkage | Sparsity control | Notes |
|---|---|---|---|---|
| ReLU SAE [4] | L1 penalty on activations | Present | Indirect (lambda) | Standard baseline |
| Gated SAE [6] | Separate gating path selects active latents; L1 applied only to the gate | Avoided | Indirect (lambda on gate) | Decouples selection from magnitude; about half as many firing features for comparable fidelity at 7B scale |
| JumpReLU SAE [8] | Discontinuous JumpReLU threshold, J_theta(x) = x * 1_{x > theta}, trained with straight-through estimators | Avoided | Indirect (theta), variable per token | Elementwise (no sort); state-of-the-art fidelity on Gemma 2 9B |
| TopK SAE [1] | Keep K largest pre-activations | Avoided | Direct (L0 fixed to K) | Requires a partial sort; fixed K per token |
The Gated SAE (Rajamanoharan et al., released May 1, 2024) separates the decision of which directions to use from the estimation of their magnitudes, applying the L1 penalty only to the selection gate; this "solves shrinkage," is similarly interpretable, and needs roughly half as many firing features for comparable reconstruction on models up to 7B parameters [6].
The JumpReLU SAE (Rajamanoharan et al., released July 19, 2024) replaces the ReLU with a discontinuous JumpReLU that zeros any value below a learned threshold theta, and trains the L0 objective directly using straight-through estimators. DeepMind reports it reaches reconstruction fidelity that is state-of-the-art on Gemma 2 9B activations and at least as good as TopK, while remaining elementwise and therefore cheaper than TopK, which requires a partial sort over the latents [8]. A practical distinction is that JumpReLU and Gated SAEs allow the number of active latents to vary across tokens, whereas plain TopK forces exactly K everywhere; OpenAI's own work notes a TopK-trained model can be evaluated with a JumpReLU at test time, and that Multi-TopK makes the TopK and JumpReLU test-time curves nearly coincide [1].
OpenAI lists several limitations of the TopK approach [1]:
Beyond these, an SAE that improves the reconstruction-sparsity frontier is not automatically more useful: an infinitely wide, maximally sparse (K = 1) autoencoder can reconstruct perfectly while learning structureless, uninteresting latents, which is why the downstream and explainability metrics matter as much as MSE [1].