Sparse autoencoder

A sparse autoencoder (SAE) is a type of autoencoder that adds a sparsity regularisation term to its training loss, encouraging only a small number of units in the hidden (bottleneck) layer to activate for any given input. The result is a sparse, often overcomplete representation that tends to be more interpretable and feature-rich than the dense codes learned by a standard autoencoder. Sparse autoencoders sit at the intersection of two research lineages: the classical sparse-coding tradition that began with Bruno Olshausen and David Field's 1996 Nature paper and became textbook material through Andrew Ng's Stanford CS294A lecture notes around 2011, and the modern wave of mechanistic interpretability research, in which sparse autoencoders are trained on the internal activations of large language models to decompose those activations into human-interpretable features. The interpretability application, launched by Cunningham et al. (2023) and Anthropic's Towards Monosemanticity report later that year, turned the SAE from a niche pre-training trick into one of the most active tools in AI safety and interpretability research.

By 2026 the field has matured around a handful of architectural variants (vanilla L1, Top-K, JumpReLU, Gated, Matryoshka), a public catalogue of open-weight SAEs that includes Google DeepMind's Gemma Scope, OpenAI's Top-K release for GPT-2 small and GPT-4, and Goodfire's Llama 3 SAEs, and an active scientific debate about how much SAEs really tell us about what a model is doing. This article surveys both halves of the SAE story: the classical sparse-coding roots and the modern interpretability turn.

Overview

A standard autoencoder is a neural network trained to copy its input through a narrow internal representation. Sparse autoencoders break with that framing in a specific way: instead of compressing the input into a smaller code, they expand it into a much larger code whose vectors are mostly zero. The few entries that are nonzero are what the network believes is going on in this particular input. A sparse, overcomplete code can carve up the world into more natural categories than a dense, compressed one. If a learned dictionary has ten thousand directions and only fifteen fire on any given image, those fifteen are doing real work and can in principle be interpreted one at a time.

In the interpretability era, sparse autoencoders are trained on the internal activations of an already-trained network, not on raw data. A transformer has, say, four thousand residual-stream channels at a given layer, but it plausibly tracks many more than four thousand concepts at once. Researchers therefore look for a much wider sparse code that decomposes the dense activation vector into thousands or millions of interpretable directions. Those directions, called features, are what the SAE actually delivers.

Definition and motivation

A standard autoencoder learns an encoder $f: \mathbb{R}^d \to \mathbb{R}^h$ and a decoder $g: \mathbb{R}^h \to \mathbb{R}^d$ that together minimise the reconstruction error $|x - g(f(x))|_2^2$. When $h < d$, the bottleneck forces the model to compress information, much like principal component analysis. A sparse autoencoder differs in two ways. First, the hidden layer is allowed (and in many modern variants required) to be overcomplete, with $h \gg d$. Second, the loss includes a sparsity term that penalises the average number of active units, so that for any input only a small subset of the $h$ hidden directions fire.

The resulting code is no longer a low-dimensional summary but a sparse, often wide, decomposition. Each input is reconstructed as a sparse linear combination of decoder columns, sometimes called "feature directions" or "dictionary atoms". The decoder columns play the role of a learned dictionary, the hidden activations are the sparse coefficients, and the encoder is an amortised solver that picks those coefficients in one forward pass instead of running an iterative optimisation at inference time.

The motivation is partly biological. Olshausen and Field argued that mammalian primary visual cortex uses a sparse, overcomplete code because it gives a more statistically independent and energy-efficient representation of natural images than a dense one. In machine learning, sparsity tends to push each hidden unit to specialise: instead of every neuron weighing in on every input, a small group of neurons takes responsibility for each pattern. That specialisation is what makes SAE units candidates for interpretable features rather than tangled mixtures.

The motivation is also engineering. A sparse code lends itself to efficient retrieval and to explicit symbolic manipulation, because the active coordinates are countable and named. In the mechanistic interpretability context, that property is more important than the compression itself: a researcher who wants to ask "is the model thinking about the Golden Gate Bridge here?" needs a representation in which Golden Gate Bridge is a single, addressable direction, not a tangle of contributions from a thousand neurons.

Classical history

Predecessors: efficient coding and Barlow's hypothesis

The idea that biological vision uses a sparse code goes back to Horace Barlow's 1961 essay Possible Principles Underlying the Transformations of Sensory Messages, which argued that the early visual system should re-represent its input to minimise statistical redundancy. Barlow framed neurons as feature detectors whose firing patterns ought to be relatively independent, an idea later formalised as the efficient coding hypothesis. David Field's 1987 paper Relations between the statistics of natural images and the response properties of cortical cells applied this lens to natural-image statistics and prepared the ground for an explicit sparse-coding model.

Olshausen and Field 1996

The first concrete demonstration came in June 1996, when Bruno Olshausen and David Field published Emergence of simple-cell receptive field properties by learning a sparse code for natural images in volume 381 of Nature. They showed that an unsupervised algorithm trained to find a sparse linear code for small patches of natural images develops basis functions that look strikingly like the localised, oriented, bandpass receptive fields measured in primary visual cortex of cats and macaques. The paper became a founding text of computational neuroscience and is the canonical citation for almost every sparse autoencoder paper written since.

Olshausen and Field did not actually use a neural-network autoencoder. Their model alternated between a decoder weight update and an iterative inference step that, for each image patch, solved an L1-penalised reconstruction problem. The link to autoencoders came later, when researchers replaced the iterative inference step with an amortised encoder, a feed-forward network trained to predict the sparse coefficients in one pass. That substitution is what turns sparse coding into a sparse autoencoder.

The 2000s and unsupervised pretraining

Between 2000 and 2010 sparse coding was extended in several directions that fed back into the autoencoder line. Honglak Lee, Alexis Battle, Rajat Raina and Andrew Ng's 2007 NIPS paper Efficient sparse coding algorithms introduced practical methods for solving the L1-penalised dictionary learning problem at scale. Geoff Hinton and Ruslan Salakhutdinov's 2006 Science paper Reducing the Dimensionality of Data with Neural Networks revived the deep autoencoder by showing that pretraining with stacked restricted Boltzmann machines made it feasible to fine-tune very deep encoder-decoder pairs. Marc'Aurelio Ranzato, Christopher Poultney and Yann LeCun's 2007 paper Efficient learning of sparse representations with an energy-based model explicitly combined an encoder, a decoder, and a sparsity-inducing penalty in a single network, in what reads in retrospect as one of the earliest "true" sparse autoencoders.

Andrew Ng's CS294A and the UFLDL treatment

Andrew Ng's 2011 Stanford CS294A lecture notes, titled Sparse autoencoder, codified the textbook treatment. They presented the L1 and KL-divergence formulations side by side, derived the backpropagation updates, and discussed how to tune the target sparsity $\rho$. The same material was lifted into the Stanford UFLDL (Unsupervised Feature Learning and Deep Learning) tutorial wiki, which for years served as a standard teaching resource.

Le et al. 2012: the cat-detector network

The high-water mark of classical sparse autoencoders in deep learning practice was Quoc Le, Marc'Aurelio Ranzato, Rajat Monga, Matthieu Devin, Kai Chen, Greg Corrado, Jeff Dean, and Andrew Ng's 2012 paper Building high-level features using large-scale unsupervised learning, published at ICML and circulated on arXiv (1112.6209) from December 2011. The team trained a nine-layer locally connected sparse autoencoder, with about one billion parameters, on ten million YouTube thumbnails, using 1,000 machines (16,000 cores) for three days. Without any labels, the network developed individual units that responded selectively to human faces, human bodies, and the now-famous "cat detector" unit. Press coverage made the project the public face of unsupervised deep learning.

After 2013, attention shifted to supervised pretraining, word embeddings, and self-supervised methods like contrastive learning. Yoshua Bengio, Aaron Courville and Pascal Vincent's 2013 IEEE TPAMI review Representation Learning: A Review and New Perspectives collected sparse-autoencoder, sparse-coding, and denoising-autoencoder approaches into one conceptual framework, but the framework itself was already losing ground to learned word vectors and end-to-end supervised pretraining.

k-Sparse autoencoders

A particularly important paper from the lull years is Alireza Makhzani and Brendan Frey's 2013 k-Sparse Autoencoders (arXiv:1312.5663). Instead of penalising the L1 norm, they kept only the $k$ largest activations per input and set the rest to zero. The k-sparse autoencoder gives direct control over the average number of active units and sidesteps L1 shrinkage. Their idea would lie largely dormant for a decade before OpenAI's interpretability team rediscovered and rebranded it as the Top-K SAE in 2024.

Mathematical formulation

A vanilla single-layer SAE writes

$$z = \sigma(W_e x + b_e), \quad \hat{x} = W_d z + b_d,$$

where $W_e \in \mathbb{R}^{h \times d}$ and $W_d \in \mathbb{R}^{d \times h}$ are the encoder and decoder weights, $b_e$ and $b_d$ are biases, and $\sigma$ is a non-negative activation such as ReLU.

Training minimises the sum of a squared reconstruction error,

$$\mathcal{L}_{\text{recon}} = |x - \hat{x}|_2^2,$$

and a sparsity penalty that pushes most coordinates of $z$ towards zero.

L1 (Lasso) penalty

The simplest and historically dominant choice is an absolute-value penalty,

$$\mathcal{L}_{\text{sparse}} = \lambda |z|_1,$$

which is differentiable almost everywhere and slots into standard gradient descent. The coefficient $\lambda$ trades off reconstruction error against sparsity. It is the workhorse of Olshausen-style sparse coding and the default in the first wave of interpretability SAEs in 2023.

L1 has two known pathologies. It applies uniform downward pressure on all nonzero coefficients, so recovered feature magnitudes systematically underestimate the values the data really called for, an effect known as L1 shrinkage. And L1 controls only the softness of sparsity, not the actual count of active units. Two SAEs trained on the same data with the same $\lambda$ can have very different average $L_0$ depending on initialisation and learning rate.

KL-divergence sparsity

The second formulation that Andrew Ng's CS294A notes popularised targets a low average activation $\rho$ (for example $\rho = 0.05$) per neuron across a mini-batch. If $\hat{\rho}_j$ is the average activation of unit $j$, the penalty is

$$\mathcal{L}_{\text{KL}} = \beta \sum_j \rho \log \frac{\rho}{\hat{\rho}_j} + (1-\rho) \log \frac{1-\rho}{1-\hat{\rho}_j}.$$

This directly targets the average firing rate but requires batch statistics, which complicates online training. It is rarely used in modern interpretability SAEs but remains a staple of classical pedagogy.

Top-K, JumpReLU, and Gated activations

Three variants emerging from the interpretability era avoid the L1 pathologies by changing the activation function or the encoder structure rather than the penalty. Top-K keeps only the $k$ largest preactivations per input and sets the rest to zero. JumpReLU uses a thresholded activation that fires only above a per-feature learned threshold. Gated SAEs split the encoder into a gating path that decides which features to fire and a magnitude path that decides how strongly. Each of these is discussed in detail in the next section.

The total loss is $\mathcal{L}{\text{recon}} + \mathcal{L}{\text{sparse}}$ regardless of the choice. A common variant ties the decoder weights to the transpose of the encoder, $W_d = W_e^T$, which halves the parameter count. In modern interpretability practice, the decoder columns are usually unit-norm or norm-constrained, so that comparisons between feature magnitudes are meaningful and one cannot game the L1 penalty by shrinking activations and growing decoder columns.

Sparse coding vs sparse autoencoder

In Olshausen-Field sparse coding, the sparse code $z$ for input $x$ is found by iteratively solving $\min_z |x - W_d z|^2 + \lambda |z|_1$ at inference time. In a sparse autoencoder, the encoder $f(x) = \sigma(W_e x + b_e)$ produces $z$ in a single forward pass; it is an amortised solver, jointly trained with the decoder to imitate what the iterative procedure would have produced. The bargain is fast inference at the cost of some fidelity loss, since no single-layer feed-forward network can exactly reproduce the global optimum of a non-trivial L1-penalised inverse problem.

In the modern interpretability formulation, the input $x$ is not a raw image or token embedding but an internal activation vector of a trained neural network, for example the residual stream of a transformer at a particular layer. The decoder columns $W_d[:,j]$ are then interpreted as directions in activation space that correspond to specific features of the model's input.

Sparsity mechanisms

Several mechanisms have been proposed for enforcing sparsity, each with tradeoffs between simplicity, sparsity control, and reconstruction fidelity. The most widely used are summarised below.

Mechanism	Originator	How it works	Strengths	Weaknesses
L1 penalty	Olshausen and Field 1996; Andrew Ng 2011	Add $\lambda \|z\|_1$ to the loss; smooth and differentiable	Simple, universally supported	Shrinks activations, biases magnitudes downward, sparsity is implicit
KL-divergence sparsity	Andrew Ng's CS294A 2011 notes	Penalise the KL divergence between target and observed average activation per neuron	Directly targets average firing rate $\rho$	Requires batch statistics; less common in modern work
Top-K activation	Makhzani and Frey 2013; revived by OpenAI in 2024	Keep only the $k$ largest activations per input, set the rest to zero	Direct, exact control over $L_0$; no shrinkage from a soft penalty	Discontinuous; needs straight-through estimators or explicit masking
JumpReLU	Rajamanoharan et al. (DeepMind) 2024	Replace ReLU with a thresholded "jump" activation that fires only above a learned threshold	State-of-the-art reconstruction-vs-sparsity Pareto frontier; trains directly against $L_0$ via straight-through estimators	Discontinuous activation, requires careful gradient estimation
Gated SAE	Rajamanoharan et al. (DeepMind) 2024	Separate "where to fire" and "how much" into two paths; apply L1 only to the gating path	Removes L1 shrinkage on magnitudes; halves the number of firing features for the same reconstruction quality	More parameters, more complex training

L1 was the workhorse of both classical sparse autoencoders and early interpretability SAEs in 2023. Top-K, JumpReLU, and Gated SAEs were developed in response to two L1 problems: it systematically underestimates feature magnitudes ("shrinkage"), and it controls the softness of sparsity rather than the actual count of active units, which complicates comparison across runs.

Matryoshka SAEs, introduced by Bart Bussmann, Noa Nabeshima, Adam Karvonen and Neel Nanda in Learning Multi-Level Features with Matryoshka Sparse Autoencoders (arXiv:2503.17547), train nested groups of features of increasing size that must each independently reconstruct the input. The result is a hierarchy in which the inner groups learn high-level concepts and the outer groups specialise, mitigating feature absorption, where a coarse feature gets cannibalised by finer-grained variants at higher widths. The architecture is orthogonal to the choice of activation (Matryoshka L1 and Matryoshka JumpReLU are both reasonable combinations).

Foundational literature timeline

A compact bibliography of the foundational era is useful for context.

Year	Authors	Contribution
1961	Horace Barlow	Efficient-coding hypothesis: cortical neurons should re-represent input to reduce statistical redundancy
1987	David Field	Statistics of natural images and orientations of receptive fields
1996	Olshausen and Field	Sparse linear coding of natural-image patches yields V1-like simple-cell basis functions (Nature)
1997	Olshausen and Field	Sparse coding with an overcomplete basis set: a strategy employed by V1? (Vision Research)
2006	Hinton and Salakhutdinov	Greedy layer-wise RBM pretraining enables deep autoencoders (Science)
2007	Lee, Battle, Raina, Ng	Efficient sparse coding algorithms (NIPS)
2007	Ranzato, Poultney, LeCun	Energy-based sparse representation learning, an early explicit sparse autoencoder
2008	Vincent et al.	Denoising autoencoders
2011	Andrew Ng (Stanford CS294A)	Sparse autoencoder lecture notes that codified the L1 and KL-divergence formulations
2012	Le, Ranzato, Monga, Devin, Chen, Corrado, Dean, Ng	Cat-detector network, billion-parameter sparse autoencoder trained on 10M YouTube frames (ICML; arXiv:1112.6209)
2013	Makhzani and Frey	k-Sparse autoencoders, precursor of Top-K (arXiv:1312.5663)
2013	Bengio, Courville, Vincent	Representation Learning: A Review and New Perspectives (IEEE TPAMI; arXiv:1206.5538)

Interpretability turn: superposition and SAEs

The modern resurgence of interest in sparse autoencoders began with a problem in mechanistic interpretability. Researchers had repeatedly found that individual neurons in trained transformers respond to many unrelated concepts at once, a phenomenon called polysemanticity. A neuron that fires for both Python for loops and German prepositions cannot be cleanly labelled.

Toy Models of Superposition (2022)

In September 2022, Anthropic published Toy Models of Superposition (Elhage, Hume, Olsson and colleagues), which argued that polysemanticity is best explained by superposition. Networks have far more conceptual features they want to represent than they have neurons, so they pack features into nearly orthogonal directions in activation space. The directions cannot align with neuron axes without sacrificing capacity, but they should be recoverable from a sparse, overcomplete decomposition. The paper proved superposition exists in carefully constructed toy models and conjectured that real transformers were doing something analogous. The recipe was clear: train a sparse autoencoder on activations of a real network and the underlying features should appear as its hidden units.

Cunningham et al. (September 2023)

The first end-to-end demonstration came from Hoagy Cunningham, Aidan Ewart, Logan Riggs Smith, Robert Huben and Lee Sharkey in their September 2023 paper Sparse Autoencoders Find Highly Interpretable Features in Language Models (arXiv:2309.08600, later presented at an ICLR 2024 workshop). The work was a collaboration involving researchers from EleutherAI, Apollo Research, and Anthropic. They trained SAEs on residual-stream activations of small language models (Pythia variants) and showed that the resulting features were measurably more interpretable, on automated metrics, than directions found by principal component analysis. The paper introduced the experimental template that nearly every subsequent SAE-for-interpretability work follows: collect activations from a frozen LM, train an SAE on those activations, and then evaluate the resulting features for interpretability and downstream utility.

Towards Monosemanticity (October 2023)

A few weeks later, on 4 October 2023, Anthropic published Towards Monosemanticity: Decomposing Language Models With Dictionary Learning, by Trenton Bricken, Adly Templeton, Joshua Batson and colleagues. The team trained sparse autoencoders on MLP activations from a one-layer transformer, scaling the 512-dimensional MLP into SAEs with hidden widths from roughly 4,096 up to about 131,000 features. They reported recovering thousands of interpretable, monosemantic features for concepts like Arabic script, DNA sequences, base64-encoded text, Hebrew text, legal language, HTTP requests, nutrition labels, and named entities. Human raters judged about 70% of the recovered features to map cleanly to a single concept, compared with a much smaller fraction of the model's raw neurons.

The report introduced several phenomena that have since become standard SAE vocabulary: feature splitting (increasing SAE width subdivides one general concept into several more specific ones), dead features (units that essentially never activate after a few thousand steps of training), resampling (periodically reinitialising dead units), and feature universality (some features show up in independent SAE training runs and across different base models).

Scaling Monosemanticity (May 2024)

In May 2024 Anthropic published the follow-up Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet by Adly Templeton, Tom Conerly, Jonathan Marcus and colleagues, on 21 May 2024. They trained SAEs on residual-stream activations roughly halfway through Claude 3 Sonnet, with hidden widths of approximately 1 million, 4 million, and 34 million features. The recovered features included multilingual and multimodal concepts (the same direction firing for the Golden Gate Bridge in English, in other languages, and in images of the bridge), plus features for safety-relevant concerns like deception, sycophancy, code with security vulnerabilities, biological-weapons information, racial bias, hatred and slurs, and inner conflict. The team demonstrated feature steering: forcing a feature to fire strongly during a forward pass changes the model's behaviour in a way that matches the feature's interpretation. Clamping the Golden Gate Bridge feature at roughly ten times its maximum observed activation made Claude insist that it was the bridge. Anthropic released a public demo, Golden Gate Claude, that exposed feature steering to users for about 24 hours starting 23 May 2024, and later published a follow-up paper, Evaluating Feature Steering, that systematically measured what feature steering does and does not change.

OpenAI Top-K SAEs (June 2024)

OpenAI's June 2024 paper Scaling and evaluating sparse autoencoders by Leo Gao, Tom Dupre la Tour, Henk Tillman, Gabriel Goh, Nick Cammarata, Adam Goucher and colleagues took a different route. The team revived Makhzani and Frey's k-sparse autoencoder, now called Top-K, and showed that it scales cleanly to a 16-million-feature SAE trained on GPT-4 activations across 40 billion tokens, with smaller-scale runs on GPT-2 small. Top-K eliminates L1 shrinkage by construction and gives exact control over the average number of active features. The paper proposed evaluation metrics based on probe loss (recovering hypothesised features), explainability of activation patterns, downstream sparsity, and "feature recovery" against a held-out feature set.

OpenAI also reported that even at 16 million features, the SAE struggled to fully reconstruct GPT-4 activations: average loss recovered topped out below 95%. Training cost was non-trivial; the company estimated the compute for its largest run at roughly the order of a percent of GPT-4 pretraining.

Gated SAEs and JumpReLU (DeepMind, 2024)

In parallel, Google DeepMind pursued its own SAE research line. Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, Janos Kramar, Rohin Shah and Neel Nanda introduced the Gated SAE in April 2024 (arXiv:2404.16014, Improving Dictionary Learning with Gated Sparse Autoencoders), and the JumpReLU SAE in July 2024 (arXiv:2407.14435, Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders). Both designs improve the reconstruction-vs-sparsity Pareto frontier. The Gated SAE separates a feature's gating decision (whether to fire) from its magnitude, applying the L1 penalty only to the gate, and showed that on models of up to 7B parameters it requires roughly half as many firing features as a vanilla L1 SAE for the same reconstruction quality. JumpReLU replaces the ReLU activation with a thresholded "jump" function and trains directly against the $L_0$ count of active features using straight-through estimators, with each feature learning its own activation threshold.

Gemma Scope (July-August 2024)

Both Gated and JumpReLU architectures underpin Gemma Scope, announced by DeepMind in July 2024 and detailed in arXiv:2408.05147 (August 2024). Tom Lieberum, Senthooran Rajamanoharan, Arthur Conmy and colleagues released more than 400 JumpReLU SAEs trained on every layer and sublayer of Gemma 2 2B and 9B and select layers of Gemma 2 27B, with more than thirty million learned features in total, all under an open-weights licence on Hugging Face. Gemma Scope was the first time a major lab made production-scale SAEs available for the broader interpretability community to use directly. A follow-up, Gemma Scope 2, in 2025 extended coverage to instruction-tuned variants, additional sublayer types, and SAEs focused on jailbreak and refusal analysis.

Goodfire's Llama 3 SAEs (late 2024 to early 2025)

Goodfire AI, an interpretability startup founded in 2024, released open-source SAEs for Llama 3.1 8B and Llama 3.3 70B in late December 2024 (announced on 8 January 2025 alongside the Ember API). At the time the company claimed they were the first open-source SAEs at 70B-instruct scale. The 8B SAE targets layer 19 of Llama 3.1 8B Instruct with L0 around 91; the 70B SAE targets layer 50 of Llama 3.3 70B Instruct with L0 around 121. The accompanying Ember API exposes SAE features as building blocks for steering, classification, and feature search, the most fully commercial productisation of SAE features to date.

Transluce, Apollo, and the academic ecosystem

Several newer organisations have built infrastructure on top of SAEs. Transluce, founded in 2024 by Jacob Steinhardt and colleagues, released Monitor, an AI-driven observability interface that uses SAE features; their Predictive Concept Decoders work uses SAEs to forecast model behaviour. Apollo Research released the e2e_sae library and contributed the end-to-end SAE training method described in Dan Braun, Jordan Taylor, Nicholas Goldowsky-Dill and Lee Sharkey's 2024 NeurIPS paper (arXiv:2405.12241), which trains an SAE so that replacing the model's activation with its reconstruction minimises downstream language-model loss rather than just the activation MSE. EleutherAI continues to maintain the EleutherAI/sae library and a large Hugging Face collection of pretrained SAEs for Pythia and other open models.

Architectural variants for interpretability

The interpretability community has converged on a small zoo of SAE variants. Several can be combined: a JumpReLU transcoder is a perfectly reasonable thing to train, and Matryoshka and end-to-end objectives can stack on top of any base activation.

Variant	Year and origin	Key idea	Notes
Vanilla L1 SAE	Cunningham et al. 2023; Bricken et al. 2023	ReLU encoder with L1 penalty on hidden activations	Simplest design, suffers from shrinkage and dead features
Top-K SAE	Makhzani and Frey 2013; OpenAI Gao et al. 2024	Keep only the $k$ largest activations per input	Exact control over $L_0$; underlies OpenAI's 16M-feature GPT-4 SAE
JumpReLU SAE	Rajamanoharan et al. (DeepMind) 2024	Threshold activation that fires only above a learned per-feature cutoff; trained against $L_0$ via straight-through estimators	State-of-the-art reconstruction-fidelity at fixed sparsity; powers Gemma Scope
Gated SAE	Rajamanoharan et al. (DeepMind) 2024	Two paths: a gate decides whether to fire, a magnitude head decides how much; L1 only on the gate	Cures L1 shrinkage; comparable interpretability to vanilla SAEs
End-to-end SAE (e2e)	Braun et al. 2024	Train the SAE so that replacing the model's activation with $g(f(x))$ minimises downstream loss, not just reconstruction	Aligns SAE with what actually matters for the task; Apollo Research library
Matryoshka SAE	Bussmann, Nabeshima, Karvonen, Nanda 2025	Train nested groups of latents that each independently reconstruct input	Reduces feature absorption; supports multi-resolution interpretation (ICML 2025)
Transcoder	Templeton et al. 2024; Dunefsky et al. 2024	Sparse map from one layer's input to the next layer's output, replacing a whole MLP block	Used for circuit tracing in Anthropic's On the Biology of a Large Language Model (2025)
Skip transcoder	Various 2024 to 2025	Transcoder with a linear bypass for the easy part of the residual stream	Improves reconstruction without changing interpretability metrics
Crosscoder	Anthropic, October 2024	Single SAE that reads from and writes to multiple layers, or multiple models	Identifies cross-layer features and supports model diffing across base and chat fine-tunes
Sparse autoencoder on multimodal activations	Templeton et al. 2024; various 2024 to 2025	Train an SAE on the joint activations of a vision-language model	Captures multimodal concepts like "images of the Golden Gate Bridge"
Lorsa (Low-rank sparse attention)	OpenMOSS 2025	Decompose an attention layer into a wide set of low-rank, sparsely active "attention units" that replace dense attention heads	Extends the SAE programme to attention superposition; supported in the Language-Model-SAEs framework
Complete replacement model	Anthropic / community 2025 to 2026	Replace every MLP and attention block in a model with a sparse transcoder or Lorsa, then study the fully sparsified network	Aspirational end-state of the sparse-circuits programme; early implementations exist for small open models

These variants are not mutually exclusive. The general trend is towards architectures that give more direct control over sparsity, support downstream analyses like circuit-level attribution, and decouple feature identification (the gate or top-k decision) from feature magnitude (how strongly it fires).

Training pipeline in practice

A modern interpretability SAE pipeline shares the same broad steps across labs. First, the researcher collects a large corpus of activations from a frozen base model on a representative dataset, typically tens to hundreds of billions of tokens for production-scale SAEs; activations are usually stored on disk and streamed during training. Second, those activations are pre-processed, often by subtracting a running mean and normalising the scale, with decoder columns initialised to unit norm and projected back to the unit sphere after each step. Third, the SAE is trained with a chosen activation (L1, Top-K, JumpReLU, Gated) using Adam or a similar optimiser. Resampling is run periodically: any feature whose firing rate falls below a threshold for many steps is reset by reinitialising its encoder row to a high-loss data point and its decoder column to a small random direction. Fourth, the SAE is evaluated on held-out activations using a basket of metrics (L0, MSE, variance explained, loss recovered, auto-interp scores) and visualised through feature dashboards. Finally, the SAE can be retrofitted into the base model as an interchangeable approximation of the original activation, for circuit tracing or steering experiments.

The sheer scale of activation collection is one of the larger hidden costs. OpenAI's Scaling and evaluating sparse autoencoders reports compute on the order of 1% of GPT-4 pretraining for its largest SAE. Storage for activations on a long-context corpus can run into petabytes if not handled carefully.

Evaluation metrics

There is no single accepted way to evaluate an interpretability SAE. In practice, researchers report a basket of complementary metrics. The most common are summarised below.

Metric	What it measures	Notes
$L_0$	Average number of active features per input token	Lower is sparser; quantity that JumpReLU and Top-K target exactly
Reconstruction loss (MSE)	Squared error between the SAE's output and the original activation	Sensitive to activation magnitude across layers
Variance explained	$1 - \|x - \hat{x}\|^2 / \|x\|^2$	Normalises MSE; comparable across layers
Loss recovered	Fraction of language-model loss preserved when activations are replaced by SAE reconstructions	Downstream metric, captures "does the SAE keep the model working?"
Manual interpretability	Human raters score whether each feature maps to a single concept	Gold standard but slow and small-sample
Auto-interp scores	LLM writes a description from top-activating examples; second LLM scores prediction quality on held-out text	Introduced by Bills et al. 2023; scales
Probe loss	How well a linear probe trained on the SAE's features predicts a target attribute	Tests downstream utility
Feature recovery	Fraction of hypothesised "true" features the SAE recovers on synthetic benchmarks	Lets you compare SAEs on a shared yardstick
Sparse probing on board games	Recover game-state features from board-game models	Karvonen et al. 2024, Measuring Progress in Dictionary Learning

A recurring concern, raised most directly in Sparse but Wrong: Incorrect L0 Leads to Incorrect Features in Sparse Autoencoders (arXiv:2508.16560), is that $L_0$ is a hyperparameter that must be set correctly to recover the true feature basis: too low and the SAE merges correlated features to save reconstruction budget, too high and it splits one feature into several near-duplicates. As a result, sparsity-vs-reconstruction Pareto plots are not by themselves a sound measure of feature quality. Several recent papers, including Engels et al. 2024 Sparse Autoencoders Trained on the Same Data Learn Different Features, also document substantial instability in which features SAEs recover across training runs even when reconstruction and sparsity are comparable.

SAEBench

The most concerted attempt to put SAE evaluation on a common footing is SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability (Karvonen, Rager and colleagues, arXiv:2503.09532, March 2025; ICML 2025). SAEBench runs an SAE through eight diverse metrics covering reconstruction (loss recovered), automated interpretability scoring, sparse probing, feature absorption, targeted pairwise probing (TPP), sparse concept ranking (SCR), the RAVEL disentanglement benchmark, and unlearning. The accompanying release includes more than 200 open-source SAEs spanning seven architectures, so practitioners can compare a new design against published baselines on a fixed yardstick rather than rebuilding the evaluation pipeline from scratch.

A central finding is that gains on the proxy metrics SAE designers usually track (L0 and reconstruction MSE) do not reliably translate into better practical performance on downstream tasks. In the typical L0 range of 20 to 200, Matryoshka SAEs outperform other architectures on TPP, SCR, RAVEL, and feature absorption, and come close on sparse probing. JumpReLU and Top-K SAEs remain strong on reconstruction-centric metrics but trade off some feature disentanglement. The benchmark's interactive interface, hosted by Neuronpedia, lets researchers slice metrics by architecture, training algorithm, model, and layer. A complementary effort, CE-Bench: Towards a Reliable Contrastive Evaluation Benchmark of Sparse Autoencoders (arXiv:2509.00691, September 2025), provides a contrastive evaluation that probes whether SAE features generalise beyond the precise activations they were trained on.

Famous findings

The Golden Gate Bridge feature

The most famous SAE feature is the Golden Gate Bridge direction in Claude 3 Sonnet, identified in Scaling Monosemanticity and made public via Golden Gate Claude. The feature fires for English text mentioning the bridge, for translations into French, German, Mandarin, and Japanese, and for images of the bridge fed to the model's multimodal interface. Clamping it high turns Claude into a bridge enthusiast that responds to almost any prompt with bridge-themed content. The viral success of the demo did more to popularise sparse autoencoders in the wider AI conversation than any technical paper.

Safety-relevant features

More consequentially, Anthropic and others have documented features that fire on deception, sycophancy, racial slurs, hate speech, biological-weapons information, code with security vulnerabilities, and prompts asking the model to break character. The Scaling Monosemanticity report includes case studies of what happens when these features are clamped or suppressed. A separate Anthropic post, Evaluating Feature Steering, measured how much feature interventions actually change model behaviour on standard benchmarks and found the changes are real but often modest, especially compared with the more dramatic effects of fine-tuning.

Programming, entity, and multilingual features

Language-model SAEs routinely surface features for programming-language constructs: Python for loops, Rust lifetime annotations, SQL JOIN clauses, JSON keys, regex patterns, base64-encoded strings. Anthropic, OpenAI, DeepMind, and Goodfire have all reported very similar code features in their respective base-model SAEs, evidence for the feature universality hypothesis that the same concepts emerge in independent SAE training runs. Gemma Scope SAEs and Goodfire Llama 3 SAEs surface features for individual famous people, institutions, currencies and countries, and fictional universes. Several papers document that semantically related text in many languages activates the same SAE features; the Golden Gate Bridge feature is the canonical example, but the pattern holds for thousands of concepts.

Sparsity-reconstruction tradeoff and scaling

Every SAE faces a fundamental trade between sparsity and reconstruction. At one extreme, a fully dense linear autoencoder can reconstruct activations exactly, but its features are no more interpretable than the original neurons. At the other, a maximally sparse code lets at most a few features fire and almost certainly drops important content. The Pareto frontier between L0 and MSE (or loss recovered) is the standard way to summarise an SAE's quality.

Gao et al.'s OpenAI paper laid out the clearest empirical scaling laws for SAEs to date. With Top-K activations, holding $k$ fixed, the reconstruction loss decreases as a power law in SAE width, similar to but slower than the loss scaling laws of language-model pretraining. With width fixed, increasing $k$ improves reconstruction roughly logarithmically. The compute-optimal frontier (best loss for a given training budget) lives at a non-trivial combination of width and $k$, and large-width Top-K SAEs continue to improve well past the point where smaller architectures plateau.

DeepMind's JumpReLU paper found a similar story for JumpReLU SAEs and showed that JumpReLU dominates Gated SAEs and vanilla L1 SAEs on the Pareto frontier across model scales from 2B to 9B. The Anthropic Towards Monosemanticity report and follow-on work suggested that the linear features hypothesis (concepts are encoded as directions in activation space, and the network's activation is a sparse sum of those directions) implies a particular relationship between SAE width and feature count, and that as width grows past some threshold, increasingly fine-grained features emerge (feature splitting).

Three tensions complicate the simple Pareto picture. First, more features and lower L0 can degrade interpretability if individual features become more diffuse; researchers have to look at the actual features, not just the curve. Second, the appropriate evaluation metric depends on the downstream task: steering accuracy, probe loss, and circuit-level attribution can each rank SAEs differently from raw reconstruction MSE. Third, the choice of L0 controls feature granularity, not just feature quality: an L0 of 50 and an L0 of 200 produce qualitatively different decompositions of the same activations, and the right L0 for one analysis (concept erasure) may not be the right L0 for another (steering).

Use cases

The classical role of a sparse autoencoder was as a self-supervised pre-training method: train it on unlabelled data, then transplant the encoder into a downstream supervised classifier as a feature extractor. That use largely faded as supervised pre-training, contextual embeddings, and self-supervised foundation models took over.

Feature discovery

The most common modern use is feature discovery in trained models, especially large language models, vision models like CLIP, and multimodal models. Anthropic, DeepMind, OpenAI, and Goodfire have demonstrated SAE features for concepts ranging from grammatical patterns to abstract entities to safety-relevant behaviours. Browsable dashboards (Neuronpedia, individual lab tools) make hundreds of thousands of SAE features inspectable.

Steering

Closely related is steering, in which a researcher clamps a feature to a high or low value at inference time to amplify or suppress an associated behaviour. The Golden Gate Claude demo was the most public example, but feature steering has also been used to push models towards more honest answers, to control persona and style, to suppress refusal behaviour for red-team analysis, and to study how features compose. Goodfire's Ember API is built explicitly around feature steering for Llama 3 SAEs.

Concept erasure and safety probing

SAE features have been used as targets for concept erasure: identify the features that encode a target concept, then ablate or down-weight them to remove the concept from the model's response. Karvonen et al. 2024 Measuring Progress in Dictionary Learning evaluated several SAE variants on a benchmark of board-game state features. SAE features for deception, sycophancy, sandbagging, and refusal have also become standard probes in alignment work, with researchers asking whether a deception feature fires more on prompts where the model is suspected of being dishonest. Gemma Scope 2 includes SAEs specifically intended for jailbreak and refusal analysis.

Circuit tracing

The most ambitious use is circuit tracing: decomposing whole computations through a model into SAE-level features and routes between them. Anthropic's Circuit Tracing: Revealing Computational Graphs in Language Models and the companion On the Biology of a Large Language Model (Lindsey et al., 2025) use cross-layer transcoders to build attribution graphs for Claude 3.5 Haiku, revealing multi-step reasoning, planning, and hallucination-suppression circuits. The open-source release of those tools, integrated with Neuronpedia, has made circuit tracing reproducible on smaller open models.

Probing, concept bottlenecks, and beyond language

SAE features feed probing studies and bias detection work that searches for representations of protected attributes; a linear probe trained on SAE features often achieves similar accuracy with far better interpretability than one trained on raw activations. They also serve as building blocks for concept-bottleneck models, in which downstream classifiers read only through interpretable feature dimensions, a workflow Goodfire's Ember has productised. Outside language modelling, SAEs have been applied to protein language models (Adams et al. 2025, PNAS), recommendation models, audio models, and reinforcement-learning policies, following the same recipe of training on internal activations and browsing the resulting features for domain concepts.

Limitations

Reconstruction is never perfect. Any SAE sparse enough to be interpretable leaves residual error that the underlying model would not have, and splicing a lossy SAE back into the model degrades downstream performance. L1-based SAEs additionally suffer from shrinkage: the L1 penalty pushes activation magnitudes downward in addition to driving them to zero, so recovered feature magnitudes systematically underestimate the true coefficients. Top-K, JumpReLU, and Gated SAEs ameliorate this but introduce their own complications around discontinuous activations and gradient estimation.

Feature splitting and feature absorption are two related phenomena that are still not fully solved. Feature splitting subdivides a coherent concept into many narrow features at high widths; feature absorption is the inverse, where a higher-level concept gets cannibalised by a more specific child feature. Matryoshka SAEs explicitly target absorption, but no architecture so far cleanly resolves both. Engels et al. 2024 Sparse Autoencoders Trained on the Same Data Learn Different Features documents substantial instability of the recovered feature basis across training seeds, even when reconstruction and L0 are comparable.

Large SAEs are computationally expensive. A 34-million-feature SAE is bigger than many of the models being interpreted, and training one requires substantial compute and storage of activations. OpenAI's largest SAE for GPT-4 reportedly cost on the order of 1% of GPT-4 pretraining. Dead features (units that essentially never fire after a few thousand steps) waste capacity and require ad hoc resampling tricks to recover. Recovered features are not always cleanly monosemantic; some remain polysemantic.

Lack of ground truth is a deeper problem. There is no oracle that says "this is the right feature basis". Evaluations rely on reconstruction metrics, human raters, auto-interp scores, and synthetic benchmarks, each with known failure modes. Two SAEs that score similarly on reconstruction and L0 can produce different feature inventories, with no principled way to say which is right.

In April 2025, DeepMind's mechanistic interpretability team published a candid update titled Negative Results for Sparse Autoencoders On Downstream Tasks and Deprioritising SAE Research, reporting that SAE features did not consistently outperform simpler probing baselines on the downstream tasks they tried. The post sparked a broader debate about whether SAEs deliver enough scientific value to justify the engineering investment. Other groups have continued to invest, and as of 2026 the field is active but more sober.

Finally, there is a theoretical critique. The linear features hypothesis, on which the whole SAE programme implicitly rests, is the assumption that the model's relevant computations live in linear directions in activation space that can be cleanly separated by a sparse linear code. Real models almost certainly use some non-linear or distributed representations that do not decompose this way, and there is active debate about whether SAEs are the right primitive for studying those phenomena.

Implementations, tools, and libraries

Several open-source tools and pre-trained models exist for sparse-autoencoder research.

Tool / resource	Maintainer	What it provides
Gemma Scope	Google DeepMind, 2024	400+ JumpReLU SAEs across Gemma 2 2B, 9B, 27B, on Hugging Face; Gemma Scope 2 update in 2025
Top-K SAEs for GPT-2 / GPT-4	OpenAI, 2024	Code accompanying Scaling and evaluating sparse autoencoders; GPT-2 small SAEs released publicly
Goodfire Llama 3 SAEs	Goodfire AI, 2024 to 2025	Open-source SAEs for Llama 3.1 8B and Llama 3.3 70B; Ember API for feature steering
Anthropic Sparse Autoencoder Library	Anthropic	Closed source; powers Towards Monosemanticity, Scaling Monosemanticity, Circuit Tracing
SAELens	Joseph Bloom, Curt Tigges, Anthony Duong, David Chanin and others	Python library for training and analysing SAEs on transformer language models, integrated with TransformerLens
TransformerLens	Neel Nanda and contributors	Interpretability toolkit on which SAELens depends; provides hooks for reading and writing internal activations
EleutherAI `sae` library	EleutherAI	Training infrastructure used for SAE experiments on Pythia and other open models
Apollo `e2e_sae` library	Apollo Research	End-to-end SAE training, used in Braun et al. 2024
Matryoshka SAE code	Bart Bussmann	Reference implementation of Matryoshka SAEs
Neuronpedia	Decode Research	Interactive front-end for browsing SAE features; dashboards for Gemma Scope, OpenAI GPT-2 SAEs, and others; powers attribution-graph visualisations
SAEDashboard	Joseph Bloom	Reusable feature-dashboard component for SAE inspection
NNsight	Northeastern NDIF	Library for instrumenting LLM activations, including SAE intervention
Goodfire SDK	Goodfire AI	Python SDK for the Ember API, exposing SAE features for Llama 3

Most interpretability research today combines several of these: SAELens or EleutherAI/sae for training, Neuronpedia and SAEDashboard for browsing, TransformerLens or NNsight for hooking SAEs into a base model, and either the Apollo or Goodfire libraries for downstream applications.

Comparison with other interpretability methods

Sparse autoencoders are one tool among many in the modern interpretability toolkit. The table below sketches how they compare with several other widely used approaches.

Method	What it does	Strengths	Weaknesses
Sparse autoencoder	Decomposes activations into sparse, interpretable features	Many human-interpretable features per layer; supports steering	Lossy, hyperparameter-sensitive, computationally heavy
Probing classifier	Trains a small classifier on activations to predict a target attribute	Simple, sample-efficient	Tells you what is decodable, not what the model uses
Activation patching	Swaps activations between two runs to test causal effect	Causal evidence, supports circuit discovery	Tedious; many candidate components
Logit lens	Projects intermediate activations through the unembedding matrix	Cheap, intuitive	Only meaningful where the residual stream is aligned with output space
Mechanistic circuit analysis	Identifies subnetworks (e.g. induction heads) behind specific behaviours	Deep mechanistic understanding	Labour-intensive, model-specific
Integrated Gradients, SmoothGrad	Input-attribution via gradients	Standard input attribution	Operates at the input level only
SHAP	Input-attribution via Shapley values	Game-theoretically grounded	Expensive, also input-level
Linear probing on raw activations	Linear classifier directly on hidden states	Cheap, comparable across layers	Hard to interpret; no decomposition
Transcoders + attribution graphs	Replace MLP blocks with sparse transcoders, then attribute behaviour to feature chains	Native circuit-tracing pipeline; Anthropic 2025 Biology of a LLM	Heavier than vanilla SAEs; less mature tooling

In practice, modern interpretability papers combine these methods: SAEs identify candidate features, activation patching or attribution tests which features are causally implicated in a behaviour, and probing or logit-lens analyses triangulate the resulting story. The Anthropic circuit-tracing line is arguably the closest existing thing to a complete pipeline, with SAEs (or their transcoder cousins) as building blocks and attribution graphs as the connective tissue.

Recent context (2024 to 2026)

SAE research has been one of the more active subfields of interpretability since 2023. Open-weight SAEs are now available for many widely studied models, including the Pythia, Gemma 2, and Llama 3 families, with smaller research releases on Mistral, Qwen, and Mixtral. The release of Gemma Scope in 2024 lowered the barrier to entry significantly, and Gemma Scope 2 in 2025 extended coverage to additional layer types and to jailbreak and refusal analysis. Anthropic's release of its circuit-tracing tools in 2025, alongside On the Biology of a Large Language Model, has put cross-layer transcoders front and centre as a complement to flat-SAE feature discovery.

Several conceptual extensions have matured. Crosscoders generalise SAEs across layers and even across models, supporting model-diffing analyses that compare the feature inventories of base and instruction-tuned variants. Transcoders replace dense MLP blocks with sparse approximations, enabling end-to-end "sparse circuits". Matryoshka SAEs introduce nested-dictionary hierarchies that reduce feature absorption. A growing literature on automated interpretability uses LLMs to label SAE features at scale.

The field has become more critical of itself. The April 2025 DeepMind "negative results" post and a series of papers documenting SAE instability, hyperparameter sensitivity, and the limits of L0 as a quality proxy have pushed practitioners to triangulate SAE evidence with other interpretability tools rather than rely on it alone. Recent surveys (for example, arXiv:2503.05613) document the breadth of the field and its open problems. Dunefsky, Chlenski, Templeton and Nanda's Transcoders Beat Sparse Autoencoders for Interpretability (arXiv:2501.18823, January 2025) reported that transcoders, especially skip transcoders, match or beat vanilla SAEs on automated interpretability scores at the same sparsity level and produce smaller increases in model loss when used as drop-in replacements for MLP blocks. That paper and the broader rise of cross-layer transcoders in Anthropic's circuit-tracing line shifted some of the community's attention from flat SAEs to transcoder-shaped architectures. In August 2025, the Circuits Research Landscape report (Anthropic, Decode Research, EleutherAI, Goodfire and Google DeepMind) cataloged replications and extensions of attribution-graph methods across more than a dozen open models, all built on SAE or transcoder primitives.

Anthropic's May 2026 release of Natural Language Autoencoders (described in its own section above) reframed the question of what comes after SAEs. Where a sparse autoencoder hands the analyst a labelled inventory of feature directions, an NLA hands them an English description of what the activation seems to represent. The two are not in competition: SAE features remain the only practical primitive for fine-grained steering and for circuit-tracing tools like attribution graphs, but NLAs make it easier to audit specific activations for surprising content (evaluation awareness, hidden chains of reasoning) without first having to scan thousands of latents.

By 2026, sparse autoencoders are best understood as one well-developed but partial lens onto neural-network internals: extremely useful in combination with other tools, and unlikely on their own to deliver a complete account of what a large language model is doing.

Natural language autoencoders (2026)

In May 2026, Anthropic published Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations on the Transformer Circuits Thread, alongside a research blog post and a release of training code and pretrained natural language autoencoders (NLAs) for several open models. NLAs are best understood as the next step in the lineage that began with sparse autoencoders: they replace the SAE's sparse numerical code with a stretch of natural-language text that an LLM both generates from the activation and uses to reconstruct it.

An NLA has two LLM-based components trained jointly with reinforcement learning. An activation verbalizer (AV) reads an activation vector from the residual stream of the target model and emits a short text description of what the activation seems to represent. An activation reconstructor (AR) reads that description back and predicts the original activation. The pair is optimised so that the AR's prediction matches the true activation under a reconstruction loss, while the AV is constrained to produce coherent, free-form English. Because the bottleneck between the two modules is text rather than a sparse vector, the explanation is read off directly rather than recovered by labelling latents after the fact.

Anthropic's report contrasts the NLA workflow with the SAE workflow on the same activations. Sparse autoencoders carve an activation into thousands or millions of latents that an analyst must then interpret one at a time, usually with the help of auto-interp pipelines. NLAs skip the latent dictionary and emit an English sentence about the activation in a single forward pass, at the cost of giving up the SAE's neat addressability (an NLA description is harder to clamp or ablate than a single feature direction). The two approaches are complementary, and the paper notes that SAE features remain better suited to feature steering and to circuit-tracing pipelines like cross-layer transcoders.

The headline application was a pre-deployment audit of Claude Opus 4.6. Run on SWE-bench Verified, NLAs surfaced evaluation awareness (the model representing the situation as a test) on roughly 26 percent of problems and on about 16 percent of trials in a destructive-action coding probe, even on prompts where Claude never said out loud that it was being tested. On a comparable sample of real claude.ai traffic the same indicator fired on under 1 percent of inputs. NLAs also caught a latent in an early Claude Opus 4.6 checkpoint that pushed responses to switch into a non-English language for English prompts, and on a different internal model (Claude Mythos Preview) they revealed unverbalised reasoning about how to avoid detection while cheating on a training task. None of those findings would have been impossible with SAEs, but each took a description rather than a labelled latent to surface.

NLAs do not replace sparse autoencoders. They depend on the same underlying assumption that residual-stream activations contain meaningful, decodable structure, and they share the SAE pitfalls of imperfect reconstruction, hyperparameter sensitivity, and lack of ground truth. They also shift evaluation onto the AV and AR LLMs: a hallucinated or sycophantic description is harder to detect than a mis-labelled feature. The community in mid-2026 broadly treats NLAs and SAEs as different read-out heads on the same underlying problem, with NLAs better suited to qualitative auditing and SAEs better suited to quantitative analysis and steering.

Sparse autoencoders are part of a broader family of representation-learning and interpretability methods. Related techniques include the standard autoencoder, the variational autoencoder, the masked autoencoder (MAE), denoising autoencoders, contractive autoencoders, and dictionary-learning methods more generally. The feature directions they learn are conceptually close to the notion of a sparse feature in classical machine learning, with the difference that the sparsity is over a learned overcomplete basis rather than over the original input coordinates.

On the interpretability side, the closest neighbours are mechanistic interpretability, circuits, activation patching, the logit lens, and probing classifiers. SAEs are also linked to the broader theory of superposition in neural networks, the linear representation hypothesis, and the features as directions framing that has become standard in transformer interpretability.

References

Sparse autoencoder

Overview

Definition and motivation

Classical history

Predecessors: efficient coding and Barlow's hypothesis

Olshausen and Field 1996

The 2000s and unsupervised pretraining

Andrew Ng's CS294A and the UFLDL treatment

Le et al. 2012: the cat-detector network

k-Sparse autoencoders

Mathematical formulation

L1 (Lasso) penalty

KL-divergence sparsity

Top-K, JumpReLU, and Gated activations

Sparse coding vs sparse autoencoder

Sparsity mechanisms

Foundational literature timeline

Interpretability turn: superposition and SAEs

Toy Models of Superposition (2022)

Cunningham et al. (September 2023)

Towards Monosemanticity (October 2023)

Scaling Monosemanticity (May 2024)

OpenAI Top-K SAEs (June 2024)

Gated SAEs and JumpReLU (DeepMind, 2024)

Gemma Scope (July-August 2024)

Goodfire's Llama 3 SAEs (late 2024 to early 2025)

Transluce, Apollo, and the academic ecosystem

Architectural variants for interpretability

Training pipeline in practice

Evaluation metrics

SAEBench

Famous findings

The Golden Gate Bridge feature

Safety-relevant features

Programming, entity, and multilingual features

Sparsity-reconstruction tradeoff and scaling

Use cases

Feature discovery

Steering

Concept erasure and safety probing

Circuit tracing

Probing, concept bottlenecks, and beyond language

Limitations

Implementations, tools, and libraries

Comparison with other interpretability methods

Recent context (2024 to 2026)

Natural language autoencoders (2026)

Related concepts

See also

References

Improve this article

Related Articles

ARC-AGI 2

Grad-CAM

Integrated Gradients

DeepLIFT

Layer-wise Relevance Propagation (LRP)

GELU (Gaussian Error Linear Unit)

Sparse autoencoder

Overview

Definition and motivation

Classical history

Predecessors: efficient coding and Barlow's hypothesis

Olshausen and Field 1996

The 2000s and unsupervised pretraining

Andrew Ng's CS294A and the UFLDL treatment

Le et al. 2012: the cat-detector network

k-Sparse autoencoders

Mathematical formulation

L1 (Lasso) penalty

KL-divergence sparsity

Top-K, JumpReLU, and Gated activations

Sparse coding vs sparse autoencoder

Sparsity mechanisms

Foundational literature timeline

Interpretability turn: superposition and SAEs

Toy Models of Superposition (2022)

Cunningham et al. (September 2023)

Towards Monosemanticity (October 2023)

Scaling Monosemanticity (May 2024)