Dictionary learning (for interpretability)

AI Safety Machine Learning

15 min read

Updated Jul 7, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 7, 2026

Fact-checked

In review queue

Sources

17 citations

Revision

v2 · 3,009 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Overview

Dictionary learning, in the context of mechanistic interpretability, is the framework of decomposing the dense internal activations of a neural network into a sparse, weighted combination drawn from a large, overcomplete set of learned directions called a dictionary. Each direction is treated as a candidate interpretable "feature," and the goal is to recover features that are monosemantic, meaning each one corresponds to a single human-understandable concept, even though the individual neurons of the network are typically polysemantic and respond to many unrelated concepts at once. ^[1]^[2] The approach reached production scale in 2024, when Anthropic used it to extract roughly 34 million candidate features from the middle-layer activations of its deployed model Claude 3 Sonnet, the largest published decomposition of a frontier model reported to date. ^[2]

The idea has deep roots in classical signal processing and computational neuroscience, where dictionary learning and sparse coding were developed in the 1990s and 2000s to represent natural signals as sparse linear combinations of basis elements (often called atoms). ^[3]^[4]^[5] Its application to interpretability rests on the superposition hypothesis: the conjecture that a network packs more distinct features into its activation space than it has neurons, by storing them as a set of nearly orthogonal directions rather than aligning each feature with a single neuron. ^[6] If superposition holds, then recovering the underlying features is a dictionary learning problem, and the modern tool of choice for solving it on transformer activations is the sparse autoencoder (SAE). ^[1]^[2] Dictionary learning is therefore best described not as a single algorithm but as the conceptual foundation that connects classical sparse representation theory to the present generation of interpretability methods.

It is useful to keep two objects distinct throughout. The dictionary is the fixed set of feature directions, realized in an SAE as the columns of the decoder weight matrix; the sparse code is the per-input vector of feature activations (also called feature coefficients), which says how strongly each dictionary atom is present in a given activation and is zero for almost all atoms. The dictionary is shared across all inputs, while the sparse code changes from input to input. ^[1]^[7]

Classical dictionary learning and sparse coding

The signal-processing version of the problem is to approximate a collection of data vectors as sparse linear combinations of dictionary atoms. Given a data matrix Y with one signal per column, one seeks a dictionary D and a coefficient matrix X minimizing the reconstruction error || Y - D X ||, subject to a sparsity constraint on X (few nonzero entries per column) and a norm constraint on the atoms (the columns of D). ^[5] When the dictionary has more atoms than the dimension of the signal it is called overcomplete, which is what allows a rich set of specialized atoms while still keeping each individual representation sparse.

The foundational result for interpretability came from computational neuroscience. In 1996, Bruno Olshausen and David Field showed that a learning algorithm trained to find a sparse linear code for small patches of natural images spontaneously develops a dictionary of localized, oriented, bandpass filters that closely resemble the receptive fields of simple cells in the primary visual cortex. ^[3] Their 1997 follow-up framed this explicitly as an "overcomplete basis set" approach to sparse coding. ^[4] The work is widely cited as evidence that sparsity is a powerful objective for discovering meaningful, interpretable structure in data, and it is the direct intellectual ancestor of using sparsity to find features in artificial networks. ^[1]^[6]

Two threads of subsequent work matter for the modern story. The first is sparse coding, the problem of finding the sparse coefficients for a fixed dictionary, for which classic methods include matching pursuit (Mallat and Zhang, 1993) and basis pursuit / the LASSO, which relax the intractable count of nonzero entries (the L0 "norm") into a tractable L1 penalty. ^[8] The second is dictionary learning proper, the problem of learning the dictionary itself from data, typically by alternating between a sparse-coding step and a dictionary-update step. The Method of Optimal Directions (Engan, Aase, and Husoy, 1999) updates the whole dictionary in closed form by least squares, and K-SVD (Aharon, Elad, and Bruckstein, 2006) updates atoms one at a time using a singular value decomposition, generalizing the k-means clustering algorithm. ^[5]^[9] These algorithms established the standard alternating structure (sparse codes, then dictionary, then repeat) that sparse autoencoders later implement with gradient descent.

What is the superposition hypothesis?

The bridge from classical sparse coding to neural network interpretability is the superposition hypothesis, articulated most fully in Anthropic's 2022 paper "Toy Models of Superposition." ^[6] A long-standing obstacle to interpreting networks is polysemanticity: individual neurons frequently fire for collections of unrelated inputs, so reading a neuron's activation rarely corresponds to a single concept. The superposition hypothesis explains this by proposing that the true unit of representation is not the neuron but a direction in activation space, and that a network represents far more such feature directions than it has neurons by storing them in superposition, as a set of almost orthogonal (but not exactly orthogonal) directions. ^[6] Because the number of nearly orthogonal directions available in a high-dimensional space grows much faster than the number of strictly orthogonal axes, a model can compress many sparsely active features into a smaller activation vector, accepting a small amount of interference between features as the cost. The toy-models work demonstrated that small networks provably do this when features are sparse and the model is incentivized to represent more of them than it has dimensions. ^[6]

Superposition reframes interpretability as a dictionary learning problem. If activations are sparse linear combinations of an overcomplete set of feature directions, then recovering those directions is exactly what dictionary learning does: the unknown dictionary is the set of feature directions the model uses, the sparse code is which features are active on a given input, and the polysemantic neuron basis is just an inconvenient rotation of the underlying monosemantic feature basis. ^[1]^[6] This is why a method that finds a sparse, overcomplete decomposition of activations is expected to surface monosemantic features even when no individual neuron is monosemantic.

How do sparse autoencoders implement dictionary learning?

The dominant implementation of dictionary learning on transformer activations is the sparse autoencoder. An SAE is a wide, shallow network trained to reconstruct an activation vector x while routing it through a sparse, high-dimensional bottleneck. The encoder maps x to a vector of feature activations f(x), most of whose entries are zero for any given input, and the decoder reconstructs x as a sparse weighted sum of learned feature directions, the columns of the decoder weight matrix W_dec:

x_hat = W_dec f(x) + b.

The columns of W_dec are the dictionary atoms (the feature directions), and f(x) is the sparse code. The standard training objective combines a reconstruction term with a sparsity penalty, classically the squared error plus an L1 penalty on the feature activations:

L(x) = || x - x_hat ||_2^2 + lambda || f(x) ||_1,

where the L1 term is the convex surrogate for the true count of active features and lambda trades reconstruction against sparsity. ^[1]^[7] This is recognizably the same alternating objective as classical dictionary learning, now optimized end to end by gradient descent, with the encoder amortizing the sparse-coding step so that codes can be computed in a single forward pass.

The approach was brought to language models in 2023. Anthropic's "Towards Monosemanticity: Decomposing Language Models With Dictionary Learning" trained SAEs on the activations of a one-layer transformer and recovered thousands of features that were substantially more monosemantic than the underlying neurons, while Cunningham, Ewart, Riggs, Huben, and Sharkey independently reported that "Sparse Autoencoders Find Highly Interpretable Features in Language Models." ^[1]^[7] In May 2024, Anthropic scaled the method dramatically in "Scaling Monosemanticity," training three SAEs of about 1 million, 4 million, and 34 million features (precisely 1,048,576, 4,194,304, and 33,554,432 latents) on the middle-layer residual stream of the production model Claude 3 Sonnet, and using neural scaling laws to set hyperparameters. ^[2] Roughly 12 million of the 34 million features were "alive," about 65 percent of the largest dictionary consisting of dead features that never fired; fewer than 300 features were active on a typical token; and the SAE reconstruction explained at least 65 percent of the variance in the model's activations. ^[2] The resulting features were multilingual and multimodal and spanned highly abstract concepts. Anthropic wrote that "the features we found in Sonnet have a depth, breadth, and abstraction reflecting Sonnet's advanced capabilities." ^[12] Parallel efforts followed at the other major labs: OpenAI's "Scaling and Evaluating Sparse Autoencoders" (June 2024) trained a 16 million latent TopK autoencoder on GPT-4 activations, and Google DeepMind released Gemma Scope (2024), an open suite of more than 400 JumpReLU SAEs covering every layer and sublayer of Gemma 2 2B and 9B (plus select layers of the 27B base model), with more than 30 million features in total. ^[10]^[11]

Work	Org	Year	Model / target	Dictionary size	Sparsity mechanism
Olshausen and Field	UC Berkeley / Cornell	1996 to 1997	natural image patches	overcomplete basis	L1-style sparse prior ^[3]^[4]
K-SVD	Technion	2006	general signals	overcomplete dictionary	L0 (matching pursuit) ^[5]
Towards Monosemanticity	Anthropic	2023	1-layer transformer	up to ~131,000 features	L1 on ReLU SAE ^[1]
Sparse Autoencoders Find...	Cunningham et al.	2023	language model activations	overcomplete SAE	L1 on ReLU SAE ^[7]
Scaling Monosemanticity	Anthropic	2024	Claude 3 Sonnet residual stream	up to ~34 million features	L1 on ReLU SAE ^[2]
Scaling and Evaluating SAEs	OpenAI	2024	GPT-4 activations	up to 16 million latents	TopK ^[10]
Gemma Scope	Google DeepMind	2024	Gemma 2 (all layers)	16K to ~1M per SAE	JumpReLU (L0) ^[11]

What is dictionary learning used for in interpretability?

Recovering an interpretable dictionary enables several downstream uses. The most striking is feature steering, also called activation steering or clamping: because each feature is a direction in activation space, one can intervene during a forward pass by artificially raising or lowering a chosen feature's activation and observe a corresponding, interpretable change in behavior. Anthropic's demonstration "Golden Gate Claude" clamped a Golden Gate Bridge feature in Claude 3 Sonnet to a high value, causing the model to steer nearly every response toward the bridge; asked about its physical form, the altered model replied "I am the Golden Gate Bridge," which served as direct causal evidence that the feature genuinely represents that concept rather than merely correlating with it. ^[2]^[12] The same paper showed that suppressing or amplifying features tied to behaviors such as sycophancy, scams, and unsafe code changed the model's outputs accordingly. ^[2]^[12]

Beyond steering, dictionary features support circuit analysis: by expressing activations in a sparse, interpretable basis, researchers can trace how features in one layer give rise to features in later layers, an effort that has grown into work on attribution graphs and "circuit tracing" (see below). Features also provide a vocabulary for AI safety, since they include directions associated with deception, power-seeking, dangerous capabilities such as biological-weapon information, and bias, which can in principle be monitored or used as classifiers. ^[2]^[12] Open dictionaries such as Gemma Scope, with browsable features hosted on tools like Neuronpedia, have made this kind of analysis reproducible without training SAEs from scratch. ^[11]

What are the latest developments in dictionary learning?

Since the 2024 scaling results, dictionary learning has moved from finding features to wiring them into circuits and stress-testing whether the decomposition is real. Three lines of work stand out.

Crosscoders and transcoders generalize the single-layer SAE. In October 2024 Anthropic introduced sparse crosscoders, a variant that reads from and writes to several layers at once, which lets researchers track a persistent feature as it moves through the residual stream and to compare (or "diff") the features of two different models. ^[13] Transcoders, a related tool, instead learn a sparse dictionary that predicts a later layer's activations from an earlier one, approximating a nonlinear sublayer with an interpretable sparse map.

Attribution graphs and circuit tracing put these dictionaries to work. In March 2025 Anthropic released "Circuit Tracing: Revealing Computational Graphs in Language Models" and the companion study "On the Biology of a Large Language Model," which used cross-layer transcoders to build attribution graphs of Claude 3.5 Haiku. ^[14]^[15] The graphs traced concrete mechanisms, including multi-step reasoning, forward planning when writing rhyming poetry, and a shared multilingual representation, offering some of the first end-to-end causal accounts of how features chain together to produce a specific output. ^[15]

New architectures target the failure modes. Matryoshka SAEs, introduced by Bussmann, Nabeshima, Karvonen, and Nanda in March 2025, train several nested dictionaries of increasing size at once and require the smaller dictionaries to reconstruct the input on their own, which pushes broad concepts into the small dictionaries and narrow ones into the large dictionaries and measurably reduces feature splitting and absorption. ^[16]

What are the limitations and open debates?

Dictionary learning for interpretability is an active and contested research area, and several limitations are well documented. Feature splitting is the observation that as the dictionary grows, a concept captured by one feature in a smaller SAE often fractures into many narrower features in a larger one, which complicates any claim that a particular feature is "the" representation of a concept and suggests the decomposition depends on the chosen dictionary size. ^[2] Completeness is a related concern: even very large SAEs do not capture all of a model's behavior. OpenAI reported that routing GPT-4's activations through its sparse autoencoder degraded performance to roughly that of a model trained with about ten times less compute, and DeepMind has noted that fully covering the concepts in a frontier model might require billions or trillions of features. ^[10]^[11] There are also reconstruction-versus-sparsity tradeoffs and pathologies such as dead features (atoms that never activate) and very high frequency features, which motivated successor architectures including Gated SAEs, TopK SAEs, JumpReLU SAEs, and Matryoshka SAEs that aim to improve the frontier. ^[10]^[11]^[16]

A deeper debate concerns whether the recovered features are the right unit of analysis at all. The method assumes the linear representation hypothesis (that concepts correspond to directions) and a sparse, roughly independent feature structure, assumptions that may not hold uniformly. Critics have raised the possibility of an "interpretability illusion," in which features look clean under cherry-picked examples but do not faithfully or completely describe the model's computation, and have questioned whether SAE features are stable, causally meaningful, and useful for concrete downstream tasks relative to simpler baselines. A 2025 study by Subhash Kantamneni, Joshua Engels, and colleagues tested SAEs on the concrete task of probing model activations for known concepts and found that, "although SAEs occasionally perform better than baselines on individual datasets," the authors were "unable to design ensemble methods combining SAEs with baselines that consistently outperform ensemble methods solely using baselines," indicating that on this benchmark SAE features did not beat simple probes such as logistic regression. ^[17] Whether dictionary learning ultimately yields a faithful, complete, and minimal description of what a network computes, or whether features are an imperfect intermediate abstraction, remains open. The technique is therefore best understood as a powerful and rapidly improving lens on model internals rather than a solved decomposition.

References

Trenton Bricken, Adly Templeton, Joshua Batson, et al. "Towards Monosemanticity: Decomposing Language Models With Dictionary Learning." Anthropic, Transformer Circuits Thread, October 2023. https://transformer-circuits.pub/2023/monosemantic-features/index.html ↩
Adly Templeton, Tom Conerly, Jonathan Marcus, et al. "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet." Anthropic, Transformer Circuits Thread, May 2024. https://transformer-circuits.pub/2024/scaling-monosemanticity/ ↩
Bruno A. Olshausen and David J. Field. "Emergence of simple-cell receptive field properties by learning a sparse code for natural images." Nature 381, 607 to 609, 1996. https://www.nature.com/articles/381607a0 ↩
Bruno A. Olshausen and David J. Field. "Sparse coding with an overcomplete basis set: A strategy employed by V1?" Vision Research 37(23), 3311 to 3325, 1997. https://www.sciencedirect.com/science/article/pii/S0042698997001697 ↩
Michal Aharon, Michael Elad, and Alfred Bruckstein. "K-SVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation." IEEE Transactions on Signal Processing 54(11), 4311 to 4322, 2006. https://www.cs.technion.ac.il/~elad/publications/journals/2004/KSVD_IEEE_TSP.pdf ↩
Nelson Elhage, Tristan Hume, Catherine Olsson, et al. "Toy Models of Superposition." Anthropic, Transformer Circuits Thread, September 2022. https://transformer-circuits.pub/2022/toy_model/index.html ↩
Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, Lee Sharkey. "Sparse Autoencoders Find Highly Interpretable Features in Language Models." arXiv:2309.08600, September 2023. https://arxiv.org/abs/2309.08600 ↩
Stephane G. Mallat and Zhifeng Zhang. "Matching pursuits with time-frequency dictionaries." IEEE Transactions on Signal Processing 41(12), 3397 to 3415, 1993. https://ieeexplore.ieee.org/document/258082 ↩
Kjersti Engan, Sven Ole Aase, and John Hakon Husoy. "Method of optimal directions for frame design." Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 1999. https://ieeexplore.ieee.org/document/760624 ↩
Leo Gao, Tom Dupre la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, Jeffrey Wu. "Scaling and Evaluating Sparse Autoencoders." arXiv:2406.04093, June 2024. https://arxiv.org/abs/2406.04093 ↩
Tom Lieberum, Senthooran Rajamanoharan, Arthur Conmy, et al. "Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2." arXiv:2408.05147, August 2024. https://arxiv.org/abs/2408.05147 ↩
"Mapping the Mind of a Large Language Model." Anthropic, May 21, 2024. https://www.anthropic.com/research/mapping-mind-language-model ↩
Jack Lindsey, Adly Templeton, Jonathan Marcus, Thomas Conerly, Joshua Batson, Christopher Olah. "Sparse Crosscoders for Cross-Layer Features and Model Diffing." Anthropic, Transformer Circuits Thread, October 2024. https://transformer-circuits.pub/2024/crosscoders/index.html ↩
Emmanuel Ameisen, Jack Lindsey, Adam Pearce, et al. "Circuit Tracing: Revealing Computational Graphs in Language Models." Anthropic, Transformer Circuits Thread, March 2025. https://transformer-circuits.pub/2025/attribution-graphs/methods.html ↩
Jack Lindsey, et al. "On the Biology of a Large Language Model." Anthropic, Transformer Circuits Thread, March 2025. https://transformer-circuits.pub/2025/attribution-graphs/biology.html ↩
Bart Bussmann, Noa Nabeshima, Adam Karvonen, Neel Nanda. "Learning Multi-Level Features with Matryoshka Sparse Autoencoders." arXiv:2503.17547, March 2025. https://arxiv.org/abs/2503.17547 ↩
Subhash Kantamneni, Joshua Engels, Senthooran Rajamanoharan, Max Tegmark, Neel Nanda. "Are Sparse Autoencoders Useful? A Case Study in Sparse Probing." arXiv:2502.16681, February 2025. https://arxiv.org/abs/2502.16681 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Circuit discovery Gated SAE Polysemanticity Scaling Monosemanticity Sparse autoencoder

Overview

Classical dictionary learning and sparse coding

What is the superposition hypothesis?

How do sparse autoencoders implement dictionary learning?

What is dictionary learning used for in interpretability?

What are the latest developments in dictionary learning?

What are the limitations and open debates?

References

Improve this article

Related Articles

Confirmation Bias

Hallucination

Recursive self-improvement

Emergent abilities

AI Alignment

Model collapse

What links here

Related Articles

Confirmation Bias

Hallucination

Recursive self-improvement

Emergent abilities

AI Alignment

Model collapse

What links here