Dictionary learning (for interpretability)
Last reviewed
Sources
12 citations
Review status
Source-backed
Revision
v1 ยท 2,385 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
12 citations
Review status
Source-backed
Revision
v1 ยท 2,385 words
Add missing citations, update stale details, or suggest a clearer explanation.
Dictionary learning, in the context of mechanistic interpretability, is the framework of decomposing the dense internal activations of a neural network into a sparse, weighted combination drawn from a large, overcomplete set of learned directions called a dictionary. Each direction is treated as a candidate interpretable "feature," and the goal is to recover features that are monosemantic, meaning each one corresponds to a single human-understandable concept, even though the individual neurons of the network are typically polysemantic and respond to many unrelated concepts at once. [1][2]
The idea has deep roots in classical signal processing and computational neuroscience, where dictionary learning and sparse coding were developed in the 1990s and 2000s to represent natural signals as sparse linear combinations of basis elements (often called atoms). [3][4][5] Its application to interpretability rests on the superposition hypothesis: the conjecture that a network packs more distinct features into its activation space than it has neurons, by storing them as a set of nearly orthogonal directions rather than aligning each feature with a single neuron. [6] If superposition holds, then recovering the underlying features is a dictionary learning problem, and the modern tool of choice for solving it on transformer activations is the sparse autoencoder (SAE). [1][2] Dictionary learning is therefore best described not as a single algorithm but as the conceptual foundation that connects classical sparse representation theory to the present generation of interpretability methods.
It is useful to keep two objects distinct throughout. The dictionary is the fixed set of feature directions, realized in an SAE as the columns of the decoder weight matrix; the sparse code is the per-input vector of feature activations (also called feature coefficients), which says how strongly each dictionary atom is present in a given activation and is zero for almost all atoms. The dictionary is shared across all inputs, while the sparse code changes from input to input. [1][7]
The signal-processing version of the problem is to approximate a collection of data vectors as sparse linear combinations of dictionary atoms. Given a data matrix Y with one signal per column, one seeks a dictionary D and a coefficient matrix X minimizing the reconstruction error || Y - D X ||, subject to a sparsity constraint on X (few nonzero entries per column) and a norm constraint on the atoms (the columns of D). [5] When the dictionary has more atoms than the dimension of the signal it is called overcomplete, which is what allows a rich set of specialized atoms while still keeping each individual representation sparse.
The foundational result for interpretability came from computational neuroscience. In 1996, Bruno Olshausen and David Field showed that a learning algorithm trained to find a sparse linear code for small patches of natural images spontaneously develops a dictionary of localized, oriented, bandpass filters that closely resemble the receptive fields of simple cells in the primary visual cortex. [3] Their 1997 follow-up framed this explicitly as an "overcomplete basis set" approach to sparse coding. [4] The work is widely cited as evidence that sparsity is a powerful objective for discovering meaningful, interpretable structure in data, and it is the direct intellectual ancestor of using sparsity to find features in artificial networks. [1][6]
Two threads of subsequent work matter for the modern story. The first is sparse coding, the problem of finding the sparse coefficients for a fixed dictionary, for which classic methods include matching pursuit (Mallat and Zhang, 1993) and basis pursuit / the LASSO, which relax the intractable count of nonzero entries (the L0 "norm") into a tractable L1 penalty. [8] The second is dictionary learning proper, the problem of learning the dictionary itself from data, typically by alternating between a sparse-coding step and a dictionary-update step. The Method of Optimal Directions (Engan, Aase, and Husoy, 1999) updates the whole dictionary in closed form by least squares, and K-SVD (Aharon, Elad, and Bruckstein, 2006) updates atoms one at a time using a singular value decomposition, generalizing the k-means clustering algorithm. [5][9] These algorithms established the standard alternating structure (sparse codes, then dictionary, then repeat) that sparse autoencoders later implement with gradient descent.
The bridge from classical sparse coding to neural network interpretability is the superposition hypothesis, articulated most fully in Anthropic's 2022 paper "Toy Models of Superposition." [6] A long-standing obstacle to interpreting networks is polysemanticity: individual neurons frequently fire for collections of unrelated inputs, so reading a neuron's activation rarely corresponds to a single concept. The superposition hypothesis explains this by proposing that the true unit of representation is not the neuron but a direction in activation space, and that a network represents far more such feature directions than it has neurons by storing them in superposition, as a set of almost orthogonal (but not exactly orthogonal) directions. [6] Because the number of nearly orthogonal directions available in a high-dimensional space grows much faster than the number of strictly orthogonal axes, a model can compress many sparsely active features into a smaller activation vector, accepting a small amount of interference between features as the cost. The toy-models work demonstrated that small networks provably do this when features are sparse and the model is incentivized to represent more of them than it has dimensions. [6]
Superposition reframes interpretability as a dictionary learning problem. If activations are sparse linear combinations of an overcomplete set of feature directions, then recovering those directions is exactly what dictionary learning does: the unknown dictionary is the set of feature directions the model uses, the sparse code is which features are active on a given input, and the polysemantic neuron basis is just an inconvenient rotation of the underlying monosemantic feature basis. [1][6] This is why a method that finds a sparse, overcomplete decomposition of activations is expected to surface monosemantic features even when no individual neuron is monosemantic.
The dominant implementation of dictionary learning on transformer activations is the sparse autoencoder. An SAE is a wide, shallow network trained to reconstruct an activation vector x while routing it through a sparse, high-dimensional bottleneck. The encoder maps x to a vector of feature activations f(x), most of whose entries are zero for any given input, and the decoder reconstructs x as a sparse weighted sum of learned feature directions, the columns of the decoder weight matrix W_dec:
x_hat = W_dec f(x) + b.
The columns of W_dec are the dictionary atoms (the feature directions), and f(x) is the sparse code. The standard training objective combines a reconstruction term with a sparsity penalty, classically the squared error plus an L1 penalty on the feature activations:
L(x) = || x - x_hat ||_2^2 + lambda || f(x) ||_1,
where the L1 term is the convex surrogate for the true count of active features and lambda trades reconstruction against sparsity. [1][7] This is recognizably the same alternating objective as classical dictionary learning, now optimized end to end by gradient descent, with the encoder amortizing the sparse-coding step so that codes can be computed in a single forward pass.
The approach was brought to language models in 2023. Anthropic's "Towards Monosemanticity: Decomposing Language Models With Dictionary Learning" trained SAEs on the activations of a one-layer transformer and recovered thousands of features that were substantially more monosemantic than the underlying neurons, while Cunningham, Ewart, Riggs, Huben, and Sharkey independently reported that "Sparse Autoencoders Find Highly Interpretable Features in Language Models." [1][7] In May 2024, Anthropic scaled the method dramatically in "Scaling Monosemanticity," training SAEs with up to roughly 34 million features on the middle-layer residual stream of the production model Claude 3 Sonnet, and using neural scaling laws to set hyperparameters. [2] The resulting features were multilingual and multimodal and spanned highly abstract concepts. Parallel efforts followed at the other major labs: OpenAI's "Scaling and Evaluating Sparse Autoencoders" (June 2024) trained a 16 million latent TopK autoencoder on GPT-4 activations, and Google DeepMind released Gemma Scope (2024), an open suite of more than 400 JumpReLU SAEs covering every layer of Gemma 2, with more than 30 million features in total. [10][11]
| Work | Org | Year | Model / target | Dictionary size | Sparsity mechanism |
|---|---|---|---|---|---|
| Olshausen and Field | UC Berkeley / Cornell | 1996 to 1997 | natural image patches | overcomplete basis | L1-style sparse prior [3][4] |
| K-SVD | Technion | 2006 | general signals | overcomplete dictionary | L0 (matching pursuit) [5] |
| Towards Monosemanticity | Anthropic | 2023 | 1-layer transformer | up to ~131,000 features | L1 on ReLU SAE [1] |
| Sparse Autoencoders Find... | Cunningham et al. | 2023 | language model activations | overcomplete SAE | L1 on ReLU SAE [7] |
| Scaling Monosemanticity | Anthropic | 2024 | Claude 3 Sonnet residual stream | up to ~34 million features | L1 on ReLU SAE [2] |
| Scaling and Evaluating SAEs | OpenAI | 2024 | GPT-4 activations | up to 16 million latents | TopK [10] |
| Gemma Scope | Google DeepMind | 2024 | Gemma 2 (all layers) | 16K to ~1M per SAE | JumpReLU (L0) [11] |
Recovering an interpretable dictionary enables several downstream uses. The most striking is feature steering, also called activation steering or clamping: because each feature is a direction in activation space, one can intervene during a forward pass by artificially raising or lowering a chosen feature's activation and observe a corresponding, interpretable change in behavior. Anthropic's demonstration "Golden Gate Claude" clamped a Golden Gate Bridge feature in Claude 3 Sonnet to a high value, causing the model to steer nearly every response toward the bridge, which served as direct causal evidence that the feature genuinely represents that concept rather than merely correlating with it. [2][12] The same paper showed that suppressing or amplifying features tied to behaviors such as sycophancy, scams, and unsafe code changed the model's outputs accordingly. [2][12]
Beyond steering, dictionary features support circuit analysis: by expressing activations in a sparse, interpretable basis, researchers can trace how features in one layer give rise to features in later layers, an effort that has grown into work on attribution graphs and "circuit tracing." Features also provide a vocabulary for AI safety, since they include directions associated with deception, power-seeking, dangerous capabilities such as biological-weapon information, and bias, which can in principle be monitored or used as classifiers. [2][12] Open dictionaries such as Gemma Scope, with browsable features hosted on tools like Neuronpedia, have made this kind of analysis reproducible without training SAEs from scratch. [11]
Dictionary learning for interpretability is an active and contested research area, and several limitations are well documented. Feature splitting is the observation that as the dictionary grows, a concept captured by one feature in a smaller SAE often fractures into many narrower features in a larger one, which complicates any claim that a particular feature is "the" representation of a concept and suggests the decomposition depends on the chosen dictionary size. [2] Completeness is a related concern: even very large SAEs do not capture all of a model's behavior. OpenAI reported that routing GPT-4's activations through its sparse autoencoder degraded performance to roughly that of a model trained with about ten times less compute, and DeepMind has noted that fully covering the concepts in a frontier model might require billions or trillions of features. [10][11] There are also reconstruction-versus-sparsity tradeoffs and pathologies such as dead features (atoms that never activate) and very high frequency features, which motivated successor architectures including Gated SAEs, TopK SAEs, and JumpReLU SAEs that aim to improve the frontier. [10][11]
A deeper debate concerns whether the recovered features are the right unit of analysis at all. The method assumes the linear representation hypothesis (that concepts correspond to directions) and a sparse, roughly independent feature structure, assumptions that may not hold uniformly. Critics have raised the possibility of an "interpretability illusion," in which features look clean under cherry-picked examples but do not faithfully or completely describe the model's computation, and have questioned whether SAE features are stable, causally meaningful, and useful for concrete downstream tasks relative to simpler baselines. Whether dictionary learning ultimately yields a faithful, complete, and minimal description of what a network computes, or whether features are an imperfect intermediate abstraction, remains open. The technique is therefore best understood as a powerful and rapidly improving lens on model internals rather than a solved decomposition.