Towards Monosemanticity
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,476 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,476 words
Add missing citations, update stale details, or suggest a clearer explanation.
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning is a research publication by Anthropic's interpretability team, released on October 5, 2023 on the Transformer Circuits Thread.[^1] Authored by Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E. Burke, Tristan Hume, Shan Carter, Tom Henighan, and Christopher Olah, the paper demonstrates that training a sparse autoencoder on the residual MLP activations of a tiny one-layer transformer recovers a dictionary of thousands of interpretable, monosemantic features. These features correspond to recognizable linguistic and structural concepts such as Arabic script, base64 strings, DNA sequences, Hebrew text, HTTP requests, legal language, and nutrition statements, even though the underlying neurons themselves are polysemantic.[^1][^2]
The work is widely regarded as the empirical pivot that turned the sparse-autoencoder hypothesis, suggested by the 2022 paper Toy Models of Superposition, into a practical research program. By giving a constructive demonstration that the superposition hypothesis can be inverted with a relatively simple dictionary learning method, Towards Monosemanticity triggered a wave of follow-up work both inside Anthropic (Scaling Monosemanticity in May 2024, Circuits Updates throughout 2024 and 2025, Sparse Crosscoders, Transcoders, and the On the Biology of a Large Language Model circuit-tracing program in 2025) and outside it (OpenAI's Scaling and Evaluating Sparse Autoencoders, Google DeepMind's Gemma Scope, EleutherAI's SAEBench, and many academic SAE projects).[^3][^4][^5][^6][^7]
The motivation for Towards Monosemanticity comes directly from Anthropic's 2022 paper Toy Models of Superposition by Nelson Elhage and colleagues.[^8] That paper formalized a long-standing observation in neural-network interpretability: individual neurons in trained models rarely correspond to single, clean concepts. Instead, the same neuron may fire on apparently unrelated inputs (a phenomenon called polysemanticity), and conversely, a single concept can be distributed across many neurons.
Toy Models of Superposition proposed that polysemanticity arises because neural networks are storing more features than they have neurons. If a model has d neurons but needs to represent n features with n greater than d, then provided the features are sufficiently sparse (only a few of them active on any given input), the model can pack them into the activation space as a set of n directions that are not all mutually orthogonal. This compression strategy is called superposition. The toy-model paper showed that superposition really does occur in carefully controlled settings, that it organizes features into geometric structures such as digons, triangles, pentagons, and tetrahedrons, and that there is a phase transition between regimes where features live in clean monosemantic neurons and regimes where they live in superposition.
The implication is striking. If superposition is what is happening inside real language-model MLPs, then the features of a model are linear directions in activation space, not individual neurons, and the number of such features can be substantially larger than the model's width. To recover them, one needs a tool that takes activations as input and produces a sparse, overcomplete decomposition: dictionary learning. Towards Monosemanticity is the first published demonstration that a simple variant of this idea works on real transformer activations.
The paper has twenty-five named contributors, with Trenton Bricken and Adly Templeton listed as core contributors and Christopher Olah (Chris Olah) as senior author. It was published on the Transformer Circuits Thread (transformer-circuits.pub), an Anthropic-curated venue for long-form interpretability research, on October 5, 2023.[^1] An accompanying blog post titled Decomposing Language Models Into Understandable Components appeared the same week on the Anthropic news site to summarize the result for a wider audience.[^2]
To keep the system tractable, the authors trained two small one-layer transformers, internally referred to as A (the primary model studied throughout the paper) and B (used to study universality of features across runs). Each model has a single transformer block consisting of one attention layer and one MLP layer with 512 neurons in the MLP hidden dimension, and was trained for approximately 100 billion tokens from The Pile.[^1][^9]
This is small enough that activations on hundreds of millions of tokens can be cached on disk, and the model's residual stream and MLP outputs can be studied exhaustively. At the same time, it is large enough to exhibit superposition and to produce the kind of contextual behaviour that makes interpretability questions interesting. The choice of one layer is deliberate: a single MLP layer is the simplest place in which superposition can already be expected to be a dominant phenomenon, and isolating it from cross-layer effects keeps the experimental question clean.
The central method is a sparse autoencoder (SAE) trained on the post-activation MLP outputs of the one-layer transformer. The autoencoder has the form
f = ReLU(W_e (x − b_d) + b_e)
x_hat = W_d f + b_d
where x is the MLP activation vector of dimension 512, W_e is the encoder weight matrix mapping into a wider feature dimension D, W_d is the decoder weight matrix mapping back down, b_e and b_d are biases, and f is the vector of non-negative feature activations.[^1] The encoder, ReLU non-linearity, and tied input bias b_d are deliberately simple; the authors stress that the SAE is not meant to be a clever model, only a clean instrument for reading out a dictionary.
The training objective is a weighted sum of mean-squared reconstruction error and an L1 sparsity penalty on the feature activations:
L = || x − x_hat ||_2^2 + λ || f ||_1
with the L1 coefficient λ controlling the trade-off between reconstruction fidelity and sparsity. Smaller λ yields better reconstruction but denser, less interpretable features; larger λ yields cleaner features but more reconstruction error. Open-source replications of the paper report values around 10^−3 for λ as roughly matching the published behaviour.[^9]
The autoencoder is trained on a large pool of cached MLP activations drawn from text passed through the trained transformer. Each training example is one activation vector, sampled across many tokens and contexts. Replications and follow-up notes describe training pools on the order of a few billion activation vectors, with batch sizes around 8,192 and learning rates around 3 × 10^−4.[^9]
A practical issue with L1-regularized autoencoders is dead features: a fraction of the dictionary entries never activate on any input, contributing nothing to reconstruction and wasting capacity. The paper introduces a neuron resampling procedure that periodically detects features that have not fired for a long stretch of training steps and reinitializes their encoder and decoder weights, biasing the reinitialization toward inputs the current dictionary reconstructs poorly. With this procedure, the authors report that the great majority of features in their main dictionary stay alive and contribute to reconstruction, with only a small percentage remaining in ultra-low-density modes.[^1][^9]
A key empirical contribution of the paper is a scan over dictionary sizes. Starting from a dictionary equal in size to the MLP itself (512 features) and increasing roughly geometrically up to about 256 times that width (more than 131,000 features), the authors train a series of autoencoders denoted A/0, A/1, A/2, A/3, A/4, A/5, and study how the dictionary changes as it grows.
| Run | Dictionary size | Approx. ratio to MLP width | Notes |
|---|---|---|---|
| A/0 | 512 | 1× | Same width as MLP; many polysemantic features |
| A/1 | 4,096 | 8× | Primary detailed-interpretability dictionary; >4,000 features studied |
| A/2 | 16,384 | 32× | Finer-grained features; clear feature splitting |
| A/3 | 32,768 | 64× | Continued refinement |
| A/4 | 65,536 | 128× | Approaches high-density regime |
| A/5 | ~131,072 | ~256× | Largest dictionary; most fine-grained splits |
A/1 (the 4,096-feature dictionary) is the focus of the bulk of the paper's detailed interpretability analysis, and is the run the authors recommend new readers begin with.[^1][^4] The larger dictionaries are mainly used to study feature splitting and universality.
The headline observation is that the resulting dictionaries are interpretable. Across the dictionary, many features cleanly correspond to a single human-recognizable concept and rarely activate outside of inputs matching that concept. The paper showcases a wide range of examples; a representative selection is summarized below.[^1][^2][^4]
| Feature category | Example | Comment |
|---|---|---|
| Script / language detection | Arabic script | Activates on Arabic characters and falls off when text leaves Arabic |
| Script / language detection | Hebrew | Similar behaviour to Arabic feature, on Hebrew text |
| Script / language detection | Korean Hangul | Activates on Hangul-rich passages |
| Encoded data | Base64 | Activates on dense base64-encoded strings |
| Encoded data | URL / HTTP request | Fires on URL-like and HTTP header patterns |
| Programming context | Code comments | Activates inside comments in source code |
| Programming context | Python function definitions | Activates after the def keyword and similar code constructs |
| Biology | DNA sequences | Activates on runs of A/C/G/T characters |
| Domain text | Legal text | Activates in contracts, statutes, and legalese |
| Domain text | Nutrition statements | Activates on food labels, calorie listings, ingredient lists |
| Lexical | Token-in-context features | Activate on specific tokens in particular surrounding contexts |
| Mathematical | LaTeX | Activates on LaTeX math markup |
Crucially, these features are not the activations of any single MLP neuron. The paper provides explicit comparisons in which inspecting a single neuron yields a "muddle" of unrelated activations, while the corresponding dictionary feature is far cleaner. This is a direct, constructive demonstration that the MLP is in fact representing many more features than neurons, exactly as predicted by the superposition hypothesis.
A nice property of monosemantic features is that one can write down a computational proxy for them: a simple rule (regex, classifier, or external statistical model) that predicts when the feature should fire, and check empirically how well the SAE feature matches that proxy. The paper uses four different evidentiary strategies, in increasing scale.[^1]
The first is detailed investigation of a few features. For features like Arabic script or DNA, the authors construct external proxies (for example, a regex or character-class detector for Arabic Unicode), then plot the feature activation against the proxy probability across very many contexts. The features track their proxies with high fidelity. The second is human analysis of a large random sample, in which human raters look at top-activating examples for many features and judge interpretability. The third uses an automated interpretability score: a separate language model is shown top-activating examples for a feature and asked to propose an interpretation; the same model then scores how well that interpretation predicts feature activation on held-out contexts. Larger dictionaries yield higher average scores up to a point, and the autoencoder dictionary scores well above raw MLP neurons. The fourth is automated analysis of logit weights: for each feature, the linear effect on output logits often picks out a coherent token cluster (Arabic letters, base64 characters, etc.) consistent with the feature's interpretation. This is described later in the paper as the direct logit effect analysis. Together, these four lines of evidence produce a picture in which the sparse dictionary is much more monosemantic than the neuron basis.
As the dictionary size grows from A/0 (512) up to A/5 (~131,000), the paper observes a robust phenomenon it calls feature splitting. A single feature in a small dictionary representing, for example, "base64-encoded data" splits into multiple features in larger dictionaries, with each of the resulting features specializing in a more specific kind of base64 string, distinguishing for example tokens at the beginning of a base64 block from tokens in the interior, or distinguishing different surrounding contexts (URLs, raw text, JSON).[^1][^4]
The same is reported for many other features. In a small dictionary, the Arabic-script feature is one feature; in a larger dictionary, it splits into features that distinguish Arabic in Quranic contexts, Arabic in news prose, Arabic letters appearing as transliterations, and so on. The paper interprets this as evidence that the underlying feature space is itself hierarchically structured: there are coarse features the small dictionary is forced to combine into a single direction, and finer features that the larger dictionary can resolve separately. This connects sparse dictionary learning to the broader interpretability question of how features compose, and motivates the later work on transcoders, crosscoders, and attribution graphs that treat the dictionary as a substrate for circuits rather than as an end in itself.
A persistent worry about any dictionary-learning method is that the dictionary might be an artifact of the training process: different runs with different random seeds might produce very different dictionaries. Towards Monosemanticity directly tests this in two ways.[^1][^4]
First, the authors train multiple sparse autoencoders on the same transformer (model A) with different seeds, and measure feature-level activation correlations between runs. They report that, on a large random sample of features, the median maximum activation correlation between two independent SAEs is around 0.72, and substantially above the median maximum correlation between SAE features and raw neurons (around 0.46). In other words, two independently trained dictionaries agree with each other far more than either one agrees with the underlying neuron basis. Second, the authors train SAEs on a different one-layer transformer (model B) trained from a different random seed and study how often the same kinds of features appear. Many features (Arabic script, base64, DNA, code comments, and so on) reliably reappear, providing evidence that the dictionary is recovering structure that lives in the data and the architecture rather than artefacts of a particular training run.
This universality result is significant. It supports the interpretation that what the SAE recovers is, in some loose sense, the "true" feature basis of the model, and provides a basis for cross-model comparison work that came later (Sparse Crosscoders, persona vectors).
The paper's contributions can be summarized as a small set of empirical claims.[^1][^2]
Taken together, these claims constitute the first end-to-end empirical demonstration that the superposition hypothesis can be inverted with a practical method, on a real (if small) transformer.
The authors are explicit about the limitations of the result.[^1]
The model studied is a one-layer transformer with a 512-neuron MLP, not a frontier language model. There is no guarantee a priori that the same method will work on deeper architectures, on attention rather than MLP activations, on the residual stream of a layer mid-stack, or on activations from much wider MLPs. The paper frames its result explicitly as a proof of concept, with scaling deferred to subsequent work.
A meaningful fraction of features at small dictionary sizes remain hard to interpret, or interpretable only with caveats. Even in the carefully studied A/1 dictionary, not every feature has a clean computational proxy; some are token-in-context features that fire only on a specific token in a particular set of contexts, and others are mixtures that are themselves a form of mild polysemanticity. Reconstruction is also not perfect. The SAE leaves some residual error that the language model would have used; downstream behaviour depends on this residual, and the paper acknowledges that the dictionary represents the MLP only approximately.
Dead features and the related phenomenon of ultra-low-density features (which fire only on a vanishing fraction of inputs) are addressed by the resampling procedure but not fully solved. The dictionary's effective size is therefore somewhat smaller than its nominal size. Finally, the paper does not yet integrate dictionary features into a full circuit-level account of model behaviour: it identifies what the features are, not how they are used to compute outputs. That is the question taken up in subsequent work on transcoders, crosscoders, and attribution graphs.
The publication of Towards Monosemanticity was a turning point for the practical mechanistic interpretability program. Within Anthropic and across the broader community, it triggered an unusually rapid sequence of follow-up papers.
Eight months after Towards Monosemanticity, Anthropic released Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet.[^3] Templeton, Bricken, and many of the same authors trained much larger sparse autoencoders on residual-stream activations of Anthropic's mid-sized production model, Claude 3 Sonnet, extracting roughly 34 million features and demonstrating monosemantic features for abstract concepts (the Golden Gate Bridge, computer code, deception, sycophancy, biological weapons) as well as concrete entities. The paper showed that the recipe from Towards Monosemanticity could be scaled multiple orders of magnitude with no fundamental change in methodology, addressing the major open question of the 2023 paper.
In March 2025, Anthropic published On the Biology of a Large Language Model by Jack Lindsey and colleagues, alongside the methodology paper Circuit Tracing: Revealing Computational Graphs in Language Models.[^5] These papers introduced attribution graphs, a tool that uses dictionary features (via cross-layer transcoders) and Jacobian-style backward attribution to produce feature-level circuit diagrams for individual inputs to Claude 3.5 Haiku. The methodology is a direct descendant of Towards Monosemanticity: it treats the SAE dictionary as the substrate on which circuits live and shows how to compose features across layers into mechanistic explanations of model behaviour.
A second strand of follow-up work refined the SAE recipe itself. Transcoders learn sparse mappings from one layer's activations to another, replacing dense MLP computations with interpretable sparse intermediates and making it possible to read off features that are computed by an MLP layer rather than just present at it. Sparse Crosscoders, published by Anthropic in October 2024, train a single dictionary across multiple layers (and even multiple models), giving a shared feature space in which cross-layer and cross-model alignment becomes a first-class object.[^7] Both of these are direct generalizations of the Towards Monosemanticity dictionary, and both rely on the same core empirical claim: that activations decompose sparsely into a learned overcomplete basis.
In 2025, Anthropic and collaborators published work on Persona Vectors: Monitoring and Controlling Character Traits in Language Models, extracting linear directions in activation space that correspond to character traits such as sycophancy, evil, hallucination tendency, optimism, impoliteness, apathy, and humor.[^10] The persona-vector paper explicitly connects each trait direction to finer-grained sets of SAE features, applying the Towards Monosemanticity dictionary view to a new domain (model character and safety) and using it to causally steer model behaviour by amplifying or suppressing features along the trait direction.
Outside Anthropic, OpenAI released Scaling and Evaluating Sparse Autoencoders by Leo Gao, Tom Dupré la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, and collaborators (arXiv:2406.04093).[^6] The paper trains k-sparse autoencoders (top-k activation rather than L1 penalty) on GPT-style activations, including a 16-million-feature dictionary on GPT-4 residual-stream activations trained on 40 billion tokens. It introduces a suite of evaluation metrics for SAE quality and explicitly cites Towards Monosemanticity as the conceptual basis of its method.
Google DeepMind (Google DeepMind) released Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2 by Tom Lieberum and colleagues in August 2024.[^11] Gemma Scope is a public release of more than 400 JumpReLU sparse autoencoders trained on every layer and sub-layer of Gemma 2 2B and 9B (and selected layers of 27B), totalling more than 30 million features. The project explicitly frames itself as an open-source counterpart to Anthropic's internal Scaling Monosemanticity work, intended to give the safety community a common substrate on which to study circuits and feature dynamics, and inherits its methodology from Towards Monosemanticity.
EleutherAI and academic collaborators have built an ecosystem of open SAE training pipelines, model-weight releases, and benchmarks. SAEBench, a 2025 benchmark suite for sparse autoencoder quality, defines a battery of eight evaluation tasks spanning concept detection, feature disentanglement, downstream unlearning, and reconstruction fidelity, and releases more than 200 SAEs across eight architectures and training regimes.[^12] The benchmark explicitly addresses an open question highlighted by Towards Monosemanticity: how to compare SAE training methods quantitatively when human-judged interpretability does not always agree with cheaper unsupervised proxies. The papers proposing JumpReLU SAEs, top-k SAEs, Matryoshka SAEs, and gated SAEs are all part of this ecosystem.
Open-source replications of Towards Monosemanticity (for example, the shehper/sparse-dictionary-learning repository) have reproduced the central qualitative finding on similarly-sized one-layer transformers trained on OpenWebText, observing the same kinds of monosemantic features and similar dead-feature dynamics with neuron resampling.[^9] These replications established that the result is not an idiosyncrasy of Anthropic's proprietary infrastructure and accelerated adoption of the method in the wider community.
Within the larger mechanistic interpretability program, Towards Monosemanticity plays a foundational role analogous to the role of the original induction-head paper for circuits-style analyses. Where the induction heads line of work argued that one could identify discrete, named subcomponents inside attention layers, Towards Monosemanticity argued that one could identify discrete, named features inside MLP activations. Together, these two threads supply mechanistic interpretability with its two basic building blocks: heads / circuits, and features / dictionaries.
The paper also influenced the field's conceptual vocabulary. The dictionary-learning view treats the model's hidden states as sparse, overcomplete codes over a basis of learned features. Almost all subsequent SAE, transcoder, and crosscoder work, including Anthropic's Biology of a Large Language Model and OpenAI's GPT-4 SAEs, uses this framing. The notion of feature splitting, of feature universality, and of monosemantic features as the right unit of analysis are all now common currency.
More speculatively, Towards Monosemanticity changed the strategic outlook of safety-motivated interpretability research. Prior to October 2023, it was an open question whether the superposition hypothesis would ever be operationally useful: it provided a coherent theoretical story about why neurons are messy, but no constructive way to recover the underlying features. The paper's contribution is to convert that hypothesis into a practical engineering recipe, with a clear scaling story (bigger dictionaries, bigger models) and a clear empirical signature (more features, more splitting, persistent universality). This is what allowed Anthropic and others to plan multi-year programs around scaling sparse autoencoders, training circuit-tracing tools, and ultimately using feature-level interventions to debug, monitor, and steer frontier language models.
The October 2023 paper therefore occupies a somewhat unusual position: a result on a tiny model, written in a long-form blog format on a niche publication venue, that nonetheless reshaped the research agenda of nearly every major interpretability group in 2024 and 2025. The follow-up work has not refuted the central claim; rather, it has affirmed and extended it, scaling the dictionary from thousands of features on a one-layer toy to tens of millions of features on production frontier models, and turning isolated monosemantic features into the substrate for full feature-level circuit explanations.