Towards Monosemanticity

AI Research Anthropic Interpretability

24 min read

Updated Jul 23, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 23, 2026

Fact-checked

In review queue

Sources

12 citations

Revision

v5 · 4,737 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Towards Monosemanticity is an October 2023 mechanistic interpretability paper from Anthropic that used a sparse autoencoder to decompose the internal activations of a small language model into thousands of human-interpretable, monosemantic features. In the authors' own summary, "we decompose a layer with 512 neurons into more than 4000 features which separately represent things like DNA sequences, legal language, HTTP requests, Hebrew text, nutrition statements, and much, much more."^[1]^[2] It was the first published demonstration that the superposition problem, the reason individual neurons are hard to interpret, can be reversed in a real transformer using dictionary learning, and it is the direct precursor to Anthropic's 2024 Scaling Monosemanticity work on Claude 3 Sonnet.

The full title is Towards Monosemanticity: Decomposing Language Models With Dictionary Learning, a research publication by Anthropic's interpretability team, released on October 5, 2023 on the Transformer Circuits Thread.^[1] Authored by Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E. Burke, Tristan Hume, Shan Carter, Tom Henighan, and Christopher Olah, the paper demonstrates that training a sparse autoencoder on the MLP activations of a tiny one-layer transformer recovers a dictionary of thousands of interpretable, monosemantic features. These features correspond to recognizable linguistic and structural concepts such as Arabic script, base64 strings, DNA sequences, Hebrew text, HTTP requests, legal language, and nutrition statements, even though the underlying neurons themselves are polysemantic.^[1]^[2]

The paper's central thesis is that "there are better units of analysis than individual neurons," and that those better units, called features, "correspond to patterns (linear combinations) of neuron activations."^[1] The work is widely regarded as the empirical pivot that turned the sparse-autoencoder hypothesis, suggested by the 2022 paper Toy Models of Superposition, into a practical research program. By giving a constructive demonstration that the superposition hypothesis can be inverted with a relatively simple dictionary learning method, Towards Monosemanticity triggered a wave of follow-up work both inside Anthropic (Scaling Monosemanticity in May 2024, Circuits Updates throughout 2024 and 2025, Sparse Crosscoders, Transcoders, and the On the Biology of a Large Language Model circuit-tracing program in 2025) and outside it (OpenAI's Scaling and Evaluating Sparse Autoencoders, Google DeepMind's Gemma Scope, EleutherAI's SAEBench, and many academic SAE projects).^[3]^[4]^[5]^[6]^[7]

What problem does Towards Monosemanticity solve? Superposition and polysemantic neurons

The motivation for Towards Monosemanticity comes directly from Anthropic's 2022 paper Toy Models of Superposition by Nelson Elhage and colleagues.^[8] That paper formalized a long-standing observation in neural-network interpretability: individual neurons in trained models rarely correspond to single, clean concepts. Instead, the same neuron may fire on apparently unrelated inputs (a phenomenon called polysemanticity), and conversely, a single concept can be distributed across many neurons.

Toy Models of Superposition proposed that polysemanticity arises because neural networks are storing more features than they have neurons. If a model has d neurons but needs to represent n features with n greater than d, then provided the features are sufficiently sparse (only a few of them active on any given input), the model can pack them into the activation space as a set of n directions that are not all mutually orthogonal. This compression strategy is called superposition. The toy-model paper showed that superposition really does occur in carefully controlled settings, that it organizes features into geometric structures such as digons, triangles, pentagons, and tetrahedrons, and that there is a phase transition between regimes where features live in clean monosemantic neurons and regimes where they live in superposition.

The implication is striking. If superposition is what is happening inside real language-model MLPs, then the features of a model are linear directions in activation space, not individual neurons, and the number of such features can be substantially larger than the model's width. To recover them, one needs a tool that takes activations as input and produces a sparse, overcomplete decomposition: dictionary learning. Towards Monosemanticity is the first published demonstration that a simple variant of this idea works on real transformer activations.

The paper

Who wrote it and where was it published?

The paper has twenty-five named contributors, with Trenton Bricken and Adly Templeton listed as core contributors and Christopher Olah (Chris Olah) as senior author. It was published on the Transformer Circuits Thread (transformer-circuits.pub), an Anthropic-curated venue for long-form interpretability research, on October 5, 2023.^[1] An accompanying blog post titled Decomposing Language Models Into Understandable Components appeared the same week on the Anthropic news site to summarize the result for a wider audience.^[2]

What model did the paper study? The one-layer transformer

To keep the system tractable, the authors trained two small one-layer transformers, internally referred to as A (the primary model studied throughout the paper) and B (used to study universality of features across runs). Each model has a single transformer block consisting of one attention layer and one MLP layer with 512 neurons in the MLP hidden dimension, and was trained for approximately 100 billion tokens from The Pile.^[1]^[9]

This is small enough that activations on hundreds of millions of tokens can be cached on disk, and the model's residual stream and MLP outputs can be studied exhaustively. At the same time, it is large enough to exhibit superposition and to produce the kind of contextual behaviour that makes interpretability questions interesting. The choice of one layer is deliberate: a single MLP layer is the simplest place in which superposition can already be expected to be a dominant phenomenon, and isolating it from cross-layer effects keeps the experimental question clean.

How does the sparse autoencoder work?

The central method is a sparse autoencoder (SAE) trained on the post-activation MLP outputs of the one-layer transformer. The autoencoder has the form

f = ReLU(W_e (x - b_d) + b_e)
x_hat = W_d f + b_d

where x is the MLP activation vector of dimension 512, W_e is the encoder weight matrix mapping into a wider feature dimension D, W_d is the decoder weight matrix mapping back down, b_e and b_d are biases, and f is the vector of non-negative feature activations.^[1] The encoder, ReLU non-linearity, and tied input bias b_d are deliberately simple; the authors stress that the SAE is not meant to be a clever model, only a clean instrument for reading out a dictionary.

The training objective is a weighted sum of mean-squared reconstruction error and an L1 sparsity penalty on the feature activations:

L = || x - x_hat ||_2^2  +  lambda || f ||_1

with the L1 coefficient lambda controlling the trade-off between reconstruction fidelity and sparsity. Smaller lambda yields better reconstruction but denser, less interpretable features; larger lambda yields cleaner features but more reconstruction error. Open-source replications of the paper report values around 10^-3 for lambda as roughly matching the published behaviour.^[9]

The autoencoder is trained on a large pool of cached MLP activations drawn from text passed through the trained transformer. Each training example is one activation vector, sampled across many tokens and contexts. Replications and follow-up notes describe training pools on the order of a few billion activation vectors, with batch sizes around 8,192 and learning rates around 3 x 10^-4.^[9]

Resampling dead features

A practical issue with L1-regularized autoencoders is dead features: a fraction of the dictionary entries never activate on any input, contributing nothing to reconstruction and wasting capacity. The paper introduces a neuron resampling procedure that periodically detects features that have not fired for a long stretch of training steps and reinitializes their encoder and decoder weights, biasing the reinitialization toward inputs the current dictionary reconstructs poorly. With this procedure, the authors report that the great majority of features in their main dictionary stay alive and contribute to reconstruction, with only a small percentage remaining in ultra-low-density modes.^[1]^[9]

How many features did it extract? The dictionary size scan

A key empirical contribution of the paper is a scan over dictionary sizes. Starting from a dictionary equal in size to the MLP itself (512 features) and increasing roughly geometrically up to about 256 times that width (more than 131,000 features), the authors train a series of autoencoders denoted A/0, A/1, A/2, A/3, A/4, A/5, and study how the dictionary changes as it grows.

Run	Dictionary size	Approx. ratio to MLP width	Notes
A/0	512	1x	Same width as MLP; many polysemantic features
A/1	4,096	8x	Primary detailed-interpretability dictionary; >4,000 features studied
A/2	16,384	32x	Finer-grained features; clear feature splitting
A/3	32,768	64x	Continued refinement
A/4	65,536	128x	Approaches high-density regime
A/5	~131,072	~256x	Largest dictionary; most fine-grained splits

A/1 (the 4,096-feature dictionary) is the focus of the bulk of the paper's detailed interpretability analysis, and is the run the authors recommend new readers begin with.^[1]^[4] The larger dictionaries are mainly used to study feature splitting and universality. It is this 4,096-feature run that produces the headline result of decomposing the 512-neuron MLP into "more than 4000 features."^[1]^[2]

Discovering monosemantic features

The headline observation is that the resulting dictionaries are interpretable. Across the dictionary, many features cleanly correspond to a single human-recognizable concept and rarely activate outside of inputs matching that concept. The paper showcases a wide range of examples; a representative selection is summarized below.^[1]^[2]^[4]

Feature category	Example	Comment
Script / language detection	Arabic script	Activates on Arabic characters and falls off when text leaves Arabic
Script / language detection	Hebrew	Similar behaviour to Arabic feature, on Hebrew text
Script / language detection	Korean Hangul	Activates on Hangul-rich passages
Encoded data	Base64	Activates on dense base64-encoded strings
Encoded data	URL / HTTP request	Fires on URL-like and HTTP header patterns
Programming context	Code comments	Activates inside comments in source code
Programming context	Python function definitions	Activates after the `def` keyword and similar code constructs
Biology	DNA sequences	Activates on runs of A/C/G/T characters
Domain text	Legal text	Activates in contracts, statutes, and legalese
Domain text	Nutrition statements	Activates on food labels, calorie listings, ingredient lists
Lexical	Token-in-context features	Activate on specific tokens in particular surrounding contexts
Mathematical	LaTeX	Activates on LaTeX math markup

Crucially, these features are not the activations of any single MLP neuron. The paper provides explicit comparisons in which inspecting a single neuron yields a "muddle" of unrelated activations, while the corresponding dictionary feature is far cleaner. This is a direct, constructive demonstration that the MLP is in fact representing many more features than neurons, exactly as predicted by the superposition hypothesis.

How do you know the features are really interpretable? Feature interpretability scoring

A nice property of monosemantic features is that one can write down a computational proxy for them: a simple rule (regex, classifier, or external statistical model) that predicts when the feature should fire, and check empirically how well the SAE feature matches that proxy. The paper offers "four different lines of evidence" that the autoencoder features are more monosemantic than neurons.^[1]

The first is detailed investigation of a few features. For features like Arabic script or DNA, the authors construct external proxies (for example, a regex or character-class detector for Arabic Unicode), then plot the feature activation against the proxy probability across very many contexts. The features track their proxies with high fidelity. The second is human analysis of a large random sample, in which human raters look at top-activating examples for many features and judge interpretability. The third uses an automated interpretability score: a separate language model is shown top-activating examples for a feature and asked to propose an interpretation; the same model then scores how well that interpretation predicts feature activation on held-out contexts. Larger dictionaries yield higher average scores up to a point, and the autoencoder dictionary scores well above raw MLP neurons. The fourth is automated analysis of logit weights: for each feature, the linear effect on output logits often picks out a coherent token cluster (Arabic letters, base64 characters, etc.) consistent with the feature's interpretation. This is described later in the paper as the direct logit effect analysis. Together, these four lines of evidence produce a picture in which the sparse dictionary is much more monosemantic than the neuron basis.

Feature splitting

As the dictionary size grows from A/0 (512) up to A/5 (~131,000), the paper observes a robust phenomenon it calls feature splitting. A single feature in a small dictionary representing, for example, "base64-encoded data" splits into multiple features in larger dictionaries, with each of the resulting features specializing in a more specific kind of base64 string, distinguishing for example tokens at the beginning of a base64 block from tokens in the interior, or distinguishing different surrounding contexts (URLs, raw text, JSON).^[1]^[4]

The same is reported for many other features. In a small dictionary, the Arabic-script feature is one feature; in a larger dictionary, it splits into features that distinguish Arabic in Quranic contexts, Arabic in news prose, Arabic letters appearing as transliterations, and so on. The paper interprets this as evidence that the underlying feature space is itself hierarchically structured: there are coarse features the small dictionary is forced to combine into a single direction, and finer features that the larger dictionary can resolve separately. This connects sparse dictionary learning to the broader interpretability question of how features compose, and motivates the later work on transcoders, crosscoders, and attribution graphs that treat the dictionary as a substrate for circuits rather than as an end in itself.

Are the features universal across runs?

A persistent worry about any dictionary-learning method is that the dictionary might be an artifact of the training process: different runs with different random seeds might produce very different dictionaries. Towards Monosemanticity directly tests this in two ways.^[1]^[4]

First, the authors train multiple sparse autoencoders on the same transformer (model A) with different seeds, and measure feature-level activation correlations between runs. They report that, on a large random sample of features, the median maximum activation correlation between two independent SAEs is around 0.72, and substantially above the median maximum correlation between SAE features and raw neurons (around 0.46). In other words, two independently trained dictionaries agree with each other far more than either one agrees with the underlying neuron basis. Second, the authors train SAEs on a different one-layer transformer (model B) trained from a different random seed and study how often the same kinds of features appear. Many features (Arabic script, base64, DNA, code comments, and so on) reliably reappear, providing evidence that the dictionary is recovering structure that lives in the data and the architecture rather than artefacts of a particular training run.

This universality result is significant. It supports the interpretation that what the SAE recovers is, in some loose sense, the "true" feature basis of the model, and provides a basis for cross-model comparison work that came later (Sparse Crosscoders, persona vectors).

Key findings

The paper's contributions can be summarized as a small set of empirical claims.^[1]^[2]

A sparse autoencoder trained with an L1 sparsity penalty on the MLP activations of a one-layer transformer recovers thousands of features that are far more monosemantic than the underlying neurons.
With an 8x dictionary (4,096 features for a 512-neuron MLP), more than 4,000 interpretable features can be identified, covering scripts and languages, encoded data, code constructs, biology, legal text, lexical patterns, and many domain-specific contexts.
Features have clean computational proxies (regex, classifiers, external detectors) that track their activations across hundreds of millions of tokens, providing quantitative evidence for monosemanticity.
As the dictionary grows from 512 up to ~131,000 features, features split into more specialized siblings rather than fragmenting into noise, suggesting a hierarchical structure in the feature space.
Features are universal: independently trained SAEs on the same model and on different models recover many of the same features, with high feature-level correlations across runs.
Dead-feature resampling and an appropriate L1 coefficient yield dictionaries in which the great majority of features are alive and contribute to reconstruction.
Aggregated across the dictionary, automated interpretability scores rise sharply over the raw neuron baseline, and increase with dictionary size up to a regime governed by reconstruction error.

Taken together, these claims constitute the first end-to-end empirical demonstration that the superposition hypothesis can be inverted with a practical method, on a real (if small) transformer.

What are the limitations of Towards Monosemanticity?

The authors are explicit about the limitations of the result.^[1]

The model studied is a one-layer transformer with a 512-neuron MLP, not a frontier language model. There is no guarantee a priori that the same method will work on deeper architectures, on attention rather than MLP activations, on the residual stream of a layer mid-stack, or on activations from much wider MLPs. The paper frames its result explicitly as a proof of concept, with scaling deferred to subsequent work.

A meaningful fraction of features at small dictionary sizes remain hard to interpret, or interpretable only with caveats. Even in the carefully studied A/1 dictionary, not every feature has a clean computational proxy; some are token-in-context features that fire only on a specific token in a particular set of contexts, and others are mixtures that are themselves a form of mild polysemanticity. Reconstruction is also not perfect. The SAE leaves some residual error that the language model would have used; downstream behaviour depends on this residual, and the paper acknowledges that the dictionary represents the MLP only approximately.

Dead features and the related phenomenon of ultra-low-density features (which fire only on a vanishing fraction of inputs) are addressed by the resampling procedure but not fully solved. The dictionary's effective size is therefore somewhat smaller than its nominal size. Finally, the paper does not yet integrate dictionary features into a full circuit-level account of model behaviour: it identifies what the features are, not how they are used to compute outputs. That is the question taken up in subsequent work on transcoders, crosscoders, and attribution graphs.

How did Towards Monosemanticity influence later research? Influence and follow-up

The publication of Towards Monosemanticity was a turning point for the practical mechanistic interpretability program. Within Anthropic and across the broader community, it triggered an unusually rapid sequence of follow-up papers.

How does Towards Monosemanticity relate to Scaling Monosemanticity (May 2024)?

Eight months after Towards Monosemanticity, Anthropic released Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet on May 21, 2024.^[3] Templeton, Bricken, and many of the same authors trained much larger sparse autoencoders on residual-stream activations of Anthropic's mid-sized production model, Claude 3 Sonnet, using three dictionary sizes (roughly 1 million, 4 million, and 34 million features) and demonstrating monosemantic features for abstract concepts (the Golden Gate Bridge, computer code, deception, sycophancy, biological weapons) as well as concrete entities. The paper showed that the recipe from Towards Monosemanticity could be scaled by roughly four orders of magnitude (from thousands of features on a one-layer toy to tens of millions on a production model) with no fundamental change in methodology, directly addressing the major open question of the 2023 paper.

Circuit tracing and Attribution Graphs (March 2025)

In March 2025, Anthropic published On the Biology of a Large Language Model by Jack Lindsey and colleagues, alongside the methodology paper Circuit Tracing: Revealing Computational Graphs in Language Models.^[5] These papers introduced attribution graphs, a tool that uses dictionary features (via cross-layer transcoders) and Jacobian-style backward attribution to produce feature-level circuit diagrams for individual inputs to Claude 3.5 Haiku. The methodology is a direct descendant of Towards Monosemanticity: it treats the SAE dictionary as the substrate on which circuits live and shows how to compose features across layers into mechanistic explanations of model behaviour.

Transcoders and Crosscoders

A second strand of follow-up work refined the SAE recipe itself. Transcoders learn sparse mappings from one layer's activations to another, replacing dense MLP computations with interpretable sparse intermediates and making it possible to read off features that are computed by an MLP layer rather than just present at it. Sparse Crosscoders, published by Anthropic in October 2024, train a single dictionary across multiple layers (and even multiple models), giving a shared feature space in which cross-layer and cross-model alignment becomes a first-class object.^[7] Both of these are direct generalizations of the Towards Monosemanticity dictionary, and both rely on the same core empirical claim: that activations decompose sparsely into a learned overcomplete basis.

Persona Vectors (2025)

In 2025, Anthropic and collaborators published work on Persona Vectors: Monitoring and Controlling Character Traits in Language Models, extracting linear directions in activation space that correspond to character traits such as sycophancy, evil, hallucination tendency, optimism, impoliteness, apathy, and humor.^[10] The persona-vector paper explicitly connects each trait direction to finer-grained sets of SAE features, applying the Towards Monosemanticity dictionary view to a new domain (model character and safety) and using it to causally steer model behaviour by amplifying or suppressing features along the trait direction.

OpenAI Scaling and Evaluating Sparse Autoencoders (June 2024)

Outside Anthropic, OpenAI released Scaling and Evaluating Sparse Autoencoders by Leo Gao, Tom Dupre la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, and collaborators (arXiv:2406.04093).^[6] The paper trains k-sparse autoencoders (top-k activation rather than L1 penalty) on GPT-style activations, including a 16-million-feature dictionary on GPT-4 residual-stream activations trained on 40 billion tokens. It introduces a suite of evaluation metrics for SAE quality and explicitly cites Towards Monosemanticity as the conceptual basis of its method.

Gemma Scope (August 2024)

Google DeepMind (Google DeepMind) released Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2 by Tom Lieberum and colleagues in August 2024.^[11] Gemma Scope is a public release of more than 400 JumpReLU sparse autoencoders trained on every layer and sub-layer of Gemma 2 2B and 9B (and selected layers of 27B), totalling more than 30 million features. The project explicitly frames itself as an open-source counterpart to Anthropic's internal Scaling Monosemanticity work, intended to give the safety community a common substrate on which to study circuits and feature dynamics, and inherits its methodology from Towards Monosemanticity.

EleutherAI and SAEBench (2025)

EleutherAI and academic collaborators have built an ecosystem of open SAE training pipelines, model-weight releases, and benchmarks. SAEBench, a 2025 benchmark suite for sparse autoencoder quality, defines a battery of eight evaluation tasks spanning concept detection, feature disentanglement, downstream unlearning, and reconstruction fidelity, and releases more than 200 SAEs across eight architectures and training regimes.^[12] The benchmark explicitly addresses an open question highlighted by Towards Monosemanticity: how to compare SAE training methods quantitatively when human-judged interpretability does not always agree with cheaper unsupervised proxies. The papers proposing JumpReLU SAEs, top-k SAEs, Matryoshka SAEs, and gated SAEs are all part of this ecosystem.

Independent replications

Open-source replications of Towards Monosemanticity (for example, the shehper/sparse-dictionary-learning repository) have reproduced the central qualitative finding on similarly-sized one-layer transformers trained on OpenWebText, observing the same kinds of monosemantic features and similar dead-feature dynamics with neuron resampling.^[9] These replications established that the result is not an idiosyncrasy of Anthropic's proprietary infrastructure and accelerated adoption of the method in the wider community.

Why does Towards Monosemanticity matter? Significance for the mechanistic interpretability program

Within the larger mechanistic interpretability program, Towards Monosemanticity plays a foundational role analogous to the role of the original induction-head paper for circuits-style analyses. Where the induction heads line of work argued that one could identify discrete, named subcomponents inside attention layers, Towards Monosemanticity argued that one could identify discrete, named features inside MLP activations. Together, these two threads supply mechanistic interpretability with its two basic building blocks: heads / circuits, and features / dictionaries.

The paper also influenced the field's conceptual vocabulary. The dictionary-learning view treats the model's hidden states as sparse, overcomplete codes over a basis of learned features. Almost all subsequent SAE, transcoder, and crosscoder work, including Anthropic's Biology of a Large Language Model and OpenAI's GPT-4 SAEs, uses this framing. The notion of feature splitting, of feature universality, and of monosemantic features as the right unit of analysis are all now common currency.

More speculatively, Towards Monosemanticity changed the strategic outlook of safety-motivated interpretability research. Prior to October 2023, it was an open question whether the superposition hypothesis would ever be operationally useful: it provided a coherent theoretical story about why neurons are messy, but no constructive way to recover the underlying features. The paper's contribution is to convert that hypothesis into a practical engineering recipe, with a clear scaling story (bigger dictionaries, bigger models) and a clear empirical signature (more features, more splitting, persistent universality). This is what allowed Anthropic and others to plan multi-year programs around scaling sparse autoencoders, training circuit-tracing tools, and ultimately using feature-level interventions to debug, monitor, and steer frontier language models.

The October 2023 paper therefore occupies a somewhat unusual position: a result on a tiny model, written in a long-form blog format on a niche publication venue, that nonetheless reshaped the research agenda of nearly every major interpretability group in 2024 and 2025. The follow-up work has not refuted the central claim; rather, it has affirmed and extended it, scaling the dictionary from thousands of features on a one-layer toy to tens of millions of features on production frontier models, and turning isolated monosemantic features into the substrate for full feature-level circuit explanations.

References

Bricken, T., Templeton, A., Batson, J., Chen, B., Jermyn, A., Conerly, T., Turner, N., Anil, C., Denison, C., Askell, A., Lasenby, R., Wu, Y., Kravec, S., Schiefer, N., Maxwell, T., Joseph, N., Hatfield-Dodds, Z., Tamkin, A., Nguyen, K., McLean, B., Burke, J. E., Hume, T., Carter, S., Henighan, T., Olah, C. *Towards Monosemanticity: Decomposing Language Models With Dictionary Learning*. Transformer Circuits Thread, October 5, 2023. https://transformer-circuits.pub/2023/monosemantic-features/index.html. Accessed 2026-06-24. ↩
Anthropic. *Decomposing Language Models Into Understandable Components*. Anthropic News, October 5, 2023. https://www.anthropic.com/news/decomposing-language-models-into-understandable-components. Accessed 2026-06-24. ↩
Templeton, A., Conerly, T., Marcus, J., Lindsey, J., Bricken, T., Chen, B., Pearce, A., Citro, C., Ameisen, E., Jones, A., Cunningham, H., Turner, N. L., McDougall, C., MacDiarmid, M., Tamkin, A., Durmus, E., Hume, T., Mosconi, F., Freeman, C. D., Sumers, T. R., Rees, E., Batson, J., Jermyn, A., Carter, S., Olah, C., Henighan, T. *Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet*. Transformer Circuits Thread, May 21, 2024. https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html. Accessed 2026-06-24. ↩
Anthropic. *Towards Monosemanticity*, AlignmentForum cross-post. https://www.alignmentforum.org/posts/TDqvQFks6TWutJEKu/towards-monosemanticity-decomposing-language-models-with. Accessed 2026-06-24. ↩
Lindsey, J., et al. *On the Biology of a Large Language Model*. Transformer Circuits Thread, March 2025. https://transformer-circuits.pub/2025/attribution-graphs/biology.html. Accessed 2026-06-24. ↩
Gao, L., Dupre la Tour, T., Tillman, H., Goh, G., Troll, R., Radford, A., Sutskever, I., Leike, J., Wu, J. *Scaling and Evaluating Sparse Autoencoders*. arXiv:2406.04093, June 6, 2024. https://arxiv.org/abs/2406.04093. Accessed 2026-06-24. ↩
Lindsey, J., Templeton, A., Marcus, J., Conerly, T., Batson, J., Olah, C. *Sparse Crosscoders for Cross-Layer Features and Model Diffing*. Transformer Circuits Thread, October 25, 2024. https://transformer-circuits.pub/2024/crosscoders/index.html. Accessed 2026-06-24. ↩
Elhage, N., Hume, T., Olsson, C., Schiefer, N., Henighan, T., Kravec, S., Hatfield-Dodds, Z., Lasenby, R., Drain, D., Chen, C., Grosse, R., McCandlish, S., Kaplan, J., Amodei, D., Wattenberg, M., Olah, C. *Toy Models of Superposition*. Transformer Circuits Thread / arXiv:2209.10652, September 14, 2022. https://transformer-circuits.pub/2022/toy_model/index.html. Accessed 2026-06-24. ↩
shehper. *sparse-dictionary-learning: An Open Source Implementation of Anthropic's Paper "Towards Monosemanticity"*. GitHub repository, 2023-2024. https://github.com/shehper/sparse-dictionary-learning. Accessed 2026-06-24. ↩
Chen, R., et al. *Persona Vectors: Monitoring and Controlling Character Traits in Language Models*. arXiv:2507.21509, 2025. https://arxiv.org/abs/2507.21509. Accessed 2026-06-24. ↩
Lieberum, T., Rajamanoharan, S., Conmy, A., Smith, L., Sonnerat, N., Varma, V., Kramar, J., Dragan, A., Shah, R., Nanda, N. *Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2*. arXiv:2408.05147, August 9, 2024. https://arxiv.org/abs/2408.05147. Accessed 2026-06-24. ↩
Karvonen, A., Wright, B., Rager, C., Angell, R., Brinkmann, J., Smith, L., Mayne Verdu, C., Bushnaq, L., Goodfire team, Marks, S., Nanda, N. *SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability*. arXiv:2503.09532, March 2025. https://arxiv.org/abs/2503.09532. Accessed 2026-06-24. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

4 revisions by 1 contributor · full history

Suggest edit

What links here

Christopher Olah Circuit discovery Induction Heads JumpReLU SAE Monosemanticity On the Biology of a Large Language Model Persona vectors Scaling Monosemanticity Sparse Coding Sparse Representation Superposition (Mechanistic Interpretability)Toy Models of Superposition

What problem does Towards Monosemanticity solve? Superposition and polysemantic neurons

The paper

Who wrote it and where was it published?

What model did the paper study? The one-layer transformer

How does the sparse autoencoder work?

Resampling dead features

How many features did it extract? The dictionary size scan

Discovering monosemantic features

How do you know the features are really interpretable? Feature interpretability scoring

Feature splitting

Are the features universal across runs?

Key findings

What are the limitations of Towards Monosemanticity?

How did Towards Monosemanticity influence later research? Influence and follow-up

How does Towards Monosemanticity relate to Scaling Monosemanticity (May 2024)?

Circuit tracing and Attribution Graphs (March 2025)

Transcoders and Crosscoders

Persona Vectors (2025)

OpenAI Scaling and Evaluating Sparse Autoencoders (June 2024)

Gemma Scope (August 2024)

EleutherAI and SAEBench (2025)

Independent replications

Why does Towards Monosemanticity matter? Significance for the mechanistic interpretability program

See also

References

Improve this article

Related Articles

Scaling Monosemanticity

On the Biology of a Large Language Model

Toy Models of Superposition

Attribution Graphs

Crosscoder

Golden Gate Claude

What links here

Related Articles

Scaling Monosemanticity

On the Biology of a Large Language Model

Toy Models of Superposition

Attribution Graphs

Crosscoder

Golden Gate Claude

What links here