Monosemanticity
Last reviewed
May 24, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,212 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 24, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,212 words
Add missing citations, update stale details, or suggest a clearer explanation.
Monosemanticity is a property of an internal feature or neuron in a neural network when that unit responds to a single, human-interpretable concept rather than to a heterogeneous collection of unrelated inputs.[1][2] The term is the antonym of polysemanticity, the more common condition in which a single neuron fires for many unrelated patterns and resists clean explanation.[1][3] In the modern usage popularised by anthropic's interpretability team, monosemanticity is the target of dictionary-learning techniques (notably the sparse autoencoder) that decompose the activations of a transformer into thousands or millions of latent directions, each of which is intended to track one concept across diverse contexts.[2][4][5] The pursuit of monosemantic features has become a central methodological goal of mechanistic interpretability, because monosemantic units are easier to label, easier to causally probe with interventions such as feature steering, and easier to compose into circuits.[2][4][6]
The distinction between monosemantic and polysemantic neurons was developed in a sequence of Distill articles produced by the Clarity team at OpenAI and later carried into Anthropic's research thread. The 2017 article "Feature Visualization" by Chris Olah, Alexander Mordvintsev, and Ludwig Schubert used activation-maximisation images to inspect what individual channels in convolutional image classifiers responded to.[7] The authors documented neurons that responded to a single coherent concept (for example, a particular animal or texture) and others whose activation-maximising image combined several unrelated motifs, and they used the latter observation to motivate later inquiry into whether interpretable units of analysis must be neurons at all.[7][8]
The 2020 article "Zoom In: An Introduction to Circuits" by Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter made the dichotomy explicit. The piece advanced three speculative claims: that features (not neurons) are the fundamental unit of neural networks, that features are connected by weights to form circuits, and that analogous features and circuits recur across models and tasks.[8] Olah and co-authors defined a pure or monosemantic feature as one that responds to a single latent variable, and a polysemantic neuron as one that responds to multiple unrelated inputs; they conjectured that polysemantic neurons "seem to result from" superposition, in which a circuit spreads a feature across many neurons in order to pack more features into a limited number of channels.[8] "Zoom In" framed polysemanticity as the principal obstacle to a circuits-style account of neural networks, since it prevents naming individual neurons cleanly. The article appeared on distill.pub in March 2020 and was the first instalment of a series called Circuits.[8]
This vocabulary was carried into the transformer era when many of the Clarity researchers, including Olah, joined anthropic. Anthropic's interpretability programme adopted the framing of features-not-neurons and made the recovery of monosemantic features a central agenda.[2][4]
The principal explanation for why monosemanticity is rare among raw neurons is the superposition hypothesis, formalised in Anthropic's September 2022 paper "Toy Models of Superposition" by Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah.[3] The paper constructed deliberately simple toy networks in which the number of underlying generative features exceeded the number of available hidden dimensions, and showed that when those features were sparse, the network would store them in superposition: a set of features distributed across an overcomplete set of nearly-linear directions, rather than aligned one-to-one with neurons.[3]
The paper made three contributions that shaped the later research programme. It demonstrated a phase transition between dense, monosemantic representations (when features are common) and sparse, polysemantic ones (when features are rare).[3] It showed that the geometric arrangement of features in superposition tracks the geometry of uniform polytopes such as digons, triangles, pentagons, and tetrahedrons, with the specific configuration depending on the sparsity regime.[3] And it introduced a notion of feature dimensionality that quantifies what fraction of a hidden dimension is dedicated to a given feature.[3] The paper also drew a connection between superposition and adversarial examples, on the grounds that small perturbations along non-orthogonal feature directions can flip the readout of an entire bundle of superposed features.[3]
The practical implication of the superposition hypothesis is that monosemanticity is not a property one should expect raw neurons in a transformer to exhibit. If a model is representing more features than it has channels, then any individual channel will fire for whichever features happen to be assigned non-zero coefficients on it, and the assignment is determined by gradient descent rather than by human-relevant structure.[3][4] To recover a monosemantic decomposition, one needs to map the activations into a higher-dimensional space in which the features are aligned with individual coordinates. This is the core motivation for using a sparse autoencoder over raw activations.[4][5]
The October 2023 Anthropic paper "Towards Monosemanticity: Decomposing Language Models With Dictionary Learning" by Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E. Burke, Tristan Hume, Shan Carter, Tom Henighan, and Chris Olah operationalised the superposition framework. The paper trained a small autoencoder with a sparsity penalty on the 512-dimensional MLP activations of a one-layer transformer and showed that the resulting overcomplete dictionary recovered thousands of directions that were substantially more monosemantic than the original neurons.[4]
The architecture is intentionally minimal. An encoder maps the model's activation vector into a much wider hidden space (in the headline run, more than 4,096 dimensions for a 512-neuron MLP, and in expanded runs up to 256 times the original width), applies a ReLU non-linearity, and trains a linear decoder to reconstruct the original activation.[4] The training loss combines a mean-squared-error reconstruction term with an L1 penalty on the hidden activations, the latter pushing the autoencoder to use as few non-zero entries as possible per input.[4] When trained successfully, only a small fraction of hidden units fire for any given input, and those units correspond to dictionary atoms with intuitive interpretations.[4]
"Towards Monosemanticity" supplied four lines of evidence that the recovered features were monosemantic. First, the authors investigated a handful of specific features in detail (including a feature firing on Arabic script, a DNA-sequence feature, a base64 feature, and a Hebrew text feature) and verified that activations correlated tightly with surface tokens, that the feature's contribution to the model's output logits made sense, and that ablating the feature degraded the relevant behaviour.[4] Second, human raters labelled a random sample of features and judged the majority to be interpretable, with substantially higher interpretability scores than the original neurons.[2][4] Third, automated interpretability, in which a separate language model is given a feature's activating examples and asked to predict activations on held-out text, found that most features admitted concise descriptions whose predictions matched activation patterns.[4] Fourth, the same automated approach was applied to the logit-weight side of features rather than activation patterns, with similar results.[4]
Two empirical phenomena from this paper became standard vocabulary in the field. Feature splitting refers to the observation that as the autoencoder is made wider, a single feature in a small dictionary frequently resolves into multiple semantically related but distinguishable features in a larger one. The headline example was a base64 feature that split into three distinct base64-related features at higher widths, each tracking a finer-grained variant.[4] Universality refers to the finding that sparse autoencoders trained on different transformer language models recover largely overlapping feature dictionaries, with features more similar across models than they are to their own host model's neurons.[4] Both observations support the framing that monosemantic features track something about the data distribution, not idiosyncrasies of any particular initialisation.
A concurrent paper, "Sparse Autoencoders Find Highly Interpretable Features in Language Models" by Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey, posted to arXiv in September 2023, reached similar conclusions on a Pythia model and contributed to the rapid uptake of the technique across the interpretability community.[9]
The May 2024 paper "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet" by Adly Templeton and the Anthropic interpretability team demonstrated that the dictionary-learning approach worked at production scale, on claude 3 sonnet.[5] Three autoencoders of escalating size (with roughly 1 million, 4 million, and 34 million latents) were trained on activations from the middle-layer residual stream of the model.[5] In the largest run, about 12 million of the 34 million latents were alive in the sense that they activated on at least one example, with the remainder dead.[5]
The paper documented many features that activated on abstract, multimodal concepts independent of surface form. The Golden Gate Bridge feature activated on English mentions of the bridge, on translations into multiple languages including Japanese, Korean, and Russian, and on relevant images, indicating that the underlying representation tracks the concept rather than any particular token sequence.[5] A "code bugs" feature fired both on natural-language discussions of software defects and on actual buggy code, suggesting a unified representation across modalities.[5] Features were also found for safety-relevant categories such as bioweapon synthesis, deception and sycophancy, and code with security vulnerabilities, although the paper was careful to note that the existence of such features did not by itself establish that the model uses them in any particular harmful way.[5]
A central methodological contribution of "Scaling Monosemanticity" was the demonstration that activating features produced the predicted behavioural effects in the live model, a technique called feature steering or feature clamping.[5] When the Golden Gate Bridge feature was clamped to a large positive activation, the model's outputs across unrelated prompts began to reference the bridge: this was the basis of the public Golden Gate Claude demonstration.[10] Steering experiments on features for sycophancy, deception, and other categories produced analogous behavioural shifts, providing direct causal evidence that the recovered features were used by the underlying model rather than being epiphenomenal projections.[5]
On 23 May 2024 Anthropic published a public chat interface, Golden Gate Claude, in which the Golden Gate Bridge feature in claude 3 sonnet was clamped to a high value for the duration of the conversation. The demo ran for approximately 24 hours.[10] Users who asked how to spend $10 received responses that recommended driving across the bridge and paying the toll; users who asked for a love story were told a tale of a car infatuated with its beloved bridge.[10] The demo was a deliberately whimsical illustration of the broader point that monosemantic features extracted by a sparse autoencoder map to causally efficacious internal variables, and that steering those variables produces predictable changes in model behaviour.[5][10] The example is widely cited as the moment when scaled SAE-based interpretability moved from a research curiosity to a tangible artifact non-specialists could interact with.
The standard SAE used to pursue monosemanticity is a single-hidden-layer architecture with a tied or untied decoder. Given an activation vector x at some site in a transformer (typically the residual stream or an MLP output), the encoder computes f = ReLU(W_e x + b_e), a sparse code in a higher-dimensional space, and the decoder reconstructs x_hat = W_d f + b_d.[4][5] The training objective combines a reconstruction loss with a sparsity penalty:
L = ||x - x_hat||_2^2 + lambda * ||f||_1
The L1 penalty on the latent code shrinks small activations toward zero and is the practical proxy for the L0 sparsity that one would prefer to use directly but cannot easily optimise.[4][9] The coefficient lambda controls the trade-off between reconstruction quality and code sparsity. Variants studied in the subsequent literature include TopK SAEs in which the k largest pre-activation values per input are kept and the rest zeroed out, and JumpReLU SAEs in which a learnable threshold replaces the ReLU.[11][12]
In the resulting decomposition, each column of W_d is a feature direction in activation space, and the corresponding entry of f is the activation strength of that feature on the input. A monosemantic feature, in this formalism, is one whose direction in activation space tracks a single human-interpretable concept across the data distribution, and whose activation strength fires precisely when that concept is present.[4][5] The vast majority of the autoencoder's latents are zero on any given input, which is what makes the decomposition human-readable: explaining the model's behaviour on a token requires inspecting only the few features that actually fired.
The choice of insertion site matters. Early "Towards Monosemanticity" work targeted the MLP activations of a one-layer transformer.[4] "Scaling Monosemanticity" and most subsequent SAEs targeted the residual stream at a chosen layer of a multi-layer model, on the grounds that this is the model's working space and contains a superset of what individual MLPs write into.[5] Other variants train SAEs on attention outputs, on MLP intermediate activations, on logits, or jointly across multiple sites; the choice affects both what concepts the features capture and how the features compose into circuits.[11][13]
Reconstruction quality is typically measured by the fraction of the original activation's variance explained by the autoencoder, the loss in next-token prediction when the autoencoder's reconstruction is patched into the model in place of the original activations, and the average L0 (number of non-zero entries) of the latent code. In well-tuned setups for residual-stream activations at frontier model scale, autoencoders recover most of the variance and most of the model's downstream loss at L0 in the range of tens to low hundreds of active features per token, out of dictionaries containing millions of latents.[5][11][12] These numbers vary substantially with layer, site, and training regime, and there is no universally agreed sparsity-versus-reconstruction Pareto frontier; published reports give point estimates on specific configurations.
The SAE-based pursuit of monosemanticity was adopted across several labs in 2024.
openai published "Scaling and evaluating sparse autoencoders" in June 2024, authored by Leo Gao, Tom Dupre la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. The paper trained a 16 million latent autoencoder on residual-stream activations of gpt-4 using 40 billion tokens, introduced the TopK sparsity scheme as an alternative to L1, and proposed quantitative evaluation metrics for SAE features including downstream-effect sparsity, explanation precision, and recovery of hypothesised ground-truth features.[11] A companion blog post released the same day documented features at multiple semantic scales in GPT-4 activations.[14]
google deepmind released Gemma Scope in August 2024, an open suite of more than 400 JumpReLU sparse autoencoders covering all layers and sub-layers of gemma 2 2B and 9B, plus selected layers of Gemma 2 27B. The release contained over 30 million learned features in aggregate and was accompanied by a research paper led by Tom Lieberum and colleagues.[12] The aim was to lower the cost barrier to monosemanticity research outside frontier labs by making a comprehensive pre-trained dictionary of features available for download. DeepMind also released an internal interpretability tooling library, Mishax, alongside the autoencoders.[12]
Anthropic continued its programme with circuit-tracing work that built on monosemantic features. The March 2025 paper "Circuit Tracing: Revealing Computational Graphs in Language Models" introduced cross-layer transcoders (CLTs), in which each feature reads from the residual stream at one layer and writes into the outputs of all subsequent MLP layers in the original model.[15] This replaced the model with a sparse, transcoder-based stand-in whose features remain interpretable while supporting attribution analysis of how features at different layers influence each other. The companion case-study paper "On the Biology of a Large Language Model" used the same tools to investigate behaviours in claude 3 5 sonnet and Claude 3.5 Haiku, drawing on monosemantic features throughout.[16] The library and an associated graph-exploration frontend were open-sourced in mid-2025.[15]
A non-exhaustive list of further follow-up work includes "Sparse Autoencoders Enable Scalable and Reliable Circuit Identification in Language Models", which used SAEs as the substrate for finding circuits with attribution patching, and an expanding ecosystem of community SAE training runs hosted on platforms such as Neuronpedia.[17] By mid-2025 several open-source toolchains existed for training, hosting, and inspecting SAEs, and SAE-based features had been used as a substrate for tasks ranging from concept editing and knowledge localisation to debugging factual errors and probing for safety-relevant behaviours.[15][16][17]
The shift from raw-neuron analysis to feature-based analysis also altered which questions interpretability researchers asked. Pre-2023 work on transformer circuits, such as the induction-head analyses in Anthropic's earlier mechanistic interpretability papers, was forced to work with whatever atoms were available, which tended to be specific attention heads and identifiable circuits within attention rather than MLP-level features.[6] After "Towards Monosemanticity" the MLP and residual stream became tractable targets, and a typical 2024-25 interpretability paper begins by training or loading a relevant SAE and then performs analysis in the SAE's basis.[11][12][15]
The standard contrast to monosemanticity is polysemanticity. A polysemantic neuron is one whose top activating examples fall into multiple semantically distinct groups: in convolutional vision models, common examples include neurons that respond jointly to cat faces, the front of cars, and chair legs.[1][7] In transformer language models, polysemantic MLP neurons fire for combinations such as one writing system, a particular HTTP header, and an unrelated programming-language token.[4]
The interpretability case against polysemanticity is practical rather than principled. Polysemantic neurons cannot be given a single label without misleading the reader; their causal contribution to a downstream behaviour cannot be summarised by referring to that label; and they resist the construction of small, comprehensible circuit diagrams.[8] Under the superposition hypothesis, polysemanticity is the equilibrium that gradient descent reaches when features are sparser than channels, and so is to be expected rather than treated as a defect.[3] Monosemanticity in raw activations would imply that a model is using only as many features as it has channels, which is typically a wasteful allocation given the diversity of the input distribution.[3]
The recovery operation performed by an SAE can be read as a re-parameterisation: it preserves the model's behaviour (up to reconstruction error) while exposing its computation in an overcomplete basis in which most basis vectors are monosemantic.[4][5] The trade-off is that the new representation is wider, sparser, and not part of the running model; downstream interpretability analysis is performed on the SAE's latents rather than on raw activations.
Quantifying how monosemantic a given feature or dictionary actually is remains an active problem. Several lines of evaluation have been used.
Activation-pattern evaluations score how cleanly the inputs that maximally activate a feature share a single concept. In "Towards Monosemanticity" and "Scaling Monosemanticity", human raters scored sampled features on a clarity rubric, and an automated pipeline asked a separate language model to generate a candidate description from top activations and then predict activations on held-out text.[4][5] The latter approach, often called automated interpretability, yielded numerical interpretability scores that could be aggregated across thousands of features.[4]
Causal evaluations check whether interventions on a feature produce predictable behavioural changes. Feature steering, in which a feature's activation is clamped to an artificially high or low value during inference, is the canonical example; for a truly monosemantic feature, the resulting outputs should reflect the labelled concept disproportionately.[5][10] Subsequent work refined this with quantitative evaluation suites that measure both the targeted behavioural shift and unintended side effects on unrelated capabilities.[18]
Downstream-effect evaluations test whether the directions identified by the SAE matter for the model's computation. Metrics introduced by the OpenAI paper include the sparsity of downstream effects (how few features need to be intervened on to recover a given behaviour) and recovery of ground-truth features in synthetic settings.[11] The Gemma Scope release included benchmark tasks that test SAEs on probing, concept separability, and steering.[12]
No single metric is canonical, and reported numbers are sensitive to the choice of model, layer, dictionary size, sparsity coefficient, and dataset. The literature generally treats monosemanticity as a graded property and reports distributions of feature quality rather than binary yes/no judgements.[4][11][12]
Several open problems remained as of mid-2025.
Feature splitting and completeness. Because increasing the dictionary size repeatedly splits existing features into finer-grained ones, it is unclear how to choose a single "correct" level of granularity, or whether the feature hierarchy bottoms out.[4] Practical research uses multiple SAEs at different widths and treats the choice of granularity as task-dependent.
Dead and ultra-rare features. A substantial fraction of latents trained in large SAEs never activate on the training distribution (dead features), and many others activate only on a small handful of inputs (ultra-rare features). In "Scaling Monosemanticity", more than half of the 34M-latent dictionary's units were dead.[5] Whether dead features represent wasted capacity or are an artefact of L1 sparsity penalties is an active topic; TopK and JumpReLU variants were partly motivated by attempts to reduce the dead-feature rate.[11][12]
Reconstruction error and faithfulness. No SAE perfectly reconstructs the activations it is trained on, and the residual error means that any circuit analysis performed in the SAE basis is approximate.[4][15] The cross-layer transcoders introduced in "Circuit Tracing" are partly an attempt to close this gap by integrating the dictionary-learning step with the model's forward computation.[15]
Evaluation of monosemanticity itself. Human ratings are expensive and noisy; automated interpretability is fast but relies on a separate language model that may share blindspots with the model being studied; causal evaluations measure behavioural effect but not whether a feature is "the right" decomposition.[4][11] A 2024 critique by an independent group, "Encourage or Inhibit Monosemanticity? Revisit Monosemanticity from a Feature Decorrelation Perspective", argued that pursuing monosemanticity per se may impair downstream task performance in some settings, and that high-quality decorrelation is a better intermediate target.[19]
Relation to safety guarantees. Monosemantic features make it easier to identify model components that activate on safety-relevant concepts such as deception or biothreat synthesis, but they do not by themselves provide guarantees about how the model uses those concepts. The literature is explicit that finding a "deception" feature is not the same as demonstrating that the model deceives, nor that steering away from that feature makes the model safe in any rigorous sense.[5][15]
Monosemanticity is not a separate research field from mechanistic interpretability but rather one of that field's central operational goals. Mechanistic interpretability aims to reverse-engineer the algorithms implemented inside trained neural networks, and the standard pipeline involves identifying interpretable features, identifying interpretable connections between features (circuits), and validating both with causal interventions.[6][8] Monosemanticity is the property that makes the first of those steps tractable: without a decomposition in which most features have intelligible labels, the subsequent circuit and intervention work has no atomic vocabulary to use.[4][6][15]
The dependency runs in both directions. Circuit work, in turn, is one of the main ways to validate that an SAE's features really do correspond to functional units of the model: a "feature" that participates in no circuit and produces no behavioural effect when steered is not obviously a feature of the model in any useful sense.[5][15] The 2025 circuit-tracing programme at Anthropic and the parallel ecosystem of community work on Gemma Scope and other open SAEs are best understood as the field's attempt to build out both halves of this loop at frontier-model scale.[12][15][16]