Polysemanticity

Interpretability Neural Networks

21 min read

Updated Jun 24, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 24, 2026

Fact-checked

In review queue

Sources

15 citations

Revision

v4 · 4,291 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Polysemanticity is the phenomenon in artificial neural networks in which a single neuron (or directional unit such as an attention head) activates strongly for multiple, semantically unrelated inputs or concepts, which makes the network hard to interpret. The canonical example is a neuron in the InceptionV1 vision model, catalogued as unit 4e:55, that responds to cat faces, the fronts of cars, and cat legs at the same time.^[1] The term was popularized by the 2020 Distill "Circuits" thread, where Chris Olah and collaborators wrote that "neural networks often contain polysemantic neurons that respond to multiple unrelated inputs."^[1] Polysemanticity is widely regarded as one of the central obstacles to mechanistic interpretability, because it prevents researchers from treating individual neurons as the natural atoms of computation.

Polysemanticity is closely related to, but conceptually distinct from, superposition. Polysemanticity is the observation that a neuron fires on many things; superposition is a specific hypothesis about why that observation occurs, namely that networks pack more linearly represented features into a hidden layer than there are dimensions in that layer.^[2]^[3] Most modern work in interpretability treats superposition as the dominant explanation for polysemanticity in trained models, although other mechanisms (non-axis-aligned but still linear feature bases, non-linear feature codes) can in principle produce polysemantic neurons without superposition.^[3]

The term polysemanticity in this technical sense should not be confused with polysemy in linguistics, which refers to a single word having multiple related meanings. While there are conceptual analogies (and word-embedding work has used "superposition"-like language for polysemous words),^[4] the deep-learning usage refers to a property of internal computational units rather than a property of natural language.

Where does the term come from?

The phrase "polysemantic neuron" entered the interpretability literature through the Distill Circuits thread. In the foundational article "Zoom In: An Introduction to Circuits," published on March 10, 2020 and authored by Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter, the paper advances three "speculative claims" about the structure of neural networks: Features (features are the fundamental unit of neural networks), Circuits (features are connected by weights into circuits), and Universality (analogous features and circuits recur across models and tasks).^[1] Polysemantic neurons are introduced in that same article as a key challenge to the Features claim, because they show that individual neurons are not always clean feature units.^[1]

The Circuits authors had been studying InceptionV1, a 2014-era convolutional network trained on ImageNet, using feature visualization and dataset examples. Most neurons they examined were monosemantic, meaning a single coherent visual concept could explain the inputs that maximally activated them, for example a "curve detector," a "dog head detector," or a "high-low frequency detector." But a non-trivial fraction of neurons appeared to respond to multiple unrelated stimuli even after careful inspection. The Distill team named these "polysemantic neurons" and treated them as a major challenge to their broader research agenda. Their proposed working framework was that neural networks consist of meaningful "features" wired together into "circuits," and polysemanticity threatens both halves of that framework by undermining the assumption that neurons are clean feature units.^[1]

Three closely related Distill publications fleshed out the empirical picture during 2020 and 2021: "An Overview of Early Vision in InceptionV1," "Curve Detectors," and "Naturally Occurring Equivariance in Neural Networks." These papers documented many specific neurons in vivid detail, and most discussions of polysemanticity in the literature trace back to the catalog of examples those articles introduced.^[1]

The conceptual lineage of the term itself predates 2020: earlier neural-network interpretability work (notably the 2017 Distill article "Feature Visualization" by Chris Olah, Alexander Mordvintsev, and Ludwig Schubert) already noted that some neurons appeared to fire on unrelated images, though it did not coin the precise term.^[5] The 2020 Circuits paper made the phenomenon a named, central object of study.

What does polysemanticity look like in vision models?

The canonical example, repeated in essentially every subsequent paper on the topic, is the InceptionV1 channel 4e:55 (a unit in the mixed4e stage) that responds to cat faces, the fronts of cars, and cat legs simultaneously. The Circuits authors used optimization-based feature visualization to confirm that this neuron is not picking up some subtle shared visual structure: separate visualizations show the neuron looking specifically for the eyes and whiskers of a cat in one mode, for furry legs in another, and for the shiny grilles and headlight regions of automobile fronts in a third.^[1]

Other documented polysemantic neurons in InceptionV1 include channels that mix dog-head features with car features, channels that mix multiple animal categories, and various edge-case detectors in early layers that combine textures or frequencies. Subsequent reanalyses of InceptionV1 using sparse autoencoder methods, beginning around 2024, have shown that many channels classified as polysemantic in the original Circuits work indeed decompose cleanly into multiple monosemantic features in a learned overcomplete basis, providing direct empirical support for the superposition explanation in the vision setting.^[6]

The visual-domain examples remain pedagogically important because feature visualization (gradient-based optimization on the input pixels to maximize a unit's activation) provides a fairly direct window onto what a neuron "looks for." In language models this kind of intuitive visualization is harder, which historically made polysemanticity in transformers more difficult to study before the development of sparse-autoencoder tools.

How common is polysemanticity in language models?

Polysemanticity has been documented to be especially severe in transformer language models. In their 2022 work "Softmax Linear Units," Nelson Elhage, Neel Nanda, Chris Olah, and collaborators at Anthropic reported that for standard transformer MLPs, most neurons in middle layers respond to several apparently unrelated text features, and that this state of affairs is the norm rather than the exception.^[7]

Concrete examples reported in the early Anthropic work include MLP neurons that activate on tokens within Python code, on French-language tokens, and on certain types of numeric tokens simultaneously; neurons that mix a syntactic role (such as "after a possessive") with a topical one (such as religion); and neurons that activate for both proper nouns of a particular language and for unrelated formatting tokens. Later work using sparse autoencoders on production language models, including Claude 3 Sonnet, identified individual features (not raw neurons) corresponding to highly specific concepts such as the Golden Gate Bridge, code with security vulnerabilities, deceptive behaviors, the concept of "inner conflict," and tens of millions of other features, none of which align with individual neurons.^[8]

A frequently cited intuition in this space is that neurons in transformers are not the right unit of analysis. Polysemanticity is part of the empirical case for that claim. The other part is the observation that features, defined as sparse, approximately linear directions in activation space, do appear to be a much better unit of analysis once they can be extracted.^[9]^[8]

Why does polysemanticity emerge? The capacity argument

The most influential theoretical account of polysemanticity is the capacity argument developed by Anthropic in the 2022 paper "Toy Models of Superposition," authored by Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah.^[2] The paper presents "a toy model where polysemanticity can be fully understood, arising as a result of models storing additional sparse features in superposition."^[2]

The setup is deliberately minimal: a small ReLU autoencoder is trained to reconstruct synthetic input vectors whose components ("features") have controlled importance and controlled sparsity (probability of being nonzero on any given example). The bottleneck has fewer dimensions than the input vector has features, so something has to give. The paper studies what the network learns as the sparsity and importance parameters are varied.^[2]

The key findings of "Toy Models of Superposition" are:

When features are dense (rarely zero), the network learns an orthogonal basis of the most important features and ignores the rest. Each hidden dimension corresponds to one feature; the model is monosemantic but cannot represent everything.^[2]
As features become sparser, the network increasingly represents more features than it has dimensions by placing them along almost-orthogonal directions. This is superposition. The hidden representation is then polysemantic: each neuron lights up on multiple features.^[2]
The transition is sharp. As sparsity is varied, the number of represented features increases in a non-smooth, sometimes phase-transition-like manner. The geometric arrangements of represented features form regular polytopes (digons, triangles, pentagons, tetrahedra) at certain sparsity regimes.^[2]
Superposition trades capacity for interference. Almost-orthogonal directions are not exactly orthogonal, so reading off any one feature picks up small amounts of others. The ReLU non-linearity is essential because it filters out small (interference) activations while preserving large (true feature) ones.^[2]

The high-dimensional fact underlying the argument is that in a space of dimension d, there are at most d exactly orthogonal vectors, but exponentially many approximately orthogonal vectors of small cosine similarity. This is closely related to the Johnson-Lindenstrauss lemma. A network can therefore "fit" more sparse features into d dimensions than would naively appear possible, paying a cost only in interference noise.^[2]^[10]

The capacity argument explains polysemanticity as the geometric byproduct of superposition: because each feature gets assigned a direction that is some combination of standard-basis neurons rather than aligned with a single neuron, any one neuron is on the support of many features. Reading out one neuron's activations therefore aggregates contributions from all the features whose direction has a nonzero coordinate at that neuron, producing the observed multi-concept response patterns.^[2]

The "Toy Models" paper also draws a surprising connection between superposition and adversarial examples: the same interference that lets the model represent many features also exposes it to small input perturbations that flip its predictions.^[2] Subsequent work has explored these connections more systematically.

A complementary capacity-based analysis was given in "Polysemanticity and Capacity in Neural Networks" (Scherlis, Sachan, Jermyn, Benton, Shlegeris, 2022), which formalized the idea that polysemanticity emerges naturally when the number of features the model "wants" to represent exceeds the available neurons, and quantified the trade-off.^[11]

How does polysemanticity differ from superposition?

The relationship between polysemanticity and superposition is central enough to deserve its own treatment, because the two terms are often conflated in informal writing.

Polysemanticity is an observation about individual neurons: when you look at the set of inputs that maximally activate a given neuron, you find unrelated concepts. It is a property of the standard basis (the basis of "individual neurons") of the activation space.^[3]

Superposition is a hypothesis about the structure of the representation: the network encodes more sparse features than the layer has dimensions, with each feature assigned to an approximately orthogonal direction that is generally not aligned with any single neuron axis.^[2]

The implication runs strictly one way. Superposition implies polysemanticity: if there are more features than neurons, the features cannot be axis-aligned, so at least some neurons must light up on multiple features. The converse does not hold. Polysemanticity can in principle arise without superposition: a network might encode the same number of features as neurons (or fewer), but along directions that are not aligned with the standard basis. Such a model would have a clean monosemantic feature basis, just not the neuron basis. In that case there is no "more features than dimensions" pressure, only a rotation.^[3]

This distinction was drawn explicitly by Lawrence Chan in a 2023 alignment-research note, "Superposition is not 'just' neuron polysemanticity," which catalogs three ways a network can be polysemantic without being in superposition: non-neuron-aligned but still orthogonal feature bases, non-linear feature representations, and compositional codes.^[3]

For research purposes the distinction matters because the two phenomena suggest different interventions. If polysemanticity were due only to misaligned but orthogonal features, then a learned change of basis would suffice to recover monosemanticity, and the basis would have exactly as many features as neurons. If superposition is at work, by contrast, then any monosemantic decomposition will require an overcomplete basis (more features than neurons), which is what makes sparse autoencoder methods natural tools.

The current empirical consensus in mechanistic interpretability, supported by the success of overcomplete sparse-autoencoder dictionaries at scale, is that superposition is the dominant cause of polysemanticity in modern trained language models.^[9]^[8]

How is polysemanticity addressed?

Multiple families of methods have been proposed to recover interpretable, ideally monosemantic, units from networks that exhibit polysemanticity. They divide roughly into post-hoc decomposition methods and training-time interventions.

Sparse autoencoders and dictionary learning

The most influential family of methods is dictionary learning with sparse autoencoders. The idea is to train a wide autoencoder on the activations of a fixed, frozen pretrained model. The autoencoder has more hidden units than the original layer (an overcomplete dictionary) and is regularized to produce sparse codes, typically via an L1 or top-k constraint on the hidden activations. The hidden units of the autoencoder are interpreted as candidate features.^[9]

The breakthrough demonstration for language models was Anthropic's "Towards Monosemanticity: Decomposing Language Models with Dictionary Learning," published October 5, 2023, with first author Trenton Bricken and co-authors including Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nicholas L. Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E. Burke, Tristan Hume, Shan Carter, Tom Henighan, and Christopher Olah.^[9]

The team trained sparse autoencoders on the MLP activations of a one-layer transformer language model, using roughly 8 billion activation vectors and autoencoder widths ranging from 512 up to about 131,000 features. With a sufficiently wide dictionary, the recovered features were dramatically more monosemantic than the underlying neurons. From a layer with just 512 neurons they extracted dictionaries of more than 4,000 features representing distinct concepts including DNA sequences, legal language, HTTP requests, Hebrew text, nutrition statements, and many more. Most of these properties were invisible at the neuron level.^[9]

This work was followed in May 2024 by "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet" by Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L. Turner, Callum McDougall, Monte MacDiarmid, Alex Tamkin, Esin Durmus, Tristan Hume, Francesco Mosconi, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, and Tom Henighan. They trained three sparse autoencoders, with roughly 1 million, 4 million, and 34 million features, on the middle-layer residual stream of Claude 3 Sonnet (Anthropic's production model released March 4, 2024); in the largest run only about 12 million of the 34 million features were "alive" (ever activated). Many features were highly abstract, multilingual, and multimodal, and the team identified features for concepts ranging from the Golden Gate Bridge to deceptive behaviors and to safety-relevant categories such as sycophancy and bias.^[8] Anthropic summarized the result by writing that "the features we found in Sonnet have a depth, breadth, and abstraction reflecting Sonnet's advanced capabilities."^[8]

A striking demonstration of feature directions was Golden Gate Claude: when the Golden Gate Bridge feature was clamped to roughly 10 times its maximum natural activation value, Claude 3 Sonnet began to self-identify with the landmark, answering "I am the Golden Gate Bridge, a famous suspension bridge that spans the San Francisco Bay" to unrelated questions. The same feature fired on English, Japanese, Korean, and Russian text and on images of the bridge, illustrating that a single learned direction can capture a concept independent of language or modality.^[8]

The success of this approach is the strongest empirical evidence that polysemanticity in language models is driven by superposition: an overcomplete dictionary representing many more features than the layer has neurons is exactly what the superposition hypothesis predicts will be needed.

Training-time interventions: SoLU and engineering monosemanticity

A separate research thread asks whether the training process itself can be modified to produce more monosemantic models in the first place, rather than decomposing polysemantic ones after the fact.

Anthropic's "Softmax Linear Units" paper, published June 27, 2022 by Nelson Elhage, Tristan Hume, Catherine Olsson, Neel Nanda, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah, replaced the standard MLP activation function with SoLU(x) = x * softmax(x). The softmax encourages sparsity and discourages neurons from being equally active for many distinct features. In randomized, blinded human ratings, the authors reported that this change roughly doubled the fraction of MLP neurons judged interpretable, from about 35 percent to around 60 percent, at little or no cost in language modeling loss when combined with an extra LayerNorm.^[7]

However, the SoLU work also produced an important caveat: the authors found evidence that SoLU was sometimes hiding superposition rather than eliminating it, noting that "SoLU may 'hide' some non-neuron-aligned features by decreasing their magnitude and then later recovering it with LayerNorm."^[7] This was one of the first concrete pieces of evidence that the superposition phenomenon was real and not just a side effect of a poorly chosen basis.^[7]

A complementary line of work in late 2022 by Adam Jermyn, Nicholas Schiefer, and Evan Hubinger ("Engineering Monosemanticity in Toy Models," November 2022) showed that, in toy models, the local minimum found during training has a large effect on monosemanticity. Different local minima with similar loss can have very different fractions of monosemantic neurons, and biased initialization plus negative biases can steer the model toward more monosemantic solutions. Adding extra width to the model also helped, consistent with the capacity-pressure account.^[12]

Sparsity priors and structural interventions

Beyond sparse autoencoders and activation-function changes, several lines of work have explored explicit sparsity priors during training, weight decay on dense connections, and architectural modifications that allocate dedicated capacity to specific tasks. Mixture-of-experts and sparse routing models can be read as moving in this direction in spirit, although they were not designed with monosemanticity as a goal.

A practical observation across these efforts is that pure architectural interventions tend to face a sharp trade-off with model performance, while post-hoc methods like sparse autoencoders are now the preferred tool because they can be applied to existing strong models without retraining.

Why does polysemanticity matter for interpretability and AI safety?

Polysemanticity has been described, by Olah and others, as one of the most important obstacles to the project of mechanistic interpretability: the attempt to reverse-engineer the algorithms implemented by trained neural networks into human-understandable form.^[1]

The reason is roughly combinatorial. If a layer has N polysemantic neurons each representing, say, five distinct features, then a downstream layer reading those activations effectively confronts not N inputs but 5N entangled signals. A circuit linking two such layers does not have N x N possible "wires" to analyze, but on the order of (5N) x (5N) = 25 N^2 effective connections, each of which may or may not be load-bearing. Manual circuit analysis at that scale is prohibitive.^[1]

Polysemanticity also undermines simple causal interventions. Ablating a polysemantic neuron damages multiple features at once, so behavioral changes after ablation cannot be cleanly attributed to a single concept. This complicates standard interpretability techniques such as activation patching, causal scrubbing, and direct attribution, all of which work most cleanly when units of analysis correspond to single concepts.

For AI safety the stakes are concrete: a clean, monosemantic decomposition is what lets researchers search a model for safety-relevant internal concepts (deception, sycophancy, dangerous-capability knowledge) and verify whether a behavior is present, rather than inferring it only from outputs. The "Scaling Monosemanticity" work explicitly surfaced features for deception, bias, and dangerous content in a production model, which is the kind of audit that polysemanticity would otherwise block.^[8]

Once polysemanticity is addressed, by switching the unit of analysis from neurons to features extracted via sparse autoencoders, mechanistic interpretability becomes much more tractable. Feature-level analyses have produced clean accounts of indirect object identification, induction heads, modular arithmetic circuits, refusal behaviors, and much more in transformer language models. They have also enabled activation steering methods, which manipulate model behavior by adjusting the activation along a single feature direction; such steering is much more interpretable and surgical than steering along a polysemantic neuron, because a feature direction corresponds (by construction) to a single concept.

Recent advances (2024-2026)

Since the publication of "Scaling Monosemanticity" in May 2024, polysemanticity has remained an active research focus, with progress along several fronts.

Scaling and architecture of sparse autoencoders. Multiple groups have explored top-k SAEs, jumpReLU SAEs, gated SAEs, switch SAEs, and other variants intended to address pathologies of standard L1-regularized autoencoders such as feature shrinkage and dead features. These improvements have allowed sparse autoencoders to be trained on larger residual streams and across more layers efficiently.^[13]

Attribution graphs and replacement models. In March 2025, Anthropic published "On the Biology of a Large Language Model" by Jack Lindsey, Wes Gurnee, Emmanuel Ameisen, Brian Chen, Adam Pearce, Nicholas L. Turner, Craig Citro, and collaborators, introducing attribution graphs as a method to trace computation through a sparse, interpretable replacement model. The replacement uses a cross-layer transcoder (CLT) trained to reconstruct the network's behavior using features rather than raw polysemantic neurons; the example reported in the paper used a CLT with on the order of 30 million features across all layers of Claude 3.5 Haiku. The resulting graphs reveal multi-step reasoning chains, planning behaviors, and other circuit-level mechanisms that would be invisible at the neuron level due to polysemanticity.^[14]

Open-source circuit tracing. Following the Anthropic attribution-graphs work, multiple groups have re-implemented and extended the techniques, including an open Circuit Tracer library and related infrastructure released through the Anthropic Fellows program. Public dashboards such as Neuronpedia have made millions of extracted features browsable, lowering the barrier to entry for feature-level interpretability research.^[14]

Theoretical work on superposition and computation in superposition. Multiple papers in 2024-2026 have studied the capacity of linear-representation networks under the superposition hypothesis, formal bounds on how many features can be stored, and the computational complexity of operations carried out on representations in superposition. This line of work has begun to clarify the conditions under which polysemanticity should be expected to arise and the limits of what sparse-autoencoder extraction can recover.^[15]

The overall trajectory is from polysemanticity being identified as a major obstacle in 2020, to being explained mechanistically by superposition in 2022, to being addressed empirically by sparse-autoencoder decomposition at scale in 2023-2024, to being routinely worked around in modern circuit-level analysis from 2025 onward. It remains an active research area: open questions include how to handle features that are themselves represented non-linearly, how to detect "missed" features that no sparse autoencoder has yet found, and how the picture changes for very large models with many layers in superposition simultaneously.

References

Olah, Chris; Cammarata, Nick; Schubert, Ludwig; Goh, Gabriel; Petrov, Michael; and Carter, Shan. "Zoom In: An Introduction to Circuits." *Distill*, March 10, 2020. https://distill.pub/2020/circuits/zoom-in/ Accessed 2026-06-24. ↩
Elhage, Nelson; Hume, Tristan; Olsson, Catherine; Schiefer, Nicholas; Henighan, Tom; Kravec, Shauna; Hatfield-Dodds, Zac; Lasenby, Robert; Drain, Dawn; Chen, Carol; Grosse, Roger; McCandlish, Sam; Kaplan, Jared; Amodei, Dario; Wattenberg, Martin; and Olah, Christopher. "Toy Models of Superposition." *Transformer Circuits Thread*, Anthropic, September 21, 2022. https://transformer-circuits.pub/2022/toy_model/index.html Also available as arXiv:2209.10652. Accessed 2026-06-24. ↩
Chan, Lawrence. "Superposition is not 'just' neuron polysemanticity." *AI Alignment Forum / LessWrong*, 2023. https://www.alignmentforum.org/posts/8EyCQKuWo6swZpagS/superposition-is-not-just-neuron-polysemanticity Accessed 2026-06-24. ↩
Arora, Sanjeev; Li, Yuanzhi; Liang, Yingyu; Ma, Tengyu; and Risteski, Andrej. "Linear Algebraic Structure of Word Senses, with Applications to Polysemy." *Transactions of the Association for Computational Linguistics*, 2018. arXiv:1601.03764. https://arxiv.org/abs/1601.03764 Accessed 2026-06-24. ↩
Olah, Chris; Mordvintsev, Alexander; and Schubert, Ludwig. "Feature Visualization." *Distill*, 2017. https://distill.pub/2017/feature-visualization/ Accessed 2026-06-24. ↩
Gorton, Liv. "The Missing Curve Detectors of InceptionV1: Applying Sparse Autoencoders to InceptionV1 Early Vision." arXiv:2406.03662, 2024. https://arxiv.org/abs/2406.03662 Accessed 2026-06-24. ↩
Elhage, Nelson; Hume, Tristan; Olsson, Catherine; Nanda, Neel; Henighan, Tom; Kravec, Shauna; Hatfield-Dodds, Zac; Lasenby, Robert; Drain, Dawn; Chen, Carol; Grosse, Roger; McCandlish, Sam; Kaplan, Jared; Amodei, Dario; Wattenberg, Martin; and Olah, Christopher. "Softmax Linear Units." *Transformer Circuits Thread*, Anthropic, June 27, 2022. https://transformer-circuits.pub/2022/solu/index.html Accessed 2026-06-24. ↩
Templeton, Adly; Conerly, Tom; Marcus, Jonathan; Lindsey, Jack; Bricken, Trenton; Chen, Brian; Pearce, Adam; Citro, Craig; Ameisen, Emmanuel; Jones, Andy; Cunningham, Hoagy; Turner, Nicholas L.; McDougall, Callum; MacDiarmid, Monte; Tamkin, Alex; Durmus, Esin; Hume, Tristan; Mosconi, Francesco; Freeman, C. Daniel; Sumers, Theodore R.; Rees, Edward; Batson, Joshua; Jermyn, Adam; Carter, Shan; Olah, Chris; and Henighan, Tom. "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet." *Transformer Circuits Thread*, Anthropic, May 21, 2024. https://transformer-circuits.pub/2024/scaling-monosemanticity/ Accessed 2026-06-24. ↩
Bricken, Trenton; Templeton, Adly; Batson, Joshua; Chen, Brian; Jermyn, Adam; Conerly, Tom; Turner, Nicholas L.; Anil, Cem; Denison, Carson; Askell, Amanda; Lasenby, Robert; Wu, Yifan; Kravec, Shauna; Schiefer, Nicholas; Maxwell, Tim; Joseph, Nicholas; Hatfield-Dodds, Zac; Tamkin, Alex; Nguyen, Karina; McLean, Brayden; Burke, Josiah E.; Hume, Tristan; Carter, Shan; Henighan, Tom; and Olah, Christopher. "Towards Monosemanticity: Decomposing Language Models with Dictionary Learning." *Transformer Circuits Thread*, Anthropic, October 5, 2023. https://transformer-circuits.pub/2023/monosemantic-features Accessed 2026-06-24. ↩
Johnson, William B.; and Lindenstrauss, Joram. "Extensions of Lipschitz mappings into a Hilbert space." *Contemporary Mathematics*, 26:189-206, 1984. (Background reference for the geometric capacity argument.) Accessed 2026-06-24. ↩
Scherlis, Adam; Sachan, Kshitij; Jermyn, Adam S.; Benton, Joe; and Shlegeris, Buck. "Polysemanticity and Capacity in Neural Networks." arXiv:2210.01892, October 2022. https://arxiv.org/abs/2210.01892 Accessed 2026-06-24. ↩
Jermyn, Adam S.; Schiefer, Nicholas; and Hubinger, Evan. "Engineering Monosemanticity in Toy Models." arXiv:2211.09169, November 16, 2022. https://arxiv.org/abs/2211.09169 Accessed 2026-06-24. ↩
Rajamanoharan, Senthooran; et al. "Improving Dictionary Learning with Gated Sparse Autoencoders." (Survey and gated/jumpReLU/topk SAE work.) See also: "A Survey on Sparse Autoencoders: Interpreting the Internal Mechanisms of Large Language Models." arXiv:2503.05613, 2025. https://arxiv.org/abs/2503.05613 Accessed 2026-06-24. ↩
Lindsey, Jack; Gurnee, Wes; Ameisen, Emmanuel; Chen, Brian; Pearce, Adam; Turner, Nicholas L.; Citro, Craig; and collaborators. "On the Biology of a Large Language Model." *Transformer Circuits Thread*, Anthropic, March 27, 2025. https://transformer-circuits.pub/2025/attribution-graphs/biology.html See also Anthropic's "Open-Source Circuit Tracing" release: https://www.anthropic.com/research/open-source-circuit-tracing Accessed 2026-06-24. ↩
Adler, Micah. "On the Complexity of Neural Computation in Superposition." arXiv:2409.15318, 2024. https://arxiv.org/abs/2409.15318 Accessed 2026-06-24. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributor · full history

Suggest edit

What links here

Attribution Graphs Circuit discovery Crosscoder Dictionary learning (for interpretability)Monosemanticity Node (neural network)Superposition (Mechanistic Interpretability)Transcoder

Where does the term come from?

What does polysemanticity look like in vision models?

How common is polysemanticity in language models?

Why does polysemanticity emerge? The capacity argument

How does polysemanticity differ from superposition?

How is polysemanticity addressed?

Sparse autoencoders and dictionary learning

Training-time interventions: SoLU and engineering monosemanticity

Sparsity priors and structural interventions

Why does polysemanticity matter for interpretability and AI safety?

Recent advances (2024-2026)

References

Improve this article

Related Articles

Superposition (Mechanistic Interpretability)

Transcoder

Linear Probes

Sparse Coding

Monosemanticity

Feature Importances

What links here

Related Articles

Superposition (Mechanistic Interpretability)

Transcoder

Linear Probes

Sparse Coding

Monosemanticity

Feature Importances

What links here