Polysemanticity
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,957 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,957 words
Add missing citations, update stale details, or suggest a clearer explanation.
Polysemanticity is the empirical phenomenon in artificial neural networks in which a single neuron (or directional unit such as an attention head) activates strongly for multiple, semantically unrelated inputs or concepts. The term was popularized by the 2020 Distill "Circuits" thread, where Chris Olah and collaborators documented neurons in the InceptionV1 vision model that responded to disjoint categories such as cat faces, fronts of cars, and cat legs.[^1] Polysemanticity is widely regarded as one of the central obstacles to [[mechanistic_interpretability]], because it prevents researchers from treating individual neurons as the natural atoms of computation.
Polysemanticity is closely related to, but conceptually distinct from, [[superposition]]. Polysemanticity is the observation that a neuron fires on many things; superposition is a specific hypothesis about why that observation occurs, namely that networks pack more linearly represented features into a hidden layer than there are dimensions in that layer.[^2][^3] Most modern work in interpretability treats superposition as the dominant explanation for polysemanticity in trained models, although other mechanisms (non-axis-aligned but still linear feature bases, non-linear feature codes) can in principle produce polysemantic neurons without superposition.[^3]
The term polysemanticity in this technical sense should not be confused with polysemy in linguistics, which refers to a single word having multiple related meanings. While there are conceptual analogies (and word-embedding work has used "superposition"-like language for polysemous words),[^4] the deep-learning usage refers to a property of internal computational units rather than a property of natural language.
The phrase "polysemantic neuron" entered the interpretability literature through the Distill Circuits thread. In the foundational article "Zoom In: An Introduction to Circuits," published on March 10, 2020 and authored by Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter, polysemantic neurons are introduced as one of three "speculative claims" about the structure of neural networks (alongside the Features claim and the Circuits claim).[^1]
The Circuits authors had been studying InceptionV1, a 2014-era convolutional network trained on ImageNet, using feature visualization and dataset examples. Most neurons they examined were monosemantic, meaning a single coherent visual concept could explain the inputs that maximally activated them, for example a "curve detector," a "dog head detector," or a "high-low frequency detector." But a non-trivial fraction of neurons appeared to respond to multiple unrelated stimuli even after careful inspection. The Distill team named these "polysemantic neurons" and treated them as a major challenge to their broader research agenda. Their proposed working framework was that neural networks consist of meaningful "features" wired together into "circuits," and polysemanticity threatens both halves of that framework by undermining the assumption that neurons are clean feature units.[^1]
Three closely related Distill publications fleshed out the empirical picture during 2020 and 2021: "An Overview of Early Vision in InceptionV1," "Curve Detectors," and "Naturally Occurring Equivariance in Neural Networks." These papers documented many specific neurons in vivid detail, and most discussions of polysemanticity in the literature trace back to the catalog of examples those articles introduced.[^1]
The conceptual lineage of the term itself predates 2020: earlier neural-network interpretability work (notably the 2017 Distill article "Feature Visualization" by Chris Olah, Alexander Mordvintsev, and Ludwig Schubert) already noted that some neurons appeared to fire on unrelated images, though it did not coin the precise term.[^5] The 2020 Circuits paper made the phenomenon a named, central object of study.
The canonical example, repeated in essentially every subsequent paper on the topic, is a particular channel in the late convolutional layers of InceptionV1 (often referenced as a neuron in the mixed4 or mixed5 stage) that responds to cat faces, the fronts of cars, and cat legs simultaneously. The Circuits authors used optimization-based feature visualization to confirm that this neuron is not picking up some subtle shared visual structure: separate visualizations show the neuron looking specifically for eyes and whiskers in one mode, for furry legs in another, and for the shiny grilles and headlight regions of automobile fronts in a third.[^1]
Other documented polysemantic neurons in InceptionV1 include channels that mix dog-head features with car features, channels that mix multiple animal categories, and various edge-case detectors in early layers that combine textures or frequencies. Subsequent reanalyses of InceptionV1 using [[sparse_autoencoder]] methods, beginning around 2024, have shown that many channels classified as polysemantic in the original Circuits work indeed decompose cleanly into multiple monosemantic features in a learned overcomplete basis, providing direct empirical support for the superposition explanation in the vision setting.[^6]
The visual-domain examples remain pedagogically important because feature visualization (gradient-based optimization on the input pixels to maximize a unit's activation) provides a fairly direct window onto what a neuron "looks for." In language models this kind of intuitive visualization is harder, which historically made polysemanticity in transformers more difficult to study before the development of sparse-autoencoder tools.
Polysemanticity has been documented to be especially severe in transformer language models. In their 2022 work "Softmax Linear Units," Nelson Elhage, Neel Nanda, Chris Olah, and collaborators at Anthropic reported that for standard transformer MLPs, most neurons in middle layers respond to several apparently unrelated text features, and that this state of affairs is the norm rather than the exception.[^7]
Concrete examples reported in the early Anthropic work include MLP neurons that activate on tokens within Python code, on French-language tokens, and on certain types of numeric tokens simultaneously; neurons that mix a syntactic role (such as "after a possessive") with a topical one (such as religion); and neurons that activate for both proper nouns of a particular language and for unrelated formatting tokens. Later work using sparse autoencoders on production language models, including Claude 3 Sonnet, identified individual features (not raw neurons) corresponding to highly specific concepts such as the Golden Gate Bridge, code with security vulnerabilities, deceptive behaviors, the concept of "inner conflict," and tens of millions of other features, none of which align with individual neurons.[^8]
A frequently cited intuition in this space is that "neurons in transformers are not the right unit of analysis." Polysemanticity is part of the empirical case for that claim. The other part is the observation that features, defined as sparse, approximately linear directions in activation space, do appear to be a much better unit of analysis once they can be extracted.[^9][^8]
The most influential theoretical account of polysemanticity is the capacity argument developed by Anthropic in the 2022 paper "Toy Models of Superposition," authored by Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah.[^2]
The setup is deliberately minimal: a small ReLU autoencoder is trained to reconstruct synthetic input vectors whose components ("features") have controlled importance and controlled sparsity (probability of being nonzero on any given example). The bottleneck has fewer dimensions than the input vector has features, so something has to give. The paper studies what the network learns as the sparsity and importance parameters are varied.[^2]
The key findings of "Toy Models of Superposition" are:
When features are dense (rarely zero), the network learns an orthogonal basis of the most important features and ignores the rest. Each hidden dimension corresponds to one feature; the model is monosemantic but cannot represent everything.[^2]
As features become sparser, the network increasingly represents more features than it has dimensions by placing them along almost-orthogonal directions. This is superposition. The hidden representation is then polysemantic: each neuron lights up on multiple features.[^2]
The transition is sharp. As sparsity is varied, the number of represented features increases in a non-smooth, sometimes phase-transition-like manner. The geometric arrangements of represented features form regular polytopes (digons, triangles, pentagons, tetrahedra) at certain sparsity regimes.[^2]
Superposition trades capacity for interference. Almost-orthogonal directions are not exactly orthogonal, so reading off any one feature picks up small amounts of others. The ReLU non-linearity is essential because it filters out small (interference) activations while preserving large (true feature) ones.[^2]
The high-dimensional fact underlying the argument is that in a space of dimension d, there are at most d exactly orthogonal vectors, but exponentially many approximately orthogonal vectors of small cosine similarity. This is closely related to the Johnson-Lindenstrauss lemma. A network can therefore "fit" more sparse features into d dimensions than would naively appear possible, paying a cost only in interference noise.[^2][^10]
The capacity argument explains polysemanticity as the geometric byproduct of superposition: because each feature gets assigned a direction that is some combination of standard-basis neurons rather than aligned with a single neuron, any one neuron is on the support of many features. Reading out one neuron's activations therefore aggregates contributions from all the features whose direction has a nonzero coordinate at that neuron, producing the observed multi-concept response patterns.[^2]
The "Toy Models" paper also draws a surprising connection between superposition and adversarial examples: the same interference that lets the model represent many features also exposes it to small input perturbations that flip its predictions.[^2] Subsequent work has explored these connections more systematically.
A complementary capacity-based analysis was given in "Polysemanticity and Capacity in Neural Networks" (Scherlis, Sachan, Jermyn, Benton, Shlegeris, 2022), which formalized the idea that polysemanticity emerges naturally when the number of features the model "wants" to represent exceeds the available neurons, and quantified the trade-off.[^11]
The relationship between polysemanticity and [[superposition]] is central enough to deserve its own treatment, because the two terms are often conflated in informal writing.
Polysemanticity is an observation about individual neurons: when you look at the set of inputs that maximally activate a given neuron, you find unrelated concepts. It is a property of the standard basis (the basis of "individual neurons") of the activation space.[^3]
Superposition is a hypothesis about the structure of the representation: the network encodes more sparse features than the layer has dimensions, with each feature assigned to an approximately orthogonal direction that is generally not aligned with any single neuron axis.[^2]
The implication runs strictly one way. Superposition implies polysemanticity: if there are more features than neurons, the features cannot be axis-aligned, so at least some neurons must light up on multiple features. The converse does not hold. Polysemanticity can in principle arise without superposition: a network might encode the same number of features as neurons (or fewer), but along directions that are not aligned with the standard basis. Such a model would have a clean monosemantic feature basis, just not the neuron basis. In that case there is no "more features than dimensions" pressure, only a rotation.[^3]
This distinction was drawn explicitly by Lawrence Chan in a 2023 alignment-research note, "Superposition is not 'just' neuron polysemanticity," which catalogs three ways a network can be polysemantic without being in superposition: non-neuron-aligned but still orthogonal feature bases, non-linear feature representations, and compositional codes.[^3]
For research purposes the distinction matters because the two phenomena suggest different interventions. If polysemanticity were due only to misaligned but orthogonal features, then a learned change of basis would suffice to recover monosemanticity, and the basis would have exactly as many features as neurons. If superposition is at work, by contrast, then any monosemantic decomposition will require an overcomplete basis (more features than neurons), which is what makes [[sparse_autoencoder]] methods natural tools.
The current empirical consensus in mechanistic interpretability, supported by the success of overcomplete sparse-autoencoder dictionaries at scale, is that superposition is the dominant cause of polysemanticity in modern trained language models.[^9][^8]
Multiple families of methods have been proposed to recover interpretable, ideally monosemantic, units from networks that exhibit polysemanticity. They divide roughly into post-hoc decomposition methods and training-time interventions.
The most influential family of methods is dictionary learning with [[sparse_autoencoder]]s. The idea is to train a wide autoencoder on the activations of a fixed, frozen pretrained model. The autoencoder has more hidden units than the original layer (an overcomplete dictionary) and is regularized to produce sparse codes, typically via an L1 or top-k constraint on the hidden activations. The hidden units of the autoencoder are interpreted as candidate features.[^9]
The breakthrough demonstration for language models was Anthropic's "Towards Monosemanticity: Decomposing Language Models with Dictionary Learning," published October 5, 2023, with first author Trenton Bricken and co-authors including Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nicholas L. Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E. Burke, Tristan Hume, Shan Carter, Tom Henighan, and Christopher Olah.[^9]
The team trained sparse autoencoders on the MLP activations of a one-layer transformer language model and found that, with a sufficiently wide dictionary, the recovered features were dramatically more monosemantic than the underlying neurons. From a layer with 512 neurons they extracted dictionaries of over 4,000 features representing distinct concepts including DNA sequences, legal language, HTTP requests, Hebrew text, nutritional statements, and many more. Most of these properties were invisible at the neuron level.[^9]
This work was followed in May 2024 by "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet" by Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L. Turner, Callum McDougall, Monte MacDiarmid, Alex Tamkin, Esin Durmus, Tristan Hume, Francesco Mosconi, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, and Tom Henighan. They trained sparse autoencoders on the middle-layer residual stream of Claude 3 Sonnet (Anthropic's production model released March 4, 2024) and extracted dictionaries of up to roughly 34 million features. Many features were highly abstract, multilingual, and multimodal, and the team identified features for concepts ranging from the Golden Gate Bridge to deceptive behaviors and to safety-relevant categories such as sycophancy and bias.[^8]
The success of this approach is the strongest empirical evidence that polysemanticity in language models is driven by superposition: an overcomplete dictionary representing many more features than the layer has neurons is exactly what the superposition hypothesis predicts will be needed.
A separate research thread asks whether the training process itself can be modified to produce more monosemantic models in the first place, rather than decomposing polysemantic ones after the fact.
Anthropic's "Softmax Linear Units" paper, published June 27, 2022 by Nelson Elhage, Tristan Hume, Catherine Olsson, Neel Nanda, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah, replaced the standard MLP activation function with SoLU(x) = x * softmax(x). The softmax encourages sparsity and discourages neurons from being equally active for many distinct features. The authors reported that this change roughly doubled the fraction of MLP neurons that researchers rated as interpretable, from approximately one third to around 60 percent, at little or no cost in language modeling loss when combined with an extra LayerNorm.[^7]
However, the SoLU work also produced an important caveat: the authors found evidence that SoLU was sometimes hiding superposition rather than eliminating it. Some features that became hard to read off neurons appeared to still be present in distributed form. This was one of the first concrete pieces of evidence that the superposition phenomenon was real and not just a side effect of a poorly chosen basis.[^7]
A complementary line of work in late 2022 by Adam Jermyn, Nicholas Schiefer, and Evan Hubinger ("Engineering Monosemanticity in Toy Models," November 2022) showed that, in toy models, the local minimum found during training has a large effect on monosemanticity. Different local minima with similar loss can have very different fractions of monosemantic neurons, and biased initialization plus negative biases can steer the model toward more monosemantic solutions. Adding extra width to the model also helped, consistent with the capacity-pressure account.[^12]
Beyond sparse autoencoders and activation-function changes, several lines of work have explored explicit sparsity priors during training, weight decay on dense connections, and architectural modifications that allocate dedicated capacity to specific tasks. Mixture-of-experts and sparse routing models can be read as moving in this direction in spirit, although they were not designed with monosemanticity as a goal.
A practical observation across these efforts is that pure architectural interventions tend to face a sharp trade-off with model performance, while post-hoc methods like sparse autoencoders are now the preferred tool because they can be applied to existing strong models without retraining.
Polysemanticity has been described, by Olah and others, as one of the most important obstacles to the project of [[mechanistic_interpretability]]: the attempt to reverse-engineer the algorithms implemented by trained neural networks into human-understandable form.[^1]
The reason is roughly combinatorial. If a layer has N polysemantic neurons each representing, say, five distinct features, then a downstream layer reading those activations effectively confronts not N inputs but 5N entangled signals. A circuit linking two such layers does not have N x N possible "wires" to analyze, but on the order of (5N) x (5N) = 25 N^2 effective connections, each of which may or may not be load-bearing. Manual circuit analysis at that scale is prohibitive.[^1]
Polysemanticity also undermines simple causal interventions. Ablating a polysemantic neuron damages multiple features at once, so behavioral changes after ablation cannot be cleanly attributed to a single concept. This complicates standard interpretability techniques such as activation patching, causal scrubbing, and direct attribution, all of which work most cleanly when units of analysis correspond to single concepts.
Once polysemanticity is addressed, by switching the unit of analysis from neurons to features extracted via sparse autoencoders, [[mechanistic_interpretability]] becomes much more tractable. Feature-level analyses have produced clean accounts of indirect object identification, induction heads, modular arithmetic circuits, refusal behaviors, and much more in transformer language models. They have also enabled [[activation_steering]] methods, which manipulate model behavior by adjusting the activation along a single feature direction; such steering is much more interpretable and surgical than steering along a polysemantic neuron, because a feature direction corresponds (by construction) to a single concept.
Since the publication of "Scaling Monosemanticity" in May 2024, polysemanticity has remained an active research focus, with progress along several fronts.
Scaling and architecture of sparse autoencoders. Multiple groups have explored top-k SAEs, jumpReLU SAEs, gated SAEs, switch SAEs, and other variants intended to address pathologies of standard L1-regularized autoencoders such as feature shrinkage and dead features. These improvements have allowed sparse autoencoders to be trained on larger residual streams and across more layers efficiently.[^13]
Attribution graphs and replacement models. In March 2025, Anthropic published "On the Biology of a Large Language Model" by Jack Lindsey, Wes Gurnee, Emmanuel Ameisen, Brian Chen, Adam Pearce, Nicholas L. Turner, Craig Citro, and collaborators, introducing attribution graphs as a method to trace computation through a sparse, interpretable replacement model. The replacement uses a cross-layer transcoder (CLT) trained to reconstruct the network's behavior using features rather than raw polysemantic neurons; the example reported in the paper used a CLT with on the order of 30 million features across all layers of Claude 3.5 Haiku. The resulting graphs reveal multi-step reasoning chains, planning behaviors, and other circuit-level mechanisms that would be invisible at the neuron level due to polysemanticity.[^14]
Open-source circuit tracing. Following the Anthropic attribution-graphs work, multiple groups have re-implemented and extended the techniques, including an open Circuit Tracer library and related infrastructure released through the Anthropic Fellows program. Public dashboards such as Neuronpedia have made millions of extracted features browsable, lowering the barrier to entry for feature-level interpretability research.[^14]
Theoretical work on superposition and computation in superposition. Multiple papers in 2024-2026 have studied the capacity of linear-representation networks under the superposition hypothesis, formal bounds on how many features can be stored, and the computational complexity of operations carried out on representations in superposition. This line of work has begun to clarify the conditions under which polysemanticity should be expected to arise and the limits of what sparse-autoencoder extraction can recover.[^15]
The overall trajectory is from polysemanticity being identified as a major obstacle in 2020, to being explained mechanistically by superposition in 2022, to being addressed empirically by sparse-autoencoder decomposition at scale in 2023-2024, to being routinely worked around in modern circuit-level analysis from 2025 onward. It remains an active research area: open questions include how to handle features that are themselves represented non-linearly, how to detect "missed" features that no sparse autoencoder has yet found, and how the picture changes for very large models with many layers in superposition simultaneously.