Superposition (Mechanistic Interpretability)

Interpretability Neural Networks

25 min read

Updated Jun 21, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 21, 2026

Fact-checked

In review queue

Sources

17 citations

Revision

v4 · 5,068 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Superposition is the phenomenon in which an artificial neural network represents more distinct features than it has dimensions in its activation space, by assigning those features to nearly-orthogonal (rather than strictly orthogonal) directions and accepting a small amount of interference between them. As used in mechanistic interpretability, superposition describes a compression strategy: the network packs a large vocabulary of latent concepts into a comparatively low-dimensional vector, trading a controlled amount of cross-talk for greater representational capacity. It is widely regarded as one of the foundational obstacles to reverse-engineering modern deep learning systems, because it implies that individual neurons typically do not correspond to single human-interpretable concepts.^[1]^[2]

The modern formulation was introduced by researchers at Anthropic in the 2022 paper "Toy Models of Superposition" by Nelson Elhage and colleagues,^[1] which formalized earlier observations about polysemantic neurons in the Distill Circuits thread.^[3] The follow-up paper "Towards Monosemanticity" (Bricken et al., 2023) demonstrated that sparse autoencoders trained on transformer activations could recover individual features from superposed representations,^[2] establishing the principal experimental method that currently dominates the field. The concept is named by analogy to the linear superposition of waves; it has no relationship to quantum superposition in physics beyond the shared English word.

This article surveys the conceptual content of superposition, the empirical findings of the toy-model line of work, the connection to sparse autoencoder dictionary learning, the distinct notion of "computation in superposition," and recent extensions to frontier models.

What is the linear representation hypothesis?

The study of superposition presupposes a working hypothesis about how neural networks encode information: the linear representation hypothesis. This hypothesis, which can be traced through earlier work on word2vec and other distributed representations, asserts that high-level features inside a trained neural network correspond to directions in its activation space, and that the presence or strength of a feature in a given input is approximately given by the projection of the activation vector onto that direction. Under this hypothesis, the activation vector at a given layer is approximately a linear combination of feature directions, with coefficients given by the (sparse) set of features active for that input.^[1]^[4]

Toy Models of Superposition adopts this hypothesis as its starting point and asks what happens when the number of features a model "wants" to represent exceeds the dimensionality of the activation space available to it. If features had to occupy strictly orthogonal directions, a layer of width $d$ could represent at most $d$ features. But in high-dimensional vector spaces, one can pack exponentially many almost-orthogonal vectors. The mathematical content underlying this fact is captured by the Johnson-Lindenstrauss lemma and related concentration phenomena: for any $\epsilon > 0$, one can find $\exp(\Theta(\epsilon^2 d))$ vectors in $d$-dimensional space whose pairwise inner products are bounded in absolute value by $\epsilon$.^[1] These almost-orthogonal vectors are not interchangeable with orthogonal ones, but if features are sparse (rarely active simultaneously), the small interferences between non-orthogonal directions average out and can be filtered by a nonlinearity such as ReLU.

A 2023 paper by Kiho Park, Yo Joong Choe, and Victor Veitch developed a formal version of the linear representation hypothesis for large language models, defining it in terms of three distinct senses (unembedding, embedding, and probing representations) and introducing a "causal inner product" under which independent concepts are represented by orthogonal vectors.^[4] This formalism has since become a reference point for empirical work on linear features in transformer residual streams.

What problem does polysemanticity pose for interpretability?

The empirical observation that motivated the toy-model line of work is polysemanticity: many neurons in trained networks respond to seemingly unrelated stimuli. (See polysemanticity.) In the InceptionV1 vision model, the Distill Circuits team documented numerous neurons that fired for combinations such as cat faces, fronts of cars, and cat legs.^[3] Polysemantic neurons present a serious obstacle to interpretability research: if a single neuron's activation reflects a mixture of unrelated concepts, then one cannot read off a network's computation by inspecting individual neurons.

Two natural explanations for polysemanticity were canvassed in the early Circuits literature. The first is that "features" might simply be poorly aligned to the neuron basis, but nonetheless live in lower-dimensional subspaces of activation space; under this view, an appropriate rotation would recover monosemantic features. The second, more radical, explanation is that the network is representing more features than it has neurons, and so cannot give every feature its own direction. Toy Models of Superposition provided the strongest evidence to date for the second explanation by constructing toy networks in which the ground-truth set of features was known and showing that polysemanticity arose exactly when feature count exceeded hidden-layer dimensionality and feature activations were sparse.^[1]

The contrast with monosemantic neurons, those that respond cleanly to a single interpretable concept, frames the central question of the field: under what conditions do neurons (or other privileged basis elements) become monosemantic, and how can researchers recover monosemantic features when the network's natural basis is polysemantic?

What did Toy Models of Superposition (Elhage et al., 2022) show?

"Toy Models of Superposition" was published on the Anthropic Transformer Circuits Thread on September 14, 2022, with a near-simultaneous arXiv release as 2209.10652.^[1]^[5] The authors, in order, are Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah.^[5] The paper's abstract states:^[5]

"Neural networks often pack many unrelated concepts into a single neuron, a puzzling phenomenon known as 'polysemanticity' which makes interpretability much more challenging. This paper provides a toy model where polysemanticity can be fully understood, arising as a result of models storing additional sparse features in 'superposition.' We demonstrate the existence of a phase change, a surprising connection to the geometry of uniform polytopes, and evidence of a link to adversarial examples."

The toy setup

The principal model studied is an autoencoder-like network with a single hidden layer that aims to reconstruct a high-dimensional sparse input. Inputs are vectors $x \in \mathbb{R}^n$ whose components are independently zero with probability $1 - S$ (where $S$ is the sparsity parameter) and uniformly distributed on $[0, 1]$ otherwise. The network projects $x$ down to a hidden space of dimension $d < n$ via a weight matrix $W$, then back up via $W^T$, and applies a ReLU nonlinearity and a bias to produce a reconstruction $x' = \text{ReLU}(W^T W x + b)$. Each feature is assigned an "importance" $I_i$, a scalar weight on its squared reconstruction error; the loss is $\sum_i I_i (x_i - x'_i)^2$ averaged over inputs.^[1]

By varying $n / d$, the sparsity $S$, and the importance profile, the authors map out a phase diagram of representations. In the dense regime (low $S$), the optimal strategy is to give each feature its own dimension and ignore the rest, recovering a kind of feature selection. As sparsity increases, the network begins to represent additional features by adding them to the directions already used by other features, accepting small interference because the probability of two features being active simultaneously is low. In the high-sparsity regime, many more than $d$ features may be represented, each as a nearly-orthogonal direction in the $d$-dimensional hidden space.

Feature importance, sparsity, and rank deficiency

Toy Models introduces the notion of feature dimensionality, defined for each feature as the squared norm of its row in $W$ divided by the sum over features of squared dot products with it. A feature with a dedicated orthogonal direction has dimensionality 1; a feature that shares its direction with others has fractional dimensionality; a feature that is not represented at all has dimensionality 0.^[1] The sum of feature dimensionalities across all features equals the rank of $W^T W$, which is bounded by $d$. This gives a precise sense in which the network "spends" rank on features, and the central observation is that under sparsity, the optimal allocation often gives many features fractional dimensionality less than 1.

The paper documents a striking rank deficiency in the trained $W^T W$ matrix when superposition is occurring: although $W$ has $d$ output dimensions, the effective representation of any single feature uses far less than a full dimension. This framing connects superposition to the older statistical concept of rank-deficient matrices and to compressed sensing, where similar geometric packing arguments arise.^[1]

Phase transitions

A central empirical finding is that the transition between representing a feature with a dedicated dimension, representing it in superposition with others, and not representing it at all has the character of a phase change rather than a smooth interpolation. As one varies sparsity or importance, the network's solution shifts discretely between qualitatively different geometric arrangements, with hysteresis-like behavior. The authors compare this to phase transitions in physical systems and note an analogy (qualitative, not formal) to the fractional quantum Hall effect, in which discrete plateaus correspond to specific filling fractions.^[1]

Polytope geometry

Perhaps the most visually striking finding of Toy Models is that, in the high-sparsity superposition regime, features arrange themselves into the vertex sets of uniform polytopes. For features sharing a two-dimensional subspace, the natural arrangement is an antipodal pair, in which two features point in exactly opposite directions: when one is active the other suffers a negative projection, which the ReLU then clips to zero. As more features share a subspace, the geometry transitions through digons, triangles, pentagons, tetrahedra, and higher-symmetry uniform polytopes such as the square antiprism. These structures emerge because they minimize the worst-case interference among features sharing a subspace.^[1] The fact that the geometry of features inside a trained network corresponds so cleanly to classical polytopes is unexpected, and it provides a small set of canonical geometric primitives in terms of which superposition can be analyzed.

How do features differ from neurons?

A consequence of superposition is that the natural basis of the network, the neuron basis, is generally a poor coordinate system for the features the network is actually computing with. When features are stored in superposition, each neuron's activation is a linear combination of many feature activations; conversely each feature is distributed across many neurons. This is sometimes referred to as the curse of polysemanticity for interpretation: even if a researcher could enumerate all features the network represents, those features would not align with the neurons one could examine.^[1]^[2]

This shift in unit of analysis is what motivates the feature-centric view of mechanistic interpretability that dominates current practice. Rather than asking what a neuron means, one asks what features the network represents, and only secondarily how those features happen to be encoded in the neuron basis. The neuron basis becomes, at best, a "privileged basis" introduced by elementwise nonlinearities such as ReLU, which do not commute with arbitrary rotations but do not in general correspond to monosemantic features either.^[1]

How do sparse autoencoders resolve superposition (Bricken et al., 2023)?

If features live in superposition along nearly-orthogonal directions, then in principle one could recover them by dictionary learning: finding an overcomplete basis ${f_1, \dots, f_K}$ with $K \gg d$ such that activation vectors decompose sparsely as $\sum_i c_i f_i$. Anthropic's October 2023 paper "Towards Monosemanticity: Decomposing Language Models With Dictionary Learning" demonstrated that sparse autoencoders trained on transformer activations recover such a dictionary, and that the learned dictionary elements are substantially more monosemantic than the original neurons.^[2]^[6]

The lead authors are Trenton Bricken and Adly Templeton, with Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E. Burke, Tristan Hume, Shan Carter, Tom Henighan, and Christopher Olah.^[2] The work was released on October 5, 2023.^[6]

Setup

The authors train a one-layer attention-only transformer language model with a 512-neuron MLP. They then train a sparse autoencoder on the MLP activations of this transformer, collected over a large corpus of text. The autoencoder consists of an encoder that maps the 512-dimensional MLP activation to a much higher-dimensional sparse code (the "dictionary"), and a decoder that reconstructs the MLP activation from the sparse code. The training objective is the sum of a mean squared reconstruction error and an L1 penalty on the sparse code, which together encourage the autoencoder to use as few dictionary elements as possible to reconstruct each activation vector.^[2]

With this setup the authors decompose the 512-neuron MLP layer into more than 4,000 features, and report that the great majority of these features pass interpretability tests that fail for the underlying neurons.^[2]^[6] The decomposition includes specialized features for DNA sequences, base64 encoded text, HTTP requests, legal language, Hebrew text, nutrition facts panels, and many more concrete categories of input.^[6]

Evidence for monosemanticity

The paper offers four converging lines of evidence that the learned dictionary elements are genuinely monosemantic features:^[2]

Detailed case studies of specific features, in which the authors examine the contexts in which a feature activates, intervene by activating or suppressing the feature, and trace its downstream effects on the model's logits.
Human evaluation of a large random sample of features, in which raters were asked to identify a coherent activating concept.
Automated interpretability scoring of features by an external language model based on activation contexts.
Automated interpretability scoring based on the feature's effect on output logits.

The authors interpret these results as showing that sparse autoencoders resolve superposition, in the sense that they recover individual features from superposed activation vectors and present them as a more interpretable unit of analysis than the underlying neurons.

What is computation in superposition?

A conceptually distinct topic, often confused with feature superposition, is computation in superposition. Where feature superposition concerns how a network represents information at a given layer, computation in superposition concerns how a network performs nonlinear operations on those superposed features without first projecting them into an enlarged orthogonal basis.^[1]^[7]

The Toy Models paper discusses an early version of this idea, showing that a network can compute simple functions such as logical AND or absolute value on inputs that are themselves represented in superposition, at the cost of a controlled amount of interference. The intuition is that, in the same way the ReLU in the toy decoder filters out negative interferences in the representation, a ReLU applied to a linear combination of superposed features can implement a nonlinear computation that effectively operates on the underlying features rather than the activations directly.^[1]

The 2024 paper "Mathematical Models of Computation in Superposition" by Kaarel Hänni, Jake Mendel, Dmitry Vaintrob, and Lawrence Chan develops this idea formally.^[7] The authors construct one-layer MLPs that emulate Boolean circuits operating on $m$ inputs in superposition, and show that the pairwise AND function on $m$ features can be computed by a network with only $\tilde{O}(m^{2/3})$ hidden neurons, even when the inputs and outputs are themselves in superposition. They also show that with error-correcting intermediate layers, deeper circuits can be emulated as long as they have polynomial depth. A complementary 2024 work by Micah Adler at MIT, "On the Complexity of Neural Computation in Superposition," establishes information-theoretic lower bounds for similar computations.^[8]

These results substantially complicate the picture of interpretability. Even if sparse autoencoders recover the features that a network represents at each layer, understanding how the network computes with those features inside an MLP block may require disentangling structures that are intrinsically not aligned to the feature basis. A network performing computation in superposition is, in effect, simulating a much wider network with much wider intermediate representations, and reverse-engineering its mechanisms requires recovering that virtual width.

What geometric structures appear in toy models?

The catalog of polytope-shaped feature configurations identified in Toy Models is a distinctive contribution of the paper and is worth surveying in slightly more detail.^[1] The table below summarizes the canonical arrangements, ordered by the number of features sharing a subspace.

Configuration	Features	Subspace dim	Arrangement
Digon (antipodal pair)	2	1	Two features point in exactly opposite directions; ReLU clips the negative interference.
Triangle	3	2	Three features at 120-degree angles, equalizing pairwise interference.
Tetrahedron	4	3	Four features at the vertices of a regular tetrahedron, the maximally symmetric four-feature configuration.
Pentagon	5	2	Five features in a regular pentagon at maximum equidistant spacing.
Square antiprism	8	3	Eight features at the vertices of a square antiprism, a higher-count pattern with more complex symmetry.

Digon (line segment with two opposite vertices). The simplest superposition arrangement, in which two features share a one-dimensional subspace and point in opposite directions. The ReLU suppresses the negative interference that arises when one feature is active. This is the antipodal pair.

Triangle. Three features share a two-dimensional subspace and arrange themselves at 120-degree angles, equalizing pairwise interference among them.

Tetrahedron. Four features arrange themselves at the vertices of a regular tetrahedron in three dimensions, the maximally symmetric four-feature configuration.

Pentagon. Five features arrange in a regular pentagon in two dimensions, sharing a single two-dimensional subspace at maximum equidistant spacing.

Square antiprism. Eight features arrange at the vertices of a square antiprism in three dimensions, providing a higher-feature-count superposition pattern with more complex geometric symmetry.

The empirical observation is that as one increases the number of features sharing a subspace, the optimal arrangement is always a uniform polytope; the network discovers these geometric optima through gradient descent. The space of arrangements is discrete (the set of uniform polytopes of a given dimension is countable), and which polytope is chosen depends on the relative importance of features and the ambient sparsity. The fact that gradient descent reliably finds these classically symmetric geometric solutions in a high-dimensional optimization landscape is among the more surprising findings in the paper.^[1]

What are the open questions about superposition?

Despite the experimental success of sparse autoencoders in extracting interpretable features, several open questions remain about superposition and its resolution.

Are SAE features the "true" features? Sparse autoencoders identify a basis of directions in activation space such that activations decompose sparsely. But these directions are determined by the autoencoder's architecture, dictionary size, and L1 penalty; different choices produce different feature sets. A 2024 ICLR paper, "Sparse Autoencoders Do Not Find Canonical Units of Analysis," argues that the recovered dictionaries are not robust to these design choices and so do not necessarily reveal a unique ground truth.^[9] Whether neural networks have a privileged set of features at all, in a basis-independent sense, remains a live theoretical question.

Dead features and feature splitting. Empirically, a substantial fraction of an SAE's dictionary elements never activate after training ("dead features"), and the dictionary structure changes dramatically with dictionary size: increasing the dictionary's expansion factor causes coarse features to "split" into more specific sub-features, with no clear stopping point.^[10] These observations suggest that the SAE method is more like a compression schedule than a discovery procedure for a fixed underlying ontology.

Feature interactions and computation. Even granted a dictionary of features at each layer, understanding the network's computation requires tracing how features at one layer activate features at the next. This is the project of attribution graphs and circuit tracing, surveyed below. The success of this program depends both on the quality of the underlying features and on the structure of computation in superposition.

Non-linear and multi-dimensional features. The linear representation hypothesis underpins much of this work, but recent papers including "Not All Language Model Features Are One-Dimensional" (ICLR 2025) argue that some features are intrinsically multi-dimensional, occupying low-rank subspaces rather than rays.^[11] This requires generalizations of sparse autoencoder methods.

Distinguishing real features from data artifacts. Because SAE features are discovered from data, there is a persistent worry that some recovered features reflect dataset artifacts (such as base64 sequences or HTML markup) rather than mechanisms in the model. Distinguishing data features from compute features is an active topic of investigation.

How does superposition connect to dictionary learning?

The recovery of features from superposition is, mathematically, an instance of dictionary learning, a problem with a long history in signal processing and statistics. In the standard dictionary learning setup, one observes samples $x \in \mathbb{R}^d$ assumed to be sparse linear combinations of an unknown overcomplete basis ${f_1, \dots, f_K}$, and seeks to recover both the basis and the sparse coefficients. Sparse autoencoders are a specific neural-network-based approach to this problem, in which the encoder and decoder are jointly trained by gradient descent with an L1 sparsity penalty.^[2]

The application of dictionary learning to neural network activations was pioneered before the Anthropic work. In particular, the September 2023 paper "Sparse Autoencoders Find Highly Interpretable Features in Language Models" by Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey, released contemporaneously with the Anthropic effort, applied the same methodology to GPT-style language models and reached qualitatively similar conclusions.^[12] The convergence of these independent results lent credibility to the approach as a general technique for recovering superposed features.

Since 2023, dictionary learning of features has become the dominant practical methodology in mechanistic interpretability. Variants include top-k sparse autoencoders (which replace the L1 penalty with a hard top-$k$ selection), JumpReLU SAEs (which use a thresholded activation function), gated SAEs, and transcoders (which predict the next layer's activations rather than reconstructing the current layer's). These methods trade off reconstruction quality, dictionary size, and the interpretability of the recovered features in different ways, but all share the basic strategy of treating the model's activations as sparse combinations of an overcomplete feature dictionary.

How does superposition scale to frontier models?

The 2024 to 2026 period has seen rapid extensions of the superposition and dictionary learning framework to larger models and more sophisticated analyses.

Scaling to frontier models

Anthropic's May 2024 paper "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet" by Adly Templeton and colleagues trained sparse autoencoders on the residual stream of Claude 3 Sonnet at three scales, with dictionaries of approximately 1 million, 4 million, and 34 million features.^[13] At the largest scale, roughly 12 million of the 34 million dictionary elements were "alive" (activated on the training data), with the remainder dead features.^[13] The work demonstrated that the methodology scales to production language models and that the recovered features are abstract, multilingual, and multimodal. The most-cited finding is a feature that responds to the Golden Gate Bridge across textual descriptions in multiple languages and images of the bridge; clamping this feature to roughly ten times its maximum observed activation caused the model to describe itself in terms of the bridge, the basis of the public "Golden Gate Claude" demonstration.^[13] The paper also documents features related to deception, code vulnerabilities, sycophancy, and other safety-relevant behaviors.^[13]

Crosscoders

A related line of work introduced in late 2024 generalizes sparse autoencoders to operate across multiple layers (or across multiple models) simultaneously. These crosscoders, introduced in "Sparse Crosscoders for Cross-Layer Features and Model Diffing" on the Transformer Circuits Thread, learn dictionaries that explain the activations of several layers jointly, accommodating the observation that some features persist across layers in the residual stream and so should not be re-learned independently at each layer.^[14] Crosscoders also enable model diffing: training a single dictionary that explains the activations of two related models (for example, a base model and a fine-tuned version) and reading off which features were introduced or modified by fine-tuning. This addresses a more subtle form of superposition, sometimes called cross-layer superposition, in which a feature is distributed not only across neurons within a layer but also across layers.^[14]

Circuit tracing and attribution graphs

The Anthropic paper "Circuit Tracing: Revealing Computational Graphs in Language Models" (2025), with a companion paper "On the Biology of a Large Language Model," used cross-layer transcoders, a related dictionary learning method, to build attribution graphs that trace the chain of feature activations through which a model produces a specific output.^[15] These graphs reveal interpretable multi-step mechanisms for behaviors such as planning, multi-hop reasoning, and refusal. The authors also open-sourced a circuit-tracing library hosted on Neuronpedia.^[15] Attribution graphs operate over the feature basis recovered by dictionary learning rather than the neuron basis, illustrating how the resolution of representational superposition enables a deeper analysis of the network's computation.

Continuing theoretical work

In parallel with the experimental program, theoretical work has continued to refine the picture of superposition. "Polysemanticity and Capacity in Neural Networks" (Scherlis et al., 2022) developed an analytic framework for the phase diagram of representations in toy models, defining a notion of per-feature capacity and computing it as a function of feature importance and sparsity.^[16] The 2024 work on mathematical models of computation in superposition, mentioned above, provides upper bounds on the efficiency of superposed Boolean computation, and a 2025 paper "Sparsity and Superposition in Mixture of Experts" extended the analysis to MoE architectures, finding that MoE models display more continuous transitions and greater monosemanticity than dense models of comparable size.^[17]

Is superposition the same as quantum superposition?

No. The word "superposition" is shared with quantum mechanics, where it refers to a state of a quantum system being a linear combination of basis states. The use of the term in mechanistic interpretability is unrelated to the quantum-mechanical concept: there is no quantum analogy intended beyond the shared metaphor of linear combination, and the mathematical objects involved (real-valued activation vectors, classical optimization) have no quantum content.^[1] The choice of name reflects the simpler analogy to the linear superposition of waves or vectors, in which two or more components combine additively in a shared medium.

Within mechanistic interpretability, the term "superposition" is used in at least two distinct senses, which should be carefully distinguished:^[1]^[7]

Features-in-superposition (representational superposition): the network represents more features than it has dimensions at a given layer by assigning them to nearly-orthogonal directions. This is the primary topic of Toy Models of Superposition and of dictionary-learning approaches.
Computation in superposition: the network performs nonlinear computations on superposed features without first projecting them into an orthogonal basis. This concerns the implementation of operations such as Boolean functions on packed inputs and is studied in the work of Hänni, Mendel, Vaintrob and Chan, Adler, and others.

These notions are related (computation in superposition presupposes features in superposition) but distinct, and the techniques developed for analyzing one do not automatically extend to the other.

Why does superposition matter for interpretability research?

Superposition has reshaped the methodology of mechanistic interpretability in several concrete ways.^[1]^[2]^[13]^[15] It has shifted the unit of analysis from neurons to features recovered by dictionary learning; it has motivated the construction of large-scale sparse autoencoder infrastructure for frontier language models; it has reframed the project of reverse-engineering a network in terms of attribution graphs over features rather than neurons; and it has imposed limits on what can be expected from purely "look at the neurons" approaches such as activation patching and direct logit attribution, which are blind to representations not aligned with the neuron basis.

Superposition is also a load-bearing assumption in safety-relevant applications such as activation steering, feature monitoring for deception, and probing for undesired model behaviors. To the extent that safety-relevant features are stored in superposition, identifying them requires the same dictionary-learning machinery developed for general interpretability, and the practical effectiveness of interpretability-based safety methods is tied to the success of feature recovery.^[13]

References

Elhage, N., Hume, T., Olsson, C., Schiefer, N., Henighan, T., Kravec, S., Hatfield-Dodds, Z., Lasenby, R., Drain, D., Chen, C., Grosse, R., McCandlish, S., Kaplan, J., Amodei, D., Wattenberg, M., & Olah, C. (2022). Toy Models of Superposition. *Transformer Circuits Thread*. https://transformer-circuits.pub/2022/toy_model/index.html (Accessed 2026-05-19). ↩
Bricken, T., Templeton, A., Batson, J., Chen, B., Jermyn, A., Conerly, T., Turner, N., Anil, C., Denison, C., Askell, A., Lasenby, R., Wu, Y., Kravec, S., Schiefer, N., Maxwell, T., Joseph, N., Hatfield-Dodds, Z., Tamkin, A., Nguyen, K., McLean, B., Burke, J. E., Hume, T., Carter, S., Henighan, T., & Olah, C. (2023). Towards Monosemanticity: Decomposing Language Models With Dictionary Learning. *Transformer Circuits Thread*. https://transformer-circuits.pub/2023/monosemantic-features/index.html (Accessed 2026-05-19). ↩
Olah, C., Cammarata, N., Schubert, L., Goh, G., Petrov, M., & Carter, S. (2020). Zoom In: An Introduction to Circuits. *Distill*. https://distill.pub/2020/circuits/zoom-in/ (Accessed 2026-05-19). ↩
Park, K., Choe, Y. J., & Veitch, V. (2023). The Linear Representation Hypothesis and the Geometry of Large Language Models. arXiv preprint 2311.03658. https://arxiv.org/abs/2311.03658 (Accessed 2026-05-19). ↩
Elhage, N., et al. (2022). Toy Models of Superposition (arXiv version). arXiv preprint 2209.10652. https://arxiv.org/abs/2209.10652 (Accessed 2026-05-19). ↩
Anthropic (2023). Towards Monosemanticity: Decomposing Language Models With Dictionary Learning (announcement). https://www.anthropic.com/research/towards-monosemanticity-decomposing-language-models-with-dictionary-learning (Accessed 2026-05-19). ↩
Hänni, K., Mendel, J., Vaintrob, D., & Chan, L. (2024). Mathematical Models of Computation in Superposition. arXiv preprint 2408.05451. https://arxiv.org/abs/2408.05451 (Accessed 2026-05-19). ↩
Adler, M. (2024). On the Complexity of Neural Computation in Superposition. arXiv preprint 2409.15318. https://arxiv.org/abs/2409.15318 (Accessed 2026-05-19). ↩
Leask, P., et al. (2025). Sparse Autoencoders Do Not Find Canonical Units of Analysis. *Proceedings of ICLR 2025*. https://proceedings.iclr.cc/paper_files/paper/2025/file/84ca3f2d9d9bfca13f69b48ea63eb4a5-Paper-Conference.pdf (Accessed 2026-05-19). ↩
Templeton, A., et al. (2024). Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet (sections on feature splitting and dead features). *Transformer Circuits Thread*. https://transformer-circuits.pub/2024/scaling-monosemanticity/ (Accessed 2026-05-19). ↩
Engels, J., et al. (2025). Not All Language Model Features Are One-Dimensionally Linear. *Proceedings of ICLR 2025*. https://proceedings.iclr.cc/paper_files/paper/2025/file/d3221cdb27e49d9c1cd35ad254feccfe-Paper-Conference.pdf (Accessed 2026-05-19). ↩
Cunningham, H., Ewart, A., Riggs, L., Huben, R., & Sharkey, L. (2023). Sparse Autoencoders Find Highly Interpretable Features in Language Models. arXiv preprint 2309.08600. https://arxiv.org/abs/2309.08600 (Accessed 2026-05-19). ↩
Templeton, A., Conerly, T., Marcus, J., Lindsey, J., Bricken, T., Chen, B., Pearce, A., Citro, C., Ameisen, E., Jones, A., Cunningham, H., Turner, N. L., McDougall, C., MacDiarmid, M., Tamkin, A., Durmus, E., Hume, T., Mosconi, F., Freeman, C. D., Sumers, T. R., Rees, E., Batson, J., Jermyn, A., Carter, S., Olah, C., & Henighan, T. (2024). Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet. *Transformer Circuits Thread*. https://transformer-circuits.pub/2024/scaling-monosemanticity/ (Accessed 2026-05-19). ↩
Lindsey, J., et al. (2024). Sparse Crosscoders for Cross-Layer Features and Model Diffing. *Transformer Circuits Thread*. https://transformer-circuits.pub/2024/crosscoders/index.html (Accessed 2026-05-19). ↩
Ameisen, E., Lindsey, J., et al. (2025). Circuit Tracing: Revealing Computational Graphs in Language Models. *Transformer Circuits Thread*. https://transformer-circuits.pub/2025/attribution-graphs/methods.html (Accessed 2026-05-19). ↩
Scherlis, A., Sachan, K., Jermyn, A., Benton, J., & Shlegeris, B. (2022). Polysemanticity and Capacity in Neural Networks. arXiv preprint 2210.01892. https://arxiv.org/abs/2210.01892 (Accessed 2026-05-19). ↩
Anonymous (2025). Sparsity and Superposition in Mixture of Experts. arXiv preprint 2510.23671. https://arxiv.org/abs/2510.23671 (Accessed 2026-05-19). ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributor · full history

Suggest edit

Superposition (Mechanistic Interpretability)

What is the linear representation hypothesis?

What problem does polysemanticity pose for interpretability?

What did Toy Models of Superposition (Elhage et al., 2022) show?

The toy setup

Feature importance, sparsity, and rank deficiency

Phase transitions

Polytope geometry

How do features differ from neurons?

How do sparse autoencoders resolve superposition (Bricken et al., 2023)?

Setup

Evidence for monosemanticity

What is computation in superposition?

What geometric structures appear in toy models?

What are the open questions about superposition?

How does superposition connect to dictionary learning?

How does superposition scale to frontier models?

Scaling to frontier models

Crosscoders

Circuit tracing and attribution graphs

Continuing theoretical work

Is superposition the same as quantum superposition?

Why does superposition matter for interpretability research?

See Also

References

Improve this article

What links here (24 of 27)

What links here (24 of 27)

What is the linear representation hypothesis?

What problem does polysemanticity pose for interpretability?

What did Toy Models of Superposition (Elhage et al., 2022) show?

The toy setup

Feature importance, sparsity, and rank deficiency

Phase transitions

Polytope geometry

How do features differ from neurons?

How do sparse autoencoders resolve superposition (Bricken et al., 2023)?

Setup

Evidence for monosemanticity

What is computation in superposition?

What geometric structures appear in toy models?

What are the open questions about superposition?

How does superposition connect to dictionary learning?

How does superposition scale to frontier models?

Scaling to frontier models

Crosscoders

Circuit tracing and attribution graphs

Continuing theoretical work

Is superposition the same as quantum superposition?

Why does superposition matter for interpretability research?

See Also

References

Improve this article

Related Articles

Polysemanticity

Transcoder

Linear Probes

Sparse Coding

Monosemanticity

Feature Importances

What links here (24 of 27)

Related Articles

Polysemanticity

Transcoder

Linear Probes

Sparse Coding

Monosemanticity

Feature Importances

What links here (24 of 27)