Toy Models of Superposition

AI Research Anthropic Interpretability

8 min read

Updated Jun 28, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 28, 2026

Fact-checked

In review queue

Sources

4 citations

Revision

v2 · 1,513 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Toy Models of Superposition is a September 2022 mechanistic interpretability paper from Anthropic that shows how a neural network can represent more features than it has dimensions by packing them into overlapping, non-orthogonal directions, a phenomenon the authors call superposition. Using small, fully understood "toy" networks trained on synthetic sparse data, the paper demonstrates that superposition is what makes individual neurons polysemantic, and it reports three central results: the existence of a phase change, a connection to the geometry of uniform polytopes, and a link to adversarial examples ^[1]^[2]. It is one of the most cited works in modern interpretability and set the research agenda that led to later sparse-autoencoder and dictionary-learning work on monosemanticity.

The paper was published on the Transformer Circuits Thread on September 14, 2022, by Nelson Elhage, Tristan Hume, Catherine Olsson and colleagues, with Christopher Olah as the senior author, and a version was posted to arXiv (2209.10652) on September 21, 2022 with sixteen listed authors ^[1]^[2]. In its own words, the abstract states: "This paper provides a toy model where polysemanticity can be fully understood, arising as a result of models storing additional sparse features in 'superposition.'" ^[2]

What is "Toy Models of Superposition"?

The paper's stated goal is to explain polysemanticity, the long-observed fact that an individual neuron in a trained network often responds to several apparently unrelated inputs. Rather than treat this as noise, Elhage et al. argue that it is the visible side effect of a deliberate compression strategy: when the underlying features a network would like to represent outnumber its available dimensions, the network stores some of them in superposition, overlaying multiple feature directions on the same set of neurons. The tradeoff is interference, since reading off one feature picks up small contributions from the others, and the network must use nonlinearities to clean this up.

Because superposition is hard to study directly inside a large language model, the authors built small, fully understood "toy" models with synthetic data where the ground-truth features are known by construction. This lets them watch superposition appear and disappear as they vary the conditions, and to characterize it precisely. The abstract summarizes the three main results: "We demonstrate the existence of a phase change, a surprising connection to the geometry of uniform polytopes, and evidence of a link to adversarial examples." ^[2]

What is superposition?

The paper works within the broader interpretability view that the meaningful units inside a network are features, treated as directions in activation space rather than individual neurons. A network that had enough dimensions could in principle give each feature its own orthogonal direction, making neurons cleanly interpretable. Two ideas push against this.

The first is that real-world features are numerous and the network has limited width, so it cannot afford a dedicated dimension for every feature it might find useful. The second is sparsity: many features occur only rarely, so on any given input most are inactive. The superposition hypothesis is that under these conditions a network behaves roughly like a compressed sensing system, exploiting sparsity to encode more features than dimensions and accepting occasional interference as the price ^[1]. Two properties of a feature drive the outcome:

Property	Meaning in the paper	Effect
Importance	How much a feature contributes to reducing the loss, modeled as a weight on its reconstruction error	More important features are likelier to get a dedicated, low-interference representation
Sparsity	How rarely a feature is active (high sparsity = mostly zero)	Higher sparsity makes superposition more attractive, since collisions between active features are rare

How does the toy model work?

The central toy model is a small autoencoder-like network with a linear bottleneck. A high-dimensional sparse input vector of features is projected down to a low-dimensional hidden space by a weight matrix W, then mapped back up by W transpose followed by a bias and a ReLU nonlinearity, so the reconstruction is ReLU(W^T W x + b) ^[1]. The hidden space is the bottleneck where superposition must happen if it happens at all. Features are generated independently, are nonnegative, and are made sparse by setting each to zero with some probability; each feature is also assigned an importance that scales its term in the mean-squared-error loss.

A representative configuration projects five features into two hidden dimensions. With no sparsity the model behaves like principal component analysis, keeping only the two most important features along orthogonal axes and discarding the rest. As sparsity rises, the picture changes: the model starts representing more than two features in the two-dimensional space, with the less important features the first to be folded into superposition ^[1]. The ReLU and bias are what make this usable, since the nonlinearity suppresses the small negative interference terms that superposition introduces. The authors also note that simple computation, not just storage, can be carried out while features remain in superposition.

What did the paper find about phase changes and geometry?

Two results give the paper much of its lasting influence.

Whether a given feature is represented on its own dedicated direction or pushed into superposition is governed by a phase change. Holding importance fixed and sweeping sparsity, a feature switches fairly abruptly between three regimes: not represented at all, represented alone, or represented in superposition with others. The authors map this as a phase diagram over importance and sparsity, and the sharpness of the transitions is what makes the term "phase change" apt rather than metaphorical ^[1].

The geometry of superposition is the more striking finding. When several features share a low-dimensional space, they do not land in arbitrary positions. They arrange themselves into regular geometric figures that spread the directions out to minimize interference, the same uniform polytopes that appear in classical geometry ^[2]. In the five-feature, two-dimensional case the model passes through configurations such as antipodal pairs (two features placed on exactly opposite directions, used when interference must be kept low) and a regular pentagon (all five features placed symmetrically when sparsity is high enough to tolerate the resulting overlap). In higher hidden dimensions the authors observe digons, triangles, tetrahedra and other polytopes, and they describe how larger arrangements decompose into independent lower-dimensional pieces, a structure they relate to tegum products. To quantify all this they introduce a notion of feature dimensionality, the fraction of a hidden dimension effectively allocated to a given feature: a feature on its own dedicated direction has dimensionality 1, while a feature sharing an antipodal pair has dimensionality 1/2, and other geometries take their own characteristic fractional values ^[1]^[3].

What is polysemanticity and how does superposition explain it?

The toy models give a mechanistic account of polysemanticity, which the paper introduces as the phenomenon where "neural networks often pack many unrelated concepts into a single neuron" ^[2]. A neuron looks polysemantic precisely because it participates in several superposed feature directions at once, so it activates for whichever of those features is present. In the regime with no superposition the same neurons are monosemantic, each tracking a single feature. Polysemanticity is therefore not an intrinsic property of neurons but a consequence of the network choosing to compress more features than it has room for. This reframing matters for interpretability: it suggests that the obstacle to reading a network is not that its concepts are inherently entangled, but that they have been linearly superimposed and could in principle be recovered.

How did it influence later interpretability work?

The paper set up the problem that much of Anthropic's subsequent interpretability research tried to solve. It explicitly sketched candidate strategies for finding the hidden features that superposition obscures, including building models that avoid superposition through activation sparsity and using dictionary learning to recover an overcomplete feature basis from a model that already exhibits superposition ^[1]. The October 2023 paper Towards Monosemanticity pursued the dictionary-learning route, training a sparse autoencoder on the activations of a one-layer transformer to extract features that were far more interpretable than the raw neurons ^[4]. That line continued with Scaling Monosemanticity, which applied sparse autoencoders to the production model Claude 3 Sonnet and pulled out millions of features. Sparse autoencoders became a standard tool in mechanistic interpretability, and the conceptual vocabulary introduced here, features as directions, superposition, interference, the importance and sparsity axes, is now routine in the field. Chris Olah, the paper's senior author, has continued to lead this research direction at Anthropic.

References

Elhage, N., Hume, T., Olsson, C., et al. "Toy Models of Superposition." Transformer Circuits Thread, September 14, 2022. https://transformer-circuits.pub/2022/toy_model/index.html ↩
Elhage, N., Hume, T., Olsson, C., Schiefer, N., Henighan, T., Kravec, S., Hatfield-Dodds, Z., Lasenby, R., Drain, D., Chen, C., Grosse, R., McCandlish, S., Kaplan, J., Amodei, D., Wattenberg, M., Olah, C. "Toy Models of Superposition." arXiv:2209.10652, submitted September 21, 2022. https://arxiv.org/abs/2209.10652 ↩
Olah, C. "Toy Models of Superposition" (summary discussion). LessWrong / AI Alignment Forum, 2022. https://www.lesswrong.com/posts/CTh74TaWgvRiXnkS6/toy-models-of-superposition ↩
Bricken, T., Templeton, A., et al. "Towards Monosemanticity: Decomposing Language Models With Dictionary Learning." Transformer Circuits Thread, October 2023. https://transformer-circuits.pub/2023/monosemantic-features ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Crosscoder Polysemanticity Superposition (Mechanistic Interpretability)

What is "Toy Models of Superposition"?

What is superposition?

How does the toy model work?

What did the paper find about phase changes and geometry?

What is polysemanticity and how does superposition explain it?

How did it influence later interpretability work?

References

Improve this article

Related Articles

Towards Monosemanticity

Scaling Monosemanticity

On the Biology of a Large Language Model

Attribution Graphs

Crosscoder

Golden Gate Claude

What links here

Related Articles

Towards Monosemanticity

Scaling Monosemanticity

On the Biology of a Large Language Model

Attribution Graphs

Crosscoder

Golden Gate Claude

What links here