Toy Models of Superposition
Last reviewed
Jun 3, 2026
Sources
4 citations
Review status
Source-backed
Revision
v1 · 1,338 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
4 citations
Review status
Source-backed
Revision
v1 · 1,338 words
Add missing citations, update stale details, or suggest a clearer explanation.
Toy Models of Superposition is a 2022 research paper from Anthropic that studies how neural networks can represent more features than they have dimensions by packing those features into overlapping, non-orthogonal directions, a phenomenon the authors call superposition. It was published on the Transformer Circuits Thread on September 14, 2022, by Nelson Elhage, Tristan Hume, Catherine Olsson and colleagues, with Christopher Olah as the senior author, and a version was posted to arXiv (2209.10652) the following week [1][2]. The paper is one of the most cited works in modern mechanistic interpretability, and it framed the research agenda that led to later work on dictionary learning and monosemanticity.
The paper's stated goal is to explain polysemanticity, the long-observed fact that an individual neuron in a trained network often responds to several apparently unrelated inputs. Rather than treat this as noise, Elhage et al. argue that it is the visible side effect of a deliberate compression strategy: when the underlying features a network would like to represent outnumber its available dimensions, the network stores some of them in superposition, overlaying multiple feature directions on the same set of neurons. The tradeoff is interference, since reading off one feature picks up small contributions from the others, and the network must use nonlinearities to clean this up.
Because superposition is hard to study directly inside a large language model, the authors built small, fully understood "toy" models with synthetic data where the ground-truth features are known by construction. This lets them watch superposition appear and disappear as they vary the conditions, and to characterize it precisely. The abstract summarizes the three main results: the existence of a phase change, a connection to the geometry of uniform polytopes, and evidence linking superposition to adversarial examples [1].
The paper works within the broader interpretability view that the meaningful units inside a network are features, treated as directions in activation space rather than individual neurons. A network that had enough dimensions could in principle give each feature its own orthogonal direction, making neurons cleanly interpretable. Two ideas push against this.
The first is that real-world features are numerous and the network has limited width, so it cannot afford a dedicated dimension for every feature it might find useful. The second is sparsity: many features occur only rarely, so on any given input most are inactive. The superposition hypothesis is that under these conditions a network behaves roughly like a compressed sensing system, exploiting sparsity to encode more features than dimensions and accepting occasional interference as the price. Two properties of a feature drive the outcome:
| Property | Meaning in the paper | Effect |
|---|---|---|
| Importance | How much a feature contributes to reducing the loss, modeled as a weight on its reconstruction error | More important features are likelier to get a dedicated, low-interference representation |
| Sparsity | How rarely a feature is active (high sparsity = mostly zero) | Higher sparsity makes superposition more attractive, since collisions between active features are rare |
The central toy model is a small autoencoder-like network with a linear bottleneck. A high-dimensional sparse input vector of features is projected down to a low-dimensional hidden space by a weight matrix W, then mapped back up by W transpose followed by a bias and a ReLU nonlinearity, so the reconstruction is ReLU(W^T W x + b) [1]. The hidden space is the bottleneck where superposition must happen if it happens at all. Features are generated independently, are nonnegative, and are made sparse by setting each to zero with some probability; each feature is also assigned an importance that scales its term in the mean-squared-error loss.
A representative configuration projects five features into two hidden dimensions. With no sparsity the model behaves like principal component analysis, keeping only the two most important features along orthogonal axes and discarding the rest. As sparsity rises, the picture changes: the model starts representing more than two features in the two-dimensional space, with the less important features the first to be folded into superposition. The ReLU and bias are what make this usable, since the nonlinearity suppresses the small negative interference terms that superposition introduces.
Two results give the paper much of its lasting influence.
Whether a given feature is represented on its own dedicated direction or pushed into superposition is governed by a phase change. Holding importance fixed and sweeping sparsity, a feature switches fairly abruptly between three regimes: not represented at all, represented alone, or represented in superposition with others. The authors map this as a phase diagram over importance and sparsity, and the sharpness of the transitions is what makes the term "phase change" apt rather than metaphorical.
The geometry of superposition is the more striking finding. When several features share a low-dimensional space, they do not land in arbitrary positions. They arrange themselves into regular geometric figures that spread the directions out to minimize interference, the same uniform polytopes that appear in classical geometry. In the five-feature, two-dimensional case the model passes through configurations such as antipodal pairs (two features on opposite directions, used when interference must be kept low) and a regular pentagon (all five features placed symmetrically when sparsity is high enough to tolerate the resulting overlap). In higher hidden dimensions the authors observe digons, triangles, tetrahedra and other polytopes, and they describe how larger arrangements decompose into independent lower-dimensional pieces, a structure they relate to tegum products. To quantify all this they introduce a notion of feature dimensionality, the fraction of a hidden dimension effectively allocated to a given feature, which takes characteristic fractional values (for example 1/2 for an antipodal pair) corresponding to the different geometries [1][3]. They also note that simple computation, not just storage, can be carried out while features remain in superposition.
The toy models give a mechanistic account of polysemanticity. A neuron looks polysemantic precisely because it participates in several superposed feature directions at once, so it activates for whichever of those features is present. In the regime with no superposition the same neurons are monosemantic, each tracking a single feature. Polysemanticity is therefore not an intrinsic property of neurons but a consequence of the network choosing to compress more features than it has room for. This reframing matters for interpretability: it suggests that the obstacle to reading a network is not that its concepts are inherently entangled, but that they have been linearly superimposed and could in principle be recovered.
The paper set up the problem that much of Anthropic's subsequent interpretability research tried to solve. It explicitly sketched candidate strategies for finding the hidden features that superposition obscures, including building models that avoid superposition through activation sparsity and using dictionary learning to recover an overcomplete feature basis from a model that already exhibits superposition [1]. The 2023 paper Towards Monosemanticity pursued the dictionary-learning route, training a sparse autoencoder on the activations of a one-layer transformer to extract features that were far more interpretable than the raw neurons. That line continued with Scaling Monosemanticity, which applied sparse autoencoders to the production model Claude 3 Sonnet and pulled out millions of features. Sparse autoencoders became a standard tool in mechanistic interpretability, and the conceptual vocabulary introduced here, features as directions, superposition, interference, the importance and sparsity axes, is now routine in the field. Chris Olah, the paper's senior author, has continued to lead this research direction at Anthropic.