Induction Heads
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 5,355 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 5,355 words
Add missing citations, update stale details, or suggest a clearer explanation.
Induction heads are a circuit pattern found in Transformer language models in which a small set of attention heads, typically arranged across two layers, implement an in-context "match and copy" operation that completes patterns of the form [A][B] ... [A] -> [B].[^1][^2] The mechanism was first characterized in late 2021 in Anthropic's A Mathematical Framework for Transformer Circuits[^2] and elaborated in the September 2022 paper In-context Learning and Induction Heads by Catherine Olsson, Nelson Elhage, Neel Nanda, and collaborators, which argues that induction heads are the dominant mechanism behind much of In-Context Learning in large language models.[^1] The discovery is considered a foundational result in Mechanistic interpretability, in part because it ties a behavior visible in the loss curve (a sudden bump that coincides with the emergence of in-context learning) to a specific, decomposable circuit inside the network.[^1] Induction heads scale: heads that satisfy the same prefix-matching and copying criteria appear in models ranging from small two-layer toy transformers up through frontier-scale production systems.[^1]
Modern decoder-only transformers process text autoregressively, with each layer mixing information across positions through multi-head self-attention and applying token-wise nonlinear transformations through MLP blocks.[^3] A long-standing puzzle for the field was that, in addition to the knowledge learned during pretraining, these models also display in-context learning: the ability to pick up patterns from a prompt and continue them without any weight updates.[^4] Standard architectural descriptions did not by themselves explain how a fixed set of weights could implement such flexible context-conditioned behavior.
The mechanistic-interpretability program initiated by Chris Olah and colleagues at Anthropic approached this question by trying to reverse-engineer concrete circuits inside small transformer models, in the same spirit that the earlier Circuits thread had reverse-engineered vision networks.[^2][^5] The first concrete circuit they identified, in A Mathematical Framework for Transformer Circuits (December 2021), was the induction head.[^2] The paper showed that a two-layer attention-only transformer, which has no MLPs, can already learn an algorithm for completing repeated subsequences, and that this algorithm can be cleanly decomposed into two attention heads in different layers communicating through the residual stream.[^2]
The follow-up paper In-context Learning and Induction Heads was first published in the Transformer Circuits Thread on March 8, 2022, and posted to arXiv as 2209.11895 on September 24, 2022.[^1][^6] It is a long, multi-part argument that the same circuit motif is responsible for the bulk of in-context learning capability not just in toy two-layer models but in the much larger transformers that constitute frontier language models.[^1] The paper involved more than 50,000 attention-head ablations across 34 transformer models of varying sizes trained from scratch for the study.[^1]
The paper's author list includes 26 researchers, with Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, and Ben Mann among the first authors, and Chris Olah as the final author.[^1] The work was produced almost entirely at Anthropic, with one of the lead authors, Neel Nanda, going on to become one of the most visible communicators of the mechanistic-interpretability approach through his open-source library TransformerLens and his subsequent role at Google DeepMind.[^10]
An induction head is most easily described by its behavior on a sequence in which a short pattern repeats. Given a sequence ending in token A that has appeared somewhere earlier in the context, an induction head:[^1][^2]
... A B ... ... A, the head at the trailing position attends to the position of the earlier B.Together these two behaviors implement the pattern [A][B] ... [A] -> [B].[^1] The original paper operationalizes the test using repeated random tokens: random sequences of tokens are concatenated with themselves, so that prefix matching and copying can be detected as numerical scores even when the underlying tokens have no semantic meaning.[^1] A head is classified as an induction head when it scores highly on both metrics.
The minimal mechanism, identified in two-layer attention-only models, uses a pair of heads in two consecutive layers connected through the residual stream.[^2]
The previous-token head is the first ingredient. It lives in the earlier layer and learns an attention pattern that simply attends from each position to the position immediately before it. Its OV circuit then writes a representation of the previous token into the residual stream at the current position. After this head fires, every token's residual stream carries information about what token preceded it.[^2]
The induction head is the second ingredient and lives in the later layer. Its queries are computed from the residual stream at the current position, but, crucially, its keys are computed from the residual stream of earlier positions after the previous-token head has written to them. So the keys at an earlier position now encode the token at that position together with the token that came before it. The induction head's QK circuit learns to match the current token against the "what came before" component of earlier keys, which means it places attention on positions whose preceding token equals the current token. Its OV circuit then copies the value (the token identity) from the attended position back to the output, raising that token's logit.[^1][^2]
This pattern of one head's output feeding into another head's keys is what A Mathematical Framework for Transformer Circuits calls K-composition.[^2] The same framework also defines Q-composition (where queries come from earlier heads' outputs) and V-composition (where values are routed through earlier heads). Induction heads are the canonical and most prominent example of K-composition discovered in real transformers.[^2] When the framework decomposes a multi-head, multi-layer attention-only transformer into all of its possible head compositions, it produces a finite set of virtual attention heads: paths through the network that can be analyzed as if they were single heads operating in a single layer. The induction head is the simplest virtual head whose effect cannot be replicated by any single literal head, which is why two layers are the minimum required for the circuit.[^2]
One useful way to picture the difference between a single-layer model and a two-layer model is that the single-layer model can only attend to tokens based on their own identity (or their position), since the keys at each position are computed directly from the embedding of the token at that position.[^2] Adding a previous-token head in a second layer enriches the keys at each position with information about the preceding token, which is exactly the information an induction head needs to perform prefix matching. Anthropic's framework treats this as a general lesson about transformer expressivity: composing attention heads through the residual stream allows later heads to attend on the basis of computed features rather than raw token identity, and induction heads are the simplest behavior that requires this extra expressivity.[^2]
A central conceptual move in the Mathematical Framework paper is to decompose each attention head into two independent circuits, both expressible as low-rank bilinear forms on the residual stream.[^2]
The QK circuit of a head is the matrix product W_Q^T W_K, projected through the embedding and any upstream head outputs. It determines where the head attends, based on the dot product between the query at the current position and the key at each previous position.[^2]
The OV circuit of a head is the matrix product W_O W_V, projected through the embedding and any downstream contributions to the logits. It determines what is written to the residual stream when the head does attend, and hence how each attended token shifts the output distribution.[^2]
For an induction head, both circuits do something distinctive. The QK circuit, after K-composition with the previous-token head, implements a prefix-matching pattern: it produces large attention scores at positions whose preceding token equals the query token.[^1][^2] The OV circuit implements a copying pattern: when the head attends to a token, it raises the logit of that token in the output. Together these two patterns yield the [A][B] ... [A] -> [B] behavior.[^1]
The QK/OV split is more than a notational convenience: it makes head analysis tractable because each circuit is bilinear in the residual stream and can be studied as a low-rank linear operator.[^2] In particular, when looking at induction heads, one can plot the QK circuit's attention pattern on repeated sequences and see prefix matching as a literal diagonal band off the main diagonal, and one can plot the OV circuit's effective contribution to the logits and see copying as a near-identity map from token embeddings to logit increments. These visualizations are the empirical signatures that the In-context Learning and Induction Heads paper uses to flag candidate induction heads in models from small attention-only toys up through frontier-scale transformers.[^1]
A natural worry about the basic story is that it sounds too literal: real transformers do far more than copy verbatim subsequences. The In-context Learning and Induction Heads paper argues that this is mostly a matter of how strictly one reads the QK and OV circuits.[^1] The paper distinguishes several relaxations:
These generalizations form a continuous spectrum from literal copy heads to behaviors that look, at the level of model output, like reasoning. The argument in the paper is that the same QK/OV decomposition, generalized in these directions, accounts for a large share of what is described colloquially as in-context learning.[^1]
The most striking empirical claim of In-context Learning and Induction Heads concerns the training dynamics of these circuits.[^1] When the authors train transformers from scratch and track both the formation of induction heads and a quantitative measure of in-context learning, they find that the two emerge simultaneously, during a narrow window early in training that is visible as a small bump in the training loss curve.[^1] Anthropic refers to this window informally as the "induction bump", and notes that it is one of very few clearly localized events in the otherwise smooth training-loss curves of language models.[^1]
The paper operationalizes in-context learning capability with a simple metric: the cross-entropy loss at the 500th token of the context minus the average loss at the 50th token of the context.[^1] If a model has good in-context learning, its loss should decrease as the context grows because later tokens benefit from earlier ones, so this difference becomes large and negative. The induction-head formation is detected directly through the prefix-matching and copying scores on repeated random tokens.[^1] In every multi-layer model studied, the two curves shift together: induction heads appear, prefix-matching and copying scores jump, in-context learning capability jumps, and the training loss briefly bumps.[^1]
The same phenomenon shows up under perturbations. When the authors change the architecture in ways that move the timing of the loss bump, the formation of induction heads moves with it; in-context learning capability tracks the same shift.[^1] This co-perturbation is one of the strongest pieces of correlational evidence for a causal link between the two phenomena.[^1]
Subsequent work has refined this picture by studying what has to be true of the training data for induction heads to form. The 2024 paper What needs to go right for an induction head? by Aaditya Singh and collaborators trains transformers on carefully controlled distributions and identifies a small set of substructures inside the model whose simultaneous appearance is necessary and sufficient for induction-head behavior.[^22] The paper shows, for example, that the formation of induction heads depends sensitively on the presence of repeated subsequences with shared context in the training data and on the timing of certain attention patterns becoming sharp; without these ingredients, induction-head circuits fail to form and the phase change in in-context learning never appears.[^22] This kind of analysis sharpens the original Anthropic claims by tying them to specific properties of the training distribution rather than only to model size.
The paper organizes its argument as six complementary lines of evidence that the induction-head circuit is the mechanistic source of much of in-context learning at scale.[^1]
For small attention-only models the argument is causal: ablating the relevant heads removes the relevant behavior.[^1] For larger models with MLPs and many more layers, the evidence is mostly correlational; the paper is explicit that it can only show consistent statistical signatures, not a fully reverse-engineered circuit at scale.[^1]
Ablation is the central tool used to make the causal case in small models.[^1] The standard procedure is to identify candidate induction heads by their prefix-matching and copying scores on repeated random sequences, then either zero out their outputs (full ablation) or selectively suppress the induction-pattern portion of their attention (attention knockout).[^1] In the small attention-only models the paper studies, both kinds of ablation produce large reductions in the loss-difference metric for in-context learning, while ablations of random heads do not.[^1]
The combined ablation campaign spanned more than 50,000 individual attention-head ablations across the 34 models trained for the study, allowing the authors to perform statistical analyses of which heads matter for which contexts.[^1] This is one of the largest single ablation studies in mechanistic-interpretability work to date.[^1]
A further refinement is attention-pattern knockout, in which only the part of a head's attention pattern that corresponds to induction-style prefix matching is suppressed, while attention to other positions is preserved.[^1] When the authors apply this targeted intervention, the loss-difference metric drops by roughly the same amount as when the whole head is zeroed out, suggesting that it is specifically the induction-style attention pattern, rather than other roles the head might play, that drives in-context learning. Combined with the formation-time correlations and the architectural co-perturbation experiments, this is the strongest causal evidence the paper offers for small models.[^1]
The induction-head template has proven remarkably extensible. Researchers have built on it in two complementary directions: relaxing what counts as a "match" (so the same circuit can handle approximate or semantic repetition) and relaxing what counts as a "copy" (so the head can edit, translate, or otherwise transform the attended-to token rather than simply repeat it). Many subsequent interpretability findings can be read as decorations on the basic QK/OV decomposition that the induction-head paper introduced.
The basic induction head matches identical tokens. Follow-up work has shown that a continuum of related circuits performs fuzzy matching, attending to context positions whose surrounding tokens are similar in some learned representation but not necessarily identical.[^1][^7] This explains how the same family of circuits can support paraphrased or inflected pattern completion in natural text rather than only literal repetition.
Subsequent research formalized parts of this picture under the heading of semantic induction heads: heads that prefix-match in a representation space where the key and value carry semantic rather than purely token-level information.[^7] These heads remain interpretable using the same QK/OV decomposition but operate on more abstract representations carried by the residual stream after MLPs and earlier attention layers have processed it.[^7]
A natural extension of the basic two-head construction allows the previous-token head, or chains of such heads, to write information about not just one but several preceding tokens into each position.[^8] The induction head then prefix-matches against this longer context, effectively performing in-context n-gram lookup. This generalization has been explored both as an interpretability lens and as an inductive bias: one line of work explicitly hardcodes an n-gram induction layer into transformers as a drop-in replacement for parts of multi-head attention.[^8]
More recent work has proposed the umbrella term generalized induction heads for circuits that extend the basic match-and-copy template to richer settings, including matching with arbitrary similarity functions, retrieving multi-token continuations, and integrating context across multiple lookups.[^9] These constructions retain the core idea of in-context retrieval followed by output editing but allow the matching and copying stages to be more flexible than the original two-head circuit.[^9]
A more recent strand of theoretical work has analyzed selective induction heads: induction-like circuits that learn to choose, in context, which causal structure to use when completing a sequence.[^20] Such heads can implement different match-and-copy rules conditional on properties of the prompt, providing a mechanistic story for tasks where the model must infer which kind of pattern is being demonstrated before completing it.[^20] Related work has also shown that two-layer transformers can provably represent induction heads over arbitrary-order Markov chains, providing a theoretical underpinning for the n-gram-style generalizations that empirical work observes.[^21]
The most widely used software for studying induction heads and related circuits is TransformerLens, an open-source library for mechanistic interpretability of GPT-style language models originally created by Neel Nanda.[^10] TransformerLens exposes hook points on every internal activation, making it straightforward to capture attention patterns, intervene on specific heads, and run ablations. The library's introductory tutorial is built around reproducing induction-head behavior on small models, and the prefix-matching/copying metrics defined in the In-context Learning and Induction Heads paper are standard utilities in the library.[^10]
TransformerLens has since become a de facto standard for circuit-level analyses, used in much of the academic mechanistic-interpretability literature and maintained by an open-source community.[^10]
For users who want a hands-on introduction, the Concrete Steps to Get Started in Transformer Mechanistic Interpretability guide by Neel Nanda walks new researchers through replicating the induction-head analysis on a small public model using TransformerLens, and the library's documentation includes a self-contained notebook that derives the prefix-matching and copying metrics from first principles.[^10] This pedagogical pipeline has made induction-head analysis one of the standard first projects for new students of mechanistic interpretability.[^10]
The repeated-random-tokens probe used in In-context Learning and Induction Heads is a deliberately stripped-down dataset: sequences of randomly sampled tokens concatenated with themselves so that the only useful pattern is the repetition.[^1] Because the tokens are random, any head whose attention pattern correlates with the repeat structure can be flagged as performing prefix matching, and any head whose OV circuit reliably boosts the repeated token can be flagged as copying.[^1]
While useful, this probe has known limitations: heads that perform sophisticated context processing on natural text may fall back to induction-like behavior only on stripped-down repeated sequences, which can produce false positives if one identifies induction heads from RRT scores alone.[^11] Subsequent analyses have therefore complemented RRT scores with task-specific behavioral tests, attention-pattern visualizations, and head-by-head ablations on natural data.[^11]
A line of research initiated by David Bau and collaborators identified function vectors: linear directions in transformer activation space that can be patched into a clean prompt to make the model perform a specific in-context task.[^12] Function vectors are constructed by averaging the outputs of a small set of attention heads, identified via causal mediation analysis, that mediate in-context learning across many tasks.[^12] These function-vector heads share the prefix-matching attention pattern that defines induction heads, and during training they tend to display high induction scores early on; as training proceeds, the function-vector score becomes the more predictive metric, suggesting that induction is a precursor or building block for function-vector behavior.[^13]
A closely related notion, task vectors, describes hidden-state representations that the model constructs from few-shot demonstrations and then uses to steer subsequent predictions.[^14] Empirically, the outputs of induction heads can serve as effective task vectors, providing a concrete link between the circuit-level description of induction and the more representational language of task vectors.[^13]
Anthropic's later interpretability work moved from circuits identified by hand to mechanized analyses based on sparse-coding decompositions and Attribution Graphs.[^15][^16] The 2025 paper On the Biology of a Large Language Model applies Attribution Graphs built from Cross-Layer Transcoders to study Claude 3.5 Haiku, including circuits for multi-hop reasoning, planning in poetry, and shared multilingual concepts.[^15][^16] Although this work emphasizes feature-level rather than head-level explanations, the induction-head story remains its conceptual ancestor: it was the first concrete demonstration that named circuits could be located inside production-scale transformers.[^15] Several attribution-graph case studies in On the Biology of a Large Language Model involve features and edges that perform exactly the kind of contextual matching and copying that induction heads first formalized.[^15]
The same line of work connects to feature-level interpretability using Sparse autoencoder dictionaries, as developed in Towards Monosemanticity and Scaling Monosemanticity.[^17][^18] Where induction heads explain how the model routes information across positions, sparse autoencoders explain what the residual stream encodes at each position; both perspectives are needed to describe what a circuit actually does.
The induction-head picture has also influenced work on other interpretable attention heads. The Logit lens technique, which projects intermediate residual-stream activations through the unembedding to read out what the model would predict at each layer, is commonly used to diagnose copying and induction behavior at intermediate layers.[^19] Closely related is the discovery of copy-suppression heads, which act as a counterweight to copy-like heads (including induction heads) by reducing the logits of tokens that other components are pushing to repeat; a single such head in GPT-2 Small accounts for a large fraction of certain self-repair phenomena observed when other heads are ablated.[^19]
A central practical claim of the original work is that induction-head-like circuits are not artefacts of toy models: they appear in transformers across orders of magnitude in scale.[^1] The Anthropic paper found induction heads in models ranging from two-layer attention-only toys up through models with many tens of layers and billions of parameters, and reported that the prefix-matching and copying scores remain reliable head-identification signals at scale.[^1] Subsequent independent work has confirmed prefix-matching, copying, and induction-like patterns in a wide variety of public GPT-2, Pythia, and Llama checkpoints.[^11][^13]
The dominance of induction-style mechanisms at scale is, however, more nuanced than the basic story suggests. More recent analyses have shown that in models above roughly a billion parameters, ablating heads identified by classic induction scores while preserving heads identified by function-vector scores leaves few-shot in-context learning mostly intact, while ablating the function-vector heads (which still exhibit induction-like patterns) impairs it much more.[^13] One interpretation is that induction is a precursor mechanism that, at scale, is partially absorbed into or supplemented by more abstract task-conditioned heads.[^13] The basic motif, prefix-matching followed by output copying through a QK/OV decomposition, persists across this transition, even as the heads that implement it become more specialized.
Another striking property of induction-head emergence is that the timescale at which the circuit forms appears to scale with context length. Theoretical analyses suggest that the number of training steps required to assemble a functional induction-head circuit grows roughly quadratically in the maximum context length, which matches the observed shift in the timing of the in-context-learning phase change in models trained with longer contexts.[^21] This is consistent with the broader pattern reported in In-context Learning and Induction Heads: induction heads are robust, but the exact moment at which they crystallize during training is sensitive to architectural and data choices in ways that other model capabilities are not.[^1]
Induction heads occupy an unusual position in the Mechanistic interpretability literature: they are the first concrete circuit discovered inside a real language model that ties a mechanism to an emergent capability.[^1][^2] Several aspects of the result have shaped subsequent interpretability research.
First, induction heads gave the field a clean example of how a behavior identified at the loss-curve level (a bump in the training loss, a phase change in in-context learning ability) could be matched to a specific circuit at the parameter level.[^1] This linkage between training dynamics and circuit-level mechanism remains a model for what mechanistic-interpretability arguments aspire to.[^15]
Second, the QK/OV decomposition introduced alongside induction heads has become a default conceptual tool in interpretability writing.[^2] Almost every subsequent attention-head case study, from the GPT-2 Small IOI (indirect object identification) circuit through the copy-suppression analyses, uses some form of the QK/OV split to organize its findings.[^19]
Third, induction heads serve as the introductory example in essentially every modern mechanistic-interpretability tutorial.[^10] TransformerLens's onboarding materials, Neel Nanda's introductory lectures, and many independent course notes all use the construction as the first nontrivial circuit students reverse-engineer.[^10] The choice is pedagogically natural: induction heads sit at the boundary where toy theory meets real, scaling phenomena.
Finally, induction heads matter for AI safety because they provide an existence proof that one can locate, name, and intervene on the specific machinery responsible for a high-level capability.[^1] If similar arguments can be made for capabilities related to deception, situational awareness, or other safety-relevant behaviors, the same toolkit of circuit identification, training-dynamics analysis, and ablation could allow more direct safety reasoning. The induction-head paper is frequently cited as a proof of concept that such arguments are at least possible.[^1][^15]
Several caveats apply to the induction-head story, and the original paper is careful about them.[^1] At larger scales the evidence for a causal link between induction heads and in-context learning is correlational rather than causal: it is consistent with multiple mechanisms, of which induction is the most parsimonious but not the only candidate.[^1] The mechanistic two-head story is also explicitly minimal; real models implement induction-like behavior using compositions across more than two layers, multiple heads per role, and contributions from MLPs, none of which are captured by the basic construction.[^11]
The behavioral definition based on repeated random tokens, while convenient, can both miss heads that perform context-rich induction only on natural text and falsely flag heads that exhibit copy-like behavior only as a fallback on degenerate inputs.[^11] Several follow-up analyses have argued that the cleanest characterization of "induction" depends on the test distribution one uses, and that any single score should be treated as a coarse proxy rather than a definitive identifier.[^11]
Finally, the relationship between induction heads and other mechanisms responsible for in-context learning is now understood to be more complex than originally proposed. As noted above, function-vector heads and task-vector representations partially decouple from classic induction signatures at scale, and parts of in-context learning that involve abstract pattern recognition or task inference may be carried by mechanisms that go well beyond simple prefix-matching and copying.[^13]
| Concept | Relation to induction heads |
|---|---|
| In-Context Learning | The behavior that induction heads are hypothesized to implement at the circuit level. |
| Mechanistic interpretability | The broader research program that discovered induction heads as one of its first concrete circuits. |
| Attribution Graphs | A later interpretability framework that builds feature- and edge-level explanations for behaviors related to those induction heads explain at the head level. |
| Sparse autoencoder | A complementary tool for decomposing the residual stream into features that induction heads then route across positions. |
| Logit lens | A diagnostic technique often paired with induction-head analyses to read out how copying behavior builds up across layers. |
| Towards Monosemanticity | Anthropic's first sparse-feature decomposition of a language model, conceptually adjacent to the head-level decomposition that defines induction heads. |
| Scaling Monosemanticity | The follow-up that scales sparse-feature analysis to Claude 3 Sonnet, complementing head-level analyses with feature-level ones. |
| On the Biology of a Large Language Model | Applies attribution graphs to Claude 3.5 Haiku and revisits induction-like circuits at a finer-grained level. |