Induction Heads

Interpretability Transformer Models

29 min read

Updated Jul 23, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 23, 2026

Fact-checked

In review queue

Sources

22 citations

Revision

v4 · 5,758 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Induction heads are a circuit pattern in Transformer language models in which a small set of attention heads, typically spread across two layers, perform an in-context "match and copy" operation that completes patterns of the form [A][B] ... [A] -> [B].^[1]^[2] In plain terms, when a token A has appeared earlier in the context followed by some token B, an induction head detects the second occurrence of A and raises the probability that the next token is B, letting the model continue a pattern it has just seen rather than relying only on what it memorized during pretraining.^[1] Anthropic's September 2022 paper In-context Learning and Induction Heads defines them directly: "'Induction heads' are attention heads that implement a simple algorithm to complete token sequences like [A][B] ... [A] -> [B]", and argues they "might constitute the mechanism for the majority of all 'in-context learning' in large transformer models".^[1]

The mechanism was first characterized in late 2021 in Anthropic's A Mathematical Framework for Transformer Circuits^[2] and elaborated in In-context Learning and Induction Heads by Catherine Olsson, Nelson Elhage, Neel Nanda, and 23 collaborators, which makes the case that induction heads are the dominant mechanism behind much of In-Context Learning in large language models.^[1] The discovery is considered a foundational result in Mechanistic interpretability, in part because it ties a behavior visible in the loss curve (a sudden bump that coincides with the emergence of in-context learning) to a specific, decomposable circuit inside the network.^[1] Induction heads scale: heads that satisfy the same prefix-matching and copying criteria appear in models ranging from small two-layer toy transformers up through frontier-scale production systems.^[1]

What problem do induction heads solve?

Modern decoder-only transformers process text autoregressively, with each layer mixing information across positions through multi-head self-attention and applying token-wise nonlinear transformations through MLP blocks.^[3] A long-standing puzzle for the field was that, in addition to the knowledge learned during pretraining, these models also display in-context learning: the ability to pick up patterns from a prompt and continue them without any weight updates.^[4] Standard architectural descriptions did not by themselves explain how a fixed set of weights could implement such flexible context-conditioned behavior.

The mechanistic-interpretability program initiated by Chris Olah and colleagues at Anthropic approached this question by trying to reverse-engineer concrete circuits inside small transformer models, in the same spirit that the earlier Circuits thread had reverse-engineered vision networks.^[2]^[5] The first concrete circuit they identified, in A Mathematical Framework for Transformer Circuits (December 22, 2021), was the induction head.^[2] The paper showed that a two-layer attention-only transformer, which has no MLPs, can already learn an algorithm for completing repeated subsequences, and that this algorithm can be cleanly decomposed into two attention heads in different layers communicating through the residual stream.^[2]

The follow-up paper In-context Learning and Induction Heads was first published in the Transformer Circuits Thread on March 8, 2022, and posted to arXiv as 2209.11895 on September 24, 2022.^[1]^[6] It is a long, multi-part argument that the same circuit motif is responsible for the bulk of in-context learning capability not just in toy two-layer models but in the much larger transformers that constitute frontier language models.^[1] The study analyzed 34 transformers over the course of training, including more than 50,000 attention-head ablations, making it one of the largest single ablation studies in mechanistic-interpretability work to date.^[1]

The paper's author list includes 26 researchers, with Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, and Ben Mann among the first authors, and Chris Olah as the final author.^[1] The work was produced almost entirely at Anthropic, with one of the lead authors, Neel Nanda, going on to become one of the most visible communicators of the mechanistic-interpretability approach through his open-source library TransformerLens and his subsequent role at Google DeepMind.^[10]

How do induction heads work?

Behavioral definition

An induction head is most easily described by its behavior on a sequence in which a short pattern repeats. Given a sequence ending in token A that has appeared somewhere earlier in the context, an induction head:^[1]^[2]

Prefix matching. When the query position is at the second occurrence of A, the head's attention pattern places weight on the token immediately following the first occurrence of A. In other words, if the context is ... A B ... ... A, the head at the trailing position attends to the position of the earlier B. The paper describes this as the head attending "back to previous tokens that were followed by the current and/or recent tokens".^[1]
Copying. The head's OV (output-value) circuit then increases the logit corresponding to the attended-to token. In the same example, this raises the predicted probability that the next token after the second A will again be B.^[1]

Together these two behaviors implement the pattern [A][B] ... [A] -> [B].^[1] The original paper operationalizes the test using repeated random tokens: random sequences of tokens are concatenated with themselves, so that prefix matching and copying can be detected as numerical scores even when the underlying tokens have no semantic meaning.^[1] A head is classified as an induction head when it scores highly on both metrics.

The two-head mechanism

The minimal mechanism, identified in two-layer attention-only models, uses a pair of heads in two consecutive layers connected through the residual stream.^[2]

The previous-token head is the first ingredient. It lives in the earlier layer and learns an attention pattern that simply attends from each position to the position immediately before it. Its OV circuit then writes a representation of the previous token into the residual stream at the current position. After this head fires, every token's residual stream carries information about what token preceded it.^[2]

The induction head is the second ingredient and lives in the later layer. Its queries are computed from the residual stream at the current position, but, crucially, its keys are computed from the residual stream of earlier positions after the previous-token head has written to them. So the keys at an earlier position now encode the token at that position together with the token that came before it. The induction head's QK circuit learns to match the current token against the "what came before" component of earlier keys, which means it places attention on positions whose preceding token equals the current token. Its OV circuit then copies the value (the token identity) from the attended position back to the output, raising that token's logit.^[1]^[2] As the paper puts it, this first head "copies information from the previous token into each token", which "makes it possible for the second attention head to attend to tokens based on what happened before them, rather than their own content".^[1]

This pattern of one head's output feeding into another head's keys is what A Mathematical Framework for Transformer Circuits calls K-composition.^[2] The same framework also defines Q-composition (where queries come from earlier heads' outputs) and V-composition (where values are routed through earlier heads). Induction heads are the canonical and most prominent example of K-composition discovered in real transformers.^[2] When the framework decomposes a multi-head, multi-layer attention-only transformer into all of its possible head compositions, it produces a finite set of virtual attention heads: paths through the network that can be analyzed as if they were single heads operating in a single layer. The induction head is the simplest virtual head whose effect cannot be replicated by any single literal head, which is why two layers are the minimum required for the circuit.^[2]

One useful way to picture the difference between a single-layer model and a two-layer model is that the single-layer model can only attend to tokens based on their own identity (or their position), since the keys at each position are computed directly from the embedding of the token at that position.^[2] Adding a previous-token head in a second layer enriches the keys at each position with information about the preceding token, which is exactly the information an induction head needs to perform prefix matching. Anthropic's framework treats this as a general lesson about transformer expressivity: composing attention heads through the residual stream allows later heads to attend on the basis of computed features rather than raw token identity, and induction heads are the simplest behavior that requires this extra expressivity.^[2]

QK and OV circuits

A central conceptual move in the Mathematical Framework paper is to decompose each attention head into two independent circuits, both expressible as low-rank bilinear forms on the residual stream.^[2]

The QK circuit of a head is the matrix product W_Q^T W_K, projected through the embedding and any upstream head outputs. It determines where the head attends, based on the dot product between the query at the current position and the key at each previous position.^[2]

The OV circuit of a head is the matrix product W_O W_V, projected through the embedding and any downstream contributions to the logits. It determines what is written to the residual stream when the head does attend, and hence how each attended token shifts the output distribution.^[2]

For an induction head, both circuits do something distinctive. The QK circuit, after K-composition with the previous-token head, implements a prefix-matching pattern: it produces large attention scores at positions whose preceding token equals the query token.^[1]^[2] The OV circuit implements a copying pattern: when the head attends to a token, it raises the logit of that token in the output. Together these two patterns yield the [A][B] ... [A] -> [B] behavior.^[1]

The QK/OV split is more than a notational convenience: it makes head analysis tractable because each circuit is bilinear in the residual stream and can be studied as a low-rank linear operator.^[2] In particular, when looking at induction heads, one can plot the QK circuit's attention pattern on repeated sequences and see prefix matching as a literal diagonal band off the main diagonal, and one can plot the OV circuit's effective contribution to the logits and see copying as a near-identity map from token embeddings to logit increments. These visualizations are the empirical signatures that the In-context Learning and Induction Heads paper uses to flag candidate induction heads in models from small attention-only toys up through frontier-scale transformers.^[1]

Generalization beyond literal copying

A natural worry about the basic story is that it sounds too literal: real transformers do far more than copy verbatim subsequences. The In-context Learning and Induction Heads paper argues that this is mostly a matter of how strictly one reads the QK and OV circuits.^[1] The paper distinguishes several relaxations:

Translation-style induction. When the OV circuit maps token B not to B itself but to a related token B', the same circuit completes patterns like "the German for dog is Hund, the German for cat is ___" by writing "Katze" rather than copying.^[1]
Fuzzier matching. When the QK circuit attends to positions whose preceding context is similar, rather than literally equal, the same circuit handles paraphrased or partially varied repetitions.^[1]
Longer prefixes. Multi-head compositions can match on more than one preceding token, in effect approximating higher-order n-gram lookups rather than only matching against the single previous token.^[1]

These generalizations form a continuous spectrum from literal copy heads to behaviors that look, at the level of model output, like reasoning. The argument in the paper is that the same QK/OV decomposition, generalized in these directions, accounts for a large share of what is described colloquially as in-context learning.^[1]

How were induction heads discovered, and what is the induction bump?

Coincident emergence with in-context learning

The most striking empirical claim of In-context Learning and Induction Heads concerns the training dynamics of these circuits.^[1] When the authors train transformers from scratch and track both the formation of induction heads and a quantitative measure of in-context learning, they find that the two emerge simultaneously, during a narrow window early in training. As the paper states, "induction heads develop at precisely the same point as a sudden sharp increase in in-context learning ability, visible as a bump in the training loss".^[1] Anthropic refers to this window informally as the "induction bump" and the broader event as a "phase change", noting that it is one of very few clearly localized events in the otherwise smooth training-loss curves of language models.^[1]

The paper operationalizes in-context learning capability with a simple metric: "the loss of the 500th token in the context minus the average loss of the 50th token in the context, averaged over dataset examples".^[1] If a model has good in-context learning, its loss should decrease as the context grows because later tokens benefit from earlier ones, so this difference becomes large and negative. The induction-head formation is detected directly through the prefix-matching and copying scores on repeated random tokens.^[1] In every multi-layer model studied, the two curves shift together: induction heads appear, prefix-matching and copying scores jump, in-context learning capability jumps, and the training loss briefly bumps.^[1]

The same phenomenon shows up under perturbations. When the authors change the architecture in ways that move the timing of the loss bump, the formation of induction heads moves with it; in-context learning capability tracks the same shift.^[1] This co-perturbation is one of the strongest pieces of correlational evidence for a causal link between the two phenomena.^[1]

Subsequent work has refined this picture by studying what has to be true of the training data for induction heads to form. The 2024 paper What needs to go right for an induction head? by Aaditya Singh and collaborators trains transformers on carefully controlled distributions and identifies a small set of substructures inside the model whose simultaneous appearance is necessary and sufficient for induction-head behavior.^[22] The paper shows, for example, that the formation of induction heads depends sensitively on the presence of repeated subsequences with shared context in the training data and on the timing of certain attention patterns becoming sharp; without these ingredients, induction-head circuits fail to form and the phase change in in-context learning never appears.^[22] This kind of analysis sharpens the original Anthropic claims by tying them to specific properties of the training distribution rather than only to model size.

What are the six lines of evidence?

The paper organizes its argument as six complementary lines of evidence that the induction-head circuit is the mechanistic source of much of in-context learning at scale.^[1]

Macroscopic co-occurrence. Induction heads form at the same point in training as the in-context learning bump (the phase change).^[1]
Macroscopic co-perturbation. Architectural changes that delay or accelerate one effect delay or accelerate the other in lockstep.^[1]
Direct ablation. Knocking out induction heads at test time in small models causes a large drop in in-context learning, as measured by the loss-difference metric.^[1]
Specific examples of generality. Induction-head-like behavior shows up in literal copying, simple translation tasks, and paraphrased pattern completion, suggesting that the underlying circuit motif is doing more than verbatim copying.^[1]
Mechanistic plausibility. The two-layer two-head construction is fully reverse-engineered in toy models, providing a concrete proof of concept that the QK and OV decomposition can implement the observed behavior.^[1]^[2]
Continuity across scale. Induction-head-like heads, defined by their prefix-matching and copying scores, are found in models from a few layers up through the largest transformers the authors examined, suggesting the same motif persists when MLPs and more layers are added.^[1]

For small attention-only models the argument is causal: ablating the relevant heads removes the relevant behavior.^[1] For larger models with MLPs and many more layers, the evidence is mostly correlational; the paper is explicit that it can only show consistent statistical signatures, not a fully reverse-engineered circuit at scale.^[1]

Ablation studies

Ablation is the central tool used to make the causal case in small models.^[1] The standard procedure is to identify candidate induction heads by their prefix-matching and copying scores on repeated random sequences, then either zero out their outputs (full ablation) or selectively suppress the induction-pattern portion of their attention (attention knockout).^[1] In the small attention-only models the paper studies, both kinds of ablation produce large reductions in the loss-difference metric for in-context learning, while ablations of random heads do not.^[1]

The combined ablation campaign spanned more than 50,000 individual attention-head ablations across the 34 models trained for the study, allowing the authors to perform statistical analyses of which heads matter for which contexts.^[1] This is one of the largest single ablation studies in mechanistic-interpretability work to date.^[1]

A further refinement is attention-pattern knockout, in which only the part of a head's attention pattern that corresponds to induction-style prefix matching is suppressed, while attention to other positions is preserved.^[1] When the authors apply this targeted intervention, the loss-difference metric drops by roughly the same amount as when the whole head is zeroed out, suggesting that it is specifically the induction-style attention pattern, rather than other roles the head might play, that drives in-context learning. Combined with the formation-time correlations and the architectural co-perturbation experiments, this is the strongest causal evidence the paper offers for small models.^[1]

What variants and generalizations of induction heads exist?

The induction-head template has proven remarkably extensible. Researchers have built on it in two complementary directions: relaxing what counts as a "match" (so the same circuit can handle approximate or semantic repetition) and relaxing what counts as a "copy" (so the head can edit, translate, or otherwise transform the attended-to token rather than simply repeat it). Many subsequent interpretability findings can be read as decorations on the basic QK/OV decomposition that the induction-head paper introduced.

Fuzzy and semantic induction

The basic induction head matches identical tokens. Follow-up work has shown that a continuum of related circuits performs fuzzy matching, attending to context positions whose surrounding tokens are similar in some learned representation but not necessarily identical.^[1]^[7] This explains how the same family of circuits can support paraphrased or inflected pattern completion in natural text rather than only literal repetition.

Subsequent research formalized parts of this picture under the heading of semantic induction heads: heads that prefix-match in a representation space where the key and value carry semantic rather than purely token-level information.^[7] These heads remain interpretable using the same QK/OV decomposition but operate on more abstract representations carried by the residual stream after MLPs and earlier attention layers have processed it.^[7]

N-gram induction

A natural extension of the basic two-head construction allows the previous-token head, or chains of such heads, to write information about not just one but several preceding tokens into each position.^[8] The induction head then prefix-matches against this longer context, effectively performing in-context n-gram lookup. This generalization has been explored both as an interpretability lens and as an inductive bias: one line of work explicitly hardcodes an n-gram induction layer into transformers as a drop-in replacement for parts of multi-head attention.^[8]

Generalized induction heads

More recent work has proposed the umbrella term generalized induction heads for circuits that extend the basic match-and-copy template to richer settings, including matching with arbitrary similarity functions, retrieving multi-token continuations, and integrating context across multiple lookups.^[9] These constructions retain the core idea of in-context retrieval followed by output editing but allow the matching and copying stages to be more flexible than the original two-head circuit.^[9]

Selective and structured induction

A more recent strand of theoretical work has analyzed selective induction heads: induction-like circuits that learn to choose, in context, which causal structure to use when completing a sequence.^[20] Such heads can implement different match-and-copy rules conditional on properties of the prompt, providing a mechanistic story for tasks where the model must infer which kind of pattern is being demonstrated before completing it.^[20] Related work has also shown that two-layer transformers can provably represent induction heads over arbitrary-order Markov chains, providing a theoretical underpinning for the n-gram-style generalizations that empirical work observes.^[21]

What tools and datasets are used to study induction heads?

TransformerLens

The most widely used software for studying induction heads and related circuits is TransformerLens, an open-source library for mechanistic interpretability of GPT-style language models originally created by Neel Nanda.^[10] TransformerLens exposes hook points on every internal activation, making it straightforward to capture attention patterns, intervene on specific heads, and run ablations. The library's introductory tutorial is built around reproducing induction-head behavior on small models, and the prefix-matching/copying metrics defined in the In-context Learning and Induction Heads paper are standard utilities in the library.^[10]

TransformerLens has since become a de facto standard for circuit-level analyses, used in much of the academic mechanistic-interpretability literature and maintained by an open-source community.^[10]

For users who want a hands-on introduction, the Concrete Steps to Get Started in Transformer Mechanistic Interpretability guide by Neel Nanda walks new researchers through replicating the induction-head analysis on a small public model using TransformerLens, and the library's documentation includes a self-contained notebook that derives the prefix-matching and copying metrics from first principles.^[10] This pedagogical pipeline has made induction-head analysis one of the standard first projects for new students of mechanistic interpretability.^[10]

Repeated random tokens and other probes

The repeated-random-tokens probe used in In-context Learning and Induction Heads is a deliberately stripped-down dataset: sequences of randomly sampled tokens concatenated with themselves so that the only useful pattern is the repetition.^[1] Because the tokens are random, any head whose attention pattern correlates with the repeat structure can be flagged as performing prefix matching, and any head whose OV circuit reliably boosts the repeated token can be flagged as copying.^[1]

While useful, this probe has known limitations: heads that perform sophisticated context processing on natural text may fall back to induction-like behavior only on stripped-down repeated sequences, which can produce false positives if one identifies induction heads from RRT scores alone.^[11] Subsequent analyses have therefore complemented RRT scores with task-specific behavioral tests, attention-pattern visualizations, and head-by-head ablations on natural data.^[11]

How do induction heads relate to other interpretability work?

Function vectors and task vectors

A line of research initiated by David Bau and collaborators identified function vectors: linear directions in transformer activation space that can be patched into a clean prompt to make the model perform a specific in-context task.^[12] Function vectors are constructed by averaging the outputs of a small set of attention heads, identified via causal mediation analysis, that mediate in-context learning across many tasks.^[12] These function-vector heads share the prefix-matching attention pattern that defines induction heads, and during training they tend to display high induction scores early on; as training proceeds, the function-vector score becomes the more predictive metric, suggesting that induction is a precursor or building block for function-vector behavior.^[13]

A closely related notion, task vectors, describes hidden-state representations that the model constructs from few-shot demonstrations and then uses to steer subsequent predictions.^[14] Empirically, the outputs of induction heads can serve as effective task vectors, providing a concrete link between the circuit-level description of induction and the more representational language of task vectors.^[13]

Attribution graphs and Claude

Anthropic's later interpretability work moved from circuits identified by hand to mechanized analyses based on sparse-coding decompositions and Attribution Graphs.^[15]^[16] The 2025 paper On the Biology of a Large Language Model applies Attribution Graphs built from Cross-Layer Transcoders to study Claude 3.5 Haiku, including circuits for multi-hop reasoning, planning in poetry, and shared multilingual concepts.^[15]^[16] Although this work emphasizes feature-level rather than head-level explanations, the induction-head story remains its conceptual ancestor: it was the first concrete demonstration that named circuits could be located inside production-scale transformers.^[15] Several attribution-graph case studies in On the Biology of a Large Language Model involve features and edges that perform exactly the kind of contextual matching and copying that induction heads first formalized.^[15]

The same line of work connects to feature-level interpretability using Sparse autoencoder dictionaries, as developed in Towards Monosemanticity and Scaling Monosemanticity.^[17]^[18] Where induction heads explain how the model routes information across positions, sparse autoencoders explain what the residual stream encodes at each position; both perspectives are needed to describe what a circuit actually does.

Logit lens and copy suppression

The induction-head picture has also influenced work on other interpretable attention heads. The Logit lens technique, which projects intermediate residual-stream activations through the unembedding to read out what the model would predict at each layer, is commonly used to diagnose copying and induction behavior at intermediate layers.^[19] Closely related is the discovery of copy-suppression heads, which act as a counterweight to copy-like heads (including induction heads) by reducing the logits of tokens that other components are pushing to repeat; a single such head in GPT-2 Small accounts for a large fraction of certain self-repair phenomena observed when other heads are ablated.^[19]

How common are induction heads across model scales?

A central practical claim of the original work is that induction-head-like circuits are not artefacts of toy models: they appear in transformers across orders of magnitude in scale.^[1] The Anthropic paper found induction heads in models ranging from two-layer attention-only toys up through models with many tens of layers and billions of parameters, and reported that the prefix-matching and copying scores remain reliable head-identification signals at scale.^[1] Subsequent independent work has confirmed prefix-matching, copying, and induction-like patterns in a wide variety of public GPT-2, Pythia, and Llama checkpoints.^[11]^[13]

The dominance of induction-style mechanisms at scale is, however, more nuanced than the basic story suggests. More recent analyses have shown that in models above roughly a billion parameters, ablating heads identified by classic induction scores while preserving heads identified by function-vector scores leaves few-shot in-context learning mostly intact, while ablating the function-vector heads (which still exhibit induction-like patterns) impairs it much more.^[13] One interpretation is that induction is a precursor mechanism that, at scale, is partially absorbed into or supplemented by more abstract task-conditioned heads.^[13] The basic motif, prefix-matching followed by output copying through a QK/OV decomposition, persists across this transition, even as the heads that implement it become more specialized.

Another striking property of induction-head emergence is that the timescale at which the circuit forms appears to scale with context length. Theoretical analyses suggest that the number of training steps required to assemble a functional induction-head circuit grows roughly quadratically in the maximum context length, which matches the observed shift in the timing of the in-context-learning phase change in models trained with longer contexts.^[21] This is consistent with the broader pattern reported in In-context Learning and Induction Heads: induction heads are robust, but the exact moment at which they crystallize during training is sensitive to architectural and data choices in ways that other model capabilities are not.^[1]

Why do induction heads matter?

Induction heads occupy an unusual position in the Mechanistic interpretability literature: they are the first concrete circuit discovered inside a real language model that ties a mechanism to an emergent capability.^[1]^[2] Several aspects of the result have shaped subsequent interpretability research.

First, induction heads gave the field a clean example of how a behavior identified at the loss-curve level (a bump in the training loss, a phase change in in-context learning ability) could be matched to a specific circuit at the parameter level.^[1] This linkage between training dynamics and circuit-level mechanism remains a model for what mechanistic-interpretability arguments aspire to.^[15]

Second, the QK/OV decomposition introduced alongside induction heads has become a default conceptual tool in interpretability writing.^[2] Almost every subsequent attention-head case study, from the GPT-2 Small IOI (indirect object identification) circuit through the copy-suppression analyses, uses some form of the QK/OV split to organize its findings.^[19]

Third, induction heads serve as the introductory example in essentially every modern mechanistic-interpretability tutorial.^[10] TransformerLens's onboarding materials, Neel Nanda's introductory lectures, and many independent course notes all use the construction as the first nontrivial circuit students reverse-engineer.^[10] The choice is pedagogically natural: induction heads sit at the boundary where toy theory meets real, scaling phenomena.

Finally, induction heads matter for AI safety because they provide an existence proof that one can locate, name, and intervene on the specific machinery responsible for a high-level capability.^[1] If similar arguments can be made for capabilities related to deception, situational awareness, or other safety-relevant behaviors, the same toolkit of circuit identification, training-dynamics analysis, and ablation could allow more direct safety reasoning. The induction-head paper is frequently cited as a proof of concept that such arguments are at least possible.^[1]^[15]

What are the limitations of the induction-head story?

Several caveats apply to the induction-head story, and the original paper is careful about them.^[1] At larger scales the evidence for a causal link between induction heads and in-context learning is correlational rather than causal: it is consistent with multiple mechanisms, of which induction is the most parsimonious but not the only candidate.^[1] The mechanistic two-head story is also explicitly minimal; real models implement induction-like behavior using compositions across more than two layers, multiple heads per role, and contributions from MLPs, none of which are captured by the basic construction.^[11]

The behavioral definition based on repeated random tokens, while convenient, can both miss heads that perform context-rich induction only on natural text and falsely flag heads that exhibit copy-like behavior only as a fallback on degenerate inputs.^[11] Several follow-up analyses have argued that the cleanest characterization of "induction" depends on the test distribution one uses, and that any single score should be treated as a coarse proxy rather than a definitive identifier.^[11]

Finally, the relationship between induction heads and other mechanisms responsible for in-context learning is now understood to be more complex than originally proposed. As noted above, function-vector heads and task-vector representations partially decouple from classic induction signatures at scale, and parts of in-context learning that involve abstract pattern recognition or task inference may be carried by mechanisms that go well beyond simple prefix-matching and copying.^[13]

ELI5: what is an induction head?

Imagine you are reading a sentence and you see the name "Dr. Lee" early on. Later you see "Dr." again and, before you even read the next word, you guess "Lee", because that is the word that followed "Dr." last time. An induction head is the little piece of a language model that does exactly this guess: it looks back for where the current word appeared before, sees what came next, and bets that the same thing will come next again.^[1] It is made of two cooperating parts: one part tags every word with "the word before me was X", and the other part uses those tags to find a matching spot earlier in the text and copy whatever followed it.^[1]^[2] This simple copy-the-pattern trick turns out to power a lot of what makes large language models seem to "learn" from the examples you put in a prompt.^[1]

Concept	Relation to induction heads
In-Context Learning	The behavior that induction heads are hypothesized to implement at the circuit level.
Mechanistic interpretability	The broader research program that discovered induction heads as one of its first concrete circuits.
Attribution Graphs	A later interpretability framework that builds feature- and edge-level explanations for behaviors related to those induction heads explain at the head level.
Sparse autoencoder	A complementary tool for decomposing the residual stream into features that induction heads then route across positions.
Logit lens	A diagnostic technique often paired with induction-head analyses to read out how copying behavior builds up across layers.
Towards Monosemanticity	Anthropic's first sparse-feature decomposition of a language model, conceptually adjacent to the head-level decomposition that defines induction heads.
Scaling Monosemanticity	The follow-up that scales sparse-feature analysis to Claude 3 Sonnet, complementing head-level analyses with feature-level ones.
On the Biology of a Large Language Model	Applies attribution graphs to Claude 3.5 Haiku and revisits induction-like circuits at a finer-grained level.

References

Catherine Olsson, Nelson Elhage, Neel Nanda et al., "In-context Learning and Induction Heads", Transformer Circuits Thread / arXiv:2209.11895, 2022-09-24. https://arxiv.org/abs/2209.11895. Accessed 2026-05-20. ↩
Nelson Elhage, Neel Nanda, Catherine Olsson et al., "A Mathematical Framework for Transformer Circuits", Anthropic / Transformer Circuits Thread, 2021-12-22. https://transformer-circuits.pub/2021/framework/index.html. Accessed 2026-05-20. ↩
Ashish Vaswani et al., "Attention Is All You Need", arXiv:1706.03762, 2017-06-12. https://arxiv.org/abs/1706.03762. Accessed 2026-05-20. ↩
Tom B. Brown et al., "Language Models are Few-Shot Learners", arXiv:2005.14165, 2020-05-28. https://arxiv.org/abs/2005.14165. Accessed 2026-05-20. ↩
Chris Olah et al., "Zoom In: An Introduction to Circuits", Distill, 2020-03-10. https://distill.pub/2020/circuits/zoom-in/. Accessed 2026-05-20. ↩
Catherine Olsson et al., "In-context Learning and Induction Heads", Transformer Circuits Thread, 2022-03-08. https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html. Accessed 2026-05-20. ↩
Jie Ren et al., "Identifying Semantic Induction Heads to Understand In-Context Learning", arXiv:2402.13055, 2024-02-20. https://arxiv.org/abs/2402.13055. Accessed 2026-05-20. ↩
Ilya Zisman et al., "N-Gram Induction Heads for In-Context RL: Improving Stability and Reducing Data Needs", arXiv:2411.01958, 2024-11-04. https://arxiv.org/abs/2411.01958. Accessed 2026-05-20. ↩
Eunji Kim et al., "Interpretable Next-token Prediction via the Generalized Induction Head", arXiv:2411.00066, 2024-10-31. https://arxiv.org/abs/2411.00066. Accessed 2026-05-20. ↩
Neel Nanda and contributors, "TransformerLens: A library for mechanistic interpretability of GPT-style language models", GitHub, 2022-09-01. https://github.com/TransformerLensOrg/TransformerLens. Accessed 2026-05-20. ↩
Alex Spies, "Some common confusion about induction heads", LessWrong, 2023-03-29. https://www.lesswrong.com/posts/nJqftacoQGKurJ6fv/some-common-confusion-about-induction-heads. Accessed 2026-05-20. ↩
Eric Todd, Millicent L. Li, Arnab Sen Sharma, Aaron Mueller, Byron C. Wallace, David Bau, "Function Vectors in Large Language Models", arXiv:2310.15213, 2023-10-23. https://arxiv.org/abs/2310.15213. Accessed 2026-05-20. ↩
Kayo Yin et al., "Which Attention Heads Matter for In-Context Learning?", arXiv:2502.14010, 2025-02-19. https://arxiv.org/abs/2502.14010. Accessed 2026-05-20. ↩
Roee Hendel, Mor Geva, Amir Globerson, "In-Context Learning Creates Task Vectors", arXiv:2310.15916, 2023-10-24. https://arxiv.org/abs/2310.15916. Accessed 2026-05-20. ↩
Jack Lindsey et al., "On the Biology of a Large Language Model", Anthropic / Transformer Circuits Thread, 2025-03-27. https://transformer-circuits.pub/2025/attribution-graphs/biology.html. Accessed 2026-05-20. ↩
Anthropic, "Circuit Tracing: Revealing Computational Graphs in Language Models", Transformer Circuits Thread, 2025-03-27. https://transformer-circuits.pub/2025/attribution-graphs/methods.html. Accessed 2026-05-20. ↩
Trenton Bricken et al., "Towards Monosemanticity: Decomposing Language Models With Dictionary Learning", Anthropic / Transformer Circuits Thread, 2023-10-04. https://transformer-circuits.pub/2023/monosemantic-features/index.html. Accessed 2026-05-20. ↩
Adly Templeton et al., "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet", Anthropic / Transformer Circuits Thread, 2024-05-21. https://transformer-circuits.pub/2024/scaling-monosemanticity/. Accessed 2026-05-20. ↩
Callum McDougall, Arthur Conmy, Cody Rushing, Thomas McGrath, Neel Nanda, "Copy Suppression: Comprehensively Understanding an Attention Head", arXiv:2310.04625, 2023-10-06. https://arxiv.org/abs/2310.04625. Accessed 2026-05-20. ↩
Francesco D'Angelo et al., "Selective Induction Heads: How Transformers Select Causal Structures In Context", arXiv:2509.08184, 2025-09-10. https://arxiv.org/abs/2509.08184. Accessed 2026-05-20. ↩
"What One Cannot, Two Can: Two-Layer Transformers Provably Represent Induction Heads on Any-Order Markov Chains", arXiv:2508.07208, 2025-08-10. https://arxiv.org/abs/2508.07208. Accessed 2026-05-20. ↩
Aaditya K. Singh et al., "What needs to go right for an induction head? A mechanistic study of in-context learning circuits and their formation", arXiv:2404.07129, 2024-04-10. https://arxiv.org/abs/2404.07129. Accessed 2026-05-20. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributor · full history

Suggest edit

What links here

Causal scrubbing Christopher Olah Circuit discovery In-Context Learning On the Biology of a Large Language Model Representation Engineering Scaling Monosemanticity Specification gaming Towards Monosemanticity TransformerLens

What problem do induction heads solve?

How do induction heads work?

Behavioral definition

The two-head mechanism

QK and OV circuits

Generalization beyond literal copying

How were induction heads discovered, and what is the induction bump?

Coincident emergence with in-context learning

What are the six lines of evidence?

Ablation studies

What variants and generalizations of induction heads exist?

Fuzzy and semantic induction

N-gram induction

Generalized induction heads

Selective and structured induction

What tools and datasets are used to study induction heads?

TransformerLens

Repeated random tokens and other probes

How do induction heads relate to other interpretability work?

Function vectors and task vectors

Attribution graphs and Claude

Logit lens and copy suppression

How common are induction heads across model scales?

Why do induction heads matter?

What are the limitations of the induction-head story?

ELI5: what is an induction head?

Related Work

See also

References

Improve this article

Related Articles

Logit lens

Feature Importances

Permutation variable importances

Variable importances

Explainable AI

Mechanistic interpretability

What links here

Related Articles

Logit lens

Feature Importances

Permutation variable importances

Variable importances

Explainable AI

Mechanistic interpretability

What links here