Induction head
Last reviewed
May 19, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,581 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 19, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,581 words
Add missing citations, update stale details, or suggest a clearer explanation.
An induction head is a specific type of attention head in transformer language models that performs in-context pattern completion. Given a sequence containing the prefix pattern [A][B] ... [A], an induction head attends from the second occurrence of [A] back to the token [B] that followed the first [A], and increases the model's predicted probability that the next token will be [B].[^1] In short, induction heads implement the inference rule "if I saw A followed by B earlier in this context, then when I see A again, predict B."
Induction heads were introduced and named in two foundational papers from anthropic on the mechanistic reverse-engineering of transformer language models: "A Mathematical Framework for Transformer Circuits" by Elhage et al. (2021)[^2] and "In-context Learning and Induction Heads" by Olsson et al. (2022).[^1] These papers argued that induction heads are a primary mechanism by which transformer language models perform in context learning (the ability to use information presented earlier in the context window to inform later predictions), and that they emerge through a sharp "phase change" during training. Because they constitute one of the first concrete, mechanistically described circuits identified inside a working language model, induction heads have become a central object of study in the field of mechanistic interpretability.[^1][^2]
Large transformer language models exhibit the ability to improve their next-token predictions as they see more of a document, using the earlier portion of the context to inform predictions about later portions. This behavior is referred to as in-context learning (ICL).[^1] Olsson et al. (2022) operationalize ICL using an "in-context learning score," which they define as "the loss of the 500th token in the context minus the average loss of the 50th token in the context, averaged over dataset examples." A more negative score indicates that the model's loss decreases substantially over the course of a long context, meaning the model is making better predictions later in the context than earlier.[^1]
Before the induction-heads work, ICL was known empirically but its underlying mechanism inside the network was not characterized. The induction-heads research program asked: which specific components of a transformer are responsible for this behavior, and how do they implement it?[^1]
The framework introduced by Elhage et al. (2021) treats a transformer not as an opaque function but as a sum of interpretable end-to-end functions, each corresponding to a "path" through the model. The functions in this decomposition map tokens to changes in logits, and are linear if one freezes the attention patterns.[^2] The framework emphasizes that all transformer components "communicate with each other by reading and writing to different subspaces of the residual stream," so that the residual stream serves as a shared communication channel.[^2]
The framework decomposes each attention head into two independent computations: a QK ("query-key") circuit that determines which positions the head attends to, and an OV ("output-value") circuit that determines how each token affects the output if attended to.[^2] These two circuits can be analyzed separately. Key, query, and value vectors can be thought of as intermediate results in the computation of low-rank matrices that map between subspaces of the residual stream.[^2]
A useful first observation from this framework is that one-layer attention-only transformers can be understood as an ensemble of bigram and skip-trigram models: they capture statistical regularities of the form "after seeing A, then B, predict C," with gaps allowed between the tokens.[^2] This already provides some nontrivial in-context behavior, but it is limited because a single attention head can only attend based on raw token identities, not on patterns formed earlier in the context. To go beyond skip-trigrams, two heads in different layers must be composed.[^2]
Two attention heads in different layers can interact when the output that the earlier head writes into the residual stream becomes part of the input to the later head. Elhage et al. identify three forms of such interaction:
Q-composition and K-composition allow information to flow between the attention patterns of different heads, since they shape what the later head looks for and what it can match against. V-composition is qualitatively different: it affects what the later head writes when it attends to a position, but not which position it chooses to attend to.[^2] The induction-head circuit is an example of K-composition between two heads in different layers: the previous-token head writes information into the residual stream that the later head's keys then incorporate.[^2]
Much of the foundational analysis of induction heads is performed in "attention-only" transformer models, which lack MLP layers and contain only token embeddings, stacked attention blocks, and an unembedding back to logits.[^2] Removing MLPs sacrifices some modeling capacity but greatly simplifies analysis: attention heads can be cleanly decomposed into QK and OV circuits, and their compositions can be enumerated and inspected.[^2] Elhage et al. (2021) and Olsson et al. (2022) both rely on attention-only models for the cleanest mechanistic story, and then ask how the resulting picture transfers to larger models that include MLPs.[^1][^2]
The minimal implementation of an induction head requires the composition of two attention heads in two different layers of an attention-only transformer.[^2] Elhage et al. (2021) and Olsson et al. (2022) describe the circuit as follows; the two-layer requirement is essential, since a single attention head cannot both record what token preceded each position and use that recorded information to match against the current position.[^1][^2]
In the first layer, a "previous-token head" attends from each position to the position immediately before it and copies information about the preceding token into the residual stream at the current position. After the first layer, each position therefore carries a representation that combines information about the current token and information about the token immediately preceding it.[^2]
In a later layer, the induction head uses the previous-token information to perform pattern matching. Concretely, suppose the input sequence contains an earlier occurrence of a token A followed by a token B, and the model now sees a second occurrence of A. The induction head sitting at the second A:
A as their previous token?"A, namely the position holding B.B, and through its OV circuit it increases the model's logit for predicting B as the next token.[^1][^2]The net effect is that the model predicts B after the second A, completing the pattern [A][B] ... [A] -> [B]. Because the mechanism requires two heads in different layers, induction heads cannot occur in single-layer attention-only models; they appear starting with two-layer attention-only models.[^1][^2]
To identify induction heads empirically in a trained model, Olsson et al. (2022) define two scores measured on sequences of repeated random tokens (sequences in which a long random subsequence is concatenated with itself so that strong induction-like behavior can be detected on a controlled distribution).[^1] The scores are:
A head that scores highly on both prefix matching and copying is operationally classified as an induction head. The pairing of the two criteria matters: a head with high prefix matching but low copying would identify the right earlier position without contributing to the prediction, while a head with high copying but no prefix matching would copy from arbitrary attended positions rather than from a pattern-matched one. Only the conjunction matches the functional role of an induction head.[^1] This empirical, behavioral definition allows induction heads to be detected and counted automatically across many models and training checkpoints, which is what makes large comparative studies across the 34 models in Olsson et al. (2022) tractable.[^1]
Olsson et al. (2022) present six lines of evidence that induction heads are not merely one of many mechanisms contributing to in-context learning, but are a primary mechanism for it across model scales.[^1] The authors describe the argument structure explicitly and present each line of evidence in turn.[^1]
The six lines of evidence are presented across the paper's 34 transformer models, which span a range of sizes from small attention-only networks (where the mechanistic dissection can be performed cleanly) up to comparatively large models that include MLP layers (where the case rests more on continuity and correlation arguments).[^1]
In a wide range of trained transformer language models, the formation of induction heads during training co-occurs in time with a large jump in the model's in-context learning ability. As induction heads form, the model's in-context learning score improves dramatically; a visible bump in the training loss accompanies both events. The authors describe this co-occurrence as a "phase change."[^1] Because the phase change is sharp and clearly identifiable, it is unlikely to be coincidental: the formation of induction heads and the improvement in in-context learning are visibly tied to the same training event in many independently trained models.[^1]
The authors apply a variety of architectural perturbations across their 34 trained transformers and observe that perturbations that shift when induction heads can form also shift when in-context learning improves, in a matching way. In other words, the two phenomena move together not just in baseline training but also under intervention.[^1] Co-perturbation strengthens the causal-style claim beyond mere co-occurrence: if induction-head formation can be moved in time by a manipulation, and the in-context learning improvement moves with it, then the two are mechanistically linked rather than being independent consequences of training progress.[^1]
When induction heads are ablated (knocked out) at test time in small models, the model's in-context learning ability decreases substantially. This is a causal, rather than purely correlational, line of evidence: removing induction heads removes much of the in-context learning capability.[^1] Ablation studies are most decisive in small attention-only models, where the set of induction heads can be enumerated and individually disabled. In larger models, ablation is more difficult both because there are many more heads and because some functions may be redundantly implemented.[^1]
The authors document that induction heads do not merely perform literal token-for-token copying. They observe induction heads implementing more abstract pattern-completion behaviors, including literal sequence copying, translation-like behavior, and pattern matching on more abstract sequence features.[^1] These examples suggest that induction heads support a substantial portion of what is normally called in-context learning rather than only verbatim repetition; abstract behaviors such as translation are explicitly cited as evidence that the same circuit family generalizes well beyond toy repeat-the-string demonstrations.[^1]
The authors argue that the induction-head mechanism naturally extends to more general in-context learning. Because induction heads can be viewed as performing a form of nearest-neighbor lookup over the context, and because the matching can be made fuzzy rather than exact, the same circuit type provides a plausible substrate for more general ICL behaviors.[^1] The argument here is not a direct experimental finding but a structural claim: the kind of computation an induction head performs (pattern matching followed by copying) is precisely the kind of computation one would expect to underlie ICL more broadly.[^1]
Behaviors characteristic of induction heads, such as prefix matching and copying, exhibit smooth continuity from small attention-only models to larger models with MLPs. The same scores can be computed across scales and identify analogous heads in both regimes, suggesting the small-model story extrapolates to larger models rather than being a peculiarity of the small-model regime.[^1] Continuity is important because the cleanest mechanistic case is made in small models, while the practical motivation is to understand large models; without continuity, the small-model story would risk being interesting but not actionable for large models.[^1]
Taken together, the authors argue that these six lines of evidence make the case that induction heads constitute a primary mechanism for in-context learning in transformer language models, with appropriately scaled confidence in small versus large models. The authors describe the overall argument as circumstantial and emphasize that subtle confounds and alternative hypotheses cannot be ruled out, while still arguing that the evidence collectively points strongly toward induction heads as a central ICL mechanism.[^1]
A particularly striking finding of Olsson et al. (2022) is that the emergence of induction heads during training is not gradual but abrupt. Over a narrow band of training steps, induction heads form, in-context learning ability rises sharply, and a visible bump appears in the training loss curve.[^1] The authors call this co-occurring set of events the "phase change."[^1]
The phase change has several notable properties as described in the paper:[^1]
The phase change has motivated significant follow-up research into how and why induction heads emerge so abruptly, which is described below.
Although the simplest induction heads perform exact, verbatim copying of [A][B] ... [A] -> [B], Olsson et al. (2022) emphasize that real induction heads in trained models go beyond verbatim copying. They describe "fuzzy" or "nearest neighbor" pattern completion, in which the head completes the pattern [A*][B*] ... [A] -> [B], where A* and B* are tokens similar to (but not identical to) A and B. Some induction heads also perform a fuzzy match over several preceding tokens, attending based on a window of recent context rather than a single token.[^1]
This generalization is significant because it explains how a mechanism originally identified through repeated random-token sequences can plausibly underlie more abstract in-context behaviors observed in larger language models, such as translation-like completion or abstract sequence continuation. The authors cite this generality as one of the reasons to think induction heads support a broad swath of in-context learning rather than only the special case of literal repetition.[^1]
Following the introduction of induction heads in 2021 and 2022, several research groups have studied their formation, the conditions under which they emerge, and their role in in-context learning. Two notable studies are described here.
Gautam Reddy's paper "The mechanistic basis of data dependence and abrupt learning in an in-context classification task," posted to arXiv in December 2023, investigates the abrupt emergence of induction heads in a controlled in-context classification setting.[^3] Reddy recapitulates the abrupt-emergence phenomenon in a minimal attention-only network trained on a simplified dataset and shows that in-context learning in this setting is driven by the abrupt emergence of an induction head, which subsequently competes with in-weights learning.[^3]
The paper attributes the sharpness of the transition to the sequential learning of three "nested logits" enabled by an intrinsic curriculum, and argues that the sharp transitions in attention-based networks arise due to a specific chain of multi-layer operations necessary to achieve ICL, implemented by nested nonlinearities that are sequentially learned during training.[^3] Reddy's work provides a phenomenological model that connects the empirical phase change observed by Olsson et al. (2022) to a mechanism of sequential, dependency-driven learning of the components needed to form an induction head.[^3]
Aaditya K. Singh, Ted Moskovitz, Felix Hill, Stephanie C. Y. Chan, and Andrew M. Saxe published "What needs to go right for an induction head? A mechanistic study of in-context learning circuits and their formation" on arXiv in April 2024.[^4] The paper introduces what the authors describe as an "optogenetics-inspired causal framework" for modifying activations throughout training, which allows them to selectively clamp subsets of activations and observe the effect on induction-head formation.[^4]
By clamping subsets of activations throughout training, Singh et al. identify three underlying subcircuits that interact to drive induction-head formation and yield the phase change.[^4] They also report that multiple induction heads operate in a "diverse and additive" manner, and that the identified subcircuits help explain data-dependent properties such as the timing of the phase change.[^4] This work complements Reddy's by providing a more direct, intervention-based dissection of the components whose joint development produces an induction head.[^4]
The introduction of induction heads has been influential in several ways:
Before the induction-heads work, much of the discussion of mechanistic interpretability focused on individual neurons or on simple toy circuits. The induction-head circuit, by contrast, is a concrete, multi-component computation identified inside trained language models, with a clear functional role and an operational definition that can be used to detect it.[^1][^2] It thereby served as an existence proof that meaningful circuits exist and can be reverse-engineered.[^1][^2]
The argument by Olsson et al. (2022) that induction heads are the primary mechanism of in-context learning provides a starting point for thinking about how large language models acquire and use in-context information.[^1] Subsequent work has examined to what extent ICL in large models is reducible to induction-head-like computations, and to what extent it involves additional mechanisms; the induction-head hypothesis serves as a reference point for these investigations.[^1]
The abrupt, phase-change-like emergence of induction heads has made them a popular case study for understanding how capabilities emerge in neural networks during training. Reddy (2023) and Singh et al. (2024) both treat the induction-head phase change as a tractable instance of more general "abrupt learning" phenomena, and use the controlled setting to study the dynamics of capability emergence.[^3][^4]
Several limitations of the induction-head story are explicitly noted by the original authors and have been discussed in subsequent work.
Olsson et al. (2022) note that the evidence for induction heads as the mechanism of ICL is strongest in small attention-only models, where direct ablation experiments are tractable and the circuit can be characterized in detail. The case for large models rests more heavily on continuity arguments (line of evidence 6) and on the observation that prefix-matching and copying scores identify analogous heads at scale, rather than on direct mechanistic dissection. The authors frame their claim accordingly, with stronger confidence at small scale and weaker but still substantive confidence at large scale.[^1]
The original paper does not claim that induction heads explain every aspect of in-context learning. They are described as a primary mechanism, and the authors leave open that additional mechanisms also contribute, particularly in large models that contain MLPs and many additional heads beyond those identified as induction heads. The fuzzy-induction discussion is itself an extension of the basic verbatim-induction story, and the boundary between "induction-head-like" and "other" ICL mechanisms is not sharp.[^1]
The prefix-matching and copying scores are operational behavioral measures defined on repeated random sequences. A head can score highly on these measures without necessarily implementing the exact circuit described by Elhage et al. (2021), particularly in larger models where many heads can contribute to similar behavior. Care is therefore needed when treating "induction head" as a single, well-defined category across scales.[^1][^2]
The clearest mechanistic account of induction heads is given in attention-only transformer models without MLP layers, where the QK and OV circuits can be analyzed cleanly.[^2] In production-scale models that include MLP layers, MLP nonlinearities can also contribute to in-context behavior in ways that are not fully captured by the two-head induction circuit. The induction-head story therefore provides a foundational picture rather than a complete account of ICL in modern large models.[^1][^2]