Activation patching
Last reviewed
May 19, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,813 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 19, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,813 words
Add missing citations, update stale details, or suggest a clearer explanation.
Activation patching is a causal intervention technique used in mechanistic interpretability to identify which internal components of a neural network are causally responsible for a specific behaviour. The method works by replacing (or "patching") activations from one forward pass of a model with activations cached from a different forward pass, then measuring the resulting change in the model's output. By selectively patching different components and observing which ones restore or break the original behaviour, researchers can localise the computations responsible for a phenomenon of interest.[^1][^2][^3]
The technique is also known by several other names that reflect its origins in causal inference and its variants in interpretability research. The most common synonyms are causal tracing (popularised by Meng et al. in the ROME paper), causal mediation analysis (the broader statistical framework from which it descends), and interchange intervention (the formalisation developed by Geiger and colleagues at Stanford).[^2][^4][^5] Although these terms have slightly different historical lineages and methodological emphases, they refer to essentially the same underlying operation: substituting an internal model state with a counterfactual value to measure the causal effect on the output.
Activation patching has become one of the most widely used tools in mechanistic interpretability research for large language models such as GPT-2. It was central to landmark works including the ROME editing method, the Indirect Object Identification (IOI) circuit discovery in GPT-2 small, and the development of automated circuit discovery tools.[^2][^3][^6]
In the simplest formulation, activation patching involves running a transformer (or any other neural model) on two related but distinct inputs. One input, often called the clean input or base input, elicits the behaviour the researcher wishes to study; the other, called the corrupted input or source input, is a minimally different prompt that does not elicit the same behaviour. The model's activations at every layer and position are cached for both runs. The researcher then performs a third forward pass on one of the inputs, but at some chosen point (a particular layer, attention head, feedforward module, or even a single neuron) overwrites the activation with the cached activation from the other run. The resulting output is compared against the two original outputs using a chosen metric.[^1][^3]
The intuition is straightforward: if patching activation A from the clean run into the corrupted run restores the correct behaviour, then activation A carries information that is sufficient for producing that behaviour. Conversely, if patching activation A from the corrupted run into the clean run destroys the correct behaviour, then A is necessary (under that particular corruption). By repeating this procedure across every component of the model, researchers can produce a heatmap-like view of which parts of the network are causally implicated in the behaviour under study.[^1][^2]
Activation patching is fundamentally a causal technique rather than a correlational one. Earlier interpretability methods, such as probing classifiers, could establish whether information was present in a representation but not whether the model actually used that information when computing its output. Causal interventions like activation patching close this gap by directly manipulating the activations and observing the downstream effect.[^4]
The methodological roots of activation patching lie in causal mediation analysis, a statistical framework developed in epidemiology and the social sciences to decompose the total effect of a treatment on an outcome into a direct effect and an indirect effect through intermediate (mediating) variables. The framework was adapted to neural natural language processing by Vig, Gehrmann, Belinkov, Qian, Nevo, Singer, and Shieber in their 2020 NeurIPS paper "Investigating Gender Bias in Language Models Using Causal Mediation Analysis."[^4]
In that work, Vig and colleagues argued that existing interpretability methods could measure only whether information existed in hidden representations, not whether it was causally used by the model. They proposed a methodology grounded in causal mediation analysis in which information flows from input to output through internal components treated as mediators. By intervening on these mediators (for example, by replacing the activation at a particular attention head with an alternative value), the authors decomposed the model's gender-bias behaviour into direct effects flowing from the input and indirect effects flowing through specific mediators. They found that gender-bias effects in pre-trained transformer language models were sparse, concentrated in a small part of the network, and synergistic across components.[^4]
This work established the template for what later became known as activation patching: identify a behaviour, design controlled clean and corrupted inputs that differ only in the property of interest, and intervene on individual mediators to measure their causal contribution. Closely related formal work by Geiger and colleagues at Stanford developed the interchange intervention framework, which sets internal variables of a model to the values they would take for a different input and uses these interventions to test whether neural representations align with the variables of an interpretable causal model.[^5]
The standard activation-patching workflow involves three forward passes.[^1][^3]
A canonical pair of clean and corrupted prompts, used in the ROME paper, is "The Space Needle is in downtown" (clean, eliciting the correct completion "Seattle") versus a noised version in which the embeddings of the subject tokens have been corrupted by added Gaussian noise.[^2] In the IOI work by Wang, Variengien, Conmy, Shlegeris, and Steinhardt, the clean and corrupted prompts differ in the identity of the names mentioned, while remaining structurally identical: for example, "When Mary and John went to the store, John gave a drink to" (clean) versus "When Alice and Bob went to the store, John gave a drink to" (corrupted).[^6]
Activation patching can be performed in two complementary directions, called denoising and noising by Heimersheim and Nanda.[^1]
These two directions are not symmetric in general, and conflating them can lead to misleading conclusions. As Heimersheim and Nanda observe, an AND circuit (where multiple components must all be active to produce the behaviour) and an OR circuit (where any of several redundant components can suffice) yield very different patterns under denoising versus noising. Denoising tends to highlight the final-stage components that are sufficient to produce the output, while noising tends to highlight all components that contribute to the computation, because removing any one of them disrupts the result.[^1]
The corruption used to generate the "corrupted" activation can take several forms, each with different statistical properties.[^1][^7]
The choice between these is not merely cosmetic. Zhang and Nanda systematically demonstrated that the choice of corruption method and evaluation metric can lead to substantially different conclusions about which components are important, even on the same task.[^7]
A patching experiment is summarised by a scalar metric that quantifies how close the patched run's output is to the clean baseline (versus the corrupted baseline). Common choices include the following.[^1][^7]
Heimersheim and Nanda recommend running multiple metrics in parallel and treating disagreement among them as evidence of an interesting subtlety rather than as a nuisance. Zhang and Nanda further show empirically that varying the metric can change which components appear important in localisation experiments.[^1][^7]
Activation patching has spawned a family of related techniques, each adapted to particular research questions.
Path patching is a refinement introduced by Goldowsky-Dill and colleagues that restricts the patch to flow along a single computational path between a source component and a downstream receiver, rather than propagating through the entire downstream subgraph. In a standard activation patch at, say, the output of a particular attention head, the patched value can subsequently influence every later component that reads from the residual stream. Path patching instead patches the contribution of one component into the input of another specific component, allowing the researcher to isolate the direct effect of one component on another from any mediated effects through intermediate components.[^8]
This finer-grained analysis was central to the IOI circuit work, where the researchers needed to determine, for example, whether a "Name Mover" attention head's effect was direct (writing the correct name to the final position) or indirect (passing information through "S-Inhibition" heads before the name was written). Path patching provides a way to test such hypotheses quantitatively.[^6][^8]
A standard activation-patching sweep over every component in a large model is computationally expensive: it requires a separate forward pass for each component being patched. Attribution patching, introduced by Syed, Rager, and Conmy, is a fast linear approximation that estimates the effect of patching every activation from just two forward passes and one backward pass.[^9]
The idea is to take a first-order Taylor expansion of the patching metric with respect to each activation, using gradients computed at the clean (or corrupted) point. The estimated patching effect for activation a is then the product of the gradient with the difference between the corrupted and clean activations. While this approximation is exact only for infinitesimal perturbations, Syed and colleagues demonstrated that on standard benchmarks attribution patching identifies the same important circuits as full activation patching, while running orders of magnitude faster. Their paper, "Attribution Patching Outperforms Automated Circuit Discovery," was presented at the 7th BlackboxNLP workshop and reports that attribution patching, combined with edge pruning, outperformed the contemporaneous ACDC algorithm on several circuit discovery benchmarks.[^9]
Geiger and colleagues developed an interchange-intervention framework in which the goal is not only to localise a behaviour but to align internal neural representations with the variables of a hypothesised interpretable causal model. An interchange intervention takes a "source" input and a "base" input, processes both, and replaces the internal state of the base run with the corresponding internal state of the source run at the alignment point. If the model output then matches the prediction of the interpretable causal model under the corresponding intervention, this is evidence that the neural representation faithfully implements that causal variable. The same framework underlies interchange intervention training (IIT), in which a model is trained to make its internal representations conform to a target causal abstraction.[^5]
Causal scrubbing, introduced by researchers at Redwood Research (Chan, Garriga-Alonso, Goldowsky-Dill, Greenblatt, Nitishinskaya, Radhakrishnan, Shlegeris, and Thomas in 2022), is a more rigorous evaluation procedure for mechanistic hypotheses. The key idea is that an interpretability hypothesis can be reformulated as a claim about which activations inside a network can be resampled without changing the model's behaviour. Causal scrubbing then tests the hypothesis by performing a structured set of resampling ablations: any activation the hypothesis says is irrelevant to a given path is replaced with the activation that would be there for a randomly sampled input consistent with the hypothesis. If the model's behaviour is preserved on average, the hypothesis is at least consistent with the evidence; if behaviour breaks, the hypothesis is incomplete or incorrect.[^10]
Causal scrubbing is therefore not a discovery tool but a verification tool: it provides a principled way to evaluate the faithfulness of hypotheses generated by activation patching and other methods.[^10]
Activation patching can be applied at many granularities, and the appropriate choice depends on the research question.[^1]
Much of the modern activation-patching literature is implemented using TransformerLens, an open-source library originally developed by Neel Nanda for mechanistic interpretability of GPT-style transformer language models. Its central class is HookedTransformer, which wraps standard pre-trained transformers and exposes every internal activation through a system of named hooks.[^11]
The library provides two main functions for performing interventions. run_with_cache runs the model and returns all internal activations in a dictionary keyed by hook name, which is how the clean and corrupted activations are typically obtained. run_with_hooks runs the model while applying user-supplied functions at specified hook points; a patching hook is a function that, when called, overwrites the activation it receives with a cached counterfactual value. By combining these two functions, a researcher can implement activation patching at any of the granularities listed above in just a few lines of PyTorch code.[^11]
The TransformerLens documentation and demo notebooks include a dedicated "Exploratory Analysis" tutorial that walks through activation patching on GPT-2 small for the IOI task, and the library is the de facto standard tooling for the mechanistic-interpretability research community.[^11]
The 2022 paper "Locating and Editing Factual Associations in GPT" by Meng, Bau, Andonian, and Belinkov introduced causal tracing, a systematic application of activation patching to locate where in a transformer a specific factual association is stored. The authors gave GPT-2 and GPT-J a series of prompts of the form "The Space Needle is in downtown ___" and recorded the model's normal completion. They then ran a corrupted version of each prompt in which the subject's token embeddings were perturbed with Gaussian noise, breaking the model's ability to produce the correct completion. Finally, they patched the hidden state at each (layer, position) pair from the clean run back into the corrupted run and measured whether this restored the correct completion.[^2]
This produced a heatmap of "indirect effect" over the (layer, position) grid. The result was strikingly localised: a strong band of effect appeared at middle-layer MLP modules processing the last token of the subject. A second, weaker band appeared at later layers attending to the end of the prompt, where the answer is read off. From this, Meng and colleagues concluded that a two-stage mechanism mediates factual recall: middle-layer MLPs at the subject's last token retrieve the fact, and later attention heads transport the retrieved information to the final position for prediction. The authors leveraged this localisation to design ROME (Rank-One Model Editing), a method that edits a single MLP layer at the identified location to change a model's stored factual association.[^2]
The ROME paper is frequently cited as the work that popularised activation patching (under the name causal tracing) in the broader interpretability community.[^2]
Wang, Variengien, Conmy, Shlegeris, and Steinhardt's 2022 paper "Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small" used activation patching and path patching as the primary tools to reverse-engineer how GPT-2 small completes sentences like "When Mary and John went to the store, John gave a drink to" with "Mary." The authors describe their work as the largest end-to-end attempt at reverse-engineering a natural behaviour "in the wild" in a language model.[^6]
By patching activations from clean prompts (where the indirect object is, say, "Mary") into corrupted prompts (where it is "Alice") at each attention head, the authors identified twenty-six attention heads in GPT-2 small that contributed to the IOI behaviour. Subsequent path-patching experiments grouped these heads into seven functional classes, including "Duplicate Token Heads" that detect the repeated name, "S-Inhibition Heads" that suppress the subject name at the final position, and "Name Mover Heads" that copy the indirect object's name to the output position. The IOI circuit has since served as a benchmark for evaluating circuit-discovery methods, including ACDC and attribution patching.[^6][^9][^12]
Conmy, Mavor-Parker, Lynch, Heimersheim, and Garriga-Alonso introduced the Automatic Circuit DisCovery (ACDC) algorithm, which automates the process of identifying which edges in a transformer's computational graph are important for a given behaviour using path patching. Starting from the full computational graph, ACDC iteratively prunes edges whose removal does not significantly change the patching metric, leaving a minimal subgraph that explains the behaviour. On the IOI task, ACDC rediscovered all five component types in the human-found circuit, selecting 68 of the 32,000 edges in GPT-2 small.[^12]
ACDC is computationally intensive because each candidate edge requires a forward pass, which is one reason Syed and colleagues subsequently developed attribution patching as a faster alternative that they showed outperformed ACDC on several benchmarks while using only two forward passes and one backward pass.[^9][^12]
Activation patching is a powerful tool, but it comes with substantial caveats. Several of these were emphasised by Heimersheim and Nanda, Zhang and Nanda, and McGrath and colleagues.
McGrath, Rahtz, Kramar, Mikulik, and Legg's 2023 paper "The Hydra Effect: Emergent Self-repair in Language Model Computations" identified a phenomenon in which ablating one attention layer of a language model causes another, later layer to compensate. They named this the hydra effect in reference to the mythological monster that regrows heads when one is cut off. The researchers also identified a counterbalancing function of late MLP layers that downregulate the maximum-likelihood token. Notably, both effects emerged in models trained without any form of dropout, indicating that they are not artefacts of regularisation but rather a fundamental property of the architecture and training.[^13]
The hydra effect has important implications for activation patching. Because backup or compensatory components activate when primary components are ablated, a noising experiment may underestimate the importance of the ablated component: the model still gets the right answer, not because the component was unimportant, but because another component took over. This complicates the interpretation of patching results and motivates the use of additional verification tools such as causal scrubbing.[^1][^13]
A 2024 follow-up by Rushing and Nanda titled "Explorations of Self-Repair in Language Models" further characterised the phenomenon and explored its mechanism.[^14]
A standard activation patch propagates through every downstream component that reads from the patched location. The measured effect therefore conflates the direct effect (what the patched component itself contributes) with all indirect effects through intermediate components. Path patching mitigates this by restricting the patch to a particular downstream consumer, but it is more expensive and more difficult to interpret when many paths are at play. Vig and colleagues' original causal-mediation-analysis framework explicitly decomposed effects into direct and indirect components, and modern interpretability work often follows this decomposition.[^1][^4][^8]
Because activation patching depends on a carefully chosen pair of clean and corrupted prompts, its conclusions strictly apply only to the distribution of prompts tested. A circuit discovered on a narrow set of IOI examples may not generalise to broader natural-language settings, and patching results can be sensitive to small changes in prompt wording. Heimersheim and Nanda recommend testing circuits on multiple prompt distributions and documenting exactly which properties of the prompt are being varied versus held fixed.[^1]
Some attention heads in GPT-2 small (and other transformers) consistently harm the model's performance on a given task; excluding them from a circuit artificially inflates the apparent "recovered performance" of the circuit, complicating the assessment of circuit quality. Heimersheim and Nanda note that this is an unsolved problem in the field.[^1]
Zhang and Nanda's 2023 ICLR paper "Towards Best Practices of Activation Patching in Language Models: Metrics and Methods" provides a systematic empirical study of how methodological choices affect activation-patching results. They find that varying hyperparameters such as the choice of evaluation metric (logit difference versus probability versus KL divergence) and the choice of corruption method (resample versus mean versus zero ablation) can produce substantially different conclusions about which components are important on the same task. They argue for resample-based corruption with carefully chosen metrics and against the historically common zero-ablation approach.[^7]
The arXiv version of this work and the related Heimersheim-Nanda tutorial together constitute the de facto methodological standards for activation patching as of 2024.[^1][^7]
A successful patching experiment identifies a sufficient set of components for a behaviour, not necessarily a minimal one. The discovered circuit may include components that are redundant in the sense that other components could perform the same function if they were ablated. Causal scrubbing and similar verification methods attempt to address this by explicitly testing the necessity of every component identified by patching.[^1][^10]
Drawing on the syntheses of Heimersheim and Nanda and of Zhang and Nanda, the following practices have emerged as community standards for activation patching.[^1][^7]
Activation patching is one tool in a broader toolkit for mechanistic interpretability. It is complementary to representation-level methods such as sparse autoencoders, which decompose model activations into interpretable features, and to studies of superposition, which examines how multiple features are encoded in shared activation directions. Patching also complements broader research programmes such as attribution graphs and representation engineering, which combine causal interventions with representation analysis to understand model behaviour. Where sparse autoencoders aim to provide an interpretable vocabulary for model internals, activation patching answers the complementary question of which components causally drive a behaviour; the two methods are often used together in modern interpretability work.[^1][^11]