Activation patching

Interpretability

26 min read

Updated Jun 27, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 27, 2026

Fact-checked

In review queue

Sources

14 citations

Revision

v2 · 5,247 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Activation patching is a causal intervention technique used in mechanistic interpretability to identify which internal components of a neural network are causally responsible for a specific behaviour. The method runs a model on a clean prompt and on a closely matched corrupted prompt, caches the internal activations from both runs, then performs a third forward pass in which a chosen activation is overwritten ("patched") with the cached value from the other run, and measures the resulting change in the model's output. By selectively patching different components and observing which ones restore or break the original behaviour, researchers can localise the computations responsible for a phenomenon of interest.^[1]^[2]^[3]

As Heimersheim and Nanda put it in their 2024 best-practices guide, activation patching is "the technique of replacing internal activations of a neural net," specifically the technique "where we overwrite some activations during a model run with cached activations from a previous run (on a different input), and observe how this affects the model's output."^[1] Because the researcher directly manipulates the activation and watches the downstream effect, the conclusion is causal rather than merely correlational, which distinguishes patching from earlier methods such as probing classifiers.^[1]^[4]

The technique is also known by several other names that reflect its origins in causal inference and its variants in interpretability research. The most common synonyms are causal tracing (popularised by Meng et al. in the ROME paper), causal mediation analysis (the broader statistical framework from which it descends), and interchange intervention (the formalisation developed by Geiger and colleagues at Stanford).^[2]^[4]^[5] Although these terms have slightly different historical lineages and methodological emphases, they refer to essentially the same underlying operation: substituting an internal model state with a counterfactual value to measure the causal effect on the output.^[1]

Activation patching has become one of the most widely used tools in mechanistic interpretability research for large language models such as GPT-2. It was central to landmark works including the ROME editing method, the Indirect Object Identification (IOI) circuit discovery in GPT-2 small, and the development of automated circuit discovery tools.^[2]^[3]^[6]

What is activation patching?

In the simplest formulation, activation patching involves running a transformer (or any other neural model) on two related but distinct inputs. One input, often called the clean input or base input, elicits the behaviour the researcher wishes to study; the other, called the corrupted input or source input, is a minimally different prompt that does not elicit the same behaviour. The model's activations at every layer and position are cached for both runs. The researcher then performs a third forward pass on one of the inputs, but at some chosen point (a particular layer, attention head, feedforward module, or even a single neuron) overwrites the activation with the cached activation from the other run. The resulting output is compared against the two original outputs using a chosen metric.^[1]^[3]

The intuition is straightforward: if patching activation A from the clean run into the corrupted run restores the correct behaviour, then activation A carries information that is sufficient for producing that behaviour. Conversely, if patching activation A from the corrupted run into the clean run destroys the correct behaviour, then A is necessary (under that particular corruption). By repeating this procedure across every component of the model, researchers can produce a heatmap-like view of which parts of the network are causally implicated in the behaviour under study.^[1]^[2]

Activation patching is fundamentally a causal technique rather than a correlational one. Earlier interpretability methods, such as probing classifiers, could establish whether information was present in a representation but not whether the model actually used that information when computing its output. Causal interventions like activation patching close this gap by directly manipulating the activations and observing the downstream effect.^[4]

Where does activation patching come from?

The methodological roots of activation patching lie in causal mediation analysis, a statistical framework developed in epidemiology and the social sciences to decompose the total effect of a treatment on an outcome into a direct effect and an indirect effect through intermediate (mediating) variables. The framework was adapted to neural natural language processing by Vig, Gehrmann, Belinkov, Qian, Nevo, Singer, and Shieber in their 2020 NeurIPS paper "Investigating Gender Bias in Language Models Using Causal Mediation Analysis."^[4]

In that work, Vig and colleagues argued that existing interpretability methods could measure only whether information existed in hidden representations, not whether it was causally used by the model. They proposed a methodology grounded in causal mediation analysis in which information flows from input to output through internal components treated as mediators. By intervening on these mediators (for example, by replacing the activation at a particular attention head with an alternative value), the authors decomposed the model's gender-bias behaviour into direct effects flowing from the input and indirect effects flowing through specific mediators. They found that gender-bias effects in pre-trained transformer language models were sparse, concentrated in a small part of the network, and synergistic across components.^[4]

This work established the template for what later became known as activation patching: identify a behaviour, design controlled clean and corrupted inputs that differ only in the property of interest, and intervene on individual mediators to measure their causal contribution. Closely related formal work by Geiger and colleagues at Stanford developed the interchange intervention framework, which sets internal variables of a model to the values they would take for a different input and uses these interventions to test whether neural representations align with the variables of an interpretable causal model.^[5]

How does activation patching work?

The standard activation-patching workflow involves three forward passes.^[1]^[3]

Clean run. The model is run on a clean prompt that elicits the target behaviour. All internal activations (the residual stream at each layer and position, the outputs of each attention head, the outputs of each MLP block, and so on) are cached.
Corrupted run. The model is run on a closely matched corrupted prompt that does not elicit the target behaviour. Its activations are also cached.
Patched run. The model is run on one of the two prompts (depending on whether the experiment is denoising or noising, see below), but at some chosen location the activation is overwritten with the cached activation from the other run. The output of this run is then compared against the clean and corrupted baselines via a chosen metric.

A canonical pair of clean and corrupted prompts, used in the ROME paper, is "The Space Needle is in downtown" (clean, eliciting the correct completion "Seattle") versus a noised version in which the embeddings of the subject tokens have been corrupted by added Gaussian noise.^[2] In the IOI work by Wang, Variengien, Conmy, Shlegeris, and Steinhardt, the clean and corrupted prompts differ in the identity of the names mentioned, while remaining structurally identical: for example, "When Mary and John went to the store, John gave a drink to" (clean) versus "When Alice and Bob went to the store, John gave a drink to" (corrupted).^[6]

What is the difference between denoising and noising?

Activation patching can be performed in two complementary directions, called denoising and noising by Heimersheim and Nanda.^[1]

Denoising (also called the "clean-to-corrupted" or "restoration" direction) patches an activation from the clean run into the corrupted run. The question being asked is: "Does restoring this single piece of information from the clean computation recover the correct behaviour?" A positive result means the patched component is sufficient to produce the behaviour in question.^[1]
Noising (also called the "corrupted-to-clean" or "ablation" direction) patches an activation from the corrupted run into the clean run. The question being asked is: "Does corrupting this single piece of information disrupt the correct behaviour?" A positive result means the patched component is necessary for the behaviour.^[1]

These two directions are not symmetric in general, and conflating them can lead to misleading conclusions. Heimersheim and Nanda warn that "an important and underrated point is that these two directions can be very different, and are not just symmetric mirrors of each other."^[1] An AND circuit (where multiple components must all be active to produce the behaviour) and an OR circuit (where any of several redundant components can suffice) yield very different patterns under denoising versus noising. Denoising tends to highlight the final-stage components that are sufficient to produce the output, while noising tends to highlight all components that contribute to the computation, because removing any one of them disrupts the result.^[1]

Types of corruption and ablation

The corruption used to generate the "corrupted" activation can take several forms, each with different statistical properties.^[1]^[7]

Resample ablation replaces the activation with the value it took on a different (typically randomly selected) input from the same distribution. This is the most common approach in modern circuit discovery work and is generally preferred because it keeps the patched activation within the natural distribution the model expects.^[1]^[7]
Mean ablation replaces the activation with its mean computed over a distribution of inputs. This is more principled than zero ablation because it removes the information carried by the activation while preserving its average magnitude.^[1]
Zero ablation overwrites the activation with zeros. This is simple to implement but can push activations far outside the model's normal operating range, leading to misleading results because the model may behave erratically when fed unnatural inputs.^[1]
Gaussian-noise corruption of the input embeddings, used in the ROME causal tracing experiments, perturbs the token embeddings for selected positions with sampled noise. Heimersheim and Nanda note that this method's sensitivity to the chosen noise level limits its practical use compared with resample ablation.^[1]^[2]

The choice between these is not merely cosmetic. Zhang and Nanda systematically demonstrated that the choice of corruption method and evaluation metric can lead to substantially different conclusions about which components are important, even on the same task.^[7]

What metrics are used for activation patching?

A patching experiment is summarised by a scalar metric that quantifies how close the patched run's output is to the clean baseline (versus the corrupted baseline). Common choices include the following.^[1]^[7]

Logit difference. The difference between the logits assigned to the correct and incorrect answer tokens. Logit difference is linear in the contributions of the residual stream and is sensitive to partial effects, making it a popular default for circuit-discovery work. A known weakness is that it can be inflated by components that suppress the incorrect answer rather than promoting the correct one.^[1]^[7]
Logit of the correct token or log-probability of the correct token. These capture how confident the model becomes in the correct answer. They saturate when the correct token is already the top prediction and cannot distinguish between components that act on the correct token versus those that suppress alternatives.^[1]
Probability of the correct answer. Because softmax is non-linear, raw probability is a distorted measure of patching effects, especially near decision thresholds.^[1]
KL divergence between the patched run's output distribution and the clean run's output distribution. KL is a holistic measure that captures any deviation in the full output distribution, but it does not preferentially weight the tokens the researcher cares about.^[1]
Accuracy or rank. These are discrete metrics that miss continuous effects below the threshold of changing the top prediction and are therefore poorly suited to exploratory analysis.^[1]

Heimersheim and Nanda recommend running multiple metrics in parallel and treating disagreement among them as evidence of an interesting subtlety rather than as a nuisance. Zhang and Nanda further show empirically that varying the metric can change which components appear important in localisation experiments.^[1]^[7]

What are the main variants of activation patching?

Activation patching has spawned a family of related techniques, each adapted to particular research questions.

Path patching

Path patching is a refinement introduced by Goldowsky-Dill and colleagues that restricts the patch to flow along a single computational path between a source component and a downstream receiver, rather than propagating through the entire downstream subgraph. In a standard activation patch at, say, the output of a particular attention head, the patched value can subsequently influence every later component that reads from the residual stream. Path patching instead patches the contribution of one component into the input of another specific component, allowing the researcher to isolate the direct effect of one component on another from any mediated effects through intermediate components.^[8]

This finer-grained analysis was central to the IOI circuit work, where the researchers needed to determine, for example, whether a "Name Mover" attention head's effect was direct (writing the correct name to the final position) or indirect (passing information through "S-Inhibition" heads before the name was written). Path patching provides a way to test such hypotheses quantitatively.^[6]^[8]

Attribution patching

A standard activation-patching sweep over every component in a large model is computationally expensive: it requires a separate forward pass for each component being patched. Attribution patching, introduced by Syed, Rager, and Conmy, is a fast linear approximation that estimates the effect of patching every activation from just two forward passes and one backward pass.^[9]

The idea is to take a first-order Taylor expansion of the patching metric with respect to each activation, using gradients computed at the clean (or corrupted) point. The estimated patching effect for activation a is then the product of the gradient with the difference between the corrupted and clean activations. While this approximation is exact only for infinitesimal perturbations, Syed and colleagues demonstrated that on standard benchmarks attribution patching identifies the same important circuits as full activation patching, while running orders of magnitude faster. Their paper, "Attribution Patching Outperforms Automated Circuit Discovery," was presented at the 7th BlackboxNLP workshop (NeurIPS 2023) and reports that a method based on attribution patching "outperforms all existing methods while requiring just two forward passes and a backward pass," beating the contemporaneous ACDC algorithm on several circuit discovery benchmarks.^[9]

Interchange intervention and causal abstraction

Geiger and colleagues developed an interchange-intervention framework in which the goal is not only to localise a behaviour but to align internal neural representations with the variables of a hypothesised interpretable causal model. An interchange intervention takes a "source" input and a "base" input, processes both, and replaces the internal state of the base run with the corresponding internal state of the source run at the alignment point. If the model output then matches the prediction of the interpretable causal model under the corresponding intervention, this is evidence that the neural representation faithfully implements that causal variable. The same framework underlies interchange intervention training (IIT), in which a model is trained to make its internal representations conform to a target causal abstraction.^[5]

Causal scrubbing

Causal scrubbing, introduced by researchers at Redwood Research (Chan, Garriga-Alonso, Goldowsky-Dill, Greenblatt, Nitishinskaya, Radhakrishnan, Shlegeris, and Thomas in 2022), is a more rigorous evaluation procedure for mechanistic hypotheses. The key idea is that an interpretability hypothesis can be reformulated as a claim about which activations inside a network can be resampled without changing the model's behaviour. Causal scrubbing then tests the hypothesis by performing a structured set of resampling ablations: any activation the hypothesis says is irrelevant to a given path is replaced with the activation that would be there for a randomly sampled input consistent with the hypothesis. If the model's behaviour is preserved on average, the hypothesis is at least consistent with the evidence; if behaviour breaks, the hypothesis is incomplete or incorrect.^[10]

Causal scrubbing is therefore not a discovery tool but a verification tool: it provides a principled way to evaluate the faithfulness of hypotheses generated by activation patching and other methods.^[10]

What can be patched? (Granularity)

Activation patching can be applied at many granularities, and the appropriate choice depends on the research question.^[1]

Residual stream at a given layer and token position. This is the broadest unit: it asks whether the residual stream at that point carries information relevant to the behaviour.
Attention head outputs. The output of a single head at a particular layer can be patched, allowing per-head analysis. This was the unit at which much of the IOI circuit was characterised.^[6]
MLP layer outputs. Patching the output of an MLP module isolates the contribution of feedforward computation at a particular layer. ROME's causal tracing showed that MLP outputs at middle layers and at the final subject-token position carry the bulk of the causal effect for factual recall.^[2]
Individual neurons. At the finest level, individual neurons inside an MLP or attention head can be patched, although this is rarely tractable for large models without further automation.
Attention patterns. Rather than patching the output of an attention head, one can patch its attention pattern (the post-softmax weights over keys), which isolates "where" the head attended from "what" it wrote.^[1]
Edges between components. Path patching and attribution patching generalise to patching the directed edge from a producer component to a consumer component rather than a single activation.^[6]^[8]^[9]

What tools are used for activation patching?

Much of the modern activation-patching literature is implemented using TransformerLens, an open-source library originally developed by Neel Nanda for mechanistic interpretability of GPT-style transformer language models. Its central class is HookedTransformer, which wraps standard pre-trained transformers and exposes every internal activation through a system of named hooks.^[11]

The library provides two main functions for performing interventions. run_with_cache runs the model and returns all internal activations in a dictionary keyed by hook name, which is how the clean and corrupted activations are typically obtained. run_with_hooks runs the model while applying user-supplied functions at specified hook points; a patching hook is a function that, when called, overwrites the activation it receives with a cached counterfactual value. By combining these two functions, a researcher can implement activation patching at any of the granularities listed above in just a few lines of PyTorch code.^[11]

The TransformerLens documentation and demo notebooks include a dedicated "Exploratory Analysis" tutorial that walks through activation patching on GPT-2 small for the IOI task, and the library is the de facto standard tooling for the mechanistic-interpretability research community.^[11]

What is activation patching used for? (Notable applications)

What is causal tracing in the ROME paper?

The 2022 paper "Locating and Editing Factual Associations in GPT" by Meng, Bau, Andonian, and Belinkov introduced causal tracing, a systematic application of activation patching to locate where in a transformer a specific factual association is stored. The authors gave GPT-2 and GPT-J a series of prompts of the form "The Space Needle is in downtown ___" and recorded the model's normal completion. They then ran a corrupted version of each prompt in which the subject's token embeddings were perturbed with Gaussian noise, breaking the model's ability to produce the correct completion. Finally, they patched the hidden state at each (layer, position) pair from the clean run back into the corrupted run and measured whether this restored the correct completion.^[2]

This produced a heatmap of "indirect effect" over the (layer, position) grid. The result was strikingly localised: a strong band of effect appeared at middle-layer MLP modules processing the last token of the subject. A second, weaker band appeared at later layers attending to the end of the prompt, where the answer is read off. From this, Meng and colleagues concluded that a two-stage mechanism mediates factual recall: in their words, "each midlayer MLP module accepts inputs that encode a subject, then produces outputs that recall memorized properties about that subject," after which "the summed information is copied to the last token by attention at high layers."^[2] The authors leveraged this localisation to design ROME (Rank-One Model Editing), a method that edits a single MLP layer at the identified location to change a model's stored factual association, a form of knowledge editing.^[2]

The ROME paper is frequently cited as the work that popularised activation patching (under the name causal tracing) in the broader interpretability community.^[2]

What is the IOI circuit in GPT-2 small?

Wang, Variengien, Conmy, Shlegeris, and Steinhardt's 2022 paper "Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small" used activation patching and path patching as the primary tools to reverse-engineer how GPT-2 small completes sentences like "When Mary and John went to the store, John gave a drink to" with "Mary." The authors describe their work as, "to our knowledge, the largest end-to-end attempt at reverse-engineering a natural behavior 'in the wild' in a language model."^[6]

By patching activations from clean prompts (where the indirect object is, say, "Mary") into corrupted prompts (where it is "Alice") at each attention head, the authors identified 26 attention heads in GPT-2 small that contributed to the IOI behaviour, out of the model's 144 attention heads in total (roughly 18 percent). Subsequent path-patching experiments grouped these 26 heads into seven functional classes, including "Duplicate Token Heads" that detect the repeated name, "S-Inhibition Heads" that suppress the subject name at the final position, and "Name Mover Heads" that copy the indirect object's name to the output position. The IOI circuit has since served as a benchmark for evaluating circuit-discovery methods, including ACDC and attribution patching.^[6]^[9]^[12]

What is automated circuit discovery (ACDC)?

Conmy, Mavor-Parker, Lynch, Heimersheim, and Garriga-Alonso introduced the Automatic Circuit DisCovery (ACDC) algorithm, which automates the process of identifying which edges in a transformer's computational graph are important for a given behaviour using path patching. Starting from the full computational graph, ACDC iteratively prunes edges whose removal does not significantly change the patching metric, leaving a minimal subgraph that explains the behaviour. On the Greater-Than task, the authors report that "the ACDC algorithm rediscovered 5/5 of the component types in a circuit in GPT-2 Small that computes the Greater-Than operation," and that "ACDC selected 68 of the 32,000 edges in GPT-2 Small, all of which were manually found by previous work." On the IOI task, every attention head ACDC recovered was confirmed to belong to the human-identified IOI circuit, spanning classes such as Previous Token, S-Inhibition, and Name Mover heads.^[12]

ACDC is computationally intensive because each candidate edge requires a forward pass, which is one reason Syed and colleagues subsequently developed attribution patching as a faster alternative that they showed outperformed ACDC on several benchmarks while using only two forward passes and one backward pass.^[9]^[12]

What are the limitations of activation patching?

Activation patching is a powerful tool, but it comes with substantial caveats. Several of these were emphasised by Heimersheim and Nanda, Zhang and Nanda, and McGrath and colleagues.

The hydra effect and self-repair

McGrath, Rahtz, Kramar, Mikulik, and Legg's 2023 paper "The Hydra Effect: Emergent Self-repair in Language Model Computations" identified a phenomenon in which ablating one attention layer of a language model causes another, later layer to compensate. They named this the hydra effect in reference to the mythological monster that regrows heads when one is cut off. The researchers also identified a counterbalancing function of late MLP layers that downregulate the maximum-likelihood token. Notably, both effects emerged in models trained without any form of dropout, indicating that they are not artefacts of regularisation but rather a fundamental property of the architecture and training.^[13]

The hydra effect has important implications for activation patching. Because backup or compensatory components activate when primary components are ablated, a noising experiment may underestimate the importance of the ablated component: the model still gets the right answer, not because the component was unimportant, but because another component took over. This complicates the interpretation of patching results and motivates the use of additional verification tools such as causal scrubbing.^[1]^[13]

A 2024 follow-up by Rushing and Nanda titled "Explorations of Self-Repair in Language Models" further characterised the phenomenon and explored its mechanism.^[14]

Direct versus indirect effects

A standard activation patch propagates through every downstream component that reads from the patched location. The measured effect therefore conflates the direct effect (what the patched component itself contributes) with all indirect effects through intermediate components. Path patching mitigates this by restricting the patch to a particular downstream consumer, but it is more expensive and more difficult to interpret when many paths are at play. Vig and colleagues' original causal-mediation-analysis framework explicitly decomposed effects into direct and indirect components, and modern interpretability work often follows this decomposition.^[1]^[4]^[8]

Distribution shifts and prompt sensitivity

Because activation patching depends on a carefully chosen pair of clean and corrupted prompts, its conclusions strictly apply only to the distribution of prompts tested. A circuit discovered on a narrow set of IOI examples may not generalise to broader natural-language settings, and patching results can be sensitive to small changes in prompt wording. Heimersheim and Nanda recommend testing circuits on multiple prompt distributions and documenting exactly which properties of the prompt are being varied versus held fixed.^[1]

Negative components

Some attention heads in GPT-2 small (and other transformers) consistently harm the model's performance on a given task; excluding them from a circuit artificially inflates the apparent "recovered performance" of the circuit, complicating the assessment of circuit quality. Heimersheim and Nanda note that this is an unsolved problem in the field.^[1]

Methodological inconsistency

Zhang and Nanda's 2023 paper "Towards Best Practices of Activation Patching in Language Models: Metrics and Methods," published at ICLR 2024, provides a systematic empirical study of how methodological choices affect activation-patching results. They find that varying hyperparameters such as the choice of evaluation metric (logit difference versus probability versus KL divergence) and the choice of corruption method (resample versus mean versus zero ablation) can produce substantially different conclusions about which components are important on the same task. They argue for resample-based corruption with carefully chosen metrics and against the historically common zero-ablation approach.^[7]

The arXiv version of this work and the related Heimersheim-Nanda tutorial together constitute the de facto methodological standards for activation patching as of 2024.^[1]^[7]

No minimality guarantee

A successful patching experiment identifies a sufficient set of components for a behaviour, not necessarily a minimal one. The discovered circuit may include components that are redundant in the sense that other components could perform the same function if they were ablated. Causal scrubbing and similar verification methods attempt to address this by explicitly testing the necessity of every component identified by patching.^[1]^[10]

What are the best practices for activation patching?

Drawing on the syntheses of Heimersheim and Nanda and of Zhang and Nanda, the following practices have emerged as community standards for activation patching.^[1]^[7]

Start coarse, then refine. Begin by patching at the residual-stream level to identify broad regions of importance, then progressively narrow down to attention heads, MLPs, and individual neurons.
Use resample ablation by default. Replace activations with values from other inputs in the same distribution rather than with zeros, which push activations off-distribution.
Run both denoising and noising. The two directions give complementary evidence about sufficiency and necessity, and discrepancies between them are diagnostic of OR or AND circuit structure.
Report multiple metrics. Run logit difference, log-probability, and KL divergence in parallel; treat disagreements as informative rather than as noise.
Document the prompt distribution. Specify exactly which properties differ between clean and corrupted prompts and which are held fixed.
Beware of compensatory effects. When a patch produces little effect, consider whether the hydra effect or other backup mechanisms could be masking a true causal role.
Verify with path patching or causal scrubbing. Activation patching is a discovery tool; subsequent verification with finer-grained methods provides stronger evidence that an identified circuit is faithful.
Use attribution patching for scale. When the number of components to test is large, the linear-approximation attribution-patching estimate from two forward passes and one backward pass is often a tractable substitute for exhaustive activation patching, with the caveat that it is exact only for small perturbations.

How does activation patching relate to other interpretability techniques?

Activation patching is one tool in a broader toolkit for mechanistic interpretability. It is complementary to representation-level methods such as sparse autoencoders, which decompose model activations into interpretable features, and to studies of superposition, which examines how multiple features are encoded in shared activation directions. Patching also complements broader research programmes such as attribution graphs and representation engineering, which combine causal interventions with representation analysis to understand model behaviour. Where sparse autoencoders aim to provide an interpretable vocabulary for model internals, activation patching answers the complementary question of which components causally drive a behaviour; the two methods are often used together in modern interpretability work.^[1]^[11]

ELI5: activation patching in plain language

Imagine you have a machine that almost always gives the right answer, and you want to know which of its thousands of tiny parts actually does the important work. With activation patching you run the machine twice: once on a question it answers correctly (the "clean" run) and once on a slightly changed question it gets wrong (the "corrupted" run). You secretly record what every internal part was doing in both runs. Then you run the broken case again, but you swap in one part's recording from the run that worked, like transplanting a single organ, and check whether the answer suddenly becomes correct. If swapping that one part fixes the answer, that part was carrying the missing information. By trying this for every part, researchers draw a map of which pieces of an AI model are responsible for a given behaviour.^[1]^[2]

References

Heimersheim, S. and Nanda, N. (2024). "How to use and interpret activation patching." arXiv:2404.15255. https://arxiv.org/abs/2404.15255. Accessed 2026-06-27. ↩
Meng, K., Bau, D., Andonian, A., and Belinkov, Y. (2022). "Locating and Editing Factual Associations in GPT." Advances in Neural Information Processing Systems 35 (NeurIPS 2022). arXiv:2202.05262. https://arxiv.org/abs/2202.05262. Accessed 2026-06-27. ↩
Meng, K., Bau, D., Andonian, A., and Belinkov, Y. (2022). "Locating and Editing Factual Associations in GPT." Project website. https://rome.baulab.info/. Accessed 2026-06-27. ↩
Vig, J., Gehrmann, S., Belinkov, Y., Qian, S., Nevo, D., Singer, Y., and Shieber, S. (2020). "Investigating Gender Bias in Language Models Using Causal Mediation Analysis." Advances in Neural Information Processing Systems 33 (NeurIPS 2020). https://proceedings.neurips.cc/paper/2020/hash/92650b2e92217715fe312e6fa7b90d82-Abstract.html. Accessed 2026-06-27. ↩
Geiger, A., Lu, H., Icard, T., and Potts, C. (2021). "Causal Abstractions of Neural Networks." Advances in Neural Information Processing Systems 34 (NeurIPS 2021). arXiv:2106.02997. https://arxiv.org/abs/2106.02997. Accessed 2026-06-27. ↩
Wang, K., Variengien, A., Conmy, A., Shlegeris, B., and Steinhardt, J. (2022). "Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small." arXiv:2211.00593. https://arxiv.org/abs/2211.00593. Accessed 2026-06-27. ↩
Zhang, F. and Nanda, N. (2023). "Towards Best Practices of Activation Patching in Language Models: Metrics and Methods." Published at ICLR 2024. arXiv:2309.16042. https://arxiv.org/abs/2309.16042. Accessed 2026-06-27. ↩
Goldowsky-Dill, N., MacLeod, C., Sato, L., and Arora, A. (2023). "Localizing Model Behavior with Path Patching." arXiv:2304.05969. https://arxiv.org/abs/2304.05969. Accessed 2026-06-27. ↩
Syed, A., Rager, C., and Conmy, A. (2023). "Attribution Patching Outperforms Automated Circuit Discovery." arXiv:2310.10348. Presented at the 7th BlackboxNLP Workshop (NeurIPS 2023). https://arxiv.org/abs/2310.10348. Accessed 2026-06-27. ↩
Chan, L., Garriga-Alonso, A., Goldowsky-Dill, N., Greenblatt, R., Nitishinskaya, J., Radhakrishnan, A., Shlegeris, B., and Thomas, N. (2022). "Causal Scrubbing: a method for rigorously testing interpretability hypotheses." AI Alignment Forum (Redwood Research). https://www.alignmentforum.org/posts/JvZhhzycHu2Yd57RN/causal-scrubbing-a-method-for-rigorously-testing. Accessed 2026-06-27. ↩
TransformerLens Documentation (2024). "TransformerLens: A library for mechanistic interpretability of GPT-style language models." https://transformerlensorg.github.io/TransformerLens/. Accessed 2026-06-27. ↩
Conmy, A., Mavor-Parker, A. N., Lynch, A., Heimersheim, S., and Garriga-Alonso, A. (2023). "Towards Automated Circuit Discovery for Mechanistic Interpretability." Advances in Neural Information Processing Systems 36 (NeurIPS 2023). arXiv:2304.14997. https://arxiv.org/abs/2304.14997. Accessed 2026-06-27. ↩
McGrath, T., Rahtz, M., Kramar, J., Mikulik, V., and Legg, S. (2023). "The Hydra Effect: Emergent Self-repair in Language Model Computations." arXiv:2307.15771. https://arxiv.org/abs/2307.15771. Accessed 2026-06-27. ↩
Rushing, C. and Nanda, N. (2024). "Explorations of Self-Repair in Language Models." Proceedings of the 41st International Conference on Machine Learning. arXiv:2402.15390. https://arxiv.org/abs/2402.15390. Accessed 2026-06-27. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributor · full history

Suggest edit

What links here

Causal scrubbing Circuit discovery Patchscopes TransformerLens nnsight

What is activation patching?

Where does activation patching come from?

How does activation patching work?

What is the difference between denoising and noising?

Types of corruption and ablation

What metrics are used for activation patching?

What are the main variants of activation patching?

Path patching

Attribution patching

Interchange intervention and causal abstraction

Causal scrubbing

What can be patched? (Granularity)

What tools are used for activation patching?

What is activation patching used for? (Notable applications)

What is causal tracing in the ROME paper?

What is the IOI circuit in GPT-2 small?

What is automated circuit discovery (ACDC)?

What are the limitations of activation patching?

The hydra effect and self-repair

Direct versus indirect effects

Distribution shifts and prompt sensitivity

Negative components

Methodological inconsistency

No minimality guarantee

What are the best practices for activation patching?

How does activation patching relate to other interpretability techniques?

ELI5: activation patching in plain language

References

Improve this article

Related Articles

Feature Importances

Permutation variable importances

Variable importances

Explainable AI

Mechanistic interpretability

SHAP (SHapley Additive exPlanations)

What links here

Related Articles

Feature Importances

Permutation variable importances

Variable importances

Explainable AI

Mechanistic interpretability

SHAP (SHapley Additive exPlanations)

What links here