Causal scrubbing

AI Safety Machine Learning

12 min read

Updated Jun 29, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 29, 2026

Fact-checked

In review queue

Sources

11 citations

Revision

v2 · 2,349 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Causal scrubbing is a methodology in mechanistic interpretability for rigorously and quantitatively testing hypotheses about the internal computational structure of a neural network. Introduced in December 2022 by a team at Redwood Research, it treats an interpretability hypothesis as a precise claim about which internal activations can be swapped between inputs without changing the network's behavior, then performs the maximal set of behavior-preserving resampling ablations that the hypothesis licenses and measures how much task performance survives. A hypothesis that recovers nearly all of the model's original performance is treated as a strong, complete explanation; one that recovers little has missed important structure ^[1]^[2].

What is causal scrubbing?

Causal scrubbing turns the informal statement "circuit X implements behavior Y" into an operational, falsifiable test grounded in causal interventions. The method replaces ("scrubs") activations along paths the hypothesis deems irrelevant with activations resampled from other inputs the hypothesis treats as equivalent, then measures how much of the original task performance is preserved. If the hypothesis fully and correctly explains the behavior, performance should be retained, and the fraction recovered is read as a measure of how complete the explanation is ^[1]^[2].

The technique is most associated with validating circuit-level claims, such as how induction heads implement in-context copying. It was introduced in a December 2022 sequence on the AI Alignment Forum and LessWrong, authored by Lawrence Chan, Adria Garriga-Alonso, Nicholas Goldowsky-Dill, Ryan Greenblatt, Jenny Nitishinskaya, Ansh Radhakrishnan, Buck Shlegeris, and Nate Thomas ^[1]^[3]. Redwood Research summarizes the core idea this way: "The key insight behind this work is that mechanistic interpretability hypotheses can be thought of as defining what activations inside a neural network can be resampled without affecting behavior." ^[3]

Property	Detail
Introduced	December 2022 ^[1]^[3]
Developed by	Redwood Research ^[1]
Authors	Chan, Garriga-Alonso, Goldowsky-Dill, Greenblatt, Nitishinskaya, Radhakrishnan, Shlegeris, Thomas ^[1]^[3]
Field	Mechanistic interpretability ^[1]
Core operation	Behavior-preserving resampling ablations ^[2]^[3]
Headline metric	Fraction of expected loss recovered ^[2]
Direction of test	Can reject a hypothesis, cannot uniquely confirm one ^[2]^[4]

Why was causal scrubbing developed?

Mechanistic interpretability aims to explain a network's behavior in terms of human-understandable algorithms implemented by its weights. A recurring problem is that such explanations are usually stated informally and are hard to check. Matching the model's outputs is not sufficient evidence that an explanation is correct: as the authors put it, it is trivial to write an algorithm that reproduces a network's outputs on a test set by building a large lookup table, yet such a lookup table explains nothing about what the weights actually do ^[1]^[4].

Causal scrubbing was designed to provide a principled, quantitative answer to the question "how good is this mechanistic explanation?" The key insight is that a mechanistic hypothesis can be reframed as a claim about invariances: it specifies which activations inside the network are claimed not to matter for a given downstream computation, and therefore which activations could be resampled from a different input without affecting the output. This reframing makes the hypothesis testable by intervention rather than by inspection ^[1]^[2].

A related motivation was the safety-relevant goal of being able to trust an interpretation. An interpretation is only as reassuring as the worst-case implementation consistent with it: two mechanisms can behave identically on a test set yet diverge badly off-distribution. The authors wanted a test that rejects explanations which fail to pin down the computation, rather than one that merely confirms surface-level agreement ^[4]^[5].

How does causal scrubbing work?

How is a hypothesis expressed as a correspondence between two graphs?

A causal-scrubbing hypothesis is formalized as a correspondence between two graphs. The first is an interpretation graph (a high-level, human-understandable computation): the algorithm the researcher believes the model runs, whose nodes are interpretable sub-computations and whose edges are claimed information dependencies. The second is the model's own computational graph: the forward pass expressed as nodes (for example attention heads, MLP layers, or individual activations) with edges between them. The hypothesis maps each interpretation-graph node to a node or set of components in the model's graph, and claims each high-level node is "extensionally equal" to the corresponding low-level computation, meaning it produces the same outputs from the same inputs ^[1]^[2].

Crucially, a missing edge in the interpretation graph is itself a claim: that the source has no causal influence on the target for the behavior in question. These claimed non-dependencies are exactly what causal scrubbing stress-tests ^[1].

What is treeification?

Before intervening, causal scrubbing rewrites the model's computational graph into a tree, a step the authors call treeification. In a normal computation graph a node's output is shared by all of its downstream consumers. Treeification instead duplicates ("unfolds") shared nodes so that every path from the input to the output has its own private copy of each intermediate node. The graph becomes a tree whose leaves are separate copies of the input ^[2]^[6].

Treeification is what gives causal scrubbing its expressive power. Because each path has independent leaves, the method can feed a different input to each path, mixing many inputs in a single forward pass. This lets it intervene on individual paths through the network rather than only on whole nodes, allowing finer-grained and stricter tests than interventions on the original (un-treeified) graph permit. The authors describe treeification both as a way that causal scrubbing is more demanding than other methods and as a way that it is more expressive ^[2]^[6].

What are resampling ablations and the scrubbing procedure?

For a given hypothesis, causal scrubbing proceeds roughly as follows ^[1]^[2]:

Sample a reference input on which to evaluate the behavior, and run the treeified network to get its values.
For every part of the input or activation that the hypothesis claims is not used by a given node, replace it with the corresponding activation computed on a different input. The replacement input is drawn so that, according to the interpretation graph, it is equivalent for that node: it belongs to the same input equivalence class the hypothesis defines.
Apply all such replacements simultaneously across the treeified graph, choosing the swaps so that the corresponding intervention in the high-level interpretation graph would leave its outputs unchanged.
Run the forward pass on this mixed set of inputs and measure the model's loss or task metric.

These swaps are called resampling ablations: instead of deleting a component's contribution, the method resamples it from another, hypothesis-equivalent input. If the hypothesis is correct, every scrub it licenses is behavior-preserving, so the scrubbed model should attain essentially the same loss as the unmodified model. The recovered loss, often reported as the percentage of performance retained after the maximally aggressive set of licensed scrubs, is interpreted as a measure of how complete the hypothesis is: recovering nearly all of the loss means the hypothesis explains nearly all of the relevant computation, while recovering little means it has missed important structure ^[1]^[2].

An important asymmetry follows: causal scrubbing can reject a hypothesis (a large performance drop shows it licensed a swap the model actually depended on) but cannot, on its own, confirm one as the unique correct explanation ^[2]^[4].

How does resampling differ from zero and mean ablation?

The choice of resampling ablation is deliberate and distinguishes causal scrubbing from older ablation styles. The table below summarizes the contrast.

Ablation style	What the component is set to	Main drawback
Zero ablation	The activation is set to zero	Drives the network far off the data distribution; zero is often not a meaningful value
Mean ablation	The activation is replaced by its average over a dataset	Destroys input-dependent structure; the mean may itself be off-distribution for any single context
Resampling ablation (causal scrubbing)	The activation is replaced by its value on another input the hypothesis treats as equivalent	Keeps activations on-distribution, but requires a hypothesis that specifies the equivalence classes

Zero and mean ablation can mislead because they push activations into regions the network never sees, so a performance drop may reflect distribution shift rather than a genuine causal dependency. Resampling draws replacement values from real forward passes on real inputs, keeping the intervention on-distribution and tying the test to the hypothesis's own claims about which inputs are interchangeable ^[2]^[6].

What has causal scrubbing been used for?

Redwood Research demonstrated causal scrubbing on several mechanistic case studies, reporting recovered-loss figures as the headline result ^[3].

Induction in small language models. Causal scrubbing was used to test how induction heads and the surrounding circuit produce in-context repetition. A naive first hypothesis recovered only about 35% of the loss, while a slightly more general hypothesis recovered roughly 89%, illustrating how the recovered-loss metric distinguishes a crude story from a refined one. Replacing an induction head's attention scores with a simple approximation recovered on the order of 65% of the relevant loss difference ^[3]^[7].
Balanced-parenthesis classifier. Applied to a small transformer trained to judge whether a parenthesis string is balanced, the high-level claims of the hypothesized algorithm held up well, with roughly 88% to 93% of the loss recovered under scrubbing ^[3]^[8].
Indirect object identification (IOI). The IOI circuit in GPT-2 small, originally mapped by Wang, Variengien, Conmy, Shlegeris, and Steinhardt using path patching ^[9], was re-examined with causal scrubbing. Scrubbing the model down to the identified circuit retained only about 50% of the original logit difference between the correct and incorrect names, indicating that the circuit as specified did not capture the full mechanism and that other components contributed to the behavior ^[3]^[9].

Causal scrubbing also underpinned Redwood's 2023 REMIX research program, which trained participants to produce and test circuit hypotheses on real models ^[10].

How does causal scrubbing relate to activation patching and causal abstraction?

Causal scrubbing belongs to the same family as activation patching and interchange interventions: all of these replace the activations of some network components with activations computed on a different input and observe the effect on the output. The differences lie in scope and intent ^[6].

Activation patching (including path patching) typically swaps one component or path at a time, usually between a single clean and a single corrupted input, to localize where a behavior is computed. Causal scrubbing instead applies a whole battery of swaps at once, derived systematically from a complete hypothesis, and mixes many inputs through the treeified graph rather than just two. Its goal is not to find a circuit but to evaluate a fully specified explanation end to end ^[2]^[6].

The method is also closely related to the causal abstraction framework and to interchange-intervention training. Both formalize the idea that a high-level causal model abstracts a low-level network, and test the abstraction by intervening so that the corresponding high-level intervention would leave the high-level output unchanged. Causal scrubbing's distinctive ingredients here are treeification and its specific resampling protocol ^[2]^[6].

What are the limitations of causal scrubbing?

Causal scrubbing is a one-directional test and has several well-documented weaknesses that the authors and later commentators have stressed ^[4]^[5].

It can only reject, not confirm. Passing the test shows a hypothesis is consistent with the model's behavior under the licensed swaps; it does not establish that the hypothesis is the true or unique mechanism. A different, even contradictory, explanation might pass equally well ^[2]^[4].
Wrong or overly permissive hypotheses can pass. A hypothesis that is too liberal about which activations are interchangeable, or that mislabels what a component computes, can still recover high loss, producing a false sense of understanding. This is a form of interpretability illusion: the explanation looks validated but does not faithfully reflect the underlying computation ^[4]^[5].
Faithfulness gaps when the model is imperfect. The test can fail to flag an unfaithful hypothesis (as opposed to merely an incomplete one) particularly when the network being interpreted does not achieve near-perfect loss on the task, because the slack between the model's loss and optimal loss can absorb the effect of an incorrect swap ^[5].
Treeification and off-distribution subtleties. Although resampling keeps individual activations on-distribution, mixing inputs across many treeified paths can still create globally implausible combinations, and the treeification step itself is a modeling choice that affects how strict the test is ^[2]^[6].
Practical and adoption difficulties. The methodology is laborious to apply and its metric tends to report that human-written explanations are far from complete. Some original developers came to view the consistently low recovered-loss numbers as evidence that fully faithful mechanistic explanations are harder to produce than hoped, contributing to limited uptake of the specific technique even as resampling ablations and the underlying framing remained influential ^[5].

These caveats motivate treating causal scrubbing as a necessary stress test a good explanation should survive, rather than as sufficient proof of correctness, and have informed later work on more robust faithfulness metrics for transformer circuits ^[11].

References

Chan, L., Garriga-Alonso, A., Goldowsky-Dill, N., Greenblatt, R., Nitishinskaya, J., Radhakrishnan, A., Shlegeris, B., Thomas, N. "Causal Scrubbing: a method for rigorously testing interpretability hypotheses." Redwood Research, AI Alignment Forum / LessWrong, December 2022. https://www.lesswrong.com/posts/JvZhhzycHu2Yd57RN/causal-scrubbing-a-method-for-rigorously-testing ↩
"A comparison of causal scrubbing, causal abstractions, and related methods." AI Alignment Forum. https://www.alignmentforum.org/posts/uLMWMeBG3ruoBRhMW/a-comparison-of-causal-scrubbing-causal-abstractions-and ↩
Chan, L. "Causal Scrubbing: a method for rigorously testing interpretability hypotheses." Author publication page. https://chanlawrence.me/publication/chan-2022-causal/ ↩
"Causal Scrubbing." AI Safety Tokyo study notes. https://aisafety.tokyo/benkyoukai/causal-scrubbing ↩
"Practical Pitfalls of Causal Scrubbing." AI Alignment Forum. https://www.alignmentforum.org/posts/DFarDnQjMnjsKvW8s/practical-pitfalls-of-causal-scrubbing ↩
"Causal scrubbing: Appendix." AI Alignment Forum / LessWrong, December 2022. https://www.lesswrong.com/posts/kcZZAsEjwrbczxN2i/causal-scrubbing-appendix ↩
"Causal scrubbing: results on induction heads." AI Alignment Forum. https://www.alignmentforum.org/posts/j6s9H9SHrEhEfuJnq/causal-scrubbing-results-on-induction-heads ↩
"Causal scrubbing: results on a paren balance checker." AI Alignment Forum. https://www.alignmentforum.org/posts/kjudfaQazMmC74SbF/causal-scrubbing-results-on-a-paren-balance-checker ↩
Wang, K., Variengien, A., Conmy, A., Shlegeris, B., Steinhardt, J. "Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small." arXiv:2211.00593, November 2022. https://arxiv.org/abs/2211.00593 ↩
"Apply to the Redwood Research Mechanistic Interpretability Experiment (REMIX), a research program in Berkeley." LessWrong. https://www.lesswrong.com/posts/nqwzrpkPvviLHWXaE/apply-to-the-redwood-research-mechanistic-interpretability ↩
"Transformer Circuit Faithfulness Metrics are not Robust." arXiv:2407.08734, 2024. https://arxiv.org/abs/2407.08734 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Linear Probes Polysemanticity Redwood Research

What is causal scrubbing?

Why was causal scrubbing developed?

How does causal scrubbing work?

How is a hypothesis expressed as a correspondence between two graphs?

What is treeification?

What are resampling ablations and the scrubbing procedure?

How does resampling differ from zero and mean ablation?

What has causal scrubbing been used for?

How does causal scrubbing relate to activation patching and causal abstraction?

What are the limitations of causal scrubbing?

References

Improve this article

Related Articles

Confirmation Bias

Hallucination

Recursive self-improvement

Emergent abilities

AI Alignment

Model collapse

What links here

Related Articles

Confirmation Bias

Hallucination

Recursive self-improvement

Emergent abilities

AI Alignment

Model collapse

What links here