Causal scrubbing
Last reviewed
Jun 8, 2026
Sources
11 citations
Review status
Source-backed
Revision
v1 ยท 2,156 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
11 citations
Review status
Source-backed
Revision
v1 ยท 2,156 words
Add missing citations, update stale details, or suggest a clearer explanation.
Causal scrubbing is a methodology in mechanistic interpretability for rigorously and quantitatively testing hypotheses about the internal computational structure of a neural network. It was introduced in December 2022 by a team at Redwood Research and treats an interpretability hypothesis as a precise claim about which internal activations can be swapped between inputs without changing the network's behavior. The method replaces ("scrubs") activations along paths the hypothesis deems irrelevant with activations resampled from other inputs the hypothesis treats as equivalent, then measures how much task performance survives. If the hypothesis fully and correctly explains the behavior, performance should be preserved, and the fraction recovered is read as a measure of how complete the explanation is [1][2].
The technique is most associated with validating circuit-level claims, such as how induction heads implement in-context copying. Its central contribution is to turn the informal statement "circuit X implements behavior Y" into an operational, falsifiable test grounded in causal interventions [1].
The original write-up was authored by Lawrence Chan, Adria Garriga-Alonso, Nicholas Goldowsky-Dill, Ryan Greenblatt, Jenny Nitishinskaya, Ansh Radhakrishnan, Buck Shlegeris, and Nate Thomas, and published as a sequence on the AI Alignment Forum and LessWrong [1][3].
Mechanistic interpretability aims to explain a network's behavior in terms of human-understandable algorithms implemented by its weights. A recurring problem is that such explanations are usually stated informally and are hard to check. Matching the model's outputs is not sufficient evidence that an explanation is correct: as the authors put it, it is trivial to write an algorithm that reproduces a network's outputs on a test set by building a large lookup table, yet such a lookup table explains nothing about what the weights actually do [1][4].
Causal scrubbing was designed to provide a principled, quantitative answer to the question "how good is this mechanistic explanation?" The key insight is that a mechanistic hypothesis can be reframed as a claim about invariances: it specifies which activations inside the network are claimed not to matter for a given downstream computation, and therefore which activations could be resampled from a different input without affecting the output. This reframing makes the hypothesis testable by intervention rather than by inspection [1][2].
A related motivation was the safety-relevant goal of being able to trust an interpretation. An interpretation is only as reassuring as the worst-case implementation consistent with it: two mechanisms can behave identically on a test set yet diverge badly off-distribution. The authors wanted a test that rejects explanations which fail to pin down the computation, rather than one that merely confirms surface-level agreement [4][5].
A causal-scrubbing hypothesis is formalized as a correspondence between two graphs. The first is an interpretation graph (a high-level, human-understandable computation): the algorithm the researcher believes the model runs, whose nodes are interpretable sub-computations and whose edges are claimed information dependencies. The second is the model's own computational graph: the forward pass expressed as nodes (for example attention heads, MLP layers, or individual activations) with edges between them. The hypothesis maps each interpretation-graph node to a node or set of components in the model's graph, and claims each high-level node is "extensionally equal" to the corresponding low-level computation, meaning it produces the same outputs from the same inputs [1][2].
Crucially, a missing edge in the interpretation graph is itself a claim: that the source has no causal influence on the target for the behavior in question. These claimed non-dependencies are exactly what causal scrubbing stress-tests [1].
Before intervening, causal scrubbing rewrites the model's computational graph into a tree, a step the authors call treeification. In a normal computation graph a node's output is shared by all of its downstream consumers. Treeification instead duplicates ("unfolds") shared nodes so that every path from the input to the output has its own private copy of each intermediate node. The graph becomes a tree whose leaves are separate copies of the input [2][6].
Treeification is what gives causal scrubbing its expressive power. Because each path has independent leaves, the method can feed a different input to each path, mixing many inputs in a single forward pass. This lets it intervene on individual paths through the network rather than only on whole nodes, allowing finer-grained and stricter tests than interventions on the original (un-treeified) graph permit. The authors describe treeification both as a way that causal scrubbing is more demanding than other methods and as a way that it is more expressive [2][6].
For a given hypothesis, causal scrubbing proceeds roughly as follows [1][2]:
These swaps are called resampling ablations: instead of deleting a component's contribution, the method resamples it from another, hypothesis-equivalent input. If the hypothesis is correct, every scrub it licenses is behavior-preserving, so the scrubbed model should attain essentially the same loss as the unmodified model. The recovered loss, often reported as the percentage of performance retained after the maximally aggressive set of licensed scrubs, is interpreted as a measure of how complete the hypothesis is: recovering nearly all of the loss means the hypothesis explains nearly all of the relevant computation, while recovering little means it has missed important structure [1][2].
An important asymmetry follows: causal scrubbing can reject a hypothesis (a large performance drop shows it licensed a swap the model actually depended on) but cannot, on its own, confirm one as the unique correct explanation [2][4].
The choice of resampling ablation is deliberate and distinguishes causal scrubbing from older ablation styles. The table below summarizes the contrast.
| Ablation style | What the component is set to | Main drawback |
|---|---|---|
| Zero ablation | The activation is set to zero | Drives the network far off the data distribution; zero is often not a meaningful value |
| Mean ablation | The activation is replaced by its average over a dataset | Destroys input-dependent structure; the mean may itself be off-distribution for any single context |
| Resampling ablation (causal scrubbing) | The activation is replaced by its value on another input the hypothesis treats as equivalent | Keeps activations on-distribution, but requires a hypothesis that specifies the equivalence classes |
Zero and mean ablation can mislead because they push activations into regions the network never sees, so a performance drop may reflect distribution shift rather than a genuine causal dependency. Resampling draws replacement values from real forward passes on real inputs, keeping the intervention on-distribution and tying the test to the hypothesis's own claims about which inputs are interchangeable [2][6].
Redwood Research demonstrated causal scrubbing on several mechanistic case studies, reporting recovered-loss figures as the headline result [3].
Causal scrubbing also underpinned Redwood's 2023 REMIX research program, which trained participants to produce and test circuit hypotheses on real models [10].
Causal scrubbing belongs to the same family as activation patching and interchange interventions: all of these replace the activations of some network components with activations computed on a different input and observe the effect on the output. The differences lie in scope and intent [6].
Activation patching (including path patching) typically swaps one component or path at a time, usually between a single clean and a single corrupted input, to localize where a behavior is computed. Causal scrubbing instead applies a whole battery of swaps at once, derived systematically from a complete hypothesis, and mixes many inputs through the treeified graph rather than just two. Its goal is not to find a circuit but to evaluate a fully specified explanation end to end [2][6].
The method is also closely related to the causal abstraction framework and to interchange-intervention training. Both formalize the idea that a high-level causal model abstracts a low-level network, and test the abstraction by intervening so that the corresponding high-level intervention would leave the high-level output unchanged. Causal scrubbing's distinctive ingredients here are treeification and its specific resampling protocol [2][6].
Causal scrubbing is a one-directional test and has several well-documented weaknesses that the authors and later commentators have stressed [4][5].
These caveats motivate treating causal scrubbing as a necessary stress test a good explanation should survive, rather than as sufficient proof of correctness, and have informed later work on more robust faithfulness metrics for transformer circuits [11].