Circuit discovery
Last reviewed
May 25, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,265 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 25, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,265 words
Add missing citations, update stale details, or suggest a clearer explanation.
Circuit discovery is a research program in mechanistic interpretability that aims to identify sparse computational subgraphs inside trained neural networks, called circuits, that implement specific behaviors or sub-tasks.[1][2] A circuit consists of a subset of model components (attention heads, MLP neurons, or features extracted by a sparse autoencoder) together with the weighted connections between them that, taken on their own, reproduce a target behavior to within some faithfulness threshold.[1][3] The field originated with computer-vision work by Chris Olah and collaborators at OpenAI on Inception v1,[1] was extended to transformers by a team at anthropic in 2021,[2] and grew into a portfolio of manual case studies (the indirect object identification circuit in gpt-2 small)[3] and automated discovery methods (ACDC, edge attribution patching, sparse feature circuits, and Anthropic's attribution-graph "circuit tracing" pipeline).[4][5][6][7] Circuit discovery is one of the two main bottom-up strategies in mechanistic interpretability, alongside feature discovery via sparse autoencoders, and is increasingly applied to safety-relevant behaviors such as refusal, sycophancy, and deceptive reasoning.[6][7]
The vocabulary of "features" and "circuits" was introduced in the Distill thread "Zoom In: An Introduction to Circuits" by Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter, published on 10 March 2020.[1] The essay laid out three speculative claims that have guided the field since: features are the fundamental unit of neural networks and correspond to directions in activation space; features are connected by weights to form circuits, defined as computational subgraphs consisting of features and the weighted edges between them; and analogous features and circuits recur across models trained on similar data ("universality").[1] In the Inception v1 case studies that accompanied the essay, the authors documented examples such as curve detectors, high-low frequency detectors, and the "car circuit," which composed wheel and window detectors into a car feature.[1]
A circuit is distinct from a feature. A feature is a single direction or unit that responds to an interpretable input pattern; a circuit is a structured set of features and connections whose joint computation explains a behavior.[1] In transformer language models the analogue of an Inception v1 neuron is an attention head, an MLP neuron, or, in later work, a feature extracted from a sparse autoencoder trained on residual-stream activations.[2][8] The distinction matters because the same neuron can participate in many circuits, and because polysemanticity and superposition make it possible for a single direction to encode several unrelated features at once, complicating any one-to-one map between neurons and concepts.[8]
Elhage and colleagues at Anthropic generalized the circuits program to transformers in "A Mathematical Framework for Transformer Circuits," published on 22 December 2021.[2] The framework decomposes an attention-only transformer into a sum of paths through the residual stream, splits each attention head into a query-key (QK) circuit that determines where the head attends and an output-value (OV) circuit that determines what it writes, and treats the residual stream as a shared communication channel that earlier components write to and later components read from.[2] The same paper introduced the notion of attention-head composition through Q, K, and V inputs, which the authors used to predict and then verify the existence of two-head "induction" mechanisms in two-layer models.[2]
The mathematical framework analyzed zero-layer, one-layer, and two-layer attention-only transformers in increasing complexity. Zero-layer transformers reduce to bigram statistics directly accessible from the embedding and unembedding matrices, since no attention or MLP non-linearity intervenes between input and output.[2] One-layer attention-only transformers can be expressed as an ensemble of bigram and "skip-trigram" patterns, in which a single head copies a token from earlier in the context into the prediction at the current position.[2] Two-layer models, by contrast, can implement compositional algorithms because attention heads in the second layer can read from outputs that earlier heads have written into the residual stream.[2] The framework treats this writing and reading explicitly: each component projects activations into a low-rank subspace, the residual stream sums all such projections, and downstream components recover information by reading from those subspaces.[2] Path expansion then expresses the model's output as a sum over discrete paths of (embedding, head, head, ..., unembedding) and allows researchers to attribute behaviors to individual paths.[2]
Manual circuit discovery proceeds by formulating hypotheses about which components matter for a behavior, then testing those hypotheses with causal interventions such as activation patching, path patching, and ablation.[3] The general workflow involves four steps: choose a narrow behavior that the model performs reliably and that can be summarized by a scalar metric (typically logit difference between a correct and incorrect completion); construct a paired dataset of "clean" inputs that elicit the behavior and "corrupted" inputs that share surface structure but produce a different answer; run the model on both, intervening at specific internal sites; and iterate, narrowing down to the smallest set of components whose intervention recovers the behavior.[3][4] Two case studies anchor the manual tradition.
Olsson and colleagues at Anthropic followed the mathematical framework with "In-context Learning and Induction Heads," published on 8 March 2022.[9] The paper defined an induction head as a head that searches the context for previous instances of the current token, attends to whatever token followed it, and increases the probability of that token at the next position.[9] Mechanistically the behavior is implemented by a two-head circuit: a "previous-token" head copies information from position N-1 into position N, and an "induction" head at a later layer uses that signal to locate earlier instances of the current token and attend to the token that came next.[9] The paper presented six lines of evidence that induction heads account for the bulk of in-context learning in large transformers, including a sharp phase change during training in which induction heads form and the in-context learning score jumps at the same time.[9]
Wang, Variengien, Conmy, Shlegeris, and Steinhardt published "Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small" on 1 November 2022.[3] The paper studied the indirect-object-identification (IOI) task, in which a sentence such as "When John and Mary went to the store, John gave a drink to" should be completed with "Mary."[3] Using path patching and other causal interventions on gpt-2 small, the authors identified a circuit of 26 attention heads grouped into seven functional classes: duplicate-token heads, previous-token heads, induction heads, S-inhibition heads, name-mover heads, negative name-mover heads, and backup name-mover heads.[3] Duplicate-token heads fire on the repeated subject token, S-inhibition heads suppress the repeated name, and name-mover heads copy the unique name (the indirect object) into the output position; backup name-mover heads take over if the primary name-mover heads are ablated, a form of redundancy the authors describe as "backup behavior."[3] The paper evaluated the discovered circuit against three criteria, faithfulness, completeness, and minimality, and described it as the largest end-to-end reverse-engineering of a natural language behavior in a transformer at the time of publication.[3]
Other manual case studies followed the same template, including circuits for greater-than comparisons, modular addition, docstring completion, and acronym prediction, but the IOI circuit became the de facto benchmark against which automated methods are evaluated.[4] The IOI study also became a methodological reference point because it specified the criteria a circuit should meet. Faithfulness asks whether the circuit alone, with the rest of the model ablated to its mean activation, recovers the metric of interest; completeness asks whether removing the circuit's components from the model destroys the behavior; and minimality asks whether every component in the proposed circuit is necessary, in the sense that ablating it alone causes a meaningful drop in the metric.[3] Subsequent papers including ACDC and Sparse Feature Circuits adopted variants of these criteria to score their automated outputs against the manually documented IOI baseline.[4][6]
The labor cost of manual circuit discovery, which the IOI authors estimated at months of researcher time per behavior, motivated a sequence of algorithms that automate the search.[4]
"Towards Automated Circuit Discovery for Mechanistic Interpretability" by Arthur Conmy, Augustine Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga-Alonso was submitted on 28 April 2023 and accepted as a NeurIPS 2023 spotlight.[4] The Automatic Circuit Discovery (ACDC) algorithm formalizes the manual process into three steps: choose a metric and a dataset that elicits the behavior, run activation patching to measure the causal effect of every edge in the computational graph, and greedily prune edges whose effect on the metric is below a threshold.[4] On the IOI task ACDC recovers 68 critical edges out of roughly 32,000 in GPT-2 small, and on the Greater-Than circuit it rediscovers five of five component classes identified in prior manual work.[4] The paper also introduced ROC-style metrics for comparing automated circuit-discovery algorithms against ground-truth manual circuits, and noted that the per-edge patching cost makes ACDC prohibitively slow on larger models.[4]
Aaquib Syed, Can Rager, and Arthur Conmy responded with "Attribution Patching Outperforms Automated Circuit Discovery," submitted on 16 October 2023 and presented at the NeurIPS 2023 workshop on attribution methods.[5] Edge attribution patching (EAP) replaces the exhaustive activation patching loop with a linear, gradient-based approximation: each edge in the computational graph is scored by a first-order Taylor expansion of the metric around the clean run, using activations from a clean forward pass, activations from a corrupted forward pass, and a single backward pass on the metric.[5] The total cost is two forward passes and one backward pass per task, regardless of the number of edges, which scales much better than ACDC.[5] On IOI, Greater-Than, and other benchmark tasks, EAP recovers circuits at least as faithful as those found by ACDC at a fraction of the compute, and the technique has since been extended in follow-up work such as edge pruning and integrated-gradients attribution.[5]
The edge-level formulation matters because earlier attribution-patching work (Neel Nanda, 2022) scored individual activation sites rather than edges between components. Reducing the granularity to edges allowed EAP to recover the directed graph structure that ACDC produces. Syed et al. also published reference code on GitHub that runs against the same IOI, Greater-Than, and docstring tasks used to benchmark ACDC, which made cross-method comparison straightforward and contributed to the rapid uptake of attribution patching as the default first pass in circuit-discovery pipelines.[5]
Samuel Marks, Can Rager, Eric Michaud, Yonatan Belinkov, David Bau, and Aaron Mueller published "Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models" on arXiv on 28 March 2024, and the paper received an oral presentation at ICLR 2025.[6] The method swaps the units of circuit discovery from attention heads and neurons to features extracted by sparse autoencoder modules trained on residual-stream activations, and then runs an attribution-patching-style pass over the resulting feature graph.[6] Because the features are more nearly monosemantic than raw neurons, the resulting circuits are easier to read off as human-understandable computations.[6] The same paper introduced SHIFT, a method that improves the out-of-distribution generalization of a classifier by ablating features that a human annotator judges to be task-irrelevant, and demonstrated a fully unsupervised pipeline that discovers thousands of feature circuits for automatically clustered model behaviors.[6]
activation patching and attribution patching are the causal-intervention techniques on which most circuit-discovery work depends.[4][5] Activation patching, also called causal mediation or interchange-intervention analysis, runs the model on a clean input, runs it again on a corrupted input that differs in some controlled way, and then copies activations from one run into the other at chosen internal sites; the change in the output metric measures the causal importance of those sites.[4] ACDC applies activation patching to every edge in the computational graph, which is exact but expensive.[4]
Attribution patching, introduced by Neel Nanda and used in the EAP paper, replaces the explicit copy-and-rerun step with a first-order Taylor expansion: the gradient of the output metric with respect to an activation, multiplied by the difference between the clean and corrupted activations at that site, approximates the patch's effect.[5] The approximation is exact when the function from the activation to the metric is linear and degrades smoothly when it is not. The trade-off is well documented: attribution patching is one to two orders of magnitude faster than activation patching but can miss edges whose effect is highly non-linear, which is why ACDC and EAP are often run together for verification.[4][5]
Several variants of activation patching exist and are used selectively in circuit-discovery work. "Resample ablation" replaces an activation with the mean or a random sample from the corrupted distribution, "zero ablation" replaces it with zero, and "mean ablation" replaces it with the mean across the clean distribution; results can differ depending on the choice, because zero is not a generic point on the model's activation manifold.[4] Path patching, used heavily in the IOI study, patches a specific path through the computational graph rather than a single site, by patching the source of one component's input while letting the rest of the model run on the clean input; this isolates the contribution of one edge at a time.[3] Subspace patching restricts the intervention to a particular direction in residual-stream space and underlies the refusal-direction literature.[12] Tooling for these interventions is provided by libraries such as transformerlens, which exposes hooks at every internal site of a transformer and is the de facto standard implementation used in ACDC, EAP, and many follow-up papers.[4][5]
The arrival of sparse autoencoder training pipelines that scale to frontier models shifted circuit discovery from the neuron basis to the feature basis.[8][10] Anthropic's "Towards Monosemanticity," published in October 2023, demonstrated that sparse autoencoders trained on a one-layer transformer extract approximately monosemantic features that are more interpretable than the underlying neurons.[8] "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet," by Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, and colleagues, was published on 21 May 2024 and applied the technique to a production claude 3 5 sonnet-class model, finding millions of features including ones for cities, programming bugs, gender bias, sycophancy, and deception.[10] These features became the units for the next generation of circuit work.
The Marks et al. paper described above is the first systematic SAE-based circuit-discovery method.[6] It treats SAE features as nodes, attention patterns as a separate class of nodes, and uses attribution patching to identify the sparsest set of feature-to-feature and feature-to-attention edges that account for a target metric.[6] The technique has been used to study spurious-correlation features in classifiers and to attribute Hindi-language outputs in a multilingual model.[6]
Anthropic's interpretability team released two companion papers on 27 March 2025: "Circuit Tracing: Revealing Computational Graphs in Language Models," describing the methodology, and "On the Biology of a Large Language Model," describing case studies on claude haiku 4 5's predecessor Claude 3.5 Haiku.[7][11] Jack Lindsey, Wes Gurnee, Emmanuel Ameisen, Brian Chen, Adam Pearce, Nicholas L. Turner, Craig Citro, and additional Anthropic contributors authored the work, with Joshua Batson as the corresponding author for the methods paper.[7][11] The pipeline trains a cross-layer transcoder (CLT) that replaces the MLP blocks in the model with roughly 30 million interpretable features, then builds a "local replacement model" in which the transcoder and attention patterns from the original model interact through error terms; for any given prompt the team computes an attribution graph of direct linear attributions between features.[7] Pruning the graph keeps roughly 10 percent of nodes and edges while preserving about 80 percent of the explanatory value, and intervention experiments validate the discovered mechanisms by perturbing features and checking whether the model's behavior changes as predicted.[7]
The biology paper applied this machinery to a dozen behaviors in Claude 3.5 Haiku, including multi-step reasoning in which the model computes intermediate quantities "in its head," forward planning in poetry composition (the model selects rhyming words before generating the line that leads up to them), multilingual abstraction in which concepts share a representation across English, French, and Chinese before being routed to a language-specific output, and metacognitive limitations in which chain-of-thought explanations diverge from the underlying computation.[11] Lindsey and colleagues estimated that the attribution-graph methodology yields useful mechanistic hypotheses on roughly a quarter of the prompts they investigated, framing the work as a source of testable hypotheses rather than definitive explanations.[11]
Circuit discovery has moved from foundational case studies toward safety-relevant behaviors. Marks et al. used sparse feature circuits to identify and ablate features driving bias in a profession-classifier setup, improving out-of-distribution generalization by removing features tied to gender rather than to the task.[6] Anthropic's biology paper documented circuits implicated in refusal, jailbreak susceptibility, hidden-goal pursuit, and harmful-content detection, and showed that targeted feature interventions can steer or suppress the corresponding behaviors.[11]
A separate line of work on the refusal direction showed that refusal behavior in chat-tuned models is mediated by a low-dimensional subspace that can be identified and ablated through difference-in-means probes, an instance in which a "circuit" reduces to a single direction in residual-stream space.[12] The Marks et al. and Anthropic results are continuous with that finding: they extend the same intervention logic from one direction to a structured graph of features.[6][11]
Sparse feature circuits and attribution graphs are also being used to study sycophancy, deception features identified during the Scaling Monosemanticity run, and "addition" or "arithmetic" circuits in which the model represents intermediate sums as features.[10][11] In each case the goal is not just understanding but actionable intervention: identify the circuit, perturb it, and observe whether the externally measurable behavior changes in the predicted direction.
The Scaling Monosemanticity paper showed concrete examples of safety-relevant features at the sparse autoencoder level before any explicit circuit had been mapped. Templeton and colleagues documented features that activated on descriptions of unsafe code, sycophantic praise, dangerous biological agents, and references to deceptive behavior, and they verified causality by clamping the feature high or low and observing changes in model output.[10] This single-feature evidence motivated the attribution-graph follow-up: if individual features carry safety-relevant signals, the next question is how those features feed into one another to produce the externally observed behavior, which is what a circuit answers.[11]
Anthropic's biology paper documented one concrete safety case: a circuit in Claude 3.5 Haiku that responded to attempts to elicit instructions for synthesizing dangerous compounds.[11] By tracing the attribution graph, the team identified features that recognized the request, features that activated refusal, and features that mediated the model's hedging language; intervening on these features changed the model's behavior in the predicted direction, providing a worked example of how circuit-level understanding can support targeted safety interventions.[11]
Outside Anthropic, sparse feature circuits and EAP have been used by independent researchers and academic labs to study tasks ranging from gender-pronoun resolution to syntactic agreement to in-context arithmetic. The David Bau group at Northeastern, which co-authored the sparse feature circuits paper, has applied the technique to spurious-feature ablation in toxicity classifiers and to attributing language identity in multilingual generation.[6] The pattern of work is consistent: discover the feature dictionary, run attribution-patching across feature edges, prune to a sparse circuit, and intervene to confirm causality.[6]
Several open problems define the current research frontier.
Faithfulness and completeness. A circuit is faithful if running the model with only the circuit's edges reproduces the behavior, and complete if the rest of the network is irrelevant once the circuit is fixed.[3] Wang et al.'s IOI work used both criteria to evaluate the manual circuit, and Conmy et al. carried them over to ACDC.[3][4] In practice, discovered circuits often achieve high faithfulness on the curated dataset that elicits the behavior but degrade off-distribution, and the question of whether a circuit is "the" mechanism or merely "a sufficient" mechanism remains contested.[4][6]
Scaling. ACDC's exhaustive activation patching does not scale to frontier models with hundreds of billions of parameters.[4] EAP improves the per-edge cost but still requires constructing the full edge graph.[5] SAE-based pipelines push further by working at the level of interpretable features rather than raw activations, and the Anthropic attribution-graph pipeline scales to Claude 3.5 Haiku by training a 30-million-feature cross-layer transcoder and aggressively pruning the resulting graph.[7] Whether these techniques will continue to scale to larger models, and how much human effort each discovered circuit still requires, is an active question.[7]
Polysemanticity in the patching basis. When the basis of patching is raw attention heads or neurons, superposition means that a single unit can carry several unrelated features, so an edge that looks important may carry several behaviors at once.[2][8] SAE features mitigate this problem but introduce their own approximations: any feature missed by the autoencoder is invisible to the circuit, and error terms in cross-layer transcoders can absorb meaningful structure.[7][10]
Discovery versus editing. Marks et al.'s SHIFT experiments and Anthropic's perturbation studies argue that circuit discovery is most valuable when paired with downstream editing or steering, but the field still lacks general-purpose methods for using discovered circuits to robustly modify a deployed model.[6][11]
Universality. Olah and colleagues' 2020 claim that analogous features and circuits recur across models has been partially supported by induction-head studies and by SAE features that transfer across model sizes, but the question of whether circuits in Claude 3.5 Haiku, GPT-4, and Llama 3 implement the "same" indirect-object-identification mechanism remains open and is one motivation for ongoing cross-model circuit benchmarks.[1][9][10]
Behavior versus mechanism. A circuit that recovers a target metric on a curated dataset has not necessarily captured the mechanism the model uses across the full distribution of inputs. The IOI authors documented this gap by observing backup name-mover heads, which take over when primary name-mover heads are ablated: any "minimal" circuit that excludes backup heads is incomplete in a precise sense, because the model retains the behavior after the primary heads are removed.[3] More generally, circuits with high faithfulness on a narrow dataset can still leave open whether the same components participate in the same way on different but related inputs, and whether the components have other roles that the circuit description omits.[4][6]
Composition with feature dictionaries. SAE-based circuits inherit the feature dictionary's coverage and error structure. Cross-layer transcoders in particular introduce error nodes that absorb whatever the transcoder cannot represent, and large reconstruction errors can hide load-bearing computations.[7] Templeton et al. discuss this trade-off as the "tightness" of the SAE: tighter SAEs (with higher L0 norm, that is, more active features per token) reconstruct activations more faithfully but produce circuits whose feature counts are harder to inspect, while looser SAEs are easier to read but skip more of the computation.[10]