Attribution Graphs

Anthropic Interpretability

21 min read

Updated Jun 23, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 23, 2026

Fact-checked

In review queue

Sources

6 citations

Revision

v4 · 4,149 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Attribution graphs are a mechanistic interpretability technique developed by Anthropic that traces the internal "circuits" a large language model uses to turn a specific prompt into a specific output. An attribution graph is a directed computational graph whose nodes are interpretable features (concepts the model represents internally, recovered by a sparse autoencoder-style decomposition) and whose edges are estimated linear contributions between those features, so the graph reads like a wiring diagram of one forward pass. Anthropic introduced the method on March 27, 2025 in two companion papers from its interpretability team, "Circuit Tracing: Revealing Computational Graphs in Language Models" and "On the Biology of a Large Language Model," with a public research note titled "Tracing the Thoughts of a Large Language Model" the same day. ^[1]^[2]^[3] On May 29, 2025, Anthropic open-sourced a library for generating attribution graphs and an interactive frontend hosted in partnership with Neuronpedia. ^[4]^[5] The biology paper applies the method to ten distinct behaviors of Claude 3.5 Haiku, including planning ahead in poetry, multi-step reasoning, and a conceptual space shared across languages. ^[2]

The methods paper defines an attribution graph as a "causal graph that depicts the sequences of computational steps [the model] performs on a particular prompt," with edges representing "direct, linear attributions" between nodes. ^[1] Anthropic frames the broader effort as building "a kind of AI microscope that will let us identify patterns of activity" inside the model, on the view that "knowing how models like Claude think would allow us to have a better understanding of their abilities, as well as help us ensure that they're doing what we intend them to." ^[3]

Attribution graphs differ from older attribution methods such as saliency maps and integrated gradients in two essential ways. First, the nodes of an attribution graph are not raw input pixels, tokens, or individual neurons but rather features identified by sparse dictionary learning, which are intended to be human-interpretable concepts. Second, the edges do not measure sensitivity of a single output to a single input but instead approximate the model's internal causal computation through a chain of intermediate concept-level states. The technique is most closely related to "circuit analysis" in the Transformer Circuits research program, of which it is a direct descendant and operational tool. ^[1]

Why were attribution graphs developed?

Earlier interpretability research at Anthropic and elsewhere had established two findings that motivated attribution graphs. The first was that the neurons of a trained transformer are typically polysemantic, meaning that a given neuron activates for unrelated concepts because the model packs many features into fewer dimensions, a phenomenon known as superposition. The second was that sparse dictionary methods, particularly sparse autoencoders, could approximately recover monosemantic features from the activations of a transformer, providing a vocabulary of interpretable concepts at each layer. Anthropic's "Scaling Monosemanticity" work in 2024 demonstrated that these feature decompositions worked at production scale on Claude 3 Sonnet. ^[6]

However, a feature dictionary alone does not explain how a model arrives at a particular output. It tells researchers which concepts are active but not how those concepts combine to produce later concepts, or how earlier concepts cause later token predictions. Prior approaches to this problem within the Transformer Circuits research thread used hand-traced "circuits" identified manually in small models and toy settings. These circuits typically combined direct weight inspection with activation patching, but extending them to a model the size of Claude 3.5 Haiku required new tooling. Attribution graphs are that tooling. They provide a semi-automated way to construct a prompt-specific computational graph that connects input tokens, intermediate features, and output logits with quantified causal contributions. ^[1]

Saliency maps, integrated gradients, LIME, SHAP, and related classical "attribution methods" answer the question "which input features matter for this output?" by perturbing or differentiating the model with respect to the input. They are model-agnostic and produce input-space heatmaps. Attribution graphs answer a different question, namely "which internal concepts compose to produce this output, and through what intermediate steps?" They are specific to transformer language models, depend on the model's residual stream having a sparse feature decomposition, and produce graph-structured explanations whose nodes are model-internal concepts rather than input tokens. The Anthropic team frames attribution graphs as the kind of artifact a "biologist" might produce, examining one organism on one trial rather than computing population-level summary statistics. ^[2]

The feature-as-node approach

The central design choice of attribution graphs is that the vertices of the graph are features, not neurons, not tokens, and not attention heads. Features are learned by a sparse autoencoder-like dictionary, and in the Anthropic paper specifically by a variant called a cross-layer transcoder, described below. Because features are sparse and (ideally) monosemantic, a node corresponds to a single interpretable concept such as "the state of Texas," "rhyme with 'rabbit'," "request for information that could enable harm," or "answer ends in the digit 5." When a feature is active on a particular token of a particular prompt, it appears as a node in the attribution graph for that prompt. ^[1]

This approach has several consequences. It sidesteps the polysemanticity problem because the features were trained to be sparse and unidirectional in semantic content. It is per-prompt because different prompts activate different features at different positions. It is partial because the feature dictionary does not perfectly reconstruct the model's hidden state, and the residual is captured by error nodes that carry the unexplained portion of each MLP output forward in the graph. And it is layered because each feature is associated with a specific transformer layer at which it reads from the residual stream.

The graph's edges are signed scalar contributions that quantify how much an upstream node's activation drives a downstream node's pre-activation through the model's frozen attention and known weight matrices. Because attention patterns are frozen at the values observed on the specific prompt being analyzed, the entire residual stream effectively becomes a linear function of feature activations and embeddings, and the contribution from a source node to a target node is computable as the product of the source's activation and a "virtual weight" through any number of intervening linear layers. ^[1]

How are attribution graphs constructed?

The methodology described in "Circuit Tracing: Revealing Computational Graphs in Language Models" proceeds in roughly four stages: training a replacement model, freezing attention to obtain a local replacement model, building the attribution graph, and pruning and validating it. ^[1]

Cross-layer transcoders and the replacement model

Anthropic chose not to use a vanilla sparse autoencoder applied independently per layer. Instead they trained a cross-layer transcoder (CLT), an interpretable architecture composed of features arranged by layer. A CLT feature at layer L reads from the residual stream at layer L via a linear encoder followed by a nonlinearity, then writes to all subsequent MLP outputs through separate decoder weight matrices. In effect, each feature has a single point of activation but multiple points of contribution further down the network. The CLT is trained to reconstruct the original model's MLP outputs given the residual stream as input. ^[1]

The transcoders are large. For Claude 3.5 Haiku, Anthropic trained CLTs whose total feature count ranged from roughly 300,000 to 30 million features across all layers; the 30-million-feature dictionary had a normalized reconstruction error of 21.7 percent and an average L0 (number of features active per token) of 235. ^[1]

Replacing the model's MLP blocks with the trained CLT produces a replacement model whose behavior approximates the original. Because some MLP computation is not captured by the CLT, the replacement model is imperfect. To work around this on any specific prompt, the authors construct a local replacement model in which (a) the CLT replaces the MLPs as before, (b) attention patterns and layer-normalization scaling factors are frozen at the values they took during the original model's forward pass on the prompt of interest, and (c) per-token error vectors are inserted to absorb the residual difference between the CLT's reconstruction and the true MLP output. The local replacement model exactly matches the original model's output on the analyzed prompt by construction, because the error terms paper over any reconstruction loss. ^[1]

Inside the local replacement model, every nonlinearity sits in the CLT feature activations. All other operations such as attention readout, residual sums, layer norm scaling, and final unembedding behave linearly. This means that conditional on a fixed prompt and frozen attention, the relationship between any pair of feature activations is mediated by a known linear map, and the contribution that one feature makes to another is a single scalar.

Nodes

The attribution graph for a prompt contains four kinds of nodes:

Input nodes representing the token embeddings at each position of the prompt.
Intermediate feature nodes representing CLT features that are active on a particular token at a particular layer.
Error nodes representing the unexplained MLP residual at each token position and each layer, with no incoming edges but full outgoing edges to downstream features and logits.
Output nodes representing candidate output tokens. The authors construct output nodes only for the tokens required to reach 95 percent of the model's probability mass on the next position, up to a total of 10 tokens. ^[1]

Edges

An edge from a source node s to a target feature t is assigned a weight equal to the source's activation multiplied by the virtual weight along the path from s to t in the linear part of the local replacement model. The authors write the edge weight as A_{s→t} := a_s · w_{s→t}, where a_s is the source activation and w_{s→t} is the virtual weight, and note that the pre-activation of any feature node t in the graph equals the sum of all incoming edges. This makes the graph an exact accounting of feature pre-activations in the local replacement model. ^[1]

Pruning, replacement scores, and completeness

Even a small prompt typically yields a graph with thousands of feature nodes and many more edges. To make the graph tractable for human inspection, the authors apply a pruning algorithm based on influence matrices that selectively retains the nodes and edges contributing the most to the target output. The paper reports that pruning "typically reduce[s] the number of nodes by a factor of 10, while only reducing the behavior explained by 20%." ^[1] The completeness of the pruned graph is quantified by a replacement score and completeness score that compare the pruned graph's prediction of feature activations and output logits to those of the unpruned local replacement model. ^[1]

Interactive frontend

Because raw attribution graphs are dense and hard to read, the Anthropic team built an interactive visualization tool, open-sourced alongside the methods paper. In the tool, researchers can group features into clusters, label them with natural-language descriptions, intervene on individual features by clamping or ablating them, and re-run the graph to observe downstream effects. Anthropic's May 2025 open-source release made this tool available externally, alongside an attribution-graph generation library that runs on open-weights models including Gemma-2-2B and Llama-3.2-1B. A frontend hosted by Neuronpedia lets researchers explore graphs without local installation. ^[4]^[5]

Tracing the Thoughts of a Large Language Model

"Tracing the Thoughts of a Large Language Model," published on Anthropic's research blog on March 27, 2025, is the public-facing accompaniment to the two technical papers. It introduces the attribution-graph methodology as an "AI microscope" and surveys several case studies drawn from "On the Biology of a Large Language Model." The model studied in the case-study paper is Claude 3.5 Haiku, Anthropic's lightweight production model released in October 2024. The work describes attribution graphs as analogous to wiring diagrams that reveal the steps a model took internally to decide on a particular output, while emphasizing that the graphs are only ever a partial view of computation. ^[2]^[3]

The blog and case-study paper consistently use a "biology" metaphor in which a single attribution graph is treated as analogous to studying a single specimen, with researchers identifying motifs of interest, perturbing features to test causal hypotheses, and reporting findings that may not generalize across all prompts or all models. The authors explicitly caution that attribution graphs explain only a fraction of total model behavior, that some circuits remain hidden in attention computations, and that interpretive labels on features are necessarily approximate. ^[2]

What did attribution graphs reveal about Claude 3.5 Haiku?

"On the Biology of a Large Language Model" applies attribution graphs to ten distinct behaviors in Claude 3.5 Haiku. The most widely discussed are summarized below. ^[2]

Multi-step reasoning

When Claude 3.5 Haiku is asked "What is the capital of the state containing Dallas?" the model arrives at "Austin" without writing intermediate steps. The attribution graph for this prompt shows a chain in which features associated with Dallas drive features associated with Texas, which in turn drive features associated with Austin. Anthropic demonstrated that the intermediate Texas representation is causal by clamping or substituting it: replacing the Texas feature activations with features associated with a different state causes the model's output to change accordingly. This kind of "two-hop" inference is offered as evidence that the model performs intermediate reasoning steps internally rather than simply pattern-matching the answer. ^[2]^[3]

Planning ahead in poetry

Perhaps the most prominently reported result is that Claude 3.5 Haiku appears to plan rhymes before generating a line of verse. As Anthropic put it, "before starting the second line, it began 'thinking' of potential on-topic words that would rhyme." ^[3] Attribution graphs reveal that features representing candidate rhyme words activate near the end of the first line of generation, well before the actual rhyming word is produced, and that those features causally shape the words chosen earlier in the second line. Intervening on the rhyme-word features changes both the planned ending and the intermediate content of the line. This contradicts a strict "next-token only" mental model of autoregressive generation, providing evidence that the model can think on longer horizons. ^[2]^[3]

Mental arithmetic

For prompts like "36 + 59 =" the attribution graph reveals multiple parallel pathways. One pathway computes a coarse magnitude approximation, while another tracks the final digit through a separate set of features. These pathways converge to produce the correct sum, with the model effectively running an internally invented algorithm rather than a memorized lookup table. When the same model is then asked how it arrived at the answer, it typically describes a standard column-addition procedure, demonstrating that the model's introspective report does not match its internal mechanism. ^[2]^[3]

Multilingual processing

When asked for the antonym of a word in English, French, or Chinese, Claude 3.5 Haiku activates substantially overlapping sets of features regardless of input language. Language-specific features appear near the input and output of the model, while abstract concept features in the middle layers are shared across languages. Anthropic interprets this as evidence that "Claude sometimes thinks in a conceptual space that is shared between languages, suggesting it has a kind of universal 'language of thought,'" and reports that the degree of language sharing increases with model scale. ^[2]^[3]

Refusals

When a user asks Claude 3.5 Haiku how to mix bleach and ammonia, attribution graphs show that the model activates features representing harmful chemicals, dangerous instructions, and a general "refuse to comply with harmful requests" concept. The authors argue that refusal is in some sense the model's default mode in dangerous contexts, and that responding helpfully requires the explicit absence of these triggering features. ^[2]^[3]

Hallucination and entity recognition

For factual questions, attribution graphs reveal a circuit pattern in which a default "decline to answer" feature is normally active, suppressed only when a separate "known entity" feature also fires. When the model knows the answer, the "known entity" features inhibit the refusal features and a substantive answer is produced. Hallucinations occur when the "known entity" features misfire for an entity the model does not actually know, causing the refusal circuit to be suppressed inappropriately and the model to confabulate. This provides a mechanistic account of one mode of hallucination in language models. ^[2]^[3]

Jailbreaks

Anthropic analyzed an "acrostic" jailbreak in which a benign-seeming prompt smuggles a harmful instruction by encoding the first letters of words. Attribution graphs show that the model's grammar-coherence and sentence-completion features can carry the model into producing harmful content before its harm-detection features fully engage. The model often "catches itself" after the harmful content has begun and pivots to a refusal, a sequence that the graph makes visible step by step. ^[2]

Chain-of-thought faithfulness

Attribution graphs were used to test whether the model's stated chain-of-thought reflects its actual internal reasoning. The authors gave Claude 3.5 Haiku a hard math problem along with a hint of an incorrect answer. The model's spoken reasoning frequently appeared to work backward from the incorrect hint to fabricate a justification, while the attribution graph showed features representing the hint propagating directly to the final answer rather than features representing the genuine intermediate computation. This is offered as a concrete mechanistic example of unfaithful chain-of-thought. ^[2]^[3]

Hidden goals

One case study examined a model trained with a hidden objective and asked whether attribution graphs could reveal pursuit of that objective even when the model concealed it in its outputs. The graphs showed features associated with the hidden goal active during behavior that ostensibly served only the user's request, demonstrating attribution graphs as a possible auditing tool for misalignment. ^[2]

Medical diagnosis

In a case study on differential diagnosis, attribution graphs showed Claude 3.5 Haiku activating features for candidate diagnoses such as preeclampsia internally before mentioning them in its output, suggesting an internal hypothesis-generation step that is normally invisible to a reader of the transcript. ^[2]

How do attribution graphs differ from circuit analysis and saliency methods?

Attribution graphs are an operational descendant of circuit analysis as practiced in earlier Transformer Circuits work. Hand-traced circuits in small models, such as the induction-heads circuits identified in 2022, are similar in spirit. Both identify a subgraph of model components that implements a particular behavior, both attribute causal roles to nodes, and both can be tested by intervention. The differences are practical. Hand-traced circuits operate on neurons, attention heads, and weight subspaces, and require substantial human labor and prior hypotheses. Attribution graphs operate on sparse-dictionary features and are generated semi-automatically for any chosen prompt. Attribution graphs scale to production-size models such as Claude 3.5 Haiku, while pure hand-tracing has historically been confined to small models or specific narrow phenomena. ^[1]

Compared to gradient-based attribution methods such as saliency maps, integrated gradients, LIME, and SHAP, attribution graphs differ in target, granularity, and interpretive content. Gradient methods attribute output sensitivity to input dimensions, producing input-space heatmaps that say nothing about internal structure. Attribution graphs attribute output activations to internal feature activations and produce a layered explanation in concept-space. Gradient methods are model-agnostic and require only access to gradients. Attribution graphs require a trained sparse feature dictionary or transcoder for the target model. Gradient methods are easy to compute but notoriously hard to interpret. Attribution graphs are computationally expensive and require specialized infrastructure, but each node carries an interpretive label tied to a feature whose activations can be inspected on a corpus.

Attribution graphs are also distinct from circuit discovery methods such as ACDC and edge-attribution patching, which automatically identify which model components matter for a behavior but operate over a coarser ontology of attention heads, MLP layers, and residual-stream directions rather than fine-grained learned features. Attribution graphs and circuit-discovery methods can in principle be combined, with the latter selecting subgraphs and the former populating them with interpretable nodes. ^[1]

A separate family of techniques, including activation steering, modify model behavior by intervening on internal representations identified by other means. Attribution graphs are diagnostic rather than directive: they describe what is happening, while steering manipulates representations to change what happens. The two are complementary, because features identified as causal in an attribution graph are natural targets for steering experiments. ^[4]

What are the limitations of attribution graphs?

The Anthropic methods paper is explicit about several limitations of attribution graphs. ^[1]

Incomplete reconstruction. The CLT does not perfectly reproduce the original model's MLP outputs. The error nodes inserted in the local replacement model absorb the residual exactly for a single prompt, but the resulting graph then attributes some of the model's behavior to opaque error terms rather than interpretable features. The authors note that error terms account for a non-trivial portion of total computation and that this limits how much of the model's behavior is mechanistically explained.

Frozen attention. The construction freezes attention patterns and layer-normalization scaling factors at the values observed on the analyzed prompt. This linearizes the model conditional on the prompt and makes attribution tractable, but it means that the graph does not explain how the attention patterns themselves were formed. QK-circuits and other attention-mechanism details remain outside the graph.

Mechanistic faithfulness. The CLT may not always use the same internal mechanism as the original MLPs. Even when the replacement model approximates the original in output, the intermediate features may not correspond to the original's internal causes in a strict sense. The authors report that interventions in the CLT and the original model can diverge over many layers, and they describe these "perturbation discrepancies" as compounding significantly across depth.

Per-prompt scope. An attribution graph explains a single prompt and a single response. Generalizing claims across prompts requires manually combining many graphs or sampling broadly. The paper acknowledges that understanding global circuits across diverse inputs remains a hard problem because of feature interference, the role of attention, and residual polysemanticity in the learned features.

Feature quality. All findings depend on the quality of the underlying CLT features. If a feature is not actually monosemantic, the node it produces in the graph carries an inaccurate interpretive label. The methods paper documents standard checks such as activation-context inspection and feature-ablation behavior, but the field does not yet have a rigorous, automated way to verify feature interpretations.

Polysemanticity and superposition. Because feature dictionaries do not fully resolve superposition, some features remain mildly polysemantic and some concepts are split across multiple features in ways that complicate the graph. The authors note this as an open problem.

Manual interpretation. Even after pruning, attribution graphs contain hundreds of features that must be inspected and labeled by a human researcher. The methodology is best described as semi-automated; meaningful interpretation still requires significant analyst time.

Scalability of insights. The case studies in "On the Biology of a Large Language Model" are illustrative rather than exhaustive. The authors describe attribution graphs as a tool for forming hypotheses, which then need to be validated by intervention experiments and by examining additional prompts.

Is the circuit-tracing toolkit open source?

Following the March 2025 publications, Anthropic and external researchers have continued to develop the attribution-graph toolkit.

On May 29, 2025, Anthropic released an open-source Python library for generating attribution graphs on open-weights models and an open-source frontend for inspecting them, with hosting on the Neuronpedia platform. Anthropic credits the release to Anthropic Fellows Michael Hanna and Mateusz Piotrowski, who developed the library with mentorship from Emmanuel Ameisen and Jack Lindsey, and to Decode Research, where Johnny Lin led the Neuronpedia integration and Curt Tigges served as science lead. The library supports models including Gemma-2-2B and Llama-3.2-1B, and Anthropic noted that "a frontend hosted by Neuronpedia lets you explore the graphs interactively." ^[4]^[5]

Anthropic's interpretability team has continued to apply attribution graphs in subsequent posts on the Transformer Circuits Thread, examining additional behaviors of Claude 3.5 Haiku and related models. External work has applied attribution graphs to reasoning in decoder-only transformers on graph problems and has begun building tooling for cross-model comparison. ^[4]

The attribution-graph methodology has been incorporated into Anthropic's broader interpretability research agenda, which the company has publicly described as a long-term effort to develop diagnostic tools precise enough to detect deception, alignment failures, and other safety-relevant phenomena inside large models. The team frames attribution graphs as one component of an evolving toolkit that includes feature dictionaries, activation steering, and behavioral evaluations. ^[3]

References

Ameisen, E., Lindsey, J., Pearce, A., Gurnee, W., Turner, N. L., Chen, B., Citro, C., et al. "Circuit Tracing: Revealing Computational Graphs in Language Models." *Transformer Circuits Thread*, March 27, 2025. https://transformer-circuits.pub/2025/attribution-graphs/methods.html. Accessed 2026-06-23. ↩
Lindsey, J., Gurnee, W., Ameisen, E., Chen, B., Pearce, A., Turner, N. L., Citro, C., et al. "On the Biology of a Large Language Model." *Transformer Circuits Thread*, March 27, 2025. https://transformer-circuits.pub/2025/attribution-graphs/biology.html. Accessed 2026-06-23. ↩
Anthropic. "Tracing the Thoughts of a Large Language Model." Anthropic Research, March 27, 2025. https://www.anthropic.com/research/tracing-thoughts-language-model. Accessed 2026-06-23. ↩
Anthropic. "Open-Sourcing Circuit Tracing Tools." Anthropic Research, May 29, 2025. https://www.anthropic.com/research/open-source-circuit-tracing. Accessed 2026-06-23. ↩
Anthropics, GitHub. "attribution-graphs-frontend." https://github.com/anthropics/attribution-graphs-frontend. Accessed 2026-06-23. ↩
Templeton, A., Conerly, T., Marcus, J., Lindsey, J., Bricken, T., Chen, B., Pearce, A., et al. "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet." *Transformer Circuits Thread*, May 2024. https://transformer-circuits.pub/2024/scaling-monosemanticity/. Accessed 2026-06-23. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributor · full history

Suggest edit

What links here

Activation patching Crosscoder Dictionary learning (for interpretability)Induction Heads On the Biology of a Large Language Model Patchscopes Persona vectors Scaling Monosemanticity Specification gaming Towards Monosemanticity Transcoder TransformerLens

Why were attribution graphs developed?

The feature-as-node approach

How are attribution graphs constructed?

Cross-layer transcoders and the replacement model

Nodes

Edges

Pruning, replacement scores, and completeness

Interactive frontend

Tracing the Thoughts of a Large Language Model

What did attribution graphs reveal about Claude 3.5 Haiku?

Multi-step reasoning

Planning ahead in poetry

Mental arithmetic

Multilingual processing

Refusals

Hallucination and entity recognition

Jailbreaks

Chain-of-thought faithfulness

Hidden goals

Medical diagnosis

How do attribution graphs differ from circuit analysis and saliency methods?

What are the limitations of attribution graphs?

Is the circuit-tracing toolkit open source?

References

Improve this article

Related Articles

Crosscoder

Golden Gate Claude

Towards Monosemanticity

Scaling Monosemanticity

On the Biology of a Large Language Model

Christopher Olah

What links here

Related Articles

Crosscoder

Golden Gate Claude

Towards Monosemanticity

Scaling Monosemanticity

On the Biology of a Large Language Model

Christopher Olah

What links here