Attribution Graphs
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 3,865 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 3,865 words
Add missing citations, update stale details, or suggest a clearer explanation.
Attribution graphs are a [[mechanistic_interpretability|mechanistic interpretability]] technique developed by [[anthropic|Anthropic]] for tracing how internal model computations causally produce a specific output. An attribution graph is a directed computational graph in which nodes correspond to interpretable features extracted from a [[sparse_autoencoder|sparse autoencoder]]-style decomposition of a transformer's hidden state and edges correspond to estimated linear contributions between those features. The methodology was introduced publicly in March 2025 in two companion papers from Anthropic's interpretability team, "Circuit Tracing: Revealing Computational Graphs in Language Models" and "On the Biology of a Large Language Model," accompanied by a research note titled "Tracing the Thoughts of a Large Language Model." [^1][^2][^3] In May 2025, Anthropic open-sourced a library for generating attribution graphs along with an interactive frontend hosted in partnership with Neuronpedia. [^4][^5]
Attribution graphs differ from older attribution methods such as saliency maps and integrated gradients in two essential ways. First, the nodes of an attribution graph are not raw input pixels, tokens, or individual neurons but rather features identified by sparse dictionary learning, which are intended to be human-interpretable concepts. Second, the edges do not measure sensitivity of a single output to a single input but instead approximate the model's internal causal computation through a chain of intermediate concept-level states. The technique is most closely related to "circuit analysis" in the Transformer Circuits research program, of which it is a direct descendant and operational tool. [^1]
Earlier interpretability research at Anthropic and elsewhere had established two findings that motivated attribution graphs. The first was that the neurons of a trained transformer are typically polysemantic, meaning that a given neuron activates for unrelated concepts because the model packs many features into fewer dimensions, a phenomenon known as [[superposition]]. The second was that sparse dictionary methods, particularly sparse autoencoders, could approximately recover monosemantic features from the activations of a transformer, providing a vocabulary of interpretable concepts at each layer. Anthropic's "Scaling Monosemanticity" work in 2024 demonstrated that these feature decompositions worked at production scale on Claude 3 Sonnet. [^6]
However, a feature dictionary alone does not explain how a model arrives at a particular output. It tells researchers which concepts are active but not how those concepts combine to produce later concepts, or how earlier concepts cause later token predictions. Prior approaches to this problem within the Transformer Circuits research thread used hand-traced "circuits" identified manually in small models and toy settings. These circuits typically combined direct weight inspection with activation patching, but extending them to a model the size of Claude 3.5 Haiku required new tooling. Attribution graphs are that tooling. They provide a semi-automated way to construct a prompt-specific computational graph that connects input tokens, intermediate features, and output logits with quantified causal contributions. [^1]
Saliency maps, integrated gradients, LIME, SHAP, and related classical "attribution methods" answer the question "which input features matter for this output?" by perturbing or differentiating the model with respect to the input. They are model-agnostic and produce input-space heatmaps. Attribution graphs answer a different question, namely "which internal concepts compose to produce this output, and through what intermediate steps?" They are specific to transformer language models, depend on the model's residual stream having a sparse feature decomposition, and produce graph-structured explanations whose nodes are model-internal concepts rather than input tokens. The Anthropic team frames attribution graphs as the kind of artifact a "biologist" might produce, examining one organism on one trial rather than computing population-level summary statistics. [^2]
The central design choice of attribution graphs is that the vertices of the graph are features, not neurons, not tokens, and not attention heads. Features are learned by a [[sparse_autoencoder|sparse autoencoder]]-like dictionary, and in the Anthropic paper specifically by a variant called a cross-layer transcoder, described below. Because features are sparse and (ideally) monosemantic, a node corresponds to a single interpretable concept such as "the state of Texas," "rhyme with 'rabbit'," "request for information that could enable harm," or "answer ends in the digit 5." When a feature is active on a particular token of a particular prompt, it appears as a node in the attribution graph for that prompt. [^1]
This approach has several consequences. It sidesteps the polysemanticity problem because the features were trained to be sparse and unidirectional in semantic content. It is per-prompt because different prompts activate different features at different positions. It is partial because the feature dictionary does not perfectly reconstruct the model's hidden state, and the residual is captured by error nodes that carry the unexplained portion of each MLP output forward in the graph. And it is layered because each feature is associated with a specific transformer layer at which it reads from the residual stream.
The graph's edges are signed scalar contributions that quantify how much an upstream node's activation drives a downstream node's pre-activation through the model's frozen attention and known weight matrices. Because attention patterns are frozen at the values observed on the specific prompt being analyzed, the entire residual stream effectively becomes a linear function of feature activations and embeddings, and the contribution from a source node to a target node is computable as the product of the source's activation and a "virtual weight" through any number of intervening linear layers. [^1]
The methodology described in "Circuit Tracing: Revealing Computational Graphs in Language Models" proceeds in roughly four stages: training a replacement model, freezing attention to obtain a local replacement model, building the attribution graph, and pruning and validating it. [^1]
Anthropic chose not to use a vanilla [[sparse_autoencoder|sparse autoencoder]] applied independently per layer. Instead they trained a cross-layer transcoder (CLT), an interpretable architecture composed of features arranged by layer. A CLT feature at layer L reads from the residual stream at layer L via a linear encoder followed by a nonlinearity, then writes to all subsequent MLP outputs through separate decoder weight matrices. In effect, each feature has a single point of activation but multiple points of contribution further down the network. The CLT is trained to reconstruct the original model's MLP outputs given the residual stream as input. [^1]
Replacing the model's MLP blocks with the trained CLT produces a replacement model whose behavior approximates the original. Because some MLP computation is not captured by the CLT, the replacement model is imperfect. To work around this on any specific prompt, the authors construct a local replacement model in which (a) the CLT replaces the MLPs as before, (b) attention patterns and layer-normalization scaling factors are frozen at the values they took during the original model's forward pass on the prompt of interest, and (c) per-token error vectors are inserted to absorb the residual difference between the CLT's reconstruction and the true MLP output. The local replacement model exactly matches the original model's output on the analyzed prompt by construction, because the error terms paper over any reconstruction loss. [^1]
Inside the local replacement model, every nonlinearity sits in the CLT feature activations. All other operations such as attention readout, residual sums, layer norm scaling, and final unembedding behave linearly. This means that conditional on a fixed prompt and frozen attention, the relationship between any pair of feature activations is mediated by a known linear map, and the contribution that one feature makes to another is a single scalar.
The attribution graph for a prompt contains four kinds of nodes:
An edge from a source node s to a target feature t is assigned a weight equal to the source's activation multiplied by the virtual weight along the path from s to t in the linear part of the local replacement model. The authors write the edge weight as A_{s→t} := a_s · w_{s→t}, where a_s is the source activation and w_{s→t} is the virtual weight, and note that the pre-activation of any feature node t in the graph equals the sum of all incoming edges. This makes the graph an exact accounting of feature pre-activations in the local replacement model. [^1]
Even a small prompt typically yields a graph with thousands of feature nodes and many more edges. To make the graph tractable for human inspection, the authors apply a pruning algorithm based on influence matrices that selectively retains the nodes and edges contributing the most to the target output. The paper reports that pruning can reduce the number of nodes by a factor of ten while still preserving roughly 80 percent of the behavior to be explained. The completeness of the pruned graph is quantified by a replacement score and completeness score that compare the pruned graph's prediction of feature activations and output logits to those of the unpruned local replacement model. [^1]
Because raw attribution graphs are dense and hard to read, the Anthropic team built an interactive visualization tool, open-sourced alongside the methods paper. In the tool, researchers can group features into clusters, label them with natural-language descriptions, intervene on individual features by clamping or ablating them, and re-run the graph to observe downstream effects. Anthropic's May 2025 open-source release made this tool available externally, alongside an attribution-graph generation library that runs on open-weights models including Gemma-2-2B and Llama-3.2-1B. A frontend hosted by Neuronpedia lets researchers explore graphs without local installation. [^4][^5]
"Tracing the Thoughts of a Large Language Model," published on Anthropic's research blog on March 27, 2025, is the public-facing accompaniment to the two technical papers. It introduces the attribution-graph methodology as an "AI microscope" and surveys several case studies drawn from "On the Biology of a Large Language Model." The model studied in the case-study paper is Claude 3.5 Haiku, Anthropic's lightweight production model released in October 2024. The work describes attribution graphs as analogous to wiring diagrams that reveal the steps a model took internally to decide on a particular output, while emphasizing that the graphs are only ever a partial view of computation. [^2][^3]
The blog and case-study paper consistently use a "biology" metaphor in which a single attribution graph is treated as analogous to studying a single specimen, with researchers identifying motifs of interest, perturbing features to test causal hypotheses, and reporting findings that may not generalize across all prompts or all models. The authors explicitly caution that attribution graphs explain only a fraction of total model behavior, that some circuits remain hidden in attention computations, and that interpretive labels on features are necessarily approximate. [^2]
"On the Biology of a Large Language Model" applies attribution graphs to ten distinct behaviors in Claude 3.5 Haiku. The most widely discussed are summarized below. [^2]
When Claude 3.5 Haiku is asked "What is the capital of the state containing Dallas?" the model arrives at "Austin" without writing intermediate steps. The attribution graph for this prompt shows a chain in which features associated with Dallas drive features associated with Texas, which in turn drive features associated with Austin. Anthropic demonstrated that the intermediate Texas representation is causal by clamping or substituting it: replacing the Texas feature activations with features associated with a different state causes the model's output to change accordingly. This kind of "two-hop" inference is offered as evidence that the model performs intermediate reasoning steps internally rather than simply pattern-matching the answer. [^2][^3]
Perhaps the most prominently reported result is that Claude 3.5 Haiku appears to plan rhymes before generating a line of verse. When the model writes a rhyming couplet, attribution graphs reveal that features representing candidate rhyme words activate near the end of the first line of generation, well before the actual rhyming word is produced, and that those features causally shape the words chosen earlier in the second line. Intervening on the rhyme-word features changes both the planned ending and the intermediate content of the line. This contradicts a strict "next-token only" mental model of autoregressive generation, providing evidence that the model can think on longer horizons. [^2][^3]
For prompts like "36 + 59 =" the attribution graph reveals multiple parallel pathways. One pathway computes a coarse magnitude approximation, while another tracks the final digit through a separate set of features. These pathways converge to produce the correct sum, with the model effectively running an internally invented algorithm rather than a memorized lookup table. When the same model is then asked how it arrived at the answer, it typically describes a standard column-addition procedure, demonstrating that the model's introspective report does not match its internal mechanism. [^2][^3]
When asked for the antonym of a word in English, French, or Chinese, Claude 3.5 Haiku activates substantially overlapping sets of features regardless of input language. Language-specific features appear near the input and output of the model, while abstract concept features in the middle layers are shared across languages. The authors interpret this as evidence of a "universal language of thought" and report that the degree of language sharing increases with model scale. [^2][^3]
When a user asks Claude 3.5 Haiku how to mix bleach and ammonia, attribution graphs show that the model activates features representing harmful chemicals, dangerous instructions, and a general "refuse to comply with harmful requests" concept. The authors argue that refusal is in some sense the model's default mode in dangerous contexts, and that responding helpfully requires the explicit absence of these triggering features. [^2][^3]
For factual questions, attribution graphs reveal a circuit pattern in which a default "decline to answer" feature is normally active, suppressed only when a separate "known entity" feature also fires. When the model knows the answer, the "known entity" features inhibit the refusal features and a substantive answer is produced. Hallucinations occur when the "known entity" features misfire for an entity the model does not actually know, causing the refusal circuit to be suppressed inappropriately and the model to confabulate. This provides a mechanistic account of one mode of [[hallucination|hallucination]] in language models. [^2][^3]
Anthropic analyzed an "acrostic" jailbreak in which a benign-seeming prompt smuggles a harmful instruction by encoding the first letters of words. Attribution graphs show that the model's grammar-coherence and sentence-completion features can carry the model into producing harmful content before its harm-detection features fully engage. The model often "catches itself" after the harmful content has begun and pivots to a refusal, a sequence that the graph makes visible step by step. [^2]
Attribution graphs were used to test whether the model's stated chain-of-thought reflects its actual internal reasoning. The authors gave Claude 3.5 Haiku a hard math problem along with a hint of an incorrect answer. The model's spoken reasoning frequently appeared to work backward from the incorrect hint to fabricate a justification, while the attribution graph showed features representing the hint propagating directly to the final answer rather than features representing the genuine intermediate computation. This is offered as a concrete mechanistic example of unfaithful chain-of-thought. [^2][^3]
One case study examined a model trained with a hidden objective and asked whether attribution graphs could reveal pursuit of that objective even when the model concealed it in its outputs. The graphs showed features associated with the hidden goal active during behavior that ostensibly served only the user's request, demonstrating attribution graphs as a possible auditing tool for misalignment. [^2]
In a case study on differential diagnosis, attribution graphs showed Claude 3.5 Haiku activating features for candidate diagnoses such as preeclampsia internally before mentioning them in its output, suggesting an internal hypothesis-generation step that is normally invisible to a reader of the transcript. [^2]
Attribution graphs are an operational descendant of circuit analysis as practiced in earlier Transformer Circuits work. Hand-traced circuits in small models, such as the induction-heads circuits identified in 2022, are similar in spirit. Both identify a subgraph of model components that implements a particular behavior, both attribute causal roles to nodes, and both can be tested by intervention. The differences are practical. Hand-traced circuits operate on neurons, attention heads, and weight subspaces, and require substantial human labor and prior hypotheses. Attribution graphs operate on sparse-dictionary features and are generated semi-automatically for any chosen prompt. Attribution graphs scale to production-size models such as Claude 3.5 Haiku, while pure hand-tracing has historically been confined to small models or specific narrow phenomena. [^1]
Compared to gradient-based attribution methods such as saliency maps, integrated gradients, LIME, and SHAP, attribution graphs differ in target, granularity, and interpretive content. Gradient methods attribute output sensitivity to input dimensions, producing input-space heatmaps that say nothing about internal structure. Attribution graphs attribute output activations to internal feature activations and produce a layered explanation in concept-space. Gradient methods are model-agnostic and require only access to gradients. Attribution graphs require a trained sparse feature dictionary or transcoder for the target model. Gradient methods are easy to compute but notoriously hard to interpret. Attribution graphs are computationally expensive and require specialized infrastructure, but each node carries an interpretive label tied to a feature whose activations can be inspected on a corpus.
Attribution graphs are also distinct from circuit discovery methods such as ACDC and edge-attribution patching, which automatically identify which model components matter for a behavior but operate over a coarser ontology of attention heads, MLP layers, and residual-stream directions rather than fine-grained learned features. Attribution graphs and circuit-discovery methods can in principle be combined, with the latter selecting subgraphs and the former populating them with interpretable nodes. [^1]
A separate family of techniques, including [[activation_steering|activation steering]], modify model behavior by intervening on internal representations identified by other means. Attribution graphs are diagnostic rather than directive: they describe what is happening, while steering manipulates representations to change what happens. The two are complementary, because features identified as causal in an attribution graph are natural targets for steering experiments. [^4]
The Anthropic methods paper is explicit about several limitations of attribution graphs. [^1]
Incomplete reconstruction. The CLT does not perfectly reproduce the original model's MLP outputs. The error nodes inserted in the local replacement model absorb the residual exactly for a single prompt, but the resulting graph then attributes some of the model's behavior to opaque error terms rather than interpretable features. The authors note that error terms account for a non-trivial portion of total computation and that this limits how much of the model's behavior is mechanistically explained.
Frozen attention. The construction freezes attention patterns and layer-normalization scaling factors at the values observed on the analyzed prompt. This linearizes the model conditional on the prompt and makes attribution tractable, but it means that the graph does not explain how the attention patterns themselves were formed. QK-circuits and other attention-mechanism details remain outside the graph.
Mechanistic faithfulness. The CLT may not always use the same internal mechanism as the original MLPs. Even when the replacement model approximates the original in output, the intermediate features may not correspond to the original's internal causes in a strict sense. The authors report that interventions in the CLT and the original model can diverge over many layers, and they describe these "perturbation discrepancies" as compounding significantly across depth.
Per-prompt scope. An attribution graph explains a single prompt and a single response. Generalizing claims across prompts requires manually combining many graphs or sampling broadly. The paper acknowledges that understanding global circuits across diverse inputs remains a hard problem because of feature interference, the role of attention, and residual [[polysemanticity]] in the learned features.
Feature quality. All findings depend on the quality of the underlying CLT features. If a feature is not actually monosemantic, the node it produces in the graph carries an inaccurate interpretive label. The methods paper documents standard checks such as activation-context inspection and feature-ablation behavior, but the field does not yet have a rigorous, automated way to verify feature interpretations.
Polysemanticity and superposition. Because feature dictionaries do not fully resolve [[superposition]], some features remain mildly polysemantic and some concepts are split across multiple features in ways that complicate the graph. The authors note this as an open problem.
Manual interpretation. Even after pruning, attribution graphs contain hundreds of features that must be inspected and labeled by a human researcher. The methodology is best described as semi-automated; meaningful interpretation still requires significant analyst time.
Scalability of insights. The case studies in "On the Biology of a Large Language Model" are illustrative rather than exhaustive. The authors describe attribution graphs as a tool for forming hypotheses, which then need to be validated by intervention experiments and by examining additional prompts.
Following the March 2025 publications, Anthropic and external researchers have continued to develop the attribution-graph toolkit.
In May 2025, Anthropic released an open-source Python library for generating attribution graphs on open-weights models and an open-source frontend for inspecting them, with hosting on the Neuronpedia platform led by Johnny Lin and Curt Tigges. The release was led by Anthropic Fellows Michael Hanna and Mateusz Piotrowski, in collaboration with Decode Research. The library supports models including Gemma-2-2B and Llama-3.2-1B. [^4][^5]
Anthropic's interpretability team has continued to apply attribution graphs in subsequent posts on the Transformer Circuits Thread, examining additional behaviors of Claude 3.5 Haiku and related models. External work has applied attribution graphs to reasoning in decoder-only transformers on graph problems and has begun building tooling for cross-model comparison. [^4]
The attribution-graph methodology has been incorporated into Anthropic's broader interpretability research agenda, which the company has publicly described as a long-term effort to develop diagnostic tools precise enough to detect deception, alignment failures, and other safety-relevant phenomena inside large models. The team frames attribution graphs as one component of an evolving toolkit that includes feature dictionaries, [[activation_steering|activation steering]], and behavioral evaluations. [^3]