Mechanistic interpretability (often abbreviated as mech interp or MI) is a research program within AI safety and machine learning that aims to reverse-engineer the internal computations of neural networks, particularly transformers, into human-understandable algorithms. Rather than treating models as black boxes and studying only their input-output behavior, mechanistic interpretability seeks to identify the specific weights, activations, features, and circuits that a model uses to produce its outputs. The field has grown rapidly since 2020, driven by the recognition that understanding what is happening inside increasingly powerful AI systems is critical for ensuring their safe deployment. By 2026, MIT Technology Review named mechanistic interpretability one of its 10 Breakthrough Technologies, reflecting the field's transition from a niche research interest to a central concern in alignment work at Anthropic, Google DeepMind, OpenAI, and a growing ecosystem of academic labs and startups [1].
As large language models have grown in capability and deployment, the need to understand their internal workings has become more pressing. Traditional approaches to model evaluation focus on behavioral testing: measuring accuracy on benchmarks, probing for biases in outputs, or checking alignment with human preferences. While behavioral testing remains valuable, it cannot reveal whether a model has learned a genuinely robust algorithm or is relying on spurious correlations, nor can it reliably predict how a model will behave in novel situations outside the test distribution.
Mechanistic interpretability addresses this gap by looking inside the model. The core analogy is to reverse engineering in computer science or biology. Just as a security researcher might disassemble a binary to understand what a program does, or a neuroscientist might trace neural circuits to understand how the brain processes information, mechanistic interpretability researchers attempt to identify the computational structures that neural networks use to perform tasks [2]. The goal is not just an explanation that sounds plausible, but a causal account: a description of which components do what, why, and how they combine to produce specific outputs.
The motivation runs along several axes. The first is safety. If a model is going to be deployed in high-stakes settings, regulators and developers want some way of knowing whether it harbors deceptive subroutines, dangerous capabilities, or systematic biases that the input-output testing might miss. The second is debugging. When a model fails, knowing which features fired and which circuits routed them gives engineers a much more useful diagnostic than a confusion matrix. The third is editing. Tools like feature steering and activation patching open the door to surgical interventions on model behavior that do not require fine-tuning or retraining. The fourth is scientific. Transformers are arguably the most studied family of artificial systems in history, and yet, until recently, almost nothing was known about the algorithms they implement. Mechanistic interpretability is the closest thing AI has to a research program for understanding what its own creations are actually doing.
Mechanistic interpretability is a subfield of the broader area of interpretability and explainable AI, but it differs from most other approaches in scope and method.
| Approach | What it studies | Typical method | Limitations |
|---|---|---|---|
| Behavioral interpretability | Input-output mapping | Benchmarks, prompt audits | Cannot reveal internal mechanisms |
| Feature attribution (SHAP, LIME, saliency maps) | Which inputs influence the output | Gradient and perturbation analysis | Shows correlation, not causation |
| Probing classifiers | What information is encoded in activations | Train small linear or MLP probe on hidden states | Reveals what is present but not how it is used |
| Concept-based interpretability (TCAV, concept bottlenecks) | High-level concept relevance | Concept activation vectors | Limited to predefined concepts |
| Mechanistic interpretability | Internal weights, activations, circuits | Activation patching, sparse autoencoders, circuit tracing | Labor-intensive, hard to validate |
XAI methods generally treat the model as a black box and try to summarize its behavior with simpler explanations. Mechanistic interpretability is white-box: it works directly with the parameters and the residual stream. The two approaches answer different questions. SHAP can tell you which input pixels mattered for a classification; mechanistic interpretability can tell you which attention head copied a name from earlier in the prompt.
Work that would later be called mechanistic interpretability has roots in computer vision research from the late 2010s. Chris Olah, then at Google Brain and later at OpenAI and Anthropic, led a series of investigations into convolutional networks that culminated in the Distill "Circuits" thread starting in 2020. The opening essay, "Zoom In: An Introduction to Circuits" by Olah, Cammarata, Schubert, Goh, Petrov, and Carter, articulated three speculative claims that would shape the field: features are the fundamental units of neural networks, features are connected by weights to form circuits, and analogous features and circuits recur across different models and tasks [2]. The companion essay "An Overview of Early Vision in InceptionV1" walked through the first five layers of InceptionV1, showing how the network builds up from raw pixels to edge detectors, curve detectors, and eventually crude detectors for eyes and small heads.
The move from vision to language happened around 2021. Anthropic was founded in 2021 by Dario Amodei, Daniela Amodei, and several colleagues from OpenAI, and the company quickly established the Transformer Circuits Thread as a venue for publishing detailed interpretability research on language models. The first major entry, "A Mathematical Framework for Transformer Circuits" by Elhage, Nanda, and many others, provided a notation and a set of tools for analyzing attention-only transformers. It introduced the idea of treating the residual stream as a communication channel between attention heads and MLPs, and showed how to decompose attention heads into composable QK and OV circuits [3].
The field's modern era began in 2022. Within a few months that year, Anthropic published "In-context Learning and Induction Heads" (Olsson et al.), which identified induction heads as the mechanism behind in-context learning [4], and "Toy Models of Superposition" (Elhage, Hume, et al.), which provided the theoretical grounding for why neurons are often polysemantic [5]. EleutherAI and independent researchers published "Interpretability in the Wild" (Wang, Variengien, Conmy, et al.), the first end-to-end reverse engineering of a substantial natural-language behavior in GPT-2 small [6]. By the end of 2022, the rough vocabulary of the field was settled: features, circuits, superposition, polysemanticity, activation patching, path patching.
From 2023 onward the field shifted again, this time toward dictionary learning with sparse autoencoders. "Towards Monosemanticity" (Bricken, Templeton, et al., 2023) demonstrated that sparse autoencoders trained on a one-layer transformer could recover thousands of human-interpretable features [7]. A year later, "Scaling Monosemanticity" (Templeton, Conerly, et al., 2024) extended the approach to Claude 3 Sonnet and extracted millions of features, including the now-famous Golden Gate Bridge feature [8]. Google DeepMind released Gemma Scope in August 2024 [9], OpenAI released its work on scaling sparse autoencoders to GPT-4 in June 2024 [10], and Anthropic published its circuit tracing methodology and case studies on Claude 3.5 Haiku in March 2025 [11][12]. By 2026 the field is one of the most active research areas in alignment.
Several organizations have made foundational contributions.
Anthropic's interpretability team, led by Chris Olah, has produced many of the field's landmark papers, including "Zoom In," "A Mathematical Framework for Transformer Circuits," "Toy Models of Superposition," "Towards Monosemanticity," "Scaling Monosemanticity," and the 2025 circuit tracing work. Anthropic has publicly committed to the goal of being able to reliably detect most AI model problems by 2027 using interpretability tools [11].
Google DeepMind's mechanistic interpretability team is led by Neel Nanda, who previously worked on Anthropic's interpretability team under Olah and then conducted independent research before joining DeepMind. Nanda created TransformerLens, the most widely used open-source library for the field, and his team has produced Gemma Scope, the JumpReLU SAE architecture, and a long line of papers on automated circuit discovery, attribution patching, and SAE evaluation [9][13].
OpenAI maintains an interpretability effort that has focused on scaling sparse autoencoders. The June 2024 paper "Scaling and evaluating sparse autoencoders" by Gao, Dupré la Tour, Tillman, Goh, Troll, Radford, Sutskever, Leike, and Wu introduced top-K sparse autoencoders and trained a 16-million-feature autoencoder on GPT-4 activations [10].
EleutherAI, the open-source AI research collective, has contributed both to tooling and to specific reverse-engineering efforts. The IOI paper had EleutherAI affiliations, and the collective has used TransformerLens to study reinforcement learning from human feedback (RLHF) effects on transformer internals [14].
Academic labs have grown substantially. David Bau's group at Northeastern University has produced NNsight, the National Deep Inference Fabric (NDIF) for remote interpretability work, and the Sparse Feature Circuits paper (Marks, Rager, Michaud, Belinkov, Bau, Mueller, 2024) [15][16]. Stanford NLP has produced pyvene for systematic causal interventions. The MATS (ML Alignment Theory Scholars) program has trained many MI researchers and supported a range of widely cited papers.
A cluster of independent labs and startups has emerged. Apollo Research focuses on detecting deceptive alignment and dangerous capabilities, often using mechanistic methods. Goodfire, founded in June 2024 by Eric Ho, Dan Balsam, and Tom McGrath, is the first venture-backed company built around interpretability as a product, with its Ember platform providing SAE-based feature access and steering for enterprise customers including Rakuten, Apollo Research, and Haize Labs [17]. Open Philanthropy, the Long-Term Future Fund, and several research grant programs have funded substantial portions of the field's growth.
In mechanistic interpretability, a feature is a meaningful unit of representation inside a neural network. The term is intentionally informal. Olah and collaborators describe a feature as any property of an input that a sufficiently large neural network would dedicate a neuron to representing. For a language model, features can be very concrete ("the next token is a closing parenthesis") or quite abstract ("the speaker is angry," "this is legal language," "the topic is about the Golden Gate Bridge") [7].
A core empirical observation is that individual neurons are often a poor unit of analysis. A given neuron in a transformer's MLP layer typically activates for many unrelated concepts, a phenomenon called polysemanticity. Conversely, a single concept may be spread across many neurons. Treating neurons as features therefore fails. Mech interp instead looks for features as directions in the high-dimensional activation space, with the activation strength along each direction representing how strongly the feature is present.
Superposition is the phenomenon where a network represents more features than it has dimensions by encoding features as overlapping directions in activation space. In a layer with 512 neurons, the model might use thousands of feature directions, none of which is aligned with the neuron basis. The features overlap, and a given neuron will fire (partially) for any feature whose direction has a non-zero component along that neuron's axis [5].
Superposition works because real-world features are usually sparse: for any given input, only a tiny fraction of all possible features is relevant. As long as features rarely co-occur, the network can tolerate small amounts of interference. The model essentially compresses a high-dimensional feature space into a lower-dimensional neuron space, paying for the compression with reconstruction noise that the rest of the network has learned to ignore.
The theoretical framework was laid out in "Toy Models of Superposition" (Elhage, Hume, et al., 2022). Using small synthetic networks, the authors showed that superposition exhibits phase transitions: as features become sparser, the network abruptly switches from monosemantic representations (one feature per neuron) to superposed representations. The paper revealed an unexpected connection to the geometry of uniform polytopes and provided early evidence that superposition is linked to adversarial examples. It also gave the field its first clean explanation of why neurons in language models are polysemantic [5].
Polysemanticity is the property that a single neuron responds to multiple unrelated concepts. Monosemanticity is the goal of representing each concept with a single, dedicated unit (a neuron, or more usefully, a learned feature direction). Most interpretability work since 2022 can be read as a sustained attempt to recover monosemantic features from polysemantic neurons, using sparse autoencoders or related dictionary learning techniques.
A circuit is a subgraph of the model's computational graph that implements a specific algorithm. The circuits framework, introduced in Olah et al. (2020), holds that neural networks can be understood as compositions of features connected by weights, forming meaningful circuits analogous to logic circuits in digital hardware. The claim is threefold: features are the fundamental units, features are connected by weights to form circuits, and these circuits can be understood through analogies to real-world concepts [2]. Circuit analysis identifies which components (attention heads, MLP layers, specific features) are responsible for a particular behavior, and traces how information flows between them.
| Concept | Definition | Why it matters |
|---|---|---|
| Feature | Direction in activation space corresponding to a concept | Basic unit of analysis above the neuron |
| Circuit | Subnetwork that implements a specific function | Lets you describe what the model is doing as an algorithm |
| Superposition | Representing more features than there are dimensions | Explains polysemanticity and motivates SAEs |
| Polysemanticity | One neuron, many unrelated concepts | The main reason individual neurons are hard to read |
| Monosemanticity | One unit, one clean concept | The interpretability ideal |
| Residual stream | Sum of all layer outputs in a transformer | Acts as the main communication bus inside the model |
| Probing | Train a small classifier on hidden activations | Cheap way to test what information is present |
| Activation patching | Replace activations from one input with those from another | Tests causal role of specific components |
| Path patching | Patch only the edge from component A to component B | Identifies fine-grained information flow |
| Sparse autoencoder | Wide-bottleneck autoencoder with sparsity penalty | Decomposes activations into monosemantic features |
| Attribution graph | Causal graph showing which features drive an output | Compact summary of the model's computation on a prompt |
The field has developed through a series of landmark publications, each building on prior work.
| Year | Paper | Authors / Org | Key contribution |
|---|---|---|---|
| 2017-2019 | Distill "Feature Visualization" and "Building Blocks of Interpretability" | Chris Olah et al. / Google Brain, OpenAI | Established feature visualization for CNNs |
| 2020 | "Zoom In: An Introduction to Circuits" | Olah, Cammarata, Schubert, Goh, Petrov, Carter / OpenAI Distill | Articulated the circuits framework [2] |
| 2020 | "An Overview of Early Vision in InceptionV1" | Olah, Cammarata, et al. / Distill | Walked through curve, edge, and shape circuits in vision models |
| 2021 | "A Mathematical Framework for Transformer Circuits" | Elhage, Nanda, et al. / Anthropic | Formal language for analyzing transformer circuits [3] |
| 2022 | "In-context Learning and Induction Heads" | Olsson, Elhage, Nanda, et al. / Anthropic | Identified induction heads as the mechanism behind in-context learning [4] |
| 2022 | "Toy Models of Superposition" | Elhage, Hume, et al. / Anthropic | Theory of superposition and polysemanticity [5] |
| 2022 | "Interpretability in the Wild" | Wang, Variengien, Conmy, et al. | Reverse-engineered the IOI circuit in GPT-2 small [6] |
| 2023 | "Sparse Autoencoders Find Highly Interpretable Features" | Cunningham, Ewart, et al. | Independent confirmation of SAE feasibility [18] |
| 2023 | "Towards Monosemanticity" | Bricken, Templeton, et al. / Anthropic | SAEs extract interpretable features in toy models [7] |
| 2023 | "Progress measures for grokking via mechanistic interpretability" | Nanda, Chan, Lieberum, Smith, Steinhardt / ICLR | Reverse-engineered grokking in modular addition [19] |
| 2023 | "Towards Automated Circuit Discovery" (ACDC) | Conmy, Mavor-Parker, Lynch, Heimersheim, Garriga-Alonso / NeurIPS | Algorithmic circuit search [20] |
| 2023 | "How does GPT-2 compute greater-than?" | Hanna, Liu, Variengien / NeurIPS | Identified the greater-than circuit [21] |
| 2023 | "Eliciting Latent Predictions from Transformers with the Tuned Lens" | Belrose, Furman, et al. / EleutherAI | Affine probes for layer-by-layer prediction [22] |
| 2024 | "Sparse Feature Circuits" | Marks, Rager, Michaud, Belinkov, Bau, Mueller / Northeastern | Combined SAE features with circuit analysis [16] |
| 2024 | "Improving Dictionary Learning with Gated Sparse Autoencoders" | Rajamanoharan, Conmy, Smith, Lieberum, Varma, Kramar, Shah, Nanda / DeepMind | Gated SAEs reduce shrinkage bias [23] |
| 2024 | "Scaling and evaluating sparse autoencoders" | Gao et al. / OpenAI | Top-K SAEs and 16M-feature SAE on GPT-4 [10] |
| 2024 | "Scaling Monosemanticity" | Templeton, Conerly, et al. / Anthropic | 34M features in Claude 3 Sonnet, including the Golden Gate feature [8] |
| 2024 | "Gemma Scope" | Lieberum, Rajamanoharan, Conmy, Smith, Sonnerat, Varma, Kramar, Nanda / DeepMind | 400+ open-source SAEs on Gemma 2 with JumpReLU [9] |
| 2024 | "Golden Gate Claude" public demo | Anthropic | Public demonstration of feature steering [24] |
| 2025 | "Circuit Tracing" methods paper | Anthropic | Cross-layer transcoders and attribution graphs [11] |
| 2025 | "On the Biology of a Large Language Model" | Anthropic | Nine case studies in Claude 3.5 Haiku [12] |
| 2025 | Open-sourced circuit tracing tools | Anthropic / Neuronpedia | Public Python library and web frontend [25] |
Sparse autoencoders (SAEs) are the dominant technique in the field for the moment. The core idea is straightforward: if features are encoded as overlapping directions in a lower-dimensional space, an autoencoder with a much wider hidden layer and a sparsity constraint should be able to decompose those directions back into individual, mostly monosemantic features.
A sparse autoencoder is trained to reconstruct the internal activations of a language model at a specific layer (usually the residual stream after a particular block). The encoder maps an activation vector x of dimension d_model into a wider hidden vector f of dimension d_sae, where d_sae is typically 8x to 64x larger than d_model. The decoder maps back to the original space. The sparsity constraint ensures that for any given input, only a small number of hidden units (features) are active.
The simplest form is the L1-penalized SAE introduced in "Towards Monosemanticity." Given an activation x, the SAE computes:
f = ReLU(W_enc * (x - b_dec) + b_enc)
x_hat = W_dec * f + b_dec
The loss combines reconstruction error (L2 distance between x and x_hat) with an L1 penalty on the feature activations f. The L1 penalty pushes most features to zero, leaving only a small set firing for each input.
The decoder columns of W_dec are interpreted as feature directions in the model's activation space. When a feature is active, its decoder column is added to the residual stream, effectively steering the model's computation in a particular direction. This gives a natural interpretation: each feature is a direction, and the feature's activation strength indicates how much that direction contributes to the current computation [7].
Anthropic's "Towards Monosemanticity" (Bricken, Templeton, et al., October 2023) was the paper that put SAEs on the mainstream MI map. The team trained SAEs on a one-layer transformer, with an 8,192-dimensional latent space projecting from a 512-dimensional residual stream (a 16x expansion). They extracted thousands of features and showed that human raters judged about 70% of them as cleanly mapping to single concepts: Arabic script, DNA motifs, legal language, HTTP requests, Hebrew text, nutrition statements, and many more. For comparison, the underlying neurons in the same layer were judged as monosemantic only a small fraction of the time. The paper provided the first strong empirical evidence that dictionary learning with sparse autoencoders could resolve superposition in a scalable, unsupervised way [7].
The May 2024 follow-up, "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet" by Templeton, Conerly, and the Anthropic interpretability team, took the same recipe to a production-scale language model. The team trained SAEs of three sizes on the residual stream of Claude 3 Sonnet, extracting roughly 1 million, 4 million, and 34 million features respectively [8].
The extracted features were strikingly abstract. They were multilingual: a feature for the Golden Gate Bridge fired on English, French, Japanese, and Chinese mentions of the bridge. They were multimodal: the same feature fired on images of the bridge as well as text mentions. They generalized between concrete and abstract references: the bridge feature fired on "the famous orange suspension bridge near San Francisco" even when the name was not used. Beyond the bridge, the team identified features for sycophantic praise, secrecy, deception, hidden agendas, gender bias, dangerous biological content, and many other safety-relevant concepts. The paper also showed that artificially clamping a feature's activation could reliably steer the model toward or away from the corresponding behavior [8].
A week after publication, Anthropic released the Golden Gate Claude public demo. For 24 hours starting May 23, 2024, anyone could chat with a version of Claude in which the Golden Gate Bridge feature was clamped to roughly 10x its normal maximum activation. The result was a model that referenced the bridge in nearly every response, sometimes claiming to be the bridge itself. Asked how to spend $10, it suggested driving across the bridge and paying the toll. Asked for a love story, it produced a romance between a car and the bridge on a foggy day [24]. The demo was widely discussed in the press and helped move mech interp from a research curiosity into the public conversation.
In June 2024, an OpenAI team led by Leo Gao published "Scaling and evaluating sparse autoencoders," introducing the top-K SAE. Instead of using an L1 penalty to encourage sparsity, the top-K SAE simply selects the k largest pre-activations after the encoder and zeros out the rest. This directly controls sparsity, simplifies training, and reduces dead latents (features that never fire on any input). To demonstrate scalability, the team trained a 16-million-latent SAE on GPT-4 activations using 40 billion tokens of training data, the largest reported SAE training run at the time [10][26].
In August 2024, Google DeepMind released Gemma Scope, a collection of more than 400 open-source SAEs trained on every layer and sublayer output of Gemma 2 2B and 9B. The release totaled more than 30 million features and used the new JumpReLU SAE architecture, which uses a discontinuous activation function with a learnable threshold to balance feature detection against magnitude estimation. The Gemma Scope team (Tom Lieberum, Sen Rajamanoharan, Arthur Conmy, Lewis Smith, Nic Sonnerat, Vikrant Varma, Janos Kramar, Neel Nanda) framed the release as infrastructure for the broader safety community, in the spirit of "open sparse autoencoders everywhere all at once on Gemma 2" [9]. A follow-up Gemma Scope 2 was released in 2025.
The basic SAE recipe has spawned a family of variants, each addressing a different limitation.
| Variant | Key idea | Main advantage |
|---|---|---|
| L1 SAE | L1 penalty on feature activations | Original recipe; simple but suffers from shrinkage |
| Top-K SAE (Gao et al., OpenAI 2024) | Keep top k features, zero the rest | Direct sparsity control, fewer dead latents |
| BatchTopK SAE | Apply top-k across a batch instead of per token | Better sparsity-reconstruction tradeoff |
| Gated SAE (Rajamanoharan et al., DeepMind 2024) | Separate detection from magnitude with a gating unit | Reduces shrinkage bias, half as many firing features [23] |
| JumpReLU SAE (DeepMind 2024) | Discontinuous activation with learnable threshold | State of the art for Gemma Scope [9] |
| Transcoder | Maps from one layer's activations to another's | Captures cross-layer computations |
| Cross-layer transcoder (CLT, Anthropic 2025) | Reads from one layer's residual stream, writes to all subsequent MLP layers | Enables attribution graph construction [11] |
SAEs are powerful but imperfect. They reconstruct activations with some error, and the residual error contains information that the SAE's features alone cannot explain. The features themselves are subjective: human and automated labels can disagree, and many features are partially polysemantic or split across multiple SAE latents. Training large SAEs is expensive: Anthropic's 34M-feature SAE on Claude 3 Sonnet required substantial infrastructure, and training Gemma Scope's 400 SAEs was a major engineering effort. DeepMind itself published a sober progress update in 2025 noting that SAEs had underperformed on several downstream evaluations and that the team was deprioritizing some lines of SAE research in favor of other interpretability methods [9].
While SAEs have dominated headlines since 2023, much of the field's most concrete progress has come from end-to-end reverse engineering of specific behaviors using attention-head and MLP analysis.
Induction heads are the best-understood circuit in transformer language models. An induction head implements a simple algorithm: given a sequence pattern like [A][B] ... [A], the head predicts that B will follow the second occurrence of A. Despite this simplicity, induction heads appear to be the main driver of in-context learning in transformers across a wide range of sizes, from small attention-only models to large LLMs [4].
The induction circuit is two layers deep. In an earlier layer, a previous-token head attends to the token immediately before the current position and copies information about it into the residual stream. In a later layer, the induction head performs a match-and-copy operation: its query (derived from the current token) matches against keys (derived from the output of the previous-token head), effectively finding earlier positions where the same token appeared and copying whatever followed it.
Olsson, Elhage, Nanda, and 23 collaborators at Anthropic presented six complementary lines of evidence that induction heads are the mechanism behind in-context learning. The most striking is the phase change: induction heads form during a sharp transition early in training, and in-context learning ability improves dramatically at the same training step, visible as a small bump in the loss curve. When induction heads are knocked out at test time, in-context learning performance drops sharply [4]. The result is one of the field's clearest examples of a learned algorithm appearing on a discrete training step.
The IOI circuit, described in Wang, Variengien, Conmy, and collaborators (2022), was the first end-to-end reverse engineering of a substantial natural-language behavior in a real language model. The task involves sentences such as "When Mary and John went to the store, John gave a drink to ___," where the model should predict "Mary," the indirect object [6].
The authors found that GPT-2 small uses 26 attention heads grouped into 7 functional classes to solve this task. Duplicate-token heads identify that a name has appeared twice. S-inhibition heads suppress the repeated name. Name mover heads copy the remaining name to the output position. Backup name movers compensate when the primary name movers are ablated, providing a kind of redundancy. Negative name movers actively decrease the probability of correct names, in a way the authors did not fully explain. The work introduced path patching as a systematic methodology for tracing information flow through the network and proposed three quantitative criteria for circuit explanations: faithfulness, completeness, and minimality [6].
The 2023 ICLR paper "Progress measures for grokking via mechanistic interpretability" by Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt fully reverse-engineered the algorithm a small transformer learns when trained on modular addition. The network turns out to use a discrete Fourier transform and trigonometric identities to convert addition into rotation around a circle. The paper defined progress measures for grokking, splitting training into three phases: memorization, circuit formation, and cleanup. Grokking, in this view, is not a sudden phase change in the loss but a gradual amplification of structured weights followed by the slow removal of memorizing components [19]. The work is one of the cleanest demonstrations that interpretability can recover an algorithm exactly, not just qualitatively.
Michael Hanna, Ollie Liu, and Alexandre Variengien (NeurIPS 2023) studied how GPT-2 small computes greater-than. Given a sentence like "The war lasted from the year 1732 to the year 17," the model should produce two-digit completions strictly greater than 32. The authors identified a circuit in which MLPs 9 and 10 directly compute the greater-than operation by upweighting years greater than the input, while specific attention heads communicate the relevant year information. The result generalized across diverse contexts, suggesting GPT-2 small uses a real algorithm rather than memorized completions [21].
Subsequent work has identified circuits for docstring completion, acronym prediction, refusal of harmful requests, multilingual translation, multi-hop reasoning, and many small algorithmic behaviors. The 2025 Anthropic paper "On the Biology of a Large Language Model" alone contains nine case studies in Claude 3.5 Haiku, including two-hop deduction, planning ahead in poetry composition, multilingual circuit organization, addition mechanisms, medical diagnosis, entity recognition and hallucination, and refusal mechanisms for harmful requests [12].
The core methodological contribution of MI beyond SAEs is a family of techniques for testing the causal role of internal components.
Activation patching, also called causal tracing or interchange intervention, is the workhorse technique for localizing which components are responsible for a behavior. The procedure: run the model on a clean input that produces the desired behavior, run it again on a corrupted input that does not, then systematically replace activations from the corrupted run with those from the clean run, one component at a time, and measure how much the output changes. Components whose patching restores the clean behavior are causally important [27].
Activation patching can be applied at various granularities: at individual neurons, at attention head outputs, at residual stream positions, at MLP outputs, and at SAE features. The Heimersheim and Nanda 2024 "How to use and interpret activation patching" paper is the standard reference and covers the many subtleties of running these experiments correctly [27].
Path patching, introduced in the IOI paper, extends activation patching to study edges (connections between components) rather than nodes. Instead of patching the entire output of component A, path patching patches only the output of A as it flows into a specific downstream component B, leaving A's influence on all other downstream components unchanged. This gives a much finer picture of information flow and is essential for distinguishing direct effects from indirect effects routed through other components [6].
Attribution patching, introduced in 2023 by Nanda and collaborators, uses gradient-based attribution as a fast approximation to activation patching. Instead of running many forward passes to evaluate every patch, attribution patching uses a single backward pass to estimate the effect of patching every component at once. The 2023 paper "Attribution Patching Outperforms Automated Circuit Discovery" showed that attribution patching could rival ACDC in accuracy at a fraction of the compute cost [28].
ACDC, introduced by Arthur Conmy and collaborators (NeurIPS 2023), automates circuit discovery. The algorithm iterates from outputs to inputs in the computational graph and at each node tries to remove as many incoming edges as possible while preserving model performance, using KL divergence as the fidelity metric. ACDC successfully rediscovered all five component types in the greater-than circuit and produced a 1,041-edge subgraph for the IOI circuit. Its main weakness is that it tends to miss redundant or backup components, and it struggles with OR-gate-like structures where multiple inputs can produce the same output [20].
In March 2025, Anthropic published two companion papers, "Circuit Tracing: Revealing Computational Graphs in Language Models" (the methods paper) and "On the Biology of a Large Language Model" (the case studies paper). The methods paper introduces cross-layer transcoders (CLTs), a new variant of sparse autoencoder that reads from one layer's residual stream but can write to all subsequent MLP layers. The model's MLPs are replaced with CLTs to produce a replacement model whose features are interpretable [11].
The key innovation is that once the MLPs are replaced with CLTs, and once attention patterns and normalization denominators are frozen, the direct interactions between features become linear. This linearity allows the construction of attribution graphs: causal diagrams depicting the computational steps the model takes to produce a particular output. Attribution graphs contain four kinds of nodes:
| Node type | Role |
|---|---|
| Output nodes | Candidate output tokens at the final position |
| Intermediate nodes | Active CLT features representing interpretable concepts |
| Input nodes | Token embeddings from the prompt |
| Error nodes | Unexplained MLP residual (what the CLTs failed to capture) |
Edges represent linear attributions between nodes. The methodology supports pruning: the analyst can keep only the features that influence the output of interest, leaving a compact, readable graph of the computation [11].
The case studies paper applied circuit tracing to Claude 3.5 Haiku across nine behaviors. Highlights include: the model performs two-hop reasoning internally ("the capital of the state containing Dallas is Austin" is solved by representing "Texas" as an intermediate); the model plans rhyming words when writing poetry, identifying the planned rhyme several tokens before it appears; multilingual circuits involve a mixture of language-specific and abstract, language-independent components, with the language-independent ones more prominent in the larger Haiku than in smaller models; and refusal of harmful requests can be traced to specific features that, when ablated, allow the model to comply with requests it would otherwise refuse [12].
Before SAEs and attribution graphs, the main tool for peering inside transformers was the logit lens, introduced by the pseudonymous researcher nostalgebraist in 2020. The logit lens applies the unembedding matrix to intermediate residual stream states, producing token distributions for each layer that show how the model's prediction takes shape across the depth of the network. The tuned lens, introduced by Nora Belrose and collaborators at EleutherAI in 2023, refines this idea by training an affine probe at each layer to minimize KL divergence with the final output. The tuned lens is more predictive, more reliable, and less biased than the logit lens, and has been shown to detect prompt injection attacks with near-perfect accuracy in some settings [22].
The field's growth has been enabled by a robust ecosystem of open-source tools.
TransformerLens, originally created by Neel Nanda, is the most widely used Python library for mechanistic interpretability of GPT-style language models. It supports loading 50+ open-source language models and exposes all internal activations as named tensors during a forward pass. Users can cache activations and add hook functions to edit, remove, or replace them as the model runs. Nanda built the library after leaving Anthropic, citing frustration with the lack of open-source tooling for the kind of exploratory work that Anthropic's internal Garcon tool enabled. TransformerLens v3 was released in alpha in 2025 with significant performance improvements for large models [13].
SAE Lens, developed primarily by Joseph Bloom, is the standard library for training and analyzing sparse autoencoders. It supports any PyTorch-based model (not only TransformerLens models) and includes evaluation metrics, feature visualizations, and integration with Neuronpedia.
Neuronpedia, created by Johnny Lin, is the central web platform for SAE feature exploration. It provides feature search, steering, activation visualization, circuit and graph exploration, automated interpretability scoring, and benchmarks. In 2025, Anthropic's circuit tracing tools were integrated with Neuronpedia, making attribution graphs viewable in the browser. The August 2025 "Circuits Research Landscape" page on Neuronpedia became a community hub for tracking new papers and releases [25].
NNsight, from David Bau's group at Northeastern, defines tracing contexts that build computation graphs from PyTorch models for both local and remote execution. The companion National Deep Inference Fabric (NDIF) provides remote access to large open-weights models, lowering the bar for academic researchers without local GPU clusters [15].
Penzai, from Google DeepMind, is a JAX-based toolkit for building, editing, and visualizing neural networks. It supports research into interpretability, model surgery, and training dynamics, and includes a reference Transformer implementation that can load Gemma, Llama, Mistral, and GPT-NeoX/Pythia weights [29].
Pyvene, from Stanford NLP, provides systematic support for causal interventions including activation patching, distributed alignment, and various forms of causal scrubbing.
| Tool | Primary purpose | Maintainer | Key feature |
|---|---|---|---|
| TransformerLens | Model loading and activation access | TransformerLens Org / Neel Nanda | Hook-based activation editing for 50+ models |
| SAE Lens | SAE training and analysis | Joseph Bloom | Model-agnostic SAE training and metrics |
| Neuronpedia | Feature visualization and sharing | Johnny Lin | Web-based exploration of SAE features and attribution graphs |
| NNsight + NDIF | Remote model intervention | NDIF / Northeastern | Run interpretability on large models without local GPUs |
| Penzai | JAX model manipulation | Google DeepMind | Functional model representation in JAX |
| Pyvene | Causal interventions | Stanford NLP | Distributed alignment and causal scrubbing |
| Circuit Tracer | Attribution graph generation | Anthropic | Open-source pipeline for CLTs and attribution graphs |
One of the most striking practical applications of MI is feature steering: directly modifying model behavior at the activation level by clamping SAE features. The technique was developed in the Scaling Monosemanticity work and demonstrated publicly with Golden Gate Claude. Steering is appealing because it can change behavior without retraining, can be applied selectively at inference time, and produces effects that, at least in principle, correspond to interpretable concepts [8][24].
The limits became clearer over time. Anthropic's October 2024 follow-up paper, "Evaluating feature steering," took a more careful look at when steering helps and when it hurts. Steering on small numbers of safety-relevant features sometimes reduced harmful outputs, but often at the cost of degraded general capabilities; steering on larger feature sets produced unpredictable behavior. The paper concluded that feature steering as an alignment technique is promising but not yet a drop-in replacement for RLHF or constitutional AI methods [30].
Goodfire's commercial product, Ember, exposes feature steering as an API for enterprise users. The pitch is that organizations can identify and amplify desirable features ("conciseness," "technical accuracy") and suppress undesirable ones ("refusal of legitimate medical questions," "PII exposure") on top of an existing model without retraining. As of 2026 the technique is widely seen as one of the most plausible commercial applications of mechanistic interpretability [17].
The field's stated goals fall into roughly five clusters.
| Goal | What it looks like in practice |
|---|---|
| Safety auditing | Detect deception, dangerous capabilities, misalignment in production models |
| Debugging | Identify which features and circuits are responsible for specific failures |
| Steering and editing | Modify behavior without retraining via feature clamping, ablation, or activation patching |
| Scientific understanding | Build a mechanistic theory of how transformers and other architectures work |
| Capability evaluation | Ground capability claims in identified algorithms, not just behavioral benchmarks |
The safety case is the loudest in 2024-2026. Anthropic's stated goal of being able to reliably detect most AI model problems by 2027 using interpretability tools sets a public benchmark; whether the field can hit it remains an open question.
The field has changed shape considerably in the last two years. Sparse autoencoders dominated 2023 and 2024 as the main object of study, with major SAE releases from Anthropic, OpenAI, and DeepMind. By 2025 the focus had broadened to include attribution graphs, automated circuit discovery, transcoders, and a growing emphasis on practical applications.
Major labs have made interpretability a strategic priority. Anthropic's interpretability team is one of the largest internal teams at the company, and the March 2025 circuit tracing release was accompanied by an open-source tool release intended to lower the entry barrier for outside researchers. DeepMind's Gemma Scope releases follow the same playbook on a different model family. OpenAI has published work on SAEs at GPT-4 scale, and the company maintains a smaller but consistent interpretability output [10][11][9].
The academic and independent ecosystem has expanded. The MATS program has produced a steady stream of researchers. Apollo Research focuses on detecting deceptive alignment using mechanistic methods. Goodfire, EleutherAI, and a number of smaller groups round out a community of perhaps a few hundred full-time researchers worldwide as of 2026. Conferences and workshops dedicated to mechanistic interpretability have grown from informal alignment forum gatherings to peer-reviewed tracks at NeurIPS, ICLR, and ICML.
MIT Technology Review's selection of mechanistic interpretability as a 2026 Breakthrough Technology marked a kind of mainstream arrival. The journalism around the selection emphasized both the field's recent progress (Golden Gate Claude, attribution graphs, multilingual circuits) and the fact that interpretability is now seen by AI companies as part of the deployment safety story, not just a research pursuit [1].
Mech interp has not solved interpretability. Several persistent problems shape the research agenda.
Training SAEs on production-scale models is expensive. Anthropic's 34M-feature SAE on Claude 3 Sonnet and OpenAI's 16M-feature SAE on GPT-4 each required substantial compute. As models grow into the hundreds of billions or trillions of parameters, the cost of extracting and analyzing features grows roughly proportionally. Attribution graph construction adds another layer: every prompt of interest requires its own analysis, and scaling MI to millions of production prompts is currently infeasible [8][10].
A persistent worry is whether MI explanations faithfully represent what the model is actually doing. SAEs introduce reconstruction error: the features they extract might be artifacts of autoencoder training rather than genuine features of the underlying model. The error nodes in attribution graphs explicitly track the unexplained portion of computation, but how to interpret a graph in which 30% of the computation is unexplained is unclear. Heimersheim, Nanda, and others have written extensively about the difficulty of validating that a circuit explanation actually captures the model's behavior on out-of-distribution inputs [11][27].
Even when a circuit is identified, it is hard to verify that the identified circuit is complete. Backup name movers in the IOI circuit and similar redundancy mechanisms in other circuits show that models often have multiple pathways for the same behavior; an analysis that ignores these will overstate how clean the model's algorithm is. The IOI authors made completeness a stated criterion for circuit explanations partly in response to these concerns [6].
Labeling what a feature represents requires human or automated judgment, and labels disagree. Many features are partially polysemantic, fire on multiple loosely related concepts, or split a single concept across several SAE latents. Automated interpretability methods, in which a language model is asked to describe what activates a feature, are useful but unreliable enough that human verification remains necessary for high-stakes claims. Anthropic, OpenAI, and DeepMind have all spent significant effort on automated feature labeling without yet producing a reliable solution [8][10].
The circuit tracing methodology freezes attention patterns and works only on the MLP-replaced replacement model. The QK-circuits that govern attention itself remain largely outside the framework. Progress on interpretable attention is one of the field's open frontiers as of 2026 [11].
In a noteworthy 2025 progress update, the DeepMind mechanistic interpretability team reported that several downstream tasks where SAEs were expected to outperform other methods showed negative or null results. The team announced it was deprioritizing some lines of SAE research in favor of probing-based approaches, attention-circuit work, and other directions. The update was a useful reminder that the field's central technique is not yet a settled solution.
A short list of open problems that the field is actively working on as of 2026:
| Problem | Why it matters |
|---|---|
| Better SAE architectures | Reduce cost, improve fidelity, eliminate dead and ultra-rare latents |
| Interpretable attention | Attention QK-circuits are not yet handled by attribution graphs |
| Cross-layer dynamics | Features change meaning across layers; current tools do not track this well |
| Multi-token computation | Most analyses look at one token at a time; sequential reasoning is harder to capture |
| Automated feature labeling | Manual labeling does not scale to millions of features |
| Faithfulness validation | Need rigorous tests that circuit explanations match actual model behavior |
| Cost reduction | Make MI cheap enough to apply to every prompt or every model checkpoint |
| Connection to alignment | Translate mechanistic findings into actionable safety guarantees |
| Combining SAEs with circuits | Sparse Feature Circuits and circuit tracing both push in this direction |
| Generalization across model families | Most work is on transformers; do similar circuits exist in Mamba, diffusion models, MoE models? |
The overarching question is whether mechanistic interpretability can scale and mature fast enough to be useful for safety work on frontier models. Anthropic's stated goal of detecting most model problems by 2027 is an aggressive bet on yes; the DeepMind 2025 progress update is a reminder of how hard the bet is. The field's next phase will determine whether MI becomes a routine part of model deployment, the way unit tests are routine for software, or remains a research practice applied selectively to the most consequential systems.