Mechanistic interpretability is a subfield of AI safety and machine learning research that aims to reverse-engineer the internal computations of neural networks, particularly transformers, to understand how they process information and arrive at their outputs. Rather than treating models as black boxes and studying only their input-output behavior, mechanistic interpretability seeks to identify the specific algorithms, representations, and circuits that a model has learned during training. The field has grown rapidly since 2020, driven by the recognition that understanding what is happening inside increasingly powerful AI systems is critical for ensuring their safe deployment.
As large language models have grown in capability and deployment, the need to understand their internal workings has become more pressing. Traditional approaches to model evaluation focus on behavioral testing: measuring accuracy on benchmarks, probing for biases in outputs, or checking alignment with human preferences. While behavioral testing remains valuable, it cannot reveal whether a model has learned a genuinely robust algorithm or is relying on spurious correlations, nor can it reliably predict how a model will behave in novel situations outside the test distribution.
Mechanistic interpretability addresses this gap by looking inside the model. The core analogy is to reverse engineering in computer science or biology: just as a security researcher might disassemble a binary to understand what a program does, or a neuroscientist might trace neural circuits to understand how the brain processes information, mechanistic interpretability researchers attempt to identify the computational structures that neural networks use to perform tasks [1].
This stands in contrast to several other interpretability approaches:
| Approach | Method | Limitations |
|---|---|---|
| Behavioral interpretability | Probing model inputs and outputs | Cannot reveal internal mechanisms |
| Feature attribution (e.g., saliency maps) | Highlighting which input tokens influenced the output | Shows correlation, not causation |
| Probing classifiers | Training small classifiers on internal representations | Reveals what information is present but not how it is used |
| Mechanistic interpretability | Reverse-engineering internal circuits and features | More labor-intensive but provides causal understanding |
Several research organizations have made foundational contributions to mechanistic interpretability.
Anthropic Interpretability Team. Led by Chris Olah, Anthropic's interpretability team has produced many of the field's landmark papers, including the "Zoom In" essay, "Toy Models of Superposition," "Towards Monosemanticity," and "Scaling Monosemanticity." In 2025, the team published the circuit tracing work that represented a major step toward understanding computations in production-scale language models. Anthropic has publicly stated its goal to reliably detect most AI model problems by 2027 using interpretability tools [2].
Google DeepMind. Neel Nanda leads the mechanistic interpretability team at Google DeepMind. Before joining DeepMind, Nanda worked at Anthropic under Chris Olah and subsequently conducted influential independent research. He created TransformerLens, one of the most widely used tools in the field, and has published important work on induction heads, grokking, and other mechanistic phenomena [3].
EleutherAI. The open-source AI research collective EleutherAI has contributed to mechanistic interpretability through both tool development and research. EleutherAI researchers have used TransformerLens to study reinforcement learning from human feedback (RLHF)-trained models and have contributed to the broader ecosystem of open-source interpretability tools [4].
MATS (ML Alignment Theory Scholars). The MATS program has trained many mechanistic interpretability researchers and supported a range of research projects, including work on sparse autoencoders, circuit discovery, and feature analysis. Several widely cited papers in the field originated from MATS scholars [5].
In the context of mechanistic interpretability, a "feature" refers to a meaningful unit of representation within a neural network. The term is deliberately informal: a feature is any property of an input that a sufficiently large neural network would dedicate a neuron to represent. For example, a language model might have features corresponding to concepts like "the text is written in French," "this token is inside a quotation," or "the current topic is about legal contracts" [6].
A longstanding observation in neural network research is that individual neurons often do not correspond to clean, interpretable features. Instead, a single neuron may activate in response to multiple unrelated concepts, a phenomenon called polysemanticity. Conversely, a single concept may be spread across many neurons. This makes it difficult to understand what a model is doing by examining individual neurons in isolation.
Superposition is the phenomenon where a neural network represents more features than it has dimensions (neurons) by encoding features as directions in activation space rather than as individual neurons. In a model with, say, 512 neurons in a given layer, the network might represent thousands of distinct features by assigning each feature to a specific direction in the 512-dimensional space. The features are not aligned with the neuron basis but instead form an overcomplete set of directions [7].
This is possible because real-world features tend to be sparse: for any given input, only a small fraction of all possible features are relevant. As long as two features rarely co-occur, the network can tolerate some interference between them. The network essentially compresses a high-dimensional feature space into a lower-dimensional neuron space, relying on the sparsity of activation patterns to keep interference manageable.
The theoretical framework for understanding superposition was developed in "Toy Models of Superposition" (Elhage et al., 2022), which used small synthetic networks to demonstrate when and why superposition arises. The paper showed that superposition exhibits phase transitions: as features become sparser, the network abruptly switches from representing features monosemantically (one feature per neuron) to representing them in superposition. The paper also revealed a surprising connection to the geometry of uniform polytopes and provided preliminary evidence linking superposition to adversarial examples [7].
A circuit is a subgraph of a neural network's computational graph that implements a specific behavior or algorithm. The circuits framework, introduced in Olah et al. (2020), proposes that neural networks can be understood as compositions of features connected by weights, forming meaningful circuits analogous to logic circuits in digital hardware. The claim is threefold: (1) features are the fundamental units of neural networks, (2) these features are connected by weights to form circuits, and (3) these circuits can be understood through analogies to real-world concepts [1].
Circuit analysis involves identifying which components of a model (attention heads, MLP layers, specific features) are responsible for a particular behavior, and then tracing how information flows between these components to produce the final output.
The field has developed through a series of landmark publications, each building on prior work.
| Year | Paper | Authors / Organization | Key Contribution |
|---|---|---|---|
| 2020 | "Zoom In: An Introduction to Circuits" | Olah, Cammarata, et al. / Anthropic (then OpenAI) | Introduced the circuits framework for understanding neural networks [1] |
| 2021 | "A Mathematical Framework for Transformer Circuits" | Elhage, Nanda, et al. / Anthropic | Formalized how to analyze transformer circuits mathematically [8] |
| 2022 | "In-context Learning and Induction Heads" | Olsson, Elhage, Nanda, et al. / Anthropic | Identified induction heads as a key mechanism for in-context learning [9] |
| 2022 | "Toy Models of Superposition" | Elhage, Hume, et al. / Anthropic | Provided theoretical framework for understanding superposition [7] |
| 2022 | "Interpretability in the Wild" | Wang, Variengien, Conmy, et al. | Reverse-engineered the indirect object identification circuit in GPT-2 small [10] |
| 2023 | "Towards Monosemanticity" | Bricken, Templeton, et al. / Anthropic | Demonstrated sparse autoencoders can extract interpretable features from language models [6] |
| 2024 | "Scaling Monosemanticity" | Templeton, Conerly, et al. / Anthropic | Scaled sparse autoencoders to Claude 3 Sonnet, extracting millions of interpretable features [11] |
| 2025 | "Circuit Tracing" | Anthropic Interpretability Team | Introduced cross-layer transcoders and attribution graphs for production-scale models [2] |
Sparse autoencoders (SAEs) have become the primary tool for addressing superposition and extracting interpretable features from neural networks. The core idea is straightforward: if features are encoded as directions in a lower-dimensional space, an autoencoder with a wider hidden layer and a sparsity constraint should be able to decompose those directions back into individual features.
A sparse autoencoder is trained to reconstruct the internal activations of a language model at a specific layer. The autoencoder has an encoder that maps the model's residual stream activations (of dimension d_model) to a much larger hidden layer (of dimension d_sae, where d_sae >> d_model), and a decoder that maps back to the original dimension. The sparsity constraint ensures that for any given input, only a small number of hidden units (features) are active [6].
Formally, given a model activation vector x, the SAE computes:
f = ReLU(W_enc * (x - b_dec) + b_enc)
x_hat = W_dec * f + b_dec
where f is the sparse feature activation vector, W_enc and W_dec are the encoder and decoder weight matrices, and b_enc and b_dec are bias terms. The training loss combines a reconstruction term (how well x_hat matches x) with a sparsity penalty on f.
The decoder columns of W_dec represent the feature directions in the model's activation space. When a feature is active, its decoder column is added to the residual stream, effectively steering the model's computation in a particular direction. This provides a natural interpretation: each feature is a direction in activation space, and the feature's activation strength indicates how much that direction contributes to the current computation.
Anthropic's "Towards Monosemanticity" paper (Bricken, Templeton, et al., 2023) demonstrated that sparse autoencoders trained on a one-layer transformer could extract features that were significantly more interpretable and monosemantic than individual neurons. The paper showed features that corresponded to specific, identifiable concepts: particular languages, programming constructs, types of text, and more. This provided the first strong evidence that the dictionary learning approach using sparse autoencoders could resolve superposition in a scalable, unsupervised manner [6].
The "Scaling Monosemanticity" paper (Templeton, Conerly, et al., 2024) extended this approach to Claude 3 Sonnet, a production-scale language model. The team trained sparse autoencoders of different sizes, extracting approximately 1 million, 4 million, and 34 million features. The resulting features were highly abstract: they were multilingual, multimodal, and generalized between concrete and abstract references. For example, a feature associated with the Golden Gate Bridge would activate for text mentions of the bridge in any language, for images of the bridge, and even for abstract references to San Francisco landmarks [11].
The paper also demonstrated that these features could be used to steer model behavior. By artificially increasing the activation of specific features, the researchers could cause the model to behave differently in predictable ways. Safety-relevant features were identified, including features related to deception, sycophancy, bias, and dangerous content, suggesting that sparse autoencoders could become a tool for monitoring and controlling model behavior [11].
Several variants of the basic sparse autoencoder architecture have been explored:
| Variant | Key Difference | Advantage |
|---|---|---|
| TopK SAE | Uses a fixed number of active features (top-k) instead of a learned threshold | Better control over sparsity level |
| BatchTopK SAE | Applies top-k selection across a batch rather than per sample | Improved sparsity/reconstruction trade-off with stable training |
| Gated SAE | Uses a gating mechanism to separate feature detection from magnitude estimation | Reduced shrinkage bias |
| Transcoder | Maps from one layer's activations to another layer's activations | Captures cross-layer computations |
| Cross-layer transcoder (CLT) | Reads from one layer's residual stream but can write to all subsequent MLP layers | Enables attribution graph construction [2] |
Induction heads are one of the best-understood circuits in transformer models. An induction head implements a simple but powerful algorithm: given a sequence pattern like [A][B] ... [A], the head predicts that B will follow the second occurrence of A. Despite its simplicity, this mechanism is believed to be a core driver of in-context learning in transformers of any size [9].
The induction circuit is a two-layer circuit. In the earlier layer, a "previous token head" attends to the token immediately before the current position and copies information about it into the residual stream. In the later layer, the induction head performs a match-and-copy operation: its query (derived from the current token A) matches against keys (derived from the output of the previous token head), effectively finding positions where A appeared before and copying the token that followed it.
Olsson et al. (2022) presented six complementary lines of evidence that induction heads are the mechanistic source of general in-context learning. They showed that induction heads form during a specific "phase change" early in training, and that in-context learning improves dramatically at exactly the same point. When induction heads are knocked out at test time, in-context learning performance drops significantly [9].
The indirect object identification (IOI) circuit, described in Wang et al. (2022), represents the largest end-to-end reverse engineering of a natural behavior in a language model. The task involves sentences like "When Mary and John went to the store, John gave a drink to __", where the model should predict "Mary" (the indirect object, i.e., the name that is not the subject of the final clause) [10].
The researchers found that GPT-2 small uses 26 attention heads grouped into 7 functional classes to solve this task. The circuit includes duplicate token heads (which identify that a name has appeared twice), S-inhibition heads (which suppress the repeated name), and name mover heads (which copy the remaining name to the output position). The work introduced path patching as a systematic methodology for tracing information flow through the network [10].
Activation patching (also called causal tracing or interchange intervention) is a fundamental technique for localizing which components of a model are responsible for a particular behavior. The procedure involves running the model on two inputs: a "clean" input that produces the desired behavior and a "corrupted" input that does not. The researcher then systematically replaces (patches) internal activations from the corrupted run with those from the clean run, one component at a time, and measures how much the output changes. Components whose patching restores the clean behavior are identified as causally important for that behavior [12].
Path patching extends activation patching to study edges (connections between components) rather than individual nodes. Instead of patching the entire output of a component, path patching patches only the output of component A as it flows into component B, leaving A's influence on all other downstream components unchanged. This provides a much more fine-grained picture of how information flows through the network. Path patching was introduced in the IOI circuit paper and has since become a standard technique in circuit discovery [10].
In March 2025, Anthropic published "Circuit Tracing: Revealing Computational Graphs in Language Models," which introduced a new approach to understanding model internals at production scale. The method replaces a model's MLP layers with cross-layer transcoders (CLTs), a new type of sparse autoencoder. Unlike standard SAEs that read from and write to the same layer, CLTs read from one layer's residual stream but can provide output to all subsequent MLP layers. Each CLT feature reads from the residual stream at its layer using a linear encoder followed by a nonlinearity [2].
The key innovation is that once the MLPs are replaced with CLTs, the direct interactions between features become linear (after freezing attention patterns and normalization denominators). This linearity allows researchers to construct attribution graphs: causal diagrams depicting the computational steps the model takes to produce a particular output. Attribution graphs contain four types of nodes:
| Node Type | Description |
|---|---|
| Output nodes | Candidate tokens at the output position |
| Intermediate nodes | Active transcoder features representing interpretable concepts |
| Input nodes | Token embeddings from the input sequence |
| Error nodes | Unexplained MLP portions (reconstruction error) |
Edges in the graph represent linear attributions between nodes. The methodology enables researchers to prune away features that do not influence the output under investigation, leaving a compact, interpretable graph of the computation [2].
The accompanying case studies paper applied circuit tracing to Claude 3.5 Haiku across nine behavioral case studies, including multi-step reasoning (two-hop deduction), planning in poem composition, multilingual circuit organization, addition mechanisms, medical diagnosis processes, entity recognition and hallucination, and refusal mechanisms for harmful requests. The work demonstrated that mechanisms behind multi-step reasoning, hallucination, and jailbreak resistance can be surfaced and studied at the scale of production models [2].
The mechanistic interpretability community has developed a robust ecosystem of open-source tools.
TransformerLens is a Python library created by Neel Nanda for mechanistic interpretability of GPT-style language models. It supports loading over 50 different open-source language models and exposes all internal activations during a forward pass. Users can cache any internal activation (attention patterns, residual stream states, MLP outputs) and add hook functions to edit, remove, or replace activations as the model runs. TransformerLens v3, released in alpha in 2025, works well with large models and offers significantly more flexibility than earlier versions [3].
SAE Lens is a library for training and analyzing sparse autoencoders, developed primarily by Joseph Bloom. It supports training SAEs on any PyTorch-based model (not just TransformerLens models) and includes built-in metrics for evaluating SAE performance. SAE Lens integrates with external tools like Neuronpedia for feature visualization and sharing. The v6 release represented a major refactor of the training code structure [13].
Neuronpedia is an open-source platform for interpretability research, created by Johnny Lin. It provides a web interface for exploring, visualizing, and sharing sparse autoencoder features. Capabilities include feature steering, activation visualization, circuit and graph exploration, automated interpretability scoring, inference, search and filtering, dashboards, and benchmarks. In 2025, Anthropic's circuit tracing tools were integrated with Neuronpedia, making attribution graphs accessible through a web frontend [14].
| Tool | Primary Purpose | Creator / Maintainer | Key Feature |
|---|---|---|---|
| TransformerLens | Model loading and activation access | Neel Nanda / Bryce Meyer | Hook-based activation editing for 50+ models |
| SAE Lens | SAE training and analysis | Joseph Bloom | Model-agnostic SAE training with quality metrics |
| Neuronpedia | Feature visualization and sharing | Johnny Lin | Web-based exploration of SAE features and circuits |
| pyvene | Intervention and activation patching | Stanford NLP | Systematic support for causal interventions |
| NNsight | Remote model access and editing | NDIF | Enables interpretability research on large models without local compute |
One of the most significant challenges is scaling mechanistic interpretability to the largest models. Training sparse autoencoders on production-scale models requires enormous compute resources. Anthropic's "Scaling Monosemanticity" work trained autoencoders with up to 34 million features on Claude 3 Sonnet, requiring substantial infrastructure. As models grow to hundreds of billions or trillions of parameters, the cost of extracting and analyzing features grows correspondingly [11].
A persistent concern is whether the explanations produced by mechanistic interpretability methods faithfully represent what the model is actually doing. Sparse autoencoders introduce reconstruction error: they do not perfectly reconstruct the original activations. The features they extract might be artifacts of the autoencoder training rather than genuine features of the underlying model. Similarly, attribution graphs based on cross-layer transcoders leave some computation unexplained (captured by "error nodes"), and the significance of this unexplained portion is an open question [2].
The circuit tracing approach has additional limitations. Attention circuits (QK-circuits) remain largely unexplained by the current methodology, and the freezing of attention patterns and normalization denominators introduces approximations whose impact is not fully characterized [2].
Even when circuits are successfully identified, it is difficult to verify that the identified circuit is complete. There may be additional pathways and backup mechanisms that contribute to the behavior but are not captured by the analysis. The IOI circuit, for example, identified 26 attention heads, but the researchers acknowledged that other heads might play supporting roles that were not detected by their methodology [10].
Manual analysis of features and circuits does not scale. Researchers have explored automated approaches where language models are used to generate descriptions of what features represent, and these descriptions are then evaluated for accuracy. While promising, automated interpretability methods are still unreliable enough that human verification remains necessary for high-stakes claims [11].
Mechanistic interpretability has entered a period of rapid growth and increasing institutional recognition. MIT Technology Review named mechanistic interpretability a 2026 Breakthrough Technology, reflecting the field's growing maturity and the urgency of the problems it addresses [15].
Anthropic's circuit tracing work on Claude 3.5 Haiku, published in March 2025, demonstrated that it is possible to trace meaningful computational pathways in production-scale language models. The open-sourcing of circuit tracing tools, including a Python library compatible with any open-weights model and a frontend hosted on Neuronpedia, significantly lowered the barrier to entry for researchers outside of large labs [2].
In July 2025, Anthropic published a circuits update describing further progress, including improvements to transcoder training, better methods for handling attention circuits, and expanded case studies. The pace of publications from the Anthropic interpretability team and others has accelerated, with new papers appearing regularly on topics including SAE variants, circuit discovery algorithms, and applications to model safety [16].
OpenAI has also invested in sparse autoencoders, applying them to GPT-4 in 2024 and publishing results on feature extraction at scale. The convergence of multiple major labs on similar approaches suggests a growing consensus that dictionary learning with sparse autoencoders, combined with circuit-level analysis, represents the most promising path toward understanding large neural networks [17].
The field faces important open questions. Can mechanistic interpretability scale to the largest frontier models with hundreds of billions of parameters? Can it provide actionable safety guarantees, not just post-hoc explanations? Can automated methods reduce the need for labor-intensive manual analysis? These questions will shape the next phase of research as the community works toward what Anthropic has called the goal of building interpretability tools that can reliably detect model problems before they cause harm.