Transcoder
Last reviewed
May 19, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,247 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 19, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,247 words
Add missing citations, update stale details, or suggest a clearer explanation.
A transcoder is a sparse neural network used in mechanistic interpretability research to approximate the input-to-output function of a component inside a transformer (most commonly an MLP sublayer) using a wider, sparsely activating hidden representation. Like a sparse autoencoder (SAE), a transcoder learns an overcomplete dictionary of features that are intended to be more monosemantic than the original neurons, but unlike an SAE it is not trained to reconstruct its own input. Instead, it takes the input of the targeted component and is trained to predict that component's output, effectively learning a sparse, interpretable replacement for the component.[^1][^2]
Transcoders were introduced in the paper "Transcoders Find Interpretable LLM Feature Circuits" by Jacob Dunefsky, Philippe Chlenski, and Neel Nanda, which was accepted at NeurIPS 2024.[^1][^3] They have since been extended and used in several follow-up works, most notably in Anthropic's "Circuit Tracing" research on Claude 3.5 Haiku, which uses a variant called a cross-layer transcoder (CLT) to build attribution graphs.[^4][^5]
Mechanistic interpretability aims to reverse engineer neural networks into computational circuits that can be understood by humans. A core difficulty is that individual neurons in large language models are often polysemantic, meaning they respond to multiple unrelated concepts as a result of superposition.[^1] Sparse autoencoders have become a leading tool for extracting more monosemantic features from a model's internal activations, but standard SAEs only describe what features exist at a particular point in the network. They do not directly explain how an MLP transforms those features.[^2]
A transcoder addresses this gap by sitting in place of an MLP rather than next to it. Given the input activations to an MLP sublayer, a transcoder produces an approximate version of that MLP's output by routing the computation through a sparse bottleneck of learned features. Because each feature has both an encoder direction (which determines when it activates) and a decoder direction (which determines what it writes to the output), the function computed by the MLP can be decomposed into a small set of weighted feature interactions. This decomposition is the central reason transcoders are useful: it cleanly separates the input-dependent part of the computation (which features fire) from the input-invariant part (how feature activations are mapped to outputs via the decoder).[^1]
In a transformer, each layer typically contains an attention sublayer and an MLP sublayer. Attention has been studied in detail because it is approximately linear given the attention pattern. MLPs are harder to analyze because they apply a nonlinearity to a high-dimensional vector and entangle many features into each neuron. Dunefsky, Chlenski, and Nanda described the problem this way: interpretable features are typically linear combinations of many neurons, each with its own nonlinearity, so naive circuit analysis through MLPs either produces intractably large graphs or fails to disentangle local and global behavior.[^1][^3]
SAEs partially solve this by giving researchers a sparse description of activations at the input or output of an MLP. However, an SAE trained on MLP input activations and another SAE trained on MLP output activations are independent dictionaries with no built-in mapping between them. To trace how an input feature contributes to an output feature one has to compute attribution through the MLP itself, which reintroduces the dense, nonlinear behavior the researcher was trying to avoid.[^1]
A transcoder solves this in a different way. Rather than describing both endpoints of the MLP, it learns one sparse description that operates as a stand-in for the entire MLP sublayer. The encoder of the transcoder reads MLP inputs, the decoder writes MLP outputs, and the sparsity penalty ensures that only a small number of features are active for any given input. Once trained, the transcoder can be substituted for the MLP at inference time. If the substitution preserves model behavior on downstream tokens, then the transcoder's features can be treated as a faithful, interpretable model of what that MLP actually computes.[^1][^2]
The standard transcoder used in Dunefsky et al. is a wide, single hidden layer ReLU MLP placed in parallel to a target MLP sublayer.[^1] Let MLP(x) denote the function computed by the original MLP sublayer of a transformer, where x is the MLP's input vector. The transcoder TC(x) is defined as:
z_TC(x) = ReLU(W_enc x + b_enc)
TC(x) = W_dec z_TC(x) + b_dec
Here W_enc is an encoder matrix that projects the d_model-dimensional input into a much wider d_features-dimensional hidden space, and W_dec is a decoder matrix that projects back to d_model. b_enc and b_dec are bias vectors. The hidden dimension d_features is typically several times larger than d_model, making the dictionary overcomplete, while sparsity is enforced via the loss function so that only a small number of features are nonzero on any given input.[^1]
The transcoder is trained to minimize a loss combining mean squared reconstruction error against the original MLP's output and an L1 sparsity penalty on the hidden activations:
L_TC(x) = || MLP(x) - TC(x) ||_2^2 + lambda_1 || z_TC(x) ||_1
The L1 term acts as a differentiable surrogate for the L0 "norm" that counts the number of active features, and the hyperparameter lambda_1 controls the tradeoff between sparsity and reconstruction fidelity.[^1] This loss is structurally similar to the loss used to train SAEs but differs in the first term: an SAE penalizes the squared error between its output and its own input, whereas a transcoder penalizes the squared error between its output and the MLP's output.[^1]
Dunefsky and colleagues trained transcoders on GPT2-small, Pythia-410M, and Pythia-1.4B, using around 3.2 million tokens of OpenWebText data for the sparsity-accuracy evaluations described in the paper. They built on training infrastructure from Joseph Bloom's SAE library (SAELens) and released their analysis code at the transcoder_circuits GitHub repository under standard open-source terms.[^1][^6]
A 2025 follow-up by Goncalo Paulo, Stepan Shabalin, and Nora Belrose, "Transcoders Beat Sparse Autoencoders for Interpretability", introduced two further variants. First, they replaced the L1 penalty with a TopK activation function that fixes the number of active features per input directly, eliminating the need to tune lambda_1. They tested k values of 32, 64, and 128.[^2] Second, they introduced skip transcoders, which add an affine skip connection from input to output:
f(x) = W_2 TopK(W_1 x + b_1) + W_skip x + b_2
Here W_2 and W_skip are initialized to zero, and b_2 is initialized to the mean of the MLP output. This affine bypass absorbs the easy-to-model linear component of the MLP, allowing the sparse bottleneck to focus on the nonlinear part and reducing reconstruction error without compromising interpretability.[^2] All variants in that paper were trained with mean squared error against the original MLP output and no auxiliary loss term beyond the sparsity mechanism implied by TopK.[^2]
The central conceptual difference between a transcoder and an SAE is what they reconstruct.[^1][^2]
|| x - SAE(x) ||_2^2 + lambda_1 || z_SAE(x) ||_1, and the network learns a sparse coding of the activation distribution at one point in the network.[^1]x with MLP(x) on the reconstruction side while keeping the sparsity penalty on the hidden activations.[^1]This difference has practical consequences for circuit analysis. With an SAE, one obtains a feature dictionary at one location, but the MLP between two SAE dictionaries remains opaque and must be analyzed using attribution techniques that compute through the dense nonlinearity. With a transcoder, the MLP is replaced by a sparse linear-then-nonlinear-then-linear computation whose connectivity is given by the encoder and decoder weights. The contribution of an upstream transcoder feature i at layer l to a downstream transcoder feature i' at layer l' can be written, on a given input, as the product of the upstream activation and a dot product of weight vectors:
contribution = z_TC^(l,i)(x) * ( f_dec^(l,i) . f_enc^(l',i') )
The first factor is input-dependent (it depends on whether the upstream feature fires), and the second factor is input-invariant (it is a fixed property of the trained transcoder weights). Dunefsky et al. emphasize this clean factorization as the main reason transcoders enable weights-based circuit analysis through MLPs, which is harder to do with stacked SAEs alone.[^1]
Quantitatively, Dunefsky and colleagues reported that transcoders perform at least on par with SAEs along three axes that they used to compare interpretability tools: sparsity (mean L0 of features), faithfulness (cross-entropy loss difference when the transcoder or SAE replaces the corresponding MLP in the model), and human-judged interpretability. They reported a blind interpretability evaluation in which raters judged 50 random features from each method, finding 41/50 transcoder features and 38/50 SAE features interpretable on GPT2-small. They also reported that the sparsity-versus-faithfulness Pareto frontier of transcoders was equal to or better than that of SAEs across the models they tested, and that the gap appears to widen on larger models.[^1]
Paulo, Shabalin, and Belrose extended this comparison in 2025 using an automated interpretability pipeline driven by Llama 3.1 70B as a scorer. On Pythia-160M with 32 active latents, skip transcoders achieved a fuzzing score of 86.4 percent versus 74.6 percent for SAEs, a detection score of 80.9 percent versus 70.2 percent, and a simulation score of 0.47 versus 0.28. They also reported that on Gemma 2 2B, skip transcoders incurred a 0.5 percent cross-entropy loss increase relative to the original model while explaining 67.1 percent of variance, compared with 1.1 percent loss increase and 16.5 percent variance explained for SAEs. They concluded that, in their setting, skip transcoders dominated both transcoders and SAEs on a joint Pareto frontier of reconstruction quality and interpretability.[^2]
A third member of the SAE family, called a crosscoder, was introduced by Jack Lindsey, Adly Templeton, Jonathan Marcus, Thomas Conerly, Joshua Batson, and Christopher Olah at Anthropic in October 2024.[^7] Like an SAE and a transcoder, a crosscoder uses a sparse bottleneck of learned features. Its distinguishing property is that it reads from and writes to multiple layers at once. Lindsey and colleagues described the three approaches as follows:[^7]
The motivation for crosscoders is cross-layer superposition: the phenomenon where the network represents a single conceptual feature using directions in the residual stream that drift across many adjacent layers. A per-layer SAE will detect such a feature several times at different depths, while a crosscoder can absorb the entire trajectory into a single feature with multiple decoder vectors, one per layer it writes to. Anthropic also proposed using crosscoders for model diffing, where a single sparse dictionary is jointly trained across multiple model variants (e.g., a base model and a finetune) so that shared and divergent features can be identified directly.[^7]
The relationship between transcoders and crosscoders is that both move beyond the single-point view of an SAE, but in different directions. A transcoder collapses the function between two points (an MLP's input and output) into sparse features. A crosscoder collapses the representation across many points (the residual stream at multiple layers) into sparse features. The two ideas were combined in Anthropic's follow-up work on attribution graphs through a structure called a cross-layer transcoder.[^4][^5]
In March 2025, Anthropic published a pair of papers, "Circuit Tracing: Revealing Computational Graphs in Language Models" and "On the Biology of a Large Language Model", which introduced the cross-layer transcoder (CLT) as the dictionary used to build attribution graphs for Claude 3.5 Haiku and a smaller 18-layer model.[^4][^5][^8]
A CLT consists of a set of features distributed across L layers, the same number of layers as the underlying transformer. Each feature is associated with a layer at which it reads from the residual stream and computes its activation, but its decoder writes to all subsequent MLP layers of the original model. The activations are computed using JumpReLU, which is a thresholded activation that produces either zero or the raw pre-activation depending on whether it exceeds a learned threshold. In notation used by the methods paper, the feature activation at layer l is:[^4]
a^l = JumpReLU( W_enc^l x^l )
and the reconstruction of the MLP output at layer l aggregates contributions from features at the current and all earlier layers:
yhat^l = sum over l' <= l of W_dec^(l' -> l) a^l'
Training jointly minimizes the mean squared reconstruction error across all layers and a tanh-based sparsity penalty of the form lambda * sum_i tanh( c * || W_dec,i^l || * a_i^l ), where lambda and c are hyperparameters.[^4]
Anthropic trained CLTs ranging from 300 thousand to 10 million features on the 18-layer model and up to 30 million features on Claude 3.5 Haiku. They report that on GPT-2-style settings the CLT achieves explained variance of approximately 0.8 and replacement score of approximately 0.8, with attribution graph completeness around 0.95 on standard prompts. They also report that on the 18-layer model, the largest CLT explains roughly half of the original model's computation in the sense that the local replacement model reproduces the original's output on around fifty percent of prompts.[^4]
The motivation for the cross-layer design, as described in the methods paper, is that per-layer transcoders (PLTs) trained independently at each layer tend to produce attribution graphs with long, redundant paths in which intermediate features simply amplify or relay information across layers. CLTs absorb these patterns into single features that span multiple layers, which substantially shortens the resulting attribution graphs. Anthropic reports an average path length of 2.3 steps for CLT-based graphs versus 3.7 steps for per-layer transcoder graphs in the same setting. Independent comparisons have reported that on certain circuit recovery benchmarks, CLT-based circuits recover much more of the original model's accuracy than PLT-based ones.[^4][^9]
Beyond the choice of transcoder, the Anthropic methodology uses a local replacement model, which is obtained by freezing the attention patterns and normalization denominators of the original model on a specific prompt, substituting the CLT for the MLPs, and adding small per-position error correction nodes so that the replacement model's activations exactly match those of the underlying model on that prompt. Edges in the attribution graph are then computed by tracing linear gradients through this local replacement model, with thresholds applied to keep only the strongest connections. The result is a graph of interpretable feature nodes and weighted edges that approximates how the model produced its output on that specific prompt.[^4]
A recurring methodological question in transcoder research is how to compare a transcoder to a sparse autoencoder or to another transcoder. The benchmarks used in the literature group around three axes.[^1][^2]
Sparsity measures how many features are active on a typical input. The most common metric is mean L0 of the hidden activations, that is, the average number of features whose activations are nonzero on a forward pass through the dictionary. TopK and JumpReLU activations directly fix this quantity, while L1-trained dictionaries control it indirectly through the penalty weight.[^1][^2]
Faithfulness measures how well the dictionary reproduces the function it is meant to model. For transcoders this is typically reported as the increase in the model's overall cross-entropy loss when the original MLP is replaced by the transcoder, and sometimes as the fraction of variance in the MLP output explained by the transcoder. Lower loss increase and higher explained variance both indicate better faithfulness.[^1][^2]
Interpretability measures whether the resulting features correspond to human-understandable concepts. Dunefsky et al. used a human-rater protocol on a random sample of features, supplemented by case studies and blind interpretation exercises.[^1] Paulo, Shabalin, and Belrose used an automated interpretability pipeline driven by Llama 3.1 70B, producing detection, fuzzing, and simulation scores for each feature based on whether an automated rater could distinguish positive examples from negatives.[^2] Both protocols are imperfect approximations of human understanding, but they have allowed direct numerical comparisons between dictionaries on shared test sets.
A consistent finding across these evaluations is that transcoders and SAEs occupy similar regions of the sparsity-faithfulness Pareto frontier when trained on comparable data, with transcoders meeting or exceeding SAEs on the interpretability axis. Skip transcoders shift the frontier further by reducing reconstruction error at fixed sparsity while preserving interpretability.[^1][^2]
Transcoders have been used in several specific case studies that demonstrate their value for circuit discovery.
Dunefsky et al. used GPT2-small transcoders to revisit the so-called "greater-than circuit", which appears in prompts such as "The war lasted from 1737 to 17". The model is expected to predict a two-digit year continuation that is greater than the starting year. Using transcoders, the authors identified that MLP10 contains a small set of features that fire predominantly on two-digit numbers, which is striking given that GPT-2's vocabulary contains more than 50,000 tokens. They reported that around 24 transcoder features were sufficient to capture most of the circuit's behavior, far fewer than would be needed at the neuron level, and used the input-invariant decomposition to characterize how each feature boosts the logits for subsequent years.[^1]
To test whether transcoder features can be interpreted from their structure alone, Dunefsky et al. performed blind case studies in which a researcher inspected only the encoder and decoder weights of a candidate feature, without looking at the actual tokens on which it activates, and then formed a hypothesis about its behavior. In one example, they identified that a feature in GPT2-small fires on semicolons that occur in academic citation patterns such as (Vaswani et al. 2017; Elhage et al. 2021), by recognizing that the upstream features included surname-like and year-like detectors with the appropriate positional relationships. Six of the nine case studies they reported were carried out under restrictions that blocked access to direct token information for parts of the analysis.[^1]
Anthropic used the CLT-based attribution graph methodology to study a wide range of behaviors in Claude 3.5 Haiku, including multilingual processing, planning during poetry generation, mathematical reasoning, and patterns associated with hallucination and unfaithful reasoning. They report that the local replacement models constructed from CLTs reproduced the underlying model's outputs in a substantial fraction of cases, and they used the resulting attribution graphs to identify mechanisms that produced specific completions on specific prompts. The companion blog post describes ten behaviors examined in this way.[^5][^8]
A separate line of work on sparse feature circuits by Samuel Marks, Can Rager, Eric J. Michaud, Yonatan Belinkov, David Bau, and Aaron Mueller, published as "Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models" on arXiv in March 2024 and later appearing at ICLR 2025, also targets the goal of building circuits out of human-interpretable features rather than attention heads or raw neurons. While that paper used SAEs and other sparse coders as its primary feature dictionary, subsequent work has explored using transcoders within similar circuit-discovery pipelines, treating the transcoder features as the nodes whose causal contributions are estimated.[^10]
The original transcoder implementation by Dunefsky was released at github.com/jacobdunefsky/transcoder_circuits, including training utilities adapted from Joseph Bloom's SAELens library and Jupyter notebooks for the case studies in the paper. The repository depends on TransformerLens, an MIT-licensed library for mechanistic interpretability of transformer language models that exposes hooks at the input and output of each sublayer.[^1][^6][^11]
SAELens itself, hosted at github.com/decoderesearch/SAELens, supports training and analyzing a range of sparse coding variants including standard SAEs, TopK and JumpReLU SAEs, and additional variants that have grown to include transcoder-style configurations and crosscoders in the wider interpretability ecosystem.[^12] Pretrained transcoder weights for several models have been released on Hugging Face by individual researchers; for example, sets of per-layer transcoders trained on small language models have been published as community collections.[^9]
For attribution graph construction in the style of Anthropic's CLT work, third-party tools such as Neuronpedia's circuit-tracing interface and open-source libraries for cross-layer transcoders have appeared, allowing researchers outside Anthropic to train CLTs, build attribution graphs, and visualize the resulting feature circuits.[^9]
Transcoders inherit many of the limitations of sparse dictionaries in general, and the CLT variant introduces some additional ones.
Approximation error. A transcoder is only an approximation of the underlying MLP. The cross-entropy loss of the model when the transcoder is substituted in place of the MLP is higher than the original model's cross-entropy loss, and the explained variance of the MLP output is less than one. Dunefsky et al. quantify this through their faithfulness measure and emphasize that any claim about the original model based on a transcoder analysis is only as faithful as the transcoder itself.[^1] In the CLT setting, Anthropic similarly notes that on Claude 3.5 Haiku the CLT achieves a normalized mean reconstruction error of around 21.7 percent and that substantial error nodes persist in attribution graphs.[^4]
Polysemantic and abstraction-mismatched features. Even though transcoder features are designed to be more monosemantic than raw neurons, Anthropic reports that CLT features in Claude 3.5 Haiku can still respond to unrelated concepts and may not be at the level of abstraction that would make a circuit easiest to understand. They cite an example of a feature that activates for the token "rhythm", for Michael Jordan, and for several other unrelated concepts, and they note that this kind of imperfect monosemanticity persists despite the sparse training objective.[^4]
Missing attention mechanisms. The transcoder replaces only the MLP. It does not explain how attention patterns are formed. Anthropic explicitly states that their attribution graph methodology does not attempt to explain attention pattern formation, which makes the method blind to phenomena that are driven primarily by attention, such as induction heads. Their example graphs may show direct edges from token-level features to downstream features without exposing the actual attention mechanism that connects them.[^4]
Inhibitory and absence-based circuits. Transcoder-based attribution graphs primarily explain why features that are active are active. They are less effective at explaining why other features did not fire. Anthropic notes that many interesting circuits involve features inhibiting other features, and that suppression effects are not well represented in their graphs.[^4]
Mechanistic faithfulness of CLTs. Because a CLT has a different causal structure from the underlying model (features at one layer can write to many later layers), perturbations to CLT features do not always produce the same downstream effects as corresponding interventions on the underlying model. Anthropic reports that perturbation discrepancies compound across layers, with magnitudes showing weaker correlation than directions, and that the local replacement model only guarantees matching outputs on the specific prompts on which the per-position error corrections were computed, not mechanistic equivalence in general.[^4]
Disagreements between PLTs and CLTs. Independent analyses comparing per-layer transcoders and cross-layer transcoders have observed that the two approaches sometimes imply sharply different circuit-level interpretations for the same underlying model behavior. This raises the question of which kind of transcoder, if either, is reporting the true mechanism, and is a current area of active research.[^9]
Training cost. Training high-quality transcoders, and particularly large CLTs with tens of millions of features, requires substantial compute. Anthropic trained CLTs with up to 30 million features on Claude 3.5 Haiku, and follow-up work by Paulo, Shabalin, and Belrose used 8 billion tokens of training data per transcoder. The cost is amortized across many subsequent analyses, but it remains a barrier to applying transcoders to the very largest production models.[^2][^4]