Crosscoder
Last reviewed
May 19, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,817 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 19, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,817 words
Add missing citations, update stale details, or suggest a clearer explanation.
A crosscoder is a mechanistic interpretability technique introduced by anthropic in October 2024 that generalizes the sparse autoencoder (SAE) by simultaneously reading from and writing to multiple components of a neural network. While a conventional SAE is trained to reconstruct the activations of a single layer of a single model, a crosscoder learns a shared dictionary of features whose encoders and decoders can span multiple layers, multiple model checkpoints, or entirely different models. This change of scope enables two applications that single-layer SAEs cannot perform directly: explicit modelling of features that are spread across layers, and "model diffing," the localization of representational differences between two related models such as a base language model and its fine-tuned variant.[^1]
Crosscoders were proposed by Jack Lindsey, Adly Templeton, Jonathan Marcus, Thomas Conerly, Joshua Batson, and Christopher Olah in the research update "Sparse Crosscoders for Cross-Layer Features and Model Diffing," published on Anthropic's Transformer Circuits Thread on October 25, 2024.[^1] The release was described by the authors as preliminary work rather than a fully self-contained paper, and a substantial follow-up community literature has since built on the original framework, including peer-reviewed work at NeurIPS 2025 on overcoming sparsity artifacts and a 2025 Anthropic update on improving model diffing quality.[^2][^3]
The motivation for crosscoders comes from a recurring obstacle in mechanistic interpretability: individual neurons in modern language models are typically polysemantic, meaning that a single unit activates on a heterogeneous mixture of unrelated concepts. The dominant explanation for this phenomenon is the superposition hypothesis, formalized in Anthropic's 2022 "Toy Models of Superposition" paper, which proposes that networks represent more features than they have dimensions by encoding each feature as a sparse linear combination of neurons.[^4] Under this view, the "natural" interpretable units of a network are not individual neurons but a set of feature directions in activation space.
Sparse autoencoders attempt to recover those directions empirically. A conventional SAE for an interpretability target encodes a layer's activation vector into a much larger, sparsely active hidden vector, then decodes it back to the original activation, while a sparsity penalty (typically L1 on the hidden codes) discourages the use of more features than necessary. Anthropic's "Scaling Monosemanticity" report demonstrated that this approach scales to production-grade models, with SAEs trained on Claude 3 Sonnet recovering large libraries of interpretable features ranging from concrete (specific cities, code constructs) to highly abstract (security vulnerabilities, sycophancy, deception).[^5]
Two limitations of single-layer SAEs motivated the crosscoder generalization. First, the same conceptual feature often appears in several layers of a transformer because the transformer residual stream carries information forward and many computations are spread across multiple attention and MLP blocks. Training a separate SAE per layer recovers the same feature redundantly and obscures the fact that it is a single concept. Second, SAEs are not natively designed to compare two models, so questions like "what new representations does instruction tuning introduce?" cannot be answered without ad hoc post-hoc alignment of separately trained dictionaries.[^1]
In the original presentation, the authors describe crosscoders informally as variants of SAEs that "read and write to multiple layers": "Where autoencoders encode and predict activations at a single layer, and transcoders use activations from one layer to predict the next, a crosscoder reads and writes to multiple layers."[^1] A crosscoder therefore takes as input a collection of activation tensors (for example, residual stream vectors from several layers, or the activations of two different models at a chosen layer) and is trained to reconstruct all of them simultaneously with a single shared dictionary of features.
Formally, a crosscoder maintains a feature bank of size N. Each feature i has, for each input source s in the set of sources (layers or models), an encoder direction and a decoder direction. Given the bundle of source activations, the encoder produces a single sparse activation vector indexed by features; the decoder then projects each feature's activation back into each source's activation space through the source-specific decoder, and the loss is the sum of reconstruction errors across sources plus a sparsity penalty on the shared feature activations. The original Anthropic update used a L1 penalty with a fixed coefficient and trained the model end-to-end on activations gathered from the chosen sources.[^1][^6]
A key property follows directly from this construction: because feature activations are shared across sources, the magnitudes (or norms) of the decoder directions across sources reveal where each feature is "used." A feature with a large decoder norm in one source and a near-zero decoder norm in another can be interpreted as being specific to the first source; a feature with comparable decoder norms across sources is shared. This norm-decomposition is what makes crosscoders useful for both cross-layer feature analysis and model diffing.[^1][^2]
The Anthropic update distinguishes three architectural families along the dimension of how information is allowed to flow between sources. The naming reflects how much causal structure of the underlying model the crosscoder is forced to respect.[^1]
An acausal crosscoder ignores any causal ordering among its sources. Each feature can read from and write to every source freely. When the sources are multiple layers, this means features can be encoded from later-layer activations and decoded to earlier-layer activations, which obviously cannot reflect a real computational mechanism in the underlying network but is convenient for purely descriptive analysis of "what is represented where." The authors trained a global, acausal crosscoder on the residual stream activations of all layers of an 18-layer model and compared it against eighteen single-layer SAEs trained separately, finding that the crosscoder identifies shared structure across layers and can lump duplicates together as one cross-layer feature.[^1][^6]
A weakly causal crosscoder allows a feature to read from earlier layers and write to later layers, but not vice versa, approximating the directionality of the forward pass while still permitting a single feature to span multiple write targets. This is closer to a multi-layer transcoder, and it preserves enough causal structure to be plugged into circuit analysis, but the relaxed encoder allows the feature to absorb representations from anywhere upstream.[^1]
A strict crosscoder enforces strict causal ordering at both encoding and decoding: a feature's encoder reads only from one layer (or strictly earlier layers in some variants), and the decoder writes only to that layer or later. Strict crosscoders are closest in spirit to ordinary transcoders, but they can still share features across multiple write layers, which is the defining novelty.[^1] The authors note that combinations are possible: for example, strictly causal crosscoders for MLP outputs together with weakly causal crosscoders for attention outputs, which mirrors the asymmetric way computation flows through a transformer block.[^6]
The simplest way to place crosscoders in the larger taxonomy of sparse coding models is by the number of sources they read from and write to.[^1][^7] An ordinary SAE has one source: it reads activations from a single layer and reconstructs the same activations. A transcoder has two sources but a fixed input-output relationship: it reads from one layer's input and predicts the output of a downstream component, typically an MLP block. A crosscoder generalizes this to an arbitrary set of sources for both reading and writing, with the further property that the same feature dictionary is shared across all sources.
In particular, a per-layer SAE is the degenerate special case of an acausal crosscoder with a single source. A single-pair strict crosscoder is essentially a transcoder. The defining new behavior, when the source set has size greater than one, is the cross-source aggregation: each feature is forced to "explain" its activation across all sources at once, which is what makes the recovered features candidates for being the same concept across layers or models.[^1]
The first headline application is using a crosscoder over multiple layers of a single model. In a transformer, the residual stream at layer L is the sum of all earlier components' contributions, and many features persist across many layers because no MLP or attention head erases them. Training one SAE per layer would in this case allocate a separate dictionary slot to the same feature L times, producing a misleadingly large feature inventory and obscuring the fact that there is one underlying concept being carried through the network.[^1]
The Anthropic update reports that when an acausal crosscoder is trained jointly on residual stream activations across all layers of an 18-layer model, the learned features exhibit a clear pattern: many features have decoder norms that are large across a contiguous span of layers (indicating a cross-layer feature) while others are localized to a single layer or a narrow band. The authors interpret this as evidence of a significant degree of redundant, linearly correlated structure across layers, which the crosscoder can compress into single feature directions, allowing the user to reason about a network's representations at a coarser, more conceptual level.[^1][^6]
Beyond compressing representations, cross-layer crosscoders are framed as a substrate for circuit simplification. In the standard per-layer SAE setting, a circuit diagram must include cross-layer "identity edges" wherever the same feature is carried forward from one layer to the next; these edges add visual clutter and obscure the actual computation. If features are represented as cross-layer entities from the start, those identity edges collapse, and only the substantive computational steps (where one feature genuinely causes another) remain. The October 2024 update flagged this benefit as a primary motivation but explicitly deferred concrete circuit analysis results to future work.[^1]
That future work emerged in 2025 with Anthropic's attribution graphs line of research. The methods paper "Circuit Tracing: Revealing Computational Graphs in Language Models" introduced the cross-layer transcoder (CLT), an architecture in which each feature in layer L reads from the residual stream at layer L using a linear encoder followed by a nonlinearity, and then contributes to the reconstruction of the MLP outputs of layers L, L+1, ..., up to the final layer.[^8] A CLT can be viewed as a particular variant in the broader crosscoder family: it has multiple write targets (a per-feature property characteristic of crosscoders) but maintains a layer-indexed reading structure (more in the spirit of a transcoder). CLTs are reported to substitute for a model's MLPs while matching the underlying model's outputs in roughly half of cases on the training distribution, and they form the substrate on which the attribution graphs work investigated Claude 3.5 Haiku's internal mechanisms in "On the Biology of a Large Language Model."[^8][^9]
The relationship is therefore one of architectural inheritance rather than identity: the CLT is a particular cross-layer variant that prioritizes the requirements of circuit analysis (causally aligned reads, multiple write targets), while the original crosscoder formulation is broader and includes acausal, weakly causal, and strict choices for the read direction.[^1][^8]
The second headline application of crosscoders is model diffing: training a crosscoder whose two sources are the activations of two related models, typically at the same architectural location, and using the resulting decoder norms to localize where the models differ in representational content. The original update describes this as follows: crosscoders "can produce shared sets of features across models," including the same model across training or fine-tuning and also "completely independent models with different architectures."[^1]
In a two-model diffing setup, each feature has two decoder vectors, one per model. The norms of these vectors can be inspected to classify each feature into three categories. Features with comparable decoder norms in both models are interpreted as shared representations; features with large norm in only one model's decoder are interpreted as model-specific. The asymmetry between "base-only" and "fine-tune-only" populations then provides a quantitative window into what fine-tuning has added or removed.[^1][^2]
The clearest public model-diffing experiment using the original crosscoder formulation came from an open-source replication by Connor Kissane and collaborators, published on LessWrong shortly after Anthropic's release. Following the recipe in Anthropic's update, the replication trained a 16,384-latent crosscoder on the middle-layer residual stream activations of Gemma 2 2B base and the corresponding instruction-tuned (IT) variant, using 400 million tokens drawn evenly from the Pile (an uncopyrighted subset) and the LmSys-Chat-1M dialogue dataset.[^10]
The trained crosscoder reported average L0 sparsity of 81 latents firing per activation, explained variance of 77%, and 95% loss recovered relative to zero ablation. By analyzing decoder norm ratios across the two models, the authors identified three latent clusters: a large set of shared latents with similar decoder norms in both models and highly aligned decoder vectors, approximately 60 base-only latents, and approximately 225 chat-only latents. The strongly asymmetric magnitudes were taken as evidence that instruction tuning introduces substantially more new representations than it removes from this model pair.[^10]
A small number of chat-only features were inspected qualitatively. One example, labeled latent 2325, was reported to activate at the end of an instruction, often just before the assistant begins to produce a response, consistent with a representation of the user-assistant turn boundary that the base model lacks. The authors cautioned that they examined only a handful of cherry-picked latents and offered no systematic characterization of the entire chat-only set.[^10]
Subsequent peer-reviewed work questioned how much of the "model-specific" structure recovered by L1-trained crosscoders is genuine rather than an artifact of training dynamics. In "Overcoming Sparsity Artifacts in Crosscoders to Interpret Chat-Tuning," accepted at NeurIPS 2025, Julian Minder, Clément Dumas, Caden Juang, Bilal Chughtai, and Neel Nanda identified two failure modes of the standard L1 crosscoder.[^2][^11]
The first, which they call complete shrinkage, occurs when the L1 penalty on decoder norms drives a feature's decoder norm in one source to zero even though the feature is genuinely present in both sources; the optimizer settles into a local minimum where one source's contribution is provided by other features. The second, called latent decoupling, occurs when a single shared concept is redundantly split into two model-specific features whose decoder vectors are nearly aligned but each lives in only one model's slot. Together, these effects systematically over-attribute concepts to the fine-tuned model.[^2]
The authors introduce a diagnostic called latent scaling, which fits per-source scaling coefficients to measure how well each supposedly chat-only latent explains base model errors and reconstructions, and a training fix called the BatchTopK crosscoder, in which the sparsity constraint is enforced not with an L1 penalty but by selecting the top-K most strongly activating latents across a whole batch at each step. Re-running the Gemma 2 2B diffing experiment, they reported that the L1 crosscoder produced 3,176 nominally chat-only latents but that 18% of them in fact overlapped with the shared latent distribution and 109 base-only/chat-only "twin" pairs had cosine similarity above 0.9, confirming widespread latent decoupling. The BatchTopK crosscoder, by contrast, reported only 134 chat-only latents but those latents were authentic and interpretable.[^2]
Examples of genuinely chat-specific concepts recovered from the BatchTopK crosscoder included representations of false information, personal questions, chat template tokens, multiple refusal-related mechanisms with different trigger nuances, and persona-related behaviors.[^2] The authors also observed that chat tuning substantially amplifies the representational norms of concepts already present in the base model rather than creating wholly novel mechanisms, complicating the simple "shared vs specific" dichotomy.
Anthropic itself returned to model diffing in a 2025 Transformer Circuits update titled "Insights on Crosscoder Model Diffing," authored by Siddharth Mishra-Sharma, Trenton Bricken, Jack Lindsey, Adam Jermyn, Jonathan Marcus, Kelley Rivoire, Christopher Olah, and Thomas Henighan.[^3] This update independently documented an unexpected phenomenon: features that are exclusive to one model in a crosscoder tend to be more polysemantic and denser in their activations than shared features, which makes them hard to interpret and undermines the original promise of model diffing.
The proposed explanation is "competition for limited feature capacity." Because shared features can be used to reconstruct activations in both models, the optimizer prefers to allocate dictionary slots to shared features unless model-specific features are doing extra reconstruction work. Genuinely model-specific concepts then end up packed into a smaller number of slots, forcing each one to absorb multiple unrelated phenomena and lose monosemanticity. The mitigation introduced in the update is to designate a small set of "shared baseline" features that receive a reduced sparsity penalty, effectively guaranteeing them capacity for the common-case representations and freeing the remaining features to specialize cleanly on per-model differences. The authors report that this intervention produces more interpretable, monosemantic exclusive features when applied to real models.[^3]
Model diffing with crosscoders has begun to be used as a downstream tool by external groups. A 2025 study, "Beyond the Leaderboard: Understanding Performance Disparities in Large Language Models via Model Diffing" by Sabri Boughorbel, Fahim Dalvi, Nadir Durrani, and Majd Hawasly, applied BatchTopK crosscoders to compare three Gemma-2-9b variants (pretrained, instruction-tuned, and a SimPO-fine-tuned model) by training crosscoders on 200 million tokens of FineWeb and LMSYS data at layer 20, using 114,688 latent dimensions and Claude 3 Opus to annotate features along thirty capability dimensions.[^12]
Their categorization assigned each model-specific latent to a capability, then aggregated relative changes across categories. The SimPO-fine-tuned variant was reported to acquire latent concepts predominantly enhancing safety mechanisms (+32.8%), multilingual capabilities (+43.8%), and instruction-following (+151.7%) relative to the instruction-tuned baseline, while showing reductions in latents associated with hallucination management (-68.5%) and self-reference (-44.1%). The authors interpret these patterns as evidence that benchmark gains can reflect stylistic and dialogic improvements rather than substantive capability changes, and that crosscoder-based diffing surfaces details that aggregate benchmark scores hide.[^12]
The publicly reported training regimes for crosscoders share several features. The training data is a stream of activations harvested from the underlying model(s) on a chosen text corpus. In the open-source Gemma 2 replication, the training corpus was a 50/50 mix of the Pile (uncopyrighted) and LmSys-Chat-1M, with 400 million tokens of activations from the middle layer's residual stream, and a 16,384-latent dictionary.[^10] The Qatar Computing Research Institute capability study used 200 million tokens with a much larger dictionary of 114,688 latents and layer 20 activations from Gemma 2 9B models.[^12] These choices follow the same general pattern as standard SAE training but require materializing activations from multiple sources at once, which roughly doubles or triples memory demands relative to a single-layer SAE training run.
The loss formulation depends on the variant. The original Anthropic update used an L1 sparsity penalty with a fixed coefficient.[^1][^6] The Minder et al. BatchTopK variant replaces L1 with batchwise top-K selection, in which the sparsity constraint is implicit: only the top-K most strongly activating features across an entire batch are allowed to write to the decoder at each step, removing direct norm penalties and creating implicit competition among features.[^2] The 2025 Anthropic insights update further proposed reserving a small subset of features with a reduced sparsity penalty as "baseline" features for shared representations.[^3]
Several open-source repositories have implemented variants of the crosscoder. The Kissane et al. replication ("crosscoder-model-diff-replication") is the closest public mirror of Anthropic's original Gemma 2 2B diffing experiment.[^10] Neel Nanda's "Crosscoders" repository provides a more general training scaffold.[^13] The Minder et al. BatchTopK code accompanies the NeurIPS 2025 paper.[^2] These repositories typically expose a configuration in which the user supplies a list of model checkpoints and layer indices, and the trainer handles activation collection, dictionary initialization, sparsity scheduling, and the source-specific decoder norm statistics that drive downstream diffing analysis.
The crosscoder framework inherits the well-known limitations of sparse dictionary methods for interpretability. Reconstruction loss is an indirect proxy for "interpretability"; a low-loss crosscoder can still recover features that are linear combinations of underlying ground-truth features rather than the features themselves, and there is no closed-form guarantee that the recovered dictionary is monosemantic. Additionally, the choice of layer, dictionary size, sparsity level, and sparsity loss form all materially affect the recovered features, and the field has not converged on canonical settings.[^1][^2]
Specific to crosscoders, several open issues have been documented.
Sparsity artifacts in model diffing. As discussed above, L1-trained crosscoders systematically over-attribute features to the fine-tuned model through complete shrinkage and latent decoupling. The BatchTopK variant mitigates but does not entirely eliminate this; some classification of exclusive-versus-shared latents remains sensitive to training choices. The Minder et al. work argues that latent scaling diagnostics should accompany any crosscoder-based diffing claim.[^2]
Capacity competition and polysemanticity of exclusive features. Anthropic's 2025 update documented that, even with a clean training run, the features the crosscoder marks as model-specific tend to be denser and more polysemantic than shared features, because shared features absorb the cheaply explained mass of the activation distribution and exclusive features must do extra work. The shared-baseline mitigation helps, but the broader question of how to allocate capacity fairly between shared and specific phenomena remains open.[^3]
Choice of architectural variant. The October 2024 update is explicit that the three variants (acausal, weakly causal, strict) trade off descriptive flexibility for causal alignment with the underlying model. Acausal crosscoders are convenient for representational summaries but cannot be plugged into a circuit. Strict crosscoders are circuit-compatible but lose the ability to "look back" at later layers' representations. There is no single best choice; different applications call for different variants, and combinations are possible.[^1][^6]
Validation against ground truth. Most existing evaluations of crosscoders rely on qualitative inspection of cherry-picked features (e.g., the latent 2325 turn-boundary feature in the Gemma 2 replication) or on aggregate statistics like decoder norm distributions. Causal interventions (e.g., suppressing or amplifying a feature and measuring behavioral change) are gradually being adopted but remain expensive at scale.[^10][^2]
Generalization across model pairs. The most studied diffing pairs are base-versus-instruction-tuned variants of small open models like Gemma 2 2B and 9B. Whether the same architectural choices and sparsity levels work for larger production models, for RLHF-tuned versus DPO-tuned pairs, or for completely independent models trained from scratch is largely an open empirical question, though the original Anthropic update explicitly framed the latter as part of the intended scope.[^1][^3]
A clean way to relate crosscoders to neighboring techniques is by which activation relationships they are trying to capture. Sparse autoencoders model the activation of a single component as a sparse linear combination of features. Transcoders model a single component-to-component map (typically an MLP block's input-to-output) as a sparse linear combination of features, with the input encoder applied to the input activations and the output decoder applied to the output activations. Crosscoders generalize this further by allowing both the read source and the write source to be arbitrary tuples of activation tensors, with the same feature dictionary shared across all of them.[^1][^7]
The attribution graphs research line developed by Anthropic in 2025 uses the cross-layer transcoder (CLT) as a substrate. Each CLT feature reads from one layer's residual stream and writes to the MLP outputs of that layer and all subsequent layers, so it fits the crosscoder template (multiple write targets, a shared sparse feature dictionary) while also fitting the transcoder template (input-output relationship across MLP boundaries).[^8] In the companion paper "On the Biology of a Large Language Model," CLT features were used to build attribution graphs for Claude 3.5 Haiku across behaviors including multi-step reasoning, multilingual processing, arithmetic, hallucination handling, and safety refusals.[^9] Among the reported findings was that Claude 3.5 Haiku shares roughly twice the proportion of features between languages compared to smaller models, a result obtained by inspecting which CLT features are activated by parallel-meaning inputs in multiple languages.[^9]
The interpretability community has not adopted a single fixed taxonomy, and the boundary between "crosscoder" and "transcoder" is sometimes drawn differently. The most common convention follows Anthropic's October 2024 framing: a crosscoder is any SAE-like model whose feature dictionary is shared across two or more sources (layers or models) with at least one of the read or write structures spanning multiple sources, while a transcoder is the special case of a single input-output map with separate input and output activation spaces.[^1][^7] Under this convention, the CLT is best described as a cross-layer transcoder that is also, in the broader sense, a crosscoder variant, which is how the Anthropic methods paper refers to it.[^8]
Crosscoders sit at the intersection of two practical needs in alignment research. First, scaling interpretability to larger and longer transformers requires techniques that handle features distributed across many layers without combinatorial blowup in the feature inventory; the cross-layer aspect of crosscoders is a direct response to that need.[^1] Second, evaluating the effects of post-training procedures such as supervised fine-tuning and RLHF on model internals requires a representational comparison primitive, not just behavioral benchmarks; the model-diffing aspect of crosscoders is a candidate for that primitive.[^2][^12]
A representative use case in this space is jailbreak and refusal analysis. The Minder et al. BatchTopK crosscoder on Gemma 2 2B identified multiple chat-specific refusal-related latents with different trigger nuances, providing a candidate inventory of features that mediate safety-tuned behavior.[^2] If such an inventory can be made reliable, it opens the door to mechanistic safety audits: checking whether a given refusal behavior is mediated by a few interpretable features (and thus potentially fragile against targeted suppression) or by a distributed mechanism that is harder to disable. Independent work has used crosscoder-derived insights to study jailbreak fragility under feature-level intervention and to compare safety-relevant features across fine-tuning regimes, though this literature is still young and the methodological best practices remain in flux.[^2][^12]
A condensed timeline of the crosscoder line of work, restricted to publications and code releases with verifiable sources, is as follows.