Patchscopes

Interpretability Large Language Models

26 min read

Updated Jul 13, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 13, 2026

Fact-checked

In review queue

Sources

14 citations

Revision

v3 · 5,145 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Patchscopes is an interpretability framework for inspecting hidden representations of large language models by patching an internal activation from a source computation into a separate target inference whose prompt is designed to elicit a natural-language description of what that activation encodes. The framework was introduced in January 2024 by Asma Ghandeharioun, Avi Caciularu, Adam Pearce, Lucas Dixon, and Mor Geva at Google Research in the paper "Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models", later presented at the 41st International Conference on Machine Learning (ICML 2024).^[1]^[2] Rather than projecting hidden states into the vocabulary with a fixed unembedding, Patchscopes uses the model itself, often the same one being studied, as a decoder: the patched representation is read out by letting the model continue generation from a carefully chosen target prompt.^[1] Many earlier inspection methods, including the logit lens and several activation-intervention techniques, can be recast as specific instances of this unified scheme.^[1]^[3]

Background

Understanding what a transformer large language model computes inside its residual stream is a central goal of mechanistic interpretability. By 2023 the community had accumulated several complementary tools, each addressing a different facet of the same question: given a high-dimensional vector that appears in the middle of a forward pass, what does it represent? The logit lens, introduced by the LessWrong user nostalgebraist in August 2020, projects intermediate hidden states through the model's own output embedding matrix and reads the resulting token distribution as a proxy for what the model "believes" at that depth.^[4] The technique is computationally trivial because it reuses the unembedding matrix that the model already learned, but it is widely reported to be brittle in the early layers, where activations have not yet been rotated into the output basis and the implied token distribution often looks nonsensical.^[4]^[5]

The Tuned Lens, introduced by Nora Belrose and collaborators in March 2023, addresses this brittleness by training a small affine probe per block so that every hidden state can be decoded into a distribution over the vocabulary, with results reported on autoregressive models up to about 20 billion parameters.^[5] The Tuned Lens is more predictive, more reliable, and less biased than the logit lens, but its output is still a probability vector over tokens rather than free-form language, and it requires training one probe per layer per model. Linear probes more generally, in which a small classifier is trained on top of frozen activations to recover a target attribute, were the dominant supervised approach for asking whether a representation contains a piece of information.^[1] Probes are simple and quantitatively interpretable, but they are limited to whatever label set the analyst defined a priori, and they cannot describe a representation that does not fit any of the labels they were trained on.

A different line of work, activation patching (sometimes called causal tracing or interchange intervention), traces back to causal mediation analysis applied to neural NLP by Jesse Vig and collaborators in 2020 and to the formalization of interchange intervention by Atticus Geiger and collaborators in 2021; it replaces an activation in one inference with an activation cached from another and measures the change in output.^[6] Activation patching is causal rather than correlational, which makes it the gold standard for localization questions of the form "where is this fact stored?" or "which component is responsible for this behavior?". Its outputs, however, are deltas on the final logits, not phrases that describe the contents of the patched representation; the technique tells one where the wires run, not what is on them.

Patchscopes synthesizes these strands. It keeps the causal intervention machinery of activation patching but, instead of measuring an effect on a specific output token, it uses the patched representation to seed a fresh generation from a target prompt whose continuation serves as a natural-language description. The paper shows that the logit lens, Tuned Lens, several entity-resolution probes, and various intervention experiments can all be expressed as particular configurations of a single quintuple, and that this generality makes it possible to inspect representations that previously evaded simple vocabulary projection, particularly in the lower layers of the network.^[1]^[3] The framing is deliberately reminiscent of a microscope: each prior method is a "scope" with a fixed lens and a fixed sample preparation, and Patchscopes parameterizes the family of scopes by exposing every choice the experimenter is implicitly making.

Definition and the patching mechanism

A Patchscope is defined by two pieces. The source side specifies the representation to be inspected via a tuple $(S, i, M, \ell)$ : an input sequence S, a token position i inside that sequence, a model M, and a layer index ell. Running M on S yields a hidden state h at position i and layer ell. The target side is specified by a quintuple $(T, i^*, f, M^*, \ell^*)$ : a target prompt T, a target position i* inside that prompt, a mapping function f that can adjust dimensionality or coordinate frame, a target model M* (which may equal M or differ from it), and a target layer ell* in the target model.^[1]^[3]

During inference, the target model is run on T up to layer ell*. At position i* the existing hidden state $\bar{h}$ is overwritten with $f(h)$ . The target model then continues its forward pass and emits a continuation. The Patchscope therefore consists of two coupled forward passes, one to extract h and a second to read it out, with the substitution performed in between. By varying T, the experimenter chooses the question being asked of the representation; by varying ell and ell*, the experimenter chooses where to look in the source and where to inject in the target; by varying M and M*, the experimenter can patch across different models.^[1]^[3]

When f is the identity and M equals M*, a Patchscope reduces to a standard activation patch with a flexible target prompt. When the target prompt is empty and the target layer equals the final layer with f equal to the identity, the configuration recovers logit-lens-style decoding; when f is a learned per-layer affine map it approximates the Tuned Lens.^[1] In this sense Patchscopes is a strict generalization: known techniques sit at particular settings of $(T, i^*, f, M^*, \ell^*)$ , and the framework's contribution is to demonstrate that other, previously unexplored settings yield strictly more information about the same internal states.

Target prompts

Target prompts are the most distinctive design choice. The paper and Google Research's accompanying blog post describe several recurring templates. For next-token style decoding the authors use a short few-shot block that demonstrates token repetition, such as "tok1 -> tok1 ; tok2 -> tok2 ; ... ; ? ->", with the patched representation placed at the question-mark position so that the model is encouraged to emit the token "encoded" by the activation.^[1]^[7] For attribute extraction, the prompt is a templated relation, for example "The official currency of x is" or "The largest city in x is", with x being the patched position; the continuation is then read as the model's answer about that attribute of whatever entity the source representation encodes.^[1]^[7] For entity description, the prompt consists of few-shot examples of entities described in a couple of sentences, again with x in place of the entity to be inspected, eliciting a free-form description.^[7]

Because the target prompt fully controls what question is asked, the same source activation can be probed in different ways without any retraining. This open-vocabulary, train-free property is one of the main advantages the paper claims over linear probes, which are limited to a predefined label set and must be retrained for each attribute.^[1]^[3]

Layer choices

The paper studies how the source layer ell and target layer ell* interact. For token-identity Patchscopes, the authors report that performance is uniformly strong from roughly layer 10 onward across the models tested, and that the diagonal configuration $(\ell = \ell^*)$ tends to give the highest precision, with gains over the logit lens and Tuned Lens of up to about 98 percent in Precision-at-1 in layers 18 through 22 of the models studied.^[1]^[3] In the early layers, before features have been rotated into the output basis, vocabulary projection methods often fail outright while a Patchscope with a token-identity target prompt continues to recover useful information, which is one of the paper's headline results.^[1]^[3]

Use cases

Decoding hidden representations to natural language

The most direct application of Patchscopes is to convert an internal activation into a phrase that describes its content. With a token-identity target prompt the natural-language description is a single token; with an entity-description prompt it can be several sentences; with a relation-template prompt it is a short answer to a specific question about the entity encoded in the source representation. The paper's experiments on next-token prediction compare Patchscopes against the logit lens and Tuned Lens across GPT-J (6B), Vicuna (7B and 13B), Pythia (12B), and LLaMA 2 (13B). The token-identity Patchscope consistently outperforms both baselines from roughly layer 10 onward on both Precision-at-1 and surprisal, with the largest gaps in the lower-middle layers where the logit lens is known to be unreliable.^[1]^[3]

The advantage in low and mid layers matters because those are precisely the layers where many interesting computations are taking place. Subject resolution, relation lookup, and intermediate stages of multi-step reasoning typically occur well before the final layers, which by training objective are increasingly specialized for next-token prediction. A method that breaks down precisely in those layers, as the logit lens does, leaves the experimenter blind during the most interpretive part of the forward pass. Patchscopes restore visibility into those layers by letting the target model translate a still-abstract representation into language, rather than asking the unembedding to do work it was not trained for.

Cross-model patching

Because the source and target models can differ, Patchscopes can use a larger model as a "translator" for a smaller one. In the cross-model experiments reported in the paper, hidden representations are extracted from Vicuna 7B and patched into Vicuna 13B at a corresponding layer, with f a learned affine mapping between the two residual streams.^[1]^[3] Lexical similarity of the resulting descriptions, measured by RougeL, exceeds what is obtained when Vicuna 7B inspects its own representations on both popular and rare entities, suggesting that the stronger model can articulate information that the smaller model encodes but cannot itself verbalize.^[1] The same idea applies in principle to patching between unrelated families given a suitable f, and is one of the directions the paper highlights as a new capability rather than an extension of prior tools.^[1]^[7]

The cross-model setting raises a subtle methodological point: the affine map f is fit on paired hidden states, typically obtained by running the same input through both models and matching activations at the same relative depth. Because residual streams can have different dimensionalities and different learned bases, f does double duty as a dimensionality adapter and as a translator between representation spaces. The Patchscopes paper does not claim universal alignment between arbitrary models; instead, it shows that within a family with related pretraining, a learned linear map suffices, and that the resulting cross-model readouts beat in-model readouts at the smaller model. This is a notable result because it implies that representations carry information the smaller model "knows but cannot say," which has implications both for interpretability (the small model is more informative than its own outputs suggest) and for distillation (a strong model may be a better readout than a weak one's own decoder).

Multi-hop reasoning analysis and correction

Patchscopes can also be used as a causal intervention rather than only a passive readout. The paper analyzes two-step factual questions of the form "the spouse of the singer of Imagine" or "the largest city in the country whose capital is Bangkok", where $w_1 = \sigma_2$ , that is, the object of the first relation is the subject of the second.^[1]^[3] On 46 such queries that meet strict criteria on Vicuna 13B, a vanilla baseline (asking the model the full question directly) answers 19.57 percent correctly and a chain-of-thought baseline reaches 35.71 percent. A Patchscope that extracts the model's internal representation of the intermediate entity and patches it into a fresh continuation reaches 50 percent.^[1]^[3]^[7] The interpretation is that the model often computes the correct first-hop answer internally but fails to compose the two hops; reinjecting the intermediate representation effectively repairs the missing connection.

This experiment is more than a probe: it demonstrates that Patchscopes can be used both diagnostically, to locate where reasoning goes wrong, and therapeutically, to recover the correct answer by routing information differently. The 2025 follow-up Auto-Patch by Aviv Jan, Dean Tahory, Omer Talmi, and Omar Abo Mokh trains a classifier to decide when a hidden state should be patched at inference time and reports a solve-rate improvement on the MuSiQue multi-hop dataset from 18.45 percent to 23.63 percent, closing part of the gap to chain-of-thought prompting.^[8]

Entity resolution and contextualization

A fourth use case in the paper concerns how an entity name is gradually resolved across layers. The authors run an entity-description Patchscope over the first ten layers of Vicuna 13B on 200 popular and 200 rare named entities and find that RougeL similarity between the generated description and a reference description rises through layers one to five and then plateaus.^[1]^[3] Phrases such as "Diana, Princess of Wales" come into focus over the first half-dozen layers, while purely surface tokens dominate the very earliest layers. Because Patchscopes work where the logit lens fails, this is one of the few interpretability tools that can directly visualize early-layer contextualization in natural language rather than as an opaque distribution.^[1]^[7]

This experiment is one of the most concrete demonstrations of the framework's "early-layer advantage." Vocabulary projection through the trained unembedding effectively asks the model what next token it would predict, given a particular layer's activations, which is meaningless during the first few layers when the model has barely begun to read the input. A Patchscope, by contrast, asks the model to describe whatever the activation represents at this depth, which works just as well in layer two as in layer twenty. The result is a high-resolution trajectory of how a token like "Diana" goes from a literal subword to a contextualized referent over the course of the network's first six layers, providing a kind of slow-motion replay of comprehension.

Locating behavior circuits

The original paper frames refusal and other behaviors only as future work, but later research has applied the same patching mechanism to behavioral analysis. Steering-vector work, including the 2024 result that refusal in instruction-tuned models is mediated by a roughly one-dimensional direction in the residual stream of chat models up to 72 billion parameters, has used Patchscopes-style readouts to characterize what that direction encodes by patching it into target prompts that solicit a description.^[9] The framework's open-vocabulary readout is especially useful here because it can return phrases such as "refuses to comply" or "harmful content" rather than a probability over a fixed label set.

Behavioral analyses of this kind tend to combine three operations: identifying a candidate direction via probing or steering, intervening on that direction during inference to confirm its causal role, and then describing the direction with a Patchscope-style readout to assign it a human-readable label. Each operation is independently useful, but the readout step is what closes the loop: without it, a "refusal direction" is just a vector with a strong effect on outputs, while with a Patchscope readout it can be summarized as "a representation that encourages declining requests about harmful content." The same workflow generalizes to directions encoding sentiment, factual claims, sycophancy, or other behaviors that can be operationalized as steering vectors.

Self-correction

The multi-hop reasoning experiment is sometimes described as a form of self-correction, since the model effectively corrects its own composition error using its own intermediate state. Rather than retrieving information from outside the network, the procedure reroutes information that the network already computed, demonstrating that the failure was one of composition rather than of recall. The 2025 Superscopes work by Jonathan Jacobi and Gal Niv extends the same idea by amplifying superposed features in MLP outputs and hidden states before patching them into a new context, drawing analogy to classifier-free guidance in diffusion; it reports being able to interpret representations that plain Patchscopes returns as uninformative.^[10] Auto-Patch, also from 2025, automates the "when to patch" decision with a learned classifier rather than requiring the experimenter to manually select source representations for each query.^[8]

Methodology

In practice, applying Patchscopes to a research question reduces to four choices: the source layer, the target prompt, the target layer, and the evaluation criterion. The paper recommends sweeping ell from one to the network's depth and choosing a target layer near or equal to ell, because diagonal configurations dominate in their reported tables.^[1] When the inspection is about a token-level attribute, the few-shot identity prompt is the default; when it is about an entity, an entity-description prompt with two or three demonstrations is used; when it is about a relation, a templated relation prompt is used.^[1]^[7] For cross-model patching, f is a learned linear or affine map fit on a small set of paired hidden states.^[1]

Evaluation differs by task. For next-token prediction, Patchscopes is compared against the logit lens and the Tuned Lens using Precision-at-1 against the model's own argmax at the final layer and using surprisal, with results aggregated across many input prompts.^[1]^[3] For attribute extraction, the metric is task accuracy against ground-truth labels, with logistic-regression linear probes as a baseline; on 12 tasks (5 commonsense, 7 factual) on GPT-J, the zero-shot feature-extraction Patchscope significantly outperforms the trained probes on 6 of 12 tasks under a strict significance threshold, despite requiring no labeled data.^[1]^[3] For entity resolution, the metric is RougeL between generated and reference descriptions; for multi-hop reasoning, it is exact-match accuracy on the gold final answer.^[1]^[3]

The methodology also requires some care to avoid spurious results. Because the target model can hallucinate, a Patchscope must be paired with a sanity check that the readout depends on the patched state. The standard control is a "no-patch" target inference using the same template, which should produce a generic completion; comparing the patched and unpatched continuations reveals whether the patched state actually steered the readout. A complementary control is to patch a randomized or zeroed activation, which should yield uninformative descriptions; if the readout looks structured even under such controls, the prompt is doing too much work and the inference is not faithful. The paper applies these and related checks throughout its experiments.^[1]

Comparison to prior tools

Method	Decoder	Training cost	Works in early layers	Output form	Reference
Logit lens	Final unembedding	None	Often fails	Token distribution	nostalgebraist, 2020^[4]
Tuned Lens	Per-layer affine probe	Trained per layer	Improved	Token distribution	Belrose et al., 2023^[5]
Early-exit decoding	Frozen unembedding at layer ell	None	Often fails	Token distribution	Standard practice in LLMs^[4]^[5]
Linear probe	Linear classifier on activations	Trained per attribute	Mixed; depends on task	Predefined labels	Standard supervised probe^[1]
Activation patching	None (causal change in output)	None	Yes (causal)	Effect on output logits	Vig et al., 2020^[6]
Patchscopes	Target prompt continuation in same or different LLM	None unless f is learned	Yes	Open-vocabulary natural language	Ghandeharioun et al., 2024^[1]

The table makes the framework's design space explicit. The logit lens uses the trained unembedding as a fixed decoder and provides no notion of a target prompt, which is why it cannot disentangle aspects of a representation that have not yet been rotated into output coordinates. The Tuned Lens learns an affine decoder per layer; this fixes the early-layer brittleness of the logit lens but still produces a token distribution rather than a textual description. Early-exit decoding, in which the model is forced to terminate at layer ell and emit a token, is similar to the logit lens for diagnostic purposes but is normally used as an inference-acceleration mechanism rather than an interpretability tool.^[5] Linear probes are highly tunable but require labels and are restricted to whatever discrete label set the probe was trained for. Activation patching is causal, but its output is a delta on the final logits, not a phrase; the Patchscopes paper observes that one can interpret activation patching as a Patchscope with an empty target prompt and target layer equal to the source layer.^[1]

By making the decoder a forward pass of the same or another LLM, Patchscopes is the only one of these tools whose output is open-vocabulary natural language and whose decoder leverages the model's nonlinear computation rather than a single linear map. This is the main advantage cited in the paper's discussion, alongside a caveat: the same nonlinearity that lets Patchscopes describe rich representations also means the generated description is not guaranteed to be a faithful summary of the patched state, since the target model can hallucinate.^[1]^[3]

Empirical findings

The headline numbers reported in the ICML 2024 paper are as follows. On token-identity decoding across GPT-J, Vicuna 7B and 13B, Pythia 12B, and LLaMA 2 13B, the Patchscope's Precision-at-1 against the model's final-layer argmax exceeds the logit lens and Tuned Lens at every layer from layer 10 onward, with the largest absolute gap (up to about 98 percentage points) in mid layers 18 through 22 of the studied models.^[1]^[3] In the first ten layers, where the logit lens essentially fails, the Patchscope continues to recover useful information.^[1]

On attribute extraction with GPT-J across 12 commonsense and factual tasks, the zero-shot Patchscope is competitive with or beats a logistic-regression probe in the early-to-mid layers and trails it in late layers, where representations are reported to shift toward the next-token prediction objective and lose easy access to fine-grained attributes.^[1] On six of the twelve tasks, the Patchscope advantage is statistically significant at p less than 1e-5. Individual examples reported in the paper include "substance phase" (Patchscope 91.9 percent versus probe 73.8 percent), "company CEO" (47.8 percent versus 5.0 percent), and "country currency" (51.0 percent versus 17.7 percent).^[1]^[3]

On the entity-description experiment with 200 popular and 200 rare entities on Vicuna 13B, the paper shows RougeL between the Patchscope's generated description and a reference description rising through layers one to five and stabilizing thereafter, a pattern the authors call "gradual entity resolution".^[1]^[3] For cross-model patching between Vicuna 7B and Vicuna 13B, RougeL is higher when the larger model is the target than when the smaller model inspects itself.^[1]

On the multi-hop reasoning experiment, the Patchscope achieves 50 percent accuracy on a curated set of 46 two-hop factual queries, versus 35.71 percent for chain-of-thought prompting and 19.57 percent for vanilla generation, all with Vicuna 13B.^[1]^[3] The paper interprets this result as evidence that the model often computes correct intermediate states that are not used by its later layers, so an intervention that reroutes the intermediate state can recover the answer.

Adoption and follow-up

Patchscopes was made available alongside an interactive explorable hosted by Google's People + AI Research group, framing the technique for a broader audience by visualizing how internal representations are decoded into natural language.^[11]^[7] An accompanying Google Research blog post from April 2024 by Avi Caciularu and Asma Ghandeharioun summarizes the method and emphasizes the unification claim.^[7] The project page maintained by Pair Code at pair-code.github.io continues to host examples and a reference implementation.^[11] The framework was rapidly picked up in interpretability tutorials, and reading lists on the subject in 2024 and 2025 commonly include Patchscopes as one of the standard references for decoding intermediate activations.

Within a year of publication, Patchscopes had appeared in several follow-ups. Auto-Patch (May 2025) by Jan, Tahory, Talmi, and Abo Mokh trains a classifier to decide which hidden states to patch and reports a multi-hop accuracy gain on MuSiQue from 18.45 percent to 23.63 percent, narrowing the gap to chain-of-thought prompting on that dataset without retraining the base model.^[8] Superscopes (March 2025) by Jacobi and Niv extends the framework by amplifying superposed features in MLP outputs before patching, aiming to surface representations that an unamplified Patchscope returns as uninformative.^[10] Open Problems in Mechanistic Interpretability (Sharkey et al., 2025) lists Patchscopes among the standard intervention-based tools available to the field.^[12] Reviews of mechanistic interpretability such as Bereska and Gavves (2024) cite Patchscopes as a unifying point for vocabulary-projection and intervention methods.^[13]

The framework is sometimes mentioned alongside circuit-style work such as Anthropic's attribution graphs and scaling monosemanticity, which target a different layer of explanation: circuits and attribution graphs aim to identify which features and computational paths produce a given behavior, while Patchscopes aims to verbalize what a single representation encodes. The two are complementary rather than overlapping. Anthropic's "On the Biology of a Large Language Model" (March 2025) by Jack Lindsey and collaborators uses attribution graphs to study multi-step reasoning, planning in poetry, and hallucination inhibition in Claude 3.5 Haiku; it shares the goal of interpreting how a model computes an answer but uses replacement-model and cross-layer-transcoder machinery rather than Patchscope-style decoding.^[14] Tools such as TransformerLens and various open-source sparse autoencoder libraries provide complementary infrastructure that Patchscopes can build on, since extracting and patching activations is a standard operation in those tools.

A second strand of adoption is in safety-oriented analyses. The 2024 paper on refusal directions by Andy Arditi and collaborators used activation interventions across 13 open chat models up to 72B parameters to show that a one-dimensional residual-stream subspace mediates refusal, and the workflow of identifying a direction and then reading it out is a natural fit for Patchscopes.^[9] In the 2025 review literature, Patchscopes is grouped with other "natural-language probes" alongside techniques that ask the model to summarize parts of its own computation.^[12]^[13] The framework is also referenced in tutorials on circuit discovery and activation patching, which note that the same primitive operations underlie both circuit identification and Patchscopes-style readout, with the difference lying in what the experimenter does with the second forward pass.^[6]^[11]

Limitations

The paper itself catalogs several limitations. First, Patchscopes inherit prompt sensitivity from any few-shot setup: the target prompt strongly determines what aspect of the representation is verbalized, and a poorly chosen prompt can hide rather than reveal information.^[1]^[3] Second, generation faithfulness is not guaranteed; because the target model performs a normal forward pass, it can hallucinate descriptions that are plausible but not actually grounded in the patched state.^[1]^[3] Third, the framework requires running a second forward pass for every probe, making it more expensive than a logit-lens projection. Fourth, in cross-model patching, the affine mapping f is only as good as the paired data used to fit it, and some representations may not have a clean correspondence between source and target. Fifth, the experiments in the paper are run on a small set of models (GPT-J 6B, Vicuna 7B and 13B, Pythia 12B, LLaMA 2 13B), and the multi-hop reasoning result is reported on a curated 46-sample set; extrapolating numerical claims beyond those settings is not directly supported by the paper.^[1]^[3]

A more conceptual concern is that the framework explains representations in terms of natural language sampled from the same model, which means that any pathology of the target model (biases, refusals, stylistic preferences) becomes a confound in the readout. This is acknowledged in the discussion as a direction for further work and is closely related to the broader question of whether model self-explanations are reliable in interpretability research.^[1]^[7]

Patchscopes belongs to the broader family of methods in mechanistic interpretability that read or modify a model's internal state to understand its behavior. The closest neighbors are the logit lens and Tuned Lens for layer-wise decoding, activation patching for causal localization, and linear probes for supervised attribute extraction. It is complementary to feature-based approaches such as sparse autoencoders and to the circuit-discovery work that culminated in Scaling Monosemanticity and attribution graphs. Whereas circuits and SAE features describe what the model represents in terms of units, Patchscopes describes a particular representation in terms of words.

The framework has natural connections to model editing and knowledge localization. Activation patching has been used for years to localize where a fact is stored, and Patchscopes can be thought of as the natural extension that asks not just where a fact lives but what the localized representation says when allowed to speak. Steering and refusal directions in the residual stream, which became a major topic of activation-engineering research in 2024 and 2025, are well-suited to Patchscope readouts because the open-vocabulary description can summarize the behavioral content of an inferred direction.^[9]

References

Asma Ghandeharioun, Avi Caciularu, Adam Pearce, Lucas Dixon, Mor Geva, "Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models", arXiv, 2024-01-11. https://arxiv.org/abs/2401.06102. Accessed 2026-05-20. ↩
Asma Ghandeharioun, Avi Caciularu, Adam Pearce, Lucas Dixon, Mor Geva, "Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models", Proceedings of the 41st International Conference on Machine Learning (PMLR vol. 235, pp. 15466-15490), 2024-07-21. https://proceedings.mlr.press/v235/ghandeharioun24a.html. Accessed 2026-05-20. ↩
Asma Ghandeharioun et al., "Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models (HTML version v3)", arXiv, 2024. https://arxiv.org/html/2401.06102v3. Accessed 2026-05-20. ↩
nostalgebraist, "Interpreting GPT: The Logit Lens", LessWrong, 2020-08-31. https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru. Accessed 2026-05-20. ↩
Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, Jacob Steinhardt, "Eliciting Latent Predictions from Transformers with the Tuned Lens", arXiv, 2023-03-14. https://arxiv.org/abs/2303.08112. Accessed 2026-05-20. ↩
Fred Zhang and Neel Nanda, "Towards Best Practices of Activation Patching in Language Models: Metrics and Methods", arXiv, 2023-09-27. https://arxiv.org/abs/2309.16042. Accessed 2026-05-20. ↩
Avi Caciularu and Asma Ghandeharioun, "Patchscopes: A unifying framework for inspecting hidden representations of language models", Google Research Blog, 2024-04-11. https://research.google/blog/patchscopes-a-unifying-framework-for-inspecting-hidden-representations-of-language-models/. Accessed 2026-05-20. ↩
Aviv Jan, Dean Tahory, Omer Talmi, Omar Abo Mokh, "Auto-Patching: Enhancing Multi-Hop Reasoning in Language Models", arXiv, 2025-05-31. https://arxiv.org/abs/2506.00483. Accessed 2026-05-20. ↩
Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, Neel Nanda, "Refusal in Language Models Is Mediated by a Single Direction", arXiv, 2024-06-17. https://arxiv.org/abs/2406.11717. Accessed 2026-05-20. ↩
Jonathan Jacobi, Gal Niv, "Superscopes: Amplifying Internal Feature Representations for Language Model Interpretation", arXiv, 2025-03-03. https://arxiv.org/abs/2503.02078. Accessed 2026-05-20. ↩
Pair Code (Google), "Patchscopes (project page)", PAIR Interpretability, 2024. https://pair-code.github.io/interpretability/patchscopes/. Accessed 2026-05-20. ↩
Lee Sharkey et al., "Open Problems in Mechanistic Interpretability", arXiv, 2025-01-27. https://arxiv.org/abs/2501.16496. Accessed 2026-05-20. ↩
Leonard Bereska, Efstratios Gavves, "Mechanistic Interpretability for AI Safety: A Review", arXiv, 2024-04-22. https://arxiv.org/abs/2404.14082. Accessed 2026-05-20. ↩
Jack Lindsey, Emmanuel Ameisen, Adam Pearce, Joshua Batson, et al., "On the Biology of a Large Language Model", Transformer Circuits Thread (Anthropic), 2025-03-27. https://transformer-circuits.pub/2025/attribution-graphs/biology.html. Accessed 2026-05-20. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributor · full history

Suggest edit

What links here

Activation steering

Background

Definition and the patching mechanism

Target prompts

Layer choices

Use cases

Decoding hidden representations to natural language

Cross-model patching

Multi-hop reasoning analysis and correction

Entity resolution and contextualization

Locating behavior circuits

Self-correction

Methodology

Comparison to prior tools

Empirical findings

Adoption and follow-up

Limitations

Related work

See also

References

Improve this article

Related Articles

Activation steering

Refusal direction

Persona vectors

Feature Importances

Permutation variable importances

Variable importances

What links here

Related Articles

Activation steering

Refusal direction

Persona vectors

Feature Importances

Permutation variable importances

Variable importances