Logit lens

RawGraph

Last reviewed

Sources

No citations yet

Review status

Needs citations

Revision

v1 · 3,938 words

Logit lens

The logit lens is a foundational technique in mechanistic interpretability for inspecting the intermediate computations of transformer language models. It works by projecting the hidden activations of any layer through the model's own output (unembedding) matrix, producing a probability distribution over the vocabulary at every depth of the network. By tracking how these layer-wise distributions evolve, researchers can observe a model's "running predictions" as input tokens are processed and refined into a final answer.[1]

The technique was introduced by the pseudonymous blogger nostalgebraist in an August 31, 2020 post on LessWrong titled "interpreting GPT: the logit lens," where it was applied to GPT-2.[1] Despite its origin outside of formal peer review, the post has become one of the most influential pieces of interpretability research, with the logit lens forming the conceptual basis for many subsequent techniques including the Tuned Lens, Future Lens, DoLa, and Patchscopes.[2][3][4][5]

Introduction

A transformer language model produces token predictions by passing input embeddings through a stack of identical residual blocks and then applying a final linear projection (the unembedding matrix) followed by a softmax to yield a probability distribution over the vocabulary. Conventionally, only this final output is treated as a "prediction"; the intermediate activations are regarded as opaque internal state. The logit lens questions that assumption. Because GPT-style architectures use a residual stream where each layer adds an update to a running representation, the intermediate activations live in the same vector space as the final output and can in principle be decoded with the same unembedding map.[1]

When that decoding is performed, the resulting distributions are often interpretable. Early layers tend to produce nonsense or shallow guesses, middle layers begin to converge on plausible candidates, and later layers refine toward the model's final prediction. This "iterative refinement" view, originally articulated by nostalgebraist, has since been formalized and tested across many model families and scales.[1][2]

The logit lens has two qualities that explain its widespread adoption. First, it requires no training: anyone with access to a model's weights and intermediate activations can apply it. Second, it produces outputs in vocabulary space rather than in some learned abstract feature space, which means the results can be read directly by a human as candidate tokens. These properties have made it a default first step in many interpretability workflows.[6][7]

The technique: residual stream and unembedding

To understand the logit lens, one must understand two structural features of standard transformer language models.

Residual stream

A decoder-only transformer is organized around a residual stream: a sequence-position-by-hidden-dimension tensor that is read from and written to by each block. Every attention sublayer and every feedforward sublayer adds its output back to this running representation rather than replacing it. As a result, the hidden state at the input of layer $\ell+1$ equals the hidden state at the input of layer $\ell$ plus whatever the layer-$\ell$ sublayers wrote. Because every layer's output enters the same additive stream, all intermediate hidden states share a common coordinate system at the dimensional level, even if the directions they emphasize differ.[6][8]

Unembedding

After the final residual block, the model applies a layer normalization (often called LN_F) and then multiplies the result by an unembedding matrix to produce logits, the unnormalized scores from which the next-token distribution is computed via softmax. In the original GPT-2 implementation, the unembedding matrix is tied to the input embedding matrix: the same weights that map token IDs into the residual stream at the start of the network are transposed and used to map the final hidden state back to a vocabulary distribution at the end.[1]

The lens

The logit lens takes the hidden state at an intermediate layer $\ell$, applies the same final-layer-norm-and-unembedding pipeline that the model would normally apply only at the top of the stack, and reads the result as a token distribution. In its simplest form, the procedure is:

  1. Run the model forward on a prompt.
  2. Capture the residual stream at layer $\ell$.
  3. Apply LN_F to that residual stream.
  4. Multiply by the unembedding matrix to obtain a logit vector.
  5. Apply softmax (optionally) to obtain a probability distribution.

The output can be inspected directly. For instance, the argmax of the layer-$\ell$ logits gives the token the model would predict if the residual stream at layer $\ell$ were treated as the final hidden state.[1][7]

History: nostalgebraist's 2020 LessWrong post

The technique was introduced in a post titled "interpreting GPT: the logit lens" published on LessWrong on August 31, 2020 by the pseudonymous user nostalgebraist.[1] The post observed that GPT's probabilistic predictions are a linear function of its final-layer activations and argued that, because of weight tying, the same linear function could be applied to earlier layers' activations to produce intelligible token distributions. The author tested this primarily on GPT-2 1558M (often called GPT-2 XL), which has 48 transformer blocks and a vocabulary of 50,257 tokens, with a hidden dimension of 1,600.[1]

The post made several empirical claims that have shaped subsequent interpretability research:

  • The input is "scrambled" almost immediately. After the first layer, the residual stream no longer resembles the input embeddings in the logit lens projection; the KL divergence between the layer-1 lens output and the layer-0 (input) distribution jumps discontinuously.[1]
  • Predictions are refined iteratively. From the middle layers onward, the lens outputs converge roughly monotonically toward the final prediction, with intermediate layers producing recognizable "early guesses" that are progressively sharpened.[1]
  • Lens outputs become interpretable mid-stack. Early layers tend to be nonsense, mid layers produce tokens that are the right part of speech or topic, and late layers approach the final answer.[1]

A follow-up notebook released by nostalgebraist in May 2021 extended the analysis to models ranging from 125M to 2.7B parameters, including GPT-Neo and CTRL.[1] The original post and its follow-up have been widely cited in subsequent interpretability work, including the Tuned Lens paper, which describes nostalgebraist's contribution as the source of the logit lens technique.[2]

Mathematical formulation

Let a decoder-only transformer have $L$ layers, hidden dimension $d$, and vocabulary size $V$. Let $h^{(\ell)} \in \mathbb{R}^d$ denote the residual-stream hidden state at layer $\ell$ for some token position. The final logits produced by the model are:

$$z = W_U \cdot \text{LN}_F(h^{(L)})$$

where $W_U \in \mathbb{R}^{V \times d}$ is the unembedding matrix and $\text{LN}_F$ is the final layer norm. The final next-token distribution is then $p = \text{softmax}(z)$.[2]

The logit lens applies the same read-out to an arbitrary intermediate layer:

$$z^{(\ell)}_{\text{lens}} = W_U \cdot \text{LN}_F(h^{(\ell)})$$

with $p^{(\ell)}{\text{lens}} = \text{softmax}(z^{(\ell)}{\text{lens}})$.[2] The same unembedding matrix and final layer norm are reused for every layer; no additional parameters are learned. This makes the logit lens an identity probe: it assumes intermediate states already encode their predictions in the basis of the final residual stream.[2][7]

Two diagnostic quantities are commonly computed:

  • Top-1 token agreement: the fraction of positions where $\arg\max_v z^{(\ell)}_{\text{lens},v}$ equals $\arg\max_v z_v$. This measures how often the lens at layer $\ell$ already prefers the same token the model ultimately outputs.[1]
  • KL divergence: $D_{\text{KL}}(p^{(\ell)}_{\text{lens}} ,|, p)$, measuring how far the layer-$\ell$ lens distribution is from the final distribution. KL divergence is monotone-decreasing as $\ell$ grows for well-behaved models, with the original post documenting a sharp discontinuity after layer 0.[1][2]

When the input embedding and output unembedding are weight-tied (as in GPT-2), $W_U$ equals the transpose of the embedding matrix $W_E$. In modern open-weights families such as Llama, the unembedding matrix is often a distinct parameter ("lm_head"), but the technique is unchanged: one simply uses whichever matrix is applied at the end of the forward pass.[9]

Failure modes and limitations

Although the logit lens often works well on GPT-2, subsequent research has documented systematic failures across other model families and settings.[2] These limitations motivated the development of the Tuned Lens and other learned alternatives.

Representational basis drift

The logit lens implicitly assumes that intermediate residual-stream states live in the same coordinate system as the final state. When that assumption fails (for example, when middle layers represent information in a rotated or shifted basis relative to the unembedding), the lens produces brittle and biased outputs.[2] Belrose et al. (2023) document this concretely across multiple families: the logit lens performs well on GPT-2 but is systematically less informative on BLOOM, OPT, GPT-Neo, GPT-J, and the Pythia suite, where intermediate hidden states are not directly aligned with the unembedding.[2]

Translation bias

A specific instance of basis drift is what Belrose et al. call translation bias: a constant additive shift between the intermediate residual stream and the residual stream that the unembedding expects. Even when the linear structure of the intermediate space is similar to the final space, an additive offset can cause the logit lens to consistently prefer the wrong tokens. The Tuned Lens corrects this by learning an explicit bias term for each layer.[2]

Early-layer non-interpretability

Across model families, lens outputs from the earliest layers are generally not informative. nostalgebraist noted that GPT-2's first layer produces a discontinuity in the lens output; later work has repeatedly observed that early layers in encoder-decoder and instruction-tuned models do not decode to meaningful token distributions at all.[1][5] Ghandeharioun et al. cite this "failure in inspecting early layers" as a primary motivation for their Patchscopes framework.[5]

Heavily fine-tuned or non-linear models

The lens assumes that the relationship between the residual stream and the final output is approximately linear and stable across layers. Heavy instruction tuning, reinforcement learning from human feedback, or architectures with significant non-linearities between the final layer and the unembedding can violate this assumption, producing lens outputs that diverge from the model's actual behavior.[7][10]

Encoder-decoder models

The original logit lens technique targets decoder-only models. For encoder-decoder transformers, where the encoder does not directly produce tokens, the lens must be adapted. Langedijk et al. (2023) introduced DecoderLens, which decodes encoder activations through the decoder, addressing a gap that the bare logit lens does not handle.[11]

Correlation versus causation

A more conceptual limitation is that the logit lens shows what information is legible at a layer when projected through the unembedding, but not whether that information is actually used by downstream layers in the model's forward pass. Belrose et al. address this in part with causal experiments demonstrating that the Tuned Lens uses features similar to those the model itself uses, but the bare logit lens makes no such guarantee.[2]

The Tuned Lens

The most influential extension of the logit lens is the Tuned Lens, introduced by Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, and Jacob Steinhardt in March 2023.[2] Their paper, "Eliciting Latent Predictions from Transformers with the Tuned Lens," addresses the logit lens's brittleness by training a small affine probe per layer rather than using the identity transformation implicit in the bare logit lens.[2]

Method

For each layer $\ell$, the Tuned Lens trains a per-layer affine translator $A_\ell$ such that:

$$z^{(\ell)}_{\text{tuned}} = W_U \cdot \text{LN}F(A\ell(h^{(\ell)}))$$

where $A_\ell(h) = W_\ell h + b_\ell$ is a learned affine map. The parameters $W_\ell$ and $b_\ell$ are trained, with the base model frozen, to minimize the KL divergence between $\text{softmax}(z^{(\ell)}{\text{tuned}})$ and the model's final distribution.[2] Because the bias term $b\ell$ is learned per layer, the Tuned Lens directly corrects the translation bias that affects the bare logit lens.

Empirical results

The authors evaluate the Tuned Lens on autoregressive language models of up to 20 billion parameters, including GPT-Neo, OPT, BLOOM, and the Pythia suite.[2] Across these models, the Tuned Lens consistently achieves lower perplexity and better next-token prediction than the bare logit lens, and the improvement is largest precisely on the models where the logit lens performs worst.[2]

Two further results are notable. First, causal experiments show that interventions on the Tuned Lens directions affect the model's behavior in ways consistent with using the same features the model uses internally, suggesting the lens is faithful rather than merely descriptive. Second, the trajectory of latent predictions across layers can be used to detect malicious or anomalous inputs with high accuracy, illustrating a practical safety application of the technique.[2]

Implementation

The Tuned Lens is distributed as an open-source Python library (tuned-lens) released under the MIT license and developed at AlignmentResearch.[12] It is installable from PyPI and requires PyTorch 1.13 or later. The library provides utilities for training new lenses on arbitrary frozen transformer language models, evaluating lens quality, and integrating with downstream interpretability workflows.[12]

Other extensions and modern alternatives

The Tuned Lens is the most prominent direct extension, but the logit lens family has grown considerably since 2020.

Jump to Conclusions

Alexander Yom Din, Taelin Karidi, Leshem Choshen, and Mor Geva introduced a related approach in March 2023 in a paper titled "Jump to Conclusions: Short-Cutting Transformers With Linear Transformations."[13] Like the Tuned Lens, the method uses linear transformations to recast intermediate hidden states in the basis of the final layer; unlike the Tuned Lens, the paper emphasizes the efficiency angle, showing that GPT-2 and BERT often predict the final output already in early layers. The authors report that targeting 95% accuracy retention with their method saves 7.9% of layers for GPT-2 and 5.4% for BERT relative to baseline early-exit strategies, and that attention sublayers are more tolerant of such substitution than feedforward sublayers.[13] The work was published at LREC-COLING 2024.[13]

Future Lens

Koyena Pal, Jiuding Sun, Andrew Yuan, Byron C. Wallace, and David Bau introduced Future Lens in a November 2023 paper accepted at CoNLL 2023.[3] Where the logit lens reads the most-likely next token from a hidden state, the Future Lens asks whether a single hidden state at position $t$ encodes information about tokens at positions $t+2$ and beyond. Working with GPT-J-6B, the authors find that linear approximation and causal intervention methods allow them to recover the model's prediction of subsequent (not just immediate) tokens with more than 48% accuracy from a single hidden state at certain layers.[3] The work suggests that transformer hidden states encode richer future-context information than the next-token-only view implied by the bare logit lens.

Patchscopes

In January 2024, Asma Ghandeharioun, Avi Caciularu, Adam Pearce, Lucas Dixon, and Mor Geva introduced Patchscopes, a unifying framework for inspecting hidden representations of language models, presented at ICML 2024.[5] Patchscopes generalizes prior interpretability methods by patching a source-prompt representation into a target prompt and letting the model itself produce a natural-language explanation of what the patched representation encodes. The paper explicitly observes that "many prior interpretability methods based on projecting representations into the vocabulary space and intervening on the LLM computation can be viewed as instances of this framework," with the logit lens being a canonical example.[5] Patchscopes addresses several known limitations of the bare logit lens, including failure on early layers and limited expressivity, and enables novel uses such as employing a more capable model to explain a smaller model's representations.[5]

DoLa: Decoding by Contrasting Layers

A practical downstream application of layer-wise logit decoding is DoLa ("Decoding by Contrasting Layers"), introduced by Yung-Sung Chuang, Yujia Xie, Hongyin Luo, Yoon Kim, James Glass, and Pengcheng He in September 2023.[4] DoLa is not an interpretability method per se but rather a decoding strategy aimed at reducing model hallucinations. It works by contrasting the logits obtained by projecting later layers to the vocabulary against the logits obtained by projecting earlier layers, on the hypothesis that factual knowledge is more localized in later layers and that down-weighting earlier-layer predictions reduces hallucinations. The authors report improvements of 12 to 17 percentage points on TruthfulQA across the LLaMA family without external retrieval or fine-tuning.[4] DoLa relies on the same underlying mechanism as the logit lens, namely the ability to project an intermediate residual stream through the unembedding matrix, but uses the resulting distributions as an inference-time signal rather than as an analysis tool.

Multilingual logit lens analysis

Chris Wendler, Veniamin Veselovsky, Giovanni Monea, and Robert West used the logit lens to analyze how multilingual Llama 2 models process non-English text, in a paper titled "Do Llamas Work in English? On the Latent Language of Multilingual Transformers," accepted at ACL 2024.[9] Applying the logit lens to non-English prompts revealed a three-phase trajectory in which intermediate embeddings start far from any output token, then in middle layers prefer the English version of a semantically correct continuation over its target-language equivalent, then in final layers move into the input-language-specific region.[9] The authors interpret this as evidence that the middle layers of Llama 2 operate in an abstract "concept space" that is biased toward English rather than literally translating to English and restarting the forward pass.[9] The work is a notable example of how the logit lens can yield substantive empirical findings about model behavior, not merely descriptive visualizations.

Use cases

The logit lens and its descendants have been applied across a broad range of interpretability and engineering tasks.

Watching factual recall develop

A canonical use case is tracking when, across layers, a model "knows" a particular fact. By applying the logit lens to a residual stream position corresponding to a factual prompt (such as "The capital of France is"), one can identify the layer at which the correct token first becomes the top prediction. This has been used as part of larger investigations into how transformers store and retrieve factual associations, including work that localizes factual recall to specific mid-stack MLP modules.[14]

Hallucination detection and mitigation

Two complementary applications target hallucinations. As an analysis tool, the logit lens (and Tuned Lens) can be used to inspect cases where a model produces incorrect output, asking whether the correct answer ever surfaced at an intermediate layer and was overridden, or whether it never surfaced at all. As an intervention, DoLa shows that contrasting layer-wise logit distributions at decoding time can directly reduce hallucination rates on factuality benchmarks.[4] Belrose et al. additionally show that the trajectory of Tuned Lens predictions can serve as a feature for detecting anomalous or adversarial inputs.[2]

Model debugging and behavioral analysis

When a model produces an unexpected output, the logit lens provides a low-cost way to trace where in the network the unexpected behavior was introduced. If the lens shows that the wrong token was already preferred at layer 5, the problem likely lies in the embedding or early attention computations; if it emerged only at the final layers, the problem is in the readout or late-stack feedforwards. This kind of layer-localized diagnosis is now a standard first step in interpretability case studies.[6][7]

Early-exit and efficient inference

If a model already "knows" the correct answer by some middle layer, one can in principle skip subsequent layers at inference time to save compute. The Jump to Conclusions paper formalizes this idea using linear shortcuts, demonstrating measurable layer savings while retaining most of the model's accuracy.[13]

Prompt engineering and behavior analysis

Practitioners use logit-lens-style inspection to verify that prompts elicit the intended internal representations. For example, if a chain-of-thought prompt is intended to make a model "think about" a particular intermediate concept, applying the lens to the relevant residual stream positions can confirm or refute whether that concept actually appears as a high-probability token at any layer.[6]

Multilingual analysis

The Wendler et al. findings illustrate how the logit lens can be used to probe macro-level questions about model behavior, such as whether a multilingual model uses one of its training languages as an internal pivot. Such analyses would be difficult or impossible to perform with the bare model output, which only exposes the final prediction.[9]

Implementations and tooling

Several open-source libraries provide ready-to-use logit-lens functionality.

  • TransformerLens is a library by Neel Nanda (maintained by Bryce Meyer) for mechanistic interpretability of GPT-style language models. It supports over 50 open-source models and exposes the hooks needed to extract intermediate residual-stream activations, on top of which a logit lens can be applied straightforwardly.[15]
  • tuned-lens is the official implementation of the Tuned Lens technique, also usable as a bare logit-lens utility for the base model. It is distributed by AlignmentResearch under the MIT license.[12]

Both libraries are widely used in academic interpretability research and have been adopted as standard tooling in mechanistic interpretability courses and tutorials.[7][15]

Conceptual significance

The logit lens occupies an unusual position in the interpretability literature. It is methodologically simple to the point of being almost trivial: apply the model's own final linear projection to its own intermediate states. Yet its empirical productivity has been enormous, both because the resulting visualizations are interpretable and because the technique inspired a research program that now includes the Tuned Lens, Future Lens, Jump to Conclusions, Patchscopes, DoLa, and many smaller variations.[2][3][4][5][13]

The lens family also provides one of the cleanest empirical illustrations of the iterative inference view of transformers, under which the model maintains and progressively refines a prediction throughout the depth of the network rather than computing all of its work at the end. This view is partially descriptive (what the logit lens shows) and partially mechanistic (what the Tuned Lens and Patchscopes attempt to verify causally), and has shaped subsequent work on early-exit inference, layer-wise decoding, and circuit-level interpretation of transformer behavior.[2][5][13]

At the same time, careful work since 2023 has clarified the lens's limits. The bare logit lens is reliable on weight-tied, decoder-only models trained on English (most prominently GPT-2) and progressively less reliable as architectures and training regimes diverge from that template. The Tuned Lens and Patchscopes can be read as principled responses to those limitations, preserving the conceptual core (project intermediate states to a vocabulary distribution) while adding either learned corrections or causal verification to address the failure modes of the original technique.[2][5]

References

  1. nostalgebraist. "interpreting GPT: the logit lens." LessWrong. August 31, 2020. https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens (Accessed 2026-05-19).
  2. Belrose, N., Furman, Z., Smith, L., Halawi, D., Ostrovsky, I., McKinney, L., Biderman, S., and Steinhardt, J. "Eliciting Latent Predictions from Transformers with the Tuned Lens." arXiv preprint arXiv:2303.08112. Submitted March 14, 2023. https://arxiv.org/abs/2303.08112 (Accessed 2026-05-19).
  3. Pal, K., Sun, J., Yuan, A., Wallace, B. C., and Bau, D. "Future Lens: Anticipating Subsequent Tokens from a Single Hidden State." arXiv preprint arXiv:2311.04897. Accepted at CoNLL 2023. https://arxiv.org/abs/2311.04897 (Accessed 2026-05-19).
  4. Chuang, Y.-S., Xie, Y., Luo, H., Kim, Y., Glass, J., and He, P. "DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models." arXiv preprint arXiv:2309.03883. Submitted September 7, 2023. https://arxiv.org/abs/2309.03883 (Accessed 2026-05-19).
  5. Ghandeharioun, A., Caciularu, A., Pearce, A., Dixon, L., and Geva, M. "Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models." arXiv preprint arXiv:2401.06102. ICML 2024. https://arxiv.org/abs/2401.06102 (Accessed 2026-05-19).
  6. "Patchscopes: A unifying framework for inspecting hidden representations of language models." Google Research Blog. https://research.google/blog/patchscopes-a-unifying-framework-for-inspecting-hidden-representations-of-language-models/ (Accessed 2026-05-19).
  7. "The Logit Lens and Tuned Lens." Learn Mechanistic Interpretability. https://learnmechinterp.com/topics/logit-lens-and-tuned-lens/ (Accessed 2026-05-19).
  8. "Patchscopes." PAIR (People + AI Research). https://pair-code.github.io/interpretability/patchscopes/ (Accessed 2026-05-19).
  9. Wendler, C., Veselovsky, V., Monea, G., and West, R. "Do Llamas Work in English? On the Latent Language of Multilingual Transformers." Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024). https://aclanthology.org/2024.acl-long.820/ (Accessed 2026-05-19).
  10. Brenndoerfer, M. "Logit Lens: Decoding Transformer Hidden States Layer by Layer." https://mbrenndoerfer.com/writing/logit-lens (Accessed 2026-05-19).
  11. "DecoderLens: Layerwise Interpretation of Encoder-Decoder Transformers." arXiv preprint arXiv:2310.03686. https://arxiv.org/pdf/2310.03686 (Accessed 2026-05-19).
  12. AlignmentResearch. "tuned-lens: Tools for understanding how transformer predictions are built layer-by-layer." GitHub repository. https://github.com/AlignmentResearch/tuned-lens (Accessed 2026-05-19).
  13. Din, A. Y., Karidi, T., Choshen, L., and Geva, M. "Jump to Conclusions: Short-Cutting Transformers With Linear Transformations." arXiv preprint arXiv:2303.09435. Submitted March 16, 2023; LREC-COLING 2024. https://arxiv.org/abs/2303.09435 (Accessed 2026-05-19).
  14. "Locating and Editing Factual Associations in GPT" (ROME). arXiv preprint arXiv:2202.05262. https://arxiv.org/pdf/2202.05262 (Accessed 2026-05-19).
  15. TransformerLensOrg. "TransformerLens: A library for mechanistic interpretability of GPT-style language models." GitHub repository. https://github.com/TransformerLensOrg/TransformerLens (Accessed 2026-05-19).

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation.

Suggest edit