TransformerLens

Developer Tools Interpretability Open Source AI

17 min read

Updated Jun 25, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 25, 2026

Fact-checked

In review queue

Sources

19 citations

Revision

v3 · 3,442 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

TransformerLens is an open-source Python library for the mechanistic interpretability of GPT-style language models. It loads a pretrained transformer such as GPT-2, exposes every internal activation through a network of named hook points, and lets researchers cache, inspect, edit, ablate, and replace those intermediate computations during a forward pass. The library was created by interpretability researcher Neel Nanda and originally released under the name EasyTransformer; it is distributed under the MIT License and, as of 2026, is maintained by Bryce Meyer and Jonah Larson through the TransformerLensOrg GitHub organization, where it has accumulated roughly 3,600 stars and over 600 forks.^[1]^[2]

TransformerLens has become the de facto standard tooling for mechanistic interpretability research, and the hook-point naming conventions it introduced now appear in a large fraction of published interpretability papers and blog posts. The library's own README states that "the goal of mechanistic interpretability is to take a trained model and reverse engineer the algorithms the model learned during training from its weights," and describes itself simply as "a library for doing mechanistic interpretability of GPT-2 Style language models."^[1] Its HookedTransformer class (and, from version 3.0, the TransformerBridge adapter) wraps a transformer language model with hook points that expose every residual stream addition, attention pattern, MLP activation, and projection in the model, so that built-in methods such as run_with_cache and run_with_hooks can perform causal interventions, activation patching, direct logit attribution, and other interpretability techniques.^[1]^[3]

What is TransformerLens used for?

TransformerLens is used to reverse-engineer the internal algorithms of transformer language models, a research program known as mechanistic interpretability. In practice this means loading an open-weight model, running text through it, and then reading or rewriting the model's intermediate tensors to test hypotheses about how a behavior is computed. Typical workflows include locating the attention circuit responsible for a task (for example the indirect-object-identification circuit in GPT-2 Small), identifying induction heads that implement in-context copying, performing activation patching to measure which components are causally necessary for a prediction, and decomposing the model's output logits into the contributions of individual heads and layers. Because the library exposes a stable, named hook for every interesting tensor, these interventions can be expressed in a few lines of code without modifying the underlying model.^[1]^[8]^[9]

History and origins

EasyTransformer (2022)

The project that became TransformerLens began life in 2022 as EasyTransformer, a PyTorch reimplementation of GPT-style transformers tailored for interpretability work. Neel Nanda, who had previously worked on the interpretability team at Anthropic, created EasyTransformer because existing open-source tooling did not expose model internals in a form convenient for mechanistic analysis. A fork of an early EasyTransformer code base was used by Arthur Conmy, Alexandre Variengien, Kevin Wang, and Jacob Steinhardt at Redwood Research for their 2022 paper "Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small." A snapshot of that fork remains available on GitHub under the redwoodresearch/Easy-Transformer repository, which explicitly notes that the project is a one-time code release and recommends that interested researchers use TransformerLens instead.^[4]^[5]

When was TransformerLens released?

The library was renamed TransformerLens ahead of its first stable release on the Python Package Index. Version 1.0.0 was published to PyPI on 16 January 2023 with Neel Nanda listed as the author and the MIT License attached.^[6] The recommended academic citation for the library names Neel Nanda and Joseph Bloom as authors and uses 2022 as the project year, reflecting the EasyTransformer origin date even though the rename and first tagged release occurred in early 2023. The exact BibTeX entry on the documentation site reads @misc{nanda2022transformerlens, title = {TransformerLens}, author = {Neel Nanda and Joseph Bloom}, year = {2022}, ...}.^[7]

Transition to TransformerLensOrg

After Nanda joined Google DeepMind, day-to-day maintenance of the library moved to a dedicated GitHub organization, TransformerLensOrg. The README now lists the project as "Maintained by Bryce Meyer and Jonah Larson," who coordinate releases, code review, and roadmap discussions as of the v3.x series, while the PyPI metadata lists the TransformerLensOrg organization as the maintainer contact.^[1]^[2]

How does TransformerLens work?

TransformerLens replaces standard transformer modules with versions instrumented for interpretability. The core idea is to insert a small no-op module called a HookPoint at every place inside the network where an interpretability researcher might want to read or write an intermediate tensor.

HookPoint and HookedRootModule

A HookPoint is implemented as a PyTorch module whose forward method is the identity function: it returns its input unchanged. Its purpose is to provide a stable, named location in the model graph at which forward and backward hooks can be attached. Because every interesting tensor (the embeddings, the queries, keys, values, attention scores and patterns, attention output, MLP pre- and post-activations, and the residual stream additions for each block) flows through a dedicated HookPoint, an interpretability researcher can refer to any of these tensors by name and intervene on it without modifying the underlying model code.^[8]

HookedRootModule is the parent class shared by the library's instrumented models. It provides utilities for registering hooks across the whole model, removing them again at the end of a context, and managing nesting so that hooks attached at one level can be cleared without disturbing hooks at another level. Each registered hook is wrapped in a LensHandle that records metadata about whether the hook is permanent and at which context level it was registered.^[8]

HookedTransformer

The flagship class historically exposed by the library is HookedTransformer. It is a from-scratch PyTorch implementation of a generic decoder-only transformer that exposes hook points at all of the locations described above. When a pretrained model is loaded via HookedTransformer.from_pretrained(...), TransformerLens downloads the Hugging Face weights for the requested model and rewrites them into the library's standardized parameter layout. This "weight standardization" step is what allows GPT-2, Pythia, Llama, Gemma, Mistral, and many other architectures to share the same interpretability API, but it also means that initial model loading with TransformerLens can take noticeably longer than loading the same model directly through Hugging Face Transformers.^[1]^[9]

ActivationCache

Cached activations are returned wrapped in an ActivationCache object. The cache behaves like a dictionary keyed by hook name, but it also provides convenience methods that are common in interpretability work: stacking activations across layers into a single tensor, applying layer norm folding, decomposing the residual stream into per-head contributions, computing direct logit attributions, and slicing along the batch or position dimensions. These helpers reflect a deliberate design choice to make common analysis patterns one-line operations rather than ad-hoc tensor manipulation.^[9]

TransformerBridge (v3.0+)

Version 3.0 of the library, released in April 2026, introduced a new abstraction called TransformerBridge. Instead of re-implementing every architecture from scratch inside TransformerLens, the bridge wraps a Hugging Face model in place and attaches HookPoints to its existing modules through adapters. The stated goal of this change is to expand model coverage dramatically while reducing the maintenance burden of keeping bespoke implementations in sync with upstream model releases. According to the README, TransformerBridge is the recommended 3.0 path and "supports 9,000+ models across 50+ architecture families"; HookedTransformer is deprecated as of 3.0 but remains available through a compatibility layer for the duration of the 3.x branch.^[1]^[10]

Which models does TransformerLens support?

TransformerLens supports a wide range of open-source decoder-only language models. The library exposes a list called OFFICIAL_MODEL_NAMES that enumerates the model identifiers it can load directly, and it also accepts arbitrary Hugging Face checkpoints that share an architecture with one of those families. Documented supported families include:^[1]^[11]

GPT-2 (all four sizes: small, medium, large, xl), the canonical example used throughout the documentation; see GPT-2.
Pythia (EleutherAI), with checkpoints ranging from 14M to 12B parameters; widely used because EleutherAI publishes hundreds of training checkpoints, enabling developmental interpretability research.
LLaMA, Llama 2, Llama 3 / 3.1 / 3.2 / 3.3 (Meta).
Mistral and Mixtral (Mistral AI).
Gemma, Gemma 2, Gemma 3 (Google DeepMind).
Qwen, Qwen2, Qwen3 (Alibaba).
Phi and Phi-3 (Microsoft).
GPT-J, GPT-Neo, GPT-NeoX (EleutherAI).
BLOOM, OPT, CodeLlama.
Encoder models such as BERT and T5, and audio models such as HuBERT and Wav2Vec2 through more recent bridges.
Experimental support for state-space models, including Mamba-1 and Mamba-2.

The TransformerLensOrg release notes for the v3.0 transition describe the TransformerBridge mechanism as expanding compatibility from roughly 200 directly supported models to approximately 9,000 models when including all sub-variants on Hugging Face, although the documentation cautions that only a subset are formally verified.^[10]

What are the key TransformerLens APIs?

The two methods most often used in TransformerLens workflows are run_with_cache and run_with_hooks. These are inherited from HookedRootModule and exposed on both HookedTransformer and TransformerBridge.

run_with_cache

run_with_cache(input) runs a forward pass and returns a tuple of (logits, ActivationCache). Every tensor that flows through a HookPoint is recorded under its hook name. A minimal example, drawn from the library's quick-start documentation, looks like this:^[1]

from transformer_lens.model_bridge import TransformerBridge

bridge = TransformerBridge.boot_transformers("gpt2", device="cpu")
logits, cache = bridge.run_with_cache("Hello World")

The returned cache can be queried with names such as cache["blocks.5.attn.hook_pattern"] to obtain the attention pattern at layer 5, or with helpers such as cache.stack_head_results() to assemble a tensor containing the contribution of every attention head across every layer.^[9]

run_with_hooks

run_with_hooks(input, fwd_hooks=[...], bwd_hooks=[...]) runs a forward (and optionally backward) pass while temporarily attaching user-supplied hook functions. Each hook receives the activation tensor and a HookPoint object as keyword arguments, and may return a modified tensor that is substituted in place. By default, hooks are removed when the call returns, leaving the model in its original state. This API is the primary mechanism for activation patching, zero or mean ablations, and other causal interventions.^[8]

Additional utilities

Beyond the two headline methods, TransformerLens exposes a number of utilities that are heavily used in research code:

to_tokens and to_str_tokens for tokenization and detokenization with offsets, supporting per-token interpretability.
tokens_to_residual_directions, which projects vocabulary tokens into the residual stream basis so that researchers can decompose logit attributions in the embedding space.
Layer-norm "folding" that absorbs the affine parameters of layer normalization into adjacent linear layers, simplifying analyses that treat layer norm as a fixed rescaling.
Stateful generation utilities that combine cache reuse with hook-based interventions.^[1]^[9]

Notable research using TransformerLens

The library's official documentation maintains a gallery of papers that use it. Representative examples include:^[12]

"Progress Measures for Grokking via Mechanistic Interpretability" (ICLR Spotlight, 2023) by Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith and Jacob Steinhardt, which reverse-engineered the algorithm a small transformer learns to perform modular addition.
"Towards Automated Circuit Discovery for Mechanistic Interpretability" (Conmy et al., 2023), which introduced the ACDC algorithm for automatically identifying minimal circuits responsible for a given behavior and was implemented on top of TransformerLens.
"Finding Neurons in a Haystack: Case Studies with Sparse Probing" (Gurnee, Nanda, Pauly, Harvey, Troitskii, Bertsimas, 2023), which used TransformerLens to study how specific concepts are encoded across MLP neurons.
"A Toy Model of Universality" (Chughtai, Chan, Nanda, ICML 2023), which examined whether networks trained on the same task converge to the same internal algorithm.
"Actually, Othello-GPT Has A Linear Emergent World Representation" by Neel Nanda, which reanalyzed Kenneth Li and colleagues' Othello-GPT model and used TransformerLens activations to argue for a linear representation of the board state.
"Eliciting Latent Predictions from Transformers with the Tuned Lens" (Belrose, Furman, Smith, Halawi, Ostrovsky, McKinney, Biderman, Steinhardt, 2023), which extended the logit-lens technique by training affine probes on intermediate residual streams.
"N2G: A Scalable Approach for Quantifying Interpretable Neuron Representations in Large Language Models" (Foote, Nanda, Kran, Konstas, Barez, ICLR RTML Workshop 2023).
"A circuit for Python docstrings in a 4-layer attention-only transformer" (Heimersheim and Janiak), a community write-up published on the Alignment Forum.

The "Interpretability in the Wild" paper on indirect object identification, which originally relied on the EasyTransformer fork, is the most-cited demonstration of the circuit-analysis style of work that TransformerLens enables. Its tutorial-style replication using run_with_hooks activation patching is one of the canonical examples taught in the official documentation.^[4]^[12]

Outside the gallery, TransformerLens has been used as the modeling layer for replications of "Locating and Editing Factual Associations in GPT" (the ROME paper by Meng, Bau, Andonian, and Belinkov), as well as in feature-level circuit work and in studies of refusal and steering behaviors. The original ROME implementation was a separate code base, but a number of follow-up papers and reproductions have reimplemented its causal tracing procedure on top of TransformerLens because the library's hook system makes the required activation surgery straightforward.^[13]

How does TransformerLens compare to nnsight and Captum?

Several other libraries occupy adjacent positions in the interpretability tooling landscape. TransformerLens is most often compared to nnsight and Captum.

nnsight and NDIF

nnsight, developed at the Northeastern University National Deep Inference Fabric (NDIF), operates on arbitrary PyTorch networks rather than only on transformers. Its design centres on building a serializable "intervention graph" that can be sent to a remote machine where the target model is already resident in GPU memory; this is the mechanism by which NDIF makes very large open-weight models (such as Llama 3 405B) available to outside researchers. The 2024 paper "NNsight and NDIF: Democratizing Access to Open-Weight Foundation Model Internals" explicitly compares the two libraries and concludes that TransformerLens provides a more ergonomic interface for the specific class of models it supports, with named HookPoints and a richer cache abstraction, while nnsight is more general and integrates directly with the unmodified Hugging Face model. The same paper notes that TransformerLens model loading is approximately three times slower than the alternatives it benchmarks, attributing the difference to TransformerLens's weight-standardization preprocessing. The paper's overall recommendation is that the libraries are complementary: TransformerLens for rapid exploration on smaller models, nnsight when exact Hugging Face behaviour or very large model access is required.^[14]

Captum

Captum, developed by Meta, is a general-purpose interpretability library for PyTorch. It implements feature-attribution methods such as integrated gradients, saliency maps, SmoothGrad, layer-wise relevance propagation, and shapley value sampling, along with evaluation metrics for those methods. Captum is broader in scope than TransformerLens but does not provide an analogue to the named hook-point graph that makes circuit-style work convenient: it is oriented toward attribution and importance scoring rather than toward causal interventions and circuit discovery on transformer internals.^[15]

SAELens

Although not a direct alternative, SAELens is a closely related library in the same ecosystem. SAELens trains and serves sparse autoencoders on the activations of language models, and historically integrated with TransformerLens through a class called HookedSAETransformer. TransformerLens v2.0 removed that class and migrated the corresponding functionality to SAELens itself, formalizing the division of responsibilities: TransformerLens handles transformer instrumentation, SAELens handles sparse-feature analysis. SAELens is maintained by Joseph Bloom and collaborators, and it reuses TransformerLens hook names to identify the sites at which SAEs are trained.^[1]^[16]

Community and ecosystem

ARENA

The largest organized teaching context for TransformerLens is the Alignment Research Engineer Accelerator (ARENA), an in-person and online curriculum led by Callum McDougall. Chapter 1 of the ARENA curriculum, "Transformer Interpretability," walks students through building a transformer from scratch in PyTorch, then transitions to TransformerLens for circuit analysis, including locating induction heads in a two-layer model and reproducing the indirect-object-identification circuit in GPT-2 Small. Later chapters cover feature superposition, sparse autoencoders (via SAELens), and activation-vector steering. All ARENA materials are released freely on GitHub.^[17]

AI Safety Camp and other community programs

TransformerLens is also widely used in the project portfolio of the AI Safety Camp (AISC), an annual remote research program. AISC 2024 ran multiple mechanistic-interpretability streams whose project summaries explicitly reference TransformerLens as the default tool for accessing model internals, alongside related programs and reading lists organized through Apart Research and the Alignment Forum.^[18]

Documentation, tutorials, and demos

The official documentation site at transformerlensorg.github.io/TransformerLens hosts a substantial collection of Jupyter-notebook demos, including a "Main Demo" walking through model loading, caching, and hook installation; demos for specific architectures such as Llama and BERT; and a curated "Getting Started in Mechanistic Interpretability" page that links to Neel Nanda's "200 Concrete Open Problems," his glossary, and his paper-walkthrough video channel. The same page advertises the field as one with a low barrier to entry and emphasises that, in the maintainers' view, low participation rather than technical difficulty explains many of the discipline's unsolved problems.^[9]^[19]

TransformerLens is frequently combined with tooling for visualising and sharing the features it helps researchers discover. Neuronpedia, a web platform for browsing sparse-autoencoder features, ingests SAEs trained with SAELens and indexes them by the TransformerLens hook names at which they were trained, so a feature discovered in a notebook can be linked to its public Neuronpedia entry with no additional metadata. The library is also commonly used in conjunction with the open-source circuitsvis package for inline visualisations of attention patterns inside Jupyter notebooks, and with attribution graphs constructed via causal patching of cached activations between forward passes.^[16]

Is TransformerLens open source?

Yes. TransformerLens is distributed under the MIT License, as recorded both in the GitHub repository and in the PyPI metadata for the package. The MIT terms permit unrestricted use, modification, and redistribution provided that the copyright notice is preserved.^[1]^[6]

Governance is informal but follows a maintainer-led model. Bryce Meyer and Jonah Larson review pull requests and tag releases on behalf of the TransformerLensOrg organization, while broader design discussions happen on GitHub issues and on the mechanistic interpretability Slack community linked from the README. Neel Nanda continues to participate as the project's original author and to publish research that uses the library, but is no longer the day-to-day maintainer.^[1]^[2]

Release cadence

Tagged releases on PyPI follow semantic versioning. Major version transitions to date include version 1.0.0 in January 2023, version 2.0 (which removed HookedSAETransformer and required Python 3.10 or newer), and version 3.0 in April 2026 (which introduced TransformerBridge). As of mid-2026, the repository's GitHub statistics list approximately 3,600 stars and more than 600 forks.^[1]^[2]^[10]

The contributing guidelines invite outside pull requests to add new model adapters, expand the demo notebooks, and improve documentation. Adding a new architecture in the v3.x line typically involves writing an adapter that maps the upstream model's modules to the TransformerLens hook-point naming convention and registering the resulting bridge with the loader. Test coverage is run automatically via GitHub Actions, and recent maintenance work has focused on tightening quantisation handling, hardening the continuous-integration pipeline, and pinning dependencies in response to specific upstream security advisories.^[10]

ELI5: What does TransformerLens do?

Imagine an AI language model as a huge machine with thousands of tiny gears turning inside it as it reads a sentence. Normally you only see what comes out of the machine, not what the gears are doing. TransformerLens is like fitting the machine with little glass windows (called hook points) at every gear, so you can watch each one, write down what it is doing (caching), and even reach in and nudge a gear to see how the answer changes (patching and ablation). Researchers use these windows to figure out the hidden "recipes" the model taught itself, which is the whole point of mechanistic interpretability.^[1]^[8]

References

TransformerLensOrg. "TransformerLens: A library for mechanistic interpretability of GPT-style language models." GitHub README. https://github.com/TransformerLensOrg/TransformerLens. Accessed 2026-06-25. ↩
Python Package Index. "transformer-lens" project page. https://pypi.org/project/transformer-lens/. Accessed 2026-06-25. ↩
Open Source Initiative. "The MIT License." https://opensource.org/license/mit. Accessed 2026-06-25. ↩
Redwood Research. "Easy-Transformer" repository README. https://github.com/redwoodresearch/Easy-Transformer. Accessed 2026-06-25. ↩
Wang, Kevin; Variengien, Alexandre; Conmy, Arthur; Shlegeris, Buck; Steinhardt, Jacob. "Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small." arXiv:2211.00593, 2022. https://arxiv.org/abs/2211.00593. Accessed 2026-06-25. ↩
Python Package Index. "transformer-lens 1.0.0" release page. https://pypi.org/project/transformer-lens/1.0.0/. Accessed 2026-06-25. ↩
TransformerLens Documentation. "Citation." https://transformerlensorg.github.io/TransformerLens/content/citation.html. Accessed 2026-06-25. ↩
TransformerLens Documentation. "transformer_lens.hook_points module." https://transformerlensorg.github.io/TransformerLens/generated/code/transformer_lens.hook_points.html. Accessed 2026-06-25. ↩
TransformerLens Documentation. "Main Demo Notebook." https://transformerlensorg.github.io/TransformerLens/generated/demos/Main_Demo.html. Accessed 2026-06-25. ↩
TransformerLensOrg. "TransformerLens releases page." https://github.com/TransformerLensOrg/TransformerLens/releases. Accessed 2026-06-25. ↩
TransformerLens Documentation. "Model Properties Table." https://transformerlensorg.github.io/TransformerLens/generated/model_properties_table.html. Accessed 2026-06-25. ↩
TransformerLens Documentation. "Gallery." https://transformerlensorg.github.io/TransformerLens/content/gallery.html. Accessed 2026-06-25. ↩
Meng, Kevin; Bau, David; Andonian, Alex; Belinkov, Yonatan. "Locating and Editing Factual Associations in GPT." Advances in Neural Information Processing Systems, 2022. https://arxiv.org/abs/2202.05262. Accessed 2026-06-25. ↩
Fiotto-Kaufman, Jaden; et al. "NNsight and NDIF: Democratizing Access to Open-Weight Foundation Model Internals." arXiv:2407.14561, 2024. https://arxiv.org/abs/2407.14561. Accessed 2026-06-25. ↩
Kokhlikyan, Narine; et al. "Captum: A unified and generic model interpretability library for PyTorch." arXiv:2009.07896, 2020. https://arxiv.org/abs/2009.07896. Accessed 2026-06-25. ↩
Bloom, Joseph; et al. "SAELens" repository. https://github.com/decoderesearch/SAELens. Accessed 2026-06-25. ↩
McDougall, Callum. "ARENA 3.0" repository. https://github.com/callummcdougall/ARENA_3.0. Accessed 2026-06-25. ↩
AI Safety Camp. "AISC 2024 Project Summaries." Alignment Forum. https://www.alignmentforum.org/posts/npkvZG67hRvBneoQ9/aisc-2024-project-summaries-1. Accessed 2026-06-25. ↩
TransformerLens Documentation. "Getting Started in Mechanistic Interpretability." https://transformerlensorg.github.io/TransformerLens/content/getting_started_mech_interp.html. Accessed 2026-06-25. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributor · full history

Suggest edit

What links here

Circuit discovery Induction Heads Linear Probes Mechanistic interpretability Monosemanticity Patchscopes

What is TransformerLens used for?

History and origins

EasyTransformer (2022)

When was TransformerLens released?

Transition to TransformerLensOrg

How does TransformerLens work?

HookPoint and HookedRootModule

HookedTransformer

ActivationCache

TransformerBridge (v3.0+)

Which models does TransformerLens support?

What are the key TransformerLens APIs?

run_with_cache

run_with_hooks

Additional utilities

Notable research using TransformerLens

How does TransformerLens compare to nnsight and Captum?

nnsight and NDIF

Captum

SAELens

Community and ecosystem

ARENA

AI Safety Camp and other community programs

Documentation, tutorials, and demos

Related interpretability infrastructure

Is TransformerLens open source?

Release cadence

ELI5: What does TransformerLens do?

See also

References

Improve this article

Related Articles

nnsight

Hugging Face

LangChain

Ollama

LlamaIndex

PyTorch

What links here

Related Articles

nnsight

Hugging Face

LangChain

Ollama

LlamaIndex

PyTorch

What links here