TransformerLens
Last reviewed
Sources
No citations yet
Review status
Needs citations
Revision
v3 ยท 3,442 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
No citations yet
Review status
Needs citations
Revision
v3 ยท 3,442 words
Add missing citations, update stale details, or suggest a clearer explanation.
TransformerLens is an open-source Python library for the mechanistic interpretability of GPT-style language models. It loads a pretrained transformer such as GPT-2, exposes every internal activation through a network of named hook points, and lets researchers cache, inspect, edit, ablate, and replace those intermediate computations during a forward pass. The library was created by interpretability researcher Neel Nanda and originally released under the name EasyTransformer; it is distributed under the MIT License and, as of 2026, is maintained by Bryce Meyer and Jonah Larson through the TransformerLensOrg GitHub organization, where it has accumulated roughly 3,600 stars and over 600 forks.[1][2]
TransformerLens has become the de facto standard tooling for mechanistic interpretability research, and the hook-point naming conventions it introduced now appear in a large fraction of published interpretability papers and blog posts. The library's own README states that "the goal of mechanistic interpretability is to take a trained model and reverse engineer the algorithms the model learned during training from its weights," and describes itself simply as "a library for doing mechanistic interpretability of GPT-2 Style language models."[1] Its HookedTransformer class (and, from version 3.0, the TransformerBridge adapter) wraps a transformer language model with hook points that expose every residual stream addition, attention pattern, MLP activation, and projection in the model, so that built-in methods such as run_with_cache and run_with_hooks can perform causal interventions, activation patching, direct logit attribution, and other interpretability techniques.[1][3]
TransformerLens is used to reverse-engineer the internal algorithms of transformer language models, a research program known as mechanistic interpretability. In practice this means loading an open-weight model, running text through it, and then reading or rewriting the model's intermediate tensors to test hypotheses about how a behavior is computed. Typical workflows include locating the attention circuit responsible for a task (for example the indirect-object-identification circuit in GPT-2 Small), identifying induction heads that implement in-context copying, performing activation patching to measure which components are causally necessary for a prediction, and decomposing the model's output logits into the contributions of individual heads and layers. Because the library exposes a stable, named hook for every interesting tensor, these interventions can be expressed in a few lines of code without modifying the underlying model.[1][8][9]
The project that became TransformerLens began life in 2022 as EasyTransformer, a PyTorch reimplementation of GPT-style transformers tailored for interpretability work. Neel Nanda, who had previously worked on the interpretability team at Anthropic, created EasyTransformer because existing open-source tooling did not expose model internals in a form convenient for mechanistic analysis. A fork of an early EasyTransformer code base was used by Arthur Conmy, Alexandre Variengien, Kevin Wang, and Jacob Steinhardt at Redwood Research for their 2022 paper "Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small." A snapshot of that fork remains available on GitHub under the redwoodresearch/Easy-Transformer repository, which explicitly notes that the project is a one-time code release and recommends that interested researchers use TransformerLens instead.[4][5]
The library was renamed TransformerLens ahead of its first stable release on the Python Package Index. Version 1.0.0 was published to PyPI on 16 January 2023 with Neel Nanda listed as the author and the MIT License attached.[6] The recommended academic citation for the library names Neel Nanda and Joseph Bloom as authors and uses 2022 as the project year, reflecting the EasyTransformer origin date even though the rename and first tagged release occurred in early 2023. The exact BibTeX entry on the documentation site reads @misc{nanda2022transformerlens, title = {TransformerLens}, author = {Neel Nanda and Joseph Bloom}, year = {2022}, ...}.[7]
After Nanda joined Google DeepMind, day-to-day maintenance of the library moved to a dedicated GitHub organization, TransformerLensOrg. The README now lists the project as "Maintained by Bryce Meyer and Jonah Larson," who coordinate releases, code review, and roadmap discussions as of the v3.x series, while the PyPI metadata lists the TransformerLensOrg organization as the maintainer contact.[1][2]
TransformerLens replaces standard transformer modules with versions instrumented for interpretability. The core idea is to insert a small no-op module called a HookPoint at every place inside the network where an interpretability researcher might want to read or write an intermediate tensor.
A HookPoint is implemented as a PyTorch module whose forward method is the identity function: it returns its input unchanged. Its purpose is to provide a stable, named location in the model graph at which forward and backward hooks can be attached. Because every interesting tensor (the embeddings, the queries, keys, values, attention scores and patterns, attention output, MLP pre- and post-activations, and the residual stream additions for each block) flows through a dedicated HookPoint, an interpretability researcher can refer to any of these tensors by name and intervene on it without modifying the underlying model code.[8]
HookedRootModule is the parent class shared by the library's instrumented models. It provides utilities for registering hooks across the whole model, removing them again at the end of a context, and managing nesting so that hooks attached at one level can be cleared without disturbing hooks at another level. Each registered hook is wrapped in a LensHandle that records metadata about whether the hook is permanent and at which context level it was registered.[8]
The flagship class historically exposed by the library is HookedTransformer. It is a from-scratch PyTorch implementation of a generic decoder-only transformer that exposes hook points at all of the locations described above. When a pretrained model is loaded via HookedTransformer.from_pretrained(...), TransformerLens downloads the Hugging Face weights for the requested model and rewrites them into the library's standardized parameter layout. This "weight standardization" step is what allows GPT-2, Pythia, Llama, Gemma, Mistral, and many other architectures to share the same interpretability API, but it also means that initial model loading with TransformerLens can take noticeably longer than loading the same model directly through Hugging Face Transformers.[1][9]
Cached activations are returned wrapped in an ActivationCache object. The cache behaves like a dictionary keyed by hook name, but it also provides convenience methods that are common in interpretability work: stacking activations across layers into a single tensor, applying layer norm folding, decomposing the residual stream into per-head contributions, computing direct logit attributions, and slicing along the batch or position dimensions. These helpers reflect a deliberate design choice to make common analysis patterns one-line operations rather than ad-hoc tensor manipulation.[9]
Version 3.0 of the library, released in April 2026, introduced a new abstraction called TransformerBridge. Instead of re-implementing every architecture from scratch inside TransformerLens, the bridge wraps a Hugging Face model in place and attaches HookPoints to its existing modules through adapters. The stated goal of this change is to expand model coverage dramatically while reducing the maintenance burden of keeping bespoke implementations in sync with upstream model releases. According to the README, TransformerBridge is the recommended 3.0 path and "supports 9,000+ models across 50+ architecture families"; HookedTransformer is deprecated as of 3.0 but remains available through a compatibility layer for the duration of the 3.x branch.[1][10]
TransformerLens supports a wide range of open-source decoder-only language models. The library exposes a list called OFFICIAL_MODEL_NAMES that enumerates the model identifiers it can load directly, and it also accepts arbitrary Hugging Face checkpoints that share an architecture with one of those families. Documented supported families include:[1][11]
The TransformerLensOrg release notes for the v3.0 transition describe the TransformerBridge mechanism as expanding compatibility from roughly 200 directly supported models to approximately 9,000 models when including all sub-variants on Hugging Face, although the documentation cautions that only a subset are formally verified.[10]
The two methods most often used in TransformerLens workflows are run_with_cache and run_with_hooks. These are inherited from HookedRootModule and exposed on both HookedTransformer and TransformerBridge.
run_with_cache(input) runs a forward pass and returns a tuple of (logits, ActivationCache). Every tensor that flows through a HookPoint is recorded under its hook name. A minimal example, drawn from the library's quick-start documentation, looks like this:[1]
from transformer_lens.model_bridge import TransformerBridge
bridge = TransformerBridge.boot_transformers("gpt2", device="cpu")
logits, cache = bridge.run_with_cache("Hello World")
The returned cache can be queried with names such as cache["blocks.5.attn.hook_pattern"] to obtain the attention pattern at layer 5, or with helpers such as cache.stack_head_results() to assemble a tensor containing the contribution of every attention head across every layer.[9]
run_with_hooks(input, fwd_hooks=[...], bwd_hooks=[...]) runs a forward (and optionally backward) pass while temporarily attaching user-supplied hook functions. Each hook receives the activation tensor and a HookPoint object as keyword arguments, and may return a modified tensor that is substituted in place. By default, hooks are removed when the call returns, leaving the model in its original state. This API is the primary mechanism for activation patching, zero or mean ablations, and other causal interventions.[8]
Beyond the two headline methods, TransformerLens exposes a number of utilities that are heavily used in research code:
to_tokens and to_str_tokens for tokenization and detokenization with offsets, supporting per-token interpretability.tokens_to_residual_directions, which projects vocabulary tokens into the residual stream basis so that researchers can decompose logit attributions in the embedding space.The library's official documentation maintains a gallery of papers that use it. Representative examples include:[12]
The "Interpretability in the Wild" paper on indirect object identification, which originally relied on the EasyTransformer fork, is the most-cited demonstration of the circuit-analysis style of work that TransformerLens enables. Its tutorial-style replication using run_with_hooks activation patching is one of the canonical examples taught in the official documentation.[4][12]
Outside the gallery, TransformerLens has been used as the modeling layer for replications of "Locating and Editing Factual Associations in GPT" (the ROME paper by Meng, Bau, Andonian, and Belinkov), as well as in feature-level circuit work and in studies of refusal and steering behaviors. The original ROME implementation was a separate code base, but a number of follow-up papers and reproductions have reimplemented its causal tracing procedure on top of TransformerLens because the library's hook system makes the required activation surgery straightforward.[13]
Several other libraries occupy adjacent positions in the interpretability tooling landscape. TransformerLens is most often compared to nnsight and Captum.
nnsight, developed at the Northeastern University National Deep Inference Fabric (NDIF), operates on arbitrary PyTorch networks rather than only on transformers. Its design centres on building a serializable "intervention graph" that can be sent to a remote machine where the target model is already resident in GPU memory; this is the mechanism by which NDIF makes very large open-weight models (such as Llama 3 405B) available to outside researchers. The 2024 paper "NNsight and NDIF: Democratizing Access to Open-Weight Foundation Model Internals" explicitly compares the two libraries and concludes that TransformerLens provides a more ergonomic interface for the specific class of models it supports, with named HookPoints and a richer cache abstraction, while nnsight is more general and integrates directly with the unmodified Hugging Face model. The same paper notes that TransformerLens model loading is approximately three times slower than the alternatives it benchmarks, attributing the difference to TransformerLens's weight-standardization preprocessing. The paper's overall recommendation is that the libraries are complementary: TransformerLens for rapid exploration on smaller models, nnsight when exact Hugging Face behaviour or very large model access is required.[14]
Captum, developed by Meta, is a general-purpose interpretability library for PyTorch. It implements feature-attribution methods such as integrated gradients, saliency maps, SmoothGrad, layer-wise relevance propagation, and shapley value sampling, along with evaluation metrics for those methods. Captum is broader in scope than TransformerLens but does not provide an analogue to the named hook-point graph that makes circuit-style work convenient: it is oriented toward attribution and importance scoring rather than toward causal interventions and circuit discovery on transformer internals.[15]
Although not a direct alternative, SAELens is a closely related library in the same ecosystem. SAELens trains and serves sparse autoencoders on the activations of language models, and historically integrated with TransformerLens through a class called HookedSAETransformer. TransformerLens v2.0 removed that class and migrated the corresponding functionality to SAELens itself, formalizing the division of responsibilities: TransformerLens handles transformer instrumentation, SAELens handles sparse-feature analysis. SAELens is maintained by Joseph Bloom and collaborators, and it reuses TransformerLens hook names to identify the sites at which SAEs are trained.[1][16]
The largest organized teaching context for TransformerLens is the Alignment Research Engineer Accelerator (ARENA), an in-person and online curriculum led by Callum McDougall. Chapter 1 of the ARENA curriculum, "Transformer Interpretability," walks students through building a transformer from scratch in PyTorch, then transitions to TransformerLens for circuit analysis, including locating induction heads in a two-layer model and reproducing the indirect-object-identification circuit in GPT-2 Small. Later chapters cover feature superposition, sparse autoencoders (via SAELens), and activation-vector steering. All ARENA materials are released freely on GitHub.[17]
TransformerLens is also widely used in the project portfolio of the AI Safety Camp (AISC), an annual remote research program. AISC 2024 ran multiple mechanistic-interpretability streams whose project summaries explicitly reference TransformerLens as the default tool for accessing model internals, alongside related programs and reading lists organized through Apart Research and the Alignment Forum.[18]
The official documentation site at transformerlensorg.github.io/TransformerLens hosts a substantial collection of Jupyter-notebook demos, including a "Main Demo" walking through model loading, caching, and hook installation; demos for specific architectures such as Llama and BERT; and a curated "Getting Started in Mechanistic Interpretability" page that links to Neel Nanda's "200 Concrete Open Problems," his glossary, and his paper-walkthrough video channel. The same page advertises the field as one with a low barrier to entry and emphasises that, in the maintainers' view, low participation rather than technical difficulty explains many of the discipline's unsolved problems.[9][19]
TransformerLens is frequently combined with tooling for visualising and sharing the features it helps researchers discover. Neuronpedia, a web platform for browsing sparse-autoencoder features, ingests SAEs trained with SAELens and indexes them by the TransformerLens hook names at which they were trained, so a feature discovered in a notebook can be linked to its public Neuronpedia entry with no additional metadata. The library is also commonly used in conjunction with the open-source circuitsvis package for inline visualisations of attention patterns inside Jupyter notebooks, and with attribution graphs constructed via causal patching of cached activations between forward passes.[16]
Yes. TransformerLens is distributed under the MIT License, as recorded both in the GitHub repository and in the PyPI metadata for the package. The MIT terms permit unrestricted use, modification, and redistribution provided that the copyright notice is preserved.[1][6]
Governance is informal but follows a maintainer-led model. Bryce Meyer and Jonah Larson review pull requests and tag releases on behalf of the TransformerLensOrg organization, while broader design discussions happen on GitHub issues and on the mechanistic interpretability Slack community linked from the README. Neel Nanda continues to participate as the project's original author and to publish research that uses the library, but is no longer the day-to-day maintainer.[1][2]
Tagged releases on PyPI follow semantic versioning. Major version transitions to date include version 1.0.0 in January 2023, version 2.0 (which removed HookedSAETransformer and required Python 3.10 or newer), and version 3.0 in April 2026 (which introduced TransformerBridge). As of mid-2026, the repository's GitHub statistics list approximately 3,600 stars and more than 600 forks.[1][2][10]
The contributing guidelines invite outside pull requests to add new model adapters, expand the demo notebooks, and improve documentation. Adding a new architecture in the v3.x line typically involves writing an adapter that maps the upstream model's modules to the TransformerLens hook-point naming convention and registering the resulting bridge with the loader. Test coverage is run automatically via GitHub Actions, and recent maintenance work has focused on tightening quantisation handling, hardening the continuous-integration pipeline, and pinning dependencies in response to specific upstream security advisories.[10]
Imagine an AI language model as a huge machine with thousands of tiny gears turning inside it as it reads a sentence. Normally you only see what comes out of the machine, not what the gears are doing. TransformerLens is like fitting the machine with little glass windows (called hook points) at every gear, so you can watch each one, write down what it is doing (caching), and even reach in and nudge a gear to see how the answer changes (patching and ablation). Researchers use these windows to figure out the hidden "recipes" the model taught itself, which is the whole point of mechanistic interpretability.[1][8]