TransformerLens
Last reviewed
May 19, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,068 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 19, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,068 words
Add missing citations, update stale details, or suggest a clearer explanation.
TransformerLens is an open-source Python library for the mechanistic interpretability of GPT-style language models. It provides programmatic access to the internal activations of transformer models, allowing researchers to cache, inspect, edit, ablate, and replace intermediate computations during a forward pass. The library was created by interpretability researcher Neel Nanda and originally released as a successor to an earlier project called EasyTransformer; as of 2026 it is maintained by Bryce Meyer through the TransformerLensOrg GitHub organization.[1][2]
TransformerLens has become one of the most widely cited tools in mechanistic interpretability research. Its HookedTransformer class (and, from version 3.0, the TransformerBridge adapter) wraps a transformer language model with a network of named hook points that expose every residual stream addition, attention pattern, MLP activation, and projection in the model. Built-in methods such as run_with_cache and run_with_hooks are used to perform causal interventions, activation patching, direct logit attribution, and other interpretability techniques. The library is distributed under the MIT License.[1][3]
The project that became TransformerLens began life in 2022 as EasyTransformer, a PyTorch reimplementation of GPT-style transformers tailored for interpretability work. Neel Nanda, who had previously worked on the interpretability team at Anthropic, created EasyTransformer because existing open-source tooling did not expose model internals in a form convenient for mechanistic analysis. A fork of an early EasyTransformer code base was used by Arthur Conmy, Alexandre Variengien, Kevin Wang, and Jacob Steinhardt at Redwood Research for their 2022 paper "Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small." A snapshot of that fork remains available on GitHub under the redwoodresearch/Easy-Transformer repository, which explicitly notes that the project is a one-time code release and recommends that interested researchers use TransformerLens instead.[4][5]
The library was renamed TransformerLens ahead of its first stable release on the Python Package Index. Version 1.0.0 was published to PyPI on 16 January 2023 with Neel Nanda listed as the author and the MIT License attached.[6] The recommended academic citation for the library names Neel Nanda and Joseph Bloom as authors and uses 2022 as the project year, reflecting the EasyTransformer origin date even though the rename and first tagged release occurred in early 2023.[7]
After Nanda joined Google DeepMind, day-to-day maintenance of the library moved to a dedicated GitHub organization, TransformerLensOrg. Bryce Meyer became the principal maintainer and continues to coordinate releases, code review, and roadmap discussions as of the v3.x series.[1][2] The PyPI metadata for the package now lists the TransformerLensOrg organization as the maintainer contact.[2]
TransformerLens replaces standard transformer modules with versions instrumented for interpretability. The core idea is to insert a small no-op module called a HookPoint at every place inside the network where an interpretability researcher might want to read or write an intermediate tensor.
A HookPoint is implemented as a PyTorch module whose forward method is the identity function: it returns its input unchanged. Its purpose is to provide a stable, named location in the model graph at which forward and backward hooks can be attached. Because every interesting tensor (the embeddings, the queries, keys, values, attention scores and patterns, attention output, MLP pre- and post-activations, and the residual stream additions for each block) flows through a dedicated HookPoint, an interpretability researcher can refer to any of these tensors by name and intervene on it without modifying the underlying model code.[8]
HookedRootModule is the parent class shared by the library's instrumented models. It provides utilities for registering hooks across the whole model, removing them again at the end of a context, and managing nesting so that hooks attached at one level can be cleared without disturbing hooks at another level. Each registered hook is wrapped in a LensHandle that records metadata about whether the hook is permanent and at which context level it was registered.[8]
The flagship class historically exposed by the library is HookedTransformer. It is a from-scratch PyTorch implementation of a generic decoder-only transformer that exposes hook points at all of the locations described above. When a pretrained model is loaded via HookedTransformer.from_pretrained(...), TransformerLens downloads the Hugging Face weights for the requested model and rewrites them into the library's standardized parameter layout. This "weight standardization" step is what allows GPT-2, Pythia, Llama, Gemma, Mistral, and many other architectures to share the same interpretability API, but it also means that initial model loading with TransformerLens can take noticeably longer than loading the same model directly through Hugging Face Transformers.[1][9]
Cached activations are returned wrapped in an ActivationCache object. The cache behaves like a dictionary keyed by hook name, but it also provides convenience methods that are common in interpretability work: stacking activations across layers into a single tensor, applying layer norm folding, decomposing the residual stream into per-head contributions, computing direct logit attributions, and slicing along the batch or position dimensions. These helpers reflect a deliberate design choice to make common analysis patterns one-line operations rather than ad-hoc tensor manipulation.[9]
Version 3.0 of the library, released on 17 April 2026, introduced a new abstraction called TransformerBridge. Instead of re-implementing every architecture from scratch inside TransformerLens, the bridge wraps a Hugging Face model in place and attaches HookPoints to its existing modules through adapters. The stated goal of this change is to expand model coverage dramatically while reducing the maintenance burden of keeping bespoke implementations in sync with upstream model releases. The TransformerBridge approach is documented as supporting "50+" architectures and is now the recommended entry point for new users; HookedTransformer remains available for backward compatibility.[1][10]
TransformerLens supports a wide range of open-source decoder-only language models. The library exposes a list called OFFICIAL_MODEL_NAMES that enumerates the model identifiers it can load directly, and it also accepts arbitrary Hugging Face checkpoints that share an architecture with one of those families. Documented supported families include:[1][11]
The TransformerLensOrg release notes for the v3.0 transition claim that the TransformerBridge mechanism expands compatibility from roughly 200 directly supported models to approximately 9,000 models when including all sub-variants on Hugging Face, although the documentation cautions that only a subset are formally verified.[10]
The two methods most often used in TransformerLens workflows are run_with_cache and run_with_hooks. These are inherited from HookedRootModule and exposed on both HookedTransformer and TransformerBridge.
run_with_cache(input) runs a forward pass and returns a tuple of (logits, ActivationCache). Every tensor that flows through a HookPoint is recorded under its hook name. A minimal example, drawn from the library's quick-start documentation, looks like this:[1]
from transformer_lens.model_bridge import TransformerBridge
bridge = TransformerBridge.boot_transformers("gpt2", device="cpu")
logits, cache = bridge.run_with_cache("Hello World")
The returned cache can be queried with names such as cache["blocks.5.attn.hook_pattern"] to obtain the attention pattern at layer 5, or with helpers such as cache.stack_head_results() to assemble a tensor containing the contribution of every attention head across every layer.[9]
run_with_hooks(input, fwd_hooks=[...], bwd_hooks=[...]) runs a forward (and optionally backward) pass while temporarily attaching user-supplied hook functions. Each hook receives the activation tensor and a HookPoint object as keyword arguments, and may return a modified tensor that is substituted in place. By default, hooks are removed when the call returns, leaving the model in its original state. This API is the primary mechanism for activation patching, zero or mean ablations, and other causal interventions.[8]
Beyond the two headline methods, TransformerLens exposes a number of utilities that are heavily used in research code:
to_tokens and to_str_tokens for tokenization and detokenization with offsets, supporting per-token interpretability.tokens_to_residual_directions, which projects vocabulary tokens into the residual stream basis so that researchers can decompose logit attributions in the embedding space.The library's official documentation maintains a gallery of papers that use it. Representative examples include:[12]
The "Interpretability in the Wild" paper on indirect object identification, which originally relied on the EasyTransformer fork, is the most-cited demonstration of the circuit-analysis style of work that TransformerLens enables. Its tutorial-style replication using run_with_hooks activation patching is one of the canonical examples taught in the official documentation.[4][12]
Outside the gallery, TransformerLens has been used as the modeling layer for replications of "Locating and Editing Factual Associations in GPT" (the ROME paper by Meng, Bau, Andonian, and Belinkov), as well as in feature-level circuit work and in studies of refusal and steering behaviors. The original ROME implementation was a separate code base, but a number of follow-up papers and reproductions have reimplemented its causal tracing procedure on top of TransformerLens because the library's hook system makes the required activation surgery straightforward.[13]
Several other libraries occupy adjacent positions in the interpretability tooling landscape. TransformerLens is most often compared to nnsight and Captum.
nnsight, developed at the Northeastern University National Deep Inference Fabric (NDIF), operates on arbitrary PyTorch networks rather than only on transformers. Its design centres on building a serializable "intervention graph" that can be sent to a remote machine where the target model is already resident in GPU memory; this is the mechanism by which NDIF makes very large open-weight models (such as Llama 3 405B) available to outside researchers. The 2024 paper "NNsight and NDIF: Democratizing Access to Open-Weight Foundation Model Internals" explicitly compares the two libraries and concludes that TransformerLens provides a more ergonomic interface for the specific class of models it supports, with named HookPoints and a richer cache abstraction, while nnsight is more general and integrates directly with the unmodified Hugging Face model. The same paper notes that TransformerLens model loading is approximately three times slower than the alternatives it benchmarks, attributing the difference to TransformerLens's weight-standardization preprocessing. The paper's overall recommendation is that the libraries are complementary: TransformerLens for rapid exploration on smaller models, nnsight when exact Hugging Face behaviour or very large model access is required.[14]
Captum, developed by Meta, is a general-purpose interpretability library for PyTorch. It implements feature-attribution methods such as integrated gradients, saliency maps, SmoothGrad, layer-wise relevance propagation, and shapley value sampling, along with evaluation metrics for those methods. Captum is broader in scope than TransformerLens but does not provide an analogue to the named hook-point graph that makes circuit-style work convenient: it is oriented toward attribution and importance scoring rather than toward causal interventions and circuit discovery on transformer internals.[15]
Although not a direct alternative, SAELens is a closely related library in the same ecosystem. SAELens trains and serves sparse autoencoders on the activations of language models, and historically integrated with TransformerLens through a class called HookedSAETransformer. TransformerLens v2.0 removed that class and migrated the corresponding functionality to SAELens itself, formalizing the division of responsibilities: TransformerLens handles transformer instrumentation, SAELens handles sparse-feature analysis. SAELens is maintained by Joseph Bloom and collaborators, and it reuses TransformerLens hook names to identify the sites at which SAEs are trained.[1][16]
The largest organized teaching context for TransformerLens is the Alignment Research Engineer Accelerator (ARENA), an in-person and online curriculum led by Callum McDougall. Chapter 1 of the ARENA curriculum, "Transformer Interpretability," walks students through building a transformer from scratch in PyTorch, then transitions to TransformerLens for circuit analysis, including locating induction heads in a two-layer model and reproducing the indirect-object-identification circuit in GPT-2 Small. Later chapters cover feature superposition, sparse autoencoders (via SAELens), and activation-vector steering. All ARENA materials are released freely on GitHub.[17]
TransformerLens is also widely used in the project portfolio of the AI Safety Camp (AISC), an annual remote research program. AISC 2024 ran multiple mechanistic-interpretability streams whose project summaries explicitly reference TransformerLens as the default tool for accessing model internals, alongside related programs and reading lists organized through Apart Research and the Alignment Forum.[18]
The official documentation site at transformerlensorg.github.io/TransformerLens hosts a substantial collection of Jupyter-notebook demos, including a "Main Demo" walking through model loading, caching, and hook installation; demos for specific architectures such as Llama and BERT; and a curated "Getting Started in Mechanistic Interpretability" page that links to Neel Nanda's "200 Concrete Open Problems," his glossary, and his paper-walkthrough video channel. The same page advertises the field as one with a low barrier to entry and emphasises that, in the maintainers' view, low participation rather than technical difficulty explains many of the discipline's unsolved problems.[9][19]
TransformerLens is frequently combined with tooling for visualising and sharing the features it helps researchers discover. Neuronpedia, a web platform for browsing sparse-autoencoder features, ingests SAEs trained with SAELens and indexes them by the TransformerLens hook names at which they were trained, so a feature discovered in a notebook can be linked to its public Neuronpedia entry with no additional metadata. The library is also commonly used in conjunction with the open-source circuitsvis package for inline visualisations of attention patterns inside Jupyter notebooks, and with attribution graphs constructed via causal patching of cached activations between forward passes.[16]
TransformerLens is distributed under the MIT License, as recorded both in the GitHub repository and in the PyPI metadata for the package. The MIT terms permit unrestricted use, modification, and redistribution provided that the copyright notice is preserved.[1][6]
Governance is informal but follows a maintainer-led model. Bryce Meyer reviews pull requests and tags releases on behalf of the TransformerLensOrg organization, while broader design discussions happen on GitHub issues and on the mechanistic interpretability Slack community linked from the README. Neel Nanda continues to participate as the project's original author and to publish research that uses the library, but is no longer the day-to-day maintainer.[1][2]
Tagged releases on PyPI follow semantic versioning. Major version transitions to date include version 1.0.0 in January 2023, version 2.0 (which removed HookedSAETransformer and required Python 3.10 or newer), and version 3.0 in April 2026 (which introduced TransformerBridge). As of May 2026, the most recent release is version 3.2.1, published on 9 May 2026, which addressed multimodal Gemma 3 support among other fixes. The repository's GitHub statistics list approximately 3,400 stars and more than 570 forks at that point.[1][2][10]
The contributing guidelines invite outside pull requests to add new model adapters, expand the demo notebooks, and improve documentation. Adding a new architecture in the v3.x line typically involves writing an adapter that maps the upstream model's modules to the TransformerLens hook-point naming convention and registering the resulting bridge with the loader. Test coverage is run automatically via GitHub Actions, and recent maintenance work has focused on tightening quantisation handling, hardening the continuous-integration pipeline, and pinning dependencies in response to specific upstream security advisories.[10]