Interpretability

42 articlesRSS

Showing 1-42 of 42 articles

Activation patching

Activation patching is a causal intervention technique used in mechanistic interpretability to identify which internal components of a neural network are...

Activation steering

Activation steering is a family of inference-time techniques in mechanistic interpretability and AI safety that modify a neural network's internal activations...

AI SafetyLarge Language Models

Attribution Graphs

Attribution graphs are a mechanistic interpretability technique developed by Anthropic that traces the internal "circuits" a large language model uses to turn...

Anthropic

Christopher Olah

Christopher Olah (commonly Chris Olah) is a Canadian machine learning researcher, a co-founder of Anthropic, and the researcher most often credited with...

AnthropicPeople

Circuit discovery

Circuit discovery is a research program in mechanistic interpretability that aims to identify sparse computational subgraphs inside trained neural networks,...

Crosscoder

A crosscoder is a mechanistic interpretability tool, introduced by Anthropic in October 2024, that generalizes the sparse autoencoder (SAE) and the transcoder...

Anthropic

DeepLIFT

DeepLIFT (Deep Learning Important FeaTures) is a feature attribution method for deep neural networks introduced by Avanti Shrikumar, Peyton Greenside, and...

Deep LearningMachine Learning

Explainable AI

Explainable AI (XAI) refers to artificial intelligence systems and techniques designed so that humans can understand how and why the system reaches its...

AI Ethics

Feature Importances

Feature importances are numeric scores that quantify how much each input feature contributes to the predictions of a machine learning model. The three dominant...

Machine LearningModel Evaluation

Golden Gate Claude

Golden Gate Claude was a temporary, research-oriented public demonstration released by Anthropic on May 23, 2024, in which a modified version of the Claude 3...

Anthropic

Goodfire AI

Goodfire AI is a San Francisco-based artificial intelligence research lab and public benefit corporation focused on mechanistic interpretability, the science...

AI CompaniesAI Safety

Grad-CAM

Grad-CAM (Gradient-weighted Class Activation Mapping) is a technique for producing visual explanations from convolutional neural network (CNN) models by using...

Computer VisionDeep Learning

Induction Heads

Induction heads are a circuit pattern in Transformer language models in which a small set of attention heads, typically spread across two layers, perform an...

Transformer Models

Influence functions (machine learning)

An influence function is a tool for estimating how a machine learning model's predictions would change if a single training example were removed or perturbed,...

Machine Learning

Integrated Gradients

Integrated Gradients (IG) is a feature-attribution method for explainable AI that explains a neural network prediction by assigning each input feature an...

Deep LearningMachine Learning

LIME

LIME (Local Interpretable Model-Agnostic Explanations) is a technique for explaining individual predictions of any black-box machine learning classifier or...

Layer-wise Relevance Propagation (LRP)

Layer-wise Relevance Propagation (LRP) is an explainable AI method that explains the prediction of a deep neural network by propagating the model's output...

Deep LearningMachine Learning

Linear Probes

A linear probe is a small linear classifier (or linear regressor) trained on the frozen internal activations of a neural network to test whether a particular...

Neural Networks

Logit lens

The logit lens is a foundational technique in mechanistic interpretability for inspecting the intermediate computations of transformer language models. It...

Transformer Models

Mechanistic interpretability

Mechanistic interpretability (often abbreviated as mech interp or MI) is the field that reverse-engineers the internal computations of neural networks,...

AI Safety

Monosemanticity

Monosemanticity is a property of an internal feature or neuron in a neural network when that unit responds to a single, human-interpretable concept rather than...

Neural Networks

On the Biology of a Large Language Model

On the Biology of a Large Language Model is a mechanistic interpretability paper published by Anthropic on March 27, 2025, in the Transformer Circuits...

AI ResearchAnthropic

OpenAI Microscope

OpenAI Microscope is a publicly accessible collection of visualizations of the neurons, channels, and features inside a number of significant, commonly studied...

OpenAI

Patchscopes

Patchscopes is an interpretability framework for inspecting hidden representations of large language models by patching an internal activation from a source...

Large Language Models

Permutation variable importances

Permutation variable importance is a model-agnostic technique that measures how much a fitted machine learning model relies on a given feature by randomly...

Machine Learning

Persona vectors

Persona vectors are single linear directions in the activation space of a large language model that correspond to high level character traits such as evil,...

AI SafetyLarge Language Models

Polysemanticity

Polysemanticity is the phenomenon in artificial neural networks in which a single neuron (or directional unit such as an attention head) activates strongly for...

Neural Networks

Refusal direction

The refusal direction is a finding from mechanistic interpretability research that the refusal behavior of safety fine-tuned chat language models is mediated...

AI SafetyLarge Language Models

Representation Engineering

Representation Engineering (often abbreviated RepE) is a top-down approach to artificial-intelligence transparency and control that reads and manipulates...

AI Safety

SHAP (SHapley Additive exPlanations)

See also: explainable AI, feature importance, LIME, permutation feature importance SHAP (SHapley Additive exPlanations) is a game-theoretic method that...

Saliency map

A saliency map is an explainable AI visualization that highlights which parts of an input, most often the individual pixels of an image, most influenced a deep...

Computer VisionDeep Learning

Scaling Monosemanticity

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet is the May 21, 2024 mechanistic interpretability paper in which Anthropic used...

AI ResearchAnthropic

SmoothGrad

SmoothGrad is a saliency map technique that reduces visual noise in gradient-based explanations of neural network predictions by averaging gradients over many...

Deep Learning

Sparse Coding

Sparse coding is a representation learning principle in which a signal is encoded as a linear combination of a small number of elements drawn from a (usually...

Machine LearningNeural Networks

Sparse autoencoder

A sparse autoencoder (SAE) is a neural network that adds a sparsity penalty to an autoencoder's training loss so that only a small number of hidden units...

Deep LearningMachine Learning

Superposition (Mechanistic Interpretability)

Superposition is the phenomenon in which an artificial neural network represents more distinct features than it has dimensions in its activation space, by...

Neural Networks

Towards Monosemanticity

Towards Monosemanticity is an October 2023 mechanistic interpretability paper from Anthropic that used a sparse autoencoder to decompose the internal...

AI ResearchAnthropic

Toy Models of Superposition

Toy Models of Superposition is a September 2022 mechanistic interpretability paper from Anthropic that shows how a neural network can represent more features...

AI ResearchAnthropic

Transcoder

A transcoder is a sparse neural network used in mechanistic interpretability research to approximate the input-to-output function of a component inside a...

Neural Networks

TransformerLens

TransformerLens is an open-source Python library for the mechanistic interpretability of GPT-style language models. It loads a pretrained transformer such as...

Developer ToolsOpen Source AI

Variable importances

Variable importances, also called feature importances, are scores assigned to each input variable of a predictive model that measure how much that variable...

Machine Learning

nnsight

nnsight is an open-source Python library for the interpretation and intervention of deep learning models, developed by the Bau Lab at Northeastern University....

Developer ToolsOpen Source AI