AI Wiki
Category

Interpretability

42 articles

Activation patching

Activation steering

AI Safety, Large Language Models

Attribution Graphs

Anthropic

Christopher Olah

Anthropic, People

Circuit discovery

Crosscoder

Anthropic

DeepLIFT

Deep Learning, Machine Learning

Explainable AI

AI Ethics

Feature Importances

Machine Learning, Model Evaluation

Golden Gate Claude

Anthropic

Goodfire AI

AI Companies, AI Safety

Grad-CAM

Computer Vision, Deep Learning, Machine Learning

Induction Heads

Transformer Models

Influence functions (machine learning)

Machine Learning

Integrated Gradients

Deep Learning, Machine Learning

LIME

Layer-wise Relevance Propagation (LRP)

Deep Learning, Machine Learning

Linear Probes

Neural Networks

Logit lens

Transformer Models

Mechanistic interpretability

AI Safety

Monosemanticity

Neural Networks

On the Biology of a Large Language Model

AI Research, Anthropic

OpenAI Microscope

OpenAI

Patchscopes

Large Language Models

Permutation variable importances

Machine Learning

Persona vectors

AI Safety, Large Language Models

Polysemanticity

Neural Networks

Refusal direction

AI Safety, Large Language Models

Representation Engineering

AI Safety

SHAP (SHapley Additive exPlanations)

Saliency map

Computer Vision, Deep Learning

Scaling Monosemanticity

AI Research, Anthropic

SmoothGrad

Deep Learning

Sparse Coding

Machine Learning, Neural Networks

Sparse autoencoder

Deep Learning, Machine Learning

Superposition (Mechanistic Interpretability)

Neural Networks

Towards Monosemanticity

AI Research, Anthropic

Toy Models of Superposition

AI Research, Anthropic

Transcoder

Neural Networks

TransformerLens

Developer Tools, Open Source AI

Variable importances

Machine Learning

nnsight

Developer Tools, Open Source AI