Interpretability
42 articles
Activation patching
Activation steering
AI Safety, Large Language Models
Attribution Graphs
Anthropic
Christopher Olah
Anthropic, People
Circuit discovery
Crosscoder
Anthropic
DeepLIFT
Deep Learning, Machine Learning
Explainable AI
AI Ethics
Feature Importances
Machine Learning, Model Evaluation
Golden Gate Claude
Anthropic
Goodfire AI
AI Companies, AI Safety
Grad-CAM
Computer Vision, Deep Learning, Machine Learning
Induction Heads
Transformer Models
Influence functions (machine learning)
Machine Learning
Integrated Gradients
Deep Learning, Machine Learning
LIME
Layer-wise Relevance Propagation (LRP)
Deep Learning, Machine Learning
Linear Probes
Neural Networks
Logit lens
Transformer Models
Mechanistic interpretability
AI Safety
Monosemanticity
Neural Networks
On the Biology of a Large Language Model
AI Research, Anthropic
OpenAI Microscope
OpenAI
Patchscopes
Large Language Models
Permutation variable importances
Machine Learning
Persona vectors
AI Safety, Large Language Models
Polysemanticity
Neural Networks
Refusal direction
AI Safety, Large Language Models
Representation Engineering
AI Safety
SHAP (SHapley Additive exPlanations)
Saliency map
Computer Vision, Deep Learning
Scaling Monosemanticity
AI Research, Anthropic
SmoothGrad
Deep Learning
Sparse Coding
Machine Learning, Neural Networks
Sparse autoencoder
Deep Learning, Machine Learning
Superposition (Mechanistic Interpretability)
Neural Networks
Towards Monosemanticity
AI Research, Anthropic
Toy Models of Superposition
AI Research, Anthropic
Transcoder
Neural Networks
TransformerLens
Developer Tools, Open Source AI
Variable importances
Machine Learning
nnsight
Developer Tools, Open Source AI