Interpretability

Introduction

Interpretability in machine learning refers to the degree to which humans can understand and explain the decisions made by a model. An interpretable model is one where a person can look at its inputs, internal workings, or outputs and form a reliable mental model of why a particular prediction was made. This goes beyond simply knowing that a model achieves high accuracy; it requires understanding the reasoning process behind individual decisions and the general patterns the model has learned.

Interpretability has become increasingly important as machine learning systems are deployed in high-stakes domains such as healthcare, criminal justice, lending, and autonomous vehicles. When a model denies a loan application, recommends a medical treatment, or drives a car, the people affected by those decisions (and the people responsible for them) need to understand why the model behaved as it did. In recent years, the field has expanded from post-hoc explanation methods to mechanistic interpretability, which attempts to reverse-engineer the internal computations of neural networks at a detailed level.

Interpretability vs. explainability

The terms "interpretability" and "explainability" are often used interchangeably, but some researchers draw a meaningful distinction between the two.

Term	Meaning	Typical association
Interpretability	The degree to which a human can understand the model's internal mechanisms directly. The model's structure itself is transparent.	Intrinsically interpretable models (decision trees, linear models)
Explainability	The degree to which the model's predictions can be explained to a human, possibly through a post-hoc approximation that simplifies or abstracts away the actual mechanism.	Post-hoc methods (LIME, SHAP, saliency maps)

Under this distinction, a decision tree is interpretable because you can read its logic directly, while a LIME explanation of a deep neural network provides explainability because it offers a simplified local approximation rather than revealing the actual computation. In practice, most practitioners use the terms interchangeably, and the field is commonly referred to as "Explainable AI" (XAI) in industry contexts and "interpretability" in research contexts. Some researchers argue that interpretability and fidelity are both components of explainability: a useful explanation must be intelligible to humans (interpretability) while also accurately depicting the model's behavior across its feature space (fidelity).

Why interpretability matters

The demand for interpretability comes from several overlapping concerns.

Trust and adoption

Practitioners and end users are more willing to deploy and rely on models they can understand. A doctor who can see that a diagnostic model is focusing on relevant medical features (rather than artifacts in the imaging process) will be more confident in its recommendations. Conversely, unexplainable predictions erode trust and slow adoption, even when the model is statistically accurate.

Debugging and improvement

Interpretability tools help developers find and fix problems in models. If a model is making incorrect predictions, understanding which features it relies on can reveal issues with the training data, feature engineering, or model architecture. For example, researchers discovered that an image classifier that appeared to distinguish wolves from huskies was actually relying on the presence of snow in the background, not the animal itself ^[1].

Regulatory compliance

Regulations increasingly require that automated decision-making systems be explainable. The European Union's General Data Protection Regulation (GDPR), enacted in 2018, includes provisions around the right to an explanation for automated decisions. The EU AI Act, which entered into force in 2024, classifies high-risk AI systems and imposes transparency requirements on them ^[2]. In the United States, the Equal Credit Opportunity Act requires lenders to provide specific reasons when denying credit, which effectively requires interpretability in credit scoring models.

Fairness and bias detection

Interpretability techniques can reveal whether a model is making decisions based on protected characteristics like race, gender, or age. Even when these features are not directly included as inputs, models can learn to use proxy variables that correlate with protected attributes. Understanding which features drive predictions is necessary for identifying and mitigating such biases.

Safety in AI systems

As AI systems become more capable, understanding their internal reasoning becomes a safety concern. If a model has learned to pursue unintended objectives or exhibits deceptive behavior, interpretability methods may be the primary way to detect these problems before deployment. This motivation is central to the field of AI safety.

Motivation	Stakeholder	Example
Trust	End users, patients, customers	Doctor needs to trust a diagnostic AI
Debugging	ML engineers, researchers	Finding that a model uses spurious correlations
Regulation	Regulators, legal teams	GDPR right to explanation
Fairness	Affected individuals, auditors	Detecting racial bias in lending models
Safety	AI researchers, policymakers	Detecting deceptive reasoning in advanced AI

Types of interpretability

Interpretability can be categorized along several dimensions.

Global vs. local interpretability

Global interpretability means understanding the model's overall behavior across all inputs. What general rules has the model learned? Which features are most important on average? Global interpretability answers questions about the model as a whole.

Local interpretability means understanding why the model made a specific prediction for a specific input. Why did this particular patient get flagged as high-risk? What would need to change for the prediction to be different? Local interpretability answers questions about individual decisions.

Most practical interpretability methods provide either local or global explanations, though some (like SHAP) can provide both.

Intrinsic vs. post-hoc interpretability

Intrinsically interpretable models are transparent by design. Decision trees, linear regression, and rule-based systems are intrinsically interpretable because their structure directly reveals the decision-making process. A decision tree can be read as a series of if-then rules. A linear model shows the weight assigned to each feature.

Post-hoc interpretability refers to methods applied after training to explain the behavior of complex, opaque models (often called "black box" models). Most post-hoc methods treat the model as a function and analyze its input-output behavior, sometimes supplemented with access to internal representations like gradients or hidden layer activations.

Category	Interpretable by design?	Examples
Intrinsically interpretable	Yes	Decision trees, linear models, rule lists, GAMs
Post-hoc explainable	No (explanation added after training)	LIME, SHAP, saliency maps, probing classifiers

Model-specific vs. model-agnostic

Model-specific methods exploit the internal structure of a particular model type. For example, attention visualization is specific to transformer models, and feature importance from tree splits is specific to tree-based models.

Model-agnostic methods work with any model by treating it as a black box and analyzing its input-output behavior. LIME and SHAP are model-agnostic because they only require the ability to feed inputs to the model and observe its outputs.

Dimension	Option A	Option B
Scope	Global (entire model behavior)	Local (single prediction)
Timing	Intrinsic (built into model)	Post-hoc (applied after training)
Specificity	Model-specific (exploits architecture)	Model-agnostic (treats model as black box)

Intrinsically interpretable models

Intrinsically interpretable models are designed so that their internal logic can be directly inspected and understood by humans. Because the underlying mathematical function is simple enough for users to access and analyze directly, these models provide an exact description of how a prediction is computed, not merely an approximation.

Linear and logistic regression

Linear regression models predict outcomes as a weighted sum of input features. Each coefficient directly indicates the direction and magnitude of each feature's influence, making the model fully transparent. Logistic regression extends this to classification by passing the linear combination through a sigmoid function. In both cases, standardized coefficients serve as natural measures of feature importance.

Decision trees and rule lists

Decision trees split the input space through a series of if-then rules, which can be visualized as a flowchart. Each path from root to leaf represents a complete decision rule. Rule lists (ordered collections of if-then rules) offer similar transparency. Both formats are widely used in healthcare and criminal justice, where stakeholders need to audit individual decisions.

Generalized additive models (GAMs)

GAMs model the target as a sum of smooth functions of individual features: f(x) = g(f1(x1) + f2(x2) + ... + fp(xp)). Each component function can be plotted independently, showing the exact shape of each feature's effect. Explainable Boosting Machines (EBMs), implemented in Microsoft's InterpretML library, are a modern GAM variant that achieves accuracy competitive with random forest and gradient boosting models while remaining interpretable ^[3].

Post-hoc explanation methods

Feature importance

Feature importance methods rank the input features of a model by how much they contribute to predictions. Several approaches exist:

Permutation importance: Randomly shuffle one feature at a time and measure how much the model's performance drops. Features whose shuffling causes large performance drops are considered important. This method is model-agnostic and was formalized by Breiman (2001) for random forests, then generalized by Fisher, Rudin, and Dominici (2019) ^[4].
Gini importance / mean decrease impurity: For tree-based models like random forests and gradient boosting, importance is measured by how much each feature reduces impurity (e.g., Gini index or entropy) across all splits in the ensemble.
Coefficient magnitudes: In linear models, the absolute value of each coefficient indicates feature importance (after standardizing features).

Partial dependence plots

Partial dependence plots (PDPs) show the relationship between one or two features and the model's predicted outcome, averaging over the values of all other features. They provide a global view of how a feature influences predictions. PDPs were introduced by Friedman in 2001 as a companion to gradient boosting ^[5]. A key limitation of PDPs is that they assume feature independence; when features are correlated, the averaging process can include unrealistic data combinations, leading to misleading visualizations.

Individual conditional expectation plots

Individual conditional expectation (ICE) plots, introduced by Goldstein et al. (2015), address a limitation of PDPs by displaying one line per instance rather than a single averaged curve ^[6]. Each line shows how an individual observation's prediction changes as one feature varies while all other features remain at their observed values. Where a PDP shows the average effect, ICE plots reveal heterogeneity: if different observations respond differently to a feature, the ICE lines will fan out or cross, signaling an interaction effect. Practitioners often overlay a PDP curve on top of ICE lines for a combined view.

LIME

LIME (Local Interpretable Model-agnostic Explanations), introduced by Ribeiro, Singh, and Guestrin in 2016 ^[1], explains individual predictions by fitting a simple, interpretable model (typically a linear model) in the local neighborhood of the input. The process works as follows:

Generate perturbed versions of the input by randomly modifying features.
Get the black-box model's predictions for each perturbed input.
Fit a simple interpretable model (e.g., linear regression) to these local input-output pairs, weighting nearby points more heavily.
Use the simple model's coefficients as the explanation for the original prediction.

LIME is model-agnostic and works for tabular data, text, and images. Its main limitation is that the explanations are local approximations and may not faithfully represent the model's behavior if the decision boundary is highly non-linear in the region of interest. Different runs of LIME can also produce different explanations for the same input due to the random perturbation process.

SHAP

SHAP (SHapley Additive exPlanations), introduced by Lundberg and Lee in 2017 ^[7], is based on Shapley values from cooperative game theory. The idea is to treat each feature as a "player" in a game where the "payoff" is the model's prediction. The Shapley value for each feature represents its average marginal contribution to the prediction across all possible combinations of features.

SHAP provides several desirable properties:

Property	Description
Local accuracy	The sum of SHAP values for all features equals the difference between the model's prediction and the average prediction
Consistency	If a feature's contribution increases in a revised model, its SHAP value will not decrease
Missingness	Features not present in the model receive a SHAP value of zero

SHAP can provide both local explanations (SHAP values for a single prediction) and global explanations (average absolute SHAP values across many predictions). The main downside is computational cost: exact Shapley values require evaluating the model for every possible subset of features, which grows exponentially. Efficient approximations exist for specific model types, such as TreeSHAP for tree-based models and DeepSHAP for neural networks.

Method	Scope	Model dependency	Key strength	Key limitation
LIME	Local	Model-agnostic	Works with any model; intuitive linear explanations	Instability across runs; faithfulness concerns
SHAP	Local and global	Model-agnostic (with optimized variants)	Theoretical guarantees from game theory	Computationally expensive for exact values
PDP	Global	Model-agnostic	Shows average feature effects	Assumes feature independence
ICE	Local	Model-agnostic	Reveals individual-level heterogeneity	Can be visually cluttered with many instances
Permutation importance	Global	Model-agnostic	Simple to compute and interpret	Sensitive to correlated features

Saliency maps and gradient-based methods

For image and deep learning models, gradient-based attribution methods identify which parts of an input are most relevant to the model's prediction. The simplest approach, vanilla saliency maps, computes the gradient of the output with respect to the input pixels; regions with large gradient magnitudes are highlighted as important.

Several refinements have been developed:

Guided Backpropagation modifies the backpropagation pass to only propagate positive gradients, producing sharper visualizations.
Grad-CAM (Gradient-weighted Class Activation Mapping) uses the gradients flowing into the final convolutional layer to produce a coarse localization map highlighting important regions in the image. It is widely used in medical imaging, where radiologists use heat map overlays to verify that an AI system is focusing on clinically relevant areas.
Integrated Gradients, proposed by Sundararajan et al. (2017), accumulates gradients along a straight-line path from a baseline input (e.g., a black image) to the actual input ^[8]. This method satisfies two axioms, sensitivity and implementation invariance, that simple gradient methods violate.
SmoothGrad averages saliency maps over many copies of the input with added noise, reducing visual noise in the explanation.

These methods are model-specific because they require access to the model's gradients, but they apply to any differentiable model, including convolutional neural networks and vision transformers.

Attention visualization

In transformer models, attention weights show how much each token attends to every other token at each layer and head. Visualizing these attention patterns can provide intuitions about what the model is "looking at" when processing input. For example, attention heads that consistently attend to the previous token might be implementing a simple positional heuristic, while heads that attend to semantically related tokens might be performing meaning-based processing.

However, attention weights are not straightforward explanations. Jain and Wallace (2019) showed that attention weights do not reliably indicate which inputs are important for predictions, and alternative attention distributions can produce the same outputs ^[9]. Attention should be interpreted as a description of information flow rather than as an explanation of decision-making. Subsequent work by Chefer et al. (2021) demonstrated that combining attention with gradient information and multi-layer aggregation can produce more faithful explanations.

Counterfactual explanations

Counterfactual explanations describe the smallest change to an input that would alter the model's prediction. Rather than explaining what features the model used, they answer the question: "What would need to be different for the outcome to change?" For example, if a loan application is denied, a counterfactual explanation might state: "If the applicant's annual income were $5,000 higher and they had no outstanding debts, the loan would have been approved."

Wachter, Mittelstadt, and Russell (2017) formalized counterfactual explanations as an optimization problem, seeking the nearest data point to the original input that produces a different classification ^[10]. This approach is attractive for several reasons: it does not require access to the model's internals (model-agnostic), the explanations are intuitive to non-technical users, and they directly support recourse by telling individuals what they can change to receive a different decision. Challenges include ensuring that the counterfactual represents a realistic scenario (not just a mathematically optimal perturbation) and managing situations where multiple valid counterfactuals exist.

Probing classifiers

Probing classifiers (also called diagnostic classifiers) test whether specific information is encoded in a model's internal representations. The method works by training a simple classifier (usually a linear model) to predict a property of interest (such as part-of-speech, syntactic structure, or factual knowledge) from the hidden states of a neural network.

If a linear probe can accurately predict part-of-speech tags from a particular layer's representations, this suggests that syntactic information is linearly accessible at that layer. Probing has been widely used to study what linguistic information BERT and other language models encode at different layers ^[11].

The main criticism of probing is the "probe complexity" concern: a sufficiently powerful probe might learn the task itself from the data, rather than extracting information that was already present in the representations. Using simple linear probes partially addresses this, but the concern remains.

The accuracy-interpretability tradeoff

A long-standing assumption in machine learning holds that more complex models achieve higher predictive accuracy but are harder to interpret. Under this view, practitioners face a tradeoff: choose a transparent model like linear regression and sacrifice some performance, or choose a complex model like a deep neural network and sacrifice interpretability.

There is evidence supporting this tradeoff in many settings. Deep neural networks and large ensemble methods generally outperform simple models on tasks with complex, non-linear patterns in the data. However, recent research has challenged the idea that the tradeoff is inevitable.

Explainable Boosting Machines and other modern GAMs have demonstrated accuracy competitive with black-box models on many tabular datasets while maintaining full interpretability ^[3]. Rudin (2019) argued in an influential paper that for high-stakes decisions, the machine learning community should invest more effort in developing inherently interpretable models rather than explaining black boxes after the fact ^[12]. Her central claim is that the accuracy gap between interpretable and black-box models is often much smaller than assumed, and that post-hoc explanations can be misleading because they approximate, rather than reveal, the model's actual reasoning.

The tradeoff remains real for certain problem types, particularly those involving unstructured data (images, text, audio) where deep learning's advantage is substantial. For tabular data, however, the gap has narrowed considerably.

Applications in regulated domains

Interpretability is not merely an academic concern; it has direct practical consequences in domains where regulations, professional standards, or public trust demand transparency.

Healthcare

Clinical decision support systems must provide explanations that physicians can evaluate and override. Radiologists use Grad-CAM heat maps overlaid on medical images to verify that an AI system is focusing on clinically relevant regions rather than imaging artifacts. In predictive risk models for conditions like sepsis or readmission, SHAP values help clinicians understand which patient factors drive a high-risk score, enabling more informed conversations with patients about care plans.

Finance

Credit scoring and fraud detection are two areas where interpretability is legally mandated in many jurisdictions. In the United States, the Equal Credit Opportunity Act and Regulation B require that lenders provide specific adverse action reasons when denying credit applications. Financial institutions use SHAP-based explanations to generate these reason codes automatically. Fraud detection systems at large payment processors use explainability layers to help human analysts triage alerts, distinguishing genuine fraud from false positives.

Legal and criminal justice

Risk assessment tools used in criminal sentencing and parole decisions have drawn significant scrutiny. The COMPAS recidivism prediction tool, for example, was the subject of a ProPublica investigation in 2016 that found racial disparities in its risk scores. Interpretability methods are essential for auditing such systems and for defendants who seek to understand and challenge the basis of algorithmic assessments.

Regulatory frameworks

The EU AI Act (Regulation 2024/1689) is the first comprehensive legal framework for AI. Article 13 requires that high-risk AI systems be designed with sufficient transparency for deployers to interpret outputs and use them appropriately. Article 86 establishes a right to explanation for individuals subject to decisions made by high-risk AI systems that produce legal effects or significantly affect them ^[2]. The GDPR's Articles 13-15 and 22 similarly address automated decision-making, though the precise scope of the "right to explanation" under GDPR remains debated among legal scholars.

Mechanistic interpretability

Mechanistic interpretability is a subfield that aims to reverse-engineer the internal computations of neural networks at the level of individual neurons, features, and circuits. Rather than treating the model as a black box and explaining its input-output behavior, mechanistic interpretability opens the black box and tries to understand the algorithms the model has learned.

Core concepts

Features: A feature is a direction in a model's representation space that corresponds to a human-understandable concept. For example, in a vision model, a feature might correspond to "curved edges" or "the color red." In a language model, a feature might correspond to "text written in French" or "the concept of deception." Features are the building blocks that mechanistic interpretability researchers try to identify ^[13].

Circuits: A circuit is a subgraph of the neural network that implements a specific computation. For example, a circuit for indirect object identification in a language model might consist of attention heads that copy information about the indirect object to the output position. Circuits describe how features interact to produce behavior ^[14].

Superposition and polysemanticity: One of the central challenges in mechanistic interpretability is that individual neurons in neural networks are typically polysemantic, meaning they respond to multiple unrelated concepts. This happens because neural networks represent more features than they have neurons, a phenomenon called superposition. A single neuron might activate for both "academic citations" and "the color blue" because these concepts rarely co-occur, so the network can reuse the same neuron for both ^[13].

Sparse autoencoders

To address polysemanticity, researchers have developed sparse autoencoders (SAEs) as a tool for decomposing neural network activations into interpretable features. An SAE is trained to reconstruct a model's internal activations using a much larger set of hidden units with a sparsity constraint. The idea is that if you expand the representation into a higher-dimensional space and require that only a few units are active at any time, each unit is more likely to correspond to a single interpretable concept.

Anthropic published foundational work on this approach in 2023-2024. Their work on Claude 3 Sonnet used dictionary learning with a 16x expansion trained on billions of residual stream activations, extracting roughly 15,000 features. Human evaluators judged about 70% of these features to be cleanly interpretable, mapping to specific concepts like "Arabic script," "DNA sequences," or "expressions of sycophancy" ^[15].

OpenAI published parallel work applying sparse autoencoders to GPT-4 in 2024, independently confirming that the approach scales to frontier models ^[16].

Circuit tracing

In March 2025, Anthropic introduced circuit tracing, a method that combines several earlier techniques into a more comprehensive approach ^[17]. Circuit tracing replaces a model's MLP layers with cross-layer transcoders, a variant of sparse autoencoders that read from one layer's residual stream and can provide output to all subsequent MLP layers. This produces an "interpretable replacement model" where the building blocks are sparse, human-readable features rather than polysemantic neurons.

The method generates attribution graphs that trace the chain of intermediate steps a model uses to transform a specific input into an output. Researchers used attribution graphs on Claude 3.5 Haiku to study various behaviors:

Behavior studied	Finding
Multi-language processing	The model uses a shared conceptual space where reasoning happens before being translated into a specific language
Poem generation	The model plans both forward and backward, identifying rhyming words before crafting lines to reach them
Sycophancy	Specific features fire in response to user disagreement that push the model toward changing its answer to agree with the user
Known-entity recognition	The model routes through entity-specific features that encode factual knowledge about named entities

Anthropic released the circuit tracing tools as open-source software, including a Python library compatible with any open-weights model and a frontend (Neuropedia) for exploring attribution graphs visually ^[18].

The frontier of mechanistic interpretability (2025-2026)

Mechanistic interpretability has progressed rapidly. Several active research directions as of 2025-2026 include:

Scaling SAEs to larger models: Applying sparse autoencoders to models with hundreds of billions of parameters, with the goal of understanding frontier model behavior.
Automated interpretability: Using language models themselves to generate and validate descriptions of features, reducing the reliance on manual human evaluation.
Causal interventions: Moving beyond correlation (this feature activates in this context) to causation (activating this feature causes this behavior) through activation patching and steering experiments.
Safety applications: Using interpretability tools to detect and understand deceptive, manipulative, or otherwise dangerous behaviors in AI systems before deployment.
Neuroscience parallels: Applying interpretability methods developed for artificial neural networks to biological neural data, and vice versa.

Challenges and open problems

Despite significant progress, interpretability research faces several fundamental challenges.

Faithfulness of explanations

A central concern is whether a given explanation accurately reflects the model's actual reasoning process. An explanation is "faithful" if it truly describes why the model made a particular prediction, as opposed to merely providing a plausible-sounding story. LIME and SHAP approximations can disagree with each other for the same prediction, raising questions about which (if either) faithfully represents the model. Gradient-based saliency maps can be manipulated to produce arbitrary outputs while leaving the model's predictions unchanged, a finding that undermines trust in these explanations ^[9]. Attention-based explanations face the same faithfulness concern: plausible attention patterns do not guarantee causal importance.

Evaluation and metrics

There is no universally accepted standard for evaluating the quality of explanations. Human evaluation studies are expensive and subjective. Automated metrics (e.g., measuring prediction changes when highlighted features are removed) test specific aspects of faithfulness but do not capture the full picture. The literature on evaluation approaches remains relatively scarce, with no uniform, well-established protocols for either qualitative or quantitative assessment.

Scalability

Many interpretability methods were developed for relatively small models and datasets. Applying them to models with billions of parameters and complex, multimodal inputs introduces computational and conceptual challenges. Exact Shapley values are intractable for high-dimensional inputs. Mechanistic interpretability techniques require significant engineering effort to apply to each new model architecture.

The definition problem

Core concepts in mechanistic interpretability, such as "feature," still lack rigorous formal definitions. Computational complexity results demonstrate that many interpretability queries are intractable in the worst case. This theoretical murkiness makes it difficult to make strong claims about what interpretability tools have actually revealed about model behavior.

Human factors

Interpretability ultimately involves a human who must understand and act on explanations. Research in human-computer interaction has shown that people can be misled by explanations, placing too much trust in confident-sounding but inaccurate descriptions. The design of explanation interfaces, the cognitive biases of users, and the context in which explanations are presented all affect whether interpretability tools actually achieve their intended purpose.

Tools and libraries

Several open-source tools support interpretability research and practice.

Tool	Purpose	Model support	URL
SHAP	Shapley value explanations for any model	Model-agnostic; optimized for trees and deep models	https://github.com/shap/shap
LIME	Local model-agnostic explanations	Model-agnostic	https://github.com/marcotcr/lime
Captum	Attribution methods for PyTorch models (integrated gradients, saliency maps, SmoothGrad, and more)	PyTorch models	https://captum.ai
InterpretML	Unified framework for interpretable models (EBMs) and black-box explanations	Model-agnostic and glassbox	https://github.com/interpretml/interpret
TransformerLens	Mechanistic interpretability for transformers	Transformer architectures	https://github.com/TransformerLensOrg/TransformerLens
Anthropic circuit tracing	Circuit-level analysis of language models	Compatible with open-weights models	https://github.com/anthropics/circuit-tracing
Neuropedia	Visual exploration of attribution graphs	Language models	https://transformer-circuits.pub
ELI5	Debug and explain ML classifiers	Scikit-learn, XGBoost, and others	https://github.com/eli5-org/eli5
Alibi	Counterfactual explanations and other methods	Model-agnostic	https://github.com/SeldonIO/alibi

Explain like I'm 5 (ELI5)

Imagine you have a friend who is really good at guessing things. You show them a picture and they say, "That's a cat!" You ask, "How did you know?" If your friend can point to the ears and whiskers and explain their thinking, that is interpretability. You can follow their reasoning step by step.

Now imagine a different friend who just says, "I know it's a cat, but I can't really explain why." That is like a black-box model. It might be right, but you cannot check its work.

Interpretability in machine learning is all about making sure we can check the computer's work. When a computer program makes a decision (like whether to approve a loan, or whether an email is spam), we want to be able to ask "why?" and get a real answer. Some programs are built to be easy to understand from the start, like a simple set of rules. Others are really complicated, so scientists have invented special tools (like LIME and SHAP) to peek inside and figure out what the program is paying attention to. This matters because if the program is making mistakes or being unfair, we need to be able to spot those problems and fix them.

References

Ribeiro, M. T., Singh, S., and Guestrin, C. "'Why Should I Trust You?': Explaining the Predictions of Any Classifier." KDD, 2016. https://arxiv.org/abs/1602.04938
"EU AI Act." Regulation (EU) 2024/1689. https://artificialintelligenceact.eu/
Nori, H., Jenkins, S., Koch, P., and Caruana, R. "InterpretML: A Unified Framework for Machine Learning Interpretability." arXiv, 2019. https://arxiv.org/abs/1909.09223
Fisher, A., Rudin, C., and Dominici, F. "All Models are Wrong, but Many are Useful: Learning a Variable's Importance by Studying an Entire Class of Prediction Models Simultaneously." Journal of Machine Learning Research, 20(177), 2019. https://arxiv.org/abs/1801.01489
Friedman, J. H. "Greedy function approximation: A gradient boosting machine." Annals of Statistics, 29(5), 2001.
Goldstein, A., Kapelner, A., Bleich, J., and Pitkin, E. "Peeking Inside the Black Box: Visualizing Statistical Learning with Plots of Individual Conditional Expectation." Journal of Computational and Graphical Statistics, 24(1), 2015. https://arxiv.org/abs/1309.6392
Lundberg, S. M. and Lee, S. "A Unified Approach to Interpreting Model Predictions." NeurIPS, 2017. https://arxiv.org/abs/1705.07874
Sundararajan, M., Taly, A., and Yan, Q. "Axiomatic Attribution for Deep Networks." ICML, 2017. https://arxiv.org/abs/1703.01365
Jain, S. and Wallace, B. C. "Attention is not Explanation." NAACL, 2019. https://arxiv.org/abs/1902.10186
Wachter, S., Mittelstadt, B., and Russell, C. "Counterfactual Explanations without Opening the Black Box: Automated Decisions and the GDPR." Harvard Journal of Law and Technology, 31(2), 2018. https://arxiv.org/abs/1711.00399
Belinkov, Y. "Probing Classifiers: Promises, Shortcomings, and Advances." Computational Linguistics, 2022. https://arxiv.org/abs/2102.12452
Rudin, C. "Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead." Nature Machine Intelligence, 1(5), 2019. https://arxiv.org/abs/1811.10154
Elhage, N., et al. "Toy Models of Superposition." Anthropic, 2022. https://transformer-circuits.pub/2022/toy_model/index.html
Olah, C., et al. "Zoom In: An Introduction to Circuits." Distill, 2020. https://distill.pub/2020/circuits/zoom-in/
Bricken, T., et al. "Towards Monosemanticity: Decomposing Language Models With Dictionary Learning." Anthropic, 2023. https://transformer-circuits.pub/2023/monosemantic-features/index.html
Gao, L., et al. "Scaling and evaluating sparse autoencoders." OpenAI, 2024. https://arxiv.org/abs/2406.04093
"Circuit Tracing: Revealing Computational Graphs in Language Models." Anthropic, 2025. https://transformer-circuits.pub/2025/attribution-graphs/methods.html
"On the Biology of a Large Language Model." Anthropic, 2025. https://transformer-circuits.pub/2025/attribution-graphs/biology.html

Introduction