See also: Machine learning terms
Interpretability in machine learning refers to the degree to which humans can understand and explain the decisions made by a model. An interpretable model is one where a person can look at its inputs, internal workings, or outputs and form a reliable mental model of why a particular prediction was made. This goes beyond simply knowing that a model achieves high accuracy; it requires understanding the reasoning process behind individual decisions and the general patterns the model has learned.
Interpretability has become increasingly important as machine learning systems are deployed in high-stakes domains such as healthcare, criminal justice, lending, and autonomous vehicles. When a model denies a loan application, recommends a medical treatment, or drives a car, the people affected by those decisions (and the people responsible for them) need to understand why the model behaved as it did. In recent years, the field has expanded from post-hoc explanation methods to mechanistic interpretability, which attempts to reverse-engineer the internal computations of neural networks at a detailed level.
The terms "interpretability" and "explainability" are often used interchangeably, but some researchers draw a meaningful distinction between the two.
| Term | Meaning | Typical association |
|---|---|---|
| Interpretability | The degree to which a human can understand the model's internal mechanisms directly. The model's structure itself is transparent. | Intrinsically interpretable models (decision trees, linear models) |
| Explainability | The degree to which the model's predictions can be explained to a human, possibly through a post-hoc approximation that simplifies or abstracts away the actual mechanism. | Post-hoc methods (LIME, SHAP, saliency maps) |
Under this distinction, a decision tree is interpretable because you can read its logic directly, while a LIME explanation of a deep neural network provides explainability because it offers a simplified local approximation rather than revealing the actual computation. In practice, most practitioners use the terms interchangeably, and the field is commonly referred to as "Explainable AI" (XAI) in industry contexts and "interpretability" in research contexts. Some researchers argue that interpretability and fidelity are both components of explainability: a useful explanation must be intelligible to humans (interpretability) while also accurately depicting the model's behavior across its feature space (fidelity).
The demand for interpretability comes from several overlapping concerns.
Practitioners and end users are more willing to deploy and rely on models they can understand. A doctor who can see that a diagnostic model is focusing on relevant medical features (rather than artifacts in the imaging process) will be more confident in its recommendations. Conversely, unexplainable predictions erode trust and slow adoption, even when the model is statistically accurate.
Interpretability tools help developers find and fix problems in models. If a model is making incorrect predictions, understanding which features it relies on can reveal issues with the training data, feature engineering, or model architecture. For example, researchers discovered that an image classifier that appeared to distinguish wolves from huskies was actually relying on the presence of snow in the background, not the animal itself [1].
Regulations increasingly require that automated decision-making systems be explainable. The European Union's General Data Protection Regulation (GDPR), enacted in 2018, includes provisions around the right to an explanation for automated decisions. The EU AI Act, which entered into force in 2024, classifies high-risk AI systems and imposes transparency requirements on them [2]. In the United States, the Equal Credit Opportunity Act requires lenders to provide specific reasons when denying credit, which effectively requires interpretability in credit scoring models.
Interpretability techniques can reveal whether a model is making decisions based on protected characteristics like race, gender, or age. Even when these features are not directly included as inputs, models can learn to use proxy variables that correlate with protected attributes. Understanding which features drive predictions is necessary for identifying and mitigating such biases.
As AI systems become more capable, understanding their internal reasoning becomes a safety concern. If a model has learned to pursue unintended objectives or exhibits deceptive behavior, interpretability methods may be the primary way to detect these problems before deployment. This motivation is central to the field of AI safety.
| Motivation | Stakeholder | Example |
|---|---|---|
| Trust | End users, patients, customers | Doctor needs to trust a diagnostic AI |
| Debugging | ML engineers, researchers | Finding that a model uses spurious correlations |
| Regulation | Regulators, legal teams | GDPR right to explanation |
| Fairness | Affected individuals, auditors | Detecting racial bias in lending models |
| Safety | AI researchers, policymakers | Detecting deceptive reasoning in advanced AI |
Interpretability can be categorized along several dimensions.
Global interpretability means understanding the model's overall behavior across all inputs. What general rules has the model learned? Which features are most important on average? Global interpretability answers questions about the model as a whole.
Local interpretability means understanding why the model made a specific prediction for a specific input. Why did this particular patient get flagged as high-risk? What would need to change for the prediction to be different? Local interpretability answers questions about individual decisions.
Most practical interpretability methods provide either local or global explanations, though some (like SHAP) can provide both.
Intrinsically interpretable models are transparent by design. Decision trees, linear regression, and rule-based systems are intrinsically interpretable because their structure directly reveals the decision-making process. A decision tree can be read as a series of if-then rules. A linear model shows the weight assigned to each feature.
Post-hoc interpretability refers to methods applied after training to explain the behavior of complex, opaque models (often called "black box" models). Most post-hoc methods treat the model as a function and analyze its input-output behavior, sometimes supplemented with access to internal representations like gradients or hidden layer activations.
| Category | Interpretable by design? | Examples |
|---|---|---|
| Intrinsically interpretable | Yes | Decision trees, linear models, rule lists, GAMs |
| Post-hoc explainable | No (explanation added after training) | LIME, SHAP, saliency maps, probing classifiers |
Model-specific methods exploit the internal structure of a particular model type. For example, attention visualization is specific to transformer models, and feature importance from tree splits is specific to tree-based models.
Model-agnostic methods work with any model by treating it as a black box and analyzing its input-output behavior. LIME and SHAP are model-agnostic because they only require the ability to feed inputs to the model and observe its outputs.
| Dimension | Option A | Option B |
|---|---|---|
| Scope | Global (entire model behavior) | Local (single prediction) |
| Timing | Intrinsic (built into model) | Post-hoc (applied after training) |
| Specificity | Model-specific (exploits architecture) | Model-agnostic (treats model as black box) |
Intrinsically interpretable models are designed so that their internal logic can be directly inspected and understood by humans. Because the underlying mathematical function is simple enough for users to access and analyze directly, these models provide an exact description of how a prediction is computed, not merely an approximation.
Linear regression models predict outcomes as a weighted sum of input features. Each coefficient directly indicates the direction and magnitude of each feature's influence, making the model fully transparent. Logistic regression extends this to classification by passing the linear combination through a sigmoid function. In both cases, standardized coefficients serve as natural measures of feature importance.
Decision trees split the input space through a series of if-then rules, which can be visualized as a flowchart. Each path from root to leaf represents a complete decision rule. Rule lists (ordered collections of if-then rules) offer similar transparency. Both formats are widely used in healthcare and criminal justice, where stakeholders need to audit individual decisions.
GAMs model the target as a sum of smooth functions of individual features: f(x) = g(f1(x1) + f2(x2) + ... + fp(xp)). Each component function can be plotted independently, showing the exact shape of each feature's effect. Explainable Boosting Machines (EBMs), implemented in Microsoft's InterpretML library, are a modern GAM variant that achieves accuracy competitive with random forest and gradient boosting models while remaining interpretable [3].
Feature importance methods rank the input features of a model by how much they contribute to predictions. Several approaches exist:
Partial dependence plots (PDPs) show the relationship between one or two features and the model's predicted outcome, averaging over the values of all other features. They provide a global view of how a feature influences predictions. PDPs were introduced by Friedman in 2001 as a companion to gradient boosting [5]. A key limitation of PDPs is that they assume feature independence; when features are correlated, the averaging process can include unrealistic data combinations, leading to misleading visualizations.
Individual conditional expectation (ICE) plots, introduced by Goldstein et al. (2015), address a limitation of PDPs by displaying one line per instance rather than a single averaged curve [6]. Each line shows how an individual observation's prediction changes as one feature varies while all other features remain at their observed values. Where a PDP shows the average effect, ICE plots reveal heterogeneity: if different observations respond differently to a feature, the ICE lines will fan out or cross, signaling an interaction effect. Practitioners often overlay a PDP curve on top of ICE lines for a combined view.
LIME (Local Interpretable Model-agnostic Explanations), introduced by Ribeiro, Singh, and Guestrin in 2016 [1], explains individual predictions by fitting a simple, interpretable model (typically a linear model) in the local neighborhood of the input. The process works as follows:
LIME is model-agnostic and works for tabular data, text, and images. Its main limitation is that the explanations are local approximations and may not faithfully represent the model's behavior if the decision boundary is highly non-linear in the region of interest. Different runs of LIME can also produce different explanations for the same input due to the random perturbation process.
SHAP (SHapley Additive exPlanations), introduced by Lundberg and Lee in 2017 [7], is based on Shapley values from cooperative game theory. The idea is to treat each feature as a "player" in a game where the "payoff" is the model's prediction. The Shapley value for each feature represents its average marginal contribution to the prediction across all possible combinations of features.
SHAP provides several desirable properties:
| Property | Description |
|---|---|
| Local accuracy | The sum of SHAP values for all features equals the difference between the model's prediction and the average prediction |
| Consistency | If a feature's contribution increases in a revised model, its SHAP value will not decrease |
| Missingness | Features not present in the model receive a SHAP value of zero |
SHAP can provide both local explanations (SHAP values for a single prediction) and global explanations (average absolute SHAP values across many predictions). The main downside is computational cost: exact Shapley values require evaluating the model for every possible subset of features, which grows exponentially. Efficient approximations exist for specific model types, such as TreeSHAP for tree-based models and DeepSHAP for neural networks.
| Method | Scope | Model dependency | Key strength | Key limitation |
|---|---|---|---|---|
| LIME | Local | Model-agnostic | Works with any model; intuitive linear explanations | Instability across runs; faithfulness concerns |
| SHAP | Local and global | Model-agnostic (with optimized variants) | Theoretical guarantees from game theory | Computationally expensive for exact values |
| PDP | Global | Model-agnostic | Shows average feature effects | Assumes feature independence |
| ICE | Local | Model-agnostic | Reveals individual-level heterogeneity | Can be visually cluttered with many instances |
| Permutation importance | Global | Model-agnostic | Simple to compute and interpret | Sensitive to correlated features |
For image and deep learning models, gradient-based attribution methods identify which parts of an input are most relevant to the model's prediction. The simplest approach, vanilla saliency maps, computes the gradient of the output with respect to the input pixels; regions with large gradient magnitudes are highlighted as important.
Several refinements have been developed:
These methods are model-specific because they require access to the model's gradients, but they apply to any differentiable model, including convolutional neural networks and vision transformers.
In transformer models, attention weights show how much each token attends to every other token at each layer and head. Visualizing these attention patterns can provide intuitions about what the model is "looking at" when processing input. For example, attention heads that consistently attend to the previous token might be implementing a simple positional heuristic, while heads that attend to semantically related tokens might be performing meaning-based processing.
However, attention weights are not straightforward explanations. Jain and Wallace (2019) showed that attention weights do not reliably indicate which inputs are important for predictions, and alternative attention distributions can produce the same outputs [9]. Attention should be interpreted as a description of information flow rather than as an explanation of decision-making. Subsequent work by Chefer et al. (2021) demonstrated that combining attention with gradient information and multi-layer aggregation can produce more faithful explanations.
Counterfactual explanations describe the smallest change to an input that would alter the model's prediction. Rather than explaining what features the model used, they answer the question: "What would need to be different for the outcome to change?" For example, if a loan application is denied, a counterfactual explanation might state: "If the applicant's annual income were $5,000 higher and they had no outstanding debts, the loan would have been approved."
Wachter, Mittelstadt, and Russell (2017) formalized counterfactual explanations as an optimization problem, seeking the nearest data point to the original input that produces a different classification [10]. This approach is attractive for several reasons: it does not require access to the model's internals (model-agnostic), the explanations are intuitive to non-technical users, and they directly support recourse by telling individuals what they can change to receive a different decision. Challenges include ensuring that the counterfactual represents a realistic scenario (not just a mathematically optimal perturbation) and managing situations where multiple valid counterfactuals exist.
Probing classifiers (also called diagnostic classifiers) test whether specific information is encoded in a model's internal representations. The method works by training a simple classifier (usually a linear model) to predict a property of interest (such as part-of-speech, syntactic structure, or factual knowledge) from the hidden states of a neural network.
If a linear probe can accurately predict part-of-speech tags from a particular layer's representations, this suggests that syntactic information is linearly accessible at that layer. Probing has been widely used to study what linguistic information BERT and other language models encode at different layers [11].
The main criticism of probing is the "probe complexity" concern: a sufficiently powerful probe might learn the task itself from the data, rather than extracting information that was already present in the representations. Using simple linear probes partially addresses this, but the concern remains.
A long-standing assumption in machine learning holds that more complex models achieve higher predictive accuracy but are harder to interpret. Under this view, practitioners face a tradeoff: choose a transparent model like linear regression and sacrifice some performance, or choose a complex model like a deep neural network and sacrifice interpretability.
There is evidence supporting this tradeoff in many settings. Deep neural networks and large ensemble methods generally outperform simple models on tasks with complex, non-linear patterns in the data. However, recent research has challenged the idea that the tradeoff is inevitable.
Explainable Boosting Machines and other modern GAMs have demonstrated accuracy competitive with black-box models on many tabular datasets while maintaining full interpretability [3]. Rudin (2019) argued in an influential paper that for high-stakes decisions, the machine learning community should invest more effort in developing inherently interpretable models rather than explaining black boxes after the fact [12]. Her central claim is that the accuracy gap between interpretable and black-box models is often much smaller than assumed, and that post-hoc explanations can be misleading because they approximate, rather than reveal, the model's actual reasoning.
The tradeoff remains real for certain problem types, particularly those involving unstructured data (images, text, audio) where deep learning's advantage is substantial. For tabular data, however, the gap has narrowed considerably.
Interpretability is not merely an academic concern; it has direct practical consequences in domains where regulations, professional standards, or public trust demand transparency.
Clinical decision support systems must provide explanations that physicians can evaluate and override. Radiologists use Grad-CAM heat maps overlaid on medical images to verify that an AI system is focusing on clinically relevant regions rather than imaging artifacts. In predictive risk models for conditions like sepsis or readmission, SHAP values help clinicians understand which patient factors drive a high-risk score, enabling more informed conversations with patients about care plans.
Credit scoring and fraud detection are two areas where interpretability is legally mandated in many jurisdictions. In the United States, the Equal Credit Opportunity Act and Regulation B require that lenders provide specific adverse action reasons when denying credit applications. Financial institutions use SHAP-based explanations to generate these reason codes automatically. Fraud detection systems at large payment processors use explainability layers to help human analysts triage alerts, distinguishing genuine fraud from false positives.
Risk assessment tools used in criminal sentencing and parole decisions have drawn significant scrutiny. The COMPAS recidivism prediction tool, for example, was the subject of a ProPublica investigation in 2016 that found racial disparities in its risk scores. Interpretability methods are essential for auditing such systems and for defendants who seek to understand and challenge the basis of algorithmic assessments.
The EU AI Act (Regulation 2024/1689) is the first comprehensive legal framework for AI. Article 13 requires that high-risk AI systems be designed with sufficient transparency for deployers to interpret outputs and use them appropriately. Article 86 establishes a right to explanation for individuals subject to decisions made by high-risk AI systems that produce legal effects or significantly affect them [2]. The GDPR's Articles 13-15 and 22 similarly address automated decision-making, though the precise scope of the "right to explanation" under GDPR remains debated among legal scholars.
Mechanistic interpretability is a subfield that aims to reverse-engineer the internal computations of neural networks at the level of individual neurons, features, and circuits. Rather than treating the model as a black box and explaining its input-output behavior, mechanistic interpretability opens the black box and tries to understand the algorithms the model has learned.
Features: A feature is a direction in a model's representation space that corresponds to a human-understandable concept. For example, in a vision model, a feature might correspond to "curved edges" or "the color red." In a language model, a feature might correspond to "text written in French" or "the concept of deception." Features are the building blocks that mechanistic interpretability researchers try to identify [13].
Circuits: A circuit is a subgraph of the neural network that implements a specific computation. For example, a circuit for indirect object identification in a language model might consist of attention heads that copy information about the indirect object to the output position. Circuits describe how features interact to produce behavior [14].
Superposition and polysemanticity: One of the central challenges in mechanistic interpretability is that individual neurons in neural networks are typically polysemantic, meaning they respond to multiple unrelated concepts. This happens because neural networks represent more features than they have neurons, a phenomenon called superposition. A single neuron might activate for both "academic citations" and "the color blue" because these concepts rarely co-occur, so the network can reuse the same neuron for both [13].
To address polysemanticity, researchers have developed sparse autoencoders (SAEs) as a tool for decomposing neural network activations into interpretable features. An SAE is trained to reconstruct a model's internal activations using a much larger set of hidden units with a sparsity constraint. The idea is that if you expand the representation into a higher-dimensional space and require that only a few units are active at any time, each unit is more likely to correspond to a single interpretable concept.
Anthropic published foundational work on this approach in 2023-2024. Their work on Claude 3 Sonnet used dictionary learning with a 16x expansion trained on billions of residual stream activations, extracting roughly 15,000 features. Human evaluators judged about 70% of these features to be cleanly interpretable, mapping to specific concepts like "Arabic script," "DNA sequences," or "expressions of sycophancy" [15].
OpenAI published parallel work applying sparse autoencoders to GPT-4 in 2024, independently confirming that the approach scales to frontier models [16].
In March 2025, Anthropic introduced circuit tracing, a method that combines several earlier techniques into a more comprehensive approach [17]. Circuit tracing replaces a model's MLP layers with cross-layer transcoders, a variant of sparse autoencoders that read from one layer's residual stream and can provide output to all subsequent MLP layers. This produces an "interpretable replacement model" where the building blocks are sparse, human-readable features rather than polysemantic neurons.
The method generates attribution graphs that trace the chain of intermediate steps a model uses to transform a specific input into an output. Researchers used attribution graphs on Claude 3.5 Haiku to study various behaviors:
| Behavior studied | Finding |
|---|---|
| Multi-language processing | The model uses a shared conceptual space where reasoning happens before being translated into a specific language |
| Poem generation | The model plans both forward and backward, identifying rhyming words before crafting lines to reach them |
| Sycophancy | Specific features fire in response to user disagreement that push the model toward changing its answer to agree with the user |
| Known-entity recognition | The model routes through entity-specific features that encode factual knowledge about named entities |
Anthropic released the circuit tracing tools as open-source software, including a Python library compatible with any open-weights model and a frontend (Neuropedia) for exploring attribution graphs visually [18].
Mechanistic interpretability has progressed rapidly. Several active research directions as of 2025-2026 include:
Despite significant progress, interpretability research faces several fundamental challenges.
A central concern is whether a given explanation accurately reflects the model's actual reasoning process. An explanation is "faithful" if it truly describes why the model made a particular prediction, as opposed to merely providing a plausible-sounding story. LIME and SHAP approximations can disagree with each other for the same prediction, raising questions about which (if either) faithfully represents the model. Gradient-based saliency maps can be manipulated to produce arbitrary outputs while leaving the model's predictions unchanged, a finding that undermines trust in these explanations [9]. Attention-based explanations face the same faithfulness concern: plausible attention patterns do not guarantee causal importance.
There is no universally accepted standard for evaluating the quality of explanations. Human evaluation studies are expensive and subjective. Automated metrics (e.g., measuring prediction changes when highlighted features are removed) test specific aspects of faithfulness but do not capture the full picture. The literature on evaluation approaches remains relatively scarce, with no uniform, well-established protocols for either qualitative or quantitative assessment.
Many interpretability methods were developed for relatively small models and datasets. Applying them to models with billions of parameters and complex, multimodal inputs introduces computational and conceptual challenges. Exact Shapley values are intractable for high-dimensional inputs. Mechanistic interpretability techniques require significant engineering effort to apply to each new model architecture.
Core concepts in mechanistic interpretability, such as "feature," still lack rigorous formal definitions. Computational complexity results demonstrate that many interpretability queries are intractable in the worst case. This theoretical murkiness makes it difficult to make strong claims about what interpretability tools have actually revealed about model behavior.
Interpretability ultimately involves a human who must understand and act on explanations. Research in human-computer interaction has shown that people can be misled by explanations, placing too much trust in confident-sounding but inaccurate descriptions. The design of explanation interfaces, the cognitive biases of users, and the context in which explanations are presented all affect whether interpretability tools actually achieve their intended purpose.
Several open-source tools support interpretability research and practice.
| Tool | Purpose | Model support | URL |
|---|---|---|---|
| SHAP | Shapley value explanations for any model | Model-agnostic; optimized for trees and deep models | https://github.com/shap/shap |
| LIME | Local model-agnostic explanations | Model-agnostic | https://github.com/marcotcr/lime |
| Captum | Attribution methods for PyTorch models (integrated gradients, saliency maps, SmoothGrad, and more) | PyTorch models | https://captum.ai |
| InterpretML | Unified framework for interpretable models (EBMs) and black-box explanations | Model-agnostic and glassbox | https://github.com/interpretml/interpret |
| TransformerLens | Mechanistic interpretability for transformers | Transformer architectures | https://github.com/TransformerLensOrg/TransformerLens |
| Anthropic circuit tracing | Circuit-level analysis of language models | Compatible with open-weights models | https://github.com/anthropics/circuit-tracing |
| Neuropedia | Visual exploration of attribution graphs | Language models | https://transformer-circuits.pub |
| ELI5 | Debug and explain ML classifiers | Scikit-learn, XGBoost, and others | https://github.com/eli5-org/eli5 |
| Alibi | Counterfactual explanations and other methods | Model-agnostic | https://github.com/SeldonIO/alibi |
Imagine you have a friend who is really good at guessing things. You show them a picture and they say, "That's a cat!" You ask, "How did you know?" If your friend can point to the ears and whiskers and explain their thinking, that is interpretability. You can follow their reasoning step by step.
Now imagine a different friend who just says, "I know it's a cat, but I can't really explain why." That is like a black-box model. It might be right, but you cannot check its work.
Interpretability in machine learning is all about making sure we can check the computer's work. When a computer program makes a decision (like whether to approve a loan, or whether an email is spam), we want to be able to ask "why?" and get a real answer. Some programs are built to be easy to understand from the start, like a simple set of rules. Others are really complicated, so scientists have invented special tools (like LIME and SHAP) to peek inside and figure out what the program is paying attention to. This matters because if the program is making mistakes or being unfair, we need to be able to spot those problems and fix them.