Explainable AI (XAI) refers to artificial intelligence systems and techniques designed so that humans can understand how and why the system reaches its decisions, predictions, or recommendations. Unlike opaque "black box" models that produce outputs without revealing their internal reasoning, explainable AI aims to make the decision-making process transparent, interpretable, and accountable. The field has grown from a niche research interest into a central pillar of AI safety, regulatory compliance, and responsible AI, driven by the rapid adoption of deep learning models in high-stakes domains like healthcare, criminal justice, and finance.
XAI sits at the intersection of machine learning, human-computer interaction, cognitive science, law, and philosophy. It encompasses everything from simple coefficient inspection in a logistic regression to mechanistic dissection of attention heads in a frontier large language model. Some researchers treat the field as a unified discipline; others argue that the techniques used for tabular gradient-boosted trees and the techniques used for billion-parameter transformers have little in common.
Explainable AI encompasses methods, tools, and design principles that allow humans to comprehend the behavior of AI systems. This includes understanding which input features drove a particular prediction, how a model generalizes from training data, and whether the model's reasoning aligns with domain knowledge and ethical norms.
The term "explainable AI" is sometimes used interchangeably with "interpretable AI," though researchers increasingly draw a distinction. Interpretability refers to the degree to which a human can consistently predict a model's outputs given its inputs. A decision tree with a handful of rules is inherently interpretable. Explainability, by contrast, refers to the ability to provide post-hoc reasons for a model's behavior, often through auxiliary techniques applied after training. A complex neural network is not inherently interpretable, but a method like LIME or SHAP can offer a simplified account of a given prediction [1]. The scope of XAI extends beyond individual predictions to cover global model behavior, fairness auditing, debugging, and scientific discovery.
Explainability is one piece of a broader set of governance properties for AI systems. The US National Institute of Standards and Technology (NIST), in its 2023 AI Risk Management Framework, treats transparency, explainability, and interpretability as distinct but mutually supportive characteristics of trustworthy AI [25]. Transparency answers "what happened?" (visibility of inputs, outputs, training data, and design choices). Explainability answers "how was the decision made?" (a representation of underlying mechanisms). Interpretability answers "why was this decision made, and what does it mean?" (the meaning of an output in functional context). Accountability adds a fourth dimension: who is responsible when the system fails, and what recourse is available?
A related distinction comes from Cynthia Rudin and other researchers who reserve "interpretable" for models whose structure is directly readable by humans and "explainable" for post-hoc approximations of opaque models [14]. Under that convention, a sparse decision tree is interpretable; a deep network with a SHAP overlay is explainable. The distinction matters because the two approaches have different failure modes.
Several forces have made explainability an urgent priority.
Trust and adoption. Practitioners and end users are reluctant to rely on AI systems they cannot understand. A physician needs to know why a model flagged a particular condition; a loan officer needs to explain the reasoning to the applicant. Without explanations, humans tend to either over-trust AI systems (accepting flawed outputs uncritically) or under-trust them (ignoring useful recommendations out of suspicion).
Debugging and model improvement. Explanations help machine learning engineers identify bugs, spurious correlations, and data quality problems. If a model is classifying wolves correctly but the explanation reveals it is relying on snow in the background rather than the animal's features, the developer can correct the training data or model architecture. This husky-vs-wolf example from the LIME paper [5] became a canonical illustration of why explanation matters in practice.
Regulatory compliance. The European Union's AI Act and the General Data Protection Regulation (GDPR) both contain provisions related to the transparency and explainability of automated decision-making systems. Organizations deploying AI in the EU must demonstrate that affected individuals can receive meaningful explanations of decisions that significantly impact them [2] [3].
Fairness and accountability. Explainability is a precondition for auditing AI systems for bias. If a hiring algorithm systematically disadvantages candidates from certain demographic groups, explanation techniques can reveal whether protected attributes (or proxies) are driving decisions. This information is essential for remediation and for legal compliance with anti-discrimination laws.
Scientific understanding. Explanations of model behavior can yield genuine scientific insights. AlphaFold's predictions of protein structure, for instance, prompted a wave of work asking what evolutionary signal the model had picked up that allowed it to outperform decades of hand-engineered methods.
Safety in capable systems. As AI systems take on more agentic roles, the cost of unexplained failures grows. Mechanistic interpretability work at Anthropic, DeepMind, and OpenAI is increasingly framed as a safety priority: being able to look inside a model's computations is a prerequisite for catching deceptive or misaligned behavior before deployment.
The question of how to explain AI decisions is as old as AI itself. Early expert systems of the 1970s and 1980s, such as MYCIN (a medical diagnosis system developed at Stanford), included built-in explanation capabilities. Because these systems operated on explicit if-then rules, they could trace their reasoning and present it to users. MYCIN could show the chain of rules that led to an antibiotic recommendation, what evidence supported each rule, and how confident the system was. William Clancey's studies at Stanford in the early 1980s found that even rule-based explanations were not always satisfying to physicians, because the rules captured surface patterns rather than the deeper causal models clinicians used; this early disappointment foreshadowed a recurring theme in modern XAI. As neural networks and statistical models began to outperform expert systems in the late 1990s and 2000s, the field traded interpretability for predictive accuracy, setting up the tension that gave rise to modern XAI research.
A pivotal moment came in 2015, when Rich Caruana and colleagues at Microsoft Research published Intelligible Models for HealthCare: Predicting Pneumonia Risk and Hospital 30-day Readmission at KDD [26]. The paper used generalized additive models with pairwise interactions (GA2Ms) to build risk prediction models that matched the accuracy of less interpretable approaches, then used the interpretable structure to audit the model's reasoning. The audit revealed that the model had learned a counterintuitive pattern: patients with asthma had lower predicted mortality risk from pneumonia than patients without it. The pattern was real in the training data, but only because asthmatic patients were routinely admitted to intensive care faster, which lowered their observed mortality. A black-box model with the same accuracy would have shipped this artifact straight into a clinical decision support tool. The intelligible model surfaced the problem before deployment, and Caruana's paper became one of the most cited arguments for inherently interpretable models in high-stakes domains.
The deep learning revolution, catalyzed by AlexNet's breakthrough performance on the ImageNet challenge in 2012, ushered in an era of models with millions (and later billions) of parameters. These models achieved remarkable accuracy on tasks from image recognition to natural language processing, but their internal workings were opaque. Researchers began to refer to them as "black boxes," and concerns about their opacity grew alongside their capabilities.
By the mid-2010s, several research groups were developing methods to peer inside these black boxes. Simonyan, Vedaldi, and Zisserman proposed saliency maps for vision models in 2013 [7]. Bach, Binder, Montavon, Klauschen, Müller, and Samek introduced Layer-wise Relevance Propagation (LRP) in 2015 [27]. The publication of two seminal papers, LIME in 2016 and SHAP in 2017, marked a clear turning point: both produced model-agnostic local explanations and both came with reference implementations that practitioners could use immediately.
The most significant institutional catalyst for the modern field was the Defense Advanced Research Projects Agency (DARPA) Explainable Artificial Intelligence program. DARPA began formulating the program in 2015 and released its call for proposals in August 2016. Development began in May 2017 with eleven research teams selected to develop explainable learning systems and one team focused on psychological models of explanation [4]. The program's stated goal was to "create a suite of new or modified machine learning techniques that produce explainable models that, when combined with effective explanation techniques, enable end users to understand, appropriately trust, and effectively manage the emerging generation of AI systems" [4].
The four-year program concluded in 2021. In a retrospective published that year in Applied AI Letters, program manager David Gunning and colleagues noted that simply showing explanations did not always improve user trust calibration; better explanations sometimes helped, but the relationship was less direct than program designers had hoped. The program left behind a toolkit of methods, several open-source releases, and the term "XAI" itself, which has since become standard usage [4].
Since the DARPA program, the field has expanded rapidly. Thousands of papers are published annually on explainability methods. Major technology companies have released open-source tools for model interpretation. Regulatory frameworks in the EU, US, and elsewhere have codified requirements for transparency and explanation. And the emergence of large language models (LLMs) has introduced entirely new challenges and opportunities for explainability, including the burgeoning field of mechanistic interpretability.
XAI methods can be classified along several dimensions. The taxonomy below is the one used in Christoph Molnar's widely cited textbook Interpretable Machine Learning and broadly in the academic literature.
| Dimension | Option A | Option B |
|---|---|---|
| Source of interpretability | Intrinsic (model is interpretable by construction) | Post-hoc (interpretation applied after training) |
| Scope | Local (explains a single prediction) | Global (explains overall model behavior) |
| Coupling to model | Model-specific (uses model internals) | Model-agnostic (treats model as black box) |
| Output format | Static (one-time explanation) | Interactive (user can probe and counterfactual) |
| Causal grounding | Correlational (feature attribution) | Counterfactual (what would change the outcome) |
Intrinsic interpretability refers to models that are inherently understandable by virtue of their structure. Linear regression, logistic regression, decision trees, rule lists, and generalized additive models all fall into this category. Their parameters or decision rules can be directly inspected by humans.
Post-hoc methods are techniques applied after a model has been trained to generate explanations of its behavior. These are necessary for complex models like deep neural networks, random forests, and gradient-boosted machines, whose internal representations are not directly human-readable. LIME, SHAP, saliency maps, and attention visualization are all post-hoc methods.
Local explanations describe why a model made a specific prediction for a single input instance. For example, "this loan application was denied because the applicant's debt-to-income ratio was 0.85, which the model weighted heavily."
Global explanations describe the model's overall behavior across all inputs. For example, "across the entire dataset, the three most influential features for loan approval are credit score, debt-to-income ratio, and employment length." Global explanations help stakeholders understand the model's general strategy, while local explanations address individual decisions.
Model-specific methods are designed for a particular type of model. Attention visualization, for instance, applies specifically to transformer-based models. Gradient-based saliency maps require differentiable models. These methods can exploit the model's internal structure to produce more detailed explanations.
Model-agnostic methods treat the model as a black box and work with any type of model. They operate by perturbing inputs and observing changes in outputs. LIME and SHAP (in its model-agnostic variant) are the best-known examples. The advantage of model-agnostic methods is their generality; the trade-off is that they cannot leverage model-specific structure and may produce less faithful explanations.
Most classical XAI tools produce a static artifact such as a feature importance bar chart, saliency heatmap, or counterfactual example. Interactive tools like Google's What-If Tool [28] and Microsoft's InterpretML dashboard let users vary inputs, ask hypotheticals, and watch predictions update in real time. Empirical studies repeatedly show that interactive exploration improves calibrated trust more than static reports.
Before turning to post-hoc methods, it is worth describing the model families that aim for accuracy without sacrificing interpretability.
| Model family | Key idea | Trade-off |
|---|---|---|
| Linear / logistic regression | Output is a weighted sum of features | Limited capacity for nonlinear interactions |
| Decision trees | Hierarchical if-then rules over feature thresholds | Variance is high; small data changes can reshape the tree |
| Decision sets and rule lists | A small ordered set of human-readable rules | Manual tuning; limited capacity on complex data |
| Generalized additive models (GAMs) | Sum of one-dimensional functions of each feature | Cannot natively capture interactions without GA2M extensions |
| Explainable Boosting Machine (EBM) | GA2M trained with cyclic gradient boosting | Slower to train than vanilla GBM but often comparable in accuracy |
| Risk score models (e.g., RiskSLIM) | Integer-coefficient linear models for paper-and-pencil use | Restricted expressivity |
The Explainable Boosting Machine deserves particular attention. EBMs use cyclic gradient boosting to fit shape functions for each feature, plus optional pairwise interaction terms. Microsoft's InterpretML team has shown that on many tabular benchmarks EBMs match XGBoost and LightGBM in accuracy while remaining fully decomposable into per-feature shape plots [22]. For tabular tasks subject to fairness audits or regulatory review, EBMs are increasingly the default choice.
The table below summarizes the most widely used XAI methods. It is not exhaustive, but it covers the techniques that appear most often in production pipelines and in the academic literature.
| Method | Year | Authors | Type | Scope | Description |
|---|---|---|---|---|---|
| Decision trees | 1980s+ | Quinlan, Breiman et al. | Intrinsic | Local and global | Hierarchical splits on feature thresholds |
| Linear models / logistic regression | classical | various | Intrinsic | Global | Coefficients give per-feature contribution |
| GAM / GA2M / EBM | 2015+ | Caruana, Lou et al. | Intrinsic | Local and global | Additive shape functions per feature |
| Permutation feature importance | 2001 / 2018 | Breiman / Fisher, Rudin, Dominici | Model-agnostic, post-hoc | Global | Score drop when a feature is randomly shuffled |
| Partial dependence plot (PDP) | 2001 | Friedman | Model-agnostic, post-hoc | Global | Average prediction as a function of one or two features |
| Individual conditional expectation (ICE) | 2015 | Goldstein et al. | Model-agnostic, post-hoc | Local | Per-instance trajectory of prediction across a feature |
| Saliency map | 2013 | Simonyan, Vedaldi, Zisserman | Model-specific (gradient-based), post-hoc | Local | Gradient of output with respect to input |
| LRP | 2015 | Bach, Binder, Montavon et al. | Model-specific, post-hoc | Local | Backward propagation of relevance through layers |
| DeepLIFT | 2017 | Shrikumar, Greenside, Kundaje | Model-specific, post-hoc | Local | Difference-from-reference attribution through the network |
| LIME | 2016 | Ribeiro, Singh, Guestrin | Model-agnostic, post-hoc | Local | Local linear surrogate fit to perturbed inputs |
| SHAP | 2017 | Lundberg, Lee | Model-agnostic and model-specific, post-hoc | Local and global | Shapley value attribution to features |
| Integrated Gradients | 2017 | Sundararajan, Taly, Yan | Model-specific, post-hoc | Local | Path integral of gradients from baseline to input |
| Grad-CAM | 2017 | Selvaraju et al. | Model-specific (CNNs), post-hoc | Local | Gradient-weighted activation map of last conv layer |
| Occlusion sensitivity | 2014 | Zeiler, Fergus | Model-specific, post-hoc | Local | Mask patches of input and measure prediction change |
| Anchors | 2018 | Ribeiro, Singh, Guestrin | Model-agnostic, post-hoc | Local | High-precision IF-THEN rule with coverage guarantee |
| Counterfactual explanations | 2017 | Wachter, Mittelstadt, Russell | Model-agnostic, post-hoc | Local | Smallest input change that flips the prediction |
| TCAV | 2018 | Kim, Wattenberg, Gilmer et al. | Model-specific, post-hoc | Global | Sensitivity to user-defined high-level concepts |
| Influence functions | 2017 | Koh, Liang | Model-specific, post-hoc | Global | Identify training points that most influenced a prediction |
Local Interpretable Model-agnostic Explanations (LIME) was introduced by Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin in their 2016 paper Why Should I Trust You? Explaining the Predictions of Any Classifier [5]. For any individual prediction, LIME generates a set of perturbed versions of the input, obtains the model's predictions for each, and fits a simple interpretable model (usually a linear regression or sparse linear model) to approximate the complex model's behavior in the local neighborhood of that input. For text classifiers, it randomly removes words; for image classifiers, it segments into superpixels and tests which are critical. The original paper's husky-versus-wolf example, where a classifier appeared accurate but was actually keying on snowy backgrounds, became a defining demonstration of why local explanations matter for debugging.
LIME's strengths are generality and intuitive output. Its limitations include sensitivity to the perturbation strategy, instability (different runs can produce different explanations for the same input), and the fact that a local linear approximation may not faithfully represent a highly nonlinear model [1]. A 2020 follow-up by Slack et al. showed that adversaries can construct biased models that produce innocuous LIME explanations, raising concerns about LIME as an audit tool.
SHapley Additive exPlanations (SHAP) was proposed by Scott Lundberg and Su-In Lee in their 2017 paper A Unified Approach to Interpreting Model Predictions [6]. SHAP draws on Shapley values from cooperative game theory, which distribute the "payout" of a prediction among the input features. Shapley values are the unique solution satisfying three properties: local accuracy (explanation values sum to the model's output), missingness (absent features receive zero attribution), and consistency (if a feature's contribution increases in a revised model, its attribution should not decrease). Lundberg and Lee showed that LIME, DeepLIFT, and classic Shapley regression values are all special cases of a single framework they called additive feature attribution methods [6].
SHAP has model-agnostic variants (KernelSHAP) and model-specific variants. TreeSHAP, the variant for decision tree ensembles like XGBoost, LightGBM, and CatBoost, computes exact Shapley values in polynomial time and has become the default explanation method for gradient-boosted tree models in production. DeepSHAP applies the framework to neural networks via DeepLIFT-style backpropagation, and LinearSHAP closes the loop for linear models. Unlike LIME, SHAP provides both local and global explanations and carries stronger theoretical guarantees. Exact Shapley computation is exponential in the number of features, so approximations are necessary for high-dimensional inputs, and KernelSHAP can be slow for large datasets [1] [6].
Integrated Gradients, introduced by Sundararajan, Taly, and Yan in 2017 [8], attribute a prediction to its input features by integrating the gradient of the output along a straight line from a baseline (often a vector of zeros) to the actual input. The integral satisfies two axioms: sensitivity (a feature whose value differs between baseline and input and affects the output receives non-zero attribution) and implementation invariance (functionally equivalent networks produce the same attributions). Integrated Gradients is implemented in PyTorch's Captum and is widely used for vision and NLP because it is well-defined for any differentiable network and avoids the noise problems of plain saliency.
Deep Learning Important FeaTures (DeepLIFT) was introduced by Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje at ICML 2017 [29]. Like LRP and integrated gradients, it propagates importance backward through the network, but it does so by comparing each neuron's activation to a reference activation and assigning contribution scores according to the difference. Unlike vanilla gradients, which can saturate in flat regions of the activation function, DeepLIFT can attribute importance to features that the gradient overlooks. It also separates positive and negative contributions, revealing canceling effects that other methods miss. DeepLIFT was a major influence on DeepSHAP.
Saliency maps, introduced by Simonyan, Vedaldi, and Zisserman in 2013, were among the earliest post-hoc explanation techniques for deep neural networks [7]. The basic idea is to compute the gradient of the model's output with respect to each input feature; features with large gradients are those that, if slightly changed, would most affect the prediction.
Plain saliency maps tend to be noisy, so several refinements exist. SmoothGrad averages gradients over multiple noisy copies of the input. Integrated Gradients accumulates gradients along a path. Grad-CAM, introduced by Selvaraju et al. in 2017, produces coarse localization maps by weighting the activations of the final convolutional layer by the gradients flowing into it, yielding visually interpretable heatmaps that have become the de facto explanation method for CNN-based image classifiers in medical imaging, autonomous driving, and content moderation. Adebayo et al. (2018) introduced "sanity checks" for saliency methods and found that several popular techniques produced visually similar maps even when the model's parameters were randomized, a finding that should temper trust in any single saliency-style attribution.
LRP, introduced by Bach, Binder, Montavon, Klauschen, Müller, and Samek in 2015 [27], reverses the forward pass of a neural network and redistributes the output relevance score backward to the input features, layer by layer. A conservation principle keeps total relevance constant per layer: the relevance of a neuron is split among the inputs that contributed to its activation, in proportion to their weighted contributions. Unlike pure gradient methods, LRP takes both weights and activations into account, often producing cleaner heatmaps for image classifiers. It has been extended to recurrent networks, transformers, and even support vector machines, and is widely used in medical imaging including MRI-based Alzheimer's classification.
In transformer-based models, attention weights indicate how much each token attends to every other token when computing its representation. Visualizing these weights can provide intuitive explanations: if a sentiment classifier assigns a positive label and the attention is concentrated on the word "excellent," this suggests the word drove the prediction. However, the relationship between attention weights and model decisions is contested. Jain and Wallace (2019) showed that attention weights frequently do not correlate with other measures of feature importance and that alternative attention distributions can yield the same predictions [9]. Wiegreffe and Pinter (2019) responded that attention is not explanation by itself but can be a useful component of explanation when combined with other evidence [10]. The debate continues, and researchers now generally advise against treating raw attention weights as faithful explanations without further validation.
Testing with Concept Activation Vectors (TCAV), proposed by Been Kim, Martin Wattenberg, Justin Gilmer, and colleagues at Google in 2018, takes a fundamentally different approach [11]. Rather than attributing predictions to low-level input features, TCAV tests the model's sensitivity to high-level, human-defined concepts. The method first collects examples of a concept (such as images containing "stripes") and trains a linear classifier in the model's activation space to distinguish concept examples from random examples. The resulting Concept Activation Vector (CAV) represents the direction in activation space that corresponds to the concept, and TCAV uses directional derivatives to measure how sensitive the model's predictions are to changes along it. TCAV's advantage is that it provides explanations in terms that domain experts naturally use. It has been integrated into PyTorch's Captum library and extended in subsequent work, including a 2025 framework called Global Concept Activation Vectors (GCAV) that unifies CAVs across layers using contrastive learning [12].
Anchors, proposed by Ribeiro, Singh, and Guestrin in 2018 [30], are high-precision IF-THEN rules that explain individual predictions while explicitly stating their scope of applicability. Where LIME approximates the model with a local linear surrogate, Anchors finds a small subset of the input's features such that the model's prediction is almost certain to remain the same for any other input sharing those features. A user study reported in the original paper found that Anchors helped non-experts predict model behavior on unseen inputs more accurately than LIME did. The trade-off is computational: finding a high-precision anchor requires repeated sampling, which is expensive for high-dimensional models.
Counterfactual explanations answer the question: "What is the smallest change to the input that would change the model's decision?" A counterfactual for a denied loan might say: "Your application would have been approved if your annual income were $5,000 higher." This approach, formalized by Sandra Wachter, Brent Mittelstadt, and Chris Russell in their 2017 paper Counterfactual Explanations Without Opening the Black Box [13], maps directly onto how humans reason about decisions. The same paper argued that counterfactual explanations could satisfy the GDPR's transparency obligations without requiring data controllers to reveal model internals, influencing subsequent guidance from the Article 29 Working Party. Later work has added feasibility (the suggested change should be something the user can act on), diversity (multiple distinct counterfactuals), and causal validity constraints. The DiCE library from Microsoft Research and Alibi Explain from Seldon both implement counterfactual algorithms suitable for production use.
For classical machine learning, two simple model-agnostic global techniques remain workhorses. Permutation feature importance, proposed by Breiman in 2001 and generalized by Fisher, Rudin, and Dominici in 2018, measures the drop in model performance when a feature's values are randomly shuffled. It is fast and works for any model with a scoring metric, but can be unreliable when features are highly correlated. Partial dependence plots (PDPs), introduced by Friedman in 2001, show the average model prediction as a function of one or two features, marginalized over the rest. Individual Conditional Expectation (ICE) plots, from Goldstein et al. in 2015, plot one curve per instance to expose heterogeneity that PDPs hide.
Koh and Liang's 2017 Understanding Black-box Predictions via Influence Functions introduced a different style of explanation: instead of asking which features mattered, ask which training points mattered. By approximating how a model's predictions would change if a particular training example were upweighted or removed, influence functions surface the most influential training points for a given prediction. This makes them useful for debugging label noise, detecting memorization, and explaining surprising outputs.
A longstanding assumption in machine learning is that there is a fundamental tradeoff between model accuracy and interpretability. Simple models like linear regression and small decision trees are easy to understand but may lack capacity for complex patterns; deep neural networks and large ensembles fit highly nonlinear relationships but are difficult or impossible to interpret directly.
This assumption has been challenged. Cynthia Rudin, in her influential 2019 Nature Machine Intelligence paper Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead [14], argued that for many practical problems, inherently interpretable models can achieve accuracy comparable to black-box models given sufficient effort in feature engineering and model design. She contended that the apparent superiority of black-box models often reflects insufficient effort spent on interpretable alternatives, and that post-hoc explanations introduce additional sources of error. Rudin went further, arguing that high-stakes decisions in criminal justice, healthcare, and credit should be made only with interpretable models, and that regulators should not accept post-hoc explanations as a substitute. Empirical work has supported the argument: the COMPAS recidivism model has been matched by simple two-feature linear models, Caruana's pneumonia work [26] showed the same in healthcare, and the Explainable Boosting Machine has matched gradient-boosted trees across dozens of tabular benchmarks [22]. Proponents of post-hoc explanation respond that for unstructured data like images, audio, and natural language, complex models genuinely outperform interpretable ones by wide margins, and post-hoc methods are the only practical option. A common pragmatic position is to use interpretable models when they are competitive and reserve post-hoc methods for cases where deep learning is genuinely required.
The most fundamental challenge in XAI is ensuring that explanations are faithful to the model's actual reasoning process, not just plausible-sounding stories. An explanation is faithful if it accurately reflects the factors that caused the model's prediction; it is unfaithful if it highlights features that seem reasonable to humans but are not actually what drove the model's decision. Post-hoc methods are particularly susceptible to this problem. LIME's local linear approximation may miss important nonlinear interactions. SHAP's approximations in KernelSHAP and DeepSHAP can introduce errors. Attention weights may not reflect the true causal structure of a transformer's computation [9]. Gradient-based methods can produce misleading attributions when the decision landscape is complex.
A related problem is disagreement among explanation methods. Krishna and colleagues at Harvard ran a 2022 study in which they applied multiple popular XAI methods to the same models and inputs and measured agreement. Different methods routinely produced different rankings for the same prediction, with no principled way to decide which to trust. Practitioners interviewed for the study described running several explainers, picking the one whose output they liked best, and presenting that to stakeholders. Common faithfulness metrics include deletion tests (remove important features and check whether the prediction changes), sufficiency tests (add only important features and check whether the prediction is reproduced), and stability tests (consistency across similar inputs), but no universal standard has emerged.
Explainability requirements appear in a growing patchwork of regulations across jurisdictions. The table below summarizes the most influential frameworks as of 2026.
| Regulation | Jurisdiction | Year | Status | Explainability requirement |
|---|---|---|---|---|
| Equal Credit Opportunity Act / Reg B | US | 1974 | In force | Specific reasons required for adverse credit action |
| Fair Credit Reporting Act | US | 1970 | In force | Disclosure of factors in adverse credit decisions |
| GDPR Articles 13, 14, 15, 22 | EU | 2018 | In force | Meaningful information about logic of automated decisions; right not to be subject to solely automated decisions |
| NYC Local Law 144 (AEDT) | New York City | 2023 | In force (enforcement July 2023) | Annual independent bias audit of automated employment decision tools, public posting of results, candidate notification |
| NIST AI Risk Management Framework | US | 2023 | Voluntary | "Explainable and interpretable" listed as one of seven trustworthy AI characteristics |
| White House Executive Order 14110 (rescinded 2025) | US | 2023 | Rescinded by EO 14179 in January 2025 | Originally required NIST guidance and red-teaming reports for frontier models |
| EU AI Act | EU | August 2024 in force | Phased in through August 2026; GPAI rules August 2025 | Article 13 transparency to deployers; Article 86 right to explanation for high-risk decisions |
| Colorado AI Act (SB 24-205) | Colorado | May 2024 enacted | Effective June 30, 2026 | Documentation, impact assessments, consumer notice for high-risk AI systems |
| California AI Transparency Act (SB-942) | California | September 2024 enacted | Effective January 1, 2026 | Watermarking and disclosure for generative AI providers with > 1M monthly users |
| Texas Responsible AI Governance Act | Texas | 2025 | Phased | Transparency obligations for high-risk AI deployers |
| China generative AI services rules | China | 2023 | In force | Content labeling, training data transparency, security review for public-facing generative AI |
The EU AI Act, which entered into force on August 1, 2024, is the world's first comprehensive legal framework for regulating artificial intelligence. Article 13 requires providers of high-risk AI systems to design and develop them so that their operation is sufficiently transparent for deployers to interpret outputs and use them appropriately, and to supply detailed instructions for use covering intended purpose, capabilities, limitations, accuracy, robustness, cybersecurity, computational requirements, and logging mechanisms [31]. Article 14 requires high-risk systems to be designed for effective human oversight, including technical measures that facilitate interpretation of outputs by deployers. Article 86 then creates a right to explanation of individual decision-making: if a decision is made about a person using a high-risk AI system that significantly affects health, safety, or fundamental rights, the person has the right to obtain "clear and meaningful explanations of the role of the AI system in the decision-making procedure and the main elements of the decision taken" [2].
The AI Act is being phased in: prohibited practices in February 2025, general-purpose AI rules in August 2025, and high-risk system obligations in August 2026. Organizations deploying AI in the EU are investing in explainability tools, technical documentation, and post-market monitoring to meet these obligations.
The General Data Protection Regulation (GDPR), in effect since May 2018, addresses automated decision-making in Article 22. Individuals have the right not to be subject to a decision based solely on automated processing, including profiling, which produces legal effects or similarly significantly affects them. When such automated decision-making is permitted (for example, with the individual's explicit consent), the data controller must provide "meaningful information about the logic involved" [3].
The precise scope of the GDPR's "right to explanation" has been debated extensively by legal scholars. Wachter, Mittelstadt, and Floridi argued in a 2017 paper, Why a Right to Explanation of Automated Decision-Making Does Not Exist in the General Data Protection Regulation, that Articles 13, 14, 15, and 22 together require only generic information about how an automated system works, not a specific explanation of any particular decision. Critics including Selbst and Powles responded that a more expansive reading is consistent with the spirit of the regulation and with subsequent guidance from the Article 29 Working Party. The debate has not been definitively settled by a court [3].
Wachter and colleagues' counterfactual explanations paper [13], discussed above, was in part a response to this legal ambiguity: counterfactuals were proposed as a form of explanation that could be required without forcing data controllers to expose model internals.
The US National Institute of Standards and Technology released the AI Risk Management Framework (AI RMF 1.0) in January 2023, identifying seven characteristics of trustworthy AI: valid and reliable, safe, secure and resilient, accountable and transparent, explainable and interpretable, privacy-enhanced, and fair with harmful bias managed [25]. Although voluntary, the framework is used by federal agencies and many large enterprises as an AI governance baseline. NIST published a Generative AI Profile (NIST AI 600-1) in July 2024 to address the specific risks of generative systems.
New York City's Local Law 144, enacted in 2021 and enforced from July 2023, was the first US law specifically targeting AI in employment. It requires employers using automated employment decision tools (AEDTs) to commission an independent annual bias audit, publish results, and notify candidates that an AEDT is being used. Penalties start at $500 per violation and rise to $1,500 per day for ongoing violations.
Colorado SB 24-205, signed by Governor Jared Polis in May 2024 and effective June 30, 2026, was the first US state law to impose comprehensive obligations on developers and deployers of high-risk AI systems. It requires developers to supply "model cards, dataset cards, or other impact assessments" to deployers and requires deployers to give consumers notice of high-risk AI use, offer a right to correct inaccurate personal data, and explain adverse decisions when feasible. Texas and Utah have followed with their own statutes, creating a fragmented compliance landscape.
California SB-942, enacted in September 2024 and effective January 1, 2026, focuses on generative AI. Providers with more than one million monthly users must offer a free AI detection tool, allow users to attach manifest disclosures to generated content, and embed latent disclosures (a form of watermarking) identifying the provider and system. It is the first US statute to require watermarking of generative AI output.
A 2023 White House executive order (EO 14110) directed federal agencies to develop AI safety guidance and required frontier model developers to share red-teaming results, but it was rescinded by EO 14179 in January 2025, shifting the federal posture toward voluntary commitments. China's 2023 regulations on generative AI services include content labeling, training data transparency, and security review obligations. The UK has so far favored a sector-by-sector approach without a single AI act.
Explainability is critical in healthcare, where AI-assisted decisions affect patient outcomes. Diagnostic models that identify cancers, predict patient deterioration, or recommend treatments must provide explanations that clinicians can evaluate against their own expertise. The US Food and Drug Administration has signaled that explainability will factor into approval of AI-based medical devices, especially for adaptive systems whose behavior can change after deployment. LIME and SHAP are widely used for tabular clinical data, while saliency maps and Grad-CAM apply to medical imaging models, and LRP has been used for MRI-based Alzheimer's classification. The Caruana pneumonia case study [26] remains the standard example of why intelligible models matter in clinical decision support.
Financial institutions use AI for credit scoring, fraud detection, algorithmic trading, and anti-money laundering. The Equal Credit Opportunity Act (Reg B) and Fair Credit Reporting Act in the United States have required lenders to provide specific reasons for adverse credit decisions since the 1970s, and those rules apply to automated systems just as to human underwriters. SHAP, especially TreeSHAP, is the standard tool in financial model explanation, and the US Federal Reserve's SR 11-7 guidance on model risk management implicitly requires interpretability for any model used in capital-relevant decisions. Insurers face similar demands and commonly rely on SHAP and on interpretable generalized additive models for actuarial work subject to regulatory review.
NYC Local Law 144 has reshaped how vendors of resume-screening and interview-scoring tools document their models. Most major HR technology vendors now publish annual bias audits and offer customers explanation reports for each candidate evaluation. Illinois passed an AI Video Interview Act, and the EU AI Act lists employment as a high-risk category that triggers full transparency obligations.
In criminal justice, the COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) system was the subject of a 2016 ProPublica investigation that found significant racial disparities in its risk scores [16]. Without understanding how the system arrived at predictions, it was impossible for defendants, judges, or the public to assess fairness. Subsequent academic work showed that COMPAS's accuracy could be matched by a simple two-feature linear model, supporting Rudin's argument for inherently interpretable models in high-stakes settings [14].
When an autonomous vehicle is involved in an accident, investigators need to understand why the vehicle's AI made the decisions it did. Saliency maps and attention visualization are commonly used to explain perception models, and Waymo, Cruise, and other operators publish post-incident reports with sensor data and reconstructed model behavior. The EU's Digital Services Act imposes a related obligation on very large online platforms to explain the main parameters of their recommender systems and offer users a non-personalized alternative.
The emergence of large language models has introduced new challenges for explainability. Traditional XAI methods like LIME and SHAP were designed for models with well-defined input features such as tabular columns and image pixels. Applying them to models with billions of parameters that process token sequences requires fundamentally new approaches.
Mechanistic interpretability is the term that has stuck for the research program that tries to reverse-engineer the algorithms a neural network has learned, working at the level of individual neurons, attention heads, and circuits. The field traces its lineage to Chris Olah's Distill essays from 2017 to 2020, particularly Zoom In: An Introduction to Circuits, which argued that neural networks could be understood mechanistically if researchers were willing to put in the same effort as biologists studying cells.
A key obstacle is polysemanticity: individual neurons often respond to multiple unrelated concepts. A single neuron might activate for images of cars, the color red, and certain textures, making it impossible to assign a clean semantic interpretation to it. This phenomenon, sometimes called superposition, means the model encodes more concepts than it has neurons. Anthropic's Toy Models of Superposition paper in 2022 gave the phenomenon a clean theoretical treatment.
Anthropic's 2023 paper Towards Monosemanticity applied sparse autoencoders (SAEs) to decompose the activations of a small language model into interpretable features, each corresponding to a recognizable concept like a programming language, a type of punctuation, or a safety-relevant behavior [17]. In 2024, Anthropic scaled the approach in Scaling Monosemanticity, applying SAEs to Claude 3 Sonnet and finding abstract, multilingual, multimodal features for cities, famous people, programming concepts, and dangerous topics. They demonstrated that these features could be artificially activated to steer the model's behavior, confirming causal relevance [18]. OpenAI published its own work in June 2024, training a 16-million-feature k-sparse autoencoder on GPT-4 activations [32], and DeepMind followed with the open-source release of Gemma Scope, a suite of SAEs trained on every layer of the Gemma 2 model family.
A complementary line of work uses causal interventions on a model's internal activations to isolate the role of specific components. Activation patching, popularized by the Indirect Object Identification (IOI) study from Wang et al. in 2022, replaces the activation at a particular layer or attention head with the activation from a different input and measures how the output changes, identifying the causal role of individual components rather than just their correlational ones. Linear probes train a simple linear classifier on a model's hidden activations to test whether the model represents a particular property like part-of-speech, sentiment, or geographic location. The Tuned Lens, introduced by Belrose et al. in 2023, refines the older Logit Lens technique by learning small affine transformations that make per-layer predictions more accurate, giving a cleaner picture of how a transformer's predictions evolve from layer to layer.
In March 2025, Anthropic introduced circuit tracing, a unified framework that replaces a model's multi-layer perceptrons (MLPs) with cross-layer transcoders (CLTs), a new type of sparse autoencoder that reads from one layer's residual stream but can contribute output to all subsequent MLP layers [19]. This produces an interpretable "replacement model" whose building blocks are sparse, human-readable features rather than polysemantic neurons. The output is an attribution graph: a directed graph whose nodes represent features, token embeddings, and output logits, and whose edges represent the causal interactions between them. Anthropic applied this technique to Claude 3.5 Haiku and published detailed case studies showing how the model processes multi-step reasoning tasks and how it decides to refuse harmful requests, then open-sourced the tools in 2025, including a Python library compatible with any open-weights model and a visual frontend hosted on Neuronpedia [20].
When large language models produce step-by-step reasoning under chain-of-thought prompting, the reasoning text looks like an explanation. The question is whether it actually is one. Anthropic addressed this directly in two studies, Measuring Faithfulness in Chain-of-Thought Reasoning (2023) and Reasoning Models Don't Always Say What They Think (2025) [33] [34]. In the first, the team intervened on the chain-of-thought by truncating it, paraphrasing it, or inserting mistakes, then measured how the final answer changed. They found that smaller models tended to rely on their stated reasoning, while larger and more capable models often arrived at the same answer regardless of what their chain-of-thought said. The 2025 follow-up examined reasoning-tuned models including Claude 3.7 Sonnet and DeepSeek-R1, and found that when the team planted hints suggesting an incorrect answer, the models often used the hints internally but failed to mention them in their visible chain-of-thought. The conclusion was sobering: chain-of-thought is sometimes faithful, sometimes not, and the more capable the model, the less guarantee there is. This matters because chain-of-thought monitoring has been proposed as a safety mechanism for catching deceptive behavior, and an unfaithful chain-of-thought offers no such guarantee.
Despite rapid progress, mechanistic interpretability for LLMs faces significant challenges. Core concepts like "feature" lack rigorous mathematical definitions. Computational complexity results suggest that many interpretability queries are theoretically intractable. Cross-layer transcoders match the underlying model's outputs in only about 50% of cases, meaning the replacement model is an approximation rather than a perfect substitute. And practical methods still underperform simple baselines on some safety-relevant evaluation tasks [21].
A landmark collaborative paper published in January 2025 by 29 researchers across 18 organizations established the field's consensus open problems. MIT Technology Review named mechanistic interpretability one of its 10 Breakthrough Technologies for 2026, reflecting both the field's promise and the growing expectation that it will deliver practical safety tools [21].
A mature ecosystem of open-source XAI tooling now supports research, deployment, and audit work. The table below covers the most widely used libraries.
| Tool | Developer | Language | Focus |
|---|---|---|---|
| SHAP | Lundberg et al. | Python | Shapley value-based explanations for any model; includes TreeSHAP, DeepSHAP, KernelSHAP, LinearSHAP |
| LIME | Ribeiro et al. | Python | Local linear surrogate explanations for tabular, text, and image models |
| Captum | Meta (Facebook) | Python / PyTorch | Model interpretability for PyTorch; includes integrated gradients, TCAV, saliency, occlusion |
| InterpretML | Microsoft | Python | Glassbox EBM models plus blackbox explanation methods (LIME, SHAP, PDP) |
| AI Explainability 360 (AIX360) | IBM (now LF AI & Data) | Python | Diverse algorithm catalog including ProtoDash, Disentangled Inferred Prior VAE, Contrastive Explanations Method, LIME, SHAP |
| What-If Tool (WIT) | Google PAIR | Python / TensorBoard | Code-free interactive probing of trained models, including counterfactual analysis and fairness metrics |
| Language Interpretability Tool (LIT) | Python | Interactive analysis of NLP and seq2seq models | |
| Alibi Explain | Seldon | Python | Counterfactuals (CFProto, CEM), Anchors, Integrated Gradients, ALE plots |
| ELI5 | Open source | Python | Lightweight library for explaining scikit-learn, XGBoost, and other model predictions |
| tf-explain | Open source | Python / TensorFlow | Grad-CAM, occlusion sensitivity, vanilla saliency for TensorFlow / Keras models |
| TransformerLens | Neel Nanda et al. | Python / PyTorch | Mechanistic interpretability toolkit for transformer models; activation patching, probing, circuit analysis |
| SAELens | Open-source community | Python / PyTorch | Training and analyzing sparse autoencoders for mechanistic interpretability |
| Neuronpedia | Open-source community | Web | Interactive frontend for exploring SAE features and Anthropic's circuit traces |
| DiCE | Microsoft Research | Python | Diverse counterfactual explanations, with feasibility constraints |
| iNNvestigate | Open source | Python | Implementations of LRP, DeepLIFT, integrated gradients, and other deep learning attribution methods |
SHAP is the most widely used XAI library. It provides fast, exact implementations for tree-based models via TreeSHAP, approximate methods for deep learning models via DeepSHAP, and a general model-agnostic approach via KernelSHAP. Its visualization tools (summary, dependence, force, waterfall plots) are standard in data science practice [6]. Captum, from Meta, is the primary interpretability library for the PyTorch ecosystem and implements over 20 attribution algorithms including integrated gradients, LIME, SHAP, and TCAV [12]. InterpretML, from Microsoft Research, ships glassbox models (Explainable Boosting Machines) alongside blackbox methods like LIME, SHAP, and PDPs [22]. AI Explainability 360, originally developed by IBM Research and now hosted as a Linux Foundation AI & Data project, ships a diverse algorithm catalog including ProtoDash, the Contrastive Explanations Method, LIME, and SHAP [35]. Google's What-If Tool, from the People + AI Research initiative, provides a no-code visual interface for probing model behavior, generating counterfactuals, and comparing fairness metrics across slices [28]. TransformerLens supports activation caching, hook-based interventions, probing, and circuit discovery on GPT-2, GPT-Neo, and other open transformer models [23], with SAELens complementing it for sparse autoencoder work.
The legal status of an automated decision "right to explanation" remains contested. Wachter, Mittelstadt, and Floridi's 2017 analysis [3] argued that the GDPR provides only a right to general information about logic, not to specific explanations of individual decisions. Even if a more expansive reading is correct, what counts as a satisfactory explanation is undefined. Should it be a feature attribution? A counterfactual? A natural-language summary? Different audiences (regulators, affected individuals, internal auditors) need different things, and most explanations optimized for one audience are inadequate for the others.
Post-hoc explanation methods frequently disagree with each other and with the underlying model's actual computation. The lack of an accepted faithfulness metric means there is no principled way to choose between competing explanations of the same prediction, and the community lacks standardized evaluation benchmarks. Different papers use different criteria: is an explanation good because it is faithful, because humans find it useful, or because it leads to better decisions? These criteria can conflict, since a faithful explanation that exposes a confusing statistical pattern may be less useful than a simplified one. The XAI-Bench benchmark and related initiatives aim to provide common ground, but consensus remains elusive. This is the field's deepest open problem.
Rudin's argument that the accuracy-interpretability tradeoff is mostly a myth in tabular settings has been broadly supported by empirical work, but it has not been universally accepted. The picture for unstructured data (images, audio, language) is different: interpretable models built without deep learning typically lose substantial accuracy. A productive synthesis is to use inherently interpretable architectures wherever they are competitive and to invest in mechanistic interpretability where they are not.
Explanations can be biased even when the underlying model is not. The 2020 work by Slack and colleagues on adversarial attacks against LIME and SHAP showed that an attacker can construct a model that behaves discriminatorily in deployment but produces innocuous explanations when queried by audit tools. This "fairwashing" attack undermines the use of explanations for auditing and accountability and highlights the need for explanation methods that are robust to adversarial manipulation [24].
Explanations are only valuable if humans can use them correctly. Feature importance scores can be misunderstood, saliency maps can create false confidence, and some studies find that exposure to model explanations actually decreases the quality of human decisions by encouraging over-reliance. Scalability is a parallel concern: exact Shapley computation is exponential in the number of features, mechanistic interpretability requires significant compute, and for frontier models with hundreds of billions of parameters, providing comprehensive explanations remains beyond current capabilities. Sparse autoencoders are a major step forward but still recover only a fraction of model behavior and require substantial compute to train.
| Researcher / lab | Affiliation | Contribution |
|---|---|---|
| David Gunning | DARPA | Founding manager of the DARPA XAI program |
| Cynthia Rudin | Duke University | Inherently interpretable models for high-stakes settings; risk scores and rule lists |
| Christoph Molnar | Independent | Author of Interpretable Machine Learning textbook |
| Marco Túlio Ribeiro | Microsoft Research | LIME and Anchors |
| Scott Lundberg | Microsoft Research | SHAP and TreeSHAP |
| Rich Caruana | Microsoft Research | Intelligible models in healthcare; Explainable Boosting Machine |
| Been Kim | Google DeepMind | TCAV and concept-based interpretability |
| Sandra Wachter, Brent Mittelstadt | Oxford Internet Institute | Counterfactual explanations and GDPR right-to-explanation analysis |
| Chris Olah | Anthropic | Distill Circuits essays; mechanistic interpretability program |
| Neel Nanda | Google DeepMind | TransformerLens; mechanistic interpretability community |
| Wojciech Samek | Fraunhofer HHI | Layer-wise Relevance Propagation |
| Mukund Sundararajan | Integrated Gradients | |
| Anthropic, OpenAI, DeepMind interpretability teams | Frontier labs | Sparse autoencoders, circuit tracing, Gemma Scope, faithfulness studies |
As of early 2026, the field is characterized by several notable trends.
Regulatory pressure is increasing. The EU AI Act's transparency obligations for general-purpose AI models took effect in August 2025, and the full enforcement regime for high-risk AI systems arrives in August 2026. State-level AI laws in Colorado, California, and Texas have created additional transparency requirements, though federal US legislation remains absent and the 2023 federal AI executive order was rescinded in January 2025 [2] [15].
Mechanistic interpretability is maturing. What began as a niche academic pursuit has grown into a recognized subdiscipline with dedicated teams at Anthropic, OpenAI, Google DeepMind, and multiple universities. The open-sourcing of Anthropic's circuit tracer, DeepMind's Gemma Scope, and community libraries like TransformerLens and SAELens has accelerated research. MIT Technology Review named it one of the 10 Breakthrough Technologies for 2026 [21].
SHAP and LIME remain dominant for traditional ML. For tabular data, tree-based models, and classical pipelines, SHAP (particularly TreeSHAP) and LIME remain the most widely used methods, sustained by their integration into standard data science workflows [1].
LLM explainability is an active frontier. Beyond mechanistic interpretability, researchers are exploring chain-of-thought analysis, probing classifiers, representation engineering, activation steering, and "LLM-as-judge" approaches in which one model explains or critiques another. Constitutional AI and other principle-based safety techniques are sometimes framed as a form of explainable safety, where the rules a model is meant to follow are explicit and inspectable even when its internal computation is not. Whether these methods can scale to provide meaningful safety guarantees for frontier models remains open [21].
Industry adoption is uneven. Large financial institutions, healthcare organizations, and regulated industries have made significant investments in XAI tooling. Many smaller organizations still deploy models without systematic explainability practices, leaving a wide gap between research capabilities and production reality.
The faithfulness problem persists. The community still lacks reliable methods to verify that explanations truly reflect model reasoning. This poses a fundamental challenge to the regulatory project of requiring explanations, since unfaithful explanations may provide false assurance, and it remains the field's central open problem.