Explainable AI (XAI) refers to artificial intelligence systems and techniques designed so that humans can understand how and why the system reaches its decisions, predictions, or recommendations. Unlike opaque "black box" models that produce outputs without revealing their internal reasoning, explainable AI aims to make the decision-making process transparent, interpretable, and accountable. The field has grown from a niche research interest into a central pillar of AI safety, regulatory compliance, and responsible deployment, driven by the rapid adoption of deep learning models in high-stakes domains like healthcare, criminal justice, and finance.
At its core, explainable AI encompasses methods, tools, and design principles that allow humans to comprehend the behavior of AI systems. This includes understanding which input features drove a particular prediction, how a model generalizes from training data, and whether the model's reasoning aligns with domain knowledge and ethical norms.
The term "explainable AI" is sometimes used interchangeably with "interpretable AI," though researchers increasingly draw a distinction between the two. Interpretability refers to the degree to which a human can consistently predict a model's outputs given its inputs. A decision tree with a handful of rules, for example, is inherently interpretable because a person can trace the logic from root to leaf. Explainability, by contrast, refers to the ability to provide post-hoc reasons for a model's behavior, often through auxiliary techniques applied after training. A complex neural network is not inherently interpretable, but an explanation method like LIME or SHAP can offer a simplified account of why it made a given prediction [1].
The scope of XAI extends beyond individual predictions. It also covers global model behavior (how does the model behave across all inputs?), fairness auditing (does the model discriminate against protected groups?), debugging (why did the model fail on this input?), and scientific discovery (what patterns has the model found in the data?).
Several forces have made explainability an urgent priority.
Trust and adoption. Practitioners and end users are reluctant to rely on AI systems they cannot understand. A physician presented with a diagnostic recommendation needs to know why the model flagged a particular condition. A loan officer reviewing an automated credit decision needs to explain the reasoning to the applicant. Without explanations, humans tend to either over-trust AI systems (accepting flawed outputs uncritically) or under-trust them (ignoring useful recommendations out of suspicion).
Debugging and model improvement. Explanations help machine learning engineers identify bugs, spurious correlations, and data quality problems. If a model is classifying images of wolves correctly but the explanation reveals it is relying on the presence of snow in the background rather than the animal's features, the developer can correct the training data or model architecture.
Regulatory compliance. The European Union's AI Act and the General Data Protection Regulation (GDPR) both contain provisions related to the transparency and explainability of automated decision-making systems. As these regulations come into force, organizations deploying AI in the EU must demonstrate that affected individuals can receive meaningful explanations of decisions that significantly impact them [2] [3].
Fairness and accountability. Explainability is a precondition for auditing AI systems for bias. If a hiring algorithm systematically disadvantages candidates from certain demographic groups, explanation techniques can reveal whether protected attributes (or proxies for them) are driving the model's decisions. This information is essential for remediation and for legal compliance with anti-discrimination laws.
Scientific understanding. In research contexts, explanations of model behavior can yield genuine scientific insights. When a model trained on molecular data identifies novel drug candidates, understanding its reasoning can help chemists develop new hypotheses about molecular interactions.
The question of how to explain AI decisions is as old as AI itself. Early expert systems of the 1970s and 1980s, such as MYCIN (a medical diagnosis system), included built-in explanation capabilities. Because these systems operated on explicit if-then rules written by human experts, they could trace their reasoning and present it to users in a straightforward manner. When MYCIN recommended an antibiotic, it could show the chain of rules that led to its conclusion.
However, the explanatory capacity of these systems was limited by the rigidity of their rule bases. As machine learning methods, particularly neural networks and statistical models, began to outperform expert systems in the late 1990s and 2000s, the field traded interpretability for predictive accuracy. This created a growing tension that would eventually give rise to modern XAI research.
The deep learning revolution, catalyzed by AlexNet's breakthrough performance on the ImageNet challenge in 2012, ushered in an era of models with millions (and later billions) of parameters. These models achieved remarkable accuracy on tasks from image recognition to natural language processing, but their internal workings were opaque. Researchers began to refer to them as "black boxes," and concerns about their opacity grew alongside their capabilities.
By the mid-2010s, several research groups were developing methods to peer inside these black boxes. The publication of two seminal papers, LIME in 2016 and SHAP in 2017, marked a turning point.
The most significant institutional catalyst for explainable AI research was the Defense Advanced Research Projects Agency (DARPA) Explainable Artificial Intelligence (XAI) program. DARPA began formulating the program in 2015 and released its call for proposals in August 2016. Development began in May 2017, with eleven research teams selected to develop explainable learning systems and one team focused on psychological models of explanation [4].
The program's stated goal was to "create a suite of new or modified machine learning techniques that produce explainable models that, when combined with effective explanation techniques, enable end users to understand, appropriately trust, and effectively manage the emerging generation of AI systems" [4]. It addressed two challenge areas: classification of events in heterogeneous multimedia data and construction of decision policies for autonomous systems performing simulated missions.
The four-year program concluded in 2021. In a retrospective published that year, program manager David Gunning and colleagues assessed the results, noting that while significant progress had been made, the fundamental challenge of explaining complex models remained open. The program helped popularize the term "XAI" and spurred widespread adoption of explanation techniques across industry and academia [4].
Since the DARPA program, the field has expanded rapidly. Thousands of papers are published annually on explainability methods. Major technology companies have released open-source tools for model interpretation. Regulatory frameworks in the EU, US, and elsewhere have codified requirements for transparency and explanation. And the emergence of large language models (LLMs) has introduced entirely new challenges and opportunities for explainability, including the burgeoning field of mechanistic interpretability.
XAI methods can be classified along several dimensions.
Intrinsic interpretability refers to models that are inherently understandable by virtue of their structure. Linear regression, logistic regression, decision trees, and rule lists all fall into this category. Their parameters or decision rules can be directly inspected by humans.
Post-hoc methods are techniques applied after a model has been trained to generate explanations of its behavior. These are necessary for complex models like deep neural networks, random forests, and gradient-boosted machines, whose internal representations are not directly human-readable. LIME, SHAP, saliency maps, and attention visualization are all post-hoc methods.
Local explanations describe why a model made a specific prediction for a single input instance. For example, "this loan application was denied because the applicant's debt-to-income ratio was 0.85, which the model weighted heavily."
Global explanations describe the model's overall behavior across all inputs. For example, "across the entire dataset, the three most influential features for loan approval are credit score, debt-to-income ratio, and employment length." Global explanations help stakeholders understand the model's general strategy, while local explanations address individual decisions.
Model-specific methods are designed for a particular type of model. Attention visualization, for instance, applies specifically to transformer-based models. Gradient-based saliency maps require differentiable models. These methods can exploit the model's internal structure to produce more detailed explanations.
Model-agnostic methods treat the model as a black box and work with any type of model. They operate by perturbing inputs and observing changes in outputs. LIME and SHAP (in its model-agnostic variant) are the best-known examples. The advantage of model-agnostic methods is their generality; the trade-off is that they cannot leverage model-specific structure and may produce less faithful explanations.
The following table summarizes the most widely used XAI methods.
| Method | Year | Authors | Type | Scope | Description |
|---|---|---|---|---|---|
| LIME | 2016 | Ribeiro, Singh, Guestrin | Model-agnostic, post-hoc | Local | Fits a local interpretable surrogate model (typically linear) around a prediction by perturbing the input |
| SHAP | 2017 | Lundberg, Lee | Model-agnostic (with model-specific variants), post-hoc | Local and global | Uses Shapley values from cooperative game theory to assign each feature an importance score |
| Saliency maps | 2013 | Simonyan, Vedaldi, Zisserman | Model-specific (gradient-based), post-hoc | Local | Computes the gradient of the output with respect to the input to highlight influential pixels or features |
| Attention visualization | 2015+ | Various | Model-specific (transformers), post-hoc | Local | Visualizes attention weights to show which input tokens or patches the model attends to |
| TCAV | 2018 | Kim, Wattenberg, Gilmer et al. | Model-specific, post-hoc | Global | Tests how sensitive a model's predictions are to user-defined high-level concepts (e.g., "stripes" for zebra classification) |
| Counterfactual explanations | 2018+ | Wachter, Mittelstadt, Russell et al. | Model-agnostic, post-hoc | Local | Identifies the minimal change to the input that would change the model's prediction |
| Feature importance (permutation) | 2001/2018 | Breiman / Fisher, Rudin, Dominici | Model-agnostic, post-hoc | Global | Measures the decrease in model performance when a feature's values are randomly shuffled |
| Integrated Gradients | 2017 | Sundararajan, Taly, Yan | Model-specific (gradient-based), post-hoc | Local | Attributes predictions by integrating gradients along a path from a baseline input to the actual input |
Local Interpretable Model-agnostic Explanations (LIME) was introduced by Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin in their 2016 paper "Why Should I Trust You? Explaining the Predictions of Any Classifier" [5]. The core idea is simple but powerful: for any individual prediction, LIME generates a set of perturbed versions of the input, obtains the model's predictions for each, and fits a simple interpretable model (usually a linear regression or sparse linear model) to approximate the complex model's behavior in the local neighborhood of that input.
For a text classifier, LIME might randomly remove words from the input and observe how the prediction changes, then fit a linear model that shows which words most strongly influenced the prediction. For an image classifier, it segments the image into superpixels and tests which ones are critical to the classification.
LIME's strengths include its generality (it works with any model) and its intuitive output. Its limitations include sensitivity to the perturbation strategy, instability (different runs can produce different explanations for the same input), and the fact that a local linear approximation may not faithfully represent the behavior of a highly nonlinear model [1].
SHapley Additive exPlanations (SHAP) was proposed by Scott Lundberg and Su-In Lee in their 2017 paper "A Unified Approach to Interpreting Model Predictions" [6]. SHAP draws on the concept of Shapley values from cooperative game theory, which provide a principled way to distribute the "payout" of a prediction among the input features.
The key insight is that Shapley values are the unique solution satisfying three desirable properties: local accuracy (the explanation values sum to the model's output), missingness (features absent from the input receive zero attribution), and consistency (if a feature's contribution increases in a revised model, its attribution should not decrease). Lundberg and Lee showed that several existing explanation methods, including LIME, DeepLIFT, and classic Shapley regression values, are special cases of a single framework they called additive feature attribution methods [6].
SHAP has both model-agnostic variants (KernelSHAP) and model-specific variants optimized for particular architectures (TreeSHAP for tree-based models, DeepSHAP for neural networks). TreeSHAP, in particular, is computationally efficient and has become the default explanation method for gradient-boosted tree models in many production settings.
Unlike LIME, SHAP provides both local explanations (per-prediction feature attributions) and global explanations (aggregated feature importance across the dataset). It also comes with stronger theoretical guarantees. However, exact Shapley value computation is exponential in the number of features, so approximations are necessary for high-dimensional inputs [1] [6].
Saliency maps, introduced by Simonyan, Vedaldi, and Zisserman in 2013, were among the earliest post-hoc explanation techniques for deep neural networks [7]. The basic idea is to compute the gradient of the model's output with respect to each input feature (e.g., each pixel in an image). Features with large gradients are those that, if slightly changed, would most affect the prediction.
Plain saliency maps tend to be noisy, so several refinements have been proposed. SmoothGrad averages gradients over multiple noisy copies of the input. Integrated Gradients, proposed by Sundararajan, Taly, and Yan in 2017, accumulates gradients along a path from a neutral baseline to the actual input, satisfying desirable axioms like sensitivity and implementation invariance [8]. Grad-CAM produces coarse localization maps by weighting the activations of the final convolutional layer by the gradients flowing into it, yielding visually interpretable heatmaps.
These methods are widely used in computer vision, but they have known shortcomings. Gradient-based attributions can be misleading when the model's decision boundary is complex, and the choice of baseline in integrated gradients can significantly affect the result.
In transformer-based models, attention weights indicate how much each token "attends to" every other token when computing its representation. Visualizing these weights can provide intuitive explanations: if a sentiment classifier assigns a positive label and the attention is concentrated on the word "excellent," this suggests the word drove the prediction.
However, the relationship between attention weights and model decisions is contested. Jain and Wallace (2019) showed that attention weights frequently do not correlate with other measures of feature importance and that alternative attention distributions can yield the same predictions [9]. Wiegreffe and Pinter (2019) responded with a more nuanced view, arguing that attention is not explanation by itself but can be a useful component of explanation when combined with other evidence [10]. The debate remains active, and researchers now generally advise against treating raw attention weights as faithful explanations without further validation.
Testing with Concept Activation Vectors (TCAV), proposed by Been Kim, Martin Wattenberg, Justin Gilmer, and colleagues at Google in 2018, takes a fundamentally different approach to explanation [11]. Rather than attributing predictions to low-level input features (pixels, words), TCAV tests the model's sensitivity to high-level, human-defined concepts.
The method works by first collecting examples of a concept (e.g., images containing "stripes") and training a linear classifier in the model's activation space to distinguish concept examples from random examples. The resulting Concept Activation Vector (CAV) represents the direction in activation space that corresponds to the concept. TCAV then uses directional derivatives to measure how sensitive the model's predictions are to changes in the direction of the concept. For instance, TCAV can answer the question: "How important is the concept 'stripes' to the model's classification of zebras?"
TCAV's advantage is that it provides explanations in terms that domain experts naturally use, rather than in terms of individual pixels or features. It has been integrated into PyTorch's Captum library and has been extended in subsequent work, including a 2025 framework called Global Concept Activation Vectors (GCAV) that unifies CAVs across layers using contrastive learning [12].
Counterfactual explanations answer the question: "What is the smallest change to the input that would change the model's decision?" For example, a counterfactual explanation for a denied loan might say: "Your application would have been approved if your annual income were $5,000 higher."
This approach, formalized by Wachter, Mittelstadt, and Russell in 2018, has strong intuitive appeal because it maps directly onto how humans naturally reason about decisions [13]. Counterfactuals also have legal relevance: they can help satisfy regulatory requirements for explaining automated decisions, since they tell the affected individual what they could change to obtain a different outcome.
Challenges include generating realistic counterfactuals (the proposed change should be plausible, not just mathematically minimal), handling multiple valid counterfactuals for a single prediction, and computational cost for complex models.
A longstanding assumption in machine learning is that there exists a fundamental tradeoff between model accuracy and interpretability. Simple, interpretable models like linear regression and small decision trees are easy to understand but may lack the capacity to capture complex patterns. Complex models like deep neural networks and large ensembles can fit highly nonlinear relationships but are difficult or impossible to interpret directly.
This assumption has been challenged. Cynthia Rudin, in her influential 2019 paper "Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead," argued that for many practical problems, inherently interpretable models can achieve accuracy comparable to black-box models when given sufficient effort in feature engineering and model design [14]. She contended that the apparent superiority of black-box models often reflects insufficient effort spent on interpretable alternatives, and that the use of post-hoc explanations for black-box models introduces additional sources of error and potential harm.
The debate continues. Proponents of post-hoc explanation argue that for certain tasks, particularly those involving unstructured data like images, audio, and natural language, complex models genuinely outperform interpretable ones by wide margins. In these settings, post-hoc methods are the only practical option. The resolution may depend on the specific application, the consequences of errors, and the regulatory environment.
The EU AI Act, which entered into force on August 1, 2024, is the world's first comprehensive legal framework for regulating artificial intelligence. It establishes transparency and explainability as core principles. Article 86 of the AI Act creates a right to explanation of individual decision-making: if a decision is made about a person using a high-risk AI system, and that decision has a significant impact on the person's health, safety, or fundamental rights, the person has the right to obtain "clear and meaningful explanations of the role of the AI system in the decision-making procedure and the main elements of the decision taken" [2].
The transparency rules of the AI Act become fully applicable in August 2026. Providers of high-risk AI systems must ensure that the system's operation is sufficiently transparent for deployers to interpret its output and use it appropriately. This includes requirements for logging, documentation, and human oversight mechanisms [2].
The General Data Protection Regulation (GDPR), in effect since May 2018, addresses automated decision-making in Article 22. Individuals have the right not to be subject to a decision based solely on automated processing, including profiling, which produces legal effects or similarly significantly affects them. When such automated decision-making is permitted (for example, with the individual's explicit consent), the data controller must provide "meaningful information about the logic involved" [3].
The precise scope of the GDPR's "right to explanation" has been debated extensively by legal scholars. Some argue that Articles 13, 14, and 15 of the GDPR, which require disclosure of the "logic involved" in automated processing, constitute a robust right to explanation. Others contend that these provisions require only generic information about the system's logic, not a specific explanation of individual decisions [3].
Article 86 of the AI Act explicitly defers to existing EU law: where the GDPR or other EU legislation already provides a right to explanation for fully automated systems, the AI Act right does not apply. However, the AI Act fills gaps. There are scenarios where GDPR protections do not apply (for example, when a decision is not "solely" automated, or when the decision does not produce "legal effects"), but the AI system is still classified as high-risk under the AI Act. In those cases, the AI Act's right to explanation provides additional coverage [3].
Beyond the EU, multiple jurisdictions have enacted or proposed requirements for AI transparency. In the United States, the Equal Credit Opportunity Act has long required lenders to provide specific reasons for adverse credit decisions, a requirement that applies equally to automated systems. The Fair Credit Reporting Act imposes similar obligations. Several US states, including Colorado and New York City, have enacted AI-specific transparency laws. China's regulations on generative AI services also include transparency and disclosure requirements [15].
Explainability is particularly critical in healthcare, where AI-assisted decisions directly affect patient outcomes. Diagnostic models that identify cancers in medical images, predict patient deterioration, or recommend treatments must provide explanations that clinicians can evaluate against their own expertise. The US Food and Drug Administration has signaled that explainability will be a factor in the regulatory approval of AI-based medical devices.
In practice, LIME and SHAP are widely used to explain predictions from tabular clinical data (lab values, vital signs, patient demographics), while saliency maps and Grad-CAM are applied to medical imaging models. Research has also explored using TCAV to test whether radiology models rely on clinically meaningful concepts rather than spurious artifacts [1].
Financial institutions use AI for credit scoring, fraud detection, algorithmic trading, and anti-money laundering. Regulatory requirements in many jurisdictions mandate that consumers receive explanations for adverse credit decisions. SHAP has become a standard tool in financial model explanation, partly because TreeSHAP integrates efficiently with the gradient-boosted tree models widely used in the industry.
Beyond regulatory compliance, financial institutions use explanations for model risk management: understanding why a model makes certain predictions helps risk teams identify potential failures before they cause losses.
The use of AI in criminal justice, including predictive policing, recidivism risk assessment, and sentencing recommendations, has attracted significant scrutiny. The COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) system, widely used in the US, was the subject of a 2016 investigation by ProPublica that found significant racial disparities in its risk scores [16]. The case highlighted the importance of explainability: without understanding how the system arrived at its predictions, it was impossible for defendants, judges, or the public to assess whether the system was fair.
Self-driving vehicles must make split-second decisions with potentially life-or-death consequences. When an autonomous vehicle is involved in an accident, investigators need to understand why the vehicle's AI made the decisions it did. Explainability is also important for public trust: surveys consistently show that consumers are more willing to ride in autonomous vehicles when they can understand the system's reasoning. Saliency maps and attention visualization are commonly used to explain perception models in autonomous driving.
The emergence of large language models has introduced new challenges for explainability. Traditional XAI methods like LIME and SHAP were designed for models with well-defined input features (tabular columns, image pixels). Applying them to models with billions of parameters that process sequences of tokens requires fundamentally new approaches.
A key obstacle to understanding neural networks is polysemanticity: individual neurons often respond to multiple unrelated concepts. A single neuron might activate for images of cars, the color red, and certain textures, making it impossible to assign a clean semantic interpretation to any individual neuron. This phenomenon, sometimes called superposition, means that the model encodes more concepts than it has neurons, using distributed representations where each concept is spread across many neurons [17].
Anthropic has been at the forefront of addressing polysemanticity through mechanistic interpretability research. In their 2023 paper "Towards Monosemanticity," Anthropic researchers applied sparse autoencoders (SAEs) to decompose the activations of a small language model into interpretable features. Each feature corresponded to a recognizable concept: a specific programming language, a type of punctuation, a particular topic, or a safety-relevant behavior [17].
In 2024, Anthropic scaled this approach dramatically in "Scaling Monosemanticity," applying sparse autoencoders to Claude 3 Sonnet, a production-scale model. The resulting features were remarkably abstract, multilingual, and multimodal. The researchers found features corresponding to specific cities, famous people, programming concepts, and even potentially dangerous topics. They demonstrated that these features could be artificially activated to steer the model's behavior, confirming that they were causally relevant to the model's computations rather than mere correlational artifacts [18].
In March 2025, Anthropic introduced circuit tracing, a technique that combines several earlier methods into a unified framework for understanding how language models process information [19]. The approach replaces a model's multi-layer perceptrons (MLPs) with cross-layer transcoders (CLTs), a new type of sparse autoencoder that reads from one layer's residual stream but can contribute output to all subsequent MLP layers. This produces an interpretable "replacement model" where the building blocks are sparse, human-readable features rather than polysemantic neurons.
The output of circuit tracing is an attribution graph: a directed graph whose nodes represent features, token embeddings, and output logits, and whose edges represent the causal interactions between them. These graphs describe, for a specific input, the sequence of computational steps the model uses to produce its output. Anthropic applied this technique to Claude 3.5 Haiku and published detailed case studies showing, for example, how the model processes multi-step reasoning tasks and how it decides to refuse harmful requests [19].
Anthropic open-sourced the circuit tracing tools in 2025, including a Python library compatible with any open-weights model and a visual frontend hosted on Neuronpedia [20].
Despite rapid progress, mechanistic interpretability for LLMs faces significant challenges. Core concepts like "feature" lack rigorous mathematical definitions. Computational complexity results suggest that many interpretability queries are theoretically intractable. Cross-layer transcoders match the underlying model's outputs in only about 50% of cases, meaning the replacement model is an approximation rather than a perfect substitute. And practical methods still underperform simple baselines on some safety-relevant evaluation tasks [21].
A landmark collaborative paper published in January 2025 by 29 researchers across 18 organizations established the field's consensus open problems. MIT Technology Review named mechanistic interpretability one of its 10 Breakthrough Technologies for 2026, reflecting both the field's promise and the growing expectation that it will deliver practical safety tools [21].
Several open-source tools support XAI research and deployment.
| Tool | Developer | Language | Focus |
|---|---|---|---|
| SHAP | Lundberg et al. | Python | Shapley value-based explanations for any model; includes TreeSHAP, DeepSHAP, and KernelSHAP |
| Captum | Meta (Facebook) | Python/PyTorch | Model interpretability for PyTorch; includes integrated gradients, TCAV, saliency, and more |
| InterpretML | Microsoft | Python | Unified framework for both glassbox (inherently interpretable) and blackbox explanation methods |
| TransformerLens | Neel Nanda et al. | Python/PyTorch | Mechanistic interpretability toolkit for transformer models; enables activation patching, probing, and circuit analysis |
| ELI5 | Various | Python | Lightweight library for explaining scikit-learn, XGBoost, and other model predictions |
| Alibi Explain | Seldon | Python | Explanations for classification and regression models; includes counterfactual methods |
| SAE Lens | Open-source community | Python/PyTorch | Training and analyzing sparse autoencoders for mechanistic interpretability research |
SHAP is perhaps the most widely used XAI library. It provides fast, exact implementations for tree-based models (via TreeSHAP), approximate methods for deep learning models (DeepSHAP), and a general model-agnostic approach (KernelSHAP). Its visualization tools, including summary plots, dependence plots, and force plots, have become standard in data science practice [6].
Captum, developed by Meta, is the primary interpretability library for the PyTorch ecosystem. It implements over 20 attribution algorithms, including integrated gradients, LIME, SHAP, and TCAV. It supports both standard neural networks and transformer architectures [12].
InterpretML, developed by Microsoft Research, takes a two-pronged approach. It offers glassbox models (inherently interpretable models like Explainable Boosting Machines) alongside blackbox explanation methods (LIME, SHAP, partial dependence plots). The Explainable Boosting Machine (EBM), InterpretML's flagship model, is a generalized additive model that often achieves accuracy competitive with gradient-boosted trees while remaining fully interpretable [22].
TransformerLens, created by Neel Nanda and collaborators, is designed specifically for mechanistic interpretability research on transformer-based models. It provides tools for activation caching, hook-based interventions, probing, and circuit discovery. The library supports GPT-2, GPT-Neo, and other open-source transformer models and has become a standard tool in the mechanistic interpretability research community [23].
The most fundamental challenge in XAI is ensuring that explanations are faithful to the model's actual reasoning process, not just plausible-sounding stories. An explanation is faithful if it accurately reflects the factors that caused the model's prediction. An explanation is unfaithful if it highlights features that seem reasonable to humans but are not actually what drove the model's decision.
Post-hoc methods are particularly susceptible to faithfulness problems. LIME's local linear approximation may miss important nonlinear interactions. SHAP's approximations (in KernelSHAP and DeepSHAP) can introduce errors. Attention weights may not reflect the true causal structure of a transformer's computation [9]. And gradient-based methods can produce misleading attributions when the model's decision landscape is complex.
Researchers have proposed various faithfulness metrics, but there is no universally accepted standard. Common approaches include checking whether removing the features identified as important actually changes the prediction (deletion tests), whether adding only the important features is sufficient to reproduce the prediction (sufficiency tests), and whether the explanation is consistent across similar inputs (stability tests).
Many XAI methods become computationally expensive or impractical at the scale of modern AI systems. Exact Shapley value computation is exponential in the number of features. Mechanistic interpretability techniques require significant compute to analyze even medium-sized models. For frontier models with hundreds of billions of parameters, providing comprehensive explanations of model behavior remains beyond current capabilities.
This scalability challenge is particularly acute for LLM interpretability. While sparse autoencoders have been successfully applied to models like Claude 3 Sonnet, the computational cost of training the autoencoders, validating the extracted features, and constructing attribution graphs is substantial. Scaling these techniques to the largest models will require both algorithmic advances and significant investment in compute infrastructure.
The XAI community lacks standardized metrics for evaluating the quality of explanations. Different papers use different benchmarks, making it difficult to compare methods. Is an explanation good because it is faithful to the model? Because humans find it useful? Because it leads to better decisions? These criteria can conflict: a faithful explanation that reveals the model's reliance on a confusing statistical pattern may be less useful to a decision-maker than a simplified (but less faithful) explanation that highlights the most actionable factors.
Efforts to standardize evaluation are underway. The XAI-Bench benchmark and related initiatives aim to provide common ground, but consensus remains elusive.
Explanations are only valuable if humans can understand and use them correctly. Research in human-computer interaction has shown that users do not always interpret explanations as intended. Feature importance scores can be misunderstood. Saliency maps can create false confidence. Counterfactuals can be misinterpreted as causal claims. Designing explanations that are both technically faithful and cognitively accessible remains a significant design challenge.
An emerging concern is that explanation methods can themselves be manipulated. Researchers have demonstrated that it is possible to construct models that produce biased decisions while generating innocuous-looking explanations, effectively using the explanation method as camouflage. This "fairwashing" attack undermines the use of explanations for auditing and accountability, and highlights the need for explanation methods that are robust to adversarial manipulation [24].
As of early 2026, the field of explainable AI is characterized by several notable trends.
Regulatory pressure is increasing. The EU AI Act's transparency obligations for general-purpose AI models took effect in August 2025, and the full enforcement regime, including requirements for high-risk AI systems, arrives in August 2026. Organizations deploying AI in Europe are actively investing in explainability tools and processes to ensure compliance. In the United States, state-level AI laws in Colorado, California, and Texas have created additional transparency requirements, though federal legislation remains absent [2] [15].
Mechanistic interpretability is maturing. What began as a niche academic pursuit has grown into a recognized subdiscipline with dedicated teams at Anthropic, OpenAI, Google DeepMind, and multiple universities. The open-sourcing of tools like Anthropic's circuit tracer and community libraries like TransformerLens and SAE Lens has accelerated research. MIT Technology Review's recognition of the field as a breakthrough technology for 2026 signals growing mainstream awareness [21].
SHAP and LIME remain dominant for traditional ML. For tabular data, tree-based models, and classical machine learning pipelines, SHAP (particularly TreeSHAP) and LIME remain the most widely used explanation methods. Their integration into standard data science workflows, availability in open-source libraries, and familiarity among practitioners ensure their continued relevance even as newer methods emerge [1].
LLM explainability is an active frontier. The interpretability challenges posed by large language models have spawned an entirely new research program. Beyond mechanistic interpretability, researchers are exploring chain-of-thought analysis (examining the intermediate reasoning steps of models trained with chain-of-thought prompting), probing classifiers, representation engineering, and activation steering. The question of whether these methods can scale to provide meaningful safety guarantees for the most capable models remains open [21].
Industry adoption is uneven. Large financial institutions, healthcare organizations, and regulated industries have made significant investments in XAI tooling. Many smaller organizations, by contrast, still deploy models without systematic explainability practices. The gap between research capabilities and production deployment remains wide.
The faithfulness problem persists. Despite years of research, the community still lacks reliable methods to verify that explanations truly reflect model reasoning. This is not merely a technical inconvenience; it poses a fundamental challenge to the regulatory project of requiring explanations, since unfaithful explanations may provide false assurance. Addressing this challenge, whether through improved post-hoc methods, broader adoption of inherently interpretable models, or advances in mechanistic interpretability, remains the field's central open problem.