# Variable importances

> Source: https://aiwiki.ai/wiki/variable_importances
> Updated: 2026-06-25
> Categories: Interpretability, Machine Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

*See also: [Machine learning terms](/wiki/machine_learning_terms)*

## What are variable importances?

Variable importances, also called feature importances, are scores assigned to each input variable of a predictive model that measure how much that variable contributes to the model's output. The scores rank the inputs from most influential to least influential, usually as a single number per feature that can be plotted as a bar chart. Variable importance is one of the oldest and most heavily used tools in [explainable AI](/wiki/explainable_ai), used by data scientists and statisticians to interpret models, select features, debug data pipelines, and produce regulatory explanations for high-stakes decisions [1][19].

The term covers a large family of methods. Some are model specific (such as the impurity reduction reported by a [random forest](/wiki/random_forest) or the absolute coefficients of a linear model), and others are model agnostic (such as permutation importance, [SHAP](/wiki/shap) and [LIME](/wiki/lime)) [1][6]. They differ in what they measure, how they compute the measurement, and what assumptions they make about the data. A naive practitioner who reads a single bar chart and concludes that the most important feature is the most causally relevant has likely made a mistake. Importance values reflect the model's behaviour on a particular dataset, not the underlying physics of the world [15].

## What is variable importance used for?

There is no single canonical use case. Practitioners reach for variable importance in several different settings.

| Use case | What the score answers | Example |
|----------|------------------------|---------|
| Model interpretation | Which inputs drive predictions in this model | Showing a credit risk model leans heavily on debt-to-income ratio |
| [Feature engineering](/wiki/feature_engineering) | Which inputs to keep, drop, or transform | Removing low-importance columns to shrink a tabular pipeline |
| Debugging | Whether the model relies on something it should not | Detecting that an image classifier latched onto a watermark |
| Regulatory explanation | What factors led to a specific decision | Explaining a denied loan under GDPR or the EU AI Act |
| Scientific discovery | Which variables warrant further study | Highlighting genes whose expression correlates with disease state |
| Fairness audit | Whether protected attributes leak through proxies | Spotting that ZIP code is acting as a stand-in for race |

These goals overlap, but they are not identical. A score that is fine for ranking columns by predictive power is not necessarily a good causal explanation, and a metric that works on the training set may give a different answer on held-out data [19]. Choosing a method that fits the goal is half the work.

## What are the main categories of methods?

Variable importance methods fall into two broad camps. Model-specific methods exploit the structure of a particular learner (a tree, a linear model, a neural network) to compute importance cheaply. Model-agnostic methods treat the trained model as a black box and probe it with carefully chosen inputs. Each camp has trade-offs.

| Family | Examples | Pros | Cons |
|--------|----------|------|------|
| Linear coefficients | Standardised |coefficient|, t-statistic, LASSO selection | Cheap, exact for the model | Only works for linear models, sensitive to scaling and collinearity |
| Tree impurity (MDI) | sklearn `feature_importances_`, R `randomForest::importance` | Free during training | Biased toward high-cardinality features, computed on training data |
| Tree gain or split count | XGBoost `gain`, `weight`, `cover` | Cheap, integrates with boosted ensembles | Different metrics give different rankings |
| Permutation importance | `sklearn.inspection.permutation_importance` | Model agnostic, uses real loss | Misleading with correlated features, costly for large data |
| Shapley values | [SHAP](/wiki/shap), TreeSHAP, KernelSHAP | Strong axiomatic basis, local and global | Expensive in general, choice of background matters |
| Local linear surrogates | [LIME](/wiki/lime), Anchors | Works on any model, easy to grasp | Stochastic, sensitive to neighbourhood definition |
| Counterfactuals | Wachter counterfactuals, DiCE | Fits regulatory framing | Many valid counterfactuals exist, no single ranking |
| Gradient attributions | Saliency, [Integrated Gradients](/wiki/integrated_gradients), DeepLIFT, Grad-CAM | Designed for differentiable models | Sensitive to baseline, gradient saturation |

## How is feature importance computed for linear models?

For a [linear model](/wiki/linear_model) with response y = w_1 x_1 + ... + w_d x_d + b, the absolute value of each coefficient |w_j| is the standard variable importance. The interpretation is direct: |w_j| measures how much the prediction changes when x_j changes by one unit, holding the other inputs fixed. The score makes sense only if the inputs are on comparable scales, so practitioners standardise the features (subtracting the mean and dividing by the standard deviation) before reading off importances.

Regularised linear models add a second layer of importance signal. LASSO (L1 regularisation) drives many coefficients to exactly zero, so the surviving features form a selected subset. Ridge regression (L2) keeps all features but shrinks correlated ones together. Elastic net interpolates between the two. The shrinkage path itself, plotted as features enter the active set as the regularisation parameter decreases, is sometimes used as an importance ranking.

The coefficient interpretation breaks down when features are highly correlated. Two near-duplicate columns can split the credit and end up with small coefficients each, even though the underlying signal is strong. Variance inflation factors and condition numbers help diagnose multicollinearity before trusting coefficient magnitudes.

## How is feature importance computed in random forests?

Methods built on trees are everywhere because tabular data is everywhere. The two dominant scores are mean decrease in impurity and gain.

Mean decrease in impurity (MDI), the default for sklearn's `RandomForestClassifier.feature_importances_` and `GradientBoostingRegressor.feature_importances_`, sums the reduction in [Gini impurity](/wiki/gini_impurity) (or entropy or mean squared error) at every split that uses a given feature, weighted by the fraction of training samples reaching that node [19]. The values are normalised across all features to sum to one. MDI is essentially free because the impurity reductions are already computed during tree induction, which is one reason it became the default.

The MDI score has a well documented bias. As the scikit-learn documentation warns, "impurity-based feature importance for trees is strongly biased and favor high cardinality features (typically numerical features) over low cardinality features such as binary features or categorical variables with a small number of possible categories" [19]. Carolin Strobl and colleagues showed in 2007 that random forest variable importance measures are "not reliable in situations where potential predictor variables vary in their scale of measurement or their number of categories" [2]. High-cardinality features (such as ZIP code, user ID, or any categorical with many levels) and continuous features with many unique values have more candidate split points than low-cardinality features, so they are more likely to be picked at random and end up with inflated MDI scores [2][19]. Bootstrap sampling with replacement compounds the bias. The recommended fixes include conditional inference forests with unbiased split selection, holding out validation data for the importance calculation, and switching to permutation importance as a sanity check [2][19].

Tree based libraries report several alternative importance scores, each measuring a slightly different thing.

| Score | What it counts | Where it shows up |
|-------|----------------|-------------------|
| MDI / mean decrease in Gini | Total impurity reduction, weighted by samples | sklearn `feature_importances_`, randomForest `importance(type=2)` |
| Mean decrease in accuracy (MDA) | Drop in OOB accuracy after permuting the feature | randomForest `importance(type=1)` |
| Gain | Sum of loss reduction at every split using the feature | XGBoost (default for sklearn API), [LightGBM](/wiki/lightgbm) `split_gain` |
| Weight (or split count) | Number of times the feature is used to split | XGBoost `weight`, LightGBM `splits` |
| Cover | Number (or sum of Hessians) of training samples touched by splits on the feature | XGBoost `cover`, `total_cover` |

[XGBoost](/wiki/xgboost) makes the choice of metric explicit. The library exposes `gain`, `weight`, `cover`, `total_gain`, and `total_cover` [20]. Gain is usually the most informative, weight rewards features that are used as splits often even when each split improves the loss only a little, and cover weighs by the number of training samples affected. The three metrics can disagree dramatically on the same model, which is why practitioners now treat XGBoost importance as a starting point rather than a definitive answer and follow up with SHAP [8].

## What is permutation importance?

Permutation importance was introduced by Leo Breiman in his 2001 paper Random Forests [1]. To measure the importance of feature j, take a fitted model, shuffle the column for j across rows, recompute the model's prediction performance, and record the drop. A large drop means the model relied heavily on j, a small drop means the model could shrug off the loss of that signal. The procedure is fully model agnostic because it only needs to call the model's prediction function [19].

[scikit-learn](/wiki/scikit-learn) implements this through `sklearn.inspection.permutation_importance`. The function repeats the shuffle several times (typically `n_repeats=10` or higher) and reports both the mean and the standard deviation of the importance, which lets users tell apart features whose scores are reliably different from random noise [19]. The scikit-learn documentation notes that "permutation importances can be computed either on the training set or on a held-out testing or validation set," and that "using a held-out set makes it possible to highlight which features contribute the most to the generalization power of the inspected model" [19].

Fisher, Rudin, and Dominici generalised the idea in 2018 with model class reliance, which uses permutation to measure how much any model in a hypothesis class would have to rely on a feature to achieve a given accuracy [13]. The framework formalises the intuition that a single trained model's importance score is one estimate of an underlying quantity, not the quantity itself.

Permutation importance has a known failure mode. When two features are strongly correlated, permuting one does not break the model, because the other still carries the same information. The model's score barely drops, so both features get a low permutation importance even though the joint signal is critical. Strobl and colleagues proposed conditional permutation importance in 2008, where the column is permuted within strata defined by correlated features, breaking the marginal but preserving the joint [3]. The conditional version is harder to compute and depends on choosing the right stratification, but it is the cleanest fix when correlations cannot be ignored.

## What are SHAP values?

Shapley values come from cooperative game theory. They divide the total payout of a coalition of players among individual players in the unique way that satisfies efficiency, symmetry, dummy, and additivity axioms. Scott Lundberg and Su-In Lee, in their 2017 NeurIPS paper A Unified Approach to Interpreting Model Predictions, applied this idea to model explanations [6]. They treat each feature as a player and the model's prediction (relative to a baseline) as the payout, then assign each feature its Shapley value. The authors describe their contribution as "the identification of a new class of additive feature importance measures, and theoretical results showing there is a unique solution in this class with a set of desirable properties" [6]. The resulting framework, [SHAP](/wiki/shap), turned out to unify several earlier explainers (LIME, DeepLIFT, layer-wise relevance propagation) under one set of axioms and proved that a unique solution exists in this class [6].

The naive computation is exponential in the number of features because the Shapley value averages contributions over all subsets. Practical implementations approximate. KernelSHAP regresses a weighted local linear model over sampled coalitions, similar in spirit to LIME but with the Shapley weighting kernel. DeepSHAP adapts DeepLIFT for neural networks. The most successful is TreeSHAP, presented by Lundberg, Erion, and Lee in 2018 and expanded in their 2020 Nature Machine Intelligence paper From local explanations to global understanding with explainable AI for trees [7][8]. TreeSHAP exploits the tree structure to compute exact Shapley values for tree ensembles in polynomial time, reducing the cost from O(T L 2^M) to O(T L D^2), where T is the number of trees, L the maximum number of leaves, D the maximum depth, and M the number of features [8]. The same paper introduces SHAP interaction values, which separate main effects from pairwise interactions, and SHAP summary plots, which combine local and global views in a single chart [8].

SHAP values are local: each prediction has its own attribution. Aggregating local SHAP values across a dataset (taking the mean absolute SHAP value per feature) gives a global importance ranking that respects the same axioms [8].

The SHAP framework also brought a long-running theoretical debate to the surface. The original paper used conditional expectations to define the value of a coalition (what does the model predict given the features in the coalition, integrating over the rest in a way that respects the data distribution) [6]. Dominik Janzing, Lenon Minorics, and Patrick Bloebaum argued in their 2020 paper Feature relevance quantification in explainable AI: A causal problem that this conflates correlation with causation [14]. They show that unconditional (interventional) expectations match Pearl's interventional semantics and avoid attributing importance to a feature that has no causal effect on the prediction but happens to correlate with one that does [14]. The two definitions agree when features are independent and diverge when they are correlated, which is the more interesting case in practice. Modern SHAP implementations expose both modes; users have to pick.

## How do LIME, Anchors, and counterfactuals differ?

[LIME](/wiki/lime) (Local Interpretable Model-agnostic Explanations) was introduced by Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin at KDD 2016 [4]. The recipe is short: pick a prediction to explain, sample perturbed instances around it, weight the samples by proximity, fit a sparse linear model on those weighted samples, and present the linear coefficients as the explanation. LIME treats the original model as a black box and assumes that any complex decision boundary looks linear when zoomed in close enough.

LIME's strength is interpretability. A linear model with five non-zero terms is easy to read. Its weakness is stability. Small changes in the perturbation kernel, the sampling seed, or the choice of distance metric can produce different explanations for the same instance, which makes it hard to use LIME for high-stakes decisions without ensembling or careful tuning.

Anchors, introduced by the same authors at AAAI 2018, replaces the linear surrogate with a high-precision rule [5]. An anchor is an if-then condition on the inputs (for example, if income > 50k AND age < 35) that locks in the prediction with a stated probability across the perturbed neighbourhood. The result is a sufficient explanation rather than a continuous attribution, which fits cases where users want to understand which feature values guarantee the outcome.

Counterfactual explanations approach the problem differently. Instead of asking how each feature contributed, they ask what the smallest change to the input would flip the prediction. Sandra Wachter, Brent Mittelstadt, and Chris Russell introduced unconditional counterfactual explanations in 2017 specifically as a way to satisfy the GDPR's right to explanation requirements without revealing model internals [12]. A loan denial counterfactual might say if your annual income had been 5,000 higher, the model would have approved the loan, which is actionable in a way that a SHAP plot sometimes is not.

## How is feature importance computed for neural networks?

Neural networks are differentiable, which opens a different family of attribution methods. The simplest is the gradient of the output with respect to the input (the saliency map), but raw gradients have known problems: they saturate when activations are flat, they can give zero attribution to features the network clearly uses, and they fail certain natural axioms [9].

[Integrated Gradients](/wiki/integrated_gradients), proposed by Mukund Sundararajan, Ankur Taly, and Qiqi Yan at ICML 2017 in Axiomatic Attribution for Deep Networks, fixes both problems [9]. The method picks a baseline (often the all-zero input or a blurred version of the image) and integrates the gradient along a straight line from the baseline to the actual input. The integral satisfies two axioms the authors call sensitivity and implementation invariance, which other gradient methods violate [9]. In practice the integral is approximated with a Riemann sum over twenty to a few hundred steps, and the resulting attribution sums to the difference between the prediction at the input and the prediction at the baseline [9].

DeepLIFT, by Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje at ICML 2017, computes attributions in a single backward pass by comparing each neuron's activation to a reference activation and propagating contribution scores through the network [10]. DeepLIFT separates positive and negative contributions, which can reveal patterns that gradient-based methods miss.

Grad-CAM, by Ramprasaath Selvaraju and colleagues at ICCV 2017, is the standard tool for explaining convolutional networks [11]. It uses the gradients flowing into the last convolutional layer to produce a coarse heat map highlighting the image regions that contributed to a class prediction. Grad-CAM is widely used in medical imaging because the resulting maps line up well with what a radiologist would point to.

Attention weights inside transformer models are sometimes presented as a free explanation, since attention values are already produced during the forward pass. Several studies (notably Sarah Wiegreffe and Yuval Pinter's 2019 EMNLP paper Attention is not not Explanation, building on earlier critique by Sarthak Jain and Byron Wallace) showed that attention weights can be replaced with very different distributions while keeping the prediction unchanged, so they should not be treated as faithful attributions without further validation [15].

## What are the most common pitfalls?

Variable importance is easy to compute and easy to misread. The literature is full of cautionary results.

| Pitfall | What goes wrong | Mitigation |
|---------|-----------------|------------|
| MDI bias | High-cardinality features inflate scores | Use permutation importance on held-out data, or unbiased trees |
| Permutation with correlated features | Importance is shared across correlated columns and looks weak | Conditional permutation, group features, or look at SHAP interactions |
| Training set scoring | Importance reflects memorisation rather than generalisation | Always compute on a validation or test set |
| Single-method reliance | Different methods give different rankings | Compute several scores and look for agreement |
| Causal vs predictive confusion | High importance does not mean the variable causes the outcome | Use causal inference, do-calculus, or randomised experiments for causal claims |
| Local vs global confusion | The most important feature on average may be irrelevant for an individual prediction | Look at SHAP local explanations, not just the global summary |
| Stochastic methods | LIME and KernelSHAP have run-to-run variance | Average across many seeds, report variance |
| Ignoring interactions | A single number per feature loses signal when features interact | Plot SHAP interaction values, ICE curves, or PDPs |
| Baseline choice for IG | Different baselines yield different attributions | Try several baselines, average, document the choice |

Janzing's critique deserves repeating because it cuts deep. A SHAP value can attribute importance to a feature that has no influence on the model's prediction function whenever that feature is correlated with one that does [14]. The fix (interventional Shapley values) is mathematically clean but breaks the data distribution, which sometimes leads to predictions on impossible inputs. There is no escape from making the trade-off explicit.

## What does a practical workflow look like?

The following workflow captures what experienced practitioners actually do when asked to explain a tabular model.

1. Train and validate the model. Importance is meaningless if the model itself does not generalise.
2. Compute several importance scores in parallel. Pair MDI with permutation importance and SHAP. If they agree, the answer is robust. If they disagree, dig into why.
3. Always compute permutation importance and SHAP on held-out data, never on the training set [19].
4. Plot a SHAP summary plot to see both the magnitude and the direction of each feature's effect [8].
5. Plot SHAP dependence plots for the top features to see how the effect changes across the feature's range and whether interactions are present.
6. Validate by ablation: drop the top features and retrain. The model performance should fall by roughly the amount the importance scores predicted.
7. If the goal is feature selection, use recursive feature elimination or LASSO regularisation. Pure ranking is rarely the right answer.
8. If the goal is causal, use a method designed for causal inference (instrumental variables, regression discontinuity, do-calculus). Importance is not a substitute [14].

## Which libraries implement variable importance?

Most of the methods discussed have battle-tested open source implementations.

| Library | Methods | Notes |
|---------|---------|-------|
| scikit-learn | `feature_importances_`, `permutation_importance` | Built into every tree-based estimator |
| SHAP (Python) | KernelSHAP, TreeSHAP, DeepSHAP, GradientSHAP, LinearSHAP | Reference implementation by Scott Lundberg |
| LIME (Python) | Tabular, text, and image explainers | Original implementation by Marco Tulio Ribeiro |
| Captum (PyTorch) | Integrated Gradients, DeepLIFT, GradientSHAP, Saliency, Grad-CAM | Maintained by Meta AI |
| Alibi Explain | Anchors, counterfactuals, integrated gradients | Maintained by Seldon |
| InterpretML | Explainable Boosting Machines plus wrappers around SHAP and LIME | From Microsoft Research |
| H2O | MOJO scoring with SHAP for GBM, GLM, RF | Production-friendly Java runtime |
| XGBoost / LightGBM / CatBoost | Built-in `gain`, `weight`, `cover` plus integrated SHAP | SHAP support is native, not a wrapper |
| dalex | Permutation importance, partial dependence, SHAP, ceteris paribus profiles | R and Python |
| ELI5 | Permutation importance, formatted explanations | Python |

A minimal SHAP example for a [gradient boosting](/wiki/gradient_boosting) model in scikit-learn looks like this.

```python
import shap
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split

X, y = fetch_california_housing(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

model = GradientBoostingRegressor(random_state=0).fit(X_train, y_train)

explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)

shap.summary_plot(shap_values, X_test)
```

The summary plot shows each feature on a row, with one dot per test sample positioned by its SHAP value and coloured by the feature's value. This single chart reveals magnitude, direction, and interaction structure at once, which is why it has become the default global explanation in tabular data science [8].

## How is variable importance changing in modern AI?

Feature importance grew up in the world of small tabular models. Two trends are reshaping it.

The first is regulation. Article 22 of the General Data Protection Regulation gives EU citizens the right not to be subject to decisions based solely on automated processing, and Article 15(1)(h) requires meaningful information about the logic involved [21]. The EU AI Act (Regulation (EU) 2024/1689), which became law in 2024 and is rolling out through 2026, adds Article 86, which grants an affected person the right to obtain from the deployer "clear and meaningful explanations of the role of the AI system in the decision-making procedure and the main elements of the decision taken" for decisions from a high-risk AI system that significantly affect their health, safety, or fundamental rights [22]. SHAP and counterfactual explanations are the two methods most cited in compliance discussions, because both produce per-instance explanations that satisfy the meaningful information requirement without forcing the company to reveal the entire model. Insurance, lending, hiring, and healthcare deployments increasingly bundle SHAP explanations with every prediction.

The second is mechanistic interpretability for [large language models](/wiki/large_language_model). Feature importance over input tokens is shallow; modern LLM interpretability digs into the network's internal representations. Anthropic's 2023 paper Towards Monosemanticity by Trenton Bricken and colleagues used sparse autoencoders to "decompose a layer with 512 neurons into more than 4000 features," most of which human raters and automated analysis judged interpretable, far more than the underlying neurons [16]. The 2024 follow-up Scaling Monosemanticity by Adly Templeton and colleagues scaled the technique to [Claude 3](/wiki/claude_3) Sonnet, extracting roughly 34 million features and discovering interpretable directions for concepts ranging from the Golden Gate Bridge to insecure code [17]. These methods are not labelled feature importance, but they answer the same underlying question (what does this model attend to?) at a much finer grain than input-level scores can. For tabular data, classical importance still rules. For language models, sparse autoencoders and circuit analysis are the new front line.

## What are the limitations of variable importance?

Variable importance is a tool, not a theorem. Several limitations apply across methods.

A single number per feature loses interaction information. If two features only matter when they appear together (think latitude and longitude), neither will look important in isolation. SHAP interaction values, partial dependence plots, and individual conditional expectation curves recover some of the lost structure [8].

Local and global importance are different. The feature with the highest mean importance across a dataset may be irrelevant for explaining one specific prediction, and vice versa. Both views matter.

Importance is not causation. Judea Pearl has argued for years that statistical methods alone cannot answer causal questions; you need a causal model. A feature can have high importance because it really drives the outcome, because it acts as a proxy for the true cause, or because the data collection introduced a confounder [14]. Distinguishing the cases requires either a randomised experiment or a causal model with assumptions you are willing to defend.

Finally, importance scores depend on the data the model was trained on. Move the model to a different population, and the importances can shift even if the model's parameters are unchanged [19]. Practitioners working across domains should recompute importance on data drawn from the deployment population, not the training set.

## Explain like I'm 5

Imagine you are baking a cake and want to know which ingredient matters most for how good it tastes. You could bake the cake the normal way and ask people to score it. Then you bake it again, but secretly swap the sugar for sand. People hate it, so sugar must be very important. You bake again, this time leaving out one tablespoon of vanilla. People barely notice, so vanilla is less important. Doing this for every ingredient gives you a ranking from most important to least important. That is what variable importance does for a machine learning model. The model is the recipe, the ingredients are the input columns, and the ranking tells you which inputs the model leaned on the most when it made its predictions.

## References

1. Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5-32. https://www.stat.berkeley.edu/~breiman/randomforest2001.pdf
2. Strobl, C., Boulesteix, A.-L., Zeileis, A., and Hothorn, T. (2007). Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinformatics, 8, 25. https://link.springer.com/article/10.1186/1471-2105-8-25
3. Strobl, C., Boulesteix, A.-L., Kneib, T., Augustin, T., and Zeileis, A. (2008). Conditional variable importance for random forests. BMC Bioinformatics, 9, 307. https://link.springer.com/article/10.1186/1471-2105-9-307
4. Ribeiro, M. T., Singh, S., and Guestrin, C. (2016). Why Should I Trust You? Explaining the Predictions of Any Classifier. KDD 2016. https://arxiv.org/abs/1602.04938
5. Ribeiro, M. T., Singh, S., and Guestrin, C. (2018). Anchors: High-Precision Model-Agnostic Explanations. AAAI 2018. https://homes.cs.washington.edu/~marcotcr/aaai18.pdf
6. Lundberg, S. M., and Lee, S.-I. (2017). A Unified Approach to Interpreting Model Predictions. NeurIPS 2017. https://arxiv.org/abs/1705.07874
7. Lundberg, S. M., Erion, G., and Lee, S.-I. (2018). Consistent Individualized Feature Attribution for Tree Ensembles. https://arxiv.org/abs/1802.03888
8. Lundberg, S. M., Erion, G., Chen, H., et al. (2020). From local explanations to global understanding with explainable AI for trees. Nature Machine Intelligence, 2, 56-67. https://www.nature.com/articles/s42256-019-0138-9
9. Sundararajan, M., Taly, A., and Yan, Q. (2017). Axiomatic Attribution for Deep Networks. ICML 2017. https://arxiv.org/abs/1703.01365
10. Shrikumar, A., Greenside, P., and Kundaje, A. (2017). Learning Important Features Through Propagating Activation Differences. ICML 2017. https://arxiv.org/abs/1704.02685
11. Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., and Batra, D. (2017). Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization. ICCV 2017. https://arxiv.org/abs/1610.02391
12. Wachter, S., Mittelstadt, B., and Russell, C. (2017). Counterfactual Explanations Without Opening the Black Box: Automated Decisions and the GDPR. Harvard Journal of Law and Technology, 31, 841-887. https://arxiv.org/abs/1711.00399
13. Fisher, A., Rudin, C., and Dominici, F. (2019). All Models are Wrong, but Many are Useful: Learning a Variable's Importance by Studying an Entire Class of Prediction Models Simultaneously. Journal of Machine Learning Research, 20(177), 1-81. https://jmlr.org/papers/v20/18-760.html
14. Janzing, D., Minorics, L., and Bloebaum, P. (2020). Feature relevance quantification in explainable AI: A causal problem. AISTATS 2020. https://arxiv.org/abs/1910.13413
15. Wiegreffe, S., and Pinter, Y. (2019). Attention is not not Explanation. EMNLP 2019. https://arxiv.org/abs/1908.04626
16. Bricken, T., Templeton, A., Batson, J., et al. (2023). Towards Monosemanticity: Decomposing Language Models With Dictionary Learning. Anthropic. https://transformer-circuits.pub/2023/monosemantic-features/index.html
17. Templeton, A., Conerly, T., Marcus, J., et al. (2024). Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet. Anthropic. https://transformer-circuits.pub/2024/scaling-monosemanticity/
18. Molnar, C. (2024). Interpretable Machine Learning, 2nd edition. https://christophm.github.io/interpretable-ml-book/
19. scikit-learn developers. Permutation feature importance. https://scikit-learn.org/stable/modules/permutation_importance.html
20. XGBoost developers. Feature importance types (Python API). https://xgboost.readthedocs.io/en/latest/python/python_api.html
21. European Parliament and Council. Regulation (EU) 2016/679 (General Data Protection Regulation), Article 22. https://gdpr-info.eu/art-22-gdpr/
22. European Parliament and Council. Regulation (EU) 2024/1689 (Artificial Intelligence Act), Article 86. https://artificialintelligenceact.eu/article/86/