# Feature Importances

> Source: https://aiwiki.ai/wiki/feature_importances
> Updated: 2026-06-23
> Categories: Interpretability, Machine Learning, Model Evaluation
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**Feature importances** are numeric scores that quantify how much each input [feature](/wiki/feature) contributes to the predictions of a [machine learning](/wiki/machine_learning) model. The three dominant techniques answer different questions: impurity-based (Gini) importance measures how much a feature reduces training loss inside a [decision tree](/wiki/decision_tree), permutation importance measures how much shuffling a feature degrades held-out accuracy, and [SHAP](/wiki/shap) values measure each feature's contribution to a single prediction using cooperative game theory. Understanding which features matter most is central to model [interpretability](/wiki/interpretability), [feature engineering](/wiki/feature_engineering), [feature selection](/wiki/feature_selection), debugging, and building trust in deployed systems. A wide range of methods exist, from measures built into specific model families to model-agnostic approaches that work with any estimator.

The topic sits at the intersection of statistics, optimization, and [explainable AI](/wiki/explainable_ai). Feature importance methods do not all answer the same question: some measure how a feature shapes the loss during training, some measure how it affects predictions on held-out data, and some measure how it contributes to a single prediction for a single sample. A key practical caveat is that the most common built-in method, impurity importance, is systematically biased toward high-cardinality features, a finding established by Strobl et al. (2007).[5]

## Why does feature importance matter?

Knowing which features drive a model's predictions serves several practical purposes:

- **Model interpretability:** Stakeholders and regulators often need to understand why a model makes certain decisions, especially in healthcare, finance, and criminal justice. Frameworks such as the European Union AI Act and the U.S. Equal Credit Opportunity Act effectively demand that operators of high-stakes models articulate the reasons behind individual decisions.
- **Feature selection:** Removing uninformative features reduces training time, lowers the risk of [overfitting](/wiki/overfitting), and simplifies models without sacrificing performance. Stable importance rankings drive [feature selection](/wiki/feature_selection) wrappers such as Recursive Feature Elimination (RFE) and Boruta.
- **Debugging:** If a model assigns high importance to a feature that should be irrelevant (for example, a row index or a record ID), that signals data leakage or a modeling error. Importance audits routinely catch leakage where future information has accidentally been included in training data.
- **Domain insight:** Feature importance rankings can reveal unexpected relationships in the data, generating hypotheses for further scientific investigation. In computational biology, importance scores from tree ensembles routinely surface candidate genes for follow-up wet-lab experiments.
- **Bias and fairness audits:** Surfaced feature contributions help identify proxies for protected attributes. If a sensitive attribute or its proxy ranks highly, that finding feeds into a [fairness](/wiki/fairness) intervention.
- **Model compression:** When deploying to embedded devices, dropping low-importance inputs reduces memory and inference cost without notably hurting accuracy.

Across all of these uses, the key question is whether the method's ranking matches the predictive structure that the user actually cares about. Different methods give different answers because they each measure a different mathematical object.

## What is the difference between global and local importance?

Feature importance methods split into two broad categories based on the scope of the explanation they provide:

- **Global importance** describes how each feature contributes across the entire dataset or population. Global scores are useful for feature selection, model documentation, and high-level audits. Examples include impurity-based importance for tree ensembles and permutation importance computed over a full test set.
- **Local importance** describes how each feature contributes to a single prediction for a single instance. Local explanations are necessary for adverse-action notices in lending or per-patient risk reports in clinical decision support. Examples include [SHAP](/wiki/shap) values for an individual sample and [LIME](/wiki/lime) explanations.

A second axis distinguishes **model-specific** methods, which exploit the internal structure of a particular family (such as the split statistics of a [decision tree](/wiki/decision_tree)), from **model-agnostic** methods, which treat the model as a black box. Model-specific methods are usually faster and can be exact but apply only to compatible models. Model-agnostic methods generalize across architectures at the cost of additional computation.

The two axes combine to give a 2x2 taxonomy that helps practitioners locate any method in the space:

| Scope vs. coupling | Model-specific | Model-agnostic |
|---|---|---|
| **Global** | Mean Decrease in Impurity, XGBoost gain/weight/cover, linear coefficient magnitude | Permutation importance, drop-column (LOCO), global SHAP aggregates |
| **Local** | TreeSHAP, DeepLIFT, Integrated Gradients (uses model gradients) | KernelSHAP, LIME, counterfactual explanations |

## Built-in (model-specific) methods

### How does impurity-based importance (mean decrease in impurity) work?

Impurity-based importance, also called Mean Decrease in Impurity (MDI) or Gini importance, is the default feature importance method for tree-based models such as [random forest](/wiki/random_forest), [gradient boosting](/wiki/gradient_boosting), and individual decision trees. During training, each node in a tree splits on a feature to reduce impurity (measured by the Gini index for classification or variance for regression). The importance of a feature is the total reduction in impurity it provides across all splits in all trees, weighted by the number of samples reaching each split.

Formally, for a single tree the MDI importance of feature $j$ is:

$$\mathrm{MDI}(j) = \sum_{t \in T_j} \frac{N_t}{N} \cdot \Delta i(t)$$

where $T_j$ is the set of internal nodes that split on feature $j$, $N_t$ is the number of training samples reaching node $t$, $N$ is the total number of training samples, and $\Delta i(t)$ is the impurity decrease at node $t$. For a forest of $B$ trees, the per-feature importance is averaged: $\mathrm{MDI}_\mathrm{forest}(j) = \frac{1}{B} \sum_{b=1}^{B} \mathrm{MDI}_b(j)$.

In [scikit-learn](/wiki/scikit_learn), impurity-based importances are accessible through the `feature_importances_` attribute of any fitted tree-based estimator:

```python
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Impurity-based importances
importances = model.feature_importances_
```

**Advantages:** MDI is fast to compute because it requires no additional evaluation after training. It is readily available for all tree-based models in scikit-learn, [XGBoost](/wiki/xgboost), [LightGBM](/wiki/lightgbm), and [CatBoost](/wiki/catboost).

**Limitations:** MDI is biased toward high-cardinality features (features with many unique values, such as continuous variables or categorical variables with many categories). Because high-cardinality features offer more candidate split points, they have a greater chance of producing a good split by chance. MDI is also computed on training data, so it can inflate the importance of features the model has overfit to. Strobl et al. (2007) demonstrated this bias directly, concluding that "the variable importance measures of Breiman's original random forest method are not reliable in situations where potential predictor variables vary in their scale of measurement or their number of categories."[5] The scikit-learn documentation echoes the warning, noting that "impurity-based feature importance for trees is strongly biased and favor high cardinality features (typically numerical features) over low cardinality features such as binary features or categorical variables with a small number of possible categories."[29] Hooker and Mentch (2019) showed the bias persists even with permutation-based importances when features are correlated.[7]

### XGBoost importance: gain, weight, and cover

[XGBoost](/wiki/xgboost) exposes three native global importance measures, each computed from the gradient-boosted tree ensemble. Practitioners often inspect more than one because rankings can disagree.

| XGBoost importance type | Definition | What it measures | Typical use |
|---|---|---|---|
| **gain** | Average loss reduction contributed by splits on a feature | How much accuracy a feature buys when it is used | Default in XGBoost; closest to MDI |
| **weight** (also `frequency`) | Number of times a feature is used as a split | Raw split count | Sensitive to deep trees and many small-gain splits |
| **cover** | Average number of samples affected by splits on a feature | Coverage of the feature in the dataset | Useful for spotting features that affect rare regions |

In XGBoost the importance type is selected via the `importance_type` argument:

```python
import xgboost as xgb

model = xgb.XGBClassifier(n_estimators=200, random_state=42)
model.fit(X_train, y_train)

gain = model.get_booster().get_score(importance_type="gain")
weight = model.get_booster().get_score(importance_type="weight")
cover = model.get_booster().get_score(importance_type="cover")
```

In [LightGBM](/wiki/lightgbm), `feature_importance(importance_type="split")` returns the weight-style count and `importance_type="gain"` returns the gain-style loss reduction. CatBoost exposes a `PredictionValuesChange` importance and a `LossFunctionChange` importance, the latter similar in spirit to permutation importance.

### Coefficient magnitude for linear models

For [linear regression](/wiki/linear_regression) and [logistic regression](/wiki/logistic_regression), the absolute value of each feature's learned coefficient can serve as a measure of importance. A larger absolute coefficient means the feature has a stronger influence on the prediction.

However, raw coefficients are only comparable when all features share the same scale. If one feature is measured in thousands and another in fractions, their coefficients will reflect those scales rather than true importance. The standard practice is to standardize all features (zero mean, unit variance) before training, which produces standardized coefficients that can be directly compared:

```python
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
import numpy as np

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

model = LogisticRegression().fit(X_train_scaled, y_train)
importances = np.abs(model.coef_[0])
```

For a generalized linear model with link $g$, the contribution of feature $x_j$ to the linear predictor is $\beta_j x_j$. The signed contribution is what regulators typically expect for adverse-action notices, since it preserves direction.

**Limitations:** When features are highly correlated ([multicollinearity](/wiki/multicollinearity)), coefficients become unstable and can swing between positive and negative values. In such cases, coefficient magnitude is unreliable. [L1](/wiki/l1_regularization) and [L2](/wiki/l2_regularization) regularization help stabilize coefficients but do not fully resolve the interpretation problem. Lasso zeroes out some coefficients entirely, which is helpful for selection but discards correlated alternatives.

## Model-agnostic methods

Model-agnostic methods can be applied to any fitted model, regardless of its internal structure. This makes them especially valuable for comparing feature importance across different model types or for explaining proprietary black-box systems.

### How does permutation importance work?

Permutation importance was originally introduced by [Leo Breiman](/wiki/leo_breiman) (2001) as part of the [random forest](/wiki/random_forest) algorithm[1] and later generalized into a fully model-agnostic technique by Fisher, Rudin, and Dominici (2019), who called it "model reliance."[4] The core idea is simple: if a feature is important, randomly shuffling its values should degrade model performance; if the feature is unimportant, shuffling it should have little effect.

**Algorithm:**

1. Train the model and compute a baseline performance score (e.g., accuracy, R-squared, or any scoring metric) on a held-out dataset.
2. For each feature $j$:
   - Randomly shuffle (permute) the values of feature $j$ across all samples.
   - Recompute the model's performance score on the permuted data.
   - Record the importance as the decrease in score: $I_j = S_\mathrm{baseline} - S_\mathrm{permuted}$.
3. Repeat the permutation multiple times and average the results to reduce variance.
4. Rank features by descending importance.

In scikit-learn, permutation importance is available through the `permutation_importance` function, added in scikit-learn 0.22 (released December 2019):

```python
from sklearn.inspection import permutation_importance

result = permutation_importance(
    model, X_test, y_test,
    n_repeats=30,
    random_state=42
)

for i in result.importances_mean.argsort()[::-1]:
    if result.importances_mean[i] - 2 * result.importances_std[i] > 0:
        print(f"{feature_names[i]}: "
              f"{result.importances_mean[i]:.3f} "
              f"+/- {result.importances_std[i]:.3f}")
```

Breiman's original definition computed permutation importance on the out-of-bag (OOB) samples of a random forest, which gave a free held-out evaluation without an explicit train-test split. Modern implementations are usually train-test based.

**Advantages:** Permutation importance is model-agnostic, does not require retraining, and can be computed on a held-out test set (which means it reflects generalization performance rather than training-set memorization). It also does not exhibit the high-cardinality bias that affects MDI.

**Limitations:** When features are correlated, permuting one feature does not substantially hurt performance because the model can still extract the same information from its correlated partner. The permutation step can also create unrealistic data points (for instance, shuffling "number of bedrooms" independently of "square footage" produces houses that would never exist), which can distort estimates. Conditional permutation importance, introduced by Strobl et al. (2008), addresses this by permuting within strata of correlated features.[6]

### Drop-column importance (leave-one-covariate-out)

Drop-column importance, also known as Leave-One-Covariate-Out (LOCO), provides perhaps the most direct answer to the question "how much does this feature contribute?" The procedure drops each feature one at a time, retrains the model from scratch, and measures the change in performance.

**Algorithm:**

1. Train the model on all features and record baseline performance.
2. For each feature $j$:
   - Remove feature $j$ from the dataset.
   - Retrain the model on the remaining features.
   - Compute the new performance score.
   - Record the importance as: $I_j = S_\mathrm{baseline} - S_\mathrm{without\,j}$.
3. Rank features by descending importance.

**Advantages:** Drop-column importance directly measures each feature's contribution to overall model performance. It avoids the unrealistic data problem of permutation importance because the model never sees the dropped feature during training.

**Limitations:** This method is computationally expensive because it requires retraining the model once per feature. For a dataset with 100 features and a model that takes an hour to train, drop-column importance requires over 100 hours of computation. It also measures a slightly different quantity than permutation importance: the marginal value of adding a feature to a model that already has all other features. For [deep learning](/wiki/deep_learning) models with stochastic optimizers, retraining variance can swamp the signal unless many seeds are averaged.

### What are SHAP values?

[SHAP](/wiki/shap) (SHapley Additive exPlanations), introduced by Lundberg and Lee (2017) at NeurIPS, applies [Shapley values](/wiki/shapley_values) from cooperative game theory to explain individual predictions.[2] Each feature receives a SHAP value representing its contribution to pushing the prediction away from the average prediction. The paper's central theoretical claim is that its novel components include "the identification of a new class of additive feature importance measures, and theoretical results showing there is a unique solution in this class with a set of desirable properties."[2] Those desirable properties are local accuracy, missingness, and consistency: SHAP values are the only additive feature attribution method that satisfies all three.[2]

Formally, the SHAP value of feature $i$ for an instance $x$ is the Shapley value:

$$\phi_i(x) = \sum_{S \subseteq F \setminus \{i\}} \frac{|S|!\,(|F|-|S|-1)!}{|F|!} \, [v(S \cup \{i\}) - v(S)]$$

where $F$ is the set of all features and $v(S)$ is the model's prediction conditioned on knowing only the values of features in subset $S$. Computing this exactly is exponential in the number of features, which is why specialized fast variants exist.

SHAP provides both **local explanations** (why a specific prediction was made) and **global importance** (which features matter most across the entire dataset). Global SHAP importance is typically computed as the mean absolute SHAP value for each feature across all samples: $\mathrm{Imp}(j) = \frac{1}{N} \sum_{i=1}^{N} |\phi_j(x_i)|$.

The SHAP library offers specialized explainers optimized for different model types:

| Explainer | Target models | Speed | Exactness |
|---|---|---|---|
| TreeExplainer | [Random forest](/wiki/random_forest), [XGBoost](/wiki/xgboost), [LightGBM](/wiki/lightgbm), [CatBoost](/wiki/catboost) | Fast | Exact for trees |
| LinearExplainer | Linear models, logistic regression | Fast | Exact for linear models |
| KernelExplainer | Any model (model-agnostic) | Slow | Approximate |
| DeepExplainer | Deep [neural networks](/wiki/neural_network) | Moderate | Approximate (DeepLIFT-based) |
| GradientExplainer | Differentiable models | Moderate | Approximate (uses Integrated Gradients) |
| PartitionExplainer | Hierarchical feature groupings | Moderate | Owen value approximation |

```python
import shap

explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)

# Global importance: mean absolute SHAP value per feature
global_importance = np.abs(shap_values).mean(axis=0)

# Summary plot showing feature importance and effect direction
shap.summary_plot(shap_values, X_test)
```

#### TreeSHAP and computational complexity

The naive Shapley value computation has complexity $O(2^M)$ in the number of features $M$, which is infeasible for typical tabular models. TreeSHAP, introduced by Lundberg, Erion, and Lee (2018)[14] and extended in Lundberg et al. (2020) for *Nature Machine Intelligence*,[13] computes exact Shapley values for tree-based models in polynomial time. The complexity is $O(T L D^2)$ per prediction, where $T$ is the number of trees, $L$ is the maximum number of leaves in any tree, and $D$ is the maximum depth. This brings exact local attribution within reach for ensembles of hundreds of trees on datasets with thousands of features.

TreeSHAP supports two estimands:

- **Path-dependent (interventional) feature perturbation:** uses the marginal distribution; treats unobserved features as if intervened upon. Recommended when interpreting interventional or causal contributions.
- **Tree-path-dependent (observational):** uses the empirical conditional distribution implied by the tree paths. Recommended when correlations between features should be reflected in the attribution.

KernelSHAP, the model-agnostic variant, uses weighted linear regression on perturbed coalitions and converges to Shapley values in expectation but is much slower.

**Advantages:** SHAP values have a strong theoretical foundation in game theory, provide both local and global explanations, and show the direction of each feature's effect (not just magnitude). TreeExplainer is computationally efficient for tree-based models. The SHAP library ships rich plotting utilities (summary plots, dependence plots, force plots, decision plots, waterfall plots).

**Limitations:** KernelExplainer (the model-agnostic variant) is computationally expensive, scaling poorly to large datasets and large feature counts. SHAP importance is on the scale of the prediction (not the loss), which makes it answer a subtly different question than permutation importance. Like permutation importance, SHAP values can be affected by feature correlations, especially in the interventional formulation. Janzing, Minorics, and Bloebaum (2020) have argued that the choice between observational and interventional SHAP corresponds to fundamentally different causal interpretations.[15]

### LIME (Local Interpretable Model-Agnostic Explanations)

[LIME](/wiki/lime), proposed by Ribeiro, Singh, and Guestrin (2016), explains individual predictions by fitting a simple interpretable model (typically a sparse linear model) in the local neighborhood of the instance being explained.[3] LIME generates perturbed samples around the instance, weights them by proximity, obtains the black-box model's predictions for these samples, and then fits a sparse linear model to approximate the local decision boundary.

**Algorithm:**

1. Select the instance to explain.
2. Generate perturbed samples in the neighborhood of that instance.
3. Get the black-box model's predictions for each perturbed sample.
4. Weight each perturbed sample by its proximity to the original instance using a kernel such as $\pi_x(z) = \exp(-d(x, z)^2 / \sigma^2)$.
5. Fit a sparse linear model (e.g., Lasso) on the weighted perturbed data.
6. The coefficients of the local model represent feature importances for that specific prediction.

LIME has tabular, text, and image variants. The image variant uses superpixel segmentation as the perturbation unit, while the text variant masks individual tokens.

**Advantages:** LIME is model-agnostic and works with any classifier or regressor. It is intuitive because explanations are expressed as simple linear weights. The user can control the number of features in the explanation.

**Limitations:** LIME explanations can be unstable: running LIME twice on the same instance can produce different results because of randomness in the perturbation process. The quality of the explanation depends on the choice of kernel width and the number of perturbations, which require careful tuning. Unlike SHAP, LIME lacks a strong theoretical guarantee about the uniqueness or optimality of its explanations. Slack et al. (2020) demonstrated that LIME and SHAP can be "fooled" by adversarial models that detect the perturbation distribution and behave benignly only on perturbed inputs.[16]

## Gradient-based attribution for deep networks

Deep [neural networks](/wiki/neural_network) accept inputs that often have thousands or millions of dimensions, such as pixels in an image or token embeddings in a language model. Tree-based importance does not apply, and exact Shapley computation is intractable. A family of gradient-based attribution methods has been developed specifically for differentiable models.

### Vanilla saliency

The simplest method computes the gradient of the model's output with respect to each input dimension. For a model $f$ and an input $x$, the saliency of input $x_i$ is $|\partial f(x) / \partial x_i|$. This was popularized by Simonyan, Vedaldi, and Zisserman (2013) for visualizing convolutional networks.[32] Vanilla gradients are fast but suffer from gradient saturation and noisy attributions.

### Integrated Gradients

Integrated Gradients, introduced by Sundararajan, Taly, and Yan (2017), addresses gradient saturation and enforces two axioms: sensitivity and implementation invariance.[8] The attribution for input dimension $i$ is the path integral of the gradient from a baseline $x'$ to the input $x$:

$$\mathrm{IG}_i(x) = (x_i - x'_i) \int_{0}^{1} \frac{\partial f(x' + \alpha (x - x'))}{\partial x_i} \, d\alpha$$

In practice the integral is approximated by a Riemann sum with 50 to 200 steps. The choice of baseline matters: a black image, a Gaussian noise sample, or the dataset mean each yields different attributions. Integrated Gradients sums to the difference between the prediction at $x$ and the prediction at $x'$ (completeness axiom), which is analogous to the local accuracy property of SHAP values.

### DeepLIFT

DeepLIFT, proposed by Shrikumar, Greenside, and Kundaje (2017), assigns contribution scores by comparing each neuron's activation to a reference activation.[9] It propagates contributions backward through the network using either Rescale rules or RevealCancel rules. DeepLIFT was designed to overcome gradient saturation issues that affect vanilla gradients, especially in deep networks with ReLU activations. Lundberg and Lee (2017) showed that DeepLIFT can be reformulated as an approximation to SHAP values, which led to the SHAP DeepExplainer implementation.[2]

### SmoothGrad

SmoothGrad, introduced by Smilkov, Thorat, Kim, Viegas, and Wattenberg (2017), reduces visual noise in saliency maps by averaging gradients over multiple noisy copies of the input:[10]

$$\mathrm{SmoothGrad}_i(x) = \frac{1}{n} \sum_{k=1}^{n} \frac{\partial f(x + \epsilon_k)}{\partial x_i}$$

where each $\epsilon_k \sim \mathcal{N}(0, \sigma^2 I)$. SmoothGrad is often combined with Integrated Gradients (SmoothGrad-IG) for sharper, more stable attributions on image classification.

### Layer-wise Relevance Propagation (LRP)

Layer-wise Relevance Propagation, introduced by Bach et al. (2015), distributes the prediction backward through the network using conservation rules at each layer.[11] Several propagation rules exist, including LRP-0, LRP-epsilon, and LRP-gamma, each with different stability and faithfulness trade-offs. Montavon et al. (2018) provided a unified framework analyzing how various propagation rules affect attributions.[33]

### Grad-CAM

For convolutional networks, Grad-CAM (Selvaraju et al., 2017) localizes the regions of an image that influenced a class prediction by combining gradients with feature maps from a chosen convolutional layer.[12] Grad-CAM is widely used in medical imaging to highlight tumor regions or lesions that drive a model's diagnosis.

### Comparison of gradient-based methods

| Method | Required model property | Output | Strengths | Weaknesses |
|---|---|---|---|---|
| Vanilla saliency | Differentiability | Per-input gradient | Cheap | Noisy, saturation |
| Integrated Gradients | Differentiability, baseline | Path integral | Completeness, axiomatic | Baseline-dependent |
| DeepLIFT | ReLU-style nonlinearities | Reference-based contribution | Avoids saturation | Reference-dependent |
| SmoothGrad | Differentiability | Averaged gradient | Visual stability | Adds compute |
| Layer-wise Relevance Propagation | Differentiability | Propagated relevance | Conservation | Rule choice non-trivial |
| Grad-CAM | CNN with feature maps | Coarse heatmap | Class-discriminative | Resolution limited |

### Captum

[Captum](/wiki/captum) is an open-source PyTorch library released by Meta AI in 2019 that implements a unified API for these gradient-based attribution methods, plus several additional ones such as Occlusion, Feature Ablation, and Shapley Value Sampling.[31] The library exposes a consistent interface where each algorithm subclasses the `Attribution` class:

```python
import torch
from captum.attr import IntegratedGradients

model.eval()
ig = IntegratedGradients(model)
input_tensor = torch.randn(1, 3, 224, 224, requires_grad=True)
baseline = torch.zeros_like(input_tensor)

attributions, delta = ig.attribute(
    input_tensor, baselines=baseline, target=0,
    return_convergence_delta=True,
)
```

Captum integrates with [PyTorch](/wiki/pytorch) modules, supports both image and text models, and includes utilities for visualization and noise tunneling (the SmoothGrad pattern). The TensorFlow community uses tf-explain, iNNvestigate, and Captum-style implementations within the AIX360 toolkit from IBM.

## Attribution for transformers and large language models

[Transformer](/wiki/transformer) architectures used in modern [large language models](/wiki/large_language_model) introduce additional considerations for feature attribution. Inputs are token embeddings rather than raw scalar features, and the model contains attention layers whose weights are often misinterpreted as importance scores.

### Is attention a reliable measure of feature importance?

Attention weights describe how each token attends to others within a layer. It is tempting to read the attention weight $\alpha_{ij}$ from token $i$ to token $j$ as a measure of how much token $j$ contributes to token $i$'s representation. Jain and Wallace (2019)[17] and Wiegreffe and Pinter (2019)[18] argued that attention weights are not reliable explanations: they can be perturbed without changing predictions, and different attention distributions can yield identical outputs. Modern guidance treats raw attention as a diagnostic, not an explanation.

### Token-level Integrated Gradients

Integrated Gradients applied at the embedding layer of a transformer assigns contribution scores to individual tokens. The baseline is typically a sequence of zero embeddings or a mean-embedding sequence. Captum's `LayerIntegratedGradients` is the standard tool for this:

```python
from captum.attr import LayerIntegratedGradients

lig = LayerIntegratedGradients(model, model.embeddings)
attributions, delta = lig.attribute(
    input_ids,
    baselines=baseline_ids,
    target=target_class,
    return_convergence_delta=True,
)
```

Token attributions are commonly summed across the embedding dimension and visualized as heatmaps over the input text.

### Attention rollout and attention flow

Abnar and Zuidema (2020) proposed *attention rollout*, which composes attention matrices across layers via matrix multiplication, and *attention flow*, which formulates the importance problem as max-flow over an attention graph.[19] These methods give a layer-aggregated view of token-to-token influence.

### Mechanistic interpretability

Beyond attribution scores, the [mechanistic interpretability](/wiki/mechanistic_interpretability) program seeks to identify circuits, features, and computations within transformer weights. Anthropic's 2023 work on monosemantic features used [sparse autoencoders](/wiki/sparse_autoencoder) to decompose neuron activations into interpretable features. The 2024 "Scaling Monosemanticity" paper extended this to Claude 3 Sonnet and identified millions of features. This line of work targets a finer-grained question than feature importance: not just *which* input contributes, but *what computation* is performed and *why*. Mechanistic interpretability and feature importance address complementary aspects of model understanding.

## How do the methods compare?

The following table summarizes the key characteristics of the most widely used feature importance methods:

| Method | Scope | Model requirement | Computation cost | Handles correlated features well? | Provides direction of effect? |
|---|---|---|---|---|---|
| Impurity-based (MDI) | Global | Tree-based models only | Very low (computed during training) | No (splits importance across correlated features) | No |
| XGBoost gain | Global | XGBoost / boosted trees | Very low | No | No |
| XGBoost weight | Global | XGBoost / boosted trees | Very low | No | No |
| XGBoost cover | Global | XGBoost / boosted trees | Very low | No | No |
| Coefficient magnitude | Global | Linear models only | Very low (read from trained model) | No (unstable with multicollinearity) | Yes (sign of coefficient) |
| Permutation importance | Global | Any model | Moderate (no retraining) | No (underestimates correlated features) | No |
| Conditional permutation | Global | Any model | High | Yes (within strata) | No |
| Drop-column (LOCO) | Global | Any model | Very high (retrains per feature) | Partially (retraining captures redistribution) | No |
| KernelSHAP | Local and global | Any model | High | Partial (interventional vs. observational) | Yes |
| TreeSHAP | Local and global | Tree-based models | Low ($O(TLD^2)$) | Partial | Yes |
| DeepSHAP / DeepLIFT | Local | Differentiable models | Moderate | Partial | Yes |
| LIME | Local | Any model | Moderate | No (local perturbations affected by correlations) | Yes (sign of local coefficient) |
| Integrated Gradients | Local | Differentiable models | Moderate | Partial (baseline-dependent) | Yes |
| Layer-wise Relevance Propagation | Local | Differentiable models | Moderate | Partial | Yes |
| Grad-CAM | Local | CNNs with feature maps | Low | Partial | Yes |
| Attention weights | Local | Attention-based models | Free (already computed) | No (not a faithful explanation) | Yes |

## Common pitfalls and best practices

### Correlated features

Correlated features are the most common source of misleading feature importance results, and this problem affects virtually every method. When two features carry similar information, importance methods tend to split the total importance between them, making each appear less important than it would be in isolation. Permutation importance can also overestimate the importance of correlated features by creating impossible feature combinations. Hooker and Mentch (2019) provide an analysis of how feature correlation distorts both MDI and permutation importance.[7]

**Practical solutions:**

- Perform hierarchical clustering on the Spearman rank-order correlation matrix, select a threshold, and retain one representative feature from each cluster.
- Apply [principal component analysis](/wiki/principal_component_analysis) (PCA) to create uncorrelated features before computing importance.
- Use conditional permutation importance, which permutes a feature conditional on the values of its correlated partners.
- Group correlated features and compute group importance using the SHAP `PartitionExplainer` or grouped permutation tests.

### Why is impurity importance biased toward high-cardinality features?

Impurity-based importance is systematically biased toward features with many unique values. A continuous feature with 10,000 distinct values will tend to score higher than a binary feature, even if the binary feature is the true driver. The mechanism is that high-cardinality features offer many more candidate split points, so they have a greater chance of yielding a strong split by chance alone. This was demonstrated by Strobl et al. (2007), who showed that random forests preferentially select high-cardinality variables for splits.[5]

**Practical solution:** Use permutation importance or SHAP instead of MDI when your dataset contains a mix of continuous and categorical features with varying cardinality. The Altmann et al. (2010) heuristic, which derives a corrected, null-distribution-calibrated importance, is another option.[30]

### Importance from overfit models

If a model has memorized the training data, the importance scores derived from that training data will be unreliable. A random noise feature may appear important simply because the model has overfit to it.

**Practical solution:** Always compute importance on a held-out test set or use cross-validated importance estimates. Scikit-learn's `permutation_importance` function accepts any dataset, so passing the test set rather than the training set is straightforward.

### Confusing importance with causation

Feature importance measures statistical association, not causation. A feature may appear important because it is correlated with the true causal factor, not because it directly influences the outcome. For example, ice cream sales might appear important for predicting drowning rates, but the true driver is temperature. The [causal inference](/wiki/causal_inference) literature provides tools, such as do-calculus, that bridge to genuinely causal contributions, but these require strong assumptions and a known causal graph.

### Importance depends on model quality

Feature importance is only meaningful when the underlying model performs well. If the model has a low cross-validation score, its importance rankings may be unreliable and unstable. Always validate model performance before interpreting feature importances.

### Stability and confidence intervals

A single importance score is a point estimate. Repeating permutation, bootstrapping the data, or running multiple training seeds yields a distribution of importance scores. Reporting standard deviations or confidence intervals catches features whose importance is dominated by noise.[27] Scikit-learn's `permutation_importance` returns `importances_mean` and `importances_std` precisely so users can quantify this uncertainty.

### Adversarial robustness of explanations

Slack et al. (2020) showed that black-box models can be crafted to deceive both LIME and SHAP.[16] Ghorbani, Abid, and Zou (2019) showed that small input perturbations can drastically change saliency map attributions even when the prediction is unchanged.[20] Auditors should treat any single explanation as one signal among many.

### Baseline dependence in attribution

Integrated Gradients and other reference-based methods depend on the choice of baseline. A black image, a gray image, a Gaussian noise sample, and a dataset mean baseline can give different attributions for the same input. Sturmfels, Lundberg, and Lee (2020) recommend using multiple baselines and reporting averaged attributions.[21]

## Implementation and libraries

Several mature libraries implement feature importance and attribution. The choice depends on the framework and the type of model.

| Library | Language / framework | Primary methods | Notes |
|---|---|---|---|
| [scikit-learn](/wiki/scikit_learn) | Python | MDI, permutation importance | Standard tabular ML toolkit |
| [SHAP](/wiki/shap) | Python | KernelSHAP, TreeSHAP, DeepSHAP, GradientSHAP | Reference SHAP implementation |
| [LIME](/wiki/lime) | Python | LIME for tabular, text, image | Original Ribeiro et al. implementation |
| [Captum](/wiki/captum) | Python / [PyTorch](/wiki/pytorch) | IG, DeepLIFT, SmoothGrad, Saliency, Occlusion | Meta AI library |
| eli5 | Python | Permutation importance, MDI, weights | Older but still useful |
| AIX360 | Python | SHAP, LIME, ProtoDash, BRCG | IBM toolkit |
| iNNvestigate | Python / TensorFlow | LRP variants, SmoothGrad, IG | Keras-focused |
| InterpretML | Python | EBM (glass-box), SHAP, LIME | From Microsoft Research |
| dalex | Python / R | Break-Down, Shapley, Ceteris Paribus | Predictive model audit |

### How do you compute feature importance in scikit-learn?

Scikit-learn provides two primary interfaces for computing feature importance:

| Interface | Function / attribute | Method type | Available since |
|---|---|---|---|
| Impurity-based | `model.feature_importances_` | Built-in (tree models) | scikit-learn 0.1 |
| Permutation | `sklearn.inspection.permutation_importance()` | Model-agnostic | scikit-learn 0.22 |

The scikit-learn documentation explicitly recommends preferring permutation importance over impurity-based importance when accuracy of rankings matters, noting that MDI importances are biased towards high-cardinality features and are computed on training set statistics that do not reflect generalization to the test set.[29]

A typical workflow combines both methods:

```python
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
from sklearn.model_selection import train_test_split
import numpy as np

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train model
rf = RandomForestClassifier(n_estimators=200, random_state=42)
rf.fit(X_train, y_train)

# Method 1: Impurity-based (fast but potentially biased)
mdi_importances = rf.feature_importances_

# Method 2: Permutation importance (more reliable)
perm_result = permutation_importance(
    rf, X_test, y_test, n_repeats=30, random_state=42
)
perm_importances = perm_result.importances_mean

# Compare
for name, mdi, perm in sorted(
    zip(feature_names, mdi_importances, perm_importances),
    key=lambda x: x[2], reverse=True
):
    print(f"{name:>20s}: MDI={mdi:.4f}  Permutation={perm:.4f}")
```

### Implementation with SHAP

The SHAP library is published under the MIT license and supports all major tabular ML libraries:

```python
import shap
import xgboost as xgb

booster = xgb.train(params, dtrain, num_boost_round=200)
explainer = shap.TreeExplainer(booster)
shap_values = explainer.shap_values(X_test)

shap.summary_plot(shap_values, X_test, plot_type="dot")
shap.dependence_plot("age", shap_values, X_test)
```

## Worked example: house price regression

Consider a regression model that predicts house prices from features such as square footage, number of bedrooms, lot size, neighborhood, year built, and number of bathrooms. The following observations illustrate how different methods may rank the same features:

- **MDI on a random forest** might rank `square_footage` first because it offers many split points and is heavily used during training.
- **Permutation importance** on the test set may rank `neighborhood` higher than MDI suggests because shuffling the categorical neighborhood label catastrophically degrades predictions.
- **TreeSHAP** values may show that `year_built` has a positive contribution for newer homes and a negative contribution for older homes, revealing a non-linear effect that aggregate importance hides.
- **Drop-column importance** may reveal that `bedrooms` and `bathrooms` jointly matter but neither is critical alone, because they act as redundant proxies for size.

The disagreement is informative. Examining several methods together is the standard professional practice when stakes are high.

## Related interpretability tools

Feature importance gives a single score per feature, but several adjacent tools answer related questions and are commonly reported alongside importance scores. Partial dependence plots (PDPs), introduced by Friedman (2001), show the average effect of one or two features on the prediction.[22] Individual Conditional Expectation (ICE) plots show the same relationship per-instance.[23] Accumulated Local Effects (ALE) plots, introduced by Apley and Zhu (2020), generalize PDPs to handle correlated features.[24] SHAP interaction values decompose each prediction into main effects and pairwise interaction terms. Counterfactual explanations and *anchors* (Ribeiro, Singh, Guestrin, 2018) provide rule-based local explanations.[25]

## Explain like I'm 5 (ELI5)

Imagine you are baking cookies, and you want to know which ingredient matters most for making them taste good. You could try leaving out the sugar one time, leaving out the butter another time, and leaving out the vanilla another time. Whichever ingredient, when removed, makes the cookies taste the worst is the most important ingredient. Feature importance works the same way: it tests what happens to a computer's predictions when each piece of information is taken away or scrambled, and the pieces that cause the biggest mess when removed are the most important ones.

## See also

- [Interpretability](/wiki/interpretability)
- [Explainable AI](/wiki/explainable_ai)
- [SHAP](/wiki/shap)
- [LIME](/wiki/lime)
- [Shapley values](/wiki/shapley_values)
- [Random forest](/wiki/random_forest)
- [Gradient boosting](/wiki/gradient_boosting)
- [XGBoost](/wiki/xgboost)
- [LightGBM](/wiki/lightgbm)
- [Decision tree](/wiki/decision_tree)
- [Captum](/wiki/captum)
- [Mechanistic interpretability](/wiki/mechanistic_interpretability)
- [Feature engineering](/wiki/feature_engineering)
- [Feature selection](/wiki/feature_selection)

## References

1. Breiman, L. (2001). "Random Forests." *Machine Learning*, 45(1), 5-32. [doi:10.1023/A:1010933404324](https://doi.org/10.1023/A:1010933404324)
2. Lundberg, S. M., & Lee, S.-I. (2017). "A Unified Approach to Interpreting Model Predictions." *Advances in Neural Information Processing Systems 30 (NeurIPS)*, 4765-4774. [arXiv:1705.07874](https://arxiv.org/abs/1705.07874)
3. Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). "'Why Should I Trust You?': Explaining the Predictions of Any Classifier." *Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*, 1135-1144. [arXiv:1602.04938](https://arxiv.org/abs/1602.04938) [doi:10.1145/2939672.2939778](https://doi.org/10.1145/2939672.2939778)
4. Fisher, A., Rudin, C., & Dominici, F. (2019). "All Models are Wrong, but Many are Useful: Learning a Variable's Importance by Studying an Entire Class of Prediction Models Simultaneously." *Journal of Machine Learning Research*, 20(177), 1-81.
5. Strobl, C., Boulesteix, A.-L., Zeileis, A., & Hothorn, T. (2007). "Bias in Random Forest Variable Importance Measures: Illustrations, Sources and a Solution." *BMC Bioinformatics*, 8, 25. [doi:10.1186/1471-2105-8-25](https://doi.org/10.1186/1471-2105-8-25)
6. Strobl, C., Boulesteix, A.-L., Kneib, T., Augustin, T., & Zeileis, A. (2008). "Conditional Variable Importance for Random Forests." *BMC Bioinformatics*, 9, 307. [doi:10.1186/1471-2105-9-307](https://doi.org/10.1186/1471-2105-9-307)
7. Hooker, G., & Mentch, L. (2019). "Please Stop Permuting Features: An Explanation and Alternatives." [arXiv:1905.03151](https://arxiv.org/abs/1905.03151)
8. Sundararajan, M., Taly, A., & Yan, Q. (2017). "Axiomatic Attribution for Deep Networks." *Proceedings of the 34th International Conference on Machine Learning*, 3319-3328. [arXiv:1703.01365](https://arxiv.org/abs/1703.01365)
9. Shrikumar, A., Greenside, P., & Kundaje, A. (2017). "Learning Important Features Through Propagating Activation Differences." *Proceedings of the 34th International Conference on Machine Learning*, 3145-3153. [arXiv:1704.02685](https://arxiv.org/abs/1704.02685)
10. Smilkov, D., Thorat, N., Kim, B., Viegas, F., & Wattenberg, M. (2017). "SmoothGrad: removing noise by adding noise." [arXiv:1706.03825](https://arxiv.org/abs/1706.03825)
11. Bach, S., Binder, A., Montavon, G., Klauschen, F., Mueller, K.-R., & Samek, W. (2015). "On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation." *PLOS ONE*, 10(7), e0130140. [doi:10.1371/journal.pone.0130140](https://doi.org/10.1371/journal.pone.0130140)
12. Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., & Batra, D. (2017). "Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization." *Proceedings of the IEEE International Conference on Computer Vision*, 618-626. [arXiv:1610.02391](https://arxiv.org/abs/1610.02391)
13. Lundberg, S. M., Erion, G., Chen, H., DeGrave, A., Prutkin, J. M., Nair, B., Katz, R., Himmelfarb, J., Bansal, N., & Lee, S.-I. (2020). "From Local Explanations to Global Understanding with Explainable AI for Trees." *Nature Machine Intelligence*, 2, 56-67. [doi:10.1038/s42256-019-0138-9](https://doi.org/10.1038/s42256-019-0138-9)
14. Lundberg, S. M., Erion, G., & Lee, S.-I. (2018). "Consistent Individualized Feature Attribution for Tree Ensembles." [arXiv:1802.03888](https://arxiv.org/abs/1802.03888)
15. Janzing, D., Minorics, L., & Bloebaum, P. (2020). "Feature relevance quantification in explainable AI: A causal problem." *International Conference on Artificial Intelligence and Statistics*. [arXiv:1910.13413](https://arxiv.org/abs/1910.13413)
16. Slack, D., Hilgard, S., Jia, E., Singh, S., & Lakkaraju, H. (2020). "Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods." *Proceedings of the 2020 AAAI/ACM Conference on AI, Ethics, and Society*. [arXiv:1911.02508](https://arxiv.org/abs/1911.02508)
17. Jain, S., & Wallace, B. C. (2019). "Attention is not Explanation." *Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics*. [arXiv:1902.10186](https://arxiv.org/abs/1902.10186)
18. Wiegreffe, S., & Pinter, Y. (2019). "Attention is not not Explanation." *Proceedings of EMNLP-IJCNLP 2019*. [arXiv:1908.04626](https://arxiv.org/abs/1908.04626)
19. Abnar, S., & Zuidema, W. (2020). "Quantifying Attention Flow in Transformers." *Proceedings of ACL 2020*. [arXiv:2005.00928](https://arxiv.org/abs/2005.00928)
20. Ghorbani, A., Abid, A., & Zou, J. (2019). "Interpretation of Neural Networks is Fragile." *Proceedings of the AAAI Conference on Artificial Intelligence*, 33(1), 3681-3688. [arXiv:1710.10547](https://arxiv.org/abs/1710.10547)
21. Sturmfels, P., Lundberg, S., & Lee, S.-I. (2020). "Visualizing the Impact of Feature Attribution Baselines." *Distill*. [doi:10.23915/distill.00022](https://doi.org/10.23915/distill.00022)
22. Friedman, J. H. (2001). "Greedy Function Approximation: A Gradient Boosting Machine." *Annals of Statistics*, 29(5), 1189-1232.
23. Goldstein, A., Kapelner, A., Bleich, J., & Pitkin, E. (2015). "Peeking Inside the Black Box: Visualizing Statistical Learning With Plots of Individual Conditional Expectation." *Journal of Computational and Graphical Statistics*, 24(1), 44-65.
24. Apley, D. W., & Zhu, J. (2020). "Visualizing the Effects of Predictor Variables in Black Box Supervised Learning Models." *Journal of the Royal Statistical Society Series B*, 82(4), 1059-1086. [arXiv:1612.08468](https://arxiv.org/abs/1612.08468)
25. Ribeiro, M. T., Singh, S., & Guestrin, C. (2018). "Anchors: High-Precision Model-Agnostic Explanations." *AAAI Conference on Artificial Intelligence*.
26. Wager, S., & Athey, S. (2018). "Estimation and Inference of Heterogeneous Treatment Effects using Random Forests." *Journal of the American Statistical Association*, 113(523), 1228-1242.
27. Mentch, L., & Hooker, G. (2016). "Quantifying Uncertainty in Random Forests via Confidence Intervals and Hypothesis Tests." *Journal of Machine Learning Research*, 17(26), 1-41.
28. Molnar, C. (2022). *Interpretable Machine Learning: A Guide for Making Black Box Models Explainable* (2nd ed.). [christophm.github.io/interpretable-ml-book](https://christophm.github.io/interpretable-ml-book/)
29. Scikit-learn Developers. "Permutation Feature Importance." *scikit-learn documentation*. [scikit-learn.org/stable/modules/permutation_importance.html](https://scikit-learn.org/stable/modules/permutation_importance.html)
30. Altmann, A., Tolosi, L., Sander, O., & Lengauer, T. (2010). "Permutation Importance: A Corrected Feature Importance Measure." *Bioinformatics*, 26(10), 1340-1347. [doi:10.1093/bioinformatics/btq134](https://doi.org/10.1093/bioinformatics/btq134)
31. Kokhlikyan, N., Miglani, V., Martin, M., et al. (2020). "Captum: A unified and generic model interpretability library for PyTorch." [arXiv:2009.07896](https://arxiv.org/abs/2009.07896)
32. Simonyan, K., Vedaldi, A., & Zisserman, A. (2013). "Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps." [arXiv:1312.6034](https://arxiv.org/abs/1312.6034)
33. Montavon, G., Samek, W., & Mueller, K.-R. (2018). "Methods for interpreting and understanding deep neural networks." *Digital Signal Processing*, 73, 1-15.