Feature Importances

Feature importances are numeric scores that quantify how much each input feature contributes to the predictions of a machine learning model. Understanding which features matter most is central to model interpretability, feature engineering, debugging, and building trust in deployed systems. A wide range of techniques exist for measuring feature importance, from methods built into specific model families to model-agnostic approaches that work with any estimator. Modern practice draws on tree-based attribution measures, post-hoc explanations grounded in cooperative game theory, and gradient-based attribution methods designed for deep networks.

The topic sits at the intersection of statistics, optimization, and explainable AI. Feature importance methods do not all answer the same question: some measure how a feature shapes the loss during training, some measure how it affects predictions on held-out data, and some measure how it contributes to a single prediction for a single sample.

Why feature importance matters

Knowing which features drive a model's predictions serves several practical purposes:

Model interpretability: Stakeholders and regulators often need to understand why a model makes certain decisions, especially in healthcare, finance, and criminal justice. Frameworks such as the European Union AI Act and the U.S. Equal Credit Opportunity Act effectively demand that operators of high-stakes models articulate the reasons behind individual decisions.
Feature selection: Removing uninformative features reduces training time, lowers the risk of overfitting, and simplifies models without sacrificing performance. Stable importance rankings drive feature selection wrappers such as Recursive Feature Elimination (RFE) and Boruta.
Debugging: If a model assigns high importance to a feature that should be irrelevant (for example, a row index or a record ID), that signals data leakage or a modeling error. Importance audits routinely catch leakage where future information has accidentally been included in training data.
Domain insight: Feature importance rankings can reveal unexpected relationships in the data, generating hypotheses for further scientific investigation. In computational biology, importance scores from tree ensembles routinely surface candidate genes for follow-up wet-lab experiments.
Bias and fairness audits: Surfaced feature contributions help identify proxies for protected attributes. If a sensitive attribute or its proxy ranks highly, that finding feeds into a fairness intervention.
Model compression: When deploying to embedded devices, dropping low-importance inputs reduces memory and inference cost without notably hurting accuracy.

Across all of these uses, the key question is whether the method's ranking matches the predictive structure that the user actually cares about. Different methods give different answers because they each measure a different mathematical object.

Global versus local importance

Feature importance methods split into two broad categories based on the scope of the explanation they provide:

Global importance describes how each feature contributes across the entire dataset or population. Global scores are useful for feature selection, model documentation, and high-level audits. Examples include impurity-based importance for tree ensembles and permutation importance computed over a full test set.
Local importance describes how each feature contributes to a single prediction for a single instance. Local explanations are necessary for adverse-action notices in lending or per-patient risk reports in clinical decision support. Examples include SHAP values for an individual sample and LIME explanations.

A second axis distinguishes model-specific methods, which exploit the internal structure of a particular family (such as the split statistics of a decision tree), from model-agnostic methods, which treat the model as a black box. Model-specific methods are usually faster and can be exact but apply only to compatible models. Model-agnostic methods generalize across architectures at the cost of additional computation.

The two axes combine to give a 2x2 taxonomy that helps practitioners locate any method in the space:

Scope vs. coupling	Model-specific	Model-agnostic
Global	Mean Decrease in Impurity, XGBoost gain/weight/cover, linear coefficient magnitude	Permutation importance, drop-column (LOCO), global SHAP aggregates
Local	TreeSHAP, DeepLIFT, Integrated Gradients (uses model gradients)	KernelSHAP, LIME, counterfactual explanations

Built-in (model-specific) methods

Impurity-based importance (mean decrease in impurity)

Impurity-based importance, also called Mean Decrease in Impurity (MDI) or Gini importance, is the default feature importance method for tree-based models such as random forest, gradient boosting, and individual decision trees. During training, each node in a tree splits on a feature to reduce impurity (measured by the Gini index for classification or variance for regression). The importance of a feature is the total reduction in impurity it provides across all splits in all trees, weighted by the number of samples reaching each split.

Formally, for a single tree the MDI importance of feature $j$ is:

$$\mathrm{MDI}(j) = \sum_{t \in T_j} \frac{N_t}{N} \cdot \Delta i(t)$$

where $T_j$ is the set of internal nodes that split on feature $j$, $N_t$ is the number of training samples reaching node $t$, $N$ is the total number of training samples, and $\Delta i(t)$ is the impurity decrease at node $t$. For a forest of $B$ trees, the per-feature importance is averaged: $\mathrm{MDI}\mathrm{forest}(j) = \frac{1}{B} \sum{b=1}^{B} \mathrm{MDI}_b(j)$.

In scikit-learn, impurity-based importances are accessible through the feature_importances_ attribute of any fitted tree-based estimator:

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Impurity-based importances
importances = model.feature_importances_

Advantages: MDI is fast to compute because it requires no additional evaluation after training. It is readily available for all tree-based models in scikit-learn, XGBoost, LightGBM, and CatBoost.

Limitations: MDI is biased toward high-cardinality features (features with many unique values, such as continuous variables or categorical variables with many categories). Because high-cardinality features offer more candidate split points, they have a greater chance of producing a good split by chance. MDI is also computed on training data, so it can inflate the importance of features the model has overfit to. These limitations were documented by Strobl et al. (2007) and confirmed by the scikit-learn team. Hooker and Mentch (2019) showed the bias persists even with permutation-based importances when features are correlated.

XGBoost importance: gain, weight, and cover

XGBoost exposes three native global importance measures, each computed from the gradient-boosted tree ensemble. Practitioners often inspect more than one because rankings can disagree.

XGBoost importance type	Definition	What it measures	Typical use
gain	Average loss reduction contributed by splits on a feature	How much accuracy a feature buys when it is used	Default in XGBoost; closest to MDI
weight (also `frequency`)	Number of times a feature is used as a split	Raw split count	Sensitive to deep trees and many small-gain splits
cover	Average number of samples affected by splits on a feature	Coverage of the feature in the dataset	Useful for spotting features that affect rare regions

In XGBoost the importance type is selected via the importance_type argument:

import xgboost as xgb

model = xgb.XGBClassifier(n_estimators=200, random_state=42)
model.fit(X_train, y_train)

gain = model.get_booster().get_score(importance_type="gain")
weight = model.get_booster().get_score(importance_type="weight")
cover = model.get_booster().get_score(importance_type="cover")

In LightGBM, feature_importance(importance_type="split") returns the weight-style count and importance_type="gain" returns the gain-style loss reduction. CatBoost exposes a PredictionValuesChange importance and a LossFunctionChange importance, the latter similar in spirit to permutation importance.

Coefficient magnitude for linear models

For linear regression and logistic regression, the absolute value of each feature's learned coefficient can serve as a measure of importance. A larger absolute coefficient means the feature has a stronger influence on the prediction.

However, raw coefficients are only comparable when all features share the same scale. If one feature is measured in thousands and another in fractions, their coefficients will reflect those scales rather than true importance. The standard practice is to standardize all features (zero mean, unit variance) before training, which produces standardized coefficients that can be directly compared:

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
import numpy as np

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

model = LogisticRegression().fit(X_train_scaled, y_train)
importances = np.abs(model.coef_<sup><a href="#cite_note-0" class="cite-ref">[0]</a></sup>)

For a generalized linear model with link $g$, the contribution of feature $x_j$ to the linear predictor is $\beta_j x_j$. The signed contribution is what regulators typically expect for adverse-action notices, since it preserves direction.

Limitations: When features are highly correlated (multicollinearity), coefficients become unstable and can swing between positive and negative values. In such cases, coefficient magnitude is unreliable. L1 and L2 regularization help stabilize coefficients but do not fully resolve the interpretation problem. Lasso zeroes out some coefficients entirely, which is helpful for selection but discards correlated alternatives.

Model-agnostic methods

Model-agnostic methods can be applied to any fitted model, regardless of its internal structure. This makes them especially valuable for comparing feature importance across different model types or for explaining proprietary black-box systems.

Permutation importance

Permutation importance was originally introduced by Leo Breiman (2001) as part of the random forest algorithm and later generalized into a fully model-agnostic technique by Fisher, Rudin, and Dominici (2019), who called it "model reliance." The core idea is simple: if a feature is important, randomly shuffling its values should degrade model performance; if the feature is unimportant, shuffling it should have little effect.

Algorithm:

Train the model and compute a baseline performance score (e.g., accuracy, R-squared, or any scoring metric) on a held-out dataset.
For each feature $j$:
- Randomly shuffle (permute) the values of feature $j$ across all samples.
- Recompute the model's performance score on the permuted data.
- Record the importance as the decrease in score: $I_j = S_\mathrm{baseline} - S_\mathrm{permuted}$.
Repeat the permutation multiple times and average the results to reduce variance.
Rank features by descending importance.

In scikit-learn, permutation importance is available through the permutation_importance function:

from sklearn.inspection import permutation_importance

result = permutation_importance(
    model, X_test, y_test,
    n_repeats=30,
    random_state=42
)

for i in result.importances_mean.argsort()[::-1]:
    if result.importances_mean[i] - 2 * result.importances_std[i] > 0:
        print(f"{feature_names[i]}: "
              f"{result.importances_mean[i]:.3f} "
              f"+/- {result.importances_std[i]:.3f}")

Breiman's original definition computed permutation importance on the out-of-bag (OOB) samples of a random forest, which gave a free held-out evaluation without an explicit train-test split. Modern implementations are usually train-test based.

Advantages: Permutation importance is model-agnostic, does not require retraining, and can be computed on a held-out test set (which means it reflects generalization performance rather than training-set memorization). It also does not exhibit the high-cardinality bias that affects MDI.

Limitations: When features are correlated, permuting one feature does not substantially hurt performance because the model can still extract the same information from its correlated partner. The permutation step can also create unrealistic data points (for instance, shuffling "number of bedrooms" independently of "square footage" produces houses that would never exist), which can distort estimates. Conditional permutation importance, introduced by Strobl et al. (2008), addresses this by permuting within strata of correlated features.

Drop-column importance (leave-one-covariate-out)

Drop-column importance, also known as Leave-One-Covariate-Out (LOCO), provides perhaps the most direct answer to the question "how much does this feature contribute?" The procedure drops each feature one at a time, retrains the model from scratch, and measures the change in performance.

Algorithm:

Train the model on all features and record baseline performance.
For each feature $j$:
- Remove feature $j$ from the dataset.
- Retrain the model on the remaining features.
- Compute the new performance score.
- Record the importance as: $I_j = S_\mathrm{baseline} - S_\mathrm{without,j}$.
Rank features by descending importance.

Advantages: Drop-column importance directly measures each feature's contribution to overall model performance. It avoids the unrealistic data problem of permutation importance because the model never sees the dropped feature during training.

Limitations: This method is computationally expensive because it requires retraining the model once per feature. For a dataset with 100 features and a model that takes an hour to train, drop-column importance requires over 100 hours of computation. It also measures a slightly different quantity than permutation importance: the marginal value of adding a feature to a model that already has all other features. For deep learning models with stochastic optimizers, retraining variance can swamp the signal unless many seeds are averaged.

SHAP (SHapley Additive exPlanations)

SHAP (SHapley Additive exPlanations), introduced by Lundberg and Lee (2017), applies Shapley values from cooperative game theory to explain individual predictions. Each feature receives a SHAP value representing its contribution to pushing the prediction away from the average prediction. The key theoretical property is that SHAP values are the only additive feature attribution method satisfying three desirable axioms: local accuracy, missingness, and consistency.

Formally, the SHAP value of feature $i$ for an instance $x$ is the Shapley value:

$$\phi_i(x) = \sum_{S \subseteq F \setminus {i}} \frac{|S|!,(|F|-|S|-1)!}{|F|!} , [v(S \cup {i}) - v(S)]$$

where $F$ is the set of all features and $v(S)$ is the model's prediction conditioned on knowing only the values of features in subset $S$. Computing this exactly is exponential in the number of features, which is why specialized fast variants exist.

SHAP provides both local explanations (why a specific prediction was made) and global importance (which features matter most across the entire dataset). Global SHAP importance is typically computed as the mean absolute SHAP value for each feature across all samples: $\mathrm{Imp}(j) = \frac{1}{N} \sum_{i=1}^{N} |\phi_j(x_i)|$.

The SHAP library offers specialized explainers optimized for different model types:

Explainer	Target models	Speed	Exactness
TreeExplainer	Random forest, XGBoost, LightGBM, CatBoost	Fast	Exact for trees
LinearExplainer	Linear models, logistic regression	Fast	Exact for linear models
KernelExplainer	Any model (model-agnostic)	Slow	Approximate
DeepExplainer	Deep neural networks	Moderate	Approximate (DeepLIFT-based)
GradientExplainer	Differentiable models	Moderate	Approximate (uses Integrated Gradients)
PartitionExplainer	Hierarchical feature groupings	Moderate	Owen value approximation

import shap

explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)

# Global importance: mean absolute SHAP value per feature
global_importance = np.abs(shap_values).mean(axis=0)

# Summary plot showing feature importance and effect direction
shap.summary_plot(shap_values, X_test)

TreeSHAP and computational complexity

The naive Shapley value computation has complexity $O(2^M)$ in the number of features $M$, which is infeasible for typical tabular models. TreeSHAP, introduced by Lundberg, Erion, and Lee (2018) and extended in Lundberg et al. (2020) for Nature Machine Intelligence, computes exact Shapley values for tree-based models in polynomial time. The complexity is $O(T L D^2)$ per prediction, where $T$ is the number of trees, $L$ is the maximum number of leaves in any tree, and $D$ is the maximum depth. This brings exact local attribution within reach for ensembles of hundreds of trees on datasets with thousands of features.

TreeSHAP supports two estimands:

Path-dependent (interventional) feature perturbation: uses the marginal distribution; treats unobserved features as if intervened upon. Recommended when interpreting interventional or causal contributions.
Tree-path-dependent (observational): uses the empirical conditional distribution implied by the tree paths. Recommended when correlations between features should be reflected in the attribution.

KernelSHAP, the model-agnostic variant, uses weighted linear regression on perturbed coalitions and converges to Shapley values in expectation but is much slower.

Advantages: SHAP values have a strong theoretical foundation in game theory, provide both local and global explanations, and show the direction of each feature's effect (not just magnitude). TreeExplainer is computationally efficient for tree-based models. The SHAP library ships rich plotting utilities (summary plots, dependence plots, force plots, decision plots, waterfall plots).

Limitations: KernelExplainer (the model-agnostic variant) is computationally expensive, scaling poorly to large datasets and large feature counts. SHAP importance is on the scale of the prediction (not the loss), which makes it answer a subtly different question than permutation importance. Like permutation importance, SHAP values can be affected by feature correlations, especially in the interventional formulation. Janzing, Minorics, and Bloebaum (2020) have argued that the choice between observational and interventional SHAP corresponds to fundamentally different causal interpretations.

LIME (Local Interpretable Model-Agnostic Explanations)

LIME, proposed by Ribeiro, Singh, and Guestrin (2016), explains individual predictions by fitting a simple interpretable model (typically a sparse linear model) in the local neighborhood of the instance being explained. LIME generates perturbed samples around the instance, weights them by proximity, obtains the black-box model's predictions for these samples, and then fits a sparse linear model to approximate the local decision boundary.

Algorithm:

Select the instance to explain.
Generate perturbed samples in the neighborhood of that instance.
Get the black-box model's predictions for each perturbed sample.
Weight each perturbed sample by its proximity to the original instance using a kernel such as $\pi_x(z) = \exp(-d(x, z)^2 / \sigma^2)$.
Fit a sparse linear model (e.g., Lasso) on the weighted perturbed data.
The coefficients of the local model represent feature importances for that specific prediction.

LIME has tabular, text, and image variants. The image variant uses superpixel segmentation as the perturbation unit, while the text variant masks individual tokens.

Advantages: LIME is model-agnostic and works with any classifier or regressor. It is intuitive because explanations are expressed as simple linear weights. The user can control the number of features in the explanation.

Limitations: LIME explanations can be unstable: running LIME twice on the same instance can produce different results because of randomness in the perturbation process. The quality of the explanation depends on the choice of kernel width and the number of perturbations, which require careful tuning. Unlike SHAP, LIME lacks a strong theoretical guarantee about the uniqueness or optimality of its explanations. Slack et al. (2020) demonstrated that LIME and SHAP can be "fooled" by adversarial models that detect the perturbation distribution and behave benignly only on perturbed inputs.

Gradient-based attribution for deep networks

Deep neural networks accept inputs that often have thousands or millions of dimensions, such as pixels in an image or token embeddings in a language model. Tree-based importance does not apply, and exact Shapley computation is intractable. A family of gradient-based attribution methods has been developed specifically for differentiable models.

Vanilla saliency

The simplest method computes the gradient of the model's output with respect to each input dimension. For a model $f$ and an input $x$, the saliency of input $x_i$ is $|\partial f(x) / \partial x_i|$. This was popularized by Simonyan, Vedaldi, and Zisserman (2013) for visualizing convolutional networks. Vanilla gradients are fast but suffer from gradient saturation and noisy attributions.

Integrated Gradients

Integrated Gradients, introduced by Sundararajan, Taly, and Yan (2017), addresses gradient saturation and enforces two axioms: sensitivity and implementation invariance. The attribution for input dimension $i$ is the path integral of the gradient from a baseline $x'$ to the input $x$:

$$\mathrm{IG}_i(x) = (x_i - x'i) \int{0}^{1} \frac{\partial f(x' + \alpha (x - x'))}{\partial x_i} , d\alpha$$

In practice the integral is approximated by a Riemann sum with 50 to 200 steps. The choice of baseline matters: a black image, a Gaussian noise sample, or the dataset mean each yields different attributions. Integrated Gradients sums to the difference between the prediction at $x$ and the prediction at $x'$ (completeness axiom), which is analogous to the local accuracy property of SHAP values.

DeepLIFT

DeepLIFT, proposed by Shrikumar, Greenside, and Kundaje (2017), assigns contribution scores by comparing each neuron's activation to a reference activation. It propagates contributions backward through the network using either Rescale rules or RevealCancel rules. DeepLIFT was designed to overcome gradient saturation issues that affect vanilla gradients, especially in deep networks with ReLU activations. Lundberg and Lee (2017) showed that DeepLIFT can be reformulated as an approximation to SHAP values, which led to the SHAP DeepExplainer implementation.

SmoothGrad

SmoothGrad, introduced by Smilkov, Thorat, Kim, Viegas, and Wattenberg (2017), reduces visual noise in saliency maps by averaging gradients over multiple noisy copies of the input:

$$\mathrm{SmoothGrad}i(x) = \frac{1}{n} \sum{k=1}^{n} \frac{\partial f(x + \epsilon_k)}{\partial x_i}$$

where each $\epsilon_k \sim \mathcal{N}(0, \sigma^2 I)$. SmoothGrad is often combined with Integrated Gradients (SmoothGrad-IG) for sharper, more stable attributions on image classification.

Layer-wise Relevance Propagation (LRP)

Layer-wise Relevance Propagation, introduced by Bach et al. (2015), distributes the prediction backward through the network using conservation rules at each layer. Several propagation rules exist, including LRP-0, LRP-epsilon, and LRP-gamma, each with different stability and faithfulness trade-offs. Montavon et al. (2018) provided a unified framework analyzing how various propagation rules affect attributions.

Grad-CAM

For convolutional networks, Grad-CAM (Selvaraju et al., 2017) localizes the regions of an image that influenced a class prediction by combining gradients with feature maps from a chosen convolutional layer. Grad-CAM is widely used in medical imaging to highlight tumor regions or lesions that drive a model's diagnosis.

Comparison of gradient-based methods

Method	Required model property	Output	Strengths	Weaknesses
Vanilla saliency	Differentiability	Per-input gradient	Cheap	Noisy, saturation
Integrated Gradients	Differentiability, baseline	Path integral	Completeness, axiomatic	Baseline-dependent
DeepLIFT	ReLU-style nonlinearities	Reference-based contribution	Avoids saturation	Reference-dependent
SmoothGrad	Differentiability	Averaged gradient	Visual stability	Adds compute
Layer-wise Relevance Propagation	Differentiability	Propagated relevance	Conservation	Rule choice non-trivial
Grad-CAM	CNN with feature maps	Coarse heatmap	Class-discriminative	Resolution limited

Captum

Captum is an open-source PyTorch library released by Meta AI in 2019 that implements a unified API for these gradient-based attribution methods, plus several additional ones such as Occlusion, Feature Ablation, and Shapley Value Sampling. The library exposes a consistent interface where each algorithm subclasses the Attribution class:

import torch
from captum.attr import IntegratedGradients

model.eval()
ig = IntegratedGradients(model)
input_tensor = torch.randn(1, 3, 224, 224, requires_grad=True)
baseline = torch.zeros_like(input_tensor)

attributions, delta = ig.attribute(
    input_tensor, baselines=baseline, target=0,
    return_convergence_delta=True,
)

Captum integrates with PyTorch modules, supports both image and text models, and includes utilities for visualization and noise tunneling (the SmoothGrad pattern). The TensorFlow community uses tf-explain, iNNvestigate, and Captum-style implementations within the AIX360 toolkit from IBM.

Attribution for transformers and large language models

Transformer architectures used in modern large language models introduce additional considerations for feature attribution. Inputs are token embeddings rather than raw scalar features, and the model contains attention layers whose weights are often misinterpreted as importance scores.

Attention as importance: a contested signal

Attention weights describe how each token attends to others within a layer. It is tempting to read the attention weight $\alpha_{ij}$ from token $i$ to token $j$ as a measure of how much token $j$ contributes to token $i$'s representation. Jain and Wallace (2019) and Wiegreffe and Pinter (2019) argued that attention weights are not reliable explanations: they can be perturbed without changing predictions, and different attention distributions can yield identical outputs. Modern guidance treats raw attention as a diagnostic, not an explanation.

Token-level Integrated Gradients

Integrated Gradients applied at the embedding layer of a transformer assigns contribution scores to individual tokens. The baseline is typically a sequence of zero embeddings or a mean-embedding sequence. Captum's LayerIntegratedGradients is the standard tool for this:

from captum.attr import LayerIntegratedGradients

lig = LayerIntegratedGradients(model, model.embeddings)
attributions, delta = lig.attribute(
    input_ids,
    baselines=baseline_ids,
    target=target_class,
    return_convergence_delta=True,
)

Token attributions are commonly summed across the embedding dimension and visualized as heatmaps over the input text.

Attention rollout and attention flow

Abnar and Zuidema (2020) proposed attention rollout, which composes attention matrices across layers via matrix multiplication, and attention flow, which formulates the importance problem as max-flow over an attention graph. These methods give a layer-aggregated view of token-to-token influence.

Mechanistic interpretability

Beyond attribution scores, the mechanistic interpretability program seeks to identify circuits, features, and computations within transformer weights. Anthropic's 2023 work on monosemantic features used sparse autoencoders to decompose neuron activations into interpretable features. The 2024 "Scaling Monosemanticity" paper extended this to Claude 3 Sonnet and identified millions of features. This line of work targets a finer-grained question than feature importance: not just which input contributes, but what computation is performed and why. Mechanistic interpretability and feature importance address complementary aspects of model understanding.

Comparison of methods

The following table summarizes the key characteristics of the most widely used feature importance methods:

Method	Scope	Model requirement	Computation cost	Handles correlated features well?	Provides direction of effect?
Impurity-based (MDI)	Global	Tree-based models only	Very low (computed during training)	No (splits importance across correlated features)	No
XGBoost gain	Global	XGBoost / boosted trees	Very low	No	No
XGBoost weight	Global	XGBoost / boosted trees	Very low	No	No
XGBoost cover	Global	XGBoost / boosted trees	Very low	No	No
Coefficient magnitude	Global	Linear models only	Very low (read from trained model)	No (unstable with multicollinearity)	Yes (sign of coefficient)
Permutation importance	Global	Any model	Moderate (no retraining)	No (underestimates correlated features)	No
Conditional permutation	Global	Any model	High	Yes (within strata)	No
Drop-column (LOCO)	Global	Any model	Very high (retrains per feature)	Partially (retraining captures redistribution)	No
KernelSHAP	Local and global	Any model	High	Partial (interventional vs. observational)	Yes
TreeSHAP	Local and global	Tree-based models	Low ($O(TLD^2)$)	Partial	Yes
DeepSHAP / DeepLIFT	Local	Differentiable models	Moderate	Partial	Yes
LIME	Local	Any model	Moderate	No (local perturbations affected by correlations)	Yes (sign of local coefficient)
Integrated Gradients	Local	Differentiable models	Moderate	Partial (baseline-dependent)	Yes
Layer-wise Relevance Propagation	Local	Differentiable models	Moderate	Partial	Yes
Grad-CAM	Local	CNNs with feature maps	Low	Partial	Yes
Attention weights	Local	Attention-based models	Free (already computed)	No (not a faithful explanation)	Yes

Common pitfalls and best practices

Correlated features

Correlated features are the most common source of misleading feature importance results, and this problem affects virtually every method. When two features carry similar information, importance methods tend to split the total importance between them, making each appear less important than it would be in isolation. Permutation importance can also overestimate the importance of correlated features by creating impossible feature combinations. Hooker and Mentch (2019) provide an analysis of how feature correlation distorts both MDI and permutation importance.

Practical solutions:

Perform hierarchical clustering on the Spearman rank-order correlation matrix, select a threshold, and retain one representative feature from each cluster.
Apply principal component analysis (PCA) to create uncorrelated features before computing importance.
Use conditional permutation importance, which permutes a feature conditional on the values of its correlated partners.
Group correlated features and compute group importance using the SHAP PartitionExplainer or grouped permutation tests.

High-cardinality bias

Impurity-based importance is systematically biased toward features with many unique values. A continuous feature with 10,000 distinct values will tend to score higher than a binary feature, even if the binary feature is the true driver. This was demonstrated by Strobl et al. (2007), who showed that random forests preferentially select high-cardinality variables for splits.

Practical solution: Use permutation importance or SHAP instead of MDI when your dataset contains a mix of continuous and categorical features with varying cardinality.

Importance from overfit models

If a model has memorized the training data, the importance scores derived from that training data will be unreliable. A random noise feature may appear important simply because the model has overfit to it.

Practical solution: Always compute importance on a held-out test set or use cross-validated importance estimates. Scikit-learn's permutation_importance function accepts any dataset, so passing the test set rather than the training set is straightforward.

Confusing importance with causation

Feature importance measures statistical association, not causation. A feature may appear important because it is correlated with the true causal factor, not because it directly influences the outcome. For example, ice cream sales might appear important for predicting drowning rates, but the true driver is temperature. The causal inference literature provides tools, such as do-calculus, that bridge to genuinely causal contributions, but these require strong assumptions and a known causal graph.

Importance depends on model quality

Feature importance is only meaningful when the underlying model performs well. If the model has a low cross-validation score, its importance rankings may be unreliable and unstable. Always validate model performance before interpreting feature importances.

Stability and confidence intervals

A single importance score is a point estimate. Repeating permutation, bootstrapping the data, or running multiple training seeds yields a distribution of importance scores. Reporting standard deviations or confidence intervals catches features whose importance is dominated by noise. Scikit-learn's permutation_importance returns importances_mean and importances_std precisely so users can quantify this uncertainty.

Adversarial robustness of explanations

Slack et al. (2020) showed that black-box models can be crafted to deceive both LIME and SHAP. Ghorbani, Abid, and Zou (2019) showed that small input perturbations can drastically change saliency map attributions even when the prediction is unchanged. Auditors should treat any single explanation as one signal among many.

Baseline dependence in attribution

Integrated Gradients and other reference-based methods depend on the choice of baseline. A black image, a gray image, a Gaussian noise sample, and a dataset mean baseline can give different attributions for the same input. Sturmfels, Lundberg, and Lee (2020) recommend using multiple baselines and reporting averaged attributions.

Implementation and libraries

Several mature libraries implement feature importance and attribution. The choice depends on the framework and the type of model.

Library	Language / framework	Primary methods	Notes
scikit-learn	Python	MDI, permutation importance	Standard tabular ML toolkit
SHAP	Python	KernelSHAP, TreeSHAP, DeepSHAP, GradientSHAP	Reference SHAP implementation
LIME	Python	LIME for tabular, text, image	Original Ribeiro et al. implementation
Captum	Python / PyTorch	IG, DeepLIFT, SmoothGrad, Saliency, Occlusion	Meta AI library
eli5	Python	Permutation importance, MDI, weights	Older but still useful
AIX360	Python	SHAP, LIME, ProtoDash, BRCG	IBM toolkit
iNNvestigate	Python / TensorFlow	LRP variants, SmoothGrad, IG	Keras-focused
InterpretML	Python	EBM (glass-box), SHAP, LIME	From Microsoft Research
dalex	Python / R	Break-Down, Shapley, Ceteris Paribus	Predictive model audit

Implementation in scikit-learn

Scikit-learn provides two primary interfaces for computing feature importance:

Interface	Function / attribute	Method type	Available since
Impurity-based	`model.feature_importances_`	Built-in (tree models)	scikit-learn 0.1
Permutation	`sklearn.inspection.permutation_importance()`	Model-agnostic	scikit-learn 0.22

The scikit-learn documentation explicitly recommends preferring permutation importance over impurity-based importance when accuracy of rankings matters, noting that MDI importances are biased towards high-cardinality features and are computed on training set statistics that do not reflect generalization to the test set.

A typical workflow combines both methods:

from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
from sklearn.model_selection import train_test_split
import numpy as np

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train model
rf = RandomForestClassifier(n_estimators=200, random_state=42)
rf.fit(X_train, y_train)

# Method 1: Impurity-based (fast but potentially biased)
mdi_importances = rf.feature_importances_

# Method 2: Permutation importance (more reliable)
perm_result = permutation_importance(
    rf, X_test, y_test, n_repeats=30, random_state=42
)
perm_importances = perm_result.importances_mean

# Compare
for name, mdi, perm in sorted(
    zip(feature_names, mdi_importances, perm_importances),
    key=lambda x: x<sup><a href="#cite_note-2" class="cite-ref">[2]</a></sup>, reverse=True
):
    print(f"{name:>20s}: MDI={mdi:.4f}  Permutation={perm:.4f}")

Implementation with SHAP

The SHAP library is published under the MIT license and supports all major tabular ML libraries:

import shap
import xgboost as xgb

booster = xgb.train(params, dtrain, num_boost_round=200)
explainer = shap.TreeExplainer(booster)
shap_values = explainer.shap_values(X_test)

shap.summary_plot(shap_values, X_test, plot_type="dot")
shap.dependence_plot("age", shap_values, X_test)

Worked example: house price regression

Consider a regression model that predicts house prices from features such as square footage, number of bedrooms, lot size, neighborhood, year built, and number of bathrooms. The following observations illustrate how different methods may rank the same features:

MDI on a random forest might rank square_footage first because it offers many split points and is heavily used during training.
Permutation importance on the test set may rank neighborhood higher than MDI suggests because shuffling the categorical neighborhood label catastrophically degrades predictions.
TreeSHAP values may show that year_built has a positive contribution for newer homes and a negative contribution for older homes, revealing a non-linear effect that aggregate importance hides.
Drop-column importance may reveal that bedrooms and bathrooms jointly matter but neither is critical alone, because they act as redundant proxies for size.

The disagreement is informative. Examining several methods together is the standard professional practice when stakes are high.

Feature importance gives a single score per feature, but several adjacent tools answer related questions and are commonly reported alongside importance scores. Partial dependence plots (PDPs), introduced by Friedman (2001), show the average effect of one or two features on the prediction. Individual Conditional Expectation (ICE) plots show the same relationship per-instance. Accumulated Local Effects (ALE) plots, introduced by Apley and Zhu (2020), generalize PDPs to handle correlated features. SHAP interaction values decompose each prediction into main effects and pairwise interaction terms. Counterfactual explanations and anchors (Ribeiro, Singh, Guestrin, 2018) provide rule-based local explanations.

Explain like I'm 5 (ELI5)

Imagine you are baking cookies, and you want to know which ingredient matters most for making them taste good. You could try leaving out the sugar one time, leaving out the butter another time, and leaving out the vanilla another time. Whichever ingredient, when removed, makes the cookies taste the worst is the most important ingredient. Feature importance works the same way: it tests what happens to a computer's predictions when each piece of information is taken away or scrambled, and the pieces that cause the biggest mess when removed are the most important ones.

References

Breiman, L. (2001). "Random Forests." *Machine Learning*, 45(1), 5-32. doi:10.1023/A:1010933404324
Lundberg, S. M., & Lee, S.-I. (2017). "A Unified Approach to Interpreting Model Predictions." *Advances in Neural Information Processing Systems 30 (NeurIPS)*, 4765-4774. arXiv:1705.07874
Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). "'Why Should I Trust You?': Explaining the Predictions of Any Classifier." *Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*, 1135-1144. arXiv:1602.04938 doi:10.1145/2939672.2939778
Fisher, A., Rudin, C., & Dominici, F. (2019). "All Models are Wrong, but Many are Useful: Learning a Variable's Importance by Studying an Entire Class of Prediction Models Simultaneously." *Journal of Machine Learning Research*, 20(177), 1-81.
Strobl, C., Boulesteix, A.-L., Zeileis, A., & Hothorn, T. (2007). "Bias in Random Forest Variable Importance Measures: Illustrations, Sources and a Solution." *BMC Bioinformatics*, 8, 25. doi:10.1186/1471-2105-8-25
Strobl, C., Boulesteix, A.-L., Kneib, T., Augustin, T., & Zeileis, A. (2008). "Conditional Variable Importance for Random Forests." *BMC Bioinformatics*, 9, 307. doi:10.1186/1471-2105-9-307
Hooker, G., & Mentch, L. (2019). "Please Stop Permuting Features: An Explanation and Alternatives." arXiv:1905.03151
Sundararajan, M., Taly, A., & Yan, Q. (2017). "Axiomatic Attribution for Deep Networks." *Proceedings of the 34th International Conference on Machine Learning*, 3319-3328. arXiv:1703.01365
Shrikumar, A., Greenside, P., & Kundaje, A. (2017). "Learning Important Features Through Propagating Activation Differences." *Proceedings of the 34th International Conference on Machine Learning*, 3145-3153. arXiv:1704.02685
Smilkov, D., Thorat, N., Kim, B., Viegas, F., & Wattenberg, M. (2017). "SmoothGrad: removing noise by adding noise." arXiv:1706.03825
Bach, S., Binder, A., Montavon, G., Klauschen, F., Mueller, K.-R., & Samek, W. (2015). "On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation." *PLOS ONE*, 10(7), e0130140. doi:10.1371/journal.pone.0130140
Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., & Batra, D. (2017). "Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization." *Proceedings of the IEEE International Conference on Computer Vision*, 618-626. arXiv:1610.02391
Lundberg, S. M., Erion, G., Chen, H., DeGrave, A., Prutkin, J. M., Nair, B., Katz, R., Himmelfarb, J., Bansal, N., & Lee, S.-I. (2020). "From Local Explanations to Global Understanding with Explainable AI for Trees." *Nature Machine Intelligence*, 2, 56-67. doi:10.1038/s42256-019-0138-9
Lundberg, S. M., Erion, G., & Lee, S.-I. (2018). "Consistent Individualized Feature Attribution for Tree Ensembles." arXiv:1802.03888
Janzing, D., Minorics, L., & Bloebaum, P. (2020). "Feature relevance quantification in explainable AI: A causal problem." *International Conference on Artificial Intelligence and Statistics*. arXiv:1910.13413
Slack, D., Hilgard, S., Jia, E., Singh, S., & Lakkaraju, H. (2020). "Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods." *Proceedings of the 2020 AAAI/ACM Conference on AI, Ethics, and Society*. arXiv:1911.02508
Jain, S., & Wallace, B. C. (2019). "Attention is not Explanation." *Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics*. arXiv:1902.10186
Wiegreffe, S., & Pinter, Y. (2019). "Attention is not not Explanation." *Proceedings of EMNLP-IJCNLP 2019*. arXiv:1908.04626
Abnar, S., & Zuidema, W. (2020). "Quantifying Attention Flow in Transformers." *Proceedings of ACL 2020*. arXiv:2005.00928
Ghorbani, A., Abid, A., & Zou, J. (2019). "Interpretation of Neural Networks is Fragile." *Proceedings of the AAAI Conference on Artificial Intelligence*, 33(1), 3681-3688. arXiv:1710.10547
Sturmfels, P., Lundberg, S., & Lee, S.-I. (2020). "Visualizing the Impact of Feature Attribution Baselines." *Distill*. doi:10.23915/distill.00022
Friedman, J. H. (2001). "Greedy Function Approximation: A Gradient Boosting Machine." *Annals of Statistics*, 29(5), 1189-1232.
Goldstein, A., Kapelner, A., Bleich, J., & Pitkin, E. (2015). "Peeking Inside the Black Box: Visualizing Statistical Learning With Plots of Individual Conditional Expectation." *Journal of Computational and Graphical Statistics*, 24(1), 44-65.
Apley, D. W., & Zhu, J. (2020). "Visualizing the Effects of Predictor Variables in Black Box Supervised Learning Models." *Journal of the Royal Statistical Society Series B*, 82(4), 1059-1086. arXiv:1612.08468
Ribeiro, M. T., Singh, S., & Guestrin, C. (2018). "Anchors: High-Precision Model-Agnostic Explanations." *AAAI Conference on Artificial Intelligence*.
Wager, S., & Athey, S. (2018). "Estimation and Inference of Heterogeneous Treatment Effects using Random Forests." *Journal of the American Statistical Association*, 113(523), 1228-1242.
Mentch, L., & Hooker, G. (2016). "Quantifying Uncertainty in Random Forests via Confidence Intervals and Hypothesis Tests." *Journal of Machine Learning Research*, 17(26), 1-41.
Molnar, C. (2022). *Interpretable Machine Learning: A Guide for Making Black Box Models Explainable* (2nd ed.). christophm.github.io/interpretable-ml-book
Scikit-learn Developers. "Permutation Feature Importance." *scikit-learn documentation*. scikit-learn.org/stable/modules/permutation_importance.html
Altmann, A., Tolosi, L., Sander, O., & Lengauer, T. (2010). "Permutation Importance: A Corrected Feature Importance Measure." *Bioinformatics*, 26(10), 1340-1347. doi:10.1093/bioinformatics/btq134
Kokhlikyan, N., Miglani, V., Martin, M., et al. (2020). "Captum: A unified and generic model interpretability library for PyTorch." arXiv:2009.07896
Simonyan, K., Vedaldi, A., & Zisserman, A. (2013). "Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps." arXiv:1312.6034
Montavon, G., Samek, W., & Mueller, K.-R. (2018). "Methods for interpreting and understanding deep neural networks." *Digital Signal Processing*, 73, 1-15.

Why feature importance matters

Global versus local importance

Built-in (model-specific) methods

Impurity-based importance (mean decrease in impurity)

XGBoost importance: gain, weight, and cover

Coefficient magnitude for linear models

Model-agnostic methods

Permutation importance

Drop-column importance (leave-one-covariate-out)

SHAP (SHapley Additive exPlanations)

TreeSHAP and computational complexity

LIME (Local Interpretable Model-Agnostic Explanations)

Gradient-based attribution for deep networks

Vanilla saliency

Integrated Gradients

DeepLIFT

SmoothGrad

Layer-wise Relevance Propagation (LRP)

Grad-CAM

Comparison of gradient-based methods

Captum

Attribution for transformers and large language models

Attention as importance: a contested signal

Token-level Integrated Gradients

Attention rollout and attention flow

Mechanistic interpretability

Comparison of methods

Common pitfalls and best practices

Correlated features

High-cardinality bias

Importance from overfit models

Confusing importance with causation

Importance depends on model quality

Stability and confidence intervals

Adversarial robustness of explanations

Baseline dependence in attribution

Implementation and libraries

Implementation in scikit-learn

Implementation with SHAP

Worked example: house price regression

Related interpretability tools

Explain like I'm 5 (ELI5)

See also

References

Improve this article

Related Articles

ARC-AGI 2

Sparse autoencoder

Grad-CAM

Integrated Gradients

DeepLIFT

Layer-wise Relevance Propagation (LRP)

Why feature importance matters

Global versus local importance

Built-in (model-specific) methods

Impurity-based importance (mean decrease in impurity)

XGBoost importance: gain, weight, and cover

Coefficient magnitude for linear models

Model-agnostic methods

Permutation importance

Drop-column importance (leave-one-covariate-out)

SHAP (SHapley Additive exPlanations)

TreeSHAP and computational complexity

LIME (Local Interpretable Model-Agnostic Explanations)

Gradient-based attribution for deep networks

Vanilla saliency

Integrated Gradients

DeepLIFT

SmoothGrad

Layer-wise Relevance Propagation (LRP)

Grad-CAM

Comparison of gradient-based methods

Captum

Attribution for transformers and large language models

Attention as importance: a contested signal

Token-level Integrated Gradients

Attention rollout and attention flow

Mechanistic interpretability

Comparison of methods

Common pitfalls and best practices