Feature Importances
Last reviewed
May 9, 2026
Sources
33 citations
Review status
Source-backed
Revision
v3 ยท 6,498 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 9, 2026
Sources
33 citations
Review status
Source-backed
Revision
v3 ยท 6,498 words
Add missing citations, update stale details, or suggest a clearer explanation.
Feature importances are numeric scores that quantify how much each input feature contributes to the predictions of a machine learning model. Understanding which features matter most is central to model interpretability, feature engineering, debugging, and building trust in deployed systems. A wide range of techniques exist for measuring feature importance, from methods built into specific model families to model-agnostic approaches that work with any estimator. Modern practice draws on tree-based attribution measures, post-hoc explanations grounded in cooperative game theory, and gradient-based attribution methods designed for deep networks.
The topic sits at the intersection of statistics, optimization, and explainable AI. Feature importance methods do not all answer the same question: some measure how a feature shapes the loss during training, some measure how it affects predictions on held-out data, and some measure how it contributes to a single prediction for a single sample.
Knowing which features drive a model's predictions serves several practical purposes:
Across all of these uses, the key question is whether the method's ranking matches the predictive structure that the user actually cares about. Different methods give different answers because they each measure a different mathematical object.
Feature importance methods split into two broad categories based on the scope of the explanation they provide:
A second axis distinguishes model-specific methods, which exploit the internal structure of a particular family (such as the split statistics of a decision tree), from model-agnostic methods, which treat the model as a black box. Model-specific methods are usually faster and can be exact but apply only to compatible models. Model-agnostic methods generalize across architectures at the cost of additional computation.
The two axes combine to give a 2x2 taxonomy that helps practitioners locate any method in the space:
| Scope vs. coupling | Model-specific | Model-agnostic |
|---|---|---|
| Global | Mean Decrease in Impurity, XGBoost gain/weight/cover, linear coefficient magnitude | Permutation importance, drop-column (LOCO), global SHAP aggregates |
| Local | TreeSHAP, DeepLIFT, Integrated Gradients (uses model gradients) | KernelSHAP, LIME, counterfactual explanations |
Impurity-based importance, also called Mean Decrease in Impurity (MDI) or Gini importance, is the default feature importance method for tree-based models such as random forest, gradient boosting, and individual decision trees. During training, each node in a tree splits on a feature to reduce impurity (measured by the Gini index for classification or variance for regression). The importance of a feature is the total reduction in impurity it provides across all splits in all trees, weighted by the number of samples reaching each split.
Formally, for a single tree the MDI importance of feature $j$ is:
$$\mathrm{MDI}(j) = \sum_{t \in T_j} \frac{N_t}{N} \cdot \Delta i(t)$$
where $T_j$ is the set of internal nodes that split on feature $j$, $N_t$ is the number of training samples reaching node $t$, $N$ is the total number of training samples, and $\Delta i(t)$ is the impurity decrease at node $t$. For a forest of $B$ trees, the per-feature importance is averaged: $\mathrm{MDI}\mathrm{forest}(j) = \frac{1}{B} \sum{b=1}^{B} \mathrm{MDI}_b(j)$.
In scikit-learn, impurity-based importances are accessible through the feature_importances_ attribute of any fitted tree-based estimator:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# Impurity-based importances
importances = model.feature_importances_
Advantages: MDI is fast to compute because it requires no additional evaluation after training. It is readily available for all tree-based models in scikit-learn, XGBoost, LightGBM, and CatBoost.
Limitations: MDI is biased toward high-cardinality features (features with many unique values, such as continuous variables or categorical variables with many categories). Because high-cardinality features offer more candidate split points, they have a greater chance of producing a good split by chance. MDI is also computed on training data, so it can inflate the importance of features the model has overfit to. These limitations were documented by Strobl et al. (2007) and confirmed by the scikit-learn team. Hooker and Mentch (2019) showed the bias persists even with permutation-based importances when features are correlated.
XGBoost exposes three native global importance measures, each computed from the gradient-boosted tree ensemble. Practitioners often inspect more than one because rankings can disagree.
| XGBoost importance type | Definition | What it measures | Typical use |
|---|---|---|---|
| gain | Average loss reduction contributed by splits on a feature | How much accuracy a feature buys when it is used | Default in XGBoost; closest to MDI |
weight (also frequency) | Number of times a feature is used as a split | Raw split count | Sensitive to deep trees and many small-gain splits |
| cover | Average number of samples affected by splits on a feature | Coverage of the feature in the dataset | Useful for spotting features that affect rare regions |
In XGBoost the importance type is selected via the importance_type argument:
import xgboost as xgb
model = xgb.XGBClassifier(n_estimators=200, random_state=42)
model.fit(X_train, y_train)
gain = model.get_booster().get_score(importance_type="gain")
weight = model.get_booster().get_score(importance_type="weight")
cover = model.get_booster().get_score(importance_type="cover")
In LightGBM, feature_importance(importance_type="split") returns the weight-style count and importance_type="gain" returns the gain-style loss reduction. CatBoost exposes a PredictionValuesChange importance and a LossFunctionChange importance, the latter similar in spirit to permutation importance.
For linear regression and logistic regression, the absolute value of each feature's learned coefficient can serve as a measure of importance. A larger absolute coefficient means the feature has a stronger influence on the prediction.
However, raw coefficients are only comparable when all features share the same scale. If one feature is measured in thousands and another in fractions, their coefficients will reflect those scales rather than true importance. The standard practice is to standardize all features (zero mean, unit variance) before training, which produces standardized coefficients that can be directly compared:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
import numpy as np
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
model = LogisticRegression().fit(X_train_scaled, y_train)
importances = np.abs(model.coef_<sup><a href="#cite_note-0" class="cite-ref">[0]</a></sup>)
For a generalized linear model with link $g$, the contribution of feature $x_j$ to the linear predictor is $\beta_j x_j$. The signed contribution is what regulators typically expect for adverse-action notices, since it preserves direction.
Limitations: When features are highly correlated (multicollinearity), coefficients become unstable and can swing between positive and negative values. In such cases, coefficient magnitude is unreliable. L1 and L2 regularization help stabilize coefficients but do not fully resolve the interpretation problem. Lasso zeroes out some coefficients entirely, which is helpful for selection but discards correlated alternatives.
Model-agnostic methods can be applied to any fitted model, regardless of its internal structure. This makes them especially valuable for comparing feature importance across different model types or for explaining proprietary black-box systems.
Permutation importance was originally introduced by Leo Breiman (2001) as part of the random forest algorithm and later generalized into a fully model-agnostic technique by Fisher, Rudin, and Dominici (2019), who called it "model reliance." The core idea is simple: if a feature is important, randomly shuffling its values should degrade model performance; if the feature is unimportant, shuffling it should have little effect.
Algorithm:
In scikit-learn, permutation importance is available through the permutation_importance function:
from sklearn.inspection import permutation_importance
result = permutation_importance(
model, X_test, y_test,
n_repeats=30,
random_state=42
)
for i in result.importances_mean.argsort()[::-1]:
if result.importances_mean[i] - 2 * result.importances_std[i] > 0:
print(f"{feature_names[i]}: "
f"{result.importances_mean[i]:.3f} "
f"+/- {result.importances_std[i]:.3f}")
Breiman's original definition computed permutation importance on the out-of-bag (OOB) samples of a random forest, which gave a free held-out evaluation without an explicit train-test split. Modern implementations are usually train-test based.
Advantages: Permutation importance is model-agnostic, does not require retraining, and can be computed on a held-out test set (which means it reflects generalization performance rather than training-set memorization). It also does not exhibit the high-cardinality bias that affects MDI.
Limitations: When features are correlated, permuting one feature does not substantially hurt performance because the model can still extract the same information from its correlated partner. The permutation step can also create unrealistic data points (for instance, shuffling "number of bedrooms" independently of "square footage" produces houses that would never exist), which can distort estimates. Conditional permutation importance, introduced by Strobl et al. (2008), addresses this by permuting within strata of correlated features.
Drop-column importance, also known as Leave-One-Covariate-Out (LOCO), provides perhaps the most direct answer to the question "how much does this feature contribute?" The procedure drops each feature one at a time, retrains the model from scratch, and measures the change in performance.
Algorithm:
Advantages: Drop-column importance directly measures each feature's contribution to overall model performance. It avoids the unrealistic data problem of permutation importance because the model never sees the dropped feature during training.
Limitations: This method is computationally expensive because it requires retraining the model once per feature. For a dataset with 100 features and a model that takes an hour to train, drop-column importance requires over 100 hours of computation. It also measures a slightly different quantity than permutation importance: the marginal value of adding a feature to a model that already has all other features. For deep learning models with stochastic optimizers, retraining variance can swamp the signal unless many seeds are averaged.
SHAP (SHapley Additive exPlanations), introduced by Lundberg and Lee (2017), applies Shapley values from cooperative game theory to explain individual predictions. Each feature receives a SHAP value representing its contribution to pushing the prediction away from the average prediction. The key theoretical property is that SHAP values are the only additive feature attribution method satisfying three desirable axioms: local accuracy, missingness, and consistency.
Formally, the SHAP value of feature $i$ for an instance $x$ is the Shapley value:
$$\phi_i(x) = \sum_{S \subseteq F \setminus {i}} \frac{|S|!,(|F|-|S|-1)!}{|F|!} , [v(S \cup {i}) - v(S)]$$
where $F$ is the set of all features and $v(S)$ is the model's prediction conditioned on knowing only the values of features in subset $S$. Computing this exactly is exponential in the number of features, which is why specialized fast variants exist.
SHAP provides both local explanations (why a specific prediction was made) and global importance (which features matter most across the entire dataset). Global SHAP importance is typically computed as the mean absolute SHAP value for each feature across all samples: $\mathrm{Imp}(j) = \frac{1}{N} \sum_{i=1}^{N} |\phi_j(x_i)|$.
The SHAP library offers specialized explainers optimized for different model types:
| Explainer | Target models | Speed | Exactness |
|---|---|---|---|
| TreeExplainer | Random forest, XGBoost, LightGBM, CatBoost | Fast | Exact for trees |
| LinearExplainer | Linear models, logistic regression | Fast | Exact for linear models |
| KernelExplainer | Any model (model-agnostic) | Slow | Approximate |
| DeepExplainer | Deep neural networks | Moderate | Approximate (DeepLIFT-based) |
| GradientExplainer | Differentiable models | Moderate | Approximate (uses Integrated Gradients) |
| PartitionExplainer | Hierarchical feature groupings | Moderate | Owen value approximation |
import shap
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)
# Global importance: mean absolute SHAP value per feature
global_importance = np.abs(shap_values).mean(axis=0)
# Summary plot showing feature importance and effect direction
shap.summary_plot(shap_values, X_test)
The naive Shapley value computation has complexity $O(2^M)$ in the number of features $M$, which is infeasible for typical tabular models. TreeSHAP, introduced by Lundberg, Erion, and Lee (2018) and extended in Lundberg et al. (2020) for Nature Machine Intelligence, computes exact Shapley values for tree-based models in polynomial time. The complexity is $O(T L D^2)$ per prediction, where $T$ is the number of trees, $L$ is the maximum number of leaves in any tree, and $D$ is the maximum depth. This brings exact local attribution within reach for ensembles of hundreds of trees on datasets with thousands of features.
TreeSHAP supports two estimands:
KernelSHAP, the model-agnostic variant, uses weighted linear regression on perturbed coalitions and converges to Shapley values in expectation but is much slower.
Advantages: SHAP values have a strong theoretical foundation in game theory, provide both local and global explanations, and show the direction of each feature's effect (not just magnitude). TreeExplainer is computationally efficient for tree-based models. The SHAP library ships rich plotting utilities (summary plots, dependence plots, force plots, decision plots, waterfall plots).
Limitations: KernelExplainer (the model-agnostic variant) is computationally expensive, scaling poorly to large datasets and large feature counts. SHAP importance is on the scale of the prediction (not the loss), which makes it answer a subtly different question than permutation importance. Like permutation importance, SHAP values can be affected by feature correlations, especially in the interventional formulation. Janzing, Minorics, and Bloebaum (2020) have argued that the choice between observational and interventional SHAP corresponds to fundamentally different causal interpretations.
LIME, proposed by Ribeiro, Singh, and Guestrin (2016), explains individual predictions by fitting a simple interpretable model (typically a sparse linear model) in the local neighborhood of the instance being explained. LIME generates perturbed samples around the instance, weights them by proximity, obtains the black-box model's predictions for these samples, and then fits a sparse linear model to approximate the local decision boundary.
Algorithm:
LIME has tabular, text, and image variants. The image variant uses superpixel segmentation as the perturbation unit, while the text variant masks individual tokens.
Advantages: LIME is model-agnostic and works with any classifier or regressor. It is intuitive because explanations are expressed as simple linear weights. The user can control the number of features in the explanation.
Limitations: LIME explanations can be unstable: running LIME twice on the same instance can produce different results because of randomness in the perturbation process. The quality of the explanation depends on the choice of kernel width and the number of perturbations, which require careful tuning. Unlike SHAP, LIME lacks a strong theoretical guarantee about the uniqueness or optimality of its explanations. Slack et al. (2020) demonstrated that LIME and SHAP can be "fooled" by adversarial models that detect the perturbation distribution and behave benignly only on perturbed inputs.
Deep neural networks accept inputs that often have thousands or millions of dimensions, such as pixels in an image or token embeddings in a language model. Tree-based importance does not apply, and exact Shapley computation is intractable. A family of gradient-based attribution methods has been developed specifically for differentiable models.
The simplest method computes the gradient of the model's output with respect to each input dimension. For a model $f$ and an input $x$, the saliency of input $x_i$ is $|\partial f(x) / \partial x_i|$. This was popularized by Simonyan, Vedaldi, and Zisserman (2013) for visualizing convolutional networks. Vanilla gradients are fast but suffer from gradient saturation and noisy attributions.
Integrated Gradients, introduced by Sundararajan, Taly, and Yan (2017), addresses gradient saturation and enforces two axioms: sensitivity and implementation invariance. The attribution for input dimension $i$ is the path integral of the gradient from a baseline $x'$ to the input $x$:
$$\mathrm{IG}_i(x) = (x_i - x'i) \int{0}^{1} \frac{\partial f(x' + \alpha (x - x'))}{\partial x_i} , d\alpha$$
In practice the integral is approximated by a Riemann sum with 50 to 200 steps. The choice of baseline matters: a black image, a Gaussian noise sample, or the dataset mean each yields different attributions. Integrated Gradients sums to the difference between the prediction at $x$ and the prediction at $x'$ (completeness axiom), which is analogous to the local accuracy property of SHAP values.
DeepLIFT, proposed by Shrikumar, Greenside, and Kundaje (2017), assigns contribution scores by comparing each neuron's activation to a reference activation. It propagates contributions backward through the network using either Rescale rules or RevealCancel rules. DeepLIFT was designed to overcome gradient saturation issues that affect vanilla gradients, especially in deep networks with ReLU activations. Lundberg and Lee (2017) showed that DeepLIFT can be reformulated as an approximation to SHAP values, which led to the SHAP DeepExplainer implementation.
SmoothGrad, introduced by Smilkov, Thorat, Kim, Viegas, and Wattenberg (2017), reduces visual noise in saliency maps by averaging gradients over multiple noisy copies of the input:
$$\mathrm{SmoothGrad}i(x) = \frac{1}{n} \sum{k=1}^{n} \frac{\partial f(x + \epsilon_k)}{\partial x_i}$$
where each $\epsilon_k \sim \mathcal{N}(0, \sigma^2 I)$. SmoothGrad is often combined with Integrated Gradients (SmoothGrad-IG) for sharper, more stable attributions on image classification.
Layer-wise Relevance Propagation, introduced by Bach et al. (2015), distributes the prediction backward through the network using conservation rules at each layer. Several propagation rules exist, including LRP-0, LRP-epsilon, and LRP-gamma, each with different stability and faithfulness trade-offs. Montavon et al. (2018) provided a unified framework analyzing how various propagation rules affect attributions.
For convolutional networks, Grad-CAM (Selvaraju et al., 2017) localizes the regions of an image that influenced a class prediction by combining gradients with feature maps from a chosen convolutional layer. Grad-CAM is widely used in medical imaging to highlight tumor regions or lesions that drive a model's diagnosis.
| Method | Required model property | Output | Strengths | Weaknesses |
|---|---|---|---|---|
| Vanilla saliency | Differentiability | Per-input gradient | Cheap | Noisy, saturation |
| Integrated Gradients | Differentiability, baseline | Path integral | Completeness, axiomatic | Baseline-dependent |
| DeepLIFT | ReLU-style nonlinearities | Reference-based contribution | Avoids saturation | Reference-dependent |
| SmoothGrad | Differentiability | Averaged gradient | Visual stability | Adds compute |
| Layer-wise Relevance Propagation | Differentiability | Propagated relevance | Conservation | Rule choice non-trivial |
| Grad-CAM | CNN with feature maps | Coarse heatmap | Class-discriminative | Resolution limited |
Captum is an open-source PyTorch library released by Meta AI in 2019 that implements a unified API for these gradient-based attribution methods, plus several additional ones such as Occlusion, Feature Ablation, and Shapley Value Sampling. The library exposes a consistent interface where each algorithm subclasses the Attribution class:
import torch
from captum.attr import IntegratedGradients
model.eval()
ig = IntegratedGradients(model)
input_tensor = torch.randn(1, 3, 224, 224, requires_grad=True)
baseline = torch.zeros_like(input_tensor)
attributions, delta = ig.attribute(
input_tensor, baselines=baseline, target=0,
return_convergence_delta=True,
)
Captum integrates with PyTorch modules, supports both image and text models, and includes utilities for visualization and noise tunneling (the SmoothGrad pattern). The TensorFlow community uses tf-explain, iNNvestigate, and Captum-style implementations within the AIX360 toolkit from IBM.
Transformer architectures used in modern large language models introduce additional considerations for feature attribution. Inputs are token embeddings rather than raw scalar features, and the model contains attention layers whose weights are often misinterpreted as importance scores.
Attention weights describe how each token attends to others within a layer. It is tempting to read the attention weight $\alpha_{ij}$ from token $i$ to token $j$ as a measure of how much token $j$ contributes to token $i$'s representation. Jain and Wallace (2019) and Wiegreffe and Pinter (2019) argued that attention weights are not reliable explanations: they can be perturbed without changing predictions, and different attention distributions can yield identical outputs. Modern guidance treats raw attention as a diagnostic, not an explanation.
Integrated Gradients applied at the embedding layer of a transformer assigns contribution scores to individual tokens. The baseline is typically a sequence of zero embeddings or a mean-embedding sequence. Captum's LayerIntegratedGradients is the standard tool for this:
from captum.attr import LayerIntegratedGradients
lig = LayerIntegratedGradients(model, model.embeddings)
attributions, delta = lig.attribute(
input_ids,
baselines=baseline_ids,
target=target_class,
return_convergence_delta=True,
)
Token attributions are commonly summed across the embedding dimension and visualized as heatmaps over the input text.
Abnar and Zuidema (2020) proposed attention rollout, which composes attention matrices across layers via matrix multiplication, and attention flow, which formulates the importance problem as max-flow over an attention graph. These methods give a layer-aggregated view of token-to-token influence.
Beyond attribution scores, the mechanistic interpretability program seeks to identify circuits, features, and computations within transformer weights. Anthropic's 2023 work on monosemantic features used sparse autoencoders to decompose neuron activations into interpretable features. The 2024 "Scaling Monosemanticity" paper extended this to Claude 3 Sonnet and identified millions of features. This line of work targets a finer-grained question than feature importance: not just which input contributes, but what computation is performed and why. Mechanistic interpretability and feature importance address complementary aspects of model understanding.
The following table summarizes the key characteristics of the most widely used feature importance methods:
| Method | Scope | Model requirement | Computation cost | Handles correlated features well? | Provides direction of effect? |
|---|---|---|---|---|---|
| Impurity-based (MDI) | Global | Tree-based models only | Very low (computed during training) | No (splits importance across correlated features) | No |
| XGBoost gain | Global | XGBoost / boosted trees | Very low | No | No |
| XGBoost weight | Global | XGBoost / boosted trees | Very low | No | No |
| XGBoost cover | Global | XGBoost / boosted trees | Very low | No | No |
| Coefficient magnitude | Global | Linear models only | Very low (read from trained model) | No (unstable with multicollinearity) | Yes (sign of coefficient) |
| Permutation importance | Global | Any model | Moderate (no retraining) | No (underestimates correlated features) | No |
| Conditional permutation | Global | Any model | High | Yes (within strata) | No |
| Drop-column (LOCO) | Global | Any model | Very high (retrains per feature) | Partially (retraining captures redistribution) | No |
| KernelSHAP | Local and global | Any model | High | Partial (interventional vs. observational) | Yes |
| TreeSHAP | Local and global | Tree-based models | Low ($O(TLD^2)$) | Partial | Yes |
| DeepSHAP / DeepLIFT | Local | Differentiable models | Moderate | Partial | Yes |
| LIME | Local | Any model | Moderate | No (local perturbations affected by correlations) | Yes (sign of local coefficient) |
| Integrated Gradients | Local | Differentiable models | Moderate | Partial (baseline-dependent) | Yes |
| Layer-wise Relevance Propagation | Local | Differentiable models | Moderate | Partial | Yes |
| Grad-CAM | Local | CNNs with feature maps | Low | Partial | Yes |
| Attention weights | Local | Attention-based models | Free (already computed) | No (not a faithful explanation) | Yes |
Correlated features are the most common source of misleading feature importance results, and this problem affects virtually every method. When two features carry similar information, importance methods tend to split the total importance between them, making each appear less important than it would be in isolation. Permutation importance can also overestimate the importance of correlated features by creating impossible feature combinations. Hooker and Mentch (2019) provide an analysis of how feature correlation distorts both MDI and permutation importance.
Practical solutions:
PartitionExplainer or grouped permutation tests.Impurity-based importance is systematically biased toward features with many unique values. A continuous feature with 10,000 distinct values will tend to score higher than a binary feature, even if the binary feature is the true driver. This was demonstrated by Strobl et al. (2007), who showed that random forests preferentially select high-cardinality variables for splits.
Practical solution: Use permutation importance or SHAP instead of MDI when your dataset contains a mix of continuous and categorical features with varying cardinality.
If a model has memorized the training data, the importance scores derived from that training data will be unreliable. A random noise feature may appear important simply because the model has overfit to it.
Practical solution: Always compute importance on a held-out test set or use cross-validated importance estimates. Scikit-learn's permutation_importance function accepts any dataset, so passing the test set rather than the training set is straightforward.
Feature importance measures statistical association, not causation. A feature may appear important because it is correlated with the true causal factor, not because it directly influences the outcome. For example, ice cream sales might appear important for predicting drowning rates, but the true driver is temperature. The causal inference literature provides tools, such as do-calculus, that bridge to genuinely causal contributions, but these require strong assumptions and a known causal graph.
Feature importance is only meaningful when the underlying model performs well. If the model has a low cross-validation score, its importance rankings may be unreliable and unstable. Always validate model performance before interpreting feature importances.
A single importance score is a point estimate. Repeating permutation, bootstrapping the data, or running multiple training seeds yields a distribution of importance scores. Reporting standard deviations or confidence intervals catches features whose importance is dominated by noise. Scikit-learn's permutation_importance returns importances_mean and importances_std precisely so users can quantify this uncertainty.
Slack et al. (2020) showed that black-box models can be crafted to deceive both LIME and SHAP. Ghorbani, Abid, and Zou (2019) showed that small input perturbations can drastically change saliency map attributions even when the prediction is unchanged. Auditors should treat any single explanation as one signal among many.
Integrated Gradients and other reference-based methods depend on the choice of baseline. A black image, a gray image, a Gaussian noise sample, and a dataset mean baseline can give different attributions for the same input. Sturmfels, Lundberg, and Lee (2020) recommend using multiple baselines and reporting averaged attributions.
Several mature libraries implement feature importance and attribution. The choice depends on the framework and the type of model.
| Library | Language / framework | Primary methods | Notes |
|---|---|---|---|
| scikit-learn | Python | MDI, permutation importance | Standard tabular ML toolkit |
| SHAP | Python | KernelSHAP, TreeSHAP, DeepSHAP, GradientSHAP | Reference SHAP implementation |
| LIME | Python | LIME for tabular, text, image | Original Ribeiro et al. implementation |
| Captum | Python / PyTorch | IG, DeepLIFT, SmoothGrad, Saliency, Occlusion | Meta AI library |
| eli5 | Python | Permutation importance, MDI, weights | Older but still useful |
| AIX360 | Python | SHAP, LIME, ProtoDash, BRCG | IBM toolkit |
| iNNvestigate | Python / TensorFlow | LRP variants, SmoothGrad, IG | Keras-focused |
| InterpretML | Python | EBM (glass-box), SHAP, LIME | From Microsoft Research |
| dalex | Python / R | Break-Down, Shapley, Ceteris Paribus | Predictive model audit |
Scikit-learn provides two primary interfaces for computing feature importance:
| Interface | Function / attribute | Method type | Available since |
|---|---|---|---|
| Impurity-based | model.feature_importances_ | Built-in (tree models) | scikit-learn 0.1 |
| Permutation | sklearn.inspection.permutation_importance() | Model-agnostic | scikit-learn 0.22 |
The scikit-learn documentation explicitly recommends preferring permutation importance over impurity-based importance when accuracy of rankings matters, noting that MDI importances are biased towards high-cardinality features and are computed on training set statistics that do not reflect generalization to the test set.
A typical workflow combines both methods:
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
from sklearn.model_selection import train_test_split
import numpy as np
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Train model
rf = RandomForestClassifier(n_estimators=200, random_state=42)
rf.fit(X_train, y_train)
# Method 1: Impurity-based (fast but potentially biased)
mdi_importances = rf.feature_importances_
# Method 2: Permutation importance (more reliable)
perm_result = permutation_importance(
rf, X_test, y_test, n_repeats=30, random_state=42
)
perm_importances = perm_result.importances_mean
# Compare
for name, mdi, perm in sorted(
zip(feature_names, mdi_importances, perm_importances),
key=lambda x: x<sup><a href="#cite_note-2" class="cite-ref">[2]</a></sup>, reverse=True
):
print(f"{name:>20s}: MDI={mdi:.4f} Permutation={perm:.4f}")
The SHAP library is published under the MIT license and supports all major tabular ML libraries:
import shap
import xgboost as xgb
booster = xgb.train(params, dtrain, num_boost_round=200)
explainer = shap.TreeExplainer(booster)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test, plot_type="dot")
shap.dependence_plot("age", shap_values, X_test)
Consider a regression model that predicts house prices from features such as square footage, number of bedrooms, lot size, neighborhood, year built, and number of bathrooms. The following observations illustrate how different methods may rank the same features:
square_footage first because it offers many split points and is heavily used during training.neighborhood higher than MDI suggests because shuffling the categorical neighborhood label catastrophically degrades predictions.year_built has a positive contribution for newer homes and a negative contribution for older homes, revealing a non-linear effect that aggregate importance hides.bedrooms and bathrooms jointly matter but neither is critical alone, because they act as redundant proxies for size.The disagreement is informative. Examining several methods together is the standard professional practice when stakes are high.
Feature importance gives a single score per feature, but several adjacent tools answer related questions and are commonly reported alongside importance scores. Partial dependence plots (PDPs), introduced by Friedman (2001), show the average effect of one or two features on the prediction. Individual Conditional Expectation (ICE) plots show the same relationship per-instance. Accumulated Local Effects (ALE) plots, introduced by Apley and Zhu (2020), generalize PDPs to handle correlated features. SHAP interaction values decompose each prediction into main effects and pairwise interaction terms. Counterfactual explanations and anchors (Ribeiro, Singh, Guestrin, 2018) provide rule-based local explanations.
Imagine you are baking cookies, and you want to know which ingredient matters most for making them taste good. You could try leaving out the sugar one time, leaving out the butter another time, and leaving out the vanilla another time. Whichever ingredient, when removed, makes the cookies taste the worst is the most important ingredient. Feature importance works the same way: it tests what happens to a computer's predictions when each piece of information is taken away or scrambled, and the pieces that cause the biggest mess when removed are the most important ones.