Feature importances quantify how much each input feature contributes to the predictions of a machine learning model. Understanding which features matter most is central to model interpretability, feature engineering, debugging, and building trust in deployed systems. A wide range of techniques exist for measuring feature importance, from methods built into specific model families to model-agnostic approaches that work with any estimator.
Knowing which features drive a model's predictions serves several practical purposes:
Impurity-based importance, also called Mean Decrease in Impurity (MDI) or Gini importance, is the default feature importance method for tree-based models such as random forest, gradient boosting, and individual decision trees. During training, each node in a tree splits on a feature to reduce impurity (measured by the Gini index for classification or variance for regression). The importance of a feature is the total reduction in impurity it provides across all splits in all trees, weighted by the number of samples reaching each split.
In scikit-learn, impurity-based importances are accessible through the feature_importances_ attribute of any fitted tree-based estimator:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# Impurity-based importances
importances = model.feature_importances_
Advantages: MDI is fast to compute because it requires no additional evaluation after training. It is readily available for all tree-based models in scikit-learn, XGBoost, LightGBM, and CatBoost.
Limitations: MDI is biased toward high-cardinality features (features with many unique values, such as continuous variables or categorical variables with many categories). Because high-cardinality features offer more candidate split points, they have a greater chance of producing a good split by chance. MDI is also computed on training data, so it can inflate the importance of features the model has overfit to. These limitations were documented by Strobl et al. (2007) and have been confirmed by the scikit-learn development team.
For linear regression and logistic regression, the absolute value of each feature's learned coefficient can serve as a measure of importance. A larger absolute coefficient means the feature has a stronger influence on the prediction.
However, raw coefficients are only comparable when all features share the same scale. If one feature is measured in thousands and another in fractions, their coefficients will reflect those scales rather than true importance. The standard practice is to standardize all features (zero mean, unit variance) before training, which produces standardized coefficients that can be directly compared:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
import numpy as np
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
model = LogisticRegression().fit(X_train_scaled, y_train)
importances = np.abs(model.coef_<sup><a href="#cite_note-0" class="cite-ref">[0]</a></sup>)
Limitations: When features are highly correlated (multicollinearity), coefficients become unstable and can swing dramatically between positive and negative values. In such cases, coefficient magnitude is an unreliable indicator of true feature importance. Regularization techniques like L1 and L2 regularization help stabilize coefficients but do not fully resolve the interpretation problem.
Model-agnostic methods can be applied to any fitted model, regardless of its internal structure. This makes them especially valuable for comparing feature importance across different model types.
Permutation importance was originally introduced by Breiman (2001) as part of the random forest algorithm and later generalized into a fully model-agnostic technique by Fisher, Rudin, and Dominici (2019), who called it "model reliance." The core idea is simple: if a feature is important, randomly shuffling its values should degrade model performance; if the feature is unimportant, shuffling it should have little effect.
Algorithm:
In scikit-learn, permutation importance is available through the permutation_importance function:
from sklearn.inspection import permutation_importance
result = permutation_importance(
model, X_test, y_test,
n_repeats=30,
random_state=42
)
for i in result.importances_mean.argsort()[::-1]:
if result.importances_mean[i] - 2 * result.importances_std[i] > 0:
print(f"{feature_names[i]}: "
f"{result.importances_mean[i]:.3f} "
f"+/- {result.importances_std[i]:.3f}")
Advantages: Permutation importance is model-agnostic, does not require retraining, and can be computed on a held-out test set (which means it reflects generalization performance rather than training-set memorization). It also does not exhibit the high-cardinality bias that affects MDI.
Limitations: When features are correlated, permuting one feature does not substantially hurt performance because the model can still extract the same information from its correlated partner. This causes both correlated features to appear less important than they truly are. Additionally, the permutation step can create unrealistic data points (for instance, shuffling "number of bedrooms" independently of "square footage" produces houses that would never exist), which can distort importance estimates.
Drop-column importance, also known as Leave-One-Covariate-Out (LOCO), provides perhaps the most direct answer to the question "how much does this feature contribute?" The procedure drops each feature one at a time, retrains the model from scratch, and measures the change in performance.
Algorithm:
Advantages: Drop-column importance directly measures each feature's contribution to overall model performance. It avoids the unrealistic data problem of permutation importance because the model never sees the dropped feature during training.
Limitations: This method is computationally expensive because it requires retraining the model once per feature. For a dataset with 100 features and a model that takes an hour to train, drop-column importance requires over 100 hours of computation. It also measures a slightly different quantity than permutation importance: the marginal value of adding a feature to a model that already has all other features.
SHAP (SHapley Additive exPlanations), introduced by Lundberg and Lee (2017), applies Shapley values from cooperative game theory to explain individual predictions. Each feature receives a SHAP value representing its contribution to pushing the prediction away from the average prediction. The key theoretical property is that SHAP values are the only additive feature attribution method satisfying three desirable axioms: local accuracy, missingness, and consistency.
SHAP provides both local explanations (why a specific prediction was made) and global importance (which features matter most across the entire dataset). Global SHAP importance is typically computed as the mean absolute SHAP value for each feature across all samples.
The SHAP library offers specialized explainers optimized for different model types:
| Explainer | Target Models | Speed | Exactness |
|---|---|---|---|
| TreeExplainer | Random forest, XGBoost, LightGBM, CatBoost | Fast | Exact for trees |
| LinearExplainer | Linear models, logistic regression | Fast | Exact for linear models |
| KernelExplainer | Any model (model-agnostic) | Slow | Approximate |
| DeepExplainer | Deep neural networks | Moderate | Approximate |
import shap
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)
# Global importance: mean absolute SHAP value per feature
global_importance = np.abs(shap_values).mean(axis=0)
# Summary plot showing feature importance and effect direction
shap.summary_plot(shap_values, X_test)
Advantages: SHAP values have a strong theoretical foundation in game theory, provide both local and global explanations, and show the direction of each feature's effect (not just magnitude). The TreeExplainer is computationally efficient for tree-based models.
Limitations: KernelExplainer (the model-agnostic variant) is computationally expensive, scaling poorly to large datasets. SHAP importance is on the scale of the prediction (not the loss), which makes it answer a subtly different question than permutation importance. Like permutation importance, SHAP values can be affected by feature correlations.
LIME, proposed by Ribeiro, Singh, and Guestrin (2016), explains individual predictions by fitting a simple interpretable model (typically a linear model) in the local neighborhood of the instance being explained. LIME generates perturbed samples around the instance, weights them by proximity, obtains the black-box model's predictions for these samples, and then fits a sparse linear model to approximate the local decision boundary.
Algorithm:
Advantages: LIME is model-agnostic and works with any classifier or regressor. It is intuitive because explanations are expressed as simple linear weights. The user can control the number of features in the explanation.
Limitations: LIME explanations can be unstable: running LIME twice on the same instance can produce different results because of randomness in the perturbation process. The quality of the explanation depends on the choice of kernel width and the number of perturbations, which require careful tuning. Unlike SHAP, LIME lacks a strong theoretical guarantee about the uniqueness or optimality of its explanations.
The following table summarizes the key characteristics of the most widely used feature importance methods:
| Method | Scope | Model Requirement | Computation Cost | Handles Correlated Features Well? | Provides Direction of Effect? |
|---|---|---|---|---|---|
| Impurity-based (MDI) | Global | Tree-based models only | Very low (computed during training) | No (splits importance across correlated features) | No |
| Coefficient magnitude | Global | Linear models only | Very low (read from trained model) | No (unstable with multicollinearity) | Yes (sign of coefficient) |
| Permutation importance | Global | Any model | Moderate (no retraining) | No (underestimates correlated features) | No |
| Drop-column (LOCO) | Global | Any model | Very high (retrains per feature) | Partially (retraining captures redistribution) | No |
| SHAP | Local and global | Any model (with specialized fast explainers for trees and linear models) | Low to high (depends on explainer) | No (affected by correlations) | Yes (sign of SHAP value) |
| LIME | Local | Any model | Moderate | No (local perturbations affected by correlations) | Yes (sign of local coefficient) |
Correlated features are the most common source of misleading feature importance results, and this problem affects virtually every method. When two features carry similar information, importance methods tend to split the total importance between them, making each appear less important than it would be in isolation. Permutation importance can also overestimate the importance of correlated features in some settings by creating impossible feature combinations.
Practical solutions:
Impurity-based importance is systematically biased toward features with many unique values. A continuous feature with 10,000 distinct values will tend to score higher than a binary feature, even if the binary feature is the true driver. This was demonstrated by Strobl et al. (2007), who showed that random forests preferentially select high-cardinality variables for splits.
Practical solution: Use permutation importance or SHAP instead of MDI when your dataset contains a mix of continuous and categorical features with varying cardinality.
If a model has memorized the training data, the importance scores derived from that training data will be unreliable. A random noise feature may appear important simply because the model has overfit to it.
Practical solution: Always compute importance on a held-out test set or use cross-validated importance estimates. Scikit-learn's permutation_importance function accepts any dataset, so passing the test set rather than the training set is straightforward.
Feature importance measures statistical association, not causation. A feature may appear important because it is correlated with the true causal factor, not because it directly influences the outcome. For example, ice cream sales might appear important for predicting drowning rates, but the true driver is temperature.
Feature importance is only meaningful when the underlying model performs well. If the model has a low cross-validation score, its importance rankings may be unreliable and unstable. Always validate model performance before interpreting feature importances.
Scikit-learn provides two primary interfaces for computing feature importance:
| Interface | Function / Attribute | Method Type | Available Since |
|---|---|---|---|
| Impurity-based | model.feature_importances_ | Built-in (tree models) | scikit-learn 0.1 |
| Permutation | sklearn.inspection.permutation_importance() | Model-agnostic | scikit-learn 0.22 |
The scikit-learn documentation explicitly recommends preferring permutation importance over impurity-based importance when accuracy of rankings matters, noting that MDI importances are "biased towards high cardinality features" and "are computed on training set statistics and therefore do not reflect the ability of feature to be useful to make predictions that generalize to the test set."
A typical workflow combines both methods:
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
from sklearn.model_selection import train_test_split
import numpy as np
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Train model
rf = RandomForestClassifier(n_estimators=200, random_state=42)
rf.fit(X_train, y_train)
# Method 1: Impurity-based (fast but potentially biased)
mdi_importances = rf.feature_importances_
# Method 2: Permutation importance (more reliable)
perm_result = permutation_importance(
rf, X_test, y_test, n_repeats=30, random_state=42
)
perm_importances = perm_result.importances_mean
# Compare
for name, mdi, perm in sorted(
zip(feature_names, mdi_importances, perm_importances),
key=lambda x: x<sup><a href="#cite_note-2" class="cite-ref">[2]</a></sup>, reverse=True
):
print(f"{name:>20s}: MDI={mdi:.4f} Permutation={perm:.4f}")
Imagine you are baking cookies, and you want to know which ingredient matters most for making them taste good. You could try leaving out the sugar one time, leaving out the butter another time, and leaving out the vanilla another time. Whichever ingredient, when removed, makes the cookies taste the worst is the most important ingredient. Feature importance works the same way: it tests what happens to a computer's predictions when each piece of information is taken away or scrambled, and the pieces that cause the biggest mess when removed are the most important ones.