See also: Variable importances, Feature engineering, Machine learning terms
Permutation variable importance (often shortened to permutation importance, permutation feature importance (PFI), or mean decrease in accuracy (MDA)) is a model-agnostic technique for ranking the predictors used by a fitted machine learning model. The score for a feature is computed by shuffling that feature's column in a held-out dataset, recomputing the model's performance, and measuring how much the performance drops. A feature whose values can be randomized without hurting accuracy was not really being used; a feature whose permutation causes a large drop in accuracy was carrying a lot of the signal that the model relied on.
The method is part of the broader family of variable importances used in explainable AI and feature selection. It was introduced by Leo Breiman for random forests in his 2001 paper and was later generalized to any predictor by Fisher, Rudin, and Dominici under the name model reliance.[^breiman2001][^fisher2019]
Permutation importance grew out of the random forest literature. In the original Breiman 2001 formulation each tree is grown on a bootstrap sample of the training data; the rows that are not in the bootstrap (the out-of-bag, or OOB, sample) are used as a built-in test set. To score variable m, the OOB rows are run down the tree twice: once with the real values of variable m, and once with the values of variable m permuted across the OOB rows. The drop in correct-vote count, averaged over every tree in the forest, is the raw importance for that variable.[^breiman2001]
Breiman framed the technique as a way to peek inside what is otherwise a black box: a forest of hundreds of deep trees has no single coefficient or rule a person can read, but it does have a measurable response when you scramble its inputs.
In 2008 Strobl and colleagues showed that the original scheme has a problem with correlated predictors. Permuting variable m unconditionally creates rows that no longer respect the joint distribution of the data, which inflates the apparent importance of correlated predictors and confounds it with the importance of the variables they happen to be tied to. Strobl proposed conditional permutation importance, which permutes within strata defined by the other correlated variables and is now part of the R package party.[^strobl2008]
Fisher, Rudin, and Dominici (2019) recast permutation importance as model reliance and extended it from a single fitted model to the entire Rashomon set of well-performing models in a class. Their model class reliance gives an interval of plausible importance values rather than a single point estimate, which is useful when many different models fit the data about equally well.[^fisher2019] Their paper also studies the U-statistic structure of the estimator and provides finite-sample confidence bounds.
The modern, model-agnostic recipe followed by libraries such as scikit-learn is:[^sklearndoc]
A negative importance is possible and means the permuted version scored slightly better than the real one, which is a sign the feature is irrelevant noise plus sampling jitter.
The algorithm is independent of how the predictor was fit. It only requires the ability to call predict and to compute the chosen metric. This is what makes it model-agnostic: the same loop works for a random forest, a gradient boosted tree, a deep network, or a hand-tuned rule list.
The canonical example in the scikit-learn documentation runs permutation_importance on the diabetes regression dataset with a Ridge regressor:[^sklearndoc]
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
from sklearn.inspection import permutation_importance
data = load_diabetes()
X_train, X_val, y_train, y_val = train_test_split(
data.data, data.target, random_state=0)
model = Ridge(alpha=1e-2).fit(X_train, y_train)
result = permutation_importance(
model, X_val, y_val, n_repeats=30, random_state=0)
for i in result.importances_mean.argsort()[::-1]:
if result.importances_mean[i] - 2 * result.importances_std[i] > 0:
print(f"{data.feature_names[i]:<8}"
f"{result.importances_mean[i]:.3f} +/- "
f"{result.importances_std[i]:.3f}")
The output ranks s5 (a serum measurement), bmi, bp (blood pressure), and sex as the only features whose importance is more than two standard deviations above zero. Features whose mean importance is comparable to the noise from the permutation repeats are filtered out.
sklearn.inspection.permutation_importanceThe signature for the scikit-learn helper is:[^sklearnapi]
permutation_importance(estimator, X, y, *, scoring=None, n_repeats=5, n_jobs=None, random_state=None, sample_weight=None, max_samples=1.0)
| Parameter | Default | Purpose |
|---|---|---|
estimator | required | A fitted model with a predict method (or predict_proba if the scorer needs probabilities). |
X, y | required | The held-out feature matrix and target. Should not be the training data unless the user understands the overfitting risk. |
scoring | None | Metric name or callable. If None, the estimator's default score is used. Multiple metrics can be passed at once. |
n_repeats | 5 | Number of times each feature is shuffled. Higher values reduce variance but cost more compute. |
n_jobs | None | Parallelism over features. |
random_state | None | Seed for the permutation RNG to make results reproducible. |
sample_weight | None | Weights passed to the scorer. |
max_samples | 1.0 | Subsample fraction or count for very large X, useful for speed. |
The call returns a Bunch with importances_mean, importances_std, and the full (n_features, n_repeats) array of raw drops.
Permutation importance is one of several methods that try to score features after a model has been trained. The table below summarizes how it compares with the other common options.
| Method | Model scope | Cost | Handles correlation | Notes |
|---|---|---|---|---|
| Permutation importance | Any predictor | Moderate (n_features × n_repeats × predict) | Poor; correlated features split importance | Global; depends on the scoring metric |
| Mean decrease impurity (MDI / Gini) | Tree ensembles only | Free; computed during training | Poor; biased toward high-cardinality features | Computed on training data, can flag overfitted noise as important[^sklearnmdi] |
| Drop-column importance | Any predictor | Very high; one full retraining per feature | Better; the model adjusts to the missing feature | Considered a gold standard but rarely tractable |
| SHAP values | Any predictor (with model-specific fast paths) | High in general, fast for trees | Better; conditional expectations preserve structure | Provides per-prediction attributions plus global aggregates |
| Linear model coefficients | Linear and generalized linear models | Free | Distorted by multicollinearity | Only meaningful after standardization |
| LIME | Any predictor | Moderate per query | Local fit can mask global correlations | Local rather than global; perturbs a neighborhood around one example |
For random forests in particular, the scikit-learn user guide explicitly warns that the impurity-based importance favors numerical and high-cardinality features, while permutation importance does not show that bias because it scores on held-out data.[^sklearnmdi]
The single biggest pitfall in permutation importance is handling correlated predictors. If two columns carry largely the same information (for example, height in inches and height in centimetres, or two lab tests that measure overlapping biology), permuting one leaves the model with a near-perfect substitute. The drop in score is small for both columns and the conclusion looks like neither feature matters, when in fact the underlying signal is critical.[^strobl2008][^molnar]
Two additional effects make the problem worse:
Several remedies exist:
| Variant | Idea | Trade-off |
|---|---|---|
| Conditional permutation importance | Permute within strata defined by the other correlated variables | Closer to true marginal effect; needs a way to define the strata |
| Group permutation importance | Permute the whole correlated group together | Reports the joint contribution; cannot rank within the group |
| Hierarchical clustering then drop | Cluster features by Spearman correlation, keep one representative per cluster, then run permutation importance | Loses the within-group ranking; recommended in the scikit-learn user guide |
| Drop-column importance | Retrain without the feature | Most faithful but most expensive |
There is a long-running debate about whether permutation importance should be computed on the training set or on a held-out set. The training-set version answers the question "how much does the trained model rely on this feature on the data it saw" while the test-set version answers "how much does this feature help the model generalize." When the model overfits, the training-set version inflates importances of noisy features that the model memorized.[^molnar][^sklearndoc]
The scikit-learn user guide recommends the test set for that reason, and it goes further: features that look unimportant on a poorly fit model can become important when the model is fit well, so it is essential to verify that the model has reasonable held-out performance before reading anything into the importance scores.
| Aspect | Marginal (standard) | Conditional |
|---|---|---|
| Definition | Shuffle one column independently | Shuffle one column within strata defined by other features |
| What it measures | Drop in accuracy if the feature is unavailable | Drop in accuracy beyond what correlated features already provide |
| Off-manifold rows | Many; can yield extrapolation | Few; rows stay closer to the joint distribution |
| Cost | Cheap | Higher; needs a model of the conditional distribution |
| Original reference | Breiman 2001[^breiman2001] | Strobl 2008[^strobl2008] |
| Library | Language | Notes |
|---|---|---|
sklearn.inspection.permutation_importance | Python | Reference model-agnostic implementation; supports multiple scorers in one pass[^sklearnapi] |
eli5.sklearn.PermutationImportance and eli5.permutation_importance.get_score_importances | Python | Earlier popular implementation; works for sklearn estimators and arbitrary score functions[^eli5docs] |
rfpimp | Python | Companion to the Beware Default Random Forest Importances article; supports column-drop and group importances[^parr2018] |
mlxtend.evaluate.feature_importance_permutation | Python | Standalone implementation by Sebastian Raschka |
iml::FeatureImp | R | Part of the iml package; implements both ratio and difference variants |
vip::vi | R | Unified interface to several importance methods, including permutation |
party::varimp | R | Original conditional permutation importance from Strobl et al.[^strobl2008] |
caret::varImp | R | Wraps several model-specific and permutation-based importances |
n_repeats=5 is the bare minimum; the user guide and most tutorials use 10 to 30.Feature selection. Drop features whose mean importance is at or below zero, then retrain. This is a cheap pruning step that often shrinks input dimensionality without hurting performance and can be more robust than coefficient-based selection on nonlinear models.
Model debugging. If a feature that should not matter shows up at the top, that is often a sign of leakage. A common example is a customer-id column that the model has accidentally learned to memorize, or a timestamp that correlates with the train-test split.
Regulatory and audit explanations. The European GDPR and the US Equal Credit Opportunity Act both put pressure on model providers to explain the factors driving a decision. Permutation importance gives an answer per feature that does not depend on internal model details.
Scientific discovery. In genomics, ecology, and clinical research the goal is often to identify which measurements matter, not to deploy a predictor. A random forest plus permutation importance has become a standard way to screen large numbers of candidate variables in these fields.
Permutation importance is global and single-feature: it does not reveal interactions between two features unless they are tested together as a group. It does not say which direction the feature pushes the prediction, only that knocking the feature out of the input hurts. It depends entirely on the choice of scoring metric, so an importance ranking under accuracy can differ from one under AUROC or log loss. It is computed on a particular fitted model, so two equally good models trained on the same data can give different rankings, which is the observation that motivated Fisher and colleagues' model class reliance.[^fisher2019]
There is also a deeper concern about extrapolation. Hooker and Mentch (2019) and follow-up work showed that unconditional permutation often forces the model to predict on inputs that are nowhere near the training distribution, which can produce importance scores that say more about how the model behaves off-manifold than about how it actually works on real data.[^hooker2021] Conditional and grouped variants reduce but do not fully remove this issue.
For tabular machine learning permutation importance is still one of the first tools reached for. It is built into scikit-learn, easy to read, and gives a usable answer in a few seconds for most problems. Tutorials in the genomics, finance, and clinical literature continue to use it as the default global importance measure.
For deep learning on images, text, and audio it is less common. The cost of running thousands of forward passes for each input feature is high, and gradient-based attribution methods such as Integrated Gradients, SmoothGrad, and SHAP variants tailored to neural networks are usually preferred. On structured tabular deep learning models such as TabNet or the FT-Transformer, however, permutation importance remains a reasonable global check.
Interest has shifted toward conditional and grouped variants in part because regulators and auditors increasingly ask for importance scores that respect feature correlations, and the marginal version is hard to defend on heavily correlated industrial data.
Imagine you have a recipe with ten ingredients and a friend who is very good at telling whether a cake will taste good just by reading the recipe. To find out which ingredient your friend really cares about, you make ten copies of the recipe. In each copy you scramble the amount of one ingredient, leaving the others alone. You hand the scrambled recipes back and ask your friend to grade each cake. The ingredient whose scramble made the predicted taste worst is the one your friend was paying the most attention to. The one whose scramble did not change the grade was being ignored. That is permutation importance: scramble one column at a time, watch how much the score drops, and use the drop as the score for that column.