Permutation variable importances

Permutation variable importance

Permutation variable importance (often shortened to permutation importance, permutation feature importance (PFI), or mean decrease in accuracy (MDA)) is a model-agnostic technique for ranking the predictors used by a fitted machine learning model. The score for a feature is computed by shuffling that feature's column in a held-out dataset, recomputing the model's performance, and measuring how much the performance drops. A feature whose values can be randomized without hurting accuracy was not really being used; a feature whose permutation causes a large drop in accuracy was carrying a lot of the signal that the model relied on.

The method is part of the broader family of variable importances used in explainable AI and feature selection. It was introduced by Leo Breiman for random forests in his 2001 paper and was later generalized to any predictor by Fisher, Rudin, and Dominici under the name model reliance.[^breiman2001][^fisher2019]

Background and history

Permutation importance grew out of the random forest literature. In the original Breiman 2001 formulation each tree is grown on a bootstrap sample of the training data; the rows that are not in the bootstrap (the out-of-bag, or OOB, sample) are used as a built-in test set. To score variable m, the OOB rows are run down the tree twice: once with the real values of variable m, and once with the values of variable m permuted across the OOB rows. The drop in correct-vote count, averaged over every tree in the forest, is the raw importance for that variable.[^breiman2001]

Breiman framed the technique as a way to peek inside what is otherwise a black box: a forest of hundreds of deep trees has no single coefficient or rule a person can read, but it does have a measurable response when you scramble its inputs.

In 2008 Strobl and colleagues showed that the original scheme has a problem with correlated predictors. Permuting variable m unconditionally creates rows that no longer respect the joint distribution of the data, which inflates the apparent importance of correlated predictors and confounds it with the importance of the variables they happen to be tied to. Strobl proposed conditional permutation importance, which permutes within strata defined by the other correlated variables and is now part of the R package party.[^strobl2008]

Fisher, Rudin, and Dominici (2019) recast permutation importance as model reliance and extended it from a single fitted model to the entire Rashomon set of well-performing models in a class. Their model class reliance gives an interval of plausible importance values rather than a single point estimate, which is useful when many different models fit the data about equally well.[^fisher2019] Their paper also studies the U-statistic structure of the estimator and provides finite-sample confidence bounds.

Algorithm

The modern, model-agnostic recipe followed by libraries such as scikit-learn is:[^sklearndoc]

Train the predictor on the training set and compute a baseline score s on a held-out evaluation set D (test or validation). The score is whatever metric matters for the task: accuracy or F1 for classification, R-squared or mean squared error for regression.
For each feature j:
1. For each repetition k in 1..K:
  - Make a copy of D, shuffle column j across rows, leaving every other column intact.
  - Score the trained model on this corrupted dataset; call the score s_k,j.
2. Importance i_j is the mean drop: i_j = s − (1/K) sum_k s_k,j.
Report the mean importance and its standard deviation across the K repetitions, optionally normalized so they sum to one.

A negative importance is possible and means the permuted version scored slightly better than the real one, which is a sign the feature is irrelevant noise plus sampling jitter.

The algorithm is independent of how the predictor was fit. It only requires the ability to call predict and to compute the chosen metric. This is what makes it model-agnostic: the same loop works for a random forest, a gradient boosted tree, a deep network, or a hand-tuned rule list.

Worked example with scikit-learn

The canonical example in the scikit-learn documentation runs permutation_importance on the diabetes regression dataset with a Ridge regressor:[^sklearndoc]

from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
from sklearn.inspection import permutation_importance

data = load_diabetes()
X_train, X_val, y_train, y_val = train_test_split(
    data.data, data.target, random_state=0)

model = Ridge(alpha=1e-2).fit(X_train, y_train)
result = permutation_importance(
    model, X_val, y_val, n_repeats=30, random_state=0)

for i in result.importances_mean.argsort()[::-1]:
    if result.importances_mean[i] - 2 * result.importances_std[i] > 0:
        print(f"{data.feature_names[i]:<8}"
              f"{result.importances_mean[i]:.3f} +/- "
              f"{result.importances_std[i]:.3f}")

The output ranks s5 (a serum measurement), bmi, bp (blood pressure), and sex as the only features whose importance is more than two standard deviations above zero. Features whose mean importance is comparable to the noise from the permutation repeats are filtered out.

Parameters of `sklearn.inspection.permutation_importance`

The signature for the scikit-learn helper is:[^sklearnapi]

permutation_importance(estimator, X, y, *, scoring=None, n_repeats=5, n_jobs=None, random_state=None, sample_weight=None, max_samples=1.0)

Parameter	Default	Purpose
`estimator`	required	A fitted model with a `predict` method (or `predict_proba` if the scorer needs probabilities).
`X`, `y`	required	The held-out feature matrix and target. Should not be the training data unless the user understands the overfitting risk.
`scoring`	`None`	Metric name or callable. If `None`, the estimator's default `score` is used. Multiple metrics can be passed at once.
`n_repeats`	`5`	Number of times each feature is shuffled. Higher values reduce variance but cost more compute.
`n_jobs`	`None`	Parallelism over features.
`random_state`	`None`	Seed for the permutation RNG to make results reproducible.
`sample_weight`	`None`	Weights passed to the scorer.
`max_samples`	`1.0`	Subsample fraction or count for very large `X`, useful for speed.

The call returns a Bunch with importances_mean, importances_std, and the full (n_features, n_repeats) array of raw drops.

Comparison with other importance measures

Permutation importance is one of several methods that try to score features after a model has been trained. The table below summarizes how it compares with the other common options.

Method	Model scope	Cost	Handles correlation	Notes
Permutation importance	Any predictor	Moderate (`n_features × n_repeats × predict`)	Poor; correlated features split importance	Global; depends on the scoring metric
Mean decrease impurity (MDI / Gini)	Tree ensembles only	Free; computed during training	Poor; biased toward high-cardinality features	Computed on training data, can flag overfitted noise as important[^sklearnmdi]
Drop-column importance	Any predictor	Very high; one full retraining per feature	Better; the model adjusts to the missing feature	Considered a gold standard but rarely tractable
SHAP values	Any predictor (with model-specific fast paths)	High in general, fast for trees	Better; conditional expectations preserve structure	Provides per-prediction attributions plus global aggregates
Linear model coefficients	Linear and generalized linear models	Free	Distorted by multicollinearity	Only meaningful after standardization
LIME	Any predictor	Moderate per query	Local fit can mask global correlations	Local rather than global; perturbs a neighborhood around one example

For random forests in particular, the scikit-learn user guide explicitly warns that the impurity-based importance favors numerical and high-cardinality features, while permutation importance does not show that bias because it scores on held-out data.[^sklearnmdi]

The correlated features problem

The single biggest pitfall in permutation importance is handling correlated predictors. If two columns carry largely the same information (for example, height in inches and height in centimetres, or two lab tests that measure overlapping biology), permuting one leaves the model with a near-perfect substitute. The drop in score is small for both columns and the conclusion looks like neither feature matters, when in fact the underlying signal is critical.[^strobl2008][^molnar]

Two additional effects make the problem worse:

Unconditional permutation forces the model to predict on rows that are off the data manifold. A 200 cm tall person who weighs 30 kg is a row the predictor was never asked to handle, and its behavior on such rows is not what the user cares about.[^hooker2021]
Tree-based models in particular tend to alternate between correlated splits during training, which spreads importance across the group and shrinks each member's measured drop further.[^strobl2008]

Several remedies exist:

Variant	Idea	Trade-off
Conditional permutation importance	Permute within strata defined by the other correlated variables	Closer to true marginal effect; needs a way to define the strata
Group permutation importance	Permute the whole correlated group together	Reports the joint contribution; cannot rank within the group
Hierarchical clustering then drop	Cluster features by Spearman correlation, keep one representative per cluster, then run permutation importance	Loses the within-group ranking; recommended in the scikit-learn user guide
Drop-column importance	Retrain without the feature	Most faithful but most expensive

Train set vs test set

There is a long-running debate about whether permutation importance should be computed on the training set or on a held-out set. The training-set version answers the question "how much does the trained model rely on this feature on the data it saw" while the test-set version answers "how much does this feature help the model generalize." When the model overfits, the training-set version inflates importances of noisy features that the model memorized.[^molnar][^sklearndoc]

The scikit-learn user guide recommends the test set for that reason, and it goes further: features that look unimportant on a poorly fit model can become important when the model is fit well, so it is essential to verify that the model has reasonable held-out performance before reading anything into the importance scores.

Conditional vs marginal permutation

Aspect	Marginal (standard)	Conditional
Definition	Shuffle one column independently	Shuffle one column within strata defined by other features
What it measures	Drop in accuracy if the feature is unavailable	Drop in accuracy beyond what correlated features already provide
Off-manifold rows	Many; can yield extrapolation	Few; rows stay closer to the joint distribution
Cost	Cheap	Higher; needs a model of the conditional distribution
Original reference	Breiman 2001[^breiman2001]	Strobl 2008[^strobl2008]

Implementations

Library	Language	Notes
`sklearn.inspection.permutation_importance`	Python	Reference model-agnostic implementation; supports multiple scorers in one pass[^sklearnapi]
`eli5.sklearn.PermutationImportance` and `eli5.permutation_importance.get_score_importances`	Python	Earlier popular implementation; works for sklearn estimators and arbitrary score functions[^eli5docs]
`rfpimp`	Python	Companion to the Beware Default Random Forest Importances article; supports column-drop and group importances[^parr2018]
`mlxtend.evaluate.feature_importance_permutation`	Python	Standalone implementation by Sebastian Raschka
`iml::FeatureImp`	R	Part of the `iml` package; implements both ratio and difference variants
`vip::vi`	R	Unified interface to several importance methods, including permutation
`party::varimp`	R	Original conditional permutation importance from Strobl et al.[^strobl2008]
`caret::varImp`	R	Wraps several model-specific and permutation-based importances

Best practices

Compute on a held-out set, not the training set, unless the goal is specifically to measure reliance on training-time signal.
Use enough repetitions to get a stable mean. The scikit-learn default of n_repeats=5 is the bare minimum; the user guide and most tutorials use 10 to 30.
Report the mean and standard deviation, and treat features whose mean drop is within roughly two standard deviations of zero as not meaningfully different from noise.
Cluster correlated features first if the dataset has known multicollinearity. Otherwise pair the analysis with conditional or group importance.
Cross-check against a second method such as SHAP for important conclusions. Agreement across methods is reassuring; disagreement is worth investigating.
Always verify the model has reasonable held-out performance first. Importance scores from a poorly fit model are not trustworthy.

Common use cases

Feature selection. Drop features whose mean importance is at or below zero, then retrain. This is a cheap pruning step that often shrinks input dimensionality without hurting performance and can be more robust than coefficient-based selection on nonlinear models.

Model debugging. If a feature that should not matter shows up at the top, that is often a sign of leakage. A common example is a customer-id column that the model has accidentally learned to memorize, or a timestamp that correlates with the train-test split.

Regulatory and audit explanations. The European GDPR and the US Equal Credit Opportunity Act both put pressure on model providers to explain the factors driving a decision. Permutation importance gives an answer per feature that does not depend on internal model details.

Scientific discovery. In genomics, ecology, and clinical research the goal is often to identify which measurements matter, not to deploy a predictor. A random forest plus permutation importance has become a standard way to screen large numbers of candidate variables in these fields.

Limitations and pitfalls

Permutation importance is global and single-feature: it does not reveal interactions between two features unless they are tested together as a group. It does not say which direction the feature pushes the prediction, only that knocking the feature out of the input hurts. It depends entirely on the choice of scoring metric, so an importance ranking under accuracy can differ from one under AUROC or log loss. It is computed on a particular fitted model, so two equally good models trained on the same data can give different rankings, which is the observation that motivated Fisher and colleagues' model class reliance.[^fisher2019]

There is also a deeper concern about extrapolation. Hooker and Mentch (2019) and follow-up work showed that unconditional permutation often forces the model to predict on inputs that are nowhere near the training distribution, which can produce importance scores that say more about how the model behaves off-manifold than about how it actually works on real data.[^hooker2021] Conditional and grouped variants reduce but do not fully remove this issue.

Modern context

For tabular machine learning permutation importance is still one of the first tools reached for. It is built into scikit-learn, easy to read, and gives a usable answer in a few seconds for most problems. Tutorials in the genomics, finance, and clinical literature continue to use it as the default global importance measure.

For deep learning on images, text, and audio it is less common. The cost of running thousands of forward passes for each input feature is high, and gradient-based attribution methods such as Integrated Gradients, SmoothGrad, and SHAP variants tailored to neural networks are usually preferred. On structured tabular deep learning models such as TabNet or the FT-Transformer, however, permutation importance remains a reasonable global check.

Interest has shifted toward conditional and grouped variants in part because regulators and auditors increasingly ask for importance scores that respect feature correlations, and the marginal version is hard to defend on heavily correlated industrial data.

Explain like I'm 5

Imagine you have a recipe with ten ingredients and a friend who is very good at telling whether a cake will taste good just by reading the recipe. To find out which ingredient your friend really cares about, you make ten copies of the recipe. In each copy you scramble the amount of one ingredient, leaving the others alone. You hand the scrambled recipes back and ask your friend to grade each cake. The ingredient whose scramble made the predicted taste worst is the one your friend was paying the most attention to. The one whose scramble did not change the grade was being ignored. That is permutation importance: scramble one column at a time, watch how much the score drops, and use the drop as the score for that column.

References

Permutation variable importances