# Feature Selection

> Source: https://aiwiki.ai/wiki/feature_selection
> Updated: 2026-06-21
> Categories: Algorithms, Data & Datasets, Machine Learning
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**Feature selection** is the process of choosing a subset of the most relevant input variables (features) from a larger candidate pool for use in a [machine learning](/wiki/machine_learning) model, with the goal of finding the smallest set of features that retains, or even improves, predictive performance while cutting computational cost, mitigating overfitting, and producing models humans can more easily interpret. It is divided into three classical families, filter, wrapper, and embedded methods, and is distinct from [dimensionality reduction](/wiki/dimensionality_reduction): feature selection keeps a subset of the original variables intact, whereas dimensionality reduction techniques such as [principal component analysis](/wiki/principal_component_analysis) project the data into a new coordinate system whose axes are linear or nonlinear combinations of the originals.

Feature selection sits at the core of the broader practice of [feature engineering](/wiki/feature_engineering). The topic was popularised in machine learning by Isabelle Guyon and Andre Elisseeff in their 2003 *Journal of Machine Learning Research* survey, "An Introduction to Variable and Feature Selection" (volume 3, pages 1157-1182), which framed selection as a way to improve prediction accuracy, lower inference cost, and gain insight into the data-generating process.[1] The Guyon and Elisseeff paper still serves as the canonical reference and has been cited more than 15,000 times.[1] Earlier foundational work by Ron Kohavi and George John in 1997, "Wrappers for Feature Subset Selection," published in *Artificial Intelligence* (volume 97, pages 273-324), formalised the wrapper paradigm in which selection is treated as a search problem guided by an actual learner's cross-validated score.[2]

In the modern era of large neural networks, explicit feature selection is sometimes assumed to be obsolete because deep models perform end-to-end representation learning. That perception is only partially correct. For unstructured data such as images, text, audio, and video, learned representations have indeed displaced hand-crafted feature pipelines. For structured tabular data, however, where gradient-boosted decision trees such as [XGBoost](/wiki/xgboost), [LightGBM](/wiki/lightgbm), and CatBoost still dominate Kaggle leaderboards and production pipelines, careful feature selection remains a high-leverage activity. Recent benchmarks across diverse tabular datasets continue to show that boosted trees with well-engineered, well-selected features match or beat large transformer-based tabular models on the majority of problems.[10]

## Why does feature selection matter?

When the number of input variables grows, models become harder to estimate, slower to train and serve, and more prone to memorising noise. The classical motivations for selecting a small subset of features are summarised below.

### What is the curse of dimensionality?

Richard Bellman coined the phrase *curse of dimensionality* in 1957 to describe the geometric and statistical pathologies that arise when the number of dimensions grows.[5] As the dimensionality *d* of the input space increases, the volume of the space grows exponentially, and any fixed sample of points becomes increasingly sparse. Distances between points become more uniform, which weakens distance-based methods such as k-nearest neighbours and clustering. The number of training examples required to densely sample a *d*-dimensional space grows exponentially in *d*, a fact that limits any nonparametric learner.[5]

A closely related result is the *Hughes phenomenon*, which states that for a fixed training-set size the expected accuracy of a classifier first rises with the number of features and then, beyond some peak dimensionality, falls again.[6] The peak occurs because each additional feature carries useful signal up to a point but then begins to inject more noise and estimation variance than information. Feature selection is one of the simplest tools for keeping a model on the favourable side of that peak.

### Bias, variance, and generalisation

The bias-variance decomposition makes the cost of irrelevant features concrete. With too few features the model is biased and underfits; with too many it has high variance and overfits the training set. Each parameter in a high-dimensional model must be estimated from finite data, so the estimate's variance grows with the number of parameters. By eliminating features that contribute no signal, feature selection reduces the effective parameter count, lowers variance, and often improves out-of-sample accuracy.[1] This rationale aligns with *Occam's razor*: among models that explain the data equally well, the simpler one tends to generalise better.

### Cost, latency, and interpretability

Fewer features means smaller models, faster training, lower inference latency, and reduced storage. In production systems where features must be computed at request time from upstream services, every dropped feature also reduces the number of network calls and feature-store lookups. Sparse models are also easier to debug and to explain to non-technical stakeholders, which is increasingly important in regulated domains such as finance, healthcare, and credit scoring where model decisions must be auditable.

### How does feature selection differ from feature extraction?

*Feature selection* and [feature extraction](/wiki/feature_extraction) are often confused. Feature selection chooses a subset of the existing variables and leaves them otherwise untouched, so the chosen variables retain their original semantics and units. Feature extraction transforms the input space into a new one. PCA, autoencoders, kernel projections, and word embeddings are extraction methods because each output dimension is a function of many inputs. Selection preserves interpretability at the cost of being limited to whatever signal already exists in the raw inputs, while extraction can synthesise new informative coordinates at the cost of opaqueness. In practice the two are often combined: extract a wide set of candidate features, then select a sparse, predictive subset.

## What are the categories of feature selection?

The community generally groups feature selection algorithms into three families: filter methods, wrapper methods, and embedded methods.[14] A fourth, hybrid methods, blends elements of the others. The taxonomy was crystallised by the Kohavi and John paper and refined by Guyon and Elisseeff.[2][1]

### Filter vs wrapper vs embedded

| Property | Filter | Wrapper | Embedded |
| --- | --- | --- | --- |
| Selection signal | Statistical score independent of any learner | Cross-validated score of an actual learner | Score produced as a side effect of model training |
| Computational cost | Lowest, scales to millions of features | Highest, requires fitting many models | Moderate, one model fit per regularisation path |
| Captures feature interactions | Mostly no, univariate by default | Yes, evaluates subsets jointly | Partially, depends on model class |
| Risk of overfitting the selection step | Low | High when the number of candidate subsets is large | Low to moderate, controlled by [regularization](/wiki/regularization) |
| Coupling to downstream model | Loose, the same selection works for many models | Tight, results are tuned to one learner | Tight, baked into the learner |
| Typical examples | Pearson correlation, chi-square, ANOVA F-test, mutual information, variance threshold | Forward selection, backward elimination, recursive feature elimination, exhaustive search | L1 (Lasso) regression, Elastic Net, tree-based feature importance, SHAP-guided pruning |
| Best when | Quick first-pass screening on very wide data | Small to medium feature counts and a fixed downstream model | The same model is used for both selection and final prediction |

### Filter methods

Filter methods rank features using a statistic computed directly from the data, with no reference to a downstream learner.[14] Because they ignore feature interactions, they are fast and embarrassingly parallel, which makes them the workhorse for the first pass over very wide datasets such as gene-expression matrices, sparse text-document term matrices, or click-stream logs with thousands of one-hot encoded categorical levels.

Common filters include:

- **Variance threshold.** Drops any feature whose variance falls below a chosen cut-off. Useful for removing constant or near-constant columns produced by one-hot encoding rare categories.
- **Pearson correlation.** For numeric features and a numeric target, ranks features by the absolute value of their linear correlation with the target. Cheap but blind to nonlinear relationships.
- **Chi-square test.** Measures the dependence between a categorical feature and a categorical target by comparing observed cell counts to the counts expected under independence. Requires non-negative input.
- **ANOVA F-test.** Tests whether the means of a numeric feature differ across the classes of a categorical target. Implemented in scikit-learn as `f_classif` for classification and `f_regression` for regression.[11]
- **Mutual information.** A nonparametric, model-free measure of statistical dependence that captures both linear and nonlinear relationships. Mutual information is zero only when the feature and target are statistically independent, which gives it a strong theoretical grounding but a higher sample-complexity cost than Pearson or chi-square.
- **ReliefF and its derivatives.** Estimate feature relevance by examining nearest-neighbour pairs, capable of detecting some interactions despite being technically univariate.

A pragmatic filter pipeline first drops zero-variance and duplicate columns, then ranks the survivors by a target-aware score such as mutual information, and finally keeps a top fraction such as the top 10 percent for downstream evaluation.

### Wrapper methods

Wrapper methods treat feature selection as a search problem in which each candidate subset is scored by training a learner and measuring its cross-validated performance.[2] They are conceptually appealing because the score they optimise is the metric the user actually cares about, and they detect interactions because subsets are evaluated jointly. The trade-off is computational: with *p* features there are 2 to the power *p* possible subsets, so exhaustive search is feasible only for very small *p*. Practical wrappers therefore use heuristic search.

Widely used wrapper algorithms include:

- **Forward sequential selection.** Starts with the empty set and at each step adds the single feature that most improves the cross-validated score. Halts when no further improvement is possible or a target subset size is reached.
- **Backward sequential elimination.** Starts with the full feature set and iteratively removes the feature whose absence most improves, or least harms, the cross-validated score.
- **Floating selection (SFFS, SFBS).** Augments forward or backward search with conditional inclusion or removal steps that revisit earlier decisions, reducing the chance of getting stuck in local optima.
- **Recursive Feature Elimination (RFE).** Trains an estimator that exposes feature weights or importances, removes the least important feature (or block of features), refits, and repeats until a target count is reached. Works particularly well with linear models and tree ensembles.
- **RFE with cross-validation (RFECV).** Wraps RFE in an outer cross-validation loop to also choose the optimal number of features automatically.
- **Exhaustive search.** Evaluates every subset; only feasible for tens of features.
- **Genetic algorithms and simulated annealing.** Stochastic search heuristics sometimes used for very large search spaces, although they rarely beat well-tuned RFE in practice.

Wrappers can overfit the selection process itself when the number of candidate subsets evaluated approaches or exceeds the number of training examples.[2] Nested cross-validation, in which feature selection runs inside an inner loop and an outer loop estimates generalisation, is the standard guard against optimistic bias.

### Embedded methods

Embedded methods perform selection as part of the model-fitting procedure itself.[14] They are usually cheaper than wrappers because the selection signal is produced for free during a single model fit, and they capture feature interactions through whatever mechanism the underlying learner uses.

The most widely used embedded approaches are:

- **L1 (Lasso) regularisation.** The Lasso, introduced by Robert Tibshirani in 1996, augments the squared-error loss of linear regression with a penalty proportional to the sum of the absolute values of the coefficients.[3] In Tibshirani's own description, the lasso "minimizes the residual sum of squares subject to the sum of the absolute value of the coefficients being less than a constant," and "because of the nature of this constraint it tends to produce coefficients that are exactly 0 and hence gives interpretable models."[3] That geometry causes many coefficients to shrink to exactly zero, so fitting a Lasso simultaneously trains the model and selects features. Lasso regression is the canonical example of embedded selection. See [lasso regression](/wiki/lasso_regression) for details on the optimisation algorithm and tuning of the regularisation strength alpha.
- **Elastic Net.** Adds an L2 penalty alongside L1, which yields a Lasso-like sparsity pattern but better handles correlated feature groups by selecting them together rather than arbitrarily picking one representative.[4]
- **Tree-based feature importance.** Random forests, gradient-boosted trees, XGBoost, LightGBM, and CatBoost each compute a feature-importance score during training. Common variants include split count, mean decrease in impurity (MDI), and gain. Features with consistently low importance can be pruned and the model retrained.
- **Permutation importance.** Records the drop in score when the values of a single feature are randomly shuffled, breaking its relationship with the target. Model agnostic, more reliable than MDI, but more expensive because it requires repeated model evaluations.[11]
- **SHAP-based pruning.** Computes Shapley additive explanation values for each prediction, then aggregates them per feature.[8] Features whose mean absolute SHAP value falls below a threshold are dropped. Tools such as `shap-select` formalise this with statistical significance tests.
- **Regularised neural networks.** L1 weight penalties, group Lasso applied to input weight matrices, dropout-based importance, and explicit gating layers (for example concrete dropout) all let neural networks learn sparse input usage as part of optimisation.

### Hybrid methods

Hybrid pipelines combine families to balance speed and accuracy. A typical recipe is to use a cheap filter to shrink the feature pool from tens of thousands to hundreds, then run a wrapper such as RFE on the survivors, and finally fit an embedded model such as Lasso on the chosen subset for the production model. The Boruta algorithm, described later, is sometimes called a hybrid because it wraps a random forest in a statistical test.[7]

## Common Algorithms and Their Implementations

The table below summarises algorithms practitioners reach for most often, along with the family they belong to and a representative software entry point.

| Algorithm | Family | Strengths | Limitations | Representative implementation |
| --- | --- | --- | --- | --- |
| Variance threshold | Filter | Removes constants, very fast | Ignores the target | `sklearn.feature_selection.VarianceThreshold` |
| Pearson correlation ranking | Filter | Simple, interpretable | Linear only | `pandas.DataFrame.corr` |
| Chi-square test | Filter | Sound for non-negative categorical features | Requires binning of numeric features | `sklearn.feature_selection.chi2` |
| ANOVA F-test | Filter | Cheap, principled for numeric vs categorical | Assumes equal variance, linear separability | `sklearn.feature_selection.f_classif` |
| Mutual information | Filter | Captures nonlinear dependence | Higher sample complexity | `sklearn.feature_selection.mutual_info_classif` |
| Forward sequential selection | Wrapper | Detects interactions, returns small subsets | O(p^2) model fits | `sklearn.feature_selection.SequentialFeatureSelector` |
| Backward elimination | Wrapper | Strong when most features are useful | Cannot start when p is huge | `SequentialFeatureSelector(direction='backward')` |
| Recursive Feature Elimination | Wrapper | One fit per round, scales to thousands | Needs `coef_` or `feature_importances_` | `sklearn.feature_selection.RFE` |
| RFE with cross-validation | Wrapper | Picks the subset size automatically | More expensive than RFE | `sklearn.feature_selection.RFECV` |
| Exhaustive search | Wrapper | Globally optimal for the chosen metric | Combinatorial cost | `mlxtend.feature_selection.ExhaustiveFeatureSelector` |
| Lasso (L1) | Embedded | Sparse, interpretable, statistically grounded | Linear, can drop correlated peers | `sklearn.linear_model.Lasso` |
| Elastic Net | Embedded | Handles correlated groups | Two hyperparameters to tune | `sklearn.linear_model.ElasticNet` |
| Tree feature importance | Embedded | Cheap, captures interactions | MDI biased toward high-cardinality features | `xgboost.Booster.get_score`, `lightgbm.Booster.feature_importance` |
| Permutation importance | Embedded | Model-agnostic, low bias | Expensive | `sklearn.inspection.permutation_importance` |
| SHAP-guided pruning | Embedded | Theoretically grounded, model-agnostic | SHAP can be slow on non-tree models | `shap.TreeExplainer`, `shap-select` |
| Boruta | Hybrid | All-relevant, principled cut-off | Slow on wide data | `boruta_py` (`scikit-learn-contrib`) |

### What is the Boruta algorithm?

Boruta, introduced by Miron Kursa and Witold Rudnicki of the University of Warsaw in 2010, takes the *all-relevant* view of selection: it tries to find every feature that carries any predictive signal, rather than the smallest sufficient subset.[7] This contrasts with minimal-optimal approaches such as Lasso, which seek the smallest predictive subset rather than every relevant variable.[7] Boruta works by duplicating each feature, randomly shuffling the values in the copy to destroy any relationship with the target, and concatenating the shuffled *shadow* features to the original matrix.[7] A random forest is then fit, and each real feature's importance is compared with the maximum importance achieved by any shadow feature. A real feature that beats the best shadow significantly more often than expected under a Bonferroni-corrected binomial test is confirmed as relevant; one that consistently loses is rejected.[7] The procedure repeats until every feature has been classified or a maximum number of iterations is reached. Boruta is popular on Kaggle and is well suited to medium-width tabular datasets where any subtle predictive signal matters.[13]

### How does SHAP-based pruning work?

SHAP (SHapley Additive exPlanations) values, introduced by Scott Lundberg and Su-In Lee in 2017, allocate each prediction's deviation from the model's mean prediction to the input features in a manner that satisfies the Shapley fairness axioms from cooperative game theory.[8] The authors motivated the work by noting that "understanding why a model makes a certain prediction can be as crucial as the prediction's accuracy in many applications."[8] The mean absolute SHAP value of a feature across a representative sample of predictions is a model-agnostic, theoretically principled importance score. Pruning workflows compute SHAP values, drop features whose mean absolute SHAP value falls below a threshold, retrain, and verify that performance is preserved. The newer `shap-select` library combines SHAP scoring with logistic-regression significance tests to choose a subset automatically.

## How do you do feature selection in scikit-learn?

Most practical Python feature-selection work uses [scikit-learn](https://scikit-learn.org). The library exposes selectors as transformers with a uniform `fit`, `transform`, and `get_support` interface so they slot into `Pipeline` objects and `GridSearchCV` cleanly.[11]

| Class or function | Module | Family | Typical use |
| --- | --- | --- | --- |
| `VarianceThreshold(threshold)` | `sklearn.feature_selection` | Filter | Drops low-variance features such as one-hot columns of rare categories |
| `SelectKBest(score_func, k)` | `sklearn.feature_selection` | Filter | Keeps the top *k* features by a univariate score such as `chi2`, `f_classif`, `mutual_info_classif` |
| `SelectPercentile(score_func, percentile)` | `sklearn.feature_selection` | Filter | Same as `SelectKBest` but expressed as a percentage |
| `SelectFpr`, `SelectFdr`, `SelectFwe` | `sklearn.feature_selection` | Filter | Univariate selection at a target false-positive, false-discovery, or family-wise error rate |
| `GenericUnivariateSelect` | `sklearn.feature_selection` | Filter | Configurable wrapper over the above |
| `chi2`, `f_classif`, `f_regression`, `mutual_info_classif`, `mutual_info_regression`, `r_regression` | `sklearn.feature_selection` | Filter | Score functions consumed by the univariate selectors |
| `RFE(estimator, n_features_to_select, step)` | `sklearn.feature_selection` | Wrapper | Recursive feature elimination with a chosen subset size |
| `RFECV(estimator, cv, scoring)` | `sklearn.feature_selection` | Wrapper | RFE that tunes the subset size by cross-validation |
| `SequentialFeatureSelector(estimator, n_features_to_select, direction, scoring, cv)` | `sklearn.feature_selection` | Wrapper | Greedy forward or backward selection that does not require `coef_` or `feature_importances_` |
| `SelectFromModel(estimator, threshold)` | `sklearn.feature_selection` | Embedded | Keeps features whose absolute coefficient or importance is above a threshold; works with Lasso, Elastic Net, [SVMs](/wiki/support_vector_machine_svm) with L1 loss, and tree ensembles |
| `permutation_importance(estimator, X, y)` | `sklearn.inspection` | Embedded | Computes permutation importance for any fitted estimator |

A worked example that combines these classes might filter low-variance columns, rank the survivors with mutual information, and then run RFECV with a logistic-regression base learner inside a `Pipeline`. Wrapping the whole pipeline in `GridSearchCV` lets the cross-validation procedure choose both the filter cut-off and the final subset size jointly, avoiding the data leakage that arises when selection is performed on the full training set before splitting.[11]

## Beyond scikit-learn

Several specialised libraries extend the core ecosystem:

- **Featuretools** automates the *generation* of candidate features from relational data via deep feature synthesis. It is most useful upstream of selection, producing the wide candidate matrix that Boruta or RFE then prune. The Featuretools tutorials explicitly demonstrate pairing it with Boruta and the Optuna hyperparameter tuner for an end-to-end automated pipeline.[12]
- **Boruta-Py** and **BorutaSHAP** implement the Boruta algorithm in Python, with the SHAP variant replacing random-forest MDI with the more reliable SHAP score for the importance comparison.[13]
- **mlxtend** provides exhaustive search and floating sequential selectors that complement the scikit-learn defaults.
- **scikit-feature**, maintained by Arizona State University, ships a wide library of classical filters such as ReliefF, Fisher score, and Laplacian score, useful for bioinformatics workloads.
- **shap-select** combines SHAP values with statistical significance tests for automatic selection.
- **mRMR** (minimum Redundancy Maximum Relevance) implementations choose features that are individually informative and jointly non-redundant; the algorithm is widely used in genomics.

## What is a good feature-selection workflow?

A defensible feature-selection workflow follows a few simple rules.

1. **Split first, then select.** Never compute selection statistics on the test set. Even on the training set, fit the selector inside cross-validation so that hyperparameters such as *k* and the regularisation strength are chosen without peeking.[11]
2. **Start cheap.** Drop constants, duplicates, and obvious leakage columns before running anything statistical.
3. **Use filters as a first pass on wide data.** Mutual information or the ANOVA F-test, scaled to keep the top few hundred features, can shrink a 100,000-feature matrix into something a wrapper or embedded method can handle.
4. **Match the selector to the model.** If the production model is XGBoost, prefer SHAP-guided pruning or RFE with an XGBoost base learner. If it is logistic regression, Lasso or Elastic Net does double duty as selector and final fit.
5. **Verify with held-out data.** Compare the selected-feature model against the full-feature model on a held-out set. Prefer the smaller model when accuracy is unchanged or better.
6. **Re-evaluate when data shifts.** Features useful at training time may decay after distribution shift. Schedule periodic re-selection alongside retraining.

## Theoretical Foundations

Feature selection sits at the intersection of statistics, information theory, and combinatorial optimisation. Several theoretical results frame what selection can and cannot achieve.

- **Strong and weak relevance.** Kohavi and John formalised relevance by distinguishing *strongly relevant* features (whose removal changes the optimal classifier's accuracy) from *weakly relevant* features (useful only in combination with others) and *irrelevant* features.[2] Filter methods generally cannot distinguish weak relevance from irrelevance because they ignore feature context.
- **No free lunch.** No selection algorithm dominates on every dataset; the best choice depends on the data distribution, the downstream model, and the loss function.[1]
- **Markov blankets.** The optimal feature subset for predicting a target is the target's Markov blanket in the underlying causal graph, comprising its parents, children, and the other parents of its children. Algorithms such as IAMB and HITON attempt to recover this set directly.
- **Stability.** A selection procedure is *stable* if small perturbations of the training set yield similar subsets. Stability selection by Meinshausen and Buhlmann formalises this with subsampling and frequency thresholds; it produces conservative, reproducible subsets.[9]

## Is feature selection still needed in the deep learning era?

Deep neural networks blur the line between feature engineering, selection, and extraction. Hidden layers learn distributed representations that subsume much of the work that previously belonged to hand-crafted features, and convolutional, recurrent, and attention-based architectures discover useful local statistics on their own. For unstructured modalities such as image classification, machine translation, and speech recognition, end-to-end representation learning is now standard and explicit feature selection is rarely useful.

For structured data the picture is more nuanced. Tabular benchmarks such as the work by Shwartz-Ziv and Armon (2022, *Information Fusion*, volume 81, pages 84-90), and follow-up surveys, repeatedly find that gradient-boosted trees match or exceed deep tabular models such as TabNet and FT-Transformer on the majority of public datasets, especially when training samples are limited; the authors report that "XGBoost outperforms the deep models" across the datasets they evaluate while requiring much less tuning.[10] In those settings, careful feature selection is still one of the highest-return investments a practitioner can make.

Within deep learning itself, several techniques recover some of the benefits of selection without exiting the differentiable framework. L1 penalties on input-layer weights drive entire input columns to zero; group Lasso applied across input neurons does the same at the column level. Concrete dropout and the L0 regularisation of Louizos and colleagues add learnable gates that produce truly sparse input usage. Attention weights over input tokens can be inspected post hoc as a soft selection mechanism, and integrated gradients or DeepLIFT scores serve a similar role to SHAP for tree ensembles.

## Challenges and Pitfalls

- **Selection bias.** Performing selection on all the data and only then splitting into train and test inflates measured accuracy. Always split first.[1]
- **Multiple testing.** Filters that score many features will occasionally find spuriously significant relationships by chance. Use false-discovery-rate corrections such as Benjamini-Hochberg when it matters.
- **Correlated features.** Tree-based importance scores tend to spread credit among correlated columns, while Lasso tends to pick only one; both behaviours can mislead. Permutation importance with grouped permutations or stability selection helps.[9]
- **Categorical encodings.** One-hot encoding inflates the column count and can give MDI-based importance an unfair edge to high-cardinality features.
- **Concept drift.** A feature informative at training time may stop carrying signal after distribution shift. Monitoring feature-importance stability over time is part of MLOps hygiene.
- **Interpretability traps.** A small selected subset is *not* a causal explanation. Selection answers "what predicts well", not "what causes".

## Summary

Feature selection chooses a small, useful subset of input variables to feed to a model, balancing predictive accuracy against complexity, latency, and interpretability. The field's three classical families, filters, wrappers, and embedded methods, each trade computational cost for selection quality differently, and modern hybrid pipelines combine them to scale gracefully from a handful of features to millions.[14] While deep learning has reduced the need for manual feature engineering on unstructured data, feature selection remains central to tabular machine learning, where boosted trees still dominate and every dropped feature pays dividends in cost, latency, and clarity.[10] The discipline rests on a long lineage of statistical and information-theoretic results, summarised most influentially by Guyon and Elisseeff in 2003, and continues to evolve through tools such as Boruta, SHAP-guided pruning, and `shap-select`.[1]

## See Also

- [Machine Learning](/wiki/machine_learning)
- [Feature Engineering](/wiki/feature_engineering)
- [Feature Extraction](/wiki/feature_extraction)
- [Dimensionality Reduction](/wiki/dimensionality_reduction)
- [Principal Component Analysis](/wiki/principal_component_analysis)
- [Lasso Regression](/wiki/lasso_regression)
- [Regularization](/wiki/regularization)
- [XGBoost](/wiki/xgboost)
- [LightGBM](/wiki/lightgbm)
- [Support Vector Machine (SVM)](/wiki/support_vector_machine_svm)

## References

1. Guyon, I., and Elisseeff, A. (2003). "An Introduction to Variable and Feature Selection." *Journal of Machine Learning Research*, 3, 1157-1182. [https://www.jmlr.org/papers/v3/guyon03a.html](https://www.jmlr.org/papers/v3/guyon03a.html)
2. Kohavi, R., and John, G. H. (1997). "Wrappers for Feature Subset Selection." *Artificial Intelligence*, 97(1-2), 273-324. [https://ai.stanford.edu/~ronnyk/wrappersPrint.pdf](https://ai.stanford.edu/~ronnyk/wrappersPrint.pdf)
3. Tibshirani, R. (1996). "Regression Shrinkage and Selection via the Lasso." *Journal of the Royal Statistical Society. Series B*, 58(1), 267-288. [https://rss.onlinelibrary.wiley.com/doi/10.1111/j.2517-6161.1996.tb02080.x](https://rss.onlinelibrary.wiley.com/doi/10.1111/j.2517-6161.1996.tb02080.x)
4. Zou, H., and Hastie, T. (2005). "Regularization and Variable Selection via the Elastic Net." *Journal of the Royal Statistical Society. Series B*, 67(2), 301-320.
5. Bellman, R. E. (1957). *Dynamic Programming*. Princeton University Press. (Origin of the term *curse of dimensionality*.)
6. Hughes, G. F. (1968). "On the Mean Accuracy of Statistical Pattern Recognizers." *IEEE Transactions on Information Theory*, 14(1), 55-63.
7. Kursa, M. B., and Rudnicki, W. R. (2010). "Feature Selection with the Boruta Package." *Journal of Statistical Software*, 36(11), 1-13. [https://www.jstatsoft.org/v36/i11/](https://www.jstatsoft.org/v36/i11/)
8. Lundberg, S. M., and Lee, S.-I. (2017). "A Unified Approach to Interpreting Model Predictions." *Advances in Neural Information Processing Systems* 30. [https://arxiv.org/abs/1705.07874](https://arxiv.org/abs/1705.07874)
9. Meinshausen, N., and Buhlmann, P. (2010). "Stability Selection." *Journal of the Royal Statistical Society. Series B*, 72(4), 417-473.
10. Shwartz-Ziv, R., and Armon, A. (2022). "Tabular Data: Deep Learning is Not All You Need." *Information Fusion*, 81, 84-90. [https://arxiv.org/abs/2106.03253](https://arxiv.org/abs/2106.03253)
11. scikit-learn developers. "1.13. Feature Selection." *scikit-learn documentation*. [https://scikit-learn.org/stable/modules/feature_selection.html](https://scikit-learn.org/stable/modules/feature_selection.html)
12. Featuretools documentation. [https://docs.featuretools.com](https://docs.featuretools.com)
13. Boruta-Py repository, scikit-learn-contrib. [https://github.com/scikit-learn-contrib/boruta_py](https://github.com/scikit-learn-contrib/boruta_py)
14. Raschka, S. "What is the difference between filter, wrapper, and embedded methods for feature selection?" [https://sebastianraschka.com/faq/docs/feature_sele_categories.html](https://sebastianraschka.com/faq/docs/feature_sele_categories.html)

