Feature Selection

Feature selection is the process of selecting a subset of relevant input variables (features) from a larger pool of candidates for use in machine learning model construction. The goal is to identify the smallest set of features that retains, or even improves, predictive performance while reducing computational cost, mitigating overfitting, and producing models that humans can more easily interpret. Feature selection sits at the core of the broader practice of feature engineering and is closely related to dimensionality reduction, although the two are not the same. Feature selection keeps a subset of the original variables intact, whereas dimensionality reduction techniques such as principal component analysis project the data into a new coordinate system whose axes are linear or nonlinear combinations of the originals.

The topic was popularised in machine learning by Isabelle Guyon and André Elisseeff in their 2003 Journal of Machine Learning Research survey, "An Introduction to Variable and Feature Selection," which framed selection as a way to improve prediction accuracy, lower inference cost, and gain insight into the data-generating process. The Guyon and Elisseeff paper still serves as the canonical reference and has been cited more than fifteen thousand times. Earlier foundational work by Ron Kohavi and George John in 1997, "Wrappers for Feature Subset Selection," formalised the wrapper paradigm in which selection is treated as a search problem guided by an actual learner's cross-validated score.

In the modern era of large neural networks, explicit feature selection is sometimes assumed to be obsolete because deep models perform end-to-end representation learning. That perception is only partially correct. For unstructured data such as images, text, audio, and video, learned representations have indeed displaced hand-crafted feature pipelines. For structured tabular data, however, where gradient-boosted decision trees such as XGBoost, LightGBM, and CatBoost still dominate Kaggle leaderboards and production pipelines, careful feature selection remains a high-leverage activity. Recent benchmarks across diverse tabular datasets continue to show that boosted trees with well-engineered, well-selected features match or beat large transformer-based tabular models on the majority of problems.

Why Feature Selection Matters

When the number of input variables grows, models become harder to estimate, slower to train and serve, and more prone to memorising noise. The classical motivations for selecting a small subset of features are summarised below.

The Curse of Dimensionality

Richard Bellman coined the phrase curse of dimensionality in 1957 to describe the geometric and statistical pathologies that arise when the number of dimensions grows. As the dimensionality d of the input space increases, the volume of the space grows exponentially, and any fixed sample of points becomes increasingly sparse. Distances between points become more uniform, which weakens distance-based methods such as k-nearest neighbours and clustering. The number of training examples required to densely sample a d-dimensional space grows exponentially in d, a fact that limits any nonparametric learner.

A closely related result is the Hughes phenomenon, which states that for a fixed training-set size the expected accuracy of a classifier first rises with the number of features and then, beyond some peak dimensionality, falls again. The peak occurs because each additional feature carries useful signal up to a point but then begins to inject more noise and estimation variance than information. Feature selection is one of the simplest tools for keeping a model on the favourable side of that peak.

Bias, Variance, and Generalisation

The bias-variance decomposition makes the cost of irrelevant features concrete. With too few features the model is biased and underfits; with too many it has high variance and overfits the training set. Each parameter in a high-dimensional model must be estimated from finite data, so the estimate's variance grows with the number of parameters. By eliminating features that contribute no signal, feature selection reduces the effective parameter count, lowers variance, and often improves out-of-sample accuracy. This rationale aligns with Occam's razor: among models that explain the data equally well, the simpler one tends to generalise better.

Cost, Latency, and Interpretability

Fewer features means smaller models, faster training, lower inference latency, and reduced storage. In production systems where features must be computed at request time from upstream services, every dropped feature also reduces the number of network calls and feature-store lookups. Sparse models are also easier to debug and to explain to non-technical stakeholders, which is increasingly important in regulated domains such as finance, healthcare, and credit scoring where model decisions must be auditable.

Distinction from Feature Extraction

Feature selection and feature extraction are often confused. Feature selection chooses a subset of the existing variables and leaves them otherwise untouched, so the chosen variables retain their original semantics and units. Feature extraction transforms the input space into a new one. PCA, autoencoders, kernel projections, and word embeddings are extraction methods because each output dimension is a function of many inputs. Selection preserves interpretability at the cost of being limited to whatever signal already exists in the raw inputs, while extraction can synthesise new informative coordinates at the cost of opaqueness. In practice the two are often combined: extract a wide set of candidate features, then select a sparse, predictive subset.

Categories of Feature Selection

The community generally groups feature selection algorithms into three families: filter methods, wrapper methods, and embedded methods. A fourth, hybrid methods, blends elements of the others. The taxonomy was crystallised by the Kohavi and John paper and refined by Guyon and Elisseeff.

Filter vs Wrapper vs Embedded

Property	Filter	Wrapper	Embedded
Selection signal	Statistical score independent of any learner	Cross-validated score of an actual learner	Score produced as a side effect of model training
Computational cost	Lowest, scales to millions of features	Highest, requires fitting many models	Moderate, one model fit per regularisation path
Captures feature interactions	Mostly no, univariate by default	Yes, evaluates subsets jointly	Partially, depends on model class
Risk of overfitting the selection step	Low	High when the number of candidate subsets is large	Low to moderate, controlled by regularization
Coupling to downstream model	Loose, the same selection works for many models	Tight, results are tuned to one learner	Tight, baked into the learner
Typical examples	Pearson correlation, chi-square, ANOVA F-test, mutual information, variance threshold	Forward selection, backward elimination, recursive feature elimination, exhaustive search	L1 (Lasso) regression, Elastic Net, tree-based feature importance, SHAP-guided pruning
Best when	Quick first-pass screening on very wide data	Small to medium feature counts and a fixed downstream model	The same model is used for both selection and final prediction

Filter Methods

Filter methods rank features using a statistic computed directly from the data, with no reference to a downstream learner. Because they ignore feature interactions, they are fast and embarrassingly parallel, which makes them the workhorse for the first pass over very wide datasets such as gene-expression matrices, sparse text-document term matrices, or click-stream logs with thousands of one-hot encoded categorical levels.

Common filters include:

Variance threshold. Drops any feature whose variance falls below a chosen cut-off. Useful for removing constant or near-constant columns produced by one-hot encoding rare categories.
Pearson correlation. For numeric features and a numeric target, ranks features by the absolute value of their linear correlation with the target. Cheap but blind to nonlinear relationships.
Chi-square test. Measures the dependence between a categorical feature and a categorical target by comparing observed cell counts to the counts expected under independence. Requires non-negative input.
ANOVA F-test. Tests whether the means of a numeric feature differ across the classes of a categorical target. Implemented in scikit-learn as f_classif for classification and f_regression for regression.
Mutual information. A nonparametric, model-free measure of statistical dependence that captures both linear and nonlinear relationships. Mutual information is zero only when the feature and target are statistically independent, which gives it a strong theoretical grounding but a higher sample-complexity cost than Pearson or chi-square.
ReliefF and its derivatives. Estimate feature relevance by examining nearest-neighbour pairs, capable of detecting some interactions despite being technically univariate.

A pragmatic filter pipeline first drops zero-variance and duplicate columns, then ranks the survivors by a target-aware score such as mutual information, and finally keeps a top fraction such as the top 10 percent for downstream evaluation.

Wrapper Methods

Wrapper methods treat feature selection as a search problem in which each candidate subset is scored by training a learner and measuring its cross-validated performance. They are conceptually appealing because the score they optimise is the metric the user actually cares about, and they detect interactions because subsets are evaluated jointly. The trade-off is computational: with p features there are 2 to the power p possible subsets, so exhaustive search is feasible only for very small p. Practical wrappers therefore use heuristic search.

Widely used wrapper algorithms include:

Forward sequential selection. Starts with the empty set and at each step adds the single feature that most improves the cross-validated score. Halts when no further improvement is possible or a target subset size is reached.
Backward sequential elimination. Starts with the full feature set and iteratively removes the feature whose absence most improves, or least harms, the cross-validated score.
Floating selection (SFFS, SFBS). Augments forward or backward search with conditional inclusion or removal steps that revisit earlier decisions, reducing the chance of getting stuck in local optima.
Recursive Feature Elimination (RFE). Trains an estimator that exposes feature weights or importances, removes the least important feature (or block of features), refits, and repeats until a target count is reached. Works particularly well with linear models and tree ensembles.
RFE with cross-validation (RFECV). Wraps RFE in an outer cross-validation loop to also choose the optimal number of features automatically.
Exhaustive search. Evaluates every subset; only feasible for tens of features.
Genetic algorithms and simulated annealing. Stochastic search heuristics sometimes used for very large search spaces, although they rarely beat well-tuned RFE in practice.

Wrappers can overfit the selection process itself when the number of candidate subsets evaluated approaches or exceeds the number of training examples. Nested cross-validation, in which feature selection runs inside an inner loop and an outer loop estimates generalisation, is the standard guard against optimistic bias.

Embedded Methods

Embedded methods perform selection as part of the model-fitting procedure itself. They are usually cheaper than wrappers because the selection signal is produced for free during a single model fit, and they capture feature interactions through whatever mechanism the underlying learner uses.

The most widely used embedded approaches are:

L1 (Lasso) regularisation. The Lasso, introduced by Robert Tibshirani in 1996, augments the squared-error loss of linear regression with a penalty proportional to the sum of the absolute values of the coefficients. The geometry of the L1 penalty causes many coefficients to shrink to exactly zero, so fitting a Lasso simultaneously trains the model and selects features. Lasso regression is the canonical example of embedded selection. See lasso_regression for details on the optimisation algorithm and tuning of the regularisation strength alpha.
Elastic Net. Adds an L2 penalty alongside L1, which yields a Lasso-like sparsity pattern but better handles correlated feature groups by selecting them together rather than arbitrarily picking one representative.
Tree-based feature importance. Random forests, gradient-boosted trees, XGBoost, LightGBM, and CatBoost each compute a feature-importance score during training. Common variants include split count, mean decrease in impurity (MDI), and gain. Features with consistently low importance can be pruned and the model retrained.
Permutation importance. Records the drop in score when the values of a single feature are randomly shuffled, breaking its relationship with the target. Model agnostic, more reliable than MDI, but more expensive because it requires repeated model evaluations.
SHAP-based pruning. Computes Shapley additive explanation values for each prediction, then aggregates them per feature. Features whose mean absolute SHAP value falls below a threshold are dropped. Tools such as shap-select formalise this with statistical significance tests.
Regularised neural networks. L1 weight penalties, group Lasso applied to input weight matrices, dropout-based importance, and explicit gating layers (for example concrete dropout) all let neural networks learn sparse input usage as part of optimisation.

Hybrid Methods

Hybrid pipelines combine families to balance speed and accuracy. A typical recipe is to use a cheap filter to shrink the feature pool from tens of thousands to hundreds, then run a wrapper such as RFE on the survivors, and finally fit an embedded model such as Lasso on the chosen subset for the production model. The Boruta algorithm, described later, is sometimes called a hybrid because it wraps a random forest in a statistical test.

Common Algorithms and Their Implementations

The table below summarises algorithms practitioners reach for most often, along with the family they belong to and a representative software entry point.

Algorithm	Family	Strengths	Limitations	Representative implementation
Variance threshold	Filter	Removes constants, very fast	Ignores the target	`sklearn.feature_selection.VarianceThreshold`
Pearson correlation ranking	Filter	Simple, interpretable	Linear only	`pandas.DataFrame.corr`
Chi-square test	Filter	Sound for non-negative categorical features	Requires binning of numeric features	`sklearn.feature_selection.chi2`
ANOVA F-test	Filter	Cheap, principled for numeric vs categorical	Assumes equal variance, linear separability	`sklearn.feature_selection.f_classif`
Mutual information	Filter	Captures nonlinear dependence	Higher sample complexity	`sklearn.feature_selection.mutual_info_classif`
Forward sequential selection	Wrapper	Detects interactions, returns small subsets	O(p^2) model fits	`sklearn.feature_selection.SequentialFeatureSelector`
Backward elimination	Wrapper	Strong when most features are useful	Cannot start when p is huge	`SequentialFeatureSelector(direction='backward')`
Recursive Feature Elimination	Wrapper	One fit per round, scales to thousands	Needs `coef_` or `feature_importances_`	`sklearn.feature_selection.RFE`
RFE with cross-validation	Wrapper	Picks the subset size automatically	More expensive than RFE	`sklearn.feature_selection.RFECV`
Exhaustive search	Wrapper	Globally optimal for the chosen metric	Combinatorial cost	`mlxtend.feature_selection.ExhaustiveFeatureSelector`
Lasso (L1)	Embedded	Sparse, interpretable, statistically grounded	Linear, can drop correlated peers	`sklearn.linear_model.Lasso`
Elastic Net	Embedded	Handles correlated groups	Two hyperparameters to tune	`sklearn.linear_model.ElasticNet`
Tree feature importance	Embedded	Cheap, captures interactions	MDI biased toward high-cardinality features	`xgboost.Booster.get_score`, `lightgbm.Booster.feature_importance`
Permutation importance	Embedded	Model-agnostic, low bias	Expensive	`sklearn.inspection.permutation_importance`
SHAP-guided pruning	Embedded	Theoretically grounded, model-agnostic	SHAP can be slow on non-tree models	`shap.TreeExplainer`, `shap-select`
Boruta	Hybrid	All-relevant, principled cut-off	Slow on wide data	`boruta_py` (`scikit-learn-contrib`)

Boruta in Detail

Boruta, introduced by Miron Kursa and Witold Rudnicki in 2010, takes the all-relevant view of selection: it tries to find every feature that carries any predictive signal, rather than the smallest sufficient subset. It works by duplicating each feature, randomly shuffling the values in the copy to destroy any relationship with the target, and concatenating the shuffled shadow features to the original matrix. A random forest is then fit, and each real feature's importance is compared with the maximum importance achieved by any shadow feature. A real feature that beats the best shadow significantly more often than expected by chance is confirmed as relevant; one that consistently loses is rejected. The procedure repeats until every feature has been classified or a maximum number of iterations is reached. Boruta is popular on Kaggle and is well suited to medium-width tabular datasets where any subtle predictive signal matters.

SHAP-Based Pruning

SHAP (SHapley Additive exPlanations) values, introduced by Scott Lundberg and Su-In Lee in 2017, allocate each prediction's deviation from the model's mean prediction to the input features in a manner that satisfies the Shapley fairness axioms from cooperative game theory. The mean absolute SHAP value of a feature across a representative sample of predictions is a model-agnostic, theoretically principled importance score. Pruning workflows compute SHAP values, drop features whose mean absolute SHAP value falls below a threshold, retrain, and verify that performance is preserved. The newer shap-select library combines SHAP scoring with logistic-regression significance tests to choose a subset automatically.

scikit-learn API Reference

Most practical Python feature-selection work uses scikit-learn. The library exposes selectors as transformers with a uniform fit, transform, and get_support interface so they slot into Pipeline objects and GridSearchCV cleanly.

Class or function	Module	Family	Typical use
`VarianceThreshold(threshold)`	`sklearn.feature_selection`	Filter	Drops low-variance features such as one-hot columns of rare categories
`SelectKBest(score_func, k)`	`sklearn.feature_selection`	Filter	Keeps the top k features by a univariate score such as `chi2`, `f_classif`, `mutual_info_classif`
`SelectPercentile(score_func, percentile)`	`sklearn.feature_selection`	Filter	Same as `SelectKBest` but expressed as a percentage
`SelectFpr`, `SelectFdr`, `SelectFwe`	`sklearn.feature_selection`	Filter	Univariate selection at a target false-positive, false-discovery, or family-wise error rate
`GenericUnivariateSelect`	`sklearn.feature_selection`	Filter	Configurable wrapper over the above
`chi2`, `f_classif`, `f_regression`, `mutual_info_classif`, `mutual_info_regression`, `r_regression`	`sklearn.feature_selection`	Filter	Score functions consumed by the univariate selectors
`RFE(estimator, n_features_to_select, step)`	`sklearn.feature_selection`	Wrapper	Recursive feature elimination with a chosen subset size
`RFECV(estimator, cv, scoring)`	`sklearn.feature_selection`	Wrapper	RFE that tunes the subset size by cross-validation
`SequentialFeatureSelector(estimator, n_features_to_select, direction, scoring, cv)`	`sklearn.feature_selection`	Wrapper	Greedy forward or backward selection that does not require `coef_` or `feature_importances_`
`SelectFromModel(estimator, threshold)`	`sklearn.feature_selection`	Embedded	Keeps features whose absolute coefficient or importance is above a threshold; works with Lasso, Elastic Net, SVMs with L1 loss, and tree ensembles
`permutation_importance(estimator, X, y)`	`sklearn.inspection`	Embedded	Computes permutation importance for any fitted estimator

A worked example that combines these classes might filter low-variance columns, rank the survivors with mutual information, and then run RFECV with a logistic-regression base learner inside a Pipeline. Wrapping the whole pipeline in GridSearchCV lets the cross-validation procedure choose both the filter cut-off and the final subset size jointly, avoiding the data leakage that arises when selection is performed on the full training set before splitting.

Beyond scikit-learn

Several specialised libraries extend the core ecosystem:

Featuretools automates the generation of candidate features from relational data via deep feature synthesis. It is most useful upstream of selection, producing the wide candidate matrix that Boruta or RFE then prune. The Featuretools tutorials explicitly demonstrate pairing it with Boruta and the Optuna hyperparameter tuner for an end-to-end automated pipeline.
Boruta-Py and BorutaSHAP implement the Boruta algorithm in Python, with the SHAP variant replacing random-forest MDI with the more reliable SHAP score for the importance comparison.
mlxtend provides exhaustive search and floating sequential selectors that complement the scikit-learn defaults.
scikit-feature, maintained by Arizona State University, ships a wide library of classical filters such as ReliefF, Fisher score, and Laplacian score, useful for bioinformatics workloads.
shap-select combines SHAP values with statistical significance tests for automatic selection.
mRMR (minimum Redundancy Maximum Relevance) implementations choose features that are individually informative and jointly non-redundant; the algorithm is widely used in genomics.

Practical Workflow

A defensible feature-selection workflow follows a few simple rules.

Split first, then select. Never compute selection statistics on the test set. Even on the training set, fit the selector inside cross-validation so that hyperparameters such as k and the regularisation strength are chosen without peeking.
Start cheap. Drop constants, duplicates, and obvious leakage columns before running anything statistical.
Use filters as a first pass on wide data. Mutual information or the ANOVA F-test, scaled to keep the top few hundred features, can shrink a 100,000-feature matrix into something a wrapper or embedded method can handle.
Match the selector to the model. If the production model is XGBoost, prefer SHAP-guided pruning or RFE with an XGBoost base learner. If it is logistic regression, Lasso or Elastic Net does double duty as selector and final fit.
Verify with held-out data. Compare the selected-feature model against the full-feature model on a held-out set. Prefer the smaller model when accuracy is unchanged or better.
Re-evaluate when data shifts. Features useful at training time may decay after distribution shift. Schedule periodic re-selection alongside retraining.

Theoretical Foundations

Feature selection sits at the intersection of statistics, information theory, and combinatorial optimisation. Several theoretical results frame what selection can and cannot achieve.

Strong and weak relevance. Kohavi and John formalised relevance by distinguishing strongly relevant features (whose removal changes the optimal classifier's accuracy) from weakly relevant features (useful only in combination with others) and irrelevant features. Filter methods generally cannot distinguish weak relevance from irrelevance because they ignore feature context.
No free lunch. No selection algorithm dominates on every dataset; the best choice depends on the data distribution, the downstream model, and the loss function.
Markov blankets. The optimal feature subset for predicting a target is the target's Markov blanket in the underlying causal graph, comprising its parents, children, and the other parents of its children. Algorithms such as IAMB and HITON attempt to recover this set directly.
Stability. A selection procedure is stable if small perturbations of the training set yield similar subsets. Stability selection by Meinshausen and Buhlmann formalises this with subsampling and frequency thresholds; it produces conservative, reproducible subsets.

Feature Selection in the Deep Learning Era

Deep neural networks blur the line between feature engineering, selection, and extraction. Hidden layers learn distributed representations that subsume much of the work that previously belonged to hand-crafted features, and convolutional, recurrent, and attention-based architectures discover useful local statistics on their own. For unstructured modalities such as image classification, machine translation, and speech recognition, end-to-end representation learning is now standard and explicit feature selection is rarely useful.

For structured data the picture is more nuanced. Tabular benchmarks such as the work by Shwartz-Ziv and Armon (2021), and follow-up surveys, repeatedly find that gradient-boosted trees match or exceed deep tabular models such as TabNet and FT-Transformer on the majority of public datasets, especially when training samples are limited. In those settings, careful feature selection is still one of the highest-return investments a practitioner can make.

Within deep learning itself, several techniques recover some of the benefits of selection without exiting the differentiable framework. L1 penalties on input-layer weights drive entire input columns to zero; group Lasso applied across input neurons does the same at the column level. Concrete dropout and the L0 regularisation of Louizos and colleagues add learnable gates that produce truly sparse input usage. Attention weights over input tokens can be inspected post hoc as a soft selection mechanism, and integrated gradients or DeepLIFT scores serve a similar role to SHAP for tree ensembles.

Challenges and Pitfalls

Selection bias. Performing selection on all the data and only then splitting into train and test inflates measured accuracy. Always split first.
Multiple testing. Filters that score many features will occasionally find spuriously significant relationships by chance. Use false-discovery-rate corrections such as Benjamini-Hochberg when it matters.
Correlated features. Tree-based importance scores tend to spread credit among correlated columns, while Lasso tends to pick only one; both behaviours can mislead. Permutation importance with grouped permutations or stability selection helps.
Categorical encodings. One-hot encoding inflates the column count and can give MDI-based importance an unfair edge to high-cardinality features.
Concept drift. A feature informative at training time may stop carrying signal after distribution shift. Monitoring feature-importance stability over time is part of MLOps hygiene.
Interpretability traps. A small selected subset is not a causal explanation. Selection answers "what predicts well", not "what causes".

Summary

Feature selection chooses a small, useful subset of input variables to feed to a model, balancing predictive accuracy against complexity, latency, and interpretability. The field's three classical families - filters, wrappers, and embedded methods - each trade computational cost for selection quality differently, and modern hybrid pipelines combine them to scale gracefully from a handful of features to millions. While deep learning has reduced the need for manual feature engineering on unstructured data, feature selection remains central to tabular machine learning, where boosted trees still dominate and every dropped feature pays dividends in cost, latency, and clarity. The discipline rests on a long lineage of statistical and information-theoretic results, summarised most influentially by Guyon and Elisseeff in 2003, and continues to evolve through tools such as Boruta, SHAP-guided pruning, and shap-select.

References

Guyon, I., and Elisseeff, A. (2003). "An Introduction to Variable and Feature Selection." *Journal of Machine Learning Research*, 3, 1157-1182. https://www.jmlr.org/papers/v3/guyon03a.html
Kohavi, R., and John, G. H. (1997). "Wrappers for Feature Subset Selection." *Artificial Intelligence*, 97(1-2), 273-324. https://ai.stanford.edu/~ronnyk/wrappersPrint.pdf
Tibshirani, R. (1996). "Regression Shrinkage and Selection via the Lasso." *Journal of the Royal Statistical Society. Series B*, 58(1), 267-288.
Zou, H., and Hastie, T. (2005). "Regularization and Variable Selection via the Elastic Net." *Journal of the Royal Statistical Society. Series B*, 67(2), 301-320.
Bellman, R. E. (1957). *Dynamic Programming*. Princeton University Press. (Origin of the term *curse of dimensionality*.)
Hughes, G. F. (1968). "On the Mean Accuracy of Statistical Pattern Recognizers." *IEEE Transactions on Information Theory*, 14(1), 55-63.
Kursa, M. B., and Rudnicki, W. R. (2010). "Feature Selection with the Boruta Package." *Journal of Statistical Software*, 36(11), 1-13.
Lundberg, S. M., and Lee, S.-I. (2017). "A Unified Approach to Interpreting Model Predictions." *Advances in Neural Information Processing Systems* 30.
Meinshausen, N., and Buhlmann, P. (2010). "Stability Selection." *Journal of the Royal Statistical Society. Series B*, 72(4), 417-473.
Shwartz-Ziv, R., and Armon, A. (2022). "Tabular Data: Deep Learning is Not All You Need." *Information Fusion*, 81, 84-90. https://arxiv.org/abs/2106.03253
scikit-learn developers. "1.13. Feature Selection." *scikit-learn documentation*. https://scikit-learn.org/stable/modules/feature_selection.html
Featuretools documentation. https://docs.featuretools.com
Boruta-Py repository, scikit-learn-contrib. https://github.com/scikit-learn-contrib/boruta_py
Raschka, S. "What is the difference between filter, wrapper, and embedded methods for feature selection?" https://sebastianraschka.com/faq/docs/feature_sele_categories.html

Feature Selection

Why Feature Selection Matters

The Curse of Dimensionality

Bias, Variance, and Generalisation

Cost, Latency, and Interpretability

Distinction from Feature Extraction

Categories of Feature Selection

Filter vs Wrapper vs Embedded

Filter Methods

Wrapper Methods

Embedded Methods

Hybrid Methods

Common Algorithms and Their Implementations

Boruta in Detail

SHAP-Based Pruning

scikit-learn API Reference

Beyond scikit-learn

Practical Workflow

Theoretical Foundations

Feature Selection in the Deep Learning Era

Challenges and Pitfalls

Summary

See Also

References

Improve this article

Related Articles

ARC-AGI 2

Discrete Feature

Bucketing

Categorical Data

Continuous Feature

Dense Feature

Feature Selection

Why Feature Selection Matters

The Curse of Dimensionality

Bias, Variance, and Generalisation

Cost, Latency, and Interpretability

Distinction from Feature Extraction

Categories of Feature Selection

Filter vs Wrapper vs Embedded

Filter Methods

Wrapper Methods

Embedded Methods

Hybrid Methods

Common Algorithms and Their Implementations

Boruta in Detail

SHAP-Based Pruning

scikit-learn API Reference

Beyond scikit-learn

Practical Workflow

Theoretical Foundations

Feature Selection in the Deep Learning Era

Challenges and Pitfalls

Summary

See Also

References

Related Articles

ARC-AGI 2

Discrete Feature

Bucketing

Categorical Data

Continuous Feature

Dense Feature