Feature Set

Introduction

In machine learning, a feature set is the collection of input variables (also called features, attributes, or predictors) used to train a model and make predictions. Each individual feature represents a measurable property of the data, and the feature set as a whole defines what information the model has access to during learning. The composition of a feature set has a direct effect on model accuracy, generalization ability, and computational cost.

A feature set is distinct from a feature vector, which is a single instance (row) of feature values for one data point. The feature set describes the columns or dimensions of the data, while a feature vector describes one observation within that space. The mathematical space defined by all possible combinations of feature values is called the feature space.

Selecting and constructing an effective feature set is one of the most consequential steps in any machine learning pipeline. A well-chosen feature set allows even simple models to achieve strong performance, while a poorly chosen one can limit the capabilities of even the most sophisticated algorithms.

Explain like I'm 5 (ELI5)

Imagine you are trying to guess what animal someone is thinking of. You could ask questions like: "How many legs does it have? Does it have fur? Can it fly? Does it live in water?" Each question is a "feature," and all the questions you decide to ask together make up your "feature set."

If you pick good questions, you can guess the animal quickly. If you pick bad questions (like "What is its favorite color?"), you will not get useful answers. And if you ask too many questions, it takes forever and some of the answers might confuse you more than help. A feature set in machine learning works the same way: it is the list of things we tell the computer to pay attention to so it can learn and make good guesses.

Types of features

Feature sets can contain several different types of features, each requiring different handling before a model can use them.

Feature type	Description	Examples	Common encoding methods
Numeric (continuous)	Quantitative values on a continuous scale	Height, temperature, price, age	Normalization, standardization, log transform
Numeric (discrete)	Countable integer values	Number of rooms, word count, page views	May be used directly or binned
Categorical (nominal)	Qualitative labels with no inherent order	Color, country, product category, gender	One-hot encoding, target encoding, hashing
Categorical (ordinal)	Qualitative labels with a meaningful order	Education level, satisfaction rating, size (S/M/L)	Ordinal encoding, integer mapping
Binary	Two possible values	True/false, yes/no, male/female	0/1 encoding
Text	Unstructured natural language	Product reviews, tweets, email body	Bag of words, TF-IDF, embeddings
Temporal	Date, time, or time-series data	Timestamps, durations, seasonal patterns	Cyclical encoding (sine/cosine), lag features
Image	Pixel data from images	Photographs, medical scans, satellite imagery	Pixel arrays, CNN-extracted features, pretrained embeddings

Feature space

The feature space is the n-dimensional mathematical space in which each axis corresponds to one feature in the feature set. Every data point is represented as a coordinate (or vector) within this space. For example, if a feature set contains two features, height and weight, then the feature space is a two-dimensional plane, and each person is a point on that plane.

The geometry of the feature space affects how machine learning algorithms operate. Algorithms such as k-nearest neighbors and support vector machines rely on distance calculations in the feature space. Clustering algorithms group data points that are close together in the feature space. Decision trees partition the feature space into rectangular regions, while neural networks learn complex, nonlinear decision boundaries within it.

When a feature set is transformed (for example, through principal component analysis or kernel methods), the data is projected into a new feature space where patterns may be easier to identify. The kernel trick used in SVMs, for instance, maps data into a higher-dimensional feature space where a linear separator can be found even if the original data is not linearly separable.

The curse of dimensionality

As the number of features in a feature set increases, the volume of the feature space grows exponentially. This phenomenon, known as the curse of dimensionality (a term coined by Richard Bellman in the 1950s), creates several problems for machine learning.

In high-dimensional spaces, data points become sparse. Distances between points converge to similar values, making distance-based algorithms less effective. Models require exponentially more training data to cover the feature space adequately, and the risk of overfitting increases because models can find spurious patterns in sparse, high-dimensional data.

The Hughes phenomenon illustrates a related effect: as features are added to a classification model, predictive accuracy initially improves but eventually degrades once the number of features exceeds a threshold relative to the training set size. A common heuristic suggests that at least five training examples per feature dimension are needed, though this varies depending on the signal-to-noise ratio and feature relevance.

Practical strategies for mitigating the curse of dimensionality include feature selection (removing irrelevant or redundant features), feature extraction (projecting into a lower-dimensional space), and regularization (penalizing model complexity).

Feature selection

Feature selection is the process of identifying and retaining the most relevant features from the full set while discarding irrelevant or redundant ones. Unlike feature extraction, which creates new derived features, feature selection works with the original features and preserves their interpretability. Feature selection serves several purposes: simplifying models, reducing training time, improving generalization, and avoiding the curse of dimensionality.

Feature selection methods are generally grouped into three categories: filter methods, wrapper methods, and embedded methods.

Filter methods

Filter methods evaluate features based on their intrinsic statistical properties, independent of any specific learning algorithm. They score each feature (or pair of features) using a statistical measure and rank or threshold them accordingly. Because they do not train a model during evaluation, filter methods are computationally efficient and scale well to large feature sets.

Technique	What it measures	Applicable to	Notes
Variance threshold	Variance of each feature	Numeric features	Removes near-constant features; simplest filter
Pearson correlation	Linear correlation with target	Numeric target and features	Only captures linear relationships
Mutual information	Shared information between feature and target	Any feature/target types	Captures nonlinear relationships; nonparametric
Chi-squared test	Independence between categorical feature and target	Categorical features, categorical target	Tests whether observed frequencies differ from expected
ANOVA F-test	Difference in means across target classes	Numeric features, categorical target	High F-value means feature separates classes well
Relief / ReliefF	Feature relevance based on nearest neighbor distances	Any feature type	Considers feature interactions

A key limitation of most filter methods is that they evaluate features individually (univariate filters), which means they can miss interactions between features. Multivariate filters such as minimum-redundancy-maximum-relevance (mRMR) address this by balancing relevance to the target with redundancy among selected features.

Wrapper methods

Wrapper methods evaluate feature subsets by training an actual model and measuring its performance on a held-out validation set. They search through the space of possible feature subsets, scoring each based on the model's predictive accuracy.

Technique	Strategy	Computational cost
Forward selection	Start with no features; add the best one at each step	Moderate
Backward elimination	Start with all features; remove the worst one at each step	Moderate to high
Recursive feature elimination (RFE)	Train model, remove least important features, repeat	Moderate to high
Exhaustive search	Evaluate all possible feature subsets	Very high (2^n subsets)
Genetic algorithms	Evolve populations of feature subsets through selection and mutation	High but parallelizable

Wrapper methods can detect feature interactions and are tuned to a specific model, so they often produce the best-performing feature subsets for that model. However, they are computationally expensive, especially with large feature sets, and they risk overfitting to the validation set.

Embedded methods

Embedded methods perform feature selection as part of the model training process itself, combining the benefits of both filter and wrapper approaches. They are more computationally efficient than wrappers because they do not require training separate models for each feature subset.

Technique	How it selects features	Model type
LASSO (L1 regularization)	Shrinks coefficients of irrelevant features to exactly zero	Linear regression, logistic regression
Elastic net	Combines L1 and L2 penalties; balances sparsity and grouping	Linear models
Tree-based feature importance	Ranks features by their contribution to information gain or impurity reduction	Decision trees, random forests, gradient boosting
Regularized trees	Penalizes features similar to previously selected ones at each split	Tree ensembles

Information-theoretic methods

Several feature selection techniques draw on information theory to quantify the relevance and redundancy of features.

Minimum-redundancy-maximum-relevance (mRMR) selects features that have high mutual information with the target variable while having low mutual information with each other. This prevents selecting highly correlated features that provide overlapping information.

Joint mutual information (JMI) identifies features that add new information to the set of already-selected features by computing conditional mutual information.

Correlation feature selection (CFS) evaluates feature subsets using the principle that good subsets contain features highly correlated with the target but uncorrelated with each other.

Feature engineering

Feature engineering is the broader process of using domain knowledge to create, transform, and select features for a machine learning model. While feature selection reduces the existing set, feature engineering also involves constructing entirely new features from raw data. The quality of feature engineering often determines the upper bound of model performance, particularly for tabular data.

Common transformations

Transformation	Description	When to use
Log transform	Applies logarithm to compress large ranges and reduce skewness	Right-skewed distributions (income, population)
Polynomial features	Creates squared, cubed, or higher-order versions of features	Capturing nonlinear relationships in linear models
Interaction features	Multiplies two or more features together	When the combined effect of features matters (e.g., area = length x width)
Binning / bucketing	Groups continuous values into discrete intervals	When the exact value is less important than the range
Cyclical encoding	Uses sine and cosine to encode circular features	Time-of-day, day-of-week, month-of-year
Normalization (min-max)	Scales values to a fixed range, typically [0, 1]	When features have different scales and the algorithm is scale-sensitive
Standardization (z-score)	Centers values to mean 0 and standard deviation 1	For algorithms that assume normally distributed inputs
One-hot encoding	Creates binary columns for each category	Nominal categorical features with few unique values
Target encoding	Replaces categories with the mean of the target variable	High-cardinality categorical features
Embeddings	Learns dense vector representations from data	Text, categorical features with many values, images

Domain-specific feature engineering

Different application domains require specialized feature engineering approaches.

In natural language processing, raw text is converted into numerical features using techniques such as bag of words, TF-IDF (term frequency-inverse document frequency), or learned word and sentence embeddings. Features may also include text length, sentiment scores, named entity counts, or part-of-speech distributions.

In computer vision, traditional feature engineering used hand-crafted descriptors such as edge histograms, SIFT (scale-invariant feature transform), and HOG (histogram of oriented gradients). Modern approaches use convolutional neural networks or pretrained vision models to extract features automatically through transfer learning.

In time series analysis, feature engineering includes creating lag features (values from previous time steps), rolling statistics (moving averages, rolling standard deviations), seasonal decomposition, and Fourier-based frequency features.

In tabular data for business applications, common engineered features include ratios (revenue per employee), differences (change from previous period), aggregations (average transaction value per customer), and date-derived features (day of week, quarter, is_holiday).

Feature extraction and dimensionality reduction

Feature extraction creates new features by transforming or combining original features, typically producing a lower-dimensional representation that retains the most informative aspects of the data. Unlike feature selection, which chooses a subset of existing features, feature extraction constructs entirely new variables.

Method	Type	How it works	Preserves original features?
Principal component analysis (PCA)	Linear, unsupervised	Finds orthogonal directions of maximum variance	No (creates new components)
Linear discriminant analysis (LDA)	Linear, supervised	Finds directions that maximize class separation	No (creates new components)
t-SNE	Nonlinear, unsupervised	Preserves local neighborhood structure for visualization	No (creates 2D/3D coordinates)
UMAP	Nonlinear, unsupervised	Preserves both local and global structure	No (creates low-dimensional coordinates)
Autoencoders	Nonlinear, unsupervised	Neural network learns compressed encoding through bottleneck layer	No (creates latent features)
Independent component analysis (ICA)	Linear, unsupervised	Separates signal into statistically independent components	No (creates independent components)

The choice between feature selection and feature extraction involves tradeoffs. Feature selection maintains the original feature meanings, aiding interpretability. Feature extraction can achieve stronger dimension reduction but produces transformed features that may be harder to interpret.

Feature importance and evaluation

Once a model is trained, assessing which features contribute most to its predictions helps validate the feature set, guide further feature engineering, and explain model behavior.

Methods for measuring feature importance

Permutation importance randomly shuffles each feature one at a time and measures how much the model's performance degrades. Features whose shuffling causes large performance drops are considered more important. This method is model-agnostic and works with any trained model.

Tree-based importance measures how much each feature contributes to reducing impurity (Gini impurity or entropy) across all splits in tree-based models such as random forests and gradient boosting models. While computationally cheap, this method can be biased toward high-cardinality features.

SHAP (SHapley Additive exPlanations) values, rooted in cooperative game theory, quantify each feature's contribution to individual predictions. SHAP provides both global feature importance (averaged across all predictions) and local explanations (for a single data point). It handles feature correlations more fairly than permutation importance by considering all possible feature interactions.

Coefficient magnitudes in linear models (linear regression, logistic regression) indicate feature importance when features are standardized. Larger absolute coefficients correspond to more influential features.

Feature stores

A feature store is a centralized system for managing, storing, and serving machine learning features in production environments. Feature stores emerged as a component of MLOps infrastructure to address the challenge of reusing and consistently serving features across multiple models and teams.

Architecture

A typical feature store consists of two main components.

The offline store holds historical feature values and is used for model training and batch scoring. It is typically implemented using data warehouses, data lakes, or distributed file systems (such as Apache Parquet files on S3, Delta Lake, BigQuery, or Snowflake). The offline store supports point-in-time correct feature retrieval, which prevents data leakage during training by ensuring each training example only uses feature values that were available at that historical moment.

The online store serves current feature values with low latency for real-time inference. It is backed by fast key-value stores such as Redis, DynamoDB, or Cassandra. When a model needs to make a prediction in real time (for example, fraud detection at the point of a credit card transaction), the online store provides features within milliseconds.

Feature store implementations

Feature store	Type	Developed by	Notable characteristics
Feast	Open source	Originally Gojek, now maintained by Tecton	Lightweight, pluggable backends, widely adopted
Tecton	Commercial	Tecton (creators of Uber's Michelangelo)	Fully managed, strong real-time streaming support
Hopsworks	Open source / commercial	Logical Clocks	Integrated with ML pipelines, supports Python and Spark
Vertex AI Feature Store	Commercial	Google Cloud	Integrated with Google Cloud AI Platform
SageMaker Feature Store	Commercial	Amazon Web Services	Integrated with SageMaker training and inference
Databricks Feature Store	Commercial	Databricks	Integrated with Unity Catalog and Delta Lake
Chronon	Open source	Airbnb	Designed for complex temporal aggregations
Feathr	Open source	LinkedIn (now part of Azure)	Supports both batch and real-time features

Feature stores help organizations maintain consistency between the features used during training and those served during inference, a problem commonly referred to as training-serving skew.

Automated feature engineering

Automated feature engineering tools reduce the manual effort of constructing features by algorithmically generating and evaluating candidate features from raw data.

Featuretools is an open-source Python library that uses deep feature synthesis (DFS) to automatically create features from relational datasets. DFS works by applying sequences of aggregation and transformation operations across related tables, producing features such as "average order value per customer" or "maximum transaction amount in the last 30 days."

TSFresh (Time Series Feature Extraction based on Scalable Hypothesis Tests) specializes in extracting features from time series data. It computes a large library of time-domain features (means, variances, autocorrelations, spectral properties) and then uses statistical hypothesis testing to filter out features that are not significantly associated with the target variable.

AutoML frameworks such as Auto-sklearn, H2O AutoML, and Google Cloud AutoML often include automated feature engineering as part of their pipeline. These systems explore feature transformations, encoding strategies, and selection methods as part of their hyperparameter search.

Recent research (2025) has explored combining large language models with evolutionary search for automated feature engineering. A framework called LLM-FE uses the domain knowledge and reasoning capabilities of LLMs to propose candidate feature transformations, which are then refined through evolutionary optimization.

Best practices for building feature sets

Start with domain knowledge. Understanding the problem domain helps identify which raw data fields are likely to be predictive and what transformations make sense. A data scientist building a credit risk model benefits from knowing that debt-to-income ratio is more informative than raw income alone.

Remove constant and near-constant features. Features with zero or near-zero variance provide no discriminating information and should be dropped early.

Handle missing values intentionally. Missing data can be informative (the absence of a value may itself be a signal). Common strategies include imputation (filling with mean, median, or mode), creating a binary indicator for missingness, or using algorithms that handle missing values natively (such as XGBoost).

Encode categorical variables appropriately. The choice of encoding (one-hot, ordinal, target, or embedding) depends on the cardinality of the feature, the model type, and whether the categories have a natural order.

Scale features when necessary. Algorithms that rely on distance calculations (k-nearest neighbors, support vector machines) or gradient-based optimization (neural networks, logistic regression) are sensitive to feature scale. Tree-based models are generally scale-invariant.

Watch for data leakage. Features that contain information about the target that would not be available at prediction time (such as future values or target-derived statistics) lead to overly optimistic performance estimates that do not generalize.

Iterate and evaluate. Feature set construction is an iterative process. Use cross-validation to assess feature sets, check feature importance after training, and refine based on results.

Document feature definitions. In production systems, clear documentation of how each feature is computed, what data sources it uses, and any assumptions it makes prevents errors and helps with debugging.

Feature sets in deep learning

In deep learning, the traditional role of manual feature engineering is partially replaced by representation learning, where the model learns to extract useful features from raw data through its hidden layers. Convolutional neural networks learn spatial features from raw pixel inputs, recurrent neural networks learn temporal features from sequential data, and transformer models learn contextual features from text through attention mechanisms.

However, feature sets still matter in deep learning in several ways.

Input representation choices affect model performance. Decisions about tokenization schemes in NLP, image resolution and color space in vision, and audio spectrogram parameters in speech processing all define the initial feature set that the network receives.

Transfer learning uses features learned by a pretrained model (often on a large dataset) as the feature set for a new task. The penultimate layer of a pretrained network is commonly used as a fixed feature extractor, providing a high-quality feature set without task-specific feature engineering.

Tabular deep learning has not yet consistently outperformed gradient-boosted trees on structured data, partly because tabular data often benefits from hand-engineered features that encode domain knowledge. Feature engineering remains valuable even when using neural networks on tabular datasets.

Historical context

The concept of feature sets has evolved significantly over the history of machine learning.

In the 1950s and 1960s, early pattern recognition systems worked with small, hand-selected feature sets. Researchers manually identified and computed features for each specific problem, such as pixel counts for character recognition or formant frequencies for speech recognition.

In the 1990s and 2000s, the growth of supervised learning and the availability of larger datasets led to more systematic feature engineering. Competitions like the Netflix Prize (2006) demonstrated that creative feature engineering could be the difference between winning and losing. The phrase "feature engineering is the key" became a common refrain in applied machine learning.

The 2010s saw two divergent trends. Deep learning showed that end-to-end models could learn features directly from raw data, reducing the need for manual feature engineering in domains like computer vision and natural language processing. Meanwhile, Kaggle competitions and industry applications of tabular ML continued to rely heavily on manual feature engineering.

The 2020s brought feature stores into mainstream MLOps practice, automated feature engineering tools matured, and research began exploring the use of large language models to assist in feature engineering.

References

Guyon, I. and Elisseeff, A. (2003). "An Introduction to Variable and Feature Selection." *Journal of Machine Learning Research*, 3, pp. 1157-1182.
Bellman, R. (1961). *Adaptive Control Processes: A Guided Tour*. Princeton University Press.
Chandrashekar, G. and Sahin, F. (2014). "A survey on feature selection methods." *Computers & Electrical Engineering*, 40(1), pp. 16-28.
Peng, H., Long, F. and Ding, C. (2005). "Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy." *IEEE Transactions on Pattern Analysis and Machine Intelligence*, 27(8), pp. 1226-1238.
Tibshirani, R. (1996). "Regression Shrinkage and Selection via the Lasso." *Journal of the Royal Statistical Society: Series B*, 58(1), pp. 267-288.
Lundberg, S.M. and Lee, S.I. (2017). "A Unified Approach to Interpreting Model Predictions." *Advances in Neural Information Processing Systems*, 30.
Kanter, J.M. and Veeramachaneni, K. (2015). "Deep Feature Synthesis: Towards Automating Data Science Endeavors." *IEEE International Conference on Data Science and Advanced Analytics*.
Hughes, G. (1968). "On the mean accuracy of statistical pattern recognizers." *IEEE Transactions on Information Theory*, 14(1), pp. 55-63.
Christ, M., Braun, N., Neuffer, J. and Kempa-Liehr, A.W. (2018). "Time Series FeatuRe Extraction on basis of Scalable Hypothesis tests (tsfresh)." *Neurocomputing*, 307, pp. 72-77.
Breiman, L. (2001). "Random Forests." *Machine Learning*, 45(1), pp. 5-32.
Feast Documentation. "Feast: An Open Source Feature Store for Machine Learning." Available at: https://feast.dev/
Domingos, P. (2012). "A Few Useful Things to Know About Machine Learning." *Communications of the ACM*, 55(10), pp. 78-87.
Zheng, A. and Casari, A. (2018). *Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists*. O'Reilly Media.
Sculley, D. et al. (2015). "Hidden Technical Debt in Machine Learning Systems." *Advances in Neural Information Processing Systems*, 28.

Introduction

Explain like I'm 5 (ELI5)

Types of features

Feature space

The curse of dimensionality

Feature selection

Filter methods

Wrapper methods

Embedded methods

Information-theoretic methods

Feature engineering

Common transformations

Domain-specific feature engineering

Feature extraction and dimensionality reduction

Feature importance and evaluation

Methods for measuring feature importance

Feature stores

Architecture

Feature store implementations

Automated feature engineering

Best practices for building feature sets

Feature sets in deep learning

Historical context

See also

References

Improve this article

Related Articles

Synthetic Feature

ARC-AGI 2

Confirmation Bias

Unsupervised learning

Dask

Data labeling

Introduction

Explain like I'm 5 (ELI5)

Types of features

Feature space

The curse of dimensionality

Feature selection

Filter methods

Wrapper methods

Embedded methods

Information-theoretic methods

Feature engineering

Common transformations

Domain-specific feature engineering

Feature extraction and dimensionality reduction

Feature importance and evaluation

Methods for measuring feature importance

Feature stores

Architecture

Feature store implementations

Automated feature engineering

Best practices for building feature sets

Feature sets in deep learning

Historical context

See also

References

Related Articles

Synthetic Feature

ARC-AGI 2

Confirmation Bias

Unsupervised learning

Dask

Data labeling