See also: Feature, Feature Engineering, Feature Extraction, Feature Vector
In machine learning, a feature set is the collection of input variables (also called features, attributes, or predictors) used to train a model and make predictions. Each individual feature represents a measurable property of the data, and the feature set as a whole defines what information the model has access to during learning. The composition of a feature set has a direct effect on model accuracy, generalization ability, and computational cost.
A feature set is distinct from a feature vector, which is a single instance (row) of feature values for one data point. The feature set describes the columns or dimensions of the data, while a feature vector describes one observation within that space. The mathematical space defined by all possible combinations of feature values is called the feature space.
Selecting and constructing an effective feature set is one of the most consequential steps in any machine learning pipeline. A well-chosen feature set allows even simple models to achieve strong performance, while a poorly chosen one can limit the capabilities of even the most sophisticated algorithms.
Imagine you are trying to guess what animal someone is thinking of. You could ask questions like: "How many legs does it have? Does it have fur? Can it fly? Does it live in water?" Each question is a "feature," and all the questions you decide to ask together make up your "feature set."
If you pick good questions, you can guess the animal quickly. If you pick bad questions (like "What is its favorite color?"), you will not get useful answers. And if you ask too many questions, it takes forever and some of the answers might confuse you more than help. A feature set in machine learning works the same way: it is the list of things we tell the computer to pay attention to so it can learn and make good guesses.
Feature sets can contain several different types of features, each requiring different handling before a model can use them.
| Feature type | Description | Examples | Common encoding methods |
|---|---|---|---|
| Numeric (continuous) | Quantitative values on a continuous scale | Height, temperature, price, age | Normalization, standardization, log transform |
| Numeric (discrete) | Countable integer values | Number of rooms, word count, page views | May be used directly or binned |
| Categorical (nominal) | Qualitative labels with no inherent order | Color, country, product category, gender | One-hot encoding, target encoding, hashing |
| Categorical (ordinal) | Qualitative labels with a meaningful order | Education level, satisfaction rating, size (S/M/L) | Ordinal encoding, integer mapping |
| Binary | Two possible values | True/false, yes/no, male/female | 0/1 encoding |
| Text | Unstructured natural language | Product reviews, tweets, email body | Bag of words, TF-IDF, embeddings |
| Temporal | Date, time, or time-series data | Timestamps, durations, seasonal patterns | Cyclical encoding (sine/cosine), lag features |
| Image | Pixel data from images | Photographs, medical scans, satellite imagery | Pixel arrays, CNN-extracted features, pretrained embeddings |
The feature space is the n-dimensional mathematical space in which each axis corresponds to one feature in the feature set. Every data point is represented as a coordinate (or vector) within this space. For example, if a feature set contains two features, height and weight, then the feature space is a two-dimensional plane, and each person is a point on that plane.
The geometry of the feature space affects how machine learning algorithms operate. Algorithms such as k-nearest neighbors and support vector machines rely on distance calculations in the feature space. Clustering algorithms group data points that are close together in the feature space. Decision trees partition the feature space into rectangular regions, while neural networks learn complex, nonlinear decision boundaries within it.
When a feature set is transformed (for example, through principal component analysis or kernel methods), the data is projected into a new feature space where patterns may be easier to identify. The kernel trick used in SVMs, for instance, maps data into a higher-dimensional feature space where a linear separator can be found even if the original data is not linearly separable.
As the number of features in a feature set increases, the volume of the feature space grows exponentially. This phenomenon, known as the curse of dimensionality (a term coined by Richard Bellman in the 1950s), creates several problems for machine learning.
In high-dimensional spaces, data points become sparse. Distances between points converge to similar values, making distance-based algorithms less effective. Models require exponentially more training data to cover the feature space adequately, and the risk of overfitting increases because models can find spurious patterns in sparse, high-dimensional data.
The Hughes phenomenon illustrates a related effect: as features are added to a classification model, predictive accuracy initially improves but eventually degrades once the number of features exceeds a threshold relative to the training set size. A common heuristic suggests that at least five training examples per feature dimension are needed, though this varies depending on the signal-to-noise ratio and feature relevance.
Practical strategies for mitigating the curse of dimensionality include feature selection (removing irrelevant or redundant features), feature extraction (projecting into a lower-dimensional space), and regularization (penalizing model complexity).
Feature selection is the process of identifying and retaining the most relevant features from the full set while discarding irrelevant or redundant ones. Unlike feature extraction, which creates new derived features, feature selection works with the original features and preserves their interpretability. Feature selection serves several purposes: simplifying models, reducing training time, improving generalization, and avoiding the curse of dimensionality.
Feature selection methods are generally grouped into three categories: filter methods, wrapper methods, and embedded methods.
Filter methods evaluate features based on their intrinsic statistical properties, independent of any specific learning algorithm. They score each feature (or pair of features) using a statistical measure and rank or threshold them accordingly. Because they do not train a model during evaluation, filter methods are computationally efficient and scale well to large feature sets.
| Technique | What it measures | Applicable to | Notes |
|---|---|---|---|
| Variance threshold | Variance of each feature | Numeric features | Removes near-constant features; simplest filter |
| Pearson correlation | Linear correlation with target | Numeric target and features | Only captures linear relationships |
| Mutual information | Shared information between feature and target | Any feature/target types | Captures nonlinear relationships; nonparametric |
| Chi-squared test | Independence between categorical feature and target | Categorical features, categorical target | Tests whether observed frequencies differ from expected |
| ANOVA F-test | Difference in means across target classes | Numeric features, categorical target | High F-value means feature separates classes well |
| Relief / ReliefF | Feature relevance based on nearest neighbor distances | Any feature type | Considers feature interactions |
A key limitation of most filter methods is that they evaluate features individually (univariate filters), which means they can miss interactions between features. Multivariate filters such as minimum-redundancy-maximum-relevance (mRMR) address this by balancing relevance to the target with redundancy among selected features.
Wrapper methods evaluate feature subsets by training an actual model and measuring its performance on a held-out validation set. They search through the space of possible feature subsets, scoring each based on the model's predictive accuracy.
| Technique | Strategy | Computational cost |
|---|---|---|
| Forward selection | Start with no features; add the best one at each step | Moderate |
| Backward elimination | Start with all features; remove the worst one at each step | Moderate to high |
| Recursive feature elimination (RFE) | Train model, remove least important features, repeat | Moderate to high |
| Exhaustive search | Evaluate all possible feature subsets | Very high (2^n subsets) |
| Genetic algorithms | Evolve populations of feature subsets through selection and mutation | High but parallelizable |
Wrapper methods can detect feature interactions and are tuned to a specific model, so they often produce the best-performing feature subsets for that model. However, they are computationally expensive, especially with large feature sets, and they risk overfitting to the validation set.
Embedded methods perform feature selection as part of the model training process itself, combining the benefits of both filter and wrapper approaches. They are more computationally efficient than wrappers because they do not require training separate models for each feature subset.
| Technique | How it selects features | Model type |
|---|---|---|
| LASSO (L1 regularization) | Shrinks coefficients of irrelevant features to exactly zero | Linear regression, logistic regression |
| Elastic net | Combines L1 and L2 penalties; balances sparsity and grouping | Linear models |
| Tree-based feature importance | Ranks features by their contribution to information gain or impurity reduction | Decision trees, random forests, gradient boosting |
| Regularized trees | Penalizes features similar to previously selected ones at each split | Tree ensembles |
Several feature selection techniques draw on information theory to quantify the relevance and redundancy of features.
Minimum-redundancy-maximum-relevance (mRMR) selects features that have high mutual information with the target variable while having low mutual information with each other. This prevents selecting highly correlated features that provide overlapping information.
Joint mutual information (JMI) identifies features that add new information to the set of already-selected features by computing conditional mutual information.
Correlation feature selection (CFS) evaluates feature subsets using the principle that good subsets contain features highly correlated with the target but uncorrelated with each other.
Feature engineering is the broader process of using domain knowledge to create, transform, and select features for a machine learning model. While feature selection reduces the existing set, feature engineering also involves constructing entirely new features from raw data. The quality of feature engineering often determines the upper bound of model performance, particularly for tabular data.
| Transformation | Description | When to use |
|---|---|---|
| Log transform | Applies logarithm to compress large ranges and reduce skewness | Right-skewed distributions (income, population) |
| Polynomial features | Creates squared, cubed, or higher-order versions of features | Capturing nonlinear relationships in linear models |
| Interaction features | Multiplies two or more features together | When the combined effect of features matters (e.g., area = length x width) |
| Binning / bucketing | Groups continuous values into discrete intervals | When the exact value is less important than the range |
| Cyclical encoding | Uses sine and cosine to encode circular features | Time-of-day, day-of-week, month-of-year |
| Normalization (min-max) | Scales values to a fixed range, typically [0, 1] | When features have different scales and the algorithm is scale-sensitive |
| Standardization (z-score) | Centers values to mean 0 and standard deviation 1 | For algorithms that assume normally distributed inputs |
| One-hot encoding | Creates binary columns for each category | Nominal categorical features with few unique values |
| Target encoding | Replaces categories with the mean of the target variable | High-cardinality categorical features |
| Embeddings | Learns dense vector representations from data | Text, categorical features with many values, images |
Different application domains require specialized feature engineering approaches.
In natural language processing, raw text is converted into numerical features using techniques such as bag of words, TF-IDF (term frequency-inverse document frequency), or learned word and sentence embeddings. Features may also include text length, sentiment scores, named entity counts, or part-of-speech distributions.
In computer vision, traditional feature engineering used hand-crafted descriptors such as edge histograms, SIFT (scale-invariant feature transform), and HOG (histogram of oriented gradients). Modern approaches use convolutional neural networks or pretrained vision models to extract features automatically through transfer learning.
In time series analysis, feature engineering includes creating lag features (values from previous time steps), rolling statistics (moving averages, rolling standard deviations), seasonal decomposition, and Fourier-based frequency features.
In tabular data for business applications, common engineered features include ratios (revenue per employee), differences (change from previous period), aggregations (average transaction value per customer), and date-derived features (day of week, quarter, is_holiday).
Feature extraction creates new features by transforming or combining original features, typically producing a lower-dimensional representation that retains the most informative aspects of the data. Unlike feature selection, which chooses a subset of existing features, feature extraction constructs entirely new variables.
| Method | Type | How it works | Preserves original features? |
|---|---|---|---|
| Principal component analysis (PCA) | Linear, unsupervised | Finds orthogonal directions of maximum variance | No (creates new components) |
| Linear discriminant analysis (LDA) | Linear, supervised | Finds directions that maximize class separation | No (creates new components) |
| t-SNE | Nonlinear, unsupervised | Preserves local neighborhood structure for visualization | No (creates 2D/3D coordinates) |
| UMAP | Nonlinear, unsupervised | Preserves both local and global structure | No (creates low-dimensional coordinates) |
| Autoencoders | Nonlinear, unsupervised | Neural network learns compressed encoding through bottleneck layer | No (creates latent features) |
| Independent component analysis (ICA) | Linear, unsupervised | Separates signal into statistically independent components | No (creates independent components) |
The choice between feature selection and feature extraction involves tradeoffs. Feature selection maintains the original feature meanings, aiding interpretability. Feature extraction can achieve stronger dimension reduction but produces transformed features that may be harder to interpret.
Once a model is trained, assessing which features contribute most to its predictions helps validate the feature set, guide further feature engineering, and explain model behavior.
Permutation importance randomly shuffles each feature one at a time and measures how much the model's performance degrades. Features whose shuffling causes large performance drops are considered more important. This method is model-agnostic and works with any trained model.
Tree-based importance measures how much each feature contributes to reducing impurity (Gini impurity or entropy) across all splits in tree-based models such as random forests and gradient boosting models. While computationally cheap, this method can be biased toward high-cardinality features.
SHAP (SHapley Additive exPlanations) values, rooted in cooperative game theory, quantify each feature's contribution to individual predictions. SHAP provides both global feature importance (averaged across all predictions) and local explanations (for a single data point). It handles feature correlations more fairly than permutation importance by considering all possible feature interactions.
Coefficient magnitudes in linear models (linear regression, logistic regression) indicate feature importance when features are standardized. Larger absolute coefficients correspond to more influential features.
A feature store is a centralized system for managing, storing, and serving machine learning features in production environments. Feature stores emerged as a component of MLOps infrastructure to address the challenge of reusing and consistently serving features across multiple models and teams.
A typical feature store consists of two main components.
The offline store holds historical feature values and is used for model training and batch scoring. It is typically implemented using data warehouses, data lakes, or distributed file systems (such as Apache Parquet files on S3, Delta Lake, BigQuery, or Snowflake). The offline store supports point-in-time correct feature retrieval, which prevents data leakage during training by ensuring each training example only uses feature values that were available at that historical moment.
The online store serves current feature values with low latency for real-time inference. It is backed by fast key-value stores such as Redis, DynamoDB, or Cassandra. When a model needs to make a prediction in real time (for example, fraud detection at the point of a credit card transaction), the online store provides features within milliseconds.
| Feature store | Type | Developed by | Notable characteristics |
|---|---|---|---|
| Feast | Open source | Originally Gojek, now maintained by Tecton | Lightweight, pluggable backends, widely adopted |
| Tecton | Commercial | Tecton (creators of Uber's Michelangelo) | Fully managed, strong real-time streaming support |
| Hopsworks | Open source / commercial | Logical Clocks | Integrated with ML pipelines, supports Python and Spark |
| Vertex AI Feature Store | Commercial | Google Cloud | Integrated with Google Cloud AI Platform |
| SageMaker Feature Store | Commercial | Amazon Web Services | Integrated with SageMaker training and inference |
| Databricks Feature Store | Commercial | Databricks | Integrated with Unity Catalog and Delta Lake |
| Chronon | Open source | Airbnb | Designed for complex temporal aggregations |
| Feathr | Open source | LinkedIn (now part of Azure) | Supports both batch and real-time features |
Feature stores help organizations maintain consistency between the features used during training and those served during inference, a problem commonly referred to as training-serving skew.
Automated feature engineering tools reduce the manual effort of constructing features by algorithmically generating and evaluating candidate features from raw data.
Featuretools is an open-source Python library that uses deep feature synthesis (DFS) to automatically create features from relational datasets. DFS works by applying sequences of aggregation and transformation operations across related tables, producing features such as "average order value per customer" or "maximum transaction amount in the last 30 days."
TSFresh (Time Series Feature Extraction based on Scalable Hypothesis Tests) specializes in extracting features from time series data. It computes a large library of time-domain features (means, variances, autocorrelations, spectral properties) and then uses statistical hypothesis testing to filter out features that are not significantly associated with the target variable.
AutoML frameworks such as Auto-sklearn, H2O AutoML, and Google Cloud AutoML often include automated feature engineering as part of their pipeline. These systems explore feature transformations, encoding strategies, and selection methods as part of their hyperparameter search.
Recent research (2025) has explored combining large language models with evolutionary search for automated feature engineering. A framework called LLM-FE uses the domain knowledge and reasoning capabilities of LLMs to propose candidate feature transformations, which are then refined through evolutionary optimization.
Start with domain knowledge. Understanding the problem domain helps identify which raw data fields are likely to be predictive and what transformations make sense. A data scientist building a credit risk model benefits from knowing that debt-to-income ratio is more informative than raw income alone.
Remove constant and near-constant features. Features with zero or near-zero variance provide no discriminating information and should be dropped early.
Handle missing values intentionally. Missing data can be informative (the absence of a value may itself be a signal). Common strategies include imputation (filling with mean, median, or mode), creating a binary indicator for missingness, or using algorithms that handle missing values natively (such as XGBoost).
Encode categorical variables appropriately. The choice of encoding (one-hot, ordinal, target, or embedding) depends on the cardinality of the feature, the model type, and whether the categories have a natural order.
Scale features when necessary. Algorithms that rely on distance calculations (k-nearest neighbors, support vector machines) or gradient-based optimization (neural networks, logistic regression) are sensitive to feature scale. Tree-based models are generally scale-invariant.
Watch for data leakage. Features that contain information about the target that would not be available at prediction time (such as future values or target-derived statistics) lead to overly optimistic performance estimates that do not generalize.
Iterate and evaluate. Feature set construction is an iterative process. Use cross-validation to assess feature sets, check feature importance after training, and refine based on results.
Document feature definitions. In production systems, clear documentation of how each feature is computed, what data sources it uses, and any assumptions it makes prevents errors and helps with debugging.
In deep learning, the traditional role of manual feature engineering is partially replaced by representation learning, where the model learns to extract useful features from raw data through its hidden layers. Convolutional neural networks learn spatial features from raw pixel inputs, recurrent neural networks learn temporal features from sequential data, and transformer models learn contextual features from text through attention mechanisms.
However, feature sets still matter in deep learning in several ways.
Input representation choices affect model performance. Decisions about tokenization schemes in NLP, image resolution and color space in vision, and audio spectrogram parameters in speech processing all define the initial feature set that the network receives.
Transfer learning uses features learned by a pretrained model (often on a large dataset) as the feature set for a new task. The penultimate layer of a pretrained network is commonly used as a fixed feature extractor, providing a high-quality feature set without task-specific feature engineering.
Tabular deep learning has not yet consistently outperformed gradient-boosted trees on structured data, partly because tabular data often benefits from hand-engineered features that encode domain knowledge. Feature engineering remains valuable even when using neural networks on tabular datasets.
The concept of feature sets has evolved significantly over the history of machine learning.
In the 1950s and 1960s, early pattern recognition systems worked with small, hand-selected feature sets. Researchers manually identified and computed features for each specific problem, such as pixel counts for character recognition or formant frequencies for speech recognition.
In the 1990s and 2000s, the growth of supervised learning and the availability of larger datasets led to more systematic feature engineering. Competitions like the Netflix Prize (2006) demonstrated that creative feature engineering could be the difference between winning and losing. The phrase "feature engineering is the key" became a common refrain in applied machine learning.
The 2010s saw two divergent trends. Deep learning showed that end-to-end models could learn features directly from raw data, reducing the need for manual feature engineering in domains like computer vision and natural language processing. Meanwhile, Kaggle competitions and industry applications of tabular ML continued to rely heavily on manual feature engineering.
The 2020s brought feature stores into mainstream MLOps practice, automated feature engineering tools matured, and research began exploring the use of large language models to assist in feature engineering.