# Feature Engineering

> Source: https://aiwiki.ai/wiki/feature_engineering
> Updated: 2026-06-20
> Categories: Data & Datasets, Machine Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**Feature engineering** is the process of using domain knowledge to create, transform, and select [features](/wiki/feature) from raw data so that [machine learning](/wiki/machine_learning) models can learn more effectively.[1] It is widely regarded as the single largest factor separating a mediocre model from a highly accurate one, and surveys consistently find it is where practitioners spend most of their effort: in a 2016 CrowdFlower (later Figure Eight) survey, data scientists reported spending roughly 80% of their time collecting, cleaning, and organizing data, and 76% named data preparation the least enjoyable part of their work.[12] As Andrew Ng has stated: "Coming up with features is difficult, time-consuming, requires expert knowledge. 'Applied machine learning' is basically feature engineering."[10]

While raw data can sometimes be fed directly into an algorithm, most real-world datasets contain noise, irrelevant columns, incompatible scales, or implicit patterns that a model cannot detect on its own. Feature engineering bridges the gap between raw data and the mathematical representations that algorithms require.[1]

## Why does feature engineering matter?

The choice of features has a more direct impact on model performance than the choice of algorithm in many practical settings.[10] A simple [logistic regression](/wiki/logistic_regression) trained on well-crafted features can outperform a complex [neural network](/wiki/neural_network) trained on poorly prepared inputs. Several factors explain this outsized influence:

- **Better signal extraction.** Transforming raw measurements into meaningful ratios, differences, or aggregations lets the model focus on the patterns that actually predict the target variable.
- **Reduced dimensionality.** Removing irrelevant or redundant columns decreases training time, lowers the risk of [overfitting](/wiki/overfitting), and makes the model easier to interpret.[11]
- **Handling data heterogeneity.** Real datasets mix [numerical data](/wiki/numerical_data), [categorical data](/wiki/categorical_data), free text, timestamps, and geospatial coordinates. Each type requires its own set of transformations before a model can use it.
- **Improved interpretability.** Domain-meaningful features make it easier for stakeholders to understand why a model makes a given prediction.

Kaggle competitions have repeatedly demonstrated this principle. Top competitors regularly report that feature engineering, not hyperparameter tuning, accounts for the largest share of their final score improvement.[1] The dependence of model quality on inputs is so strong that the authors of the Deep Feature Synthesis paper noted "the efficacy of a machine learning algorithm relies heavily on the input features," which is precisely why crafting good features has traditionally demanded so much human effort.[3]

## What are features in machine learning?

A feature (also called an attribute, variable, or predictor) is a measurable property of the phenomenon being modeled. In a tabular dataset, features correspond to columns and individual observations correspond to rows. For a house-price prediction task, typical features include the number of bedrooms, total living area in square feet, year built, and neighborhood.

Not all features carry useful information. Some are redundant (highly correlated with another feature), some are noisy (contain mostly random variation), and some are irrelevant (have no relationship with the target).[11] A central goal of feature engineering is to maximize the ratio of informative features to uninformative ones.

## Types of feature transformations

Different data types require different engineering strategies. The table below summarizes the most common transformation families.

| Data type | Common transformations | When to use |
|---|---|---|
| Numerical | Scaling, [normalization](/wiki/normalization), log transform, Box-Cox, binning, polynomial features | Continuous measurements with skewed distributions or differing scales |
| Categorical | [One-hot encoding](/wiki/one-hot_encoding), label encoding, target encoding, frequency encoding | Nominal or ordinal variables with a limited or large set of categories |
| Text | Bag of words, TF-IDF, [word embeddings](/wiki/word_embedding), n-grams | Unstructured text fields such as reviews, descriptions, or titles |
| Date and time | Component extraction, cyclical sine/cosine encoding, time deltas | Timestamps, event logs, scheduling data |
| Geospatial | Haversine distance, clustering, proximity to landmarks | Latitude/longitude coordinates, postal codes |

### Numerical transformations

Numerical features are the most straightforward to work with, but they still benefit from careful preprocessing.

**Scaling and standardization.** Many algorithms, including [linear models](/wiki/linear_model), support vector machines, and neural networks, are sensitive to the absolute scale of input features.[1] Min-max scaling rescales values to a fixed range (typically 0 to 1), while z-score standardization centers values around zero with unit variance. Tree-based models such as [random forests](/wiki/random_forest) and [gradient boosting](/wiki/gradient_boosting) are invariant to monotonic transformations and generally do not require scaling.

**Log and power transforms.** Skewed distributions compress most values into a narrow range while a long tail stretches out in one direction. Applying a logarithmic transformation (or the more general Box-Cox or Yeo-Johnson transforms) pulls the tail inward, producing a distribution closer to normal.[1] This often improves the performance of models that assume normally distributed inputs.

**Binning (discretization).** Continuous values can be grouped into discrete bins. For example, age might be split into brackets such as 18 to 25, 26 to 35, and so on. Binning can smooth out noise and reduce the influence of outliers.[1] Scikit-learn provides `KBinsDiscretizer` for this purpose, supporting uniform, quantile, and k-means strategies.[9]

**Polynomial and interaction features.** Creating squared terms, cubed terms, or products of two features allows linear models to capture nonlinear relationships.[1] Scikit-learn's `PolynomialFeatures` transformer systematically generates all combinations up to a specified degree. However, the number of generated features grows rapidly, so this technique works best when combined with feature selection.

### Categorical transformations

Categorical variables represent discrete groups or labels. Different encoding strategies suit different situations.

**[One-hot encoding](/wiki/one-hot_encoding).** Each category becomes a separate binary column. This is the safest default for nominal (unordered) categories because it introduces no artificial ordinal relationship.[1] The downside is that high-cardinality features (such as zip codes with thousands of unique values) produce very wide, sparse matrices.

**Label encoding.** Each category is assigned a unique integer. This is appropriate for ordinal variables where the categories have a natural order (for example, "low", "medium", "high"). For nominal variables, label encoding can mislead the model into treating the integers as having meaningful magnitude.

**Target encoding (mean encoding).** Each category is replaced with the mean of the target variable for observations in that category. Target encoding is effective for high-cardinality features but introduces a risk of data leakage.[2] Regularization techniques such as smoothing or leave-one-out encoding help mitigate this risk.[1]

**Frequency encoding.** Each category is replaced with its frequency (count or proportion) in the training set. This approach is useful when the prevalence of a category correlates with the target. It produces a single numeric column regardless of cardinality.

| Encoding method | Handles high cardinality | Preserves ordinality | Risk of data leakage | Typical use case |
|---|---|---|---|---|
| One-hot | No (creates many columns) | No | None | Nominal features with few categories |
| Label | Yes | Yes | None | Ordinal features |
| Target | Yes | No | Moderate (needs regularization) | High-cardinality nominal features |
| Frequency | Yes | No | Low | High-cardinality features correlated with target frequency |

### Text transformations

Unstructured text must be converted into numeric vectors before a model can process it.

**Bag of words (BoW).** The simplest text representation counts the occurrences of each word in a document, ignoring word order and grammar. The result is a sparse vector whose length equals the vocabulary size.

**TF-IDF (Term Frequency-Inverse Document Frequency).** TF-IDF extends bag of words by weighting each term according to how common it is across the entire corpus.[1] Words that appear in many documents (such as "the" or "is") receive low weights, while words that are distinctive to a few documents receive high weights. TF-IDF remains widely used in search engines, text classification, and keyword extraction.

**N-grams.** Instead of treating each word independently, n-grams consider sequences of n consecutive words. Bigrams (n=2) can capture compound terms like "New York" or "machine learning" that a unigram model would split into separate tokens.

**[Word embeddings](/wiki/word_embedding).** Dense, low-dimensional vector representations learned from large text corpora.[7] Word2Vec, GloVe, and FastText are among the most widely used pre-trained embedding models. Unlike bag of words, embeddings capture semantic similarity: words with similar meanings are mapped to nearby points in the vector space. More recent transformer-based models such as BERT produce contextualized embeddings where the same word receives different vectors depending on surrounding context.

### Date and time transformations

Timestamps contain implicit patterns that raw numeric representations obscure.

**Component extraction.** Extracting the hour, day of week, month, quarter, or year from a timestamp creates individual features that capture periodic patterns. For example, retail sales tend to spike on weekends and during holiday months.

**Cyclical encoding.** Hours, days, and months are cyclical: hour 23 is close to hour 0, and December is close to January. Encoding these features as sine and cosine pairs preserves this circular relationship.[1] For a feature with period P, the transformation is: sin(2 * pi * x / P) and cos(2 * pi * x / P). This ensures that the Euclidean distance between the encoded representations of 23:00 and 00:00 is small, matching the intuitive closeness of those times.

**Time deltas.** Computing the elapsed time between events (for example, days since last purchase, seconds between page views) often reveals behavioral patterns that static timestamps miss.

### Geospatial transformations

Location data in the form of latitude and longitude coordinates requires specialized treatment.

**Distance calculations.** The Haversine formula computes the great-circle distance between two points on the Earth's surface. Features like "distance to city center" or "distance to nearest hospital" are common in real estate, logistics, and ride-hailing models.

**Spatial clustering.** Applying k-means or DBSCAN to coordinates groups nearby points into clusters, creating a categorical "region" feature that can capture local effects.

**Proximity counts.** Counting the number of points of interest within a given radius (for example, restaurants within 500 meters) provides a density signal useful for property valuation and urban analytics.

## Feature creation

Beyond transforming existing columns, practitioners often create entirely new features by combining or aggregating raw data.

### Interaction features

[Interaction features](/wiki/feature_cross) capture joint effects that individual features cannot express.[1] Multiplying "house size" by "number of floors" gives "total floor area," a feature that may be more predictive than either input alone. Ratio features (feature A divided by feature B) and difference features (feature A minus feature B) are also common. In marketing analytics, dividing total revenue by number of transactions gives average order value, a feature with strong predictive power for customer lifetime value.

### Domain-specific features

The most powerful features often come from deep understanding of the problem domain rather than generic transformations. Examples include:

- **Finance:** Moving averages, volatility measures, debt-to-income ratios.
- **Healthcare:** Body mass index (weight divided by height squared), age-adjusted lab values.
- **Natural language processing:** Sentiment scores, readability indices, named entity counts.
- **E-commerce:** Recency, frequency, and monetary (RFM) scores for customer segmentation.

Collaborating with subject matter experts is one of the most reliable ways to discover high-value domain features that automated tools would miss.[2]

### Aggregation features

When working with relational data (for example, a customer table joined to a transactions table), aggregation features summarize child records for each parent entity.[3] Common aggregations include count, sum, mean, min, max, and standard deviation. A customer's average transaction amount over the past 90 days is a typical aggregation feature in credit scoring.

## Feature selection

After generating a large pool of candidate features, selecting the most informative subset prevents overfitting and speeds up training.[11] Feature selection methods fall into three broad families.[9]

### Filter methods

Filter methods evaluate each feature independently of the model, using statistical measures to rank features by relevance.[11]

- **Correlation coefficient.** Pearson correlation measures linear relationships; Spearman and Kendall capture monotonic ones. Features with near-zero correlation to the target can often be dropped.
- **Mutual information.** Measures the amount of information that one variable provides about another. Unlike correlation, mutual information captures nonlinear dependencies.[9]
- **ANOVA F-test.** Tests whether the means of a numerical feature differ significantly across target classes. Useful for classification tasks.
- **Chi-squared test.** Evaluates the independence between categorical features and a categorical target.

Filter methods are computationally cheap and scale well, but they ignore feature interactions.

### Wrapper methods

Wrapper methods train a model on different feature subsets and evaluate each subset using cross-validated performance.[11]

- **Recursive feature elimination (RFE).** Starts with all features, trains a model, removes the least important feature, and repeats until the desired number of features remains. Scikit-learn's `RFE` and `RFECV` classes automate this process.[9]
- **Forward selection.** Starts with no features and adds the one that improves performance the most at each step.
- **Backward elimination.** Starts with all features and removes the one whose removal hurts performance the least.

Wrapper methods tend to find better subsets than filter methods but are far more expensive because they require training the model many times.

### Embedded methods

Embedded methods perform feature selection as an integral part of the model training process.

- **Lasso (L1 regularization).** Lasso regression adds an L1 penalty to the loss function, which drives the coefficients of less important features to exactly zero. Features with zero coefficients are effectively removed.[9]
- **[Decision tree](/wiki/decision_tree) importance.** Tree-based models such as random forests and gradient boosting machines calculate feature importance based on the total reduction in impurity (for example, Gini impurity or entropy) that each feature contributes across all splits.[2] Features with low importance scores can be pruned.
- **Elastic net.** Combines L1 and L2 penalties, offering a balance between Lasso's sparsity and Ridge's stability when features are correlated.

| Method family | Pros | Cons | Example techniques |
|---|---|---|---|
| Filter | Fast, scalable, model-agnostic | Ignores feature interactions | Correlation, mutual information, chi-squared |
| Wrapper | Finds strong subsets, accounts for interactions | Computationally expensive | RFE, forward selection, backward elimination |
| Embedded | Efficient, built into training | Tied to a specific model type | Lasso, tree-based importance, elastic net |

## How does feature engineering differ across model types?

The amount and type of feature engineering needed varies by algorithm family.

**Linear models** (linear regression, logistic regression, SVMs with linear kernels) learn weighted sums of inputs. They benefit heavily from scaling, normalization, polynomial features, and interaction terms because they cannot learn nonlinear relationships on their own.[1]

**Tree-based models** ([decision trees](/wiki/decision_tree), random forests, XGBoost, LightGBM) split the feature space using threshold rules. They are invariant to monotonic transformations, handle categorical variables natively (in some implementations), and automatically discover interaction effects through successive splits.[2] Feature scaling is unnecessary, though creating ratio or difference features can still help by reducing the number of splits required.

**[Deep learning](/wiki/deep_learning) models** (convolutional and recurrent neural networks, transformers) can learn hierarchical representations from raw data, reducing the need for manual feature engineering.[7] However, they still benefit from proper input normalization, embedding layers for categorical inputs, and data augmentation. On tabular data, deep learning models rarely outperform well-engineered gradient boosting pipelines without substantial effort.

| Model family | Scaling needed | Handles categoricals natively | Benefits from interaction terms | Typical feature engineering effort |
|---|---|---|---|---|
| Linear models | Yes | No | High | Heavy |
| Tree-based models | No | Partially | Moderate | Moderate |
| Neural networks | Yes (normalization) | Via embeddings | Low (learns them) | Light to moderate |

## Can feature engineering be automated?

Manual feature engineering is labor-intensive. Several open-source libraries aim to automate parts of the process.

**Featuretools** is the most widely adopted automated feature engineering library. It uses an algorithm called Deep Feature Synthesis (DFS), introduced by James Max Kanter and Kalyan Veeramachaneni at MIT in 2015, which stacks simple transformation and aggregation primitives across relational tables to generate complex features automatically.[3] For example, given a customers table and a transactions table, DFS can compute features like "mean transaction amount per customer in the last 30 days" without manual coding. The authors framed the core difficulty plainly: "Transforming raw data into features is often the part of the process that most heavily involves humans, because it is driven by intuition."[3] To test whether automation could match that intuition, they entered their Data Science Machine in 3 data science competitions that featured 906 other teams; it beat 615 of them, and in its best case beat 85.6% of the teams while reaching 95.7% of the top submission's score, with submissions "generally completed in under 12 hours" rather than the months a human team might spend.[3]

**autofeat** takes a different approach, generating nonlinear transformations of individual features (logarithms, squares, square roots) and then selecting the most useful ones with L1-regularized regression. It was designed with scientific datasets in mind and works well for heterogeneous data containing measurements with different physical units.[4]

**tsfresh** specializes in time-series data, automatically extracting hundreds of statistical features (autocorrelation, entropy, spectral peaks) from sequential observations.

While these tools accelerate exploration, they do not replace domain knowledge. The features they generate should be reviewed by a practitioner to ensure they are meaningful, free from leakage, and computationally feasible at inference time.[2]

## Feature stores and MLOps

As organizations scale from individual models to production ML systems, managing engineered features becomes a significant operational challenge. A feature store is a centralized repository that stores, versions, and serves precomputed features for both model training and real-time inference.

Key capabilities of a feature store include:

- **Reusability.** Features computed for one model can be shared across teams and projects, eliminating redundant work.
- **Consistency.** The same feature definitions are used in training and serving, preventing training-serving skew.
- **Versioning.** Historical feature values are retained, enabling reproducible experiments and point-in-time correct training datasets.
- **Low-latency serving.** Online feature stores provide sub-millisecond lookups for real-time prediction endpoints.

Prominent feature store platforms include Feast (open-source, highly configurable), Tecton (managed platform built by the creators of Uber's Michelangelo), and Hopsworks (an all-in-one AI lakehouse with strong governance features). Cloud providers also offer integrated feature stores within their ML platforms, such as Amazon SageMaker Feature Store and Google Vertex AI Feature Store.

## Historical context and representation learning

Feature engineering has been a central part of machine learning practice since the field's earliest days. In the 1990s and 2000s, the success of a machine learning project depended almost entirely on the skill of the practitioner in crafting features by hand. Computer vision researchers spent years designing features like SIFT (Scale-Invariant Feature Transform, introduced by David Lowe in 1999)[6] and HOG (Histogram of Oriented Gradients, introduced by Navneet Dalal and Bill Triggs in 2005)[5] that could robustly describe image content.

The rise of [deep learning](/wiki/deep_learning) in the 2010s shifted this paradigm. Geoffrey Hinton and collaborators demonstrated in 2006 that deep neural networks could learn useful representations directly from raw data through multiple layers of nonlinear transformations, a capability now called representation learning or [feature extraction](/wiki/feature_extraction).[7] The landmark success of AlexNet in the 2012 ImageNet competition showed that convolutional neural networks could learn image features far more powerful than any hand-designed descriptor: an ensemble of AlexNet models achieved a top-5 error rate of 15.3% in ILSVRC-2012, compared with 26.2% for the second-best entry, which relied on conventional hand-crafted features.[8]

Despite these advances, feature engineering remains essential for tabular and structured data, which still accounts for the majority of real-world ML applications.[1] Deep learning's ability to learn features automatically does not extend well to small, heterogeneous tabular datasets where gradient boosting with hand-engineered features continues to dominate benchmarks.

## Explain like I'm 5 (ELI5)

Imagine you are trying to guess which dog will win a race. You could look at every single thing about each dog: its color, its name, how many spots it has, what it ate for breakfast. But most of that information does not help you predict speed. Feature engineering is like picking out just the useful clues: the dog's leg length, its weight, and how fast it ran last time. You might even combine clues, like figuring out the ratio of leg length to body weight. The better your clues, the better your guess, even if your guessing method is simple.

## Best practices

1. **Start with exploratory data analysis.** Understand distributions, correlations, and missing value patterns before engineering features.
2. **Iterate between feature engineering and modeling.** Build a baseline model, examine errors, engineer features to address those errors, and repeat.
3. **Guard against data leakage.** Features must be computed using only information available at prediction time. Target encoding, for example, must use only training-fold statistics during cross-validation.[2]
4. **Document features.** Maintain a registry describing each feature's definition, source, and transformation logic. This is especially important in team settings.
5. **Monitor feature drift.** In production systems, the statistical properties of features can change over time. Detecting and responding to drift prevents silent model degradation.
6. **Use pipelines.** Scikit-learn's `Pipeline` and `ColumnTransformer` classes chain preprocessing steps together, ensuring that transformations are applied consistently during training and inference.[9]

## See also

- [OpenThoughts](/wiki/openthoughts)

## References

1. Zheng, A., & Casari, A. (2018). *Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists*. O'Reilly Media.
2. Kuhn, M., & Johnson, K. (2019). *Feature Engineering and Selection: A Practical Approach for Predictive Models*. CRC Press. http://www.feat.engineering/
3. Kanter, J. M., & Veeramachaneni, K. (2015). "Deep Feature Synthesis: Towards Automating Data Science Endeavors." *Proceedings of the IEEE International Conference on Data Science and Advanced Analytics (DSAA)*. https://groups.csail.mit.edu/EVO-DesignOpt/groupWebSite/uploads/Site/DSAA_DSM_2015.pdf
4. Horn, F., Pack, R., & Rieger, M. (2020). "The autofeat Python Library for Automated Feature Engineering and Selection." *Machine Learning under Resource Constraints*, Springer.
5. Dalal, N., & Triggs, B. (2005). "Histograms of Oriented Gradients for Human Detection." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*.
6. Lowe, D. G. (1999). "Object Recognition from Local Scale-Invariant Features." *Proceedings of the IEEE International Conference on Computer Vision (ICCV)*.
7. Bengio, Y., Courville, A., & Vincent, P. (2013). "Representation Learning: A Review and New Perspectives." *IEEE Transactions on Pattern Analysis and Machine Intelligence*, 35(8), 1798-1828.
8. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). "ImageNet Classification with Deep Convolutional Neural Networks." *Advances in Neural Information Processing Systems (NeurIPS)*. https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks
9. Scikit-learn documentation. "Feature Selection." https://scikit-learn.org/stable/modules/feature_selection.html
10. Ng, A. (2013). "Machine Learning and AI via Brain Simulations." Stanford University lecture. Quoted in multiple sources regarding the centrality of feature engineering in applied ML.
11. Guyon, I., & Elisseeff, A. (2003). "An Introduction to Variable and Feature Selection." *Journal of Machine Learning Research*, 3, 1157-1182.
12. CrowdFlower (Figure Eight). (2016). *Data Science Report*. Survey finding that data scientists spend about 80% of their time on data preparation and that 76% view it as the least enjoyable part of their work.