# Synthetic Feature

> Source: https://aiwiki.ai/wiki/synthetic_feature
> Updated: 2026-07-12
> Categories: Data & Datasets, Data Science, Machine Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

A **synthetic feature** (also called a constructed feature or derived feature) is a variable created by transforming, combining, or otherwise manipulating one or more existing features in a dataset, rather than one that is measured or collected directly. Google's Machine Learning Glossary defines a synthetic feature as "a feature that is not present among the input features, but is created from one or more of them" [1]. Common ways to build one include feature crosses, bucketing a continuous value into range bins, polynomial transforms, and ratios that divide one feature by another. Because they are produced during the [feature engineering](/wiki/feature_engineering) process, synthetic features do not appear in the original raw data; they give [machine learning](/wiki/machine_learning) models information in a form that is easier to learn from. The term is used broadly across statistics, data science, and machine learning to describe any feature that a practitioner deliberately constructs rather than directly measures or collects.

Creating synthetic features is one of the most common and impactful steps in building predictive models. By encoding domain knowledge, mathematical relationships, or statistical summaries into new columns, data scientists can improve model accuracy, interpretability, and robustness. The payoff is well documented. In a widely cited 2012 paper, computer scientist Pedro Domingos wrote that "at the end of the day, some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used" [13].

## Explain like I'm 5 (ELI5)

Imagine you have a box of colored building blocks. Each block has a color and a size. Now suppose you want to sort them by how heavy they feel, but you do not have a scale. You notice that bigger blocks are heavier, and metal blocks are heavier than wooden ones. So you make up a new rule: "heaviness score = size times material weight." That new score is not written on any block. You invented it by combining two things you already knew (size and material). In machine learning, a synthetic feature works the same way. It is a new piece of information you create by mixing together things you already have, so your model can make better predictions.

## Is a synthetic feature the same as synthetic data?

No. The two terms sound similar and are often confused, but they describe different things. A synthetic feature is a new column, a derived variable, computed from features that already exist in a dataset, such as a price-per-square-foot ratio or an age bucket. [Synthetic data](/wiki/synthetic_data), by contrast, means artificially generated records (rows) that imitate the statistical properties of real data. It is produced by simulations, generative models, or sampling methods, and used when real data is scarce, sensitive, or expensive to collect. Put simply, a synthetic feature adds a derived attribute to existing examples, while synthetic data creates new example records. A project can use both: a team might generate synthetic data to enlarge a training set and then engineer synthetic features on top of it.

## Why do machine learning models need synthetic features?

Raw datasets rarely contain all the information a model needs in an immediately usable form. A table of real estate listings, for example, might include the year a house was built and the current year, but not the house's age. A medical dataset might record a patient's height and weight but not their body mass index (BMI). In both cases, the relationship between existing columns carries predictive signal that the model cannot easily discover on its own, especially when using [linear regression](/wiki/linear_regression) or other models that assume linear relationships among inputs.

Synthetic features address this gap. By explicitly constructing variables such as "age of house" (current year minus year built) or "BMI" (weight divided by height squared), the practitioner encodes domain knowledge directly into the data. This makes the model's job easier and often produces better results than relying on the model to infer these relationships from raw inputs alone.

The practice has deep roots in statistics, where variable transformation (such as taking the logarithm of a skewed variable or computing interaction terms in a regression model) has been standard for over a century. With the growth of modern machine learning, these techniques have been systematized, expanded, and in some cases automated.

## What are the main types of synthetic features?

Synthetic features can be grouped into several broad categories based on how they are constructed. The table below summarizes the main types.

| Type | Description | Example |
|---|---|---|
| Arithmetic combinations | New features formed by adding, subtracting, multiplying, or dividing existing features | Profit = shelf price - warehouse price |
| Ratio features | The quotient of two features, often expressing a rate or density | Population density = population / area |
| Polynomial features | Existing features raised to a power or multiplied together | $$x^2$$, $$x_1 x_2$$ |
| Interaction terms | Products of two or more features that capture joint effects | bedrooms * square footage |
| [Feature cross](/wiki/feature_cross) | Cartesian product of two or more categorical or bucketized features | latitude_bucket x longitude_bucket |
| Logarithmic or power transforms | Mathematical functions applied to reduce skew or stabilize variance | $$\log(\text{income})$$, $$\sqrt{\text{distance}}$$ |
| Binning (bucketizing) | Converting a continuous variable into discrete intervals | Age groups: 0-17, 18-34, 35-54, 55+ |
| Date/time extraction | Components extracted from timestamps | Hour of day, day of week, month, is_weekend |
| Cyclical encoding | Sine and cosine transforms of periodic features | $$\sin(2\pi \cdot \text{hour} / 24)$$, $$\cos(2\pi \cdot \text{hour} / 24)$$ |
| Aggregation features | Statistical summaries computed over groups or windows | Mean purchase amount per customer, rolling 7-day average |
| Text-derived features | Numerical representations extracted from text data | Word count, [TF-IDF](/wiki/bag_of_words) scores, [word embedding](/wiki/word_embedding) vectors |
| Indicator (dummy) variables | Binary flags encoding the presence or absence of a condition | is_holiday, has_garage, is_missing_value |
| Target encoding | Replacing a categorical value with a statistic of the target variable | Mean house price for each zip code |

## Arithmetic and ratio features

The simplest synthetic features are formed by applying basic arithmetic operations to existing columns. If a dataset contains both the purchase price and the selling price of an item, subtracting one from the other yields a profit feature. If it contains distance and time, dividing one by the other produces a speed feature.

Ratio features are especially useful because they normalize one quantity by another, making comparisons across different scales meaningful. In real estate modeling, for instance, price per square foot is often more predictive than raw price or raw square footage alone. In web analytics, click-through rate (clicks divided by impressions) is more informative than either raw count.

These features are easy to construct and interpret, which makes them a good starting point in any feature engineering workflow. However, care must be taken when the denominator can be zero, as this produces undefined values that require handling (for example, by adding a small constant or by treating the zero case separately).

## Polynomial features

Polynomial features are created by raising existing features to integer powers or by multiplying features together. They let linear models capture nonlinear relationships by expanding the input into a basis of higher-order terms [15]. As Google's Machine Learning Crash Course notes, when a practitioner's domain knowledge suggests that one variable relates to the square, cube, or other power of another, "it's useful to create a synthetic feature from one of the existing numerical features" [2]. For a two-dimensional input sample $$[a, b]$$, the degree-2 polynomial features are $$[1, a, b, a^2, ab, b^2]$$ [3].

This technique is motivated by the observation that many real-world relationships involve powers of variables. Gravitational force depends on the square of the distance between two masses. Kinetic energy depends on the square of velocity. When a data scientist suspects such a relationship, adding a squared term as a synthetic feature enables a [linear regression](/wiki/linear_regression) model to fit a curve rather than a straight line.

### Mathematical formulation

Given an input vector $$\mathbf{x} = (x_1, x_2, \ldots, x_n)$$ and a maximum degree $$d$$, the polynomial feature expansion generates all monomials of the form:

$$
x_1^{k_1} \cdot x_2^{k_2} \cdots x_n^{k_n}
$$

where $$k_1 + k_2 + \cdots + k_n \le d$$ and each $$k_i \ge 0$$.

The number of output features (including the bias term) is given by the binomial coefficient $$\binom{n + d}{d}$$. For example, with $$n = 2$$ input features and degree $$d = 2$$, the output contains $$\binom{4}{2} = 6$$ features. With $$n = 3$$ and $$d = 3$$, it grows to $$\binom{6}{3} = 20$$ features.

### Implementation in scikit-learn

The [scikit-learn](/wiki/scikit-learn) library provides the `PolynomialFeatures` class in its preprocessing module, which the documentation describes as generating "a new feature matrix consisting of all polynomial combinations of the features with degree less than or equal to the specified degree" [3]:

```python
import numpy as np
from sklearn.preprocessing import PolynomialFeatures

X = np.array([[2, 3],
              [4, 5]])

poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)
print(poly.get_feature_names_out())
# ['x0', 'x1', 'x0^2', 'x0 x1', 'x1^2']
print(X_poly)
# [[ 2.  3.  4.  6.  9.]
#  [ 4.  5. 16. 20. 25.]]
```

The key parameters of `PolynomialFeatures` are summarized below.

| Parameter | Default | Description |
|---|---|---|
| degree | 2 | Maximum degree of polynomial features. Can also accept a (min_degree, max_degree) tuple. |
| interaction_only | False | If True, only interaction features (products of distinct input features) are produced. Self-powers like x^2 are excluded. |
| include_bias | True | If True, a column of ones is included as a bias (intercept) term. |
| order | 'C' | Memory layout of the output array. 'F' (Fortran order) can be faster to compute. |

### Feature explosion warning

The number of polynomial features grows rapidly with both the number of input features and the degree. For 10 input features at degree 3, the output contains $$\binom{13}{3} = 286$$ features. At degree 5 with the same 10 inputs, the count rises to $$\binom{15}{5} = 3{,}003$$. This exponential growth increases the risk of [overfitting](/wiki/overfitting) and computational cost. Practitioners typically keep the degree at 2 or 3 and combine polynomial expansion with [regularization](/wiki/regularization) (Lasso, Ridge, or Elastic Net) or feature selection to control model complexity [4].

## Interaction terms

An interaction term is a synthetic feature formed by multiplying two or more original features. It captures the idea that the effect of one feature on the target variable may depend on the value of another feature. In statistical modeling, this concept has been used for decades in the form of interaction effects in analysis of variance (ANOVA) and multiple regression.

Consider predicting house prices. The value added by an extra bedroom might be much higher for a large house (say, 3,000 square feet) than for a small apartment (600 square feet). A model with only separate features for bedrooms and square footage cannot capture this joint effect. Adding the interaction term bedrooms * square_footage allows the model to learn that the combination matters.

Interaction terms differ from full polynomial features in that they only include products of distinct features, not powers of individual features. In scikit-learn, setting `interaction_only=True` in `PolynomialFeatures` produces only interaction terms.

### When to use interaction terms

Interaction terms are most useful when:

- Domain knowledge suggests that two variables have a joint effect on the outcome.
- Exploratory data analysis reveals that the relationship between a feature and the target changes at different levels of another feature.
- A model with main effects alone shows systematic patterns in its residuals that suggest missed interactions.

## Feature crosses

A [feature cross](/wiki/feature_cross) is a synthetic feature created by taking the Cartesian product of two or more categorical or bucketized features [5]. Google's Machine Learning Glossary describes it as "a synthetic feature formed by 'crossing' categorical or bucketed features" [1]. While polynomial transforms operate on numerical data, feature crosses operate on categorical data. Both serve the same purpose: enabling linear models to learn nonlinear relationships.

For example, consider a leaf classification task with two categorical features: edge type (smooth, toothed, lobed) and leaf arrangement (opposite, alternate). Crossing these two features produces six combined categories: smooth_opposite, smooth_alternate, toothed_opposite, toothed_alternate, lobed_opposite, lobed_alternate. Each combination is encoded as a separate binary feature.

A well-known application comes from geospatial modeling. Individually, latitude and longitude have limited predictive power for property values. But their cross product defines specific city blocks, and the model can learn that certain blocks command higher prices than others.

### Sparsity considerations

Feature crosses can produce very high-dimensional, sparse feature spaces. Crossing a 100-element sparse feature with a 200-element sparse feature results in a 20,000-element feature. This sparsity increases memory consumption and can slow training. Techniques such as hashing and dimensionality reduction help manage the resulting feature space.

## Logarithmic and power transforms

Applying mathematical functions like log, square root, or Box-Cox transforms to individual features is a longstanding technique in statistics. These transforms serve several purposes:

- **Reducing skew.** Many real-world distributions (income, population, web traffic) are heavily right-skewed. Taking the logarithm pulls in extreme values and produces a more symmetric distribution, which benefits models that assume normally distributed inputs (such as linear regression and [logistic regression](/wiki/logistic_regression)).
- **Stabilizing variance.** When the variance of a variable increases with its mean (a phenomenon called heteroscedasticity), a log or square root transform can equalize the variance across the range.
- **Linearizing relationships.** If the relationship between a feature and the target follows a power law or exponential curve, a log transform can make it approximately linear.

The choice of transform should be guided by the data distribution and domain knowledge. It is important to handle zero and negative values appropriately, since the logarithm is undefined for non-positive numbers. Common workarounds include log(x + 1) or the inverse hyperbolic sine transform.

## Binning and bucketizing

Binning (also called discretization or bucketizing) converts a continuous numerical feature into a set of discrete intervals (bins), a standard quantization step in feature engineering practice [16]. Google's Machine Learning Glossary describes bucketing as "converting a single feature into multiple binary features called buckets or bins, typically based on a value range" [1]. Each data point is assigned to the bin that contains its value, and the bin membership is then encoded as a categorical feature (often using [one-hot encoding](/wiki/one-hot_encoding)).

There are several common binning strategies:

| Strategy | Description | Best for |
|---|---|---|
| Fixed-width (uniform) | Divides the range into equal-width intervals | Uniformly distributed data |
| Quantile-based | Creates bins with approximately equal numbers of observations | Skewed data |
| Domain-driven | Uses meaningful thresholds defined by domain experts | Variables with known breakpoints (e.g., age groups, income brackets) |
| Logarithmic | Bin widths increase exponentially | Data spanning several orders of magnitude |

Binning can reveal nonlinear patterns that a linear model would otherwise miss. For example, the relationship between age and insurance risk may not be linear, but grouping ages into brackets (18-25, 26-35, 36-50, 51-65, 65+) allows the model to assign different risk levels to each bracket. Binned features are also useful as inputs to feature crosses.

The main disadvantage of binning is information loss: the model can no longer distinguish between values within the same bin. Choosing too few bins loses detail; choosing too many bins approaches the original continuous feature and may add noise.

## Date and time features

Timestamp columns contain rich temporal information that most models cannot use directly. Extracting components from a datetime object produces several useful synthetic features:

- **Hour of day** (0-23)
- **Day of week** (Monday through Sunday)
- **Day of month** (1-31)
- **Month** (1-12)
- **Quarter** (1-4)
- **Year**
- **Is weekend** (binary flag)
- **Is holiday** (binary flag, requires a holiday calendar)
- **Time since an event** (e.g., days since last purchase)

In Python with pandas, these can be extracted using the `.dt` accessor:

```python
import pandas as pd

df['hour'] = df['timestamp'].dt.hour
df['day_of_week'] = df['timestamp'].dt.dayofweek
df['month'] = df['timestamp'].dt.month
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
```

### Cyclical encoding

Many time-based features are cyclical: hour 23 is close to hour 0, December is close to January, and Sunday is close to Monday. Encoding these as plain integers misleads distance-based and linear models, which treat 23 and 0 as far apart numerically.

Cyclical encoding addresses this by mapping each cyclical feature onto a circle using sine and cosine transforms [6]:

```python
import numpy as np

df['hour_sin'] = np.sin(2 * np.pi * df['hour'] / 24)
df['hour_cos'] = np.cos(2 * np.pi * df['hour'] / 24)
```

This produces two features that together preserve the circular distance between time points. The technique works well with [neural networks](/wiki/neural_network) and linear models. Tree-based models such as [random forests](/wiki/random_forest) and [gradient boosting](/wiki/gradient_boosting) generally do not require cyclical encoding because they can learn non-monotonic splits on integer-encoded time features.

## Encoding categorical features

Converting categorical variables into numerical form is itself a type of synthetic feature creation. The most common encoding methods are listed below.

| Encoding method | Description | Typical use case |
|---|---|---|
| [One-hot encoding](/wiki/one-hot_encoding) | Creates a binary column for each category | Low-cardinality nominal features |
| Label (ordinal) encoding | Assigns consecutive integers to categories | Ordinal features with a natural order |
| Binary encoding | Converts category indices to binary digits, with one column per bit | Medium-cardinality features |
| Target (mean) encoding | Replaces each category with the mean of the target variable for that category | High-cardinality features |
| Frequency encoding | Replaces each category with its frequency in the dataset | When category frequency carries signal |
| [Embedding](/wiki/embeddings) vectors | Learns a dense vector representation for each category via a neural network | Very high-cardinality features; [deep learning](/wiki/deep_learning) models |

### Target encoding and smoothing

Target encoding (also called mean encoding or likelihood encoding) replaces each category with the average target value for that category. For a [classification](/wiki/classification) task, the replacement value is the conditional probability of the positive class given the category. For [regression](/wiki/regression), it is the mean target value.

The main risk of target encoding is data leakage and overfitting, especially for rare categories with few observations. Smoothing mitigates this by blending the category-specific mean with the global mean:

$$
\text{encoded\_value} = \frac{\text{count} \cdot \text{category\_mean} + \text{smoothing} \cdot \text{global\_mean}}{\text{count} + \text{smoothing}}
$$

With this formula, categories that have many observations are encoded close to their own mean, while rare categories are pulled toward the global mean. The smoothing parameter controls the balance. Scikit-learn's `TargetEncoder` class can automatically select a suitable smoothing value using empirical Bayes variance estimates [7].

## Text-derived features

Text data requires transformation into numerical features before it can be used by most machine learning models. Common approaches include:

- **[Bag of words](/wiki/bag_of_words):** Represents a document as a vector of word frequencies, ignoring word order and grammar.
- **TF-IDF (Term Frequency-Inverse Document Frequency):** Weights word frequencies by how rare each word is across the entire corpus, giving more importance to distinctive words.
- **[Word embeddings](/wiki/word_embedding):** Dense vector representations (such as [Word2Vec](/wiki/word2vec) or GloVe) that capture semantic relationships between words. Words with similar meanings have similar embedding vectors.
- **Character n-grams:** Counts of character subsequences, useful for capturing morphological patterns and handling misspellings.
- **Simple text statistics:** Word count, sentence count, average word length, proportion of uppercase letters, and similar surface-level features.

These text-derived features are synthetic in the sense that they are computed from the raw text and do not exist in the original dataset. In modern [natural language processing](/wiki/natural_language_processing), pretrained language models (such as BERT and GPT) produce contextual embeddings that serve as high-dimensional synthetic features for downstream tasks.

## Aggregation and window features

When working with grouped or sequential data, aggregating existing features across groups or time windows produces informative synthetic features. Examples include:

- Mean, median, min, max, and standard deviation of a customer's past purchase amounts.
- Count of transactions in the last 7 days, 30 days, or 90 days.
- Ratio of the current value to the rolling mean (detecting anomalies or trends).
- Lag features: the value of a variable at a previous time step (t-1, t-2, etc.).
- Difference features: the change between consecutive time steps.

These features are common in time series forecasting, fraud detection, and recommendation systems. They encode temporal patterns and behavioral trends that raw point-in-time snapshots cannot capture.

## How does automated feature engineering work?

Manual feature engineering requires domain expertise and can be time-consuming. Automated feature engineering tools aim to generate large numbers of candidate features algorithmically and then select the most useful ones.

### Deep feature synthesis

Deep Feature Synthesis (DFS) is an algorithm introduced by Kanter and Veeramachaneni in 2015 that automatically creates features from relational and temporal data [8]. It works by:

1. Defining entities (tables) and the relationships between them.
2. Applying **transform primitives** (operations on a single table, such as computing the hour from a timestamp).
3. Applying **aggregation primitives** (operations across related tables, such as computing the mean of a customer's order amounts).
4. Stacking primitives to create "deep" features (for example, the standard deviation of the monthly average order amount).

When Kanter and Veeramachaneni tested their Data Science Machine (the system that implements DFS) against human analysts, it beat 615 of 906 teams across three data science competitions: 328 of 473 teams in the 2014 KDD Cup, 51 of 156 in the 2015 IJCAI repeat-buyer contest, and 237 of 277 in the 2015 KDD Cup [8]. The Featuretools library (maintained by Alteryx) provides an open-source Python implementation of DFS.

### Automated feature engineering tools

Several open-source libraries support automated feature generation.

| Tool | Focus area | Key capability |
|---|---|---|
| Featuretools | Relational and temporal data | Deep Feature Synthesis with customizable primitives |
| tsfresh | Time series data | Extracts hundreds of statistical, spectral, and nonlinear features from time series |
| Feature-engine | General tabular data | Scikit-learn-compatible transformers for encoding, discretization, and feature creation |
| tsflex | Time series data | Faster and more memory-efficient alternative to tsfresh |
| Category Encoders | Categorical data | 15+ encoding methods including target, binary, and hash encoding |

These tools reduce the manual effort involved in feature engineering but still require the practitioner to validate the generated features, check for data leakage, and manage the increased dimensionality.

## Which machine learning models benefit most from synthetic features?

The usefulness of synthetic features varies by model type. The table below compares how different model families interact with synthetic features.

| Model family | Needs synthetic features? | Reason |
|---|---|---|
| [Linear regression](/wiki/linear_regression), [logistic regression](/wiki/logistic_regression) | Often yes | Cannot represent nonlinear relationships without polynomial or interaction terms |
| [Decision trees](/wiki/decision_tree), [random forests](/wiki/random_forest), [gradient boosting](/wiki/gradient_boosting) | Sometimes | Can learn nonlinear splits natively, but ratio and aggregation features can still help |
| [Support vector machines](/wiki/support_vector_machine_svm) | Sometimes | Kernel trick handles some nonlinearity, but explicit features can improve linear kernels |
| [Neural networks](/wiki/neural_network), [deep learning](/wiki/deep_learning) | Less often | Automatically learn feature representations in hidden layers, but handcrafted features can accelerate training and improve results on small datasets |

As a general rule, simpler models benefit more from synthetic features, while complex models (especially deep neural networks) can discover useful representations on their own given enough data. However, even in deep learning pipelines, manually engineered features remain common in tabular data tasks, where neural networks have historically lagged behind tree-based methods. A 2022 benchmark study of 45 tabular datasets found that tree-based models such as XGBoost and random forests still outperformed deep learning approaches on medium-sized data (on the order of 10,000 samples), in part because uninformative features hurt neural networks more than they hurt tree-based models [9].

## Risks and best practices

Creating synthetic features introduces several risks that must be managed carefully.

### Overfitting

Adding too many features increases the capacity of the model to memorize the training data, leading to poor generalization. This is closely related to the curse of dimensionality: as the number of features grows relative to the number of training samples, the data becomes increasingly sparse in the high-dimensional feature space, and models need exponentially more data to maintain performance [10].

A commonly cited guideline is to maintain a sample-to-feature ratio of at least 10:1, with 20:1 or higher being preferable for stable and generalizable models.

### Data leakage

Some synthetic features can inadvertently leak information about the target variable into the training data. Target encoding is a common culprit: if the category mean is computed on the entire training set (including the current sample), it encodes target information that the model should not have access to at prediction time. Using cross-validated target encoding or smoothing helps mitigate this risk.

### Multicollinearity

Synthetic features are often correlated with the features they were derived from. For example, $$x$$ and $$x^2$$ are correlated, as are age and year_of_birth. High multicollinearity can destabilize coefficient estimates in linear models and make interpretation difficult. Checking variance inflation factors (VIF) and applying regularization are standard countermeasures.

### Computational cost

Polynomial and cross-product features can produce a very large number of new columns, increasing memory usage and training time. Feature selection methods (filter, wrapper, or embedded approaches) should be applied to prune uninformative features.

### Best practices summary

| Practice | Description |
|---|---|
| Start simple | Begin with arithmetic and ratio features before moving to polynomial or automated methods |
| Use domain knowledge | Features motivated by real-world understanding are more likely to generalize |
| Validate rigorously | Use [cross-validation](/wiki/cross-validation) to evaluate whether new features actually improve performance |
| Monitor feature importance | Remove features that do not contribute meaningfully, using permutation importance or SHAP values |
| Apply regularization | Use L1 ([Lasso](/wiki/regularization)), L2 (Ridge), or Elastic Net penalties to control complexity when using many synthetic features |
| Normalize after transforming | If a synthetic feature changes the scale of the data, apply [normalization](/wiki/normalization) or standardization |
| Watch for leakage | Ensure that no synthetic feature encodes future information or target values inappropriately |
| Document features | Record how each synthetic feature was created, including any parameters or thresholds used |

## How are synthetic features deployed in production?

In production [MLOps](/wiki/mlops) workflows, synthetic features must be computed consistently during both training and inference. A feature store is a centralized repository that stores feature definitions, computed feature values, and the code used to generate them [11]. Feature stores help teams:

- Reuse features across multiple models and projects.
- Ensure consistency between training-time and serving-time feature computation (avoiding training-serving skew).
- Version and audit features for governance and reproducibility.
- Serve precomputed features with low latency for real-time predictions.

Popular open-source and managed feature stores include Feast, Hopsworks, and the feature store components of Databricks and Amazon SageMaker.

## When did feature engineering emerge?

The idea of creating new variables from existing ones predates machine learning by many decades. In classical statistics, researchers routinely applied log transforms, computed interaction terms, and standardized variables as part of regression analysis. The Box-Cox transformation, introduced by George Box and David Cox in 1964, provided a systematic family of power transforms for normalizing data [12].

The term "feature engineering" became prominent in the machine learning community during the 2000s and 2010s, as practitioners recognized that the choice and construction of features often mattered more than the choice of algorithm. Andrew Ng summarized the manual effort in a 2013 Stanford lecture: "Coming up with features is difficult, time-consuming, requires expert knowledge," and "'Applied machine learning' is basically feature engineering" [14].

Research into automated feature engineering began in the 1990s, with commercial and open-source tools becoming available from 2016 onward. Deep Feature Synthesis (2015) and the subsequent release of Featuretools marked an important step toward reducing the manual burden of feature construction. More recently, deep learning approaches have shifted some of the feature engineering workload to the model itself, which learns internal representations (synthetic features, in a sense) through its hidden layers.

## See also

- [Feature engineering](/wiki/feature_engineering)
- [Feature extraction](/wiki/feature_extraction)
- [Feature cross](/wiki/feature_cross)
- [Feature vector](/wiki/feature_vector)
- [Feature set](/wiki/feature_set)
- [Feature importances](/wiki/feature_importances)
- [Overfitting](/wiki/overfitting)
- [Regularization](/wiki/regularization)
- [Preprocessing](/wiki/preprocessing)
- [Normalization](/wiki/normalization)
- [One-hot encoding](/wiki/one-hot_encoding)
- [Dimension reduction](/wiki/dimension_reduction)

## References

1. Google Developers. "synthetic feature." *Machine Learning Glossary*. https://developers.google.com/machine-learning/glossary
2. Google Developers. "Working with numerical data: Polynomial transforms." *Machine Learning Crash Course*. https://developers.google.com/machine-learning/crash-course/numerical-data/polynomial-transforms
3. Scikit-learn developers. "PolynomialFeatures." *scikit-learn documentation*, version 1.9. https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html
4. Hastie, T., Tibshirani, R., and Friedman, J. *The Elements of Statistical Learning: Data Mining, Inference, and Prediction*. 2nd ed. Springer, 2009.
5. Google Developers. "Categorical data: Feature crosses." *Machine Learning Crash Course*. https://developers.google.com/machine-learning/crash-course/categorical-data/feature-crosses
6. Scikit-learn developers. "Time-related feature engineering." *scikit-learn documentation*, version 1.9. https://scikit-learn.org/stable/auto_examples/applications/plot_cyclical_feature_engineering.html
7. Scikit-learn developers. "TargetEncoder." *scikit-learn documentation*, version 1.9. https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.TargetEncoder.html
8. Kanter, J. M. and Veeramachaneni, K. "Deep feature synthesis: Towards automating data science endeavors." *Proceedings of the IEEE International Conference on Data Science and Advanced Analytics (DSAA)*, 2015. https://ieeexplore.ieee.org/document/7344858
9. Grinsztajn, L., Oyallon, E., and Varoquaux, G. "Why do tree-based models still outperform deep learning on typical tabular data?" *Advances in Neural Information Processing Systems (NeurIPS)*, 2022.
10. Bellman, R. *Adaptive Control Processes: A Guided Tour*. Princeton University Press, 1961.
11. Hopsworks. "Feature Store: The Definitive Guide." https://www.hopsworks.ai/dictionary/feature-store
12. Box, G. E. P. and Cox, D. R. "An analysis of transformations." *Journal of the Royal Statistical Society, Series B*, 26(2):211-252, 1964.
13. Domingos, P. "A few useful things to know about machine learning." *Communications of the ACM*, 55(10):78-87, 2012.
14. Ng, A. "Machine Learning and AI via Brain simulations." Stanford University, 2013. https://ai.stanford.edu/~ang/slides/DeepLearning-Mar2013.pptx
15. Murphy, K. P. *Probabilistic Machine Learning: An Introduction*. MIT Press, 2022.
16. Zheng, A. and Casari, A. *Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists*. O'Reilly Media, 2018.