A synthetic feature (also called a constructed feature or derived feature) is a new variable created by transforming, combining, or otherwise manipulating one or more existing features in a dataset. Synthetic features do not appear in the original raw data; instead, they are produced during the feature engineering process to provide machine learning models with additional information that helps them learn patterns more effectively. The term is used broadly across statistics, data science, and machine learning to describe any feature that a practitioner deliberately constructs rather than directly measures or collects.
Creating synthetic features is one of the most common and impactful steps in building predictive models. According to Google's Machine Learning Crash Course, a synthetic feature is "a new feature created from existing numerical features based on domain knowledge" [1]. By encoding domain knowledge, mathematical relationships, or statistical summaries into new columns, data scientists can significantly improve model accuracy, interpretability, and robustness.
Imagine you have a box of colored building blocks. Each block has a color and a size. Now suppose you want to sort them by how heavy they feel, but you do not have a scale. You notice that bigger blocks are heavier, and metal blocks are heavier than wooden ones. So you make up a new rule: "heaviness score = size times material weight." That new score is not written on any block. You invented it by combining two things you already knew (size and material). In machine learning, a synthetic feature works the same way. It is a new piece of information you create by mixing together things you already have, so your model can make better predictions.
Raw datasets rarely contain all the information a model needs in an immediately usable form. A table of real estate listings, for example, might include the year a house was built and the current year, but not the house's age. A medical dataset might record a patient's height and weight but not their body mass index (BMI). In both cases, the relationship between existing columns carries predictive signal that the model cannot easily discover on its own, especially when using linear regression or other models that assume linear relationships among inputs.
Synthetic features address this gap. By explicitly constructing variables such as "age of house" (current year minus year built) or "BMI" (weight divided by height squared), the practitioner encodes domain knowledge directly into the data. This makes the model's job easier and often produces better results than relying on the model to infer these relationships from raw inputs alone.
The practice has deep roots in statistics, where variable transformation (such as taking the logarithm of a skewed variable or computing interaction terms in a regression model) has been standard for over a century. With the growth of modern machine learning, these techniques have been systematized, expanded, and in some cases automated.
Synthetic features can be grouped into several broad categories based on how they are constructed. The table below summarizes the main types.
| Type | Description | Example |
|---|---|---|
| Arithmetic combinations | New features formed by adding, subtracting, multiplying, or dividing existing features | Profit = shelf price - warehouse price |
| Ratio features | The quotient of two features, often expressing a rate or density | Population density = population / area |
| Polynomial features | Existing features raised to a power or multiplied together | x^2, x1 * x2 |
| Interaction terms | Products of two or more features that capture joint effects | bedrooms * square footage |
| Feature cross | Cartesian product of two or more categorical or bucketized features | latitude_bucket x longitude_bucket |
| Logarithmic or power transforms | Mathematical functions applied to reduce skew or stabilize variance | log(income), sqrt(distance) |
| Binning (bucketizing) | Converting a continuous variable into discrete intervals | Age groups: 0-17, 18-34, 35-54, 55+ |
| Date/time extraction | Components extracted from timestamps | Hour of day, day of week, month, is_weekend |
| Cyclical encoding | Sine and cosine transforms of periodic features | sin(2 pi * hour / 24), cos(2 pi * hour / 24) |
| Aggregation features | Statistical summaries computed over groups or windows | Mean purchase amount per customer, rolling 7-day average |
| Text-derived features | Numerical representations extracted from text data | Word count, TF-IDF scores, word embedding vectors |
| Indicator (dummy) variables | Binary flags encoding the presence or absence of a condition | is_holiday, has_garage, is_missing_value |
| Target encoding | Replacing a categorical value with a statistic of the target variable | Mean house price for each zip code |
The simplest synthetic features are formed by applying basic arithmetic operations to existing columns. If a dataset contains both the purchase price and the selling price of an item, subtracting one from the other yields a profit feature. If it contains distance and time, dividing one by the other produces a speed feature.
Ratio features are especially useful because they normalize one quantity by another, making comparisons across different scales meaningful. In real estate modeling, for instance, price per square foot is often more predictive than raw price or raw square footage alone. In web analytics, click-through rate (clicks divided by impressions) is more informative than either raw count.
These features are easy to construct and interpret, which makes them a good starting point in any feature engineering workflow. However, care must be taken when the denominator can be zero, as this produces undefined values that require handling (for example, by adding a small constant or by treating the zero case separately).
Polynomial features are created by raising existing features to integer powers or by multiplying features together. They allow linear models to capture nonlinear relationships in the data. For a two-dimensional input sample [a, b], the degree-2 polynomial features are [1, a, b, a^2, ab, b^2] [2].
This technique is motivated by the observation that many real-world relationships involve powers of variables. Gravitational force depends on the square of the distance between two masses. Kinetic energy depends on the square of velocity. When a data scientist suspects such a relationship, adding a squared term as a synthetic feature enables a linear regression model to fit a curve rather than a straight line.
Given an input vector x = (x_1, x_2, ..., x_n) and a maximum degree d, the polynomial feature expansion generates all monomials of the form:
x_1^{k_1} * x_2^{k_2} * ... * x_n^{k_n}
where k_1 + k_2 + ... + k_n <= d and each k_i >= 0.
The number of output features (including the bias term) is given by the binomial coefficient C(n + d, d). For example, with n = 2 input features and degree d = 2, the output contains C(4, 2) = 6 features. With n = 3 and d = 3, it grows to C(6, 3) = 20 features.
The scikit-learn library provides the PolynomialFeatures class in its preprocessing module for generating polynomial and interaction features:
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
X = np.array([[2, 3],
[4, 5]])
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)
print(poly.get_feature_names_out())
# ['x0', 'x1', 'x0^2', 'x0 x1', 'x1^2']
print(X_poly)
# [[ 2. 3. 4. 6. 9.]
# [ 4. 5. 16. 20. 25.]]
The key parameters of PolynomialFeatures are summarized below.
| Parameter | Default | Description |
|---|---|---|
| degree | 2 | Maximum degree of polynomial features. Can also accept a (min_degree, max_degree) tuple. |
| interaction_only | False | If True, only interaction features (products of distinct input features) are produced. Self-powers like x^2 are excluded. |
| include_bias | True | If True, a column of ones is included as a bias (intercept) term. |
| order | 'C' | Memory layout of the output array. 'F' (Fortran order) can be faster to compute. |
The number of polynomial features grows rapidly with both the number of input features and the degree. For 10 input features at degree 3, the output contains C(13, 3) = 286 features. At degree 5 with the same 10 inputs, the count rises to C(15, 5) = 3,003. This exponential growth increases the risk of overfitting and computational cost. Practitioners typically keep the degree at 2 or 3 and combine polynomial expansion with regularization (Lasso, Ridge, or Elastic Net) or feature selection to control model complexity [3].
An interaction term is a synthetic feature formed by multiplying two or more original features. It captures the idea that the effect of one feature on the target variable may depend on the value of another feature. In statistical modeling, this concept has been used for decades in the form of interaction effects in analysis of variance (ANOVA) and multiple regression.
Consider predicting house prices. The value added by an extra bedroom might be much higher for a large house (say, 3,000 square feet) than for a small apartment (600 square feet). A model with only separate features for bedrooms and square footage cannot capture this joint effect. Adding the interaction term bedrooms * square_footage allows the model to learn that the combination matters.
Interaction terms differ from full polynomial features in that they only include products of distinct features, not powers of individual features. In scikit-learn, setting interaction_only=True in PolynomialFeatures produces only interaction terms.
Interaction terms are most useful when:
A feature cross is a synthetic feature created by taking the Cartesian product of two or more categorical or bucketized features [4]. While polynomial transforms operate on numerical data, feature crosses operate on categorical data. Both serve the same purpose: enabling linear models to learn nonlinear relationships.
For example, consider a leaf classification task with two categorical features: edge type (smooth, toothed, lobed) and leaf arrangement (opposite, alternate). Crossing these two features produces six combined categories: smooth_opposite, smooth_alternate, toothed_opposite, toothed_alternate, lobed_opposite, lobed_alternate. Each combination is encoded as a separate binary feature.
A well-known application comes from geospatial modeling. Individually, latitude and longitude have limited predictive power for property values. But their cross product defines specific city blocks, and the model can learn that certain blocks command higher prices than others.
Feature crosses can produce very high-dimensional, sparse feature spaces. Crossing a 100-element sparse feature with a 200-element sparse feature results in a 20,000-element feature. This sparsity increases memory consumption and can slow training. Techniques such as hashing and dimensionality reduction help manage the resulting feature space.
Applying mathematical functions like log, square root, or Box-Cox transforms to individual features is a longstanding technique in statistics. These transforms serve several purposes:
The choice of transform should be guided by the data distribution and domain knowledge. It is important to handle zero and negative values appropriately, since the logarithm is undefined for non-positive numbers. Common workarounds include log(x + 1) or the inverse hyperbolic sine transform.
Binning (also called discretization or bucketizing) converts a continuous numerical feature into a set of discrete intervals (bins). Each data point is assigned to the bin that contains its value, and the bin membership is then encoded as a categorical feature (often using one-hot encoding).
There are several common binning strategies:
| Strategy | Description | Best for |
|---|---|---|
| Fixed-width (uniform) | Divides the range into equal-width intervals | Uniformly distributed data |
| Quantile-based | Creates bins with approximately equal numbers of observations | Skewed data |
| Domain-driven | Uses meaningful thresholds defined by domain experts | Variables with known breakpoints (e.g., age groups, income brackets) |
| Logarithmic | Bin widths increase exponentially | Data spanning several orders of magnitude |
Binning can reveal nonlinear patterns that a linear model would otherwise miss. For example, the relationship between age and insurance risk may not be linear, but grouping ages into brackets (18-25, 26-35, 36-50, 51-65, 65+) allows the model to assign different risk levels to each bracket. Binned features are also useful as inputs to feature crosses.
The main disadvantage of binning is information loss: the model can no longer distinguish between values within the same bin. Choosing too few bins loses detail; choosing too many bins approaches the original continuous feature and may add noise.
Timestamp columns contain rich temporal information that most models cannot use directly. Extracting components from a datetime object produces several useful synthetic features:
In Python with pandas, these can be extracted using the .dt accessor:
import pandas as pd
df['hour'] = df['timestamp'].dt.hour
df['day_of_week'] = df['timestamp'].dt.dayofweek
df['month'] = df['timestamp'].dt.month
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
Many time-based features are cyclical: hour 23 is close to hour 0, December is close to January, and Sunday is close to Monday. Encoding these as plain integers misleads distance-based and linear models, which treat 23 and 0 as far apart numerically.
Cyclical encoding addresses this by mapping each cyclical feature onto a circle using sine and cosine transforms [5]:
import numpy as np
df['hour_sin'] = np.sin(2 * np.pi * df['hour'] / 24)
df['hour_cos'] = np.cos(2 * np.pi * df['hour'] / 24)
This produces two features that together preserve the circular distance between time points. The technique works well with neural networks and linear models. Tree-based models such as random forests and gradient boosting generally do not require cyclical encoding because they can learn non-monotonic splits on integer-encoded time features.
Converting categorical variables into numerical form is itself a type of synthetic feature creation. The most common encoding methods are listed below.
| Encoding method | Description | Typical use case |
|---|---|---|
| One-hot encoding | Creates a binary column for each category | Low-cardinality nominal features |
| Label (ordinal) encoding | Assigns consecutive integers to categories | Ordinal features with a natural order |
| Binary encoding | Converts category indices to binary digits, with one column per bit | Medium-cardinality features |
| Target (mean) encoding | Replaces each category with the mean of the target variable for that category | High-cardinality features |
| Frequency encoding | Replaces each category with its frequency in the dataset | When category frequency carries signal |
| Embedding vectors | Learns a dense vector representation for each category via a neural network | Very high-cardinality features; deep learning models |
Target encoding (also called mean encoding or likelihood encoding) replaces each category with the average target value for that category. For a classification task, the replacement value is the conditional probability of the positive class given the category. For regression, it is the mean target value.
The main risk of target encoding is data leakage and overfitting, especially for rare categories with few observations. Smoothing mitigates this by blending the category-specific mean with the global mean:
encoded_value = (count * category_mean + smoothing * global_mean) / (count + smoothing)
With this formula, categories that have many observations are encoded close to their own mean, while rare categories are pulled toward the global mean. The smoothing parameter controls the balance. Scikit-learn's TargetEncoder class can automatically select a suitable smoothing value using empirical Bayes variance estimates [6].
Text data requires transformation into numerical features before it can be used by most machine learning models. Common approaches include:
These text-derived features are synthetic in the sense that they are computed from the raw text and do not exist in the original dataset. In modern natural language processing, pretrained language models (such as BERT and GPT) produce contextual embeddings that serve as high-dimensional synthetic features for downstream tasks.
When working with grouped or sequential data, aggregating existing features across groups or time windows produces informative synthetic features. Examples include:
These features are common in time series forecasting, fraud detection, and recommendation systems. They encode temporal patterns and behavioral trends that raw point-in-time snapshots cannot capture.
Manual feature engineering requires domain expertise and can be time-consuming. Automated feature engineering tools aim to generate large numbers of candidate features algorithmically and then select the most useful ones.
Deep Feature Synthesis (DFS) is an algorithm introduced by Kanter and Veeramachaneni in 2015 that automatically creates features from relational and temporal data [7]. It works by:
In a competition hosted by the IEEE, models using DFS-generated features beat 615 of 906 human teams [7]. The Featuretools library (maintained by Alteryx) provides an open-source Python implementation of DFS.
Several open-source libraries support automated feature generation.
| Tool | Focus area | Key capability |
|---|---|---|
| Featuretools | Relational and temporal data | Deep Feature Synthesis with customizable primitives |
| tsfresh | Time series data | Extracts hundreds of statistical, spectral, and nonlinear features from time series |
| Feature-engine | General tabular data | Scikit-learn-compatible transformers for encoding, discretization, and feature creation |
| tsflex | Time series data | Faster and more memory-efficient alternative to tsfresh |
| Category Encoders | Categorical data | 15+ encoding methods including target, binary, and hash encoding |
These tools reduce the manual effort involved in feature engineering but still require the practitioner to validate the generated features, check for data leakage, and manage the increased dimensionality.
The usefulness of synthetic features varies by model type. The table below compares how different model families interact with synthetic features.
| Model family | Needs synthetic features? | Reason |
|---|---|---|
| Linear regression, logistic regression | Often yes | Cannot represent nonlinear relationships without polynomial or interaction terms |
| Decision trees, random forests, gradient boosting | Sometimes | Can learn nonlinear splits natively, but ratio and aggregation features can still help |
| Support vector machines | Sometimes | Kernel trick handles some nonlinearity, but explicit features can improve linear kernels |
| Neural networks, deep learning | Less often | Automatically learn feature representations in hidden layers, but handcrafted features can accelerate training and improve results on small datasets |
As a general rule, simpler models benefit more from synthetic features, while complex models (especially deep neural networks) can discover useful representations on their own given enough data. However, even in deep learning pipelines, manually engineered features remain common in tabular data tasks, where neural networks have historically lagged behind tree-based methods [8].
Creating synthetic features introduces several risks that must be managed carefully.
Adding too many features increases the capacity of the model to memorize the training data, leading to poor generalization. This is closely related to the curse of dimensionality: as the number of features grows relative to the number of training samples, the data becomes increasingly sparse in the high-dimensional feature space, and models need exponentially more data to maintain performance [9].
A commonly cited guideline is to maintain a sample-to-feature ratio of at least 10:1, with 20:1 or higher being preferable for stable and generalizable models.
Some synthetic features can inadvertently leak information about the target variable into the training data. Target encoding is a common culprit: if the category mean is computed on the entire training set (including the current sample), it encodes target information that the model should not have access to at prediction time. Using cross-validated target encoding or smoothing helps mitigate this risk.
Synthetic features are often correlated with the features they were derived from. For example, x and x^2 are correlated, as are age and year_of_birth. High multicollinearity can destabilize coefficient estimates in linear models and make interpretation difficult. Checking variance inflation factors (VIF) and applying regularization are standard countermeasures.
Polynomial and cross-product features can produce a very large number of new columns, increasing memory usage and training time. Feature selection methods (filter, wrapper, or embedded approaches) should be applied to prune uninformative features.
| Practice | Description |
|---|---|
| Start simple | Begin with arithmetic and ratio features before moving to polynomial or automated methods |
| Use domain knowledge | Features motivated by real-world understanding are more likely to generalize |
| Validate rigorously | Use cross-validation to evaluate whether new features actually improve performance |
| Monitor feature importance | Remove features that do not contribute meaningfully, using permutation importance or SHAP values |
| Apply regularization | Use L1 (Lasso), L2 (Ridge), or Elastic Net penalties to control complexity when using many synthetic features |
| Normalize after transforming | If a synthetic feature changes the scale of the data, apply normalization or standardization |
| Watch for leakage | Ensure that no synthetic feature encodes future information or target values inappropriately |
| Document features | Record how each synthetic feature was created, including any parameters or thresholds used |
In production MLOps workflows, synthetic features must be computed consistently during both training and inference. A feature store is a centralized repository that stores feature definitions, computed feature values, and the code used to generate them [10]. Feature stores help teams:
Popular open-source and managed feature stores include Feast, Hopsworks, and the feature store components of Databricks and Amazon SageMaker.
The idea of creating new variables from existing ones predates machine learning by many decades. In classical statistics, researchers routinely applied log transforms, computed interaction terms, and standardized variables as part of regression analysis. The Box-Cox transformation, introduced by George Box and David Cox in 1964, provided a systematic family of power transforms for normalizing data [11].
The term "feature engineering" became prominent in the machine learning community during the 2000s and 2010s, as practitioners recognized that the choice and construction of features often mattered more than the choice of algorithm. Andrew Ng famously stated that "coming up with features is difficult, time-consuming, requires expert knowledge. Applied machine learning is basically feature engineering" [12].
Research into automated feature engineering began in the 1990s, with commercial and open-source tools becoming available from 2016 onward. Deep Feature Synthesis (2015) and the subsequent release of Featuretools marked an important step toward reducing the manual burden of feature construction. More recently, deep learning approaches have shifted some of the feature engineering workload to the model itself, which learns internal representations (synthetic features, in a sense) through its hidden layers.