# Synthetic Feature

> Source: https://aiwiki.ai/wiki/synthetic_feature
> Updated: 2026-04-26
> Categories: Data & Datasets, Data Science, Machine Learning
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

A **synthetic feature** (also called a constructed feature or derived feature) is a new variable created by transforming, combining, or otherwise manipulating one or more existing features in a dataset. Synthetic features do not appear in the original raw data; instead, they are produced during the [feature engineering](/wiki/feature_engineering) process to provide [machine learning](/wiki/machine_learning) models with additional information that helps them learn patterns more effectively. The term is used broadly across statistics, data science, and machine learning to describe any feature that a practitioner deliberately constructs rather than directly measures or collects.

Creating synthetic features is one of the most common and impactful steps in building predictive models. According to Google's Machine Learning Crash Course, a synthetic feature is "a new feature created from existing numerical features based on domain knowledge" [1]. By encoding domain knowledge, mathematical relationships, or statistical summaries into new columns, data scientists can significantly improve model accuracy, interpretability, and robustness.

## Explain like I'm 5 (ELI5)

Imagine you have a box of colored building blocks. Each block has a color and a size. Now suppose you want to sort them by how heavy they feel, but you do not have a scale. You notice that bigger blocks are heavier, and metal blocks are heavier than wooden ones. So you make up a new rule: "heaviness score = size times material weight." That new score is not written on any block. You invented it by combining two things you already knew (size and material). In machine learning, a synthetic feature works the same way. It is a new piece of information you create by mixing together things you already have, so your model can make better predictions.

## Background and motivation

Raw datasets rarely contain all the information a model needs in an immediately usable form. A table of real estate listings, for example, might include the year a house was built and the current year, but not the house's age. A medical dataset might record a patient's height and weight but not their body mass index (BMI). In both cases, the relationship between existing columns carries predictive signal that the model cannot easily discover on its own, especially when using [linear regression](/wiki/linear_regression) or other models that assume linear relationships among inputs.

Synthetic features address this gap. By explicitly constructing variables such as "age of house" (current year minus year built) or "BMI" (weight divided by height squared), the practitioner encodes domain knowledge directly into the data. This makes the model's job easier and often produces better results than relying on the model to infer these relationships from raw inputs alone.

The practice has deep roots in statistics, where variable transformation (such as taking the logarithm of a skewed variable or computing interaction terms in a regression model) has been standard for over a century. With the growth of modern machine learning, these techniques have been systematized, expanded, and in some cases automated.

## Types of synthetic features

Synthetic features can be grouped into several broad categories based on how they are constructed. The table below summarizes the main types.

| Type | Description | Example |
|---|---|---|
| Arithmetic combinations | New features formed by adding, subtracting, multiplying, or dividing existing features | Profit = shelf price - warehouse price |
| Ratio features | The quotient of two features, often expressing a rate or density | Population density = population / area |
| Polynomial features | Existing features raised to a power or multiplied together | x^2, x1 * x2 |
| Interaction terms | Products of two or more features that capture joint effects | bedrooms * square footage |
| [Feature cross](/wiki/feature_cross) | Cartesian product of two or more categorical or bucketized features | latitude_bucket x longitude_bucket |
| Logarithmic or power transforms | Mathematical functions applied to reduce skew or stabilize variance | log(income), sqrt(distance) |
| Binning (bucketizing) | Converting a continuous variable into discrete intervals | Age groups: 0-17, 18-34, 35-54, 55+ |
| Date/time extraction | Components extracted from timestamps | Hour of day, day of week, month, is_weekend |
| Cyclical encoding | Sine and cosine transforms of periodic features | sin(2 pi * hour / 24), cos(2 pi * hour / 24) |
| Aggregation features | Statistical summaries computed over groups or windows | Mean purchase amount per customer, rolling 7-day average |
| Text-derived features | Numerical representations extracted from text data | Word count, [TF-IDF](/wiki/bag_of_words) scores, [word embedding](/wiki/word_embedding) vectors |
| Indicator (dummy) variables | Binary flags encoding the presence or absence of a condition | is_holiday, has_garage, is_missing_value |
| Target encoding | Replacing a categorical value with a statistic of the target variable | Mean house price for each zip code |

## Arithmetic and ratio features

The simplest synthetic features are formed by applying basic arithmetic operations to existing columns. If a dataset contains both the purchase price and the selling price of an item, subtracting one from the other yields a profit feature. If it contains distance and time, dividing one by the other produces a speed feature.

Ratio features are especially useful because they normalize one quantity by another, making comparisons across different scales meaningful. In real estate modeling, for instance, price per square foot is often more predictive than raw price or raw square footage alone. In web analytics, click-through rate (clicks divided by impressions) is more informative than either raw count.

These features are easy to construct and interpret, which makes them a good starting point in any feature engineering workflow. However, care must be taken when the denominator can be zero, as this produces undefined values that require handling (for example, by adding a small constant or by treating the zero case separately).

## Polynomial features

Polynomial features are created by raising existing features to integer powers or by multiplying features together. They allow linear models to capture nonlinear relationships in the data. For a two-dimensional input sample [a, b], the degree-2 polynomial features are [1, a, b, a^2, ab, b^2] [2].

This technique is motivated by the observation that many real-world relationships involve powers of variables. Gravitational force depends on the square of the distance between two masses. Kinetic energy depends on the square of velocity. When a data scientist suspects such a relationship, adding a squared term as a synthetic feature enables a [linear regression](/wiki/linear_regression) model to fit a curve rather than a straight line.

### Mathematical formulation

Given an input vector **x** = (x_1, x_2, ..., x_n) and a maximum degree d, the polynomial feature expansion generates all monomials of the form:

x_1^{k_1} * x_2^{k_2} * ... * x_n^{k_n}

where k_1 + k_2 + ... + k_n <= d and each k_i >= 0.

The number of output features (including the bias term) is given by the binomial coefficient C(n + d, d). For example, with n = 2 input features and degree d = 2, the output contains C(4, 2) = 6 features. With n = 3 and d = 3, it grows to C(6, 3) = 20 features.

### Implementation in scikit-learn

The [scikit-learn](/wiki/scikit-learn) library provides the `PolynomialFeatures` class in its preprocessing module for generating polynomial and interaction features:

```python
import numpy as np
from sklearn.preprocessing import PolynomialFeatures

X = np.array([[2, 3],
              [4, 5]])

poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)
print(poly.get_feature_names_out())
# ['x0', 'x1', 'x0^2', 'x0 x1', 'x1^2']
print(X_poly)
# [[ 2.  3.  4.  6.  9.]
#  [ 4.  5. 16. 20. 25.]]
```

The key parameters of `PolynomialFeatures` are summarized below.

| Parameter | Default | Description |
|---|---|---|
| degree | 2 | Maximum degree of polynomial features. Can also accept a (min_degree, max_degree) tuple. |
| interaction_only | False | If True, only interaction features (products of distinct input features) are produced. Self-powers like x^2 are excluded. |
| include_bias | True | If True, a column of ones is included as a bias (intercept) term. |
| order | 'C' | Memory layout of the output array. 'F' (Fortran order) can be faster to compute. |

### Feature explosion warning

The number of polynomial features grows rapidly with both the number of input features and the degree. For 10 input features at degree 3, the output contains C(13, 3) = 286 features. At degree 5 with the same 10 inputs, the count rises to C(15, 5) = 3,003. This exponential growth increases the risk of [overfitting](/wiki/overfitting) and computational cost. Practitioners typically keep the degree at 2 or 3 and combine polynomial expansion with [regularization](/wiki/regularization) (Lasso, Ridge, or Elastic Net) or feature selection to control model complexity [3].

## Interaction terms

An interaction term is a synthetic feature formed by multiplying two or more original features. It captures the idea that the effect of one feature on the target variable may depend on the value of another feature. In statistical modeling, this concept has been used for decades in the form of interaction effects in analysis of variance (ANOVA) and multiple regression.

Consider predicting house prices. The value added by an extra bedroom might be much higher for a large house (say, 3,000 square feet) than for a small apartment (600 square feet). A model with only separate features for bedrooms and square footage cannot capture this joint effect. Adding the interaction term bedrooms * square_footage allows the model to learn that the combination matters.

Interaction terms differ from full polynomial features in that they only include products of distinct features, not powers of individual features. In scikit-learn, setting `interaction_only=True` in `PolynomialFeatures` produces only interaction terms.

### When to use interaction terms

Interaction terms are most useful when:

- Domain knowledge suggests that two variables have a joint effect on the outcome.
- Exploratory data analysis reveals that the relationship between a feature and the target changes at different levels of another feature.
- A model with main effects alone shows systematic patterns in its residuals that suggest missed interactions.

## Feature crosses

A [feature cross](/wiki/feature_cross) is a synthetic feature created by taking the Cartesian product of two or more categorical or bucketized features [4]. While polynomial transforms operate on numerical data, feature crosses operate on categorical data. Both serve the same purpose: enabling linear models to learn nonlinear relationships.

For example, consider a leaf classification task with two categorical features: edge type (smooth, toothed, lobed) and leaf arrangement (opposite, alternate). Crossing these two features produces six combined categories: smooth_opposite, smooth_alternate, toothed_opposite, toothed_alternate, lobed_opposite, lobed_alternate. Each combination is encoded as a separate binary feature.

A well-known application comes from geospatial modeling. Individually, latitude and longitude have limited predictive power for property values. But their cross product defines specific city blocks, and the model can learn that certain blocks command higher prices than others.

### Sparsity considerations

Feature crosses can produce very high-dimensional, sparse feature spaces. Crossing a 100-element sparse feature with a 200-element sparse feature results in a 20,000-element feature. This sparsity increases memory consumption and can slow training. Techniques such as hashing and dimensionality reduction help manage the resulting feature space.

## Logarithmic and power transforms

Applying mathematical functions like log, square root, or Box-Cox transforms to individual features is a longstanding technique in statistics. These transforms serve several purposes:

- **Reducing skew.** Many real-world distributions (income, population, web traffic) are heavily right-skewed. Taking the logarithm pulls in extreme values and produces a more symmetric distribution, which benefits models that assume normally distributed inputs (such as linear regression and [logistic regression](/wiki/logistic_regression)).
- **Stabilizing variance.** When the variance of a variable increases with its mean (a phenomenon called heteroscedasticity), a log or square root transform can equalize the variance across the range.
- **Linearizing relationships.** If the relationship between a feature and the target follows a power law or exponential curve, a log transform can make it approximately linear.

The choice of transform should be guided by the data distribution and domain knowledge. It is important to handle zero and negative values appropriately, since the logarithm is undefined for non-positive numbers. Common workarounds include log(x + 1) or the inverse hyperbolic sine transform.

## Binning and bucketizing

Binning (also called discretization or bucketizing) converts a continuous numerical feature into a set of discrete intervals (bins). Each data point is assigned to the bin that contains its value, and the bin membership is then encoded as a categorical feature (often using [one-hot encoding](/wiki/one-hot_encoding)).

There are several common binning strategies:

| Strategy | Description | Best for |
|---|---|---|
| Fixed-width (uniform) | Divides the range into equal-width intervals | Uniformly distributed data |
| Quantile-based | Creates bins with approximately equal numbers of observations | Skewed data |
| Domain-driven | Uses meaningful thresholds defined by domain experts | Variables with known breakpoints (e.g., age groups, income brackets) |
| Logarithmic | Bin widths increase exponentially | Data spanning several orders of magnitude |

Binning can reveal nonlinear patterns that a linear model would otherwise miss. For example, the relationship between age and insurance risk may not be linear, but grouping ages into brackets (18-25, 26-35, 36-50, 51-65, 65+) allows the model to assign different risk levels to each bracket. Binned features are also useful as inputs to feature crosses.

The main disadvantage of binning is information loss: the model can no longer distinguish between values within the same bin. Choosing too few bins loses detail; choosing too many bins approaches the original continuous feature and may add noise.

## Date and time features

Timestamp columns contain rich temporal information that most models cannot use directly. Extracting components from a datetime object produces several useful synthetic features:

- **Hour of day** (0-23)
- **Day of week** (Monday through Sunday)
- **Day of month** (1-31)
- **Month** (1-12)
- **Quarter** (1-4)
- **Year**
- **Is weekend** (binary flag)
- **Is holiday** (binary flag, requires a holiday calendar)
- **Time since an event** (e.g., days since last purchase)

In Python with pandas, these can be extracted using the `.dt` accessor:

```python
import pandas as pd

df['hour'] = df['timestamp'].dt.hour
df['day_of_week'] = df['timestamp'].dt.dayofweek
df['month'] = df['timestamp'].dt.month
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
```

### Cyclical encoding

Many time-based features are cyclical: hour 23 is close to hour 0, December is close to January, and Sunday is close to Monday. Encoding these as plain integers misleads distance-based and linear models, which treat 23 and 0 as far apart numerically.

Cyclical encoding addresses this by mapping each cyclical feature onto a circle using sine and cosine transforms [5]:

```python
import numpy as np

df['hour_sin'] = np.sin(2 * np.pi * df['hour'] / 24)
df['hour_cos'] = np.cos(2 * np.pi * df['hour'] / 24)
```

This produces two features that together preserve the circular distance between time points. The technique works well with [neural networks](/wiki/neural_network) and linear models. Tree-based models such as [random forests](/wiki/random_forest) and [gradient boosting](/wiki/gradient_boosting) generally do not require cyclical encoding because they can learn non-monotonic splits on integer-encoded time features.

## Encoding categorical features

Converting categorical variables into numerical form is itself a type of synthetic feature creation. The most common encoding methods are listed below.

| Encoding method | Description | Typical use case |
|---|---|---|
| [One-hot encoding](/wiki/one-hot_encoding) | Creates a binary column for each category | Low-cardinality nominal features |
| Label (ordinal) encoding | Assigns consecutive integers to categories | Ordinal features with a natural order |
| Binary encoding | Converts category indices to binary digits, with one column per bit | Medium-cardinality features |
| Target (mean) encoding | Replaces each category with the mean of the target variable for that category | High-cardinality features |
| Frequency encoding | Replaces each category with its frequency in the dataset | When category frequency carries signal |
| [Embedding](/wiki/embeddings) vectors | Learns a dense vector representation for each category via a neural network | Very high-cardinality features; [deep learning](/wiki/deep_learning) models |

### Target encoding and smoothing

Target encoding (also called mean encoding or likelihood encoding) replaces each category with the average target value for that category. For a [classification](/wiki/classification) task, the replacement value is the conditional probability of the positive class given the category. For [regression](/wiki/regression), it is the mean target value.

The main risk of target encoding is data leakage and overfitting, especially for rare categories with few observations. Smoothing mitigates this by blending the category-specific mean with the global mean:

encoded_value = (count * category_mean + smoothing * global_mean) / (count + smoothing)

With this formula, categories that have many observations are encoded close to their own mean, while rare categories are pulled toward the global mean. The smoothing parameter controls the balance. Scikit-learn's `TargetEncoder` class can automatically select a suitable smoothing value using empirical Bayes variance estimates [6].

## Text-derived features

Text data requires transformation into numerical features before it can be used by most machine learning models. Common approaches include:

- **[Bag of words](/wiki/bag_of_words):** Represents a document as a vector of word frequencies, ignoring word order and grammar.
- **TF-IDF (Term Frequency-Inverse Document Frequency):** Weights word frequencies by how rare each word is across the entire corpus, giving more importance to distinctive words.
- **[Word embeddings](/wiki/word_embedding):** Dense vector representations (such as [Word2Vec](/wiki/word2vec) or GloVe) that capture semantic relationships between words. Words with similar meanings have similar embedding vectors.
- **Character n-grams:** Counts of character subsequences, useful for capturing morphological patterns and handling misspellings.
- **Simple text statistics:** Word count, sentence count, average word length, proportion of uppercase letters, and similar surface-level features.

These text-derived features are synthetic in the sense that they are computed from the raw text and do not exist in the original dataset. In modern [natural language processing](/wiki/natural_language_processing), pretrained language models (such as BERT and GPT) produce contextual embeddings that serve as high-dimensional synthetic features for downstream tasks.

## Aggregation and window features

When working with grouped or sequential data, aggregating existing features across groups or time windows produces informative synthetic features. Examples include:

- Mean, median, min, max, and standard deviation of a customer's past purchase amounts.
- Count of transactions in the last 7 days, 30 days, or 90 days.
- Ratio of the current value to the rolling mean (detecting anomalies or trends).
- Lag features: the value of a variable at a previous time step (t-1, t-2, etc.).
- Difference features: the change between consecutive time steps.

These features are common in time series forecasting, fraud detection, and recommendation systems. They encode temporal patterns and behavioral trends that raw point-in-time snapshots cannot capture.

## Automated feature engineering

Manual feature engineering requires domain expertise and can be time-consuming. Automated feature engineering tools aim to generate large numbers of candidate features algorithmically and then select the most useful ones.

### Deep feature synthesis

Deep Feature Synthesis (DFS) is an algorithm introduced by Kanter and Veeramachaneni in 2015 that automatically creates features from relational and temporal data [7]. It works by:

1. Defining entities (tables) and the relationships between them.
2. Applying **transform primitives** (operations on a single table, such as computing the hour from a timestamp).
3. Applying **aggregation primitives** (operations across related tables, such as computing the mean of a customer's order amounts).
4. Stacking primitives to create "deep" features (for example, the standard deviation of the monthly average order amount).

In a competition hosted by the IEEE, models using DFS-generated features beat 615 of 906 human teams [7]. The Featuretools library (maintained by Alteryx) provides an open-source Python implementation of DFS.

### Automated feature engineering tools

Several open-source libraries support automated feature generation.

| Tool | Focus area | Key capability |
|---|---|---|
| Featuretools | Relational and temporal data | Deep Feature Synthesis with customizable primitives |
| tsfresh | Time series data | Extracts hundreds of statistical, spectral, and nonlinear features from time series |
| Feature-engine | General tabular data | Scikit-learn-compatible transformers for encoding, discretization, and feature creation |
| tsflex | Time series data | Faster and more memory-efficient alternative to tsfresh |
| Category Encoders | Categorical data | 15+ encoding methods including target, binary, and hash encoding |

These tools reduce the manual effort involved in feature engineering but still require the practitioner to validate the generated features, check for data leakage, and manage the increased dimensionality.

## Synthetic features and model types

The usefulness of synthetic features varies by model type. The table below compares how different model families interact with synthetic features.

| Model family | Needs synthetic features? | Reason |
|---|---|---|
| [Linear regression](/wiki/linear_regression), [logistic regression](/wiki/logistic_regression) | Often yes | Cannot represent nonlinear relationships without polynomial or interaction terms |
| [Decision trees](/wiki/decision_tree), [random forests](/wiki/random_forest), [gradient boosting](/wiki/gradient_boosting) | Sometimes | Can learn nonlinear splits natively, but ratio and aggregation features can still help |
| [Support vector machines](/wiki/support_vector_machine_svm) | Sometimes | Kernel trick handles some nonlinearity, but explicit features can improve linear kernels |
| [Neural networks](/wiki/neural_network), [deep learning](/wiki/deep_learning) | Less often | Automatically learn feature representations in hidden layers, but handcrafted features can accelerate training and improve results on small datasets |

As a general rule, simpler models benefit more from synthetic features, while complex models (especially deep neural networks) can discover useful representations on their own given enough data. However, even in deep learning pipelines, manually engineered features remain common in tabular data tasks, where neural networks have historically lagged behind tree-based methods [8].

## Risks and best practices

Creating synthetic features introduces several risks that must be managed carefully.

### Overfitting

Adding too many features increases the capacity of the model to memorize the training data, leading to poor generalization. This is closely related to the curse of dimensionality: as the number of features grows relative to the number of training samples, the data becomes increasingly sparse in the high-dimensional feature space, and models need exponentially more data to maintain performance [9].

A commonly cited guideline is to maintain a sample-to-feature ratio of at least 10:1, with 20:1 or higher being preferable for stable and generalizable models.

### Data leakage

Some synthetic features can inadvertently leak information about the target variable into the training data. Target encoding is a common culprit: if the category mean is computed on the entire training set (including the current sample), it encodes target information that the model should not have access to at prediction time. Using cross-validated target encoding or smoothing helps mitigate this risk.

### Multicollinearity

Synthetic features are often correlated with the features they were derived from. For example, x and x^2 are correlated, as are age and year_of_birth. High multicollinearity can destabilize coefficient estimates in linear models and make interpretation difficult. Checking variance inflation factors (VIF) and applying regularization are standard countermeasures.

### Computational cost

Polynomial and cross-product features can produce a very large number of new columns, increasing memory usage and training time. Feature selection methods (filter, wrapper, or embedded approaches) should be applied to prune uninformative features.

### Best practices summary

| Practice | Description |
|---|---|
| Start simple | Begin with arithmetic and ratio features before moving to polynomial or automated methods |
| Use domain knowledge | Features motivated by real-world understanding are more likely to generalize |
| Validate rigorously | Use [cross-validation](/wiki/cross-validation) to evaluate whether new features actually improve performance |
| Monitor feature importance | Remove features that do not contribute meaningfully, using permutation importance or SHAP values |
| Apply regularization | Use L1 ([Lasso](/wiki/regularization)), L2 (Ridge), or Elastic Net penalties to control complexity when using many synthetic features |
| Normalize after transforming | If a synthetic feature changes the scale of the data, apply [normalization](/wiki/normalization) or standardization |
| Watch for leakage | Ensure that no synthetic feature encodes future information or target values inappropriately |
| Document features | Record how each synthetic feature was created, including any parameters or thresholds used |

## Feature stores and production deployment

In production [MLOps](/wiki/mlops) workflows, synthetic features must be computed consistently during both training and inference. A feature store is a centralized repository that stores feature definitions, computed feature values, and the code used to generate them [10]. Feature stores help teams:

- Reuse features across multiple models and projects.
- Ensure consistency between training-time and serving-time feature computation (avoiding training-serving skew).
- Version and audit features for governance and reproducibility.
- Serve precomputed features with low latency for real-time predictions.

Popular open-source and managed feature stores include Feast, Hopsworks, and the feature store components of Databricks and Amazon SageMaker.

## Historical context

The idea of creating new variables from existing ones predates machine learning by many decades. In classical statistics, researchers routinely applied log transforms, computed interaction terms, and standardized variables as part of regression analysis. The Box-Cox transformation, introduced by George Box and David Cox in 1964, provided a systematic family of power transforms for normalizing data [11].

The term "feature engineering" became prominent in the machine learning community during the 2000s and 2010s, as practitioners recognized that the choice and construction of features often mattered more than the choice of algorithm. Andrew Ng famously stated that "coming up with features is difficult, time-consuming, requires expert knowledge. Applied machine learning is basically feature engineering" [12].

Research into automated feature engineering began in the 1990s, with commercial and open-source tools becoming available from 2016 onward. Deep Feature Synthesis (2015) and the subsequent release of Featuretools marked an important step toward reducing the manual burden of feature construction. More recently, deep learning approaches have shifted some of the feature engineering workload to the model itself, which learns internal representations (synthetic features, in a sense) through its hidden layers.

## See also

- [Feature engineering](/wiki/feature_engineering)
- [Feature extraction](/wiki/feature_extraction)
- [Feature cross](/wiki/feature_cross)
- [Feature vector](/wiki/feature_vector)
- [Feature set](/wiki/feature_set)
- [Feature importances](/wiki/feature_importances)
- [Overfitting](/wiki/overfitting)
- [Regularization](/wiki/regularization)
- [Preprocessing](/wiki/preprocessing)
- [Normalization](/wiki/normalization)
- [One-hot encoding](/wiki/one-hot_encoding)
- [Dimension reduction](/wiki/dimension_reduction)

## References

1. Google Developers. "Working with numerical data: Polynomial transforms." *Machine Learning Crash Course*. https://developers.google.com/machine-learning/crash-course/numerical-data/polynomial-transforms
2. Scikit-learn developers. "PolynomialFeatures." *scikit-learn documentation*, version 1.8. https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html
3. Hastie, T., Tibshirani, R., and Friedman, J. *The Elements of Statistical Learning: Data Mining, Inference, and Prediction*. 2nd ed. Springer, 2009.
4. Google Developers. "Categorical data: Feature crosses." *Machine Learning Crash Course*. https://developers.google.com/machine-learning/crash-course/categorical-data/feature-crosses
5. Scikit-learn developers. "Time-related feature engineering." *scikit-learn documentation*, version 1.8. https://scikit-learn.org/stable/auto_examples/applications/plot_cyclical_feature_engineering.html
6. Scikit-learn developers. "TargetEncoder." *scikit-learn documentation*, version 1.8. https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.TargetEncoder.html
7. Kanter, J. M. and Veeramachaneni, K. "Deep feature synthesis: Towards automating data science endeavors." *Proceedings of the IEEE International Conference on Data Science and Advanced Analytics (DSAA)*, 2015. https://ieeexplore.ieee.org/document/7344858
8. Grinsztajn, L., Oyallon, E., and Varoquaux, G. "Why do tree-based models still outperform deep learning on typical tabular data?" *Advances in Neural Information Processing Systems (NeurIPS)*, 2022.
9. Bellman, R. *Adaptive Control Processes: A Guided Tour*. Princeton University Press, 1961.
10. Hopsworks. "Feature Store: The Definitive Guide." https://www.hopsworks.ai/dictionary/feature-store
11. Box, G. E. P. and Cox, D. R. "An analysis of transformations." *Journal of the Royal Statistical Society, Series B*, 26(2):211-252, 1964.
12. Domingos, P. "A few useful things to know about machine learning." *Communications of the ACM*, 55(10):78-87, 2012.
13. Murphy, K. P. *Probabilistic Machine Learning: An Introduction*. MIT Press, 2022.
14. Zheng, A. and Casari, A. *Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists*. O'Reilly Media, 2018.
