# Continuous Feature

> Source: https://aiwiki.ai/wiki/continuous_feature
> Updated: 2026-06-25
> Categories: Data & Datasets, Machine Learning, Statistics
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

A **continuous feature** is a numeric input variable in [machine learning](/wiki/machine_learning) and statistics that can take any value within a range, including decimals and fractions, rather than a fixed set of categories or counts. Google's Machine Learning Glossary defines it as "a floating-point feature with an infinite range of possible values, such as temperature or weight."[14] Common examples include height (175.3 cm), temperature (22.57 degrees C), income ($72,450.25), age, price, and elapsed time. Continuous features contrast with [categorical](/wiki/categorical_data) features (a finite set of labels such as red, yellow, green) and with discrete features (a countable set of values such as the number of children), and they form the backbone of most [regression](/wiki/regression_model) and [classification](/wiki/classification_model) tasks. Because their natural scales differ so widely, continuous features usually require preprocessing such as normalization, standardization, or bucketing before a model can use them effectively.

## ELI5 (Explain like I'm 5)

Imagine you have a ruler. You can measure something and get 3 inches, or 3.5 inches, or 3.51 inches, or even 3.5172 inches. You can always find a number in between two other numbers. That is what a continuous feature is: a measurement that can be any number, not just whole numbers or categories like "red" or "blue." When a computer learns to make predictions, it uses these kinds of measurements (like how tall someone is, or how much something weighs) to figure out patterns.

## What is a continuous feature?

A continuous feature (also called a continuous variable or numerical feature) is an input column whose values lie on a smooth, unbroken scale, so that between any two values there is always another valid value. In computing terms, continuous features are typically stored as floating-point numbers. The Google for Developers Machine Learning Glossary states plainly that a continuous feature is "a floating-point feature with an infinite range of possible values, such as temperature or weight," and explicitly contrasts it with the [discrete feature](/wiki/categorical_data), defined there as "a feature with a finite set of possible values."[14]

In [feature engineering](/wiki/feature_engineering), a continuous feature is any input column in a dataset whose domain is (a subset of) the real numbers and whose values are meaningful as quantities rather than labels. All standard arithmetic operations (addition, subtraction, multiplication, division) are meaningful on a continuous feature, which is one practical test for distinguishing it from a categorical code that merely happens to be stored as a number.

## Mathematical definition

In probability theory and statistics, a continuous random variable *X* is one whose [cumulative distribution function](https://en.wikipedia.org/wiki/Cumulative_distribution_function) (CDF) is absolutely continuous. This means the CDF can be expressed as the integral of a nonnegative function called the [probability density function](https://en.wikipedia.org/wiki/Probability_density_function) (PDF):

**F_X(x) = P(X ≤ x) = ∫ from −∞ to x of f_X(u) du**

where f_X(u) ≥ 0 for all u and ∫ from −∞ to ∞ of f_X(u) du = 1.

A defining property of continuous random variables is that the probability of the variable taking any single exact value is zero: P(X = x) = 0 for every x. Instead, probabilities are assigned to intervals. This contrasts with discrete random variables, where individual outcomes can have nonzero probabilities.

In the context of [feature engineering](/wiki/feature_engineering), a continuous feature is any input column in a dataset whose domain is (a subset of) the real numbers and whose values are meaningful as quantities rather than labels.

## How is a continuous feature different from a categorical or discrete feature?

Understanding the differences between feature types is fundamental for choosing the right preprocessing methods and [algorithms](/wiki/machine_learning). A continuous feature has an infinite range of possible floating-point values; a discrete feature has a countable (often integer) set of values obtained by counting; and a [categorical feature](/wiki/categorical_data) has a finite set of named labels. As the Google ML Glossary puts it, categorical data consists of "features having a specific set of possible values," giving the example of a feature `traffic-light-state` that "can only have one of the following three possible values: red, yellow, green."[14]

| Property | Continuous feature | Discrete feature | Categorical feature |
|---|---|---|---|
| **Value domain** | Any real number within a range (infinite possible values) | Countable set of values, usually integers | Finite set of named categories or labels |
| **Obtained by** | Measurement (length, weight, time) | Counting (number of children, inventory count) | Observation or assignment (color, nationality) |
| **Examples** | Temperature (22.57 °C), height (175.3 cm), salary ($72,450.25) | Number of siblings (0, 1, 2, 3), product reviews (1, 2, 3, ...) | Gender (male, female), blood type (A, B, AB, O) |
| **Arithmetic operations** | All arithmetic meaningful (add, subtract, multiply, divide) | Addition and subtraction meaningful; multiplication context-dependent | Arithmetic not meaningful |
| **Typical encoding** | Used directly as numeric input; may need [scaling](/wiki/scaling) | Used directly or [one-hot encoded](/wiki/one-hot_encoding) | [One-hot encoded](/wiki/one-hot_encoding), label encoded, or [embedded](/wiki/embeddings) |
| **Common distributions** | Normal, log-normal, uniform, exponential | Poisson, binomial, geometric | Not described by standard continuous distributions |

### Levels of measurement

Stanley Smith Stevens introduced a widely used classification of measurement scales in his 1946 paper "On the Theory of Scales of Measurement" published in *Science*.[1] Continuous features typically fall under two of these levels:

- **Interval scale.** Differences between values are meaningful, but the zero point is arbitrary. Temperature in Celsius and Fahrenheit are classic examples: the difference between 20 °C and 30 °C is the same as between 30 °C and 40 °C, but 0 °C does not mean "no temperature." You can add and subtract interval-scale values, but ratios are not meaningful (40 °C is not "twice as hot" as 20 °C).
- **Ratio scale.** The most informative level. Like the interval scale, differences are meaningful, but ratio-scale data also has a true zero that represents the complete absence of the measured property. Height, weight, distance, and time duration are ratio-scale. All arithmetic operations (addition, subtraction, multiplication, division) are valid.

In practice, most continuous features in [machine learning](/wiki/machine_learning) datasets are ratio-scale variables, though interval-scale features (such as dates or temperatures) also appear frequently.

## Common examples by domain

Continuous features appear in virtually every applied machine learning problem. The table below lists typical continuous features across several domains.

| Domain | Continuous features | Typical prediction target |
|---|---|---|
| Healthcare | Age, body mass index, blood pressure, heart rate, cholesterol level, blood glucose | Disease diagnosis, patient readmission risk |
| Finance | Income, account balance, transaction amount, credit score, debt-to-income ratio | Loan default, fraud detection, credit scoring |
| Weather and climate | Temperature, humidity, wind speed, atmospheric pressure, precipitation | Weather forecasting, crop yield prediction |
| E-commerce | Price, session duration, page views, cart value, shipping distance | Purchase probability, customer lifetime value |
| Manufacturing | Machine temperature, vibration frequency, pressure, cycle time, defect size | Predictive maintenance, quality control |
| Real estate | Square footage, lot size, distance to city center, property tax, median neighborhood income | Property price estimation |
| Transportation | Speed, fuel consumption, trip distance, traffic density, wait time | Travel time prediction, route optimization |

## How do you preprocess continuous features?

Raw continuous features often need to be transformed before they can be used effectively in [machine learning](/wiki/machine_learning) models. Preprocessing aims to put features on comparable scales, handle missing data, reduce skewness, and remove or flag anomalous values. The three most common steps are feature scaling (normalization and standardization), handling missing values and outliers, and bucketing (binning), each covered below.

### Feature scaling: normalization and standardization

Many learning algorithms (particularly those that rely on distance calculations or gradient-based optimization) are sensitive to the scale of input features. If one feature ranges from 0 to 1 and another from 0 to 100,000, the larger-scale feature can dominate the model. [Feature scaling](/wiki/scaling), also called [normalization](/wiki/normalization), addresses this.[12] Google's Machine Learning Crash Course describes normalization broadly as "the process of converting a variable's actual range of values into a standard range of values, such as: -1 to +1, 0 to 1, [or] Z-scores (roughly, -3 to +3)," and notes that "if you normalize a feature during training, you must also normalize that feature when making predictions."[15]

| Method | Formula | Output range | Strengths | Weaknesses |
|---|---|---|---|---|
| **Min-max normalization** | x' = (x − x_min) / (x_max − x_min) | [0, 1] | Simple; preserves original distribution shape | Sensitive to [outliers](/wiki/outlier_detection); can compress most data into a narrow band if outliers are extreme |
| **Z-score standardization** | x' = (x − μ) / σ | Unbounded (centered at 0) | Works well with normally distributed data; widely supported | Assumes approximate normality; still affected by extreme outliers |
| **Robust scaling** | x' = (x − Q₂) / (Q₃ − Q₁) | Unbounded (centered at 0) | Uses median and IQR, making it resistant to outliers | Less intuitive output range |
| **Max-abs scaling** | x' = x / max(|x|) | [−1, 1] | Preserves sparsity (zeros remain zeros) | Still affected by individual extreme values |
| **Unit vector (L2 norm)** | x' = x / ‖x‖₂ | Unit sphere | Useful when direction matters more than magnitude (e.g., text similarity) | Destroys information about absolute scale |

Min-max normalization rescales values into a fixed band, usually 0 to 1, while z-score standardization (also called standardization) re-centers a feature to a mean of 0 and a standard deviation of 1. As a rule of thumb, standardization is the safer default for most algorithms, while min-max normalization is preferred when the algorithm requires bounded inputs.

#### Which algorithms need scaling?

Not all algorithms require scaled inputs. The table below summarizes sensitivity.

| Algorithm category | Examples | Scaling needed? | Reason |
|---|---|---|---|
| Distance-based | [k-nearest neighbors](/wiki/k-means), [SVM](/wiki/support_vector_machine_svm) | Yes | Distance calculations are dominated by large-scale features |
| Gradient-based | [Neural networks](/wiki/neural_network), [logistic regression](/wiki/logistic_regression) | Yes | Unscaled features cause slow or unstable convergence in [gradient descent](/wiki/gradient_descent) |
| Regularized linear | Lasso ([L1](/wiki/l1_regularization)), Ridge ([L2](/wiki/l2_regularization)) | Yes | Regularization penalizes coefficients equally, so feature scales must be comparable |
| Tree-based | [Decision trees](/wiki/decision_tree), [random forests](/wiki/random_forest), [gradient boosting](/wiki/gradient_boosting) | No | Trees split on thresholds, so scale does not affect split quality |
| Naive Bayes | Gaussian Naive Bayes | No | Parameters are estimated per feature independently |

### Handling missing values

Missing data in continuous features is common in real-world datasets. Several imputation strategies exist, each with trade-offs.

| Method | Description | Best suited for | Drawbacks |
|---|---|---|---|
| **Mean imputation** | Replace missing values with the feature's mean | Normally distributed features with few missing values | Reduces variance; biased when data is skewed |
| **Median imputation** | Replace with the median | Skewed distributions | Ignores relationships between features |
| **KNN imputation** | Impute using the mean of *k* nearest neighbors | Complex datasets where features are correlated | Computationally expensive; sensitive to *k* and distance metric |
| **Regression imputation** | Predict missing values from other features using a regression model | Features with strong linear relationships | Can overfit; underestimates variability |
| **MICE** (Multiple Imputation by Chained Equations) | Iteratively imputes each feature using the others[8] | Complex missing-data patterns | Computationally intensive; requires careful model specification |
| **MissForest** | Uses a [random forest](/wiki/random_forest) to impute each feature iteratively[7] | Mixed feature types, nonlinear relationships | Slow on large datasets |
| **Indicator variable** | Add a binary flag column marking whether the value was missing, then impute with any simple method | When missingness itself is informative | Increases dimensionality |

### Handling outliers

Outliers in continuous features can distort model training, especially for algorithms sensitive to extreme values (linear models, [neural networks](/wiki/neural_network), distance-based methods).

Common detection methods include:

- **Z-score method.** Flag observations more than 2 or 3 standard deviations from the mean. Works well for approximately normal distributions.
- **IQR method.** Compute the interquartile range (Q3 − Q1). Values below Q1 − 1.5 × IQR or above Q3 + 1.5 × IQR are considered outliers. More robust to non-normal distributions.
- **Isolation Forest.** An unsupervised [ensemble](/wiki/ensemble) method that isolates anomalies by randomly partitioning the feature space.[10] Effective in high-dimensional settings.

Once detected, outliers can be removed, capped (winsorized), or transformed (see the next section).

### Transformations for skewed distributions

Many machine learning algorithms perform best when input features follow an approximately normal (Gaussian) distribution. Highly skewed continuous features can be transformed to reduce skewness.

| Transformation | Formula | Applicable to | Effect |
|---|---|---|---|
| **Log transform** | x' = log(x + 1) | Positive values; right-skewed distributions | Compresses large values, expands small values |
| **Square root** | x' = √x | Non-negative values; moderate right skew | Less aggressive than log; easier to interpret |
| **Box-Cox** | x' = (x^λ − 1) / λ (λ ≠ 0); log(x) (λ = 0) | Strictly positive values | Finds optimal λ to maximize normality; very flexible |
| **Yeo-Johnson** | Extension of Box-Cox that handles negative and zero values | Any real-valued data | Same flexibility as Box-Cox without the positivity constraint |
| **Quantile transform** | Maps data to a uniform or normal distribution via CDF | Any distribution | Guarantees specified output distribution; non-linear and may distort relationships |

The Box-Cox transformation was proposed by George Box and David Cox in a 1964 paper in the *Journal of the Royal Statistical Society*.[2] The Yeo-Johnson transformation, proposed by In-Kwon Yeo and Richard Johnson in 2000, generalizes Box-Cox to handle negative values.[3]

## What is bucketing or binning of a continuous feature?

Discretization, also called bucketing or binning, converts a continuous feature into a set of discrete intervals or bins. The Google ML Glossary describes bucketing as "converting a single feature into multiple binary features called buckets or bins, typically based on a value range," and adds that "the chopped feature is typically a continuous feature."[14] While this deliberately discards some information, it can be useful in specific situations.

### Common binning methods

| Method | How it works | Strengths | Weaknesses |
|---|---|---|---|
| **Equal-width** | Divides the feature range into *k* bins of equal width | Simple to implement | Bins can be highly imbalanced if the distribution is skewed |
| **Equal-frequency (quantile)** | Each bin contains approximately the same number of observations | Naturally handles skewness | Bin boundaries may split similar values |
| **K-means binning** | Uses [k-means clustering](/wiki/k-means) to find natural groupings | Adapts to the data's structure | Requires choosing *k*; computationally heavier |
| **Decision tree-based** | Uses a [decision tree](/wiki/decision_tree) trained on the target variable to find optimal split points | Supervised; maximizes information gain | Prone to overfitting if tree depth is not limited |

### When to discretize

- When a nonlinear relationship exists and the model is linear, binning can help approximate that relationship.
- When the downstream algorithm natively handles categorical data better than continuous (e.g., Naive Bayes).
- When interpretability is more important than predictive power (e.g., grouping ages into "18-25," "26-35," etc.).

Discretization should be used with caution. It can result in information loss, and practitioners should verify through proper cross-validated evaluation that binning actually improves model performance before adopting it.

## Feature engineering with continuous features

[Feature engineering](/wiki/feature_engineering) transforms raw continuous features into representations that better capture patterns in the data.[11]

### Polynomial and interaction features

For a pair of continuous features [a, b], generating degree-2 polynomial features produces [1, a, b, a², ab, b²]. The term *ab* is called an interaction feature. Polynomial features allow linear models to capture nonlinear relationships without switching to a more complex model architecture.

Scikit-learn provides `PolynomialFeatures` for this purpose.[6] In practice, degrees of 2 or 3 are most common because the number of generated features grows polynomially with the input dimension and exponentially with the degree, which can lead to [overfitting](/wiki/overfitting) and high computational cost.

### Domain-specific transformations

Some common domain-specific feature engineering techniques for continuous variables include:

- **Ratios and proportions.** Dividing one feature by another (e.g., debt-to-income ratio, body mass index = weight / height²).
- **Differences and changes.** Computing differences between time-stamped values (e.g., month-over-month revenue change).
- **Rolling statistics.** Computing moving averages, moving standard deviations, or rolling sums over a time window in [time series](/wiki/time_series_analysis) data.
- **Aggregations.** Summarizing groups of related continuous values (e.g., average transaction amount per customer).

### Numerical embeddings in deep learning

Recent research has explored transforming scalar continuous features into high-dimensional vector representations (embeddings) before feeding them into [deep learning](/wiki/deep_learning) models for tabular data. Gorishniy et al. (2022) showed that numerical embeddings can significantly improve the performance of transformer-based models on tabular benchmarks, making them competitive with [gradient boosting](/wiki/gradient_boosting) methods like XGBoost and LightGBM.[5]

Two main approaches have emerged:

1. **Piecewise linear encoding.** The continuous value is encoded as a vector using a set of learned breakpoints, producing a piecewise linear mapping.
2. **Periodic activations.** The scalar is passed through periodic functions (such as sine and cosine at different frequencies) to create a rich embedding.

## Feature selection for continuous features

When a dataset contains many continuous features, selecting the most relevant ones can improve model accuracy, reduce training time, and prevent [overfitting](/wiki/overfitting).[13]

### Filter methods

Filter methods evaluate features individually based on statistical properties, independent of any particular model.

- **Pearson correlation.** Measures the linear relationship between a continuous feature and a continuous target. Values range from −1 to 1. Features with correlations close to 0 are candidates for removal.
- **Mutual information.** A nonparametric measure from information theory that captures both linear and nonlinear dependencies. It equals zero if and only if two variables are independent. Unlike Pearson correlation, mutual information can detect arbitrary relationships and works with mixed data types (continuous and discrete). Scikit-learn's `mutual_info_classif` and `mutual_info_regression` functions use k-nearest neighbor-based entropy estimators for continuous features.[9]
- **ANOVA F-statistic.** Tests whether the means of a continuous feature differ significantly across classes. Useful for classification tasks with continuous inputs.
- **Variance threshold.** Removes features whose variance falls below a specified cutoff. A feature with near-zero variance provides almost no information.

### Wrapper methods

Wrapper methods evaluate subsets of features by training and validating a model.

- **Recursive feature elimination (RFE).** Trains a model, removes the least important feature, and repeats.
- **Forward/backward selection.** Iteratively adds or removes features and evaluates model performance at each step.

### Embedded methods

Embedded methods perform feature selection as part of the model training process.

- **L1 regularization (Lasso).** Drives the coefficients of less important features to exactly zero during training, effectively performing automatic feature selection.
- **Tree-based [feature importance](/wiki/feature_importances).** [Decision trees](/wiki/decision_tree) and ensembles like [random forests](/wiki/random_forest) provide feature importance scores based on how much each feature reduces impurity (e.g., Gini impurity or information gain) across all splits.

## Algorithm-specific considerations

Different machine learning algorithms interact with continuous features in different ways.

### Linear models

[Linear regression](/wiki/linear_regression), [logistic regression](/wiki/logistic_regression), and other generalized linear models assume a linear (or monotonic, after a link function) relationship between features and the target. Continuous features can be used directly, but nonlinear relationships must be captured through explicit feature engineering (polynomial terms, binning) or by switching to a nonlinear model.

### Tree-based models

[Decision trees](/wiki/decision_tree), [random forests](/wiki/random_forest), and [gradient boosting](/wiki/gradient_boosting) algorithms handle continuous features natively. They find optimal split points by evaluating thresholds along each feature. Tree-based models are invariant to monotonic transformations of features (the same splits are found regardless of scale), so scaling is unnecessary.

### Neural networks

[Neural networks](/wiki/neural_network) benefit substantially from scaled continuous inputs. Without scaling, neurons in early layers may saturate (for sigmoid or tanh activations) or produce wildly different gradient magnitudes, slowing or destabilizing training. [Batch normalization](/wiki/batch_normalization), proposed by Ioffe and Szegedy in 2015, normalizes intermediate layer outputs during training and has become a standard technique in deep networks.[4]

### Support vector machines

[SVMs](/wiki/support_vector_machine_svm) compute decision boundaries based on distances in feature space. Unscaled features lead to boundaries that are dominated by the highest-magnitude feature. Standardization (z-score) or min-max scaling is typically applied before training an SVM.

### K-nearest neighbors

[KNN](/wiki/k-means) classifiers and regressors rely on distance metrics (Euclidean, Manhattan, etc.) to find similar observations. Continuous features must be scaled so that all features contribute proportionally to distance calculations.

## Distributions of continuous features

Understanding the distribution of a continuous feature guides preprocessing and modeling decisions. Below are commonly encountered distributions.

| Distribution | Shape | Real-world examples | Relevant ML consideration |
|---|---|---|---|
| **Normal (Gaussian)** | Symmetric bell curve | Height, test scores, measurement errors | Many algorithms assume or perform best with Gaussian inputs |
| **Log-normal** | Right-skewed; log of the variable is normal | Income, stock prices, city population sizes | Apply log transform to normalize |
| **Uniform** | Flat; all values equally likely | Random seeds, some synthetic features | Scaling is straightforward |
| **Exponential** | Heavily right-skewed; memoryless | Time between events (e.g., customer arrivals) | Consider log or Box-Cox transform |
| **Bimodal / multimodal** | Two or more peaks | Mixed populations (e.g., heights of adults from two demographic groups) | Consider separating into subpopulations or using [clustering](/wiki/clustering) |
| **Power-law** | Extremely right-skewed; long tail | Website page views, word frequencies, earthquake magnitudes | Log-log transform; consider robust scaling |

## Visualization techniques

Visualizing continuous features helps identify distribution shape, outliers, relationships, and potential issues before modeling.

- **Histogram.** Shows the frequency distribution of a single continuous feature. Bin width affects the level of detail visible.
- **Kernel density estimate (KDE).** A smoothed version of the histogram that estimates the PDF of the feature.
- **Box plot.** Displays the median, quartiles, and outliers. Useful for comparing distributions across groups.
- **Q-Q plot (quantile-quantile plot).** Compares the quantiles of the feature against the quantiles of a theoretical distribution (typically normal). Points falling on a straight line indicate the feature follows that distribution.
- **Scatter plot.** Plots two continuous features against each other to reveal linear or nonlinear relationships.
- **Correlation heatmap.** Visualizes pairwise Pearson correlations between all continuous features in a dataset, helping to identify multicollinearity.

## Practical guidelines

The following recommendations summarize best practices for working with continuous features in machine learning projects.

1. **Always visualize first.** Plot histograms and box plots for every continuous feature before doing any modeling. Understanding the distribution, skewness, and outlier profile informs every subsequent decision.
2. **Scale when required.** Use standardization (z-score) as a safe default for most algorithms. Use min-max normalization if the algorithm requires bounded inputs (e.g., some [neural network](/wiki/neural_network) architectures). Use robust scaling if outliers are present.[12]
3. **Transform skewed features.** Apply log, Box-Cox, or Yeo-Johnson transformations for heavily skewed features, especially when using algorithms that assume normality.
4. **Handle missing values thoughtfully.** Simple mean or median imputation works for small amounts of missingness. For larger proportions, use KNN imputation or MICE.
5. **Watch for multicollinearity.** Highly correlated continuous features provide redundant information and can destabilize linear models. Use variance inflation factor (VIF) analysis or correlation matrices to detect and address this.
6. **Validate transformations with cross-validation.** Any preprocessing step (scaling, imputation, transformation) should be fit only on the training set and applied to the validation and test sets. Fitting on the full dataset before splitting leads to data leakage. As Google's Machine Learning Crash Course warns, "if you normalize a feature during training, you must also normalize that feature when making predictions."[15]
7. **Consider the algorithm.** Tree-based models need little preprocessing of continuous features. Linear and distance-based models need careful scaling and transformation.

## See also

- [Feature engineering](/wiki/feature_engineering)
- [Scaling](/wiki/scaling)
- [Normalization](/wiki/normalization)
- [Categorical data](/wiki/categorical_data)
- [Numerical data](/wiki/numerical_data)
- [Feature extraction](/wiki/feature_extraction)
- [Dense feature](/wiki/dense_feature)
- [Preprocessing](/wiki/preprocessing)
- [Outlier detection](/wiki/outlier_detection)
- [One-hot encoding](/wiki/one-hot_encoding)
- [Batch normalization](/wiki/batch_normalization)

## References

1. Stevens, S. S. (1946). "On the Theory of Scales of Measurement." *Science*, 103(2684), 677-680.
2. Box, G. E. P., & Cox, D. R. (1964). "An Analysis of Transformations." *Journal of the Royal Statistical Society, Series B*, 26(2), 211-252.
3. Yeo, I.-K., & Johnson, R. A. (2000). "A New Family of Power Transformations to Improve Normality or Symmetry." *Biometrika*, 87(4), 954-959.
4. Ioffe, S., & Szegedy, C. (2015). "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift." *Proceedings of the 32nd International Conference on Machine Learning (ICML)*.
5. Gorishniy, Y., Rubachev, I., Khrulkov, V., & Babenko, A. (2022). "On Embeddings for Numerical Features in Tabular Deep Learning." *Advances in Neural Information Processing Systems (NeurIPS)*.
6. Pedregosa, F., et al. (2011). "Scikit-learn: Machine Learning in Python." *Journal of Machine Learning Research*, 12, 2825-2830.
7. Stekhoven, D. J., & Bühlmann, P. (2012). "MissForest: Non-parametric Missing Value Imputation for Mixed-type Data." *Bioinformatics*, 28(1), 112-118.
8. van Buuren, S., & Groothuis-Oudshoorn, K. (2011). "mice: Multivariate Imputation by Chained Equations in R." *Journal of Statistical Software*, 45(3), 1-67.
9. Kraskov, A., Stögbauer, H., & Grassberger, P. (2004). "Estimating Mutual Information." *Physical Review E*, 69(6), 066138.
10. Liu, F. T., Ting, K. M., & Zhou, Z.-H. (2008). "Isolation Forest." *Proceedings of the IEEE International Conference on Data Mining (ICDM)*, 413-422.
11. Zheng, A., & Casari, A. (2018). *Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists*. O'Reilly Media.
12. Raschka, S. (2014). "About Feature Scaling and Normalization." sebastianraschka.com.
13. Guyon, I., & Elisseeff, A. (2003). "An Introduction to Variable and Feature Selection." *Journal of Machine Learning Research*, 3, 1157-1182.
14. Google for Developers. "Machine Learning Glossary." developers.google.com/machine-learning/glossary (continuous feature, discrete feature, categorical data, bucketing).
15. Google for Developers. "Machine Learning Crash Course: Numerical Data, Normalization." developers.google.com/machine-learning/crash-course/numerical-data/normalization.