Numerical Data

Introduction

Numerical data, also called quantitative data, represents information expressed as numbers on a continuous or discrete scale. In machine learning and statistics, numerical data forms the foundation for mathematical models and algorithms that identify patterns, compute distances, and generate predictions. Unlike categorical data, which represents group membership or labels, numerical data supports arithmetic operations such as addition, subtraction, multiplication, and division, making it directly usable by most learning algorithms.

Virtually every machine learning pipeline involves numerical data at some stage. Even when the original inputs are images, text, or audio, they are ultimately converted into numerical representations (pixel values, embeddings, or spectrograms) before a model can process them. Understanding how to properly handle, transform, and preprocess numerical data is therefore one of the most important skills in applied machine learning and data science.

Types of Numerical Data

Numerical data falls into two primary categories based on how values are distributed along the number line.

Continuous Data

Continuous data can take any value within a given range, including fractions and decimals. Between any two continuous values, there are infinitely many possible intermediate values. A continuous feature is measured rather than counted.

Example	Range	Notes
Temperature	-273.15 C to theoretically unbounded	Can take any decimal value
Height	0 to ~2.75 m for humans	Measured with arbitrary precision
Stock price	0 to unbounded	Varies continuously over time
Sensor voltage	Depends on sensor	Analog signal converted to digital
Blood pressure	0 to ~300 mmHg	Measured on a continuous scale

Continuous features are common in scientific measurements, financial data, and sensor readings. Many statistical methods and machine learning algorithms assume that input features are continuous, since operations like computing means, variances, and gradients are naturally defined for continuous values.

Discrete Data

Discrete data consists of distinct, countable values. A discrete feature is counted rather than measured, and there are gaps between successive possible values.

Example	Possible Values	Notes
Number of children	0, 1, 2, 3, ...	Cannot have 2.5 children
Number of website visits	0, 1, 2, ...	Whole-number counts
Dice roll outcome	1, 2, 3, 4, 5, 6	Finite set of integers
Number of defective items	0, 1, 2, ...	Count data in manufacturing
Shoe size (US)	5, 5.5, 6, 6.5, ...	Fixed increments of 0.5

Discrete numerical data sometimes overlaps with ordinal categorical data. For instance, a customer satisfaction rating from 1 to 5 could be treated as either discrete numerical data (if the spacing between values is meaningful) or as ordinal categorical data (if only the rank order matters). The choice depends on the assumptions of the model and the analyst's judgment.

Measurement Scales

Numerical data is further classified by the measurement scale it uses. Psychologist Stanley Stevens proposed four scales of measurement in 1946, two of which apply to numerical data.

Interval Scale

Interval-scale data has equal spacing between values, but no true zero point. Because the zero is arbitrary, ratios between values are not meaningful. The classic example is temperature measured in Celsius or Fahrenheit: the difference between 20 C and 30 C is the same as the difference between 30 C and 40 C, but 40 C is not "twice as hot" as 20 C.

Property	Supported?
Identity (values can be distinguished)	Yes
Order (values can be ranked)	Yes
Equal intervals (differences are meaningful)	Yes
True zero (ratios are meaningful)	No

Other examples of interval data include calendar years, IQ scores, and standardized test scores.

Ratio Scale

Ratio-scale data has all the properties of interval data plus a true zero point, meaning that zero represents a complete absence of the quantity being measured. This allows meaningful ratio comparisons: a weight of 100 kg is genuinely twice as heavy as 50 kg.

Property	Supported?
Identity	Yes
Order	Yes
Equal intervals	Yes
True zero (ratios are meaningful)	Yes

Examples include height, weight, distance, income, age, and duration. Most numerical features encountered in machine learning are ratio-scale data.

Handling Numerical Data in Machine Learning

Raw numerical data often needs preprocessing before being fed into a model. The sections below cover the most important preprocessing steps for numerical features.

Scaling and Normalization

Different numerical features frequently have different units and ranges. A feature representing income might range from 0 to 500,000, while a feature representing age might range from 0 to 100. When features have very different scales, many algorithms (particularly those based on distance calculations or gradient descent) can be biased toward the feature with the larger range or converge slowly. Normalization and scaling techniques address this problem.

Technique	Formula / Description	When to Use
Min-max scaling	x' = (x - x_min) / (x_max - x_min)	Data is uniformly distributed; few outliers; bounded features
Z-score standardization	x' = (x - mean) / std	Data is approximately normally distributed; most general-purpose use
Robust scaling	x' = (x - median) / IQR	Data contains significant outliers; uses median and interquartile range
Max-abs scaling	x' = x / max(abs(x))	Sparse data that should not be centered to zero
Log scaling	x' = log(x)	Data follows a power-law distribution; long right tail

A critical practical consideration: the scaler must be fitted on the training data only, and then the same fitted parameters (mean, standard deviation, min, max) are applied to transform both training and test data. Fitting the scaler on the entire dataset before splitting introduces data leakage, which inflates performance estimates and produces unreliable models.

As noted by Google's Machine Learning Crash Course, "if you normalize a feature during training, you must also normalize that feature when making predictions."

Missing Value Imputation

Real-world datasets frequently contain missing values in numerical columns. Many machine learning algorithms, including linear regression, support vector machines, and k-nearest neighbors, cannot handle missing values natively and will raise errors if NaN values are present. Imputation replaces missing values with estimated substitutes.

Method	Description	Pros	Cons
Mean imputation	Replace missing values with the column mean	Simple; preserves overall mean	Reduces variance; distorts distribution if data is skewed
Median imputation	Replace missing values with the column median	Robust to skewed distributions	Reduces variance; ignores relationships between features
KNN imputation	Replace with weighted average of k nearest neighbors' values	Preserves feature relationships; more accurate	Computationally expensive; sensitive to k and distance metric
Iterative (MICE) imputation	Model each feature with missing values as a function of other features in a round-robin fashion	Captures complex inter-feature relationships	Slow; results can vary across runs
Constant imputation	Replace with a fixed value (e.g., 0 or -1)	Simple; sometimes domain-appropriate	Can introduce bias; model may learn the imputed value as meaningful

For small amounts of missing data (under 5%), simple methods like mean or median imputation are usually sufficient. For larger proportions, advanced methods like KNN or MICE tend to produce better estimates. The choice of imputation method can significantly affect model performance and should be validated through cross-validation.

Numerical Feature Transformations

Many machine learning algorithms, especially linear models and neural networks, assume or benefit from features that follow a roughly Gaussian (normal) distribution. Real-world numerical data is often skewed, and applying mathematical transformations can make the distribution more symmetric. These transformations are a key part of feature engineering.

Log Transformation

The log transformation (x' = log(x)) compresses large values and spreads out small values, reducing right skew. It is effective when data spans several orders of magnitude or follows a multiplicative process. For example, income data and word frequencies often benefit from log transformation. The main limitation is that log is defined only for strictly positive values; a common workaround is to use log(x + 1) when zeros are present.

Square Root Transformation

The square root transformation (x' = sqrt(x)) is a milder alternative to the log transform. It is often applied to count data, such as the number of events per time period. Like the log transform, it requires non-negative values.

Box-Cox Transformation

The Box-Cox transformation is a parametric family of power transformations defined as:

x' = (x^lambda - 1) / lambda, when lambda is not 0
x' = log(x), when lambda = 0

The optimal value of lambda is estimated from the data, typically by maximum likelihood. Special cases include the log transform (lambda = 0), the reciprocal transform (lambda = -1), and the square root transform (lambda = 0.5). The Box-Cox transformation is restricted to strictly positive data.

Yeo-Johnson Transformation

The Yeo-Johnson transformation extends the Box-Cox approach to handle zero and negative values. It applies a Box-Cox-like formula to positive values and a separate mirrored formula to negative values, maintaining continuity at zero. This makes it applicable to a wider range of real-world data without requiring a positivity constraint. In scikit-learn, both Box-Cox and Yeo-Johnson are available through the PowerTransformer class.

Transformation	Handles Zeros?	Handles Negatives?	Strength of Effect
Log	No (use log(x+1))	No	Strong
Square root	Yes	No	Moderate
Box-Cox	No	No	Adaptive (tuned lambda)
Yeo-Johnson	Yes	Yes	Adaptive (tuned lambda)

Outlier Detection and Treatment

Outliers are data points that deviate substantially from the majority of observations. They can arise from measurement errors, data entry mistakes, or genuinely rare events. In numerical data, outliers can distort means and standard deviations, mislead model training, and degrade prediction quality.

Detection Methods

Z-Score Method. The Z-score measures how many standard deviations a data point lies from the mean. A common threshold is |z| > 3, meaning any observation more than three standard deviations from the mean is flagged as an outlier. This method works best when the data is approximately normally distributed; for skewed distributions, it may miss outliers on one side while over-flagging on the other.

Interquartile Range (IQR) Method. The IQR is the difference between the 75th percentile (Q3) and the 25th percentile (Q1). Observations below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR are considered outliers. The IQR method is more robust than the Z-score approach because it relies on percentiles rather than the mean and standard deviation, making it less sensitive to the outliers it is trying to detect.

Isolation Forest. This algorithm isolates observations by randomly selecting a feature and a split value. Outliers, being rare and different, require fewer splits to isolate and therefore have shorter average path lengths in the tree ensemble. Isolation Forest is effective for high-dimensional numerical data.

Treatment Strategies

Strategy	Description	When to Use
Removal	Delete outlier rows from the dataset	Outliers are clearly erroneous (e.g., negative age)
Winsorization (capping)	Replace outliers with the nearest non-outlier value	Preserve dataset size; reduce influence of extremes
Transformation	Apply log or other transforms to compress the range	Outliers are genuine but skew the distribution
Separate modeling	Build a separate model for outlier observations	Outliers represent a distinct subpopulation
Robust algorithms	Use models that are inherently tolerant of outliers (e.g., tree-based models)	Outliers are expected and informative

Binning and Discretization

Binning (also called discretization) converts continuous numerical data into discrete intervals or bins. This can be useful for capturing nonlinear relationships in linear models, reducing the effect of minor observation errors, and improving model interpretability.

Equal-Width Binning

The range of the feature is divided into a fixed number of intervals, each having the same width. For example, an age feature ranging from 0 to 100 could be split into ten bins: 0 to 10, 10 to 20, and so on. Equal-width binning is simple but can produce unbalanced bins when the data distribution is skewed.

Equal-Frequency Binning

The data is divided into bins that each contain approximately the same number of observations, with boundaries set at quantile values. This approach handles skewed distributions better than equal-width binning because it ensures a roughly uniform number of data points in every bin.

Decision Tree Binning

A decision tree is trained on the target variable using only the feature to be binned. The split points learned by the tree are then used as bin boundaries. This supervised approach creates bins that are optimized for predicting the target and often outperforms unsupervised binning methods.

Numerical Data in Different Model Types

Different families of machine learning models handle numerical features in distinct ways.

Model Type	How It Uses Numerical Data	Preprocessing Typically Needed
Linear models (linear/logistic regression)	Learns a weight for each feature; assumes linear relationship	Scaling, handling skew, outlier treatment
Distance-based models (KNN, SVM)	Computes distances between data points	Scaling is critical; unscaled features dominate distance
Tree-based models (decision trees, random forests, gradient boosting)	Splits on threshold values; invariant to monotonic transforms	Minimal preprocessing; robust to outliers and different scales
Neural networks	Learns nonlinear combinations via weighted sums and activations	Scaling improves convergence; normalization layers can help
Naive Bayes (Gaussian)	Estimates mean and variance per class per feature	Assumes Gaussian distribution; may need transformation

Tree-based models are particularly forgiving with numerical data because their split-based logic is invariant to monotonic transformations and unaffected by differences in feature scale. Linear models and neural networks, by contrast, are much more sensitive to the scale and distribution of numerical inputs.

Comparison with Categorical Data

Understanding the differences between numerical and categorical data is essential for selecting appropriate preprocessing pipelines and model architectures.

Aspect	Numerical Data	Categorical Data
Nature	Expressed as numbers; supports arithmetic	Expressed as labels or groups; arithmetic not meaningful
Examples	Temperature, weight, price, count	Color, country, product type, gender
Subtypes	Continuous, discrete	Nominal, ordinal
Measurement scales	Interval, ratio	Nominal, ordinal
Distance computation	Directly computable (Euclidean, Manhattan)	Requires encoding (Hamming distance, etc.)
Missing value imputation	Mean, median, KNN	Mode, KNN, constant category
Common preprocessing	Scaling, transformation, binning	One-hot encoding, label encoding, target encoding
Model compatibility	Natively accepted by most algorithms	Must be encoded to numerical form first

In practice, most datasets contain a mix of numerical and categorical features. Libraries like scikit-learn provide the ColumnTransformer class to apply different preprocessing pipelines to different feature types within a single workflow.

Representation of Numerical Data

Numerical data is stored in several standard data structures within machine learning frameworks.

Vectors. A one-dimensional array represents a single data point's features or a single feature across multiple data points. For example, a vector [5.1, 3.5, 1.4, 0.2] could represent the four measurements of one iris flower.

Matrices. A two-dimensional array (rows by columns) represents a dataset where each row is an observation and each column is a feature. Most tabular datasets in machine learning are stored as matrices.

Tensors. Multi-dimensional arrays generalize vectors and matrices to higher dimensions. Convolutional neural networks use 4D tensors (batch size, channels, height, width) to represent image data, while recurrent neural networks process 3D tensors (batch size, time steps, features) for sequential data.

Explain Like I'm Five (ELI5)

Imagine you have a jar of marbles. You can describe the marbles in two ways: by their color (red, blue, green) or by their size (small, medium, big). The colors are like categorical data: you can sort them into groups, but you cannot add "red" and "blue" together. The sizes, though, can be measured with a ruler. You might find one marble is 1.2 centimeters wide and another is 2.5 centimeters wide. Those measurements are numerical data.

Numerical data is special because you can do math with it. You can find the average size of all your marbles, figure out which one is the biggest, and even predict how big a new marble might be based on the ones you have already measured. Computers love numerical data because they are really good at math, and that is exactly what machine learning is: using math to find patterns and make guesses about new things.

References

Stevens, S. S. (1946). "On the theory of scales of measurement." *Science*, 103(2684), 677-680.
Google Developers. "Numerical data: Normalization." *Machine Learning Crash Course*. https://developers.google.com/machine-learning/crash-course/numerical-data/normalization
Scikit-learn documentation. "Imputation of missing values." https://scikit-learn.org/stable/modules/impute.html
Box, G. E. P., & Cox, D. R. (1964). "An analysis of transformations." *Journal of the Royal Statistical Society, Series B*, 26(2), 211-252.
Yeo, I.-K., & Johnson, R. A. (2000). "A new family of power transformations to improve normality or symmetry." *Biometrika*, 87(4), 954-959.
Liu, F. T., Ting, K. M., & Zhou, Z.-H. (2008). "Isolation Forest." *Proceedings of the 2008 Eighth IEEE International Conference on Data Mining*, 413-422.
Scikit-learn documentation. "Preprocessing data." https://scikit-learn.org/stable/modules/preprocessing.html
GeeksforGeeks. "Feature Engineering: Scaling, Normalization and Standardization." https://www.geeksforgeeks.org/machine-learning/feature-engineering-scaling-normalization-and-standardization/
Van Buuren, S., & Groothuis-Oudshoorn, K. (2011). "mice: Multivariate Imputation by Chained Equations in R." *Journal of Statistical Software*, 45(3), 1-67.
Dougherty, J., Kohavi, R., & Sahami, M. (1995). "Supervised and unsupervised discretization of continuous features." *Proceedings of the Twelfth International Conference on Machine Learning*, 194-202.

Introduction

Types of Numerical Data

Continuous Data

Discrete Data

Measurement Scales

Interval Scale

Ratio Scale

Handling Numerical Data in Machine Learning

Scaling and Normalization

Missing Value Imputation

Numerical Feature Transformations

Log Transformation

Square Root Transformation

Box-Cox Transformation

Yeo-Johnson Transformation

Outlier Detection and Treatment

Detection Methods

Treatment Strategies

Binning and Discretization

Equal-Width Binning

Equal-Frequency Binning

Decision Tree Binning

Numerical Data in Different Model Types

Comparison with Categorical Data

Representation of Numerical Data

Explain Like I'm Five (ELI5)

References

Improve this article

Related Articles

ARC-AGI 2

Dimension Reduction

Bucketing

Class-Imbalanced Dataset

Data Augmentation

Data Set or Dataset

Introduction

Types of Numerical Data

Continuous Data

Discrete Data

Measurement Scales

Interval Scale

Ratio Scale

Handling Numerical Data in Machine Learning

Scaling and Normalization

Missing Value Imputation

Numerical Feature Transformations

Log Transformation

Square Root Transformation

Box-Cox Transformation

Yeo-Johnson Transformation

Outlier Detection and Treatment

Detection Methods

Treatment Strategies

Binning and Discretization

Equal-Width Binning

Equal-Frequency Binning

Decision Tree Binning

Numerical Data in Different Model Types

Comparison with Categorical Data

Representation of Numerical Data

Explain Like I'm Five (ELI5)

References

Related Articles

ARC-AGI 2

Dimension Reduction

Bucketing

Class-Imbalanced Dataset

Data Augmentation

Data Set or Dataset