See also: Categorical data, Feature engineering, Data preprocessing
Numerical data, also called quantitative data, represents information expressed as numbers on a continuous or discrete scale. In machine learning and statistics, numerical data forms the foundation for mathematical models and algorithms that identify patterns, compute distances, and generate predictions. Unlike categorical data, which represents group membership or labels, numerical data supports arithmetic operations such as addition, subtraction, multiplication, and division, making it directly usable by most learning algorithms.
Virtually every machine learning pipeline involves numerical data at some stage. Even when the original inputs are images, text, or audio, they are ultimately converted into numerical representations (pixel values, embeddings, or spectrograms) before a model can process them. Understanding how to properly handle, transform, and preprocess numerical data is therefore one of the most important skills in applied machine learning and data science.
Numerical data falls into two primary categories based on how values are distributed along the number line.
Continuous data can take any value within a given range, including fractions and decimals. Between any two continuous values, there are infinitely many possible intermediate values. A continuous feature is measured rather than counted.
| Example | Range | Notes |
|---|---|---|
| Temperature | -273.15 C to theoretically unbounded | Can take any decimal value |
| Height | 0 to ~2.75 m for humans | Measured with arbitrary precision |
| Stock price | 0 to unbounded | Varies continuously over time |
| Sensor voltage | Depends on sensor | Analog signal converted to digital |
| Blood pressure | 0 to ~300 mmHg | Measured on a continuous scale |
Continuous features are common in scientific measurements, financial data, and sensor readings. Many statistical methods and machine learning algorithms assume that input features are continuous, since operations like computing means, variances, and gradients are naturally defined for continuous values.
Discrete data consists of distinct, countable values. A discrete feature is counted rather than measured, and there are gaps between successive possible values.
| Example | Possible Values | Notes |
|---|---|---|
| Number of children | 0, 1, 2, 3, ... | Cannot have 2.5 children |
| Number of website visits | 0, 1, 2, ... | Whole-number counts |
| Dice roll outcome | 1, 2, 3, 4, 5, 6 | Finite set of integers |
| Number of defective items | 0, 1, 2, ... | Count data in manufacturing |
| Shoe size (US) | 5, 5.5, 6, 6.5, ... | Fixed increments of 0.5 |
Discrete numerical data sometimes overlaps with ordinal categorical data. For instance, a customer satisfaction rating from 1 to 5 could be treated as either discrete numerical data (if the spacing between values is meaningful) or as ordinal categorical data (if only the rank order matters). The choice depends on the assumptions of the model and the analyst's judgment.
Numerical data is further classified by the measurement scale it uses. Psychologist Stanley Stevens proposed four scales of measurement in 1946, two of which apply to numerical data.
Interval-scale data has equal spacing between values, but no true zero point. Because the zero is arbitrary, ratios between values are not meaningful. The classic example is temperature measured in Celsius or Fahrenheit: the difference between 20 C and 30 C is the same as the difference between 30 C and 40 C, but 40 C is not "twice as hot" as 20 C.
| Property | Supported? |
|---|---|
| Identity (values can be distinguished) | Yes |
| Order (values can be ranked) | Yes |
| Equal intervals (differences are meaningful) | Yes |
| True zero (ratios are meaningful) | No |
Other examples of interval data include calendar years, IQ scores, and standardized test scores.
Ratio-scale data has all the properties of interval data plus a true zero point, meaning that zero represents a complete absence of the quantity being measured. This allows meaningful ratio comparisons: a weight of 100 kg is genuinely twice as heavy as 50 kg.
| Property | Supported? |
|---|---|
| Identity | Yes |
| Order | Yes |
| Equal intervals | Yes |
| True zero (ratios are meaningful) | Yes |
Examples include height, weight, distance, income, age, and duration. Most numerical features encountered in machine learning are ratio-scale data.
Raw numerical data often needs preprocessing before being fed into a model. The sections below cover the most important preprocessing steps for numerical features.
Different numerical features frequently have different units and ranges. A feature representing income might range from 0 to 500,000, while a feature representing age might range from 0 to 100. When features have very different scales, many algorithms (particularly those based on distance calculations or gradient descent) can be biased toward the feature with the larger range or converge slowly. Normalization and scaling techniques address this problem.
| Technique | Formula / Description | When to Use |
|---|---|---|
| Min-max scaling | x' = (x - x_min) / (x_max - x_min) | Data is uniformly distributed; few outliers; bounded features |
| Z-score standardization | x' = (x - mean) / std | Data is approximately normally distributed; most general-purpose use |
| Robust scaling | x' = (x - median) / IQR | Data contains significant outliers; uses median and interquartile range |
| Max-abs scaling | x' = x / max(abs(x)) | Sparse data that should not be centered to zero |
| Log scaling | x' = log(x) | Data follows a power-law distribution; long right tail |
A critical practical consideration: the scaler must be fitted on the training data only, and then the same fitted parameters (mean, standard deviation, min, max) are applied to transform both training and test data. Fitting the scaler on the entire dataset before splitting introduces data leakage, which inflates performance estimates and produces unreliable models.
As noted by Google's Machine Learning Crash Course, "if you normalize a feature during training, you must also normalize that feature when making predictions."
Real-world datasets frequently contain missing values in numerical columns. Many machine learning algorithms, including linear regression, support vector machines, and k-nearest neighbors, cannot handle missing values natively and will raise errors if NaN values are present. Imputation replaces missing values with estimated substitutes.
| Method | Description | Pros | Cons |
|---|---|---|---|
| Mean imputation | Replace missing values with the column mean | Simple; preserves overall mean | Reduces variance; distorts distribution if data is skewed |
| Median imputation | Replace missing values with the column median | Robust to skewed distributions | Reduces variance; ignores relationships between features |
| KNN imputation | Replace with weighted average of k nearest neighbors' values | Preserves feature relationships; more accurate | Computationally expensive; sensitive to k and distance metric |
| Iterative (MICE) imputation | Model each feature with missing values as a function of other features in a round-robin fashion | Captures complex inter-feature relationships | Slow; results can vary across runs |
| Constant imputation | Replace with a fixed value (e.g., 0 or -1) | Simple; sometimes domain-appropriate | Can introduce bias; model may learn the imputed value as meaningful |
For small amounts of missing data (under 5%), simple methods like mean or median imputation are usually sufficient. For larger proportions, advanced methods like KNN or MICE tend to produce better estimates. The choice of imputation method can significantly affect model performance and should be validated through cross-validation.
Many machine learning algorithms, especially linear models and neural networks, assume or benefit from features that follow a roughly Gaussian (normal) distribution. Real-world numerical data is often skewed, and applying mathematical transformations can make the distribution more symmetric. These transformations are a key part of feature engineering.
The log transformation (x' = log(x)) compresses large values and spreads out small values, reducing right skew. It is effective when data spans several orders of magnitude or follows a multiplicative process. For example, income data and word frequencies often benefit from log transformation. The main limitation is that log is defined only for strictly positive values; a common workaround is to use log(x + 1) when zeros are present.
The square root transformation (x' = sqrt(x)) is a milder alternative to the log transform. It is often applied to count data, such as the number of events per time period. Like the log transform, it requires non-negative values.
The Box-Cox transformation is a parametric family of power transformations defined as:
The optimal value of lambda is estimated from the data, typically by maximum likelihood. Special cases include the log transform (lambda = 0), the reciprocal transform (lambda = -1), and the square root transform (lambda = 0.5). The Box-Cox transformation is restricted to strictly positive data.
The Yeo-Johnson transformation extends the Box-Cox approach to handle zero and negative values. It applies a Box-Cox-like formula to positive values and a separate mirrored formula to negative values, maintaining continuity at zero. This makes it applicable to a wider range of real-world data without requiring a positivity constraint. In scikit-learn, both Box-Cox and Yeo-Johnson are available through the PowerTransformer class.
| Transformation | Handles Zeros? | Handles Negatives? | Strength of Effect |
|---|---|---|---|
| Log | No (use log(x+1)) | No | Strong |
| Square root | Yes | No | Moderate |
| Box-Cox | No | No | Adaptive (tuned lambda) |
| Yeo-Johnson | Yes | Yes | Adaptive (tuned lambda) |
Outliers are data points that deviate substantially from the majority of observations. They can arise from measurement errors, data entry mistakes, or genuinely rare events. In numerical data, outliers can distort means and standard deviations, mislead model training, and degrade prediction quality.
Z-Score Method. The Z-score measures how many standard deviations a data point lies from the mean. A common threshold is |z| > 3, meaning any observation more than three standard deviations from the mean is flagged as an outlier. This method works best when the data is approximately normally distributed; for skewed distributions, it may miss outliers on one side while over-flagging on the other.
Interquartile Range (IQR) Method. The IQR is the difference between the 75th percentile (Q3) and the 25th percentile (Q1). Observations below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR are considered outliers. The IQR method is more robust than the Z-score approach because it relies on percentiles rather than the mean and standard deviation, making it less sensitive to the outliers it is trying to detect.
Isolation Forest. This algorithm isolates observations by randomly selecting a feature and a split value. Outliers, being rare and different, require fewer splits to isolate and therefore have shorter average path lengths in the tree ensemble. Isolation Forest is effective for high-dimensional numerical data.
| Strategy | Description | When to Use |
|---|---|---|
| Removal | Delete outlier rows from the dataset | Outliers are clearly erroneous (e.g., negative age) |
| Winsorization (capping) | Replace outliers with the nearest non-outlier value | Preserve dataset size; reduce influence of extremes |
| Transformation | Apply log or other transforms to compress the range | Outliers are genuine but skew the distribution |
| Separate modeling | Build a separate model for outlier observations | Outliers represent a distinct subpopulation |
| Robust algorithms | Use models that are inherently tolerant of outliers (e.g., tree-based models) | Outliers are expected and informative |
Binning (also called discretization) converts continuous numerical data into discrete intervals or bins. This can be useful for capturing nonlinear relationships in linear models, reducing the effect of minor observation errors, and improving model interpretability.
The range of the feature is divided into a fixed number of intervals, each having the same width. For example, an age feature ranging from 0 to 100 could be split into ten bins: 0 to 10, 10 to 20, and so on. Equal-width binning is simple but can produce unbalanced bins when the data distribution is skewed.
The data is divided into bins that each contain approximately the same number of observations, with boundaries set at quantile values. This approach handles skewed distributions better than equal-width binning because it ensures a roughly uniform number of data points in every bin.
A decision tree is trained on the target variable using only the feature to be binned. The split points learned by the tree are then used as bin boundaries. This supervised approach creates bins that are optimized for predicting the target and often outperforms unsupervised binning methods.
Different families of machine learning models handle numerical features in distinct ways.
| Model Type | How It Uses Numerical Data | Preprocessing Typically Needed |
|---|---|---|
| Linear models (linear/logistic regression) | Learns a weight for each feature; assumes linear relationship | Scaling, handling skew, outlier treatment |
| Distance-based models (KNN, SVM) | Computes distances between data points | Scaling is critical; unscaled features dominate distance |
| Tree-based models (decision trees, random forests, gradient boosting) | Splits on threshold values; invariant to monotonic transforms | Minimal preprocessing; robust to outliers and different scales |
| Neural networks | Learns nonlinear combinations via weighted sums and activations | Scaling improves convergence; normalization layers can help |
| Naive Bayes (Gaussian) | Estimates mean and variance per class per feature | Assumes Gaussian distribution; may need transformation |
Tree-based models are particularly forgiving with numerical data because their split-based logic is invariant to monotonic transformations and unaffected by differences in feature scale. Linear models and neural networks, by contrast, are much more sensitive to the scale and distribution of numerical inputs.
Understanding the differences between numerical and categorical data is essential for selecting appropriate preprocessing pipelines and model architectures.
| Aspect | Numerical Data | Categorical Data |
|---|---|---|
| Nature | Expressed as numbers; supports arithmetic | Expressed as labels or groups; arithmetic not meaningful |
| Examples | Temperature, weight, price, count | Color, country, product type, gender |
| Subtypes | Continuous, discrete | Nominal, ordinal |
| Measurement scales | Interval, ratio | Nominal, ordinal |
| Distance computation | Directly computable (Euclidean, Manhattan) | Requires encoding (Hamming distance, etc.) |
| Missing value imputation | Mean, median, KNN | Mode, KNN, constant category |
| Common preprocessing | Scaling, transformation, binning | One-hot encoding, label encoding, target encoding |
| Model compatibility | Natively accepted by most algorithms | Must be encoded to numerical form first |
In practice, most datasets contain a mix of numerical and categorical features. Libraries like scikit-learn provide the ColumnTransformer class to apply different preprocessing pipelines to different feature types within a single workflow.
Numerical data is stored in several standard data structures within machine learning frameworks.
Vectors. A one-dimensional array represents a single data point's features or a single feature across multiple data points. For example, a vector [5.1, 3.5, 1.4, 0.2] could represent the four measurements of one iris flower.
Matrices. A two-dimensional array (rows by columns) represents a dataset where each row is an observation and each column is a feature. Most tabular datasets in machine learning are stored as matrices.
Tensors. Multi-dimensional arrays generalize vectors and matrices to higher dimensions. Convolutional neural networks use 4D tensors (batch size, channels, height, width) to represent image data, while recurrent neural networks process 3D tensors (batch size, time steps, features) for sequential data.
Imagine you have a jar of marbles. You can describe the marbles in two ways: by their color (red, blue, green) or by their size (small, medium, big). The colors are like categorical data: you can sort them into groups, but you cannot add "red" and "blue" together. The sizes, though, can be measured with a ruler. You might find one marble is 1.2 centimeters wide and another is 2.5 centimeters wide. Those measurements are numerical data.
Numerical data is special because you can do math with it. You can find the average size of all your marbles, figure out which one is the biggest, and even predict how big a new marble might be based on the ones you have already measured. Computers love numerical data because they are really good at math, and that is exactly what machine learning is: using math to find patterns and make guesses about new things.