Numerical Data
Last reviewed
Sources
12 citations
Review status
Source-backed
Revision
v5 ยท 3,679 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
12 citations
Review status
Source-backed
Revision
v5 ยท 3,679 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Categorical data, Feature engineering, Data preprocessing
Numerical data (also called quantitative data) is information expressed as numbers on a continuous or discrete scale that supports arithmetic operations such as addition, subtraction, multiplication, and division. It is the most common feature type in applied machine learning: even when raw inputs are images, text, or audio, models can only process them after conversion into numerical representations such as pixel values, embeddings, or spectrograms. Unlike categorical data, which encodes group membership or labels and for which arithmetic is not meaningful, numerical data is consumed directly by most learning algorithms, which is why Google's Machine Learning Crash Course treats "Working with numerical data" as a foundational module. [1]
Numerical data, also called quantitative data, represents information expressed as numbers on a continuous or discrete scale. In machine learning and statistics, numerical data forms the foundation for mathematical models and algorithms that identify patterns, compute distances, and generate predictions. Unlike categorical data, which represents group membership or labels, numerical data supports arithmetic operations such as addition, subtraction, multiplication, and division, making it directly usable by most learning algorithms.
Virtually every machine learning pipeline involves numerical data at some stage. Even when the original inputs are images, text, or audio, they are ultimately converted into numerical representations (pixel values, embeddings, or spectrograms) before a model can process them. Understanding how to properly handle, transform, and preprocess numerical data is therefore one of the most important skills in applied machine learning and data science.
Numerical data falls into two primary categories based on how values are distributed along the number line.
Continuous data can take any value within a given range, including fractions and decimals. Between any two continuous values, there are infinitely many possible intermediate values. A continuous feature is measured rather than counted.
| Example | Range | Notes |
|---|---|---|
| Temperature | -273.15 C to theoretically unbounded | Can take any decimal value |
| Height | 0 to ~2.75 m for humans | Measured with arbitrary precision |
| Stock price | 0 to unbounded | Varies continuously over time |
| Sensor voltage | Depends on sensor | Analog signal converted to digital |
| Blood pressure | 0 to ~300 mmHg | Measured on a continuous scale |
Continuous features are common in scientific measurements, financial data, and sensor readings. Many statistical methods and machine learning algorithms assume that input features are continuous, since operations like computing means, variances, and gradients are naturally defined for continuous values.
Discrete data consists of distinct, countable values. A discrete feature is counted rather than measured, and there are gaps between successive possible values.
| Example | Possible Values | Notes |
|---|---|---|
| Number of children | 0, 1, 2, 3, ... | Cannot have 2.5 children |
| Number of website visits | 0, 1, 2, ... | Whole-number counts |
| Dice roll outcome | 1, 2, 3, 4, 5, 6 | Finite set of integers |
| Number of defective items | 0, 1, 2, ... | Count data in manufacturing |
| Shoe size (US) | 5, 5.5, 6, 6.5, ... | Fixed increments of 0.5 |
Discrete numerical data sometimes overlaps with ordinal categorical data. For instance, a customer satisfaction rating from 1 to 5 could be treated as either discrete numerical data (if the spacing between values is meaningful) or as ordinal categorical data (if only the rank order matters). The choice depends on the assumptions of the model and the analyst's judgment.
Numerical data is further classified by the measurement scale it uses. Psychologist Stanley Smith Stevens proposed four scales of measurement (nominal, ordinal, interval, and ratio) in a 1946 article in the journal Science titled "On the theory of scales of measurement," two of which apply to numerical data. [2] Stevens argued that the mathematical operations permissible on a set of numbers depend on the measurement scale used, a framework still taught to statistics students today. [3]
Interval-scale data has equal spacing between values, but no true zero point. Because the zero is arbitrary, ratios between values are not meaningful. The classic example is temperature measured in Celsius or Fahrenheit: the difference between 20 C and 30 C is the same as the difference between 30 C and 40 C, but 40 C is not "twice as hot" as 20 C.
| Property | Supported? |
|---|---|
| Identity (values can be distinguished) | Yes |
| Order (values can be ranked) | Yes |
| Equal intervals (differences are meaningful) | Yes |
| True zero (ratios are meaningful) | No |
Other examples of interval data include calendar years, IQ scores, and standardized test scores.
Ratio-scale data has all the properties of interval data plus a true zero point, meaning that zero represents a complete absence of the quantity being measured. This allows meaningful ratio comparisons: a weight of 100 kg is genuinely twice as heavy as 50 kg.
| Property | Supported? |
|---|---|
| Identity | Yes |
| Order | Yes |
| Equal intervals | Yes |
| True zero (ratios are meaningful) | Yes |
Examples include height, weight, distance, income, age, and duration. Most numerical features encountered in machine learning are ratio-scale data.
Raw numerical data often needs preprocessing before being fed into a model. The sections below cover the most important preprocessing steps for numerical features.
Different numerical features frequently have different units and ranges. A feature representing income might range from 0 to 500,000, while a feature representing age might range from 0 to 100. When features have very different scales, many algorithms (particularly those based on distance calculations or gradient descent) can be biased toward the feature with the larger range or converge slowly. Normalization, which Google's Machine Learning Crash Course defines as transforming "features to be on a similar scale," addresses this problem. [1]
| Technique | Formula / Description | When to Use |
|---|---|---|
| Min-max scaling (linear scaling) | x' = (x - x_min) / (x_max - x_min) | Feature is mostly uniformly distributed across its range; few outliers; bounded features |
| Z-score standardization | x' = (x - mean) / std | Feature is approximately normally distributed (peak close to mean); most general-purpose use |
| Robust scaling | x' = (x - median) / IQR | Data contains significant outliers; uses median and interquartile range |
| Max-abs scaling | x' = x / max(abs(x)) | Sparse data that should not be centered to zero |
| Log scaling | x' = log(x) | Feature distribution is heavily skewed; follows a power-law distribution; long right tail |
| Clipping | Cap values above/below chosen thresholds | Feature contains extreme outliers |
Google's Machine Learning Crash Course recommends min-max scaling when a feature is "mostly uniformly distributed across range," Z-score scaling "when the feature is normally distributed (peak close to mean)," and log scaling "when the feature distribution is heavily skewed." [1]
A critical practical consideration: the scaler must be fitted on the training data only, and then the same fitted parameters (mean, standard deviation, min, max) are applied to transform both training and test data. Fitting the scaler on the entire dataset before splitting introduces data leakage, which inflates performance estimates and produces unreliable models.
As the Google Machine Learning Crash Course states, "If you normalize a feature during training, you must also normalize that feature when making predictions." [1]
Real-world datasets frequently contain missing values in numerical columns. Many machine learning algorithms, including linear regression, support vector machines, and k-nearest neighbors, cannot handle missing values natively and will raise errors if NaN values are present. Imputation replaces missing values with estimated substitutes. [4]
| Method | Description | Pros | Cons |
|---|---|---|---|
| Mean imputation | Replace missing values with the column mean | Simple; preserves overall mean | Reduces variance; distorts distribution if data is skewed |
| Median imputation | Replace missing values with the column median | Robust to skewed distributions | Reduces variance; ignores relationships between features |
| KNN imputation | Replace with weighted average of k nearest neighbors' values | Preserves feature relationships; more accurate | Computationally expensive; sensitive to k and distance metric |
| Iterative (MICE) imputation | Model each feature with missing values as a function of other features in a round-robin fashion | Captures complex inter-feature relationships | Slow; results can vary across runs |
| Constant imputation | Replace with a fixed value (e.g., 0 or -1) | Simple; sometimes domain-appropriate | Can introduce bias; model may learn the imputed value as meaningful |
For small amounts of missing data (under 5%), simple methods like mean or median imputation are usually sufficient. For larger proportions, advanced methods like KNN or MICE tend to produce better estimates. [9] The choice of imputation method can significantly affect model performance and should be validated through cross-validation.
Many machine learning algorithms, especially linear models and neural networks, assume or benefit from features that follow a roughly Gaussian (normal) distribution. Real-world numerical data is often skewed, and applying mathematical transformations can make the distribution more symmetric. These transformations are a key part of feature engineering.
The log transformation (x' = log(x)) compresses large values and spreads out small values, reducing right skew. It is effective when data spans several orders of magnitude or follows a multiplicative process. For example, income data and word frequencies often benefit from log transformation. The main limitation is that log is defined only for strictly positive values; a common workaround is to use log(x + 1) when zeros are present.
The square root transformation (x' = sqrt(x)) is a milder alternative to the log transform. It is often applied to count data, such as the number of events per time period. Like the log transform, it requires non-negative values.
The Box-Cox transformation is a parametric family of power transformations defined as: [5]
The optimal value of lambda is estimated from the data, typically by maximum likelihood. Special cases include the log transform (lambda = 0), the reciprocal transform (lambda = -1), and the square root transform (lambda = 0.5). The Box-Cox transformation is restricted to strictly positive data.
The Yeo-Johnson transformation extends the Box-Cox approach to handle zero and negative values. [6] It applies a Box-Cox-like formula to positive values and a separate mirrored formula to negative values, maintaining continuity at zero. This makes it applicable to a wider range of real-world data without requiring a positivity constraint. In scikit-learn, both Box-Cox and Yeo-Johnson are available through the PowerTransformer class. [7]
| Transformation | Handles Zeros? | Handles Negatives? | Strength of Effect |
|---|---|---|---|
| Log | No (use log(x+1)) | No | Strong |
| Square root | Yes | No | Moderate |
| Box-Cox | No | No | Adaptive (tuned lambda) |
| Yeo-Johnson | Yes | Yes | Adaptive (tuned lambda) |
Outliers are data points that deviate substantially from the majority of observations. They can arise from measurement errors, data entry mistakes, or genuinely rare events. In numerical data, outliers can distort means and standard deviations, mislead model training, and degrade prediction quality.
Z-Score Method. The Z-score measures how many standard deviations a data point lies from the mean. A common threshold is |z| > 3, meaning any observation more than three standard deviations from the mean is flagged as an outlier. This method works best when the data is approximately normally distributed; for skewed distributions, it may miss outliers on one side while over-flagging on the other.
Interquartile Range (IQR) Method. The IQR is the difference between the 75th percentile (Q3) and the 25th percentile (Q1). Observations below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR are considered outliers. The IQR method is more robust than the Z-score approach because it relies on percentiles rather than the mean and standard deviation, making it less sensitive to the outliers it is trying to detect.
Isolation Forest. Introduced by Liu, Ting, and Zhou in 2008, this algorithm isolates observations by randomly selecting a feature and a split value. [8] Outliers, being rare and different, require fewer splits to isolate and therefore have shorter average path lengths in the tree ensemble. As the original paper notes, "a normal point generally requires more partitions to be isolated, while an anomaly generally requires less partitions to be isolated." [8] Isolation Forest is effective for high-dimensional numerical data.
| Strategy | Description | When to Use |
|---|---|---|
| Removal | Delete outlier rows from the dataset | Outliers are clearly erroneous (e.g., negative age) |
| Winsorization (capping) | Replace outliers with the nearest non-outlier value | Preserve dataset size; reduce influence of extremes |
| Transformation | Apply log or other transforms to compress the range | Outliers are genuine but skew the distribution |
| Separate modeling | Build a separate model for outlier observations | Outliers represent a distinct subpopulation |
| Robust algorithms | Use models that are inherently tolerant of outliers (e.g., tree-based models) | Outliers are expected and informative |
Binning (also called bucketing or discretization) converts continuous numerical data into discrete intervals or bins. Google's Machine Learning Crash Course defines it as "a feature engineering technique that groups different numerical subranges into bins or buckets," noting that "in many cases, binning turns numerical data into categorical data." [10] It is particularly useful when the linear relationship between a feature and the label is weak, for capturing nonlinear relationships in linear models, for reducing the effect of minor observation errors, and for improving model interpretability. [10]
The range of the feature is divided into a fixed number of intervals, each having the same width. For example, an age feature ranging from 0 to 100 could be split into ten bins: 0 to 10, 10 to 20, and so on. Equal-width binning is simple but can produce unbalanced bins when the data distribution is skewed, leaving some bins with insufficient training examples. [10]
The data is divided into bins that each contain approximately the same number of observations, with boundaries set at quantile values. Google's Machine Learning Crash Course calls this "quantile bucketing" and recommends it for skewed data because it "gives extra information space to the large torso while compacting the long tail into a single bucket." [10] This approach handles skewed distributions better than equal-width binning because it ensures a roughly uniform number of data points in every bin.
A decision tree is trained on the target variable using only the feature to be binned. The split points learned by the tree are then used as bin boundaries. [11] This supervised approach creates bins that are optimized for predicting the target and often outperforms unsupervised binning methods.
Beyond the mechanics of preprocessing, Google's Machine Learning Crash Course identifies several qualities that distinguish a good continuous numerical feature. [12]
house_age_years is preferable to one named feature_07.These qualities complement the numerical considerations above: a feature can be perfectly scaled and still be a poor input if its meaning is ambiguous or it hides missing data behind a magic value.
Different families of machine learning models handle numerical features in distinct ways.
| Model Type | How It Uses Numerical Data | Preprocessing Typically Needed |
|---|---|---|
| Linear models (linear/logistic regression) | Learns a weight for each feature; assumes linear relationship | Scaling, handling skew, outlier treatment |
| Distance-based models (KNN, SVM) | Computes distances between data points | Scaling is critical; unscaled features dominate distance |
| Tree-based models (decision trees, random forests, gradient boosting) | Splits on threshold values; invariant to monotonic transforms | Minimal preprocessing; robust to outliers and different scales |
| Neural networks | Learns nonlinear combinations via weighted sums and activations | Scaling improves convergence; normalization layers can help |
| Naive Bayes (Gaussian) | Estimates mean and variance per class per feature | Assumes Gaussian distribution; may need transformation |
Tree-based models are particularly forgiving with numerical data because their split-based logic is invariant to monotonic transformations and unaffected by differences in feature scale. A decision tree that splits on "income > 50,000" produces the same partition whether income is measured in dollars or thousands of dollars, and whether or not it has been log-transformed. Linear models and neural networks, by contrast, are much more sensitive to the scale and distribution of numerical inputs, which is why scaling and normalization are routine steps in those pipelines.
Understanding the differences between numerical and categorical data is essential for selecting appropriate preprocessing pipelines and model architectures.
| Aspect | Numerical Data | Categorical Data |
|---|---|---|
| Nature | Expressed as numbers; supports arithmetic | Expressed as labels or groups; arithmetic not meaningful |
| Examples | Temperature, weight, price, count | Color, country, product type, gender |
| Subtypes | Continuous, discrete | Nominal, ordinal |
| Measurement scales | Interval, ratio | Nominal, ordinal |
| Distance computation | Directly computable (Euclidean, Manhattan) | Requires encoding (Hamming distance, etc.) |
| Missing value imputation | Mean, median, KNN | Mode, KNN, constant category |
| Common preprocessing | Scaling, transformation, binning | One-hot encoding, label encoding, target encoding |
| Model compatibility | Natively accepted by most algorithms | Must be encoded to numerical form first |
In practice, most datasets contain a mix of numerical and categorical features. Libraries like scikit-learn provide the ColumnTransformer class to apply different preprocessing pipelines to different feature types within a single workflow. [7]
Numerical data is stored in several standard data structures within machine learning frameworks.
Vectors. A one-dimensional array represents a single data point's features or a single feature across multiple data points. For example, a vector [5.1, 3.5, 1.4, 0.2] could represent the four measurements of one iris flower.
Matrices. A two-dimensional array (rows by columns) represents a dataset where each row is an observation and each column is a feature. Most tabular datasets in machine learning are stored as matrices.
Tensors. Multi-dimensional arrays generalize vectors and matrices to higher dimensions. Convolutional neural networks use 4D tensors (batch size, channels, height, width) to represent image data, while recurrent neural networks process 3D tensors (batch size, time steps, features) for sequential data.
Imagine you have a jar of marbles. You can describe the marbles in two ways: by their color (red, blue, green) or by their size (small, medium, big). The colors are like categorical data: you can sort them into groups, but you cannot add "red" and "blue" together. The sizes, though, can be measured with a ruler. You might find one marble is 1.2 centimeters wide and another is 2.5 centimeters wide. Those measurements are numerical data.
Numerical data is special because you can do math with it. You can find the average size of all your marbles, figure out which one is the biggest, and even predict how big a new marble might be based on the ones you have already measured. Computers love numerical data because they are really good at math, and that is exactly what machine learning is: using math to find patterns and make guesses about new things.