See also: Machine learning terms
Z-score normalization, also called standardization or z-score scaling, is a data preprocessing technique that transforms numerical features so that they have a mean of zero and a standard deviation of one. The transformation works by subtracting the mean from each value and then dividing the result by the standard deviation. The output values are called z-scores, and each z-score represents how many standard deviations a given data point sits above or below the mean of its distribution.
In machine learning, z-score normalization is one of the most widely used feature scaling methods. Many learning algorithms are sensitive to the relative scales of input features. Without scaling, a feature measured in thousands (such as annual income in dollars) can dominate a feature measured in single digits (such as age in decades), leading to poor model performance and slow convergence. Standardization addresses this problem by placing all features on a comparable scale while preserving the shape of each feature's distribution.[1][2]
The z-score for a single observation is computed as:
z = (x - μ) / σ
where:
| Symbol | Meaning |
|---|---|
| z | The standardized value (z-score) |
| x | The original raw value |
| μ (mu) | The arithmetic mean of all values for that feature |
| σ (sigma) | The standard deviation of all values for that feature |
To standardize an entire feature column, compute the mean and standard deviation across all observations in the training set, then apply the formula to every value, including values in the validation and test sets.[1]
After z-score normalization is applied to a feature, the transformed values always exhibit two key properties:[3]
These properties hold regardless of the original distribution's shape. If the raw data is skewed, the z-scored data will still be skewed; standardization changes the location and scale but not the shape of the distribution.
For data that follows a normal distribution, z-scores have an additional useful interpretation. Roughly 68% of values fall between z = -1 and z = +1, about 95% fall between z = -2 and z = +2, and approximately 99.7% fall between z = -3 and z = +3. This is known as the 68-95-99.7 rule (or the empirical rule).[3]
| Z-Score Range | Approximate Percentage of Data (Normal Distribution) |
|---|---|
| -1 to +1 | 68% |
| -2 to +2 | 95% |
| -3 to +3 | 99.7% |
Consider a dataset with two features: height (in cm) and weight (in kg).
| Person | Height (cm) | Weight (kg) |
|---|---|---|
| A | 180 | 85 |
| B | 170 | 70 |
| C | 160 | 60 |
| D | 150 | 55 |
| E | 165 | 65 |
Step 1: Compute summary statistics.
| Feature | Mean (μ) | Standard Deviation (σ) |
|---|---|---|
| Height (cm) | 165.0 | 10.0 |
| Weight (kg) | 67.0 | 10.84 |
Step 2: Apply the formula to each value.
| Person | Height Z-Score | Weight Z-Score |
|---|---|---|
| A | (180 - 165) / 10 = 1.50 | (85 - 67) / 10.84 = 1.66 |
| B | (170 - 165) / 10 = 0.50 | (70 - 67) / 10.84 = 0.28 |
| C | (160 - 165) / 10 = -0.50 | (60 - 67) / 10.84 = -0.65 |
| D | (150 - 165) / 10 = -1.50 | (55 - 67) / 10.84 = -1.11 |
| E | (165 - 165) / 10 = 0.00 | (65 - 67) / 10.84 = -0.18 |
After standardization, both features are centered around zero and expressed in comparable units (standard deviations). A height z-score of 1.50 and a weight z-score of 1.66 tell us that Person A is 1.5 standard deviations above the mean height and 1.66 standard deviations above the mean weight.
Many machine learning models, including linear regression, logistic regression, and neural networks, are trained using gradient descent. When input features have very different scales, the loss surface becomes elongated (shaped like a narrow valley rather than a symmetric bowl). Gradient descent in such a landscape oscillates back and forth across the narrow dimension and makes slow progress along the long dimension, resulting in slow convergence. Standardizing the features reshapes the loss surface into something closer to a symmetric bowl, allowing gradient descent to take more direct paths toward the minimum and converge significantly faster.[2][4]
Distance-based algorithms such as k-nearest neighbors (KNN), k-means clustering, and support vector machines (SVM) calculate distances between data points. Without standardization, features with larger numeric ranges contribute disproportionately to the distance calculation. For example, if one feature ranges from 0 to 1,000 and another from 0 to 1, the first feature would overwhelm the second in any Euclidean distance computation. Standardization ensures that every feature contributes equally.[1][5]
Regularization techniques such as L1 regularization (Lasso) and L2 regularization (Ridge) penalize large weight values. When features are on different scales, the associated weights must differ in magnitude just to compensate for the scale differences, not because of genuine differences in feature importance. Standardization removes scale-driven magnitude differences, allowing the regularization penalty to treat all features fairly.[5]
Principal Component Analysis (PCA) identifies the directions of maximum variance in the data. If features are not standardized, PCA tends to identify the features with the largest numeric ranges as the most important, even if those features are not truly the most informative. Sebastian Raschka's empirical study on a wine classification dataset found that accuracy jumped from 64.81% to 98.15% when standardization was applied before PCA.[5]
Not all algorithms benefit equally from standardization. The table below summarizes which model families typically need it and which do not.[5][6]
| Model Type | Needs Standardization? | Reason |
|---|---|---|
| Linear regression, logistic regression | Yes | Uses gradient descent; convergence depends on feature scale |
| Support vector machines (SVM) | Yes | Distance-based kernel computations are scale-sensitive |
| K-nearest neighbors (KNN) | Yes | Euclidean distance dominated by large-scale features |
| K-means clustering | Yes | Cluster assignment uses distance metrics |
| Neural networks | Yes | Gradient-based optimization; large-scale inputs cause unstable gradients |
| Principal Component Analysis (PCA) | Yes | Variance-based; scale differences distort principal components |
| Decision trees | No | Splits are based on thresholds; scale-invariant |
| Random forests | No | Ensemble of decision trees; inherits scale invariance |
| Gradient boosted trees (XGBoost, LightGBM) | No | Tree-based; not affected by feature scale |
Z-score normalization and min-max normalization are the two most common feature scaling techniques. They serve different purposes and behave differently in the presence of outliers.[7]
| Property | Z-Score Normalization (Standardization) | Min-Max Normalization |
|---|---|---|
| Formula | z = (x - μ) / σ | x' = (x - x_min) / (x_max - x_min) |
| Output range | Unbounded (typically -3 to +3 for normal data) | Fixed [0, 1] (or custom range) |
| Center and spread | Mean = 0, Std = 1 | Depends on data range |
| Outlier sensitivity | Moderate (mean and std are affected, but output is not bounded) | High (a single extreme value compresses all other values into a narrow band) |
| Distribution shape | Preserved | Preserved |
| Best for | Algorithms using gradient descent or distance metrics; data with outliers | Algorithms requiring bounded inputs (e.g., pixel values for image models); data with no significant outliers |
When to choose standardization: Use z-score normalization when the data may contain outliers, when no fixed output range is required, or when training algorithms that assume normally distributed features (such as many linear models and SVMs).
When to choose min-max scaling: Use min-max normalization when a bounded output range is needed (for example, pixel intensity values in image processing) and when the data contains no significant outliers.
In practice, it is often worth trying both approaches and comparing model performance through cross-validation.[7]
Standard z-score normalization uses the mean and standard deviation, both of which are sensitive to extreme values. When a dataset contains significant outliers, a single extreme observation can shift the mean and inflate the standard deviation, distorting the standardized values for all other points.
Robust standardization addresses this limitation by replacing the mean with the median and the standard deviation with the interquartile range (IQR, the range between the 25th and 75th percentiles):[8]
x_robust = (x - median) / IQR
Because the median and IQR are less sensitive to outliers than the mean and standard deviation, robust standardization produces more stable scaling in the presence of extreme values.
In scikit-learn, robust standardization is available through the RobustScaler class:
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
| Scaler | Center Statistic | Scale Statistic | Outlier Robustness |
|---|---|---|---|
| StandardScaler | Mean | Standard deviation | Low |
| RobustScaler | Median | Interquartile range (IQR) | High |
The most common way to apply z-score normalization in Python is through the StandardScaler class in scikit-learn. Below is a typical workflow:[1]
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Create and fit the scaler on TRAINING data only
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
# Transform the test data using the SAME scaler
X_test_scaled = scaler.transform(X_test)
# Train a model on the scaled data
model = LogisticRegression()
model.fit(X_train_scaled, y_train)
accuracy = model.score(X_test_scaled, y_test)
| Parameter | Default | Description |
|---|---|---|
with_mean | True | If True, center data by subtracting the mean |
with_std | True | If True, scale data to unit variance |
copy | True | If False, attempt to modify arrays in place instead of copying |
| Attribute | Description |
|---|---|
mean_ | Per-feature mean computed from the training data |
var_ | Per-feature variance computed from the training data |
scale_ | Per-feature scaling factor (standard deviation) |
n_features_in_ | Number of features seen during fit |
A critical best practice is to fit the scaler on the training set only and then use the same fitted scaler to transform the validation and test sets. This prevents data leakage, a situation where information from the test set influences the training process. If the scaler were fit on the entire dataset (including test data), the computed mean and standard deviation would contain information from the test set, giving the model an unfair advantage during evaluation and producing overly optimistic performance estimates.[9]
To reduce the risk of data leakage, scikit-learn recommends using Pipelines, which chain preprocessing steps and the estimator together and automatically ensure that fit is called only on the training fold during cross-validation:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
pipeline = Pipeline([
('scaler', StandardScaler()),
('svm', SVC())
])
pipeline.fit(X_train, y_train)
score = pipeline.score(X_test, y_test)
Batch normalization extends the core idea behind z-score normalization into the hidden layers of deep neural networks. Proposed by Sergey Ioffe and Christian Szegedy in 2015, batch normalization applies standardization to the activations of each layer during training.[10]
For each mini-batch, the algorithm computes the mean and variance of the activations and then normalizes them:
x_hat = (x - μ_batch) / √(σ²_batch + ε)
where ε is a small constant added for numerical stability. After normalization, the values are scaled and shifted using two learnable parameters, γ (gamma) and β (beta):
y = γ · x_hat + β
These learnable parameters allow each layer to recover the optimal activation distribution, while still benefiting from the stability that normalization provides.
| Aspect | Z-Score Normalization | Batch Normalization |
|---|---|---|
| Applied to | Input features (before training) | Hidden layer activations (during training) |
| Statistics source | Entire training set | Current mini-batch |
| Learnable parameters | None | γ (scale) and β (shift) |
| Purpose | Equalize input feature scales | Stabilize and accelerate deep network training |
Batch normalization allows the use of higher learning rates, reduces sensitivity to weight initialization, and provides a mild regularization effect. It has become a standard component in modern convolutional neural networks and other deep architectures.[10]
RobustScaler before defaulting to StandardScaler.StandardScaler and MinMaxScaler, evaluate using cross-validation, and pick whichever produces better results for your specific problem.Imagine you and your friends are comparing how good you are at two different games: one where scores go up to 1,000, and another where scores only go up to 10. If you just look at the raw numbers, the first game's scores always seem "bigger" and more important, even though a score of 8 out of 10 might be just as impressive as 800 out of 1,000.
Z-score normalization is like a magic translator. It takes every score and asks: "How far above or below average is this?" Then it writes the answer in a simple language where "0" means perfectly average, "+1" means one step above average, and "-1" means one step below average. Now you can compare your performance across both games fairly, because the numbers all speak the same language.