# Z-Score Normalization

> Source: https://aiwiki.ai/wiki/z-score_normalization
> Updated: 2026-05-09
> Categories: Data & Datasets, Machine Learning
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

*See also: [Machine learning terms](/wiki/machine_learning_terms)*

## introduction

Z-score normalization, also called **standardization**, **standard score normalization**, or **z-score scaling**, is a [data preprocessing](/wiki/data_preprocessing) technique that transforms numerical [features](/wiki/feature) so that they have a mean of zero and a [standard deviation](/wiki/standard_deviation) of one. The transformation works by subtracting the mean from each value and then dividing the result by the standard deviation. The output values are called **z-scores**, and each z-score represents how many standard deviations a given data point sits above or below the mean of its distribution.

In [machine learning](/wiki/machine_learning), z-score [normalization](/wiki/normalization) is one of the most widely used [feature scaling](/wiki/feature_scaling) methods. Many learning algorithms are sensitive to the relative scales of input features. Without scaling, a feature measured in thousands (such as annual income in dollars) can dominate a feature measured in single digits (such as age in decades), leading to poor model performance and slow [convergence](/wiki/convergence). Standardization addresses this problem by placing all features on a comparable scale while preserving the shape of each feature's distribution.<sup>[1][2]</sup>

The technique has roots in classical statistics that predate machine learning by more than a century. Francis Galton's late-19th-century work on hereditary statistics and Karl Pearson's early-20th-century formalization of the product-moment correlation coefficient relied on standardized variables. The textbook *Statistical Methods* by George W. Snedecor and William G. Cochran, first published in 1937 and revised through eight editions, helped cement standardization as a default operation in applied statistics. The same arithmetic that statisticians used to compare student test scores or crop yields now serves as a routine step in modern [training pipelines](/wiki/training_pipeline) for [deep learning](/wiki/deep_learning) models.

## formula

The z-score for a single observation is computed as:

**z = (x - μ) / σ**

where:

| Symbol | Meaning |
| --- | --- |
| z | The standardized value (z-score) |
| x | The original raw value |
| μ (mu) | The arithmetic mean of all values for that [feature](/wiki/feature) |
| σ (sigma) | The [standard deviation](/wiki/standard_deviation) of all values for that feature |

To standardize an entire feature column, compute the mean and standard deviation across all observations in the [training set](/wiki/training_set), then apply the formula to every value, including values in the validation and test sets.<sup>[1]</sup>

### population versus sample standard deviation

A subtle but important choice when implementing z-score normalization is whether to use the **population** standard deviation or the **sample** standard deviation. The two formulas differ only in their denominator:

| Formula | Denominator | Common name |
| --- | --- | --- |
| Population variance | n | Divides by the number of observations |
| Sample variance | n - 1 | Bessel's correction; produces an unbiased estimate of the population variance |

[scikit-learn](/wiki/scikit_learn)'s `StandardScaler` and the [NumPy](/wiki/numpy) function `numpy.std` default to the population formula (denominator n). The [pandas](/wiki/pandas) function `Series.std` defaults to the sample formula (denominator n-1). The [SciPy](/wiki/scipy) function `scipy.stats.zscore` defaults to the population formula but accepts a `ddof` argument that switches to the sample form. For large training sets the difference between the two is small, but the inconsistency between libraries occasionally causes off-by-a-fraction discrepancies that are worth understanding before debugging a pipeline.<sup>[1][8]</sup>

| Library | Function | Default ddof | Result |
| --- | --- | --- | --- |
| scikit-learn | `StandardScaler` | 0 | Population std |
| NumPy | `numpy.std` | 0 | Population std |
| SciPy | `scipy.stats.zscore` | 0 | Population std (configurable) |
| pandas | `DataFrame.std` | 1 | Sample std (Bessel correction) |
| TensorFlow | `tf.math.reduce_std` | 0 | Population std |

## mathematical properties

After z-score normalization is applied to a feature, the transformed values always exhibit two key properties:<sup>[3]</sup>

1. **Zero mean.** The mean of the z-scores equals zero. Subtracting the original mean from every value centers the distribution at the origin.
2. **Unit variance.** The standard deviation (and therefore the [variance](/wiki/variance)) of the z-scores equals one. Dividing by the original standard deviation rescales the spread to a standard size.

These properties hold regardless of the original distribution's shape. If the raw data is skewed, the z-scored data will still be skewed; standardization changes the location and scale but not the shape of the distribution.

For data that follows a [normal distribution](/wiki/normal_distribution), z-scores have an additional useful interpretation. Roughly 68% of values fall between z = -1 and z = +1, about 95% fall between z = -2 and z = +2, and approximately 99.7% fall between z = -3 and z = +3. This is known as the 68-95-99.7 rule (or the empirical rule).<sup>[3]</sup>

| Z-Score Range | Approximate Percentage of Data (Normal Distribution) |
| --- | --- |
| -1 to +1 | 68% |
| -2 to +2 | 95% |
| -3 to +3 | 99.7% |
| -4 to +4 | 99.994% |

### linearity and invertibility

Z-score normalization is an **affine transformation** of the form `f(x) = ax + b` with `a = 1/σ` and `b = -μ/σ`. Affine transformations preserve linear relationships, which is why standardization does not change the rank ordering of values, the [Pearson correlation coefficient](/wiki/pearson_correlation_coefficient) between two features, or the [coefficient of determination](/wiki/coefficient_of_determination) of a linear fit. They do, however, change unstandardized regression coefficients in interpretable ways, which is the basis for **standardized regression coefficients** (also called **beta coefficients**) used in social science research.

The transformation is invertible. Given a fitted scaler with stored mean μ and standard deviation σ, the original value is recovered as `x = z·σ + μ`. Inverse transformation is essential for two practical workflows. First, when a [target variable](/wiki/target_variable) has been standardized before training a [regression](/wiki/regression) model, the model's predictions must be inverted before they are reported in original units. Second, when interpreting feature importance scores or explaining model behavior, analysts often want to map z-scored thresholds back to the original measurement scale.

```python
# Inverse transform with scikit-learn
from sklearn.preprocessing import StandardScaler
import numpy as np

X = np.array([[180, 85], [170, 70], [160, 60], [150, 55], [165, 65]])
scaler = StandardScaler().fit(X)
X_scaled = scaler.transform(X)
X_recovered = scaler.inverse_transform(X_scaled)
# X_recovered equals X within floating-point precision
```

## worked example

Consider a dataset with two features: height (in cm) and weight (in kg).

| Person | Height (cm) | Weight (kg) |
| --- | --- | --- |
| A | 180 | 85 |
| B | 170 | 70 |
| C | 160 | 60 |
| D | 150 | 55 |
| E | 165 | 65 |

**Step 1: Compute summary statistics.**

| Feature | Mean (μ) | Standard Deviation (σ) |
| --- | --- | --- |
| Height (cm) | 165.0 | 10.0 |
| Weight (kg) | 67.0 | 10.84 |

**Step 2: Apply the formula to each value.**

| Person | Height Z-Score | Weight Z-Score |
| --- | --- | --- |
| A | (180 - 165) / 10 = **1.50** | (85 - 67) / 10.84 = **1.66** |
| B | (170 - 165) / 10 = **0.50** | (70 - 67) / 10.84 = **0.28** |
| C | (160 - 165) / 10 = **-0.50** | (60 - 67) / 10.84 = **-0.65** |
| D | (150 - 165) / 10 = **-1.50** | (55 - 67) / 10.84 = **-1.11** |
| E | (165 - 165) / 10 = **0.00** | (65 - 67) / 10.84 = **-0.18** |

After standardization, both features are centered around zero and expressed in comparable units (standard deviations). A height z-score of 1.50 and a weight z-score of 1.66 tell us that Person A is 1.5 standard deviations above the mean height and 1.66 standard deviations above the mean weight.

**Step 3: Verify the output statistics.**

After standardization, the mean of each scaled column should be zero (within floating-point precision) and the standard deviation should be one. For the height z-scores: 1.50 + 0.50 + (-0.50) + (-1.50) + 0.00 = 0, so the mean is exactly 0. For the weight z-scores: 1.66 + 0.28 + (-0.65) + (-1.11) + (-0.18) = 0.00, again confirming a zero mean. Computing the standard deviation of each transformed column produces a value of 1.00. These checks are useful as unit tests when implementing a custom standardizer.

## why standardization helps machine learning

### faster gradient descent convergence

Many [machine learning](/wiki/machine_learning) models, including [linear regression](/wiki/linear_regression), [logistic regression](/wiki/logistic_regression), and [neural networks](/wiki/neural_network), are trained using [gradient descent](/wiki/gradient_descent). When input features have very different scales, the loss surface becomes elongated (shaped like a narrow valley rather than a symmetric bowl). Gradient descent in such a landscape oscillates back and forth across the narrow dimension and makes slow progress along the long dimension, resulting in slow convergence. Standardizing the features reshapes the loss surface into something closer to a symmetric bowl, allowing gradient descent to take more direct paths toward the minimum and converge significantly faster.<sup>[2][4]</sup>

This effect can be quantified using the **condition number** of the Hessian matrix of the loss surface. A condition number near 1 corresponds to a roughly spherical loss surface, while a large condition number corresponds to an elongated valley. Standardization reduces the condition number by removing scale-driven magnitude differences across features. Yann LeCun and colleagues' 1998 paper *Efficient BackProp* recommended centering and scaling inputs precisely for this reason and noted that the recommendation extended to hidden activations as well, foreshadowing later work on [batch normalization](/wiki/batch_normalization).<sup>[10]</sup>

### equal feature weighting

Distance-based algorithms such as [k-nearest neighbors](/wiki/k_nearest_neighbors) (KNN), [k-means clustering](/wiki/k-means), and [support vector machines](/wiki/support_vector_machine_svm) (SVM) calculate distances between data points. Without standardization, features with larger numeric ranges contribute disproportionately to the distance calculation. For example, if one feature ranges from 0 to 1,000 and another from 0 to 1, the first feature would overwhelm the second in any Euclidean distance computation. Standardization ensures that every feature contributes equally.<sup>[1][5]</sup>

The Euclidean distance between two standardized observations equals the unweighted Mahalanobis distance under the assumption that the features are uncorrelated. When the features are correlated, the full Mahalanobis distance further multiplies by the inverse covariance matrix, but z-score standardization is still a useful first step before computing distances.

### improved regularization

Regularization techniques such as [L1 regularization](/wiki/l1_regularization) (Lasso) and [L2 regularization](/wiki/l2_regularization) (Ridge) penalize large [weight](/wiki/weight) values. When features are on different scales, the associated weights must differ in magnitude just to compensate for the scale differences, not because of genuine differences in feature importance. Standardization removes scale-driven magnitude differences, allowing the regularization penalty to treat all features fairly.<sup>[5]</sup>

In the original 1996 Lasso paper, Robert Tibshirani assumed that the predictors were standardized to have mean zero and unit variance before applying the penalty. Most modern implementations, including scikit-learn's `Lasso`, `Ridge`, and `ElasticNet` classes, expect the user to standardize the inputs explicitly via `StandardScaler` (or set the deprecated `normalize` argument). Failing to standardize before fitting a regularized linear model is one of the more common silent bugs in applied machine learning.

### better performance in PCA

[Principal Component Analysis](/wiki/principal_component_analysis_pca) (PCA) identifies the directions of maximum [variance](/wiki/variance) in the data. If features are not standardized, PCA tends to identify the features with the largest numeric ranges as the most important, even if those features are not truly the most informative. Sebastian Raschka's empirical study on a wine classification dataset found that accuracy jumped from 64.81% to 98.15% when standardization was applied before PCA.<sup>[5]</sup>

A related operation is **whitening**, which goes beyond z-score normalization by also decorrelating the features. Whitening transforms a feature vector x with mean μ and covariance Σ into `W·(x - μ)` where W is chosen so that the resulting covariance matrix is the identity. PCA whitening is a common preprocessing step for [autoencoders](/wiki/autoencoder), [independent component analysis](/wiki/independent_component_analysis), and certain [generative models](/wiki/generative_model).

## which models require standardization?

Not all algorithms benefit equally from standardization. The table below summarizes which model families typically need it and which do not.<sup>[5][6]</sup>

| Model Type | Needs Standardization? | Reason |
| --- | --- | --- |
| [Linear regression](/wiki/linear_regression), [logistic regression](/wiki/logistic_regression) | Yes | Uses [gradient descent](/wiki/gradient_descent); convergence depends on feature scale |
| [Support vector machines](/wiki/support_vector_machine_svm) (SVM) | Yes | Distance-based kernel computations are scale-sensitive |
| [K-nearest neighbors](/wiki/k_nearest_neighbors) (KNN) | Yes | Euclidean distance dominated by large-scale features |
| [K-means clustering](/wiki/k-means) | Yes | Cluster assignment uses distance metrics |
| [Neural networks](/wiki/neural_network) | Yes | Gradient-based optimization; large-scale inputs cause unstable gradients |
| [Principal Component Analysis](/wiki/principal_component_analysis_pca) (PCA) | Yes | Variance-based; scale differences distort principal components |
| [Lasso, Ridge, Elastic Net](/wiki/regularization) | Yes | Regularization penalty depends on weight magnitude |
| [Naive Bayes](/wiki/naive_bayes) (Gaussian) | Sometimes | Class-conditional Gaussians are estimated independently per feature, but standardization can stabilize numerical computation |
| [Decision trees](/wiki/decision_tree) | No | Splits are based on thresholds; scale-invariant |
| [Random forests](/wiki/random_forest) | No | Ensemble of decision trees; inherits scale invariance |
| [Gradient boosted trees](/wiki/gradient_boosting) (XGBoost, LightGBM) | No | Tree-based; not affected by feature scale |
| Rule-based models | No | Decision rules use thresholds, not magnitudes |

## standardization vs other scalers

Z-score normalization is one of several feature scaling techniques. The other common ones are [min-max scaling](/wiki/min_max_scaling), [robust scaler](/wiki/robust_scaler), max-abs scaling, and the quantile transformer.<sup>[7]</sup>

| Scaler | Output Center | Output Spread | Output Range | Outlier Robustness | When to Use |
| --- | --- | --- | --- | --- | --- |
| StandardScaler (z-score) | Mean = 0 | Std = 1 | Unbounded | Moderate | Default for gradient-based models, distance-based methods, PCA |
| MinMaxScaler | Depends on data | Depends on data | [0, 1] or custom | Low | Bounded inputs needed (image pixels), neural networks with sigmoid outputs |
| RobustScaler | Median = 0 | IQR = 1 | Unbounded | High | Datasets with outliers or heavy-tailed distributions |
| MaxAbsScaler | Preserved | Scaled by max absolute value | [-1, 1] | Low | Sparse data; preserves zero entries |
| Normalizer (L2) | Per row | Unit norm per row | Unit sphere | Low | Text classification with TF-IDF, cosine similarity |
| QuantileTransformer | Median = 0 | Uniform or normal | Bounded | High | Heavily skewed features |
| PowerTransformer (Yeo-Johnson, Box-Cox) | Approx. mean 0 | Approx. std 1 | Unbounded | High | Features that should be made more Gaussian-like |

### z-score versus min-max scaling

Z-score normalization and [min-max normalization](/wiki/min_max_scaling) are the two most common feature scaling techniques. They serve different purposes and behave differently in the presence of [outliers](/wiki/outliers).<sup>[7]</sup>

| Property | Z-Score Normalization (Standardization) | Min-Max Normalization |
| --- | --- | --- |
| Formula | z = (x - μ) / σ | x' = (x - x_min) / (x_max - x_min) |
| Output range | Unbounded (typically -3 to +3 for normal data) | Fixed [0, 1] (or custom range) |
| Center and spread | Mean = 0, Std = 1 | Depends on data range |
| Outlier sensitivity | Moderate (mean and std are affected, but output is not bounded) | High (a single extreme value compresses all other values into a narrow band) |
| Distribution shape | Preserved | Preserved |
| Best for | Algorithms using gradient descent or distance metrics; data with [outliers](/wiki/outliers) | Algorithms requiring bounded inputs (e.g., pixel values for image models); data with no significant outliers |

**When to choose standardization:** Use z-score normalization when the data may contain outliers, when no fixed output range is required, or when training algorithms that assume normally distributed features (such as many [linear models](/wiki/linear_regression) and [SVMs](/wiki/support_vector_machine_svm)).

**When to choose min-max scaling:** Use min-max normalization when a bounded output range is needed (for example, pixel intensity values in image processing) and when the data contains no significant outliers.

In practice, it is often worth trying both approaches and comparing model performance through [cross-validation](/wiki/cross_validation).<sup>[7]</sup>

## robust standardization

Standard z-score normalization uses the mean and standard deviation, both of which are sensitive to extreme values. When a dataset contains significant [outliers](/wiki/outliers), a single extreme observation can shift the mean and inflate the standard deviation, distorting the standardized values for all other points.

**Robust standardization** addresses this limitation by replacing the mean with the **median** and the standard deviation with the **interquartile range** (IQR, the range between the 25th and 75th percentiles):<sup>[8]</sup>

**x_robust = (x - median) / IQR**

Because the median and IQR are less sensitive to outliers than the mean and standard deviation, robust standardization produces more stable scaling in the presence of extreme values.

In [scikit-learn](/wiki/scikit_learn), robust standardization is available through the [RobustScaler](/wiki/robust_scaler) class:

```python
from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
```

| Scaler | Center Statistic | Scale Statistic | Outlier Robustness |
| --- | --- | --- | --- |
| StandardScaler | Mean | Standard deviation | Low |
| RobustScaler | Median | Interquartile range (IQR) | High |

### modified z-score (MAD-based)

A closely related variant is the **modified z-score** introduced by Boris Iglewicz and David Hoaglin in their 1993 American Statistical Association volume *How to Detect and Handle Outliers*. The modified z-score uses the **median absolute deviation** (MAD) instead of the standard deviation:<sup>[11]</sup>

**M_i = 0.6745 · (x_i - median) / MAD**

where `MAD = median(|x_i - median|)` and the constant 0.6745 is the inverse of the 75th percentile of the standard normal distribution. This constant rescales the MAD so that, for normally distributed data, the modified z-score is approximately equal to the ordinary z-score.

Iglewicz and Hoaglin recommended treating any observation with `|M_i| > 3.5` as a potential outlier. The modified z-score is widely used in [anomaly detection](/wiki/anomaly_detection) systems where the underlying distribution is heavy-tailed or contaminated, and a small number of outliers should not influence the threshold for the rest of the data.

| Statistic | Standard z-score | Modified z-score |
| --- | --- | --- |
| Center | Mean | Median |
| Scale | Standard deviation | 1.4826 · MAD (or equivalently divides by MAD/0.6745) |
| Outlier flag rule of thumb | \|z\| > 3 | \|M\| > 3.5 |
| Breakdown point | 0% (a single extreme value distorts both stats) | 50% (median and MAD remain stable until half the data is corrupted) |

## StandardScaler in scikit-learn

The most common way to apply z-score normalization in Python is through the `StandardScaler` class in [scikit-learn](/wiki/scikit_learn). Below is a typical workflow:<sup>[1]</sup>

```python
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Create and fit the scaler on TRAINING data only
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

# Transform the test data using the SAME scaler
X_test_scaled = scaler.transform(X_test)

# Train a model on the scaled data
model = LogisticRegression()
model.fit(X_train_scaled, y_train)
accuracy = model.score(X_test_scaled, y_test)
```

### key parameters

| Parameter | Default | Description |
| --- | --- | --- |
| `with_mean` | True | If True, center data by subtracting the mean |
| `with_std` | True | If True, scale data to unit variance |
| `copy` | True | If False, attempt to modify arrays in place instead of copying |

### key attributes (after fitting)

| Attribute | Description |
| --- | --- |
| `mean_` | Per-feature mean computed from the training data |
| `var_` | Per-feature variance computed from the training data |
| `scale_` | Per-feature scaling factor (standard deviation) |
| `n_features_in_` | Number of features seen during fit |
| `n_samples_seen_` | Number of samples processed (relevant for `partial_fit`) |

### fitting on training data only

A critical best practice is to fit the scaler on the [training set](/wiki/training_set) only and then use the same fitted scaler to transform the validation and [test sets](/wiki/test_set). This prevents **data leakage**, a situation where information from the test set influences the training process. If the scaler were fit on the entire dataset (including test data), the computed mean and standard deviation would contain information from the test set, giving the model an unfair advantage during evaluation and producing overly optimistic performance estimates.<sup>[9]</sup>

To reduce the risk of data leakage, scikit-learn recommends using **Pipelines**, which chain preprocessing steps and the estimator together and automatically ensure that `fit` is called only on the training fold during cross-validation:

```python
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('svm', SVC())
])

pipeline.fit(X_train, y_train)
score = pipeline.score(X_test, y_test)
```

### sparse data

Setting `with_mean=False` is required when standardizing sparse matrices because subtracting a non-zero mean from every element would densify the matrix and consume large amounts of memory. The scikit-learn API supports `partial_fit`, which updates `mean_` and `var_` incrementally over multiple chunks of data using the [Welford online algorithm](/wiki/welford_algorithm) described below. This pattern is useful for datasets larger than memory and for streaming preprocessing.

### implementing standardization from scratch

```python
import numpy as np

class MyStandardScaler:
    def fit(self, X):
        self.mean_ = X.mean(axis=0)
        self.scale_ = X.std(axis=0, ddof=0)
        # Avoid division by zero for constant features
        self.scale_[self.scale_ == 0.0] = 1.0
        return self

    def transform(self, X):
        return (X - self.mean_) / self.scale_

    def fit_transform(self, X):
        return self.fit(X).transform(X)

    def inverse_transform(self, X_scaled):
        return X_scaled * self.scale_ + self.mean_
```

The scikit-learn implementation is more elaborate (sparse-matrix support, numerical stability checks, partial fit, validation of input shapes), but the arithmetic core is the same.

## online and streaming computation: Welford's algorithm

When training data does not fit in memory or arrives as a continuous stream, the mean and variance must be computed incrementally rather than from a single batch. The naive approach of accumulating the running sum and the running sum of squares (`sum`, `sum_sq`) and then computing the variance as `sum_sq/n - (sum/n)^2` is numerically unstable: subtracting two large, similar quantities loses precision.

B. P. Welford published a numerically stable online algorithm in 1962 that updates the mean and a running quantity `M2` (the sum of squared deviations from the running mean) one observation at a time:<sup>[12]</sup>

```python
def welford_update(state, x):
    n, mean, M2 = state
    n += 1
    delta = x - mean
    mean += delta / n
    delta2 = x - mean
    M2 += delta * delta2
    return n, mean, M2

def welford_finalize(state):
    n, mean, M2 = state
    if n < 2:
        return mean, float('nan')
    variance_pop = M2 / n           # population variance
    variance_samp = M2 / (n - 1)    # sample variance with Bessel correction
    return mean, variance_pop ** 0.5
```

Welford's recursion preserves accuracy across many updates and is the basis of `partial_fit` in scikit-learn's `StandardScaler`, `tf.keras.layers.Normalization.adapt` in [TensorFlow](/wiki/tensorflow), and equivalent functions in [PyTorch](/wiki/pytorch). A parallel version of the same recursion (Chan, Golub, and LeVeque 1979) merges the statistics from two partitions and is used in distributed feature-statistics jobs on systems such as [Apache Spark](/wiki/apache_spark).

## framework APIs

| Framework | API | Notes |
| --- | --- | --- |
| [scikit-learn](/wiki/scikit_learn) | `sklearn.preprocessing.StandardScaler` | `fit`, `transform`, `partial_fit`, `inverse_transform`; integrates with `Pipeline` and `ColumnTransformer` |
| [SciPy](/wiki/scipy) | `scipy.stats.zscore` | Pure function; supports `axis` and `ddof` arguments |
| [NumPy](/wiki/numpy) | Manual: `(x - x.mean(axis=0)) / x.std(axis=0)` | No built-in scaler; commonly used in custom code |
| [pandas](/wiki/pandas) | `(df - df.mean()) / df.std()` | Default `std` uses Bessel correction (ddof=1) |
| [TensorFlow](/wiki/tensorflow) | `tf.keras.layers.Normalization` | Preprocessing layer; call `adapt(dataset)` to compute statistics |
| [PyTorch](/wiki/pytorch) | Manual or `torchvision.transforms.Normalize` | The `Normalize` transform expects pre-computed mean and std; for images these are typically the channel-wise statistics of the training set (e.g., the ImageNet mean `[0.485, 0.456, 0.406]` and std `[0.229, 0.224, 0.225]`) |
| [Spark MLlib](/wiki/spark_mllib) | `pyspark.ml.feature.StandardScaler` | Distributed; configurable `withMean` and `withStd` |
| [Polars](/wiki/polars) | Manual via `(col - col.mean()) / col.std()` | Lazy evaluation supported |
| R | `scale(x, center=TRUE, scale=TRUE)` | Built-in; uses sample standard deviation |

## connection to batch normalization and layer normalization

[Batch normalization](/wiki/batch_normalization) extends the core idea behind z-score normalization into the hidden layers of [deep neural networks](/wiki/deep_neural_network). Proposed by Sergey Ioffe and Christian Szegedy in 2015, batch normalization applies standardization to the activations of each layer during training.<sup>[13]</sup>

For each mini-batch, the algorithm computes the mean and variance of the activations and then normalizes them:

**x_hat = (x - μ_batch) / √(σ²_batch + ε)**

where ε is a small constant added for numerical stability. After normalization, the values are scaled and shifted using two learnable parameters, γ (gamma) and β (beta):

**y = γ · x_hat + β**

These learnable parameters allow each layer to recover the optimal activation distribution, while still benefiting from the stability that normalization provides.

[Layer normalization](/wiki/layer_normalization), introduced by Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey Hinton in 2016, applies the same arithmetic but computes the statistics across the feature dimension of a single example rather than across the batch dimension. Layer normalization is the standard choice in [transformer](/wiki/transformer) architectures because it does not depend on the batch size and is well-suited to variable-length sequences. Other variants include **instance normalization** (used in style transfer), **group normalization** (used in computer vision when batch sizes are small), and **RMSNorm** (used in modern [large language models](/wiki/large_language_model) such as the [LLaMA](/wiki/llama) family).

| Aspect | Z-Score Normalization | [Batch Normalization](/wiki/batch_normalization) | [Layer Normalization](/wiki/layer_normalization) |
| --- | --- | --- | --- |
| Applied to | Input features (before training) | Hidden layer activations | Hidden layer activations |
| Statistics computed across | Entire training set | Mini-batch (per channel) | Feature dimension (per example) |
| Statistics source at inference | Stored from training | Running averages collected during training | Computed from each input on the fly |
| Learnable parameters | None | γ (scale) and β (shift) per channel | γ and β per feature |
| Common use | All ML models | CNNs | Transformers, RNNs |

Batch normalization allows the use of higher [learning rates](/wiki/learning_rate), reduces sensitivity to weight initialization, and provides a mild regularization effect. It has become a standard component in modern [convolutional neural networks](/wiki/convolutional_neural_network) and other deep architectures.<sup>[13]</sup>

## use in anomaly detection

The absolute z-score is a popular heuristic for [outlier](/wiki/outliers) and [anomaly detection](/wiki/anomaly_detection). Under the assumption that a feature is approximately normally distributed, observations with `|z| > 3` are sometimes flagged as suspicious because they correspond to the tails of the empirical 68-95-99.7 rule and account for only about 0.3% of the distribution.

The **three-sigma rule** is convenient but blunt. It assumes a unimodal, approximately Gaussian distribution; on heavy-tailed or skewed data, the threshold either flags too many points (Cauchy-like distributions, financial returns) or too few (long-tailed user behavior data). For these reasons, modified z-scores based on the median and MAD (Iglewicz and Hoaglin 1993) and quantile-based methods such as [isolation forest](/wiki/isolation_forest) are often preferred for production anomaly detection pipelines.<sup>[11]</sup>

| Method | Sensitivity Threshold | Robust to skew? |
| --- | --- | --- |
| Standard z-score | \|z\| > 2 (loose), \|z\| > 3 (strict) | No |
| Modified z-score (MAD) | \|M\| > 3.5 | Yes |
| IQR rule (Tukey) | x outside [Q1 - 1.5·IQR, Q3 + 1.5·IQR] | Yes |
| Mahalanobis distance | Chi-squared threshold on multivariate distance | Partial |
| Isolation forest | Score-based | Yes |

A practical financial example is the **z-score of returns**, used to flag market days where an asset's daily return is more than three standard deviations from its rolling average. Risk-management systems use such rules to trigger model recalibration or reporting events.

## applications outside machine learning

Z-score normalization predates [machine learning](/wiki/machine_learning) by more than a century and remains heavily used across the quantitative sciences and finance:

- **Education and psychometrics.** Standardized tests such as the SAT, GRE, and IQ tests report scores derived from z-scores that have been linearly rescaled to a target mean and standard deviation. The classical SAT scaled score has historically targeted a mean of 500 and a standard deviation of 100 per section, while the Wechsler Adult Intelligence Scale (WAIS) IQ scale targets a mean of 100 and a standard deviation of 15.
- **Pediatric medicine.** The World Health Organization Child Growth Standards report **z-scores** for weight-for-age, height-for-age, and BMI-for-age. A child below z = -2 on weight-for-age is classified as underweight, and below z = -3 as severely underweight.
- **Finance.** Edward Altman's 1968 **Altman Z-Score** is a weighted combination of standardized accounting ratios used to predict corporate bankruptcy. Modern risk-management systems also compute rolling z-scores of asset returns to flag tail events.
- **Quality control.** Six Sigma uses standardized deviations from a target value to characterize manufacturing defects per million opportunities.
- **Sports analytics.** Player and team metrics are often expressed as z-scores relative to a league average for cross-position comparisons (often called *plus-minus* metrics in basketball or **WAR** in baseball when adjusted for league context).
- **Climate science.** Temperature and precipitation anomalies are routinely reported as departures from a reference period mean expressed in standard deviations, allowing comparisons across stations with different baselines.
- **Genomics and bioinformatics.** Gene-expression z-scores are computed within microarray and RNA-seq experiments to highlight genes that are over- or under-expressed relative to the average across samples.

## common pitfalls

A number of mistakes show up repeatedly when applying z-score normalization in practice. Avoiding them tends to be more valuable than choosing between scaler variants.

| Pitfall | Why it is wrong | Correct approach |
| --- | --- | --- |
| Fitting the scaler on the full dataset | Test-set statistics leak into training | Fit on `X_train` only; transform `X_train`, `X_val`, `X_test` |
| Standardizing across rows instead of across columns | Rows mix incommensurable features (height, weight, age); the per-row mean has no statistical meaning | Standardize per column (`axis=0`); per-row normalization is for vector-norm scaling, not z-score |
| Standardizing one-hot or binary indicators | Destroys interpretability and sparsity, and the resulting values may be larger than the original signal | Skip standardization for binary or categorical encodings; use `ColumnTransformer` to apply scaling only to numeric features |
| Forgetting to standardize new inference data | Production input distribution differs from training; model receives unscaled inputs | Save the fitted scaler with the model and apply `transform` at inference |
| Standardizing the [target variable](/wiki/target_variable) without inverting | Predictions reported in the wrong units | Apply `inverse_transform` to predictions before reporting |
| Standardizing each fold separately and aggregating | Each fold has slightly different statistics; not comparable | Use scikit-learn `Pipeline` so the scaler is refit per fold automatically |
| Standardizing time-series data with the full series | Future statistics leak into past predictions | Use rolling or expanding statistics that respect temporal ordering |
| Using the same scaler for training and inference, then retraining the scaler later | Model expects the original mean and std; new statistics shift the distribution | Version the scaler alongside the model; retrain both together |
| Constant feature with zero variance | Division by zero in the denominator | Detect and handle via `with_std=False`, removal, or replacing zero std with 1 |
| Mixing population and sample standard deviations | Tiny numeric differences cause confusion across libraries | Pick one convention (usually population, ddof=0) and stick with it across the pipeline |

### standardization with time series

For [time series](/wiki/time_series) and [forecasting](/wiki/forecasting) problems, z-score normalization must be performed in a way that respects temporal ordering. Computing the global mean and standard deviation across the entire series, then transforming all observations, leaks future information into the past. A safer approach is to use a **rolling window** or **expanding window** of past values to standardize each time step relative to its history. Libraries such as [statsmodels](/wiki/statsmodels) and [Prophet](/wiki/prophet) include rolling normalization helpers, and [PyTorch](/wiki/pytorch) and [TensorFlow](/wiki/tensorflow) data-loaders support precomputed per-window statistics.

```python
# Expanding-window z-score for a pandas Series
import pandas as pd

series = pd.Series(values)
rolling_mean = series.expanding(min_periods=30).mean().shift(1)
rolling_std = series.expanding(min_periods=30).std().shift(1)
z_series = (series - rolling_mean) / rolling_std
```

The `.shift(1)` step is essential. It ensures that the statistics at time t are computed only from observations strictly before t.

## practical tips

- **Always standardize after splitting.** Compute the mean and standard deviation from the training set only. Apply the same transformation to validation and test data.
- **Store scaler parameters for production.** When deploying a model, save the fitted scaler alongside the model so that incoming data can be transformed with the exact same mean and standard deviation used during training.
- **Consider robust scaling for dirty data.** If your dataset has significant outliers or measurement errors, try `RobustScaler` before defaulting to `StandardScaler`.
- **Tree-based models do not need it.** [Decision trees](/wiki/decision_tree), [random forests](/wiki/random_forest), and gradient boosted trees are invariant to monotonic transformations of features, so standardization neither helps nor hurts.
- **Match the scaler to inference.** As Google's Machine Learning Crash Course states, "if you normalize a feature during training, you must also normalize that feature when making predictions."<sup>[2]</sup>
- **Use ColumnTransformer for mixed types.** Apply standardization only to numeric columns; leave categorical and binary indicators untouched.
- **Combine with imputation.** Fit imputation and standardization in one [Pipeline](/wiki/scikit_learn_pipeline) so that train and inference apply the same operations in the same order.
- **Watch for constant features.** Features with zero variance break the formula; remove them or set `with_std=False`.
- **Do not standardize tree-model targets.** Tree-based regressors are scale-equivariant for the target; standardizing the target is unnecessary and complicates inverse-transform bookkeeping.
- **Experiment.** There is no universal best scaler. Try both `StandardScaler` and `MinMaxScaler`, evaluate using cross-validation, and pick whichever produces better results for your specific problem.

## history and terminology

The term **standard score** appears throughout the early-20th-century statistical literature, and Karl Pearson's correlation work in the 1890s explicitly used standardized variables. Ronald A. Fisher's 1925 *Statistical Methods for Research Workers* further popularized the use of standardized residuals and tabulated values of the standard normal distribution. The letter **z** for the standardized variable became conventional through textbooks such as Snedecor and Cochran's *Statistical Methods* and through the widespread reproduction of standard-normal tables in undergraduate courses.

In machine learning, the equivalent operation has been called **z-score normalization**, **standardization**, **autoscaling** (in chemometrics), and **mean-centering and unit-variance scaling**. The scikit-learn project chose the name `StandardScaler` to emphasize the unit-variance result, while [TensorFlow](/wiki/tensorflow) calls the equivalent layer `Normalization` and exposes the mean and variance via the `adapt` method. Despite the variety of names, the underlying arithmetic has been unchanged for more than a century.

## explain like I'm 5 (ELI5)

Imagine you and your friends are comparing how good you are at two different games: one where scores go up to 1,000, and another where scores only go up to 10. If you just look at the raw numbers, the first game's scores always seem "bigger" and more important, even though a score of 8 out of 10 might be just as impressive as 800 out of 1,000.

Z-score normalization is like a magic translator. It takes every score and asks: "How far above or below average is this?" Then it writes the answer in a simple language where "0" means perfectly average, "+1" means one step above average, and "-1" means one step below average. Now you can compare your performance across both games fairly, because the numbers all speak the same language.

## see also

- [Feature scaling](/wiki/feature_scaling)
- [Min-max scaling](/wiki/min_max_scaling)
- [Robust scaler](/wiki/robust_scaler)
- [Batch normalization](/wiki/batch_normalization)
- [Layer normalization](/wiki/layer_normalization)
- [Normalization](/wiki/normalization)
- [Standard deviation](/wiki/standard_deviation)
- [Variance](/wiki/variance)
- [Normal distribution](/wiki/normal_distribution)
- [Anomaly detection](/wiki/anomaly_detection)
- [Principal component analysis](/wiki/principal_component_analysis_pca)
- [Data preprocessing](/wiki/data_preprocessing)

## references

1. [StandardScaler - scikit-learn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)
2. [Numerical Data: Normalization - Google Machine Learning Crash Course](https://developers.google.com/machine-learning/crash-course/numerical-data/normalization)
3. [Z-Scores and the Standard Normal Distribution - Statistics LibreTexts](https://stats.libretexts.org/Bookshelves/Applied_Statistics/Introduction_to_Statistics_in_the_Psychological_Sciences_(Cote_Gordon_Randell_and_Marvin)/01:_Fundamentals_of_Statistics/1.04:_Chapter_4-_z_Scores_and_the_Standard_Normal_Distribution)
4. [About Feature Scaling and Normalization - Sebastian Raschka](https://sebastianraschka.com/Articles/2014_about_feature_scaling.html)
5. [Importance of Feature Scaling - scikit-learn documentation](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html)
6. [Which Machine Learning Algorithms Require Feature Scaling - The Professionals Point](http://theprofessionalspoint.blogspot.com/2019/02/which-machine-learning-algorithms.html)
7. [Normalization vs. Standardization - DataCamp](https://www.datacamp.com/tutorial/normalization-vs-standardization)
8. [RobustScaler - scikit-learn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html)
9. [Common Pitfalls and Recommended Practices - scikit-learn documentation](https://scikit-learn.org/stable/common_pitfalls.html)
10. [Efficient BackProp - LeCun, Bottou, Orr, and Müller, 1998](http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf)
11. [How to Detect and Handle Outliers - Iglewicz and Hoaglin, 1993, ASQC Quality Press](https://www.asq.org/quality-press/display-item?item=H0613)
12. [Note on a Method for Calculating Corrected Sums of Squares and Products - B. P. Welford, Technometrics 1962](https://doi.org/10.1080/00401706.1962.10490022)
13. [Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift - Ioffe and Szegedy, 2015](https://arxiv.org/abs/1502.03167)
14. [Layer Normalization - Ba, Kiros, and Hinton, 2016](https://arxiv.org/abs/1607.06450)
15. [Statistical Methods - Snedecor and Cochran, 8th ed., Iowa State University Press, 1989](https://www.wiley.com/en-us/Statistical+Methods%2C+8th+Edition-p-9780813815619)
16. [scipy.stats.zscore - SciPy documentation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.zscore.html)
17. [Normalization layer - TensorFlow documentation](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Normalization)
18. [WHO Child Growth Standards - World Health Organization](https://www.who.int/tools/child-growth-standards)

