Stationarity

Stationarity is a property of a stochastic process whose statistical characteristics, such as the mean, variance, and autocorrelation, do not change when shifted in time. A time series drawn from a stationary process looks statistically the same regardless of when you start observing it. The concept is foundational to classical time series analysis, to econometrics, to signal processing, and to many results in machine learning, where the closely related independent and identically distributed (i.i.d.) assumption underpins generalization theory. Without stationarity (or some controlled departure from it), most standard estimators are inconsistent, standard significance tests have nonstandard distributions, and forecasts can be wildly off because the past no longer informs the future in a stable way.

The notion was developed primarily in the first half of the twentieth century by mathematicians and statisticians working on signal processing, communication theory, and economics, including Andrey Kolmogorov, Norbert Wiener, Aleksandr Khinchin, and later George Box and Gwilym Jenkins. James D. Hamilton's 1994 textbook Time Series Analysis remains the standard graduate reference and devotes substantial space to the formal treatment, the econometric tests, and the consequences of nonstationarity. Robert Engle and Clive Granger received the 2003 Sveriges Riksbank Prize in Economic Sciences in Memory of Alfred Nobel in part for showing how to model long-run relationships among nonstationary series through cointegration, an idea that grew directly out of the realization that running ordinary regressions on nonstationary data could yield spurious results.[^nobel2003]

Formal definitions

Let ${X_t}_{t \in T}$ denote a stochastic process indexed by time. Several distinct but related notions of stationarity are used in practice. The right one to invoke depends on what you actually need: full distributional invariance, invariance of the first two moments, or merely invariance of the mean.

Strict (strong) stationarity

A process is strictly stationary (also called strongly stationary) if its joint probability distribution is invariant under arbitrary time shifts. Formally, for every finite collection of time points $t_1, t_2, \dots, t_n$ and every shift $\tau$,

$$F_{X}(x_{t_1}, x_{t_2}, \dots, x_{t_n}) = F_{X}(x_{t_1 + \tau}, x_{t_2 + \tau}, \dots, x_{t_n + \tau}).$$

In words, the entire joint distribution of any finite subset of observations depends only on the relative spacing of the time points, not on where in time you happen to be looking. Strict stationarity is a strong requirement and is usually difficult to verify directly from data because it concerns the full distribution rather than summary moments.[^wiki_stationary]

Weak (wide-sense, covariance) stationarity

A process is weakly stationary, also called wide-sense stationary (WSS) or covariance stationary, if the following three conditions hold:

The second moment is finite for all $t$: $E[X_t^2] < \infty$.
The mean is constant: $E[X_t] = \mu$ for all $t$.
The autocovariance depends only on the lag $h$, not on the absolute time $t$: $\operatorname{Cov}(X_t, X_{t+h}) = \gamma(h)$.

The function $\gamma(h)$ is the autocovariance function of the process. Dividing by the variance gives the autocorrelation function $\rho(h) = \gamma(h) / \gamma(0)$. Weak stationarity is the workhorse definition because most useful estimators (sample means, sample autocovariances) and most standard asymptotic results require only the first two moments to be well behaved.[^statlect_cov]

The relationship between the two definitions is asymmetric. A strictly stationary process with finite second moments is automatically weakly stationary, but a weakly stationary process need not be strictly stationary because the higher moments could still drift over time. The two notions coincide for Gaussian processes, since a Gaussian distribution is fully characterized by its first two moments.

Trend stationarity

A process is trend stationary if it can be written as the sum of a deterministic trend $f(t)$ and a stationary component $\epsilon_t$:

$$X_t = f(t) + \epsilon_t.$$

Subtracting the trend (typically a polynomial in $t$, often linear) yields a stationary residual series. Trend stationary processes revert to the deterministic trend in the long run, meaning shocks have only a transitory effect.[^matlab_trend]

Difference stationarity and unit roots

A process is difference stationary if its first difference $\Delta X_t = X_t - X_{t-1}$ (or some higher-order difference) is stationary. Such processes contain a unit root in the autoregressive polynomial. A series that becomes stationary after $d$ differencing operations is said to be integrated of order d, written $I(d)$. The simplest example of an $I(1)$ process is the random walk $X_t = X_{t-1} + \epsilon_t$, where $\epsilon_t$ is white noise; the level $X_t$ is nonstationary, but $\Delta X_t = \epsilon_t$ is stationary.[^matlab_trend][^wiki_unit_root]

The distinction between trend stationary and difference stationary processes matters because the appropriate transformation differs: detrending is correct for the former, while differencing is correct for the latter. Applying the wrong transformation introduces structure the model cannot remove.

Cyclostationarity and other generalizations

A process is cyclostationary if its statistical properties vary periodically with time, so the process is stationary up to a known periodic component. Other generalizations include local stationarity (slowly varying parameters) and nth-order stationarity (invariance of moments up to order $n$). These extensions are used in signal processing, climate science, and any setting where the i.i.d.-or-nothing dichotomy is too coarse.

Why stationarity matters

Without stationarity, most of the statistical machinery for time series analysis breaks down. The reasons span estimation, inference, and prediction.

Estimation. The sample mean $\bar{X}_n = \frac{1}{n} \sum X_t$ is a natural estimator of the population mean, but it converges to a meaningful population quantity only if the population mean exists and is constant. If the mean drifts, $\bar{X}_n$ chases a moving target.

Inference. Standard t-statistics and F-statistics rely on asymptotic normality results that assume stationarity (or related mixing conditions). When you regress one nonstationary series on another, the t-statistic can diverge even when the variables are unrelated, producing what Granger and Newbold called spurious regression. Regressions of independent random walks on each other tend to show extreme R^2 values and highly significant coefficients despite the absence of any genuine relationship.[^granger_newbold]

Forecasting. Models like ARIMA, SARIMA, exponential smoothing, and most state-space models assume an underlying stationary structure (after appropriate differencing). If the structure changes during the forecast horizon, predictions degrade. The validity of confidence intervals and prediction intervals depends on stationarity holding in the period being forecast.

Machine learning generalization. The standard supervised learning setup assumes training and test data are drawn i.i.d. from the same distribution. This is a stronger condition than stationarity (i.i.d. requires independence as well), but it shares the core idea that the statistical environment does not change between when you fit the model and when you deploy it. When the deployment distribution differs, the model is operating under distribution shift or concept drift, and its accuracy is no longer guaranteed.[^d2l_shift]

Tests for stationarity

A range of formal hypothesis tests have been developed to assess stationarity in observed series. The tests differ in their null hypothesis (stationarity vs. unit root), in how they handle serial correlation, and in their power against various alternatives. It is common practice to apply at least two tests with opposing nulls and reconcile the results.

Test	Year	Null hypothesis	Alternative	Approach to serial correlation	Notes
Dickey-Fuller (DF)	1979	Unit root (nonstationary)	Stationary	None (assumes white-noise errors)	Original test; superseded by ADF
Augmented Dickey-Fuller (ADF)	1981	Unit root	Stationary	Parametric: includes lagged differences	Most widely used unit-root test
Phillips-Perron (PP)	1988	Unit root	Stationary	Nonparametric Newey-West correction	Robust to heteroskedasticity; weaker finite-sample power than ADF
KPSS (Kwiatkowski-Phillips-Schmidt-Shin)	1992	Stationary (or trend stationary)	Unit root	Nonparametric long-run variance	Reverses the null; useful as confirmatory test
Elliott-Rothenberg-Stock (DF-GLS)	1996	Unit root	Stationary	GLS detrending before ADF regression	Higher power than ADF near the unit root
Zivot-Andrews	1992	Unit root with one structural break	Stationary	Allows endogenous break date	Useful when structural breaks are suspected
HEGY	1990	Seasonal unit roots	Seasonal stationarity	Tests roots at seasonal frequencies	Used for seasonal ARIMA model identification
Ljung-Box	1978	No autocorrelation in residuals	Autocorrelation present	n/a	Diagnostic check on residuals after model fitting

The ADF and KPSS tests are often used together. Because their nulls are opposite, the combination produces four possible outcomes: both reject (ambiguous), neither rejects (ambiguous), ADF rejects and KPSS does not (the series is stationary), or ADF does not reject and KPSS does (the series is nonstationary). Some practitioners refer to the third pattern as difference stationarity confirmed and the fourth as trend stationarity vs. difference stationarity ambiguity requiring further analysis.[^statsmodels_adfkpss]

The ADF regression, in its general form, takes the test equation

$$\Delta y_t = \alpha + \beta t + \gamma y_{t-1} + \sum_{i=1}^{p} \delta_i \Delta y_{t-i} + \epsilon_t,$$

and tests $H_0: \gamma = 0$ (unit root) against $H_1: \gamma < 0$ (stationary). The lag length $p$ is typically chosen by an information criterion (AIC, BIC) or a sequential testing procedure. The asymptotic distribution of the t-statistic on $\gamma$ under the null is nonstandard (the Dickey-Fuller distribution), so critical values are taken from tabulated simulations rather than from the standard normal.[^wiki_adf]

Methods for achieving stationarity

If a series is not stationary, several transformations can often render it (approximately) stationary. The right choice depends on the source of the nonstationarity: changing mean, changing variance, deterministic trend, stochastic trend, seasonality, or some combination.

Method	What it addresses	Formula	When to use	Caveats
First differencing	Stochastic trend (unit root)	$y't = y_t - y{t-1}$	$I(1)$ series; ADF fails to reject	Over-differencing introduces unit root in MA part
Second differencing	Stronger stochastic trend	$y''_t = y't - y'{t-1}$	$I(2)$ series; rare in economic data	Most series are at most $I(2)$; further differencing usually unnecessary
Seasonal differencing	Seasonal nonstationarity	$y^{(s)}t = y_t - y{t-s}$	Strong seasonal pattern of period $s$	Combine with first differencing if needed
Detrending	Deterministic trend	$y'_t = y_t - \hat{f}(t)$	Trend stationary series	Wrong if true process has unit root
Log transformation	Variance proportional to level	$y'_t = \log(y_t)$	Exponential growth; multiplicative noise	Requires positive values
Box-Cox transformation	Heteroskedasticity (more general)	$y'_t = (y_t^\lambda - 1)/\lambda$ for $\lambda \neq 0$, $\log y_t$ for $\lambda = 0$	Variance changes with level; $\lambda$ chosen to maximize likelihood	Requires positive values; back-transformation introduces forecast bias
Square root	Mild heteroskedasticity	$y'_t = \sqrt{y_t}$	Count data, Poisson-like variance	Special case of Box-Cox with $\lambda = 0.5$
Seasonal decomposition (STL, X-13)	Trend plus seasonality	$y_t = T_t + S_t + R_t$, model $R_t$	Series with both trend and seasonality	Adds modeling assumptions
Fractional differencing	Long memory ($I(d)$, non-integer $d$)	$(1-L)^d y_t$	ARFIMA processes, hydrology, finance	Computationally heavier than integer differencing

The Box-Cox transformation, introduced by George Box and David Cox in 1964, is a parametric family of power transformations indexed by $\lambda$ that includes the log transformation ($\lambda = 0$), the square root ($\lambda = 0.5$), the identity ($\lambda = 1$), and the inverse ($\lambda = -1$). The optimal $\lambda$ is typically chosen by maximum likelihood, often visualized via a profile likelihood plot. Box-Cox stabilizes variance and brings the marginal distribution closer to Gaussian, which helps when the model assumes normally distributed errors (as ARIMA does under maximum likelihood estimation).[^box_cox][^otexts_transforms]

A practical pitfall is over-differencing. Differencing a stationary series introduces a unit root in the moving-average component of the resulting model, hurting forecast accuracy. The Box-Jenkins recommendation is to difference only as many times as the data require, typically determined by a unit-root test combined with inspection of the autocorrelation function. If the ACF of the differenced series shows a single large negative spike at lag 1, that is a classic sign of over-differencing.

Use in ARIMA and SARIMA

The autoregressive integrated moving average (ARIMA) model, popularized by George Box and Gwilym Jenkins in the 1970s, is built directly on the assumption of stationarity. An ARIMA(p, d, q) model is an ARMA(p, q) model fit to the $d$-th difference of the series, where:

$p$ is the number of autoregressive lags,
$d$ is the number of differencing operations applied to achieve stationarity,
$q$ is the number of moving-average lags.

The "I" (integrated) component exists precisely to handle nonstationarity by differencing. If the original series is already stationary, $d = 0$ and the model reduces to an ARMA model. If the series has a single unit root, $d = 1$ and the model fits ARMA structure to the first differences. The Box-Jenkins methodology lays out a three-stage iterative procedure of model identification, parameter estimation, and diagnostic checking; the identification stage is largely concerned with determining the appropriate $d$ and inspecting the autocorrelation and partial autocorrelation functions of the differenced series to choose $p$ and $q$.[^box_jenkins_wiki]

The seasonal ARIMA (SARIMA) model extends ARIMA to handle seasonality. A SARIMA(p, d, q)(P, D, Q)_s model includes seasonal autoregressive and moving-average terms with period $s$ in addition to the non-seasonal components, and applies seasonal differencing $D$ times. When seasonal unit roots are suspected, the HEGY test (Hylleberg, Engle, Granger, Yoo, 1990) tests for unit roots at each of the $s$ seasonal frequencies separately, allowing the analyst to determine whether ordinary differencing, seasonal differencing, or both are required.[^hegy]

For multivariate settings, vector autoregression (VAR) and vector error correction models (VECM) extend the same ideas. A VAR fit to a vector of stationary series produces consistent estimates; if the component series are nonstationary but cointegrated, a VECM is the appropriate specification because it explicitly models the long-run equilibrium relationship.

Cointegration

Cointegration, introduced by Robert Engle and Clive Granger in their 1987 Econometrica paper, addresses a problem that motivated much of the unit-root literature. Two or more nonstationary series can drift through their domains without any apparent common pattern, yet maintain a stable long-run relationship. Formally, a vector $(y_1, y_2, \dots, y_k)$ of $I(1)$ series is cointegrated if there exists a nonzero linear combination $\beta_1 y_{1,t} + \beta_2 y_{2,t} + \dots + \beta_k y_{k,t}$ that is $I(0)$, that is, stationary. The vector $\boldsymbol{\beta}$ is called a cointegrating vector. Economic examples include consumption and income, short and long interest rates, and prices of substitutable goods, all of which can wander individually but do so in ways that keep the spread bounded.[^engle_granger][^nobel2003]

The Engle-Granger two-step procedure tests for cointegration by (1) running a static regression of one series on the others and (2) testing the residuals for a unit root using an ADF-type test with adjusted critical values. Johansen's likelihood-based procedure (1988, 1991) extends the analysis to multiple cointegrating vectors and is now standard for systems of more than two variables.

Cointegration matters because it justifies running regressions on nonstationary data without falling into the spurious regression trap, and because it implies the existence of an error correction mechanism: deviations from the long-run equilibrium produce predictable adjustments in the short run. The Granger representation theorem formally connects the existence of a cointegrating relationship to the existence of an error correction representation. Engle and Granger were jointly awarded the 2003 Sveriges Riksbank Prize in Economic Sciences in Memory of Alfred Nobel for their contributions, with Granger cited specifically for cointegration and Engle for autoregressive conditional heteroskedasticity (ARCH).[^nobel2003]

Stationarity in machine learning

Classical machine learning theory rests on the assumption that data are drawn independently from a fixed distribution, the so-called i.i.d. assumption. This assumption is closely related to (and stronger than) stationarity: i.i.d. requires both stationarity and independence, while stationarity alone permits temporal dependence as long as the joint distribution is shift-invariant. Generalization bounds in statistical learning theory, including the VC bounds and PAC learning results, derive their finite-sample guarantees from i.i.d. sampling. When this assumption fails, the bounds no longer apply directly.[^iid]

Distribution shift

When the joint distribution $P(X, Y)$ at deployment differs from the training distribution, the model is operating under distribution shift. Several special cases are recognized:

Type	What changes	What stays the same	Example
Covariate shift	$P(X)$	$P(Y \mid X)$	New geographic region; same disease etiology
Label shift (prior shift)	$P(Y)$	$P(X \mid Y)$	Disease prevalence changes; symptom distribution per disease unchanged
Concept drift	$P(Y \mid X)$	$P(X)$	Spam patterns evolve; email content distribution stable
Joint shift	$P(X, Y)$	nothing	Domain change with both inputs and labels affected

Covariate shift is sometimes correctable through importance weighting if the shift is known. Label shift can be corrected through methods like BBSE (black box shift estimation). Concept drift typically requires online learning, model retraining, or explicit drift detection systems. Detecting drift is its own subfield, with methods ranging from monitoring statistical tests on input features to comparing model prediction distributions over time.[^d2l_shift][^huyen]

Nonstationarity in reinforcement learning

Reinforcement learning typically formulates problems as Markov decision processes (MDPs). A standard MDP assumes stationary transition probabilities and reward functions, meaning the dynamics do not change over time. Under this assumption there exists a stationary optimal policy, and value iteration, policy iteration, and Q-learning converge to it. Non-stationary MDPs relax this assumption to allow transitions and rewards to drift over time, which arises naturally in multi-agent settings (where other agents are learning) and in real-world deployments (where the environment evolves). Algorithms for non-stationary MDPs include sliding-window methods, change-point detection, restart policies, and meta-learning approaches that explicitly model how the environment changes.[^non_stationary_mdp]

Time series forecasting with deep learning

Deep learning models for forecasting (LSTMs, Transformers, N-BEATS, TFT, TimeGPT, Chronos) do not strictly require stationarity in the same way ARIMA does, because they are flexible enough to capture trends and seasonality directly. In practice, however, preprocessing steps such as differencing, log transformation, or removing the linear trend are still common because they reduce the dynamic range of the input, stabilize gradients, and make patterns more learnable. Recent work on foundation models for time series often benchmarks against simple stationarity-based baselines, and these baselines remain surprisingly competitive on many real-world datasets.

Practical workflow

A typical workflow for assessing and handling stationarity in an applied problem:

Plot the series. A visual inspection of the time series, ACF, and PACF often reveals trends, seasonality, and changing variance more quickly than any test.
Stabilize the variance. If the spread of the series increases with the level, apply a log or Box-Cox transformation first. Variance changes can mask or mimic mean changes.
Test for stationarity. Run the ADF test (null: unit root) and KPSS test (null: stationary). Reconcile the results.
Difference if needed. If the tests indicate a unit root, take first differences and retest. Repeat at most once or twice; if more is needed, reconsider whether the series is genuinely $I(d)$ for $d > 2$ or whether there is a structural break.
Handle seasonality. If a strong seasonal pattern is present, apply seasonal differencing or use a SARIMA model. The HEGY test can guide whether seasonal differencing is appropriate.
Check residuals. After fitting, residuals from a correctly specified model should be approximately white noise. The Ljung-Box test on residuals is a standard diagnostic.
Watch for structural breaks. A series that fails stationarity tests because of a one-time level shift is qualitatively different from one with a unit root. Tests like Zivot-Andrews or Bai-Perron explicitly model breaks.

Common misconceptions

A few clarifications come up often when teaching this material.

Stationarity is not the absence of pattern. A stationary series can have rich autocorrelation structure (think of an AR(1) with $\phi = 0.95$). What matters is that the structure does not change over time.

Stationarity is not the same as being stochastic. A constant function is stationary; a deterministic linear trend is not.

Failing to reject the unit-root null does not prove a unit root. Like any hypothesis test, ADF can fail to reject when the true process is stationary but close to the unit root. This is why complementary tests (KPSS, DF-GLS) are used, and why visual diagnostics matter.

Differencing is not detrending. Differencing is appropriate for stochastic trends (unit roots); detrending is appropriate for deterministic trends. Using the wrong one introduces residual structure the model cannot remove.

Stationarity is not a panacea for ML deployment. Even if a model is trained on stationary data and the underlying data-generating process is stationary, distribution shift can still arise from sample selection, measurement changes, or interventions on the system being modeled.

References

Stationarity

Formal definitions

Strict (strong) stationarity

Weak (wide-sense, covariance) stationarity

Trend stationarity

Difference stationarity and unit roots

Cyclostationarity and other generalizations

Why stationarity matters

Tests for stationarity

Methods for achieving stationarity

Use in ARIMA and SARIMA

Cointegration

Stationarity in machine learning

Distribution shift

Nonstationarity in reinforcement learning

Time series forecasting with deep learning

Practical workflow

Common misconceptions

See also

References

Improve this article

Formal definitions

Strict (strong) stationarity

Weak (wide-sense, covariance) stationarity

Trend stationarity

Difference stationarity and unit roots

Cyclostationarity and other generalizations

Why stationarity matters

Tests for stationarity

Methods for achieving stationarity

Use in ARIMA and SARIMA

Cointegration

Stationarity in machine learning

Distribution shift

Nonstationarity in reinforcement learning

Time series forecasting with deep learning

Practical workflow

Common misconceptions

See also

References

Formal definitions

Strict (strong) stationarity

Weak (wide-sense, covariance) stationarity

Trend stationarity

Difference stationarity and unit roots

Cyclostationarity and other generalizations

Why stationarity matters

Tests for stationarity

Methods for achieving stationarity

Use in ARIMA and SARIMA

Cointegration

Stationarity in machine learning

Distribution shift

Nonstationarity in reinforcement learning

Time series forecasting with deep learning

Practical workflow

Common misconceptions

See also

References

Improve this article

Related Articles

ARIMA

Nonstationarity

ARC-AGI 2

Temporal data

AUC-ROC

Machine learning terms/Clustering

Formal definitions

Strict (strong) stationarity

Weak (wide-sense, covariance) stationarity

Trend stationarity

Difference stationarity and unit roots

Cyclostationarity and other generalizations

Why stationarity matters

Tests for stationarity

Methods for achieving stationarity

Use in ARIMA and SARIMA

Cointegration

Stationarity in machine learning

Distribution shift

Nonstationarity in reinforcement learning

Time series forecasting with deep learning

Practical workflow

Common misconceptions

See also

References

Related Articles

ARIMA

Nonstationarity

ARC-AGI 2

Temporal data

AUC-ROC

Machine learning terms/Clustering