Stationarity
Last reviewed
Apr 30, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v4 ยท 3,928 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Apr 30, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v4 ยท 3,928 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Time series analysis, ARIMA, Machine learning terms
Stationarity is a property of a stochastic process whose statistical characteristics, such as the mean, variance, and autocorrelation, do not change when shifted in time. A time series drawn from a stationary process looks statistically the same regardless of when you start observing it. The concept is foundational to classical time series analysis, to econometrics, to signal processing, and to many results in machine learning, where the closely related independent and identically distributed (i.i.d.) assumption underpins generalization theory. Without stationarity (or some controlled departure from it), most standard estimators are inconsistent, standard significance tests have nonstandard distributions, and forecasts can be wildly off because the past no longer informs the future in a stable way.
The notion was developed primarily in the first half of the twentieth century by mathematicians and statisticians working on signal processing, communication theory, and economics, including Andrey Kolmogorov, Norbert Wiener, Aleksandr Khinchin, and later George Box and Gwilym Jenkins. James D. Hamilton's 1994 textbook Time Series Analysis remains the standard graduate reference and devotes substantial space to the formal treatment, the econometric tests, and the consequences of nonstationarity. Robert Engle and Clive Granger received the 2003 Sveriges Riksbank Prize in Economic Sciences in Memory of Alfred Nobel in part for showing how to model long-run relationships among nonstationary series through cointegration, an idea that grew directly out of the realization that running ordinary regressions on nonstationary data could yield spurious results.[^nobel2003]
Let ${X_t}_{t \in T}$ denote a stochastic process indexed by time. Several distinct but related notions of stationarity are used in practice. The right one to invoke depends on what you actually need: full distributional invariance, invariance of the first two moments, or merely invariance of the mean.
A process is strictly stationary (also called strongly stationary) if its joint probability distribution is invariant under arbitrary time shifts. Formally, for every finite collection of time points $t_1, t_2, \dots, t_n$ and every shift $\tau$,
$$F_{X}(x_{t_1}, x_{t_2}, \dots, x_{t_n}) = F_{X}(x_{t_1 + \tau}, x_{t_2 + \tau}, \dots, x_{t_n + \tau}).$$
In words, the entire joint distribution of any finite subset of observations depends only on the relative spacing of the time points, not on where in time you happen to be looking. Strict stationarity is a strong requirement and is usually difficult to verify directly from data because it concerns the full distribution rather than summary moments.[^wiki_stationary]
A process is weakly stationary, also called wide-sense stationary (WSS) or covariance stationary, if the following three conditions hold:
The function $\gamma(h)$ is the autocovariance function of the process. Dividing by the variance gives the autocorrelation function $\rho(h) = \gamma(h) / \gamma(0)$. Weak stationarity is the workhorse definition because most useful estimators (sample means, sample autocovariances) and most standard asymptotic results require only the first two moments to be well behaved.[^statlect_cov]
The relationship between the two definitions is asymmetric. A strictly stationary process with finite second moments is automatically weakly stationary, but a weakly stationary process need not be strictly stationary because the higher moments could still drift over time. The two notions coincide for Gaussian processes, since a Gaussian distribution is fully characterized by its first two moments.
A process is trend stationary if it can be written as the sum of a deterministic trend $f(t)$ and a stationary component $\epsilon_t$:
$$X_t = f(t) + \epsilon_t.$$
Subtracting the trend (typically a polynomial in $t$, often linear) yields a stationary residual series. Trend stationary processes revert to the deterministic trend in the long run, meaning shocks have only a transitory effect.[^matlab_trend]
A process is difference stationary if its first difference $\Delta X_t = X_t - X_{t-1}$ (or some higher-order difference) is stationary. Such processes contain a unit root in the autoregressive polynomial. A series that becomes stationary after $d$ differencing operations is said to be integrated of order d, written $I(d)$. The simplest example of an $I(1)$ process is the random walk $X_t = X_{t-1} + \epsilon_t$, where $\epsilon_t$ is white noise; the level $X_t$ is nonstationary, but $\Delta X_t = \epsilon_t$ is stationary.[^matlab_trend][^wiki_unit_root]
The distinction between trend stationary and difference stationary processes matters because the appropriate transformation differs: detrending is correct for the former, while differencing is correct for the latter. Applying the wrong transformation introduces structure the model cannot remove.
A process is cyclostationary if its statistical properties vary periodically with time, so the process is stationary up to a known periodic component. Other generalizations include local stationarity (slowly varying parameters) and nth-order stationarity (invariance of moments up to order $n$). These extensions are used in signal processing, climate science, and any setting where the i.i.d.-or-nothing dichotomy is too coarse.
Without stationarity, most of the statistical machinery for time series analysis breaks down. The reasons span estimation, inference, and prediction.
Estimation. The sample mean $\bar{X}_n = \frac{1}{n} \sum X_t$ is a natural estimator of the population mean, but it converges to a meaningful population quantity only if the population mean exists and is constant. If the mean drifts, $\bar{X}_n$ chases a moving target.
Inference. Standard t-statistics and F-statistics rely on asymptotic normality results that assume stationarity (or related mixing conditions). When you regress one nonstationary series on another, the t-statistic can diverge even when the variables are unrelated, producing what Granger and Newbold called spurious regression. Regressions of independent random walks on each other tend to show extreme R^2 values and highly significant coefficients despite the absence of any genuine relationship.[^granger_newbold]
Forecasting. Models like ARIMA, SARIMA, exponential smoothing, and most state-space models assume an underlying stationary structure (after appropriate differencing). If the structure changes during the forecast horizon, predictions degrade. The validity of confidence intervals and prediction intervals depends on stationarity holding in the period being forecast.
Machine learning generalization. The standard supervised learning setup assumes training and test data are drawn i.i.d. from the same distribution. This is a stronger condition than stationarity (i.i.d. requires independence as well), but it shares the core idea that the statistical environment does not change between when you fit the model and when you deploy it. When the deployment distribution differs, the model is operating under distribution shift or concept drift, and its accuracy is no longer guaranteed.[^d2l_shift]
A range of formal hypothesis tests have been developed to assess stationarity in observed series. The tests differ in their null hypothesis (stationarity vs. unit root), in how they handle serial correlation, and in their power against various alternatives. It is common practice to apply at least two tests with opposing nulls and reconcile the results.
| Test | Year | Null hypothesis | Alternative | Approach to serial correlation | Notes |
|---|---|---|---|---|---|
| Dickey-Fuller (DF) | 1979 | Unit root (nonstationary) | Stationary | None (assumes white-noise errors) | Original test; superseded by ADF |
| Augmented Dickey-Fuller (ADF) | 1981 | Unit root | Stationary | Parametric: includes lagged differences | Most widely used unit-root test |
| Phillips-Perron (PP) | 1988 | Unit root | Stationary | Nonparametric Newey-West correction | Robust to heteroskedasticity; weaker finite-sample power than ADF |
| KPSS (Kwiatkowski-Phillips-Schmidt-Shin) | 1992 | Stationary (or trend stationary) | Unit root | Nonparametric long-run variance | Reverses the null; useful as confirmatory test |
| Elliott-Rothenberg-Stock (DF-GLS) | 1996 | Unit root | Stationary | GLS detrending before ADF regression | Higher power than ADF near the unit root |
| Zivot-Andrews | 1992 | Unit root with one structural break | Stationary | Allows endogenous break date | Useful when structural breaks are suspected |
| HEGY | 1990 | Seasonal unit roots | Seasonal stationarity | Tests roots at seasonal frequencies | Used for seasonal ARIMA model identification |
| Ljung-Box | 1978 | No autocorrelation in residuals | Autocorrelation present | n/a | Diagnostic check on residuals after model fitting |
The ADF and KPSS tests are often used together. Because their nulls are opposite, the combination produces four possible outcomes: both reject (ambiguous), neither rejects (ambiguous), ADF rejects and KPSS does not (the series is stationary), or ADF does not reject and KPSS does (the series is nonstationary). Some practitioners refer to the third pattern as difference stationarity confirmed and the fourth as trend stationarity vs. difference stationarity ambiguity requiring further analysis.[^statsmodels_adfkpss]
The ADF regression, in its general form, takes the test equation
$$\Delta y_t = \alpha + \beta t + \gamma y_{t-1} + \sum_{i=1}^{p} \delta_i \Delta y_{t-i} + \epsilon_t,$$
and tests $H_0: \gamma = 0$ (unit root) against $H_1: \gamma < 0$ (stationary). The lag length $p$ is typically chosen by an information criterion (AIC, BIC) or a sequential testing procedure. The asymptotic distribution of the t-statistic on $\gamma$ under the null is nonstandard (the Dickey-Fuller distribution), so critical values are taken from tabulated simulations rather than from the standard normal.[^wiki_adf]
If a series is not stationary, several transformations can often render it (approximately) stationary. The right choice depends on the source of the nonstationarity: changing mean, changing variance, deterministic trend, stochastic trend, seasonality, or some combination.
| Method | What it addresses | Formula | When to use | Caveats |
|---|---|---|---|---|
| First differencing | Stochastic trend (unit root) | $y't = y_t - y{t-1}$ | $I(1)$ series; ADF fails to reject | Over-differencing introduces unit root in MA part |
| Second differencing | Stronger stochastic trend | $y''_t = y't - y'{t-1}$ | $I(2)$ series; rare in economic data | Most series are at most $I(2)$; further differencing usually unnecessary |
| Seasonal differencing | Seasonal nonstationarity | $y^{(s)}t = y_t - y{t-s}$ | Strong seasonal pattern of period $s$ | Combine with first differencing if needed |
| Detrending | Deterministic trend | $y'_t = y_t - \hat{f}(t)$ | Trend stationary series | Wrong if true process has unit root |
| Log transformation | Variance proportional to level | $y'_t = \log(y_t)$ | Exponential growth; multiplicative noise | Requires positive values |
| Box-Cox transformation | Heteroskedasticity (more general) | $y'_t = (y_t^\lambda - 1)/\lambda$ for $\lambda \neq 0$, $\log y_t$ for $\lambda = 0$ | Variance changes with level; $\lambda$ chosen to maximize likelihood | Requires positive values; back-transformation introduces forecast bias |
| Square root | Mild heteroskedasticity | $y'_t = \sqrt{y_t}$ | Count data, Poisson-like variance | Special case of Box-Cox with $\lambda = 0.5$ |
| Seasonal decomposition (STL, X-13) | Trend plus seasonality | $y_t = T_t + S_t + R_t$, model $R_t$ | Series with both trend and seasonality | Adds modeling assumptions |
| Fractional differencing | Long memory ($I(d)$, non-integer $d$) | $(1-L)^d y_t$ | ARFIMA processes, hydrology, finance | Computationally heavier than integer differencing |
The Box-Cox transformation, introduced by George Box and David Cox in 1964, is a parametric family of power transformations indexed by $\lambda$ that includes the log transformation ($\lambda = 0$), the square root ($\lambda = 0.5$), the identity ($\lambda = 1$), and the inverse ($\lambda = -1$). The optimal $\lambda$ is typically chosen by maximum likelihood, often visualized via a profile likelihood plot. Box-Cox stabilizes variance and brings the marginal distribution closer to Gaussian, which helps when the model assumes normally distributed errors (as ARIMA does under maximum likelihood estimation).[^box_cox][^otexts_transforms]
A practical pitfall is over-differencing. Differencing a stationary series introduces a unit root in the moving-average component of the resulting model, hurting forecast accuracy. The Box-Jenkins recommendation is to difference only as many times as the data require, typically determined by a unit-root test combined with inspection of the autocorrelation function. If the ACF of the differenced series shows a single large negative spike at lag 1, that is a classic sign of over-differencing.
The autoregressive integrated moving average (ARIMA) model, popularized by George Box and Gwilym Jenkins in the 1970s, is built directly on the assumption of stationarity. An ARIMA(p, d, q) model is an ARMA(p, q) model fit to the $d$-th difference of the series, where:
The "I" (integrated) component exists precisely to handle nonstationarity by differencing. If the original series is already stationary, $d = 0$ and the model reduces to an ARMA model. If the series has a single unit root, $d = 1$ and the model fits ARMA structure to the first differences. The Box-Jenkins methodology lays out a three-stage iterative procedure of model identification, parameter estimation, and diagnostic checking; the identification stage is largely concerned with determining the appropriate $d$ and inspecting the autocorrelation and partial autocorrelation functions of the differenced series to choose $p$ and $q$.[^box_jenkins_wiki]
The seasonal ARIMA (SARIMA) model extends ARIMA to handle seasonality. A SARIMA(p, d, q)(P, D, Q)_s model includes seasonal autoregressive and moving-average terms with period $s$ in addition to the non-seasonal components, and applies seasonal differencing $D$ times. When seasonal unit roots are suspected, the HEGY test (Hylleberg, Engle, Granger, Yoo, 1990) tests for unit roots at each of the $s$ seasonal frequencies separately, allowing the analyst to determine whether ordinary differencing, seasonal differencing, or both are required.[^hegy]
For multivariate settings, vector autoregression (VAR) and vector error correction models (VECM) extend the same ideas. A VAR fit to a vector of stationary series produces consistent estimates; if the component series are nonstationary but cointegrated, a VECM is the appropriate specification because it explicitly models the long-run equilibrium relationship.
Cointegration, introduced by Robert Engle and Clive Granger in their 1987 Econometrica paper, addresses a problem that motivated much of the unit-root literature. Two or more nonstationary series can drift through their domains without any apparent common pattern, yet maintain a stable long-run relationship. Formally, a vector $(y_1, y_2, \dots, y_k)$ of $I(1)$ series is cointegrated if there exists a nonzero linear combination $\beta_1 y_{1,t} + \beta_2 y_{2,t} + \dots + \beta_k y_{k,t}$ that is $I(0)$, that is, stationary. The vector $\boldsymbol{\beta}$ is called a cointegrating vector. Economic examples include consumption and income, short and long interest rates, and prices of substitutable goods, all of which can wander individually but do so in ways that keep the spread bounded.[^engle_granger][^nobel2003]
The Engle-Granger two-step procedure tests for cointegration by (1) running a static regression of one series on the others and (2) testing the residuals for a unit root using an ADF-type test with adjusted critical values. Johansen's likelihood-based procedure (1988, 1991) extends the analysis to multiple cointegrating vectors and is now standard for systems of more than two variables.
Cointegration matters because it justifies running regressions on nonstationary data without falling into the spurious regression trap, and because it implies the existence of an error correction mechanism: deviations from the long-run equilibrium produce predictable adjustments in the short run. The Granger representation theorem formally connects the existence of a cointegrating relationship to the existence of an error correction representation. Engle and Granger were jointly awarded the 2003 Sveriges Riksbank Prize in Economic Sciences in Memory of Alfred Nobel for their contributions, with Granger cited specifically for cointegration and Engle for autoregressive conditional heteroskedasticity (ARCH).[^nobel2003]
Classical machine learning theory rests on the assumption that data are drawn independently from a fixed distribution, the so-called i.i.d. assumption. This assumption is closely related to (and stronger than) stationarity: i.i.d. requires both stationarity and independence, while stationarity alone permits temporal dependence as long as the joint distribution is shift-invariant. Generalization bounds in statistical learning theory, including the VC bounds and PAC learning results, derive their finite-sample guarantees from i.i.d. sampling. When this assumption fails, the bounds no longer apply directly.[^iid]
When the joint distribution $P(X, Y)$ at deployment differs from the training distribution, the model is operating under distribution shift. Several special cases are recognized:
| Type | What changes | What stays the same | Example |
|---|---|---|---|
| Covariate shift | $P(X)$ | $P(Y \mid X)$ | New geographic region; same disease etiology |
| Label shift (prior shift) | $P(Y)$ | $P(X \mid Y)$ | Disease prevalence changes; symptom distribution per disease unchanged |
| Concept drift | $P(Y \mid X)$ | $P(X)$ | Spam patterns evolve; email content distribution stable |
| Joint shift | $P(X, Y)$ | nothing | Domain change with both inputs and labels affected |
Covariate shift is sometimes correctable through importance weighting if the shift is known. Label shift can be corrected through methods like BBSE (black box shift estimation). Concept drift typically requires online learning, model retraining, or explicit drift detection systems. Detecting drift is its own subfield, with methods ranging from monitoring statistical tests on input features to comparing model prediction distributions over time.[^d2l_shift][^huyen]
Reinforcement learning typically formulates problems as Markov decision processes (MDPs). A standard MDP assumes stationary transition probabilities and reward functions, meaning the dynamics do not change over time. Under this assumption there exists a stationary optimal policy, and value iteration, policy iteration, and Q-learning converge to it. Non-stationary MDPs relax this assumption to allow transitions and rewards to drift over time, which arises naturally in multi-agent settings (where other agents are learning) and in real-world deployments (where the environment evolves). Algorithms for non-stationary MDPs include sliding-window methods, change-point detection, restart policies, and meta-learning approaches that explicitly model how the environment changes.[^non_stationary_mdp]
Deep learning models for forecasting (LSTMs, Transformers, N-BEATS, TFT, TimeGPT, Chronos) do not strictly require stationarity in the same way ARIMA does, because they are flexible enough to capture trends and seasonality directly. In practice, however, preprocessing steps such as differencing, log transformation, or removing the linear trend are still common because they reduce the dynamic range of the input, stabilize gradients, and make patterns more learnable. Recent work on foundation models for time series often benchmarks against simple stationarity-based baselines, and these baselines remain surprisingly competitive on many real-world datasets.
A typical workflow for assessing and handling stationarity in an applied problem:
A few clarifications come up often when teaching this material.
Stationarity is not the absence of pattern. A stationary series can have rich autocorrelation structure (think of an AR(1) with $\phi = 0.95$). What matters is that the structure does not change over time.
Stationarity is not the same as being stochastic. A constant function is stationary; a deterministic linear trend is not.
Failing to reject the unit-root null does not prove a unit root. Like any hypothesis test, ADF can fail to reject when the true process is stationary but close to the unit root. This is why complementary tests (KPSS, DF-GLS) are used, and why visual diagnostics matter.
Differencing is not detrending. Differencing is appropriate for stochastic trends (unit roots); detrending is appropriate for deterministic trends. Using the wrong one introduces residual structure the model cannot remove.
Stationarity is not a panacea for ML deployment. Even if a model is trained on stationary data and the underlying data-generating process is stationary, distribution shift can still arise from sample selection, measurement changes, or interventions on the system being modeled.