Nonstationarity

Nonstationarity refers to the condition in which the statistical properties of a data-generating process change over time. In a stationary process, quantities such as the mean, variance, and autocorrelation structure remain constant regardless of when they are measured. When any of these properties shift, the process is said to be nonstationary. This concept sits at the intersection of time series analysis, statistics, and machine learning, and it poses significant challenges for models that assume the data distribution remains fixed between training and deployment.

In practice, nonstationarity is the norm rather than the exception. Financial markets exhibit changing volatility regimes, consumer preferences evolve, language usage shifts over time, and sensor readings degrade as equipment ages. Any predictive system operating in the real world must eventually confront nonstationarity in some form.

Explain like I'm 5 (ELI5)

Imagine you learn the rules of a game and get really good at it. But then someone keeps changing the rules while you play. Sometimes the changes are small (like moving the goal posts a little), and sometimes the rules change completely overnight. That is what nonstationarity is like for a computer that has learned patterns from data. The patterns it memorized stop working because the "rules" behind the data keep changing. To stay good at the game, the computer has to notice when the rules change and learn the new ones.

Formal definition

A stochastic process {X_t} is strictly stationary if the joint distribution of (X_{t1}, X_{t2}, ..., X_{tk}) is the same as the joint distribution of (X_{t1+h}, X_{t2+h}, ..., X_{tk+h}) for all choices of time indices and all shifts h. A weaker condition, weak stationarity (also called second-order or covariance stationarity), requires only that the mean E[X_t] is constant, the variance Var(X_t) is finite and constant, and the autocovariance Cov(X_t, X_{t+h}) depends only on the lag h, not on t.

A process is nonstationary when it violates one or more of these conditions. Common violations include a time-varying mean (trend), a time-varying variance (heteroscedasticity), or a changing autocorrelation structure.

Types of nonstationarity

Nonstationarity takes many forms depending on which aspect of the data-generating process changes and how quickly the change occurs.

Trend nonstationarity

A trend-stationary process has a deterministic trend component (for example, a linear or polynomial function of time) but stationary fluctuations around that trend. Removing the trend by regression or differencing yields a stationary residual series. A common example is GDP, which tends to grow over time but fluctuates around a long-run growth path.

Unit root nonstationarity

A process with a unit root is one in which shocks have a permanent effect on the level of the series. The classic example is a random walk: X_t = X_{t-1} + e_t, where e_t is white noise. Because each shock accumulates without decay, the variance of X_t grows without bound over time. Unlike trend nonstationarity, unit root nonstationarity cannot be removed by subtracting a deterministic trend; instead, differencing (computing X_t - X_{t-1}) is required.

Structural breaks

A structural break is an abrupt, permanent change in the parameters of the data-generating process. For example, a policy change or a financial crisis may shift the mean or variance of an economic time series at a single point in time. The Chow test and the Bai-Perron procedure are commonly used to detect structural breaks.

Heteroscedasticity

When the variance of a process changes over time, the process displays conditional heteroscedasticity. Financial return series are a well-known example: periods of high volatility cluster together, a phenomenon captured by ARCH and GARCH models introduced by Robert Engle (1982) and Tim Bollerslev (1986).

Distribution shift in machine learning

In machine learning, nonstationarity is often discussed under the umbrella of dataset shift or distribution shift. The foundational framework was laid out by Quinonero-Candela, Sugiyama, Schwaighofer, and Lawrence in their 2009 book Dataset Shift in Machine Learning. The core idea is that the joint distribution P(X, Y) of features X and labels Y differs between training and test (or deployment) time.

Because P(X, Y) = P(Y|X) * P(X) = P(X|Y) * P(Y), different components of the joint distribution can shift independently. This decomposition gives rise to several well-studied types of shift.

Covariate shift

Covariate shift occurs when the marginal distribution of the input features changes (P_train(X) differs from P_test(X)) but the conditional distribution of labels given features remains the same (P(Y|X) is unchanged). An example is a medical diagnosis model trained mostly on young patients that is later deployed on an older population. The input distribution changes, but the relationship between symptoms and diseases does not.

Covariate shift can be corrected by importance weighting, where each training example is weighted by the density ratio P_test(X) / P_train(X). Shimodaira (2000) formalized this approach, and kernel mean matching and logistic regression-based density ratio estimation are common practical methods.

Prior probability shift (label shift)

Prior probability shift, also called label shift or target shift, occurs when the marginal distribution of labels changes (P_train(Y) differs from P_test(Y)) but the class-conditional distribution of features given labels remains the same (P(X|Y) is unchanged). This type of shift is common in anti-causal prediction settings. For instance, a disease screening model may be trained on a hospital population where 20% of patients have the disease, then deployed in a general population where prevalence is only 2%.

Correction methods for label shift include the black-box shift estimation technique proposed by Lipton, Wang, and Smola (2018), which uses the model's own confusion matrix to estimate the shifted label distribution.

Concept shift (concept drift)

Concept shift, often called concept drift, occurs when the conditional distribution P(Y|X) changes over time. This is the most challenging type of shift because the very relationship between inputs and outputs has changed. A spam filter provides a classic example: spammers continually adapt their strategies, so the mapping from email features to "spam" or "not spam" labels evolves over time.

The term "concept drift" was popularized by Widmer and Kubat (1996) in their work on learning in the presence of hidden contexts.

Summary of distribution shift types

Type of shift	What changes	What stays the same	Example
Covariate shift	P(X)	P(Y\|X)	Training on daytime images, deploying on nighttime images
Prior probability shift (label shift)	P(Y)	P(X\|Y)	Disease prevalence differs between training hospital and deployment population
Concept shift (concept drift)	P(Y\|X)	Varies	Spam tactics evolve, changing what counts as spam
Full dataset shift	P(X, Y)	Nothing guaranteed	Entirely new deployment environment

Temporal patterns of concept drift

Concept drift does not always follow the same pattern over time. Understanding the temporal dynamics of drift helps practitioners choose the right detection and adaptation strategy.

Sudden drift

Sudden drift (also called abrupt drift) occurs when the data-generating process changes instantaneously from one concept to another. An example is a regulatory change that instantly alters customer eligibility criteria for a loan, making the old classification boundary obsolete overnight.

Gradual drift

Gradual drift involves a slow transition period during which data points from both the old and new concepts coexist. Over time, the proportion of data from the new concept increases until it fully replaces the old one. Slowly changing consumer preferences on a social media platform are a typical example.

Incremental drift

In incremental drift, the concept changes through a sequence of very small steps. Each individual step may be imperceptible, but the cumulative effect over a long period is a substantially different concept. Sensor calibration degradation is a common real-world instance.

Recurring (seasonal) drift

Recurring drift arises when previously observed concepts reappear after a period of absence. Seasonal patterns in retail demand are a classic example: holiday shopping spikes repeat each year. Models that can store and retrieve past concept descriptions have an advantage in this setting.

Drift pattern	Speed of change	Reversibility	Detection difficulty	Example
Sudden	Instantaneous	Typically irreversible	Easier (sharp signal)	Regulatory policy change
Gradual	Slow transition	May or may not reverse	Moderate	Evolving consumer taste
Incremental	Very slow, continuous	Usually irreversible	Harder (weak signal)	Sensor degradation
Recurring	Periodic	Cyclical by nature	Moderate (if periodicity is known)	Seasonal retail demand

Statistical tests for stationarity

Before modeling a time series or detecting distribution shift, practitioners often apply formal statistical tests to assess whether the data is stationary.

Augmented Dickey-Fuller (ADF) test

The Augmented Dickey-Fuller test is the most widely used unit root test. Its null hypothesis is that the series contains a unit root (is nonstationary). A low p-value (below the chosen significance level, typically 0.05) leads to rejection of the null, suggesting the series is stationary. The ADF test extends the original Dickey-Fuller test by including lagged difference terms to account for higher-order autocorrelation. It is available in Python through statsmodels.tsa.stattools.adfuller.

KPSS test

The Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test reverses the hypothesis structure of the ADF test. Its null hypothesis is that the series is stationary (or trend-stationary). A low p-value here suggests nonstationarity. Because the ADF and KPSS tests have complementary null hypotheses, using both tests together provides stronger evidence than either test alone. When both tests agree, the conclusion is more reliable; when they disagree, further investigation is needed.

Phillips-Perron (PP) test

The Phillips-Perron test shares the same null hypothesis as the ADF test (unit root present) but uses a non-parametric correction for serial correlation and heteroscedasticity rather than adding lagged difference terms. This makes the PP test robust to a wider range of error structures, though it can suffer from size distortions in small samples.

Comparison of stationarity tests

Test	Null hypothesis	Alternative hypothesis	Handles serial correlation via	Key advantage
ADF	Unit root (nonstationary)	Stationary	Lagged difference terms	Most widely used; well understood
KPSS	Stationary	Unit root (nonstationary)	Spectral density estimation	Complementary to ADF; confirms stationarity
Phillips-Perron	Unit root (nonstationary)	Stationary	Non-parametric correction	No lag length selection needed; robust to heteroscedasticity

Practical testing strategy

A recommended approach is to run both the ADF test and the KPSS test on the same series. Four outcomes are possible:

ADF rejects, KPSS does not reject: Strong evidence of stationarity.
ADF does not reject, KPSS rejects: Strong evidence of a unit root.
Both reject: The series may be trend-stationary (stationary around a deterministic trend).
Neither rejects: The results are inconclusive; additional tests or visual inspection may be needed.

Drift detection methods in production ML

In deployed machine learning systems, models must be monitored for performance degradation caused by distribution shift. A variety of statistical and algorithmic methods have been developed to detect drift in real time or near-real time.

Statistical two-sample tests

Kolmogorov-Smirnov (KS) test

The KS test is a non-parametric test that compares the empirical cumulative distribution functions (CDFs) of two samples. It computes the maximum absolute difference between the two CDFs and tests whether this difference is statistically significant. The KS test makes no assumptions about the underlying distribution, which makes it broadly applicable. However, its sensitivity increases with sample size, meaning that on very large datasets it may flag statistically significant but practically irrelevant differences.

Population Stability Index (PSI)

The PSI measures how much a distribution has shifted relative to a baseline by comparing binned probability distributions. Standard thresholds are:

PSI < 0.1: No significant shift.
0.1 <= PSI < 0.25: Moderate shift, worth investigating.
PSI >= 0.25: Significant shift, action needed.

PSI is less sensitive to sample size than the KS test and works with both continuous and categorical features, making it popular in financial services and credit scoring.

Maximum Mean Discrepancy (MMD)

The Maximum Mean Discrepancy is a kernel-based two-sample test that measures the distance between the mean embeddings of two distributions in a reproducing kernel Hilbert space (RKHS). It was formalized by Gretton, Borgwardt, Rasch, Scholkopf, and Smola (2012). Unlike the KS test, MMD operates on multivariate data natively, making it suitable for detecting shift across multiple features simultaneously. The Gaussian (RBF) kernel is the most common choice.

Error-rate monitoring methods

DDM (Drift Detection Method)

The Drift Detection Method (Gama et al., 2004) monitors the online error rate of a classifier. It assumes that the error rate of a well-performing model is approximately binomially distributed and raises a warning when the error rate plus its standard deviation exceeds a threshold, and signals drift when a higher threshold is crossed. DDM works well for detecting sudden drift but can be slow to react to gradual changes.

EDDM (Early Drift Detection Method)

The Early Drift Detection Method (Baena-Garcia et al., 2006) modifies DDM by tracking the distance between consecutive classification errors rather than the raw error rate. This modification makes EDDM more sensitive to gradual drift, though it is also more prone to false alarms from noise.

Page-Hinkley test

The Page-Hinkley test is a sequential analysis technique derived from the CUSUM (cumulative sum) algorithm. It detects changes by accumulating the difference between observed values and their running mean. When the accumulated sum exceeds a user-defined threshold, drift is signaled. The test is computationally efficient and well-suited for monitoring a single metric stream.

Window-based methods

ADWIN (Adaptive Windowing)

ADWIN (Bifet and Gavalda, 2007) maintains a variable-length sliding window over incoming data. It automatically grows the window when the data is stationary and shrinks it when a change is detected, by comparing the means of two sub-windows within the current window. ADWIN provides theoretical guarantees on false positive and false negative rates and does not require the user to fix a window size in advance.

Comparison of drift detection methods

Method	Type	What it monitors	Best for	Limitations
KS test	Statistical	Feature distributions	Single-feature drift	Overly sensitive on large samples
PSI	Statistical	Binned distributions	Tabular data, credit scoring	Requires binning decisions
MMD	Statistical (kernel)	Multivariate distributions	High-dimensional feature drift	Computationally expensive
DDM	Error-based	Classifier error rate	Sudden drift	Slow on gradual drift; needs labels
EDDM	Error-based	Error intervals	Gradual drift	Sensitive to noise; needs labels
Page-Hinkley	Sequential	Running mean deviation	Continuous metric streams	Single metric only
ADWIN	Window-based	Windowed means	Streaming data with unknown drift points	Memory overhead for window storage

Handling nonstationarity in time series

Classical time series methods require stationarity as a precondition, so a variety of transformation techniques have been developed to convert nonstationary series into stationary ones.

Differencing

Differencing computes the change between consecutive observations: Y't = Y_t - Y{t-1}. This removes a unit root and stabilizes the mean. If a single round of differencing is not sufficient (assessed via a unit root test), second-order differencing can be applied. In seasonal data, seasonal differencing (Y_t - Y_{t-m}, where m is the seasonal period) removes seasonal patterns.

Differencing is the foundation of the ARIMA family of models. The "I" in ARIMA stands for "integrated," referring to the number of differencing operations (the d parameter) needed to achieve stationarity.

Log and power transformations

When the variance of a series changes proportionally with its level, a logarithmic transformation can stabilize the variance. The Box-Cox family of transformations generalizes this idea by parameterizing the power transformation and selecting the parameter that best normalizes the data.

Detrending

For trend-stationary processes, fitting a deterministic trend (such as a linear or polynomial regression on time) and subtracting it yields a stationary residual series. This approach is appropriate when the nonstationarity is due to a smooth, predictable trend rather than a stochastic unit root.

Approaches to handling nonstationarity in machine learning

Once nonstationarity is detected, a model must adapt. The choice of adaptation strategy depends on the type and speed of the shift, the availability of labeled data, and the computational budget.

Periodic retraining

The simplest approach is to retrain the model on recent data at regular intervals (daily, weekly, or monthly). This works well when drift is slow relative to the retraining frequency. The main risk is that the model may be stale between retraining cycles if drift accelerates.

Online learning

Online learning algorithms update model parameters incrementally as each new data point arrives. This allows the model to adapt continuously to changing distributions. Algorithms such as stochastic gradient descent and Passive-Aggressive classifiers are naturally suited to online settings. Forgetting mechanisms, such as decaying weights on older examples or fixed-size sliding windows, help the model shed outdated patterns.

Ensemble methods

Ensemble learning approaches maintain a pool of diverse models and combine their predictions. As the data distribution shifts, new models can be added and old ones retired. Streaming random forest variants and weighted voting ensembles are examples of this approach. The Dynamic Weighted Majority algorithm (Kolter and Maloof, 2007) adds and removes experts based on their recent performance, effectively tracking concept drift.

Domain adaptation

Domain adaptation methods explicitly model the difference between a source domain (training data) and a target domain (deployment data) and learn representations that are invariant to the domain. Techniques include:

Importance weighting: Reweighting source examples by the density ratio P_target(X) / P_source(X).
Domain-adversarial training: Using a domain discriminator network to learn features that cannot distinguish between source and target domains (Ganin et al., 2016).
Feature alignment: Minimizing a divergence measure (such as MMD) between source and target feature distributions.

These methods are closely related to transfer learning and are widely used in computer vision and natural language processing.

Continual learning

Continual learning (also called lifelong learning) aims to learn from a sequence of tasks or distributions without forgetting previously acquired knowledge. The central challenge is catastrophic forgetting, where training on new data causes the model to lose performance on old data. Strategies include elastic weight consolidation (EWC), progressive neural networks, and experience replay buffers. Continual learning is especially relevant in reinforcement learning settings where the environment changes over time.

Nonstationarity in reinforcement learning

Standard reinforcement learning algorithms assume a stationary environment, typically formalized as a Markov decision process (MDP) with fixed transition dynamics and reward function. In practice, many real-world environments are nonstationary. Traffic patterns change throughout the day, user behavior on a platform evolves, and physical systems degrade over time.

Nonstationarity in RL can affect the transition function P(s'|s, a), the reward function R(s, a), or both. When the environment drifts, a policy that was once optimal can become suboptimal or even harmful.

Approaches to handle nonstationarity in RL include:

Sliding window or decaying weights on experience replay: Prioritizing recent experience over stale transitions.
Context-conditional policies: Conditioning the policy on a latent variable that represents the current environment state, and learning to infer that variable online.
Meta-reinforcement learning: Training agents to adapt quickly to new tasks or environment dynamics, drawing on experience across multiple settings.
Restarting or fine-tuning: Periodically reinitializing parts of the model to prevent it from over-committing to outdated dynamics.

Achieving sublinear regret in drifting environments typically requires the algorithm to forget outdated data at a controlled rate. Theoretical work by Jaksch, Ortner, and Auer (2010) on the UCRL2 algorithm and subsequent extensions has established performance bounds for RL under bounded nonstationarity.

Nonstationarity in NLP

Text data is inherently nonstationary. Word meanings shift over time (semantic drift), new terms emerge, and writing styles change. For natural language processing models, this creates a distinct form of concept drift.

Named entity recognition (NER) models, for example, degrade as new entities appear and old ones become less frequent. Lazaridou et al. (2021) showed that language model perplexity increases on text from years not represented in the training data. Sentiment and stance classifiers trained on historical data can become unreliable as public discourse evolves.

Strategies for handling temporal drift in NLP include periodic fine-tuning on recent text, continual pre-training, and retrieval-augmented approaches that ground predictions in up-to-date external knowledge.

Internal covariate shift in deep learning

A related but distinct phenomenon is internal covariate shift, a term introduced by Ioffe and Szegedy (2015). During the training of a deep neural network, the distribution of inputs to each hidden layer changes as the parameters of preceding layers are updated. This shifting distribution can slow training convergence and require careful learning rate and initialization choices.

Batch normalization was proposed as a direct countermeasure. By normalizing the inputs to each layer within each mini-batch, it reduces the degree to which the internal distributions shift during training. Later research has debated whether internal covariate shift is truly the main mechanism by which batch normalization helps, but the practical benefits of the technique are well established.

Real-world examples

Nonstationarity affects virtually every domain where machine learning is applied in production.

Domain	Source of nonstationarity	Consequence for ML models
Email spam filtering	Spammers adapt tactics to evade detection	Classification accuracy degrades unless the model is retrained
Fraud detection	Fraudsters develop new attack vectors	Static models miss novel fraud patterns
Recommendation systems	User preferences and item catalogs evolve	Recommendations become stale and less engaging
Autonomous driving	Weather, road conditions, and traffic patterns change	Perception models must generalize across conditions
Medical diagnosis	Patient demographics and disease prevalence shift	Models calibrated on one population may miscalibrate on another
Financial trading	Market regimes shift (bull, bear, high volatility)	Strategies optimized for one regime underperform in another
Natural language processing	Language usage, slang, and topics evolve	Language models and classifiers degrade on newer text

Best practices for managing nonstationarity

Monitor continuously: Track model performance metrics, feature distributions, and prediction distributions over time. Use drift detection methods (KS test, PSI, ADWIN) to receive early warnings.
Separate monitoring from alerting: Not every statistically significant shift is practically meaningful. Set alert thresholds based on business impact, not just p-values.
Version your data and models: Maintain records of which data was used to train each model version, so you can diagnose when and why performance changed.
Design for retraining: Build pipelines that make it easy to retrain or fine-tune models on fresh data. Automate retraining triggers when drift exceeds predefined thresholds.
Use holdout sets with temporal splits: When evaluating models that will be deployed over time, split training and test data by time rather than randomly. This gives a more realistic estimate of future performance.
Consider ensemble or online approaches: If the application environment changes frequently, ensemble methods or online learning algorithms may outperform a single static model.
Store concept descriptions: For domains with recurring drift (seasonal patterns), maintaining a library of past model states or data snapshots allows faster adaptation when old patterns reappear.
Test for stationarity before modeling time series: Always apply the ADF and KPSS tests (or visual inspection of autocorrelation plots) before fitting time series models. Apply differencing or transformations as needed.

References

Quinonero-Candela, J., Sugiyama, M., Schwaighofer, A., & Lawrence, N. D. (Eds.). (2009). *Dataset Shift in Machine Learning*. MIT Press.
Shimodaira, H. (2000). Improving predictive inference under covariate shift by weighting the log-likelihood function. *Journal of Statistical Planning and Inference*, 90(2), 227-244.
Widmer, G., & Kubat, M. (1996). Learning in the presence of concept drift and hidden contexts. *Machine Learning*, 23(1), 69-101.
Gama, J., Medas, P., Castillo, G., & Rodrigues, P. (2004). Learning with drift detection. *Advances in Artificial Intelligence (SBIA 2004)*, Lecture Notes in Computer Science, 286-295.
Bifet, A., & Gavalda, R. (2007). Learning from time-changing data with adaptive windowing. *Proceedings of the 2007 SIAM International Conference on Data Mining*, 443-448.
Gretton, A., Borgwardt, K. M., Rasch, M. J., Scholkopf, B., & Smola, A. (2012). A kernel two-sample test. *Journal of Machine Learning Research*, 13, 723-773.
Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. *Proceedings of the 32nd International Conference on Machine Learning (ICML)*, 448-456.
Lipton, Z., Wang, Y.-X., & Smola, A. (2018). Detecting and correcting for label shift with black box predictors. *Proceedings of the 35th International Conference on Machine Learning (ICML)*, 3122-3130.
Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., Marchand, M., & Lempitsky, V. (2016). Domain-adversarial training of neural networks. *Journal of Machine Learning Research*, 17(59), 1-35.
Kolter, J. Z., & Maloof, M. A. (2007). Dynamic weighted majority: An ensemble method for drifting concepts. *Journal of Machine Learning Research*, 8, 2755-2790.
Engle, R. F. (1982). Autoregressive conditional heteroscedasticity with estimates of the variance of United Kingdom inflation. *Econometrica*, 50(4), 987-1007.
Hyndman, R. J., & Athanasopoulos, G. (2021). *Forecasting: Principles and Practice* (3rd ed.). OTexts. https://otexts.com/fpp3/
Jaksch, T., Ortner, R., & Auer, P. (2010). Near-optimal regret bounds for reinforcement learning. *Journal of Machine Learning Research*, 11, 1563-1600.
Lazaridou, A., Kuncoro, A., Gribovskaya, E., Agrawal, D., Liska, A., Terber, T., ... & Blunsom, P. (2021). Mind the gap: Assessing temporal generalization in neural language models. *Advances in Neural Information Processing Systems*, 34, 29348-29363.

Explain like I'm 5 (ELI5)

Formal definition

Types of nonstationarity

Trend nonstationarity

Unit root nonstationarity

Structural breaks

Heteroscedasticity

Distribution shift in machine learning

Covariate shift

Prior probability shift (label shift)

Concept shift (concept drift)

Summary of distribution shift types

Temporal patterns of concept drift

Sudden drift

Gradual drift

Incremental drift

Recurring (seasonal) drift

Statistical tests for stationarity

Augmented Dickey-Fuller (ADF) test

KPSS test

Phillips-Perron (PP) test

Comparison of stationarity tests

Practical testing strategy

Drift detection methods in production ML

Statistical two-sample tests

Kolmogorov-Smirnov (KS) test

Population Stability Index (PSI)

Maximum Mean Discrepancy (MMD)

Error-rate monitoring methods

DDM (Drift Detection Method)

EDDM (Early Drift Detection Method)

Page-Hinkley test

Window-based methods

ADWIN (Adaptive Windowing)

Comparison of drift detection methods

Handling nonstationarity in time series

Differencing

Log and power transformations

Detrending

Approaches to handling nonstationarity in machine learning

Periodic retraining

Online learning

Ensemble methods

Domain adaptation

Continual learning

Nonstationarity in reinforcement learning

Nonstationarity in NLP

Internal covariate shift in deep learning

Real-world examples

Best practices for managing nonstationarity

See also

References

Improve this article

Related Articles

ARIMA

Stationarity

ARC-AGI 2

Temporal data

AUC-ROC

Machine learning terms/Clustering

Explain like I'm 5 (ELI5)

Formal definition

Types of nonstationarity

Trend nonstationarity

Unit root nonstationarity

Structural breaks

Heteroscedasticity

Distribution shift in machine learning

Covariate shift

Prior probability shift (label shift)

Concept shift (concept drift)

Summary of distribution shift types

Temporal patterns of concept drift

Sudden drift

Gradual drift

Incremental drift

Recurring (seasonal) drift

Statistical tests for stationarity

Augmented Dickey-Fuller (ADF) test

KPSS test