# Concept drift

> Source: https://aiwiki.ai/wiki/concept_drift
> Updated: 2026-06-24
> Categories: Data Science, MLOps, Machine Learning, Statistics
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**Concept drift** is the change over time in the statistical relationship between a model's inputs and its target, formally when the joint distribution `P(X, Y)` (and in the most damaging case the conditional `P(Y | X)`) that a model learned during training stops matching the live data, so a deployed model's accuracy silently degrades unless it is monitored and adapted [1][2]. It is one of the most common causes of silent model failure in production: predictions still return, the pipeline still runs, and no exception is thrown, but accuracy quietly collapses [5]. Concept drift is a central problem in [machine learning](/wiki/machine_learning) operations ([MLOps](/wiki/mlops)), [online learning](/wiki/online_learning), [model evaluation](/wiki/model_evaluation), and any system that consumes [streaming data](/wiki/streaming_data) from a non-stationary source.

The term was introduced by Schlimmer and Granger in 1986, whose STAGGER system was the first algorithm built to track a moving concept [3]. Tsymbal's 2004 technical report consolidated the field and gave the canonical definition still used today [4]. The 2014 ACM Computing Surveys paper by Gama and colleagues, cited more than 3,200 times, remains the most referenced work in the field and frames drift as learning under non-stationarity, with detection and adaptation as two complementary directions [1]. The 2018 IEEE TKDE review by Lu et al. catalogues the large body of algorithms developed across roughly 2010 to 2018 and organises them into a unified framework [2].

In production, the cost of drift is best understood as maintenance debt. The 2015 Google paper "Hidden Technical Debt in Machine Learning Systems" by Sculley et al. observed that "it is common to incur massive ongoing maintenance costs in real-world ML systems" and identified drift handling, monitoring, and entanglement as primary sources of that cost [5].

## What is concept drift, and what are its types?

Let `X` denote the input features and `Y` the target. At time `t` the data is generated from a joint distribution `P_t(X, Y)`. **Concept drift** is said to occur between times `t` and `t+1` when `P_t(X, Y) != P_{t+1}(X, Y)`. Because the joint distribution factors as `P(X, Y) = P(Y | X) * P(X)`, there are several distinct ways the distribution can change, and the literature draws careful distinctions among them [1][4].

**Real concept drift**, sometimes called *true drift*, occurs when the conditional distribution `P(Y | X)` changes. In the formal terms of Gama et al., real drift is the case where `P_t(Y | X) != P_{t+1}(Y | X)` even when `P_t(X) = P_{t+1}(X)`: the same input now corresponds to a different label or probability, so the decision boundary shifts [1]. Real drift is the most damaging form because the function the model was trained to approximate is no longer correct.

**Virtual concept drift**, also called [data drift](/wiki/covariate_shift), [covariate shift](/wiki/covariate_shift), or input drift, occurs when `P(X)` changes but `P(Y | X)` stays the same. Features arrive in different proportions, but the underlying labelling rule is stable. The model can still in principle make correct predictions, though calibration may suffer on under-represented regions [1][6].

**Label drift** or *prior shift* is the case where `P(Y)` changes while `P(X | Y)` stays the same. The base rate of each class shifts, common when the deployment population has different prevalences than the training population. See [label shift](/wiki/label_shift) for standard reweighting techniques [6].

**Concept evolution** describes the appearance of new classes that did not exist during training, such as a spam filter encountering a new category of phishing.

These categories are not mutually exclusive. Real production systems usually exhibit a mixture, and disentangling them is part of the diagnosis problem [2].

## How does drift behave over time?

Drift can also be classified by its temporal shape, irrespective of which component of the joint distribution is changing. Gama and colleagues distinguish four canonical patterns [1].

| Pattern | Description | Typical example |
| --- | --- | --- |
| Sudden / abrupt | An instantaneous change from one stable concept to another | Regulatory rule change, deployment of a new sensor |
| Gradual | Old and new concepts coexist for a period, with the new one progressively dominating | Product preferences shifting between generations |
| Incremental | Continuous slow change, with no clear before-and-after | Slow demographic drift in a customer base |
| Recurring / cyclical | Old concepts reappear in a periodic pattern | Seasonality in retail or energy demand |

A fifth category, **outliers** or **blips**, are short-lived deviations that revert to the prior distribution. Robust detectors should ignore them [1][7]. Different patterns favour different adaptation strategies: sudden drift rewards aggressive forgetting and fast retraining; recurring drift rewards a memory of past models that can be reactivated when a known regime returns [2].

## What causes concept drift in practice?

Drift arises from many practical sources, and identifying the cause points toward the right remediation [2][7].

- **Population changes.** The user base shifts; a recommender trained on early adopters meets a mass-market audience.
- **Behavioural changes.** Consumer preferences, browsing patterns, or buying habits evolve, as [recommender systems](/wiki/recommender_system) face continuously.
- **Adversarial actors.** [Fraud detection](/wiki/fraud_detection), spam filtering, and intrusion detection face opponents who deliberately change their patterns to evade the model.
- **Seasonal effects.** Retail demand, electricity load, hospital admissions, and traffic follow daily, weekly, and yearly cycles.
- **External shocks.** Pandemics, financial crises, and regulatory changes cause abrupt drift across many features simultaneously.
- **Sensor degradation or replacement.** A camera lens accumulates dust, a temperature probe drifts in calibration, a microphone is swapped. Features change without any change in the underlying world.
- **Pipeline changes.** Upstream feature engineering is updated, a logging schema changes, or a data provider modifies the meaning of a field.
- **Feedback loops.** The model's own predictions influence the data it later sees: a pricing model sets prices that change buying behaviour; a loan model rejects applicants who therefore never appear in the labelled outcome stream. Feedback loops can produce gradual drift invisible to a naive monitor because the labels themselves have been censored [5].

## How is concept drift detected?

Drift detection methods fall into three broad families: statistical tests on feature distributions, performance-based detectors that watch model accuracy, and sequential change-point procedures borrowed from quality control [1][2][8].

### Statistical tests on the data distribution

These methods compare a recent window of feature values against a reference window or the training distribution. They do not require labels and so are usable in real time, but they are blind to whether the drift actually hurts predictive performance.

- **[Kolmogorov-Smirnov test](/wiki/kolmogorov_smirnov_test)** compares two empirical CDFs via their maximum absolute difference. Non-parametric, suited to continuous features.
- **[Chi-squared test](/wiki/chi_squared_test)** compares observed and expected counts in categorical bins. Standard for discrete features.
- **[Population Stability Index](/wiki/psi)** (PSI) is widely used in credit risk. It bins the variable and sums `(p_curr - p_ref) * ln(p_curr / p_ref)` across bins. The industry rule of thumb is that PSI below 0.1 indicates no significant shift, 0.1 to 0.25 a moderate shift, and above 0.25 a major shift requiring action, though these cut-offs are heuristics rather than calibrated error rates and should be read alongside sample size [9].
- **[Wasserstein distance](/wiki/wasserstein_distance)** (Earth Mover's Distance) measures the minimum cost of transforming one distribution into another. Unlike KL or JS it is sensitive to the geometry of the support, useful for ordinal or continuous data.
- **[Maximum Mean Discrepancy](/wiki/maximum_mean_discrepancy)** (MMD) is a kernel-based two-sample test that lifts both samples into a reproducing kernel Hilbert space and compares their means. It is the workhorse statistic in the "Failing Loudly" framework [8].
- **[KL divergence](/wiki/kl_divergence)** and **[Jensen-Shannon divergence](/wiki/js_divergence)** are information-theoretic distances; JS is the symmetrised, bounded variant of KL. **Hellinger distance** is another bounded f-divergence, often preferred for numerical stability.

In high-dimensional settings univariate tests applied per feature suffer from multiple comparisons. Rabanser, Gunnemann, and Lipton's NeurIPS 2019 paper "Failing Loudly" benchmarked many strategies and found that combining [dimensionality reduction](/wiki/dimensionality_reduction) (PCA or a black-box autoencoder) with a univariate test, or training a domain classifier and using its accuracy as a drift signal, both work well in practice [8].

### Performance-based detectors

These methods watch a stream of error indicators (correct vs incorrect predictions, or residuals) and signal drift when the error process changes. They directly reflect predictive performance but require ground-truth labels, possibly with delay.

- **[DDM](/wiki/ddm) (Drift Detection Method)** of Gama et al. (2004) models the running error rate `p_i` and its standard deviation `s_i = sqrt(p_i (1 - p_i) / i)`, tracking the minimum of `p_i + s_i`. A warning fires when `p_i + s_i >= p_min + 2 * s_min` and drift is declared when `p_i + s_i >= p_min + 3 * s_min`, the two-sigma and three-sigma thresholds of a normal approximation to the error rate [10].
- **[EDDM](/wiki/eddm) (Early Drift Detection Method)** of Baena-Garcia et al. (2006) monitors the distance between consecutive errors, which is more sensitive to gradual drift [11].
- **HDDM** (Frias-Blanco et al., 2015) uses Hoeffding bounds to give probabilistic guarantees on false alarm rates [12].
- **[ADWIN](/wiki/adwin) (Adaptive Windowing)** of Bifet and Gavalda (2007) maintains a window of recent values and repeatedly tests whether it can be split into two sub-windows whose means differ by more than a Hoeffding bound; when they do, the older sub-window is dropped, so the window grows while data are stationary and shrinks when change occurs. ADWIN gives rigorous bounds on its false positive and false negative rates and is the de facto standard adaptive window in streaming ML [13].
- **[Page-Hinkley test](/wiki/page_hinkley_test)**, derived from CUSUM, accumulates the deviation of each observation from the running mean and signals when the cumulative sum exceeds a threshold [1].

### Sequential change-point methods

These methods come from statistical process control and are agnostic to the source of the signal.

- **[CUSUM](/wiki/cusum)** (cumulative sum), introduced by Page in 1954, accumulates positive and negative deviations from a target value and signals a change when either accumulator crosses a threshold [14]. It is optimal for detecting a known mean shift in a Gaussian stream.
- **Generalised likelihood ratio** tests assume parametric families before and after a possible change point and search over the change time.
- **Bayesian online change-point detection** (Adams and MacKay, 2007) maintains a posterior over the time since the last change [23].

The following table compares the major detector families on the dimensions practitioners care about.

| Detector | Needs labels | Strength | Weakness |
| --- | --- | --- | --- |
| KS test | No | Simple, non-parametric, per-feature | High-dim blindness |
| PSI | No | Industry standard in credit risk | Bin-dependent |
| MMD | No | Multivariate, kernel-flexible | Cost grows with sample size |
| Domain classifier | No | Captures multivariate shifts | Needs separate model |
| DDM | Yes | Lightweight, easy to implement | Slow on gradual drift |
| EDDM | Yes | Better on gradual drift | Noise-sensitive early on |
| ADWIN | Yes | Rigorous bounds, parameter-light | Memory scales with window |
| Page-Hinkley | Yes | Classic, well understood | Single threshold tuning |
| CUSUM | Yes | Optimal for known mean shift | Parametric assumption |

## How do models adapt to concept drift?

Once drift is detected, or pre-emptively even without detection, several strategies keep the model aligned with the current distribution [1][2].

**Retraining** is the simplest approach. It can be **scheduled** (retrain every day, week, or month) or **triggered** (retrain when a detector signals drift). Full retraining starts from scratch; **incremental retraining** continues training the existing model on new data, which is cheaper but risks catastrophic forgetting on neural networks.

**Sliding window** training keeps only the most recent `w` samples; the window size trades off recency and statistical efficiency, and adaptive windows like [ADWIN](/wiki/adwin) set `w` automatically. **Weighted learning** assigns each example a weight that decays with age, often exponentially, the streaming analogue of exponential moving averages.

**Ensemble methods** combine multiple models and update their composition over time. SEA (Street and Kim, 2001), Oza's Online Bagging and Boosting, and Adaptive Random Forest (ARF) by Gomes et al. (2017) all maintain base learners and add or remove them in response to performance signals [15]. ARF, which combines [random forest](/wiki/random_forest) base learners with per-tree ADWIN detectors, is one of the strongest off-the-shelf streaming classifiers.

**Online learning** algorithms update on every incoming example without storing a finite training set. [Hoeffding Trees](/wiki/hoeffding_tree) (Domingos and Hulten, 2000) build decision trees from streams using Hoeffding bounds to decide when to split; [Naive Bayes](/wiki/naive_bayes) admits a trivial online update because its sufficient statistics are simple counts; SGD provides online updates for linear and neural models.

**Test-time adaptation** for [neural networks](/wiki/neural_network) updates a small number of parameters at inference time using only new inputs, without labels. TENT (Wang et al., 2021) updates batch normalisation statistics to minimise prediction entropy on the new domain; MEMO extends the idea with augmentations [16].

**[Domain adaptation](/wiki/domain_adaptation)** is the body of techniques from [transfer learning](/wiki/transfer_learning) for the case where source and target distributions differ. Importance reweighting, feature alignment, and adversarial domain-invariant representations are the classical recipes.

**Active learning** complements adaptation by choosing which examples to query for labels, focusing annotation effort on regions of input space that have actually moved.

## How is concept drift handled in MLOps?

Handling drift in production is a [model monitoring](/wiki/model_monitoring) problem. A modern monitoring stack tracks three layers: features, predictions, and performance [5][7].

**Feature monitoring** computes statistics on each input feature in a recent window and compares them to a reference; KS or PSI tests run per feature, and joint statistics like correlations are also tracked because some drifts only manifest in the joint distribution.

**Prediction monitoring** watches the distribution of model outputs. A sudden shift in predicted class probabilities, average regression output, or rate of high-confidence predictions can signal drift without labels. Prediction drift is sometimes the first observable signal because predictions arrive in real time while labels arrive with delay.

**Performance monitoring** computes accuracy, AUC, calibration, log loss, or business metrics when labels arrive. Alerts are configured on absolute thresholds (AUC below 0.7), relative drops (AUC fell more than 5 percent from baseline), or detector signals.

Feature drift triggers investigation; prediction drift triggers a closer look at the pipeline; performance drift triggers retraining. Without label access, feature and prediction monitoring are the only options, which is why they receive disproportionate attention in commercial tools [7].

A recurring lesson from production ML is that the components of a system are deeply entangled, so even a small data change ripples through the whole model. Sculley et al. named this the CACE principle: "Changing Anything Changes Everything", which is precisely why drift cannot be patched feature by feature and instead demands system-level monitoring and retraining discipline [5].

The following table summarises the major commercial and open-source monitoring platforms.

| Platform | Type | Notable features |
| --- | --- | --- |
| [Evidently AI](/wiki/evidently_ai) | Open source + cloud | Reports and tests for data and prediction drift |
| [Arize AI](/wiki/arize_ai) | Commercial | Embedding drift, slice analysis, performance debugging |
| [WhyLabs](/wiki/whylabs) | Commercial | whylogs profiling library, sketches at scale |
| [Fiddler AI](/wiki/fiddler_ai) | Commercial | Explainability plus drift, governance focus |
| DataRobot MLOps | Commercial | Automated ML with drift monitoring built in |
| SageMaker Model Monitor | Cloud | Native AWS drift checks, [SageMaker](/wiki/amazon_sagemaker) integration |
| [Azure ML](/wiki/azure_ml) Data Drift | Cloud | Dataset drift between baseline and target |
| Vertex AI Model Monitoring | Cloud | Drift and skew for [Vertex AI](/wiki/vertex_ai) endpoints |
| Comet ML, Neptune.ai | Commercial | Experiment tracking with monitoring add-ons |
| [Weights and Biases](/wiki/wandb) | Commercial | Experiment tracking and production monitoring |

Monitoring requires a stable definition of "reference". A [feature store](/wiki/feature_store) helps by providing a versioned, queryable record of training features so monitoring at serving time has something well-defined to compare against. Lineage tracking ties predictions back to the exact feature and model versions that produced them, which is essential for diagnosing drift in complex pipelines.

## Real-world examples and case studies

The **COVID-19 pandemic in early 2020** was a textbook drift event at planetary scale. Mobility models, demand forecasts, fraud detectors, recommender systems, ad auction predictors, and clinical risk models simultaneously broke as user behaviour shifted within days. Stocking-up purchases skewed e-commerce baselines, work-from-home patterns broke commute-time models, and elective-surgery cancellations made hospital length-of-stay predictors unusable [17]. The episode is often cited as the moment drift monitoring graduated from a niche concern to a board-level risk.

**[Fraud detection](/wiki/fraud_detection)** is the canonical adversarial drift problem. Card-not-present fraud, account takeover, and synthetic identity attacks evolve continuously as fraudsters probe a defender's blind spots. Deployed fraud models routinely retrain on a daily or hourly cadence with champion-challenger evaluation deciding when to promote shadow models.

**[Recommender systems](/wiki/recommender_system)** experience drift on both sides of the user-item matrix as new items appear, old ones become unavailable, and user preferences shift. Modern recommenders combine streaming candidate generation, frequent re-ranking updates, and online experimentation to track a moving target.

**Credit scoring** under macroeconomic regimes is sensitive to label drift. Default rates rise in recessions and fall in expansions, and the relationship between application features and default risk shifts with credit conditions. US and EU regulators require model risk management practices that address performance monitoring and recalibration [9].

**Manufacturing sensor models** drift with equipment wear: a predictive maintenance model trained on a new machine sees different vibration spectra a year later. **Climate and environmental models** face slow incremental drift as the underlying baseline itself moves; hydrological and air quality models are routinely re-fit on rolling baselines.

The **"Hidden Technical Debt" paper** (Sculley et al., NeurIPS 2015) gave a now-classic taxonomy of failure modes in production ML systems and coined many of the terms in current use: glue code, pipeline jungles, data dependencies, undeclared consumers, configuration debt, and changes in the external world [5]. Drift sits squarely in the last category.

## Theoretical foundations

Classical statistical learning theory assumes training and test data are drawn independently from the same distribution. Concept drift breaks this assumption, and a body of theoretical work has explored what learning under non-stationarity can guarantee.

Bartlett's 1992 paper proposed a **[PAC-learning](/wiki/statistical_learning_theory)**-style framework for drifting concepts, bounding how fast the target can change while still permitting learnability with a finite number of mistakes [18]. Helmbold and Long (1994) gave related results for tracking slowly changing concepts. Learning is possible if and only if the rate of drift is bounded relative to the sample size.

**Online learning regret bounds**, developed for prediction-with-expert-advice and online convex optimisation, give a complementary view. Shifting-regret algorithms specifically target piecewise-stationary streams. **[VC theory](/wiki/vc_theory)** generalisation bounds depend on i.i.d. sampling; mixing-condition bounds extend them to weakly dependent stationary processes but do not cover concept drift. Bounds for non-stationary settings typically introduce a measure of total variation between source and target distributions and pay a price proportional to it.

The practical consequence is that no model trained on a finite past can be guaranteed to perform well on an arbitrary future. Drift management is therefore not a one-time analysis but an ongoing engineering practice.

## How do you tell drift apart from noise and rare events?

A drift detector that fires on every transient fluctuation is worse than no detector at all because it forces unnecessary retraining and triggers alert fatigue. The trade-off between false alarms and detection latency is unavoidable: ADWIN, DDM, EDDM, Page-Hinkley, and CUSUM expose threshold parameters along the same Pareto frontier.

Practical systems combine signals: a drift alert fires only when a feature distribution test, a prediction distribution test, and a performance-based detector agree. Voting reduces the false alarm rate at the cost of additional latency. Some teams add manual review before automatic retraining when the cost of a bad retraining run is high.

Distinguishing drift from data quality issues is also essential. A feature that suddenly contains 50 percent missing values is not drifting; the upstream pipeline is broken.

## Open-source libraries

A mature open-source ecosystem supports drift detection and adaptation in Python and Java. **River**, the 2020 merger of scikit-multiflow and Creme, provides a uniform API for online classification, regression, clustering, and anomaly detection, plus drift detectors including ADWIN, DDM, EDDM, HDDM, KSWIN, and Page-Hinkley [19]. **MOA (Massive Online Analysis)** is the Java counterpart from Waikato, the standard tool in academic streaming ML [20].

**Alibi Detect** (Seldon) focuses on drift, outlier, and adversarial detection for production, implementing MMD, learned kernels, classifier-based drift detection, KS tests with multiple-comparison correction, and tests over embeddings [21]. **Evidently** is both a library and hosted service whose Python package generates HTML reports and JSON metrics for data drift, target drift, and performance [22]. Other tools include **NannyML** for performance estimation without labels, **whylogs** for compact statistical profiles, and **TorchDrift** for drift detection in [PyTorch](/wiki/pytorch) workflows.

The field owes much to several researchers. [Joao Gama](/wiki/joao_gama) of the University of Porto authored the founding ACM survey and co-developed DDM. [Albert Bifet](/wiki/albert_bifet) of Telecom Paris and Waikato co-developed ADWIN, MOA, and many of the streaming ensemble methods now in standard use.

## See also

- [Data drift](/wiki/covariate_shift), [Covariate shift](/wiki/covariate_shift), [Label shift](/wiki/label_shift), [Concept evolution](/wiki/concept_evolution)
- [Online learning](/wiki/online_learning), [Streaming data](/wiki/streaming_data), [MLOps](/wiki/mlops), [Model monitoring](/wiki/model_monitoring), [Model evaluation](/wiki/model_evaluation)
- [Domain adaptation](/wiki/domain_adaptation), [Transfer learning](/wiki/transfer_learning)
- [Hoeffding tree](/wiki/hoeffding_tree), [Random forest](/wiki/random_forest), [Naive Bayes](/wiki/naive_bayes)
- [Recommender system](/wiki/recommender_system), [Fraud detection](/wiki/fraud_detection)
- [PAC learning](/wiki/statistical_learning_theory), [VC theory](/wiki/vc_theory)

## References

[1] Gama, J., Zliobaite, I., Bifet, A., Pechenizkiy, M., and Bouchachia, A. (2014). "A survey on concept drift adaptation." *ACM Computing Surveys* 46(4), 1-37. https://dl.acm.org/doi/10.1145/2523813

[2] Lu, J., Liu, A., Dong, F., Gu, F., Gama, J., and Zhang, G. (2018). "Learning under concept drift: A review." *IEEE Transactions on Knowledge and Data Engineering* 31(12), 2346-2363. https://arxiv.org/abs/2004.05785

[3] Schlimmer, J. C., and Granger, R. H. (1986). "Incremental learning from noisy data." *Machine Learning* 1(3), 317-354. https://link.springer.com/article/10.1007/BF00116895

[4] Tsymbal, A. (2004). "The problem of concept drift: definitions and related work." Technical report, Trinity College Dublin. https://www.scss.tcd.ie/publications/tech-reports/reports.04/TCD-CS-2004-15.pdf

[5] Sculley, D., et al. (2015). "Hidden technical debt in machine learning systems." *NeurIPS*. https://papers.nips.cc/paper/2015/hash/86df7dcfd896fcaf2674f757a2463eba-Abstract.html

[6] Moreno-Torres, J. G., et al. (2012). "A unifying view on dataset shift in classification." *Pattern Recognition* 45(1), 521-530. https://www.sciencedirect.com/science/article/abs/pii/S0031320311002901

[7] Huyen, C. (2022). *Designing Machine Learning Systems*. O'Reilly. https://www.oreilly.com/library/view/designing-machine-learning/9781098107956/

[8] Rabanser, S., Gunnemann, S., and Lipton, Z. C. (2019). "Failing loudly: An empirical study of methods for detecting dataset shift." *NeurIPS*. https://arxiv.org/abs/1810.11953

[9] Siddiqi, N. (2017). *Intelligent Credit Scoring*. Wiley. https://www.wiley.com/en-us/Intelligent+Credit+Scoring-p-9781119279150

[10] Gama, J., Medas, P., Castillo, G., and Rodrigues, P. (2004). "Learning with drift detection." *SBIA*, 286-295. https://link.springer.com/chapter/10.1007/978-3-540-28645-5_29

[11] Baena-Garcia, M., et al. (2006). "Early drift detection method." *IWKDDS*. https://www.cs.upc.edu/~abifet/EDDM.pdf

[12] Frias-Blanco, I., et al. (2015). "Online and non-parametric drift detection methods based on Hoeffding's bounds." *IEEE TKDE* 27(3), 810-823. https://ieeexplore.ieee.org/document/6871418

[13] Bifet, A., and Gavalda, R. (2007). "Learning from time-changing data with adaptive windowing." *SDM*, 443-448. https://epubs.siam.org/doi/10.1137/1.9781611972771.42

[14] Page, E. S. (1954). "Continuous inspection schemes." *Biometrika* 41(1-2), 100-115. https://academic.oup.com/biomet/article-abstract/41/1-2/100/461922

[15] Gomes, H. M., et al. (2017). "Adaptive random forests for evolving data stream classification." *Machine Learning* 106(9-10), 1469-1495. https://link.springer.com/article/10.1007/s10994-017-5642-8

[16] Wang, D., et al. (2021). "Tent: Fully test-time adaptation by entropy minimization." *ICLR*. https://arxiv.org/abs/2006.10726

[17] Heaven, W. D. (2020). "Our weird behavior during the pandemic is messing with AI models." *MIT Technology Review*. https://www.technologyreview.com/2020/05/11/1001563/covid-pandemic-broken-ai-machine-learning-amazon-retail-fraud-humans-in-the-loop/

[18] Bartlett, P. L. (1992). "Learning with a slowly changing distribution." *COLT*, 243-252. https://dl.acm.org/doi/10.1145/130385.130412

[19] Montiel, J., et al. (2021). "River: machine learning for streaming data in Python." *JMLR* 22(110), 1-8. https://www.jmlr.org/papers/v22/20-1380.html

[20] Bifet, A., Holmes, G., Kirkby, R., and Pfahringer, B. (2010). "MOA: Massive Online Analysis." *JMLR* 11, 1601-1604. https://www.jmlr.org/papers/v11/bifet10a.html

[21] Klaise, J., et al. (2020). "Monitoring and explainability of models in production." *ICML Workshop*. https://arxiv.org/abs/2007.06299

[22] Evidently AI documentation. https://docs.evidentlyai.com/

[23] Adams, R. P., and MacKay, D. J. C. (2007). "Bayesian online changepoint detection." Technical report, Cambridge. https://arxiv.org/abs/0710.3742

[24] Webb, G. I., Hyde, R., Cao, H., Nguyen, H. L., and Petitjean, F. (2016). "Characterizing concept drift." *DMKD* 30, 964-994. https://link.springer.com/article/10.1007/s10618-015-0448-4

[25] Alibi Detect docs. https://docs.seldon.io/projects/alibi-detect/en/stable/ ; River docs. https://riverml.xyz/