See also: Machine learning terms
A dynamic model in machine learning is a model that is retrained frequently or continuously as new data arrives, so that its parameters track changes in the underlying data distribution over time. The Google Machine Learning Glossary defines a dynamic model as one that is "frequently (maybe even continuously) retrained," describes it as a "lifelong learner that constantly adapts to evolving data," and notes that the term is synonymous with online model [1]. The opposite is a static model, which is trained once on a snapshot of historical data and then served unchanged for some period.
Dynamic models are the natural choice for production environments where the relationship between features and labels changes over hours, minutes, or seconds. Common examples include recommender systems, online ad ranking, fraud detection, dynamic pricing, news ranking, and short-video feed personalization. The general term for the process of routinely refitting a model on fresh data is continuous training (CT), which sits alongside continuous integration and continuous delivery (CI/CD) in the MLOps pipeline [2].
A static model fixes its parameters at training time and treats inference as a separate stage that may run for days, months, or years on the same weights. A dynamic model collapses that boundary. New observations stream into a training process that is either always running or kicked off on a short cadence (every few minutes, every hour, or every day), and the resulting weights replace or augment the model in production.
The Google Machine Learning Crash Course frames the choice as static training versus dynamic training. In static training, a model is trained once on a fixed dataset and then served for a while; in dynamic training, the model is trained continuously or at least frequently, and the most recently trained version is the one that gets served [3]. The crash course also distinguishes static inference (predictions are computed offline and cached) from dynamic inference (predictions are computed on demand at request time). A system can be dynamic in training, dynamic in inference, or both. A common production pattern is dynamic training with dynamic inference, since the value of fresh weights would otherwise be lost in a stale prediction cache.
In the online machine learning literature, the same idea is described as processing data "in a sequential order" and updating the predictor at each step, in contrast with batch learning that fits a model in one pass over the entire training set [4]. From this point of view, a dynamic model is the operational embodiment of an online learning algorithm running indefinitely against a live data stream.
Static and dynamic models differ along several axes. The differences matter because they determine which engineering investments a team has to make and what failure modes they have to monitor.
| Property | Static (offline) model | Dynamic (online) model |
|---|---|---|
| Training cadence | Once, or every few weeks | Continuous, hourly, or per-minute |
| Data assumption | Stationary distribution | Distribution can shift |
| Memory of past data | Full historical pass each time | Single pass or small replay buffer |
| Hardware footprint | Periodic large training jobs | Always-on training pipeline |
| Time to incorporate new label | Hours to weeks | Seconds to minutes |
| Failure modes | Drift, staleness | Drift detection bugs, runaway feedback, label leakage |
| Validation strategy | Train, validate, holdout, ship | Shadow deployment, A/B testing, progressive validation |
| Rollback model | Redeploy previous artifact | Snapshot weights and revert |
| Monitoring requirement | Input distribution and label shift | Input distribution, label shift, training health, model freshness |
| Typical use case | Image classification, demand forecasting at weekly cadence | Ad CTR prediction, fraud scoring, short-form video feed |
Google's documentation makes the trade-off explicit. Static training is "simpler to build and test," but "if you train offline, then the model has no way to incorporate new data as it arrives," which leads to staleness when the distribution shifts. Dynamic training keeps the model fresh but "requires continuous building, testing, and releasing cycles" and a heavier monitoring stack [3]. Even teams that pick static training are advised to monitor input distributions in production, because data drift can degrade a frozen model just as easily as a live one.
Chip Huyen's Designing Machine Learning Systems (O'Reilly, 2022) makes a related distinction between stateless retraining and stateful training. In stateless retraining, each training run starts from scratch on a fresh window of data; in stateful training, each run continues from the previous model's weights, only doing a partial update on new examples. Both fall under the dynamic model umbrella, but stateful training is the lower-cost option once a baseline exists [5].
Dynamic models depend on algorithms that can update parameters incrementally without revisiting the entire training set. The online learning literature provides several families of such algorithms, all sharing the property that they consume one example (or a small minibatch) at a time and produce a new parameter vector after each update.
Stochastic gradient descent (SGD) is the workhorse of online learning. After each example, SGD computes a gradient of the loss with respect to the parameters and takes a step in the negative gradient direction. The pure online form of SGD has constant memory cost O(d) for d parameters and constant time per example, which makes it scale to data streams of arbitrary length [4]. Most production dynamic models, including ad CTR predictors and embedding-based recommenders, are SGD-based variants.
The perceptron, originally proposed by Frank Rosenblatt in 1958, is one of the earliest online learning algorithms. It updates a linear classifier only when it makes a mistake on the current example. Novikoff's 1962 mistake-bound theorem shows that on a linearly separable dataset with margin gamma and example radius R, the perceptron makes at most (R/gamma)^2 mistakes regardless of how many examples it sees [6]. The mistake bound framework introduced by the perceptron analysis still anchors the way online algorithms are evaluated theoretically.
Online passive-aggressive (PA) algorithms, introduced by Crammer, Dekel, Keshet, Shalev-Shwartz, and Singer in 2006, take a more aggressive update than the perceptron. After each example, PA chooses the smallest weight change that satisfies a margin constraint on the current example. This produces a closed-form analytical update and tight regret bounds for binary classification, regression, multi-class classification, uniclass prediction, and sequence labelling [7]. The PA family is a common choice when the training stream contains both correctly and incorrectly classified examples and the model needs to react sharply to surprises.
Martin Zinkevich's 2003 paper "Online Convex Programming and Generalized Infinitesimal Gradient Ascent" generalized SGD to the online convex optimization (OCO) framework. In OCO, an adversary picks a sequence of convex losses, the learner picks parameters before seeing each loss, and the goal is to minimize regret against the best fixed comparator in hindsight. Zinkevich proved that simple projected gradient descent with stepsize eta_t = 1/sqrt(t) achieves O(sqrt(T)) regret on Lipschitz convex losses, which is the standard reference rate for online learning [8].
Follow the Regularized Leader (FTRL) is the dual of online gradient descent. At each step, FTRL solves an optimization that minimizes the cumulative loss seen so far plus a regularization term. The FTRL-Proximal variant, introduced by H. Brendan McMahan and colleagues at Google, was designed for click-through rate (CTR) prediction at scale and combines L1 regularization (for sparsity) with per-coordinate adaptive learning rates similar to AdaGrad. The KDD 2013 paper "Ad Click Prediction: a View from the Trenches" is the canonical reference for production-grade FTRL-Proximal on a live ad serving system [9].
Recursive least squares (RLS) is the online analog of ordinary least squares for linear regression. It maintains an inverse covariance estimate that is updated by the Sherman-Morrison formula after each example, achieving O(d^2) per-step time and O(d^2) memory. RLS is a natural choice when an online linear model needs second-order information without the full cost of refitting an offline regression [4].
Decision trees pose a special challenge for online learning because each split decision in principle depends on the entire dataset. Domingos and Hulten's 2000 paper "Mining High-Speed Data Streams" introduced the Very Fast Decision Tree (VFDT), also called the Hoeffding tree, which uses the Hoeffding bound to prove that a small sample is sufficient to choose a split with high probability. The result is a tree that can ingest tens of thousands of examples per second on commodity hardware while approximating the tree that batch training on the same data would have produced [10]. Hoeffding trees and their adaptive successors (HAT, EFDT) remain the most common online tree learners in streaming pipelines.
For settings where memory is bounded, several variants of online learners maintain a fixed-size pool of support vectors or prototypes. Examples include the Forgetron, Randomized Budget Perceptron, and Online Passive-Aggressive on a Budget. These are useful in resource-constrained dynamic models where the support set cannot be allowed to grow without bound.
Dynamic models are valuable precisely because real-world data shifts. The literature distinguishes several kinds of shift, and the choice of detection algorithm depends on which kind is dominant in the application.
| Type | What changes | Also called | Typical example |
|---|---|---|---|
| Covariate shift / data drift | P(X) | Data drift, feature drift | Customer demographic mix changes after product launch |
| Label shift | P(Y) | Prior probability shift | Fraud rate rises during a holiday weekend |
| Concept drift | P(Y|X) | Real concept drift | Definition of spam evolves as spammers change tactics |
| Posterior shift | P(X|Y) | Conditional shift | New imaging device produces brighter X-rays |
The distinction between data drift and concept drift is the most cited in production ML literature. Data drift occurs when the input distribution changes but the input-output relationship stays the same; concept drift occurs when the input-output relationship itself changes [11]. Both can degrade a static model, but only retraining on fresh labels (a dynamic model) can fix concept drift.
Concept drift can be abrupt (an overnight regime change), gradual (an old concept fades while a new one rises), incremental (a continuous shift in the boundary), or recurring (seasonal patterns that re-emerge). Different detectors react well to different patterns.
| Detector | Year | Authors | Idea | Best for |
|---|---|---|---|---|
| DDM (Drift Detection Method) | 2004 | Gama, Medas, Castillo, Rodrigues | Monitor the binomial error rate; alarm when it exceeds a threshold based on its own minimum | Abrupt drift on classifiers with declining error |
| EDDM (Early Drift Detection Method) | 2006 | Baena-Garcia, del Campo-Avila, Fidalgo, Bifet, Gavalda, Morales-Bueno | Monitor the distance between consecutive errors instead of the raw error rate | Gradual drift |
| Page-Hinkley test | 1954 / streaming use 2000s | E. S. Page (original); revived by Gama et al. | Cumulative sum test on the difference between current accuracy and a moving average | Smooth drift in numerical signals |
| ADWIN (Adaptive Windowing) | 2007 | Bifet, Gavalda | Maintain a variable-length sliding window; cut the window when statistics on its two halves differ | Any drift, with rigorous false-positive bounds |
| HDDM | 2014 | Frias-Blanco, del Campo-Avila, Ramos-Jimenez, Morales-Bueno, Ortiz-Diaz, Caballero-Mota | Hoeffding-bound-based test on weighted moving averages | Streams with non-stationary noise |
| KSWIN | 2020 | Raab, Heusinger, Schleif | Kolmogorov-Smirnov test on a sliding window | Distribution-free drift on numeric features |
ADWIN is a particularly common choice in production streaming systems because Bifet and Gavalda's 2007 SDM paper provided rigorous bounds on both false-positive and false-negative rates, and because ADWIN can be plugged in as a black-box monitor for either model error or any individual feature [12]. The drift_detection module of the river library implements ADWIN, DDM, EDDM, HDDM, KSWIN, and Page-Hinkley with a uniform interface.
When a detector raises an alarm, the dynamic model has several options. It can reset its weights and start retraining from scratch, increase the learning rate to react more quickly, switch to a buffered alternative model that has been training in parallel, or simply log the event for a human operator. Production systems usually combine these: minor drift triggers an automatic catch-up update, major drift pages a human.
A production dynamic model needs more than an online algorithm; it needs an entire pipeline that can ingest streams, update parameters, evaluate on the fly, and serve predictions. Several open-source frameworks specialize in this layer.
| Framework | Language | First released | Maintainer | Specialty |
|---|---|---|---|---|
| river | Python | 2020 (merger of creme + scikit-multiflow) | online-ml community | General-purpose online ML in Python with progressive validation |
| Vowpal Wabbit | C++ | 2007 | Microsoft Research (originally Yahoo Research) | Massive-scale online learning, contextual bandits, hashing trick |
| MOA (Massive Online Analysis) | Java | 2010 | University of Waikato | Stream classification, clustering, drift detection, evaluation tools |
| Apache Flink ML | Java / Scala | 2015 | Apache Software Foundation | Online learning on top of Flink streaming runtime |
| Spark Streaming MLlib | Scala / Python | 2014 | Apache Software Foundation | Streaming linear regression, k-means, batch-style streaming |
| scikit-learn (partial_fit) | Python | 2010 (incremental support) | scikit-learn community | SGDClassifier, SGDRegressor, Naive Bayes, MiniBatchKMeans |
| TensorFlow Extended (TFX) | Python | 2017 | End-to-end ML pipelines with continuous training support | |
| ByteDance Monolith | Python / C++ | Open-sourced 2022 | ByteDance | Real-time recommendation training with collisionless embeddings |
River is the result of a 2020 merger between two earlier projects: creme (started in 2018 at Telecom ParisTech) and scikit-multiflow (started by Bifet's group at the University of Waikato). The library provides online versions of linear models, decision trees and random forests, k-nearest neighbors, anomaly detectors, drift detectors, recommender systems, time-series models, factorization machines, and bandits. The JMLR 2021 paper "River: machine learning for streaming data in Python" by Montiel, Halford, Mastelini, and others is the canonical reference [13]. Version 0.24.2 was released on April 15, 2026.
Vowpal Wabbit (VW), started by John Langford at Yahoo Research and now maintained at Microsoft Research, is the heavyweight open-source online learner. VW uses SGD-based online learning combined with the hashing trick (32-bit MurmurHash3 of feature names into a fixed-size weight vector) to scale to billions of features and billions of examples [14]. It supports binary and multiclass classification, regression, contextual bandits, active learning, and reductions for structured prediction. VW has been used to learn a tera-feature (10^12) dataset on 1000 nodes in one hour, which remains a reference point for raw online-learning throughput.
MOA is the Java equivalent of WEKA for streaming data, created in 2010 by Bifet, Holmes, Kirkby, and Pfahringer at the University of Waikato. The framework provides Hoeffding trees, ADWIN, drift detectors, online ensembles (online bagging, leveraging bagging, ARF), clusterers, and evaluation tools for prequential and holdout-on-stream evaluation [15]. MOA is the most cited reference for academic streaming ML benchmarks.
Flink ML is the machine learning library built on top of Apache Flink's streaming runtime. It supports online versions of common preprocessors (OnlineStandardScaler, OnlineKMeans), agglomerative clustering, and online linear models. The library reached 2.2.0 in April 2023 and is in production use at Alibaba for real-time clustering and feature engineering on log data [16]. Its main appeal is the unified pipeline API that lets the same algorithm run on bounded (offline) and unbounded (online) data streams.
Apache Spark MLlib supports streaming linear regression and streaming k-means via the StreamingLinearRegressionWithSGD and StreamingKMeans classes. The implementation runs SGD on each Spark Streaming batch, so it is closer to mini-batch online learning than to true per-example online learning [17]. For teams already on Spark, this is the smallest-effort path to a dynamic model.
scikit-learn does not advertise itself as a streaming library, but a number of its estimators expose a partial_fit method that supports incremental training: SGDClassifier, SGDRegressor, PassiveAggressiveClassifier, PassiveAggressiveRegressor, Perceptron, MultinomialNB, BernoulliNB, MiniBatchKMeans, and others. Combined with dask-ml's Incremental wrapper, partial_fit is the path of least resistance for adding online learning to an existing scikit-learn pipeline.
Beyond open-source libraries, every large platform that depends on dynamic models maintains its own internal training infrastructure. ByteDance's Monolith (open-sourced 2022) is built on TensorFlow with a Worker / Parameter-Server architecture and uses Cuckoo hashmaps for collisionless embedding tables, allowing TikTok to update its recommendation model on a minute scale [18]. Google uses TensorFlow Extended (TFX) and an internal continuous training system that integrates FTRL-Proximal for ad ranking and other linear models. Meta uses PyTorch with FBLearner Flow for batch training and a separate online training stack for ranking. Netflix uses a hybrid of offline training plus online fine-tuning for portions of its recommendation stack.
Dynamic models are not appropriate for every problem; they pay back the engineering cost only when the data distribution changes faster than the static-retrain cadence can keep up with. The most common production use cases share that property.
Large-scale recommender systems are the canonical home for dynamic models. User preferences shift in real time, new items appear hourly, and the value of a recommendation depends on freshness. Most production recommenders combine offline-trained candidate generation models (refreshed daily or weekly) with online-trained ranking models (refreshed on a minute scale).
Click-through rate prediction is the second canonical home. Ads, queries, and creatives turn over at hour-by-hour rates, and a 1% lift in CTR translates into eight or nine figures of revenue at scale. Google's KDD 2013 paper documents an FTRL-Proximal-based dynamic model serving ad CTR predictions in production [9]. Comparable systems are described in publications from Yahoo, Microsoft, Meta, and ByteDance.
Fraud detection is a textbook concept-drift problem: adversaries change tactics in response to detection. Static models become obsolete as soon as they ship. Production fraud detection systems combine an online learning model trained on labeled fraud reports with a separate anomaly detector tuned to flag previously unseen patterns.
Financial markets and dynamic pricing systems both deal with non-stationary data where the cost of staleness is measured in basis points or in lost revenue per minute. Most production trading systems combine slow offline-trained risk models with fast online-trained execution models.
News ranking, short-form video ranking (TikTok, Reels, Shorts), and feed personalization all share the property that the inventory turns over within hours. TikTok's Monolith paper documents a system that incorporates user interactions into the model within minute-scale latency, which the authors directly attribute to its real-time online training pipeline [18]. Netflix has reported moving portions of its recommendation stack from batch to online training because batch-trained models created "regret as many members over a long period did not benefit from the better experience."
Spam filtering is one of the oldest production applications of online learning, dating back to early Bayesian filters. Modern email and chat platforms run a continuous training loop that incorporates user-flagged spam labels into a refreshed model every few hours.
Industrial sensor streams (vibration, temperature, current draw) are non-stationary because equipment ages, environmental conditions vary, and operating modes change. Online learning algorithms with drift detectors are well-matched to this regime. Apache Flink ML is used at Alibaba for online clustering of log data, and a similar pattern shows up in factory sensor monitoring with MOA, river, or custom Flink jobs.
| Platform | Application | Approach | Reference |
|---|---|---|---|
| Google Ads | Sponsored search CTR prediction | FTRL-Proximal with per-coordinate learning rates | McMahan et al., KDD 2013 [9] |
| Google Play | App recommendation | Wide & Deep with online fine-tuning | Cheng et al., DLRS 2016 |
| TikTok | For You feed ranking | Monolith real-time training, minute-scale updates | Liu et al., arXiv 2022 [18] |
| Netflix | Homepage ranking | Hybrid offline plus online fine-tuning | Netflix Tech Blog |
| YouTube | Video ranking | Two-stage candidate generation plus online ranker | Covington et al., RecSys 2016 |
| Spotify | Daily Mix and Discover Weekly | Weekly batch retrains plus online bandit layer | Spotify Engineering |
| Feed ranking | Offline GLMix plus online ranker fine-tuning | LinkedIn Engineering | |
| Alibaba | Real-time clustering on log streams | Flink ML OnlineKMeans | Apache Flink blog [16] |
| Meta (Facebook) | News Feed and Ads | Continuous training on PyTorch with FBLearner Flow | Meta Engineering |
These systems vary in how aggressively they update their models. TikTok pushes new weights on roughly a one-minute cadence. Google Ads CTR predictors update on the order of a few minutes. Netflix's recommendation models historically updated every few hours but have moved closer to minute-scale for some surfaces. Spotify's Discover Weekly remains a weekly batch update because the user expectation is a weekly playlist drop, not a continuously shifting one.
Dynamic models cost more to operate than static ones, and they introduce failure modes that static models do not have. The decision to go dynamic should be driven by a measured cost of staleness, not by aesthetics.
A dynamic model requires an always-on training pipeline. That means continuous data ingestion, feature computation, label joining (often the hardest part), gradient computation, parameter updates, and rollouts to inference servers. Each of these layers needs to be scaled, monitored, and on-call rotated. A typical production dynamic model is two to five engineers' worth of operational ownership beyond the model itself.
Online learning depends on quickly observed labels. CTR prediction has labels available within seconds (a click happens or it doesn't). Fraud detection has labels available after minutes to days, since chargebacks take time. Long-horizon prediction problems (lifetime value, churn) have labels that arrive too late for online learning to be useful, and for those problems static or hybrid models are usually the right choice.
A dynamic model that ranks the items it sees can amplify its own biases. If a recommender is trained on the clicks generated by its own previous predictions, its training distribution becomes self-conditioned. The standard mitigation is to mix in exploration via contextual bandits or randomized impressions, and to log propensity-corrected rewards for off-policy evaluation.
Classic train / validation / test splits do not apply directly to streaming models. The standard alternatives are prequential evaluation (test each example before training on it, then update), interleaved test-then-train, and holdout-on-stream (set aside a small fraction of the stream for evaluation only). Production systems also rely on shadow deployments, A/B tests, canary releases, and interleaving experiments, all of which are described in detail in Huyen's Designing Machine Learning Systems [5].
Neural networks trained online can forget old patterns quickly when the input distribution shifts, a phenomenon known as catastrophic forgetting or catastrophic interference. Mitigations include replay buffers (keeping a sample of older data and mixing it into updates), elastic weight consolidation, and architectural choices that protect specific subspaces of weights from rapid updates. The continual learning literature exists largely to address this issue.
A static model can be monitored by checking its prediction distribution and a few business metrics. A dynamic model needs all of that plus monitoring for training health (loss curves, gradient norms), data freshness (how stale is the most recent example), label freshness (how stale is the most recent label), drift detection signals, and rollback readiness (can the system swap to a previous snapshot if today's training run goes off the rails). Drift-aware monitoring tools, often built around ADWIN or KSWIN under the hood, are part of every well-run dynamic model.
A static model trained from a fixed dataset and a fixed seed is reproducible to the bit. A dynamic model that has been training continuously for six months has consumed billions of examples in a specific order from a specific stream, and reproducing it exactly is usually impossible. The replacement is not bit-level reproducibility but process-level reproducibility: the same training pipeline applied to the same window of data should produce a model with the same statistical behavior.
A short decision guide for picking between the two regimes:
The choice is rarely binary in practice. Most large production systems run a portfolio of models with different cadences: a daily-batch model for stable signals, an hourly online model for fast-moving signals, and a per-request rerank step that uses the latest interaction context.
In 2026, dynamic models are the default in any consumer-facing recommendation, advertising, or feed ranking application at scale. The continuing rise of short-form video platforms, real-time chat assistants, and personalized agentic interfaces has only increased the cost of staleness. At the same time, the tooling has matured: river, Flink ML, Vowpal Wabbit, and MOA cover the open-source side, while every major cloud provider offers a managed continuous training service.
The rise of large language models has not displaced dynamic modelling. Although foundation models themselves are usually trained statically (with periodic full retrains rather than continuous updates), the systems that use them often wrap them in online learning loops at the application layer. A search ranker that uses an LLM as a feature extractor is still a dynamic model in the sense that its ranker weights update continuously based on user interactions. Personalization layers, retrieval indexes, and post-training adapters all use online updates even when the base model is frozen.
The research frontier has moved toward neural online learning, online deep learning with replay and consolidation, online fine-tuning of LLMs, federated dynamic models that learn from decentralized data, and graph-based online learning for evolving social networks. The 2007 ADWIN paper, the 2003 Zinkevich paper, and the 2013 FTRL-Proximal paper remain the standard references for the underlying theory.
Imagine you have a robot that picks the best snack to give you each day. A regular robot is trained once, when it is first built. It learns that you like pretzels and apples and never changes its mind. After a few months it is still offering you pretzels even though you got tired of pretzels weeks ago.
A dynamic robot watches what you actually eat every day. If you start liking grapes, the dynamic robot notices and starts offering grapes too. If you stop liking pretzels, it stops offering pretzels. It is always learning, a little bit at a time, instead of being frozen forever after one big lesson.
The trade-off is that the dynamic robot is more work to take care of. Someone has to make sure it is not learning weird things, that the snacks are still real snacks, and that it does not suddenly forget you are allergic to peanuts. The plain robot is simpler but it gets boring. The dynamic robot keeps up with you, as long as someone keeps an eye on what it is learning.