# Online learning

> Source: https://aiwiki.ai/wiki/online_learning
> Updated: 2026-07-11
> Categories: Machine Learning, Reinforcement Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

*See also: [Machine learning terms](/wiki/machine_learning_terms)*

Online learning is a [machine learning](/wiki/machine_learning) paradigm in which a model receives data sequentially, one example or one mini-batch at a time, and updates its parameters immediately after each observation rather than training on a fixed dataset all at once. Unlike batch (offline) learning, the model never assumes the full dataset is available in advance: it consumes a stream of inputs and predictions and improves as it goes, with a memory footprint that stays bounded no matter how long the stream runs. The framework has a deep theoretical history rooted in sequential decision making, game theory, and [regret minimization](/wiki/regret_minimization), and its industrial form powers systems that learn from billions of events per day, such as ad click-through prediction and real-time recommendation.[1][8]

Online learning is the natural framework for problems where data arrives faster than it can be stored, where the data-generating distribution changes over time, or where the cost of retraining a full model from scratch is prohibitive. In the worst-case (adversarial) view, performance is measured not by expected error but by regret: the gap between the learner's accumulated loss and the loss of the single best fixed predictor chosen in hindsight.[1][17]

The term is overloaded in practice. "Online" can refer to the algorithmic regime described above, but it sometimes also refers to [online inference](/wiki/online_inference), meaning serving predictions to users in real time on already-trained models. These are different concepts. This article focuses on the learning paradigm.

## How does online learning differ from batch (offline) learning?

The operational difference between online and offline learning shows up in how data is consumed, how parameters are updated, and what assumptions are made about the data distribution.

| Aspect | Online learning | Offline (batch) learning |
| --- | --- | --- |
| Data access | Sequential, one example or mini-batch at a time | Entire training set available upfront |
| Update frequency | After every example or small batch | After full passes (epochs) over the data |
| Memory footprint | Bounded, independent of stream length | Grows with dataset size |
| Distribution assumption | Often makes none (adversarial setting) | Usually assumes iid samples from a fixed distribution |
| Reaction to drift | Adapts continuously | Requires periodic retraining |
| Theoretical goal | Sublinear regret against best fixed predictor | Low generalization error against a fixed distribution |
| Typical bottleneck | Per-step compute and stability | Total training time and memory |

The practical line is not always crisp. [Stochastic gradient descent](/wiki/stochastic_gradient_descent_sgd) on a fixed dataset with a single example per step looks structurally identical to an online algorithm, and many SGD-based recipes are described as "online-like." The distinction is whether data is streaming and unrepeatable (true online) or whether it is a finite shuffled dataset that the optimizer simply happens to visit one example at a time.

## What concepts are related to but distinct from online learning?

Several neighboring ideas are often confused with online learning, and the distinctions matter:

| Concept | Defining property |
| --- | --- |
| Online learning | Sequential examples, immediate per-example updates |
| Offline / batch learning | Full dataset known in advance, parameters updated in epochs |
| [Continual learning](/wiki/continual_learning) | Sequential tasks with shifting distributions; must retain old skills |
| [Incremental learning](/wiki/incremental_learning) | Model is extended with new data or classes over time without full retraining |
| [Streaming data](/wiki/streaming_data) | Continuous data flow, often unbounded; subset of the online regime |
| [Online inference](/wiki/online_inference) | Serving predictions in real time on a fixed model |
| [Active learning](/wiki/active_learning) | Learner chooses which unlabeled examples to query for labels |
| [Reinforcement learning](/wiki/reinforcement_learning) | Sequential interaction where actions affect future observations |

Continual learning sits closest to online learning. The difference is that continual learning explicitly cares about retaining performance on prior tasks while learning new ones, which motivates techniques like Elastic Weight Consolidation, replay buffers, and progressive networks. A pure online learner has no such constraint and may overwrite older knowledge freely.

## What is the difference between adversarial and stochastic settings?

Online learning theory generally splits the world into two regimes based on how the data sequence is generated.

In the stochastic setting, examples are drawn independently and identically distributed (iid) from some unknown but fixed distribution. The sequential nature of the algorithm is a computational convenience, not a statistical necessity. Classical generalization bounds apply once the online output is averaged or otherwise post-processed.

In the adversarial setting, the data sequence is chosen by an opponent that may have access to the learner's algorithm. No probabilistic assumption is placed on the inputs. The natural performance measure is no longer expected error but regret, which compares the learner's accumulated loss to the best fixed prediction strategy in hindsight. Cesa-Bianchi and Lugosi's 2006 monograph *Prediction, Learning, and Games* gives the canonical treatment, studying "the problem of predicting individual sequences" of data without imposing any probabilistic assumption on how those sequences are generated.[1]

A learner is said to be no-regret if its regret grows sublinearly in the number of rounds $$T$$, meaning average per-round regret tends to zero as $$T$$ grows. A regret bound of $$O(\sqrt{T})$$ is standard. $$O(\log T)$$ is possible under stronger conditions like strong convexity.[17]

## What are the classical online learning algorithms?

A handful of algorithms anchor the field. They differ in their loss assumptions, their use of regularization, and the regret they achieve.

| Algorithm | Year and source | Update rule (sketch) | Regret / mistake bound |
| --- | --- | --- | --- |
| [Perceptron](/wiki/perceptron) | Rosenblatt 1958, analysis Novikoff 1962 | If misclassified, $$w_t = w_{t-1} + y_t x_t$$ | Mistake bound $$R^2 / \gamma^2$$ if data is linearly separable with margin $$\gamma$$ and radius $$R$$ |
| Hedge / Multiplicative Weights | Freund and Schapire 1997 | $$w_{i,t+1} \propto w_{i,t} \exp(-\eta \cdot \mathrm{loss}_{i,t})$$ | $$O(\sqrt{T \log N})$$ for N experts |
| Online Gradient Descent (OGD) | Zinkevich 2003 | $$\theta_t = \theta_{t-1} - \eta_t \nabla L(\theta_{t-1}; x_t, y_t)$$, then projected | $$O(\sqrt{T})$$ for convex losses; $$O(\log T)$$ under strong convexity |
| Online Mirror Descent (OMD) | Beck and Teboulle 2003 | Update in a dual space defined by a Bregman divergence | Matches OGD; better constants for sparse or simplex domains |
| [Follow the Regularized Leader (FTRL)](/wiki/follow_the_regularized_leader) | McMahan 2011 (analysis), Shalev-Shwartz 2007 (formulation) | $$\theta_t = \arg\min_\theta \sum_{s<t} L_s(\theta) + R(\theta)$$ | $$O(\sqrt{T})$$ in general; equivalent to OMD in many cases |
| FTRL-Proximal | McMahan, Holt et al. 2013 | FTRL variant with per-coordinate adaptive learning rates and L1 regularization | Same regret as FTRL; produces highly sparse models, used in Google ad CTR |

The Perceptron is the historical entry point. Frank Rosenblatt introduced it in 1958 in *Psychological Review* as "a probabilistic model for information storage and organization in the brain," a brain-inspired model for binary classification.[2] Its online update only changes weights on a misclassified point. Albert Novikoff's 1962 mistake bound, which says the algorithm makes at most $$R^2 / \gamma^2$$ mistakes on linearly separable data, was a foundational result for the entire field.[3] The bound depends on geometry of the data, not on the length of the sequence.

Online gradient descent generalized this idea to arbitrary convex loss functions.[18] Martin Zinkevich's 2003 analysis showed that taking a gradient step on the current example, with step size $$\eta_t = 1/\sqrt{t}$$ and a projection back into the feasible set, "achieves additive regret $$O(\sqrt{T})$$, for an arbitrary sequence of T convex cost functions (of bounded gradients), with respect to the best single decision in hindsight," with no probabilistic assumption on the data.[4]

FTRL recasts the problem: at each step, output the parameter that would have minimized cumulative loss so far, plus a regularizer. The regularizer keeps the iterate from chasing noise too aggressively. McMahan's 2011 unification showed that FTRL and online mirror descent are essentially equivalent in many settings, which clarified a decade of seemingly different algorithms.[7][16]

The Hedge algorithm, also called multiplicative weights, attacks the experts problem. The learner maintains a probability distribution over N experts and adjusts the weights exponentially based on observed expert losses. Freund and Schapire showed regret $$O(\sqrt{T \log N})$$ against the best expert.[5] The algorithm reappears throughout machine learning, optimization, and even theoretical computer science as a primitive for solving zero-sum games and linear programs.

## How is online learning used in Google's ad systems?

The FTRL-Proximal algorithm, presented by H. Brendan McMahan and colleagues at KDD 2013 in "Ad Click Prediction: a View from the Trenches," is the most widely cited industrial deployment of online learning.[8] The paper opens by framing the problem in stark terms: "Predicting ad click-through rates (CTR) is a massive-scale learning problem that is central to the multi-billion dollar online advertising industry."[8] The system predicted click-through rate for ads served by Google and operated at a scale that rules out periodic batch retraining.

Three design choices made the algorithm work in production:

- L1 regularization combined with the FTRL update naturally drives many feature weights exactly to zero, producing sparse models even with billions of candidate features.
- Per-coordinate learning rates let frequent features (such as common keywords) settle quickly while rare features still get meaningful updates.
- Probability calibration and confidence estimation were treated as engineering problems alongside the learner itself, with separate calibration layers correcting for systematic bias.

The paper reports that FTRL-Proximal "gives significantly improved sparsity with the same or better prediction accuracy" than earlier gradient-descent baselines, and it is one of the few deeply practical accounts of running an online learner at internet scale.[8] It also covered topics that academic papers usually skip: how to subsample training data to fit memory budgets, how to monitor a continuously updating model, and how to detect regressions in real time.

## What is online-to-batch conversion?

The link between online learning and standard supervised learning runs through online-to-batch conversion. If an online algorithm achieves regret $$O(\sqrt{T})$$ on an iid stream of T examples and the user averages the sequence of online predictors (or otherwise selects one carefully), the resulting fixed predictor has expected risk within $$O(\sqrt{1/n})$$ of the best in the comparator class. Cesa-Bianchi, Conconi, and Gentile gave a clean analysis of this conversion in 2004, showing that the online-to-batch trick yields data-dependent generalization bounds without strong probabilistic machinery.[6]

This result is more than a technicality. It means an algorithm designed for the worst-case adversarial setting still works in the friendlier iid setting and inherits a sound generalization guarantee almost for free. Many practical recipes, such as averaging the iterates of SGD, are special cases.

## How do bandits and partial feedback fit in?

Online learning becomes harder when the learner only observes the loss of the action it actually took, not the loss of all available actions. This is the [multi-armed bandit](/wiki/multi-armed_bandit) setting. The classic stochastic bandit assumes each arm has a fixed but unknown reward distribution; algorithms like UCB and Thompson sampling achieve regret $$O(\log T)$$. The adversarial bandit setting drops the stochastic assumption and is solved by EXP3, a Hedge variant that uses importance-weighted estimates to recover unbiased gradient information.[13]

Bandits sit at the boundary between supervised online learning and reinforcement learning. Unlike full reinforcement learning, the action does not change the underlying state, so the problem is sequential but not Markovian.

## What is catastrophic forgetting, and how does continual learning fix it?

A pure online learner that updates greedily on each new example often overwrites useful knowledge. In neural networks, this phenomenon is called catastrophic forgetting: training a network on Task B after Task A typically destroys most of the performance on Task A. Continual learning is the broader research area trying to fix this.

Elastic Weight Consolidation (EWC), introduced by James Kirkpatrick and colleagues at DeepMind in PNAS in 2017, is the most cited approach.[9] EWC adds a quadratic penalty to the loss that keeps weights important for previous tasks close to their old values. Importance is estimated using the diagonal of the Fisher information matrix. The intuition mirrors synaptic consolidation in biological brains: neurons that mattered for learned skills become harder to overwrite. EWC works for both supervised and reinforcement learning task sequences and was demonstrated on Atari games learned in succession.[9]

Other continual learning techniques include:

- Replay-based methods, which interleave new examples with stored or generated samples from prior tasks.
- Progressive Networks (Rusu et al. 2016), which freeze old columns of weights and add new ones for new tasks, avoiding interference at the cost of growing model size.[10]
- Modular and gating approaches, which route different inputs to different sub-networks.

These methods address [distribution shift](/wiki/distribution_shift) in a structured way that pure online algorithms generally do not.

## How is concept drift detected?

In streaming applications the distribution can shift suddenly (a holiday traffic spike, a software change upstream) or gradually (slow user behavior changes). Concept drift detectors sit alongside an online learner and decide when the model has fallen out of date.[12]

| Detector | Reference | Idea |
| --- | --- | --- |
| Page-Hinkley test | Page 1954 | Cumulative deviation of observed loss from running mean |
| DDM (Drift Detection Method) | Gama et al. 2004 | Tracks online error rate; flags warning and drift thresholds |
| EDDM (Early DDM) | Baena-Garcia et al. 2006 | Detects gradual drift using distance between consecutive errors |
| ADWIN | Bifet and Gavalda 2007 | Adaptive sliding window; provides false-positive and false-negative bounds |

ADWIN is widely used because it offers rigorous guarantees: it keeps a variable-length window of recent observations such that no statistically significant change is detected within it, and it shrinks the window when a change appears.[11] The detector can be plugged into [concept drift](/wiki/concept_drift) pipelines to trigger model resets or rolling-window retraining.

## What tools and libraries implement online learning?

A few systems have shaped how practitioners actually deploy online learners.

| Tool | Origin | Notable features |
| --- | --- | --- |
| Vowpal Wabbit | Started by John Langford at Yahoo Research, then Microsoft Research | Out-of-core SGD, the hashing trick (32-bit MurmurHash3), contextual bandits, allreduce-based distributed training |
| River | Merger of creme and scikit-multiflow, JMLR 2021 | Pure Python online ML; supports per-example fit, drift detectors, evaluators, and a scikit-learn-like API |
| scikit-learn `partial_fit` | scikit-learn project | Mini-batch style updates for SGDClassifier, MultinomialNB, MiniBatchKMeans, and others |
| Apache Flink ML / Spark Streaming MLlib | Apache Foundation | Online updates inside larger streaming data pipelines |
| MOA (Massive Online Analysis) | University of Waikato | Java framework for stream mining, includes ADWIN, Hoeffding trees, and concept drift datasets |

Vowpal Wabbit deserves special mention. It pairs online learning with feature hashing to keep the feature dimension fixed regardless of dataset size, which is what allows it to learn from terabyte-scale logs on commodity hardware.[14] The combination of streaming data and bounded memory makes it effectively a one-pass learner.

River took the academic stream-mining ecosystem and packaged it for working data scientists.[15] As the authors describe it, "all predictive models in River perform two core functions: learn and predict," exposed through `learn_one(x, y)` and `predict_one(x)`, so a model can be updated with a single observation and queried immediately afterward.[15] River is closer to how production systems actually look, where each event is processed as it arrives.

## How does online learning apply to modern LLMs?

Large language models are typically trained in giant offline runs, but they live in production for months or years. As user behavior, preferences, and adversarial pressure all shift, several lines of work apply online learning ideas to keep the models current.

Online RLHF gathers human or AI preference data continuously and updates a policy model against an evolving reward model. The risk is that fixed reward models drift off-distribution as the policy changes, so methods like Reward Learning on Policy (Lang et al. 2024) refresh the reward model with on-policy samples, and CPPO and COPR add regularization terms or sample-weighting schemes to limit catastrophic forgetting of earlier preferences.[19]

Real-time recommender systems push embeddings and ranker weights through online updates whenever user feedback arrives. Similar patterns appear in real-time anomaly detection, where models are retrained or fine-tuned every few minutes against the latest data.

LLM agents that learn from production traces and feedback during deployment are pushing online learning in a less mature direction. Without guardrails such as replay buffers, frozen safety heads, or KL constraints to a reference model, an online policy can drift in undesirable ways. Operators routinely combine streaming updates with offline evaluation suites to catch regressions before they reach users.

## What is online learning used for?

Online learning shows up wherever data is fresh, fast, or both.

| Domain | Why online learning fits | Representative system or paper |
| --- | --- | --- |
| Online advertising (CTR prediction) | Billions of impressions per day; per-feature sparsity matters | FTRL-Proximal at Google (McMahan et al. 2013) |
| Recommendation systems | New items, new users, and changing preferences arrive constantly | YouTube, Netflix, and TikTok ranking pipelines update incrementally |
| Real-time anomaly detection | Anomalies are rare; baselines drift | River-based intrusion and fraud detection pipelines |
| Financial trading | Markets are non-stationary; latency is costly | Online portfolio selection (Cover 1991), online sequential prediction[20] |
| Robotics | Sensors stream continuously; offline data is expensive | On-policy fine-tuning during deployment |
| Search ranking | Query distribution shifts daily | Click models updated via online logistic regression |
| Spam and abuse classification | Adversaries change tactics in response to defenses | Vowpal Wabbit and similar pipelines at email providers |
| Online RLHF for LLMs | Preferences and abuse patterns shift | OAIF, Nash-MD, VPO (2024) |

The common thread is that the cost of waiting for a batch retrain is high, either because the data is too big to store or because the world has already moved on by the time the next training cycle finishes.

## What are the limitations and failure modes of online learning?

Online learning trades flexibility for several real risks.

Sensitivity to data ordering is structural. The same data delivered in a different order can produce a different model. This makes reproducibility harder and complicates debugging.

Stability and convergence are harder than in batch settings. Step sizes that are too aggressive cause oscillation; step sizes that are too small ignore drift. Per-coordinate learning rate schedules and FTRL-style regularization help but require tuning.

Catastrophic forgetting affects neural online learners that update without replay or regularization. Continual learning techniques mitigate it but never eliminate it.

Online RL and online RLHF can drift in undesirable directions if the feedback loop is biased, gameable, or too narrow. Operators usually anchor online updates with KL-regularization to a reference policy, hold-out evaluations, and rollback procedures.

Debugging is harder. The model state at any point in time depends on every prior example, so reproducing a bug may require replaying the exact sequence. Logging, deterministic shuffling, and checkpointing are essential operational disciplines.

Adversarial inputs can poison an online model if no input validation is in place. Spam classifiers and recommender systems both face active adversaries trying to push the model toward bad updates.

## Explain like I'm 5

Most machine learning is like studying for a test. You read a big textbook from start to finish, take a test once, and you are done. Online learning is more like learning to play a video game. You see one screen at a time, you make a move, and you get a small piece of feedback. You change how you play just a little bit and then you see the next screen. Over time you get better, even though you never sat down with the whole game spread out in front of you. The hard part is remembering the early levels when you are deep in the later ones, and not getting confused when the game changes the rules.

## See also

- [Continual learning](/wiki/continual_learning)
- [Incremental learning](/wiki/incremental_learning)
- [Reinforcement learning](/wiki/reinforcement_learning)
- [Stochastic gradient descent](/wiki/stochastic_gradient_descent_sgd)
- [Regret minimization](/wiki/regret_minimization)
- [Multi-armed bandit](/wiki/multi-armed_bandit)
- [Concept drift](/wiki/concept_drift)
- [Streaming data](/wiki/streaming_data)
- [Perceptron](/wiki/perceptron)
- [Follow the regularized leader](/wiki/follow_the_regularized_leader)
- [Online inference](/wiki/online_inference)
- [Distribution shift](/wiki/distribution_shift)
- [Active learning](/wiki/active_learning)

## References

1. Cesa-Bianchi, N. and Lugosi, G. (2006). *Prediction, Learning, and Games*. Cambridge University Press. https://www.cambridge.org/core/books/prediction-learning-and-games/A05C9F6ABC752FAB8954C885D0065C8F
2. Rosenblatt, F. (1958). "The perceptron: A probabilistic model for information storage and organization in the brain." *Psychological Review*, 65(6), 386-408. https://doi.org/10.1037/h0042519
3. Novikoff, A. B. (1962). "On convergence proofs on perceptrons." *Symposium on the Mathematical Theory of Automata*, 12, 615-622.
4. Zinkevich, M. (2003). "Online convex programming and generalized infinitesimal gradient ascent." *Proceedings of ICML 2003*. https://www.cs.cmu.edu/~maz/publications/techconvex.pdf
5. Freund, Y. and Schapire, R. E. (1997). "A decision-theoretic generalization of on-line learning and an application to boosting." *Journal of Computer and System Sciences*, 55(1), 119-139.
6. Cesa-Bianchi, N., Conconi, A., and Gentile, C. (2004). "On the generalization ability of on-line learning algorithms." *IEEE Transactions on Information Theory*, 50(9), 2050-2057. https://homes.di.unimi.it/~cesabian/Pubblicazioni/J20.pdf
7. McMahan, H. B. (2011). "Follow-the-regularized-leader and mirror descent: Equivalence theorems and L1 regularization." *AISTATS 2011*.
8. McMahan, H. B., Holt, G., Sculley, D., Young, M., Ebner, D., Grady, J., et al. (2013). "Ad click prediction: a view from the trenches." *Proceedings of KDD 2013*. https://research.google.com/pubs/pub41159.html
9. Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., et al. (2017). "Overcoming catastrophic forgetting in neural networks." *Proceedings of the National Academy of Sciences*, 114(13), 3521-3526. https://www.pnas.org/doi/10.1073/pnas.1611835114
10. Rusu, A. A., Rabinowitz, N. C., Desjardins, G., Soyer, H., Kirkpatrick, J., Kavukcuoglu, K., Pascanu, R., and Hadsell, R. (2016). "Progressive neural networks." arXiv:1606.04671.
11. Bifet, A. and Gavalda, R. (2007). "Learning from time-changing data with adaptive windowing." *Proceedings of the 2007 SIAM International Conference on Data Mining*, 443-448. https://epubs.siam.org/doi/10.1137/1.9781611972771.42
12. Gama, J., Medas, P., Castillo, G., and Rodrigues, P. (2004). "Learning with drift detection." *Brazilian Symposium on Artificial Intelligence*.
13. Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. (2002). "The nonstochastic multi-armed bandit problem." *SIAM Journal on Computing*, 32(1), 48-77.
14. Langford, J., Li, L., and Strehl, A. (2007). "Vowpal Wabbit" (open source software). https://github.com/VowpalWabbit/vowpal_wabbit
15. Montiel, J., Halford, M., Mastelini, S. M., et al. (2021). "River: machine learning for streaming data in Python." *Journal of Machine Learning Research*, 22(110), 1-8. https://www.jmlr.org/papers/v22/20-1380.html
16. Beck, A. and Teboulle, M. (2003). "Mirror descent and nonlinear projected subgradient methods for convex optimization." *Operations Research Letters*, 31(3), 167-175.
17. Hazan, E. (2016). *Introduction to Online Convex Optimization*. *Foundations and Trends in Optimization*, 2(3-4), 157-325.
18. Shalev-Shwartz, S. (2012). "Online learning and online convex optimization." *Foundations and Trends in Machine Learning*, 4(2), 107-194.
19. Lang, H., Huang, F., and Li, Y. (2024). "Fine-tuning language models with reward learning on policy." *NAACL 2024*. https://aclanthology.org/2024.naacl-long.75/
20. Cover, T. M. (1991). "Universal portfolios." *Mathematical Finance*, 1(1), 1-29.