Online learning

introduction

Online learning is a machine learning paradigm in which a model receives data sequentially, one example or one mini-batch at a time, and updates its parameters immediately after each observation rather than training on a fixed dataset all at once. The model never assumes the full dataset is available in advance. Instead, it consumes a stream of inputs and predictions, ideally improving as it goes.

The approach contrasts with the more familiar batch (or offline) setup, where a learner sees the entire training set, possibly multiple times across many epochs, before producing a final model. Online learning is the natural framework for problems where data arrives faster than it can be stored, where the data-generating distribution changes over time, or where the cost of retraining a full model from scratch is prohibitive. It also has a deep theoretical history rooted in sequential decision making, game theory, and regret minimization.

The term is overloaded in practice. "Online" can refer to the algorithmic regime described above, but it sometimes also refers to online inference, meaning serving predictions to users in real time on already-trained models. These are different concepts. This article focuses on the learning paradigm.

online versus offline learning

The operational difference between online and offline learning shows up in how data is consumed, how parameters are updated, and what assumptions are made about the data distribution.

Aspect	Online learning	Offline (batch) learning
Data access	Sequential, one example or mini-batch at a time	Entire training set available upfront
Update frequency	After every example or small batch	After full passes (epochs) over the data
Memory footprint	Bounded, independent of stream length	Grows with dataset size
Distribution assumption	Often makes none (adversarial setting)	Usually assumes iid samples from a fixed distribution
Reaction to drift	Adapts continuously	Requires periodic retraining
Theoretical goal	Sublinear regret against best fixed predictor	Low generalization error against a fixed distribution
Typical bottleneck	Per-step compute and stability	Total training time and memory

The practical line is not always crisp. Stochastic gradient descent on a fixed dataset with a single example per step looks structurally identical to an online algorithm, and many SGD-based recipes are described as "online-like." The distinction is whether data is streaming and unrepeatable (true online) or whether it is a finite shuffled dataset that the optimizer simply happens to visit one example at a time.

Several neighboring ideas are often confused with online learning, and the distinctions matter:

Concept	Defining property
Online learning	Sequential examples, immediate per-example updates
Offline / batch learning	Full dataset known in advance, parameters updated in epochs
Continual learning	Sequential tasks with shifting distributions; must retain old skills
Streaming data	Continuous data flow, often unbounded; subset of the online regime
Online inference	Serving predictions in real time on a fixed model
Active learning	Learner chooses which unlabeled examples to query for labels
Reinforcement learning	Sequential interaction where actions affect future observations

Continual learning sits closest to online learning. The difference is that continual learning explicitly cares about retaining performance on prior tasks while learning new ones, which motivates techniques like Elastic Weight Consolidation, replay buffers, and progressive networks. A pure online learner has no such constraint and may overwrite older knowledge freely.

adversarial versus stochastic settings

Online learning theory generally splits the world into two regimes based on how the data sequence is generated.

In the stochastic setting, examples are drawn independently and identically distributed (iid) from some unknown but fixed distribution. The sequential nature of the algorithm is a computational convenience, not a statistical necessity. Classical generalization bounds apply once the online output is averaged or otherwise post-processed.

In the adversarial setting, the data sequence is chosen by an opponent that may have access to the learner's algorithm. No probabilistic assumption is placed on the inputs. The natural performance measure is no longer expected error but regret, which compares the learner's accumulated loss to the best fixed prediction strategy in hindsight. The 2006 monograph by Cesa-Bianchi and Lugosi gives the canonical treatment of this perspective and grounds many of the algorithms discussed below.

A learner is said to be no-regret if its regret grows sublinearly in the number of rounds T, meaning average per-round regret tends to zero as T grows. A regret bound of O(sqrt(T)) is standard. O(log T) is possible under stronger conditions like strong convexity.

classical online algorithms

A handful of algorithms anchor the field. They differ in their loss assumptions, their use of regularization, and the regret they achieve.

Algorithm	Year and source	Update rule (sketch)	Regret / mistake bound
Perceptron	Rosenblatt 1958, analysis Novikoff 1962	If misclassified, w_t = w_{t-1} + y_t x_t	Mistake bound R^2 / gamma^2 if data is linearly separable with margin gamma and radius R
Hedge / Multiplicative Weights	Freund and Schapire 1997	w_i,t+1 proportional to w_i,t * exp(-eta * loss_i,t)	O(sqrt(T log N)) for N experts
Online Gradient Descent (OGD)	Zinkevich 2003	theta_t = theta_{t-1} - eta_t * grad L(theta_{t-1}; x_t, y_t), then projected	O(sqrt(T)) for convex losses; O(log T) under strong convexity
Online Mirror Descent (OMD)	Beck and Teboulle 2003	Update in a dual space defined by a Bregman divergence	Matches OGD; better constants for sparse or simplex domains
Follow the Regularized Leader (FTRL)	McMahan 2011 (analysis), Shalev-Shwartz 2007 (formulation)	theta_t = argmin_theta sum_{s<t} L_s(theta) + R(theta)	O(sqrt(T)) in general; equivalent to OMD in many cases
FTRL-Proximal	McMahan, Holt et al. 2013	FTRL variant with per-coordinate adaptive learning rates and L1 regularization	Same regret as FTRL; produces highly sparse models, used in Google ad CTR

The Perceptron is the historical entry point. Frank Rosenblatt introduced it in 1958 as a brain-inspired model for binary classification. Its online update only changes weights on a misclassified point. Albert Novikoff's 1962 mistake bound, which says the algorithm makes at most R^2 / gamma^2 mistakes on linearly separable data, was a foundational result for the entire field. The bound depends on geometry of the data, not on the length of the sequence.

Online gradient descent generalized this idea to arbitrary convex loss functions. Martin Zinkevich's 2003 analysis showed that taking a gradient step on the current example, with step size eta_t = 1 / sqrt(t) and a projection back into the feasible set, achieves regret O(sqrt(T)) against the best fixed parameter in hindsight, with no probabilistic assumption on the data.

FTRL recasts the problem: at each step, output the parameter that would have minimized cumulative loss so far, plus a regularizer. The regularizer keeps the iterate from chasing noise too aggressively. McMahan's 2011 unification showed that FTRL and online mirror descent are essentially equivalent in many settings, which clarified a decade of seemingly different algorithms.

The Hedge algorithm, also called multiplicative weights, attacks the experts problem. The learner maintains a probability distribution over N experts and adjusts the weights exponentially based on observed expert losses. Freund and Schapire showed regret O(sqrt(T log N)) against the best expert. The algorithm reappears throughout machine learning, optimization, and even theoretical computer science as a primitive for solving zero-sum games and linear programs.

FTRL-Proximal at Google

The FTRL-Proximal algorithm, presented by H. Brendan McMahan and colleagues at KDD 2013 in "Ad Click Prediction: a View from the Trenches," is the most widely cited industrial deployment of online learning. The system predicted click-through rate (CTR) for ads served by Google and operated at a scale that rules out periodic batch retraining.

Three design choices made the algorithm work in production:

L1 regularization combined with the FTRL update naturally drives many feature weights exactly to zero, producing sparse models even with billions of features.
Per-coordinate learning rates let frequent features (such as common keywords) settle quickly while rare features still get meaningful updates.
Probability calibration and confidence estimation were treated as engineering problems alongside the learner itself, with separate calibration layers correcting for systematic bias.

The paper reported AUC improvements on the order of 11% over earlier production baselines and is one of the few deeply practical accounts of running an online learner at internet scale. It also covered topics that academic papers usually skip: how to subsample training data to fit memory budgets, how to monitor a continuously updating model, and how to detect regressions in real time.

online to batch conversion

The link between online learning and standard supervised learning runs through online-to-batch conversion. If an online algorithm achieves regret O(sqrt(T)) on an iid stream of T examples and the user averages the sequence of online predictors (or otherwise selects one carefully), the resulting fixed predictor has expected risk within O(sqrt(1/n)) of the best in the comparator class. Cesa-Bianchi, Conconi, and Gentile gave a clean analysis of this conversion in 2004, showing that the online-to-batch trick yields data-dependent generalization bounds without strong probabilistic machinery.

This result is more than a technicality. It means an algorithm designed for the worst-case adversarial setting still works in the friendlier iid setting and inherits a sound generalization guarantee almost for free. Many practical recipes, such as averaging the iterates of SGD, are special cases.

bandits and partial feedback

Online learning becomes harder when the learner only observes the loss of the action it actually took, not the loss of all available actions. This is the multi-armed bandit setting. The classic stochastic bandit assumes each arm has a fixed but unknown reward distribution; algorithms like UCB and Thompson sampling achieve regret O(log T). The adversarial bandit setting drops the stochastic assumption and is solved by EXP3, a Hedge variant that uses importance-weighted estimates to recover unbiased gradient information.

Bandits sit at the boundary between supervised online learning and reinforcement learning. Unlike full reinforcement learning, the action does not change the underlying state, so the problem is sequential but not Markovian.

continual learning and catastrophic forgetting

A pure online learner that updates greedily on each new example often overwrites useful knowledge. In neural networks, this phenomenon is called catastrophic forgetting: training a network on Task B after Task A typically destroys most of the performance on Task A. Continual learning is the broader research area trying to fix this.

Elastic Weight Consolidation (EWC), introduced by James Kirkpatrick and colleagues at DeepMind in PNAS in 2017, is the most cited approach. EWC adds a quadratic penalty to the loss that keeps weights important for previous tasks close to their old values. Importance is estimated using the diagonal of the Fisher information matrix. The intuition mirrors synaptic consolidation in biological brains: neurons that mattered for learned skills become harder to overwrite. EWC works for both supervised and reinforcement learning task sequences and was demonstrated on Atari games learned in succession.

Other continual learning techniques include:

Replay-based methods, which interleave new examples with stored or generated samples from prior tasks.
Progressive Networks (Rusu et al. 2016), which freeze old columns of weights and add new ones for new tasks, avoiding interference at the cost of growing model size.
Modular and gating approaches, which route different inputs to different sub-networks.

These methods address distribution shift in a structured way that pure online algorithms generally do not.

concept drift detection

In streaming applications the distribution can shift suddenly (a holiday traffic spike, a software change upstream) or gradually (slow user behavior changes). Concept drift detectors sit alongside an online learner and decide when the model has fallen out of date.

Detector	Reference	Idea
Page-Hinkley test	Page 1954	Cumulative deviation of observed loss from running mean
DDM (Drift Detection Method)	Gama et al. 2004	Tracks online error rate; flags warning and drift thresholds
EDDM (Early DDM)	Baena-Garcia et al. 2006	Detects gradual drift using distance between consecutive errors
ADWIN	Bifet and Gavalda 2007	Adaptive sliding window; provides false-positive and false-negative bounds

ADWIN is widely used because it offers rigorous guarantees: it keeps a variable-length window of recent observations such that no statistically significant change is detected within it, and it shrinks the window when a change appears. The detector can be plugged into concept drift pipelines to trigger model resets or rolling-window retraining.

tools and libraries

A few systems have shaped how practitioners actually deploy online learners.

Tool	Origin	Notable features
Vowpal Wabbit	Started by John Langford at Yahoo Research, then Microsoft Research	Out-of-core SGD, the hashing trick (32-bit MurmurHash3), contextual bandits, allreduce-based distributed training
River	Merger of creme and scikit-multiflow, JMLR 2021	Pure Python online ML; supports per-example fit, drift detectors, evaluators, and a scikit-learn-like API
scikit-learn `partial_fit`	scikit-learn project	Mini-batch style updates for SGDClassifier, MultinomialNB, MiniBatchKMeans, and others
Apache Flink ML / Spark Streaming MLlib	Apache Foundation	Online updates inside larger streaming data pipelines
MOA (Massive Online Analysis)	University of Waikato	Java framework for stream mining, includes ADWIN, Hoeffding trees, and concept drift datasets

Vowpal Wabbit deserves special mention. It pairs online learning with feature hashing to keep the feature dimension fixed regardless of dataset size, which is what allows it to learn from terabyte-scale logs on commodity hardware. The combination of streaming data and bounded memory makes it effectively a one-pass learner.

River took the academic stream-mining ecosystem and packaged it for working data scientists. Every estimator implements learn_one(x, y) and predict_one(x), so a model can be updated with a single observation and queried immediately afterward. River is closer to how production systems actually look, where each event is processed as it arrives.

modern context: online RLHF and continual fine-tuning

Large language models are typically trained in giant offline runs, but they live in production for months or years. As user behavior, preferences, and adversarial pressure all shift, several lines of work apply online learning ideas to keep the models current.

Online RLHF gathers human or AI preference data continuously and updates a policy model against an evolving reward model. The risk is that fixed reward models drift off-distribution as the policy changes, so methods like Reward Learning on Policy (Lang et al. 2024) refresh the reward model with on-policy samples, and CPPO and COPR add regularization terms or sample-weighting schemes to limit catastrophic forgetting of earlier preferences.

Real-time recommender systems push embeddings and ranker weights through online updates whenever user feedback arrives. Similar patterns appear in real-time anomaly detection, where models are retrained or fine-tuned every few minutes against the latest data.

LLM agents that learn from production traces and feedback during deployment are pushing online learning in a less mature direction. Without guardrails such as replay buffers, frozen safety heads, or KL constraints to a reference model, an online policy can drift in undesirable ways. Operators routinely combine streaming updates with offline evaluation suites to catch regressions before they reach users.

applications

Online learning shows up wherever data is fresh, fast, or both.

Domain	Why online learning fits	Representative system or paper
Online advertising (CTR prediction)	Billions of impressions per day; per-feature sparsity matters	FTRL-Proximal at Google (McMahan et al. 2013)
Recommendation systems	New items, new users, and changing preferences arrive constantly	YouTube, Netflix, and TikTok ranking pipelines update incrementally
Real-time anomaly detection	Anomalies are rare; baselines drift	River-based intrusion and fraud detection pipelines
Financial trading	Markets are non-stationary; latency is costly	Online portfolio selection (Cover 1991), online sequential prediction
Robotics	Sensors stream continuously; offline data is expensive	On-policy fine-tuning during deployment
Search ranking	Query distribution shifts daily	Click models updated via online logistic regression
Spam and abuse classification	Adversaries change tactics in response to defenses	Vowpal Wabbit and similar pipelines at email providers
Online RLHF for LLMs	Preferences and abuse patterns shift	OAIF, Nash-MD, VPO (2024)

The common thread is that the cost of waiting for a batch retrain is high, either because the data is too big to store or because the world has already moved on by the time the next training cycle finishes.

limitations and failure modes

Online learning trades flexibility for several real risks.

Sensitivity to data ordering is structural. The same data delivered in a different order can produce a different model. This makes reproducibility harder and complicates debugging.

Stability and convergence are harder than in batch settings. Step sizes that are too aggressive cause oscillation; step sizes that are too small ignore drift. Per-coordinate learning rate schedules and FTRL-style regularization help but require tuning.

Catastrophic forgetting affects neural online learners that update without replay or regularization. Continual learning techniques mitigate it but never eliminate it.

Online RL and online RLHF can drift in undesirable directions if the feedback loop is biased, gameable, or too narrow. Operators usually anchor online updates with KL-regularization to a reference policy, hold-out evaluations, and rollback procedures.

Debugging is harder. The model state at any point in time depends on every prior example, so reproducing a bug may require replaying the exact sequence. Logging, deterministic shuffling, and checkpointing are essential operational disciplines.

Adversarial inputs can poison an online model if no input validation is in place. Spam classifiers and recommender systems both face active adversaries trying to push the model toward bad updates.

explain like I'm 5

Most machine learning is like studying for a test. You read a big textbook from start to finish, take a test once, and you are done. Online learning is more like learning to play a video game. You see one screen at a time, you make a move, and you get a small piece of feedback. You change how you play just a little bit and then you see the next screen. Over time you get better, even though you never sat down with the whole game spread out in front of you. The hard part is remembering the early levels when you are deep in the later ones, and not getting confused when the game changes the rules.

references

Cesa-Bianchi, N. and Lugosi, G. (2006). Prediction, Learning, and Games. Cambridge University Press. https://www.cambridge.org/core/books/prediction-learning-and-games/A05C9F6ABC752FAB8954C885D0065C8F
Rosenblatt, F. (1958). "The perceptron: A probabilistic model for information storage and organization in the brain." Psychological Review, 65(6), 386-408.
Novikoff, A. B. (1962). "On convergence proofs on perceptrons." Symposium on the Mathematical Theory of Automata, 12, 615-622.
Zinkevich, M. (2003). "Online convex programming and generalized infinitesimal gradient ascent." Proceedings of ICML 2003. https://www.cs.cmu.edu/~maz/publications/techconvex.pdf
Freund, Y. and Schapire, R. E. (1997). "A decision-theoretic generalization of on-line learning and an application to boosting." Journal of Computer and System Sciences, 55(1), 119-139.
Cesa-Bianchi, N., Conconi, A., and Gentile, C. (2004). "On the generalization ability of on-line learning algorithms." IEEE Transactions on Information Theory, 50(9), 2050-2057. https://homes.di.unimi.it/~cesabian/Pubblicazioni/J20.pdf
McMahan, H. B. (2011). "Follow-the-regularized-leader and mirror descent: Equivalence theorems and L1 regularization." AISTATS 2011.
McMahan, H. B., Holt, G., Sculley, D., Young, M., Ebner, D., Grady, J., et al. (2013). "Ad click prediction: a view from the trenches." Proceedings of KDD 2013. https://research.google.com/pubs/pub41159.html
Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., et al. (2017). "Overcoming catastrophic forgetting in neural networks." Proceedings of the National Academy of Sciences, 114(13), 3521-3526. https://www.pnas.org/doi/10.1073/pnas.1611835114
Rusu, A. A., Rabinowitz, N. C., Desjardins, G., Soyer, H., Kirkpatrick, J., Kavukcuoglu, K., Pascanu, R., and Hadsell, R. (2016). "Progressive neural networks." arXiv:1606.04671.
Bifet, A. and Gavalda, R. (2007). "Learning from time-changing data with adaptive windowing." Proceedings of the 2007 SIAM International Conference on Data Mining, 443-448. https://epubs.siam.org/doi/10.1137/1.9781611972771.42
Gama, J., Medas, P., Castillo, G., and Rodrigues, P. (2004). "Learning with drift detection." Brazilian Symposium on Artificial Intelligence.
Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. (2002). "The nonstochastic multi-armed bandit problem." SIAM Journal on Computing, 32(1), 48-77.
Langford, J., Li, L., and Strehl, A. (2007). "Vowpal Wabbit" (open source software). https://github.com/VowpalWabbit/vowpal_wabbit
Montiel, J., Halford, M., Mastelini, S. M., et al. (2021). "River: machine learning for streaming data in Python." Journal of Machine Learning Research, 22(110), 1-8. https://www.jmlr.org/papers/v22/20-1380.html
Beck, A. and Teboulle, M. (2003). "Mirror descent and nonlinear projected subgradient methods for convex optimization." Operations Research Letters, 31(3), 167-175.
Hazan, E. (2016). Introduction to Online Convex Optimization. Foundations and Trends in Optimization, 2(3-4), 157-325.
Shalev-Shwartz, S. (2012). "Online learning and online convex optimization." Foundations and Trends in Machine Learning, 4(2), 107-194.
Lang, H., Huang, F., and Li, Y. (2024). "Fine-tuning language models with reward learning on policy." NAACL 2024. https://aclanthology.org/2024.naacl-long.75/
Cover, T. M. (1991). "Universal portfolios." Mathematical Finance, 1(1), 1-29.

Online learning

introduction

online versus offline learning

adversarial versus stochastic settings

classical online algorithms

FTRL-Proximal at Google

online to batch conversion

bandits and partial feedback

continual learning and catastrophic forgetting

concept drift detection

tools and libraries

modern context: online RLHF and continual fine-tuning

applications

limitations and failure modes

explain like I'm 5

see also

references

Improve this article

introduction

online versus offline learning

adversarial versus stochastic settings

classical online algorithms

FTRL-Proximal at Google

online to batch conversion

bandits and partial feedback

continual learning and catastrophic forgetting

concept drift detection

tools and libraries

modern context: online RLHF and continual fine-tuning

applications

limitations and failure modes

explain like I'm 5

see also

references

introduction

online versus offline learning

related but distinct concepts

adversarial versus stochastic settings

classical online algorithms

FTRL-Proximal at Google

online to batch conversion

bandits and partial feedback

continual learning and catastrophic forgetting

concept drift detection

tools and libraries

modern context: online RLHF and continual fine-tuning

applications

limitations and failure modes

explain like I'm 5

see also

references

Improve this article

Related Articles

Dynamic model

introduction

online versus offline learning

related but distinct concepts

adversarial versus stochastic settings

classical online algorithms

FTRL-Proximal at Google

online to batch conversion

bandits and partial feedback

continual learning and catastrophic forgetting

concept drift detection

tools and libraries

modern context: online RLHF and continual fine-tuning

applications

limitations and failure modes

explain like I'm 5

see also

references

Related Articles

Dynamic model