See also: Machine learning terms
Online learning is a machine learning paradigm in which a model receives data sequentially, one example or one mini-batch at a time, and updates its parameters immediately after each observation rather than training on a fixed dataset all at once. The model never assumes the full dataset is available in advance. Instead, it consumes a stream of inputs and predictions, ideally improving as it goes.
The approach contrasts with the more familiar batch (or offline) setup, where a learner sees the entire training set, possibly multiple times across many epochs, before producing a final model. Online learning is the natural framework for problems where data arrives faster than it can be stored, where the data-generating distribution changes over time, or where the cost of retraining a full model from scratch is prohibitive. It also has a deep theoretical history rooted in sequential decision making, game theory, and regret minimization.
The term is overloaded in practice. "Online" can refer to the algorithmic regime described above, but it sometimes also refers to online inference, meaning serving predictions to users in real time on already-trained models. These are different concepts. This article focuses on the learning paradigm.
The operational difference between online and offline learning shows up in how data is consumed, how parameters are updated, and what assumptions are made about the data distribution.
| Aspect | Online learning | Offline (batch) learning |
|---|---|---|
| Data access | Sequential, one example or mini-batch at a time | Entire training set available upfront |
| Update frequency | After every example or small batch | After full passes (epochs) over the data |
| Memory footprint | Bounded, independent of stream length | Grows with dataset size |
| Distribution assumption | Often makes none (adversarial setting) | Usually assumes iid samples from a fixed distribution |
| Reaction to drift | Adapts continuously | Requires periodic retraining |
| Theoretical goal | Sublinear regret against best fixed predictor | Low generalization error against a fixed distribution |
| Typical bottleneck | Per-step compute and stability | Total training time and memory |
The practical line is not always crisp. Stochastic gradient descent on a fixed dataset with a single example per step looks structurally identical to an online algorithm, and many SGD-based recipes are described as "online-like." The distinction is whether data is streaming and unrepeatable (true online) or whether it is a finite shuffled dataset that the optimizer simply happens to visit one example at a time.
Several neighboring ideas are often confused with online learning, and the distinctions matter:
| Concept | Defining property |
|---|---|
| Online learning | Sequential examples, immediate per-example updates |
| Offline / batch learning | Full dataset known in advance, parameters updated in epochs |
| Continual learning | Sequential tasks with shifting distributions; must retain old skills |
| Streaming data | Continuous data flow, often unbounded; subset of the online regime |
| Online inference | Serving predictions in real time on a fixed model |
| Active learning | Learner chooses which unlabeled examples to query for labels |
| Reinforcement learning | Sequential interaction where actions affect future observations |
Continual learning sits closest to online learning. The difference is that continual learning explicitly cares about retaining performance on prior tasks while learning new ones, which motivates techniques like Elastic Weight Consolidation, replay buffers, and progressive networks. A pure online learner has no such constraint and may overwrite older knowledge freely.
Online learning theory generally splits the world into two regimes based on how the data sequence is generated.
In the stochastic setting, examples are drawn independently and identically distributed (iid) from some unknown but fixed distribution. The sequential nature of the algorithm is a computational convenience, not a statistical necessity. Classical generalization bounds apply once the online output is averaged or otherwise post-processed.
In the adversarial setting, the data sequence is chosen by an opponent that may have access to the learner's algorithm. No probabilistic assumption is placed on the inputs. The natural performance measure is no longer expected error but regret, which compares the learner's accumulated loss to the best fixed prediction strategy in hindsight. The 2006 monograph by Cesa-Bianchi and Lugosi gives the canonical treatment of this perspective and grounds many of the algorithms discussed below.
A learner is said to be no-regret if its regret grows sublinearly in the number of rounds T, meaning average per-round regret tends to zero as T grows. A regret bound of O(sqrt(T)) is standard. O(log T) is possible under stronger conditions like strong convexity.
A handful of algorithms anchor the field. They differ in their loss assumptions, their use of regularization, and the regret they achieve.
| Algorithm | Year and source | Update rule (sketch) | Regret / mistake bound |
|---|---|---|---|
| Perceptron | Rosenblatt 1958, analysis Novikoff 1962 | If misclassified, w_t = w_{t-1} + y_t x_t | Mistake bound R^2 / gamma^2 if data is linearly separable with margin gamma and radius R |
| Hedge / Multiplicative Weights | Freund and Schapire 1997 | w_i,t+1 proportional to w_i,t * exp(-eta * loss_i,t) | O(sqrt(T log N)) for N experts |
| Online Gradient Descent (OGD) | Zinkevich 2003 | theta_t = theta_{t-1} - eta_t * grad L(theta_{t-1}; x_t, y_t), then projected | O(sqrt(T)) for convex losses; O(log T) under strong convexity |
| Online Mirror Descent (OMD) | Beck and Teboulle 2003 | Update in a dual space defined by a Bregman divergence | Matches OGD; better constants for sparse or simplex domains |
| Follow the Regularized Leader (FTRL) | McMahan 2011 (analysis), Shalev-Shwartz 2007 (formulation) | theta_t = argmin_theta sum_{s<t} L_s(theta) + R(theta) | O(sqrt(T)) in general; equivalent to OMD in many cases |
| FTRL-Proximal | McMahan, Holt et al. 2013 | FTRL variant with per-coordinate adaptive learning rates and L1 regularization | Same regret as FTRL; produces highly sparse models, used in Google ad CTR |
The Perceptron is the historical entry point. Frank Rosenblatt introduced it in 1958 as a brain-inspired model for binary classification. Its online update only changes weights on a misclassified point. Albert Novikoff's 1962 mistake bound, which says the algorithm makes at most R^2 / gamma^2 mistakes on linearly separable data, was a foundational result for the entire field. The bound depends on geometry of the data, not on the length of the sequence.
Online gradient descent generalized this idea to arbitrary convex loss functions. Martin Zinkevich's 2003 analysis showed that taking a gradient step on the current example, with step size eta_t = 1 / sqrt(t) and a projection back into the feasible set, achieves regret O(sqrt(T)) against the best fixed parameter in hindsight, with no probabilistic assumption on the data.
FTRL recasts the problem: at each step, output the parameter that would have minimized cumulative loss so far, plus a regularizer. The regularizer keeps the iterate from chasing noise too aggressively. McMahan's 2011 unification showed that FTRL and online mirror descent are essentially equivalent in many settings, which clarified a decade of seemingly different algorithms.
The Hedge algorithm, also called multiplicative weights, attacks the experts problem. The learner maintains a probability distribution over N experts and adjusts the weights exponentially based on observed expert losses. Freund and Schapire showed regret O(sqrt(T log N)) against the best expert. The algorithm reappears throughout machine learning, optimization, and even theoretical computer science as a primitive for solving zero-sum games and linear programs.
The FTRL-Proximal algorithm, presented by H. Brendan McMahan and colleagues at KDD 2013 in "Ad Click Prediction: a View from the Trenches," is the most widely cited industrial deployment of online learning. The system predicted click-through rate (CTR) for ads served by Google and operated at a scale that rules out periodic batch retraining.
Three design choices made the algorithm work in production:
The paper reported AUC improvements on the order of 11% over earlier production baselines and is one of the few deeply practical accounts of running an online learner at internet scale. It also covered topics that academic papers usually skip: how to subsample training data to fit memory budgets, how to monitor a continuously updating model, and how to detect regressions in real time.
The link between online learning and standard supervised learning runs through online-to-batch conversion. If an online algorithm achieves regret O(sqrt(T)) on an iid stream of T examples and the user averages the sequence of online predictors (or otherwise selects one carefully), the resulting fixed predictor has expected risk within O(sqrt(1/n)) of the best in the comparator class. Cesa-Bianchi, Conconi, and Gentile gave a clean analysis of this conversion in 2004, showing that the online-to-batch trick yields data-dependent generalization bounds without strong probabilistic machinery.
This result is more than a technicality. It means an algorithm designed for the worst-case adversarial setting still works in the friendlier iid setting and inherits a sound generalization guarantee almost for free. Many practical recipes, such as averaging the iterates of SGD, are special cases.
Online learning becomes harder when the learner only observes the loss of the action it actually took, not the loss of all available actions. This is the multi-armed bandit setting. The classic stochastic bandit assumes each arm has a fixed but unknown reward distribution; algorithms like UCB and Thompson sampling achieve regret O(log T). The adversarial bandit setting drops the stochastic assumption and is solved by EXP3, a Hedge variant that uses importance-weighted estimates to recover unbiased gradient information.
Bandits sit at the boundary between supervised online learning and reinforcement learning. Unlike full reinforcement learning, the action does not change the underlying state, so the problem is sequential but not Markovian.
A pure online learner that updates greedily on each new example often overwrites useful knowledge. In neural networks, this phenomenon is called catastrophic forgetting: training a network on Task B after Task A typically destroys most of the performance on Task A. Continual learning is the broader research area trying to fix this.
Elastic Weight Consolidation (EWC), introduced by James Kirkpatrick and colleagues at DeepMind in PNAS in 2017, is the most cited approach. EWC adds a quadratic penalty to the loss that keeps weights important for previous tasks close to their old values. Importance is estimated using the diagonal of the Fisher information matrix. The intuition mirrors synaptic consolidation in biological brains: neurons that mattered for learned skills become harder to overwrite. EWC works for both supervised and reinforcement learning task sequences and was demonstrated on Atari games learned in succession.
Other continual learning techniques include:
These methods address distribution shift in a structured way that pure online algorithms generally do not.
In streaming applications the distribution can shift suddenly (a holiday traffic spike, a software change upstream) or gradually (slow user behavior changes). Concept drift detectors sit alongside an online learner and decide when the model has fallen out of date.
| Detector | Reference | Idea |
|---|---|---|
| Page-Hinkley test | Page 1954 | Cumulative deviation of observed loss from running mean |
| DDM (Drift Detection Method) | Gama et al. 2004 | Tracks online error rate; flags warning and drift thresholds |
| EDDM (Early DDM) | Baena-Garcia et al. 2006 | Detects gradual drift using distance between consecutive errors |
| ADWIN | Bifet and Gavalda 2007 | Adaptive sliding window; provides false-positive and false-negative bounds |
ADWIN is widely used because it offers rigorous guarantees: it keeps a variable-length window of recent observations such that no statistically significant change is detected within it, and it shrinks the window when a change appears. The detector can be plugged into concept drift pipelines to trigger model resets or rolling-window retraining.
A few systems have shaped how practitioners actually deploy online learners.
| Tool | Origin | Notable features |
|---|---|---|
| Vowpal Wabbit | Started by John Langford at Yahoo Research, then Microsoft Research | Out-of-core SGD, the hashing trick (32-bit MurmurHash3), contextual bandits, allreduce-based distributed training |
| River | Merger of creme and scikit-multiflow, JMLR 2021 | Pure Python online ML; supports per-example fit, drift detectors, evaluators, and a scikit-learn-like API |
scikit-learn partial_fit | scikit-learn project | Mini-batch style updates for SGDClassifier, MultinomialNB, MiniBatchKMeans, and others |
| Apache Flink ML / Spark Streaming MLlib | Apache Foundation | Online updates inside larger streaming data pipelines |
| MOA (Massive Online Analysis) | University of Waikato | Java framework for stream mining, includes ADWIN, Hoeffding trees, and concept drift datasets |
Vowpal Wabbit deserves special mention. It pairs online learning with feature hashing to keep the feature dimension fixed regardless of dataset size, which is what allows it to learn from terabyte-scale logs on commodity hardware. The combination of streaming data and bounded memory makes it effectively a one-pass learner.
River took the academic stream-mining ecosystem and packaged it for working data scientists. Every estimator implements learn_one(x, y) and predict_one(x), so a model can be updated with a single observation and queried immediately afterward. River is closer to how production systems actually look, where each event is processed as it arrives.
Large language models are typically trained in giant offline runs, but they live in production for months or years. As user behavior, preferences, and adversarial pressure all shift, several lines of work apply online learning ideas to keep the models current.
Online RLHF gathers human or AI preference data continuously and updates a policy model against an evolving reward model. The risk is that fixed reward models drift off-distribution as the policy changes, so methods like Reward Learning on Policy (Lang et al. 2024) refresh the reward model with on-policy samples, and CPPO and COPR add regularization terms or sample-weighting schemes to limit catastrophic forgetting of earlier preferences.
Real-time recommender systems push embeddings and ranker weights through online updates whenever user feedback arrives. Similar patterns appear in real-time anomaly detection, where models are retrained or fine-tuned every few minutes against the latest data.
LLM agents that learn from production traces and feedback during deployment are pushing online learning in a less mature direction. Without guardrails such as replay buffers, frozen safety heads, or KL constraints to a reference model, an online policy can drift in undesirable ways. Operators routinely combine streaming updates with offline evaluation suites to catch regressions before they reach users.
Online learning shows up wherever data is fresh, fast, or both.
| Domain | Why online learning fits | Representative system or paper |
|---|---|---|
| Online advertising (CTR prediction) | Billions of impressions per day; per-feature sparsity matters | FTRL-Proximal at Google (McMahan et al. 2013) |
| Recommendation systems | New items, new users, and changing preferences arrive constantly | YouTube, Netflix, and TikTok ranking pipelines update incrementally |
| Real-time anomaly detection | Anomalies are rare; baselines drift | River-based intrusion and fraud detection pipelines |
| Financial trading | Markets are non-stationary; latency is costly | Online portfolio selection (Cover 1991), online sequential prediction |
| Robotics | Sensors stream continuously; offline data is expensive | On-policy fine-tuning during deployment |
| Search ranking | Query distribution shifts daily | Click models updated via online logistic regression |
| Spam and abuse classification | Adversaries change tactics in response to defenses | Vowpal Wabbit and similar pipelines at email providers |
| Online RLHF for LLMs | Preferences and abuse patterns shift | OAIF, Nash-MD, VPO (2024) |
The common thread is that the cost of waiting for a batch retrain is high, either because the data is too big to store or because the world has already moved on by the time the next training cycle finishes.
Online learning trades flexibility for several real risks.
Sensitivity to data ordering is structural. The same data delivered in a different order can produce a different model. This makes reproducibility harder and complicates debugging.
Stability and convergence are harder than in batch settings. Step sizes that are too aggressive cause oscillation; step sizes that are too small ignore drift. Per-coordinate learning rate schedules and FTRL-style regularization help but require tuning.
Catastrophic forgetting affects neural online learners that update without replay or regularization. Continual learning techniques mitigate it but never eliminate it.
Online RL and online RLHF can drift in undesirable directions if the feedback loop is biased, gameable, or too narrow. Operators usually anchor online updates with KL-regularization to a reference policy, hold-out evaluations, and rollback procedures.
Debugging is harder. The model state at any point in time depends on every prior example, so reproducing a bug may require replaying the exact sequence. Logging, deterministic shuffling, and checkpointing are essential operational disciplines.
Adversarial inputs can poison an online model if no input validation is in place. Spam classifiers and recommender systems both face active adversaries trying to push the model toward bad updates.
Most machine learning is like studying for a test. You read a big textbook from start to finish, take a test once, and you are done. Online learning is more like learning to play a video game. You see one screen at a time, you make a move, and you get a small piece of feedback. You change how you play just a little bit and then you see the next screen. Over time you get better, even though you never sat down with the whole game spread out in front of you. The hard part is remembering the early levels when you are deep in the later ones, and not getting confused when the game changes the rules.