Continual learning

Deep Learning Machine Learning Neural Networks

31 min read

Updated Jun 23, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 23, 2026

Fact-checked

In review queue

Sources

25 citations

Revision

v5 · 6,168 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Introduction

Continual learning, also called lifelong learning or incremental learning, is a machine learning paradigm in which a model learns from a stream of tasks or data distributions over time, accumulating knowledge while retaining what it has already learned. The defining obstacle is catastrophic forgetting: the tendency of neural networks to abruptly lose knowledge of earlier tasks when their parameters are updated for new ones, a failure first documented by Michael McCloskey and Neal Cohen in 1989.^[1] The dominant solutions fall into three families, regularization (for example Elastic Weight Consolidation, Kirkpatrick et al., 2017), replay, and architecture-based methods, all of which seek a balance between stability (retaining old knowledge) and plasticity (acquiring new knowledge).^[6]^[20]

Unlike conventional training, where a model is trained once on a fixed dataset and then deployed, continual learning requires the model to adapt to new information without losing performance on previously acquired tasks. The problem is fundamental to building intelligent systems that operate in open-ended, non-stationary environments. Humans learn continuously throughout their lives, building on prior knowledge and rarely forgetting well-practiced skills when acquiring new ones. Replicating this ability in artificial systems has been a long-standing goal of artificial intelligence research. Continual learning sits at the intersection of several related fields, including transfer learning, meta-learning, multi-task learning, and curriculum learning, but is distinguished by its emphasis on sequential task presentation and the explicit goal of avoiding forgetting.^[20]

The field has gained renewed prominence with the rise of large language models (LLMs) and foundation models, which are expensive to retrain from scratch but must be updated to incorporate new knowledge, skills, and alignment objectives over time.^[21]

What is the history of continual learning?

The phenomenon now known as catastrophic forgetting was first identified by McCloskey and Cohen in 1989 in a chapter in Psychology of Learning and Motivation (volume 24, pages 109-165).^[1] Their experiments involved a simple neural network trained to learn arithmetic (addition of pairs of integers). When the network was trained on a second group of addition problems, it rapidly and almost completely forgot the first group.^[1] Ratcliff (1990) independently demonstrated similar effects.^[2] These findings revealed a serious limitation of connectionist models: the shared, distributed representations that give neural networks their generalization power also make them vulnerable to interference when the training distribution changes.

Robert French published an influential review of the problem in 1999 in Trends in Cognitive Sciences, titled "Catastrophic Forgetting in Connectionist Networks," which formalized the issue and surveyed early mitigation strategies.^[3] French connected catastrophic forgetting to the broader stability-plasticity dilemma, a concept first articulated by Stephen Grossberg in the early 1980s.^[4] Grossberg observed that any learning system must balance two competing demands: it must be plastic enough to incorporate new information and stable enough to retain existing knowledge.^[4] Too much plasticity leads to forgetting; too much stability prevents learning.

A key insight from neuroscience came with the complementary learning systems (CLS) theory, proposed by McClelland, McNaughton, and O'Reilly in 1995.^[5] CLS theory suggests that the mammalian brain solves the stability-plasticity dilemma through two complementary memory systems. The hippocampus rapidly encodes new episodic memories, while the neocortex slowly integrates information into structured, long-term knowledge through a process of memory consolidation and replay.^[5] This dual-system architecture has inspired many modern continual learning approaches, particularly replay-based methods.

Research on continual learning in deep neural networks accelerated dramatically after 2016, driven by several landmark papers: Elastic Weight Consolidation (Kirkpatrick et al., 2017), Progressive Neural Networks (Rusu et al., 2016), Learning without Forgetting (Li and Hoiem, 2016), and Synaptic Intelligence (Zenke, Poole, and Ganguli, 2017).^[6]^[9]^[8] These works established the main families of approaches that continue to shape the field.

What is catastrophic forgetting?

Catastrophic forgetting occurs when a neural network trained sequentially on multiple tasks experiences a severe drop in performance on earlier tasks after learning new ones.^[1] Before the EWC results, as Kirkpatrick et al. note in their 2017 paper, "it has been widely thought that catastrophic forgetting is an inevitable feature of connectionist models."^[6] The root cause lies in the way neural networks store knowledge. Unlike lookup tables or databases where each piece of information occupies a distinct memory location, neural networks encode knowledge in a distributed fashion across shared weights. When those weights are updated to minimize the loss on a new task, the changes can overwrite the weight configurations that were critical for earlier tasks.^[6]

More formally, consider a network with parameters θ that has been trained on task A to reach an optimal configuration θ_A. When the network is then trained on task B, gradient descent pushes the parameters toward a new configuration θ_B that minimizes the loss on task B. If the loss landscapes for tasks A and B are sufficiently different, θ_B may be far from θ_A in parameter space, resulting in poor performance on task A. The problem is especially severe when:

The tasks have different input distributions or output structures.
The network has limited capacity relative to the complexity of the tasks.
Training on the new task involves many gradient updates.
The network does not have access to data from earlier tasks during training on the new task.

Catastrophic forgetting is distinct from the gradual forgetting that humans experience. Human forgetting follows a smooth decay curve (the Ebbinghaus forgetting curve), while neural network forgetting is abrupt and often nearly complete.^[2] A network that achieved 95% accuracy on task A might drop to near-random performance after just a few epochs of training on task B.

The severity of catastrophic forgetting depends on the degree of overlap between task representations. When tasks share similar low-level features (for example, two image classification tasks that both rely on edge detectors), forgetting tends to be less severe because the shared features remain useful. When tasks are more dissimilar, forgetting is typically worse.

What are the three scenarios for continual learning?

Van de Ven and Tolias (2019), in their paper "Three Scenarios for Continual Learning," introduced a widely adopted framework for categorizing continual learning problems based on what information is available at test time.^[15] The three scenarios differ in difficulty and in the types of methods that succeed in each. Van de Ven, Tuytelaars, and Tolias later expanded this framework in their 2022 paper "Three Types of Incremental Learning," published in Nature Machine Intelligence (volume 4, pages 1185-1197).^[16]

Task-Incremental Learning (Task-IL)

In task-incremental learning, the model must learn a sequence of distinct tasks, and at test time it is told which task it is being asked to perform.^[15] This means the model receives a task identifier (task ID) as input alongside the test sample. Each task typically has its own output head (a separate set of output neurons), and the task ID tells the model which output head to use.

Because the model knows which task to evaluate, it only needs to distinguish between classes within a single task, not across all tasks. This makes task-incremental learning the easiest of the three scenarios.^[15] Regularization-based methods, such as Elastic Weight Consolidation (EWC), perform relatively well in this setting.^[16]

Example: a model sequentially learns to classify digits 0-1 (task 1), digits 2-3 (task 2), and digits 4-5 (task 3). At test time, it is told "this is a task 2 sample" and must decide between digit 2 and digit 3.

Domain-Incremental Learning (Domain-IL)

In domain-incremental learning, the model learns to solve the same type of problem across different contexts or domains, and the task structure remains the same. At test time, the model is not told which domain the sample comes from. The output space is shared across all domains.^[15]

A classic example is Permuted MNIST, where the model must classify handwritten digits (always the same 10 classes), but each task applies a different fixed pixel permutation to the images. The model must learn to handle all permutations without being told which permutation was applied at test time.

Domain-incremental learning is harder than task-incremental learning because the model cannot rely on a task ID to select the appropriate processing pathway. However, it is simpler than class-incremental learning because the output space does not grow.

Class-Incremental Learning (Class-IL)

Class-incremental learning is the most challenging scenario. The model must learn new classes over time, and at test time it must distinguish between all classes seen so far without receiving any task identifier. The output space grows with each new task.^[15]

For example, a model first learns to classify cats versus dogs, then learns to classify birds versus fish. At test time, it must correctly classify any of the four classes without being told which pair of classes the sample belongs to.

Class-incremental learning requires the model to both learn good representations for new classes and maintain decision boundaries between all previously learned classes. Van de Ven, Tuytelaars, and Tolias found that replay-based methods (either using a memory buffer or a generative model) were the only approaches among the commonly used strategies they tested that performed consistently well in this scenario.^[16]

Comparison of the Three Scenarios

Feature	Task-Incremental (Task-IL)	Domain-Incremental (Domain-IL)	Class-Incremental (Class-IL)
Task ID at test time	Yes	No	No
Output space	Separate head per task	Shared across tasks	Grows with each task
Key challenge	Preserving per-task parameters	Handling domain shift	Discriminating all classes
Relative difficulty	Easiest	Moderate	Hardest
Best-performing methods	Regularization, replay, architectural	Replay, domain adaptation	Replay (buffer or generative)
Classic benchmark	Split MNIST (with task ID)	Permuted MNIST	Split MNIST (without task ID)

What are the main approaches to continual learning?

The methods developed to address catastrophic forgetting can be organized into three broad families: regularization-based, replay-based, and architecture-based approaches.^[20] Some methods combine elements from multiple families.

Regularization-Based Approaches

Regularization-based methods add a penalty term to the loss function that discourages changes to parameters that are important for previously learned tasks. The core idea is to constrain the optimization so that learning a new task does not drastically alter the weight configurations needed for old tasks.

Elastic Weight Consolidation (EWC)

Elastic Weight Consolidation, introduced by Kirkpatrick et al. in 2017 in the Proceedings of the National Academy of Sciences (volume 114, issue 13, pages 3521-3526), is one of the most influential regularization methods for continual learning and has been cited more than 5,000 times.^[6] EWC draws inspiration from synaptic consolidation in neuroscience: in the authors' words, the method "remembers old tasks by selectively slowing down learning on the weights important for those tasks."^[6] Just as the brain reduces the plasticity of synapses that are important for well-learned behaviors, EWC selectively constrains the parameters that matter most for previous tasks.^[6]

EWC adds a quadratic penalty to the loss function that pulls each parameter toward its value after training on the previous task, weighted by the parameter's importance. The importance of each parameter is estimated using the diagonal of the Fisher information matrix, which approximates the curvature of the loss landscape around the learned solution. Parameters with high Fisher information (those where small changes would significantly increase the loss on the previous task) are penalized more heavily, while parameters with low Fisher information are left free to adapt to the new task.^[6] In the original paper, EWC let a single network learn ten Atari 2600 games sequentially while retaining competence on games it had not played for a long time, a result standard gradient descent could not achieve.^[6]

The EWC loss for task B, given a model previously trained on task A, is:

L_total = L_B(θ) + (λ/2) Σ_i F_i (θ_i - θ*_A,i)²

where L_B is the loss on task B, θ*_A is the optimal parameter vector for task A, F_i is the Fisher information for parameter i, and λ controls the strength of the regularization.

A known limitation of standard EWC is that it does not scale gracefully to many tasks, because the Fisher information matrices from all previous tasks must be stored. Online EWC (Schwarz et al., 2018) addresses this by maintaining a running average of the Fisher information, using only a single regularization term regardless of the number of tasks.^[14]

Synaptic Intelligence (SI)

Synaptic Intelligence, proposed by Zenke, Poole, and Ganguli in 2017 at ICML, takes a similar approach to EWC but computes parameter importance in an online fashion during training, rather than computing it offline after each task.^[7] Each parameter accumulates an importance score based on its contribution to the reduction of the training loss along the entire learning trajectory. When training on a new task, parameters with high accumulated importance are penalized for deviating from their previous values.^[7]

SI and EWC yield similar performance on benchmarks like Permuted MNIST, but SI has the advantage of not requiring a separate Fisher information computation step.

Learning without Forgetting (LwF)

Learning without Forgetting, introduced by Li and Hoiem (2016) at ECCV, uses knowledge distillation rather than explicit parameter regularization.^[8] Before training on a new task, the model's current outputs on the new task's data are recorded. During training, the model is optimized to both perform well on the new task and reproduce its previous outputs (soft targets) on the new data.^[8] This acts as a form of functional regularization: the model's behavior on old tasks is preserved even though the training data for those tasks is not available.

LwF is notable because it requires no storage of old data and no computation of parameter importance. However, its effectiveness decreases as the number of tasks grows or when new tasks are very different from previous ones, because the soft targets become less reliable guides for preserving old knowledge.

Replay-Based Approaches

Replay-based methods maintain a memory of past experiences and interleave them with new data during training. This directly combats forgetting by periodically exposing the model to examples from earlier tasks. Replay methods are inspired by the hippocampal replay observed in mammalian brains, where the hippocampus replays stored memories during sleep to consolidate them into long-term neocortical storage.^[5]

Experience Replay (ER)

Experience replay is the simplest replay strategy. The model maintains a small buffer of stored examples from previous tasks. When training on a new task, samples from the buffer are mixed into each training batch. Despite its simplicity, experience replay is a strong baseline that often outperforms more complex methods. The buffer is typically managed using reservoir sampling, which ensures that all previously seen examples have an equal probability of being stored, regardless of when they were encountered.

A key design choice is the buffer size. Larger buffers reduce forgetting but increase memory requirements. Even small buffers (a few hundred examples) can significantly reduce catastrophic forgetting.

Gradient Episodic Memory (GEM) and A-GEM

Gradient Episodic Memory, introduced by Lopez-Paz and Ranzato in 2017, uses stored examples not for direct replay but to define optimization constraints.^[12] At each training step, GEM checks whether the proposed gradient update would increase the loss on any stored examples from previous tasks. If so, the gradient is projected to the closest vector that does not violate these constraints. This ensures that each update either improves or at least does not harm performance on past tasks.^[12]

GEM is computationally expensive because it requires solving a quadratic programming problem at each step, with cost growing linearly in the number of tasks. Averaged GEM (A-GEM), proposed by Chaudhry et al. in 2019, simplifies this by using a single averaged gradient constraint from a random subset of the memory buffer, making it much faster while retaining most of GEM's benefits.^[13]

iCaRL: Incremental Classifier and Representation Learning

iCaRL, introduced by Rebuffi, Kolesnikov, Sperl, and Lampert at CVPR 2017, combines replay with knowledge distillation for class-incremental learning.^[10] iCaRL maintains a fixed-size set of exemplars (representative examples) for each class, selected to be closest to the class mean in feature space. Classification is performed using a nearest-class-mean rule rather than the network's softmax output. When new classes arrive, the model is trained using a combination of the new class data and the stored exemplars, with a distillation loss to preserve the representations of old classes.^[10]

iCaRL was one of the first methods to address class-incremental learning in a principled way and remains a widely used baseline.

Generative Replay

Generative replay, introduced by Shin et al. in 2017, replaces the memory buffer with a generative model that learns to produce synthetic examples resembling data from past tasks.^[11] Instead of storing real examples, the system trains a generative adversarial network (GAN) or other generative model alongside the task-solving model. When learning a new task, the generative model produces pseudo-examples from previous tasks, which are interleaved with real data from the current task.^[11]

The approach is inspired by the generative nature of hippocampal memory replay. Its main advantage is that it avoids storing raw data, which may be desirable for privacy or memory reasons. However, the quality of replay depends on the fidelity of the generative model. If the generator fails to capture the full diversity of past data, the task solver may still forget. More recent work has explored using diffusion models (DDGR, Gao et al., 2023) instead of GANs for higher-quality generative replay.^[24]

Architecture-Based Approaches

Architecture-based methods modify the network structure itself to accommodate new tasks while protecting the parameters dedicated to old tasks. These methods can guarantee zero forgetting on previous tasks but typically require more memory as the number of tasks grows.

Progressive Neural Networks

Progressive Neural Networks, introduced by Rusu, Rabinowitz, Desjardins, Soyer, Kirkpatrick, Kavukcuoglu, Pascanu, and Hadsell in 2016, prevent forgetting by freezing all parameters of a trained column and adding a new column of neurons for each new task.^[9] Lateral connections from previous columns to the new column allow knowledge transfer. Because old parameters are never modified, catastrophic forgetting is impossible by design.^[9]

The main drawback is that the model size grows linearly with the number of tasks, which is impractical for long task sequences. Each new task requires allocating a complete set of layers and learning the lateral connections to all previous columns.

PackNet

PackNet (Mallya and Lazebnik, 2018) addresses the growing-model problem by sharing a single network across tasks.^[17] After training on each task, PackNet prunes the network to identify a sparse subnetwork that is sufficient for that task. The pruned weights are then freed up for use by subsequent tasks. Each task uses a distinct subset of the network's parameters, with a binary mask indicating which parameters belong to which task.^[17]

PackNet allows efficient use of network capacity: a single network can accommodate many tasks if each task's subnetwork is sufficiently sparse. However, eventually the available capacity is exhausted, at which point either the network must be expanded or performance on new tasks will degrade.

Hard Attention to the Task (HAT)

HAT (Serra et al., 2018) learns task-specific attention masks over the network's units.^[18] For each task, a set of near-binary gates controls which units are active. After training on a task, the gates are fixed, and subsequent tasks are trained with a gradient compensation mechanism that prevents changes to the units most used by previous tasks.^[18] HAT provides a middle ground between the full parameter isolation of Progressive Neural Networks and the full sharing of regularization methods.

Comparison of Continual Learning Approaches

Approach	Category	Forgetting Prevention	Requires Old Data	Memory Overhead	Key Strength	Key Weakness
EWC (Kirkpatrick et al., 2017)	Regularization	Fisher-weighted parameter penalty	No	Low (Fisher diagonal per task)	Biologically inspired, simple	Accumulating Fisher matrices; gradual forgetting
Online EWC (Schwarz et al., 2018)	Regularization	Running Fisher average	No	Low (single Fisher estimate)	Scales to many tasks	Less precise than per-task Fisher
SI (Zenke et al., 2017)	Regularization	Online importance accumulation	No	Low	Online computation, no extra pass	Importance estimates can be noisy
LwF (Li and Hoiem, 2016)	Regularization (functional)	Knowledge distillation loss	No	None	No exemplar storage needed	Degrades with dissimilar tasks
Experience Replay	Replay	Interleaving stored examples	Yes (buffer)	Buffer size	Simple and effective	Requires storing raw data
GEM (Lopez-Paz and Ranzato, 2017)	Replay / Optimization	Gradient constraints from memory	Yes (buffer)	Buffer + QP solver	Principled constraint approach	Quadratic programming cost
A-GEM (Chaudhry et al., 2019)	Replay / Optimization	Average gradient constraint	Yes (buffer)	Buffer	Fast approximation of GEM	Less precise constraints
iCaRL (Rebuffi et al., 2017)	Replay + Distillation	Exemplar set + distillation	Yes (exemplars)	Fixed exemplar budget	Strong class-incremental learner	Fixed capacity; exemplar selection
Generative Replay (Shin et al., 2017)	Replay (generative)	Generated pseudo-examples	No (generates data)	Generative model	No raw data storage needed	Generator fidelity; training cost
Progressive Nets (Rusu et al., 2016)	Architecture	Freeze old columns, add new	No	Linear growth with tasks	Zero forgetting guaranteed	Model size grows without bound
PackNet (Mallya and Lazebnik, 2018)	Architecture	Task-specific pruned subnetworks	No	Binary masks per task	Efficient parameter reuse	Finite capacity, eventual saturation
HAT (Serra et al., 2018)	Architecture	Learned attention masks per task	No	Mask parameters per task	Flexible parameter sharing	Mask learning adds complexity

How does continual learning apply to large language models?

The emergence of large language models such as GPT-4, LLaMA, and Claude has given continual learning new urgency. These models are trained at enormous computational cost, making full retraining impractical every time new knowledge or capabilities are needed. Yet LLMs must be updated to stay current with evolving world knowledge, to add new skills (such as code generation or mathematical reasoning), and to align with changing human preferences and safety requirements.^[21]

A 2025 survey published in ACM Computing Surveys (Shi, Xu, Wang, et al., "Continual Learning of Large Language Models: A Comprehensive Survey") provides a framework for continual learning in LLMs, organizing the problem into two dimensions:^[21]

Vertical continuity: continual adaptation from general capabilities to specific ones, for example adapting a general-purpose foundation model to a specialized medical or legal model through continual pre-training, instruction tuning, and alignment, while retaining its general reasoning ability.
Horizontal continuity: continual adaptation across time and domains, for example periodically updating a deployed model to reflect new knowledge and distribution shifts without compromising the experience of existing users.

The survey describes this distinction as one that "transcends mere modification of existing paradigms" and offers a framework for analyzing the more complex continual learning paradigms used to maintain production LLMs.^[21]

Continual Pre-Training

Continual pre-training extends the knowledge base of a pre-trained LLM by training it on new corpora without starting from scratch.^[21] This is particularly important for keeping models up to date with recent events, scientific discoveries, or domain-specific data (such as medical literature or legal documents). The main risk is that extensive pre-training on new data can cause the model to forget its general-purpose capabilities, producing a domain expert that has lost its ability to perform basic tasks.

Mitigation strategies include mixing new data with a small proportion of the original pre-training data, using learning rate warmup and decay schedules to limit the magnitude of parameter changes, and applying regularization techniques adapted from the classical continual learning literature.

Continual Instruction Tuning

Continual instruction tuning involves sequentially fine-tuning an LLM on new instruction-following tasks.^[21] Each phase of tuning may focus on a different capability (summarization, translation, question answering, code generation). The challenge is that tuning on a new instruction set can degrade performance on previously learned instructions.

Parameter-efficient fine-tuning (PEFT) methods, particularly LoRA (Low-Rank Adaptation), have become central to continual instruction tuning. By updating only a small number of low-rank adapter parameters rather than the full model, PEFT reduces the risk of catastrophic forgetting. However, even PEFT methods are not immune to forgetting: sequential LoRA tuning on different tasks can still cause significant performance degradation on earlier tasks.

Recent methods addressing this include:

SwitchCIT (2024): introduces a switch network that identifies the task given an instruction, routing to the appropriate task-specific parameters.
MoE-CL: uses a Mixture-of-Experts architecture with dedicated LoRA experts for each task, combined with an adversarial training objective.
STABLE (2025): a gated continual self-editing framework that constrains forgetting during sequential LoRA updates.

Continual Alignment

As safety standards and human preferences evolve, LLMs must be re-aligned over time. Reinforcement learning from human feedback (RLHF) and related alignment procedures can be applied sequentially, but each round of alignment risks undoing the effects of previous rounds. Continual alignment research explores how to update the model's value alignment incrementally without degrading earlier safety training.

How is continual learning evaluated?

Continual learning research relies on standardized benchmarks to compare methods. The most commonly used benchmarks originated in computer vision but have since expanded to NLP, reinforcement learning, and multimodal settings.

Vision Benchmarks

Benchmark	Dataset Base	Tasks	Scenario	Description
Permuted MNIST	MNIST	10-20 tasks	Domain-IL	Each task applies a different fixed pixel permutation to the MNIST images. The model must classify digits under all permutations.
Split MNIST	MNIST	5 tasks	Task-IL / Class-IL	The 10 digit classes are split into 5 tasks of 2 classes each. With task IDs it is Task-IL; without, it is Class-IL.
Split CIFAR-10	CIFAR-10	5 tasks	Task-IL / Class-IL	The 10 CIFAR-10 classes are split into 5 tasks of 2 classes each.
Split CIFAR-100	CIFAR-100	10-20 tasks	Task-IL / Class-IL	The 100 CIFAR-100 classes are divided into 10 or 20 tasks. Often used for evaluating class-incremental methods at larger scale.
CORe50	CORe50 video	8-11 tasks	Multiple	A benchmark featuring 50 objects captured on video under varying conditions. Supports multiple CL scenarios with natural distribution shifts.
Split TinyImageNet	Tiny ImageNet	10 tasks	Task-IL / Class-IL	200 classes from Tiny ImageNet split into sequential tasks.

NLP and LLM Benchmarks

Benchmark	Focus	Tasks	Description
TRACE	LLM continual learning	8 datasets	Evaluates LLMs on domain-specific tasks, multilingual capabilities, code generation, and mathematical reasoning after sequential training.
CLiMB	Vision-and-language	Multiple V&L tasks	Evaluates continual learning in multimodal (vision-language) models, measuring both forgetting and forward transfer.
CTrL	General continual learning	Configurable	Allows detailed analysis of continual learning properties with controllable task similarity and difficulty.

Evaluation Metrics

The field uses several standardized metrics to quantify continual learning performance:

Metric	Definition	What It Measures
Average Accuracy (ACC)	Mean test accuracy across all tasks after the final task has been learned	Overall performance of the continual learner
Backward Transfer (BWT)	Average change in performance on task i after learning subsequent tasks i+1, ..., T	Degree of catastrophic forgetting (negative BWT indicates forgetting)
Forward Transfer (FWT)	Average improvement on task i from having learned tasks 1, ..., i-1 compared to learning task i from scratch	Benefit of prior knowledge for learning new tasks
Forgetting Measure (FM)	Maximum performance on task i at any point minus final performance on task i, averaged across tasks	Worst-case forgetting across the task sequence

The BWT and FWT metrics were introduced by Lopez-Paz and Ranzato alongside the GEM method in 2017.^[12] A model with high ACC, near-zero (or positive) BWT, and positive FWT is considered an effective continual learner. In practice, most methods involve a trade-off between these metrics: strategies that minimize forgetting (high BWT) sometimes do so at the cost of plasticity, reducing the model's ability to learn new tasks well.

How does continual learning differ from transfer learning and meta-learning?

Continual learning, transfer learning, and meta-learning are related but distinct paradigms that address different aspects of learning efficiency and knowledge reuse.

Transfer Learning

Transfer learning involves pre-training a model on a source task and then fine-tuning it on a target task. The goal is to transfer useful representations or knowledge from the source to the target domain to improve learning speed or accuracy. Transfer learning is typically a one-shot process: the model is adapted from one task to another, with no concern for retaining performance on the source task. In fact, standard fine-tuning often causes catastrophic forgetting of the source task's capabilities, which is acceptable in transfer learning but unacceptable in continual learning.

Continual learning can be viewed as an extension of transfer learning to the sequential multi-task setting, where the model must both transfer knowledge forward (to help with new tasks) and retain backward knowledge (to preserve performance on old tasks).

Meta-Learning

Meta-learning, or "learning to learn," focuses on training models that can quickly adapt to new tasks with minimal data. Rather than learning a single task well, meta-learning optimizes the learning process itself so that the model acquires new tasks efficiently. Methods like MAML (Model-Agnostic Meta-Learning) train models to find parameter initializations from which a few gradient steps suffice to learn any task in a given distribution.

Meta-learning and continual learning are complementary. Meta-learning can be used to improve continual learning by finding initializations or learning algorithms that are inherently more resistant to catastrophic forgetting. This intersection has been explored under the name meta-continual learning, where each episode of traditional meta-learning is replaced with a continual learning episode. Conversely, continual meta-learning studies how a meta-learner can itself adapt over time as the distribution of tasks changes.

Javed and White (2019) proposed OML (Online Meta-Learning), which uses meta-learning to find representations that are specifically suited for continual learning.^[25] Beaulieu et al. (2020) introduced ANML (A Neuromodulated Meta-Learning Algorithm), which learns a neuromodulatory network that gates plasticity in a prediction network, combining ideas from meta-learning and architecture-based continual learning.

Summary of Relationships

Paradigm	Goal	Task Presentation	Forgetting Constraint	Data Access
Transfer learning	Adapt source knowledge to target task	Source, then target (two stages)	None (forgetting source is acceptable)	Source data may or may not be available
Multi-task learning	Learn all tasks simultaneously	All tasks at once	Not applicable (joint training)	All task data available simultaneously
Meta-learning	Learn to learn new tasks quickly	Episodic (train/test splits per task)	Not a primary concern	Task distribution available for meta-training
Continual learning	Learn tasks sequentially without forgetting	Sequential, one at a time	Central constraint	Typically only current task data available

What software frameworks support continual learning?

Several software libraries support continual learning research and experimentation:

Avalanche (ContinualAI): a comprehensive PyTorch-based framework that provides standardized benchmarks, training strategies, evaluation protocols, and logging tools for continual learning research. It includes implementations of many continual learning algorithms (EWC, SI, LwF, GEM, A-GEM, Replay, and others) and supports all three scenarios.
Continuum: a data-loading library for continual learning that provides easy access to standard benchmarks (Split MNIST, Split CIFAR, Permuted MNIST, and others) with configurable task splits.
Sequoia: a framework that unifies several settings including continual learning, reinforcement learning, and multi-task learning under a common interface.

Open Challenges and Future Directions

Despite significant progress, several challenges remain in continual learning:

Scalability to long task sequences. Most continual learning methods are evaluated on sequences of 5 to 20 tasks. Real-world applications may involve hundreds or thousands of tasks over a model's lifetime. Whether current methods can scale to such long sequences without gradual performance degradation remains an open question.

Task-free continual learning. Most existing methods assume clear task boundaries, where the model knows when one task ends and another begins. In many real-world scenarios, the data distribution shifts gradually and continuously, without discrete boundaries. Developing methods that can detect and adapt to gradual distribution shifts is an active area of research.

Balancing stability and plasticity in LLMs. As foundation models become the primary tool for AI applications, finding effective strategies for continual updates (knowledge, alignment, and capabilities) without degrading existing performance is a central engineering and research challenge. The computational cost of these models makes trial-and-error approaches impractical.

Evaluation beyond classification. The majority of continual learning benchmarks focus on image classification. Extending rigorous evaluation to other domains, such as object detection, natural language understanding, generative modeling, and reinforcement learning, is important for understanding the generality of proposed methods.

Theoretical foundations. While many practical methods exist, the theoretical understanding of why certain approaches work better than others, and what the fundamental limits of continual learning are, remains incomplete. Questions about the minimum memory required to prevent forgetting, the relationship between network capacity and the number of learnable tasks, and the information-theoretic limits of continual learning are areas of ongoing investigation.

References

McCloskey, M., & Cohen, N. J. (1989). "Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem." *Psychology of Learning and Motivation*, 24, 109-165. ↩
Ratcliff, R. (1990). "Connectionist Models of Recognition Memory: Constraints Imposed by Learning and Forgetting Functions." *Psychological Review*, 97(2), 285-308. ↩
French, R. M. (1999). "Catastrophic Forgetting in Connectionist Networks." *Trends in Cognitive Sciences*, 3(4), 128-135. ↩
Grossberg, S. (1980). "How Does a Brain Build a Cognitive Code?" *Psychological Review*, 87(1), 1-51. ↩
McClelland, J. L., McNaughton, B. L., & O'Reilly, R. C. (1995). "Why There Are Complementary Learning Systems in the Hippocampus and Neocortex." *Psychological Review*, 102(3), 419-457. ↩
Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Vinyals, O., Desjardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., Hassabis, D., Blundell, C., Lillicrap, T., & Hadsell, R. (2017). "Overcoming Catastrophic Forgetting in Neural Networks." *Proceedings of the National Academy of Sciences*, 114(13), 3521-3526. arXiv:1612.00796. ↩
Zenke, F., Poole, B., & Ganguli, S. (2017). "Continual Learning Through Synaptic Intelligence." *Proceedings of the 34th International Conference on Machine Learning (ICML)*, 3987-3995. ↩
Li, Z., & Hoiem, D. (2016). "Learning without Forgetting." *European Conference on Computer Vision (ECCV)*, 614-629. ↩
Rusu, A. A., Rabinowitz, N. C., Desjardins, G., Soyer, H., Kirkpatrick, J., Kavukcuoglu, K., Pascanu, R., & Hadsell, R. (2016). "Progressive Neural Networks." arXiv:1606.04671. ↩
Rebuffi, S.-A., Kolesnikov, A., Sperl, G., & Lampert, C. H. (2017). "iCaRL: Incremental Classifier and Representation Learning." *IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*, 2001-2010. ↩
Shin, H., Lee, J. K., Kim, J., & Kim, J. (2017). "Continual Learning with Deep Generative Replay." *Advances in Neural Information Processing Systems (NeurIPS)*, 30. ↩
Lopez-Paz, D., & Ranzato, M. (2017). "Gradient Episodic Memory for Continual Learning." *Advances in Neural Information Processing Systems (NeurIPS)*, 30. ↩
Chaudhry, A., Ranzato, M., Rohrbach, M., & Elhoseiny, M. (2019). "Efficient Lifelong Learning with A-GEM." *International Conference on Learning Representations (ICLR)*. ↩
Schwarz, J., Czarnecki, W., Luketina, J., Grabska-Barwinska, A., Teh, Y. W., Pascanu, R., & Hadsell, R. (2018). "Progress & Compress: A Scalable Framework for Continual Learning." *International Conference on Machine Learning (ICML)*. ↩
Van de Ven, G. M., & Tolias, A. S. (2019). "Three Scenarios for Continual Learning." arXiv:1904.07734. ↩
Van de Ven, G. M., Tuytelaars, T., & Tolias, A. S. (2022). "Three Types of Incremental Learning." *Nature Machine Intelligence*, 4(12), 1185-1197. ↩
Mallya, A., & Lazebnik, S. (2018). "PackNet: Adding Multiple Tasks to a Single Network by Iterative Pruning." *IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*, 7765-7773. ↩
Serra, J., Suris, D., Miron, M., & Karatzoglou, A. (2018). "Overcoming Catastrophic Forgetting with Hard Attention to the Task." *International Conference on Machine Learning (ICML)*. ↩
Hinton, G., Vinyals, O., & Dean, J. (2015). "Distilling the Knowledge in a Neural Network." arXiv:1503.02531.
Parisi, G. I., Kemker, R., Part, J. L., Kanan, C., & Wermter, S. (2019). "Continual Lifelong Learning with Neural Networks: A Review." *Neural Networks*, 113, 54-71. ↩
Shi, H., Xu, Z., Wang, S., et al. (2025). "Continual Learning of Large Language Models: A Comprehensive Survey." *ACM Computing Surveys*. arXiv:2404.16789. ↩
Lomonaco, V., & Maltoni, D. (2017). "CORe50: A New Dataset and Benchmark for Continuous Object Recognition." *Proceedings of the 1st Annual Conference on Robot Learning (CoRL)*, 17-26.
Srinivasan, T., et al. (2022). "CLiMB: A Continual Learning Benchmark for Vision-and-Language Tasks." *Advances in Neural Information Processing Systems (NeurIPS)*, 35.
Gao, R., et al. (2023). "DDGR: Continual Learning with Deep Diffusion-Based Generative Replay." *International Conference on Machine Learning (ICML)*. ↩
Javed, K., & White, M. (2019). "Meta-Learning Representations for Continual Learning." *Advances in Neural Information Processing Systems (NeurIPS)*, 32. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

4 revisions by 1 contributors · full history

Suggest edit

What links here

Bayesian Neural Network Dynamic Independently and Identically Distributed (i.i.d.)Kolmogorov-Arnold Network Loss Surface Meta-Learning NeoCognition Nonstationarity Online learning Post-training Pre-Trained Model Task arithmetic Test-Time Training (TTT)

Introduction

What is the history of continual learning?

What is catastrophic forgetting?

What are the three scenarios for continual learning?

Task-Incremental Learning (Task-IL)

Domain-Incremental Learning (Domain-IL)

Class-Incremental Learning (Class-IL)

Comparison of the Three Scenarios

What are the main approaches to continual learning?

Regularization-Based Approaches

Elastic Weight Consolidation (EWC)

Synaptic Intelligence (SI)

Learning without Forgetting (LwF)

Replay-Based Approaches

Experience Replay (ER)

Gradient Episodic Memory (GEM) and A-GEM

iCaRL: Incremental Classifier and Representation Learning

Generative Replay

Architecture-Based Approaches

Progressive Neural Networks

PackNet

Hard Attention to the Task (HAT)

Comparison of Continual Learning Approaches

How does continual learning apply to large language models?

Continual Pre-Training

Continual Instruction Tuning

Continual Alignment

How is continual learning evaluated?

Vision Benchmarks

NLP and LLM Benchmarks

Evaluation Metrics

How does continual learning differ from transfer learning and meta-learning?

Transfer Learning

Meta-Learning

Summary of Relationships

What software frameworks support continual learning?

Open Challenges and Future Directions

References

Improve this article

Related Articles

Mixture of Experts (MoE)

Activation Function

Attention

Backpropagation

Batch Normalization

Bayesian Neural Network

What links here

Related Articles

Mixture of Experts (MoE)

Activation Function

Attention

Backpropagation

Batch Normalization

Bayesian Neural Network

What links here