See also: Transfer learning, Meta-learning, Deep learning, Neural network
Continual learning, also called lifelong learning or incremental learning, is a machine learning paradigm in which a model learns from a stream of tasks or data distributions over time, accumulating knowledge while retaining what it has already learned. Unlike conventional training, where a model is trained once on a fixed dataset and then deployed, continual learning requires the model to adapt to new information without losing performance on previously acquired tasks. The central challenge is catastrophic forgetting: the tendency of neural networks to abruptly lose knowledge of earlier tasks when their parameters are updated to accommodate new ones.
The problem is fundamental to building intelligent systems that operate in open-ended, non-stationary environments. Humans learn continuously throughout their lives, building on prior knowledge and rarely forgetting well-practiced skills when acquiring new ones. Replicating this ability in artificial systems has been a long-standing goal of artificial intelligence research. Continual learning sits at the intersection of several related fields, including transfer learning, meta-learning, multi-task learning, and curriculum learning, but is distinguished by its emphasis on sequential task presentation and the explicit goal of avoiding forgetting.
The field has gained renewed prominence with the rise of large language models (LLMs) and foundation models, which are expensive to retrain from scratch but must be updated to incorporate new knowledge, skills, and alignment objectives over time.
The phenomenon now known as catastrophic forgetting was first identified by McCloskey and Cohen in 1989. Their experiments involved a simple neural network trained to learn arithmetic (addition of pairs of integers). When the network was trained on a second group of addition problems, it rapidly and almost completely forgot the first group. Ratcliff (1990) independently demonstrated similar effects. These findings revealed a serious limitation of connectionist models: the shared, distributed representations that give neural networks their generalization power also make them vulnerable to interference when the training distribution changes.
Robert French published an influential review of the problem in 1999 in Trends in Cognitive Sciences, titled "Catastrophic Forgetting in Connectionist Networks," which formalized the issue and surveyed early mitigation strategies. French connected catastrophic forgetting to the broader stability-plasticity dilemma, a concept first articulated by Stephen Grossberg in the early 1980s. Grossberg observed that any learning system must balance two competing demands: it must be plastic enough to incorporate new information and stable enough to retain existing knowledge. Too much plasticity leads to forgetting; too much stability prevents learning.
A key insight from neuroscience came with the complementary learning systems (CLS) theory, proposed by McClelland, McNaughton, and O'Reilly in 1995. CLS theory suggests that the mammalian brain solves the stability-plasticity dilemma through two complementary memory systems. The hippocampus rapidly encodes new episodic memories, while the neocortex slowly integrates information into structured, long-term knowledge through a process of memory consolidation and replay. This dual-system architecture has inspired many modern continual learning approaches, particularly replay-based methods.
Research on continual learning in deep neural networks accelerated dramatically after 2016, driven by several landmark papers: Elastic Weight Consolidation (Kirkpatrick et al., 2017), Progressive Neural Networks (Rusu et al., 2016), Learning without Forgetting (Li and Hoiem, 2016), and Synaptic Intelligence (Zenke, Poole, and Ganguli, 2017). These works established the main families of approaches that continue to shape the field.
Catastrophic forgetting occurs when a neural network trained sequentially on multiple tasks experiences a severe drop in performance on earlier tasks after learning new ones. The root cause lies in the way neural networks store knowledge. Unlike lookup tables or databases where each piece of information occupies a distinct memory location, neural networks encode knowledge in a distributed fashion across shared weights. When those weights are updated to minimize the loss on a new task, the changes can overwrite the weight configurations that were critical for earlier tasks.
More formally, consider a network with parameters θ that has been trained on task A to reach an optimal configuration θ_A. When the network is then trained on task B, gradient descent pushes the parameters toward a new configuration θ_B that minimizes the loss on task B. If the loss landscapes for tasks A and B are sufficiently different, θ_B may be far from θ_A in parameter space, resulting in poor performance on task A. The problem is especially severe when:
Catastrophic forgetting is distinct from the gradual forgetting that humans experience. Human forgetting follows a smooth decay curve (the Ebbinghaus forgetting curve), while neural network forgetting is abrupt and often nearly complete. A network that achieved 95% accuracy on task A might drop to near-random performance after just a few epochs of training on task B.
The severity of catastrophic forgetting depends on the degree of overlap between task representations. When tasks share similar low-level features (for example, two image classification tasks that both rely on edge detectors), forgetting tends to be less severe because the shared features remain useful. When tasks are more dissimilar, forgetting is typically worse.
Van de Ven and Tolias (2019), in their paper "Three Scenarios for Continual Learning," introduced a widely adopted framework for categorizing continual learning problems based on what information is available at test time. The three scenarios differ in difficulty and in the types of methods that succeed in each. Van de Ven, Tuytelaars, and Tolias later expanded this framework in their 2022 paper "Three Types of Incremental Learning" published in Nature Machine Intelligence.
In task-incremental learning, the model must learn a sequence of distinct tasks, and at test time it is told which task it is being asked to perform. This means the model receives a task identifier (task ID) as input alongside the test sample. Each task typically has its own output head (a separate set of output neurons), and the task ID tells the model which output head to use.
Because the model knows which task to evaluate, it only needs to distinguish between classes within a single task, not across all tasks. This makes task-incremental learning the easiest of the three scenarios. Regularization-based methods, such as Elastic Weight Consolidation (EWC), perform relatively well in this setting.
Example: a model sequentially learns to classify digits 0-1 (task 1), digits 2-3 (task 2), and digits 4-5 (task 3). At test time, it is told "this is a task 2 sample" and must decide between digit 2 and digit 3.
In domain-incremental learning, the model learns to solve the same type of problem across different contexts or domains, and the task structure remains the same. At test time, the model is not told which domain the sample comes from. The output space is shared across all domains.
A classic example is Permuted MNIST, where the model must classify handwritten digits (always the same 10 classes), but each task applies a different fixed pixel permutation to the images. The model must learn to handle all permutations without being told which permutation was applied at test time.
Domain-incremental learning is harder than task-incremental learning because the model cannot rely on a task ID to select the appropriate processing pathway. However, it is simpler than class-incremental learning because the output space does not grow.
Class-incremental learning is the most challenging scenario. The model must learn new classes over time, and at test time it must distinguish between all classes seen so far without receiving any task identifier. The output space grows with each new task.
For example, a model first learns to classify cats versus dogs, then learns to classify birds versus fish. At test time, it must correctly classify any of the four classes without being told which pair of classes the sample belongs to.
Class-incremental learning requires the model to both learn good representations for new classes and maintain decision boundaries between all previously learned classes. Van de Ven et al. found that replay-based methods (either using a memory buffer or a generative model) are the only approaches among commonly used strategies that perform consistently well in this scenario.
| Feature | Task-Incremental (Task-IL) | Domain-Incremental (Domain-IL) | Class-Incremental (Class-IL) |
|---|---|---|---|
| Task ID at test time | Yes | No | No |
| Output space | Separate head per task | Shared across tasks | Grows with each task |
| Key challenge | Preserving per-task parameters | Handling domain shift | Discriminating all classes |
| Relative difficulty | Easiest | Moderate | Hardest |
| Best-performing methods | Regularization, replay, architectural | Replay, domain adaptation | Replay (buffer or generative) |
| Classic benchmark | Split MNIST (with task ID) | Permuted MNIST | Split MNIST (without task ID) |
The methods developed to address catastrophic forgetting can be organized into three broad families: regularization-based, replay-based, and architecture-based approaches. Some methods combine elements from multiple families.
Regularization-based methods add a penalty term to the loss function that discourages changes to parameters that are important for previously learned tasks. The core idea is to constrain the optimization so that learning a new task does not drastically alter the weight configurations needed for old tasks.
Elastic Weight Consolidation, introduced by Kirkpatrick et al. in 2017 in the Proceedings of the National Academy of Sciences, is one of the most influential regularization methods for continual learning. EWC draws inspiration from synaptic consolidation in neuroscience: just as the brain reduces the plasticity of synapses that are important for well-learned behaviors, EWC selectively constrains the parameters that matter most for previous tasks.
EWC adds a quadratic penalty to the loss function that pulls each parameter toward its value after training on the previous task, weighted by the parameter's importance. The importance of each parameter is estimated using the diagonal of the Fisher information matrix, which approximates the curvature of the loss landscape around the learned solution. Parameters with high Fisher information (those where small changes would significantly increase the loss on the previous task) are penalized more heavily, while parameters with low Fisher information are left free to adapt to the new task.
The EWC loss for task B, given a model previously trained on task A, is:
L_total = L_B(θ) + (λ/2) Σ_i F_i (θ_i - θ*_A,i)²
where L_B is the loss on task B, θ*_A is the optimal parameter vector for task A, F_i is the Fisher information for parameter i, and λ controls the strength of the regularization.
A known limitation of standard EWC is that it does not scale gracefully to many tasks, because the Fisher information matrices from all previous tasks must be stored. Online EWC (Schwarz et al., 2018) addresses this by maintaining a running average of the Fisher information, using only a single regularization term regardless of the number of tasks.
Synaptic Intelligence, proposed by Zenke, Poole, and Ganguli in 2017 at ICML, takes a similar approach to EWC but computes parameter importance in an online fashion during training, rather than computing it offline after each task. Each parameter accumulates an importance score based on its contribution to the reduction of the training loss along the entire learning trajectory. When training on a new task, parameters with high accumulated importance are penalized for deviating from their previous values.
SI and EWC yield similar performance on benchmarks like Permuted MNIST, but SI has the advantage of not requiring a separate Fisher information computation step.
Learning without Forgetting, introduced by Li and Hoiem (2016) at ECCV, uses knowledge distillation rather than explicit parameter regularization. Before training on a new task, the model's current outputs on the new task's data are recorded. During training, the model is optimized to both perform well on the new task and reproduce its previous outputs (soft targets) on the new data. This acts as a form of functional regularization: the model's behavior on old tasks is preserved even though the training data for those tasks is not available.
LwF is notable because it requires no storage of old data and no computation of parameter importance. However, its effectiveness decreases as the number of tasks grows or when new tasks are very different from previous ones, because the soft targets become less reliable guides for preserving old knowledge.
Replay-based methods maintain a memory of past experiences and interleave them with new data during training. This directly combats forgetting by periodically exposing the model to examples from earlier tasks. Replay methods are inspired by the hippocampal replay observed in mammalian brains, where the hippocampus replays stored memories during sleep to consolidate them into long-term neocortical storage.
Experience replay is the simplest replay strategy. The model maintains a small buffer of stored examples from previous tasks. When training on a new task, samples from the buffer are mixed into each training batch. Despite its simplicity, experience replay is a strong baseline that often outperforms more complex methods. The buffer is typically managed using reservoir sampling, which ensures that all previously seen examples have an equal probability of being stored, regardless of when they were encountered.
A key design choice is the buffer size. Larger buffers reduce forgetting but increase memory requirements. Even small buffers (a few hundred examples) can significantly reduce catastrophic forgetting.
Gradient Episodic Memory, introduced by Lopez-Paz and Ranzato in 2017, uses stored examples not for direct replay but to define optimization constraints. At each training step, GEM checks whether the proposed gradient update would increase the loss on any stored examples from previous tasks. If so, the gradient is projected to the closest vector that does not violate these constraints. This ensures that each update either improves or at least does not harm performance on past tasks.
GEM is computationally expensive because it requires solving a quadratic programming problem at each step, with cost growing linearly in the number of tasks. Averaged GEM (A-GEM), proposed by Chaudhry et al. in 2019, simplifies this by using a single averaged gradient constraint from a random subset of the memory buffer, making it much faster while retaining most of GEM's benefits.
iCaRL, introduced by Rebuffi, Kolesnikov, Sperl, and Lampert at CVPR 2017, combines replay with knowledge distillation for class-incremental learning. iCaRL maintains a fixed-size set of exemplars (representative examples) for each class, selected to be closest to the class mean in feature space. Classification is performed using a nearest-class-mean rule rather than the network's softmax output. When new classes arrive, the model is trained using a combination of the new class data and the stored exemplars, with a distillation loss to preserve the representations of old classes.
iCaRL was one of the first methods to address class-incremental learning in a principled way and remains a widely used baseline.
Generative replay, introduced by Shin et al. in 2017, replaces the memory buffer with a generative model that learns to produce synthetic examples resembling data from past tasks. Instead of storing real examples, the system trains a generative adversarial network (GAN) or other generative model alongside the task-solving model. When learning a new task, the generative model produces pseudo-examples from previous tasks, which are interleaved with real data from the current task.
The approach is inspired by the generative nature of hippocampal memory replay. Its main advantage is that it avoids storing raw data, which may be desirable for privacy or memory reasons. However, the quality of replay depends on the fidelity of the generative model. If the generator fails to capture the full diversity of past data, the task solver may still forget. More recent work has explored using diffusion models (DDGR, Gao et al., 2023) instead of GANs for higher-quality generative replay.
Architecture-based methods modify the network structure itself to accommodate new tasks while protecting the parameters dedicated to old tasks. These methods can guarantee zero forgetting on previous tasks but typically require more memory as the number of tasks grows.
Progressive Neural Networks, introduced by Rusu, Rabinowitz, Desjardins, Soyer, Kirkpatrick, Kavukcuoglu, Pascanu, and Hadsell in 2016, prevent forgetting by freezing all parameters of a trained column and adding a new column of neurons for each new task. Lateral connections from previous columns to the new column allow knowledge transfer. Because old parameters are never modified, catastrophic forgetting is impossible by design.
The main drawback is that the model size grows linearly with the number of tasks, which is impractical for long task sequences. Each new task requires allocating a complete set of layers and learning the lateral connections to all previous columns.
PackNet (Mallya and Lazebnik, 2018) addresses the growing-model problem by sharing a single network across tasks. After training on each task, PackNet prunes the network to identify a sparse subnetwork that is sufficient for that task. The pruned weights are then freed up for use by subsequent tasks. Each task uses a distinct subset of the network's parameters, with a binary mask indicating which parameters belong to which task.
PackNet allows efficient use of network capacity: a single network can accommodate many tasks if each task's subnetwork is sufficiently sparse. However, eventually the available capacity is exhausted, at which point either the network must be expanded or performance on new tasks will degrade.
HAT (Serra et al., 2018) learns task-specific attention masks over the network's units. For each task, a set of near-binary gates controls which units are active. After training on a task, the gates are fixed, and subsequent tasks are trained with a gradient compensation mechanism that prevents changes to the units most used by previous tasks. HAT provides a middle ground between the full parameter isolation of Progressive Neural Networks and the full sharing of regularization methods.
| Approach | Category | Forgetting Prevention | Requires Old Data | Memory Overhead | Key Strength | Key Weakness |
|---|---|---|---|---|---|---|
| EWC (Kirkpatrick et al., 2017) | Regularization | Fisher-weighted parameter penalty | No | Low (Fisher diagonal per task) | Biologically inspired, simple | Accumulating Fisher matrices; gradual forgetting |
| Online EWC (Schwarz et al., 2018) | Regularization | Running Fisher average | No | Low (single Fisher estimate) | Scales to many tasks | Less precise than per-task Fisher |
| SI (Zenke et al., 2017) | Regularization | Online importance accumulation | No | Low | Online computation, no extra pass | Importance estimates can be noisy |
| LwF (Li and Hoiem, 2016) | Regularization (functional) | Knowledge distillation loss | No | None | No exemplar storage needed | Degrades with dissimilar tasks |
| Experience Replay | Replay | Interleaving stored examples | Yes (buffer) | Buffer size | Simple and effective | Requires storing raw data |
| GEM (Lopez-Paz and Ranzato, 2017) | Replay / Optimization | Gradient constraints from memory | Yes (buffer) | Buffer + QP solver | Principled constraint approach | Quadratic programming cost |
| A-GEM (Chaudhry et al., 2019) | Replay / Optimization | Average gradient constraint | Yes (buffer) | Buffer | Fast approximation of GEM | Less precise constraints |
| iCaRL (Rebuffi et al., 2017) | Replay + Distillation | Exemplar set + distillation | Yes (exemplars) | Fixed exemplar budget | Strong class-incremental learner | Fixed capacity; exemplar selection |
| Generative Replay (Shin et al., 2017) | Replay (generative) | Generated pseudo-examples | No (generates data) | Generative model | No raw data storage needed | Generator fidelity; training cost |
| Progressive Nets (Rusu et al., 2016) | Architecture | Freeze old columns, add new | No | Linear growth with tasks | Zero forgetting guaranteed | Model size grows without bound |
| PackNet (Mallya and Lazebnik, 2018) | Architecture | Task-specific pruned subnetworks | No | Binary masks per task | Efficient parameter reuse | Finite capacity, eventual saturation |
| HAT (Serra et al., 2018) | Architecture | Learned attention masks per task | No | Mask parameters per task | Flexible parameter sharing | Mask learning adds complexity |
The emergence of large language models such as GPT-4, LLaMA, and Claude has given continual learning new urgency. These models are trained at enormous computational cost, making full retraining impractical every time new knowledge or capabilities are needed. Yet LLMs must be updated to stay current with evolving world knowledge, to add new skills (such as code generation or mathematical reasoning), and to align with changing human preferences and safety requirements.
A 2025 survey published in ACM Computing Surveys (Wang et al.) provides a comprehensive framework for continual learning in LLMs, organizing the problem into two dimensions:
Continual pre-training extends the knowledge base of a pre-trained LLM by training it on new corpora without starting from scratch. This is particularly important for keeping models up to date with recent events, scientific discoveries, or domain-specific data (such as medical literature or legal documents). The main risk is that extensive pre-training on new data can cause the model to forget its general-purpose capabilities, producing a domain expert that has lost its ability to perform basic tasks.
Mitigation strategies include mixing new data with a small proportion of the original pre-training data, using learning rate warmup and decay schedules to limit the magnitude of parameter changes, and applying regularization techniques adapted from the classical continual learning literature.
Continual instruction tuning involves sequentially fine-tuning an LLM on new instruction-following tasks. Each phase of tuning may focus on a different capability (summarization, translation, question answering, code generation). The challenge is that tuning on a new instruction set can degrade performance on previously learned instructions.
Parameter-efficient fine-tuning (PEFT) methods, particularly LoRA (Low-Rank Adaptation), have become central to continual instruction tuning. By updating only a small number of low-rank adapter parameters rather than the full model, PEFT reduces the risk of catastrophic forgetting. However, even PEFT methods are not immune to forgetting: sequential LoRA tuning on different tasks can still cause significant performance degradation on earlier tasks.
Recent methods addressing this include:
As safety standards and human preferences evolve, LLMs must be re-aligned over time. Reinforcement learning from human feedback (RLHF) and related alignment procedures can be applied sequentially, but each round of alignment risks undoing the effects of previous rounds. Continual alignment research explores how to update the model's value alignment incrementally without degrading earlier safety training.
Continual learning research relies on standardized benchmarks to compare methods. The most commonly used benchmarks originated in computer vision but have since expanded to NLP, reinforcement learning, and multimodal settings.
| Benchmark | Dataset Base | Tasks | Scenario | Description |
|---|---|---|---|---|
| Permuted MNIST | MNIST | 10-20 tasks | Domain-IL | Each task applies a different fixed pixel permutation to the MNIST images. The model must classify digits under all permutations. |
| Split MNIST | MNIST | 5 tasks | Task-IL / Class-IL | The 10 digit classes are split into 5 tasks of 2 classes each. With task IDs it is Task-IL; without, it is Class-IL. |
| Split CIFAR-10 | CIFAR-10 | 5 tasks | Task-IL / Class-IL | The 10 CIFAR-10 classes are split into 5 tasks of 2 classes each. |
| Split CIFAR-100 | CIFAR-100 | 10-20 tasks | Task-IL / Class-IL | The 100 CIFAR-100 classes are divided into 10 or 20 tasks. Often used for evaluating class-incremental methods at larger scale. |
| CORe50 | CORe50 video | 8-11 tasks | Multiple | A benchmark featuring 50 objects captured on video under varying conditions. Supports multiple CL scenarios with natural distribution shifts. |
| Split TinyImageNet | Tiny ImageNet | 10 tasks | Task-IL / Class-IL | 200 classes from Tiny ImageNet split into sequential tasks. |
| Benchmark | Focus | Tasks | Description |
|---|---|---|---|
| TRACE | LLM continual learning | 8 datasets | Evaluates LLMs on domain-specific tasks, multilingual capabilities, code generation, and mathematical reasoning after sequential training. |
| CLiMB | Vision-and-language | Multiple V&L tasks | Evaluates continual learning in multimodal (vision-language) models, measuring both forgetting and forward transfer. |
| CTrL | General continual learning | Configurable | Allows detailed analysis of continual learning properties with controllable task similarity and difficulty. |
The field uses several standardized metrics to quantify continual learning performance:
| Metric | Definition | What It Measures |
|---|---|---|
| Average Accuracy (ACC) | Mean test accuracy across all tasks after the final task has been learned | Overall performance of the continual learner |
| Backward Transfer (BWT) | Average change in performance on task i after learning subsequent tasks i+1, ..., T | Degree of catastrophic forgetting (negative BWT indicates forgetting) |
| Forward Transfer (FWT) | Average improvement on task i from having learned tasks 1, ..., i-1 compared to learning task i from scratch | Benefit of prior knowledge for learning new tasks |
| Forgetting Measure (FM) | Maximum performance on task i at any point minus final performance on task i, averaged across tasks | Worst-case forgetting across the task sequence |
A model with high ACC, near-zero (or positive) BWT, and positive FWT is considered an effective continual learner. In practice, most methods involve a trade-off between these metrics: strategies that minimize forgetting (high BWT) sometimes do so at the cost of plasticity, reducing the model's ability to learn new tasks well.
Continual learning, transfer learning, and meta-learning are related but distinct paradigms that address different aspects of learning efficiency and knowledge reuse.
Transfer learning involves pre-training a model on a source task and then fine-tuning it on a target task. The goal is to transfer useful representations or knowledge from the source to the target domain to improve learning speed or accuracy. Transfer learning is typically a one-shot process: the model is adapted from one task to another, with no concern for retaining performance on the source task. In fact, standard fine-tuning often causes catastrophic forgetting of the source task's capabilities, which is acceptable in transfer learning but unacceptable in continual learning.
Continual learning can be viewed as an extension of transfer learning to the sequential multi-task setting, where the model must both transfer knowledge forward (to help with new tasks) and retain backward knowledge (to preserve performance on old tasks).
Meta-learning, or "learning to learn," focuses on training models that can quickly adapt to new tasks with minimal data. Rather than learning a single task well, meta-learning optimizes the learning process itself so that the model acquires new tasks efficiently. Methods like MAML (Model-Agnostic Meta-Learning) train models to find parameter initializations from which a few gradient steps suffice to learn any task in a given distribution.
Meta-learning and continual learning are complementary. Meta-learning can be used to improve continual learning by finding initializations or learning algorithms that are inherently more resistant to catastrophic forgetting. This intersection has been explored under the name meta-continual learning, where each episode of traditional meta-learning is replaced with a continual learning episode. Conversely, continual meta-learning studies how a meta-learner can itself adapt over time as the distribution of tasks changes.
Javed and White (2019) proposed OML (Online Meta-Learning), which uses meta-learning to find representations that are specifically suited for continual learning. Beaulieu et al. (2020) introduced ANML (A Neuromodulated Meta-Learning Algorithm), which learns a neuromodulatory network that gates plasticity in a prediction network, combining ideas from meta-learning and architecture-based continual learning.
| Paradigm | Goal | Task Presentation | Forgetting Constraint | Data Access |
|---|---|---|---|---|
| Transfer learning | Adapt source knowledge to target task | Source, then target (two stages) | None (forgetting source is acceptable) | Source data may or may not be available |
| Multi-task learning | Learn all tasks simultaneously | All tasks at once | Not applicable (joint training) | All task data available simultaneously |
| Meta-learning | Learn to learn new tasks quickly | Episodic (train/test splits per task) | Not a primary concern | Task distribution available for meta-training |
| Continual learning | Learn tasks sequentially without forgetting | Sequential, one at a time | Central constraint | Typically only current task data available |
Several software libraries support continual learning research and experimentation:
Despite significant progress, several challenges remain in continual learning:
Scalability to long task sequences. Most continual learning methods are evaluated on sequences of 5 to 20 tasks. Real-world applications may involve hundreds or thousands of tasks over a model's lifetime. Whether current methods can scale to such long sequences without gradual performance degradation remains an open question.
Task-free continual learning. Most existing methods assume clear task boundaries, where the model knows when one task ends and another begins. In many real-world scenarios, the data distribution shifts gradually and continuously, without discrete boundaries. Developing methods that can detect and adapt to gradual distribution shifts is an active area of research.
Balancing stability and plasticity in LLMs. As foundation models become the primary tool for AI applications, finding effective strategies for continual updates (knowledge, alignment, and capabilities) without degrading existing performance is a central engineering and research challenge. The computational cost of these models makes trial-and-error approaches impractical.
Evaluation beyond classification. The majority of continual learning benchmarks focus on image classification. Extending rigorous evaluation to other domains, such as object detection, natural language understanding, generative modeling, and reinforcement learning, is important for understanding the generality of proposed methods.
Theoretical foundations. While many practical methods exist, the theoretical understanding of why certain approaches work better than others, and what the fundamental limits of continual learning are, remains incomplete. Questions about the minimum memory required to prevent forgetting, the relationship between network capacity and the number of learnable tasks, and the information-theoretic limits of continual learning are areas of ongoing investigation.