Curriculum learning is a training strategy for machine learning models in which training examples are presented in a meaningful order, progressing from easier samples to harder ones, rather than in random order. Inspired by the way humans and animals learn, where instruction typically begins with simple concepts before introducing more complex material, curriculum learning has been shown to improve both convergence speed and generalization performance across a range of tasks. Since the foundational paper by Yoshua Bengio and colleagues in 2009, the technique has been applied to deep learning, reinforcement learning, computer vision, natural language processing, and large language model pretraining.
The intellectual roots of curriculum learning extend back to cognitive science and developmental psychology. In 1993, Jeffrey Elman published "Learning and Development in Neural Networks: The Importance of Starting Small," in which he trained recurrent neural networks to process complex sentences involving relative clauses and number agreement. Elman found that networks with full adult-level capacity could not learn the grammar at all. Training succeeded only when networks began with limited working memory that gradually "matured" to the adult state. Elman's key insight was that developmental restrictions on resources may serve as a necessary prerequisite for mastering certain complex domains. By starting with simpler input, the network developed internal representations of basic grammatical structure that constrained the solution space for subsequent, more complex learning.
This idea also connects to the concept of shaping in behavioral psychology, introduced by B.F. Skinner, where animals learn complex behaviors through a sequence of progressively more challenging tasks with appropriate rewards at each stage.
The term "curriculum learning" was formally introduced by Yoshua Bengio, Jerome Louradour, Ronan Collobert, and Jason Weston in their 2009 paper presented at the 26th International Conference on Machine Learning (ICML). The paper drew a direct analogy to human education: just as students learn arithmetic before calculus, neural networks might benefit from structured exposure to training data ordered by difficulty.
Bengio et al. advanced two central hypotheses. First, curriculum learning accelerates the convergence of the training process. Second, for non-convex optimization criteria (which characterize most deep learning problems), curriculum learning can guide the model toward better local minima. The authors connected curriculum learning to continuation methods, a family of global optimization strategies that solve a sequence of smoothed approximations to a non-convex problem, gradually increasing the problem's complexity.
The 2009 paper reported experiments on two tasks:
Across both tasks, curriculum-trained models achieved better generalization and converged faster than models trained on randomly shuffled data.
Modern curriculum learning methods can be decomposed into two core components, as described in surveys by Wang et al. (2021) and Soviany et al. (2022):
| Component | Role | Examples |
|---|---|---|
| Difficulty Measurer | Assigns a difficulty score to each training sample | Sentence length, loss value, confidence score from a pretrained model, distance to decision boundary, noise level |
| Training Scheduler | Determines when harder samples are introduced during training | Fixed schedule (step functions), continuous pacing functions, exponential pacing, linear pacing |
The difficulty measurer answers the question "which examples are easy and which are hard?" while the training scheduler answers "when should the model see harder examples?" Together, these two components define a complete curriculum.
Curriculum learning methods broadly fall into two categories:
Soviany et al. (2022) proposed a four-level taxonomy of curriculum learning approaches:
| Curriculum Type | Description | Example |
|---|---|---|
| Data-level | Training examples are ordered from easy to hard | Presenting short sentences before long sentences in NMT |
| Model-level | The modeling capacity of the network is gradually increased | Starting with a shallow network and progressively adding layers |
| Task-level | The complexity of the learning task itself increases during training | Training on simple subtasks before the full task in RL |
| Objective-level | The model optimizes toward an increasingly complex objective | Beginning with a simple loss function and transitioning to a harder one |
Self-paced learning (SPL) was introduced by M. Pawan Kumar, Benjamin Packer, and Daphne Koller in their 2010 NeurIPS paper "Self-Paced Learning for Latent Variable Models." While Bengio's original curriculum learning relies on a predefined, fixed ordering of examples, SPL lets the model itself decide which examples are easy or hard at each stage of training.
The SPL algorithm works iteratively: at each step, it simultaneously selects a subset of easy samples (those with low training loss) and updates the model parameters. A weight parameter controls how many samples are selected, and this weight is annealed over time so that progressively more difficult samples are included until the entire training set is being used.
Kumar et al. demonstrated that SPL outperformed existing methods on four tasks: object localization, noun phrase coreference resolution, motif finding, and handwritten digit recognition. The strength of SPL lies in its adaptiveness: rather than relying on a fixed external notion of difficulty, the model's own learning state determines which examples to prioritize.
Lu Jiang, Deyu Meng, Qian Zhao, Shiguang Shan, and Alexander Hauptmann unified curriculum learning and self-paced learning in their 2015 AAAI paper "Self-Paced Curriculum Learning" (SPCL). They observed that CL and SPL each have complementary strengths: CL incorporates prior knowledge about sample difficulty (from an instructor), while SPL adapts to the model's current learning state (the student's perspective).
SPCL combines both signals into a single optimization framework. In the analogy to human education, SPCL corresponds to an "instructor-student collaborative" learning mode, as opposed to "instructor-driven" (pure CL) or "student-driven" (pure SPL). The unified approach demonstrated improved performance over either method alone on tasks including matrix factorization and multimedia event detection.
Alex Graves, Marc G. Bellemare, Jacob Menick, Remi Munos, and Koray Kavukcuoglu presented "Automated Curriculum Learning for Neural Networks" at ICML 2017. Their approach frames curriculum selection as a multi-armed bandit problem. A nonstationary bandit algorithm selects which tasks or data subsets to present to the network at each training step, using a reward signal based on how much the network learns from each sample.
The authors considered multiple signals of learning progress, including the rate of increase in prediction accuracy and the rate of increase in network complexity. Experiments on LSTM networks across three curricula demonstrated that automated curriculum selection could halve the training time needed to reach a target performance level.
Tambet Matiisen and colleagues proposed Teacher-Student Curriculum Learning (TSCL), where a separate "Teacher" network selects training tasks for a "Student" network. The Teacher monitors the Student's learning progress and preferentially assigns tasks where the Student is improving most quickly. To counter the problem of catastrophic forgetting, the Teacher also revisits tasks where the Student's performance has declined.
TSCL matched or surpassed hand-crafted curricula on tasks including decimal addition with LSTMs and navigation in Minecraft. The framework was published in IEEE Transactions on Neural Networks and Learning Systems in 2020.
Several methods use reinforcement learning to train a curriculum policy. In these approaches, the teacher is itself an RL agent whose action space consists of selecting tasks or difficulty levels for the student, and whose reward is derived from the student's learning progress. This creates a bilevel optimization problem where the outer loop trains the teacher and the inner loop trains the student. Such methods have been applied to robotics, game playing, and dialogue systems.
Curriculum learning has been widely applied in computer vision tasks. Hacohen and Weinshall (2019) conducted a thorough study of curriculum learning for convolutional neural network image classification in their ICML paper "On The Power of Curriculum Learning in Training Deep Networks." They proposed two methods for scoring example difficulty: transfer scoring (using confidence scores from a pretrained teacher network) and self-taught scoring (using the network's own confidence from an initial training pass). Combined with an exponential pacing function that gradually expands the training set, their method consistently accelerated learning and improved final accuracy on standard benchmarks.
In medical imaging, curriculum learning has been applied to address challenges with limited labeled data and noisy annotations. The CURVETE framework (Curriculum Learning and Progressive Self-supervised Training) introduced a progressive curriculum based on sample decomposition granularity for medical image classification tasks, including brain tumor detection, knee X-ray analysis, and mammography screening.
Curriculum learning has also been employed for object detection and image segmentation, where training can begin with images containing clearly visible, well-separated objects before introducing crowded scenes with heavy occlusion.
Curriculum learning has proven particularly natural for neural machine translation (NMT), where sentence length and vocabulary rarity provide intuitive difficulty measures.
Kocmi and Bojar (2017) conducted the first systematic study of curriculum learning for NMT. They explored two types of data ordering for a Czech-English translation system: organizing minibatches so that each batch contains sentences similar in some linguistic aspect, and gradually introducing more complex sentence types as training progresses.
Platanios et al. (2019) introduced competence-based curriculum learning for NMT in their NAACL paper. Their framework defines a model competence function that increases over training time, and at each step only trains on examples whose estimated difficulty falls within the model's current competence level. The difficulty of each sentence pair is estimated based on features such as length and word rarity. This approach achieved up to a 70% reduction in training time and improvements of up to 2.2 BLEU points on standard translation benchmarks, for both recurrent neural network and Transformer architectures.
Curriculum learning has attracted significant interest for pretraining large language models, though results have been mixed and context-dependent.
Sequence length warmup (Li et al., 2022): Conglong Li and colleagues at Microsoft, working within the DeepSpeed framework, proposed using sequence length as a curriculum dimension for GPT model pretraining. Their approach starts training with shorter sequences and progressively increases sequence length. This enabled stable training with 8x larger batch sizes and 4x larger learning rates for GPT-2 models (117M and 1.5B parameters). To match baseline evaluation results on WikiText-103 and LAMBADA, the method reduced required training tokens by up to 2.2x and wall clock time by up to 3.7x.
Multi-stage pretraining: Several recent LLMs, including OLMo 2, Phi-4, and others, have adopted a two-phase curriculum strategy. The first phase trains on a data mixture dominated by large-scale web data. The second phase (sometimes called "mid-training") shifts to a mixture that primarily consists of high-quality, curated data. This approach treats data quality as the curriculum dimension.
Mixed findings at scale: Research presented at NeurIPS 2024 found that curriculum learning has limited impact during post-training stages such as supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF), where standard random sampling often performs competitively. However, combining curriculum learning with appropriate learning rate schedules and weight averaging can produce synergistic benefits during pretraining, improving downstream benchmark accuracy by over 1.5% through data reordering alone.
Curriculum learning is especially valuable in reinforcement learning, where agents often face sparse reward signals and must learn complex behaviors that are nearly impossible to acquire from scratch.
Narvekar et al. (2020) published a comprehensive framework and survey in the Journal of Machine Learning Research, classifying curriculum approaches for RL domains. In RL, a curriculum can operate at multiple levels: ordering individual experience samples, sequencing subtasks, or progressively modifying the environment.
Automatic Domain Randomization (ADR): OpenAI applied a form of automatic curriculum learning in their work on solving a Rubik's Cube with a robotic hand (2019). ADR starts with a single, non-randomized simulation environment and progressively increases the amount of domain randomization (friction, object size, lighting, and other physical parameters) as the agent's performance improves. Each time the agent surpasses a performance threshold, the environment becomes more challenging. The resulting policy transferred successfully to a physical robot hand, solving the Rubik's Cube 60% of the time.
Reverse Curriculum Learning: Xi et al. (2024) introduced reverse curriculum reinforcement learning for training LLMs on reasoning tasks (ICML 2024). Instead of progressing from easy to hard, the method starts from goal states and works backward to generate a sequence of increasingly challenging starting states, ensuring that the agent always has a feasible path toward the solution.
The following table summarizes major curriculum learning approaches across domains:
| Approach | Year | Authors | Domain | Difficulty Measure | Key Contribution |
|---|---|---|---|---|---|
| Starting Small | 1993 | Elman | Language (RNN) | Working memory capacity | Showed that limited initial capacity aids grammar learning |
| Curriculum Learning | 2009 | Bengio, Louradour, Collobert, Weston | Vision, NLP | Predefined (shape complexity, vocabulary frequency) | Formalized CL; linked to continuation methods |
| Self-Paced Learning | 2010 | Kumar, Packer, Koller | Latent variable models | Training loss (model-determined) | Let the model select its own easy examples |
| Self-Paced Curriculum Learning | 2015 | Jiang, Meng, Zhao, Shan, Hauptmann | Multimedia, matrix factorization | Combined prior knowledge and model loss | Unified CL and SPL into one framework |
| Automated CL (Bandit) | 2017 | Graves, Bellemare, Menick, Munos, Kavukcuoglu | Sequence learning (LSTM) | Learning progress signals | Multi-armed bandit for curriculum selection |
| Teacher-Student CL | 2017 | Matiisen et al. | RL (Minecraft, arithmetic) | Student learning progress | Teacher network selects tasks for student |
| CL for NMT | 2017 | Kocmi, Bojar | Machine translation | Sentence length, vocabulary rarity | First CL study for NMT |
| Power of CL | 2019 | Hacohen, Weinshall | Image classification (CNN) | Transfer scoring, self-taught scoring | Scoring and pacing functions for vision |
| Competence-based CL | 2019 | Platanios, Stretcu, Neubig, Poczos, Mitchell | Machine translation | Competence function (length, rarity) | 70% training time reduction, +2.2 BLEU |
| ADR (OpenAI) | 2019 | OpenAI | Robotics (RL) | Environment randomization complexity | Sim-to-real transfer for dexterous manipulation |
| CL for RL (Survey) | 2020 | Narvekar, Peng, Leonetti, Sinapov, Taylor, Stone | Reinforcement learning | Various (task-level, sample-level) | Comprehensive CL framework for RL |
| Sequence Length Warmup | 2022 | Li et al. (Microsoft/DeepSpeed) | LLM pretraining (GPT) | Sequence length | Up to 3.7x wall clock speedup for GPT-2 |
| Reverse Curriculum RL | 2024 | Xi et al. | LLM reasoning | Distance from goal state | Backward curriculum for reasoning tasks |
The effectiveness of curriculum learning is not universal. Wu et al. (2021) published "When Do Curricula Work?" at ICLR 2021, providing one of the most rigorous analyses of when curriculum learning provides genuine benefits.
An important nuance from the research is that the pacing function (how quickly the training set expands) may matter more than the ordering itself in some settings. Wu et al. observed that a "random curriculum" (expanding the training set size over time but in random order) sometimes performed as well as a true difficulty-ordered curriculum, implying that the gradual increase in data diversity contributes independently to the training benefit.
A central challenge in curriculum learning is measuring example difficulty. Various approaches have been proposed:
| Difficulty Measure | Description | Applicable Domain |
|---|---|---|
| Sentence length | Longer sentences are harder | NMT, language modeling |
| Vocabulary rarity | Sentences with rare words are harder | NLP tasks |
| Training loss | High-loss examples are harder (self-paced) | Any supervised task |
| Transfer scoring | Low confidence from a pretrained model indicates difficulty | Image classification |
| Self-taught scoring | Low confidence from initial training pass | Image classification |
| Edit distance | Distance from prototypical examples | Text classification |
| Image resolution / noise | Blurry or noisy images are harder | Computer vision |
| Number of objects | Scenes with more objects are harder | Object detection |
| Task complexity | More steps or subgoals required | Reinforcement learning |
| Domain randomization | More environmental variation is harder | Sim-to-real robotics |
Curriculum learning intersects with several related training strategies:
For practitioners considering curriculum learning, several practical points emerge from the literature: